Building a Temporal Knowledge Graph with Python and NetworkX

digitado ⋅ 28 de February de 2026

Authors:
Shon Mohsin, D. Eng AI
Jeremiah Lowhorn, D. Eng AI
Seth Thor
Matthew Morais, D. Eng AI

What if you could ask your data: “What did we know about this research topic in June 2024?” — and get a precise, structured answer?

Scientific Research generates a web of interconnected entities — subjects, methods, outcomes, organizations, publications — that evolves over time. A spreadsheet can track individual facts. A relational database can join tables. But neither preserves the shape of knowledge: the web of typed relationships between entities that grows and changes with every new publication.

Knowledge graphs do.

In this article, we’ll build a temporal knowledge graph in Python using NetworkX. It ingests scientific publications, extracts entities, wires them into a typed graph, and supports “time-travel” queries — answering questions about what was known at any point in the past. The implementation uses a scientific research domain, but the patterns transfer directly to supply chain tracking, legal document analysis, financial intelligence, and any field where relationships between entities matter more than the entities themselves.

What Is a Knowledge Graph?

A knowledge graph is a data structure where nodes represent entities and edges represent typed relationships between them. Both nodes and edges carry attributes.

This sounds like a regular graph, but the key distinction is typing. In a simple graph, nodes are nodes and edges are edges. In a knowledge graph:

Nodes have types: a Publication node is fundamentally different from a Drug node or a Target node
Edges have types: a DESCRIBES_DRUG edge is different from a TARGETS edge or a TREATS edge
Both carry attributes: a publication has a timestamp, a drug has a development stage, a relationship has a date it was established

This is what separates a knowledge graph from a relational database. In a relational model, you’d represent these as separate tables with foreign keys. In a KG, the relationships are first-class citizens — you can traverse, query, and reason over them directly.

Real-world examples include Google’s Knowledge Graph (powering search cards), Wikidata (structured Wikipedia), and biomedical KGs like Hetionet. Increasingly, knowledge graphs serve as structured retrieval layers for LLM-based systems, providing context that vector similarity search alone cannot capture.

If we add temporal awareness — timestamps on edges and the ability to filter by time — the knowledge graph becomes a time machine for your domain.

Step 1: Design the Ontology

Every knowledge graph starts with an ontology: the schema that defines what types of entities and relationships your graph can contain. Think of it as the blueprint before construction.

For our scientific publications graph, we define 8 node types and 10 relationship types:

from enum import Enum

class NodeType(Enum):
    STOCK = "Stock"               # Organizations / companies
    PUBLICATION = "Publication"    # Source documents
    DRUG = "Drug"                 # Research subjects
    TARGET = "Target"             # Biological targets
    INDICATION = "Indication"     # Disease areas
    TECHNOLOGY = "Technology"     # Methods and platforms
    CLINICAL_TRIAL = "ClinicalTrial"
    PARTNERSHIP = "Partnership"

class EdgeType(Enum):
    PUBLISHED_BY = "PUBLISHED_BY"       # Publication -> Organization
    DESCRIBES_DRUG = "DESCRIBES_DRUG"   # Publication -> Subject
    TARGETS = "TARGETS"                 # Subject -> Target
    TREATS = "TREATS"                   # Subject -> Disease
    USES_TECHNOLOGY = "USES_TECHNOLOGY" # Publication -> Technology
    ADVANCES_TO = "ADVANCES_TO"        # Subject -> Study
    PARTNERS_WITH = "PARTNERS_WITH"    # Organization -> Partnership
    FOLLOWS_TEMPORALLY = "FOLLOWS"     # Publication -> Publication
    OWNED_BY = "OWNED_BY"             # Subject -> Organization
    DEVELOPED_BY = "DEVELOPED_BY"     # Technology -> Organization

Each node type is backed by a Python dataclass that defines its attributes and provides serialization:

from dataclasses import dataclass, field
from typing import Optional, List

@dataclass
class DrugNode:
    drug_id: str
    name: str
    code: Optional[str] = None
    mechanism: Optional[str] = None
    stage: DrugStage = DrugStage.DISCOVERY
    therapeutic_area: Optional[TherapeuticArea] = None

    @property
    def node_id(self) -> str:
        return self.drug_id

    @property
    def node_type(self) -> NodeType:
        return NodeType.DRUG

    def to_dict(self) -> dict:
        return {
            "node_type": self.node_type.value,
            "drug_id": self.drug_id,
            "name": self.name,
            "code": self.code,
            "mechanism": self.mechanism,
            "stage": self.stage.value,
            "therapeutic_area": self.therapeutic_area.value if self.therapeutic_area else None,
        }

The key design decisions here:

Enums for controlled vocabularies — node types, edge types, development stages, and therapeutic areas are all enums. This prevents typos from creating ghost categories and makes the ontology self-documenting.
Dataclasses for structure — each entity type has explicit fields with types and defaults, making the schema clear to any contributor.
to_dict() for serialization – every node can flatten itself into a dictionary for graph storage or JSON export.

Source: Image created by authors. The ontology as a hub-and-spoke diagram: Publication nodes sit at the center, linking outward to Drug, Technology, and Organization. Drug nodes connect to Target, Indication, and ClinicalTrial.

Step 2: Parse Unstructured Data into Entities

Knowledge graphs don’t build themselves. The hardest part is going from unstructured documents to structured entities. Our implementation takes a pragmatic approach: a combination of filename conventions and regex-based entity extraction.

Each publication file follows a naming convention that encodes metadata:

YYYYMMDD_ORG_Topic_Source.md

The parser extracts timestamp, organization ticker, and source type from the filename alone:

def parse_filename(filename: str) -> dict:
    base = filename.replace('.md', '').replace('.pdf', '')
    parts = base.split('_')

    date_str = parts[0]
    year, month, day = int(date_str[:4]), int(date_str[4:6]), int(date_str[6:8])
    timestamp = datetime(year, month, max(1, day))

    ticker = parts[1]
    source = parts[-1]
    pub_type = _infer_publication_type(source, filename)
    return {'timestamp': timestamp, 'ticker': ticker, 'source': source, 'pub_type': pub_type}

For entity extraction from document content, the system uses regex patterns matched against known entities:

def _extract_drugs(content: str) -> list:
    drugs = []
    drug_patterns = [
        (r'zovegalisib|RLY-2608', 'Zovegalisib', 'RLY-2608',
         'Allosteric inhibitor', DrugStage.PHASE3, TherapeuticArea.ONCOLOGY),
        (r'REC-994|tempol', 'REC-994', 'REC-994',
         'Redox-cycling nitroxide', DrugStage.PHASE2, TherapeuticArea.NEUROLOGY),
        # ... more patterns
    ]
    for pattern, name, code, mechanism, stage, area in drug_patterns:
        if re.search(pattern, content, re.IGNORECASE):
            drugs.append({'name': name, 'code': code, 'mechanism': mechanism,
                          'stage': stage, 'therapeutic_area': area})
    return drugs

The same pattern-matching approach applies to targets (gene symbols), indications (disease names), technologies (platform names), clinical trials, and partnerships.

Trade-off note: Regex-based extraction is deterministic and precise for a bounded domain with known entities. For open-domain extraction at scale, you’d swap in NLP/NER models (spaCy, biomedical BERT, etc.). The graph builder doesn’t care how entities are extracted — it just needs structured dictionaries.

Step 3: Build the Graph with NetworkX

With the ontology defined and the parser producing structured data, we can build the graph. We use NetworkX’s MultiDiGraph – a directed graph that allows multiple edges between the same pair of nodes (essential when a publication both DESCRIBES_DRUG and USES_TECHNOLOGY connected to the same entity).

The core class maintains tracking dictionaries for deduplication:

import networkx as nx

class BiotechKnowledgeGraph:
    def __init__(self):
        self.graph = nx.MultiDiGraph()
        self._drugs = {}
        self._targets = {}
        self._publications = {}
        # ... tracking dicts for each node type

The central method processes a parsed publication and creates all nodes and edges:

def add_publication_with_entities(self, parsed_data: dict):
    timestamp = parsed_data.get('timestamp')
    ticker = parsed_data.get('ticker')

    # Create publication node
    pub_id = f"pub_{parsed_data['filename'].replace('.md', '')}"
    pub = PublicationNode(pub_id=pub_id, timestamp=timestamp, ...)
    self._add_publication(pub)

    # Link publication to organization
    self._add_edge(pub_id, ticker, EdgeType.PUBLISHED_BY, timestamp)

    # Process each entity type and create relationships
    for drug_data in parsed_data.get('drugs', []):
        drug_id = self._add_drug(drug_data)
        self._add_edge(pub_id, drug_id, EdgeType.DESCRIBES_DRUG, timestamp)
        self._add_edge(drug_id, ticker, EdgeType.OWNED_BY, timestamp)

        for target_data in parsed_data.get('targets', []):
            target_id = self._add_target(target_data)
            self._add_edge(drug_id, target_id, EdgeType.TARGETS, timestamp)

        for indication_data in parsed_data.get('indications', []):
            indication_id = self._add_indication(indication_data)
            self._add_edge(drug_id, indication_id, EdgeType.TREATS, timestamp)

    for tech_data in parsed_data.get('technologies', []):
        tech_id = self._add_technology(tech_data)
        self._add_edge(pub_id, tech_id, EdgeType.USES_TECHNOLOGY, timestamp)

Notice that every edge carries a timestamp. This is the foundation for temporal queries.

After all publications are ingested, we create temporal chains — FOLLOWS_TEMPORALLY edges that link publications about the same research subject in chronological order:

def add_temporal_edges(self):
    drug_pubs = defaultdict(list)
    for pub_id, pub in self._publications.items():
        for _, target, data in self.graph.out_edges(pub_id, data=True):
            if data.get('edge_type') == EdgeType.DESCRIBES_DRUG.value:
                drug_pubs[target].append((pub.timestamp, pub_id))

    for drug_id, pubs in drug_pubs.items():
        sorted_pubs = sorted(pubs, key=lambda x: x[0])
        for i in range(len(sorted_pubs) - 1):
            self._add_edge(
                sorted_pubs[i][1], sorted_pubs[i + 1][1],
                EdgeType.FOLLOWS_TEMPORALLY, sorted_pubs[i + 1][0],
                context=f"drug:{drug_id}"
            )

A single publication typically generates 10+ nodes and edges. From ~20 documents, our graph reached 66 nodes and 230 edges.

Source: Image created by authors. The complete knowledge graph visualized with NetworkX. Node colors indicate type; node size reflects connectivity (degree). Stock and Drug nodes emerge as highly connected hubs.

Step 4: Time-Travel Queries — The Killer Feature

What makes this a temporal knowledge graph is the ability to ask: “What was known at date X?”

The TemporalQueryEngine class wraps the graph with time-aware query methods:

Historical Snapshot

The most powerful query reconstructs the known state of any entity at a specific point in time:

def get_drug_state_at_date(self, drug_id: str, as_of_date: datetime) -> dict:
    drug_data = dict(self.graph.nodes[drug_id])
    timeline = self.get_drug_timeline(drug_id)

    relevant_pubs = [p for p in timeline if p['timestamp'] <= as_of_date]

    known_targets, known_indications, known_trials = set(), set(), set()

    for source, target, data in self.graph.out_edges(drug_id, data=True):
        edge_ts = datetime.fromisoformat(data.get('timestamp', ''))
        if edge_ts > as_of_date:
            continue

        edge_type = data.get('edge_type')
        if edge_type == EdgeType.TARGETS.value:
            known_targets.add(target)
        elif edge_type == EdgeType.TREATS.value:
            known_indications.add(target)
        elif edge_type == EdgeType.ADVANCES_TO.value:
            known_trials.add(target)

    return {
        'drug_id': drug_id,
        'publication_count': len(relevant_pubs),
        'known_targets': list(known_targets),
        'known_indications': list(known_indications),
        'known_trials': list(known_trials),
    }

This simulates the information landscape at any historical date. Useful for backtesting research decisions, regulatory analysis, or understanding how knowledge accumulates over time.

Source: Image created by authors. A drug-centric view: the neighborhood around Zovegalisib (RLY-2608) showing its connections to publications, targets, indications, clinical trials, and the owning organization.

Time-Constrained Path Finding

Perhaps the most interesting query: find paths between two entities using only edges that existed before a cutoff date:

def find_paths_before_date(self, source: str, target: str,
                            cutoff_date: datetime, max_depth: int = 4) -> list:
    # Build subgraph with only pre-cutoff edges
    valid_edges = []
    for u, v, key, data in self.graph.edges(keys=True, data=True):
        edge_ts_str = data.get('timestamp')
        if not edge_ts_str or datetime.fromisoformat(edge_ts_str) <= cutoff_date:
            valid_edges.append((u, v, key))

    subgraph = nx.MultiDiGraph()
    subgraph.add_nodes_from(self.graph.nodes(data=True))
    for u, v, key in valid_edges:
        subgraph.add_edge(u, v, key=key, **self.graph.edges[u, v, key])

    return list(nx.all_simple_paths(subgraph, source, target, cutoff=max_depth))

This answers questions like: “Could we have connected research subject A to disease B based on what was published before March 2024?” — a powerful tool for analyzing information flow and research gaps.

Other Temporal Queries

The engine also supports:

Entity timelines: chronological list of all publications about a subject
Organization comparison: compare pipeline sizes, tech stacks, and publication activity at any date
Publication counts by month: track research activity trends over time
Development stage progression: track how a subject’s status evolved through publications

Source: Image created by authors. Publication activity over time: the temporal distribution of publications by organization, showing research intensity patterns across the dataset’s 3-year span.

Step 5: Serialize and Visualize

The graph exports to four formats, each serving a different use case:

def save(self, output_dir: str, name: str = "biotech_kg"):
    # Pickle -- full Python object, best for reloading
    with open(f"{name}.pickle", 'wb') as f:
        pickle.dump(self.graph, f)

    # GraphML -- interoperable with Gephi, Cytoscape, yEd
    nx.write_graphml(G_export, f"{name}.graphml")

    # JSON -- D3.js-ready for web visualization
    self._save_as_json(f"{name}.json")

    # Stats JSON -- summary metrics
    json.dump(self.get_statistics(), f)

For visualization, the GraphML export works directly with Gephi (network analysis) or Cytoscape (biological networks). The JSON export is structured for D3.js force-directed layouts. For quick Python visualization, pyvis can render interactive HTML graphs from NetworkX directly.

What We Built

From approximately 20 scientific publication documents, the system extracted:

+--------------------+----------+
| Metric             | Count    |
+--------------------+----------+
| Total nodes        | 66       |
| Total edges        | 230      |
| Node types         | 8        |
| Edge types         | 10       |
| Publications       | 21       |
| Research subjects  | 8        |
| Technologies       | 10       |
| Biological targets | 7        |
| Disease areas      | 7        |
| Partnerships       | 8        |
| Clinical studies   | 3        |
| Temporal range     | 3+ years |
+--------------------+----------+

Source: Image created by authors. Degree distribution analysis: Left shows the overall connectivity distribution (most nodes have few connections, some hubs have many). Right shows connectivity patterns by node type — Stock and Drug nodes are the most connected, serving as hubs in the graph.

The graph answers questions that no single document, spreadsheet, or database table can:

What was the state of knowledge about subject X at date Y?
How did an organization’s research pipeline evolve over time?
What technologies were associated with which publications?
What paths connect two entities through the graph?

Where to Go From Here

This implementation is a foundation. Here’s how to extend it:

NLP-based entity extraction — Replace regex patterns with spaCy or domain-specific NER models for open-domain extraction at scale
Graph database backend — Swap NetworkX for Neo4j or Amazon Neptune when the graph outgrows memory
Graph algorithms — Run centrality analysis (which entity is most connected?), community detection (which entities cluster together?), or link prediction (what connections are likely missing?)

RAG integration — Use the knowledge graph as a structured retrieval layer for LLM-based question answering, combining graph traversal with semantic search
Collaboration networks — Add author and institution nodes to map who works with whom across publications

Conclusion

Knowledge graphs aren’t just for Google. They’re a practical data structure for anyone dealing with complex, evolving relationships — from research intelligence to supply chains to legal document analysis.

The combination of typed nodes, typed edges, temporal awareness, and graph traversal gives you a fundamentally different tool than relational databases or flat files. You don’t just store facts — you store the structure of knowledge, and you can query that structure across time.

The full implementation is available in the project repository. The entire system runs on Python standard library plus NetworkX — no database server, no cloud service, no heavyweight dependencies.

Think about your own domain. What entities do you track? What relationships between them are you currently losing to flat data structures? That’s where your knowledge graph begins.

Built with Python 3, NetworkX, and dataclasses. No external database required.

For questions, please email us at shon@numenai.capital

Building a Temporal Knowledge Graph with Python and NetworkX was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Like 0

Liked Liked