Knowledge Graphs from Documents

Where vector search fails

Vector search finds documents that are semantically similar to a query. This is powerful for most questions, but it fails systematically for a specific category: relationship queries.

Consider these questions:

"Who approved the change order that introduced the $2M liability?"
"Which contracts reference the subsidiary that was acquired in 2024?"
"Show me all projects managed by people who report to Sarah Chen."
"What vendors are connected to the procurement irregularity flagged in the audit?"

These questions are not about finding a document -- they are about traversing relationships between entities. The answer is not in any single chunk; it emerges from connecting information scattered across multiple documents.

Vector search cannot traverse relationships. It can find documents that mention "change order" and documents that mention "$2M liability," but it cannot connect them through the approval chain. A query like "Who approved the change order that introduced the $2M liability?" requires:

Finding the change order that introduced the $2M liability (which might be in one document)
Finding who approved that specific change order (which might be in a different document)
Connecting the two through a shared identifier (the change order number)

This is a graph problem, not a similarity problem. And for enterprises with complex organisational structures, contractual relationships, and regulatory obligations, these graph queries are some of the highest-value questions the RAG system needs to answer.

A compliance officer asks: 'Which of our active contracts were signed by employees who have since left the company, and do any of those contracts have upcoming renewal dates?' Why can't standard vector search answer this effectively?

Extracting entities and relationships from documents

Building a knowledge graph from enterprise documents is a two-phase process: extract entities, then extract relationships.

Entity extraction identifies the key objects in your documents: people, organisations, contracts, dates, monetary amounts, regulatory references, projects, and products. Gemma 4 handles this well with a structured prompt:

Extract all entities from the following document chunk.
For each entity, provide:
- entity_text: the exact text as it appears in the document
- entity_type: one of [PERSON, ORGANISATION, CONTRACT, DATE,
  AMOUNT, REGULATION, PROJECT, PRODUCT, LOCATION]
- entity_id: a normalised identifier (e.g., "John Smith" and
  "J. Smith" should share the same ID if they are the same person)

Document chunk:
{chunk_text}

Return as JSON array.

The entity_id normalisation is the hard part. "John Smith," "J. Smith," "Mr. Smith," and "the VP of Engineering" might all refer to the same person. Entity resolution -- determining which mentions refer to the same real-world entity -- requires contextual reasoning that rule-based systems handle poorly but LLMs handle well.

Relationship extraction identifies the connections between entities:

Given the following document chunk and the extracted entities,
identify all relationships between entities.
For each relationship, provide:
- source_entity: the entity ID of the relationship source
- relationship_type: one of [SIGNED_BY, APPROVED_BY, REPORTS_TO,
  REFERENCES, DATED, VALUED_AT, MANAGED_BY, CONTRACTED_WITH,
  GOVERNED_BY, SUPERSEDES]
- target_entity: the entity ID of the relationship target
- evidence: the specific text in the document that supports
  this relationship

Document chunk:
{chunk_text}

Extracted entities:
{entities_json}

Return as JSON array.

The output is a set of triples: (source, relationship, target). These triples form the edges of your knowledge graph.

Throughput considerations. Entity and relationship extraction with Gemma 4 12B processes approximately 50-100 chunks per minute on a single L40S (each chunk requires a generation call with substantial output). For 80 million chunks, that is 800,000-1,600,000 minutes -- clearly impractical for the full corpus.

The practical approach: extract entities and relationships from high-value document categories only. Contracts, policies, organisational charts, project charters, and audit reports are where relationship queries are most valuable. This might be 5-10% of your corpus (4-8 million chunks), which takes 40,000-160,000 minutes -- 28-111 days on a single GPU, or 4-16 days on 8 GPUs. Significant but feasible.

You are extracting entities from a contract that mentions 'Smith & Associates' in one clause and 'Smith and Associates LLC' in another. These are the same organisation. What challenge does this illustrate?

Where to store your knowledge graph

The knowledge graph needs a storage layer that supports efficient graph traversal: "Starting from entity A, follow relationship B, find entities of type C, then follow relationship D." Here are the options.

Neo4j (Community Edition: GPL, Enterprise: commercial). The most mature graph database. Neo4j's Cypher query language is purpose-built for graph traversal:

MATCH (contract:CONTRACT)-[:SIGNED_BY]->(person:PERSON)
WHERE person.departure_date IS NOT NULL
  AND contract.status = 'active'
RETURN contract.name, person.name, contract.renewal_date
ORDER BY contract.renewal_date

Neo4j handles millions of nodes and edges efficiently. The Community Edition is free for most deployment scenarios. The trade-off: it is another system to operate -- installation, backups, monitoring, upgrades.

Apache AGE (Apache 2.0). A PostgreSQL extension that adds graph database capabilities. If you already run PostgreSQL (and most enterprises do), AGE adds graph queries without introducing a new system:

SELECT * FROM cypher('knowledge_graph', $$
  MATCH (c:Contract)-[:SIGNED_BY]->(p:Person)
  WHERE p.departure_date IS NOT NULL
    AND c.status = 'active'
  RETURN c.name, p.name, c.renewal_date
$$) as (contract_name agtype, person_name agtype, renewal_date agtype);

Same Cypher-like query language, but running inside PostgreSQL. This is the recommended option for enterprises that want to minimise operational complexity. Performance is adequate for knowledge graphs up to tens of millions of edges. For larger graphs, Neo4j's dedicated engine outperforms it.

Lightweight in-memory (NetworkX, igraph). For smaller knowledge graphs (under 1 million nodes), Python graph libraries like NetworkX work well. No database to operate -- the graph loads into memory from a serialised file. This is ideal for prototyping and for departmental graphs that are small enough to fit in RAM. The limitation: no persistence, no concurrent access, no query language -- you write graph traversal in Python code.

Which to choose:

Graph size	Recommendation
Under 100K nodes	NetworkX in memory -- simplest, no infrastructure
100K - 10M nodes	Apache AGE if you have PostgreSQL, Neo4j Community otherwise
Over 10M nodes	Neo4j -- its dedicated engine handles large graphs most efficiently

Combining graph traversal with vector search

The real power emerges when you combine graph traversal with vector search in a single query pipeline. Here is the pattern:

Step 1: Entity recognition. Identify entities in the user's query. "Which contracts were signed by employees who left in 2024?" contains entities: CONTRACT (type), PERSON (implied, "employees who left"), DATE ("2024").

Step 2: Graph traversal. Query the knowledge graph for the structural part of the question:

MATCH (c:CONTRACT)-[:SIGNED_BY]->(p:PERSON)
WHERE p.departure_date >= '2024-01-01'
  AND p.departure_date <= '2024-12-31'
  AND c.status = 'active'
RETURN c.contract_id, c.name, p.name, p.departure_date

This returns a list of contract IDs and their details.

Step 3: Targeted vector search. Use the contract IDs from the graph traversal as metadata filters in a vector search to find relevant chunks from those specific contracts:

results = vector_db.search(
    query_embedding=embed("contract terms and obligations"),
    filter={"contract_id": {"$in": contract_ids_from_graph}},
    top_k=20
)

Step 4: Generation. Pass the graph results (structured data) and the vector search results (document chunks) to the generative model:

Based on the following information, answer the user's question.

Graph query results (structured data):
{graph_results_as_table}

Relevant document sections:
{retrieved_chunks}

Question: {user_query}

This hybrid approach handles queries that neither vector search nor graph traversal can answer alone. The graph provides structural relationships (who signed what, who reports to whom). The vector search provides the semantic content (what do those contracts actually say). Together, they answer questions like "Summarise the key obligations in contracts signed by employees who left in 2024" -- a question that requires both relationship traversal and content understanding.

A user asks: 'What are the financial exposure risks from contracts managed by the team that was reorganised last quarter?' How does the hybrid graph-vector pattern handle this?

Incremental updates as documents change

A knowledge graph that reflects last month's organisational structure is a liability, not an asset. The graph must stay current as documents change.

Event-driven entity extraction. When a document is updated in your ingestion pipeline (Module 5), add a graph extraction step:

Extract entities and relationships from the new/updated chunks.
Identify which existing entities the new entities resolve to (entity resolution).
Add new entities and relationships to the graph.
If the document is an update (not a new document), identify which relationships from the previous version should be removed or modified.

Step 4 is the hardest. When a contract amendment changes the signatory from Person A to Person B, you need to:

Detect that the SIGNED_BY relationship to Person A should be removed (or marked as historical)
Add a new SIGNED_BY relationship to Person B
Potentially add an AMENDED_BY relationship between the original contract and the amendment

Temporal relationships. Enterprise knowledge graphs benefit from temporal modelling -- relationships have a valid_from and valid_to date. "Sarah Chen REPORTS_TO Michael Zhang" was true from 2023-01-15 to 2024-06-30. "Sarah Chen REPORTS_TO Lisa Park" has been true since 2024-07-01. This lets you answer historical queries ("Who did Sarah report to when this contract was signed?") and current queries ("Who does Sarah report to now?") from the same graph.

Consistency checks. Run periodic validations:

Orphaned entities: entities with no relationships (might indicate extraction failures)
Contradictory relationships: a person who REPORTS_TO two different managers simultaneously (unless the organisation allows matrix reporting)
Stale entities: entities whose source documents have not been updated in N months (might need re-verification)

Practical cadence. For most enterprises, graph extraction on new/updated documents happens continuously (event-driven). A full graph consistency check runs weekly or monthly. A complete re-extraction from scratch runs quarterly or when the extraction model is upgraded.

✎

Module 11 -- Final Assessment

Why does vector search fail for the query 'Who approved the change order that introduced the $2M liability?'

What is entity resolution in the context of building a knowledge graph from documents?

In the hybrid graph-vector query pattern, what role does the graph traversal play versus the vector search?

You are building a knowledge graph for an enterprise with 80 million document chunks. Why is it impractical to extract entities and relationships from all 80 million chunks?