Why fixed-size chunking is terrible
The default chunking strategy in every RAG tutorial is fixed-size: split the document every N tokens (typically 256-512), with some overlap (typically 50-100 tokens). It is simple, fast, and deterministic. It is also the single largest source of retrieval failures in production RAG systems.
Here is why. A 512-token fixed-size chunk is a arbitrary window that has no relationship to the document's semantic structure. Consider a legal contract:
...The Vendor shall deliver all equipment specified in Exhibit B
within thirty (30) calendar days of the Effective Date, subject to
[--- CHUNK BOUNDARY ---]
force majeure conditions as defined in Section 14.2. Failure to
deliver within the specified timeframe shall result in liquidated
damages of 0.5% of the total contract value per day of delay...Chunk 1 contains the delivery obligation but not the penalty. Chunk 2 contains the penalty and force majeure reference but not the delivery timeframe. Neither chunk, on its own, answers the question "What happens if the vendor delivers late?" The semantic unit -- the complete delivery clause -- has been split across two chunks.
Overlap (including 50-100 tokens from the previous chunk) partially mitigates this, but it is a crude fix. The overlap might capture the tail of the delivery obligation in chunk 2, but it might not. And overlap increases your total vector count (and therefore storage cost and search time) by 10-30% without any guarantee of capturing the right information.
At enterprise scale, fixed-size chunking compounds these problems. A 50,000-page corpus might produce 500,000 chunks, of which 20-40% have semantic breaks at chunk boundaries. That is 100,000-200,000 chunks that are degraded retrieval units. Every one of them is a potential missed answer.
Your RAG system uses 512-token fixed-size chunks with 50-token overlap. Users report that the system can answer questions about individual facts but fails on questions requiring understanding of complete clauses or procedures. What is the most likely cause?
Using the embedding model to find natural break points
Semantic chunking uses the embedding model itself to determine where to split text. The intuition: if two adjacent sentences have very different embeddings, there is a topic shift -- a natural place to insert a chunk boundary.
The algorithm:
- Split the document into sentences (using a sentence tokeniser, not just splitting on periods -- abbreviations like "Dr." and "U.S." need handling).
- Compute the embedding for each sentence.
- Calculate the cosine similarity between consecutive sentence embeddings.
- Where similarity drops below a threshold (or drops significantly relative to the local average), insert a chunk boundary.
- Group sentences between boundaries into chunks.
This produces chunks that align with topic shifts in the document. A section about delivery timelines stays in one chunk. A transition to payment terms starts a new chunk. The boundaries are semantically meaningful rather than arbitrary.
Tuning the threshold. A lower similarity threshold produces larger chunks (only splitting at major topic shifts). A higher threshold produces smaller chunks (splitting at minor topic shifts too). The right setting depends on your domain and document types. Legal contracts have sharper topic boundaries than narrative reports. Start with the 20th percentile of similarity scores as your threshold and adjust based on retrieval quality.
The cost. Semantic chunking requires computing an embedding for every sentence in your corpus before you compute the final chunk embeddings. For a large corpus, this pre-processing step can be significant. A 10 TB corpus with ~200 million sentences, at 2,000 sentences per second (Nomic Embed v2 on A100), takes roughly 28 hours. This is a one-time cost that produces substantially better chunks.
The limitation. Semantic chunking works well for flowing text but poorly for documents with implicit structure that does not produce clear similarity drops: numbered lists, bullet points, FAQ sections where every item is a distinct topic. For these, document-structure-aware chunking is better.
Respecting headings, sections, and document hierarchy
If your ingestion pipeline produces structured output (which Docling and Unstructured do), you have access to the document's hierarchy: headings, subheadings, paragraphs, lists, tables. This structure is the author's own indication of how information is organised, and it is far more reliable than algorithmic boundary detection.
Document-structure-aware chunking follows these rules:
- Never split within a paragraph. Paragraphs are the smallest semantic unit the author intended to be read as a whole.
- Group paragraphs under their nearest heading. A heading and its subsequent paragraphs form a natural chunk.
- Respect the hierarchy. An H2 section with three H3 subsections produces either one large chunk (the whole H2) or three medium chunks (one per H3), but never a chunk that starts in one H3 and ends in another.
- Handle tables as atomic units. A table, plus its caption and any immediately surrounding explanatory text, should be a single chunk. Splitting a table across chunks is almost always destructive.
- Include hierarchical context. When you create a chunk from an H3 subsection, prefix it with the H1 and H2 headings above it. A chunk reading "Liquidated damages of 0.5% per day..." is less useful than one reading "Master Services Agreement > Delivery Terms > Penalties: Liquidated damages of 0.5% per day..."
This approach requires structured document parsing (Docling, Unstructured with layout analysis) rather than raw text extraction. The investment in better extraction pays off directly in chunking quality.
Handling oversized sections. Some document sections are enormous -- a 40-page appendix under a single heading. When a structure-aware chunk exceeds your maximum size (typically 1000-2000 tokens), fall back to semantic chunking within that section. This gives you the best of both: structural boundaries where the document provides them, semantic boundaries where it does not.
A 200-page policy document has the structure: H1 (10 sections) > H2 (3-5 subsections each) > paragraphs. Most H2 subsections are 300-600 tokens. One H2 subsection ('Appendix A: Definitions') is 8,000 tokens. How should you chunk this document?
Embed at sentence, paragraph, section, and document level simultaneously
Here is a technique that dramatically improves retrieval for diverse query types: embed the same content at multiple granularities and store all of them.
The insight: different queries require different retrieval granularities. "What is the liquidated damages percentage?" is best answered by a single sentence. "Summarise our delivery obligations" requires a full section. "Compare our vendor contracts" requires document-level understanding.
Multi-granularity indexing creates embeddings at four levels:
| Level | Example | Best for |
|---|---|---|
| Sentence | "Liquidated damages shall be 0.5% per day." | Precise fact retrieval |
| Paragraph | The full paragraph containing the sentence above, plus surrounding context | Questions requiring context around a specific fact |
| Section | The entire "Delivery Terms" section under its heading | Summarisation, comprehensive topic queries |
| Document | A summary of the entire contract (generated by the LLM) | Comparative queries, "which document discusses X" |
At query time, the vector search retrieves results across all granularity levels. A precise factual query naturally matches sentence-level embeddings (which are tightly focused on the specific fact). A broad thematic query naturally matches section or document-level embeddings (which capture the overall topic).
The cost. Multi-granularity indexing increases your vector count by approximately 3-5x compared to single-level chunking. For an 80-million-chunk corpus at the paragraph level, you might have 500 million sentence embeddings, 80 million paragraph embeddings, 10 million section embeddings, and 1 million document embeddings -- roughly 591 million vectors total. The storage cost increase is significant, but the retrieval quality improvement is substantial.
The implementation. Tag each vector with a granularity metadata field ("sentence", "paragraph", "section", "document") and a parent_id linking it to its containing chunk at the next level. This enables granularity-aware retrieval: you can restrict search to a specific level, or search across all levels and let the reranker sort out which granularity best answers the query.
Retrieve the child, return the parent
Parent-child chunking addresses a specific problem: small chunks embed well (focused, precise semantics) but contain too little context for the generative model to produce a good answer.
The pattern:
- Index small chunks (children). Embed sentences or small paragraphs for maximum retrieval precision. These are the units that the vector database searches over.
- Return larger chunks (parents). When a child chunk is retrieved, return its parent chunk (the full section or surrounding paragraphs) to the generative model. The parent provides the context the model needs.
This separates the retrieval unit (small, precise) from the generation unit (large, contextual). You get the best of both: precise retrieval that finds exactly the right information, and rich context that gives the generative model enough material to produce a comprehensive answer.
Implementation in your vector database: each child vector has a parent_id metadata field pointing to the parent chunk. After retrieval, a lookup step fetches the parent text. Deduplication is important -- if three child chunks from the same parent are retrieved, you only pass the parent to generation once.
Synthetic query generation.
This technique uses the generative model (Gemma 4) during indexing to improve retrieval quality at query time. For each chunk, ask Gemma 4: "What questions would this chunk answer?"
A chunk about liquidated damages might generate:
- "What is the penalty for late delivery?"
- "How are liquidated damages calculated?"
- "What percentage is deducted per day of delay?"
These synthetic queries are embedded and stored alongside the chunk's own embedding. At query time, a user asking "What happens if the vendor is late?" matches the synthetic query "What is the penalty for late delivery?" with high similarity, even though the user's phrasing is quite different from the chunk's text.
This is effectively the inverse of HyDE. HyDE generates a hypothetical answer at query time. Synthetic queries generate hypothetical questions at index time. The advantage: the generation cost is amortised over index time (run once) rather than incurred at query time (run for every query).
The cost: generating 3-5 synthetic queries per chunk using Gemma 4 12B takes approximately 1-2 seconds per chunk. For 80 million chunks, that is 80-160 million seconds of GPU time -- roughly 2.5-5 years on a single GPU. In practice, you parallelize across multiple GPUs and accept that this is a significant one-time investment. At 8 GPUs, the timeline drops to 4-8 months. For many enterprises, applying synthetic queries selectively (to high-value document categories only) is more practical.
You implement parent-child chunking where sentences are indexed (children) but full sections are returned to the generative model (parents). Three sentences from the same section are all retrieved for a query. What should your pipeline do?
The precision-recall tradeoff
There is a temptation to chunk aggressively -- smaller chunks, more chunks, index everything at every granularity. More vectors means more chances to match a query, right?
Not necessarily. More chunks means more noise in retrieval results. If you index every sentence individually, a query about "delivery obligations" might match:
- "The vendor shall deliver within 30 days" (relevant)
- "Delivery of this notice shall be by email" (irrelevant -- different sense of "delivery")
- "The final delivery was completed on March 15" (marginally relevant -- a historical fact, not the obligation)
At the sentence level, all three have high semantic similarity to "delivery obligations." At the paragraph level, chunks 2 and 3 would have more context that pushes their embeddings away from the legal obligation sense of "delivery."
Precision is the fraction of retrieved results that are actually relevant. Recall is the fraction of all relevant results that were retrieved.
Smaller chunks improve recall (more chances to match) but often hurt precision (more false positives). Larger chunks improve precision (more context disambiguates meaning) but can hurt recall (the embedding must represent more information, diluting the specific topic you are searching for).
The practical balance:
- For factual Q&A: favour smaller chunks (200-400 tokens) with reranking to filter noise. High recall matters because you need to find the specific fact.
- For summarisation and analysis: favour larger chunks (800-1500 tokens) with fewer retrieved results. High precision matters because you need coherent context, not fragments.
- For mixed workloads: use multi-granularity indexing with a reranker that can evaluate relevance across granularity levels.
The reranker is the safety net. If your chunking produces some false-positive matches (inevitable at any granularity), a strong reranker -- especially an LLM-based one like Gemma 4 E4B used as a cross-encoder -- can identify and demote the irrelevant results before they reach generation.
Module 6 -- Final Assessment
What is the primary problem with fixed-size chunking (e.g., 512 tokens with 50-token overlap) for enterprise document corpora?
In parent-child chunking, what are the 'children' and 'parents' respectively used for?
What does the synthetic query generation technique do, and when is the generation cost incurred?
You are building a RAG system that must handle both precise factual queries ('What is the liquidated damages percentage?') and broad analytical queries ('Summarise our delivery obligations'). Which chunking approach best serves both?