RAG Architecture Fundamentals

The canonical RAG pipeline

Before we optimise anything, we need a shared vocabulary for the eight stages of a RAG pipeline. Every RAG system, no matter how sophisticated, is a variation on this sequence:

1. Ingest -- Documents enter the system. PDFs get parsed, emails get extracted, Confluence pages get pulled via API. The output is clean text with metadata (source, date, author, permissions).

2. Chunk -- Clean text gets split into retrieval units. A 50-page contract becomes hundreds of chunks, each sized to be a useful unit of retrieval. This is where most enterprise RAG systems silently fail, and we will dedicate an entire module to it.

3. Embed -- Each chunk gets converted into a dense vector -- a list of floating-point numbers that captures the chunk's semantic meaning. This is done by an embedding model, which is a completely different model from the one that generates answers. If that distinction is not clear to you yet, Module 3 will make it crystal clear.

4. Store -- Vectors and their associated metadata go into a vector database. The database builds an index (typically HNSW or IVF) that enables fast approximate nearest-neighbour search.

5. Query -- A user asks a question. The question itself gets embedded using the same embedding model that embedded the chunks. This produces a query vector.

6. Retrieve -- The query vector is compared against all stored vectors using similarity search (typically cosine similarity). The database returns the top-K most similar chunks -- usually 20-100 candidates at this stage.

7. Rerank -- A more expensive model (a cross-encoder or even the generative LLM itself) re-scores the candidates. The top 5-10 chunks after reranking are the ones that actually get passed to generation. This stage is optional in simple systems and critical in production ones.

8. Generate -- The generative LLM (Gemma 4, Llama, or a proprietary model) receives the user's question plus the reranked context chunks and produces a grounded answer.

This eight-stage pipeline is the skeleton. The quality of your RAG system is determined by how well you execute each stage and how the stages interact.

A user asks: 'What were the key findings in the Q3 2025 audit report?' The system returns irrelevant chunks about Q1 2024 financials. Which stage most likely failed?

The five failure modes of naive RAG

"Naive RAG" is the approach you get from following a LangChain tutorial: split documents into fixed-size chunks, embed them with a single model, do a simple top-K vector search, and pass the results straight to the LLM. It works surprisingly well in demos. It fails predictably in production. Here are the five ways it breaks.

Failure 1: Semantic dilution. Fixed-size chunks (say, 512 tokens) routinely split meaningful content across chunk boundaries. A contract clause that says "The vendor shall deliver within 30 days of the purchase order date, subject to force majeure provisions outlined in Section 14.2" might get split at "subject to" -- and the second half, without the first, is meaningless for retrieval. The embedding of the truncated chunk captures "force majeure in Section 14.2" but not "30-day delivery obligation," destroying the semantic connection between the two halves.

Failure 2: The lost-in-the-middle problem. When you pass 5-10 retrieved chunks to the LLM, research from Liu et al. (2023) demonstrated that LLMs pay more attention to information at the beginning and end of the context, and less to information in the middle. If your most relevant chunk happens to land in positions 3-7, the LLM may underweight it. This is not a retrieval failure -- it is a generation failure caused by naive context assembly.

Failure 3: No negative signal. Naive RAG always returns the top-K results, even when none of them are actually relevant. If you ask about a topic that is not in your corpus, the system still returns the K "least irrelevant" chunks, and the LLM gamely synthesises an answer from them. There is no mechanism to say "we do not have information about this."

Failure 4: Single-hop only. Naive RAG handles questions that can be answered from a single chunk or a handful of related chunks. It fails on multi-hop questions: "Which projects led by managers who joined after 2023 have exceeded their budget by more than 20%?" This requires retrieving manager records, filtering by join date, cross-referencing project assignments, and then checking budget data. Vector similarity search alone cannot do compositional reasoning.

Failure 5: No query understanding. The user's question is embedded as-is and used for retrieval. But user questions are often ambiguous, underspecified, or phrased differently from how the information is stated in the documents. "What's our liability exposure?" could refer to legal liability, financial liability, insurance liability, or regulatory liability. Naive RAG does not expand, disambiguate, or reformulate the query.

Your enterprise RAG system consistently returns relevant documents but the generated answers miss key details that are present in the retrieved context. The relevant information tends to appear in chunks ranked 3rd through 6th. What is the most likely cause?

Patterns that solve the failure modes

Each naive RAG failure mode has a corresponding advanced pattern. Here are the four most impactful.

Query expansion addresses the "no query understanding" failure. Instead of embedding the user's raw question, you first use the generative LLM to produce 3-5 reformulated queries. "What's our liability exposure?" becomes:

"What legal liabilities does the organisation currently face?"
"What is the estimated financial exposure from pending litigation?"
"What regulatory compliance risks could result in fines or penalties?"

Each reformulation is embedded and searched separately. The results are merged using Reciprocal Rank Fusion (RRF), which combines rankings from multiple retrieval signals. Query expansion typically improves recall by 15-30% because it casts a wider semantic net.

HyDE (Hypothetical Document Embeddings) takes a different approach. Instead of reformulating the query, you ask the LLM to generate a hypothetical document that would answer the query. This hypothetical answer is then embedded, and you search for real documents similar to the hypothetical answer rather than similar to the question.

Why does this work? Because questions and answers often use different vocabulary. The question "What's our parental leave policy?" and the document "Employees are entitled to 16 weeks of paid family leave..." are semantically related but lexically different. HyDE bridges this vocabulary gap by converting the question into answer-space before searching.

Multi-hop retrieval addresses compositional questions. Instead of a single retrieval step, the system breaks the question into sub-questions, retrieves for each, and chains the results. "Which projects led by managers who joined after 2023 have exceeded budget?" becomes:

Retrieve: managers who joined after 2023 (returns names)
Retrieve: projects led by [those specific managers]
Retrieve: budget status for [those specific projects]
Synthesise: combine the chain into a final answer

This requires an orchestration layer that uses the LLM to decompose questions and plan retrieval steps. It is slower (3-4 retrieval round-trips instead of one) but handles questions that single-hop retrieval literally cannot answer.

Self-RAG addresses the "no negative signal" problem. In self-RAG, the generative model is trained (or prompted) to emit special tokens that indicate whether it is using retrieved context, whether the retrieved context is relevant, and whether its own response is supported by the context. This gives the system a built-in quality signal: if the model flags low relevance or low support, the system can return "I don't have enough information to answer that" instead of hallucinating.

A user asks: 'What is the average response time for customer support tickets in the EMEA region for clients with platinum SLAs?' Your RAG system returns chunks about general SLA terms and average global response times, but nothing specific to EMEA platinum clients. What pattern would most improve this query?

Two models, two jobs

This is one of the most common points of confusion in enterprise RAG, and getting it wrong leads to architectural mistakes that are expensive to fix.

Embedding models convert text into fixed-size numerical vectors. They are small (140M to 1.5B parameters), fast (thousands of chunks per second on a single GPU), and produce output of fixed dimensions (768 to 4096 floats). Their job is to create a numerical representation of meaning so that similar texts produce similar vectors. They do not generate text. They do not understand questions. They do not reason. They are a very sophisticated similarity function.

Generative models (Gemma 4, Llama, GPT-4o, Claude) take text as input and produce text as output. They are large (2B to 400B+ parameters), slower (tens to hundreds of tokens per second), and their output is variable-length natural language. Their job is to read context and produce a thoughtful, structured response.

In a RAG system, you need both. The embedding model handles stages 3, 5, and part of 6 (embed chunks, embed queries, similarity search). The generative model handles stages 7 and 8 (reranking and generation). They are different tools for different jobs, like a library's catalogue system (embedding model) versus the librarian who reads the books and answers your questions (generative model).

You cannot use Gemma 4 27B as your embedding model. It is a generative model -- it produces text, not fixed-size vectors. You could technically extract internal representations from it, but that would be absurdly expensive (running a 27B model for every chunk embedding when a 140M model produces better embeddings) and architecturally wrong.

The correct architecture: a small, fast embedding model (GTE-Qwen2-1.5B, Nomic Embed v2, or similar) running on a modest GPU for embedding and retrieval, plus Gemma 4 for reasoning and generation. The embedding model might process 2,000 chunks per second. The generative model might produce 80 tokens per second. They have completely different performance profiles and hardware requirements.

What each stage costs in milliseconds

One of the most practical things you can do as an architect is establish a latency budget for your RAG pipeline. Here are realistic numbers for an on-premises deployment.

Stage	Typical latency	Notes
Query embedding	5-15 ms	Single query through embedding model on GPU
Vector search (HNSW)	2-10 ms	For 10-100M vectors with pre-loaded index
Metadata filtering	1-5 ms	If filters are applied post-retrieval; pre-filtering can add 10-50 ms
Reranking (cross-encoder, top-100 to top-10)	50-150 ms	BGE-reranker-v2 on GPU; scales linearly with candidate count
Context assembly	1-5 ms	Formatting retrieved chunks into the generation prompt
Generation (time to first token)	100-500 ms	Gemma 4 27B on 2x L40S with vLLM; depends on queue depth
Generation (full response, ~500 tokens)	3-8 seconds	At 60-150 tokens/second depending on model and hardware
Total (to first token)	160-685 ms
Total (full response)	3-9 seconds

The critical insight: retrieval is fast; generation is slow. Everything from query embedding through reranking typically completes in under 200 ms. The generation step dominates the end-to-end latency. This means optimising retrieval latency from 50 ms to 20 ms is irrelevant if your generation step takes 5 seconds.

Where to invest your optimisation effort depends on your use case. For interactive chat (where users see streaming tokens), time-to-first-token (TTFT) matters most -- and TTFT is dominated by the generation model's startup time. For batch processing (where you process thousands of queries), throughput matters most -- and throughput is dominated by how many concurrent requests your generation infrastructure can handle.

Your enterprise RAG pipeline has a total end-to-end latency of 6 seconds. You need to reduce it to under 4 seconds. Where should you focus optimisation effort?

✎

Module 2 -- Final Assessment

In the canonical 8-stage RAG pipeline, what is the purpose of the reranking stage?

Why does the 'lost-in-the-middle' problem occur in RAG generation?

How does HyDE (Hypothetical Document Embeddings) improve retrieval?

In a typical on-premises RAG pipeline, which stage dominates the end-to-end latency?