The canonical RAG pipeline
Before we optimise anything, we need a shared vocabulary for the eight stages of a RAG pipeline. Every RAG system, no matter how sophisticated, is a variation on this sequence:
1. Ingest -- Documents enter the system. PDFs get parsed, emails get extracted, Confluence pages get pulled via API. The output is clean text with metadata (source, date, author, permissions).
2. Chunk -- Clean text gets split into retrieval units. A 50-page contract becomes hundreds of chunks, each sized to be a useful unit of retrieval. This is where most enterprise RAG systems silently fail, and we will dedicate an entire module to it.
3. Embed -- Each chunk gets converted into a dense vector -- a list of floating-point numbers that captures the chunk's semantic meaning. This is done by an embedding model, which is a completely different model from the one that generates answers. If that distinction is not clear to you yet, Module 3 will make it crystal clear.
4. Store -- Vectors and their associated metadata go into a vector database. The database builds an index (typically HNSW or IVF) that enables fast approximate nearest-neighbour search.
5. Query -- A user asks a question. The question itself gets embedded using the same embedding model that embedded the chunks. This produces a query vector.
6. Retrieve -- The query vector is compared against all stored vectors using similarity search (typically cosine similarity). The database returns the top-K most similar chunks -- usually 20-100 candidates at this stage.
7. Rerank -- A more expensive model (a cross-encoder or even the generative LLM itself) re-scores the candidates. The top 5-10 chunks after reranking are the ones that actually get passed to generation. This stage is optional in simple systems and critical in production ones.
8. Generate -- The generative LLM (Gemma 4, Llama, or a proprietary model) receives the user's question plus the reranked context chunks and produces a grounded answer.
This eight-stage pipeline is the skeleton. The quality of your RAG system is determined by how well you execute each stage and how the stages interact.