Enterprise RAG on Your Own Infrastructure

RAG Architecture Fundamentals

The canonical RAG pipeline from ingestion to generation, why naive RAG fails at enterprise scale, advanced patterns like HyDE and self-RAG, and latency budgets for every stage.

The canonical RAG pipeline

Before we optimise anything, we need a shared vocabulary for the eight stages of a RAG pipeline. Every RAG system, no matter how sophisticated, is a variation on this sequence:

1. Ingest -- Documents enter the system. PDFs get parsed, emails get extracted, Confluence pages get pulled via API. The output is clean text with metadata (source, date, author, permissions).

2. Chunk -- Clean text gets split into retrieval units. A 50-page contract becomes hundreds of chunks, each sized to be a useful unit of retrieval. This is where most enterprise RAG systems silently fail, and we will dedicate an entire module to it.

3. Embed -- Each chunk gets converted into a dense vector -- a list of floating-point numbers that captures the chunk's semantic meaning. This is done by an embedding model, which is a completely different model from the one that generates answers. If that distinction is not clear to you yet, Module 3 will make it crystal clear.

4. Store -- Vectors and their associated metadata go into a vector database. The database builds an index (typically HNSW or IVF) that enables fast approximate nearest-neighbour search.

5. Query -- A user asks a question. The question itself gets embedded using the same embedding model that embedded the chunks. This produces a query vector.

6. Retrieve -- The query vector is compared against all stored vectors using similarity search (typically cosine similarity). The database returns the top-K most similar chunks -- usually 20-100 candidates at this stage.

7. Rerank -- A more expensive model (a cross-encoder or even the generative LLM itself) re-scores the candidates. The top 5-10 chunks after reranking are the ones that actually get passed to generation. This stage is optional in simple systems and critical in production ones.

8. Generate -- The generative LLM (Gemma 4, Llama, or a proprietary model) receives the user's question plus the reranked context chunks and produces a grounded answer.

This eight-stage pipeline is the skeleton. The quality of your RAG system is determined by how well you execute each stage and how the stages interact.

?

A user asks: 'What were the key findings in the Q3 2025 audit report?' The system returns irrelevant chunks about Q1 2024 financials. Which stage most likely failed?