Enterprise RAG on Your Own Infrastructure

Retrieval and Reranking

The two-stage retrieval pattern, cross-encoder rerankers, LLM-based reranking with Gemma 4, query expansion, HyDE, Reciprocal Rank Fusion, and measuring retrieval quality with MRR, NDCG, and Recall@K.

The two-stage retrieval pattern

In a naive RAG system, you run a single vector search, take the top-K results, and pass them directly to the generative model. This is fast but imprecise. The vector search is optimised for recall -- finding all potentially relevant results -- not for precision. The top-10 results from a vector search against 100 million vectors will include some genuinely relevant chunks and some that are topically adjacent but not actually useful for answering the query.

The two-stage retrieval pattern fixes this by separating the recall-optimised stage from the precision-optimised stage:

Stage 1: Fast retrieval (recall-optimised). Run the vector search and retrieve a large candidate set -- typically the top 50-100 results. This is cheap: 2-10 ms regardless of corpus size (with an HNSW index). The goal is to not miss any relevant chunks, even at the cost of including some irrelevant ones.

Stage 2: Slow reranking (precision-optimised). Run a more expensive model over the 50-100 candidates to re-score them. The reranker evaluates each candidate against the query with much more sophistication than cosine similarity. Keep the top 5-10 after reranking for generation.

Why does the reranker produce better rankings than the original vector search? Because vector search uses bi-encoder similarity: the query and the chunk are embedded independently, and you compare their vectors. The models have never "seen" the query and chunk together. A reranker (cross-encoder) processes the query and chunk as a single input, allowing it to model fine-grained interactions between query terms and chunk content. The cross-encoder can notice that "delivery obligation" in the query should match "the vendor shall deliver" in the chunk, even though their individual embeddings might not be maximally similar.

The trade-off: cross-encoder reranking is orders of magnitude slower than vector search. Processing 100 candidate pairs takes 50-150 ms on a GPU. This is why you do not use it as the primary search -- at 100 million candidates, it would take days. The two-stage pattern gives you the speed of vector search and the precision of cross-encoder evaluation.

?

Your RAG system retrieves the top-10 results from vector search and passes them directly to the generative model (no reranking). Users report that answers are often based on tangentially related content rather than the most relevant documents. What architectural change would most improve answer quality?