The two-stage retrieval pattern
In a naive RAG system, you run a single vector search, take the top-K results, and pass them directly to the generative model. This is fast but imprecise. The vector search is optimised for recall -- finding all potentially relevant results -- not for precision. The top-10 results from a vector search against 100 million vectors will include some genuinely relevant chunks and some that are topically adjacent but not actually useful for answering the query.
The two-stage retrieval pattern fixes this by separating the recall-optimised stage from the precision-optimised stage:
Stage 1: Fast retrieval (recall-optimised). Run the vector search and retrieve a large candidate set -- typically the top 50-100 results. This is cheap: 2-10 ms regardless of corpus size (with an HNSW index). The goal is to not miss any relevant chunks, even at the cost of including some irrelevant ones.
Stage 2: Slow reranking (precision-optimised). Run a more expensive model over the 50-100 candidates to re-score them. The reranker evaluates each candidate against the query with much more sophistication than cosine similarity. Keep the top 5-10 after reranking for generation.
Why does the reranker produce better rankings than the original vector search? Because vector search uses bi-encoder similarity: the query and the chunk are embedded independently, and you compare their vectors. The models have never "seen" the query and chunk together. A reranker (cross-encoder) processes the query and chunk as a single input, allowing it to model fine-grained interactions between query terms and chunk content. The cross-encoder can notice that "delivery obligation" in the query should match "the vendor shall deliver" in the chunk, even though their individual embeddings might not be maximally similar.
The trade-off: cross-encoder reranking is orders of magnitude slower than vector search. Processing 100 candidate pairs takes 50-150 ms on a GPU. This is why you do not use it as the primary search -- at 100 million candidates, it would take days. The two-stage pattern gives you the speed of vector search and the precision of cross-encoder evaluation.