Retrieval and Reranking

The two-stage retrieval pattern

In a naive RAG system, you run a single vector search, take the top-K results, and pass them directly to the generative model. This is fast but imprecise. The vector search is optimised for recall -- finding all potentially relevant results -- not for precision. The top-10 results from a vector search against 100 million vectors will include some genuinely relevant chunks and some that are topically adjacent but not actually useful for answering the query.

The two-stage retrieval pattern fixes this by separating the recall-optimised stage from the precision-optimised stage:

Stage 1: Fast retrieval (recall-optimised). Run the vector search and retrieve a large candidate set -- typically the top 50-100 results. This is cheap: 2-10 ms regardless of corpus size (with an HNSW index). The goal is to not miss any relevant chunks, even at the cost of including some irrelevant ones.

Stage 2: Slow reranking (precision-optimised). Run a more expensive model over the 50-100 candidates to re-score them. The reranker evaluates each candidate against the query with much more sophistication than cosine similarity. Keep the top 5-10 after reranking for generation.

Why does the reranker produce better rankings than the original vector search? Because vector search uses bi-encoder similarity: the query and the chunk are embedded independently, and you compare their vectors. The models have never "seen" the query and chunk together. A reranker (cross-encoder) processes the query and chunk as a single input, allowing it to model fine-grained interactions between query terms and chunk content. The cross-encoder can notice that "delivery obligation" in the query should match "the vendor shall deliver" in the chunk, even though their individual embeddings might not be maximally similar.

The trade-off: cross-encoder reranking is orders of magnitude slower than vector search. Processing 100 candidate pairs takes 50-150 ms on a GPU. This is why you do not use it as the primary search -- at 100 million candidates, it would take days. The two-stage pattern gives you the speed of vector search and the precision of cross-encoder evaluation.

Your RAG system retrieves the top-10 results from vector search and passes them directly to the generative model (no reranking). Users report that answers are often based on tangentially related content rather than the most relevant documents. What architectural change would most improve answer quality?

The reranker landscape

A cross-encoder reranker takes a (query, document) pair as input and outputs a relevance score. Unlike embedding models (which produce independent vectors), cross-encoders process the query and document together through the same transformer, allowing attention between query tokens and document tokens. This is why they capture relevance signals that bi-encoder similarity misses.

Here are the production-viable open-source rerankers:

BGE-reranker-v2-m3 (BAAI, Apache 2.0). 570M parameters. Multilingual. The most widely deployed open-source reranker. Handles 100 (query, document) pairs in approximately 80-120 ms on an A100. Good quality across domains.

Jina Reranker v2 (Jina AI, Apache 2.0). 137M parameters. Very fast (100 pairs in 30-50 ms on A100) with quality close to BGE-reranker. Its smaller size means it can run on modest GPUs or even high-end CPUs. A strong choice when latency is critical.

ms-marco-MiniLM-L-12-v2 (Microsoft, MIT). 33M parameters. Trained on the MS MARCO passage ranking dataset. The fastest reranker option (100 pairs in 10-20 ms on A100) but with measurably lower quality than BGE or Jina on out-of-domain data. Best when you need maximum throughput and your queries are similar to web search queries.

Cohere Rerank v3 (proprietary API). Included for comparison: it is the quality benchmark that open-source rerankers are measured against. For self-hosted deployments, it is not an option, but knowing where the open-source models stand relative to it helps calibrate expectations. BGE-reranker-v2-m3 is typically within 2-4% of Cohere Rerank v3 on standard benchmarks.

You need a reranker for your self-hosted RAG system. Your latency budget allows 80 ms for the reranking stage. You rerank the top-100 vector search results. Which reranker best fits?

Using Gemma 4 E4B as a reranker

Cross-encoder rerankers are purpose-built for relevance scoring, but general-purpose LLMs can also rerank -- and sometimes outperform dedicated rerankers because they understand nuance that a small cross-encoder misses.

Gemma 4 E4B (4 billion parameters) is fast enough for reranking if you limit the candidate count (top-20 to top-30 rather than top-100) and use the right prompt template.

The prompt template:

Given the following query and document passage, rate the relevance
of the document to the query on a scale of 0 to 10.

Query: {query}

Document: {document}

Relevance score (0-10):

Parse the first token of the response as the score. Process all candidate documents as a batch.

Why this works. The LLM understands semantic relationships that a 137M-570M cross-encoder cannot. "What are the consequences of late delivery?" and a document discussing "penalties for failure to meet agreed timelines" require understanding that "late delivery" = "failure to meet agreed timelines" and "consequences" = "penalties." A small cross-encoder might catch this; Gemma 4 E4B almost certainly will.

The cost. Gemma 4 E4B processes each (query, document) pair in approximately 50-100 ms on an L40S (time to generate the score token, given the full input). For 20 candidates, that is 1-2 seconds of sequential processing, or 100-200 ms with batched inference via vLLM. This is 2-4x slower than a dedicated cross-encoder for 20 candidates but provides higher quality reranking.

When to use LLM reranking over cross-encoder reranking:

When your domain requires nuanced understanding (legal, medical, technical)
When your candidate count is small enough (under 30) that the latency is acceptable
When you are already running Gemma 4 E4B for other tasks and the GPU capacity is available

When to stick with cross-encoder reranking:

When you need to rerank 100+ candidates (LLM cost becomes prohibitive)
When your latency budget is under 100 ms
When the retrieval domain is well-represented in the cross-encoder's training data

Improving recall before retrieval

Reranking improves precision -- the quality of results you already found. Query expansion and HyDE improve recall -- finding results you would have missed.

Query expansion generates multiple query variants before retrieval. The generative model produces 3-5 reformulations that approach the topic from different angles:

Original: "What are our obligations if a data breach occurs?"

Expanded:

"Data breach notification requirements under our contracts"
"Incident response obligations for security events"
"Contractual penalties for unauthorised data access"
"Customer notification timelines after a data breach"

Each variant is embedded and searched independently. The results are merged using Reciprocal Rank Fusion (RRF).

Implementation detail: Query expansion adds one LLM call before retrieval (50-200 ms for Gemma 4 E4B generating 3-5 variants) plus 3-5x the vector search cost (but vector searches are 2-10 ms each, so 3-5x is still under 50 ms). The total added latency is typically 100-250 ms for a 15-30% improvement in recall.

HyDE (Hypothetical Document Embeddings) takes a different approach. Instead of generating multiple queries, it generates a hypothetical answer:

Original query: "What are our obligations if a data breach occurs?"

HyDE-generated hypothetical answer: "In the event of a data breach, the company is required to notify affected customers within 72 hours of discovery. The company must also notify the relevant supervisory authority and maintain detailed records of the breach including the nature of the personal data affected, the approximate number of individuals concerned, and the remedial measures taken."

This hypothetical answer is embedded and used as the search query. Because documents are written in "answer language" (declarative statements about obligations) rather than "question language," the hypothetical answer embedding is often closer to the relevant documents than the question embedding.

When to use which:

Query expansion works best when the user's query is ambiguous or could match documents using different terminology.
HyDE works best when there is a systematic vocabulary gap between how users ask questions and how documents state information.
Both together: generate 3 expanded queries plus 1 HyDE answer, search with all 4 embeddings, and merge with RRF. This is the highest-recall strategy but adds 200-400 ms of latency.

Users search your legal document corpus. They frequently ask questions like 'Can we terminate the contract early?' but the relevant clauses in contracts use language like 'Either party may terminate this Agreement upon ninety (90) days written notice...' Which technique best bridges this gap?

Combining retrieval signals with RRF

When you have multiple retrieval results -- from dense search, sparse search, query expansion variants, and HyDE -- you need a principled way to merge them. Reciprocal Rank Fusion (RRF) is the standard method.

For each document that appears in any retrieval result list, compute:

RRF_score = sum(1 / (k + rank_i)) for each list i where the document appears.

k is a constant (typically 60) that prevents documents ranked first from dominating excessively.

Example: a document ranks 3rd in the dense search results and 7th in the sparse search results.

RRF_score = 1/(60+3) + 1/(60+7) = 0.0159 + 0.0149 = 0.0308

Documents are sorted by RRF score in descending order. Documents that appear in multiple lists get boosted because they have multiple contributing terms. A document that ranks 10th in all four lists scores higher than a document that ranks 1st in one list but does not appear in the others.

RRF is simple, parameter-free (aside from k), and remarkably effective. Research consistently shows it outperforms learned combination methods for most retrieval fusion tasks.

Measuring retrieval quality.

You cannot improve what you do not measure. Three metrics matter for RAG retrieval:

Recall@K -- Of all relevant documents in your corpus for a given query, what fraction appears in the top-K retrieved results? Recall@10 of 0.85 means your retrieval finds 85% of relevant documents in the top 10. This measures whether your retrieval is finding the right documents.

MRR (Mean Reciprocal Rank) -- For each query, what is the reciprocal of the rank of the first relevant result? If the first relevant result is at position 3, its reciprocal rank is 1/3. MRR averages this across queries. Higher MRR means the most relevant result tends to appear earlier. This measures how quickly a user would find what they need.

NDCG@K (Normalised Discounted Cumulative Gain) -- A more nuanced metric that accounts for multiple relevant documents at different positions, with higher weights for documents ranked earlier. NDCG@10 of 0.78 means your ranking captures 78% of the ideal ranking quality.

For RAG specifically, Recall@10 is the most actionable metric because you typically pass the top 5-10 chunks to generation. If the relevant chunk is not in the top 10, no amount of generation sophistication can recover it.

You run query expansion (4 variants) and hybrid search (dense + sparse). For each query, you now have 6 result lists (4 expansion + 2 hybrid). How should you combine them?

✎

Module 7 -- Final Assessment

Why does a cross-encoder reranker produce better relevance judgments than bi-encoder (embedding) similarity?

How does HyDE improve retrieval for queries where the user's language differs from the document's language?

In Reciprocal Rank Fusion (RRF) with k=60, a document ranks 1st in List A and 50th in List B. What is its RRF score?

For a RAG system that passes the top-10 retrieved chunks to generation, which retrieval metric is most directly actionable?