Generation with Open Models

The Gemma 4 model family for RAG

Gemma 4 is Google's open-weights model family released under an Apache 2.0-compatible licence. For self-hosted RAG, it is the generation model you should evaluate first. Here is the family and where each member fits in a RAG architecture.

Gemma 4 E2B (2 billion parameters). The smallest model. Runs on a single consumer GPU (8 GB VRAM) or even on CPU. Generates 100-200 tokens/second on an L4 GPU. For RAG, it handles simple factual queries where the answer is directly stated in the retrieved context. It struggles with synthesis, multi-step reasoning, and nuanced questions. Best for: L1 edge/device tier, simple factual Q&A, high-throughput low-complexity workloads.

Gemma 4 E4B (4 billion parameters). The sweet spot for many RAG deployments. Runs on a single L4 (24 GB) or T4 (16 GB with quantisation). Generates 80-150 tokens/second on an L4. Capable of synthesis across multiple retrieved chunks, following structured output formats, and basic reasoning. This is also the model you might use as an LLM-based reranker (as discussed in Module 7). Best for: mid-tier RAG generation, reranking, query expansion, environments where GPU budget is limited.

Gemma 4 12B (12 billion parameters). Requires a single L40S (48 GB) or A100 (40/80 GB). Generates 40-80 tokens/second on an L40S. Substantially better at reasoning, handling ambiguous queries, producing well-structured responses, and faithfully representing nuances in the retrieved context. For most enterprise RAG deployments, this is the primary generation model. Best for: departmental RAG systems, complex queries, scenarios where answer quality matters more than throughput.

Gemma 4 27B (27 billion parameters). Requires 2x L40S or a single A100 80 GB (with quantisation). Generates 30-60 tokens/second on 2x L40S. The highest-quality open model in the Gemma 4 family. Comparable to GPT-4o for RAG-grounded tasks where the model reads provided context and synthesises answers. Excels at complex synthesis, handling contradictory evidence across chunks, and producing nuanced, well-qualified answers. Best for: high-stakes queries (legal, medical, financial), L2 departmental tier where quality is paramount.

The right model depends on your query complexity distribution. If 80% of your queries are factual ("What is the delivery deadline in contract X?"), E4B or 12B handles them efficiently. The remaining 20% of complex queries ("Compare the indemnification terms across our three major vendor contracts") benefit from 27B. A tiered architecture (covered in Module 10) routes queries to the appropriate model.

Your enterprise RAG system handles 50,000 queries/day. Analysis shows 70% are simple factual lookups, 25% require moderate synthesis, and 5% require complex multi-document reasoning. What model deployment strategy minimises cost while maintaining quality?

The RAG prompt template

The prompt you send to the generative model is the interface between your retrieval pipeline and the user-facing answer. Getting it right is the difference between a system that users trust and one they abandon.

Here is a production RAG prompt template:

<system>
You are an internal knowledge assistant for {organisation_name}.
Answer the user's question using ONLY the information provided
in the context sections below. Follow these rules strictly:

1. If the context does not contain enough information to answer
   the question, say "I don't have enough information in the
   available documents to answer this question" and explain
   what information would be needed.
2. Never use information from your training data. Only use the
   provided context.
3. Cite your sources using [Source N] notation after each claim,
   where N corresponds to the context section number.
4. If different context sections contain contradictory information,
   acknowledge the contradiction and present both positions with
   their sources.
5. Keep your answer concise and directly relevant to the question.
</system>

<context>
[Source 1: {document_title_1}, {document_date_1}]
{chunk_text_1}

[Source 2: {document_title_2}, {document_date_2}]
{chunk_text_2}

[Source 3: {document_title_3}, {document_date_3}]
{chunk_text_3}
</context>

<question>
{user_query}
</question>

Key design decisions in this template:

System prompt sets the grounding rules. The instruction to use "ONLY the information provided" is the primary defence against hallucination. Without this, the model may supplement retrieved context with knowledge from its training data, producing answers that sound authoritative but cite information not in your documents.

Source metadata is included. Document title and date help the model (and the user) assess recency and authority. An answer citing a 2019 policy when a 2024 revision exists is a failure your users will catch.

Citation format is explicit. The [Source N] notation creates a verifiable link between each claim and its source document. Users can click through to verify. This is non-negotiable in enterprise RAG -- unsourced answers are unacceptable in legal, financial, and medical contexts.

Contradiction handling is specified. Enterprise document corpora invariably contain contradictions (policy revisions, regional variations, superseded documents). The prompt tells the model how to handle this rather than letting it silently choose one version.

Your RAG system's prompt template does not include the instruction to use only the provided context. Users report that answers sometimes include accurate information that is not in any of the retrieved documents. Why is this a problem?

Grounding, attribution, and self-verification

Hallucination in RAG takes a specific form: the model generates claims that are not supported by the retrieved context. This is different from general LLM hallucination (making up facts) -- in RAG, the model has access to the correct information but either ignores it, misrepresents it, or supplements it with unsupported content.

Three techniques reduce RAG hallucination:

Grounding through prompt design. The system prompt instructions from the previous section are the first layer. Beyond the basic "use only provided context" instruction, you can strengthen grounding with:

"Begin your answer by identifying which source sections are most relevant to the question."
"For each claim, quote the specific phrase from the source that supports it."
"If you are uncertain whether the context supports a claim, do not include it."

These instructions force the model to explicitly connect its output to the input context, making it harder to inject unsupported claims.

Attribution verification (post-generation check). After the model generates a response, run a second pass that checks each cited claim against the cited source. This can be done by the same model (with a different prompt) or a smaller model:

Given the following claim and the source text it cites,
does the source text support the claim?

Claim: {claim}
Cited source: {source_text}

Answer YES if the source directly supports the claim,
NO if it does not, or PARTIAL if the source partially
supports the claim. Explain briefly.

Claims that fail verification are flagged or removed. This adds one more LLM call per response (100-300 ms) but catches hallucinated citations -- a particularly insidious failure mode where the model cites a real source for a fabricated claim.

Self-verification (self-RAG pattern). Train or prompt the model to emit confidence signals alongside its response: "Based on Source 2, I am confident that the delivery deadline is 30 days. I am less certain about the penalty amount -- Source 3 mentions liquidated damages but does not specify a percentage." This gives users a nuanced understanding of answer reliability rather than a binary "here is the answer" that may or may not be fully supported.

When retrieved context exceeds the window

Modern LLMs have context windows of 128K tokens (Gemma 4) or even longer, so context overflow is less common than it was in 2023. But at enterprise scale, it still happens. A query about a major multi-year project might retrieve 20 relevant document sections totalling 50,000 tokens. Here is how to handle it.

Strategy 1: Truncate by relevance. After reranking, you have results ordered by relevance score. Include chunks starting from the most relevant until you hit your context budget (leave room for the system prompt, the question, and the generated response). Typical budget: 60-70% of the context window for retrieved content.

Strategy 2: Summarise low-ranked chunks. Instead of including the full text of every chunk, summarise chunks ranked 5-10 into a condensed form. Include full text for the top 4 chunks (highest relevance, most likely to contain the direct answer) and summaries for the rest (providing breadth of context without consuming as many tokens).

Strategy 3: Iterative retrieval. For complex queries, do not try to fit everything into a single generation call. First, use the LLM to generate an initial answer from the top-5 chunks. Then, identify gaps ("What information is missing?"), retrieve additional chunks for those gaps, and generate a more complete answer. This is more expensive (2-3 LLM calls) but handles complex multi-aspect queries better.

The "lost in the middle" mitigation. As discussed in Module 2, LLMs underweight information in the middle of the context. Two practical mitigations:

Put the most relevant chunks first and last. Sandwich the less-critical chunks in the middle. The model pays more attention to the beginning and end of the context.
Reduce total chunks. Passing 5 highly relevant chunks is often better than passing 15 chunks of mixed relevance. The model processes fewer chunks more thoroughly.

Your RAG system retrieves 15 document chunks (each ~2,000 tokens) for a complex query. The Gemma 4 12B context window is 128K tokens. The total retrieved context is 30,000 tokens. Should you include all 15 chunks?

Streaming responses

For interactive RAG applications, streaming the response (showing tokens as they are generated) is critical for user experience. A 500-token response takes 5-10 seconds to generate fully. Without streaming, the user stares at a blank screen for 5-10 seconds. With streaming, they see the first token in 100-500 ms and can start reading while the rest generates.

vLLM supports streaming natively via its OpenAI-compatible API. Your application receives Server-Sent Events (SSE), with each event containing one or more tokens. The implementation is straightforward -- the complexity is in the frontend:

Progressive rendering. Render tokens as they arrive. Markdown formatting (headers, lists, code blocks) should be applied incrementally.
Citation rendering. When the model outputs [Source 3], your frontend must resolve this to the actual source document and render it as a clickable link. This can happen inline or after the response completes.
Error handling. If generation fails mid-stream (GPU OOM, timeout), show what was generated so far with a clear error indicator, not a blank screen.

Handling "I don't know."

One of the most important capabilities of an enterprise RAG system is knowing when it does not know. If a user asks about a topic not covered in the corpus, the system must say so clearly rather than hallucinating an answer.

The prompt template from earlier includes this instruction, but implementation requires more:

Retrieval confidence scoring. If the highest reranker score for any retrieved chunk is below a threshold, flag the query as potentially unanswerable before generation even starts. You can show a warning: "Limited relevant documents found -- answer may be incomplete."
Generation confidence. If the model's response includes qualifiers like "I could not find specific information about..." or "The available documents do not address...", surface these prominently rather than burying them in the response.
Fallback paths. When the system cannot answer, offer constructive alternatives: "This topic may be covered in [Department X's] knowledge base" or "You might consult [Subject Matter Expert] for this type of question." These fallbacks require a knowledge map of what your RAG system covers and what it does not.

The worst outcome is a confident-sounding answer that is wrong. A clear "I don't know" preserves user trust and avoids the downstream consequences of acting on incorrect information -- which in enterprise contexts can mean regulatory violations, financial losses, or legal liability.

✎

Module 8 -- Final Assessment

For a self-hosted enterprise RAG system processing 50,000 queries/day where 70% are simple factual lookups, which Gemma 4 model deployment minimises cost without sacrificing quality on complex queries?

Why is the system prompt instruction to 'use ONLY the information provided in the context' critical for enterprise RAG?

Your RAG system retrieves 15 chunks for a query, all within the model's context window. Why might including only the top 5-7 chunks produce a better answer than including all 15?

What is attribution verification in the context of RAG hallucination reduction?