Enterprise RAG on Your Own Infrastructure

Generation with Open Models

The Gemma 4 model family for RAG generation, prompt engineering for grounded responses, reducing hallucination through faithfulness techniques, context window management, streaming, and handling knowledge gaps.

The Gemma 4 model family for RAG

Gemma 4 is Google's open-weights model family released under an Apache 2.0-compatible licence. For self-hosted RAG, it is the generation model you should evaluate first. Here is the family and where each member fits in a RAG architecture.

Gemma 4 E2B (2 billion parameters). The smallest model. Runs on a single consumer GPU (8 GB VRAM) or even on CPU. Generates 100-200 tokens/second on an L4 GPU. For RAG, it handles simple factual queries where the answer is directly stated in the retrieved context. It struggles with synthesis, multi-step reasoning, and nuanced questions. Best for: L1 edge/device tier, simple factual Q&A, high-throughput low-complexity workloads.

Gemma 4 E4B (4 billion parameters). The sweet spot for many RAG deployments. Runs on a single L4 (24 GB) or T4 (16 GB with quantisation). Generates 80-150 tokens/second on an L4. Capable of synthesis across multiple retrieved chunks, following structured output formats, and basic reasoning. This is also the model you might use as an LLM-based reranker (as discussed in Module 7). Best for: mid-tier RAG generation, reranking, query expansion, environments where GPU budget is limited.

Gemma 4 12B (12 billion parameters). Requires a single L40S (48 GB) or A100 (40/80 GB). Generates 40-80 tokens/second on an L40S. Substantially better at reasoning, handling ambiguous queries, producing well-structured responses, and faithfully representing nuances in the retrieved context. For most enterprise RAG deployments, this is the primary generation model. Best for: departmental RAG systems, complex queries, scenarios where answer quality matters more than throughput.

Gemma 4 27B (27 billion parameters). Requires 2x L40S or a single A100 80 GB (with quantisation). Generates 30-60 tokens/second on 2x L40S. The highest-quality open model in the Gemma 4 family. Comparable to GPT-4o for RAG-grounded tasks where the model reads provided context and synthesises answers. Excels at complex synthesis, handling contradictory evidence across chunks, and producing nuanced, well-qualified answers. Best for: high-stakes queries (legal, medical, financial), L2 departmental tier where quality is paramount.

The right model depends on your query complexity distribution. If 80% of your queries are factual ("What is the delivery deadline in contract X?"), E4B or 12B handles them efficiently. The remaining 20% of complex queries ("Compare the indemnification terms across our three major vendor contracts") benefit from 27B. A tiered architecture (covered in Module 10) routes queries to the appropriate model.

?

Your enterprise RAG system handles 50,000 queries/day. Analysis shows 70% are simple factual lookups, 25% require moderate synthesis, and 5% require complex multi-document reasoning. What model deployment strategy minimises cost while maintaining quality?