The cache analogy for RAG
If you have spent time in systems architecture, you know the memory hierarchy: L1 cache (small, fast, on-chip), L2 cache (larger, slower), L3 cache (largest, slowest), then main memory, then disk. Each tier trades capacity for speed. The system routes data access to the fastest tier that has the needed data, falling through to slower tiers only when necessary.
Enterprise RAG benefits from exactly the same architecture. Not every question needs a 27B model searching a 100-million-vector index. "What is the WiFi password for the London office?" can be answered from a device-local FAQ with a 2B model in under 100 milliseconds. "Summarise the key changes between our 2024 and 2025 employee handbook" needs the departmental knowledge base and a 12B model. "Analyse the regulatory implications of our proposed acquisition across all jurisdictions we operate in" needs the full organisational knowledge base, possibly with graph traversal, and the highest-quality model available.
A single-tier architecture that routes everything through the same pipeline is either over-provisioned for simple queries (wasting GPU capacity on trivial questions) or under-provisioned for complex ones (delivering poor answers when it matters most). Tiered RAG matches the query to the right resources.
Your enterprise RAG system processes 50,000 queries/day. Analysis shows 60% are simple factual lookups ('What is policy X?'), 30% need moderate synthesis ('Summarise our approach to Y'), and 10% need complex multi-document reasoning ('Compare X across our three divisions'). You currently route everything through Gemma 4 27B on 2x L40S. What is the primary inefficiency?
L1: Device/edge tier
Model: Gemma 4 E2B (2B parameters) Knowledge scope: Personal documents, team FAQs, frequently accessed policies Hardware: Laptop GPU, mobile NPU, or a shared L4 GPU on the network Latency target: Under 100 ms end-to-end Vector index size: 10,000-100,000 vectors (personal and team documents)
The L1 tier answers questions that the individual user or their immediate team already has the context for. Think: personal notes, team knowledge bases, company FAQs, bookmarked documents. The vector index is small enough to fit in RAM on a laptop or a single low-cost GPU.
At 2B parameters, Gemma 4 E2B generates 100-200 tokens/second on an L4 and can run (slowly) on a modern laptop CPU. The responses are adequate for factual lookups -- "What is the VPN configuration for remote access?" -- where the answer is directly stated in the retrieved chunk. E2B struggles with synthesis and nuance but excels at speed.
The L1 tier exists because most queries in an enterprise are repetitive. The same 500 questions account for 40-60% of all queries. Caching these at the edge eliminates unnecessary round-trips to the central infrastructure.
L2: Departmental/on-premises tier.
Model: Gemma 4 12B or 27B Knowledge scope: Department knowledge base, organisational policies, project documentation Hardware: On-premises GPU cluster or private cloud (L40S, A100) Latency target: 200 ms - 2 seconds to first token, 3-8 seconds full response Vector index size: 10 million - 500 million vectors
The L2 tier is the workhorse of enterprise RAG. It hosts the departmental and organisational knowledge base, runs the full retrieval pipeline (hybrid search, reranking, multi-hop), and generates responses with Gemma 4 12B or 27B.
This is where the tiered retrieval patterns from earlier modules apply: query expansion, HyDE, cross-encoder reranking, parent-child chunk retrieval. The L2 tier has the computational budget (200 ms - 2 seconds) to run these sophistications.
Most enterprise RAG deployments are single-tier at L2. Adding L1 below and L3 above turns a good system into an excellent one.
L3: Cloud API escalation tier.
Model: Claude Opus/Sonnet, GPT-4o, or equivalent frontier model Knowledge scope: Cross-organisational synthesis, questions that exceed local model capability Hardware: Cloud API (you pay per token) Latency target: 1-5 seconds Vector index size: Uses L2's vector index, but with a more capable generation model
The L3 tier handles the 1-5% of queries that exceed L2's capability. These are questions that require the reasoning depth of a frontier model: complex multi-document synthesis, nuanced analysis of contradictory sources, or tasks that benefit from Claude or GPT-4's broader training data.
L3 queries still use your self-hosted retrieval pipeline (the vector search and reranking happen on your infrastructure). Only the generation step calls an external API, and the retrieved context sent to the API is pre-filtered to exclude documents above a certain classification level.
This is a deliberate architectural choice. You maintain full control of retrieval and data access. You only escalate the generation step, and only for queries that justify the cost and the (limited, filtered) data exposure.
A user asks: 'What are the regulatory implications if we acquire a company with operations in five EU member states, considering our existing GDPR compliance framework and the target's different data processing practices?' The query is routed to your tiered RAG system. Which tier should handle it?
How to classify and route queries
The effectiveness of a tiered architecture depends on the query router -- the component that decides which tier handles each query. A bad router sends complex queries to L1 (producing poor answers) or simple queries to L3 (wasting money).
Heuristic routing (simple, effective). A rules-based classifier using query features:
| Feature | L1 indicators | L2 indicators | L3 indicators |
|---|---|---|---|
| Query length | Short (< 15 tokens) | Medium (15-50 tokens) | Long (> 50 tokens) with complex structure |
| Keywords | "what is", "how do I", lookup verbs | "summarise", "explain", "compare" | "analyse", "evaluate implications", "across" |
| Entity count | 1 entity | 2-3 entities | 4+ entities or cross-domain entities |
| Temporal scope | Single point in time | Time range | Multiple periods with comparison |
| Prior query success | Previously answered at L1 | -- | Previously failed at L2 |
Heuristic routing is transparent, debuggable, and requires no training data. Start here.
LLM-based routing. Use Gemma 4 E2B (fast, cheap) to classify query complexity:
Classify this query's complexity as SIMPLE, MODERATE, or COMPLEX.
SIMPLE: Can be answered from a single document with a direct factual lookup.
MODERATE: Requires synthesising information from 2-5 documents.
COMPLEX: Requires multi-document reasoning, comparison, or analysis
across domains or time periods.
Query: {query}
Complexity:This adds 20-50 ms of latency but handles edge cases better than heuristics. In practice, a hybrid approach works best: heuristic rules handle the obvious cases (80% of queries), and the LLM classifier handles the ambiguous 20%.
Feedback-driven routing. Track whether each tier's response was accepted by the user (no follow-up question, no thumbs-down). If L1 responses for a certain query pattern consistently lead to follow-up queries (indicating the answer was inadequate), automatically route that pattern to L2. This creates a self-improving router.
The blended cost per query
The economic power of tiered RAG comes from the blended cost. Here is a realistic model:
| Tier | Queries/day | Cost per query | Daily cost |
|---|---|---|---|
| L1 (E2B on L4) | 30,000 (60%) | $0.0002 | $6.00 |
| L2 (27B on 2x L40S) | 18,000 (36%) | $0.0016 | $28.80 |
| L3 (Cloud API) | 2,000 (4%) | $0.02 | $40.00 |
| Total | 50,000 | $0.0015 blended | $74.80 |
Compare with a single-tier architecture running everything through Gemma 4 27B: 50,000 queries x $0.0016 = $80/day. The tiered architecture is slightly cheaper in this example, but the real savings come from reduced GPU provisioning: you only need enough 27B capacity for 18,000 queries/day instead of 50,000, which means fewer L40S GPUs.
Compare with cloud-only: 50,000 queries x $0.016 (GPT-4o) = $800/day. The tiered self-hosted architecture is approximately 10x cheaper.
Ambient RAG: when inference is free at the margin.
Tiered RAG creates an interesting economic dynamic at L1. If your device-tier hardware (laptop GPUs, shared L4 instances) has idle capacity -- and it typically does, since query volume is bursty -- the marginal cost of additional inference is zero. The hardware is already powered on and paid for.
This enables ambient RAG: proactively surfacing relevant knowledge without the user asking for it.
Examples:
- You open a contract in your document editor. The RAG system, running locally, identifies relevant precedents, similar contracts, and known issues, and surfaces them in a sidebar.
- You join a meeting about Project Phoenix. The RAG system pre-fetches the latest project status, budget data, and open issues, and presents a briefing before the meeting starts.
- You receive an email mentioning a regulation. The RAG system retrieves your organisation's compliance status for that regulation and adds a contextual annotation.
Ambient RAG is only practical when inference cost is near-zero at the margin. With cloud APIs at $0.016 per query, proactively generating 50 knowledge suggestions per user per day would cost $0.80/user/day -- $200/user/month, which is prohibitive. With L1 self-hosted inference on hardware you already own, the cost is electricity.
This shifts RAG from a reactive tool (user asks, system answers) to a proactive knowledge layer (system anticipates, user benefits). It is the architectural end-state that justifies the investment in self-hosted infrastructure.
Your tiered RAG system routes 60% of queries to L1 (device/edge). The L1 hardware runs at 30% average GPU utilisation because query volume is bursty. What opportunity does this idle capacity create?
Module 10 -- Final Assessment
What is the primary benefit of routing 60% of simple queries to an L1 (device/edge) tier instead of sending all queries through the L2 (departmental) tier?
In the L3 (cloud API escalation) tier, what data leaves your network?
What makes ambient RAG economically viable on self-hosted infrastructure but prohibitively expensive with cloud APIs?
A heuristic query router sends all queries containing the word 'compare' to L3 (cloud API). A user asks: 'Compare the WiFi password for the London and Paris offices.' What is wrong with this routing decision?