The cache analogy for RAG
If you have spent time in systems architecture, you know the memory hierarchy: L1 cache (small, fast, on-chip), L2 cache (larger, slower), L3 cache (largest, slowest), then main memory, then disk. Each tier trades capacity for speed. The system routes data access to the fastest tier that has the needed data, falling through to slower tiers only when necessary.
Enterprise RAG benefits from exactly the same architecture. Not every question needs a 27B model searching a 100-million-vector index. "What is the WiFi password for the London office?" can be answered from a device-local FAQ with a 2B model in under 100 milliseconds. "Summarise the key changes between our 2024 and 2025 employee handbook" needs the departmental knowledge base and a 12B model. "Analyse the regulatory implications of our proposed acquisition across all jurisdictions we operate in" needs the full organisational knowledge base, possibly with graph traversal, and the highest-quality model available.
A single-tier architecture that routes everything through the same pipeline is either over-provisioned for simple queries (wasting GPU capacity on trivial questions) or under-provisioned for complex ones (delivering poor answers when it matters most). Tiered RAG matches the query to the right resources.