Why Self-Hosted RAG

The cloud RAG bill nobody talks about

Let us do the maths that your cloud RAG vendor hopes you never do.

Take a moderately large enterprise: 10 TB of documents, 50,000 queries per day from internal users. This is not exceptional -- it is a mid-size law firm, a regional bank, or a defence contractor with a decade of accumulated institutional knowledge.

Embedding costs. Your 10 TB corpus, after extraction and chunking, produces roughly 40 billion tokens. Using OpenAI's text-embedding-3-large at $0.13 per million tokens, the initial embedding run costs approximately $5,200. That sounds manageable -- until you realise documents change. With 5% monthly churn (new documents, updates, deletions), you are re-embedding 2 billion tokens per month: $260/month just to keep embeddings current. And that is before you decide to re-embed everything because a better model came out, which you will, because the embedding landscape changes every six months.

Vector database hosting. At 1024 dimensions with 40 billion tokens chunked into ~80 million vectors, you need roughly 320 GB of vector storage (4 bytes per float32 dimension times 1024 dimensions times 80 million vectors). With metadata, indexes, and replication, plan for 1-2 TB of actual storage. Pinecone's enterprise tier runs $0.33 per GB per month for storage plus query costs. Weaviate Cloud starts at $1,840/month for a production cluster. At this scale, expect $3,000-8,000/month for managed vector database hosting alone.

LLM inference costs. Each query retrieves 5-10 context chunks (roughly 4,000 tokens of context) plus the query itself, then generates a 500-token response. At 50,000 queries/day, that is approximately 225 million input tokens and 25 million output tokens per month. Using GPT-4o at $2.50/$10.00 per million tokens, you are looking at $812 per month. Using Claude Sonnet 4 at $3/$15 per million tokens, roughly $1,050/month. Sounds reasonable -- but that is with the cheapest models. If your use case requires the reasoning depth of GPT-4o or Claude Opus, multiply by 4-10x.

The total. For a 10 TB corpus with 50,000 queries/day using mid-tier models:

Component	Monthly cost
Embedding API (ongoing)	$260
Vector DB hosting	$3,000-8,000
LLM inference	$800-5,000
Orchestration (LangSmith, etc.)	$500-2,000
Total	$4,560-15,260/month

That is $55,000-183,000 per year. And this scales linearly. Double the corpus or double the queries, double the bill. Enterprises with 50 TB of documents and 200,000 queries/day are easily spending $50,000-100,000+ per month.

Your organisation has a 10 TB document corpus and 50,000 queries/day. What is likely the largest cost component of a cloud-hosted RAG system?

When sending queries to an API is a non-starter

Cost is the problem everyone can quantify. Data sovereignty is the problem that kills the project before it starts.

Consider what happens when your legal department uses a cloud RAG system. Every query -- "What are our obligations under the 2024 MegaCorp acquisition agreement?" -- gets sent to an external API. The LLM provider sees the query. They see the retrieved document chunks. They see the generated response. Even with contractual assurances about data handling, you have fundamentally lost control of three things:

The query itself reveals confidential intent. A query like "What is our exposure if the FDA rejects the Phase III trial data?" tells any observer exactly what your organisation is worried about. This is not hypothetical paranoia. In regulated industries -- healthcare, defence, financial services -- the mere act of asking a question can constitute material non-public information. A pharmaceutical company querying about regulatory exposure before an FDA decision is generating information that, if leaked, could move stock prices.

Retrieved context contains your most sensitive documents. The whole point of RAG is to ground the LLM's response in your actual documents. That means contract clauses, patient records, classified technical specifications, or merger analysis get sent as context to an external API. Enterprise agreements and SOC 2 compliance help, but they do not change the fundamental architecture: your sensitive documents are leaving your network.

The response synthesis can create derivative classified information. When an LLM synthesises an answer from multiple classified document fragments, the response itself may constitute a new classification level. In defence contexts, combining information from two SECRET documents can produce a TOP SECRET synthesis. Cloud RAG systems have no mechanism for this kind of classification algebra.

For organisations subject to ITAR (defence), HIPAA (healthcare), SEC regulations (financial services), or GDPR with data residency requirements, cloud RAG is often not a risk management question -- it is simply not permitted.

A pharmaceutical company uses cloud RAG to search internal research documents. Why might the search queries alone -- not the documents, just the queries -- be a compliance concern?

The embedding trap

There is a subtler form of lock-in in cloud RAG that most architects do not notice until it is too late: your embeddings are meaningless outside the model that created them.

When you embed 80 million document chunks using OpenAI's text-embedding-3-large, those vectors exist in a 3072-dimensional space that is specific to that model. You cannot take those vectors and use them with Cohere's embedding model, or with an open-source model, or with the next version of OpenAI's own embedding model. The geometric relationships between vectors -- the entire basis of similarity search -- only hold within the same model's embedding space.

This means switching embedding providers requires re-embedding your entire corpus. For our 10 TB example, that is a $5,200 compute cost (at the new provider's rates), plus days of processing time, plus a transition period where your search quality is degraded because you are serving from a partially re-embedded index.

Worse, if you have fine-tuned your retrieval pipeline around the specific characteristics of one embedding model -- its strengths with legal language, its weaknesses with numerical data, its optimal chunk sizes -- that tuning does not transfer. You are back to square one on retrieval quality.

The vendor knows this. That is why embedding APIs are priced so cheaply relative to generation APIs. Embeddings are the hook. Once your corpus is embedded with their model, the switching cost is enormous, and they have a captive customer for the more expensive generation and hosting services.

The open-source escape hatch. Open-source embedding models break this lock-in entirely. When you run GTE-Qwen2 or BGE-M3 on your own hardware, you own the model weights. You can re-embed at any time at the cost of electricity and GPU hours. You can run the exact same model version five years from now. And you can switch models without paying a per-token fee to do so.

You have 80 million document chunks embedded with OpenAI's text-embedding-3-large. You want to switch to an open-source embedding model. What is the primary technical challenge?

Why 2025-2026 changes the calculus

Every argument for self-hosted RAG existed in 2023. What has changed is that the open-source model ecosystem has caught up to a degree that makes self-hosting technically viable for production enterprise use.

Embedding models have reached parity. The top open-source embedding models -- GTE-Qwen2-1.5B, BGE-M3, Nomic Embed v2 -- now match or exceed OpenAI's text-embedding-3-large and Cohere's embed-v3 on the MTEB benchmark. This is not marginal improvement; GTE-Qwen2-1.5B achieves a higher average score across retrieval tasks than any proprietary embedding model as of early 2026. You no longer sacrifice retrieval quality by self-hosting embeddings.

Generative models have crossed the enterprise threshold. Gemma 4 27B, released under an Apache 2.0-compatible licence, produces generation quality that is comparable to GPT-4o for RAG-grounded tasks. For RAG specifically -- where the model needs to read provided context and synthesise a faithful answer, not reason from its own knowledge -- the gap between open 27B models and proprietary frontier models has narrowed dramatically. The Gemma 4 family (E2B, E4B, 12B, 27B) gives you a model for every latency and cost tier.

vLLM has matured. Self-hosted inference used to mean writing custom serving code and praying it scaled. vLLM now handles continuous batching, speculative decoding, tensor parallelism, and quantised model serving out of the box. It is battle-tested at scale by organisations running thousands of queries per second. The operational burden of self-hosted inference has dropped by an order of magnitude.

Hardware costs have normalised. An NVIDIA L40S (48 GB VRAM, excellent for inference) is available from major cloud providers at $1.50-2.00/hour or can be purchased outright for $7,000-9,000. A pair of L40S GPUs can serve Gemma 4 27B at hundreds of tokens per second with quantisation. Compare that to $10 per million output tokens from a proprietary API.

The inflection point is not any single development. It is the convergence of all four: embedding models that match proprietary quality, generative models that are good enough for RAG, serving infrastructure that handles production scale, and hardware economics that make the maths work.

Which development has been most critical for making self-hosted enterprise RAG viable in 2025-2026?

✎

Module 1 -- Final Assessment

For a 10 TB document corpus with 50,000 queries/day, which cost component of cloud RAG is most directly driven by corpus size rather than query volume?

Why is the 'query leaks intent' problem particularly serious for pharmaceutical companies before FDA decisions?

You have embedded 80 million document chunks using a proprietary embedding API. What makes switching to a different embedding model operationally expensive?

What is the primary reason the open-source model ecosystem became viable for enterprise RAG in 2025-2026, as opposed to earlier?