Enterprise RAG on Your Own Infrastructure

Why Self-Hosted RAG

The real cost of cloud RAG at enterprise scale, data sovereignty requirements, the query-leaks-intent problem, vendor lock-in, and why 2025-2026 is the inflection point for self-hosted RAG.

The cloud RAG bill nobody talks about

Let us do the maths that your cloud RAG vendor hopes you never do.

Take a moderately large enterprise: 10 TB of documents, 50,000 queries per day from internal users. This is not exceptional -- it is a mid-size law firm, a regional bank, or a defence contractor with a decade of accumulated institutional knowledge.

Embedding costs. Your 10 TB corpus, after extraction and chunking, produces roughly 40 billion tokens. Using OpenAI's text-embedding-3-large at $0.13 per million tokens, the initial embedding run costs approximately $5,200. That sounds manageable -- until you realise documents change. With 5% monthly churn (new documents, updates, deletions), you are re-embedding 2 billion tokens per month: $260/month just to keep embeddings current. And that is before you decide to re-embed everything because a better model came out, which you will, because the embedding landscape changes every six months.

Vector database hosting. At 1024 dimensions with 40 billion tokens chunked into ~80 million vectors, you need roughly 320 GB of vector storage (4 bytes per float32 dimension times 1024 dimensions times 80 million vectors). With metadata, indexes, and replication, plan for 1-2 TB of actual storage. Pinecone's enterprise tier runs $0.33 per GB per month for storage plus query costs. Weaviate Cloud starts at $1,840/month for a production cluster. At this scale, expect $3,000-8,000/month for managed vector database hosting alone.

LLM inference costs. Each query retrieves 5-10 context chunks (roughly 4,000 tokens of context) plus the query itself, then generates a 500-token response. At 50,000 queries/day, that is approximately 225 million input tokens and 25 million output tokens per month. Using GPT-4o at $2.50/$10.00 per million tokens, you are looking at $812 per month. Using Claude Sonnet 4 at $3/$15 per million tokens, roughly $1,050/month. Sounds reasonable -- but that is with the cheapest models. If your use case requires the reasoning depth of GPT-4o or Claude Opus, multiply by 4-10x.

The total. For a 10 TB corpus with 50,000 queries/day using mid-tier models:

ComponentMonthly cost
Embedding API (ongoing)$260
Vector DB hosting$3,000-8,000
LLM inference$800-5,000
Orchestration (LangSmith, etc.)$500-2,000
Total$4,560-15,260/month

That is $55,000-183,000 per year. And this scales linearly. Double the corpus or double the queries, double the bill. Enterprises with 50 TB of documents and 200,000 queries/day are easily spending $50,000-100,000+ per month.

?

Your organisation has a 10 TB document corpus and 50,000 queries/day. What is likely the largest cost component of a cloud-hosted RAG system?