The cloud RAG bill nobody talks about
Let us do the maths that your cloud RAG vendor hopes you never do.
Take a moderately large enterprise: 10 TB of documents, 50,000 queries per day from internal users. This is not exceptional -- it is a mid-size law firm, a regional bank, or a defence contractor with a decade of accumulated institutional knowledge.
Embedding costs. Your 10 TB corpus, after extraction and chunking, produces roughly 40 billion tokens. Using OpenAI's text-embedding-3-large at $0.13 per million tokens, the initial embedding run costs approximately $5,200. That sounds manageable -- until you realise documents change. With 5% monthly churn (new documents, updates, deletions), you are re-embedding 2 billion tokens per month: $260/month just to keep embeddings current. And that is before you decide to re-embed everything because a better model came out, which you will, because the embedding landscape changes every six months.
Vector database hosting. At 1024 dimensions with 40 billion tokens chunked into ~80 million vectors, you need roughly 320 GB of vector storage (4 bytes per float32 dimension times 1024 dimensions times 80 million vectors). With metadata, indexes, and replication, plan for 1-2 TB of actual storage. Pinecone's enterprise tier runs $0.33 per GB per month for storage plus query costs. Weaviate Cloud starts at $1,840/month for a production cluster. At this scale, expect $3,000-8,000/month for managed vector database hosting alone.
LLM inference costs. Each query retrieves 5-10 context chunks (roughly 4,000 tokens of context) plus the query itself, then generates a 500-token response. At 50,000 queries/day, that is approximately 225 million input tokens and 25 million output tokens per month. Using GPT-4o at $2.50/$10.00 per million tokens, you are looking at $812 per month. Using Claude Sonnet 4 at $3/$15 per million tokens, roughly $1,050/month. Sounds reasonable -- but that is with the cheapest models. If your use case requires the reasoning depth of GPT-4o or Claude Opus, multiply by 4-10x.
The total. For a 10 TB corpus with 50,000 queries/day using mid-tier models:
| Component | Monthly cost |
|---|---|
| Embedding API (ongoing) | $260 |
| Vector DB hosting | $3,000-8,000 |
| LLM inference | $800-5,000 |
| Orchestration (LangSmith, etc.) | $500-2,000 |
| Total | $4,560-15,260/month |
That is $55,000-183,000 per year. And this scales linearly. Double the corpus or double the queries, double the bill. Enterprises with 50 TB of documents and 200,000 queries/day are easily spending $50,000-100,000+ per month.