Capstone: Your Enterprise RAG Blueprint

From knowledge to action

You have now covered the full stack of enterprise RAG: from the economics of self-hosting through embedding models, vector databases, document ingestion, chunking, retrieval, generation, deployment, tiered architecture, knowledge graphs, and security.

This capstone turns that knowledge into a concrete plan for your organisation. Each exercise produces a deliverable. Together, they form a one-page RAG transformation blueprint that you can present to your CTO, your board, or your procurement committee.

Work through these exercises with your actual numbers. Estimates are fine where you do not have exact figures -- the goal is a defensible order-of-magnitude plan, not a decimal-precise budget.

Before starting the exercises, which of these is your primary driver for self-hosted RAG?

Exercise 1: Calculate your current spend vs self-hosted projection

Build a cost comparison between your current approach and a self-hosted alternative. Use the framework below.

Current cloud/API costs (monthly):

Component	Your numbers
Embedding API costs (monthly re-embedding + new documents)	$ _____
Vector database hosting (managed service)	$ _____
LLM inference API costs (generation)	$ _____
Orchestration/observability (LangSmith, Langfuse, etc.)	$ _____
Total current monthly spend	$ _____

If you are not currently running RAG and are evaluating a greenfield deployment, estimate what the cloud approach would cost using the numbers from Module 1.

Self-hosted projection (monthly):

Component	Specification	Monthly cost
Embedding GPU (e.g., 1x T4 or L4)	Model: _____, GPU: _____	$ _____
Vector database infrastructure	DB: _____, Nodes: _____	$ _____
Generation GPU(s)	Model: _____, GPU: _____	$ _____
L3 cloud API escalation (if applicable)	% of queries: _____	$ _____
Engineering operations (hours/month x rate)	Hours: _____, Rate: _____	$ _____
Total self-hosted monthly spend		$ _____

Key metrics:

Metric	Value
Total document corpus size	_____ TB
Estimated chunk count	_____ million
Daily query volume	_____
Cost per query (current)	$ _____
Cost per query (self-hosted)	$ _____
Monthly savings	$ _____
Break-even period (including setup costs)	_____ months

The break-even period should include one-time costs: hardware procurement (if not renting), initial engineering build (estimate 2-4 engineer-months for a production deployment), and initial corpus embedding time.

You estimate your self-hosted RAG system will cost $8,000/month to operate (hardware + engineering time) versus $15,000/month for cloud RAG. The initial setup requires $40,000 in engineering effort and $20,000 in hardware. What is the approximate break-even period?

Exercise 2: Design your document ingestion architecture

Map your organisation's document sources and design the connectors.

Document source inventory:

Source system	Document types	Estimated volume	Change frequency	Connector complexity
Example: Confluence	Wiki pages, attachments	50,000 pages	Daily	Low (REST API, webhooks)
Example: Network share	PDFs, Word, Excel	2 TB, 500K files	Weekly	Medium (polling, no API)

For each source, determine:

Authentication method: OAuth 2.0, API token, service account, NTLM?
Change detection: Webhooks, delta queries, polling, file system events?
Extraction strategy: Fast (pymupdf/text-only) or quality (Docling/Unstructured with layout analysis)?
Vision processing needed? What percentage of documents have tables, charts, or scanned content that requires vision model processing?

Ingestion pipeline design:

[Source Connectors] → [Change Detection Queue]
    → [Extraction Workers]
        → Fast path (pymupdf): 80% of documents
        → Quality path (Docling): 15% of documents
        → Vision path (Gemma 4): 5% of documents
    → [Quality Check]
    → [Chunking]
    → [Embedding]
    → [Vector DB Upsert]
    → [Graph Extraction (selective)]

Estimated initial ingestion time:

Total documents: _____
Average pages per document: _____
Total pages: _____
Fast extraction: _____ pages/sec x _____ pages = _____ hours
Quality extraction: _____ pages/sec x _____ pages = _____ hours
Vision processing: _____ pages/sec x _____ pages = _____ hours
Embedding: _____ chunks/sec x _____ chunks = _____ hours
Total initial ingestion: _____ hours/days

Your organisation has 10 source systems. You estimate the ingestion pipeline will take 3 months to build all connectors. A colleague suggests building all 10 connectors before going live. What is a better approach?

Exercise 3: Choose your model stack

Select the three models that form the core of your RAG system.

Embedding model selection:

Criterion	Your requirement	Model A: _____	Model B: _____
Languages needed
Retrieval quality (MTEB)
Dimensions / Matryoshka support
Throughput needed (chunks/sec)
GPU requirement
Licence
Selected embedding model:

Reranker selection:

Criterion	Your requirement	Model A: _____	Model B: _____
Reranking latency budget
Candidate count to rerank
Multilingual support
GPU requirement
Selected reranker:

Generation model selection (per tier):

Tier	Query types	Model	GPU	Throughput	Monthly cost
L1 (edge)	Simple factual
L2 (departmental)	Moderate synthesis
L2 (complex)	Multi-document reasoning
L3 (escalation)	Beyond local capability

Decision rationale:

Why this embedding model over alternatives?
Why this generation model over alternatives?
What is your quantisation strategy?
How will you handle model upgrades (new Gemma versions, new embedding models)?

You have selected Gemma 4 27B as your L2 generation model. A new version of Gemma (Gemma 5) is released six months after deployment with substantially better RAG performance. What is the migration path?

Exercise 4: Design your tiered architecture

Define the tiers, routing logic, and escalation paths for your deployment.

Tier definitions:

Tier	Model	Knowledge scope	Latency target	Hardware	Estimated % of queries
L1					%
L2					%
L3					%

Query routing rules:

Rule	Condition	Route to
1		L1
2		L2
3		L3
Fallback	L2 response confidence below threshold	Escalate to L3

Blended cost calculation:

Tier	Cost/query	Daily cost
L1	$	$
L2	$	$
L3	$	$
Total	$ blended	$

Architecture decisions to document:

Does your L3 tier use a cloud API or a larger on-premises model? If cloud API, which data classification levels are permitted to be sent?
Where do your knowledge graph queries execute? L2 only, or across tiers?
What is your ambient RAG strategy for L1 idle capacity?
How do you handle tier failures? (L2 is down -- do all queries fall through to L3, or do you queue them?)

Exercise 5: Your RAG transformation blueprint

Compile the outputs of Exercises 1-4 into a one-page blueprint. This is the document you present to your leadership team.

RAG Transformation Blueprint -- [Your Organisation Name]

Business case:

Current annual spend on knowledge search / RAG: $_____
Projected self-hosted annual cost: $_____
Annual savings: $_____
Break-even period: _____ months
Non-financial drivers: [data sovereignty / vendor independence / capability gaps]

Architecture summary:

Corpus: _____ TB across _____ source systems
Model stack: [embedding model] + [reranker] + [L1 model] / [L2 model] / [L3 model or API]
Vector database: [choice] with [index type], [estimated vector count]
Knowledge graph: [Yes/No], covering [document categories]
Security model: [per-query filtering / federated] with [compliance frameworks]

Hardware requirements:

Embedding: _____
Generation (L1): _____
Generation (L2): _____
Vector database: _____
Total GPU cost (monthly): $_____

Implementation roadmap:

Phase	Duration	Deliverable
Phase 1: Foundation	4-6 weeks	Core pipeline with 2-3 source systems, single-tier (L2), 100 internal beta users
Phase 2: Quality	4-6 weeks	Retrieval tuning, reranking, chunking optimisation, 500 users
Phase 3: Scale	4-6 weeks	Full source system connectors, tiered architecture (L1+L2), knowledge graph for high-value docs
Phase 4: Production	4-6 weeks	Security hardening, audit logging, multi-tenancy, L3 escalation, full rollout

Team requirements:

1 ML engineer (model selection, fine-tuning, evaluation)
1 infrastructure engineer (GPU deployment, vLLM, vector database operations)
1 backend engineer (connectors, pipeline, API)
0.5 security engineer (access controls, compliance, audit)
Total: 3-4 engineers for 4-6 months to reach production

Risks and mitigations:

Risk	Impact	Mitigation
Retrieval quality below expectations	Users lose trust	Phased rollout with retrieval quality benchmarks; revert to cloud API if quality gates not met
Operational complexity exceeds team capacity	Service reliability issues	Start with single-tier (L2), add tiers only when the base system is stable
Model quality regression after open-source model upgrade	Answer quality drops	Shadow mode evaluation before any model switch
Security incident (data leak through RAG)	Regulatory consequences	Per-query access filtering from day one; penetration testing in Phase 4

You have completed your blueprint and are presenting to the CTO. They ask: 'What is the biggest risk of this project, and what is your honest assessment of whether we should do it?' What is the most credible response?

✎

Module 13 -- Final Assessment

A phased RAG deployment starts with 2-3 source systems and a single-tier architecture (L2 only) for 100 beta users. What is the primary benefit of this approach over a full deployment from day one?

Your cost model shows self-hosted RAG saves $7,000/month but requires $60,000 in upfront costs. The CTO asks: 'What if our query volume doubles in 12 months?' How does this affect the economics?

Your blueprint specifies BGE-M3 for embeddings and Gemma 4 27B for generation. Six months later, a new embedding model outperforms BGE-M3 by 5% on your domain's retrieval benchmarks. What is the migration impact?

An enterprise RAG team has completed Phase 1 (core pipeline with 2-3 source systems) and Phase 2 (retrieval tuning, 500 users). Users report high satisfaction for factual queries but frustration with complex analytical queries. What should Phase 3 prioritise?