From knowledge to action
You have now covered the full stack of enterprise RAG: from the economics of self-hosting through embedding models, vector databases, document ingestion, chunking, retrieval, generation, deployment, tiered architecture, knowledge graphs, and security.
This capstone turns that knowledge into a concrete plan for your organisation. Each exercise produces a deliverable. Together, they form a one-page RAG transformation blueprint that you can present to your CTO, your board, or your procurement committee.
Work through these exercises with your actual numbers. Estimates are fine where you do not have exact figures -- the goal is a defensible order-of-magnitude plan, not a decimal-precise budget.
Before starting the exercises, which of these is your primary driver for self-hosted RAG?
Exercise 1: Calculate your current spend vs self-hosted projection
Build a cost comparison between your current approach and a self-hosted alternative. Use the framework below.
Current cloud/API costs (monthly):
| Component | Your numbers |
|---|---|
| Embedding API costs (monthly re-embedding + new documents) | $ _____ |
| Vector database hosting (managed service) | $ _____ |
| LLM inference API costs (generation) | $ _____ |
| Orchestration/observability (LangSmith, Langfuse, etc.) | $ _____ |
| Total current monthly spend | $ _____ |
If you are not currently running RAG and are evaluating a greenfield deployment, estimate what the cloud approach would cost using the numbers from Module 1.
Self-hosted projection (monthly):
| Component | Specification | Monthly cost |
|---|---|---|
| Embedding GPU (e.g., 1x T4 or L4) | Model: _____, GPU: _____ | $ _____ |
| Vector database infrastructure | DB: _____, Nodes: _____ | $ _____ |
| Generation GPU(s) | Model: _____, GPU: _____ | $ _____ |
| L3 cloud API escalation (if applicable) | % of queries: _____ | $ _____ |
| Engineering operations (hours/month x rate) | Hours: _____, Rate: _____ | $ _____ |
| Total self-hosted monthly spend | $ _____ |
Key metrics:
| Metric | Value |
|---|---|
| Total document corpus size | _____ TB |
| Estimated chunk count | _____ million |
| Daily query volume | _____ |
| Cost per query (current) | $ _____ |
| Cost per query (self-hosted) | $ _____ |
| Monthly savings | $ _____ |
| Break-even period (including setup costs) | _____ months |
The break-even period should include one-time costs: hardware procurement (if not renting), initial engineering build (estimate 2-4 engineer-months for a production deployment), and initial corpus embedding time.
You estimate your self-hosted RAG system will cost $8,000/month to operate (hardware + engineering time) versus $15,000/month for cloud RAG. The initial setup requires $40,000 in engineering effort and $20,000 in hardware. What is the approximate break-even period?
Exercise 2: Design your document ingestion architecture
Map your organisation's document sources and design the connectors.
Document source inventory:
| Source system | Document types | Estimated volume | Change frequency | Connector complexity |
|---|---|---|---|---|
| Example: Confluence | Wiki pages, attachments | 50,000 pages | Daily | Low (REST API, webhooks) |
| Example: Network share | PDFs, Word, Excel | 2 TB, 500K files | Weekly | Medium (polling, no API) |
For each source, determine:
- Authentication method: OAuth 2.0, API token, service account, NTLM?
- Change detection: Webhooks, delta queries, polling, file system events?
- Extraction strategy: Fast (pymupdf/text-only) or quality (Docling/Unstructured with layout analysis)?
- Vision processing needed? What percentage of documents have tables, charts, or scanned content that requires vision model processing?
Ingestion pipeline design:
[Source Connectors] → [Change Detection Queue]
→ [Extraction Workers]
→ Fast path (pymupdf): 80% of documents
→ Quality path (Docling): 15% of documents
→ Vision path (Gemma 4): 5% of documents
→ [Quality Check]
→ [Chunking]
→ [Embedding]
→ [Vector DB Upsert]
→ [Graph Extraction (selective)]Estimated initial ingestion time:
- Total documents: _____
- Average pages per document: _____
- Total pages: _____
- Fast extraction: _____ pages/sec x _____ pages = _____ hours
- Quality extraction: _____ pages/sec x _____ pages = _____ hours
- Vision processing: _____ pages/sec x _____ pages = _____ hours
- Embedding: _____ chunks/sec x _____ chunks = _____ hours
- Total initial ingestion: _____ hours/days
Your organisation has 10 source systems. You estimate the ingestion pipeline will take 3 months to build all connectors. A colleague suggests building all 10 connectors before going live. What is a better approach?
Exercise 3: Choose your model stack
Select the three models that form the core of your RAG system.
Embedding model selection:
| Criterion | Your requirement | Model A: _____ | Model B: _____ |
|---|---|---|---|
| Languages needed | |||
| Retrieval quality (MTEB) | |||
| Dimensions / Matryoshka support | |||
| Throughput needed (chunks/sec) | |||
| GPU requirement | |||
| Licence | |||
| Selected embedding model: |
Reranker selection:
| Criterion | Your requirement | Model A: _____ | Model B: _____ |
|---|---|---|---|
| Reranking latency budget | |||
| Candidate count to rerank | |||
| Multilingual support | |||
| GPU requirement | |||
| Selected reranker: |
Generation model selection (per tier):
| Tier | Query types | Model | GPU | Throughput | Monthly cost |
|---|---|---|---|---|---|
| L1 (edge) | Simple factual | ||||
| L2 (departmental) | Moderate synthesis | ||||
| L2 (complex) | Multi-document reasoning | ||||
| L3 (escalation) | Beyond local capability |
Decision rationale:
- Why this embedding model over alternatives?
- Why this generation model over alternatives?
- What is your quantisation strategy?
- How will you handle model upgrades (new Gemma versions, new embedding models)?
You have selected Gemma 4 27B as your L2 generation model. A new version of Gemma (Gemma 5) is released six months after deployment with substantially better RAG performance. What is the migration path?
Exercise 4: Design your tiered architecture
Define the tiers, routing logic, and escalation paths for your deployment.
Tier definitions:
| Tier | Model | Knowledge scope | Latency target | Hardware | Estimated % of queries |
|---|---|---|---|---|---|
| L1 | % | ||||
| L2 | % | ||||
| L3 | % |
Query routing rules:
| Rule | Condition | Route to |
|---|---|---|
| 1 | L1 | |
| 2 | L2 | |
| 3 | L3 | |
| Fallback | L2 response confidence below threshold | Escalate to L3 |
Blended cost calculation:
| Tier | Queries/day | Cost/query | Daily cost |
|---|---|---|---|
| L1 | $ | $ | |
| L2 | $ | $ | |
| L3 | $ | $ | |
| Total | $ blended | $ |
Architecture decisions to document:
- Does your L3 tier use a cloud API or a larger on-premises model? If cloud API, which data classification levels are permitted to be sent?
- Where do your knowledge graph queries execute? L2 only, or across tiers?
- What is your ambient RAG strategy for L1 idle capacity?
- How do you handle tier failures? (L2 is down -- do all queries fall through to L3, or do you queue them?)
Exercise 5: Your RAG transformation blueprint
Compile the outputs of Exercises 1-4 into a one-page blueprint. This is the document you present to your leadership team.
RAG Transformation Blueprint -- [Your Organisation Name]
Business case:
- Current annual spend on knowledge search / RAG: $_____
- Projected self-hosted annual cost: $_____
- Annual savings: $_____
- Break-even period: _____ months
- Non-financial drivers: [data sovereignty / vendor independence / capability gaps]
Architecture summary:
- Corpus: _____ TB across _____ source systems
- Model stack: [embedding model] + [reranker] + [L1 model] / [L2 model] / [L3 model or API]
- Vector database: [choice] with [index type], [estimated vector count]
- Knowledge graph: [Yes/No], covering [document categories]
- Security model: [per-query filtering / federated] with [compliance frameworks]
Hardware requirements:
- Embedding: _____
- Generation (L1): _____
- Generation (L2): _____
- Vector database: _____
- Total GPU cost (monthly): $_____
Implementation roadmap:
| Phase | Duration | Deliverable |
|---|---|---|
| Phase 1: Foundation | 4-6 weeks | Core pipeline with 2-3 source systems, single-tier (L2), 100 internal beta users |
| Phase 2: Quality | 4-6 weeks | Retrieval tuning, reranking, chunking optimisation, 500 users |
| Phase 3: Scale | 4-6 weeks | Full source system connectors, tiered architecture (L1+L2), knowledge graph for high-value docs |
| Phase 4: Production | 4-6 weeks | Security hardening, audit logging, multi-tenancy, L3 escalation, full rollout |
Team requirements:
- 1 ML engineer (model selection, fine-tuning, evaluation)
- 1 infrastructure engineer (GPU deployment, vLLM, vector database operations)
- 1 backend engineer (connectors, pipeline, API)
- 0.5 security engineer (access controls, compliance, audit)
- Total: 3-4 engineers for 4-6 months to reach production
Risks and mitigations:
| Risk | Impact | Mitigation |
|---|---|---|
| Retrieval quality below expectations | Users lose trust | Phased rollout with retrieval quality benchmarks; revert to cloud API if quality gates not met |
| Operational complexity exceeds team capacity | Service reliability issues | Start with single-tier (L2), add tiers only when the base system is stable |
| Model quality regression after open-source model upgrade | Answer quality drops | Shadow mode evaluation before any model switch |
| Security incident (data leak through RAG) | Regulatory consequences | Per-query access filtering from day one; penetration testing in Phase 4 |
You have completed your blueprint and are presenting to the CTO. They ask: 'What is the biggest risk of this project, and what is your honest assessment of whether we should do it?' What is the most credible response?
Module 13 -- Final Assessment
A phased RAG deployment starts with 2-3 source systems and a single-tier architecture (L2 only) for 100 beta users. What is the primary benefit of this approach over a full deployment from day one?
Your cost model shows self-hosted RAG saves $7,000/month but requires $60,000 in upfront costs. The CTO asks: 'What if our query volume doubles in 12 months?' How does this affect the economics?
Your blueprint specifies BGE-M3 for embeddings and Gemma 4 27B for generation. Six months later, a new embedding model outperforms BGE-M3 by 5% on your domain's retrieval benchmarks. What is the migration impact?
An enterprise RAG team has completed Phase 1 (core pipeline with 2-3 source systems) and Phase 2 (retrieval tuning, 500 users). Users report high satisfaction for factual queries but frustration with complex analytical queries. What should Phase 3 prioritise?