Data classification in RAG
Enterprise documents have access controls. Not everyone in the organisation can see everything. A standard corporate environment has at minimum:
- Public: Information approved for external sharing (marketing materials, published reports)
- Internal: Available to all employees (policies, general procedures, company news)
- Confidential: Restricted to specific teams or roles (financial forecasts, M&A plans, personnel records)
- Restricted/Classified: Available only to named individuals (board materials, litigation strategy, classified technical data)
A RAG system that ignores these classifications is a security breach waiting to happen. If an intern asks "What is our acquisition strategy?" and the system retrieves and presents confidential board materials, you have a data leak -- regardless of whether the system is cloud-hosted or self-hosted.
The challenge is that retrieval systems are designed to find the most relevant content, and the most relevant content for a sensitive query is often the most sensitive document. Without access controls, the RAG system is an oracle that bypasses every document access control your organisation has spent years implementing.
There are two architectural approaches to enforcing access controls in RAG: per-query filtering and federated indexes.
Your enterprise RAG system has no access controls. An employee from the sales department asks: 'What is the company's position on the pending regulatory investigation?' The system retrieves and returns excerpts from privileged legal strategy documents. What has gone wrong?
Per-query filtering
The most straightforward approach to RAG access control: attach security metadata to every vector at ingestion time, and filter vectors by the user's access rights at query time.
At ingestion: Every document chunk gets metadata fields that encode its access classification:
{
"vector": [0.123, 0.456, ...],
"text": "The acquisition target's EBITDA...",
"metadata": {
"source": "board_materials/q3_strategy.pdf",
"classification": "restricted",
"department": "executive",
"access_groups": ["board", "c-suite", "legal"],
"document_date": "2025-09-15"
}
}At query time: The retrieval call includes a filter based on the authenticated user's access rights:
user_groups = get_user_groups(authenticated_user_id)
# Returns e.g., ["engineering", "all_employees"]
results = vector_db.search(
query_embedding=query_vector,
filter={
"access_groups": {"$overlap": user_groups}
},
top_k=20
)The vector database applies the filter, ensuring only vectors whose access_groups overlap with the user's groups are returned. The engineering employee never sees the board materials. The board member sees everything they are authorised for.
Implementation considerations:
-
Metadata must be authoritative. The access classification on each vector must come from the source system's actual access controls, not from heuristic classification. If a SharePoint document has specific permissions, those permissions should propagate to the vector metadata. Manually classifying documents is error-prone and does not scale.
-
User identity must be verified. The RAG API must authenticate the user and resolve their group memberships. Do not rely on the client to send group information -- this can be spoofed. Use your identity provider (Active Directory, Okta, Auth0) as the authority.
-
Filter performance matters. As discussed in Module 4, metadata filtering interacts with vector search performance. If a user has access to only 1% of the corpus, pre-filtering to that 1% before vector search can degrade HNSW performance. Qdrant's filterable HNSW handles this well. For other databases, test the performance impact of narrow filters.
-
Group changes must propagate. When a user changes departments or gets promoted, their access groups change. The vector database does not need to be re-indexed (the document metadata does not change), but the user's resolved group memberships must update in your identity provider.
Your RAG system uses per-query filtering with access_groups metadata. A new employee joins the Legal department. They can immediately search for legal documents, but they report that some documents they should have access to are missing from their results. What is the most likely cause?
The federated RAG pattern
Per-query filtering works well when access controls are metadata-based (group memberships, classifications). But some enterprises need stronger isolation: physically separate indexes per department, classification level, or tenant.
When federated RAG is necessary:
- Regulatory requirement. Some data classifications require that the data never co-resides with data at lower classifications, even in encrypted form. Defence and intelligence contexts may require this.
- Multi-tenant SaaS. If you operate a RAG system for multiple client organisations, each client's data must be in a physically separate index. A metadata filter bug must not be able to expose Client A's data to Client B.
- Performance isolation. A department with 500 million vectors should not have its search performance degraded by another department adding 2 billion vectors to a shared index.
Architecture:
Each tenant (department, classification level, client) gets its own vector database instance or collection. The query router determines which index to search based on the user's identity and query context:
User (Legal Dept, Secret clearance)
→ Query Router
→ Legal Department Index (Confidential + below)
→ Cross-Org Index (Internal + Public only)
→ Secret-Classified Index (clearance-verified)
→ Merge results
→ Rerank
→ GenerateThe router can search multiple indexes in parallel and merge results with RRF, effectively creating a unified search experience over physically separated data stores.
Trade-offs:
- Pro: Strongest isolation guarantee -- a bug in one index cannot leak data from another.
- Pro: Independent scaling -- each index can be sized and resourced for its workload.
- Con: Operational complexity -- N indexes means N databases to operate, monitor, and back up.
- Con: No cross-index similarity -- you cannot find documents in the Legal index that are similar to documents in the Finance index unless you explicitly search both.
- Con: Duplication -- a document that is accessible to multiple departments must be indexed in each department's index, or in a shared "common" index.
For most enterprises, per-query filtering with strong metadata controls is sufficient. Federated RAG is the right choice when regulatory requirements mandate physical separation or when you are operating a multi-tenant platform.
PII in the RAG pipeline
Enterprise documents contain personally identifiable information (PII): names, email addresses, phone numbers, national identification numbers, medical record numbers, financial account numbers. PII in your RAG pipeline creates compliance risk under GDPR, CCPA, HIPAA, and dozens of other regulations.
PII interacts with RAG at three points:
1. In the document corpus. Employee records, customer data, patient information -- PII is embedded in the source documents. If a document chunk containing "John Smith, SSN 123-45-6789, diagnosed with diabetes" gets embedded and indexed, that PII is now searchable via vector similarity.
2. In the user's query. Users might include PII in their questions: "What is the policy status for patient Jane Doe, MRN 12345?" The query, including the PII, gets embedded and potentially logged.
3. In the generated response. The model might surface PII from retrieved chunks in its response, even if the user did not ask for it.
Detection. Use a combination of:
- Regular expressions for structured PII: SSN patterns, email addresses, phone numbers, credit card numbers. Fast and deterministic.
- Named Entity Recognition (NER) for unstructured PII: person names, addresses, medical conditions. Libraries like spaCy, Presidio (Microsoft, open-source), or Stanza handle this.
- LLM-based detection for context-dependent PII: "the patient in Room 304" is PII in a hospital context but not in a hotel context. Use Gemma 4 for ambiguous cases.
Redaction strategies:
| Strategy | Example | Pros | Cons |
|---|---|---|---|
| Replacement | "John Smith" becomes "[PERSON_1]" | Preserves document structure, allows re-identification with a key | Requires secure key management |
| Masking | "123-45-6789" becomes "XXX-XX-XXXX" | Simple, irreversible | Loses information that might be needed |
| Generalisation | "42-year-old male" becomes "adult male" | Preserves some utility | Loses precision that might matter for retrieval |
| Removal | Entire PII-containing sentence removed | Maximum privacy | Loses potentially valuable context |
Where to redact:
The safest approach is to redact PII at ingestion time, before embedding. The redacted text gets embedded, so PII never enters the vector database. The trade-off: if the PII is relevant to retrieval (e.g., a user searches for a specific patient by name), redaction breaks that use case. For healthcare and similar domains, role-based access to unredacted content may be necessary for authorised users, with redaction applied only for unauthorised users.
Your RAG system indexes employee records. A manager asks: 'What is the performance rating for my team members?' The system retrieves and returns performance data including names and ratings for employees outside the manager's team. What control failed?
Logging every query, retrieval, and generation
In regulated industries, you must prove what your RAG system did: which documents it searched, which chunks it retrieved, what context it sent to the model, and what answer it generated. This is not optional -- it is a compliance requirement.
What to log:
| Event | Data to capture | Retention |
|---|---|---|
| Query received | Timestamp, user_id, query text, user's access groups | Duration of regulatory retention period |
| Retrieval executed | Query embedding (or hash), vector DB queried, filter applied, top-K results with scores | Same |
| Chunks retrieved | Chunk IDs, source document IDs, relevance scores | Same |
| Context assembled | Full prompt sent to the generative model (system prompt + context + query) | Same |
| Response generated | Full response text, model ID, model version, token counts, latency | Same |
| User feedback | Thumbs up/down, follow-up query (indicating dissatisfaction), explicit correction | Same |
Storage considerations. At 50,000 queries/day with full prompt logging, expect 500 MB - 2 GB of log data per day (depending on context size). That is 180-730 GB per year. Store in an append-only, tamper-evident system: a dedicated database with write-once policies, or a log management system like Elasticsearch or Loki.
Compliance framework implications:
SOC 2. Requires demonstrating that access controls are enforced, that systems are monitored, and that data is protected. Self-hosted RAG is well-positioned for SOC 2 because you control the entire infrastructure. Document: access control enforcement (metadata filtering), monitoring (the metrics from Module 9), data protection (encryption at rest and in transit), and incident response procedures.
HIPAA. Requires protecting Protected Health Information (PHI). Self-hosted RAG eliminates the "Business Associate" relationship that cloud RAG would require (because you are not sending PHI to a third party). You still must: encrypt PHI at rest and in transit, implement access controls (covered above), maintain audit logs, and have a breach response plan. Self-hosted infrastructure simplifies HIPAA compliance by keeping PHI within your controlled environment.
FedRAMP. Required for US government cloud systems. If your RAG system runs in a FedRAMP-authorised environment (GovCloud), the infrastructure compliance is handled. The RAG application layer must still meet FedRAMP controls for access management, audit logging, and data protection. Self-hosted models (no external API calls) simplify the authorisation boundary.
GDPR. Requires lawful basis for processing personal data, data minimisation, right to erasure, and data protection impact assessments. Self-hosted RAG avoids the complexities of data transfer to third-party processors (Article 28). For the right to erasure: when a data subject requests deletion, you must delete their data from the vector database (re-embed documents with the individual's data removed), the knowledge graph (remove associated entities), and the audit logs (except where retention is legally required). This is operationally complex and must be planned at architecture time, not retrofitted.
Module 12 -- Final Assessment
In a RAG system with per-query filtering, where must access control be enforced to prevent unauthorised data exposure?
When is federated RAG (physically separate indexes per tenant) necessary instead of per-query metadata filtering?
A user requests deletion under GDPR's right to erasure. Which components of the RAG system must be updated?
Your RAG system processes 50,000 queries/day with full prompt logging. Approximately how much log storage should you plan for per year?