Data classification in RAG
Enterprise documents have access controls. Not everyone in the organisation can see everything. A standard corporate environment has at minimum:
- Public: Information approved for external sharing (marketing materials, published reports)
- Internal: Available to all employees (policies, general procedures, company news)
- Confidential: Restricted to specific teams or roles (financial forecasts, M&A plans, personnel records)
- Restricted/Classified: Available only to named individuals (board materials, litigation strategy, classified technical data)
A RAG system that ignores these classifications is a security breach waiting to happen. If an intern asks "What is our acquisition strategy?" and the system retrieves and presents confidential board materials, you have a data leak -- regardless of whether the system is cloud-hosted or self-hosted.
The challenge is that retrieval systems are designed to find the most relevant content, and the most relevant content for a sensitive query is often the most sensitive document. Without access controls, the RAG system is an oracle that bypasses every document access control your organisation has spent years implementing.
There are two architectural approaches to enforcing access controls in RAG: per-query filtering and federated indexes.