Data Classification for AI Workflows

Your classification scheme was not built for AI

Most enterprises have a data classification scheme. It typically has three to five levels — something like Public, Internal, Confidential, Restricted. It was designed for traditional data security: controlling access to files, databases, and network shares. It answers the question "who can see this data?"

AI breaks this model because the question is no longer just "who can see it" but "where does it go, who processes it, what happens to it during processing, and does it influence future outputs?" A document classified as "Internal" under your existing scheme might be perfectly fine for employees to read, but completely unacceptable to send to a cloud AI provider that retains data for 30 days and operates infrastructure in a different jurisdiction.

Traditional classification also assumes you know the data in advance. A database of customer records can be classified once and that classification persists. But AI usage is dynamic — an employee might paste a sentence that contains no sensitive data, or they might paste an entire medical record. The classification needs to happen at the point of use, not at the point of storage.

This module builds an AI-specific classification framework that accounts for these differences. It maps to your existing classification scheme rather than replacing it, so you do not need to reclassify your entire data estate.

Your organisation classifies a customer support knowledge base as 'Internal' data. An employee wants to use it as context for a RAG system powered by a cloud AI provider. Under your current classification, is this allowed?

Five levels for AI data handling

Here is a classification framework designed specifically for AI workflows. Each level defines what AI processing is permissible, not just who can access the data.

Level 1: Public Data that is publicly available or intended for public distribution. Press releases, public documentation, published research, open-source code, publicly available datasets.

AI processing: Any AI service, cloud or local, with no restrictions.
Example: Using ChatGPT to rewrite a press release for a different audience.

Level 2: Internal Data intended for use within the organisation but not sensitive if exposed. Internal newsletters, general process documentation, non-sensitive meeting notes, team schedules.

AI processing: Cloud AI services with enterprise agreements (API, not consumer products). Data Processing Agreements must be in place. No consumer AI tools.
Example: Using Claude API to summarise internal project documentation.

Level 3: Confidential Data that could cause harm to the organisation or individuals if exposed. Customer lists, non-public financial data, internal strategy documents, proprietary business processes, employee performance data.

AI processing: Cloud AI services with enterprise agreements AND the gateway pattern (Module 6). PII must be detected and redacted before transmission. Pseudonymisation preferred where referential integrity is needed.
Example: Using the gateway pattern to analyse customer feedback — names and account numbers redacted before the data reaches the cloud model.

Level 4: Restricted Data subject to specific regulatory requirements or contractual obligations. PII governed by GDPR/CCPA, PHI under HIPAA, PCI cardholder data, data subject to ITAR/EAR controls, data covered by legal privilege.

AI processing: Local inference only, or cloud AI with explicit regulatory compliance (e.g., HIPAA BAA with the provider, FedRAMP authorisation). Gateway pattern mandatory with full redaction. Re-identification risk assessment required.
Example: Using a local Gemma 4 model to analyse clinical notes, with no data leaving the hospital's network.

Level 5: Prohibited Data that must never be processed by AI systems under any circumstances. Active litigation hold material, classified government information, data with explicit contractual prohibitions on AI processing, raw biometric data.

AI processing: None. Not local, not cloud. Human review only.
Example: Documents under a litigation hold where AI processing could raise spoliation concerns.

The key insight is that this framework is a decision layer on top of your existing classification, not a replacement. Your existing "Confidential" data might map to AI Level 3 or Level 4 depending on the specific regulatory requirements. The AI classification adds the "what kind of AI processing is allowed" dimension that traditional classification lacks.

A dataset contains customer names, email addresses, and product preferences. Under this framework, what is the minimum AI classification level?

Direct identifiers, quasi-identifiers, and sensitive attributes

PII is not a binary property. Data falls on a spectrum from directly identifying to indirectly identifying to non-identifying, and the boundaries are less clear than most people assume.

Direct identifiers uniquely identify an individual on their own. These are the obvious ones:

Full name
Email address
Phone number
Social Security Number (SSN) / National Insurance Number / national ID numbers
Passport number
Driver's licence number
Financial account numbers (bank account, credit card)
Biometric data (fingerprints, facial recognition templates, retinal scans)
IP address (considered PII under GDPR, situation-dependent under US law)

Quasi-identifiers do not identify an individual alone but can do so in combination. This is where classification gets difficult:

Date of birth
Postcode / ZIP code
Gender
Job title
Employer name
Educational institution
Ethnicity
Age

The seminal research by Latanya Sweeney demonstrated that 87% of the US population can be uniquely identified by the combination of just three quasi-identifiers: date of birth, gender, and 5-digit ZIP code. This means that a dataset with these three fields and nothing else is effectively PII — even though no individual field is a direct identifier.

Sensitive attributes are data points that, while not necessarily identifying, create elevated risk if associated with an identified individual:

Medical diagnoses and treatment history
Sexual orientation
Religious beliefs
Political opinions
Trade union membership
Criminal records
Financial status and credit history
Genetic data

Under GDPR, sensitive attributes are "special category data" (Article 9) and require explicit consent or a specific legal basis for processing — a higher bar than ordinary personal data. This distinction matters for AI classification because processing sensitive attributes through AI systems may require a Data Protection Impact Assessment even when the processing of ordinary PII would not.

A dataset has been 'anonymised' by removing names and email addresses, but retains date of birth, gender, ZIP code, and medical diagnosis. Is this dataset safe to process through a cloud AI service?

HIPAA's 18 identifiers and PCI cardholder data

Two regulated data categories require special attention in AI systems because they have explicit, enumerated definitions of what constitutes protected data.

HIPAA: the 18 PHI identifiers

The HIPAA Privacy Rule defines Protected Health Information (PHI) as individually identifiable health information held by a covered entity or its business associates. The Safe Harbor method of de-identification (45 CFR 164.514(b)) requires the removal of 18 specific identifiers:

Names
Geographic data smaller than a state (street address, city, ZIP code — note: the first three digits of a ZIP code may be retained if the geographic unit contains more than 20,000 people)
All dates (except year) directly related to an individual (birth date, admission date, discharge date, date of death)
Telephone numbers
Fax numbers
Email addresses
Social Security Numbers
Medical record numbers
Health plan beneficiary numbers
Account numbers
Certificate/licence numbers
Vehicle identifiers and serial numbers (including licence plates)
Device identifiers and serial numbers
Web URLs
IP addresses
Biometric identifiers
Full-face photographs and comparable images
Any other unique identifying number, characteristic, or code

That eighteenth category is the catch-all that makes HIPAA de-identification particularly challenging for AI systems. If your PII detection pipeline catches identifiers 1 through 17 but misses a custom patient reference number, you have failed HIPAA de-identification.

PCI DSS: cardholder data elements

The Payment Card Industry Data Security Standard defines cardholder data as:

Primary Account Number (PAN) — the 15-16 digit card number. This is the key element; if PAN is present, PCI DSS applies.
Cardholder name (when stored with PAN)
Expiration date (when stored with PAN)
Service code (when stored with PAN)

Sensitive authentication data — which must never be stored after authorisation — includes:

Full magnetic stripe data / chip equivalent
CAV2/CVC2/CVV2/CID (the 3-4 digit security code)
PINs and PIN blocks

For AI systems, the critical rule is: never send full PAN to any AI service, cloud or local, unless that service is within your PCI Cardholder Data Environment (CDE). Even truncated PANs (first six and last four digits) should be treated with care, as they can be combined with other data to derive the full number.

Intellectual property: where automation fails

Direct PII — names, emails, SSNs — follows patterns. You can write regex for it, train NER models on it, and build reliable automated detection. Trade secrets and intellectual property are fundamentally different: they are defined by context, not by format.

A line of source code is not inherently a trade secret. But that same line of code, in the context of a proprietary algorithm that gives your company a competitive advantage, is a trade secret. A chemical formula is not inherently confidential. But a chemical formula that represents your unreleased product formulation is highly confidential. No regex pattern or NER model can make this distinction.

This is why trade secret and IP classification for AI requires a different approach:

Classification by source, not by content. Instead of trying to detect whether a piece of text is a trade secret, classify the source. Code from your proprietary algorithm repository is Level 4 or 5 regardless of which specific line is sent to the AI. Documents from your R&D folder are Confidential regardless of which paragraph is queried.

Repository-level tagging. Work with your legal and IP teams to tag repositories, document libraries, and data stores with AI classification levels. This turns the problem from "detect whether this text is a trade secret" (impossible to automate reliably) to "check whether this text came from a tagged source" (straightforward to automate).

The human-in-the-loop requirement. For trade secret and IP classification, accept that full automation is not achievable. Build your pipeline to flag ambiguous cases for human review rather than making automated pass/block decisions on content that might be IP.

Your R&D team wants to use a cloud AI to help analyse experimental results. The data contains proprietary formulations. What is the right approach?

The decision tree for AI data classification

Here is a practical decision tree you can adapt for your organisation. For any data that might be processed by an AI system, walk through these questions in order.

Question 1: Is this data publicly available or intended for public distribution?

Yes: Level 1 (Public). Any AI processing is acceptable.
No: Continue to Question 2.

Question 2: Does this data contain any of the HIPAA 18 identifiers in combination with health information?

Yes: Level 4 (Restricted). HIPAA BAA required with any AI provider. Local inference strongly preferred. Gateway pattern with full de-identification if cloud processing is necessary.
No: Continue to Question 3.

Question 3: Does this data contain PCI cardholder data (PAN, cardholder name with PAN, expiration date with PAN)?

Yes: Level 4 (Restricted). AI processing only within PCI CDE. Never send PAN to external AI services.
No: Continue to Question 4.

Question 4: Does this data contain direct PII identifiers (names, email, phone, SSN, passport numbers, etc.)?

Yes: Continue to Question 4a.
No: Continue to Question 5.

Question 4a: Are the data subjects in jurisdictions with specific data protection regulations (EU/GDPR, California/CCPA, etc.)?

Yes: Level 4 (Restricted). Regulatory compliance required. DPIA may be needed. Gateway pattern with full redaction mandatory for cloud AI.
No: Level 3 (Confidential). Gateway pattern required. PII detection and redaction before any cloud AI processing.

Question 5: Does this data contain quasi-identifiers that could enable re-identification when combined?

Yes: Level 3 (Confidential). Apply the gateway pattern. Consider k-anonymity analysis to assess re-identification risk.
No: Continue to Question 6.

Question 6: Is this data from a repository or source tagged as containing trade secrets or proprietary IP?

Yes: Level 4 (Restricted) or Level 5 (Prohibited), depending on legal assessment. Local inference for Level 4. No AI processing for Level 5.
No: Continue to Question 7.

Question 7: Is this data subject to litigation hold, classified government controls, or contractual AI processing prohibitions?

Yes: Level 5 (Prohibited). No AI processing of any kind.
No: Level 2 (Internal). Enterprise AI services with DPA in place.

This decision tree is a starting point. Your legal, compliance, and security teams should review and adapt it for your specific regulatory obligations and risk tolerance. The important thing is that a decision tree exists — that the classification decision for AI is structured, documented, and consistently applied rather than left to individual employees' judgement.

✎

Module 2 — Final Assessment

Why does traditional data classification (Public/Internal/Confidential/Restricted) fail for AI workflows?

Under the HIPAA Safe Harbor method of de-identification, how many specific identifier categories must be removed?

A dataset contains dates of birth, gender, and 5-digit ZIP codes but no names or other direct identifiers. What is the re-identification risk?

Why is automated classification of trade secrets and intellectual property fundamentally harder than automated PII detection?