Redaction and Anonymisation Techniques

The terminology matters legally

Redaction, pseudonymisation, anonymisation, and tokenisation are not synonyms. They are distinct techniques with different legal implications, different technical implementations, and different impacts on AI utility. Confusing them in a compliance document or a vendor conversation can create real legal exposure.

Here is the precise distinction:

Redaction removes PII entirely and replaces it with a marker. "John Smith called about his account" becomes "[PERSON] called about his account." The original data is destroyed — there is no way to recover "John Smith" from the redacted text. Redacted data is no longer personal data under GDPR because no individual can be identified.

Pseudonymisation replaces PII with consistent artificial identifiers. "John Smith called about his account" becomes "Person_A called about his account," and every occurrence of "John Smith" in the same document becomes "Person_A." The mapping between "John Smith" and "Person_A" is stored separately. Under GDPR, pseudonymised data is still personal data because it can be re-identified using the mapping. However, GDPR explicitly recognises pseudonymisation as a safeguard (Article 4(5) and Recital 26), and it is considered a measure that reduces risk.

Anonymisation transforms data so that no individual can be identified, even by the data controller, even with additional information. Under GDPR, truly anonymised data is no longer personal data and falls outside GDPR scope entirely. The threshold for anonymisation is high: it must be irreversible, and the data must not be re-identifiable "by any means reasonably likely to be used."

Tokenisation replaces sensitive data elements with non-sensitive tokens that map back to the originals through a tokenisation vault. Unlike pseudonymisation, tokenisation preserves the format of the original data (a 16-digit token replaces a 16-digit card number), making it useful for systems that need format-consistent data. Tokenisation is widely used in PCI DSS contexts to keep cardholder data out of systems that do not need it.

Your legal team says 'we can process this data through cloud AI because it has been anonymised.' The data has had names replaced with Person_A, Person_B, etc., with a mapping table stored in your database. Is this actually anonymised?

Practical redaction: from simple to sophisticated

Simple redaction: the blunt instrument

The simplest redaction replaces all detected PII with a generic marker:

Input:  "John Smith ([email protected]) reported the issue on 15 March 2025."
Output: "[REDACTED] ([REDACTED]) reported the issue on [REDACTED]."

This is maximally safe — no PII survives — but it destroys information the AI model may need. If you are asking the AI to "summarise this customer complaint," losing the date means the AI cannot identify patterns by time period. Losing the email means the AI cannot distinguish between multiple complainants.

Typed redaction: preserving semantic meaning

Typed redaction replaces PII with type-specific markers:

Input:  "John Smith ([email protected]) reported the issue on 15 March 2025."
Output: "[PERSON] ([EMAIL]) reported the issue on [DATE]."

This preserves the semantic structure — the AI knows a person sent an email on a date — without exposing the actual identifiers. Typed redaction is sufficient for many AI use cases: summarisation, sentiment analysis, topic classification, and pattern identification can all work with typed markers.

Pseudonymised redaction: preserving referential integrity

When the AI needs to track relationships between entities — "John Smith reported the issue, and John Smith's manager approved the escalation" — typed redaction breaks because both instances become "[PERSON]." Pseudonymisation preserves referential integrity:

Input:  "John Smith reported the issue, and John Smith's manager, Sarah Chen, approved the escalation."
Output: "Person_A reported the issue, and Person_A's manager, Person_B, approved the escalation."

Now the AI understands that the same person reported and was escalated, and a different person approved. The mapping (John Smith = Person_A, Sarah Chen = Person_B) is stored in the gateway's session state for re-hydration after the AI responds.

You need AI to analyse a set of customer support tickets to identify which agent handles the most complex issues. Which redaction approach preserves the analytical value?

K-anonymity, l-diversity, t-closeness, and differential privacy

When you are working with datasets (not individual prompts), statistical anonymisation techniques provide formal privacy guarantees. These are particularly relevant when preparing datasets for AI training or fine-tuning.

K-anonymity

A dataset satisfies k-anonymity if every combination of quasi-identifiers (attributes that could contribute to re-identification) applies to at least k individuals. If your dataset has k=5, then any combination of age + gender + ZIP code matches at least 5 people, making it impossible to identify any individual with certainty from those attributes alone.

Achieving k-anonymity typically involves generalisation (replacing "age 32" with "age 30-35") and suppression (removing records that cannot be k-anonymised). The larger the k, the stronger the privacy guarantee, but the more utility is lost through generalisation.

L-diversity

K-anonymity has a weakness: if all k individuals in a group share the same sensitive attribute, the attribute is still exposed. If 5 people match age 30-35, female, ZIP 90210, and all 5 have the diagnosis "diabetes," then you know anyone matching those quasi-identifiers has diabetes.

L-diversity adds the requirement that within each k-anonymity group, the sensitive attribute must have at least l "well-represented" values. A group satisfying 3-diversity for diagnosis would need at least 3 distinct diagnoses among its members.

T-closeness

T-closeness goes further: the distribution of the sensitive attribute within each group must be close (within threshold t) to the distribution in the overall dataset. This prevents attacks based on the distribution of values within a group, even when l-diversity is satisfied.

Differential privacy

Differential privacy takes a fundamentally different approach. Instead of modifying the data structure, it adds calibrated noise to query results or to the data itself. A differentially private mechanism guarantees that the presence or absence of any single individual's data changes the output by at most a small, controlled amount (defined by the privacy parameter epsilon).

Differential privacy is the approach used by Apple (for usage analytics), the US Census Bureau (for the 2020 census), and Google (for Chrome usage data). For AI, differential privacy is most relevant during model training — techniques like DP-SGD (Differentially Private Stochastic Gradient Descent) train models with formal guarantees that no individual training example disproportionately influences the model.

For enterprise AI data handling, differential privacy applies when:

You are fine-tuning models on sensitive datasets
You are sharing aggregate analytics from AI processing
You need formal privacy guarantees for regulatory compliance

It does not apply to the prompt-level PII detection and redaction pipeline — that requires the deterministic approaches (redaction, pseudonymisation) described earlier.

Synthetic data generation and the re-identification problem

Synthetic data: realistic but fake

Synthetic data generation uses AI to create datasets that preserve the statistical properties of real data without containing any real individuals' information. The generated data looks real — realistic names, plausible addresses, coherent medical histories — but no record corresponds to an actual person.

Use cases for synthetic data in enterprise AI:

Development and testing: Build and test your AI pipeline against realistic data without using production PII
Model training: Fine-tune models on synthetic data that captures the patterns in your real data
Vendor evaluation: Share synthetic datasets with AI vendors for proof-of-concept work without exposing real customer data
Training materials: Use synthetic examples in employee training programmes about PII handling

Tools for synthetic data generation:

Gretel.ai — cloud platform for generating synthetic tabular, text, and time-series data, with privacy assessments built in
Mostly AI — enterprise synthetic data platform focused on tabular data with privacy guarantees
SDV (Synthetic Data Vault) — open-source Python library for generating synthetic tabular data using statistical models and deep learning
Faker — Python library for generating fake but realistic individual data elements (names, addresses, phone numbers) — useful for testing PII detection pipelines

The re-identification risk: why "anonymised" data often is not

The history of data privacy is littered with re-identification attacks on supposedly anonymised datasets:

Netflix Prize (2006): Researchers de-anonymised Netflix's "anonymised" movie rating dataset by correlating it with public IMDB reviews, identifying individual subscribers and their viewing histories.
AOL search logs (2006): AOL released "anonymised" search logs with user IDs replaced by numbers. Journalists re-identified individual users from their search patterns — one was identified from searches for her own name, a friend's name, and medical conditions.
NYC taxi data (2014): The NYC Taxi and Limousine Commission released "anonymised" trip data. Researchers reversed the MD5 hash of medallion numbers (a weak hash) and linked trips to specific drivers, then used trip start/end locations and times to identify passengers.

The pattern is consistent: data that appears anonymised is re-identifiable when combined with external data sources. For AI systems, this creates a specific risk: if you "anonymise" a dataset, use it to fine-tune a model, and the model then generates outputs that reflect patterns unique to identifiable individuals, you may have a privacy breach even though no PII was in the training data.

The practical guidance:

Do not assume generalisation or pseudonymisation equals anonymisation
Conduct re-identification risk assessments on any dataset you claim is anonymised
For formal anonymisation guarantees, use differential privacy with a defined epsilon
For operational work (day-to-day AI pipeline processing), use pseudonymisation with technical controls on the mapping table, and treat the data as personal data under GDPR

Your data science team creates a synthetic dataset modelled on your real customer data. They claim it is 'privacy-safe' because no real customers are in the dataset. What is the remaining privacy risk?

Practical: the redaction pipeline

Here is the architecture for a production redaction pipeline combining Microsoft Presidio with custom rules and a local Gemma 4 model.

Step 1: Detection (Presidio Analyzer)

Configure Presidio with the built-in recognisers plus custom recognisers for your domain-specific PII:

from presidio_analyzer import AnalyzerEngine, PatternRecognizer, Pattern

# Custom recogniser for internal employee IDs (format: EMP-XXXXX)
employee_id_pattern = Pattern(
    name="employee_id",
    regex=r"\bEMP-\d{5}\b",
    score=0.95
)
employee_id_recogniser = PatternRecognizer(
    supported_entity="EMPLOYEE_ID",
    patterns=[employee_id_pattern]
)

analyzer = AnalyzerEngine()
analyzer.registry.add_recognizer(employee_id_recogniser)

results = analyzer.analyze(
    text=input_text,
    language="en",
    entities=[
        "PERSON", "EMAIL_ADDRESS", "PHONE_NUMBER",
        "US_SSN", "CREDIT_CARD", "IP_ADDRESS",
        "EMPLOYEE_ID"
    ],
    score_threshold=0.5
)

Step 2: LLM review (Gemma 4, for Level 3+ data)

Pass the text and Presidio's results to a local Gemma 4 model for contextual review. The model checks for PII that Presidio missed and validates Presidio's detections:

prompt = f"""Review this text for PII that may have been missed by automated detection.
The following PII was already detected: {detected_entities}

Text: {input_text}

Identify any additional PII, especially:
- Contextual identifiers (combinations that could identify someone)
- Domain-specific identifiers not in the detected list
- Indirect references to identifiable individuals

Return only NEW detections not already in the list above."""

Step 3: Anonymisation (Presidio Anonymizer)

Apply the appropriate anonymisation strategy based on your use case:

from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig

anonymizer = AnonymizerEngine()

# For typed redaction:
anonymized = anonymizer.anonymize(
    text=input_text,
    analyzer_results=results,
    operators={
        "DEFAULT": OperatorConfig("replace", {"new_value": "[REDACTED]"}),
        "PERSON": OperatorConfig("replace", {"new_value": "[PERSON]"}),
        "EMAIL_ADDRESS": OperatorConfig("replace", {"new_value": "[EMAIL]"}),
        "PHONE_NUMBER": OperatorConfig("replace", {"new_value": "[PHONE]"}),
    }
)

For pseudonymisation, use the replace operator with a consistent mapping function that generates the same pseudonym for the same input value across the document.

Step 4: Validation

Before sending the anonymised text to a cloud AI service, run a validation pass:

Re-scan the anonymised text through the detection pipeline to verify no PII was missed
Check that pseudonymised tokens do not inadvertently match real PII patterns
Log the detection and anonymisation decisions for audit

This four-step pipeline — detect, review, anonymise, validate — forms the core of the gateway pattern described in Module 6.

✎

Module 5 — Final Assessment

Under GDPR, what is the key legal difference between pseudonymised data and anonymised data?

When is pseudonymisation (consistent replacement like Person_A, Person_B) preferable to simple typed redaction ([PERSON], [PERSON])?

What does k-anonymity guarantee about a dataset?

A synthetic dataset generated from real customer data passes initial review — no records match real customers. What additional privacy risk should be assessed?