AI Data Privacy & PII Management

Redaction and Anonymisation Techniques

The legal and technical differences between redaction, pseudonymisation, anonymisation, and tokenisation — plus statistical privacy concepts, synthetic data, and building a practical redaction pipeline.

The terminology matters legally

Redaction, pseudonymisation, anonymisation, and tokenisation are not synonyms. They are distinct techniques with different legal implications, different technical implementations, and different impacts on AI utility. Confusing them in a compliance document or a vendor conversation can create real legal exposure.

Here is the precise distinction:

Redaction removes PII entirely and replaces it with a marker. "John Smith called about his account" becomes "[PERSON] called about his account." The original data is destroyed — there is no way to recover "John Smith" from the redacted text. Redacted data is no longer personal data under GDPR because no individual can be identified.

Pseudonymisation replaces PII with consistent artificial identifiers. "John Smith called about his account" becomes "Person_A called about his account," and every occurrence of "John Smith" in the same document becomes "Person_A." The mapping between "John Smith" and "Person_A" is stored separately. Under GDPR, pseudonymised data is still personal data because it can be re-identified using the mapping. However, GDPR explicitly recognises pseudonymisation as a safeguard (Article 4(5) and Recital 26), and it is considered a measure that reduces risk.

Anonymisation transforms data so that no individual can be identified, even by the data controller, even with additional information. Under GDPR, truly anonymised data is no longer personal data and falls outside GDPR scope entirely. The threshold for anonymisation is high: it must be irreversible, and the data must not be re-identifiable "by any means reasonably likely to be used."

Tokenisation replaces sensitive data elements with non-sensitive tokens that map back to the originals through a tokenisation vault. Unlike pseudonymisation, tokenisation preserves the format of the original data (a 16-digit token replaces a 16-digit card number), making it useful for systems that need format-consistent data. Tokenisation is widely used in PCI DSS contexts to keep cardholder data out of systems that do not need it.

?

Your legal team says 'we can process this data through cloud AI because it has been anonymised.' The data has had names replaced with Person_A, Person_B, etc., with a mapping table stored in your database. Is this actually anonymised?