AI Data Privacy & PII Management

PII Detection and Recognition

From regex to transformer NER to context-aware LLM detection — building a layered PII detection system with Microsoft Presidio, spaCy, and local models.

Detection is the foundation of everything else

You cannot redact what you cannot find. Every privacy architecture in this course — the gateway pattern, the data privacy pipeline, the audit system — depends on a PII detection layer that is accurate, fast, and comprehensive. If your detection misses a Social Security Number, your redaction pipeline passes it through to the cloud model. If your detection flags every occurrence of "John" including the word "john" in "john_doe_table," your users will abandon the system within a week.

PII detection is a precision-recall tradeoff, and the right balance depends on your risk tolerance. A healthcare organisation processing PHI under HIPAA needs recall above 99% — missing even one identifier is a compliance violation. A marketing team using AI to analyse customer feedback might accept 95% recall if it means fewer false positives disrupting their workflow.

This module covers the three layers of PII detection — rule-based, ML-based, and LLM-based — and how to combine them into a pipeline that achieves both high recall and acceptable precision.

?

Your PII detection system has 99.5% recall (misses 0.5% of PII) and 85% precision (15% of detections are false positives). Which metric is more important to optimise?