PII Detection and Recognition

Detection is the foundation of everything else

You cannot redact what you cannot find. Every privacy architecture in this course — the gateway pattern, the data privacy pipeline, the audit system — depends on a PII detection layer that is accurate, fast, and comprehensive. If your detection misses a Social Security Number, your redaction pipeline passes it through to the cloud model. If your detection flags every occurrence of "John" including the word "john" in "john_doe_table," your users will abandon the system within a week.

PII detection is a precision-recall tradeoff, and the right balance depends on your risk tolerance. A healthcare organisation processing PHI under HIPAA needs recall above 99% — missing even one identifier is a compliance violation. A marketing team using AI to analyse customer feedback might accept 95% recall if it means fewer false positives disrupting their workflow.

This module covers the three layers of PII detection — rule-based, ML-based, and LLM-based — and how to combine them into a pipeline that achieves both high recall and acceptable precision.

Your PII detection system has 99.5% recall (misses 0.5% of PII) and 85% precision (15% of detections are false positives). Which metric is more important to optimise?

Regex patterns: fast, deterministic, and limited

Rule-based detection uses regular expressions and pattern matching to identify PII that follows predictable formats. It is the fastest detection layer, runs without ML infrastructure, and produces deterministic results — the same input always produces the same output.

What regex handles well:

Email addresses: The pattern [a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,} catches the vast majority of email addresses. Edge cases exist (quoted local parts, internationalised domain names), but for enterprise PII detection, this pattern has recall above 99%.

Phone numbers: Phone numbers are harder because formats vary by country. A US-focused pattern like (\+?1[-.\s]?)?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4} catches most North American formats. For international numbers, the libphonenumber library (originally developed by Google) provides parsing and validation across 200+ country formats.

Social Security Numbers: The pattern \b\d{3}-\d{2}-\d{4}\b catches the standard format. But SSNs also appear without dashes (\b\d{9}\b) and in varied formats. The challenge is that a 9-digit number could be many things — you need context-aware validation (does this number appear near terms like "SSN," "social security," or "tax ID"?).

Credit card numbers: The Luhn algorithm validates whether a number is a plausible credit card. Combined with patterns for common card formats — Visa (4xxx, 16 digits), Mastercard (5[1-5]xx or 2[2-7]xx, 16 digits), Amex (3[47]xx, 15 digits) — you can detect card numbers with high precision. The pattern to watch for is any 13-19 digit number that passes the Luhn check.

IP addresses: IPv4 is straightforward: \b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b with validation that each octet is 0-255. IPv6 is more complex due to abbreviation rules and mixed notation.

Dates: Date patterns vary enormously by locale and context. US format (MM/DD/YYYY), European format (DD/MM/YYYY), ISO format (YYYY-MM-DD), and written formats ("March 15, 2024") all need separate patterns. Dates are PII under HIPAA (except year) and can be quasi-identifiers in other contexts.

Where regex fails:

Names. "John Smith" is a name. "Main Street" is not. "Victoria" could be a name, a city, or a state. No regex can make these distinctions reliably.
Addresses. Addresses have some structural patterns but are enormously variable across countries and contexts.
Context-dependent PII. "Patient presented with chest pain" — "Patient" is not PII, but it indicates the surrounding text is likely PHI. Regex cannot understand context.
Obfuscated PII. "My social is three four five, six seven, eight nine zero one" — a human reads the SSN. Regex does not.

You are building a PII detection pipeline. Your regex layer catches emails, phone numbers, SSNs, and credit card numbers. What is the most critical gap?

Named Entity Recognition: understanding context

Named Entity Recognition (NER) is the NLP task of identifying and classifying named entities in text into predefined categories: person, organisation, location, date, etc. NER models understand linguistic context — they know that "Apple" in "Apple reported strong earnings" is an organisation, while "apple" in "she ate an apple" is not an entity.

spaCy

spaCy is the most widely used NER library in production PII detection systems. Its pretrained English models (en_core_web_sm, en_core_web_md, en_core_web_lg, en_core_web_trf) provide NER out of the box with entity types including PERSON, ORG, GPE (geo-political entity), DATE, MONEY, and CARDINAL.

Performance varies by model size. The transformer-based model (en_core_web_trf, based on RoBERTa) achieves an F1 score of approximately 0.90 on the OntoNotes 5.0 benchmark for NER. The small model (en_core_web_sm) achieves approximately 0.85 F1 but runs 10-50x faster. For a PII detection pipeline where you need to process thousands of requests per minute, the speed-accuracy tradeoff matters.

spaCy's strength is speed and integration. It processes text at thousands of tokens per second (even the transformer model runs in milliseconds for typical prompt lengths), integrates cleanly with Python pipelines, and can be extended with custom entity types.

Flair

Flair is an NLP framework built on PyTorch that achieves state-of-the-art NER performance through stacked embeddings. Flair's NER models combine contextual string embeddings with traditional word embeddings and GloVe vectors, achieving F1 scores above 0.93 on CoNLL-03 (a standard NER benchmark).

Flair is slower than spaCy for inference but more accurate, especially on entity types that benefit from character-level context (misspelled names, unusual entity formats). For a pipeline that prioritises recall over speed, Flair is a strong choice for the ML layer.

Hugging Face NER pipelines

The Hugging Face transformers library provides access to hundreds of pretrained NER models. The most relevant for PII detection:

dslim/bert-base-NER — BERT-based NER, ~0.91 F1 on CoNLL-03, good general-purpose model
Jean-Baptiste/camembert-ner — for French text NER
StanfordAIMI/stanford-deidentifier-base — specifically trained for medical de-identification
Various community models fine-tuned on PII-specific datasets

The Hugging Face ecosystem's advantage is breadth: you can find NER models for specific languages, specific domains (medical, legal, financial), and specific entity types. The disadvantage is consistency — community models vary in quality, and you need to evaluate each one on your specific data.

The practical ML NER setup for PII detection:

For most enterprise pipelines, the recommended ML NER configuration is:

spaCy with the transformer model (en_core_web_trf) as the primary NER layer — good accuracy, fast enough for real-time processing
Custom entity recognisers for domain-specific PII (e.g., medical record numbers, internal employee IDs) trained on your organisation's data
A confidence threshold: entities detected with confidence above 0.8 are flagged as PII; entities between 0.5 and 0.8 are flagged for review

Using local LLMs for context-aware PII detection

Regex catches patterns. NER catches entities. But some PII requires understanding the full context of a passage to detect. This is where local LLMs add a detection layer that neither regex nor NER can provide.

Why LLMs for PII detection?

Consider this text: "The patient in room 412 who was admitted on Tuesday for the procedure discussed in Dr. Ramirez's email last week." A human reader understands that "room 412" combined with "admitted on Tuesday" could identify a patient. Regex sees a number. NER sees a date. Neither understands that the combination of hospital room number + admission date + reference to a specific doctor is PHI that needs to be handled.

Local LLMs — specifically smaller, efficient models like Google's Gemma 4 family — can perform this contextual analysis. You prompt the model with the text and ask it to identify all PII, including contextual PII that would not be caught by pattern matching or entity recognition.

Gemma 4 for PII detection

Gemma 4 E2B (2 billion parameters) and E4B (4 billion parameters) are small enough to run on a single GPU or even a modern CPU, yet capable enough to understand context and identify subtle PII. A typical PII detection prompt:

Analyse the following text and identify all personally identifiable information (PII).
For each PII instance, provide:
- The exact text span
- The PII category (PERSON, EMAIL, PHONE, SSN, ADDRESS, DATE_OF_BIRTH, MEDICAL_ID, FINANCIAL, OTHER)
- Your confidence level (HIGH, MEDIUM, LOW)
- Why this is PII (brief explanation)

Pay special attention to:
- Indirect identifiers that could identify someone in combination
- Context clues that suggest surrounding text contains PII
- Domain-specific identifiers (medical record numbers, case IDs, employee IDs)

Text to analyse:
{input_text}

At E4B scale, Gemma 4 can process a typical prompt (500-1000 tokens) in under 2 seconds on a modern GPU (NVIDIA T4 or better). This is slower than regex (microseconds) or spaCy NER (milliseconds), but fast enough for a pipeline that processes prompts before forwarding them to a cloud AI service.

The false positive/negative tradeoff at the LLM layer

LLMs are better at avoiding false negatives (they catch contextual PII that other layers miss) but worse at avoiding false positives (they may flag benign text as PII because they are being cautious). This is actually the right behaviour for a third detection layer — you want the LLM to err on the side of caution, catching things the other layers missed, and let the downstream review or redaction process handle the false positives.

The key is that the LLM layer should be additive, not a replacement. It reviews the same text that regex and NER have already processed, specifically looking for PII those layers might have missed.

Your pipeline runs regex, then spaCy NER, then a local Gemma 4 model for PII detection. The LLM layer adds 1.5 seconds of latency per request. When is this latency acceptable?

Structured data, multi-language, and Microsoft Presidio

Structured vs unstructured data

The detection approaches above focus on unstructured text — the primary input for AI systems. But AI pipelines also process structured data: CSV files, JSON payloads, database query results, spreadsheet exports. Structured data requires a different detection approach.

For structured data, column-level classification is more effective than row-level scanning. A column named "email" almost certainly contains email addresses. A column named "dob" contains dates of birth. Column name heuristics, combined with sample-based validation (scan 100 rows and check if the data matches the expected PII type), can classify entire columns efficiently.

For structured data processed through AI, the practical approach is:

Classify columns at the schema level (using column names, data types, and sample values)
Apply appropriate redaction at the column level (redact all values in the "ssn" column)
Use row-level detection only for free-text columns (like "notes" or "comments") that might contain unstructured PII

Multi-language PII detection

Enterprise data is rarely monolingual. Customer support systems process queries in dozens of languages. Global organisations have documents in multiple languages within the same pipeline.

Names vary dramatically across languages and cultures. Chinese names are typically 2-3 characters with family name first. Arabic names may include patronymics (bin/bint), tribal names, and honorifics. Hispanic names often include two family names. Indian names vary by region, religion, and language.

National ID numbers differ by country: 9-digit SSN in the US, 10-digit CPF in Brazil, 12-digit Aadhaar in India, 8-character NINO format in the UK (two letters, six digits, one letter). Phone number formats, address structures, and date formats all vary by country.

For multi-language detection, the practical stack is:

Regex patterns parameterised by locale (US SSN pattern, UK NINO pattern, Brazilian CPF pattern, etc.)
Multilingual NER models: spaCy supports models for 25+ languages; Hugging Face hosts NER models for 100+ languages
A language detection step at the start of the pipeline that routes text to the appropriate locale-specific detection rules

Microsoft Presidio: the open-source PII framework

Microsoft Presidio is an open-source PII detection and anonymisation framework that implements exactly the layered approach described in this module. It is the most mature open-source option for enterprise PII detection.

Presidio's architecture has two main components:

Presidio Analyzer: Detects PII in text using a combination of regex patterns, NER models (spaCy by default), and custom recognisers. Returns a list of detected PII entities with type, position, and confidence score.
Presidio Anonymizer: Applies anonymisation operators (redact, replace, mask, hash, encrypt) to detected entities.

Presidio ships with recognisers for ~30 PII types across multiple locales. You can add custom recognisers for domain-specific PII (medical record numbers, internal employee IDs) by implementing a simple Python interface.

Presidio integrates with spaCy, Hugging Face transformers, and Azure AI Language. For the architecture in this course, Presidio serves as the orchestration layer: it runs regex recognisers and ML NER, and you can extend it with a custom recogniser that calls your local LLM for the third detection layer.

You are processing a dataset of customer support tickets from a global company. Tickets are in English, Spanish, French, German, and Japanese. What is the most effective PII detection approach?

Combining layers: the detection pipeline architecture

The three detection layers complement each other. Regex provides speed and deterministic detection of structured PII. ML NER provides context-aware entity recognition. The local LLM catches contextual and subtle PII that the other layers miss.

Here is the pipeline architecture:

Input text
    │
    ├─── Layer 1: Regex (microseconds)
    │    ├── Email patterns
    │    ├── Phone number patterns (locale-aware)
    │    ├── SSN / National ID patterns (locale-aware)
    │    ├── Credit card numbers (Luhn validation)
    │    ├── IP addresses (v4 and v6)
    │    └── Custom patterns (internal IDs, etc.)
    │
    ├─── Layer 2: ML NER (milliseconds)
    │    ├── spaCy transformer model (or Flair)
    │    ├── Entity types: PERSON, ORG, GPE, DATE, MONEY
    │    ├── Custom entity recognisers for domain-specific PII
    │    └── Confidence threshold: 0.8 for auto-flag, 0.5-0.8 for review
    │
    └─── Layer 3: Local LLM (seconds) [for Level 3+ data only]
         ├── Gemma 4 E4B with PII detection prompt
         ├── Contextual PII identification
         ├── Indirect identifier combination detection
         └── Confidence-scored results
    │
    ▼
Merged results (union of all detections, deduplicated)
    │
    ▼
Output: List of PII entities with type, position, confidence, detection source

Merging results across layers:

When multiple layers detect the same PII, take the union (not the intersection). If regex detects an email and NER does not, keep the detection. If NER detects a name and regex does not, keep the detection. If the LLM detects contextual PII that neither other layer caught, keep the detection.

For overlapping detections (regex and NER both detect the same email address), deduplicate by taking the detection with higher confidence and keeping the source attribution for audit purposes.

Performance benchmarks:

On a typical enterprise prompt (200-500 tokens), the combined pipeline performs as follows:

Regex layer: < 1ms
spaCy NER (transformer): 15-50ms
Gemma 4 E4B (GPU): 800-1500ms
Total for a three-layer pipeline: ~1-2 seconds

If the LLM layer is reserved for Level 3+ data, the majority of requests (Level 1-2) complete in under 50ms. This overhead is negligible compared to the cloud AI inference time (typically 1-10 seconds for a response).

✎

Module 4 — Final Assessment

A PII detection system achieves 98% precision and 92% recall. In a regulated healthcare environment, which metric needs the most improvement?

What is the primary advantage of ML-based NER (e.g., spaCy transformer model) over regex for PII detection?

Why should the local LLM detection layer (e.g., Gemma 4) be reserved for higher data classification levels rather than applied to every request?

When merging PII detections from multiple pipeline layers (regex, NER, LLM), what is the correct aggregation strategy?