The end-to-end architecture
The previous modules gave you the individual components: classification (Module 2), detection (Module 4), redaction (Module 5), the gateway pattern (Module 6), and local inference (Module 7). This module wires them together into a production pipeline.
The pipeline has six stages:
Ingestion → Classification → Detection → Redaction → Routing → AuditStage 1: Ingestion captures the AI request from whatever source it originates — a chat interface, an API call, an automated workflow, a RAG system retrieving documents.
Stage 2: Classification determines the data sensitivity level using the framework from Module 2. This drives every downstream decision.
Stage 3: Detection identifies PII using the layered approach from Module 4, with detection depth proportional to the classification level.
Stage 4: Redaction transforms detected PII using the appropriate technique from Module 5 — simple redaction, typed redaction, pseudonymisation, or no redaction (for Level 1 data).
Stage 5: Routing directs the request to the appropriate AI endpoint — local inference, gateway-to-cloud, or direct cloud — based on the classification level and task complexity from Module 7.
Stage 6: Audit logs every decision made by the pipeline — what was detected, what was redacted, where the request was routed — without logging the actual PII.
Each stage must be independently configurable, testable, and monitorable. If detection accuracy degrades, you need to know. If routing latency increases, you need to know. If a new PII type emerges that your pipeline does not catch, you need to add a recogniser without redesigning the pipeline.
You are designing the pipeline for an organisation that processes both customer support tickets (medium sensitivity) and medical records (high sensitivity) through AI. What is the most important design principle?
The detection stack: Presidio + spaCy + custom rules + Gemma 4
For a production enterprise pipeline, here is the recommended detection stack and why each component is included.
Microsoft Presidio (orchestration layer)
Presidio serves as the detection orchestrator. It manages the registry of recognisers, runs them in sequence, merges results, and provides a consistent API for the rest of the pipeline. You do not need to build the orchestration yourself — Presidio handles recogniser management, result deduplication, and confidence scoring.
spaCy en_core_web_trf (ML NER layer)
Presidio uses spaCy as its default NER engine. The transformer model (en_core_web_trf) provides the best accuracy for entity types like PERSON, ORG, GPE, and DATE. For production, pin the spaCy model version to ensure consistent behaviour across deployments.
Custom Presidio recognisers (domain-specific patterns)
Every organisation has PII types that are not covered by default recognisers:
- Internal employee IDs (EMP-12345)
- Customer account numbers (ACC-XXXXXXXX)
- Case reference numbers (CASE-2025-XXXXX)
- Medical record numbers (MRN format varies by health system)
- Project codes that reveal confidential initiatives
Build custom Presidio recognisers for each. This is the most organisation-specific part of the detection stack, and it requires collaboration with your data governance team to enumerate all custom identifier formats.
Gemma 4 E4B (contextual LLM layer, Level 3+ only)
The local LLM adds contextual PII detection for higher-sensitivity data. Implement it as a custom Presidio recogniser that calls the local model endpoint, so it integrates cleanly into the Presidio orchestration flow. The LLM recogniser runs after regex and NER, specifically looking for PII those layers missed.
Configuration per classification level:
| Level | Regex | NER (spaCy) | Custom recognisers | LLM (Gemma 4) |
|---|---|---|---|---|
| 1 (Public) | No detection | No detection | No detection | No detection |
| 2 (Internal) | Yes | No | Yes | No |
| 3 (Confidential) | Yes | Yes | Yes | No |
| 4 (Restricted) | Yes | Yes | Yes | Yes |
| 5 (Prohibited) | Block — no AI processing |
This configuration ensures that detection overhead scales with data sensitivity. Level 2 data gets fast regex-only scanning. Level 4 data gets the full three-layer pipeline.
Your organisation's internal project codes follow the format 'PROJ-[department]-[year]-[number]' (e.g., PROJ-RND-2025-0042). These codes reveal which department is working on what initiative. How should you add detection for these?
Detection thresholds: tuning the precision-recall tradeoff
Every PII detection produces a confidence score between 0 and 1. The threshold determines which detections are treated as PII (and redacted) and which are ignored. Setting this threshold correctly is the difference between a pipeline that is useful and one that is abandoned.
The threshold spectrum:
- Threshold 0.3: Very aggressive. Catches almost everything, including many false positives. Use for Level 4+ data where missing PII is a compliance violation and users accept higher friction.
- Threshold 0.5: Balanced. Good general-purpose threshold for most enterprise use cases. Catches clear PII, may miss subtle or uncertain detections.
- Threshold 0.7: Conservative. Catches only high-confidence PII. Fewer false positives, but higher risk of missing ambiguous PII. Use for Level 2 data where the consequences of a miss are lower.
- Threshold 0.9: Minimal intervention. Only catches near-certain PII. Low false positive rate, but misses a significant portion of PII. Not recommended for any data containing actual PII.
Per-entity-type thresholds:
One global threshold is too coarse. Different PII types have different detection reliability:
- Email addresses (regex): Confidence is binary — either the pattern matches or it does not. Threshold is effectively 1.0 for regex matches.
- Phone numbers (regex): Moderate confidence — the pattern matches many non-phone-number digit sequences. Threshold 0.6-0.7 with contextual validation (is there a "phone," "call," or "contact" nearby?).
- Person names (NER): Highly variable confidence. "Dr. Sarah Chen" detects with confidence 0.95. "Jordan" detects with confidence 0.4 because it could be a name or a country. Threshold 0.5 with contextual review for 0.4-0.6 range.
- Dates (NER/regex): Dates are PII under HIPAA (except year) but generally not PII in other contexts. Use context-dependent thresholds: in healthcare documents, threshold 0.3. In general business documents, threshold 0.7.
Configuring thresholds per classification level:
THRESHOLDS = {
"Level_2": {
"PERSON": 0.7,
"EMAIL": 0.9, # Regex — high confidence or false positive
"PHONE": 0.7,
"US_SSN": 0.5, # Better safe — SSN exposure is severe
"CREDIT_CARD": 0.5,
"DEFAULT": 0.7,
},
"Level_3": {
"PERSON": 0.5,
"EMAIL": 0.9,
"PHONE": 0.5,
"US_SSN": 0.3,
"CREDIT_CARD": 0.3,
"DEFAULT": 0.5,
},
"Level_4": {
"PERSON": 0.3,
"EMAIL": 0.5,
"PHONE": 0.3,
"US_SSN": 0.2,
"CREDIT_CARD": 0.2,
"DATE_TIME": 0.3, # HIPAA requires date detection
"DEFAULT": 0.3,
},
}Lower thresholds at higher classification levels mean more aggressive detection — more false positives, but fewer missed PII instances. This is the correct tradeoff: the regulatory cost of a missed SSN at Level 4 far exceeds the productivity cost of a false positive.
Orchestration and red-team testing
Pipeline orchestration
For a production pipeline processing requests in real time, the orchestration is typically synchronous within the request path — the user waits for classification, detection, redaction, and routing to complete before getting a response. The audit stage runs asynchronously (fire-and-forget) to avoid adding latency.
For batch processing (analysing a document corpus, processing a data export), orchestration tools provide scheduling, retry logic, and monitoring:
- Apache Airflow: The most mature orchestration platform. Good for complex DAGs (Directed Acyclic Graphs) where pipeline stages have dependencies. Overkill for simple linear pipelines.
- Prefect: Modern alternative to Airflow with a simpler API. Better for Python-native teams. Good observability out of the box.
- Simple queue-based: For many enterprises, a Redis or RabbitMQ queue with worker processes is sufficient. Each worker runs the full pipeline for one request. Scale by adding workers.
For the real-time gateway use case, avoid orchestration tools entirely — the latency overhead of queue-based processing (even in-memory) adds unnecessary delay. Process the pipeline stages inline within the gateway request handler.
Testing your pipeline: red-team exercises
You cannot know if your pipeline works without testing it against data that contains PII. But you cannot use real PII for testing (that would defeat the purpose). The solution is synthetic PII datasets designed to stress-test your detection.
Building a red-team dataset:
-
Basic coverage: Generate documents containing each PII type your pipeline is supposed to detect. Use Faker (Python library) to generate realistic names, addresses, phone numbers, SSNs, emails, and credit card numbers.
-
Edge cases: Include the hard cases that break detection:
- Names that are also common nouns ("Rose," "Grace," "Hunter," "Chase")
- Phone numbers in unusual formats ("call me at five-five-five, one-two-three-four")
- Partial SSNs ("last four digits are 4567")
- Obfuscated emails ("john dot smith at acme dot com")
- Multi-language names and identifiers
-
Contextual PII: Include cases where the PII is in the context, not in the format:
- "The patient in bed 3 on ward 7" (contextual PHI)
- "My neighbour at 42 Oak Street" (third-party PII)
- "The CEO of Acme Corp" (identifiable individual without naming them)
-
False positive traps: Include text that looks like PII but is not:
- "The model number is 123-45-6789" (looks like an SSN)
- "Contact the Springfield office" (Springfield is a location, not PII)
- "The John Deere tractor" (John Deere is a brand, not a person)
Scoring the red team:
Run your red-team dataset through the pipeline and calculate:
- Recall per PII type: what percentage of each PII type was detected?
- Precision per PII type: what percentage of detections were correct?
- Edge case coverage: did the pipeline catch obfuscated, contextual, and multi-language PII?
- False positive rate on traps: how many non-PII items were incorrectly flagged?
Set minimum thresholds for production readiness:
- Overall recall: >95% (>99% for Level 4 data)
- Overall precision: >80%
- Edge case recall: >80%
- False positive rate on traps: under 20%
Your red-team test reveals that the pipeline catches 99% of emails and phone numbers but only 78% of person names. What is the most effective fix?
Performance benchmarks and handling false positives
Throughput benchmarks
Here are realistic throughput numbers for the pipeline at different configurations, measured on a single server with one NVIDIA T4 GPU:
| Configuration | Requests/sec | P95 latency | Use case |
|---|---|---|---|
| Regex only | ~500 | under 5ms | Level 2 data, high-throughput scanning |
| Regex + spaCy (sm) | ~200 | ~15ms | Level 2-3 data, general enterprise |
| Regex + spaCy (trf) | ~50 | ~50ms | Level 3 data, accuracy-prioritised |
| Regex + spaCy (trf) + Gemma 4 E4B | ~5-10 | ~1.5s | Level 4 data, maximum detection |
For most enterprise deployments, the bottleneck is not the detection pipeline but the cloud AI inference (1-10 seconds). The pipeline overhead is a small fraction of total request time.
To scale beyond these numbers:
- Horizontal scaling: deploy multiple pipeline instances behind a load balancer
- Batch processing: for document corpora, process multiple documents in parallel
- Model optimisation: use quantised models (INT8 quantisation for spaCy and Gemma 4 reduces latency by 30-50% with minimal accuracy loss)
- GPU batching: vLLM and TGI support batching multiple requests through the LLM simultaneously
Handling false positives without creating security holes
False positives — text incorrectly flagged as PII — are the primary reason users abandon privacy tools and revert to shadow AI. You must handle them, but you must do so without creating a bypass mechanism that also allows real PII through.
Approach 1: User correction with audit trail. When the pipeline flags something as PII, the user can mark it as a false positive. The correction is logged in the audit trail, and the text is released without redaction. Key safeguards:
- The user must explicitly review and approve each false positive
- The correction is logged with the user's identity and timestamp
- Corrections are reviewed periodically to identify detection issues
Approach 2: Allowlists for known false positives. Maintain an allowlist of terms that frequently trigger false positives (company names that look like person names, product codes that look like SSNs). The allowlist is managed by the security team, not individual users, and is version-controlled.
Approach 3: Context-aware suppression. If the detection is triggered by a known context (e.g., "John Deere" always triggers PERSON detection), add a suppression rule: "if PERSON detection overlaps with a known company name, suppress the detection." This is more surgical than threshold adjustment and does not reduce detection sensitivity globally.
What you must NOT do:
- Allow users to disable the pipeline entirely ("turn off PII detection for this session")
- Create a "trusted user" bypass that lets certain users skip detection
- Suppress all false positives without review — some may be true positives the user does not recognise as PII
Module 8 — Final Assessment
In a production privacy pipeline, what determines whether the LLM detection layer (Gemma 4) is activated for a given request?
Your red-team test shows 99% recall on emails, 97% on phone numbers, and 78% on person names. Which improvement would have the greatest impact on overall pipeline effectiveness?
A user reports that the pipeline flagged 'John Deere tractor' as containing a person name. What is the correct fix?
For real-time AI request processing through the gateway, which pipeline orchestration approach is most appropriate?