Local Inference as a Privacy Architecture

The strongest guarantee you can make

Every privacy control in this course — detection, redaction, pseudonymisation, the gateway pattern — exists to mitigate the risk of data leaving your environment. Local inference eliminates that risk entirely. If the data never leaves, there is no cross-border transfer, no third-party processing, no retention by a provider, no training data contamination, and no metadata exposure. The privacy guarantee is architectural, not contractual.

This is not theoretical. As of early 2026, local models have reached a capability threshold where they can handle a meaningful share of enterprise AI workloads. Gemma 4 E4B running on a single NVIDIA L4 GPU can perform text classification, entity extraction, summarisation, Q&A over documents, and structured data extraction at quality levels that would have required cloud models two years ago. Larger local deployments — Llama 3.3 70B on a multi-GPU server, or Mistral models on an on-premises cluster — can handle more complex reasoning tasks.

The question is no longer "can we run AI locally?" It is "which workloads should run locally, which should go through the gateway, and which genuinely need cloud?"

What percentage of your organisation's AI workload do you estimate could be handled by a local model without meaningful quality loss?

What local models can and cannot do

Tasks local models handle well (Gemma 4 E2B/E4B scale):

Text classification. Categorising support tickets, routing emails, tagging documents by topic, sentiment analysis. Classification tasks typically involve choosing from a predefined set of labels, and small models achieve near-parity with frontier models when the label set is well-defined. A fine-tuned Gemma 4 E4B can match GPT-4-class performance on domain-specific classification tasks.

Named entity extraction. Pulling structured data from unstructured text — names, dates, monetary amounts, product references, clause types in contracts. This is the same capability used in PII detection (Module 4), applied to business-relevant entities.

Summarisation (short to medium documents). Summarising a 2-3 page document, a support ticket thread, or a meeting transcript. For documents under 4,000 tokens, E4B produces summaries that are coherent and accurate. For longer documents, you need chunk-and-summarise strategies or larger local models.

Q&A over provided context. Given a document and a question, extract or synthesise the answer. This is the core capability behind local RAG (Retrieval-Augmented Generation) systems. E4B handles single-document Q&A effectively; multi-document reasoning requires larger models.

Structured output generation. Converting unstructured text to JSON, filling forms from narratives, extracting data into tables. Small models are particularly strong here because structured output follows patterns that smaller parameter counts handle well.

Template-based generation. Drafting emails from bullet points, generating standard reports, creating routine documentation. When the output follows a predictable structure, local models perform well.

Tasks that still benefit from cloud models:

Complex multi-step reasoning. Problems requiring many steps of logical deduction, mathematical proof, or causal analysis. Frontier models (Claude Opus 4/Sonnet 4, GPT-4.1) maintain a significant advantage on reasoning-heavy tasks.

Large context windows. Processing a 100-page document, an entire codebase, or a long conversation history. Cloud models with 128K-1M token contexts have no local equivalent at comparable quality (as of early 2026).

Creative and nuanced writing. Drafting marketing copy, generating creative content, writing with a specific voice or tone. Larger models produce more natural, varied, and contextually appropriate text.

Code generation. While local models can handle simple code completion and template generation, complex code generation (designing architectures, refactoring large codebases, implementing novel algorithms) benefits from frontier model capabilities.

Multi-modal tasks. Processing images, audio, or video alongside text. Multi-modal local models exist but lag significantly behind cloud offerings in capability.

Your legal team wants AI to review 50-page contracts and flag non-standard clauses. Should this run locally or through the gateway to a cloud model?

Local, gateway, and cloud: the three-tier architecture

The practical architecture is not "all local" or "all cloud." It is a tiered system where the routing decision is based on two dimensions: data sensitivity and task complexity.

                    Task Complexity
                Low          Medium          High
           ┌────────────┬──────────────┬──────────────┐
    High   │  Local     │  Local       │  Local       │
Sensitivity│  (E4B)     │  (70B)       │  (70B+)      │
           ├────────────┼──────────────┼──────────────┤
    Medium │  Local     │  Gateway +   │  Gateway +   │
           │  (E4B)     │  Cloud       │  Cloud       │
           ├────────────┼──────────────┼──────────────┤
    Low    │  Local     │  Cloud       │  Cloud       │
           │  (E4B)     │  Direct      │  Direct      │
           └────────────┴──────────────┴──────────────┘

Tier 1: Local inference (data stays in your environment)

All high-sensitivity data (Level 4-5 classification)
Low-complexity tasks at any sensitivity level (classification, extraction, simple summarisation)
Model selection: E4B for simple tasks, 70B for complex tasks, fine-tuned models for domain-specific tasks

Tier 2: Gateway + Cloud (data is sanitised before leaving)

Medium-sensitivity data (Level 3) with medium-to-high complexity tasks
Tasks where pseudonymisation preserves enough context for useful AI processing
The gateway detects, redacts/pseudonymises, forwards, and re-hydrates

Tier 3: Direct cloud (data goes directly to cloud AI)

Low-sensitivity data (Level 1-2) where the overhead of the gateway is not justified
Public data analysis, internal documentation generation, general knowledge queries
Still requires an enterprise AI agreement (not consumer products)

The routing decision:

The routing decision should be automated, not left to users. The gateway itself can serve as the routing layer:

User sends a request to the gateway
Gateway classifies the data (using the framework from Module 2)
Based on classification level and task type:
- Level 4-5: Route to local inference endpoint
- Level 3: Detect and redact PII, then forward to cloud AI
- Level 1-2: Forward to cloud AI (optionally through the gateway for audit logging)
Response is delivered to the user, with re-hydration if pseudonymisation was used

The user experience should be identical regardless of routing tier. The user types a prompt, gets a response. Whether that response came from a local model, a cloud model with gateway sanitisation, or a cloud model directly should be transparent to the user (though visible in the audit log).

Gemma 4 on-device: practical capabilities and limitations

Google's Gemma 4 family, released in 2025, provides the most capable open-weight models at the sub-10B parameter scale. For enterprise privacy architectures, the E2B (2 billion parameter) and E4B (4 billion parameter) models are the workhorses.

E2B (2 billion parameters):

Runs on: CPU (modern laptop, 8GB+ RAM), mobile devices, edge hardware
Inference speed: ~30-50 tokens/second on a modern CPU, ~100+ tokens/second on a modest GPU
Strengths: Classification, simple extraction, short summarisation, PII detection
Context window: 8K tokens
Use case: Browser-based AI tools, mobile applications, low-latency classification endpoints

E4B (4 billion parameters):

Runs on: GPU (NVIDIA T4 or better, ~8GB VRAM), high-end CPU with quantisation
Inference speed: ~50-80 tokens/second on T4 GPU, ~20-30 tokens/second on CPU
Strengths: All E2B tasks plus better summarisation, multi-step extraction, Q&A, structured output
Context window: 8K-32K tokens (depending on configuration)
Use case: PII detection layer, document processing, structured data extraction, internal chatbots for routine queries

Deployment options:

Single-server deployment: A single NVIDIA T4 or L4 GPU server running Gemma 4 E4B with vLLM or Ollama can handle 10-50 concurrent requests with acceptable latency. This is sufficient for a department of 50-200 users with moderate AI usage.

Kubernetes deployment: For larger organisations, deploy Gemma 4 as a containerised service with auto-scaling. Tools like vLLM, TGI (Text Generation Inference from Hugging Face), and Ollama all support containerised deployment with GPU scheduling.

Desktop deployment: Ollama enables running Gemma 4 E4B on a developer's workstation with a modest GPU. This is useful for developer tools (AI-assisted coding, documentation generation) where the data should not leave the developer's machine.

What E4B cannot do well:

Process documents longer than its context window without chunking strategies
Match frontier model quality on complex reasoning, creative writing, or nuanced analysis
Handle multi-modal inputs (images, audio) — the base Gemma 4 text models are text-only
Replace specialised fine-tuned models for domain-specific tasks without its own fine-tuning

The honest assessment: E4B is a capable local model for routine tasks, not a local replacement for Claude Opus 4 or GPT-4.1. It is the "good enough for 40-60% of enterprise AI tasks" model, and that 40-60% is where the majority of your data privacy risk lives — routine, repetitive tasks involving real customer data.

Your organisation processes 10,000 customer support tickets per day with AI. 70% are routine (classify and route), 20% need summarisation, and 10% need complex analysis and response drafting. How would you architect the deployment?

Measuring your privacy posture

Once you have a tiered deployment, you need to measure whether it is actually delivering the privacy protection you intend. Here are the metrics that matter.

Local processing ratio: The percentage of AI requests handled entirely by local inference. This is the single most important privacy metric. If 80% of your requests stay local, your attack surface is reduced by 80% compared to an all-cloud deployment.

Formula: Local requests / Total requests * 100

Target: varies by industry. Healthcare and defence: 90%+. Financial services: 70-80%. General enterprise: 50-70%.

PII interception rate: Of the requests that go through the gateway, what percentage had PII detected and redacted? If this number is very low, either your detection is not sensitive enough or your users are already self-censoring (unlikely).

Re-identification risk score: For pseudonymised data that reaches the cloud, what is the estimated re-identification risk? This can be assessed using k-anonymity analysis on the pseudonymised prompt set. If every pseudonymised prompt is unique (k=1), the re-identification risk is higher than if multiple prompts share the same pseudonymisation pattern.

Audit coverage: What percentage of AI interactions are captured in the audit log with complete metadata (classification level, detection results, routing decision, provider)? Target: 100%.

Mean detection latency: How much time does the detection and redaction pipeline add to each request? Track this by classification level (Level 2 should be under 50ms; Level 4 should be under 2 seconds).

False positive rate in production: Track how often users report that the gateway blocked or redacted content that should have been allowed through. High false positive rates drive shadow AI adoption.

The cost comparison: is local actually more expensive?

A common objection to local inference is cost. Let us run the numbers for a mid-sized enterprise deployment.

Cloud-only costs (1,000 requests/day at ~500 tokens input, ~500 tokens output):

Claude Sonnet 4 API: ~$3/million input tokens, ~$15/million output tokens
Daily cost: (500K input tokens * $3/M) + (500K output tokens * $15/M) = $1.50 + $7.50 = $9/day
Monthly: ~$270/month
Plus: Gateway infrastructure ($200-500/month for detection pipeline compute)

Local inference costs (same workload):

NVIDIA L4 GPU server: ~$0.50-1.00/hour on cloud, or ~$3,000-5,000 one-time for on-premises
Monthly cloud GPU cost: ~$360-720/month
Monthly on-premises (amortised over 3 years): ~$85-140/month
Operational overhead: system administration, model updates, monitoring

The comparison: For low-volume workloads (under 1,000 requests/day), cloud AI is cheaper than maintaining local GPU infrastructure. For high-volume workloads (over 5,000 requests/day) or when you are already running GPU infrastructure for other purposes, local inference becomes cost-competitive or cheaper.

But cost is not the right frame for this decision. The decision driver is privacy posture. If your regulatory requirements mandate local processing (HIPAA without a BAA, ITAR, classified data), the cost comparison is irrelevant — local is the only option. If your regulatory requirements allow cloud processing with appropriate controls (GDPR with DPA and SCCs), the gateway pattern with selective local inference for the most sensitive data is typically the most cost-effective approach.

Your CFO asks: 'Why are we spending money on GPU infrastructure when the cloud AI API is cheaper per request?' What is the most compelling response?

✎

Module 7 — Final Assessment

What is the primary privacy advantage of local inference over the gateway pattern?

In the three-tier architecture (local, gateway+cloud, direct cloud), which factor primarily determines whether a request should be routed to local inference or through the gateway to a cloud model?

Gemma 4 E4B can realistically handle which of the following enterprise tasks at acceptable quality?

A CFO questions the cost of local GPU infrastructure. What is the most accurate framing of the cost comparison?