The pragmatic middle ground
Pure edge and pure cloud are both valid architectures. But for most enterprises, the right answer is somewhere in between.
The hybrid pattern recognises a practical reality: edge models (2-27B parameters) handle 70-85% of enterprise queries at adequate quality. The remaining 15-30% -- queries requiring complex multi-step reasoning, very long context windows, or frontier-level capability -- benefit from a larger cloud model.
The question is not edge vs cloud. The question is: how do you route each query to the right target while maintaining your privacy requirements?
Three hybrid patterns dominate enterprise deployments:
- Complexity-based routing: Simple queries go to edge, complex queries go to cloud
- PII gateway: Edge model strips sensitive data before sending to cloud
- Escalation with sanitisation: Edge handles everything, but can escalate anonymised queries to cloud for better answers
Each pattern has different privacy, cost, and quality characteristics. The right choice depends on your regulatory constraints and quality requirements.
Your organisation can tolerate sending anonymised queries to a cloud API, but customer PII must never leave your infrastructure. Which hybrid pattern fits?
Building a query complexity classifier
The simplest hybrid pattern: classify incoming queries by complexity and route them to the appropriate model.
Why this works: Most enterprise queries are straightforward. "Summarise this email." "What is the policy for expense reimbursement?" "Extract the payment terms from this contract." A 4B or 27B model handles these as well as a frontier model. The hard queries -- "Analyse the strategic implications of these three competing proposals in the context of our five-year roadmap" -- are rare, and they benefit disproportionately from a larger model.
The classifier approach:
You can build a complexity classifier using a small local model, heuristic rules, or a combination.
Heuristic-based routing (simple, fast, no model needed):
def route_query(query: str, context_length: int) -> str:
"""Route to 'edge' or 'cloud' based on heuristics."""
# Long context likely needs more capable model
if context_length > 4000:
return "cloud"
# Multi-part questions with conjunctions
complexity_markers = [
"compare", "contrast", "analyse the implications",
"evaluate the tradeoffs", "synthesise",
"what are the second-order effects",
"taking into account", "in the context of"
]
if any(marker in query.lower() for marker in complexity_markers):
return "cloud"
# Queries requesting structured output or specific formats
format_markers = [
"create a table", "write a report",
"generate a detailed plan", "produce a comprehensive"
]
if any(marker in query.lower() for marker in format_markers):
return "cloud"
# Default: handle locally
return "edge"Model-based routing (more accurate, adds ~50ms latency):
Use your edge model itself to classify query complexity:
async def classify_complexity(query: str, edge_model) -> str:
classification = await edge_model.generate(
f"""Classify this query as SIMPLE or COMPLEX.
SIMPLE: factual lookup, summarisation, extraction, classification, short Q&A.
COMPLEX: multi-step reasoning, comparative analysis, long-form generation, strategic analysis.
Query: {query}
Classification:""",
max_tokens=5,
temperature=0
)
return "cloud" if "COMPLEX" in classification else "edge"The routing architecture:
User Query
│
├── Complexity Classifier (local, under 50ms)
│ ├── SIMPLE (70-85% of queries)
│ │ └── Edge Model (vLLM on-prem or local device)
│ │ └── Response to user
│ └── COMPLEX (15-30% of queries)
│ └── [Optional: PII sanitisation]
│ └── Cloud API (GPT-4, Claude, etc.)
│ └── Response to userCost impact: If 80% of queries are routed to edge and 20% to cloud, and your edge marginal cost is effectively $0/query while cloud costs $0.02/query, your blended cost is $0.004/query instead of $0.02/query -- an 80% reduction. At 100,000 queries/day, that is $400/month instead of $2,000/month for the cloud component, plus the fixed edge infrastructure cost.
Your complexity classifier routes 75% of queries to edge. But user satisfaction surveys show that edge responses score 3.8/5 on average while cloud responses score 4.3/5. What should you adjust?
PII detection and sanitisation as a routing layer
The PII gateway is the most important pattern for enterprises that want hybrid architectures but cannot send sensitive data to cloud APIs.
The concept: before any query reaches a cloud API, a local processing layer detects and redacts personally identifiable information. The cloud model receives only the sanitised query. Optionally, PII is re-inserted into the cloud response before showing it to the user.
Detection approaches:
1. Rule-based NER (Named Entity Recognition):
- Regex patterns for structured PII: email addresses, phone numbers, social security numbers, credit card numbers, dates of birth
- Dictionary matching for known entities: employee names from HR systems, customer names from CRM
- Fast (~1ms per query), deterministic, zero false negatives on structured formats
- Misses unstructured PII: "the CEO told me last Tuesday" contains temporal information that could identify a specific conversation
2. Model-based NER:
- Use a small NER model (e.g., GLiNER, Presidio with spaCy, or a fine-tuned BERT) to detect PII entities
- Catches unstructured PII that rules miss: names in unusual contexts, addresses described in prose
- 10-50ms per query, with occasional false positives
- Run locally -- the NER model itself is an edge model
3. LLM-based detection:
- Use your edge LLM to identify PII before forwarding to cloud
- Most flexible: can handle ambiguous cases and contextual PII
- 200-500ms per query (a full LLM inference pass)
- Best quality but highest latency
The recommended approach: layered.
class PIIGateway:
def __init__(self):
self.regex_detector = RegexPIIDetector() # Layer 1: fast, structured
self.ner_detector = NERPIIDetector() # Layer 2: unstructured
self.entity_map = {} # Maps placeholder -> original value
def sanitise(self, text: str) -> tuple[str, dict]:
"""Remove PII, return sanitised text and mapping."""
self.entity_map = {}
sanitised = text
# Layer 1: Regex for structured PII
for match in self.regex_detector.find_all(sanitised):
placeholder = f"[{match.entity_type}_{len(self.entity_map)}]"
self.entity_map[placeholder] = match.text
sanitised = sanitised.replace(match.text, placeholder)
# Layer 2: NER for unstructured PII
for entity in self.ner_detector.detect(sanitised):
placeholder = f"[{entity.type}_{len(self.entity_map)}]"
self.entity_map[placeholder] = entity.text
sanitised = sanitised.replace(entity.text, placeholder)
return sanitised, self.entity_map
def restore(self, response: str, entity_map: dict) -> str:
"""Re-insert original PII into the response."""
restored = response
for placeholder, original in entity_map.items():
restored = restored.replace(placeholder, original)
return restoredExample transformation:
Original query:
"Draft a response to John Smith ([email protected]) regarding his complaint about the delayed shipment to 123 Oak Street, Bristol. His order number is ORD-2024-78432."
Sanitised query:
"Draft a response to [PERSON_0] ([EMAIL_0]) regarding his complaint about the delayed shipment to [ADDRESS_0]. His order number is [ORDER_ID_0]."
The cloud model generates a response using the placeholders. The gateway re-inserts the original values before showing the response to the user. The cloud API never sees John Smith's name, email, or address.
Your PII gateway uses regex rules only. A query reads: 'The incident happened when Sarah talked to her manager at the coffee shop on Tuesday.' What PII does the regex miss?
Keeping edge and cloud consistent
When edge and cloud systems both serve AI capabilities, you need strategies for keeping them consistent and handling failures.
Index synchronisation patterns:
If both edge and cloud maintain knowledge indexes (for RAG), they need to stay in sync. Three approaches:
1. Cloud-primary, edge-cache. The cloud system is the authoritative index. Edge systems pull a subset of the index relevant to their users. Updates flow from cloud to edge on a schedule (hourly, daily) or event-driven (when documents change).
- Pro: simple consistency model, cloud always has the full picture
- Con: edge is stale between syncs, requires connectivity for updates
2. Edge-primary, cloud-aggregate. Each edge system maintains its own index of local data. The cloud aggregates metadata (not content) from edge systems to enable cross-system search.
- Pro: data stays on edge by default, cloud only sees metadata
- Con: cross-system search quality depends on metadata quality
3. Bidirectional sync. Both edge and cloud can add and update documents. Conflict resolution handles simultaneous updates.
- Pro: maximum flexibility
- Con: conflict resolution is genuinely hard, especially for vector indexes
For most enterprise edge AI deployments, pattern 1 (cloud-primary, edge-cache) is the right starting point. It is the simplest to implement and reason about. Data sovereignty is maintained because the edge caches only the data it needs, and the cloud system is within your infrastructure (not a third-party service).
Failover strategies:
Edge fails, cloud available:
- Route all queries to cloud (with PII sanitisation if required)
- Quality may improve (larger model), but privacy posture changes
- Alert operations team to restore edge service
Cloud fails, edge available:
- Edge handles all queries, including complex ones
- Quality degrades for complex queries, but service continues
- This is why the edge system must be capable enough for standalone operation
Both fail:
- Degrade gracefully to cached responses or static fallbacks
- Queue queries for processing when service resumes
- This scenario should be rare in a well-architected system
The key design principle: Edge should be self-sufficient. Cloud is an enhancement, not a dependency. If cloud goes down, the user experience degrades but does not break. This inverts the typical cloud dependency pattern and makes your AI infrastructure more resilient.
Three reference architectures
Pattern A: On-premises primary, cloud escalation
User Query → On-prem vLLM (Gemma 27B)
│
├── Answer quality check (confidence scoring)
│ ├── High confidence → Return response
│ └── Low confidence → Sanitise → Cloud API → Return responseBest for: organisations with existing data centre infrastructure, moderate quality requirements, and willingness to use cloud for the long tail of complex queries.
Pattern B: Browser-first, server fallback
User Query → Browser model (Gemma E2B/E4B)
│
├── Context length check
│ ├── Short context → Generate locally
│ └── Long context → Send to on-prem vLLM (27B)
│ ├── Generates locally on your server
│ └── Returns response to browserBest for: privacy-maximising deployments where data should stay on the employee's device whenever possible, with server-side processing only for queries that exceed the browser model's capability.
Pattern C: Air-gapped with manual cloud bridge
Secure Network:
User Query → On-prem vLLM (air-gapped, no internet)
│
├── Most queries answered locally
└── Unanswerable query → Flagged for analyst
│
Analyst manually reformulates query
without classified information
│
Analyst submits to cloud API (separate device)
│
Analyst incorporates cloud response into
local answer (manual synthesis)Best for: defence, intelligence, and classified environments where the secure network has no internet connectivity. The "manual cloud bridge" preserves human judgment about what information can cross the air gap.
Your organisation has both a HIPAA-regulated healthcare division and a non-regulated marketing division. Both want AI. What hybrid architecture serves both?
Module 9 -- Final Assessment
In a hybrid cloud-edge architecture, what percentage of typical enterprise queries can be handled adequately by an edge model (4-27B parameters)?
What does the PII gateway pattern accomplish that simple complexity-based routing does not?
In a well-designed hybrid architecture, what should happen when the cloud component becomes unavailable?
A regex-only PII gateway processes the query: 'Please review the proposal Dr. Patel sent regarding the Thames Water acquisition.' What PII does the regex likely miss?