When edge means your data centre
Not all edge AI runs on employee devices. For many enterprises, "edge" means "our own data centre, not someone else's cloud." The data never leaves your network, but it is served from centralised infrastructure that you control.
This is the on-premises deployment pattern: a GPU cluster in your data centre (or co-location facility) running an open-source inference engine, serving your entire organisation through an internal API. Employees use AI through internal applications, but every inference request is processed on hardware you own, in a facility you control, under your security policies.
vLLM is the production standard for this pattern. It is an open-source inference engine that provides:
- High throughput: PagedAttention for efficient KV cache management, continuous batching for maximising GPU utilisation
- OpenAI-compatible API: Drop-in replacement for applications currently using OpenAI's API
- Broad model support: Gemma, Llama, Mistral, Qwen, Phi, and most Hugging Face models
- Quantisation support: AWQ, GPTQ, GGUF, bitsandbytes -- serve quantised models with minimal configuration
- Tensor parallelism: Split a single model across multiple GPUs for larger models or higher throughput
Your organisation currently uses the OpenAI API for an internal document analysis tool. You want to migrate to on-premises for data sovereignty. What is the lowest-friction migration path?
From zero to serving Gemma 4 27B
Here is the practical setup for a vLLM deployment serving Gemma 4 27B on a single GPU.
Prerequisites:
- Linux server (Ubuntu 22.04+ recommended)
- NVIDIA GPU with 24GB+ VRAM (A100 40GB/80GB, H100, L40S, or RTX 4090 for development)
- CUDA 12.1+ and compatible NVIDIA drivers
- Python 3.9+
Step 1: Install vLLM
pip install vllmStep 2: Download the model
# Using huggingface-cli
huggingface-cli download google/gemma-4-27b-it --local-dir ./models/gemma-4-27b-it
# Or for a pre-quantised AWQ version (smaller, faster to download)
huggingface-cli download casperhansen/gemma-4-27b-it-awq --local-dir ./models/gemma-4-27b-it-awqStep 3: Start the server
# Serve the AWQ-quantised model
python -m vllm.entrypoints.openai.api_server \
--model ./models/gemma-4-27b-it-awq \
--quantization awq \
--host 0.0.0.0 \
--port 8000 \
--max-model-len 8192 \
--gpu-memory-utilization 0.90 \
--dtype float16Step 4: Test the endpoint
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gemma-4-27b-it-awq",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain the concept of data sovereignty in two sentences."}
],
"temperature": 0.3,
"max_tokens": 200
}'Step 5: Use with existing OpenAI client code
from openai import OpenAI
# Point to your vLLM server instead of OpenAI
client = OpenAI(
base_url="http://your-vllm-server:8000/v1",
api_key="not-needed" # vLLM does not require an API key by default
)
response = client.chat.completions.create(
model="gemma-4-27b-it-awq",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Summarise this contract clause..."}
],
temperature=0.3,
max_tokens=500,
stream=True
)
for chunk in response:
print(chunk.choices[0].delta.content, end="")The key insight: from the application's perspective, vLLM looks exactly like the OpenAI API. Existing code, client libraries, and integrations work without modification.
GPU selection and capacity planning
Choosing the right GPU is the most consequential hardware decision. It determines your model options, concurrent capacity, and cost profile.
GPU comparison for on-premises inference:
| GPU | VRAM | FP16 TFLOPS | Approx. price | Best for |
|---|---|---|---|---|
| RTX 4090 | 24 GB | 82.6 | $1,600-2,000 | Development, small-scale production |
| A10G | 24 GB | 31.2 | $2,000-3,000 | AWS instances, moderate throughput |
| L40S | 48 GB | 91.6 | $7,000-9,000 | Production sweet spot for 27B models |
| A100 40GB | 40 GB | 77.97 | $8,000-12,000 | Production standard, tensor parallelism |
| A100 80GB | 80 GB | 77.97 | $15,000-20,000 | Large models (70B), high concurrency |
| H100 80GB | 80 GB | 267.6 | $25,000-35,000 | Highest throughput, FP8 support |
Capacity estimation:
vLLM's throughput depends on the model size, quantisation, GPU, context length, and batch size. Here are representative numbers for Gemma 4 27B (AWQ INT4):
| GPU | Concurrent requests | Tokens/sec (total throughput) | Avg latency (256-token response) |
|---|---|---|---|
| RTX 4090 (24GB) | 4-8 | 200-400 | 1-3 seconds |
| L40S (48GB) | 12-20 | 500-900 | 0.5-1.5 seconds |
| A100 40GB | 8-15 | 400-700 | 0.8-2 seconds |
| A100 80GB | 20-35 | 800-1400 | 0.4-1 second |
| H100 80GB | 30-50 | 1500-2500 | 0.2-0.5 seconds |
Sizing example:
An enterprise with 5,000 knowledge workers making 20 queries per day = 100,000 queries per day = ~1.2 queries per second average. With a 5x peak factor, peak load is ~6 queries per second.
Each query generates roughly 300 tokens in 1-2 seconds. At 6 concurrent queries with 300-token responses, you need sustained throughput of ~1,000-1,800 tokens/second during peaks.
A single L40S or A100 40GB handles this comfortably. For redundancy (hardware failure, maintenance), deploy two GPUs. Total cost: $14,000-24,000 in hardware, replacing $30,000-90,000/month in cloud API costs.
You need to serve Gemma 4 27B to 500 concurrent users during peak business hours (9am-5pm). Average query: 500 input tokens, 300 output tokens. Average think time between queries: 3 minutes. What hardware do you need?
Production deployment on Kubernetes
For production on-premises deployment, Kubernetes provides the orchestration layer for autoscaling, health checks, rolling updates, and resource management.
The deployment architecture:
Load Balancer (nginx, HAProxy, or K8s Ingress)
├── vLLM Pod 1 (GPU node 1)
│ ├── vLLM server container
│ ├── Model volume (PVC or hostPath)
│ └── Health check sidecar
├── vLLM Pod 2 (GPU node 2)
│ ├── vLLM server container
│ ├── Model volume (PVC or hostPath)
│ └── Health check sidecar
└── Monitoring stack
├── Prometheus (metrics collection)
├── Grafana (dashboards)
└── AlertManager (on-call notifications)Kubernetes deployment manifest:
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-gemma-27b
namespace: ai-inference
spec:
replicas: 2
selector:
matchLabels:
app: vllm-gemma-27b
template:
metadata:
labels:
app: vllm-gemma-27b
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- "--model=/models/gemma-4-27b-it-awq"
- "--quantization=awq"
- "--host=0.0.0.0"
- "--port=8000"
- "--max-model-len=8192"
- "--gpu-memory-utilization=0.90"
ports:
- containerPort: 8000
resources:
limits:
nvidia.com/gpu: 1
requests:
memory: "32Gi"
cpu: "8"
volumeMounts:
- name: model-storage
mountPath: /models
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 120
periodSeconds: 30
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 60
periodSeconds: 10
volumes:
- name: model-storage
persistentVolumeClaim:
claimName: model-weights-pvc
nodeSelector:
gpu-type: a100
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
---
apiVersion: v1
kind: Service
metadata:
name: vllm-service
namespace: ai-inference
spec:
selector:
app: vllm-gemma-27b
ports:
- port: 8000
targetPort: 8000
type: ClusterIPKey configuration details:
initialDelaySeconds: 120for the liveness probe gives vLLM time to load the model into GPU memory. A 27B model takes 60-90 seconds to load.gpu-memory-utilization: 0.90tells vLLM to use 90% of available GPU memory. The remaining 10% provides headroom for CUDA overhead and prevents OOM crashes.nvidia.com/gpu: 1requests one GPU per pod. For tensor parallelism across multiple GPUs, increase this number and add the--tensor-parallel-sizeflag to vLLM.- Model storage: Use a PersistentVolumeClaim backed by fast storage (NVMe SSD). Model loading from spinning disk adds 2-5 minutes to pod startup.
Monitoring your inference cluster
A production inference cluster needs monitoring for three reasons: capacity planning, incident detection, and cost justification.
Key metrics to track:
| Metric | What it tells you | Alert threshold |
|---|---|---|
| Tokens/second (throughput) | Cluster capacity utilisation | >80% sustained = scale up |
| Time-to-first-token (TTFT) | User-perceived responsiveness | >2 seconds = investigate |
| Request queue depth | Whether demand exceeds capacity | >50 queued = scale up |
| GPU utilisation (%) | Hardware efficiency | Under 30% sustained = scale down |
| GPU memory usage (%) | Memory pressure | >95% = risk of OOM |
| Request error rate | Service health | >1% = investigate |
| Active requests | Concurrent load | Informational |
vLLM exports Prometheus-compatible metrics natively:
# Start vLLM with metrics enabled
python -m vllm.entrypoints.openai.api_server \
--model ./models/gemma-4-27b-it-awq \
--quantization awq \
--host 0.0.0.0 \
--port 8000 \
--served-model-name gemma-4-27b \
--enable-metricsMetrics are available at http://localhost:8000/metrics in Prometheus format. Configure your Prometheus instance to scrape this endpoint.
Cost modelling: on-premises vs cloud GPU rental
The make-vs-rent decision depends on your utilisation pattern and time horizon.
| Approach | Monthly cost (2x A100 40GB) | 3-year total |
|---|---|---|
| Own hardware (purchase) | ~$2,500/mo (amortised hardware + power + admin) | ~$90,000 |
| Lambda Labs (cloud GPU) | ~$5,600/mo (2x A100 40GB on-demand) | ~$201,600 |
| RunPod (cloud GPU) | ~$4,800/mo (2x A100 40GB on-demand) | ~$172,800 |
| CoreWeave (cloud GPU) | ~$5,200/mo (2x A100 40GB) | ~$187,200 |
| AWS p4d.24xlarge (8x A100) | ~$24,000/mo (overkill but minimum instance) | ~$864,000 |
The owned-hardware numbers assume:
- Hardware purchase: $20,000-24,000 (2x A100 40GB + server)
- Amortised over 3 years: ~$670/mo
- Power and cooling: ~$400/mo (2x 400W servers at $0.12/kWh + cooling)
- Part-time admin (10% of an engineer): ~$1,500/mo
- Total: ~$2,500/mo
For sustained workloads (8+ hours/day, 5+ days/week), owning hardware is 40-60% cheaper than cloud GPU rental over a 3-year horizon. For bursty or experimental workloads, cloud GPU rental avoids the capital commitment.
Your CFO asks: 'Why should we spend $50,000 on GPU hardware when we can just pay per API call?' What is the strongest counter-argument?
Module 8 -- Final Assessment
What is the primary advantage of vLLM over llama.cpp for on-premises production serving?
You are sizing hardware for 5,000 users making 20 AI queries per day. What is the average query rate?
In the Kubernetes vLLM deployment, why is the liveness probe initialDelaySeconds set to 120 seconds?
An enterprise processes 100,000 AI queries per day. Cloud API cost is $45,000/month. On-premises cost (2x A100 40GB, amortised over 3 years with power and admin) is $2,500/month. What is the approximate payback period for the hardware investment?