On-Premises Deployment with vLLM

When edge means your data centre

Not all edge AI runs on employee devices. For many enterprises, "edge" means "our own data centre, not someone else's cloud." The data never leaves your network, but it is served from centralised infrastructure that you control.

This is the on-premises deployment pattern: a GPU cluster in your data centre (or co-location facility) running an open-source inference engine, serving your entire organisation through an internal API. Employees use AI through internal applications, but every inference request is processed on hardware you own, in a facility you control, under your security policies.

vLLM is the production standard for this pattern. It is an open-source inference engine that provides:

High throughput: PagedAttention for efficient KV cache management, continuous batching for maximising GPU utilisation
OpenAI-compatible API: Drop-in replacement for applications currently using OpenAI's API
Broad model support: Gemma, Llama, Mistral, Qwen, Phi, and most Hugging Face models
Quantisation support: AWQ, GPTQ, GGUF, bitsandbytes -- serve quantised models with minimal configuration
Tensor parallelism: Split a single model across multiple GPUs for larger models or higher throughput

Your organisation currently uses the OpenAI API for an internal document analysis tool. You want to migrate to on-premises for data sovereignty. What is the lowest-friction migration path?

From zero to serving Gemma 4 27B

Here is the practical setup for a vLLM deployment serving Gemma 4 27B on a single GPU.

Prerequisites:

Linux server (Ubuntu 22.04+ recommended)
NVIDIA GPU with 24GB+ VRAM (A100 40GB/80GB, H100, L40S, or RTX 4090 for development)
CUDA 12.1+ and compatible NVIDIA drivers
Python 3.9+

Step 1: Install vLLM

pip install vllm

Step 2: Download the model

# Using huggingface-cli
huggingface-cli download google/gemma-4-27b-it --local-dir ./models/gemma-4-27b-it

# Or for a pre-quantised AWQ version (smaller, faster to download)
huggingface-cli download casperhansen/gemma-4-27b-it-awq --local-dir ./models/gemma-4-27b-it-awq

Step 3: Start the server

# Serve the AWQ-quantised model
python -m vllm.entrypoints.openai.api_server \
  --model ./models/gemma-4-27b-it-awq \
  --quantization awq \
  --host 0.0.0.0 \
  --port 8000 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.90 \
  --dtype float16

Step 4: Test the endpoint

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma-4-27b-it-awq",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Explain the concept of data sovereignty in two sentences."}
    ],
    "temperature": 0.3,
    "max_tokens": 200
  }'

Step 5: Use with existing OpenAI client code

from openai import OpenAI

# Point to your vLLM server instead of OpenAI
client = OpenAI(
    base_url="http://your-vllm-server:8000/v1",
    api_key="not-needed"  # vLLM does not require an API key by default
)

response = client.chat.completions.create(
    model="gemma-4-27b-it-awq",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Summarise this contract clause..."}
    ],
    temperature=0.3,
    max_tokens=500,
    stream=True
)

for chunk in response:
    print(chunk.choices[0].delta.content, end="")

The key insight: from the application's perspective, vLLM looks exactly like the OpenAI API. Existing code, client libraries, and integrations work without modification.

GPU selection and capacity planning

Choosing the right GPU is the most consequential hardware decision. It determines your model options, concurrent capacity, and cost profile.

GPU comparison for on-premises inference:

GPU	VRAM	FP16 TFLOPS	Approx. price	Best for
RTX 4090	24 GB	82.6	$1,600-2,000	Development, small-scale production
A10G	24 GB	31.2	$2,000-3,000	AWS instances, moderate throughput
L40S	48 GB	91.6	$7,000-9,000	Production sweet spot for 27B models
A100 40GB	40 GB	77.97	$8,000-12,000	Production standard, tensor parallelism
A100 80GB	80 GB	77.97	$15,000-20,000	Large models (70B), high concurrency
H100 80GB	80 GB	267.6	$25,000-35,000	Highest throughput, FP8 support

Capacity estimation:

vLLM's throughput depends on the model size, quantisation, GPU, context length, and batch size. Here are representative numbers for Gemma 4 27B (AWQ INT4):

GPU	Concurrent requests	Tokens/sec (total throughput)	Avg latency (256-token response)
RTX 4090 (24GB)	4-8	200-400	1-3 seconds
L40S (48GB)	12-20	500-900	0.5-1.5 seconds
A100 40GB	8-15	400-700	0.8-2 seconds
A100 80GB	20-35	800-1400	0.4-1 second
H100 80GB	30-50	1500-2500	0.2-0.5 seconds

Sizing example:

An enterprise with 5,000 knowledge workers making 20 queries per day = 100,000 queries per day = ~1.2 queries per second average. With a 5x peak factor, peak load is ~6 queries per second.

Each query generates roughly 300 tokens in 1-2 seconds. At 6 concurrent queries with 300-token responses, you need sustained throughput of ~1,000-1,800 tokens/second during peaks.

A single L40S or A100 40GB handles this comfortably. For redundancy (hardware failure, maintenance), deploy two GPUs. Total cost: $14,000-24,000 in hardware, replacing $30,000-90,000/month in cloud API costs.

You need to serve Gemma 4 27B to 500 concurrent users during peak business hours (9am-5pm). Average query: 500 input tokens, 300 output tokens. Average think time between queries: 3 minutes. What hardware do you need?

Production deployment on Kubernetes

For production on-premises deployment, Kubernetes provides the orchestration layer for autoscaling, health checks, rolling updates, and resource management.

The deployment architecture:

Load Balancer (nginx, HAProxy, or K8s Ingress)
├── vLLM Pod 1 (GPU node 1)
│   ├── vLLM server container
│   ├── Model volume (PVC or hostPath)
│   └── Health check sidecar
├── vLLM Pod 2 (GPU node 2)
│   ├── vLLM server container
│   ├── Model volume (PVC or hostPath)
│   └── Health check sidecar
└── Monitoring stack
    ├── Prometheus (metrics collection)
    ├── Grafana (dashboards)
    └── AlertManager (on-call notifications)

Kubernetes deployment manifest:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-gemma-27b
  namespace: ai-inference
spec:
  replicas: 2
  selector:
    matchLabels:
      app: vllm-gemma-27b
  template:
    metadata:
      labels:
        app: vllm-gemma-27b
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        args:
          - "--model=/models/gemma-4-27b-it-awq"
          - "--quantization=awq"
          - "--host=0.0.0.0"
          - "--port=8000"
          - "--max-model-len=8192"
          - "--gpu-memory-utilization=0.90"
        ports:
        - containerPort: 8000
        resources:
          limits:
            nvidia.com/gpu: 1
          requests:
            memory: "32Gi"
            cpu: "8"
        volumeMounts:
        - name: model-storage
          mountPath: /models
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 120
          periodSeconds: 30
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 60
          periodSeconds: 10
      volumes:
      - name: model-storage
        persistentVolumeClaim:
          claimName: model-weights-pvc
      nodeSelector:
        gpu-type: a100
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-service
  namespace: ai-inference
spec:
  selector:
    app: vllm-gemma-27b
  ports:
  - port: 8000
    targetPort: 8000
  type: ClusterIP

Key configuration details:

initialDelaySeconds: 120 for the liveness probe gives vLLM time to load the model into GPU memory. A 27B model takes 60-90 seconds to load.
gpu-memory-utilization: 0.90 tells vLLM to use 90% of available GPU memory. The remaining 10% provides headroom for CUDA overhead and prevents OOM crashes.
nvidia.com/gpu: 1 requests one GPU per pod. For tensor parallelism across multiple GPUs, increase this number and add the --tensor-parallel-size flag to vLLM.
Model storage: Use a PersistentVolumeClaim backed by fast storage (NVMe SSD). Model loading from spinning disk adds 2-5 minutes to pod startup.

Monitoring your inference cluster

A production inference cluster needs monitoring for three reasons: capacity planning, incident detection, and cost justification.

Key metrics to track:

Metric	What it tells you	Alert threshold
Tokens/second (throughput)	Cluster capacity utilisation	>80% sustained = scale up
Time-to-first-token (TTFT)	User-perceived responsiveness	>2 seconds = investigate
Request queue depth	Whether demand exceeds capacity	>50 queued = scale up
GPU utilisation (%)	Hardware efficiency	Under 30% sustained = scale down
GPU memory usage (%)	Memory pressure	>95% = risk of OOM
Request error rate	Service health	>1% = investigate
Active requests	Concurrent load	Informational

vLLM exports Prometheus-compatible metrics natively:

# Start vLLM with metrics enabled
python -m vllm.entrypoints.openai.api_server \
  --model ./models/gemma-4-27b-it-awq \
  --quantization awq \
  --host 0.0.0.0 \
  --port 8000 \
  --served-model-name gemma-4-27b \
  --enable-metrics

Metrics are available at http://localhost:8000/metrics in Prometheus format. Configure your Prometheus instance to scrape this endpoint.

Cost modelling: on-premises vs cloud GPU rental

The make-vs-rent decision depends on your utilisation pattern and time horizon.

Approach	Monthly cost (2x A100 40GB)	3-year total
Own hardware (purchase)	~$2,500/mo (amortised hardware + power + admin)	~$90,000
Lambda Labs (cloud GPU)	~$5,600/mo (2x A100 40GB on-demand)	~$201,600
RunPod (cloud GPU)	~$4,800/mo (2x A100 40GB on-demand)	~$172,800
CoreWeave (cloud GPU)	~$5,200/mo (2x A100 40GB)	~$187,200
AWS p4d.24xlarge (8x A100)	~$24,000/mo (overkill but minimum instance)	~$864,000

The owned-hardware numbers assume:

Hardware purchase: $20,000-24,000 (2x A100 40GB + server)
Amortised over 3 years: ~$670/mo
Power and cooling: ~$400/mo (2x 400W servers at $0.12/kWh + cooling)
Part-time admin (10% of an engineer): ~$1,500/mo
Total: ~$2,500/mo

For sustained workloads (8+ hours/day, 5+ days/week), owning hardware is 40-60% cheaper than cloud GPU rental over a 3-year horizon. For bursty or experimental workloads, cloud GPU rental avoids the capital commitment.

Your CFO asks: 'Why should we spend $50,000 on GPU hardware when we can just pay per API call?' What is the strongest counter-argument?

✎

Module 8 -- Final Assessment

What is the primary advantage of vLLM over llama.cpp for on-premises production serving?

You are sizing hardware for 5,000 users making 20 AI queries per day. What is the average query rate?

In the Kubernetes vLLM deployment, why is the liveness probe initialDelaySeconds set to 120 seconds?

An enterprise processes 100,000 AI queries per day. Cloud API cost is $45,000/month. On-premises cost (2x A100 40GB, amortised over 3 years with power and admin) is $2,500/month. What is the approximate payback period for the hardware investment?