vLLM Deployment Guide

What vLLM is and why it is the standard

vLLM is an open-source (Apache 2.0) inference engine for large language models. It has become the de facto standard for self-hosted LLM serving because it solves the hard engineering problems of efficient inference: memory management, request scheduling, and batching.

The core innovation in vLLM is PagedAttention -- a memory management technique that eliminates the memory waste caused by the KV cache in transformer models. Without PagedAttention, serving a 27B model wastes 60-80% of available GPU memory on fragmented KV cache allocations. PagedAttention manages the KV cache like an operating system manages virtual memory: in fixed-size pages that can be allocated, freed, and shared efficiently. This single optimisation roughly doubles the number of concurrent requests a GPU can handle.

What vLLM gives you out of the box:

OpenAI-compatible API. Drop-in replacement for the OpenAI API. Your application code does not need to know whether it is talking to GPT-4 or a self-hosted Gemma 4 instance.
Continuous batching. Incoming requests are batched dynamically rather than waiting for a fixed batch to fill. New requests join the batch as soon as there is GPU capacity, and completed requests leave without blocking others.
Speculative decoding. A small draft model generates candidate tokens that the larger model verifies in parallel, improving throughput by 1.5-2x for some workloads.
Tensor parallelism. Automatically splits large models across multiple GPUs. A 27B model on 2x L40S "just works" with --tensor-parallel-size 2.
Quantised model support. Serves AWQ, GPTQ, and GGUF quantised models natively.
Streaming. Server-Sent Events for token-by-token streaming to clients.

Starting a vLLM server serving Gemma 4 12B:

python -m vllm.entrypoints.openai.api_server \
  --model google/gemma-4-12b \
  --dtype bfloat16 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.90 \
  --port 8000

That is it. You now have an OpenAI-compatible API serving Gemma 4 12B. For production, you will add monitoring, load balancing, and quantisation -- but the barrier to getting started is remarkably low.

Your team currently calls the OpenAI API from your RAG application. You want to switch to self-hosted Gemma 4 via vLLM. How much application code needs to change?

GPUs for inference: real cost and performance numbers

Selecting the right GPU is the highest-impact decision in self-hosted inference. Here is the landscape as of 2026, with real numbers for RAG workloads.

GPU	VRAM	FP16 TFLOPS	Typical price (cloud/hr)	Purchase price	Best for
NVIDIA H100 80GB	80 GB HBM3	989	$3.00-4.00	$25,000-30,000	Maximum throughput, large models, training+inference
NVIDIA A100 80GB	80 GB HBM2e	312	$1.50-2.50	$10,000-15,000	General-purpose inference, good cost/performance
NVIDIA L40S	48 GB GDDR6	362	$1.20-1.80	$7,000-9,000	Inference-optimised, best cost/performance for serving
NVIDIA L4	24 GB GDDR6	121	$0.40-0.70	$2,500-3,500	Small models, embedding, edge deployment
NVIDIA T4	16 GB GDDR6	65	$0.20-0.40	$1,500-2,000	Embedding models, quantised small models

For Gemma 4 27B serving (the full-quality RAG generation model):

The 27B model in bfloat16 requires approximately 54 GB of VRAM (2 bytes per parameter). This exceeds any single GPU except the H100 80GB. Practical options:

2x L40S (96 GB total) with tensor parallelism. Approximately $2.40-3.60/hr cloud, $14,000-18,000 purchase. This is the cost-effective choice for most enterprises. Throughput: 30-60 tokens/second per request, 8-15 concurrent requests.
1x A100 80GB with AWQ 4-bit quantisation (reduces model to ~14 GB). ~$1.50-2.50/hr. Throughput: 40-80 tokens/second per request. Slight quality reduction from quantisation.
1x H100 80GB in bfloat16 or with light quantisation. ~$3.00-4.00/hr. Highest throughput: 60-120 tokens/second. Overkill unless you need maximum per-GPU throughput.

For Gemma 4 12B serving (the workhorse model):

The 12B model in bfloat16 requires approximately 24 GB. Fits on a single L40S with room for KV cache.

1x L40S in bfloat16. ~$1.20-1.80/hr. Throughput: 40-80 tokens/second per request, 15-25 concurrent requests. The recommended setup.
1x L4 with AWQ 4-bit quantisation (reduces to ~6 GB). ~$0.40-0.70/hr. Throughput: 20-40 tokens/second. Good for cost-sensitive deployments.

For embedding model serving (GTE-Qwen2-1.5B, BGE-M3):

Embedding models are small and compute-efficient. A single T4 (16 GB, $0.20-0.40/hr) handles GTE-Qwen2-1.5B at 200-400 chunks/second. A single L4 handles it at 500-1000 chunks/second. For most enterprises, one or two T4/L4 GPUs handle all embedding needs.

You need to serve Gemma 4 27B for an enterprise RAG system processing 50,000 queries/day with a P95 latency target of 8 seconds per response (500 tokens average). Which hardware configuration makes the most sense?

AWQ, GPTQ, GGUF -- what they trade off

Quantisation reduces the precision of model weights from 16-bit floats to 4-bit or 8-bit integers, reducing memory usage by 2-4x. This lets you run larger models on smaller GPUs. The question is: what does it cost you in quality?

AWQ (Activation-aware Weight Quantisation). Identifies the 1% of weights that matter most (based on activation patterns) and preserves their precision while aggressively quantising the rest. AWQ 4-bit reduces a 27B model from 54 GB to approximately 14 GB with minimal quality loss -- typically less than 1% degradation on benchmarks. vLLM has native AWQ support. This is the recommended quantisation method for production inference.

GPTQ (GPT-Quantisation). An older method that quantises weights using a layer-by-layer calibration process. Similar compression to AWQ (4-bit reduces 27B to ~14 GB) but slightly lower quality on most benchmarks. GPTQ requires a calibration dataset during quantisation, which adds a preparation step. vLLM supports GPTQ natively.

GGUF (GPT-Generated Unified Format). A file format (developed by the llama.cpp project) that supports mixed-precision quantisation -- different layers can use different bit widths. GGUF models are primarily used with llama.cpp for CPU inference or CPU+GPU hybrid inference. vLLM has added experimental GGUF support, but AWQ remains preferred for GPU-only serving.

Scalar Quantisation (INT8/FP8). Simpler than AWQ/GPTQ: convert all weights from FP16 to INT8 or FP8. Reduces memory by 2x with negligible quality loss (typically < 0.5% on benchmarks). Supported natively by vLLM and NVIDIA TensorRT-LLM. This is the safest first step: 2x memory reduction with virtually no quality impact.

When to quantise and when not to:

Scenario	Recommendation
Model fits comfortably in GPU VRAM (e.g., 12B on L40S)	Use FP8 or bfloat16. Minimal quality trade-off, and you have the memory.
Model barely fits (e.g., 27B on A100 80GB)	Use FP8 to free memory for KV cache.
Model requires multi-GPU without quantisation (e.g., 27B on 2x L40S)	Compare: 2x L40S bfloat16 vs 1x L40S AWQ 4-bit. The single-GPU option is cheaper but lower quality.
Model does not fit on any available GPU (e.g., 70B on L40S)	AWQ 4-bit is required to fit the model. Accept the ~1% quality trade-off.

Continuous batching and speculative decoding

In traditional (static) batching, the server collects N requests, processes them as a batch, and returns all results when the slowest request finishes. This is inefficient because short responses wait for long ones.

vLLM implements continuous batching (also called iteration-level batching). At each generation step, the scheduler can:

Add new requests to the batch (they join at the prefill stage)
Remove completed requests (their slots become available immediately)
Preempt lower-priority requests if the GPU is at capacity

This means a 50-token response is returned as soon as it finishes, without waiting for a concurrent 500-token response to complete. The result: significantly lower average latency and higher GPU utilisation compared to static batching.

Speculative decoding is a throughput optimisation. A small "draft" model (e.g., Gemma 4 E2B) quickly generates a sequence of candidate tokens. The large "target" model (e.g., Gemma 4 27B) then verifies all candidate tokens in parallel. If the draft model predicted correctly (which happens 60-80% of the time for well-matched pairs), the target model accepts those tokens without generating them one by one.

The speedup depends on the acceptance rate. At 70% acceptance, speculative decoding provides roughly 1.5-1.8x throughput improvement. vLLM supports speculative decoding with the --speculative-model flag:

python -m vllm.entrypoints.openai.api_server \
  --model google/gemma-4-27b \
  --speculative-model google/gemma-4-e2b \
  --num-speculative-tokens 5 \
  --tensor-parallel-size 2

Multi-GPU tensor parallelism. For models that exceed a single GPU's memory, tensor parallelism splits the model's layers across GPUs. Each GPU holds a slice of every layer, and they communicate intermediate results at each step. This requires fast inter-GPU communication -- NVLink or PCIe 5.0 between GPUs on the same machine. Cross-machine parallelism (pipeline parallelism) is possible but adds network latency.

vLLM handles tensor parallelism automatically with --tensor-parallel-size N. For 2x L40S serving Gemma 4 27B, set --tensor-parallel-size 2. The throughput is roughly 1.6-1.8x a single GPU (not 2x, due to communication overhead).

You are serving Gemma 4 27B on 2x L40S with vLLM. Peak load is 5 concurrent requests. You want to improve throughput without adding GPUs. What should you try first?

What to monitor in production

Running an LLM inference service in production requires monitoring metrics that are specific to LLM serving. Here are the ones that matter.

Time to First Token (TTFT). The latency from when the request arrives to when the first generated token is emitted. TTFT is dominated by the prefill phase (processing the input prompt) and queue wait time. For interactive RAG, TTFT should be under 500 ms at P95. If TTFT spikes, check queue depth first (too many requests waiting) and input length second (very long prompts take longer to prefill).

Tokens Per Second (TPS). The generation throughput per request. Gemma 4 27B on 2x L40S should produce 30-60 TPS per request. If TPS drops, check GPU utilisation and thermal throttling.

Request queue depth. How many requests are waiting to be processed. Persistent queue depth above 10 means your inference infrastructure is undersized for demand. Transient spikes are normal and handled by continuous batching.

GPU utilisation. Should be 70-90% during active serving. Below 50% means the inference server is idle (request volume too low for the hardware). Above 95% means no headroom for burst traffic.

GPU memory utilisation. Should match your configured gpu_memory_utilization (typically 0.85-0.90). If actual usage hits 100%, vLLM will preempt requests (recompute their KV cache later), increasing latency.

Error rate. Track 429 (rate limited), 500 (server error), and timeout errors. A spike in timeouts correlates with queue depth. A spike in 500 errors usually means an OOM or model loading issue.

Cost modelling: calculating your per-query cost.

The formula for per-query cost in a self-hosted system:

Per-query cost = (GPU cost per hour) / (queries per hour)

Example with 2x L40S at $3.00/hr total, processing 50,000 queries/day (2,083 queries/hr average):

Per-query cost = $3.00 / 2,083 = $0.00144 per query

That is $0.00144 per query for generation. Add embedding cost (a T4 at $0.30/hr handling 50,000 queries/day = $0.00014 per query) and vector search cost (negligible per query).

Total self-hosted cost: approximately $0.0016 per query.

Compare with cloud API:

4,500 input tokens (context + query) + 500 output tokens per query
GPT-4o: $0.01125 + $0.005 = $0.01625 per query
Claude Sonnet 4: $0.0135 + $0.0075 = $0.021 per query

Self-hosted is 10-13x cheaper per query at this scale -- and the gap widens as volume increases because the self-hosted GPU cost is fixed while API costs scale linearly.

✎

Module 9 -- Final Assessment

What is PagedAttention, and why is it critical to vLLM's performance advantage?

You need to serve Gemma 4 27B (54 GB in bfloat16). Your budget allows for either 2x L40S (96 GB total, $3.00/hr) or 1x L40S with AWQ 4-bit quantisation (~14 GB, $1.50/hr). What is the primary trade-off?

At 50,000 queries/day, your self-hosted RAG system (2x L40S at $3.00/hr) costs approximately $0.0016 per query. Using GPT-4o at standard rates, the same query costs approximately $0.016. What is the cost ratio?

Your vLLM deployment shows P95 TTFT of 2.3 seconds against a 500 ms target. GPU utilisation is 85% and the request queue depth averages 15 during peak hours. What is the most likely cause and fix?