What vLLM is and why it is the standard
vLLM is an open-source (Apache 2.0) inference engine for large language models. It has become the de facto standard for self-hosted LLM serving because it solves the hard engineering problems of efficient inference: memory management, request scheduling, and batching.
The core innovation in vLLM is PagedAttention -- a memory management technique that eliminates the memory waste caused by the KV cache in transformer models. Without PagedAttention, serving a 27B model wastes 60-80% of available GPU memory on fragmented KV cache allocations. PagedAttention manages the KV cache like an operating system manages virtual memory: in fixed-size pages that can be allocated, freed, and shared efficiently. This single optimisation roughly doubles the number of concurrent requests a GPU can handle.
What vLLM gives you out of the box:
- OpenAI-compatible API. Drop-in replacement for the OpenAI API. Your application code does not need to know whether it is talking to GPT-4 or a self-hosted Gemma 4 instance.
- Continuous batching. Incoming requests are batched dynamically rather than waiting for a fixed batch to fill. New requests join the batch as soon as there is GPU capacity, and completed requests leave without blocking others.
- Speculative decoding. A small draft model generates candidate tokens that the larger model verifies in parallel, improving throughput by 1.5-2x for some workloads.
- Tensor parallelism. Automatically splits large models across multiple GPUs. A 27B model on 2x L40S "just works" with
--tensor-parallel-size 2. - Quantised model support. Serves AWQ, GPTQ, and GGUF quantised models natively.
- Streaming. Server-Sent Events for token-by-token streaming to clients.
Starting a vLLM server serving Gemma 4 12B:
python -m vllm.entrypoints.openai.api_server \
--model google/gemma-4-12b \
--dtype bfloat16 \
--max-model-len 8192 \
--gpu-memory-utilization 0.90 \
--port 8000That is it. You now have an OpenAI-compatible API serving Gemma 4 12B. For production, you will add monitoring, load balancing, and quantisation -- but the barrier to getting started is remarkably low.