Enterprise RAG on Your Own Infrastructure

vLLM Deployment Guide

What vLLM is and why it dominates self-hosted inference, hardware selection with real cost-performance numbers, quantisation trade-offs, batching and throughput optimisation, multi-GPU serving, monitoring, and cost modelling.

What vLLM is and why it is the standard

vLLM is an open-source (Apache 2.0) inference engine for large language models. It has become the de facto standard for self-hosted LLM serving because it solves the hard engineering problems of efficient inference: memory management, request scheduling, and batching.

The core innovation in vLLM is PagedAttention -- a memory management technique that eliminates the memory waste caused by the KV cache in transformer models. Without PagedAttention, serving a 27B model wastes 60-80% of available GPU memory on fragmented KV cache allocations. PagedAttention manages the KV cache like an operating system manages virtual memory: in fixed-size pages that can be allocated, freed, and shared efficiently. This single optimisation roughly doubles the number of concurrent requests a GPU can handle.

What vLLM gives you out of the box:

  • OpenAI-compatible API. Drop-in replacement for the OpenAI API. Your application code does not need to know whether it is talking to GPT-4 or a self-hosted Gemma 4 instance.
  • Continuous batching. Incoming requests are batched dynamically rather than waiting for a fixed batch to fill. New requests join the batch as soon as there is GPU capacity, and completed requests leave without blocking others.
  • Speculative decoding. A small draft model generates candidate tokens that the larger model verifies in parallel, improving throughput by 1.5-2x for some workloads.
  • Tensor parallelism. Automatically splits large models across multiple GPUs. A 27B model on 2x L40S "just works" with --tensor-parallel-size 2.
  • Quantised model support. Serves AWQ, GPTQ, and GGUF quantised models natively.
  • Streaming. Server-Sent Events for token-by-token streaming to clients.

Starting a vLLM server serving Gemma 4 12B:

python -m vllm.entrypoints.openai.api_server \
  --model google/gemma-4-12b \
  --dtype bfloat16 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.90 \
  --port 8000

That is it. You now have an OpenAI-compatible API serving Gemma 4 12B. For production, you will add monitoring, load balancing, and quantisation -- but the barrier to getting started is remarkably low.

?

Your team currently calls the OpenAI API from your RAG application. You want to switch to self-hosted Gemma 4 via vLLM. How much application code needs to change?