When edge means your data centre
Not all edge AI runs on employee devices. For many enterprises, "edge" means "our own data centre, not someone else's cloud." The data never leaves your network, but it is served from centralised infrastructure that you control.
This is the on-premises deployment pattern: a GPU cluster in your data centre (or co-location facility) running an open-source inference engine, serving your entire organisation through an internal API. Employees use AI through internal applications, but every inference request is processed on hardware you own, in a facility you control, under your security policies.
vLLM is the production standard for this pattern. It is an open-source inference engine that provides:
- High throughput: PagedAttention for efficient KV cache management, continuous batching for maximising GPU utilisation
- OpenAI-compatible API: Drop-in replacement for applications currently using OpenAI's API
- Broad model support: Gemma, Llama, Mistral, Qwen, Phi, and most Hugging Face models
- Quantisation support: AWQ, GPTQ, GGUF, bitsandbytes -- serve quantised models with minimal configuration
- Tensor parallelism: Split a single model across multiple GPUs for larger models or higher throughput