Edge AI & Private Inference

Quantisation for Edge Deployment

What quantisation is, why it matters for edge AI, the quality-size tradeoff with real benchmarks, quantisation methods, how to read model cards, and practical guidance on quantising models yourself.

Why quantisation is the key to edge deployment

A language model's weights are numbers. During training, those numbers are stored as 32-bit floating point values (FP32) -- each weight occupies 4 bytes of memory. A 7B parameter model at FP32 requires 28GB of memory just for the weights, before accounting for the memory needed during inference (KV cache, activations, overhead).

That is fine for a data centre GPU with 80GB of VRAM. It is not fine for a browser tab, a phone, or even most desktop GPUs.

Quantisation is the process of representing those weights using fewer bits. Instead of 32-bit floats, you use 16-bit, 8-bit, or 4-bit representations. The maths is straightforward:

Memory for weights = parameters x bits per parameter / 8

ModelFP32 (32-bit)FP16 (16-bit)INT8 (8-bit)INT4 (4-bit)
2B params8 GB4 GB2 GB1 GB
7B params28 GB14 GB7 GB3.5 GB
13B params52 GB26 GB13 GB6.5 GB
27B params108 GB54 GB27 GB13.5 GB
70B params280 GB140 GB70 GB35 GB

These are weight-only numbers. Actual inference memory is higher because you also need space for the KV cache (which grows with context length) and runtime overhead. A practical rule of thumb: add 20-30% to the weight memory for inference overhead at short context lengths, and more for long contexts.

So a 7B model at INT4 needs roughly 3.5GB for weights plus ~1-1.5GB for inference overhead, totalling about 4.5-5GB. That fits in a discrete laptop GPU. A 2B model at INT4 needs roughly 1GB for weights plus ~0.5GB overhead, totalling about 1.5GB. That fits in a browser tab using WebGPU.

This is why quantisation unlocks edge deployment. Without it, useful models simply do not fit on edge hardware.

?

You need to deploy a model in a browser tab where the maximum available GPU memory is 4GB. What is the largest model you can practically run?