Why quantisation is the key to edge deployment
A language model's weights are numbers. During training, those numbers are stored as 32-bit floating point values (FP32) -- each weight occupies 4 bytes of memory. A 7B parameter model at FP32 requires 28GB of memory just for the weights, before accounting for the memory needed during inference (KV cache, activations, overhead).
That is fine for a data centre GPU with 80GB of VRAM. It is not fine for a browser tab, a phone, or even most desktop GPUs.
Quantisation is the process of representing those weights using fewer bits. Instead of 32-bit floats, you use 16-bit, 8-bit, or 4-bit representations. The maths is straightforward:
Memory for weights = parameters x bits per parameter / 8
| Model | FP32 (32-bit) | FP16 (16-bit) | INT8 (8-bit) | INT4 (4-bit) |
|---|---|---|---|---|
| 2B params | 8 GB | 4 GB | 2 GB | 1 GB |
| 7B params | 28 GB | 14 GB | 7 GB | 3.5 GB |
| 13B params | 52 GB | 26 GB | 13 GB | 6.5 GB |
| 27B params | 108 GB | 54 GB | 27 GB | 13.5 GB |
| 70B params | 280 GB | 140 GB | 70 GB | 35 GB |
These are weight-only numbers. Actual inference memory is higher because you also need space for the KV cache (which grows with context length) and runtime overhead. A practical rule of thumb: add 20-30% to the weight memory for inference overhead at short context lengths, and more for long contexts.
So a 7B model at INT4 needs roughly 3.5GB for weights plus ~1-1.5GB for inference overhead, totalling about 4.5-5GB. That fits in a discrete laptop GPU. A 2B model at INT4 needs roughly 1GB for weights plus ~0.5GB overhead, totalling about 1.5GB. That fits in a browser tab using WebGPU.
This is why quantisation unlocks edge deployment. Without it, useful models simply do not fit on edge hardware.