Quantisation for Edge Deployment

Why quantisation is the key to edge deployment

A language model's weights are numbers. During training, those numbers are stored as 32-bit floating point values (FP32) -- each weight occupies 4 bytes of memory. A 7B parameter model at FP32 requires 28GB of memory just for the weights, before accounting for the memory needed during inference (KV cache, activations, overhead).

That is fine for a data centre GPU with 80GB of VRAM. It is not fine for a browser tab, a phone, or even most desktop GPUs.

Quantisation is the process of representing those weights using fewer bits. Instead of 32-bit floats, you use 16-bit, 8-bit, or 4-bit representations. The maths is straightforward:

Memory for weights = parameters x bits per parameter / 8

Model	FP32 (32-bit)	FP16 (16-bit)	INT8 (8-bit)	INT4 (4-bit)
2B params	8 GB	4 GB	2 GB	1 GB
7B params	28 GB	14 GB	7 GB	3.5 GB
13B params	52 GB	26 GB	13 GB	6.5 GB
27B params	108 GB	54 GB	27 GB	13.5 GB
70B params	280 GB	140 GB	70 GB	35 GB

These are weight-only numbers. Actual inference memory is higher because you also need space for the KV cache (which grows with context length) and runtime overhead. A practical rule of thumb: add 20-30% to the weight memory for inference overhead at short context lengths, and more for long contexts.

So a 7B model at INT4 needs roughly 3.5GB for weights plus ~1-1.5GB for inference overhead, totalling about 4.5-5GB. That fits in a discrete laptop GPU. A 2B model at INT4 needs roughly 1GB for weights plus ~0.5GB overhead, totalling about 1.5GB. That fits in a browser tab using WebGPU.

This is why quantisation unlocks edge deployment. Without it, useful models simply do not fit on edge hardware.

You need to deploy a model in a browser tab where the maximum available GPU memory is 4GB. What is the largest model you can practically run?

What you lose when you quantise

Quantisation is not free. Reducing precision from 32 bits to 4 bits means each weight can represent fewer distinct values. Information is lost. The question is: how much quality do you actually lose?

The honest answer: less than you might expect, but more than zero.

Real-world benchmarks on representative tasks show a consistent pattern:

FP16 vs FP32: Essentially no measurable quality difference. FP16 is the standard training and inference precision for most modern models. There is no reason to use FP32 for inference.

INT8 vs FP16: Typically 0.5-2% degradation on aggregate benchmarks. On most enterprise tasks (summarisation, extraction, classification), the difference is imperceptible to human evaluators. INT8 is safe for virtually all production use cases.

INT4 vs FP16: Typically 2-5% degradation on aggregate benchmarks. The degradation is not uniform -- it is more pronounced on tasks requiring precise numerical reasoning, code generation with exact syntax, and long-chain logical deduction. For summarisation and extraction, INT4 quality is usually acceptable. For complex reasoning, you may notice the difference.

Below INT4 (3-bit, 2-bit): Quality degrades significantly. Models become noticeably less coherent, hallucinate more frequently, and struggle with instruction following. Not recommended for production enterprise use cases.

The practical implication for edge deployment:

Browser and mobile: INT4 is your primary option. The quality tradeoff is acceptable for most tasks on small models (2-4B).
Desktop with discrete GPU: INT4 or INT8, depending on available VRAM. INT8 gives you a meaningful quality bump if you have the memory.
On-premises server: INT8 is the sweet spot. You have the VRAM budget, and the quality improvement over INT4 is worth it. Use FP16 only if quality on complex reasoning tasks is critical and you have the GPU memory.

You are deploying a model for contract clause extraction -- identifying specific legal provisions in contracts. Quality of extraction is critical. You have a desktop GPU with 12GB VRAM and want to use a 7B model. What quantisation level should you choose?

GPTQ, AWQ, GGUF, and bitsandbytes

Not all quantisation is the same. Different methods produce different quality-to-size tradeoffs, and they target different inference engines. Here is what you need to know about the major methods.

GGUF (llama.cpp format)

GGUF is the format used by llama.cpp, the most widely deployed local inference engine. It is a single-file format that contains the model weights, tokenizer, and metadata in one package.

GGUF quantisation uses a scheme called k-quants that applies different quantisation levels to different parts of the model based on sensitivity analysis. This is why you see designations like Q4_K_M -- the "K" indicates k-quant, and the "M" indicates medium (between small and large variants).

Common GGUF quantisation levels:

Q2_K: 2-bit. Very small, very lossy. Not recommended for production.
Q3_K_S / Q3_K_M / Q3_K_L: 3-bit variants. Small/medium/large differ in which layers get higher precision. Marginal for production.
Q4_K_S / Q4_K_M: 4-bit variants. Q4_K_M is the most popular choice for edge deployment. Good balance of size and quality.
Q5_K_S / Q5_K_M: 5-bit variants. Noticeably better quality than Q4, at ~25% more memory. Strong choice when you have the headroom.
Q6_K: 6-bit. Near-INT8 quality at lower memory. Excellent if you have the VRAM.
Q8_0: 8-bit. Minimal quality loss. The "safe" choice for on-premises deployment.

The S/M/L suffixes (small, medium, large) indicate how many layers receive higher-precision treatment. Q4_K_M means 4-bit base with medium allocation of higher-precision layers. In practice, Q4_K_M is almost always the right choice for 4-bit deployment.

GPTQ (GPU-optimised post-training quantisation)

GPTQ uses a calibration dataset to determine optimal quantisation parameters for each weight matrix. It produces models optimised for GPU inference and integrates with frameworks like Transformers, vLLM, and text-generation-inference.

Requires a GPU for inference (no CPU fallback).
Typically used for 4-bit and 8-bit quantisation.
Slightly better quality than naive round-to-nearest quantisation because of calibration.
Widely available on Hugging Face with pre-quantised models.

AWQ (Activation-Aware Weight Quantisation)

AWQ observes that not all weights are equally important -- some contribute more to the model's output than others. It identifies the most important weights (based on activation patterns) and preserves their precision while aggressively quantising less important weights.

Generally produces slightly better quality than GPTQ at the same bit width.
Supported by vLLM, making it a strong choice for on-premises serving.
4-bit AWQ models are the standard for vLLM deployment.

bitsandbytes

bitsandbytes is a library for quantised inference within the Hugging Face Transformers ecosystem. It supports 8-bit (LLM.int8) and 4-bit (NF4, FP4) quantisation.

Integrates directly with transformers.from_pretrained() -- load any model in 4-bit with one flag.
Uses NF4 (4-bit NormalFloat) by default, which is optimised for the statistical distribution of neural network weights.
Best for experimentation and development. For production serving, vLLM with AWQ or GPTQ is more performant.

You are deploying a model to an on-premises vLLM cluster for production inference. Which quantisation method should you use?

How to read a quantised model card

When you browse Hugging Face for quantised models, the naming conventions tell you what you are getting. Here is how to decode them.

A typical quantised model name looks like:

google/gemma-4-27b-it-GGUF or TheBloke/gemma-4-27B-IT-GPTQ or author/model-name-AWQ

Within a GGUF repository, you will see multiple files:

gemma-4-27b-it-Q2_K.gguf      (10.3 GB)
gemma-4-27b-it-Q3_K_M.gguf    (12.8 GB)
gemma-4-27b-it-Q4_K_M.gguf    (15.6 GB)
gemma-4-27b-it-Q5_K_M.gguf    (18.4 GB)
gemma-4-27b-it-Q6_K.gguf      (21.1 GB)
gemma-4-27b-it-Q8_0.gguf      (27.5 GB)

The "it" suffix means instruction-tuned (the model has been fine-tuned to follow instructions, as opposed to a base model that just does text completion). For enterprise deployment, you almost always want the instruction-tuned variant.

The memory requirement calculator:

For GGUF models, the file size on disk closely approximates the VRAM needed for the weights. Add overhead for the KV cache:

Total VRAM = model file size + KV cache size

KV cache per token = 2 x num_layers x num_kv_heads x head_dim x precision_bytes

For a 27B model with 32 layers, 8 KV heads, 128-dim heads, at FP16:

KV cache per token = 2 x 32 x 8 x 128 x 2 bytes = 131,072 bytes = 128 KB per token
For a 4K context: 128 KB x 4,096 = 512 MB
For an 8K context: 128 KB x 8,192 = 1 GB

So that Q4_K_M 27B model (15.6 GB file) with an 8K context needs roughly 15.6 + 1 + 0.5 (runtime overhead) = ~17 GB of VRAM. That fits on a single 24GB GPU (RTX 4090, A5000, L40S) with room to spare.

Quick reference for deployment targets:

Target	Available VRAM	Max model (INT4)	Recommended
Browser (WebGPU)	2-6 GB shared	4B	E2B or E4B
Phone (2024+)	4-8 GB shared	4-7B	E2B or E4B
Laptop (integrated GPU)	4-8 GB shared	4-7B	E4B
Laptop (discrete GPU)	6-16 GB dedicated	7-13B	7-12B
Desktop (RTX 4090)	24 GB dedicated	27B	27B Q4_K_M
Server (A100 40GB)	40 GB dedicated	70B (tight)	27B Q8_0
Server (A100 80GB)	80 GB dedicated	70B comfortable	70B Q4_K_M
Server (H100 80GB)	80 GB dedicated	70B+	70B Q5_K_M

Practical: quantising with llama.cpp

Most of the time, you will download pre-quantised models from Hugging Face. But there are cases where you need to quantise yourself: a new model release, a custom fine-tune, or a specific quantisation level that is not available pre-built.

The standard tool for GGUF quantisation is llama.cpp's convert and quantize utilities.

Step 1: Convert from Hugging Face format to GGUF

# Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# Install Python dependencies
pip install -r requirements.txt

# Convert a Hugging Face model to GGUF (FP16)
python convert_hf_to_gguf.py /path/to/model --outtype f16 --outfile model-f16.gguf

Step 2: Quantise to your target precision

# Build the quantize tool
make quantize

# Quantise from FP16 to Q4_K_M
./quantize model-f16.gguf model-Q4_K_M.gguf Q4_K_M

The quantisation step takes 2-10 minutes depending on model size and your CPU. It does not require a GPU.

Step 3: Verify the output

# Test inference with the quantised model
./main -m model-Q4_K_M.gguf -p "Summarise the key points:" -n 200

For GPTQ and AWQ quantisation, the process involves running a calibration dataset through the model to determine optimal quantisation parameters. This requires a GPU and takes longer:

# AWQ quantisation using autoawq
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = "google/gemma-4-27b-it"
quant_path = "gemma-4-27b-it-awq"

model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

quant_config = {
    "zero_point": True,
    "q_group_size": 128,
    "w_bit": 4,
    "version": "GEMM"
}

# Quantise with calibration data
model.quantize(tokenizer, quant_config=quant_config)
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

The calibration dataset matters for GPTQ and AWQ. Using a calibration set that matches your deployment domain (e.g., legal text if you are deploying for legal tasks) can produce slightly better quantisation for your specific use case.

You have fine-tuned Gemma 4 12B on your company's internal documents. You need to deploy it in-browser via WebGPU. What is your quantisation workflow?

✎

Module 3 -- Final Assessment

A 7B parameter model quantised to INT4 requires approximately how much memory for weights alone?

What does the 'K' in GGUF quantisation names like Q4_K_M indicate?

You need to serve a quantised model in production on a vLLM cluster. Which quantisation format is recommended?

At what quantisation level does quality degradation become significant enough to potentially affect production enterprise use cases?