WebGPU: the GPU comes to the browser
For the first time in the history of the web platform, JavaScript applications can access the GPU with an API designed for general-purpose compute, not just rendering triangles.
WebGPU is the successor to WebGL. Where WebGL was designed for 3D graphics rendering (and AI researchers had to creatively abuse its texture processing pipelines to run matrix multiplications), WebGPU exposes compute shaders -- direct GPU computation pipelines designed for arbitrary parallel workloads, including neural network inference.
The practical difference is enormous. WebGL-based inference was slow, memory-inefficient, and fragile. WebGPU-based inference runs matrix multiplications on the GPU the way they were meant to run: as proper compute workloads with efficient memory management, parallel execution, and hardware-appropriate data types.
What WebGPU gives you:
- Compute shaders: run arbitrary parallel computations on the GPU, not just graphics
- Storage buffers: large, GPU-resident memory buffers for model weights
- Shader compilation: GPU programs compiled to native hardware instructions
- Efficient memory management: explicit control over GPU memory allocation and data transfer
- FP16 and INT8 compute: native support for reduced-precision arithmetic on supported hardware
For AI inference specifically, this means you can load quantised model weights into GPU storage buffers and run the transformer's matrix multiplications as compute shader dispatches. The performance approaches what native applications achieve on the same hardware, typically reaching 60-80% of native GPU inference speed.
Why was WebGL inadequate for AI inference, despite being a GPU API?
Where WebGPU works today
As of early 2026, WebGPU support across major browsers:
Chrome/Chromium (including Edge, Brave, Opera): Full support since Chrome 113 (May 2023). This is the most mature implementation and the primary target for in-browser AI. Chrome on Windows, macOS, Linux, and ChromeOS all support WebGPU. Chrome on Android has WebGPU support behind a flag with progressive rollout to stable.
Safari: WebGPU support shipped in Safari 17 (September 2023) on macOS Sonoma and later. Safari on iOS/iPadOS has WebGPU support from iOS 17 onwards. Apple's implementation uses Metal as the backend, which means excellent performance on Apple hardware.
Firefox: WebGPU shipped in Firefox in 2024 after an extended development period. Firefox uses its own WebGPU implementation backed by the wgpu library (written in Rust). Performance is competitive with Chrome on most hardware.
Practical implications:
- On desktop, WebGPU covers ~90%+ of users across Chrome, Edge, Safari, and Firefox.
- On mobile, coverage is lower. Android Chrome has partial support; iOS Safari has full support.
- The primary gap is older devices and browsers that have not updated. For enterprise deployments where you control the browser version (managed devices, internal tools), this is not a problem.
The WASM fallback: For devices without WebGPU support, WebAssembly (WASM) provides a CPU-based fallback. WASM inference is 5-20x slower than WebGPU but works on virtually every modern browser. This is important for progressive enhancement: your application works everywhere, it just runs faster on WebGPU-capable devices.
WebAssembly: near-native CPU in the browser
WebAssembly (WASM) is a binary instruction format that runs in the browser at near-native CPU speed. Where WebGPU accelerates computation on the GPU, WASM accelerates computation on the CPU.
For AI inference, WASM matters in two scenarios:
1. As a fallback when WebGPU is unavailable. Not every device has a capable GPU or a browser that supports WebGPU. WASM-based inference engines (like the WASM backend of ONNX Runtime Web) run on the CPU and work on any modern browser. Slower, but universal.
2. For non-GPU workloads in the AI pipeline. Even when you use WebGPU for the main model inference, parts of the pipeline run better on the CPU: tokenisation, text processing, embedding post-processing, vector search. WASM handles these efficiently.
ONNX Runtime Web is the primary framework for WASM-based AI in the browser. It supports both WASM (CPU) and WebGPU (GPU) backends, allowing you to write code once and run on whichever backend is available.
import * as ort from 'onnxruntime-web';
// Automatically uses WebGPU if available, falls back to WASM
const session = await ort.InferenceSession.create('model.onnx', {
executionProviders: ['webgpu', 'wasm']
});WASM inference performance for a small model (2B parameters, INT4):
- Modern laptop CPU (Apple M3, Intel 13th gen): 5-15 tokens/second
- Older laptop CPU (Intel 10th gen): 2-8 tokens/second
- Mobile CPU (Snapdragon 8 Gen 3): 3-10 tokens/second
Compare with WebGPU on the same hardware:
- Laptop with discrete GPU (RTX 4060): 30-60 tokens/second
- Laptop with Apple M3 (unified memory): 25-45 tokens/second
- Laptop with Intel integrated GPU: 10-20 tokens/second
The performance gap is significant. WebGPU is 3-10x faster than WASM for inference. But WASM at 5-15 tokens/second is still usable for many interactive applications -- users can read at roughly 4-5 words per second, so even WASM inference can keep ahead of reading speed for streaming output.
You are building an internal enterprise tool that must work on all employee devices, including older machines without WebGPU support. What is the right architecture?
The browser GPU memory problem
Running AI inference in a browser tab is not the same as running it in a native application. The browser imposes constraints that fundamentally limit what you can deploy.
Shared GPU memory. The browser shares GPU resources with the operating system, other browser tabs, and other applications. A laptop with an 8GB GPU does not give you 8GB for your model. The OS compositor, other tabs with WebGPU content, and the browser's own rendering pipeline all consume GPU memory. In practice, expect to use 50-70% of total GPU memory for your model.
Adapter limits. WebGPU exposes device limits that cap buffer sizes, typically at 256MB-2GB per storage buffer depending on the hardware and driver. Model weights that exceed the maximum buffer size must be split across multiple buffers, which the inference frameworks handle internally but which adds overhead.
No direct memory mapping. Unlike native applications that can memory-map model files directly from disk to GPU, browser applications must download weights, store them in JavaScript memory, and then upload them to GPU buffers. This means you need sufficient JavaScript heap memory in addition to GPU memory.
Practical memory budgets by device class:
| Device | Total GPU | Available for AI | Max model (INT4) |
|---|---|---|---|
| Integrated GPU (Intel UHD) | 2-4 GB shared | 1-2.5 GB | 2B |
| Apple M1/M2 (unified memory) | 8-16 GB shared | 3-8 GB | 4-7B |
| Apple M3/M4 (unified memory) | 8-36 GB shared | 4-18 GB | 4-27B |
| Discrete GPU (RTX 3060, 12GB) | 12 GB dedicated | 6-8 GB | 7-12B |
| Discrete GPU (RTX 4090, 24GB) | 24 GB dedicated | 14-18 GB | 27B |
Apple Silicon deserves special mention. Its unified memory architecture means the CPU and GPU share the same physical memory, eliminating data transfer overhead. An M3 MacBook Pro with 36GB of unified memory can run a quantised 27B model in a browser tab -- something no other laptop hardware can match.
For most enterprise deployments targeting a diverse device fleet, 2B-4B models at INT4 are the practical ceiling for browser inference. This covers Gemma 4 E2B and E4B, which deliver genuinely useful quality for common enterprise tasks.
Caching model weights in the browser
A quantised 2B model is roughly 1.5GB. A 4B model is roughly 3GB. Downloading this every time a user opens your application is unacceptable -- it would take 15-30 seconds on a fast connection, and much longer on typical enterprise networks.
The solution is IndexedDB, the browser's built-in key-value database. IndexedDB can store large binary blobs (model weight files) persistently across sessions. The user downloads the model once, it is cached in IndexedDB, and subsequent loads read from the local cache.
The pattern:
async function loadModelWeights(modelUrl, cacheKey) {
// Check IndexedDB cache first
const cached = await getFromIndexedDB(cacheKey);
if (cached) {
return cached; // Fast local load, no network request
}
// Download with progress reporting
const response = await fetch(modelUrl);
const reader = response.body.getReader();
const contentLength = response.headers.get('Content-Length');
let receivedBytes = 0;
const chunks = [];
while (true) {
const { done, value } = await reader.read();
if (done) break;
chunks.push(value);
receivedBytes += value.length;
updateProgressBar(receivedBytes / contentLength);
}
const weights = concatenateChunks(chunks);
// Cache for next time
await saveToIndexedDB(cacheKey, weights);
return weights;
}Important details for production:
-
Storage quota: IndexedDB storage is limited by browser-specific quotas. Chrome typically allows up to 80% of available disk space. Safari is more restrictive on iOS (1GB limit in some configurations). Always check available storage before caching and handle quota errors gracefully.
-
Cache versioning: When you update the model, you need to invalidate the old cache. Include a version identifier in your cache key (e.g.,
gemma-e2b-q4km-v2) and clean up old versions on first load. -
Chunked storage: Very large files (4GB+) can be problematic in IndexedDB. Split them into smaller chunks (256MB each) and reassemble during load.
-
Service Worker integration: For a truly offline-capable application, register a Service Worker that intercepts model weight requests and serves them from IndexedDB. This makes the caching transparent to the inference library.
Your in-browser AI application uses a 3GB model. Users report that the first load takes too long. What is the most impactful improvement?
Real-world tokens per second
Benchmarks are slippery because they depend on model, quantisation, hardware, browser, and workload. The numbers below are representative of what you can expect in production, measured on real hardware with real models in Chrome.
Gemma 4 E2B (Q4_K_M, ~1.5GB) via WebLLM:
| Hardware | Tokens/sec (prompt) | Tokens/sec (generation) | Time-to-first-token |
|---|---|---|---|
| Apple M3 Pro (18GB unified) | 180-250 | 35-50 | 80-120ms |
| Apple M1 (8GB unified) | 100-150 | 20-30 | 120-200ms |
| RTX 4060 Laptop (8GB) | 200-300 | 40-60 | 60-100ms |
| RTX 3060 (12GB) | 150-220 | 30-45 | 80-150ms |
| Intel Arc A770 (16GB) | 120-180 | 25-35 | 100-180ms |
| Intel Iris Xe (integrated) | 30-50 | 8-12 | 300-500ms |
Gemma 4 E4B (Q4_K_M, ~3GB) via WebLLM:
| Hardware | Tokens/sec (prompt) | Tokens/sec (generation) | Time-to-first-token |
|---|---|---|---|
| Apple M3 Pro (18GB unified) | 100-150 | 22-32 | 120-180ms |
| Apple M1 (8GB unified) | 50-80 | 12-18 | 200-350ms |
| RTX 4060 Laptop (8GB) | 120-180 | 28-40 | 80-130ms |
| RTX 3060 (12GB) | 90-130 | 20-30 | 100-200ms |
Interpretation:
Generation speed (tokens per second during output) is the most user-visible metric. At 20+ tokens/second, streaming output feels responsive -- text appears faster than a user can read. At 10-20 tokens/second, there is a slight but acceptable delay. Below 10 tokens/second, users notice lag.
For the E2B model, virtually all modern hardware with a discrete GPU or Apple Silicon delivers a responsive experience. Even integrated Intel GPUs produce usable (if not fast) generation speeds.
The E4B model is roughly 40-50% slower than E2B on the same hardware, which is expected given the larger model size. It remains responsive on discrete GPUs and higher-end Apple Silicon.
Key takeaway: For enterprise deployment targeting a mix of employee hardware, Gemma 4 E2B is the safe choice -- it runs well on everything from 2023 onwards. E4B is the quality upgrade for organisations that can guarantee at least a discrete GPU or Apple M-series chip in every target device.
Module 4 -- Final Assessment
What is the fundamental advantage of WebGPU over WebGL for AI inference?
A laptop has 8GB of GPU memory. What is a realistic amount available for AI model inference in a browser tab?
Why is IndexedDB caching essential for in-browser AI applications?
What role does WebAssembly (WASM) play in a browser AI application that primarily uses WebGPU?