Edge AI & Private Inference

WebGPU and WebAssembly for AI

What WebGPU and WebAssembly are, their browser support status, GPU memory constraints, the IndexedDB caching pattern, and practical performance benchmarks for in-browser AI inference.

WebGPU: the GPU comes to the browser

For the first time in the history of the web platform, JavaScript applications can access the GPU with an API designed for general-purpose compute, not just rendering triangles.

WebGPU is the successor to WebGL. Where WebGL was designed for 3D graphics rendering (and AI researchers had to creatively abuse its texture processing pipelines to run matrix multiplications), WebGPU exposes compute shaders -- direct GPU computation pipelines designed for arbitrary parallel workloads, including neural network inference.

The practical difference is enormous. WebGL-based inference was slow, memory-inefficient, and fragile. WebGPU-based inference runs matrix multiplications on the GPU the way they were meant to run: as proper compute workloads with efficient memory management, parallel execution, and hardware-appropriate data types.

What WebGPU gives you:

  • Compute shaders: run arbitrary parallel computations on the GPU, not just graphics
  • Storage buffers: large, GPU-resident memory buffers for model weights
  • Shader compilation: GPU programs compiled to native hardware instructions
  • Efficient memory management: explicit control over GPU memory allocation and data transfer
  • FP16 and INT8 compute: native support for reduced-precision arithmetic on supported hardware

For AI inference specifically, this means you can load quantised model weights into GPU storage buffers and run the transformer's matrix multiplications as compute shader dispatches. The performance approaches what native applications achieve on the same hardware, typically reaching 60-80% of native GPU inference speed.

?

Why was WebGL inadequate for AI inference, despite being a GPU API?