The three libraries that matter
There are three serious options for running LLM inference in a browser tab. Each takes a different approach, and the right choice depends on your deployment requirements.
Transformers.js (Hugging Face)
Transformers.js is a JavaScript port of the Hugging Face Transformers library. It supports both WebGPU and WASM backends, covers a wide range of model architectures (not just LLMs -- also embeddings, image classification, speech recognition), and uses the familiar Hugging Face pipeline API.
Strengths:
- Broadest model support -- any ONNX-exported model works
- Both WebGPU and WASM backends with automatic fallback
- Familiar API for anyone who has used Python Transformers
- Active development with regular releases
- Good for multi-model applications (embedding + generation in one app)
Limitations:
- LLM inference is slower than WebLLM because it uses ONNX rather than compiled GPU kernels
- Larger runtime overhead for LLM-specific use cases
WebLLM (MLC AI)
WebLLM uses the TVM compiler to generate hardware-specific GPU kernels at compile time. The result is faster LLM inference than Transformers.js, at the cost of supporting fewer model architectures and requiring a compilation step for new models.
Strengths:
- Fastest LLM inference in the browser -- 20-50% faster than Transformers.js for generation
- Pre-compiled model library covering major open models (Gemma, Llama, Phi, Qwen, Mistral)
- OpenAI-compatible API -- easy to integrate with existing LLM application code
- Efficient KV cache management for longer conversations
Limitations:
- LLM-only -- cannot run embedding models or other architectures
- Adding a new model requires compilation with the MLC toolchain
- WebGPU-only -- no WASM fallback for CPU-only devices
MediaPipe LLM (Google)
MediaPipe LLM is Google's official library for running Gemma and other models in the browser. It is tightly optimised for Google's own models and integrates with the broader MediaPipe ecosystem (vision, audio, pose detection).
Strengths:
- Tight optimisation for Gemma models specifically
- Integrated with MediaPipe's cross-platform runtime (web, Android, iOS)
- Good documentation and Google support
- Simple API with fewer configuration options to get wrong
Limitations:
- Primarily supports Google models (Gemma family)
- Less flexibility than Transformers.js or WebLLM for custom models
- Newer and less battle-tested in production deployments
Which to choose:
| Use case | Recommendation |
|---|---|
| Fastest LLM chat in browser | WebLLM |
| Multi-model pipeline (embed + generate) | Transformers.js |
| Gemma-specific deployment with cross-platform plans | MediaPipe LLM |
| Must support CPU-only devices | Transformers.js (WASM fallback) |
| Existing OpenAI-compatible codebase | WebLLM (compatible API) |