Edge AI & Private Inference

Building In-Browser AI Applications

Transformers.js, WebLLM, and MediaPipe LLM for production in-browser inference -- library comparison, step-by-step deployment of Gemma 4 E2B, first-load UX, memory pressure handling, and streaming chat interfaces.

The three libraries that matter

There are three serious options for running LLM inference in a browser tab. Each takes a different approach, and the right choice depends on your deployment requirements.

Transformers.js (Hugging Face)

Transformers.js is a JavaScript port of the Hugging Face Transformers library. It supports both WebGPU and WASM backends, covers a wide range of model architectures (not just LLMs -- also embeddings, image classification, speech recognition), and uses the familiar Hugging Face pipeline API.

Strengths:

  • Broadest model support -- any ONNX-exported model works
  • Both WebGPU and WASM backends with automatic fallback
  • Familiar API for anyone who has used Python Transformers
  • Active development with regular releases
  • Good for multi-model applications (embedding + generation in one app)

Limitations:

  • LLM inference is slower than WebLLM because it uses ONNX rather than compiled GPU kernels
  • Larger runtime overhead for LLM-specific use cases

WebLLM (MLC AI)

WebLLM uses the TVM compiler to generate hardware-specific GPU kernels at compile time. The result is faster LLM inference than Transformers.js, at the cost of supporting fewer model architectures and requiring a compilation step for new models.

Strengths:

  • Fastest LLM inference in the browser -- 20-50% faster than Transformers.js for generation
  • Pre-compiled model library covering major open models (Gemma, Llama, Phi, Qwen, Mistral)
  • OpenAI-compatible API -- easy to integrate with existing LLM application code
  • Efficient KV cache management for longer conversations

Limitations:

  • LLM-only -- cannot run embedding models or other architectures
  • Adding a new model requires compilation with the MLC toolchain
  • WebGPU-only -- no WASM fallback for CPU-only devices

MediaPipe LLM (Google)

MediaPipe LLM is Google's official library for running Gemma and other models in the browser. It is tightly optimised for Google's own models and integrates with the broader MediaPipe ecosystem (vision, audio, pose detection).

Strengths:

  • Tight optimisation for Gemma models specifically
  • Integrated with MediaPipe's cross-platform runtime (web, Android, iOS)
  • Good documentation and Google support
  • Simple API with fewer configuration options to get wrong

Limitations:

  • Primarily supports Google models (Gemma family)
  • Less flexibility than Transformers.js or WebLLM for custom models
  • Newer and less battle-tested in production deployments

Which to choose:

Use caseRecommendation
Fastest LLM chat in browserWebLLM
Multi-model pipeline (embed + generate)Transformers.js
Gemma-specific deployment with cross-platform plansMediaPipe LLM
Must support CPU-only devicesTransformers.js (WASM fallback)
Existing OpenAI-compatible codebaseWebLLM (compatible API)
?

You are building an internal enterprise tool that needs both text embedding (for search) and text generation (for answers) running entirely in the browser. Which library?