Transformers.js, WebLLM, and MediaPipe LLM for production in-browser inference -- library comparison, step-by-step deployment of Gemma 4 E2B, first-load UX, memory pressure handling, and streaming chat interfaces.
The three libraries that matter
There are three serious options for running LLM inference in a browser tab. Each takes a different approach, and the right choice depends on your deployment requirements.
Transformers.js (Hugging Face)
Transformers.js is a JavaScript port of the Hugging Face Transformers library. It supports both WebGPU and WASM backends, covers a wide range of model architectures (not just LLMs -- also embeddings, image classification, speech recognition), and uses the familiar Hugging Face pipeline API.
Strengths:
Broadest model support -- any ONNX-exported model works
Both WebGPU and WASM backends with automatic fallback
Familiar API for anyone who has used Python Transformers
Active development with regular releases
Good for multi-model applications (embedding + generation in one app)
Limitations:
LLM inference is slower than WebLLM because it uses ONNX rather than compiled GPU kernels
Larger runtime overhead for LLM-specific use cases
WebLLM (MLC AI)
WebLLM uses the TVM compiler to generate hardware-specific GPU kernels at compile time. The result is faster LLM inference than Transformers.js, at the cost of supporting fewer model architectures and requiring a compilation step for new models.
Strengths:
Fastest LLM inference in the browser -- 20-50% faster than Transformers.js for generation
Pre-compiled model library covering major open models (Gemma, Llama, Phi, Qwen, Mistral)
OpenAI-compatible API -- easy to integrate with existing LLM application code
Efficient KV cache management for longer conversations
Limitations:
LLM-only -- cannot run embedding models or other architectures
Adding a new model requires compilation with the MLC toolchain
WebGPU-only -- no WASM fallback for CPU-only devices
MediaPipe LLM (Google)
MediaPipe LLM is Google's official library for running Gemma and other models in the browser. It is tightly optimised for Google's own models and integrates with the broader MediaPipe ecosystem (vision, audio, pose detection).
Strengths:
Tight optimisation for Gemma models specifically
Integrated with MediaPipe's cross-platform runtime (web, Android, iOS)
Good documentation and Google support
Simple API with fewer configuration options to get wrong
Limitations:
Primarily supports Google models (Gemma family)
Less flexibility than Transformers.js or WebLLM for custom models
Newer and less battle-tested in production deployments
Which to choose:
Use case
Recommendation
Fastest LLM chat in browser
WebLLM
Multi-model pipeline (embed + generate)
Transformers.js
Gemma-specific deployment with cross-platform plans
MediaPipe LLM
Must support CPU-only devices
Transformers.js (WASM fallback)
Existing OpenAI-compatible codebase
WebLLM (compatible API)
?
You are building an internal enterprise tool that needs both text embedding (for search) and text generation (for answers) running entirely in the browser. Which library?
Step-by-step: WebLLM with Gemma 4 E2B
Let us walk through the complete code for loading and running Gemma 4 E2B in a browser tab using WebLLM. This is the path to fastest LLM performance in the browser.
1. Install and import
npm install @mlc-ai/web-llm
import * as webllm from "@mlc-ai/web-llm";
2. Initialise the engine with progress reporting
const initProgressCallback = (progress) => { // progress.text contains a human-readable status message // progress.progress is 0-1 for download progress document.getElementById("status").textContent = progress.text; if (progress.progress !== undefined) { document.getElementById("progress-bar").style.width = `${Math.round(progress.progress * 100)}%`; }};const engine = await webllm.CreateMLCEngine( "gemma-2-2b-it-q4f16_1-MLC", // Model identifier { initProgressCallback, logLevel: "INFO", });
The model identifier follows a naming convention: {model}-{quantisation}-MLC. Check the WebLLM model list for available pre-compiled models. Gemma 4 E2B models will use identifiers like gemma-4-e2b-it-q4f16_1-MLC when available in the library.
3. Run inference with the OpenAI-compatible API
const response = await engine.chat.completions.create({ messages: [ { role: "system", content: "You are a helpful assistant that summarises documents concisely." }, { role: "user", content: `Summarise the following document:\n\n${documentText}` } ], temperature: 0.3, max_tokens: 500,});const summary = response.choices[0].message.content;
Streaming is critical for user experience. Without it, the user stares at a blank screen for 5-30 seconds while the model generates the full response. With streaming, text appears token by token -- users see output beginning within 100-200ms of submitting their query.
5. The complete initialisation pattern for production
async function initAI() { // Check WebGPU support if (!navigator.gpu) { showFallbackMessage("Your browser does not support WebGPU. " + "Please use Chrome 113+ or Safari 17+."); return null; } // Check GPU adapter availability const adapter = await navigator.gpu.requestAdapter(); if (!adapter) { showFallbackMessage("No GPU adapter found. " + "Your device may not have a compatible GPU."); return null; } // Check available memory (heuristic) const adapterInfo = await adapter.requestAdapterInfo(); console.log("GPU:", adapterInfo.description); try { const engine = await webllm.CreateMLCEngine( "gemma-2-2b-it-q4f16_1-MLC", { initProgressCallback } ); return engine; } catch (error) { if (error.message.includes("out of memory")) { showFallbackMessage("Insufficient GPU memory for this model. " + "Try closing other tabs or applications."); } else { showFallbackMessage(`Failed to load AI model: ${error.message}`); } return null; }}
?
Your browser AI application needs to support an enterprise help desk. Agents process ~50 queries per shift. What is the most important UX consideration?
Step-by-step: Transformers.js with streaming
For applications that need embedding and generation in a single library, or that require WASM fallback, Transformers.js is the right choice.
import { pipeline, env } from '@huggingface/transformers';// Configure cachingenv.cacheDir = './.cache';env.allowLocalModels = false;// Create a text generation pipelineconst generator = await pipeline( 'text-generation', 'onnx-community/gemma-2-2b-it-ONNX', // ONNX-format model { device: 'webgpu', // Use WebGPU; falls back to 'wasm' automatically dtype: 'q4', // INT4 quantisation });// Generate with streaming via a callbackconst output = await generator( [ { role: "user", content: "Summarise this document briefly." } ], { max_new_tokens: 256, temperature: 0.3, do_sample: true, callback_function: (token) => { // Called for each generated token process.stdout.write(token); } });
Running an embedding model alongside generation:
// Embedding pipeline (for RAG or search)const embedder = await pipeline( 'feature-extraction', 'Xenova/nomic-embed-text-v1.5', { device: 'webgpu', dtype: 'fp16' });// Embed a documentconst embedding = await embedder("Your document text here", { pooling: 'mean', normalize: true});// embedding.data is a Float32Array of 768 dimensions
This is where Transformers.js shines -- you have a single library managing both embedding and generation models, sharing the WebGPU device and caching infrastructure.
When the browser runs out of GPU memory
GPU out-of-memory (OOM) in a browser is different from OOM in a native application. There is no segfault. Instead, you get one of several failure modes:
WebGPU buffer allocation fails. The browser returns an error when you try to create a GPU buffer for model weights. This is the cleanest failure -- your code catches it and can show a user-friendly message.
The browser tab crashes. Chrome's "Aw, Snap!" page appears. The user loses any unsaved state. This happens when the GPU driver runs out of memory during inference, after weights have been successfully loaded.
The GPU driver crashes. On Windows especially, the display may flicker or go black momentarily as the driver recovers. Other applications may be affected.
Other tabs slow down. Your model consumes so much GPU memory that browser rendering in other tabs becomes laggy. The user does not see an error but their overall experience degrades.
Defensive strategies:
// Strategy 1: Check available memory before loadingasync function estimateAvailableGPUMemory() { const adapter = await navigator.gpu?.requestAdapter(); if (!adapter) return 0; // Request device with maximum limits const device = await adapter.requestDevice({ requiredLimits: { maxBufferSize: adapter.limits.maxBufferSize, maxStorageBufferBindingSize: adapter.limits.maxStorageBufferBindingSize, } }); // maxBufferSize gives an upper bound on single allocation // Total available is typically 2-4x this return adapter.limits.maxBufferSize;}// Strategy 2: Graceful degradationasync function loadModelWithFallback() { const availableMemory = await estimateAvailableGPUMemory(); if (availableMemory > 4 * 1024 * 1024 * 1024) { // 4GB+ available: try E4B return await tryLoadModel("gemma-4-e4b-q4"); } else if (availableMemory > 1.5 * 1024 * 1024 * 1024) { // 1.5GB+ available: use E2B return await tryLoadModel("gemma-4-e2b-q4"); } else { // Not enough GPU memory: fall back to WASM return await tryLoadModel("gemma-4-e2b-q4", { device: "wasm" }); }}// Strategy 3: Release memory when not in usefunction releaseModel(engine) { engine.unload(); // Releases GPU buffers // Model will need to be reloaded from IndexedDB cache for next use}
The key insight: always have a degradation path. If the large model does not fit, try a smaller one. If no model fits on the GPU, fall back to WASM. If WASM is too slow, offer a server-side fallback. Each step degrades performance but maintains functionality.
?
During testing, your in-browser AI application works perfectly on developer machines (MacBook Pro M3, 36GB) but crashes on employee laptops (Intel i5, 8GB RAM, Intel UHD integrated graphics). What is the root cause and fix?
Production chat UI patterns
A chat interface powered by in-browser inference has unique UX considerations that differ from cloud-powered chat.
The loading state lifecycle:
Application load: Check for cached model in IndexedDB. If found, show "Loading AI..." (2-5 seconds). If not found, show download progress bar with time estimate.
Model warm-up: After weights are loaded, the first inference takes 1-3 seconds longer than subsequent ones (GPU shader compilation). Run a silent warm-up inference with a short prompt to move this cost out of the user's first interaction.
Ready state: The model is loaded and warm. Inference requests are handled immediately.
Generation: Stream tokens to the UI. Show a typing indicator during time-to-first-token (80-200ms), then switch to streaming text output.
Error recovery: If the model crashes (OOM, driver error), detect the failure, show a clear message, and offer to reload or switch to a fallback.
Conversation memory management: Unlike cloud APIs where the server handles context, in-browser inference processes the full conversation history on every message. As conversations grow, inference slows because the model must process more prompt tokens. Reset the conversation periodically or implement a sliding window that keeps only the last N turns.
✎
Module 5 -- Final Assessment
1
What is the primary advantage of WebLLM over Transformers.js for LLM inference in the browser?
2
Why is a warm-up inference recommended after loading the model but before the user's first interaction?
3
In-browser LLM conversations get slower over time. What is the most likely cause?
4
Your application must support both embedding search and LLM generation in the browser. Which library architecture is most appropriate?