Building In-Browser AI Applications

The three libraries that matter

There are three serious options for running LLM inference in a browser tab. Each takes a different approach, and the right choice depends on your deployment requirements.

Transformers.js (Hugging Face)

Transformers.js is a JavaScript port of the Hugging Face Transformers library. It supports both WebGPU and WASM backends, covers a wide range of model architectures (not just LLMs -- also embeddings, image classification, speech recognition), and uses the familiar Hugging Face pipeline API.

Strengths:

Broadest model support -- any ONNX-exported model works
Both WebGPU and WASM backends with automatic fallback
Familiar API for anyone who has used Python Transformers
Active development with regular releases
Good for multi-model applications (embedding + generation in one app)

Limitations:

LLM inference is slower than WebLLM because it uses ONNX rather than compiled GPU kernels
Larger runtime overhead for LLM-specific use cases

WebLLM (MLC AI)

WebLLM uses the TVM compiler to generate hardware-specific GPU kernels at compile time. The result is faster LLM inference than Transformers.js, at the cost of supporting fewer model architectures and requiring a compilation step for new models.

Strengths:

Fastest LLM inference in the browser -- 20-50% faster than Transformers.js for generation
Pre-compiled model library covering major open models (Gemma, Llama, Phi, Qwen, Mistral)
OpenAI-compatible API -- easy to integrate with existing LLM application code
Efficient KV cache management for longer conversations

Limitations:

LLM-only -- cannot run embedding models or other architectures
Adding a new model requires compilation with the MLC toolchain
WebGPU-only -- no WASM fallback for CPU-only devices

MediaPipe LLM (Google)

MediaPipe LLM is Google's official library for running Gemma and other models in the browser. It is tightly optimised for Google's own models and integrates with the broader MediaPipe ecosystem (vision, audio, pose detection).

Strengths:

Tight optimisation for Gemma models specifically
Integrated with MediaPipe's cross-platform runtime (web, Android, iOS)
Good documentation and Google support
Simple API with fewer configuration options to get wrong

Limitations:

Primarily supports Google models (Gemma family)
Less flexibility than Transformers.js or WebLLM for custom models
Newer and less battle-tested in production deployments

Which to choose:

Use case	Recommendation
Fastest LLM chat in browser	WebLLM
Multi-model pipeline (embed + generate)	Transformers.js
Gemma-specific deployment with cross-platform plans	MediaPipe LLM
Must support CPU-only devices	Transformers.js (WASM fallback)
Existing OpenAI-compatible codebase	WebLLM (compatible API)

You are building an internal enterprise tool that needs both text embedding (for search) and text generation (for answers) running entirely in the browser. Which library?

Step-by-step: WebLLM with Gemma 4 E2B

Let us walk through the complete code for loading and running Gemma 4 E2B in a browser tab using WebLLM. This is the path to fastest LLM performance in the browser.

1. Install and import

npm install @mlc-ai/web-llm

import * as webllm from "@mlc-ai/web-llm";

2. Initialise the engine with progress reporting

const initProgressCallback = (progress) => {
  // progress.text contains a human-readable status message
  // progress.progress is 0-1 for download progress
  document.getElementById("status").textContent = progress.text;

  if (progress.progress !== undefined) {
    document.getElementById("progress-bar").style.width =
      `${Math.round(progress.progress * 100)}%`;
  }
};

const engine = await webllm.CreateMLCEngine(
  "gemma-2-2b-it-q4f16_1-MLC", // Model identifier
  {
    initProgressCallback,
    logLevel: "INFO",
  }
);

The model identifier follows a naming convention: {model}-{quantisation}-MLC. Check the WebLLM model list for available pre-compiled models. Gemma 4 E2B models will use identifiers like gemma-4-e2b-it-q4f16_1-MLC when available in the library.

3. Run inference with the OpenAI-compatible API

const response = await engine.chat.completions.create({
  messages: [
    {
      role: "system",
      content: "You are a helpful assistant that summarises documents concisely."
    },
    {
      role: "user",
      content: `Summarise the following document:\n\n${documentText}`
    }
  ],
  temperature: 0.3,
  max_tokens: 500,
});

const summary = response.choices[0].message.content;

4. Stream output for responsive UX

const chunks = await engine.chat.completions.create({
  messages: [
    { role: "user", content: "Explain the key risks in this contract." }
  ],
  temperature: 0.3,
  max_tokens: 800,
  stream: true,
});

let fullResponse = "";
for await (const chunk of chunks) {
  const delta = chunk.choices[0]?.delta?.content || "";
  fullResponse += delta;
  document.getElementById("output").textContent = fullResponse;
}

Streaming is critical for user experience. Without it, the user stares at a blank screen for 5-30 seconds while the model generates the full response. With streaming, text appears token by token -- users see output beginning within 100-200ms of submitting their query.

5. The complete initialisation pattern for production

async function initAI() {
  // Check WebGPU support
  if (!navigator.gpu) {
    showFallbackMessage("Your browser does not support WebGPU. " +
      "Please use Chrome 113+ or Safari 17+.");
    return null;
  }

  // Check GPU adapter availability
  const adapter = await navigator.gpu.requestAdapter();
  if (!adapter) {
    showFallbackMessage("No GPU adapter found. " +
      "Your device may not have a compatible GPU.");
    return null;
  }

  // Check available memory (heuristic)
  const adapterInfo = await adapter.requestAdapterInfo();
  console.log("GPU:", adapterInfo.description);

  try {
    const engine = await webllm.CreateMLCEngine(
      "gemma-2-2b-it-q4f16_1-MLC",
      { initProgressCallback }
    );
    return engine;
  } catch (error) {
    if (error.message.includes("out of memory")) {
      showFallbackMessage("Insufficient GPU memory for this model. " +
        "Try closing other tabs or applications.");
    } else {
      showFallbackMessage(`Failed to load AI model: ${error.message}`);
    }
    return null;
  }
}

Your browser AI application needs to support an enterprise help desk. Agents process ~50 queries per shift. What is the most important UX consideration?

Step-by-step: Transformers.js with streaming

For applications that need embedding and generation in a single library, or that require WASM fallback, Transformers.js is the right choice.

import { pipeline, env } from '@huggingface/transformers';

// Configure caching
env.cacheDir = './.cache';
env.allowLocalModels = false;

// Create a text generation pipeline
const generator = await pipeline(
  'text-generation',
  'onnx-community/gemma-2-2b-it-ONNX', // ONNX-format model
  {
    device: 'webgpu',        // Use WebGPU; falls back to 'wasm' automatically
    dtype: 'q4',             // INT4 quantisation
  }
);

// Generate with streaming via a callback
const output = await generator(
  [
    { role: "user", content: "Summarise this document briefly." }
  ],
  {
    max_new_tokens: 256,
    temperature: 0.3,
    do_sample: true,
    callback_function: (token) => {
      // Called for each generated token
      process.stdout.write(token);
    }
  }
);

Running an embedding model alongside generation:

// Embedding pipeline (for RAG or search)
const embedder = await pipeline(
  'feature-extraction',
  'Xenova/nomic-embed-text-v1.5',
  { device: 'webgpu', dtype: 'fp16' }
);

// Embed a document
const embedding = await embedder("Your document text here", {
  pooling: 'mean',
  normalize: true
});

// embedding.data is a Float32Array of 768 dimensions

This is where Transformers.js shines -- you have a single library managing both embedding and generation models, sharing the WebGPU device and caching infrastructure.

When the browser runs out of GPU memory

GPU out-of-memory (OOM) in a browser is different from OOM in a native application. There is no segfault. Instead, you get one of several failure modes:

WebGPU buffer allocation fails. The browser returns an error when you try to create a GPU buffer for model weights. This is the cleanest failure -- your code catches it and can show a user-friendly message.
The browser tab crashes. Chrome's "Aw, Snap!" page appears. The user loses any unsaved state. This happens when the GPU driver runs out of memory during inference, after weights have been successfully loaded.
The GPU driver crashes. On Windows especially, the display may flicker or go black momentarily as the driver recovers. Other applications may be affected.
Other tabs slow down. Your model consumes so much GPU memory that browser rendering in other tabs becomes laggy. The user does not see an error but their overall experience degrades.

Defensive strategies:

// Strategy 1: Check available memory before loading
async function estimateAvailableGPUMemory() {
  const adapter = await navigator.gpu?.requestAdapter();
  if (!adapter) return 0;

  // Request device with maximum limits
  const device = await adapter.requestDevice({
    requiredLimits: {
      maxBufferSize: adapter.limits.maxBufferSize,
      maxStorageBufferBindingSize: adapter.limits.maxStorageBufferBindingSize,
    }
  });

  // maxBufferSize gives an upper bound on single allocation
  // Total available is typically 2-4x this
  return adapter.limits.maxBufferSize;
}

// Strategy 2: Graceful degradation
async function loadModelWithFallback() {
  const availableMemory = await estimateAvailableGPUMemory();

  if (availableMemory > 4 * 1024 * 1024 * 1024) {
    // 4GB+ available: try E4B
    return await tryLoadModel("gemma-4-e4b-q4");
  } else if (availableMemory > 1.5 * 1024 * 1024 * 1024) {
    // 1.5GB+ available: use E2B
    return await tryLoadModel("gemma-4-e2b-q4");
  } else {
    // Not enough GPU memory: fall back to WASM
    return await tryLoadModel("gemma-4-e2b-q4", { device: "wasm" });
  }
}

// Strategy 3: Release memory when not in use
function releaseModel(engine) {
  engine.unload(); // Releases GPU buffers
  // Model will need to be reloaded from IndexedDB cache for next use
}

The key insight: always have a degradation path. If the large model does not fit, try a smaller one. If no model fits on the GPU, fall back to WASM. If WASM is too slow, offer a server-side fallback. Each step degrades performance but maintains functionality.

During testing, your in-browser AI application works perfectly on developer machines (MacBook Pro M3, 36GB) but crashes on employee laptops (Intel i5, 8GB RAM, Intel UHD integrated graphics). What is the root cause and fix?

Production chat UI patterns

A chat interface powered by in-browser inference has unique UX considerations that differ from cloud-powered chat.

The loading state lifecycle:

Application load: Check for cached model in IndexedDB. If found, show "Loading AI..." (2-5 seconds). If not found, show download progress bar with time estimate.
Model warm-up: After weights are loaded, the first inference takes 1-3 seconds longer than subsequent ones (GPU shader compilation). Run a silent warm-up inference with a short prompt to move this cost out of the user's first interaction.
Ready state: The model is loaded and warm. Inference requests are handled immediately.
Generation: Stream tokens to the UI. Show a typing indicator during time-to-first-token (80-200ms), then switch to streaming text output.
Error recovery: If the model crashes (OOM, driver error), detect the failure, show a clear message, and offer to reload or switch to a fallback.

// Production chat implementation with WebLLM
class BrowserChat {
  constructor() {
    this.engine = null;
    this.conversationHistory = [];
    this.isGenerating = false;
  }

  async initialise(statusCallback) {
    statusCallback({ phase: "checking", message: "Checking GPU support..." });

    if (!navigator.gpu) {
      statusCallback({ phase: "error", message: "WebGPU not supported" });
      return false;
    }

    statusCallback({ phase: "loading", message: "Loading AI model..." });

    try {
      this.engine = await webllm.CreateMLCEngine(
        "gemma-2-2b-it-q4f16_1-MLC",
        {
          initProgressCallback: (progress) => {
            statusCallback({
              phase: "loading",
              message: progress.text,
              progress: progress.progress
            });
          }
        }
      );

      // Warm-up inference
      statusCallback({ phase: "warming", message: "Warming up..." });
      await this.engine.chat.completions.create({
        messages: [{ role: "user", content: "Hi" }],
        max_tokens: 1,
      });

      statusCallback({ phase: "ready", message: "AI ready" });
      return true;

    } catch (error) {
      statusCallback({
        phase: "error",
        message: `Failed to load: ${error.message}`
      });
      return false;
    }
  }

  async sendMessage(userMessage, onToken) {
    if (this.isGenerating) return;
    this.isGenerating = true;

    this.conversationHistory.push({
      role: "user",
      content: userMessage
    });

    try {
      const stream = await this.engine.chat.completions.create({
        messages: [
          {
            role: "system",
            content: "You are a helpful enterprise assistant. Be concise and accurate."
          },
          ...this.conversationHistory
        ],
        temperature: 0.3,
        max_tokens: 1024,
        stream: true,
      });

      let assistantMessage = "";
      for await (const chunk of stream) {
        const delta = chunk.choices[0]?.delta?.content || "";
        assistantMessage += delta;
        onToken(delta, assistantMessage);
      }

      this.conversationHistory.push({
        role: "assistant",
        content: assistantMessage
      });

    } catch (error) {
      onToken(null, null, error);
    } finally {
      this.isGenerating = false;
    }
  }

  resetConversation() {
    this.conversationHistory = [];
    this.engine.resetChat();
  }
}

Conversation memory management: Unlike cloud APIs where the server handles context, in-browser inference processes the full conversation history on every message. As conversations grow, inference slows because the model must process more prompt tokens. Reset the conversation periodically or implement a sliding window that keeps only the last N turns.

✎

Module 5 -- Final Assessment

What is the primary advantage of WebLLM over Transformers.js for LLM inference in the browser?

Why is a warm-up inference recommended after loading the model but before the user's first interaction?

In-browser LLM conversations get slower over time. What is the most likely cause?

Your application must support both embedding search and LLM generation in the browser. Which library architecture is most appropriate?