RAG in the Browser

The zero-network-request RAG stack

Here is something that would have been impractical two years ago: a complete RAG pipeline -- embed documents, index them, search by semantic similarity, and generate answers with citations -- running entirely in a browser tab. No server. No API calls. No data leaving the user's device.

The architecture is straightforward:

Embed: Run a small embedding model (Nomic Embed, BGE-small) in the browser via Transformers.js to convert document text into vectors
Index: Store the vectors in an in-memory HNSW index with persistence to IndexedDB
Search: Given a user query, embed it with the same model and find the top-k most similar document chunks
Generate: Pass the retrieved chunks as context to a local LLM (Gemma 4 E2B/E4B) and generate an answer

Every step runs client-side. The user's documents stay on their device. This is the "personal knowledge base" pattern -- each employee has their own local knowledge base that only they can access, built from the documents they choose to add.

This pattern is particularly powerful for:

Regulated professionals (lawyers, doctors, financial advisers) who work with sensitive client documents
Field workers who need to query technical documentation offline
Executives who want AI search over confidential board papers and financial reports
Anyone whose documents are too sensitive for cloud processing

A law firm wants AI-powered search over client case files. Each lawyer works on different cases with strict client confidentiality requirements. What architecture fits?

Client-side embedding models

The embedding step converts text into dense vectors that capture semantic meaning. You need a small, fast embedding model that runs efficiently in the browser.

Recommended models:

Nomic Embed Text v1.5 (137M parameters)

Dimension: 768 (or 256/512 with Matryoshka truncation)
Size: ~270MB at FP16, ~140MB at INT8
Performance: 50-100 documents/second in browser (short documents, ~100 tokens each)
Quality: competitive with larger embedding models on MTEB benchmarks
Matryoshka support means you can truncate embeddings to 256 dimensions with minimal quality loss, reducing storage by 3x

BGE-small-en-v1.5 (33M parameters)

Dimension: 384
Size: ~65MB at FP16
Performance: 100-200 documents/second in browser
Quality: good for English-only use cases, lower quality on multilingual text
The smallest option that still produces useful embeddings

all-MiniLM-L6-v2 (22M parameters)

Dimension: 384
Size: ~45MB at FP16
Performance: 150-250 documents/second in browser
Quality: widely used baseline, adequate for many retrieval tasks
The lightest option with proven track record

Loading an embedding model with Transformers.js:

import { pipeline } from '@huggingface/transformers';

const embedder = await pipeline(
  'feature-extraction',
  'nomic-ai/nomic-embed-text-v1.5',
  {
    device: 'webgpu',
    dtype: 'fp16',
    revision: 'onnx',
  }
);

// Embed a single query
const queryEmbedding = await embedder(
  "search_query: What are the termination clauses?",
  { pooling: 'mean', normalize: true }
);
// queryEmbedding.data is a Float32Array of 768 values

// Embed a batch of document chunks
const chunks = [
  "search_document: The agreement may be terminated by either party...",
  "search_document: Upon termination, all confidential materials...",
  "search_document: The non-compete clause survives termination..."
];
const chunkEmbeddings = await embedder(chunks, {
  pooling: 'mean',
  normalize: true
});

Note the search_query: and search_document: prefixes -- Nomic Embed uses these task-specific prefixes to produce better embeddings. Query embeddings and document embeddings are generated differently for optimal retrieval.

Memory budget for embeddings:

The embedding model (140-270MB) runs alongside your generation model (1.5-3GB). On a device with 4GB available GPU memory, this means:

E2B (1.5GB) + Nomic Embed (270MB) = 1.77GB for models, leaving ~2.2GB for KV cache, vectors, and overhead. Comfortable.
E4B (3GB) + Nomic Embed (270MB) = 3.27GB for models, leaving ~0.7GB. Tight. Consider using BGE-small (65MB) instead or offloading the embedding model to WASM (CPU).

You are indexing a corpus of 5,000 internal documents (average 2,000 words each). Each document is chunked into ~10 chunks. You need to embed all 50,000 chunks in the browser. How long will this take with Nomic Embed?

Client-side vector indexing and search

Once you have embeddings, you need to search them. The standard approach is an approximate nearest neighbour (ANN) index using the HNSW (Hierarchical Navigable Small Worlds) algorithm.

HNSW in the browser:

Several JavaScript libraries implement HNSW for client-side vector search:

vectra: A lightweight vector database designed for Node.js and browser use, with IndexedDB persistence
hnswlib-wasm: A WebAssembly port of the popular hnswlib C++ library
usearch: Compiled to WASM, offering high-performance ANN search

Building an index with vectra:

import { LocalIndex } from 'vectra';

// Create or open an index (persisted to IndexedDB)
const index = new LocalIndex('knowledge-base');
await index.createIndex({ version: 1, dimensions: 768 });

// Insert document chunks with embeddings
for (const chunk of documentChunks) {
  const embedding = await embedder(
    `search_document: ${chunk.text}`,
    { pooling: 'mean', normalize: true }
  );

  await index.insertItem({
    vector: Array.from(embedding.data),
    metadata: {
      text: chunk.text,
      documentId: chunk.documentId,
      documentTitle: chunk.documentTitle,
      chunkIndex: chunk.index,
    }
  });
}

// Search
async function search(query, topK = 5) {
  const queryEmbedding = await embedder(
    `search_query: ${query}`,
    { pooling: 'mean', normalize: true }
  );

  const results = await index.queryItems(
    Array.from(queryEmbedding.data),
    topK
  );

  return results.map(r => ({
    text: r.item.metadata.text,
    documentTitle: r.item.metadata.documentTitle,
    score: r.score,
  }));
}

Performance at different corpus sizes:

Chunks indexed	Index size (768-dim)	Search latency (top-5)	Build time
1,000	~3 MB	under 5ms	Seconds
5,000	~15 MB	under 10ms	Minutes
10,000	~30 MB	under 15ms	Minutes
50,000	~150 MB	under 30ms	10-15 min
100,000	~300 MB	under 50ms	30+ min

Search is fast even at scale. The HNSW algorithm provides logarithmic search time, so doubling the corpus size adds only milliseconds to search latency.

The practical ceiling is not search speed but memory and IndexedDB storage. A 100,000-chunk index at 768 dimensions occupies ~300MB in memory and on disk. Combined with the embedding model and generation model, you are looking at 2-3.5GB of total memory usage. This is feasible on modern hardware but pushes the limits on constrained devices.

Persistence with IndexedDB:

The index must persist across browser sessions. Otherwise, users would re-embed their entire document corpus every time they open the application.

// Save index to IndexedDB after building/updating
await index.save();

// On application load, check for existing index
const existingIndex = new LocalIndex('knowledge-base');
const exists = await existingIndex.isIndexCreated();

if (exists) {
  // Load from cache -- seconds, not minutes
  await existingIndex.loadIndex();
} else {
  // First time -- need to embed and index documents
  await buildIndexFromDocuments();
}

Putting it together: embed, search, generate

Here is the complete pipeline that ties embedding, search, and generation together in a single browser application:

class BrowserRAG {
  constructor() {
    this.embedder = null;
    this.generator = null;
    this.index = null;
  }

  async initialise(onProgress) {
    onProgress("Loading embedding model...");
    this.embedder = await pipeline(
      'feature-extraction',
      'nomic-ai/nomic-embed-text-v1.5',
      { device: 'webgpu', dtype: 'fp16' }
    );

    onProgress("Loading generation model...");
    this.generator = await webllm.CreateMLCEngine(
      "gemma-2-2b-it-q4f16_1-MLC",
      { initProgressCallback: (p) => onProgress(p.text) }
    );

    onProgress("Loading knowledge base index...");
    this.index = new LocalIndex('knowledge-base');

    if (await this.index.isIndexCreated()) {
      await this.index.loadIndex();
      onProgress("Ready");
    } else {
      onProgress("No documents indexed yet. Add documents to get started.");
    }
  }

  async addDocument(title, text) {
    const chunks = this.chunkText(text, 512, 50); // 512 tokens, 50 overlap

    for (const chunk of chunks) {
      const embedding = await this.embedder(
        `search_document: ${chunk}`,
        { pooling: 'mean', normalize: true }
      );

      await this.index.insertItem({
        vector: Array.from(embedding.data),
        metadata: { text: chunk, title }
      });
    }

    await this.index.save();
  }

  async ask(question, onToken) {
    // Step 1: Embed the question
    const queryEmbedding = await this.embedder(
      `search_query: ${question}`,
      { pooling: 'mean', normalize: true }
    );

    // Step 2: Search for relevant chunks
    const results = await this.index.queryItems(
      Array.from(queryEmbedding.data),
      5 // top 5 results
    );

    // Step 3: Build context from retrieved chunks
    const context = results
      .filter(r => r.score > 0.5)
      .map(r => `[From: ${r.item.metadata.title}]\n${r.item.metadata.text}`)
      .join("\n\n---\n\n");

    // Step 4: Generate answer with context
    const stream = await this.generator.chat.completions.create({
      messages: [
        {
          role: "system",
          content: `Answer questions based only on the provided context.
If the context does not contain enough information, say so.
Cite the source document for each claim.`
        },
        {
          role: "user",
          content: `Context:\n${context}\n\nQuestion: ${question}`
        }
      ],
      temperature: 0.2,
      max_tokens: 512,
      stream: true,
    });

    let answer = "";
    for await (const chunk of stream) {
      const delta = chunk.choices[0]?.delta?.content || "";
      answer += delta;
      onToken(delta, answer);
    }

    return {
      answer,
      sources: results.map(r => ({
        title: r.item.metadata.title,
        text: r.item.metadata.text,
        score: r.score
      }))
    };
  }

  chunkText(text, maxTokens, overlap) {
    // Simple word-based chunking (production would use a tokeniser)
    const words = text.split(/\s+/);
    const chunks = [];
    const chunkSize = maxTokens; // approximate: 1 word ~ 1.3 tokens
    const step = chunkSize - overlap;

    for (let i = 0; i < words.length; i += step) {
      chunks.push(words.slice(i, i + chunkSize).join(' '));
    }

    return chunks;
  }
}

The user experience:

User opens the application. Models load from IndexedDB cache (2-5 seconds after first load).
User drags and drops PDF/text files into the application. Documents are chunked and embedded locally. This takes a few seconds per document.
User asks a question. The pipeline embeds the query (~50ms), searches the index (~10ms), retrieves relevant chunks, and generates an answer with streaming output (~100ms to first token, full response in 5-15 seconds).

Total time from question to first answer token: under 200ms after the initial document indexing. No network request. No data leaving the device.

A user reports that your browser RAG application gives worse answers as they add more documents. What is the most likely cause?

Where browser RAG breaks down

Browser RAG is powerful, but it has hard limits. Knowing when to stop pushing the browser and move to a server is an important architectural skill.

Corpus size ceiling: ~10,000-50,000 chunks. At 768 dimensions per vector, 50,000 chunks consume ~150MB of memory for the index alone, plus the index structure overhead. Add the embedding model and generation model, and you are at 2-3.5GB of total memory usage. Beyond 50,000 chunks, memory pressure becomes problematic on most devices.

For context: 50,000 chunks at ~500 words per chunk is roughly 25 million words, or about 50,000 pages of text. That is a substantial personal knowledge base -- equivalent to 100-200 books or several years of business documents. For many use cases, this is more than enough.

No cross-user search. Each user's index is isolated on their device. There is no way to search across other users' documents without a shared server-side index. If your use case requires "search all company documents," browser RAG is not sufficient.

No real-time updates from external sources. If documents change on a server, the browser application does not know about it until the user explicitly refreshes. There is no push mechanism for updating the local index.

Embedding model quality ceiling. Small embedding models (33-137M parameters) that fit in a browser produce good but not state-of-the-art embeddings. Server-side embedding with larger models (400M-1B parameters) produces meaningfully better retrieval quality for complex queries.

When browser RAG is enough:

Personal knowledge base for individual professionals
Sensitive document search where data must stay on-device
Offline-capable search for field workers
Demo and prototyping without server infrastructure

When you need server-side:

Organisation-wide search across all users' documents
Corpus larger than ~50,000 pages
Real-time index updates from external document management systems
Quality-critical retrieval requiring large embedding models or rerankers

The hybrid approach often works best: browser RAG for personal, sensitive documents that must stay local; server-side RAG for shared, less-sensitive organisational knowledge.

✎

Module 6 -- Final Assessment

In a browser RAG pipeline, what is the correct order of operations when a user asks a question?

Why do Nomic Embed text prompts use 'search_query:' and 'search_document:' prefixes?

Your browser RAG application supports a corpus of 30,000 document chunks. What is the approximate memory overhead for the vector index alone at 768 dimensions?

A user's browser RAG answers degrade as the corpus grows. Which solution most directly addresses the root cause?