Embedding Models Demystified

What embedding models actually do

An embedding model takes a piece of text -- a sentence, a paragraph, a document chunk -- and converts it into a list of floating-point numbers. That list is called a vector, and it typically has 768 to 4096 dimensions.

The mathematical intuition is this: the model learns to place semantically similar texts close together in a high-dimensional space, and semantically different texts far apart. "The quarterly revenue exceeded projections" and "Q3 earnings beat analyst estimates" should produce vectors that are close together (high cosine similarity), while "The quarterly revenue exceeded projections" and "The office kitchen needs new coffee filters" should produce vectors that are far apart.

How does the model learn this? Through contrastive training. The model is shown millions of pairs: (query, relevant document) and (query, irrelevant document). It learns to produce vectors where the relevant pair has high similarity and the irrelevant pair has low similarity. The specific architecture is usually a transformer encoder (similar to BERT, but much larger and better trained) that processes the input text and produces a single fixed-size vector as output.

Critically, embedding models are encoders, not generators. They process text in one direction -- text in, vector out. They cannot produce text. They cannot answer questions. They cannot reason. They are a mapping function from the space of all possible texts to a point in high-dimensional geometric space. That mapping is what makes similarity search possible: instead of comparing texts linguistically, you compare their geometric positions.

This is fundamentally different from what generative models like Gemma 4 do. A generative model takes text in and produces text out. It can reason, follow instructions, and synthesise information. But it cannot efficiently produce fixed-size vector representations of meaning. These are different architectures solving different problems.

An embedding model produces a 1024-dimensional vector for the text 'Annual revenue was $4.2 billion.' Which of the following is true about this vector?

How to read the MTEB leaderboard

The Massive Text Embedding Benchmark (MTEB) is the standard evaluation framework for embedding models. It tests models across multiple task categories: retrieval, classification, clustering, pair classification, reranking, semantic textual similarity (STS), and summarisation.

For RAG, the most relevant category is retrieval. This measures how well the model's embeddings enable finding relevant documents for a given query. The metric is typically nDCG@10 -- a measure of ranking quality where higher is better, and 1.0 is perfect.

When reading the MTEB leaderboard for RAG purposes, focus on:

Retrieval scores, not overall averages. A model might have a stellar average because it excels at classification and clustering, but if its retrieval score is mediocre, it will underperform in your RAG pipeline. Filter the leaderboard by the "Retrieval" task category.

Model size relative to performance. A 7B parameter embedding model that scores 2% higher than a 140M model is not necessarily better for your use case. The 7B model requires a large GPU and processes chunks 50x slower. The 140M model runs on a CPU or a modest GPU and can embed your entire corpus in a fraction of the time.

Dimensionality. Higher-dimensional embeddings (3072, 4096) capture more nuance but cost more to store and search. A 4096-dimensional vector takes 16 KB of float32 storage per chunk. At 80 million chunks, that is 1.2 TB just for vectors. A 768-dimensional model would use 240 GB. The quality difference is often 1-3% on retrieval benchmarks -- worth evaluating whether that justifies 5x the storage.

Evaluation dataset relevance. MTEB includes datasets from many domains. If your documents are legal contracts, check how the model performs on legal retrieval benchmarks specifically, not just the average across all domains.

The models you should be evaluating

The open-source embedding landscape moves quickly, but as of early 2026, these are the models that consistently top retrieval benchmarks and are proven in production deployments.

GTE-Qwen2-1.5B (Alibaba). 1.5 billion parameters, up to 8192 token input, 1536-dimensional output. This is currently the highest-performing open-source embedding model for English retrieval tasks. It outperforms OpenAI's text-embedding-3-large on most MTEB retrieval benchmarks. The trade-off: at 1.5B parameters, it requires a GPU (a T4 or better) and processes roughly 200-400 chunks per second on an A100, depending on chunk length. Best for organisations that prioritise retrieval quality and have GPU infrastructure.

BGE-M3 (BAAI). 570 million parameters, up to 8192 token input, 1024-dimensional output. The "M3" stands for Multi-lingual, Multi-functional, and Multi-granularity. BGE-M3 supports over 100 languages and can produce both dense and sparse (lexical) embeddings from the same model. This makes it uniquely suited for hybrid search -- you get dense vectors and BM25-like sparse vectors in a single inference pass. Roughly 500-800 chunks per second on an A100. Best for multilingual corpora or organisations that want hybrid dense+sparse search without running two separate models.

Nomic Embed v2 (Nomic AI). 140 million parameters, up to 8192 token input, 768-dimensional output with Matryoshka support (more on this shortly). Despite being 10x smaller than GTE-Qwen2, it achieves surprisingly competitive retrieval scores -- typically within 3-5% of the larger models. It can run on CPU at reasonable speeds (50-100 chunks per second) and screams on even modest GPUs (2,000+ chunks per second on an A100). Best for organisations that want maximum throughput at minimal hardware cost, or edge deployment scenarios.

Snowflake Arctic Embed L (Snowflake). 335 million parameters, 1024-dimensional output. Specifically optimised for enterprise retrieval tasks. Strong performance on domain-specific retrieval benchmarks (legal, financial, technical). Licensed under Apache 2.0. A solid middle-ground option between Nomic's efficiency and GTE-Qwen2's raw quality.

Jina Embeddings v3 (Jina AI). 570 million parameters, up to 8192 token input, 1024-dimensional output with Matryoshka support. Notable for its "task-specific adapters" -- you can specify whether the embedding is for retrieval, classification, or similarity, and the model adjusts its internal weights accordingly. Good multilingual support. Licensed under Apache 2.0.

Your organisation has a 50 TB multilingual corpus (English, French, German, Mandarin) and needs hybrid search combining dense and sparse retrieval. Which model is the strongest fit?

Variable-dimension embeddings and why they matter

Traditional embedding models produce a fixed-size output. GTE-Qwen2-1.5B always produces 1536 dimensions. You store all 1536, you search all 1536, you pay for all 1536. But what if you could trade precision for efficiency by using fewer dimensions?

Matryoshka Representation Learning (MRL), introduced by Kusupati et al. (2022), trains embedding models so that the first N dimensions of any embedding are a valid lower-dimensional embedding on their own. Like the Russian nesting dolls they are named after, you can "open" the embedding to different sizes.

Concretely, a Matryoshka-trained model that produces 1024-dimensional embeddings allows you to truncate to 512, 256, 128, or even 64 dimensions and still get meaningful similarity search -- just with progressively lower precision.

The impact on storage and search cost is dramatic:

Dimensions	Storage per vector	Storage for 80M vectors	Retrieval quality (relative)
1024	4 KB	320 GB	100%
512	2 KB	160 GB	~97-98%
256	1 KB	80 GB	~93-95%
128	512 bytes	40 GB	~88-92%

This enables a powerful cost/quality tradeoff strategy. Use 256-dimensional embeddings for your first-pass retrieval (fast, cheap, broad recall), then use the full 1024-dimensional embeddings only for reranking the top candidates. Or use lower dimensions for less critical document categories and higher dimensions for high-value content.

Nomic Embed v2 and Jina Embeddings v3 both support Matryoshka representations. GTE-Qwen2-1.5B does not natively support it (though research on applying MRL post-hoc is ongoing).

Your RAG pipeline processes 100,000 queries/day against a 200-million vector index. You are running out of GPU memory for the vector index. How could Matryoshka embeddings help?

Multilingual considerations

If your document corpus is English-only, most modern embedding models handle it well. The decision becomes purely about speed, quality, and cost.

If your corpus includes multiple languages, the choice narrows significantly. You need a model trained on multilingual data that produces a unified embedding space -- meaning a query in English should be able to retrieve a relevant document in German.

BGE-M3 is the current leader for multilingual retrieval, supporting 100+ languages with strong cross-lingual performance. Jina Embeddings v3 covers 30+ languages with good performance. GTE-Qwen2 has multilingual training data but is primarily optimised for English and Chinese.

A common mistake is assuming you can train separate embedding models per language and combine them. Separate models create separate embedding spaces -- a French vector and an English vector for the same concept will have no geometric relationship. Cross-lingual retrieval requires a single model that maps all languages into the same space.

Running embedding models on your own infrastructure.

Here are realistic throughput numbers for production embedding deployments:

Model	GPU	Throughput (chunks/sec, 256 tokens avg)	Memory required
Nomic Embed v2 (140M)	A100 80GB	2,000-3,000	~2 GB
Nomic Embed v2 (140M)	CPU (16 cores)	50-100	~1 GB RAM
Snowflake Arctic Embed L (335M)	A100 80GB	1,000-1,500	~4 GB
BGE-M3 (570M)	A100 80GB	500-800	~6 GB
GTE-Qwen2-1.5B	A100 80GB	200-400	~8 GB
GTE-Qwen2-1.5B	L40S 48GB	150-300	~8 GB

At 2,000 chunks per second, Nomic Embed v2 can embed 172 million chunks per day on a single A100. For an initial 80-million-chunk corpus, that is under 12 hours. Re-embedding 5% monthly churn (4 million chunks) takes about 30 minutes.

The GPU required for embedding is modest compared to the GPU required for generation. A single A100 (or even a T4/L4 for smaller models) handles embedding for most enterprise deployments. The generative model is where the serious GPU investment goes.

Gemma 4 is NOT an embedding model

This is important enough to state directly, because this confusion derails architecture decisions.

Gemma 4 is a generative model. It takes text in and produces text out. It can read context, reason, follow instructions, and generate answers. It is what you use for the generation stage of RAG -- reading retrieved chunks and producing a response.

Gemma 4 is not designed for producing embeddings. While you could technically extract intermediate layer representations from Gemma 4 and use them as embeddings, this would be:

Absurdly slow. Running a 27B parameter model to produce a vector for each of your 80 million chunks would take weeks on enterprise hardware, compared to hours with a purpose-built 140M-1.5B embedding model.
Worse quality. Generative models are trained to predict the next token, not to produce geometrically meaningful fixed-size representations. Purpose-built embedding models, trained specifically with contrastive objectives, produce better embeddings.
Enormously expensive. A Gemma 4 27B model requires 2x L40S GPUs (or similar) for inference. Using that capacity for embedding means it is not available for generation. A 140M embedding model on a single T4 produces better embeddings at 1/100th the cost.

For enterprise RAG, you need BOTH:

A dedicated embedding model (small, fast, cheap): GTE-Qwen2-1.5B, BGE-M3, or Nomic Embed v2. This handles chunk embedding, query embedding, and powers the retrieval stage. It runs on a modest GPU or even a CPU.
A generative model (large, slower, more expensive): Gemma 4 27B (or 12B, or E4B depending on your latency/quality requirements). This handles the generation stage -- reading the retrieved context and producing a faithful, well-structured answer. It runs on your serious GPU infrastructure.

These two models have completely different hardware profiles, throughput characteristics, scaling requirements, and failure modes. Treating them as a single system is an architectural mistake.

A colleague proposes using Gemma 4 27B for both embedding and generation in your RAG pipeline to 'simplify the architecture by using one model.' Why is this a bad idea?

✎

Module 3 -- Final Assessment

What is the fundamental difference between an embedding model and a generative model in a RAG system?

A model supports Matryoshka Representation Learning. What does this enable?

When evaluating embedding models on the MTEB leaderboard for a RAG use case, which metric should you prioritise?

Why can't you use Gemma 4 27B as an embedding model for your enterprise RAG system?