Vector Databases at Scale

The vector database landscape

A vector database stores embedding vectors and enables fast similarity search over them. That is the core job. But at enterprise scale, the differences between vector databases matter enormously -- in performance, operational complexity, cost, and the features that make production deployment viable.

Here is the landscape as it stands in 2026, focused on the options that are deployable on your own infrastructure.

Milvus (LF AI & Data Foundation, Apache 2.0). The most mature open-source vector database for large-scale deployment. Written in Go with a distributed architecture that separates compute, storage, and coordination. Natively supports sharding across nodes, GPU-accelerated indexing and search, and hybrid dense+sparse search. Milvus is the default choice when you know you will exceed a single node's capacity. The trade-off: operational complexity. A production Milvus cluster involves etcd for metadata, MinIO or S3 for storage, and multiple query/data/index nodes. You are operating a distributed system with all the associated monitoring, failover, and upgrade complexity.

Qdrant (Apache 2.0). Written in Rust, optimised for single-node performance with horizontal scaling when needed. Qdrant's standout feature is its filtering performance -- metadata filters are applied during the search itself (not as a post-filter), which makes filtered queries almost as fast as unfiltered ones. It supports quantisation (scalar and product quantisation) to reduce memory usage, and on-disk storage for vectors that do not fit in RAM. Simpler to operate than Milvus for deployments up to a few hundred million vectors.

Weaviate (BSD-3-Clause). Written in Go. Weaviate distinguishes itself with built-in "modules" for vectorisation, reranking, and generative search -- you can configure the entire RAG pipeline within Weaviate. This is convenient for prototyping but can be limiting in production where you want control over each pipeline stage. Strong multi-tenancy support. Its HNSW implementation is well-optimised, and it supports flat, dynamic, and HNSW indexes.

pgvector (PostgreSQL extension, BSD). If you already run PostgreSQL, pgvector adds vector similarity search without introducing a new database. It supports HNSW and IVFFlat indexes. Performance is good for up to 10-20 million vectors. Beyond that, dedicated vector databases significantly outperform it. The advantage: you get vector search, relational queries, ACID transactions, and your existing PostgreSQL expertise in one system. The disadvantage: at scale, it cannot match the throughput of purpose-built vector databases.

Chroma (Apache 2.0). An embedded vector database designed for simplicity. Excellent for prototyping and small deployments (up to a few million vectors). Not designed for distributed deployment or billion-vector scale. Think of it as SQLite for vectors -- perfect when simplicity matters more than scale.

Your enterprise has 500 million document chunk vectors, a PostgreSQL-heavy infrastructure team, and strict requirements for ACID transactions on metadata updates. Which vector database fits best?

The cost of operating your own vector database

Let us do a concrete cost comparison for a 500-million-vector deployment at 1024 dimensions.

Storage requirement: 500M vectors x 1024 dimensions x 4 bytes (float32) = 2 TB of raw vector data. With HNSW index overhead (typically 1.5-2x), metadata, and replication, plan for 4-6 TB of total storage.

Self-hosted (Qdrant on bare metal or cloud VMs):

Component	Specification	Monthly cost (cloud VMs)
Primary node	256 GB RAM, 4 TB NVMe SSD, 32 cores	$2,500-3,500
Replica node	Same specification	$2,500-3,500
Monitoring/management	Prometheus, Grafana on small VM	$100-200
Engineer time (estimated)	10-20 hours/month operations	$3,000-6,000
Total		$8,100-13,200/month

With scalar quantisation (int8), you reduce memory by 4x, meaning 64 GB RAM nodes suffice. That drops hardware cost to $800-1,200/month per node.

Managed (Pinecone, Weaviate Cloud):

Pinecone's serverless pricing at this scale: storage costs ($0.33/GB/month for 4 TB = $1,320) plus read units based on query volume. At 50,000 queries/day, expect $3,000-8,000/month depending on query complexity and pod configuration.

Weaviate Cloud's enterprise tier for this scale: $5,000-15,000/month depending on SLA requirements and cluster configuration.

The crossover point. For small deployments (under 50 million vectors, under 10,000 queries/day), managed services are cheaper than self-hosting because you avoid the operational overhead. Above 200-300 million vectors, self-hosting on dedicated hardware (especially with quantisation) becomes significantly cheaper -- and you gain full control over data residency.

The real cost of self-hosting is not hardware. It is the engineering time to operate, monitor, upgrade, and troubleshoot the database. If your team does not have experience running stateful distributed systems, factor in a substantial ramp-up period.

HNSW, IVF, and Product Quantisation

The indexing algorithm determines how the vector database organises vectors for fast search. Understanding the trade-offs is essential for tuning performance at scale.

HNSW (Hierarchical Navigable Small World). The default choice for most deployments. HNSW builds a multi-layer graph where each vector is a node, connected to its nearest neighbours. Search traverses the graph from a random entry point, greedily moving to closer neighbours at each step, then descending to finer-grained layers.

Trade-offs:

Build time: moderate (hours for hundreds of millions of vectors)
Query latency: excellent (2-10 ms for 100M vectors)
Memory: high -- the graph structure must fit in RAM for fast search. A 500M vector index at 1024 dimensions with HNSW requires roughly 4-6 TB of RAM (including graph edges)
Accuracy: very high (typically 95-99% recall)
Tuneable parameters: M (number of edges per node, higher = better recall, more memory), efConstruction (build quality), efSearch (query quality vs speed)

IVF (Inverted File Index). IVF partitions the vector space into clusters (typically 1,000-10,000 for large datasets) using k-means clustering. At query time, it identifies the nearest clusters and only searches vectors within those clusters.

Trade-offs:

Build time: faster than HNSW (the k-means clustering is the main cost)
Query latency: good (5-30 ms depending on nprobe -- the number of clusters searched)
Memory: lower than HNSW because there is no graph structure
Accuracy: depends heavily on nprobe. Searching 1% of clusters gives ~80% recall; 10% gives ~95% recall
Best for: very large datasets where HNSW's memory requirement is prohibitive

Product Quantisation (PQ). PQ is not an index structure but a compression technique. It divides each high-dimensional vector into sub-vectors and quantises each sub-vector independently, reducing memory by 4-32x. A 1024-dimensional float32 vector (4 KB) can be compressed to 128 bytes with PQ.

PQ is typically combined with IVF (IVF-PQ) for large-scale deployments where the full vectors do not fit in memory. The trade-off is accuracy: PQ introduces quantisation error that reduces recall by 2-10%, depending on compression ratio. For billion-vector deployments, IVF-PQ is often the only viable option without enormous hardware investment.

Scalar Quantisation (SQ). A simpler compression technique: convert float32 to int8, reducing memory by 4x with minimal accuracy loss (typically < 1% recall reduction). Qdrant and Milvus both support this natively. It is the first optimisation you should apply before reaching for PQ.

You have 2 billion vectors at 1024 dimensions. Your hardware budget allows for 512 GB of total RAM across your cluster. Which indexing strategy is most appropriate?

Sharding at billion-vector scale

When your vector count exceeds what a single node can handle efficiently (typically 500M-1B vectors depending on hardware), you need to shard: distribute vectors across multiple nodes.

Range-based sharding assigns vectors to nodes based on a metadata attribute -- for example, all documents from 2023 go to shard 1, 2024 to shard 2. This works well when queries frequently filter by that attribute, because you can route queries to the relevant shard and skip the others. The downside: uneven distribution if some ranges have far more documents than others.

Hash-based sharding distributes vectors uniformly across nodes using a hash of the vector ID. This ensures even distribution but requires querying all shards for every search (because similar vectors can land on any shard). The results are merged using a top-K merge sort.

Partition-based sharding (used by Milvus) creates logical partitions within a cluster, and each partition can span multiple physical nodes. Milvus handles the routing, replication, and rebalancing automatically. This is the most operationally convenient approach but requires the full Milvus distributed deployment.

For most enterprise deployments, the practical approach is: start with a single high-memory node with scalar quantisation. When that is exhausted, move to IVF-PQ compression on a single node. When that is exhausted, shard. Many organisations hit 500M-1B vectors before needing to shard.

Hybrid search: combining dense and sparse retrieval.

Dense vector search (what we have been discussing) excels at semantic matching -- finding documents about the same concept even if they use different words. But it struggles with exact keyword matching, numerical values, and entity names. A search for "ISO 27001 Section 6.2.3" might return documents about information security generally, but miss the exact section reference.

Sparse retrieval (BM25, SPLADE) excels at exact matching. It is the technology behind traditional search engines. It finds documents containing the specific keywords in your query.

Hybrid search combines both signals. For each query, you run both a dense vector search and a sparse keyword search, then merge the results. The standard merging technique is Reciprocal Rank Fusion (RRF):

For each document, calculate: RRF_score = sum(1 / (k + rank_in_list)) across all retrieval lists where k is typically 60.

In practice, hybrid search improves retrieval quality by 5-15% over dense-only search, with the largest gains on queries containing specific entity names, numbers, codes, or technical terms.

Milvus, Qdrant, and Weaviate all support hybrid search natively. With BGE-M3, you can generate both dense and sparse embeddings from the same model, simplifying the pipeline.

Your legal department searches for specific contract clause references like 'Section 14.2(b)' and also for conceptual queries like 'indemnification obligations related to data breaches.' Which search strategy serves both use cases?

Metadata filtering at scale

In enterprise RAG, almost every query should include metadata filters. "Find documents about GDPR compliance" is rarely the actual query. It is "Find documents about GDPR compliance that my security clearance allows me to see, from the legal department, created after January 2024, that are marked as current (not superseded)."

Metadata filtering interacts with vector search in two ways, and the difference matters enormously for performance.

Pre-filtering applies metadata conditions before vector search. The database first identifies all vectors matching the filter, then searches only within that subset. This gives exact filter results but can degrade search quality if the filtered subset is small -- HNSW navigation may find poor paths when most nodes are excluded.

Post-filtering runs the full vector search first, then removes results that do not match the filter. This maintains search quality but may return fewer results than requested (you ask for top-10 but only 6 pass the filter), or even zero results if the top-K candidates happen to not match.

Qdrant's approach (filterable HNSW) applies filters during graph traversal, achieving near-native speed for filtered queries. This is a significant technical advantage at scale.

Best practices for metadata filtering in enterprise RAG:

Index the fields you filter on. Every metadata field used in filters should be indexed in the vector database. Unindexed field filters fall back to sequential scan.
Avoid high-cardinality filters on small subsets. Filtering to 0.01% of your vectors can degrade HNSW performance. For very narrow filters, consider partitioned indexes.
Use integer and keyword types, not string search. Metadata filtering is fast for exact matches and ranges. Free-text search within metadata is slow -- use the sparse search component for that.
Design your metadata schema at ingestion time. Adding new metadata fields later requires re-indexing. Plan for: source system, document type, department, classification level, date range, and any entity fields you extract.

✎

Module 4 -- Final Assessment

At what vector count does self-hosted vector database deployment typically become more cost-effective than managed services?

Why is HNSW typically not viable as the sole indexing strategy for 2 billion vectors on commodity hardware?

What is the primary advantage of hybrid search (combining dense vectors with sparse BM25 retrieval) over dense-only search?

You need to filter vector search results to only return documents from the 'Legal' department created after January 2024. What is the risk of using post-filtering for this query?