Vector Databases
The Problem: Nearest Neighbor Search at Scaleโ
You have a million document embeddings stored in a Postgres table. A query comes in. You embed the query and need the 10 most similar documents. The naive approach: compute cosine similarity between the query vector and every stored vector. One million dot products, every request.
At small scale this works. At 100,000 documents it starts to hurt. At a million it becomes the bottleneck. And the problem compounds: each vector has hundreds to thousands of dimensions, so each similarity computation is itself expensive.
Traditional databases use B-tree or inverted indexes for exact lookups โ find the row where user_id = 42. Those index structures have no concept of "close to." You can't ask a B-tree for the 10 nearest neighbors without scanning everything.
Vector databases exist to solve this. They provide fast approximate nearest neighbor (ANN) search: find vectors that are very likely to be the most similar, without scanning every vector in the collection. The trade-off is that the result is approximate โ you might miss the true nearest neighbor โ but in practice the accuracy loss is small and the speed gain is large.
How Vector Search Works: HNSWโ
The dominant index structure today is HNSW (Hierarchical Navigable Small World) (Malkov & Yashunin, 2018), a graph-based approach.
When you insert a vector, HNSW adds it to a multi-layer graph. Higher layers are sparse โ a few long-range connections. Lower layers are dense โ many short-range connections. The structure is designed so you can navigate from "approximately right part of the space" (high layer) down to "actually close" (bottom layer) in far fewer steps than a linear scan.
At query time:
- Start at the top layer with a random entry node
- Greedily follow edges toward the query vector (whichever neighbor is closer)
- Drop to the next layer and repeat
- At the bottom layer, return the top-K closest nodes found
The key parameter is recall@K: what fraction of the true K nearest neighbors appear in your returned K results? Most production systems operate at 95โ99% recall. The HNSW parameters ef_construction (index build quality) and ef_search (query time quality) let you trade recall for speed.
The alternative index type, IVF (Inverted File Index), clusters vectors into buckets at index time and searches only a subset of buckets at query time. IVF is faster to build but typically has lower recall than HNSW for the same latency budget. Most managed vector databases use HNSW or a hybrid.
Core Operationsโ
Every vector database exposes the same core operations:
Upsert โ insert or update a vector, with an ID and optional metadata payload:
{ "id": "doc-42", "vector": [0.12, -0.87, ...], "payload": { "text": "...", "user_id": "u-7" } }
Similarity search โ given a query vector, return the top-K most similar:
{ "vector": [0.11, -0.90, ...], "top_k": 5 }
Filtered similarity search โ similarity search within a subset of records:
{ "vector": [...], "top_k": 5, "filter": { "user_id": "u-7" } }
This is the critical operation for multi-tenant applications. Without filtering, a search for user A's memories would return results from user B.
Delete by ID โ remove a vector when the underlying data is deleted or updated.
Collections / namespaces โ logical groupings of vectors, each with its own index. Most systems let you create multiple collections in a single deployment (one per tenant, one per document type, etc.).
The Landscapeโ
Five categories of vector database, each with a different operational posture:
Embedded (no server)โ
Chroma โ runs in-process with your application. No separate server to deploy. Data lives on disk. The right choice for local development, prototyping, and small-scale production (tens of thousands of vectors). Not suited to multi-server deployments.
import chromadb
client = chromadb.PersistentClient(path="./chroma_data")
collection = client.get_or_create_collection("memories")
collection.add(ids=["1"], embeddings=[[0.1, 0.2, ...]], documents=["User prefers Python"])
results = collection.query(query_embeddings=[[0.1, 0.2, ...]], n_results=5)
Open-Source, Self-Hosted or Managedโ
Qdrant โ written in Rust, strong filtering capabilities, cloud and self-hosted. Supports multiple vector types per record (useful for hybrid dense/sparse search). Has a generous free cloud tier. The filtering system is the most expressive of the open-source options.
Weaviate โ built-in hybrid search: combines vector similarity with BM25 keyword scoring in a single query. Schema-driven data model. Good choice when your retrieval needs to blend semantic and lexical matching (e.g., a search that should prefer documents mentioning an exact product SKU).
Milvus โ high scale, open-source. The Zilliz cloud offering. Designed for very large collections (hundreds of millions of vectors). More operational complexity than Qdrant or Weaviate. Worth the overhead only at scale.
Fully Managedโ
Pinecone โ the category pioneer. Fully managed, serverless tier available, no infrastructure to run. Battle-tested at production scale. Proprietary โ no self-hosted option. Pricing is per record/query rather than per compute. Simple API, strong reliability.
Postgres Extensionโ
pgvector โ adds vector storage and HNSW/IVF search directly to Postgres. If your application already runs on Postgres, this is the lowest-overhead path: one database, existing backups and monitoring, familiar query language.
-- Store a vector
INSERT INTO memories (user_id, content, embedding)
VALUES ('u-7', 'User prefers Python', '[0.12, -0.87, ...]');
-- Similarity search with filter
SELECT content, embedding <=> '[0.11, -0.90, ...]' AS distance
FROM memories
WHERE user_id = 'u-7'
ORDER BY distance
LIMIT 5;
pgvector added HNSW support in 0.5.0 (2023). Before that, it used IVF, which required a separate index-building step and performed poorly on small collections. Use HNSW unless you have specific reasons not to.
Quick Selection Guideโ
| Situation | Recommendation |
|---|---|
| Already on Postgres | pgvector โ one less system to run |
| Local dev / prototype | Chroma โ no setup |
| Need strong metadata filtering | Qdrant |
| Need hybrid semantic + keyword search | Weaviate |
| Want fully managed, no infra | Pinecone |
| Very large scale (100M+ vectors) | Milvus / Zilliz |
Operational Gotchasโ
Embedding model lock-in โ every vector in a collection must come from the same embedding model. If you switch models (say, from text-embedding-ada-002 to text-embedding-3-large), you cannot mix old and new vectors โ the spaces are incompatible. The migration path is to re-embed everything. This is expensive and disruptive. Choose your embedding model deliberately, not by default.
Dimension mismatch โ the collection schema specifies the vector dimension (e.g., 1536 for OpenAI's Ada-002, 1024 for Cohere's embed-v3-english). You cannot insert a 768-dimensional vector into a 1536-dimensional collection. Pick a model, lock in the dimension, and don't mix them. For guidance on choosing the right dimension count for your workload โ including how Matryoshka models let you trade quality for cost โ see Embeddings: Dimensions.
Approximation and non-determinism โ this is the single most misunderstood property of HNSW, and it has two distinct flavors.
It can miss the true nearest neighbor. The greedy graph descent can get trapped in a local optimum โ a node where every reachable neighbor is farther from the query than the current node, even though the true nearest neighbor exists elsewhere in the graph. HNSW mitigates this with the ef_search parameter (beam width): higher values explore more candidate paths before returning, increasing recall at the cost of latency. At ef_search = 10, recall@10 might be 90%. At ef_search = 100, it might be 99%. There is no setting that guarantees 100% recall without degenerating into exhaustive scan.
The same query can return different results on separate calls. This surprises developers who assume database operations are deterministic. Two sources of variance:
-
Index construction is probabilistic. When a vector is inserted, HNSW randomly assigns it to one or more layers (higher layers with exponentially lower probability). Two index builds from identical data in different insertion orders, or with different random seeds, produce different graph structures โ and different graph structures mean different traversal paths and possibly different top-K results for the same query. Most libraries expose a
random_seedparameter to make construction deterministic; it is rarely set in production code. -
Concurrent insertions mutate the live graph. If new vectors arrive while queries are running โ common in any application where users or agents are continuously adding memories or documents โ queries may traverse a partially updated graph. A record inserted mid-traversal may or may not appear in results depending on when the graph update landed.
The practical implication: do not write logic that depends on exact stability of results. Re-ranking the same query twice will sometimes return the same results and sometimes slightly different ones. Design for "statistically the most relevant K" not "always these exact K records."
Deletes are expensive โ HNSW indexes don't support efficient in-place deletion. Most implementations mark deleted vectors as tombstones and filter them at query time, or rebuild the index periodically. High delete rates degrade performance. Design your data model to minimize deletes (soft-delete with a metadata flag, then filter on that field, rather than physical deletes).
Multi-tenancy requires filtering, not separate collections โ creating one collection per tenant sounds clean but doesn't scale to thousands of tenants. Use a single collection with a tenant_id or user_id metadata field and filter on every query. Every major vector DB optimizes this pattern.
Vector databases are infrastructure, not magic. The interesting question is not "which vector DB should we use" but "what is our embedding model and what are our recall requirements?" Those two decisions constrain everything else. If you're already on Postgres, pgvector is probably the right answer until you have evidence it isn't.
Vector databases are the storage layer for both RAG and external memory systems. RAG explains how they power knowledge retrieval. External Memory explains how they power agent state persistence across sessions.
Further Readingโ
- Embeddings โ how vectors are produced and what makes them meaningful
- RAG โ the most common production use case for vector search
- External Memory โ using vector search to give agents persistent memory