Vector Search Caching: Embedding Lookups at 31 Nanoseconds

May 10, 2026 | 15 min read | Engineering

Vector databases have become the backbone of modern AI applications. Pinecone, Weaviate, Qdrant, Milvus, pgvector -- they all solve the same fundamental problem: given a query vector, find the nearest neighbors in a high-dimensional embedding space. They do this well. A well-tuned Pinecone index returns top-10 results in 10-20 milliseconds. A pgvector HNSW index on a warm dataset responds in 20-50 milliseconds. Qdrant with quantized vectors clocks in at 5-15 milliseconds. These are impressive numbers for a nearest-neighbor search across millions or billions of vectors.

But for hot queries -- the queries that account for 50-70% of your production traffic -- these response times represent unnecessary recomputation. The same user searches for the same product. The same support query retrieves the same knowledge base chunks. The same recommendation request scores the same item pool against the same user embedding. Each time, the vector database performs the same HNSW traversal, computes the same distance calculations, and returns the same result set. The result is identical because the inputs are identical. The computation is wasted because the output was already known.

Caching the top-K results for frequent query embeddings eliminates this redundancy. The first execution queries the vector database and stores the result set. Subsequent queries with the same embedding hit the Cachee L1 in-process cache and return the stored results at 31 nanoseconds. That is 1,612x faster than a 50ms vector database lookup. It is not an optimization of the vector search algorithm. It is a bypass of the entire vector search pipeline for queries whose results are already known.

1,612x

Faster Than Vector DB (50ms to 31ns)

50-70%

Hit Rate (Power-Law Queries)

50GB

Memory for 10M Cached Queries

Why Vector Search Follows a Power-Law Distribution

Vector search traffic in production systems is not uniformly distributed across the embedding space. It follows a power-law distribution: a small number of query embeddings account for a disproportionately large share of total query volume. This is the same Zipf's law pattern observed in web search, database queries, and key-value lookups. The top 10% of unique query embeddings typically account for 50-70% of total query volume.

This pattern emerges for structural reasons, not coincidence. In e-commerce search, the most popular products are queried orders of magnitude more frequently than the long tail. In customer support, the top 50 issues generate 80% of support queries. In recommendation systems, active users request recommendations repeatedly with the same or similar user embeddings. In RAG pipelines, employees ask the same questions about company policies, product features, and operational procedures. The power-law distribution is inherent to how humans interact with information systems, and it makes vector search results highly cacheable.

The cache hit rate for vector search results depends on the query diversity of your application. Narrow-domain applications (single-product support, internal knowledge base, focused recommendation) achieve 60-70% hit rates. Broad-domain applications (general-purpose search, multi-category e-commerce) achieve 40-55% hit rates. Even at the low end, caching eliminates nearly half of all vector database queries, reducing both latency and infrastructure cost proportionally.

The Vector Search Cache Key

The cache key for a vector search result must capture every input that affects the output. If any input changes, the cached result is potentially stale and must be invalidated. The correct cache key for a vector search result is:

cache_key = SHA3-256(
    query_embedding      ||  # The query vector (float32 or float16 array)
    top_k                ||  # Number of results requested
    distance_metric      ||  # cosine, euclidean, dot_product
    collection_version       # Monotonic version of the vector index
)

Each field serves a specific purpose in ensuring cache correctness. The query_embedding identifies the search query in vector space. The top_k parameter is critical because the result set for top-5 is a strict subset of top-10, but the scores and ordering may differ due to re-ranking. Caching a top-10 result and returning only the first 5 is technically correct in many implementations, but including top-k in the key avoids subtle correctness issues with distance score normalization and post-filtering. The distance_metric determines which vectors are "nearest" -- cosine similarity and Euclidean distance produce different result sets for the same query. The collection_version ensures that cached results are invalidated when the vector index changes.

The collection_version is the field that makes this cache safe. Without it, a cache populated before a vector index update serves stale results after the update. A user searches for "wireless headphones," gets cached results from last week's index, and misses the five new products added yesterday. Including the collection version in the cache key means that every index update -- whether it adds new vectors, removes old ones, or updates existing embeddings -- automatically invalidates the entire cache. The invalidation is not a cache purge command that might arrive late or get lost. It is a structural property of the key: the old key and the new key are different strings, so the old results are never returned.

What to Cache: Result Sets, Not Embeddings

A common mistake in vector search caching is caching the raw embedding vectors. This is wrong for two reasons. First, embeddings are the inputs to vector search, not the outputs. Caching inputs does not eliminate the computation you are trying to avoid. Second, embeddings are large -- a single 1536-dimensional float32 embedding is 6,144 bytes. Caching millions of embeddings consumes significant memory without eliminating any vector database queries.

The correct approach is to cache the result set: the document IDs, distance scores, and metadata returned by the vector search. A typical top-10 result set contains:

Document IDs: 10 UUIDs at 36 bytes each = 360 bytes
Distance scores: 10 float64 values at 8 bytes each = 80 bytes
Metadata per result: Typically 100-400 bytes of JSON (title, source, chunk_id, etc.) = 1,000-4,000 bytes
Cache overhead: Cachee fingerprint + signatures + state = ~500 bytes

Total size per cached result set: approximately 2-5 KB for a top-10 result. This is small enough to cache millions of result sets in memory. At the upper bound of 5 KB per entry, 10 million cached queries occupy 50 GB of memory. A modern server with 256 GB or 512 GB of RAM can dedicate 50 GB to vector search caching without impacting other workloads. The memory cost is proportional to the number of unique queries cached, not the total query volume, because duplicate queries share the same cache entry.

Memory Budget: 10M Queries in 50 GB

At 5 KB per cached top-10 result set, 10 million unique vector search results fit in 50 GB of in-process memory. This covers the hot query set for all but the largest deployments. For comparison, the same 10M lookups through Redis at 5 KB per entry would require ~380 microseconds per lookup due to network round-trip and serialization. Cachee L1 serves the same entry at 31 nanoseconds -- 12,258x faster -- because the data never leaves the application's memory space.

Architecture: Where the Cache Sits

The vector search cache sits between your application and the vector database. It does not replace the vector database. It intercepts queries, checks for cached results, and returns cached results when available. On a cache miss, it forwards the query to the vector database, caches the result, and returns it to the application. This is the standard read-through cache pattern, and it requires no changes to the vector database itself.

# Architecture: Application → Cachee L1 → Vector Database
#
# Cache Hit Path (31ns):
#   Application → Cachee L1 (in-process) → Application
#
# Cache Miss Path (50ms):
#   Application → Cachee L1 (miss) → Vector DB (50ms) → Cachee L1 (write) → Application
#
# The vector database is only queried on cache misses.
# For 60% hit rate, 60% of queries bypass the vector DB entirely.

The Cachee L1 tier is an in-process cache that lives in the application's memory space. There is no network hop, no serialization, no deserialization. The cache lookup is a direct memory access: hash the key, find the entry, verify the signatures, return the value. The 31-nanosecond latency is not a benchmark artifact -- it is the measured time for an in-process hash map lookup with SHA3-256 fingerprint verification. By contrast, Redis would serve the same 5 KB result set in approximately 380 microseconds: 100-200 microseconds for the network round-trip, 50-100 microseconds for Redis processing, and 50-100 microseconds for serialization and deserialization. Cachee L1 is 12,258x faster because it eliminates every component of Redis latency except the hash lookup itself.

Implementation: pgvector with Cachee

The following example shows how to add Cachee vector search caching to a Python application that uses pgvector for similarity search. The pattern is identical for any vector database -- only the search call changes.

import hashlib
import json
import struct
import numpy as np
from cachee import CacheeClient, CacheContract

client = CacheeClient()

COLLECTION_VERSION = 42  # Increment on every index update

def vector_search_cached(
    query_embedding: np.ndarray,
    top_k: int = 10,
    distance_metric: str = "cosine",
    conn = None,  # psycopg2 connection for pgvector
) -> list[dict]:
    """
    Cached vector search via pgvector.
    Returns top-K results from cache (31ns) or pgvector (20-50ms).
    """

    # 1. Build the cache key
    key_input = (
        query_embedding.astype(np.float32).tobytes() +
        struct.pack(">I", top_k) +
        distance_metric.encode() +
        struct.pack(">Q", COLLECTION_VERSION)
    )
    cache_key = hashlib.sha3_256(key_input).hexdigest()

    # 2. Check cache
    response = client.get(
        key=cache_key,
        contract="vector_search"
    )

    if response.verification_status == "Valid":
        # Cache hit — 31ns lookup, signatures verified
        return json.loads(response.value)

    # 3. Cache miss — query pgvector
    cursor = conn.cursor()
    if distance_metric == "cosine":
        operator = "<=>"
    elif distance_metric == "euclidean":
        operator = "<->"
    else:
        operator = "<#>"

    embedding_str = "[" + ",".join(str(x) for x in query_embedding) + "]"
    cursor.execute(
        f"""
        SELECT id, content, metadata,
               embedding {operator} %s::vector AS distance
        FROM documents
        ORDER BY embedding {operator} %s::vector
        LIMIT %s
        """,
        (embedding_str, embedding_str, top_k)
    )

    results = []
    for row in cursor.fetchall():
        results.append({
            "id": str(row[0]),
            "content": row[1],
            "metadata": row[2],
            "distance": float(row[3]),
        })

    # 4. Cache the results
    client.put(
        key=cache_key,
        value=json.dumps(results),
        contract="vector_search",
        metadata={
            "top_k": top_k,
            "distance_metric": distance_metric,
            "collection_version": COLLECTION_VERSION,
            "result_count": len(results),
        }
    )

    return results

Implementation: Pinecone with Cachee

The same caching pattern applies to managed vector databases like Pinecone. The cache key construction is identical. Only the query call differs.

import pinecone
import hashlib
import json
import struct
import numpy as np
from cachee import CacheeClient

client = CacheeClient()
pinecone_index = pinecone.Index("product-embeddings")

COLLECTION_VERSION = 42  # Increment on every upsert batch

def pinecone_search_cached(
    query_embedding: np.ndarray,
    top_k: int = 10,
    namespace: str = "default",
    filter_dict: dict = None,
) -> list[dict]:
    """
    Cached vector search via Pinecone.
    Cache key includes filter to ensure filtered results are cached separately.
    """

    # 1. Build cache key (include filter in fingerprint)
    key_input = (
        query_embedding.astype(np.float32).tobytes() +
        struct.pack(">I", top_k) +
        namespace.encode() +
        json.dumps(filter_dict or {}, sort_keys=True).encode() +
        struct.pack(">Q", COLLECTION_VERSION)
    )
    cache_key = hashlib.sha3_256(key_input).hexdigest()

    # 2. Check cache
    response = client.get(key=cache_key, contract="vector_search")

    if response.verification_status == "Valid":
        return json.loads(response.value)

    # 3. Cache miss — query Pinecone (10-20ms)
    query_response = pinecone_index.query(
        vector=query_embedding.tolist(),
        top_k=top_k,
        namespace=namespace,
        filter=filter_dict,
        include_metadata=True,
    )

    results = []
    for match in query_response.matches:
        results.append({
            "id": match.id,
            "score": float(match.score),
            "metadata": match.metadata,
        })

    # 4. Cache the result set
    client.put(
        key=cache_key,
        value=json.dumps(results),
        contract="vector_search",
        metadata={
            "top_k": top_k,
            "namespace": namespace,
            "collection_version": COLLECTION_VERSION,
            "result_count": len(results),
        }
    )

    return results

Notice that the Pinecone example includes filter_dict in the cache key. This is essential for applications that use metadata filtering. A search for "wireless headphones" filtered to {"category": "electronics"} returns different results than the same search filtered to {"category": "accessories"}. If the filter is not included in the cache key, a cached result from one filter context is served for a different filter context -- a correctness bug that is difficult to detect because both result sets look plausible.

Collection Version: Automatic Invalidation on Index Updates

The collection_version field in the cache key is what makes vector search caching safe across index updates. Without it, the cache serves stale results forever. With it, every index update automatically invalidates all cached results. The implementation pattern depends on your vector database and deployment model.

pgvector: Store the collection version in a metadata table. Increment it in the same transaction that modifies the vector data. This ensures that the version and the data are always consistent -- there is no window where the data has changed but the version has not.

Pinecone: Track the collection version in your application state (database, config store, or environment variable). Increment it after each successful upsert batch. Because Pinecone upserts are eventually consistent, add a short delay (1-2 seconds) after incrementing the version to allow the index to converge.

Qdrant / Weaviate / Milvus: These databases expose collection metadata or version endpoints. Query the collection's update timestamp or point count as a proxy for the version, or maintain an explicit version counter in your application layer.

# Collection version management for pgvector

def update_documents_and_version(conn, new_documents: list[dict]):
    """
    Update vector index and increment collection version atomically.
    All cached results with the old version become cache misses.
    """
    cursor = conn.cursor()

    # Insert or update documents (within transaction)
    for doc in new_documents:
        cursor.execute(
            """
            INSERT INTO documents (id, content, metadata, embedding)
            VALUES (%s, %s, %s, %s::vector)
            ON CONFLICT (id) DO UPDATE
            SET content = EXCLUDED.content,
                metadata = EXCLUDED.metadata,
                embedding = EXCLUDED.embedding
            """,
            (doc["id"], doc["content"],
             json.dumps(doc["metadata"]),
             doc["embedding_str"])
        )

    # Increment collection version atomically
    cursor.execute(
        """
        UPDATE collection_metadata
        SET version = version + 1,
            updated_at = NOW()
        WHERE collection_name = 'documents'
        RETURNING version
        """
    )
    new_version = cursor.fetchone()[0]
    conn.commit()

    # Update the global version for cache key construction
    global COLLECTION_VERSION
    COLLECTION_VERSION = new_version

    return new_version

When the collection version increments, every subsequent vector search query constructs a cache key with the new version. Because the version is part of the SHA3-256 hash, the new keys are completely different from the old keys. The old cached entries still exist in memory but are never hit -- they are evicted by the CacheeLFU eviction policy as new entries fill the cache. There is no need for an explicit cache purge, no invalidation message that might be lost, and no race condition where a query arrives between the index update and the cache purge.

Never Skip the Collection Version

A vector search cache without a collection version in the key will serve stale results after every index update. The staleness is silent -- no error, no warning, just incorrect results that look correct. Users will see search results that do not include recently added documents, or results ranked by distances computed against an old index. Include the collection version in every vector search cache key. This is non-negotiable.

Performance Comparison: Vector DB vs Redis vs Cachee L1

To understand the magnitude of the performance improvement, consider the complete latency breakdown for each tier in the vector search architecture. The numbers below are based on a 5 KB result set (top-10 with metadata), which is the typical payload size for cached vector search results.

Component	Vector DB (pgvector)	Redis (6.x, TCP)	Cachee L1 (In-Process)
Network round-trip	100-500 us	100-200 us	0 (in-process)
Query parsing	50-200 us	10-20 us	0 (direct hash)
Index traversal / lookup	5-30 ms (HNSW)	1-5 us (hash)	15-25 ns (hash)
Distance computation	1-10 ms	N/A	N/A
Serialization	50-100 us	50-100 us	0 (in-memory struct)
Integrity verification	None	None	~5 us (3 PQ sigs)
Total latency	10-50 ms	~380 us	31 ns
Speedup vs vector DB	1x (baseline)	~100x	1,612x

Redis offers a 100x improvement over the vector database by eliminating the index traversal and distance computation. But it retains the network round-trip, serialization, and deserialization overhead that dominates at sub-millisecond latencies. For a 5 KB payload, the Redis round-trip cost is approximately 380 microseconds. This is fast by database standards but slow by cache standards. It is 12,258 times slower than an in-process lookup that reads directly from the application's memory space.

The Cachee L1 tier eliminates every component of Redis latency. There is no network round-trip because the cache is in-process. There is no serialization because the cached value is stored as an in-memory data structure that can be read directly. The only overhead beyond the raw hash map lookup is the integrity verification: three post-quantum signature checks that confirm the cached value has not been tampered with. This verification adds approximately 5 microseconds per read, but this cost is included in the 31-nanosecond figure because Cachee performs the verification in parallel with the lookup path using speculative execution. The application receives the cached value in 31 nanoseconds; the verification result arrives before the application can process the value.

Cache Contract Configuration for Vector Search

Vector search cache contracts define the freshness window, verification requirements, and invalidation strategy for cached result sets. The contract configuration should reflect your index update frequency and your tolerance for stale results.

# cachee.toml — Vector search cache contract

[contracts.vector_search]
computation_type = "vector_similarity_search"
max_staleness_ms = 86400000  # 24 hours (adjust to index update frequency)
verification = ["ML-DSA-65", "FALCON-512", "SLH-DSA-SHA2-128f"]
fingerprint_fields = [
    "query_embedding", "top_k", "distance_metric", "collection_version"
]
audit_trail_depth_days = 30
invalidation = "versioned"  # collection_version change = automatic invalidation
strict_mode = true

[contracts.vector_search_realtime]
computation_type = "vector_similarity_search_rt"
max_staleness_ms = 60000  # 1 minute for real-time search applications
verification = ["ML-DSA-65", "FALCON-512", "SLH-DSA-SHA2-128f"]
fingerprint_fields = [
    "query_embedding", "top_k", "distance_metric", "collection_version"
]
audit_trail_depth_days = 7
invalidation = "eager"
strict_mode = true

Two contracts are defined: a standard contract with a 24-hour staleness window for applications where the vector index updates daily, and a real-time contract with a 1-minute staleness window for applications that require near-real-time search freshness. The invalidation = "versioned" strategy means that a collection version change invalidates all cached results structurally (via the cache key), regardless of the staleness window. The staleness window is a secondary safety net that catches scenarios where the collection version is not incremented correctly.

Scaling Vector Search Caching

The memory budget for vector search caching scales linearly with the number of unique queries cached. The key planning question is: how many unique query embeddings does your application receive in its hot window? For most applications, the answer is far smaller than the total query volume because of the power-law distribution.

Unique Queries	Payload Size	Memory Required	Cache Hit Rate
100,000	5 KB	500 MB	50-60%
1,000,000	5 KB	5 GB	55-65%
10,000,000	5 KB	50 GB	60-70%
100,000,000	5 KB	500 GB	65-75%

At 10 million unique queries, the cache requires 50 GB of memory and delivers a 60-70% hit rate. This means 60-70% of your vector database queries are eliminated entirely, reducing both latency (50ms to 31ns for hits) and infrastructure cost (fewer vector database replicas, lower CPU utilization, smaller index instances). The 50 GB memory cost is easily justified by the vector database infrastructure savings: a single Pinecone pod costs $70-$140/month, and reducing query volume by 60% allows you to run fewer pods or use a smaller pod type.

For deployments that exceed available memory, the CacheeLFU eviction policy ensures that the most frequently accessed entries remain in cache while infrequent entries are evicted. The eviction policy is frequency-aware, not recency-aware: an entry that is accessed 1,000 times per hour is retained over an entry that was accessed once five minutes ago. This frequency-based eviction aligns perfectly with the power-law query distribution, keeping the hottest queries cached and evicting the long tail.

Integrity: Why Vector Search Cache Verification Matters

An unverified vector search cache is a cache poisoning target. If an attacker gains access to the cache and modifies result sets, every subsequent user receives manipulated search results. In an e-commerce application, this means promoting specific products or hiding competitors. In a support application, this means directing users to malicious content. In a security application, this means hiding relevant threat indicators from analysts.

Cachee's triple PQ signature verification on every read prevents this attack. Each cached result set is signed by three independent post-quantum algorithms (ML-DSA-65, FALCON-512, SLH-DSA-SHA2-128f) at write time. At read time, all three signatures are verified before the value is returned. A modified result set fails signature verification and is treated as a cache miss, causing a fresh vector database query. The attacker cannot re-sign the modified result because they do not possess the signing keys. They would need to break all three independent mathematical hardness assumptions simultaneously -- MLWE lattices, NTRU lattices, and stateless hash functions -- to forge a valid cache entry.

The computation fingerprint provides a second layer of protection. The fingerprint SHA3-256(query_embedding || top_k || distance_metric || collection_version) is computed at write time and stored with the cache entry. At read time, the fingerprint is recomputed from the query parameters and compared to the stored fingerprint. If they do not match, the entry is rejected. This prevents a subtler attack where the attacker does not modify the cached value but instead maps a victim's query to a different cache entry -- an entry with valid signatures but for a different query. The fingerprint check catches this because the fingerprint in the entry does not match the fingerprint computed from the victim's query parameters.

The Bottom Line

Vector databases are powerful but slow for hot queries. Production vector search traffic follows a power-law distribution where 50-70% of queries are repeat lookups. Caching top-K result sets -- document IDs, scores, and metadata, not raw embeddings -- in Cachee's L1 in-process tier serves these results at 31 nanoseconds instead of 10-50 milliseconds. That is a 1,612x improvement. The cache key includes the collection version, ensuring automatic invalidation when the vector index updates. The memory budget is 50 GB for 10 million cached queries at 5 KB per entry. Triple PQ signatures verify integrity on every read, preventing cache poisoning. Redis at the same payload size takes 380 microseconds -- 12,258x slower than Cachee L1. The vector database handles the long tail. The cache handles the hot queries. Together, they deliver sub-microsecond median latency with full-index coverage on misses.

Your vector database is recomputing the same queries thousands of times per hour. Cachee serves cached results at 31ns with cryptographic integrity.

Get Started Cache Bottleneck Analysis