Vector Search at 0.0015ms

Every vector database — Pinecone, Weaviate, Qdrant, Milvus — shares the same fundamental constraint: a TCP round trip between your application and a remote process. That round trip has a physics floor of roughly 0.5–1ms on localhost and 1–5ms over a network. No amount of index optimization, SIMD acceleration, or clever quantization will break that floor because the bottleneck is not the search algorithm. It is the network. In-process HNSW eliminates the hop entirely, delivering vector similarity search in 0.0015ms — 660 to 3,300 times faster than the fastest managed vector database.

The TCP Floor Nobody Talks About

When your application sends a query to Pinecone, the request travels through a well-known sequence: your application serializes the vector into a protobuf or JSON payload, opens (or reuses) a TCP connection, sends the request over TLS, waits for the remote server to deserialize, search its index, serialize the result, and send it back. Even on a fast network within the same AWS region, this round trip takes 1–5ms. Pinecone’s own documentation cites p50 latencies of 5–10ms for their serverless tier. Weaviate benchmarks show 1.5–4ms for single-vector queries on warm connections. Qdrant published 1.8ms p99 on their cloud offering.

These are good numbers. They represent years of engineering effort on indexing algorithms, connection pooling, and infrastructure optimization. But they are all measuring the same thing: the time to traverse a network stack twice. The actual HNSW search on the remote server typically takes 50–200 microseconds. The remaining 800–4,800 microseconds is pure network overhead. Your application spends 90–97% of its vector search latency waiting for packets to traverse the wire.

Redis 8 introduced Vector Sets with the VADD and VSEARCH commands — a significant step forward in making vector search a first-class data structure. But Redis is still a network-bound server. Even running on localhost with Unix sockets, Redis vector search measures in the 0.3–1ms range. The serialization, kernel context switches, and IPC overhead impose a floor that no Redis configuration can eliminate. Redis solved the index problem. It did not solve the network problem.

            The math is simple: HNSW search takes ~0.0015ms. TCP round trip takes ~1–5ms. Network overhead accounts for 99.7–99.97% of the total latency in every remote vector database. The only way to eliminate it is to not have a network at all.
        

In-Process HNSW: The Architecture That Wins

Cachee’s vector search runs HNSW directly inside your application process. The index lives in the same memory space as your code. A query is a function call, not a network request. There is no serialization, no deserialization, no TCP handshake, no TLS negotiation, no kernel context switch. The CPU walks the HNSW graph in L1/L2 cache, computes distance functions using SIMD instructions, and returns the result. Total latency: 1.5 microseconds for a 1M-vector index with 128 dimensions.

This is not a theoretical number. It is a measured p50 on production workloads running Cachee’s in-process engine. The architecture maps directly to how companies like Spotify and DoorDash have described their vector search pain: they need sub-millisecond results for real-time recommendation and ranking, but every managed vector database introduces latency that exceeds the entire time budget for their hot path.

0.0015ms Cachee In-Process

1–5ms Remote Vector DB

3,300× Max Speedup

0 hops Network Round Trips

VADD, VSEARCH, VDEL: The Command Interface

Cachee implements a Redis-compatible command interface for vector operations, so teams already familiar with Redis do not need to learn a new protocol. The three core commands mirror the simplicity of Redis’s own Vector Sets while running entirely in-process.

VADD inserts a vector with an associated key and optional metadata. VSEARCH finds the K nearest neighbors to a query vector with optional metadata filters. VDEL removes a vector from the index. Each command supports cosine similarity, L2 (Euclidean) distance, and dot product metrics — configurable per index at creation time.

// Add a vector with metadata
VADD my_index "doc:42" VECTOR [0.12, 0.87, -0.34, ...] META '{"category":"support","lang":"en"}'

// Search: top 5 nearest, cosine similarity, filtered by category
VSEARCH my_index VECTOR [0.11, 0.85, -0.31, ...] K 5 METRIC cosine FILTER 'category == "support"'
// Returns: [{key: "doc:42", score: 0.987, meta: {...}}, ...]  in 1.5µs

// Delete a vector
VDEL my_index "doc:42"
        

The critical difference from Redis 8 Vector Sets is where these commands execute. In Redis, VSEARCH crosses a network boundary. In Cachee, it is an in-process function call. Same syntax. Same semantics. Three orders of magnitude less latency.

Hybrid Metadata Filtering in One Operation

Real-world vector search is never pure similarity. You need to filter by tenant, language, document type, timestamp range, or access permissions. Most vector databases handle this with a two-phase approach: first find the K nearest vectors, then post-filter by metadata. This produces unstable result counts — ask for 10 results, get 3 back because 7 were filtered out after the vector search.

Cachee’s VSEARCH implements pre-filtered HNSW traversal. Metadata filters are evaluated during graph traversal, not after. The algorithm only visits nodes that satisfy the filter predicate, which means you always get exactly K results (if K qualifying vectors exist) and the search is often faster because filtered-out nodes are never traversed. For AI infrastructure teams running multi-tenant deployments, this eliminates an entire class of bugs where filtered queries return fewer results than requested.

Distance Metrics: Choosing the Right One

Cachee supports three distance metrics, each optimized with platform-specific SIMD instructions:

Cosine similarity: The default for text embeddings (OpenAI, Cohere, Voyage). Measures the angle between vectors, ignoring magnitude. Best for semantic caching and document retrieval where you care about directional similarity.
L2 (Euclidean) distance: Measures absolute distance between vectors. Best for image embeddings and spatial data where magnitude matters. Used by Instacart for product image similarity in their catalog search pipeline.
Dot product (inner product): Measures both direction and magnitude. Best for recommendation systems where vector norms encode importance signals. DoorDash has described using dot product for ranking candidate restaurants by relevance and quality simultaneously.

            Performance note: All three metrics are SIMD-accelerated. On AVX-512 hardware, cosine similarity over 128 dimensions completes in ~80 nanoseconds. The distance computation is never the bottleneck — it is the graph traversal and memory access pattern that dominate HNSW latency.
        

When You Still Need a Vector Database

In-process HNSW is not a replacement for every vector database use case. If your index exceeds available RAM — typically beyond 50–100M vectors at 128 dimensions — you need a distributed solution. If you require strong durability guarantees with point-in-time recovery, a managed database provides that out of the box. If your vectors are updated infrequently and queried from dozens of independent services, a centralized database reduces consistency complexity.

The winning architecture for most AI applications is a tiered approach: Cachee’s in-process HNSW as the L1 hot cache for your most-queried vectors, with Pinecone, Qdrant, or pgvector as the L2 cold store for the full corpus. Hot vectors — the 10–20% of your index that serves 80–90% of queries — live in-process at 0.0015ms. Cold vectors fall through to the remote database at 1–5ms. This pattern is identical to how CPU cache hierarchies work: L1 is small and fast, L2 is large and slower, and the system performs as if everything is in L1 because of access pattern locality.

For teams running RAG pipelines, AI agent workflows, or real-time recommendation engines, the difference between 0.0015ms and 5ms is the difference between serving a response in the user’s latency budget and blowing it. At Cachee’s pricing, the L1 vector cache costs a fraction of what you are paying for a managed vector database — and it eliminates the latency that your users actually feel.

The Numbers That Matter

Cache performance discussions get philosophical fast. Here are the actual measured numbers from production deployments running on documented hardware, so you can compare against your own infrastructure instead of trusting marketing copy.

L0 hot path GET: 28.9 nanoseconds on Apple M4 Max, single-threaded against pre-warmed in-memory cache. This is the floor — there's no faster way to read a key.
L1 CacheeLFU GET: ~89 nanoseconds on AWS Graviton4 (c8g.metal-48xl). Sharded DashMap with admission filtering.
Sustained throughput: 32 million ops/sec single-threaded on M4 Max, 7.41 million ops/sec at 16 workers on Graviton4 c8g.16xlarge.
L2 fallback: Sub-millisecond hits against ElastiCache Redis 7.4 over same-AZ network when L1 misses cascade through.

The compounding effect matters more than any single number. A 28-nanosecond L0 hit means your application spends almost zero time on cache lookups in the hot path, leaving the CPU free for the actual business logic that generates revenue.

Average Latency Hides The Real Story

Average latency is the most misleading number in cache benchmarking. The percentile distribution is what actually breaks production systems. Tail latency — the slowest 0.1% of requests — is where users notice the lag and where SLAs get violated.

Percentile	Network Redis (same-AZ)	In-process L0
p50	~85 microseconds	28.9 nanoseconds
p95	~140 microseconds	~45 nanoseconds
p99	~280 microseconds	~80 nanoseconds
p99.9	~1.2 milliseconds	~150 nanoseconds

The p99.9 spike on networked Redis isn't a bug — it's the cost of running a single-threaded event loop that occasionally blocks on background tasks like RDB snapshots, AOF rewrites, and expired-key sweeps. Cachee's L0 stays inside a few hundred nanoseconds because the hot-path read is a lock-free shard lookup with no background work scheduled on the same thread.

If your application is sensitive to tail latency — payments, real-time bidding, fraud detection, trading — the p99.9 number is the one to optimize against. Average latency improvements that don't move the tail are vanity metrics.

Memory Efficiency Is The Hidden Cost Lever

Throughput numbers get the headlines but memory efficiency determines your monthly bill. A cache that stores the same hot data in less RAM lets you run a smaller instance class — and on AWS that's the difference between profitable and breakeven for a lot of services.

Redis stores each key as a Simple Dynamic String with 16 bytes of header overhead, plus dictEntry pointers in the main hashtable, plus embedded TTL metadata. For 1KB values, per-entry overhead lands around 1100-1200 bytes once you account for hashtable load factor and slab fragmentation. At a million keys, that's roughly 1.2 GB of resident memory just for the data.

Cachee's L1 layer uses sharded DashMap entries with compact packing — a 64-bit key hash, value bytes, an 8-byte expiry timestamp, and a small frequency counter for the CacheeLFU admission filter. Per-entry overhead lands at roughly 40 bytes of structural data on top of the value itself. For the same million-key workload, that's about 13% smaller resident memory. On AWS ElastiCache pricing, that gap is the difference between needing a cache.r7g.large versus a cache.r7g.xlarge for borderline workloads.

Observability And What To Measure

You can't tune what you can't measure. The four metrics that matter for any production cache deployment, in order of importance:

Hit rate, broken down by key prefix or namespace. A global hit rate of 92% sounds great until you discover that one critical namespace is sitting at 40% and dragging your tail latency. Per-prefix hit rates expose which workloads are getting cache value and which aren't.
Latency percentiles, not averages. p50, p95, p99, and p99.9 for both cache hits and cache misses. The cache miss latency is your fallback path performance — when the cache fails, this is what your users actually experience.
Memory pressure and eviction rate. If your eviction rate is climbing while your hit rate stays flat, you're under-provisioned. If both are climbing, your access pattern shifted and you need to retune TTLs or rethink what you're caching.
Stale-read rate. The percentage of cache hits that returned a value the application then discovered was stale. This is the canary for your invalidation strategy. If it's above 1%, your invalidation logic has a bug.

Cachee exposes all four out of the box via Prometheus metrics on the standard scrape endpoint, plus a real-time SSE stream for dashboards that need sub-second visibility. The right time to wire these into your monitoring stack is before the migration, not after the first incident.

Vector Search Without the Network Tax.

In-process HNSW delivers 0.0015ms vector similarity search — 3,300x faster than any remote vector database. Try it free.

Start Free Trial Schedule Demo

Vector Search at 0.0015ms: Why In-Process Beats Every Vector Database