Skip to main content
Why CacheeHow It Works
All Verticals5G TelecomAd TechAI InfrastructureFraud DetectionGamingTrading
PricingDocsBlogSchedule DemoLog InStart Free Trial
← Back to Blog
AI Infrastructure

Vector Search at 0.0015ms: Why In-Process Beats Every Vector Database

Every vector database — Pinecone, Weaviate, Qdrant, Milvus — shares the same fundamental constraint: a TCP round trip between your application and a remote process. That round trip has a physics floor of roughly 0.5–1ms on localhost and 1–5ms over a network. No amount of index optimization, SIMD acceleration, or clever quantization will break that floor because the bottleneck is not the search algorithm. It is the network. In-process HNSW eliminates the hop entirely, delivering vector similarity search in 0.0015ms — 660 to 3,300 times faster than the fastest managed vector database.

The TCP Floor Nobody Talks About

When your application sends a query to Pinecone, the request travels through a well-known sequence: your application serializes the vector into a protobuf or JSON payload, opens (or reuses) a TCP connection, sends the request over TLS, waits for the remote server to deserialize, search its index, serialize the result, and send it back. Even on a fast network within the same AWS region, this round trip takes 1–5ms. Pinecone’s own documentation cites p50 latencies of 5–10ms for their serverless tier. Weaviate benchmarks show 1.5–4ms for single-vector queries on warm connections. Qdrant published 1.8ms p99 on their cloud offering.

These are good numbers. They represent years of engineering effort on indexing algorithms, connection pooling, and infrastructure optimization. But they are all measuring the same thing: the time to traverse a network stack twice. The actual HNSW search on the remote server typically takes 50–200 microseconds. The remaining 800–4,800 microseconds is pure network overhead. Your application spends 90–97% of its vector search latency waiting for packets to traverse the wire.

Redis 8 introduced Vector Sets with the VADD and VSEARCH commands — a significant step forward in making vector search a first-class data structure. But Redis is still a network-bound server. Even running on localhost with Unix sockets, Redis vector search measures in the 0.3–1ms range. The serialization, kernel context switches, and IPC overhead impose a floor that no Redis configuration can eliminate. Redis solved the index problem. It did not solve the network problem.

The math is simple: HNSW search takes ~0.0015ms. TCP round trip takes ~1–5ms. Network overhead accounts for 99.7–99.97% of the total latency in every remote vector database. The only way to eliminate it is to not have a network at all.

In-Process HNSW: The Architecture That Wins

Cachee’s vector search runs HNSW directly inside your application process. The index lives in the same memory space as your code. A query is a function call, not a network request. There is no serialization, no deserialization, no TCP handshake, no TLS negotiation, no kernel context switch. The CPU walks the HNSW graph in L1/L2 cache, computes distance functions using SIMD instructions, and returns the result. Total latency: 1.5 microseconds for a 1M-vector index with 128 dimensions.

This is not a theoretical number. It is a measured p50 on production workloads running Cachee’s in-process engine. The architecture maps directly to how companies like Spotify and DoorDash have described their vector search pain: they need sub-millisecond results for real-time recommendation and ranking, but every managed vector database introduces latency that exceeds the entire time budget for their hot path.

0.0015ms Cachee In-Process
1–5ms Remote Vector DB
3,300× Max Speedup
0 hops Network Round Trips

VADD, VSEARCH, VDEL: The Command Interface

Cachee implements a Redis-compatible command interface for vector operations, so teams already familiar with Redis do not need to learn a new protocol. The three core commands mirror the simplicity of Redis’s own Vector Sets while running entirely in-process.

VADD inserts a vector with an associated key and optional metadata. VSEARCH finds the K nearest neighbors to a query vector with optional metadata filters. VDEL removes a vector from the index. Each command supports cosine similarity, L2 (Euclidean) distance, and dot product metrics — configurable per index at creation time.

// Add a vector with metadata VADD my_index "doc:42" VECTOR [0.12, 0.87, -0.34, ...] META '{"category":"support","lang":"en"}' // Search: top 5 nearest, cosine similarity, filtered by category VSEARCH my_index VECTOR [0.11, 0.85, -0.31, ...] K 5 METRIC cosine FILTER 'category == "support"' // Returns: [{key: "doc:42", score: 0.987, meta: {...}}, ...] in 1.5µs // Delete a vector VDEL my_index "doc:42"

The critical difference from Redis 8 Vector Sets is where these commands execute. In Redis, VSEARCH crosses a network boundary. In Cachee, it is an in-process function call. Same syntax. Same semantics. Three orders of magnitude less latency.

Hybrid Metadata Filtering in One Operation

Real-world vector search is never pure similarity. You need to filter by tenant, language, document type, timestamp range, or access permissions. Most vector databases handle this with a two-phase approach: first find the K nearest vectors, then post-filter by metadata. This produces unstable result counts — ask for 10 results, get 3 back because 7 were filtered out after the vector search.

Cachee’s VSEARCH implements pre-filtered HNSW traversal. Metadata filters are evaluated during graph traversal, not after. The algorithm only visits nodes that satisfy the filter predicate, which means you always get exactly K results (if K qualifying vectors exist) and the search is often faster because filtered-out nodes are never traversed. For AI infrastructure teams running multi-tenant deployments, this eliminates an entire class of bugs where filtered queries return fewer results than requested.

Distance Metrics: Choosing the Right One

Cachee supports three distance metrics, each optimized with platform-specific SIMD instructions:

Performance note: All three metrics are SIMD-accelerated. On AVX-512 hardware, cosine similarity over 128 dimensions completes in ~80 nanoseconds. The distance computation is never the bottleneck — it is the graph traversal and memory access pattern that dominate HNSW latency.

When You Still Need a Vector Database

In-process HNSW is not a replacement for every vector database use case. If your index exceeds available RAM — typically beyond 50–100M vectors at 128 dimensions — you need a distributed solution. If you require strong durability guarantees with point-in-time recovery, a managed database provides that out of the box. If your vectors are updated infrequently and queried from dozens of independent services, a centralized database reduces consistency complexity.

The winning architecture for most AI applications is a tiered approach: Cachee’s in-process HNSW as the L1 hot cache for your most-queried vectors, with Pinecone, Qdrant, or pgvector as the L2 cold store for the full corpus. Hot vectors — the 10–20% of your index that serves 80–90% of queries — live in-process at 0.0015ms. Cold vectors fall through to the remote database at 1–5ms. This pattern is identical to how CPU cache hierarchies work: L1 is small and fast, L2 is large and slower, and the system performs as if everything is in L1 because of access pattern locality.

For teams running RAG pipelines, AI agent workflows, or real-time recommendation engines, the difference between 0.0015ms and 5ms is the difference between serving a response in the user’s latency budget and blowing it. At Cachee’s pricing, the L1 vector cache costs a fraction of what you are paying for a managed vector database — and it eliminates the latency that your users actually feel.

Related Reading

Also Read

Vector Search Without the Network Tax.

In-process HNSW delivers 0.0015ms vector similarity search — 3,300x faster than any remote vector database. Try it free.

Start Free Trial Schedule Demo