Every vector database — Pinecone, Weaviate, Qdrant, Milvus — shares the same fundamental constraint: a TCP round trip between your application and a remote process. That round trip has a physics floor of roughly 0.5–1ms on localhost and 1–5ms over a network. No amount of index optimization, SIMD acceleration, or clever quantization will break that floor because the bottleneck is not the search algorithm. It is the network. In-process HNSW eliminates the hop entirely, delivering vector similarity search in 0.0015ms — 660 to 3,300 times faster than the fastest managed vector database.
The TCP Floor Nobody Talks About
When your application sends a query to Pinecone, the request travels through a well-known sequence: your application serializes the vector into a protobuf or JSON payload, opens (or reuses) a TCP connection, sends the request over TLS, waits for the remote server to deserialize, search its index, serialize the result, and send it back. Even on a fast network within the same AWS region, this round trip takes 1–5ms. Pinecone’s own documentation cites p50 latencies of 5–10ms for their serverless tier. Weaviate benchmarks show 1.5–4ms for single-vector queries on warm connections. Qdrant published 1.8ms p99 on their cloud offering.
These are good numbers. They represent years of engineering effort on indexing algorithms, connection pooling, and infrastructure optimization. But they are all measuring the same thing: the time to traverse a network stack twice. The actual HNSW search on the remote server typically takes 50–200 microseconds. The remaining 800–4,800 microseconds is pure network overhead. Your application spends 90–97% of its vector search latency waiting for packets to traverse the wire.
Redis 8 introduced Vector Sets with the VADD and VSEARCH commands — a significant step forward in making vector search a first-class data structure. But Redis is still a network-bound server. Even running on localhost with Unix sockets, Redis vector search measures in the 0.3–1ms range. The serialization, kernel context switches, and IPC overhead impose a floor that no Redis configuration can eliminate. Redis solved the index problem. It did not solve the network problem.
In-Process HNSW: The Architecture That Wins
Cachee’s vector search runs HNSW directly inside your application process. The index lives in the same memory space as your code. A query is a function call, not a network request. There is no serialization, no deserialization, no TCP handshake, no TLS negotiation, no kernel context switch. The CPU walks the HNSW graph in L1/L2 cache, computes distance functions using SIMD instructions, and returns the result. Total latency: 1.5 microseconds for a 1M-vector index with 128 dimensions.
This is not a theoretical number. It is a measured p50 on production workloads running Cachee’s in-process engine. The architecture maps directly to how companies like Spotify and DoorDash have described their vector search pain: they need sub-millisecond results for real-time recommendation and ranking, but every managed vector database introduces latency that exceeds the entire time budget for their hot path.
VADD, VSEARCH, VDEL: The Command Interface
Cachee implements a Redis-compatible command interface for vector operations, so teams already familiar with Redis do not need to learn a new protocol. The three core commands mirror the simplicity of Redis’s own Vector Sets while running entirely in-process.
VADD inserts a vector with an associated key and optional metadata. VSEARCH finds the K nearest neighbors to a query vector with optional metadata filters. VDEL removes a vector from the index. Each command supports cosine similarity, L2 (Euclidean) distance, and dot product metrics — configurable per index at creation time.
The critical difference from Redis 8 Vector Sets is where these commands execute. In Redis, VSEARCH crosses a network boundary. In Cachee, it is an in-process function call. Same syntax. Same semantics. Three orders of magnitude less latency.
Hybrid Metadata Filtering in One Operation
Real-world vector search is never pure similarity. You need to filter by tenant, language, document type, timestamp range, or access permissions. Most vector databases handle this with a two-phase approach: first find the K nearest vectors, then post-filter by metadata. This produces unstable result counts — ask for 10 results, get 3 back because 7 were filtered out after the vector search.
Cachee’s VSEARCH implements pre-filtered HNSW traversal. Metadata filters are evaluated during graph traversal, not after. The algorithm only visits nodes that satisfy the filter predicate, which means you always get exactly K results (if K qualifying vectors exist) and the search is often faster because filtered-out nodes are never traversed. For AI infrastructure teams running multi-tenant deployments, this eliminates an entire class of bugs where filtered queries return fewer results than requested.
Distance Metrics: Choosing the Right One
Cachee supports three distance metrics, each optimized with platform-specific SIMD instructions:
- Cosine similarity: The default for text embeddings (OpenAI, Cohere, Voyage). Measures the angle between vectors, ignoring magnitude. Best for semantic caching and document retrieval where you care about directional similarity.
- L2 (Euclidean) distance: Measures absolute distance between vectors. Best for image embeddings and spatial data where magnitude matters. Used by Instacart for product image similarity in their catalog search pipeline.
- Dot product (inner product): Measures both direction and magnitude. Best for recommendation systems where vector norms encode importance signals. DoorDash has described using dot product for ranking candidate restaurants by relevance and quality simultaneously.
When You Still Need a Vector Database
In-process HNSW is not a replacement for every vector database use case. If your index exceeds available RAM — typically beyond 50–100M vectors at 128 dimensions — you need a distributed solution. If you require strong durability guarantees with point-in-time recovery, a managed database provides that out of the box. If your vectors are updated infrequently and queried from dozens of independent services, a centralized database reduces consistency complexity.
The winning architecture for most AI applications is a tiered approach: Cachee’s in-process HNSW as the L1 hot cache for your most-queried vectors, with Pinecone, Qdrant, or pgvector as the L2 cold store for the full corpus. Hot vectors — the 10–20% of your index that serves 80–90% of queries — live in-process at 0.0015ms. Cold vectors fall through to the remote database at 1–5ms. This pattern is identical to how CPU cache hierarchies work: L1 is small and fast, L2 is large and slower, and the system performs as if everything is in L1 because of access pattern locality.
For teams running RAG pipelines, AI agent workflows, or real-time recommendation engines, the difference between 0.0015ms and 5ms is the difference between serving a response in the user’s latency budget and blowing it. At Cachee’s pricing, the L1 vector cache costs a fraction of what you are paying for a managed vector database — and it eliminates the latency that your users actually feel.
Related Reading
- AI Infrastructure Solutions
- Cachee Vector Search Documentation
- Cachee Pricing
- Get Started with Cachee
Also Read
Vector Search Without the Network Tax.
In-process HNSW delivers 0.0015ms vector similarity search — 3,300x faster than any remote vector database. Try it free.
Start Free Trial Schedule Demo