How It Works Pricing Benchmarks
vs Redis Docs Blog
Start Free Trial
Vector Search

Vector Similarity Search
at 0.0015ms. Not 1ms.

Redis 8 added Vector Sets over the network. We built HNSW in-process. Same algorithm. 660x faster. Hybrid search with metadata filters in one operation.

0.0015ms
Per Query
HNSW
Index Type
Cos/L2/Dot
Similarity Metrics
Hybrid
Metadata Filters
Architecture

Why Vector Search Belongs in the Cache

Every AI-powered application follows the same pattern: embed the input, find the nearest vectors, serve the result. RAG pipelines query embeddings before every LLM call. Recommendation engines score candidates on every page load. Semantic search runs similarity on every keystroke. These are hot-path operations measured in milliseconds or less.

The conventional answer is a dedicated vector database sitting behind a network connection. That means TCP serialization, connection pooling, and a round-trip for every query. For a RAG pipeline making 3-5 retrieval calls per user request, those milliseconds compound into visible latency. The vector database is not the bottleneck because it is slow at search. It is the bottleneck because it is on the other side of a network hop.

Cachee eliminates the hop. The HNSW index lives in your application's process memory. Vector search runs as a function call, not a network request. The same cache layer that stores your keys, sessions, and API responses now handles vector similarity at 0.0015ms per query. No additional infrastructure. No connection management. No serialization overhead. One process, one memory space, one data layer for both key-value and vector operations.

🧠
RAG Retrieval
Retrieve context chunks from your embedding store in 1.5 microseconds instead of 1-5ms. Faster retrieval means lower end-to-end LLM latency for every generation.
3-5 retrievals per LLM call
🔍
Semantic Search
Turn user queries into embeddings and find semantically similar content without a network round-trip. Typeahead and autocomplete pipelines become sub-microsecond.
Zero network latency
Recommendations
Score candidates against user embeddings on every page load. In-process means you can afford to re-rank on every request, not batch offline.
Real-time, not batch

Cachee already runs AI-optimized caching with predictive pre-warming and ML-driven TTLs. Vector search is the natural extension: same process, same memory, same sub-microsecond performance tier.

How It Works

In-Process HNSW Vector Search

Store vectors with metadata using VADD. Search for K nearest neighbors using VSEARCH. Filter by metadata and similarity in a single operation. The entire pipeline runs in your application's memory space.

Vector Search Pipeline
Step 1
VADD
Step 2
HNSW Index
Step 3
VSEARCH
Step 4
Filter + Rank
Result
0.0015ms

VADD: Store Vectors with Metadata

VADD inserts a vector into the in-process HNSW graph along with arbitrary key-value metadata. Each vector gets a unique ID, a float array of any dimensionality, and optional metadata attributes like category, timestamp, source, or any custom field your application needs.

The HNSW graph builds incrementally. Every VADD updates the navigable small-world graph structure in real time. There is no batch indexing step, no rebuild trigger, and no read-lock during writes. New vectors are immediately searchable after insertion.

VSEARCH: K-Nearest with Hybrid Filters

VSEARCH takes a query vector and returns the K most similar vectors by your chosen metric: cosine similarity, L2 (Euclidean) distance, or dot product. The HNSW algorithm provides approximate nearest neighbor search with tunable accuracy-performance tradeoffs via the ef_search parameter.

Hybrid search is the key differentiator. VSEARCH accepts metadata filter expressions that are evaluated during graph traversal, not as a post-filter. This means "find the 10 nearest vectors where category = 'electronics' AND price < 50" runs as a single operation. No separate metadata query. No client-side join. One call, one result set, one latency number.

// Store a product embedding with metadata await cache.vadd('products', { id: 'prod_8291', vector: embedding, // float[] from your model metadata: { category: 'electronics', price: 29.99, in_stock: true } }); // Find 10 nearest with metadata filter — single operation const results = await cache.vsearch('products', { vector: queryEmbedding, k: 10, metric: 'cosine', filter: { category: 'electronics', price: { $lt: 50 } } }); // => [{id: 'prod_8291', score: 0.97, metadata: {...}}, ...] // Total query time: 0.0015ms

The HNSW implementation uses multi-layer navigable graphs with configurable construction parameters (M, ef_construction) for tuning the recall-vs-speed tradeoff. Default parameters deliver >95% recall at sub-microsecond latency for indices up to 1M vectors. For details on how the broader AI pipeline integrates with vector search, see the architecture page.

Comparison

Cachee vs Redis 8 Vector Sets

Redis 8 introduced Vector Sets as a native data type. It is a meaningful step forward for the Redis ecosystem. But vector search over a TCP connection has a hard latency floor that in-process search eliminates entirely.

Cachee
0.0015ms
Redis 8 Vector Sets
~1ms+
Pinecone / Weaviate
5-50ms
Capability Redis 8 Vector Sets Cachee Vector Search
Query Latency ~1ms+ (network round-trip) 0.0015ms (in-process)
Index Algorithm SVS-VAMANA (DiskANN variant) HNSW (navigable small-world)
Hybrid Search Separate query + filter Single operation, inline filters
Dependencies Redis server required Zero — runs in your process
Similarity Metrics Cosine, L2 Cosine, L2, Dot Product
Metadata Filters Via FT.SEARCH (separate module) Native, inline with VSEARCH
Connection Overhead TCP + RESP serialization None (function call)
Key-Value + Vector Same server, different types Same process, unified API
Scaling Model Cluster sharding In-process per node + distributed sync

Redis 8 Vector Sets solve a real problem for teams already running Redis who want to add vector search without deploying a separate database. Cachee solves a different problem: eliminating the network entirely for latency-sensitive vector workloads. If your vector queries are on the critical path of user-facing requests, the 660x difference is not theoretical. It is the gap between "fast enough" and "invisible." For a broader comparison of caching architectures, see our full comparison page.

Use Cases

Built for Latency-Sensitive AI Workloads

Vector search at cache speed unlocks use cases that network-bound databases cannot serve in real time.

01
RAG / LLM Retrieval
Retrieve context chunks before every LLM call. At 0.0015ms per retrieval, a 5-chunk RAG pipeline adds 0.0075ms total vector search time instead of 5-25ms with a network-bound store. The retrieval step disappears from your latency budget.
02
Semantic Caching
Before sending a prompt to your LLM, check if a semantically similar query was already answered. VSEARCH finds the nearest cached response by embedding distance. Save $0.01-0.10 per avoided API call while serving results in microseconds instead of seconds.
03
Product Recommendations
Score product embeddings against user preference vectors on every page load. Hybrid filters let you constrain by availability, price range, or category in the same call. Real-time personalization that runs on the critical path without adding latency.
04
Image Similarity
Store CLIP or ResNet embeddings alongside image metadata. Find visually similar images with metadata constraints (same brand, same color, in-stock) in a single VSEARCH call. Powers visual search, duplicate detection, and content moderation pipelines.
05
Anomaly Detection
Embed incoming events and measure distance to known-good clusters in real time. Outliers that fall beyond a threshold trigger alerts. At sub-microsecond query time, you can afford to check every event, not sample. Ideal for fraud detection and security monitoring.
06
Conversational Memory
Store conversation turns as embeddings and retrieve contextually relevant history for chatbot and agent systems. VSEARCH with a recency filter surfaces the most relevant past exchanges, giving your AI agents long-term memory at cache speed.
API Reference

Vector Commands

Five commands cover the full vector search lifecycle. Familiar Redis-style naming. Available through the SDK and RESP-compatible CLI.

VADD
Add a vector with metadata to an index. Supports any dimensionality. The HNSW graph updates incrementally on each insert.
VSEARCH
Find K nearest neighbors by cosine, L2, or dot product. Accepts inline metadata filter expressions for hybrid search in one call.
VDEL
Remove a vector by ID. The HNSW graph repairs its connections automatically, maintaining search quality after deletions.
VCARD
Return the cardinality (total vector count) of an index. Useful for monitoring index size and triggering capacity alerts.
VINFO
Return index metadata including dimensionality, metric type, HNSW parameters (M, ef), vector count, and memory usage.
# CLI examples # Add a 128-dim vector with metadata VADD products prod_001 128 [0.12, 0.45, -0.33, ...] category=electronics price=29.99 # Search for 5 nearest by cosine, filtered by category VSEARCH products 5 [0.11, 0.44, -0.31, ...] METRIC cosine FILTER category=electronics # Check index size VCARD products # => (integer) 248391 # Get index info VINFO products # => dim:128 metric:cosine vectors:248391 M:16 ef:200 mem:186MB # Delete a vector VDEL products prod_001

Full API documentation with parameter references, filter syntax, and tuning guides is available in the documentation. For predictive caching workflows that combine vector search with intelligent pre-warming, see the integration guide.

Quick Start

Add Vector Search in Three Lines

The same SDK you use for key-value caching now handles vector operations. No new dependencies, no infrastructure changes.

// Initialize Cachee (same client for KV + vector) import { Cachee } from '@cachee/sdk'; const cache = new Cachee({ apiKey: 'ck_live_your_key' }); // Embed with your model (OpenAI, Cohere, local, etc.) const embedding = await embed(userQuery); // Semantic cache check: was this question already answered? const cached = await cache.vsearch('qa_cache', { vector: embedding, k: 1, metric: 'cosine', filter: { min_score: 0.95 } }); if (cached.length && cached[0].score > 0.95) { // Cache hit — serve the previous answer (0.0015ms) return cached[0].metadata.answer; } // Cache miss — call LLM, then store for future queries const answer = await callLLM(userQuery); await cache.vadd('qa_cache', { id: hash(userQuery), vector: embedding, metadata: { question: userQuery, answer, ts: Date.now() } });
Unified Data Layer
Key-value cache and vector index share the same process memory. No dual-write consistency issues. No separate infrastructure to monitor. Set a key, add a vector, search both from one client.
One SDK, two superpowers
Any Embedding Model
Cachee is embedding-agnostic. Use OpenAI ada-002, Cohere embed-v3, Sentence Transformers, CLIP, or your own fine-tuned model. VADD accepts any float array. You own the embedding pipeline.
Model-agnostic by design

Every AI Application Is a Caching Problem.
Solve Both at Once.

Key-value caching and vector search in one process. Sub-microsecond latency for both. Start with the free tier and add vector search to your existing Cachee deployment in minutes.

Start Free Trial AI Caching Platform
AI Features Predictive Caching Enterprise Compare