Skip to main content
Why CacheeHow It Works
All Verticals5G TelecomAd TechAI InfrastructureFraud DetectionGamingTrading
PricingDocsBlogSchedule DemoLog InStart Free Trial
← Back to Blog
AI Infrastructure

How Glean Can Make Enterprise RAG Search 10x Faster With In-Process Vector Caching

Glean built its entire company around one promise: search everything in your enterprise instantly. Slack messages, Google Docs, Jira tickets, Confluence pages, emails, Notion wikis — one search bar, one answer. The product is enterprise RAG at scale. Every query that enters that search bar triggers an embedding, a vector search across millions of document chunks, context assembly, and LLM generation. The vector search step — typically 1–5ms per lookup, with 3–5 lookups per query — is the silent bottleneck standing between Glean’s promise and a user experience that feels truly instant.

The Bottleneck Hiding in Plain Sight

Glean indexes an enterprise’s entire knowledge graph. For a mid-size company, that means 50–200 million document chunks, each embedded as a high-dimensional vector and stored in a vector database. When a user types “What was the Q4 revenue decision from the board meeting?”, Glean’s pipeline does the following: embed the query into a vector, search the index for the top-k most relevant document chunks (typically 3–5 lookups across different collections — Slack, Docs, email), assemble those chunks as context, and pass them to an LLM for answer generation.

The LLM generation step takes 500ms–2s and is largely irreducible without model changes. But the vector search step — the retrieval in Retrieval-Augmented Generation — is where Glean controls its own destiny. Each vector lookup against an external database like Pinecone, Weaviate, or a managed Elasticsearch instance incurs 1–5ms of round-trip latency. Five lookups across different data source collections means 5–25ms before the context assembly even begins. That is 5–25ms of dead time where the user sees nothing — no streaming tokens, no partial results, just a spinner.

1–5ms Per Vector DB Lookup
0.0015ms In-Process HNSW Lookup
3,300x Speed Improvement
95% Hot Set Query Coverage

Why Search Speed IS the Product

For most SaaS companies, latency is a performance metric. For Glean, latency is the product itself. Enterprise search lives or dies on perceived speed. Google trained every knowledge worker on the planet to expect search results in under 200ms. When Glean’s answer takes 2.5 seconds instead of 1.8 seconds, users don’t analyze the pipeline breakdown — they feel the lag and start opening tabs manually. Every millisecond Glean shaves from the retrieval step compounds directly into user satisfaction, adoption rates, and renewal conversations with enterprise buyers.

The competitive landscape makes this existential. Microsoft Copilot is embedded in every Office 365 deployment. Google Vertex AI Search has native access to Workspace data. Both have the advantage of zero-hop access to their own document stores. Glean, as a third-party connector, must overcome the inherent latency penalty of indexing data it does not natively own. The vector search step is where that penalty materializes — and where in-process caching eliminates it.

Key insight: Glean performs 3–5 vector lookups per query across different data source collections. At 1–5ms per lookup, vector search alone accounts for 5–25ms of latency. In-process HNSW at 0.0015ms per lookup compresses those 5 lookups to 0.0075ms — effectively zero.

The Hot Set Architecture

Not all 200 million document embeddings need to live in process. Enterprise search follows a steep power law: the top 10 million vectors cover approximately 95% of all queries. These are the recently edited documents, the frequently referenced Confluence pages, the Slack channels with daily activity, the Jira epics under active development. This is the hot set.

The architecture is a two-tier retrieval system. Tier 1 is an in-process HNSW index (via Cachee’s VADD and VSEARCH commands) holding the hot 10M vectors in the application’s own memory space. Tier 2 is the existing vector database holding the full index. Every query hits L1 first. On a hit — which happens 95% of the time — retrieval completes in 0.0015ms per lookup. On the rare miss (stale documents, archived content, rarely accessed pages), the query falls through to the vector DB at standard latency, and the result gets promoted into L1.

Current: External Vector DB (per query, 5 lookups)

Embed query
0.5ms
Vector search (5x)
5–25ms
Context assembly
0.3ms
LLM generation
800–2000ms

With L1 In-Process HNSW (per query, 5 lookups)

Embed query
0.5ms
L1 HNSW search (5x)
0.0075ms
Context assembly
0.3ms
LLM generation
800–2000ms

The impact is stark. Total pre-LLM latency drops from 5.8–25.8ms to 0.81ms. That is a 7x–32x reduction in the retrieval phase. The user perceives faster time-to-first-token because the LLM starts generating sooner. For streaming responses, this means tokens begin appearing 5–25ms earlier — a difference that feels immediate at human perception scale.

Memory Economics at Enterprise Scale

The natural objection: 10 million vectors in process sounds expensive. It is not. At 768 dimensions (standard for enterprise embedding models like text-embedding-3-large) with float32 precision, each vector consumes 3,072 bytes. Ten million vectors require 30.7 GB of RAM — well within the capacity of a single modern server instance. With quantized int8 vectors (which HNSW supports with minimal recall loss), that drops to 7.7 GB. Glean already runs substantial infrastructure per enterprise tenant. An additional 8–31 GB of RAM per deployment is a rounding error against the compute cost of the LLM inference they are already paying for.

The ROI calculation is straightforward. If Glean processes 500,000 queries per day for a large enterprise tenant, and each query saves 10ms of vector search latency, that is 5,000 seconds of cumulative latency eliminated daily. More importantly, every individual user experiences faster results on 95% of their searches. That is the difference between “Glean is fast” and “Glean is instant.”

Infrastructure cost: 10M vectors at 768 dimensions = 30.7 GB (float32) or 7.7 GB (int8 quantized). For Glean’s infrastructure footprint, this is negligible. The latency savings on 500K daily queries per enterprise tenant make the RAM investment invisible on the balance sheet.

Predictive Warming and Real-Time Sync

The hot set is not static. Document relevance shifts throughout the day. A Slack thread about a production incident becomes the hottest content in the index within minutes. Glean already tracks document access patterns, edit timestamps, and cross-reference frequency. This metadata feeds directly into a cache warming strategy: newly edited documents get their embeddings promoted to L1 within seconds. Documents that have not been accessed in 72 hours get evicted. The cache stays warm without manual intervention.

For Glean, this creates a feedback loop. The more a document is searched, the faster it is found. The faster it is found, the more users rely on Glean instead of navigating to the source application. The more users rely on Glean, the more search data feeds the warming model. In-process vector caching does not just accelerate Glean’s search — it deepens the product moat.

The Competitive Calculus

Microsoft Copilot has zero-hop access to SharePoint, OneDrive, and Exchange. Google Vertex AI Search has zero-hop access to Drive and Gmail. Glean connects to these systems over APIs, indexes asynchronously, and searches its own copy. That architectural tax is permanent. But the retrieval latency tax is not. With in-process HNSW, Glean can deliver search results on the hot set faster than Copilot or Vertex can query their own native stores through standard database paths. The connector disadvantage becomes a caching advantage — because Glean controls its own index format and can optimize the read path in ways that platform-native search cannot.

At 0.0075ms for 5 lookups, Glean could deliver contextual snippets before the user finishes typing. Autocomplete becomes answer-complete. That is not a performance optimization. That is a product category shift.

Related Reading

Also Read

Make Enterprise Search Instant.

In-process HNSW delivers vector lookups in 0.0015ms — 3,300x faster than external vector databases. Zero network hops.

Start Free Trial Schedule Demo