Skip to main content
Why CacheeHow It Works
All Verticals5G TelecomAd TechAI InfrastructureFraud DetectionGamingTrading
PricingDocsBlogSchedule DemoLog InStart Free Trial
← Back to Blog
Architecture

API Gateway Caching: Patterns for Sub-Millisecond Response Times

Your API gateway is the front door to your entire backend. Every millisecond it adds to response time multiplies across every client, every request, every second. Most gateway caching implementations use a simple TTL-based Redis lookup — adding 1–5ms per cache check. At 10,000 requests per second, that is 10–50 seconds of cumulative latency per second. The cache that is supposed to accelerate your API is quietly becoming its bottleneck.

The problem is not that gateway caching is a bad idea. It is one of the highest-leverage performance optimizations available. The problem is that most implementations stop at "put Redis in front of the backend" and never address the fundamental limitations of network-based caching at the gateway layer. The cache check itself becomes the latency floor, and no amount of backend optimization can push response times below it.

Cachee replaces the network cache hop with an in-process L1 lookup that resolves in 1.5 microseconds. For cacheable responses — which at a 99.05% hit rate means virtually all of them — the API gateway serves the response without ever contacting Redis, without ever reaching the backend, and without any network round-trip at all. The response leaves the gateway in the time it previously took just to check whether a cached response existed.

1.5µs Cache Check
10,000+ RPS Per Node
99.05% Hit Rate
Zero Origin Load on Hit

Gateway Caching Anti-Patterns

Before examining what works, it is worth cataloging the patterns that create problems at scale. These anti-patterns are widespread because they work adequately at low traffic and only fail when the stakes are highest.

Flat TTL on Everything

The most common anti-pattern is applying the same TTL to every cached response. A 60-second TTL on a product catalog endpoint and a 60-second TTL on a user profile endpoint treats fundamentally different data with the same freshness guarantee. The catalog changes once a day; 60 seconds is wasteful churn. The user profile changes when the user edits it; 60 seconds of stale data means they do not see their own changes for a full minute. The right TTL depends on the data, not on a global configuration value.

Missing Vary-By Headers

An API that returns different content based on authentication level, locale, feature flags, or A/B test assignments needs cache keys that vary by those dimensions. Without proper vary-by configuration, user A sees user B's cached response, authenticated users see unauthenticated content, and the German localization serves English content. These are not edge cases. They are the default behavior of a cache that ignores request context.

No Cache Key Normalization

The requests /api/products?sort=price&page=1 and /api/products?page=1&sort=price are semantically identical but produce different cache keys by default. Without query parameter normalization — sorting parameters alphabetically, removing defaults, canonicalizing encoding — the cache stores duplicate responses for every permutation of the same query. Cache utilization drops. Hit rates suffer. Memory fills with redundant data.

Stale-While-Revalidate Without Prediction

Stale-while-revalidate is a useful pattern: serve the stale cached response immediately while refreshing from the backend in the background. But without prediction, every first request after a TTL expiry gets the stale response while triggering a backend call. If traffic is bursty — thousands of requests arriving in the first second after expiry — all of them get stale data and all of them trigger redundant backend calls. The pattern degrades to a thundering herd with stale characteristics.

The Right Patterns

Effective gateway caching requires treating the cache as a first-class component of the API architecture, not as an afterthought bolted onto the side. The following patterns, when combined with L1 caching, deliver sub-millisecond response times at scale.

Response Caching by Route + Parameters

Each API route should have its own caching policy. A product listing endpoint that queries a stable catalog can cache aggressively with long TTLs and AI-driven warming. A user notifications endpoint with real-time delivery expectations should use short TTLs with event-driven invalidation. A search endpoint with high cardinality but temporal locality should cache recent popular queries and evict long-tail results. The caching policy is part of the API contract, not a deployment configuration.

Vary-By Auth Level

For APIs that serve different content based on authentication — which is most of them — the cache key must include the auth tier. Not the specific user token (that would make every request unique), but the authorization level: anonymous, free tier, premium, admin. A product API might return different pricing tiers to different user classes. A content API might return different feature access levels. Cache keys that include the auth tier serve the correct response from cache without per-user fragmentation.

Cache Key Normalization

Normalize cache keys by sorting query parameters alphabetically, removing parameters that match their defaults, lowercasing header values used in vary-by calculations, and stripping tracking parameters (utm_source, fbclid, etc.) that do not affect the response. This single optimization can improve cache hit rates by 15–30% with zero impact on correctness. Cachee performs this normalization automatically for all gateway-layer cache entries.

Short TTL + AI Warming

For dynamic content that cannot tolerate stale responses, the right approach is not a long TTL. It is a short TTL — 5 to 30 seconds — combined with AI prediction that refreshes the cache before the TTL expires. Cachee learns the access frequency of each cached response and schedules background refreshes so that the cached entry is always fresh when the next request arrives. The client never sees stale data. The backend never sees a thundering herd. The cache is always warm.

L1 at the Gateway

Cachee deploys as a sidecar to your existing API gateway — Kong, NGINX, Envoy, AWS API Gateway, or any gateway that supports middleware or plugin architecture. The integration point is a cache check on every incoming request, before the request reaches any upstream service.

// Express/Node.js API gateway middleware with Cachee const cachee = require('@cachee/sdk'); const gatewayCacheMiddleware = async (req, res, next) => { // Normalize cache key: sorted params, stripped tracking const cacheKey = cachee.normalizeKey({ route: req.path, params: req.query, authTier: req.user?.tier || 'anonymous', locale: req.headers['accept-language'] }); // L1 lookup: 1.5µs const cached = await cachee.get(cacheKey); if (cached) { res.set('X-Cache', 'HIT'); res.set('X-Cache-Latency', '1.5us'); return res.json(cached); // Response served. Backend never contacted. } // Cache miss: proceed to backend res.set('X-Cache', 'MISS'); const originalJson = res.json.bind(res); res.json = (body) => { cachee.set(cacheKey, body, { ttl: 30 }); return originalJson(body); }; next(); }; app.use(gatewayCacheMiddleware); // 99 out of 100 requests never reach your backend.

On a cache hit — which at 99.05% means virtually every request — the response returns from L1 memory in 1.5 microseconds. There is no Redis round-trip. There is no connection pool management. There is no serialization or deserialization across a network boundary. The cached response bytes are already in the process memory of the gateway node. The response is assembled and sent before a traditional cache check would have even opened a TCP connection.

On a cache miss, the request proceeds to the backend normally. The response is cached in L1 on the way back through the middleware. Cachee's AI prediction engine learns from the miss and pre-warms similar keys in the background, reducing future miss rates. Over time, the hit rate converges toward 99%+ even for APIs with diverse access patterns.

Intelligent Invalidation

The hardest problem in caching is not storage or retrieval. It is invalidation. When should a cached response be evicted? The wrong answer creates either stale data (evict too late) or poor hit rates (evict too early). Most gateway caches solve this with TTL-based expiry, which is the equivalent of setting an alarm clock when you need a precision timer.

Cachee uses event-driven invalidation. When a write operation modifies data that affects a cached response, Cachee invalidates the specific cache entries affected by that mutation — not the entire cache, not every entry for that route, but exactly the entries whose data changed. This is surgical invalidation instead of cache flush.

How It Works in Practice

Consider an e-commerce API. A user updates their shipping address. With TTL-based caching, every cached response that includes user profile data will eventually expire and refresh. With event-driven invalidation, the specific cache entries for that user's profile, their order summary, and their checkout preview are invalidated immediately. Cache entries for other users, product listings, and category pages are untouched. The hit rate stays high. The user sees their change immediately. Every other cached response continues serving at full speed.

Cachee integrates with your write path through a simple invalidation API. When your backend processes a mutation, it sends a lightweight invalidation event describing what changed. Cachee maps that event to the affected cache entries and evicts them. The next read request triggers a fresh cache population. The entire cycle completes in microseconds.

GraphQL Caching

REST APIs have a natural cache boundary: each endpoint returns a predictable response shape. GraphQL breaks this contract. Every query can request different fields, different nesting depths, different relationships. A naive cache-by-query approach leads to astronomical key cardinality — every unique query string is a unique cache entry, and clients rarely send the exact same query twice.

Cachee addresses GraphQL caching at three levels:

Normalized Query Caching

GraphQL queries are parsed and normalized before cache key generation. Whitespace, field ordering, and alias differences are eliminated. Two queries that request the same data in different syntactic forms map to the same cache key. This alone reduces key cardinality by 40–60% in typical GraphQL APIs.

Field-Level Caching

Instead of caching entire query responses, Cachee caches individual resolved fields. A query that requests user name, email, and order history can serve name and email from cache even if order history is a cache miss. The gateway assembles the response from cached fields and only queries the backend for the missing data. This partial-hit strategy dramatically improves effective hit rates for GraphQL workloads where full query cache hits are rare.

Persisted Query Optimization

For APIs that use persisted queries (Apollo's APQ or similar), the query hash serves as a stable, compact cache key. Cachee pre-warms responses for frequently used persisted queries during low-traffic periods, ensuring that the most common operations always serve from L1 when traffic spikes. Combined with field-level caching for non-persisted queries, this provides comprehensive coverage across both usage patterns.

The fastest API response is the one that never reaches your backend. At 99.05% hit rate, Cachee means 99 out of 100 requests resolve in 1.5µs at the gateway. Your backend only handles the 1% of requests that actually require fresh computation.

The Compound Effect

Gateway caching at L1 speed has a compound effect across the entire stack. When 99% of API requests resolve in microseconds at the gateway, backend services handle 99% less traffic. Database connection pools shrink. Queues drain faster. Background workers have idle capacity. Autoscaling groups scale down. The entire infrastructure cost structure shifts because the gateway is absorbing the read load that previously cascaded through every layer.

For microservice architectures, this effect multiplies. A single client API request often fans out to 5–10 internal service calls. If the gateway caches the aggregated response, all 5–10 internal calls are eliminated on a cache hit. The internal network traffic drops by the same 99% factor. Service mesh overhead decreases. Distributed tracing spans shrink from 10 hops to 1. The architecture becomes simpler because the cache removes the need for most of the inter-service communication that makes microservices operationally expensive.

The teams that implement gateway caching correctly — with normalized keys, proper vary-by headers, event-driven invalidation, and L1 speed — find that it is not just a performance optimization. It is an architectural simplification. The API gateway stops being a thin proxy that adds latency and becomes the component that absorbs the majority of read traffic, protects backends from load spikes, and delivers response times that were previously impossible without dedicated CDN infrastructure. Cachee makes that transition practical by eliminating the latency tax that conventional caching imposes on the gateway itself.

Ready for Sub-Millisecond API Responses?

See how Cachee's L1 gateway caching eliminates backend load and delivers 1.5µs cache checks.

See How It Works Start Free Trial