API Response Caching: The Complete Guide
Every API response is the output of a computation. A database query, a join across three tables, a call to an upstream service, a machine learning inference. These computations take time -- 5 milliseconds for a simple database lookup, 50 milliseconds for a complex aggregation, 500 milliseconds for an ML prediction. If the same computation produces the same result for multiple requests, running it every time is waste. Caching the result and serving it on subsequent requests eliminates that waste.
This is not a new idea. What is new is the scale at which it matters. Modern APIs serve thousands to millions of requests per second. A 50-millisecond computation at 10,000 requests per second consumes 500 CPU-seconds per second -- you need 500 cores just for that one endpoint. If 95% of those requests return the same response (because the underlying data has not changed), caching reduces the compute requirement from 500 cores to 25 cores. That is not an optimization. It is the difference between a viable architecture and an infrastructure budget that grows linearly with traffic.
This guide covers everything you need to implement API response caching correctly: what to cache, how to design cache keys, TTL strategies, invalidation patterns, large response handling, Cache-Control headers, and monitoring. It applies to REST, GraphQL, and gRPC.
1. What to Cache
Not every API response should be cached. Caching is appropriate when the response is deterministic (the same input produces the same output), the response changes infrequently relative to how often it is read, and serving a slightly stale response is acceptable for the use case.
Cache These
GET responses for reference data. Product catalogs, configuration settings, feature flags, country lists, currency exchange rates. These change infrequently (minutes to hours between updates) and are read on nearly every request. Cache aggressively with TTLs of 60-3600 seconds.
Computed aggregations. Dashboard summaries, analytics rollups, leaderboards, recommendation lists. These are expensive to compute (often involving joins across multiple tables or services) and the result is valid for seconds to minutes. Cache with TTLs of 10-300 seconds and recompute in the background.
Authentication and authorization checks. "Does this token belong to an active user?" is a question asked on every authenticated request. The answer changes when the user is deactivated or the token is revoked, which happens far less often than once per request. Cache auth check results with a short TTL (5-30 seconds) to bound the window between revocation and enforcement.
Upstream API responses. If your API calls a third-party service (payment provider, geolocation, weather, shipping rates), cache the response to reduce external API costs, latency, and dependency on third-party uptime. Respect the upstream Cache-Control headers if present; otherwise, set a conservative TTL based on how often the data changes.
Do Not Cache These
POST, PUT, PATCH, DELETE responses. Mutation endpoints modify state. Caching their responses would serve stale confirmation data. More importantly, caching the fact that a mutation happened does not prevent the mutation from being re-executed if the response is lost -- the client should retry the mutation, not read a cached success response.
User-specific data with high write frequency. If a user's inbox count changes every few seconds, caching the inbox count with a 30-second TTL means the user sees stale data most of the time. The cache provides minimal value because the data changes faster than the TTL. Either cache with a sub-second TTL (which requires in-process caching to be practical) or do not cache at all.
Responses containing secrets or sensitive data. API responses that include API keys, passwords, PII, or financial data should not be cached in shared caches. In-process caching of sensitive data is acceptable if the process is trusted, but network caches (Redis, Memcached, CDNs) add attack surface. When in doubt, set Cache-Control: no-store.
2. Cache Key Design
The cache key determines when two requests are considered equivalent for caching purposes. A key that is too broad will serve incorrect responses to the wrong users. A key that is too narrow will cache every request separately and achieve a near-zero hit rate.
REST APIs
For REST APIs, the natural cache key is the request URL including query parameters, plus any request headers that affect the response. The minimum key for a REST GET is the full URL: GET /api/v1/products?category=electronics&page=2. If the response varies by authenticated user, include the user ID or a hash of the auth token in the key. If the response varies by Accept-Language or Accept-Encoding headers, include those headers in the key.
# Cache key for a REST API response
def cache_key(request):
parts = [
request.method, # GET
request.path, # /api/v1/products
request.query_string, # category=electronics&page=2
]
# Only include auth context if response varies by user
if endpoint_is_user_specific(request.path):
parts.append(hash(request.auth_token))
# Only include Accept-Language if response is localized
if endpoint_is_localized(request.path):
parts.append(request.headers.get("Accept-Language", "en"))
return sha256("|".join(parts))
GraphQL APIs
GraphQL is harder to cache because the query body is in the POST request and every client can request different fields. Two clients requesting the same entity with different field selections produce different responses. The cache key for GraphQL must include the query string (or a hash of the normalized query), the variables, and the operation name.
Normalization matters. The queries { user(id: 1) { name email } } and { user(id: 1) { email name } } return the same response but have different string representations. To avoid caching the same response under two different keys, normalize the query by sorting fields alphabetically, removing whitespace, and then hashing. Most GraphQL caching libraries handle this normalization automatically.
# GraphQL cache key
def graphql_cache_key(request):
body = json.loads(request.body)
normalized_query = normalize_graphql(body["query"])
variables = json.dumps(body.get("variables", {}), sort_keys=True)
operation = body.get("operationName", "")
parts = [normalized_query, variables, operation]
if requires_auth_context(normalized_query):
parts.append(hash(request.auth_token))
return sha256("|".join(parts))
gRPC APIs
For gRPC, the cache key is the fully qualified method name plus the serialized request message. Because protobuf serialization is deterministic for the same message content (in proto3 with sorted map keys), the serialized bytes of the request can serve as the cache key directly, prefixed by the method name. This is simpler than REST or GraphQL because protobuf messages are strongly typed and do not have the variable-format problem of JSON or query strings.
# gRPC cache key
def grpc_cache_key(method, request_message):
return sha256(method + "|" + request_message.SerializeToString())
3. TTL Strategy
The TTL (time-to-live) determines how long a cached response is considered valid. Too short, and the cache provides little benefit (low hit rate). Too long, and users see stale data. The right TTL depends on how often the underlying data changes and how much staleness is acceptable.
| Data Type | Change Frequency | Acceptable Staleness | Recommended TTL |
|---|---|---|---|
| Feature flags | Hours to days | Minutes | 300s (5 min) |
| Product catalog | Hours | Minutes | 600s (10 min) |
| User profile | Days | Seconds | 30s |
| Auth check result | Rare (revocation) | Seconds | 10-30s |
| Dashboard aggregation | Minutes | Seconds | 15-60s |
| Search results | Minutes to hours | Minutes | 120-600s |
| Rate limit counter | Every request | Zero | Do not L1 cache |
| Exchange rates | Minutes | Seconds to minutes | 30-120s |
Stale-While-Revalidate
The most useful TTL pattern for API caching is stale-while-revalidate. When a cached entry expires, serve the stale entry immediately to the requesting client, and trigger an asynchronous background refresh. The next request after the refresh completes will get the fresh data. This pattern eliminates the latency spike that occurs when a popular cache entry expires and all concurrent requests simultaneously query the origin (the thundering herd problem).
In HTTP, this is expressed as Cache-Control: max-age=60, stale-while-revalidate=30, which means the response is fresh for 60 seconds, and for an additional 30 seconds after expiry, the stale response may be served while a background refresh occurs. After 90 seconds total (60 + 30), the entry is fully expired and a synchronous refresh is required. In application-level caching, you implement this by storing both a hard TTL and a soft TTL per entry, and triggering background refresh when the soft TTL expires but the hard TTL has not.
# Stale-while-revalidate implementation
def get_cached(key, fetch_fn, ttl=60, stale_ttl=30):
entry = cache.get(key)
if entry is None:
# Hard miss: synchronous fetch
value = fetch_fn()
cache.set(key, value, ttl=ttl + stale_ttl)
cache.set_meta(key, "fresh_until", now() + ttl)
return value
if now() > cache.get_meta(key, "fresh_until"):
# Stale: serve stale, refresh in background
background_refresh(key, fetch_fn, ttl, stale_ttl)
return entry # Fresh or stale, either way: 31ns
4. Invalidation Patterns
TTL-based expiry is the simplest invalidation strategy, but it means data can be stale for up to the TTL duration. For use cases where staleness is unacceptable (user deactivation, price changes, security policy updates), you need explicit invalidation.
Event-Driven Invalidation (CDC)
The most robust invalidation pattern is event-driven invalidation via change data capture (CDC). When the source data changes (a database row is updated, a configuration is modified), the system emits an event. The cache subscribes to these events and invalidates or updates the affected entries. This gives you near-zero staleness without the cost of aggressive polling.
In practice, CDC-based invalidation uses a message queue (Kafka, SQS, NATS) or a database trigger (Postgres LISTEN/NOTIFY, DynamoDB Streams). The cache listener receives the change event, computes the affected cache keys from the event payload, and either deletes the keys (forcing a refresh on the next read) or updates them with the new value (if the event contains the full updated entity).
The latency between the data change and the cache invalidation is the end-to-end event delivery time, typically 50-500 milliseconds for Kafka and 10-50 milliseconds for LISTEN/NOTIFY. This is fast enough that most users never see stale data, but it is not zero -- there is a window between the write and the invalidation where a read from cache will return the old value.
Tag-Based Invalidation
Tag-based invalidation associates each cached entry with one or more tags. When you need to invalidate all entries related to a specific entity, you invalidate the tag, which invalidates all entries associated with that tag. This is useful when a single data change affects multiple cached responses.
For example, when a product's price changes, you might need to invalidate the product detail page, the product listing page, the shopping cart, and the order summary. Instead of tracking all four cache keys individually, you tag all of them with product:12345. When the product changes, you invalidate the product:12345 tag, and all four entries are invalidated in one operation.
# Tag-based invalidation
cache.set("product-detail:12345", detail_response,
tags=["product:12345", "category:electronics"])
cache.set("product-listing:electronics:page1", listing_response,
tags=["category:electronics"])
cache.set("cart:user:789", cart_response,
tags=["product:12345", "product:67890", "user:789"])
# When product 12345 changes:
cache.invalidate_tag("product:12345")
# Invalidates: product-detail:12345, cart:user:789
# Does NOT invalidate: product-listing (different tag)
TTL Expiry
For data where bounded staleness is acceptable and the source system does not emit change events, TTL expiry is the pragmatic choice. Set the TTL to match the maximum acceptable staleness. Combine with stale-while-revalidate to avoid thundering herd on expiry. This is the right pattern for reference data (country lists, timezone data), third-party API responses (where you do not control the source system), and data that changes gradually (analytics rollups, leaderboards).
5. Handling Large Responses
API responses are not uniformly small. A paginated product listing might be 2-5KB. A GraphQL query returning a user's complete activity feed might be 20-150KB. An analytics aggregation endpoint might return 200KB of chart data. A report generation endpoint might return 1-5MB of structured data.
For network caches (Redis, Memcached), response size directly impacts latency. A 64-byte cached response takes 300 microseconds to retrieve from Redis. A 100KB response takes 1.2 milliseconds. A 1MB response takes 8-12 milliseconds. The latency increase is linear with size because serialization and network transfer scale linearly.
| Response Size | Redis Latency | In-Process L1 Latency | Speedup |
|---|---|---|---|
| 1 KB | 0.35 ms | 0.000042 ms (42 ns) | 8,333x |
| 10 KB | 0.55 ms | 0.000120 ms (120 ns) | 4,583x |
| 50 KB | 0.95 ms | 0.000450 ms (450 ns) | 2,111x |
| 100 KB | 1.20 ms | 0.000850 ms (850 ns) | 1,412x |
| 500 KB | 4.80 ms | 0.004000 ms (4 us) | 1,200x |
For large responses, the advantage of in-process caching is dramatic. A 100KB GraphQL response served from Redis takes 1.2 milliseconds. The same response served from in-process L1 takes 850 nanoseconds -- 1,412x faster. The difference grows with response size because network transfer time grows linearly while in-process access is a memory copy that benefits from CPU cache prefetch and wide memory buses.
If your API serves responses larger than 10KB frequently, in-process caching is not an optimization. It is a requirement. The alternative is spending 1-5 milliseconds per request on cache retrieval alone, before any application logic runs. At 10,000 requests per second, that is 10-50 seconds of cumulative network time per second -- more than your origin computation might cost in the first place.
6. Cache-Control Headers
If your API is consumed by browsers, mobile apps, CDNs, or any HTTP-aware client, Cache-Control headers are your primary mechanism for controlling caching behavior across the entire request chain. Setting them correctly avoids redundant requests at every layer.
Key Directives
max-age=N tells the client (browser, app, CDN) that the response is fresh for N seconds. During this window, the client will not make a request to your server at all -- it serves the response from its own local cache. This is the single most effective directive for reducing API traffic. A max-age=60 on a popular endpoint can reduce your request volume by 90%+ if clients poll more frequently than once per minute.
s-maxage=N is like max-age but only applies to shared caches (CDNs, reverse proxies). This lets you set different freshness windows for end-user clients (max-age) and intermediate caches (s-maxage). A common pattern is max-age=0, s-maxage=300: the CDN caches for 5 minutes, but the browser always revalidates with the CDN (not your origin).
no-cache does not mean "do not cache." It means "cache but always revalidate before serving." The client stores the response but checks with the server (via If-None-Match/ETag or If-Modified-Since) before using it. If the server returns 304 Not Modified, the client uses the cached copy. This saves bandwidth (304 responses have no body) while ensuring freshness.
no-store means "do not cache at all." The response must not be stored by any cache at any layer. Use this for responses containing sensitive data (auth tokens, PII, financial data). This is the only directive that truly prevents caching.
stale-while-revalidate=N allows the client to serve a stale response for N seconds past the max-age expiry while it revalidates in the background. This eliminates the latency penalty of revalidation for the user and is supported by modern browsers and CDNs.
# Reference data: cache aggressively
Cache-Control: public, max-age=3600, s-maxage=7200, stale-while-revalidate=600
# User-specific data: cache briefly, revalidate
Cache-Control: private, max-age=30, stale-while-revalidate=10
# Sensitive data: never cache
Cache-Control: no-store
# Shared data behind auth: CDN caches, browser revalidates
Cache-Control: no-cache, s-maxage=300
Vary: Authorization
The Vary Header
The Vary header tells caches which request headers affect the response. If your API returns different content based on the Authorization header, set Vary: Authorization. If it varies by language, set Vary: Accept-Language. Without the Vary header, a CDN might cache the response for User A and serve it to User B, which is a security and correctness bug.
Be careful with Vary: *, which effectively disables caching by saying "this response varies on everything." If you need to vary on a header that most caches do not handle well, consider including the varying dimension in the URL or query string instead, where it naturally becomes part of the cache key.
7. Monitoring
A cache without monitoring is a liability. You do not know if it is working, how much value it provides, or when it degrades. Four metrics are essential for API response caching.
Hit Rate
The fraction of requests served from cache. Track this per endpoint, not just as an aggregate. An aggregate 85% hit rate can hide a 20% hit rate on your most expensive endpoint. The goal is 95%+ on cacheable endpoints. If an endpoint's hit rate is below 70%, investigate: the TTL may be too short, the cache key may be too specific (creating too many unique entries), or the data may genuinely change too fast for caching.
Miss Latency
The latency of requests that miss the cache. This is the cost you are paying for every cache miss. Track the P50 and P99 of miss latency per endpoint. If miss latency is under 5 milliseconds, the cache is a nice optimization. If miss latency is over 50 milliseconds (a common pattern for ML inference, complex aggregations, or upstream API calls), the cache is a structural requirement -- without it, your endpoint cannot meet latency SLAs.
Cache Size and Eviction Rate
How much memory the cache consumes and how often entries are evicted. If the eviction rate is high (entries are being evicted before their TTL expires), the cache is undersized. Consider increasing capacity or improving the eviction policy to retain more valuable entries. If the cache is using only 10% of its allocated capacity, you may be over-provisioned or your TTLs may be too short (entries expire before the cache fills up).
Weighted Miss Cost
The metric that ties everything together: miss_rate * avg_miss_latency. This is the average latency penalty per request caused by cache misses. A high weighted miss cost means either your hit rate is too low, your miss latency is too high, or both. Reducing this number is the goal of all caching optimization. Track it per endpoint and aggregate. The endpoints with the highest weighted miss cost are where caching improvements will deliver the most user-visible latency reduction.
# Essential cache monitoring queries (Prometheus/Grafana)
# Hit rate per endpoint
sum(rate(cache_hits_total{endpoint="/api/v1/products"}[5m])) /
(sum(rate(cache_hits_total{endpoint="/api/v1/products"}[5m])) +
sum(rate(cache_misses_total{endpoint="/api/v1/products"}[5m])))
# Weighted miss cost per endpoint
(1 - cache_hit_rate{endpoint="/api/v1/products"}) *
cache_miss_latency_p50{endpoint="/api/v1/products"}
# Eviction rate (entries evicted per second)
rate(cache_evictions_total[5m])
# Cache size utilization
cache_entries_current / cache_entries_max
Common Monitoring Mistakes
Tracking only the aggregate hit rate hides per-endpoint problems. An 85% aggregate rate can mask a 15% hit rate on your most expensive endpoint. Track hit rate, miss latency, and weighted miss cost per endpoint. Set alerts on weighted miss cost, not hit rate -- a 90% hit rate on a 2ms endpoint is fine, but a 90% hit rate on a 500ms endpoint means 10% of requests take half a second, which is not fine regardless of how the percentage looks.
Putting It Together: A Complete API Caching Architecture
The architecture that addresses all seven concerns (what to cache, key design, TTL, invalidation, large responses, headers, monitoring) is a two-tier cache with event-driven invalidation. L1 is an in-process cache for reads. It uses frequency-based eviction (CacheeLFU), stores responses in their serialized form (avoiding re-serialization on cache hits), and handles TTL and stale-while-revalidate internally. L2 is a network cache (Redis or equivalent) for cross-instance consistency and cold-start warming. The application writes to L2 on origin fetch, and L1 falls through to L2 on L1 miss. CDC events invalidate both L1 and L2 when source data changes.
The read path is: check L1 (31 nanoseconds). On L1 miss, check L2 (300 microseconds). On L2 miss, fetch from origin (15+ milliseconds), populate L2 and L1, return. The write path is: write to the database, emit a CDC event, the event triggers invalidation of L1 and L2 entries tagged with the changed entity. The Cache-Control headers are set based on the endpoint's TTL configuration, allowing browsers and CDNs to cache at the edge as well.
This architecture gives you sub-microsecond average read latency (because 90%+ of reads hit L1), consistent cross-instance behavior (because L2 is shared), near-real-time invalidation (because CDC events propagate in milliseconds), and full HTTP caching compatibility (because Cache-Control headers are set correctly for every response). The monitoring layer tracks all four essential metrics per endpoint, and alerts on weighted miss cost to surface problems before they affect users.
The Bottom Line
API response caching is not about adding a Redis call before your database query. It is a system design discipline that spans cache key correctness, TTL tuning, invalidation architecture, response size management, HTTP header configuration, and continuous monitoring. Get the keys wrong and you serve stale or incorrect data. Get the TTLs wrong and you either miss too often or serve data that is too old. Skip invalidation and you trade correctness for speed. Ignore response size and your cache becomes slower than your database. The complete approach -- in-process L1 at 31ns, network L2 for consistency, CDC for invalidation, Cache-Control for edge caching, and per-endpoint monitoring -- delivers sub-microsecond reads without sacrificing correctness.
API response caching at 31 nanoseconds. In-process L1 with CacheeLFU eviction.
brew install cachee Session Caching at Scale