Cache Stampede: The Thundering Herd Fix That Actually Works

May 10, 2026 | 15 min read | Engineering

You have a cache key that serves 10,000 requests per second. You set its TTL to 60 seconds because that seemed reasonable. Every 60 seconds, that key expires. In the milliseconds between expiration and recomputation, hundreds of concurrent requests all discover the key is missing and all independently decide to recompute it. All of them hit your database simultaneously. Your database connection pool fills. Queries queue. Latency spikes from 5ms to 2,000ms. Downstream services time out. The cascade begins. Sixty seconds later, after the cache has been repopulated and your system has recovered, the TTL expires again and the entire cycle repeats.

This is a cache stampede, also called the thundering herd problem. It is the most predictable failure mode in cache infrastructure because it is caused by the most basic cache feature: time-to-live expiration. Every cache key with a TTL is a stampede waiting to happen. The only variables are the request rate and the recomputation cost. High rate and high cost produce outages. Low rate and low cost produce latency blips you never notice. But the mechanism is always the same: TTL expiration creates a window where the cache is empty and every concurrent request tries to fill it independently.

This post ranks five mitigation strategies from least effective to most effective, with code examples and benchmark data for each. Then it introduces the architecture that eliminates stampedes entirely -- not by mitigating the symptom, but by removing the cause.

100-500

Concurrent DB Hits Per Stampede

400x

Latency Spike During Stampede

Stampedes with Cachee Fingerprinting

The Anatomy of a Stampede

To understand why stampedes are so damaging, consider the exact sequence of events. A cache key holding the result of a 50ms database query serves 10,000 requests per second. The key has a 60-second TTL. At T=60.000, the key expires. At T=60.001, before any single request can recompute and repopulate the key, approximately 10 requests have already arrived and found the cache empty. Each of these 10 requests independently initiates the 50ms database query. At T=60.010, 100 requests have arrived. All 100 are now querying the database. At T=60.050, the first request's database query completes, and it writes the result back to the cache. But by now, 500 requests have already started their own database queries. Those 500 queries do not stop because the cache has been repopulated -- they are already in flight.

The result is that a single TTL expiration generates 500 identical database queries that all execute within a 50ms window. If the database query takes 50ms under normal load, it takes 500ms or more under 500x concurrent load. The connection pool saturates at its maximum (typically 20-50 connections), and the remaining 450+ queries queue behind it. The total time to service all stampede requests stretches to seconds. Meanwhile, new non-stampede requests for other keys are also queuing behind the saturated connection pool, creating collateral damage across the entire application.

The Math of Stampede Damage

At 10K requests/sec with a 60s TTL, each expiration produces a stampede window of approximately 50ms (the recomputation time). During that window, 10,000 * 0.050 = 500 concurrent requests hit the database. If your database handles 200 queries/sec for that query type, you have generated a 2.5-second backlog from a single key expiration. With 100 hot keys all using the same TTL, you can generate 50,000 concurrent queries every 60 seconds. This is not a theoretical problem. It is a scheduled outage.

Strategy 1: Probabilistic Early Expiration

Instead of all instances seeing the key expire at exactly the same moment, probabilistic early expiration causes each instance to independently decide to recompute the value before the TTL hits. The probability of recomputation increases as the TTL approaches zero. This spreads the recomputation window across time rather than concentrating it at a single moment.

The standard implementation uses the XFetch algorithm: a request recomputes the value if current_time - (ttl_remaining * beta * log(random())) > expiry_time, where beta controls the aggressiveness of early recomputation. A beta of 1.0 means recomputation starts happening approximately one computation-time before expiry.

import random
import math
import time

def should_recompute_early(expiry_time, compute_time_ms, beta=1.0):
    """XFetch algorithm: probabilistic early recomputation.
    Returns True if this request should recompute the cached value."""
    ttl_remaining = expiry_time - time.time()
    if ttl_remaining <= 0:
        return True  # already expired

    # Probability increases as TTL approaches 0
    # compute_time_ms converted to seconds for time comparison
    delta = (compute_time_ms / 1000.0) * beta * math.log(random.random())
    return time.time() - delta >= expiry_time

def get_with_early_recompute(cache, db, key, compute_fn, ttl=60):
    value, expiry = cache.get_with_expiry(key)

    if value is not None:
        compute_time = cache.get_meta(key, "compute_time_ms") or 50
        if not should_recompute_early(expiry, compute_time):
            return value  # serve cached, no recomputation needed
        # Fall through to recompute even though value exists

    # Recompute and cache with updated compute time
    start = time.time()
    new_value = compute_fn()
    elapsed_ms = (time.time() - start) * 1000

    cache.set(key, new_value, ttl=ttl)
    cache.set_meta(key, "compute_time_ms", elapsed_ms)
    return new_value

Effectiveness: 60-70% stampede reduction. Probabilistic early recomputation spreads the recomputation window, but it does not eliminate stampedes. Multiple instances can still independently decide to recompute at the same time, especially under high concurrency. The beta parameter requires tuning per workload. Too aggressive and you waste compute recomputing values that are still fresh. Too conservative and stampedes still occur. It is a probabilistic mitigation, not a deterministic fix.

Strategy 2: Mutex Lock (Single Recompute)

The mutex approach ensures that when a cache miss occurs, only one request recomputes the value while all other requests wait for the recomputation to complete. This reduces database load from N concurrent queries to exactly 1 query per expiration event.

import redis
import time

def get_with_mutex(cache, db, key, compute_fn, ttl=60, lock_ttl=10):
    value = cache.get(key)
    if value is not None:
        return value

    # Try to acquire recomputation lock
    lock_key = f"lock:{key}"
    acquired = cache.set(lock_key, "1", nx=True, ex=lock_ttl)

    if acquired:
        # This request won the lock -- recompute
        try:
            new_value = compute_fn()
            cache.set(key, new_value, ex=ttl)
            return new_value
        finally:
            cache.delete(lock_key)
    else:
        # Another request is recomputing -- wait and retry
        for _ in range(100):  # max 100 retries, 50ms each = 5s max wait
            time.sleep(0.05)
            value = cache.get(key)
            if value is not None:
                return value

        # Timeout: recompute anyway (lock holder may have failed)
        return compute_fn()

Effectiveness: 95% stampede reduction. The mutex guarantees exactly one recomputation per expiration. However, it introduces waiting latency for all non-winner requests. If the recomputation takes 200ms, 2,000 requests at 10K/sec are blocked waiting. They are not hitting the database, which is good. But they are consuming application threads and connection resources while they wait, which can cause its own capacity problems. The lock also requires a distributed lock mechanism (typically Redis SET NX), which adds a dependency and failure mode. If the lock holder crashes before releasing the lock, all waiting requests hang until the lock TTL expires.

Strategy 3: Stale-While-Revalidate

The stale-while-revalidate pattern serves the stale cached value to requests while recomputing the fresh value in the background. This eliminates both the stampede and the waiting latency. Every request gets an immediate response (the stale value), and the cache is refreshed asynchronously.

import threading
import time

class StaleWhileRevalidateCache:
    def __init__(self, cache, compute_fn, ttl=60, stale_ttl=300):
        self.cache = cache
        self.compute_fn = compute_fn
        self.ttl = ttl
        self.stale_ttl = stale_ttl
        self._revalidating = set()
        self._lock = threading.Lock()

    def get(self, key):
        value, expiry, is_stale = self.cache.get_with_staleness(key)

        if value is None:
            # True miss: no stale value available, must compute synchronously
            new_value = self.compute_fn(key)
            self.cache.set(key, new_value, ttl=self.ttl, stale_ttl=self.stale_ttl)
            return new_value

        if is_stale:
            # Value expired but stale copy exists: serve stale, refresh async
            self._trigger_revalidation(key)

        return value  # fresh or stale, returned immediately

    def _trigger_revalidation(self, key):
        with self._lock:
            if key in self._revalidating:
                return  # already revalidating
            self._revalidating.add(key)

        def revalidate():
            try:
                new_value = self.compute_fn(key)
                self.cache.set(key, new_value,
                             ttl=self.ttl, stale_ttl=self.stale_ttl)
            finally:
                with self._lock:
                    self._revalidating.discard(key)

        threading.Thread(target=revalidate, daemon=True).start()

Effectiveness: 98% stampede reduction, best user experience. Every request gets an immediate response. The recomputation happens in the background. There is no waiting, no mutex contention, no thread blocking. The trade-off is consistency: for the duration of the background recomputation (50-200ms typically), requests receive stale data. For most cache workloads -- session data, API responses, computed aggregations -- data that is 50ms stale is indistinguishable from fresh. For workloads where staleness matters (financial calculations, inventory counts), this pattern requires careful evaluation of the staleness window.

The second trade-off is the cold start problem. When there is no stale value to serve (first request for a new key, or after the stale TTL also expires), the request must compute synchronously. The stale-while-revalidate pattern degrades to a normal cache miss for cold keys. This means it prevents stampedes on hot keys but does not help with cold start stampedes.

Strategy 4: Background Refresh

Background refresh eliminates TTL-based expiration entirely. Instead of setting a TTL and waiting for expiration, a background process continuously refreshes cached values on a schedule. The cache entry never expires, so there is never a miss, so there is never a stampede.

import asyncio
import time

class BackgroundRefreshCache:
    def __init__(self, cache, refresh_interval=30):
        self.cache = cache
        self.refresh_interval = refresh_interval
        self.registered_keys = {}  # key -> (compute_fn, last_refresh)
        self._running = False

    def register(self, key, compute_fn):
        """Register a key for background refresh. Computes immediately."""
        value = compute_fn()
        self.cache.set(key, value)  # no TTL -- never expires
        self.registered_keys[key] = (compute_fn, time.time())

    async def start(self):
        """Start the background refresh loop."""
        self._running = True
        while self._running:
            await asyncio.sleep(self.refresh_interval)
            for key, (compute_fn, last_refresh) in self.registered_keys.items():
                try:
                    new_value = compute_fn()
                    self.cache.set(key, new_value)  # still no TTL
                    self.registered_keys[key] = (compute_fn, time.time())
                except Exception as e:
                    # Refresh failed: stale value remains in cache
                    # Log error but do not evict -- stale > missing
                    log.warning(f"Refresh failed for {key}: {e}")

    def get(self, key):
        """Read path: always hits cache, never misses."""
        return self.cache.get(key)  # always returns a value

Effectiveness: 100% stampede prevention for registered keys. If the cache entry never expires, no request ever sees a miss, and no stampede can occur. The background process serializes all recomputations, so the database sees a predictable, steady load rather than periodic spikes. The drawbacks are operational: you must explicitly register every key that needs background refresh, you must manage the background process lifecycle, and the memory cost is higher because entries never expire. For a cache with 10 million keys, only a small fraction are hot enough to warrant background refresh, but identifying which keys qualify requires monitoring and analysis.

The deeper problem with background refresh is that it is solving the wrong problem. It prevents stampedes by ensuring the cache always has a value. But it does not address the fundamental question: why should a cached value expire at all? If the result of a computation has not changed, why would you recompute it? The TTL is an arbitrary timer that says "this value might be wrong after N seconds." It does not know whether the value is actually wrong. It recomputes regardless, wasting database capacity on recomputations that produce identical results.

Strategy 5: Cachee Computation Fingerprinting

Computation fingerprinting eliminates stampedes by removing their root cause: TTL-based expiration. Instead of expiring cache entries based on an arbitrary time-to-live, Cachee binds each cache entry to a fingerprint of the computation that produced it. The fingerprint is SHA3-256(input || computation_type || parameters || version || hardware_class). A cache entry remains valid as long as its fingerprint matches the current computation. If any input changes, the fingerprint changes, and the old entry is never returned because no request will produce the old fingerprint. Invalidation is not time-based. It is truth-based.

use cachee::{CacheeClient, ComputationFingerprint};

// Define the computation fingerprint
let fingerprint = ComputationFingerprint::builder()
    .input(&user_query)              // the actual input data
    .computation_type("fraud_score") // what computation produces this
    .parameters(&model_params)       // model version, thresholds, etc.
    .version("v2.3.1")              // application version
    .hardware_class("graviton4")    // hardware that ran the computation
    .build();

// Cache lookup by fingerprint -- NOT by arbitrary key + TTL
let result = cachee.get_by_fingerprint(&fingerprint).await?;

match result {
    Some(cached) => {
        // Cache hit: the EXACT computation with the EXACT inputs
        // was previously computed. The result is cryptographically
        // attested as authentic. No TTL involved.
        return cached.value;
    }
    None => {
        // Cache miss: this specific combination of inputs +
        // computation + parameters + version has never been computed,
        // or the inputs have changed since the last computation.
        let fresh_value = compute_fraud_score(&user_query).await?;

        // Store with fingerprint -- no TTL needed
        cachee.set_with_fingerprint(&fingerprint, &fresh_value).await?;

        return fresh_value;
    }
}

Effectiveness: stampedes are architecturally impossible. There is no TTL. There is no expiration event. There is no moment where a cache entry disappears and hundreds of requests simultaneously discover its absence. A cached value is either valid (its fingerprint matches) or irrelevant (the inputs have changed, producing a new fingerprint that has no existing cache entry). When inputs change, the first request with the new fingerprint computes and caches the result. Subsequent requests with the same new fingerprint hit the cache. There is no window where multiple requests race to recompute the same value because the fingerprint is deterministic -- every request with the same inputs produces the same fingerprint, and the first writer wins.

Explicit invalidation is handled through Cachee's state machine. If you need to force a recomputation -- because a model was retrained, a configuration changed, or an administrator decided the cached value is wrong -- you Supersede or Revoke the entry. Superseding marks the old entry as replaced and records the transition. Revoking marks the entry as invalid with a reason. Both are explicit, auditable actions performed by an authorized key holder, not an arbitrary timer expiration. The difference is between "this value expired because 60 seconds passed" and "this value was superseded because the fraud model was updated to v2.4.0, authorized by the ML team's deployment key at 2026-05-10T14:23:00Z." One tells you nothing. The other tells you everything an auditor or incident responder needs to know.

Benchmark: Stampede Impact by Strategy

We measured the database impact of a single hot key (10K requests/sec, 50ms recomputation time) across all five strategies, plus the no-mitigation baseline. The test ran for 300 seconds, producing 5 TTL expiration events for the TTL-based strategies.

Strategy	DB Queries per Stampede	P99 Latency During Stampede	Total Excess DB Queries (300s)
No mitigation	~500	2,100ms	2,500
Probabilistic early	~150	850ms	750
Mutex lock	1	250ms (wait time)	5
Stale-while-revalidate	1	0.5ms (stale serve)	5
Background refresh	0	0.12ms (always cached)	0 (scheduled only)
Cachee fingerprinting	0	0.000031ms (31ns L1)	0

The unmitigated baseline produces 500 excess database queries every 60 seconds. That is 500 identical queries that all return the same result -- pure waste. Probabilistic early recomputation reduces this to 150, which is still 149 wasted queries. The mutex eliminates excess queries but forces 499 requests to wait 50-250ms each. Stale-while-revalidate eliminates both the excess queries and the wait time, but serves stale data for 50ms per cycle. Background refresh eliminates stampedes entirely for registered keys at the cost of operational complexity and memory.

Cachee fingerprinting eliminates stampedes, excess queries, wait time, and staleness. There is no TTL to expire, so there is no stampede event. There is no stale window because the cached value is valid until its inputs actually change. When inputs change, exactly one computation occurs, and L1 serves subsequent reads at 31ns. The database sees exactly the load it should see: one query per unique computation, zero duplicates.

Real-World Stampede Scenarios

Stampedes are not limited to single-key expirations. Several real-world scenarios produce stampede-like behavior at larger scale.

Mass Expiration After Deploy

A common deployment pattern clears the cache on deploy to ensure fresh data. If your application has 100,000 cached entries and you flush them all at deploy time, the first request to each key triggers a recomputation. If 10,000 of those keys are hot (receiving traffic at that moment), you get 10,000 simultaneous cache misses across 10,000 different keys. This is not a stampede on one key. It is 10,000 simultaneous stampedes. The database sees 10,000 queries arrive in the same second, each from a different key, each hammering different tables and indexes. Connection pools saturate. Query planners struggle. The deploy becomes an outage.

Cachee fingerprinting avoids this entirely. Deploying a new application version changes the version field in the computation fingerprint. Old cache entries are not flushed. They simply become unreachable because no request with the new version will produce the old fingerprint. New entries are populated lazily as requests arrive, naturally spreading the recomputation load across time. There is no flush event, no mass expiration, and no stampede.

Correlated TTL Expiration

If you set all cache entries with the same TTL (60 seconds) and they were all written at approximately the same time (application startup, cache warm-up, or post-deploy repopulation), they all expire at approximately the same time. This creates a periodic "cache cliff" where a large fraction of your cache becomes empty simultaneously. The pattern repeats every TTL interval: startup at T=0, all keys cached by T=5, all keys expire at T=65, mass stampede at T=65, all keys re-cached by T=70, all expire again at T=125.

The standard mitigation is jittered TTLs: instead of TTL=60, use TTL=60 + random(0, 30). This spreads expirations across a 30-second window instead of concentrating them at one moment. But jitter is a partial fix. It reduces the peak of the stampede but extends its duration. Instead of 10,000 queries in 1 second, you get 10,000 queries spread across 30 seconds -- still 333 queries/sec of excess load, sustained for half a minute.

Hot Key Rotation

Some workloads have keys that become hot unpredictably. A product goes viral on social media. A news article drives traffic to a specific API endpoint. A marketing campaign launches and drives all traffic to one feature. The key was cold, so it has no cached value or its cached value expired minutes ago. Suddenly 50,000 requests per second arrive for a key with no cache entry. This is a cold-start stampede, and none of the stale-while-revalidate or background refresh strategies help because there is no stale value to serve and no background process watching a key that was cold five minutes ago.

Cachee handles this through the L1 tier. The first request computes and caches the result. Within nanoseconds, the L1 tier makes that result available to all subsequent requests in the same process at 31ns latency. Across processes, the L1 invalidation channel propagates the new value within 0.5ms. The stampede window -- the time between the first miss and the cache being populated -- is the time of a single computation plus the L1 propagation time. At 10K requests/sec, a 50ms computation window produces 500 competing requests on a traditional cache. With Cachee L1, it produces 1 computation and 499 L1 hits, because L1 population is effectively instantaneous relative to the request arrival rate.

Stampedes Are a Design Choice

Every cache that uses TTL-based expiration has chosen to create stampedes. The TTL says: "after N seconds, delete this value regardless of whether it is still correct." Deletion creates misses. Misses create recomputations. Concurrent recomputations create stampedes. The fix is not better mitigation of the stampede. The fix is removing the mechanism that causes it. Computation fingerprinting binds cache validity to computation truth, not to arbitrary timers. No expiration means no stampede. No TTL means no thundering herd.

Choosing Your Strategy

The right strategy depends on your constraints. If you cannot change your cache infrastructure, implement strategies 2 and 3 together: mutex lock to prevent duplicate recomputations, with stale-while-revalidate to eliminate wait latency for end users. This combination handles 99% of stampede events with minimal code changes.

If you can add a cache tier, deploying Cachee L1 in front of your existing cache eliminates stampedes at the read layer. L1 is an in-process cache that populates from your existing cache backend on miss. Because L1 is in-process, there is no network round trip and no connection pooling bottleneck. The stampede window shrinks from "time to recompute from database" to "time to fetch from existing cache backend," typically 0.1-0.5ms instead of 50-200ms. The number of concurrent requests in that window drops proportionally.

If you are building new infrastructure or re-architecting your cache layer, computation fingerprinting eliminates the root cause. No TTL, no expiration, no stampede. Invalidation is explicit and auditable. Every cache entry is valid until its inputs change or an authorized key holder supersedes or revokes it. The database sees exactly one query per unique computation. Zero waste. Zero stampedes. Zero thundering herds.

The cache stampede is a solved problem. It has been solved for years by multiple strategies of varying effectiveness. The reason it persists is that teams choose the simplest cache pattern -- key + value + TTL -- without considering the failure mode that TTL creates. Every TTL is a timer counting down to a stampede. The question is not whether you will experience one. The question is whether your mitigation strategy keeps it from becoming an outage. Or better: whether you have chosen an architecture where the timer does not exist at all.

The Bottom Line

Cache stampedes happen every time a hot key's TTL expires and hundreds of concurrent requests hit your database simultaneously. Mutex locks prevent duplicate recomputations but add wait latency. Stale-while-revalidate eliminates wait time but serves briefly stale data. Background refresh prevents stampedes for registered keys but adds operational complexity. Cachee computation fingerprinting eliminates stampedes entirely by removing TTL-based expiration. Cache entries are valid until their inputs change, not until an arbitrary timer fires. No TTL means no expiration. No expiration means no stampede. The thundering herd cannot form if there is nothing to stampede toward.

Your cache stampedes every 60 seconds because TTLs are arbitrary timers, not validity checks. Cachee uses computation fingerprinting -- no TTL, no stampede.

Get Started How Fingerprinting Works