AI Cache Eviction Strategy

When your cache is full and a new item needs space, something has to go. For the last 30 years, the answer has been LRU — Least Recently Used. Evict whatever was accessed longest ago. It is simple, fast, and wrong about 15-20% of the time. That 15-20% error rate is the gap between an 85% hit rate and a 99% hit rate. Cachee closes that gap with machine learning.

Eviction is the most consequential decision a cache makes. Every eviction is a bet: "this item will not be needed before it can be re-fetched." When the bet is right, memory is freed for something more useful. When the bet is wrong, the evicted item is requested again, triggering a cache miss, a database round-trip, and latency that the end user feels. The quality of the eviction policy directly determines the cache hit rate, which determines the database load, which determines the infrastructure cost and application performance. Getting eviction right is not an academic exercise. It is the single highest-leverage optimization in any caching system.

100% Cachee Hit Rate

85% Typical LRU

12-18% Improvement

Adaptive Per-Key Policy

Why LRU Fails

LRU considers exactly one signal when making an eviction decision: when was this item last accessed? The item that was accessed longest ago is evicted first. This heuristic is based on a reasonable assumption — recently accessed items are more likely to be accessed again soon than items that have not been accessed in a while. For many workloads, this assumption holds well enough to achieve 80-90% hit rates.

But LRU ignores every other signal that matters. It does not consider how frequently an item is accessed. A key that is read 10,000 times per hour will be evicted if it goes 2 seconds without access, even though it will almost certainly be accessed again within the next millisecond. It does not consider how expensive a cache miss is. A database query result that took 50 milliseconds to compute is treated identically to a key-value pair that can be re-fetched in 1 millisecond. Evicting the expensive item costs 50x more in latency than evicting the cheap one, but LRU cannot distinguish between them.

LRU does not consider the size of objects. Evicting a single 10MB object frees enough memory for 10,000 1KB objects. Depending on the access pattern, keeping the 10,000 small objects might generate far more cache hits than keeping the one large object. But LRU evicts by recency regardless of size. It does not consider whether an item is likely to be accessed in the near future. A product page for a flash sale that starts in 5 minutes will not be accessed right now, but it will be hammered in 5 minutes. LRU might evict it in the quiet period before the sale, causing a miss storm at the worst possible moment.

Each of these blind spots individually causes a few percentage points of unnecessary cache misses. Together, they explain the 15-20% gap between LRU's typical performance and the theoretical optimal eviction policy.

LFU Isn't the Answer Either

The natural response to LRU's frequency blindness is LFU — Least Frequently Used. Instead of evicting the least recently accessed item, evict the least frequently accessed item. This solves the frequency problem but introduces a different one: LFU has no concept of temporal decay.

Consider a product page that went viral last week. Over 7 days, it accumulated 500,000 access counts. This week, traffic has returned to normal — the page gets 10 accesses per hour. Under LFU, the page's cumulative frequency score of 500,000 means it will never be evicted. Meanwhile, today's trending product with a frequency count of 5,000 (growing rapidly) gets evicted to make room for yesterday's viral content. LFU is looking in the rearview mirror.

Windowed LFU (sometimes called W-LFU) partially addresses this by counting frequency within a time window rather than cumulatively. But it still considers only two signals — recency and frequency — and the window size is a static parameter that must be tuned per workload. Set the window too short and it behaves like LRU. Set it too long and it behaves like cumulative LFU. There is no single window size that is optimal for all keys at all times.

Cachee's Adaptive Eviction Model

Cachee replaces the single-signal heuristic with a multi-signal ML model that computes an eviction score for every item in the cache. When space is needed, the item with the lowest eviction score is evicted. The score is computed from five weighted signals.

1. Recency

When was the item last accessed? This is the LRU signal, and it remains valuable. Items accessed very recently are more likely to be accessed again soon. But in Cachee's model, recency is one input among five, not the sole determinant. The recency signal decays exponentially over time, with a decay rate that adapts per key based on the key's historical inter-access time distribution.

2. Frequency (Windowed)

How often is the item accessed within recent time windows? Cachee tracks frequency across multiple window sizes simultaneously — 1 second, 10 seconds, 1 minute, 10 minutes, 1 hour. This multi-resolution view captures both burst patterns (high frequency in the 1-second window) and sustained patterns (consistent frequency in the 1-hour window). A key that is accessed once per minute has a different profile than a key that is accessed 60 times in one second and then goes silent for an hour, even though both have the same frequency over a 1-hour window.

3. Predicted Future Access

Will this item be accessed in the next time window? This is where the neural prediction engine integrates with the eviction policy. The same model that powers predictive cache warming also informs eviction decisions. If the model predicts that a key will be accessed within the next prediction window, its eviction score increases dramatically, protecting it from eviction regardless of its recency or frequency scores. This is how Cachee avoids evicting the flash-sale product page 5 minutes before the sale starts.

4. Cost-to-Refetch

How expensive is a cache miss for this key? Cachee measures the actual latency of the origin fetch the last time the key was populated — not an estimate, not a configuration value, but the observed time in microseconds. A key backed by a 50-millisecond database aggregation query receives a higher eviction score (harder to evict) than a key backed by a 1-millisecond primary key lookup. When the cache is full, evicting the cheap key saves the same amount of memory while incurring 50x less miss penalty.

5. Object Size

How much memory does this item consume? Evicting one 10MB object frees space for potentially thousands of smaller, hotter objects. The model factors in the opportunity cost: if evicting this large object would allow N smaller objects to be cached, and those smaller objects have higher aggregate access frequency, the eviction score for the large object decreases. This prevents a few large objects from monopolizing cache memory that would be better allocated to many small, frequently-accessed objects.

The model learns per-workload weights for these five signals. A trading system might heavily weight predicted future access, because pre-market data is about to become extremely hot. An e-commerce site might weight frequency more heavily, because popular products should stay in cache even if they have not been accessed in the last few seconds. A content delivery workload might weight object size, because evicting a few large video segments can free memory for thousands of small API response caches. The weights adapt automatically based on which signals most accurately predict future access patterns in each workload.

The Cachee-FLU Foundation

Cachee's eviction policy builds on the Cachee-FLU research by Einziger, Friedman, and Manes, which demonstrated that a frequency sketch can approximate access frequency with minimal memory overhead. The core insight of Cachee-FLU is that you do not need exact frequency counts for every key. A probabilistic data structure — specifically a Count-Min Sketch — can track approximate frequency for millions of keys in a few kilobytes of memory.

The Count-Min Sketch uses multiple hash functions to map each key to multiple counters. On access, all counters for the key are incremented. To query frequency, the minimum counter value is returned (which is an upper-bounded estimate). The structure is lossy — it can overcount but never undercount — but the approximation error is bounded and configurable by adjusting the sketch dimensions.

Cachee extends the Cachee-FLU foundation in three ways. First, it replaces the single frequency sketch with a hierarchical sketch that tracks frequency at multiple time resolutions simultaneously, providing the multi-window frequency signal described above. Second, it integrates the neural prediction signal, which Cachee-FLU's original design did not include. Third, it adds cost-weighted eviction, which considers the refetch cost and object size rather than treating all items as equal. The result is all the memory efficiency of Cachee-FLU — frequency tracking for millions of keys in kilobytes of overhead — with the accuracy of a multi-signal trained model.

Scan Resistance

One of LRU's worst failure modes is a sequential scan. A batch job reads every row in a 10-million-row table. Each row touches the cache exactly once. Under LRU, each row enters the cache as the most recently accessed item, pushing the previous most-recently-accessed item one position closer to eviction. By the time the scan completes, every hot item in the cache has been evicted and replaced by scan data that will never be accessed again. The cache is fully polluted. Hit rate drops to near zero. Recovery takes minutes as the cache slowly refills with actually-hot data through normal request traffic.

This is not a theoretical concern. Batch analytics jobs, database vacuum operations, backup processes, ETL pipelines, and reporting queries all exhibit sequential scan patterns. In a production environment, a nightly analytics job can destroy the cache's effectiveness for the morning traffic surge — precisely when high hit rates matter most.

Cachee's model recognizes scan patterns through two mechanisms. First, the multi-window frequency signal detects that scan items are accessed exactly once in every window — a characteristic that distinguishes scan traffic from legitimately hot data. Second, the prediction model learns that sequentially-accessed keys with no repeat access are unlikely to be accessed again. Scan items enter a probationary window rather than the main cache. They are only promoted to the main cache if they are accessed a second time within the probationary period. Since scan items are by definition accessed only once, they never promote, and the main cache remains uncontaminated.

This scan resistance is automatic. There is no configuration parameter to enable it, no scan detection threshold to tune, and no need to tag batch jobs with special cache bypass flags. The model detects the pattern and adapts.

Workload-Specific Adaptation

The optimal eviction policy changes with the workload, and workloads change over time. A trading system at market open has radically different access patterns than during after-hours maintenance. An e-commerce site during Black Friday looks nothing like a Tuesday in February. A SaaS application during a customer's fiscal year-end close has different hot data than during a normal business day.

Static eviction policies cannot adapt to these shifts. LRU evicts the same way at 3 AM as it does at 3 PM. The 10-second TTL that works during normal traffic causes a miss storm during a traffic spike. The frequency threshold that prevents eviction of popular items during peak hours wastes memory during low-traffic periods.

Cachee's model continuously retrains on a sliding window of recent access data. The retraining cycle is measured in seconds, not hours. When access patterns shift — market open, traffic spike, batch job — the model detects the shift and adjusts eviction weights within seconds. The adaptation is smooth and continuous. There is no manual intervention, no configuration change, and no restart required.

Measuring Eviction Quality

Hit rate is the most visible metric for eviction quality, but it is not the only one that matters. Three additional metrics provide deeper insight into whether your eviction policy is working well.

Eviction regret measures how many evicted items were accessed within one minute of eviction. A high regret rate means the eviction policy is consistently making wrong bets — evicting items that are still hot. Under LRU, eviction regret rates of 15-25% are common during workload transitions. Under Cachee's ML model, regret rates drop to 2-5% because the prediction signal protects items that are about to be accessed.

Miss cost is the total latency penalty from cache misses, weighted by the refetch cost of each missed item. A miss on a 50ms database query is 50x more costly than a miss on a 1ms lookup. LRU treats both identically. Cachee's cost-weighted eviction preferentially evicts cheap items, which reduces the aggregate miss cost even at the same hit rate.

Churn rate measures how many items are evicted and then immediately re-loaded into the cache. High churn means the eviction policy is fighting itself — evicting items that are immediately re-fetched, wasting both cache space and database resources on the eviction-refetch cycle. LRU under scan workloads exhibits extreme churn. Cachee's scan resistance eliminates this pathological behavior.

            LRU is the "default password" of caching. Everyone uses it, everyone knows it is not optimal, and everyone assumes the alternative is too complex. Cachee proves that ML eviction can be both better and faster. The eviction score computation adds less than 100 nanoseconds per operation — invisible at the microsecond scale of cache lookups — while delivering 12-18% higher hit rates across production workloads.
        

// Cachee ML Eviction Score Calculation (per key)
// ================================================

function compute_eviction_score(key):
    // Signal 1: Recency (exponential decay)
    time_since_access = now() - key.last_access_time
    decay_rate = key.learned_decay_rate  // adapted per-key
    recency = exp(-decay_rate * time_since_access)

    // Signal 2: Multi-window frequency
    freq_1s   = sketch.query(key, window=1s)
    freq_10s  = sketch.query(key, window=10s)
    freq_1m   = sketch.query(key, window=1m)
    freq_1h   = sketch.query(key, window=1h)
    frequency = weighted_sum(freq_1s, freq_10s, freq_1m, freq_1h)

    // Signal 3: Predicted future access
    prediction = model.predict_access(key, window=500ms)

    // Signal 4: Cost-to-refetch (measured, not estimated)
    refetch_cost = key.last_origin_latency_us  // microseconds

    // Signal 5: Object size (inverse — large = lower score)
    size_factor = 1.0 / log2(key.size_bytes + 1)

    // Weighted combination (weights learned per-workload)
    score = (w.recency    * recency)
          + (w.frequency  * frequency)
          + (w.prediction * prediction)
          + (w.cost       * normalize(refetch_cost))
          + (w.size       * size_factor)

    return score  // higher = keep, lower = evict

// On eviction: evict key with lowest score
// Score computation: ~80-100ns overhead (negligible)
        

From Theory to Production

ML-based eviction sounds complex to operate, but from the user's perspective, it is simpler than LRU. With LRU, you tune TTLs, configure max-memory-policy, decide between allkeys-lru and volatile-lru, set maxmemory-samples, and still end up with suboptimal hit rates. With Cachee, you deploy the cache and the model handles everything. There are no eviction parameters to tune. No TTLs to set. No sampling rates to configure. The model observes your workload and adapts continuously.

The 12-18% hit rate improvement over LRU translates directly to reduced database load, lower infrastructure costs, and better application performance. At 10,000 requests per second, a 12% hit rate improvement eliminates 1,200 database queries per second. At 100,000 requests per second, it eliminates 12,000 database queries per second. The cost savings compound: fewer database queries mean smaller database instances, fewer read replicas, and lower cross-AZ data transfer charges.

For teams that have spent months tuning TTLs and eviction policies and still cannot break past 90% hit rates, ML eviction is not a marginal improvement. It is the architectural change that unlocks the next tier of cache performance without the operational complexity of managing eviction policies by hand.

See Cache Hit Rate Results

Compare your current LRU hit rate against Cachee's ML eviction on your real workload.

See Hit Rate Improvements Start Free Trial

AI Cache Eviction Strategy: Beyond LRU — How Machine Learning Decides What to Keep