Machine Learning Cache Eviction: Beyond LRU and LFU

December 21, 2025 • 7 min read • Technical Deep Dive

Cache eviction determines which data gets removed when memory fills up. For decades, we've relied on simple policies like LRU (Least Recently Used) and LFU (Least Frequently Used). These algorithms are fast, predictable, and fundamentally limited. Machine learning changes everything by predicting future access patterns instead of just reacting to past behavior.

Why Traditional Eviction Policies Fall Short

LRU (Least Recently Used)

LRU evicts the item accessed longest ago. It's simple and works well for many workloads, but fails badly in common scenarios:

# Scenario: Scanning through data once
for i in range(1000000):
    cache.get(f"item:{i}")  # Each item accessed once

# Problem: Recently-used scan data evicts
# frequently-accessed hot data

LRU weakness: One-time sequential scans pollute the cache, evicting valuable hot data. A single bulk operation can destroy your hit rate.

LFU (Least Frequently Used)

LFU evicts the least-accessed items. Better for workloads with stable hot data, but struggles with changing patterns:

# Scenario: Yesterday's popular content
# "viral-video-123" accessed 1M times yesterday
# "viral-video-456" accessed 100K times today

# Problem: LFU keeps old viral content
# and evicts today's trending content

LFU weakness: Historical frequency dominates, making the cache slow to adapt to changing access patterns. Popular old data crowds out important new data.

The Core Problem: Reacting vs. Predicting

Traditional policies react to past access patterns. ML-powered eviction predicts future access probability. This fundamental shift enables dramatic improvements:

15-25% higher hit rates with same memory
Or 30-40% less memory for same hit rate
Automatic adaptation to traffic pattern changes
No manual tuning or configuration

How ML-Powered Eviction Works

Feature Extraction

For each cached item, the system tracks features that correlate with future access:

{
  "key": "user:profile:12345",
  "features": {
    "access_count_1h": 45,
    "access_count_24h": 203,
    "access_count_7d": 1847,
    "time_since_last_access": 120,  // seconds
    "time_of_day": 14,  // hour
    "day_of_week": 3,  // Wednesday
    "size_bytes": 2048,
    "computation_cost_ms": 35,
    "ttl_remaining": 1800,
    "key_pattern": "user:profile:*",
    "access_variance": 0.34,
    "trend": "increasing"  // +12% hour-over-hour
  }
}

Access Prediction Model

A lightweight neural network or gradient boosted tree predicts "probability of access in next N minutes" for each cached item:

# Simplified prediction model
def predict_access_probability(features):
    # Combine multiple signals
    recency_score = 1.0 / (1 + features.time_since_last_access)
    frequency_score = features.access_count_1h / max_access_rate
    trend_score = features.trend_coefficient
    time_pattern_score = temporal_model.predict(
        features.time_of_day,
        features.day_of_week
    )

    # ML model weighs and combines signals
    probability = ml_model.predict([
        recency_score,
        frequency_score,
        trend_score,
        time_pattern_score,
        features.computation_cost_ms,
        features.size_bytes
    ])

    return probability

Cost-Aware Eviction

The system calculates eviction cost as:

eviction_cost = (
    access_probability *
    computation_cost *
    size_efficiency_factor
)

# Evict items with lowest cost
# High probability + expensive to recompute = keep in cache
# Low probability + cheap to recompute = safe to evict

Real-World Performance Improvements

E-Commerce Product Catalog

Workload: 10M products, 5M accessed daily, heavy temporal patterns
LRU hit rate: 82%
ML eviction hit rate: 94%
Improvement: +12% absolute, 67% reduction in misses

Social Media Feed

Workload: Rapidly changing content, temporal access patterns
LFU hit rate: 76% (stale content problem)
ML eviction hit rate: 91%
Improvement: +15% absolute, 63% reduction in misses

Key ML Eviction Strategies

1. Temporal Pattern Recognition

ML models detect time-based patterns humans miss:

# Detected pattern: User profiles accessed heavily
# Mon-Fri 9am-5pm, minimal weekend access

# Traditional LRU/LFU: Treats all times equally
# ML eviction: Aggressively caches profiles during
# weekday business hours, allows eviction on weekends

2. Trend Detection

Identify rising and falling access trends:

# Trending up: Keep in cache even with low historical count
# Trending down: Evict even with high historical count

def calculate_trend(access_history):
    recent_rate = access_history.last_1h
    historical_rate = access_history.last_24h / 24
    return (recent_rate - historical_rate) / historical_rate

3. Size-Efficiency Optimization

Large low-value items get evicted before small high-value items:

# 10MB video thumbnail (rarely accessed)
# vs 2KB user session (frequently accessed)

value_per_byte = access_probability / size_bytes

# Evict low value-per-byte items first
# Even if access count is similar

4. Computation-Cost Weighting

Items expensive to regenerate stay cached longer:

# Computed recommendation: 500ms to generate
# vs simple database query: 5ms

keep_score = access_probability * computation_cost_ms

# High computation cost items stay cached
# even with lower access probability

Online Learning: Adapting to Traffic Changes

Static models become stale as traffic patterns evolve. Online learning continuously updates the eviction model:

# Every 5 minutes:
1. Measure actual access patterns vs predictions
2. Calculate prediction accuracy
3. Update model weights based on errors
4. Deploy updated model with <1ms interruption

# Result: Model adapts to traffic shifts in minutes
# instead of weeks of manual retraining

Implementation Considerations

Computational Overhead

ML eviction must be fast enough for production use:

Feature collection: 0.1-0.5μs per operation
Prediction: 1-5μs per item during eviction
Model update: Background process, zero impact

Modern implementations add <1% CPU overhead compared to LRU.

Memory Overhead

Feature storage requires additional memory:

# Per-item metadata:
# Traditional LRU: 16 bytes (timestamp)
# ML eviction: 64-128 bytes (features + prediction)

# For 1M cached items:
# LRU overhead: 16MB
# ML overhead: 64-128MB

# Trade-off: 48-112MB extra memory for 15%+ hit rate improvement

When to Use ML Eviction

ML-powered eviction provides the most value when:

Complex access patterns: Temporal trends, seasonal traffic
High eviction rate: Limited memory, aggressive churn
Expensive cache misses: Database queries, API calls, computation
Changing workloads: Daily/weekly pattern shifts

Less valuable for:

Small caches (<1000 items)
Uniform random access patterns
Workloads with 95%+ hit rates already

Conclusion

LRU and LFU served us well for decades, but modern applications demand smarter eviction. Machine learning transforms cache eviction from reactive to predictive, considering recency, frequency, trends, costs, and temporal patterns simultaneously.

The result: 15-25% higher hit rates, 30-40% memory savings, and automatic adaptation to changing traffic—all with minimal overhead. As cache workloads grow more complex, ML-powered eviction is becoming essential infrastructure.

Experience ML-Powered Cache Eviction

Cachee.ai uses adaptive ML models to optimize eviction automatically. 94% hit rates, zero configuration.

Start Free Trial

The Numbers That Matter

Cache performance discussions get philosophical fast. Here are the actual measured numbers from production deployments running on documented hardware, so you can compare against your own infrastructure instead of trusting marketing copy.

L0 hot path GET: 28.9 nanoseconds on Apple M4 Max, single-threaded against pre-warmed in-memory cache. This is the floor — there's no faster way to read a key.
L1 CacheeLFU GET: ~89 nanoseconds on AWS Graviton4 (c8g.metal-48xl). Sharded DashMap with admission filtering.
Sustained throughput: 32 million ops/sec single-threaded on M4 Max, 7.41 million ops/sec at 16 workers on Graviton4 c8g.16xlarge.
L2 fallback: Sub-millisecond hits against ElastiCache Redis 7.4 over same-AZ network when L1 misses cascade through.

The compounding effect matters more than any single number. A 28-nanosecond L0 hit means your application spends almost zero time on cache lookups in the hot path, leaving the CPU free for the actual business logic that generates revenue.

Three Pitfalls That Burn Teams

Three things consistently bite teams during the first month of running an in-process cache alongside or instead of a network cache. We've seen each of these in production. Here's how to avoid them.

Hot working set sizing. The L0 hot tier is fast because it lives in your application process. If your hot working set is 50 GB and your heap budget is 8 GB, you can't put all of it in L0. Measure your actual hot key distribution before deciding what fits in-process versus what needs an L1 sidecar or an L2 fallback. The Cachee admission filter will protect you from polluting the cache, but it can't conjure RAM that doesn't exist.
TTL semantics drift. Redis processes TTL expirations lazily on access plus a background sweeper. Cachee processes them in the same lock-free read path via monotonic timestamp comparison. Behavior is identical for the vast majority of workloads, but if you depend on Redis-specific behaviors like OBJECT IDLETIME tracking or precise keyspace expiration notifications, validate the semantics for your specific use case before flipping production traffic over.
Eviction policy assumptions. Redis defaults to allkeys-lru. Cachee uses CacheeLFU which makes different admission decisions on workloads with skewed access frequency distributions. Most teams see hit rate improvements after migration, but if you've spent years tuning your application around LRU behavior — choosing TTLs based on how LRU evicts cold data — expect a brief transition period where you re-tune TTLs and access patterns to match the new admission policy.

Observability And What To Measure

You can't tune what you can't measure. The four metrics that matter for any production cache deployment, in order of importance:

Hit rate, broken down by key prefix or namespace. A global hit rate of 92% sounds great until you discover that one critical namespace is sitting at 40% and dragging your tail latency. Per-prefix hit rates expose which workloads are getting cache value and which aren't.
Latency percentiles, not averages. p50, p95, p99, and p99.9 for both cache hits and cache misses. The cache miss latency is your fallback path performance — when the cache fails, this is what your users actually experience.
Memory pressure and eviction rate. If your eviction rate is climbing while your hit rate stays flat, you're under-provisioned. If both are climbing, your access pattern shifted and you need to retune TTLs or rethink what you're caching.
Stale-read rate. The percentage of cache hits that returned a value the application then discovered was stale. This is the canary for your invalidation strategy. If it's above 1%, your invalidation logic has a bug.

Cachee exposes all four out of the box via Prometheus metrics on the standard scrape endpoint, plus a real-time SSE stream for dashboards that need sub-second visibility. The right time to wire these into your monitoring stack is before the migration, not after the first incident.

Memory Efficiency Is The Hidden Cost Lever

Throughput numbers get the headlines but memory efficiency determines your monthly bill. A cache that stores the same hot data in less RAM lets you run a smaller instance class — and on AWS that's the difference between profitable and breakeven for a lot of services.

Redis stores each key as a Simple Dynamic String with 16 bytes of header overhead, plus dictEntry pointers in the main hashtable, plus embedded TTL metadata. For 1KB values, per-entry overhead lands around 1100-1200 bytes once you account for hashtable load factor and slab fragmentation. At a million keys, that's roughly 1.2 GB of resident memory just for the data.

Cachee's L1 layer uses sharded DashMap entries with compact packing — a 64-bit key hash, value bytes, an 8-byte expiry timestamp, and a small frequency counter for the CacheeLFU admission filter. Per-entry overhead lands at roughly 40 bytes of structural data on top of the value itself. For the same million-key workload, that's about 13% smaller resident memory. On AWS ElastiCache pricing, that gap is the difference between needing a cache.r7g.large versus a cache.r7g.xlarge for borderline workloads.