Distributed Cache Architecture

Every high-scale system eventually builds a caching layer. Most build it wrong. The typical pattern is a single Redis instance — or a Redis cluster — that sits between the application and the database. It works at first. Then traffic grows, the cluster becomes a single point of failure, network round-trips become the dominant latency source, and the cost of scaling the cluster grows linearly with request volume. The architecture that was supposed to solve the performance problem becomes the performance problem.

Multi-tier caching solves this by recognizing that not all cache reads are equal. Some data is accessed thousands of times per second and should live as close to the CPU as physically possible. Other data is accessed once per minute and can afford a network round-trip. The right architecture serves each access pattern at the appropriate tier, minimizing both latency and infrastructure cost.

3 Tiers L1 / L2 / L3

1.5µs L1 Hit Latency

100% L1 Hit Rate

660K+ Ops / Second

The Single-Tier Problem

A single Redis cluster serving all cache reads has three structural problems that no amount of scaling solves.

Network bottleneck. Every read requires a TCP round-trip to the Redis server. In the same availability zone, this is 100–300µs. Cross-AZ, it is 500µs–2ms. Under load, TCP congestion and connection pool contention push this higher. The Redis process itself responds in microseconds — the network is the bottleneck, and you cannot optimize it away without eliminating the hop.

Single point of failure. Redis Sentinel and Cluster mode provide failover, but failover takes time — typically 10–30 seconds. During that window, all cache reads fail or hang. If your application depends on the cache for performance (and if it does not, why do you have a cache?), a 30-second outage cascades into a full-system latency spike as every request hits the database directly.

Linear cost scaling. When a single Redis cluster handles 100% of reads, doubling traffic requires doubling cluster capacity. The cost curve is linear and predictable — and predictably expensive. At 100M reads per day, you need beefy nodes. At 1B reads per day, you need a cluster that costs more than the database it is supposed to protect.

            The core insight: A single-tier cache treats every read identically. But the data your application reads most frequently should be closer, faster, and cheaper to serve than the data it reads occasionally. One tier cannot optimize for both patterns simultaneously.
        

Multi-Tier Architecture Explained

A multi-tier cache mimics the CPU cache hierarchy that has driven processor performance for decades. Just as CPUs use L1, L2, and L3 caches with increasing size and decreasing speed, application caches benefit from the same tiered approach.

L1: In-Process Memory (1–10µs)

The L1 tier lives inside your application process. There is no network hop, no serialization, no TCP overhead. A read from L1 is a memory lookup — the fastest operation your application can perform short of a CPU register access. Cachee's L1 tier delivers reads in 1.5µs.

L1 is small relative to your total cache. It holds the hottest keys — the working set that handles the vast majority of reads. A typical application's access pattern follows a power law: 5–15% of keys handle 80–95% of reads. L1 only needs to hold that 5–15% to intercept almost all traffic.

L2: Network Cache (100µs–2ms)

The L2 tier is your Redis or Memcached cluster. It holds the full cached dataset and handles the reads that miss L1. Because L1 intercepts 95–99% of reads, L2 sees only 1–5% of total traffic. This dramatically changes the sizing requirements: a cluster that needed to handle 100,000 req/s at L2-only now needs to handle 1,000–5,000 req/s.

L3: Persistent Storage (10–100ms)

The L3 tier is disk-backed storage — SSD, S3, or a similar durable store. It holds cold data that has been evicted from L2 but is too expensive to recompute from the origin. L3 reads are slow (10–100ms) but still faster than regenerating the data. Not every architecture needs an L3 tier — it depends on whether your cache misses hit a fast database or a slow computation.

Each tier catches what the tier above misses. The read path is L1 check, L2 check, L3 check, origin. On the way back, the value is stored at each tier it passed through, so subsequent reads serve from the fastest available tier.

// Multi-tier cache read flow (pseudocode)
function cacheRead(key) {
  // Tier 1: In-process L1 (31ns)
  let value = l1.get(key);
  if (value) return value;  // 100% of reads stop here

  // Tier 2: Redis/Memcached L2 (200µs–1ms)
  value = await l2.get(key);
  if (value) {
    l1.set(key, value);  // Promote to L1
    return value;
  }

  // Tier 3: Persistent store L3 (10–100ms)
  value = await l3.get(key);
  if (value) {
    l2.set(key, value);  // Promote to L2
    l1.set(key, value);  // Promote to L1
    return value;
  }

  // Origin: Database or computation
  value = await origin.fetch(key);
  l3.set(key, value);
  l2.set(key, value);
  l1.set(key, value);
  return value;
}
        

Cache Consistency Patterns

Multi-tier caching introduces a consistency challenge: when data changes at the origin, how do you ensure all tiers reflect the update? There are four established patterns, each with different trade-offs.

Cache-Aside (Lazy Loading)

The application manages the cache explicitly. On read, check the cache; on miss, fetch from origin and store in cache. On write, update the origin and invalidate the cache. The next read will miss the cache and repopulate with fresh data. This is the simplest pattern and works well when reads vastly outnumber writes. The downside is that the first read after a write always misses the cache, causing a latency spike.

Read-Through

The cache itself fetches from the origin on a miss. The application only talks to the cache. This simplifies application code but requires the cache layer to understand how to reach the origin. Read-through caches are popular with managed caching platforms because the platform handles the origin integration. Cachee uses read-through for L2 and L3 misses, transparently fetching from the next tier without application involvement.

Write-Through

Every write goes to both the cache and the origin synchronously. The cache is always consistent with the origin, eliminating stale reads. The cost is write latency — every write pays the origin write time plus the cache write time. Write-through is appropriate when consistency is more important than write performance, which is common in financial applications and anywhere stale data has business consequences.

Write-Behind (Write-Back)

Writes go to the cache immediately and are flushed to the origin asynchronously. Write latency is minimal because the application only waits for the cache write. The risk is data loss: if the cache fails before flushing to the origin, the write is lost. Write-behind is appropriate for data that can tolerate eventual consistency and where write throughput is more important than durability. Session data and analytics counters are typical candidates.

Eviction Strategies Beyond LRU

The eviction strategy determines which keys the cache drops when it reaches capacity. The right strategy can mean the difference between a 90% hit rate and a 99% hit rate.

LRU (Least Recently Used)

The standard. Evict the key that was accessed least recently. LRU works well when recent access is a good predictor of future access. It fails badly during scan patterns — when a batch job reads through the entire keyspace once, it evicts every hot key and replaces them with keys that will never be accessed again. A single table scan can drop your hit rate from 95% to 20%.

LFU (Least Frequently Used)

Evict the key with the fewest accesses. LFU resists scan pollution because frequently accessed keys have high counters that survive a single-pass scan. The weakness is stale popularity: a key that was accessed 10,000 times last week but zero times today will stay in cache indefinitely, blocking new hot keys from entering. Frequency-based eviction is backward-looking by nature.

Adaptive ML-Driven Eviction

Cachee's eviction strategy combines frequency, recency, and predicted future access into a single eviction score. The ML model considers temporal patterns (is this key accessed at specific times of day?), access velocity (is this key trending up or down?), and correlation patterns (when key A is accessed, does key B follow within milliseconds?). Keys are evicted based on predicted utility, not just historical counts.

This is why Cachee achieves 99%+ hit rates where standard LRU achieves 85–92%. The eviction strategy does not just react to what has happened — it anticipates what will happen and keeps the keys that will be needed while evicting the ones that will not.

AI-Powered Predictive Warming

Eviction decides what to remove. Predictive warming decides what to add before it is requested. These are complementary systems that together maximize hit rate.

Standard cache warming is reactive: a key enters the cache only when it is first requested. This guarantees that every unique key access has at least one cache miss. After a cold start — deployment, restart, failover — every key is a miss. Hit rates start at zero and take minutes to hours to recover, depending on the working set size and traffic volume.

Predictive warming changes the model from reactive to proactive. The AI engine learns access patterns across three dimensions. Temporal patterns identify keys accessed at specific times — configuration keys loaded at startup, session keys accessed during business hours, batch job keys accessed at scheduled intervals. Sequential patterns identify chains — when user A's profile is loaded, their preferences, permissions, and recent activity follow within milliseconds. Frequency trends identify keys transitioning from cold to hot — a product page that is starting to go viral, a cache key referenced by a newly deployed feature.

By loading keys into L1 before they are requested, the prediction engine eliminates cold-start misses and keeps the hit rate above 99% even during deployments, traffic spikes, and access pattern shifts. The difference between a reactive cache at 85–92% hit rate and a predictive cache at 100% is not incremental — it is the difference between 8–15% of reads paying full network latency and 1% of reads paying it.

Implementation Patterns

There are three common deployment patterns for adding an L1 caching tier. Each has distinct trade-offs around operational complexity, resource isolation, and latency.

Embedded SDK

The L1 cache runs inside your application process as a library. Reads are function calls — no IPC, no serialization, no network. This is the fastest possible approach (Cachee's 1.5µs reads use this model). The trade-off is that the cache shares memory and CPU with your application. For most applications, L1 memory usage is small (100MB–2GB) relative to available memory, so this is not a practical concern.

Sidecar

The L1 cache runs as a separate process on the same host, communicating via Unix domain sockets or shared memory. This provides resource isolation — the cache cannot OOM your application — at the cost of IPC latency (5–50µs, depending on the mechanism). Sidecars are popular in Kubernetes environments where each pod gets a cache container alongside the application container.

Proxy

The cache runs as a separate service that intercepts traffic between your application and Redis. This is the most operationally simple model — point your Redis connection at the proxy and you are done. The trade-off is network latency: even on the same host, the proxy adds a TCP round-trip (50–200µs). In a different host, it adds a full network round-trip, which partially defeats the purpose of L1 caching.

For most applications, the embedded SDK delivers the best latency. The sidecar model is appropriate when resource isolation is a hard requirement. The proxy model is appropriate when you cannot modify application dependencies at all — it works as a transparent drop-in with zero code changes.

Designing for Failure

A well-designed multi-tier cache degrades gracefully at every tier. If L1 fails, reads fall through to L2. If L2 fails, reads fall through to L3 or the origin. No single tier failure causes a system outage — it causes a latency increase that is proportional to the failed tier's traffic share.

This is the fundamental advantage of multi-tier architecture over single-tier. A single Redis cluster that fails causes all cache reads to hit the database — a thundering herd that typically overwhelms the database and causes a full outage. A multi-tier architecture where L1 handles 99% of reads means that an L2 failure only increases database load by 1% of total reads (the L1 misses that would have hit L2). The database barely notices.

Design each tier with independent health checks, independent failover, and independent capacity planning. L1 capacity is determined by your application's hot working set size. L2 capacity is determined by your total cached dataset. L3 capacity is determined by your cold data retention policy. No tier's sizing depends on another's.

Ready to Build a Multi-Tier Cache?

See how Cachee implements L1/L2/L3 caching with AI-powered warming and 99%+ hit rates out of the box.

See How Cachee Implements This Start Free Trial

Distributed Cache Architecture: A Complete Guide to Multi-Tier Caching