Distributed vs Local Cache: When to Use Each

April 29, 2026 | 14 min read | Engineering

The most common caching mistake is using one tier for everything. Teams pick Redis, use it for session state, feature flags, rate limiting, API response caching, and hot read acceleration, and then wonder why their caching architecture has contradictory requirements. Sessions need strong consistency across instances. Feature flags need low latency but can tolerate staleness. Rate limiting needs approximate accuracy at high throughput. API responses need fast reads with tunable freshness. These are fundamentally different access patterns, and a single cache tier cannot optimize for all of them simultaneously.

The solution is a two-tier architecture: an in-process local cache (L1) for hot reads that need nanosecond latency, and a distributed cache (L2) for shared state that needs cross-instance visibility. L1 handles the volume. L2 handles the coordination. The database handles the truth. Each tier does what it is best at, and nothing else.

This article provides the decision framework for choosing which data belongs in which tier, the architecture for combining them, and the consistency model that makes it all work.

31 ns

L1 In-Process Latency

300 us

L2 Redis Latency

15 ms

Database Latency

The Two Tiers: What They Are and What They Cost

L1: In-Process Local Cache

An L1 cache lives inside your application process. It is a hash map (or a more sophisticated structure with TTL and eviction support) that stores key-value pairs in the application's own heap memory. There is no network call, no serialization, no TCP overhead. A lookup takes approximately 31 nanoseconds on modern hardware. That is the time for a hash computation, a pointer dereference, and a memory read from L1/L2 CPU cache.

The L1 cache is per-instance. If you have 10 application instances behind a load balancer, each instance has its own L1 cache. They do not share data. Instance A's L1 contains whatever Instance A has read recently. Instance B's L1 contains whatever Instance B has read recently. They may contain different versions of the same key. This is fine for reads where bounded staleness is acceptable. It is not fine for data that must be consistent across instances.

The cost of L1 is application memory. Each cached entry consumes heap memory in your application process. At 10,000 entries with 500-byte average values, L1 uses approximately 5 MB. At 100,000 entries, it uses approximately 50 MB. This is typically a small fraction of your application's total memory footprint, but it should be bounded with a maxsize limit to prevent unbounded growth.

L2: Distributed Cache (Redis, Memcached, or equivalent)

An L2 cache is a separate service that all application instances connect to over the network. It provides a single shared keyspace visible to all instances. When Instance A writes a value to L2, Instance B can read it immediately. This shared visibility makes L2 the right place for data that must be consistent across instances: sessions, locks, pub/sub channels, and shared counters.

The cost of L2 is latency and infrastructure. Every L2 operation requires a network round-trip: serialize the request, send it over TCP, wait for Redis to process it, receive the response, deserialize it. On a same-AZ connection, this takes 100-300 microseconds. On a cross-AZ connection, it takes 300-800 microseconds. Additionally, L2 requires a separate server (or cluster) with its own memory, CPU, and network resources. On AWS ElastiCache, a cache.r7g.large instance costs approximately $0.068 per GB per hour, which is $49 per GB per month.

The Four-Quadrant Decision Framework

Every piece of cacheable data can be classified along two dimensions: access frequency (how often it is read) and mutability (how often it changes and whether changes must be immediately visible). These two dimensions create four quadrants, and each quadrant maps to a specific caching strategy.

	Read-Only / Rarely Changes	Mutable / Changes Frequently
High Frequency	L1 (in-process)	L2 + L1 read cache
Low Frequency	L2 only	Database direct

Quadrant 1: High Frequency + Read-Only = L1

Data that is read on every request and changes rarely or never. Feature flags, application configuration, static reference data (country codes, currency conversion rates, product categories), and auth validation keys. These are read thousands of times per second across all instances and change once per hour, once per day, or never. They are the perfect L1 candidates because the high read frequency amortizes the rare staleness, and the read-only nature means there is no write coordination needed.

An L1 TTL of 30-60 seconds means each instance refreshes this data once per minute. If the data changes, the worst case is 30-60 seconds of staleness. For a feature flag, this means a user might see the old flag value for up to 60 seconds after a flag change. In practice, this is indistinguishable from "instant" for almost all feature flag use cases.

# Feature flags: perfect L1 candidate
# Read on every request, changes once per deploy

from cachetools import TTLCache

# L1: 10,000 entries max, 30-second TTL
l1 = TTLCache(maxsize=10000, ttl=30)

def is_feature_enabled(flag_name, user_id):
    cache_key = f"flag:{flag_name}"

    # L1 hit: 31 nanoseconds
    result = l1.get(cache_key)
    if result is not None:
        return evaluate_flag(result, user_id)

    # L1 miss: fetch from L2 or database (300us-15ms)
    result = redis.get(cache_key)
    if result is None:
        result = db.query("SELECT * FROM feature_flags WHERE name = %s",
                          flag_name)
        redis.setex(cache_key, 300, serialize(result))

    l1[cache_key] = result
    return evaluate_flag(result, user_id)

Quadrant 2: High Frequency + Mutable = L2 + L1 Read Cache

Data that is read frequently and written occasionally, where writes must be visible across instances within a bounded time. User sessions, user profiles (for display), shopping cart summaries, and rate limit states that need approximate cross-instance consistency. The write goes to L2 (for cross-instance visibility), and each instance maintains an L1 read cache with a short TTL for fast reads.

The key insight is that even "mutable" data is usually read far more often than it is written. A user session is read on every request (hundreds of times per session) and written once (at creation) or a few times (on activity update). An L1 cache with a 10-second TTL absorbs 99% of the reads and adds at most 10 seconds of staleness on writes. The write path updates L2 directly, so any instance that has an L1 miss will see the fresh value immediately.

# User session: high frequency, mutable
# Read on every request, written on login and activity update

def get_session(session_id):
    cache_key = f"session:{session_id}"

    # L1 hit: 31 nanoseconds (absorbs 95%+ of reads)
    session = l1.get(cache_key)
    if session is not None:
        return session

    # L1 miss: read from L2 (300 microseconds)
    session = redis.get(cache_key)
    if session is not None:
        l1[cache_key] = session  # Populate L1
        return session

    return None  # No session exists

def update_session(session_id, data):
    cache_key = f"session:{session_id}"

    # Write to L2 (cross-instance visibility)
    redis.setex(cache_key, 1800, serialize(data))

    # Optionally update local L1 (avoids staleness on this instance)
    l1[cache_key] = data

Quadrant 3: Low Frequency + Read-Only = L2 Only

Data that is accessed occasionally and changes rarely. Full user profiles (for admin pages), historical reports, infrequently accessed product details, and API responses for long-tail queries. These are not read often enough to justify consuming L1 memory across all instances. A single copy in L2, shared by all instances, is sufficient. The 300-microsecond L2 latency is acceptable because these reads are infrequent and not on the hot path.

The key question for Quadrant 3 data is: "Will this key be read again within a reasonable TTL window?" If a key is read once per hour and your L2 TTL is 5 minutes, the cache hit rate for that key is nearly zero. You are paying the cost of caching (write to L2, memory in L2) without the benefit (cache hits). For data that is read less frequently than the TTL, consider whether caching it provides any value at all. It may be simpler and cheaper to read directly from the database.

Quadrant 4: Low Frequency + Mutable = Database Direct

Data that is accessed rarely and changes frequently. Audit logs, analytics events, write-heavy counters that are rarely read, and transactional data that must be strongly consistent. Caching this data provides little benefit (low read frequency means low hit rate) and adds complexity (frequent mutations require frequent invalidation or short TTLs). Read directly from the database. The 15-millisecond database latency is acceptable for data that is accessed once per minute or less.

The most common mistake here is caching data "just in case." Engineers add caching to every database query by default, even for queries that are executed once per hour. This wastes cache memory, adds invalidation complexity, and provides no measurable latency improvement. The discipline is to cache only what benefits from caching: high-frequency reads of slowly-changing data.

The Architecture: L1, L2, and Database

The three-tier architecture is simple in concept but requires care in implementation. The read path checks L1, then L2, then the database. The write path writes to the database, then to L2, then optionally updates L1 on the writing instance.

# Three-tier read path
def get(key):
    # Tier 1: L1 in-process cache (31 nanoseconds)
    value = l1.get(key)
    if value is not None:
        metrics.increment("cache.l1.hit")
        return value

    # Tier 2: L2 distributed cache (300 microseconds)
    value = l2.get(key)
    if value is not None:
        metrics.increment("cache.l2.hit")
        l1.set(key, value)  # Populate L1 for future reads
        return value

    # Tier 3: Database (15 milliseconds)
    value = db.query(key)
    if value is not None:
        metrics.increment("cache.db.hit")
        l2.setex(key, ttl=300, value=value)  # Populate L2
        l1.set(key, value)                   # Populate L1
    return value

# Three-tier write path
def put(key, value):
    # Write to database first (source of truth)
    db.write(key, value)

    # Update L2 (cross-instance visibility)
    l2.setex(key, ttl=300, value=value)

    # Update local L1 (avoid staleness on this instance)
    l1.set(key, value)

Sizing L1 and L2

L1 should be sized to hold your hot set: the keys that account for 80-95% of your reads. In most workloads, this is 1,000 to 100,000 keys. At 500 bytes per value, that is 500 KB to 50 MB of application memory per instance. Use a bounded cache (maxsize) with LFU eviction to ensure L1 holds the most frequently accessed keys. Monitor L1 hit rate: if it is below 80%, your maxsize is too small and hot keys are being evicted. If it is above 98%, your maxsize might be larger than necessary.

L2 should be sized to hold your warm set: the keys that benefit from caching but are not hot enough for L1. This is typically 10x to 100x the size of L1. Use a distributed cache with allkeys-lfu eviction and aggressive TTLs. Monitor L2 hit rate and eviction rate: if the eviction rate is high (keys are being evicted before they expire), you need more L2 capacity or shorter TTLs to reduce the working set.

The Consistency Model: Eventual Consistency Between L1 Instances

The most common objection to L1 caching is: "But my L1 might be stale." This is true. By design. And it is fine for the vast majority of workloads.

When Instance A writes a new value to L2, Instance B's L1 still has the old value. Instance B will serve the old value until its L1 TTL expires (at most TTL seconds) or until Instance B reads the key from L2 (on the next L1 miss). The maximum staleness is equal to the L1 TTL. If your L1 TTL is 30 seconds, Instance B may serve data that is up to 30 seconds old.

The question to ask is: "What is the worst consequence of serving data that is 30 seconds old?" For most data categories, the answer is "nothing meaningful." A user sees an old feature flag value for 30 seconds. A product page shows the previous price for 30 seconds. A dashboard widget shows a count that is 30 seconds behind. These are acceptable for virtually all applications. The latency improvement (10,000x faster reads) far outweighs the consistency cost (30 seconds of bounded staleness).

When NOT to Use L1 Caching

L1 caching is not appropriate for data that requires strong read-after-write consistency across instances. Specific examples: inventory counts for limited-stock items (overselling risk), financial balances (double-spend risk), distributed locks (mutual exclusion requires immediate visibility), and rate limiting that must be globally accurate (per-instance L1 counters can allow 10x the intended rate if you have 10 instances). For these use cases, read directly from L2 (which provides cross-instance consistency) or from the database (which provides strong consistency).

Invalidation Between L1 Instances

For most workloads, TTL-based L1 invalidation is sufficient. Set a short TTL (10-60 seconds) and let entries expire naturally. The simplicity of this approach is its greatest advantage: no additional infrastructure, no pub/sub channels, no invalidation coordination.

If you need faster L1 invalidation (sub-second instead of TTL-based), you have two options. First, use L2 pub/sub: when a write occurs, publish an invalidation message on a channel. All instances subscribe to the channel and delete the key from their L1 when they receive the message. This reduces staleness from TTL seconds to pub/sub delivery latency (typically 1-5 milliseconds). Second, use a polling approach: each instance periodically (every 1-5 seconds) checks a "version" key in L2 for each L1-cached key. If the version has changed, the L1 entry is evicted. This is simpler than pub/sub but adds L2 polling overhead.

# L1 invalidation via L2 pub/sub
import threading

def start_invalidation_listener():
    """Subscribe to L2 invalidation channel and evict L1 entries."""
    pubsub = redis.pubsub()
    pubsub.subscribe("cache:invalidate")

    def listener():
        for message in pubsub.listen():
            if message['type'] == 'message':
                key = message['data'].decode('utf-8')
                l1.pop(key, None)  # Remove from L1

    thread = threading.Thread(target=listener, daemon=True)
    thread.start()

def put_with_invalidation(key, value):
    """Write to DB and L2, then notify all instances to evict L1."""
    db.write(key, value)
    redis.setex(key, 300, serialize(value))
    redis.publish("cache:invalidate", key)  # Notify all instances
    l1[key] = value  # Update local L1 immediately

Real-World Mapping: What Goes Where

The following table maps common data categories to their appropriate cache tier. This is not exhaustive, but it covers the data types that account for the majority of cache operations in typical web applications.

Data Category	Tier	L1 TTL	L2 TTL	Rationale
Feature flags	L1	30s	5m	Read every request, change rarely
Auth/JWT validation keys	L1	60s	10m	Read every request, rotate daily
App configuration	L1	60s	10m	Read every request, change per-deploy
User session (read)	L1 + L2	10s	30m	Read every request, written on activity
User session (create)	L2	--	30m	Must be visible to all instances immediately
Rate limit counters	L2	--	Window	Must be globally accurate
API response cache	L1 + L2	15s	60s	Read frequently, freshness varies
Product catalog	L1 + L2	30s	5m	Read heavily, updated occasionally
User profile (display)	L1 + L2	15s	5m	Read on page load, updated occasionally
Inventory counts	L2	--	10s	Consistency critical, no L1
Financial balances	Database	--	--	Strong consistency required
Audit logs	Database	--	--	Write-heavy, rarely read

The Numbers: Why Two Tiers Beat One

Consider a web application that handles 10,000 requests per second. Each request makes an average of 5 cache reads. That is 50,000 cache reads per second. With a single-tier architecture (L2 only, Redis at 300 microseconds per read), the total cache latency per request is 5 * 300 = 1,500 microseconds (1.5 milliseconds). At 10,000 requests per second, Redis handles 50,000 operations per second, which is within capacity for a single node but leaves no headroom.

Now add an L1 tier with an 85% hit rate. Of the 5 cache reads per request, 4.25 hit L1 (31 nanoseconds each) and 0.75 miss L1 and hit L2 (300 microseconds each). The total cache latency per request is (4.25 * 0.031) + (0.75 * 300) = 0.13 + 225 = 225 microseconds. That is a 6.7x latency reduction versus L2 only. Redis now handles 7,500 operations per second instead of 50,000 -- an 85% load reduction. You can downgrade to a smaller Redis instance, saving infrastructure cost.

The Impact of L1 Hit Rate on Total Latency

At 50% L1 hit rate: weighted latency drops from 1,500us to 750us (2x improvement). At 70% L1 hit rate: weighted latency drops to 450us (3.3x). At 85% L1 hit rate: weighted latency drops to 225us (6.7x). At 95% L1 hit rate: weighted latency drops to 75us (20x). The returns are not linear -- each percentage point of L1 hit rate above 80% is increasingly valuable because it eliminates the most expensive operations (L2 round-trips).

Common Objections and Responses

"L1 wastes memory because each instance caches the same data"

This is true and it is the right tradeoff. If your hot set is 10,000 keys at 500 bytes each, L1 uses 5 MB per instance. With 20 instances, that is 100 MB total across all instances, versus 5 MB in a single L2 node. You are trading 95 MB of application memory for a 10,000x latency reduction on 85% of your reads. Application memory costs approximately $0.005 per GB per hour on a typical cloud instance. The 95 MB costs $0.0005 per hour. The Redis instance it partially replaces costs $0.068 per GB per hour. The math is overwhelmingly in favor of L1 duplication.

"L1 makes debugging harder because each instance has different state"

This is a valid concern that is solved by observability, not by avoiding L1. Instrument your L1 with hit/miss metrics per key prefix. Add a debug endpoint that dumps the current L1 contents for a specific instance. Log the L1 hit/miss decision on each request so you can trace whether a specific response came from L1, L2, or the database. These are straightforward engineering practices that pay dividends beyond cache debugging.

"We should just use a faster distributed cache instead of adding L1"

No distributed cache can match in-process latency. The speed of light imposes a minimum network latency of approximately 5 microseconds per meter of fiber (round-trip). Even on the same rack, a network round-trip takes 10-50 microseconds. An in-process hash map lookup takes 31 nanoseconds. There is a 300-1,600x gap between in-process and any networked solution that no protocol optimization or hardware acceleration can close. The gap is physics, not engineering.

The Bottom Line

The caching mistake is using one tier for everything. Distributed caches (Redis) are for shared mutable state: sessions, locks, pub/sub, and cross-instance coordination. Local caches (in-process L1) are for hot reads: feature flags, auth validation, configuration, and any data that is read frequently and changes slowly. The two tiers are complementary, not competing. L1 handles 85% of reads at 31 nanoseconds. L2 handles the remaining 15% at 300 microseconds. The database handles the truth at 15 milliseconds. Each tier does what it is best at. The result is 6.7x lower latency, 85% less load on L2, and a simpler mental model than trying to make one cache tier do everything.

L1 caching at 31 nanoseconds with built-in TTL, LFU eviction, and L2 tiering.

brew install cachee Cache Invalidation Guide