In-Process Cache: Why 31ns Beats Redis at Any Scale

April 24, 2026 | 13 min read | Engineering

An in-process cache stores values in the same address space as your application. A GET is not a network call. It is a hash lookup that finds a memory address and a pointer dereference that reads the value at that address. The total operation takes 31 nanoseconds. A Redis GET, by contrast, requires your application to serialize the key, send bytes over a TCP connection, wait for Redis to process the command on its single-threaded event loop, receive the serialized response, deserialize the value, and allocate memory for it. That round-trip takes 300 microseconds to 3 milliseconds depending on value size and network conditions.

The difference is not a percentage improvement. It is a category change. 31 nanoseconds versus 300 microseconds is a 9,677x gap. That gap does not close at scale. It widens. This post explains the architecture of in-process caching, why the performance difference is fundamental rather than incidental, when in-process wins, when Redis still matters, and how to combine them in a tiered model that gives you both.

The Architecture: What Makes It 31ns

An in-process cache is a concurrent hash map embedded in your application's runtime. Cachee uses a DashMap with 64 shards -- a lock-sharded concurrent hash table where each shard is independently locked. The number 64 is chosen because modern servers have 32-192 CPU cores, and 64 shards provide sufficient parallelism for up to 192 concurrent readers without contention on any single shard.

The GET Path

When your application calls cache.get(key), the following operations execute:

Hash the key (10-15ns): Compute a 64-bit hash of the key bytes using a fast non-cryptographic hash function (SipHash-2-4 or xxHash). This determines which of the 64 shards holds the entry.
Acquire the shard read lock (1-3ns): Read locks on DashMap shards are essentially free under read-heavy workloads. The lock is a compare-and-swap on an atomic integer. Under contention, the read lock never blocks other readers -- only writers block readers, and writes are infrequent in a cache (reads dominate by 10-100x).
Probe the bucket (5-10ns): Walk the hash table bucket (typically 1-2 entries due to low load factor) and compare the key. This is one or two cache-line reads from L1 or L2 CPU cache.
Return the pointer (5ns): The hash table entry contains a pointer (or Arc reference) to the cached value. The GET operation returns this pointer. It does not copy the value. It does not serialize anything. It hands back a reference to bytes that already exist in the application's heap.
Release the read lock (1ns): Decrement the atomic counter.

Total: 22-34 nanoseconds. The measured median is 31 nanoseconds.

31ns

Median GET Latency

Concurrent Shards

Bytes Serialized

Why the Value Size Does Not Matter

This is the most counterintuitive property of in-process caching, and the most important. A GET for a 64-byte value takes 31 nanoseconds. A GET for a 1 MB value takes 31 nanoseconds. A GET for a 100 MB value takes 31 nanoseconds. The operation is identical because it never touches the value bytes. It returns a pointer to the value, not a copy of the value.

In Redis, value size dominates latency because every byte must be serialized, transmitted over the network, and deserialized. In an in-process cache, the value is already in the application's memory. The GET operation locates the pointer and returns it. Whether that pointer points to 64 bytes or 100 megabytes is irrelevant to the pointer lookup.

This property means that in-process caching becomes more advantageous as value sizes increase. At 64 bytes, in-process is 9,677x faster than Redis (31ns vs 0.3ms). At 1 MB, in-process is 403,226x faster (31ns vs 12.5ms). The larger the values you cache, the more you benefit from in-process caching.

CacheeLFU Admission: 512 KiB Constant Memory

An in-process cache has a fixed memory budget. You cannot store everything. CacheeLFU (the admission and eviction policy used by Cachee) decides which entries to keep and which to evict, using a frequency sketch that tracks access frequency in constant memory.

The frequency sketch is a probabilistic data structure (a variant of Count-Min Sketch) that estimates how many times each key has been accessed. It uses 512 KiB of memory regardless of the number of keys tracked. At 100,000 keys, it uses 512 KiB. At 1 million keys, it uses 512 KiB. At 10 million keys, it uses 512 KiB. This constant-memory property means the eviction overhead does not scale with your dataset size.

When the cache is full and a new entry needs to be inserted, CacheeLFU compares the frequency estimate of the new entry against the frequency estimate of the least-frequently-used entry currently in the cache. If the new entry has a higher frequency, it replaces the old entry. If not, the new entry is rejected. This ensures that the cache always contains the hottest entries -- the ones that provide the most hits per byte of memory consumed.

// CacheeLFU admission check (simplified)
fn should_admit(new_key: &Key, victim_key: &Key, sketch: &FrequencySketch) -> bool {
    let new_freq = sketch.estimate(new_key);
    let victim_freq = sketch.estimate(victim_key);
    new_freq > victim_freq
}

// Memory usage is constant
// 100K keys: 512 KiB sketch
// 1M keys:   512 KiB sketch
// 10M keys:  512 KiB sketch

Head-to-Head: In-Process vs. Redis at Every Value Size

The following table compares in-process cache GET latency against Redis GET latency across value sizes from 64 bytes to 1 MB. Redis numbers are from a same-AZ r7g.xlarge at moderate load (50% event loop utilization). In-process numbers are from a DashMap with 64 shards on the same instance type.

Value Size	Redis P50	Redis P99	In-Process P50	In-Process P99	Speedup (P50)
64 B	0.30ms	0.55ms	31ns	48ns	9,677x
512 B	0.32ms	0.58ms	31ns	48ns	10,323x
1 KB	0.36ms	0.65ms	31ns	48ns	11,613x
4 KB	0.52ms	0.90ms	31ns	48ns	16,774x
10 KB	0.78ms	1.40ms	31ns	48ns	25,161x
50 KB	1.60ms	3.00ms	31ns	48ns	51,613x
100 KB	2.80ms	5.50ms	31ns	48ns	90,323x
500 KB	7.00ms	15.00ms	31ns	48ns	225,806x
1 MB	12.50ms	28.00ms	31ns	48ns	403,226x

Two columns tell the entire story. The Redis columns scale linearly with value size. The in-process columns are constant. At 1 MB, the gap is over 400,000x. Note also the P99 column: in-process P99 is 48 nanoseconds -- a 1.5x P99/P50 ratio. Redis P99 is 2-3x P50 at moderate load, and 5-10x at high load. In-process caching does not just eliminate average latency. It eliminates tail latency.

The Objection: "But What About Shared State?"

This is the first question every engineer asks, and it is the right question. An in-process cache is local to one application instance. If you have 20 application servers, each has its own cache. If server A updates a value, server B's cache still has the old value. How do you handle this?

The answer is: you do not try to make the in-process cache consistent across instances. That would turn it into a distributed cache, which would reintroduce network overhead and defeat the purpose. Instead, you accept eventual consistency for the L1 layer and use Redis as the L2 source of truth for shared state.

The Tiered Model

The architecture is two tiers:

L1 (in-process): Per-instance, 31ns reads, eventually consistent, bounded staleness via TTL. Handles 80-90% of read traffic.
L2 (Redis): Shared across instances, 0.3-3ms reads, strongly consistent, persistent. Handles 10-20% of read traffic (L1 misses) and all writes.

The read path: check L1 first. On hit, return in 31ns. On miss, fetch from L2 (Redis), promote to L1, return. The write path: write to L2 (Redis) first. L1 entries expire via TTL, or (for latency-sensitive consistency) invalidate L1 via Redis pub/sub.

This model gives you the best of both worlds. Hot-path reads are 31 nanoseconds. Shared state is consistent via Redis. The staleness window for L1 is bounded by the TTL (typically 5-60 seconds). For most cache workloads, 5-60 seconds of staleness is perfectly acceptable. A session token cached for 30 seconds does not cause problems. A feature flag cached for 10 seconds does not cause problems. A user profile cached for 60 seconds does not cause problems.

When Staleness Matters

There are workloads where even 5 seconds of staleness is unacceptable: distributed locks, real-time inventory counts, auction bids, financial balances. For these, do not use L1 caching. Go directly to Redis (or your database) for every read. These workloads are typically a small fraction of total cache traffic. Keeping them on Redis while moving the rest to L1 is the correct architecture.

When In-Process Wins

In-process caching provides the largest benefit for workloads with three properties: high read frequency, tolerable staleness, and hot-key concentration.

Hot-Path Reads

Any value that is read more than 100 times per second per instance benefits from in-process caching. At 100 reads/sec, you save 100 * 0.3ms = 30ms of Redis round-trip time per second per instance. At 10,000 reads/sec per instance, you save 3 seconds of Redis time per second. Common examples: auth tokens (read on every API request), feature flags (read on every page render), rate limit state (read on every request), user sessions (read on every request), configuration values (read throughout the codebase).

Auth and Session Management

Auth tokens are the single highest-frequency cache access in most applications. Every API request validates the auth token. In a service handling 5,000 requests per second per instance, that is 5,000 Redis GETs per second just for auth. Moving auth to L1 eliminates 5,000 network round-trips per second per instance. At 20 instances, that is 100,000 fewer Redis operations per second -- enough to downsize your Redis cluster.

Computation Results

Cached computation results (ZK proof verification, ML inference, pricing calculations) must be in-process if the computation takes less than 1 millisecond. Caching a 25-microsecond computation in Redis (300-microsecond lookup) makes the cache 12x slower than recomputing. In-process caching (31ns lookup) makes the cache 806x faster than recomputing. The math only works in-process.

Multi-Lookup Request Paths

If a single request requires 3-5 cache lookups (auth, user profile, feature flags, rate limit, preferences), each lookup adds 0.3ms from Redis. Five lookups add 1.5ms. From L1, five lookups add 155 nanoseconds -- invisible. The cumulative effect across many lookups per request is where in-process caching makes the most dramatic difference in end-to-end request latency.

When Redis Wins

Redis remains the correct choice for four categories of workload. In-process caching cannot replace these, and should not try.

Pub/Sub and Event Distribution

Redis pub/sub, streams, and list-based queues provide cross-process communication. An in-process cache is inherently local to one process. If you need to broadcast events from one producer to many consumers across instances, Redis pub/sub is the right tool. In fact, Redis pub/sub is often used to invalidate L1 cache entries -- the two systems are complementary, not competitive.

Shared Mutable State

Distributed locks (SETNX), atomic counters (INCR), and compare-and-swap operations (SET ... NX) require a single shared instance that all processes agree on. In-process caches are per-process and cannot provide cross-process atomicity. If you need a global rate limit counter, a distributed lock, or a leader election mechanism, Redis (or another distributed coordinator) is necessary.

Persistence and Durability

In-process cache entries are lost when the process restarts. If you need cache entries to survive deployments, crashes, or instance recycling, Redis with AOF persistence provides durability. This matters for expensive-to-rebuild caches where a cold start would overwhelm the backend. Redis as L2 solves this: on restart, L1 is empty but L2 has the warm data.

Cross-Service Cache Sharing

If service A computes a value and service B needs it, an in-process cache on service A is invisible to service B. Redis provides a shared namespace that both services can read from and write to. For cross-service cache sharing, a network cache is the only option. The in-process cache on each service handles its own hot-path reads; Redis handles the shared data.

The Tiered Architecture in Practice

Here is how to deploy the L1/L2 tiered model. Cachee acts as a RESP-compatible proxy that your existing Redis client connects to. It maintains the L1 cache in-process and falls through to Redis on misses.

# Install
brew tap h33ai-postquantum/tap
brew install cachee

# Initialize with Redis L2 upstream
cachee init --upstream redis://your-redis:6379 --l1-memory 1gb --l1-ttl 30s

# Start (listens on localhost:6380, RESP-compatible)
cachee start

# Point your app at Cachee instead of Redis directly
# Before: REDIS_URL=redis://your-redis:6379
# After:  REDIS_URL=redis://localhost:6380

The migration requires zero application code changes. Your Redis client library connects to Cachee on localhost:6380 instead of your Redis endpoint on port 6379. Cachee handles the L1/L2 tiering transparently. Reads check L1 first (31ns), fall through to Redis L2 on miss (0.3ms), and promote the result to L1. Writes go directly to Redis L2 to maintain it as the source of truth.

Tuning the L1 Layer

Three parameters control L1 behavior:

Memory budget (--l1-memory): How much RAM to dedicate to L1. Start with 256 MB per instance and increase based on hit rate. If your hit rate plateaus below 80%, increasing memory may not help -- your access pattern may be too diffuse. If your hit rate is 95%+, you may be able to reduce memory since CacheeLFU is efficiently keeping only the hottest entries.

TTL (--l1-ttl): How long entries stay in L1 before expiring. This bounds staleness. 30 seconds is a good default for most workloads. Reduce to 5-10 seconds for latency-sensitive consistency. Increase to 60-300 seconds for slowly changing data (feature flags, configuration). Per-key TTL overrides are available for mixed workloads.

Key prefix filter (--l1-keys): Optionally restrict L1 to specific key prefixes. If only your session and auth keys benefit from L1 caching, filter to session:*,auth:* to avoid polluting L1 with low-frequency keys from other namespaces.

Monitoring

# Real-time L1 metrics
cachee status

# Output:
# L1 hit rate:      88.4%
# L1 entries:       247,891
# L1 memory:        892 MB / 1024 MB
# L1 evictions/sec: 142
# L1 avg latency:   31ns
# L2 fallback rate:  11.6%
# L2 avg latency:   0.34ms
# Effective avg:    0.039ms

The "effective avg" metric is the weighted average of L1 and L2 latency based on hit rates. At 88.4% L1 hit rate, the effective average is 0.039ms -- an 8.7x improvement over Redis-only at 0.34ms. But averages obscure the full picture. The important metric is the P99, which drops from 0.6ms (Redis-only) to 0.34ms (Cachee tiered), because the only requests that hit Redis are L1 misses, and Redis is lightly loaded.

Scale Analysis: Why the Gap Widens

At small scale (1,000 requests per second), the absolute time saved per request is 0.27ms (0.3ms Redis minus 0.031ms effective). That is 270 milliseconds of cumulative time saved per second. Noticeable, but not transformative.

At moderate scale (100,000 requests per second), the absolute time saved is 0.27ms * 100,000 = 27 seconds of cumulative time saved per second. That is 27 CPU-seconds per second of serialization and deserialization eliminated. 27 cores worth of work removed from your fleet.

At large scale (1,000,000 requests per second), the absolute time saved is 270 seconds of cumulative time per second. That is 270 CPU cores of serialization overhead eliminated. 270 cores that you no longer need to provision, pay for, or maintain.

But the gap widens further because of secondary effects. At 1,000,000 requests per second, Redis is under extreme load. Event loop contention pushes P99 from 0.6ms to 5-15ms. NIC saturation adds 1-2ms. Cross-AZ variance adds 0.5-1ms. The effective Redis latency at scale is 3-10x worse than the nominal Redis latency at low load. In-process latency does not degrade at scale because there is no contention bottleneck -- 64 shards handle 1,000,000 concurrent reads without measurable contention.

9,677x

Faster at 64 bytes

403,226x

Faster at 1 MB

Degradation Under Load

The Concurrency Model

A common concern with in-process caches is thread safety. If 96 application threads are reading and writing the cache concurrently, will lock contention negate the performance advantage? The answer depends on the implementation.

A single-lock hash map (e.g., Mutex<HashMap>) would indeed suffer catastrophic contention at 96 threads. Every read and write would serialize on the lock. Throughput would plateau at a few million ops/sec regardless of core count.

DashMap with 64 shards eliminates this problem. Each shard has its own RwLock. A read on shard 7 does not contend with a read on shard 42. With 96 threads and 64 shards, the expected contention per shard is 1.5 concurrent accessors -- well within the capacity of an RwLock where multiple readers can proceed in parallel. Only write-write and write-read contention on the same shard causes blocking, and writes are rare in a cache (reads outnumber writes 10-100x).

Measured throughput on a 96-core Graviton4: 1,708,400 lookups per second with 96 concurrent workers. That is 17,796 lookups per second per core, each completing in 56 microseconds (which includes the full application pipeline, not just the cache lookup). The cache lookup itself remains 31 nanoseconds under full contention.

Memory Efficiency

An in-process cache consumes application heap memory. This is a real cost. But the memory efficiency of CacheeLFU makes this cost manageable.

The CacheeLFU frequency sketch uses 512 KiB constant regardless of the number of keys. Compare this to a DashMap storing the actual entries: at 10 million 100-byte entries, the DashMap consumes approximately 1.9 GB. The CacheeLFU overhead is 512 KiB on top of whatever the entries themselves consume -- a ratio of 1,239x more memory-efficient for the eviction metadata at 10 million keys.

Practical memory budgets for in-process caching: 256 MB holds approximately 500,000 entries at 500 bytes average. 1 GB holds approximately 2 million entries. 4 GB holds approximately 8 million entries. These numbers are for the entries themselves; the CacheeLFU overhead is always 512 KiB regardless.

The Bottom Line

An in-process cache delivers 31 nanoseconds because it eliminates the three costs that dominate Redis latency: serialization (0 bytes converted), network transfer (0 hops), and deserialization (0 allocations). The value lives in your application's address space. A GET is a hash lookup and a pointer dereference. This is constant at every value size and does not degrade under load. Use in-process L1 for hot-path reads (auth, sessions, feature flags, computation results). Use Redis as L2 for shared state, pub/sub, and persistence. The tiered model gives you 31ns on the hot path and shared consistency where you need it.

31ns reads. Zero serialization. Zero network. Drop-in RESP proxy, zero code changes.

brew install cachee Redis Alternatives Compared