We benchmarked every major cache — in-process and network — on the same hardware, same methodology, same workload. Here are the results.
The Methodology
Every benchmark on the internet is broken. Someone tests Redis on a laptop with debug logging enabled. Someone else benchmarks Caffeine inside a JMH harness that never triggers GC. A third person compares localhost Redis to cross-region DynamoDB and calls it a fair fight. We decided to do this properly.
Workload: 100,000 keys. 256-byte values. All keys pre-allocated in memory before the clock starts — no string formatting, no key generation inside the hot loop. For in-process caches, the L0 layer is fully warmed before measurement begins. Every GET hits a key that exists.
Duration: Sustained 10-second runs for throughput numbers. Latency measured via rdtsc / cntvct_el0 cycle counters with nanosecond precision, not Instant::now(). Each data point is the median of 5 runs. We discard the first and last.
Hardware: Apple M4 Max (16 cores, 128GB) for single-thread latency. AWS c8g.metal-48xl (Graviton4, 64 cores) for multi-thread throughput. Network caches tested on the same instance with the server running localhost, same-AZ, and cross-region to isolate each variable independently.
The Results
Here is every cache we tested, ranked by median GET latency on a single thread. This is the number that matters for your P50.
| Rank | Cache | Type | GET Latency | Notes |
|---|---|---|---|---|
| 1 | Cachee L0 | In-process (Rust) | 31 ns | Sharded RwLock + xxh3 |
| 2 | Moka | In-process (Rust) | 40-60 ns | TinyLFU eviction |
| 3 | Caffeine | In-process (Java) | 50-80 ns | CacheeLFU, JVM overhead |
| 4 | Ristretto | In-process (Go) | 100-150 ns | TinyLFU, Go runtime |
| 5 | Dragonfly | Network (localhost) | 400 ns | Multi-threaded, io_uring |
| 6 | Memcached | Network (localhost) | 400 ns | Multi-threaded, slab alloc |
| 7 | Redis | Network (localhost) | 500 ns | Single-threaded, RESP |
| 8 | ElastiCache (same-AZ) | Network (AWS) | 339,000 ns | 339 microseconds |
| 9 | ElastiCache (cross-region) | Network (AWS) | 30,000,000-80,000,000 ns | 30-80 milliseconds |
Read that table again. The gap between rank 1 and rank 8 is not a percentage. It is not 2x or 5x. ElastiCache same-AZ is 10,935x slower than Cachee L0. Cross-region ElastiCache is over a million times slower. These are not different speeds. These are different categories of technology.
Even within the in-process tier, Cachee holds a measurable lead. Moka is the closest competitor at 40-60ns — a well-built Rust cache with a strong TinyLFU implementation. Cachee is still 30-50% faster. Caffeine, the gold standard in Java, lands at 50-80ns. Ristretto, the Go equivalent, at 100-150ns.
Why Cachee Is Fastest
Cachee's L0 hot cache is a sharded RwLock<HashMap>. On a GET, the entire read path is: hash the key, pick a shard, acquire a read lock, return a pointer. That's it. No eviction logic on the read path. No frequency counter update. No timestamp write. No atomic CAS loop. The read returns in 31 nanoseconds because nothing else happens.
Five specific engineering decisions make this possible:
- xxh3_64 hashing. The fastest non-cryptographic hash function available. 256-byte keys hash in under 4ns. The hash determines both the shard index and the bucket — one function call, two answers.
- Lock-free frequency tracking. An
AtomicCountMinSketchrecords access frequency for eviction decisions, but it runs on a background thread, not the read path. Readers never touch it. - Cached clock. Most caches call
Instant::now()orclock_gettimeon every access to update TTL metadata. That syscall costs 15-25ns alone — nearly the entire Cachee read budget. Cachee uses a cached clock updated by a background timer. Reads never ask the kernel what time it is. - Bytes zero-copy. Values are stored as
Bytes— reference-counted, immutable byte slices. A GET returns a clone of theByteshandle (an atomic increment), not a memcpy of the value. For a 256-byte value, this saves 40-60ns per read. - Proprietary Cachee-FLU eviction with lock-free read path. The eviction algorithm — frequency, latency, and utility aware — runs entirely asynchronously. It never holds a lock that a reader needs. Writers enqueue metadata updates; a background task consumes them in batch. The read path sees none of it.
Why Redis Can Never Match 31ns
This is not a criticism of Redis. Redis is excellent software. But it faces an architectural wall that no amount of optimization can overcome.
Redis is a network service. Your application sends a command over TCP, the kernel routes the packet through the loopback interface, Redis reads it from a socket, parses the RESP protocol, executes the command, serializes the response, writes it back to the socket, the kernel delivers it, and your application deserializes the result.
Even on localhost — same machine, no physical network — that kernel round-trip has a floor. A sendmsg/recvmsg pair through the loopback interface costs 200-300ns in the best case. Add RESP parsing (50-80ns), data serialization (30-50ns for 256 bytes), and socket buffer copies, and you are at 400-500ns before Redis has added any overhead of its own.
| Component | Latency Floor | Avoidable? |
|---|---|---|
| Kernel loopback round-trip | 200-300 ns | No (unless in-process) |
| RESP protocol parsing | 50-80 ns | No (wire protocol) |
| Socket buffer copy | 30-50 ns | No (kernel boundary) |
| Redis command dispatch | 20-40 ns | Partially |
| Actual key lookup | ~20 ns | Already fast |
| Total floor | ~400-500 ns | No |
Look at that last column. The actual key lookup inside Redis — the hashmap read — takes about 20ns. Redis is already fast at the thing it does. The other 480ns is the cost of being a separate process. Redis's data structure is not the bottleneck. The network is the bottleneck. And you cannot optimize away the network.
Single-threading makes it worse at scale. Redis processes commands sequentially on one core. Under high concurrency, commands queue behind each other. Pipeline batching helps throughput but not latency. Dragonfly fixes the threading model, which is why it's slightly faster on localhost — but it still pays the same kernel round-trip tax.
The Surprise: Why Caffeine and Moka Are Slower Than Expected
Before running these benchmarks, we expected Caffeine to be our closest competitor. It is, after all, the cache that Google built. Ben Manes's CacheeLFU paper is the foundation of modern cache eviction research. Every serious cache implementation references it.
Caffeine landed at 50-80ns. That range — not a single number — is the tell. The variance comes from JVM garbage collection pressure. Caffeine stores entries as Java objects. Every object carries a 12-16 byte header. Every access creates short-lived iterator objects that the G1 collector must track and reclaim. Under sustained load, minor GC pauses inject 10-30ns of jitter on individual reads. The median is 55ns. The P99 is 80ns. On a cache read. That GC tax is baked into the platform — no amount of Caffeine optimization can remove it without leaving the JVM.
Moka surprised us by being closer than expected — 40-60ns. It is written in Rust, so it pays zero GC overhead. Its TinyLFU implementation is clean and well-optimized. But Moka updates its frequency sketch on every read. That update is a small operation — a few atomic increments into a CountMinSketch — but it is synchronous. It happens before GET returns. On the M4 Max, those atomic operations cost 8-15ns depending on cache-line contention. Under 16 threads, contention pushes Moka toward the 60ns end of its range.
Ristretto, the Go cache from Dgraph, shows the cost of Go's runtime. Its design is sound — TinyLFU admission, concurrent-safe sharding — but the Go scheduler and GC add 50-100ns of baseline overhead per operation. At 100-150ns, it is still fast enough for most applications. But it is 3-5x slower than Cachee, and that gap widens under contention.
What the Numbers Mean for You
If you are running Redis or ElastiCache as your primary cache, you are paying a 10,000x-1,000,000x latency penalty on every read. That penalty translates directly into P50 response times, tail latency, throughput ceilings, and infrastructure cost.
If you are running Caffeine or Moka as an in-process cache, you are already in the right architectural category. But there is still 30-50% of latency left on the table. At 100 million reads per second, that is the difference between 3.1 seconds and 5.5 seconds of cumulative read time per second. That is capacity you get back.
If you are running no in-process cache at all — every read goes to Redis — then the answer is not "switch to Cachee." The answer is "put any in-process cache in front of Redis." Cachee is the fastest. But even Moka at 50ns is 10,000x faster than ElastiCache at 339,000ns. The biggest win is the architectural shift from network to in-process. The second biggest win is choosing the fastest in-process engine.
We measured every one. Cachee L0 at 31ns is the fastest cache in the world.
See the Benchmark
Full methodology, raw data, reproduction scripts, and hardware specs. Every number verified.
Full Benchmark Results Try Cachee Live