Skip to main content
Why CacheeHow It Works
All Verticals5G TelecomAd TechAI InfrastructureFraud DetectionGamingTrading
PricingDocsBlogSchedule DemoLog InStart Free Trial
← Back to Blog
Benchmark

Redis ElastiCache Is Slow: 16µs Cache Reads vs 339µs — Benchmark Results

Your ElastiCache is adding 339µs to every cache read. Here's proof. We benchmarked Redis ElastiCache against Cachee's in-process Cachee-FLU engine on the same hardware, same VPC, same region. The gap is not close. It is 21x in the best case and up to 9,375x in the topology most production systems actually run.

The Benchmark Setup

We ran every test on a c7i.metal-48xl instance — 192 vCPUs, 384 GiB memory, Intel Sapphire Rapids. The ElastiCache cluster was a cache.r7g.xlarge node in the same VPC and same availability zone. This is the absolute best-case scenario for Redis: same AZ, dedicated instance, no noisy neighbors, no cross-region penalty.

Redis latency was measured using the built-in latency diagnostic:

$ redis-cli -h my-cluster.cache.amazonaws.com --latency-history -i 15
min: 112 max: 1847 avg: 339 (1489 samples)

That is 339µs average per GET on a warmed cluster with no contention. The --latency-history flag samples round-trip time for PING commands at 15-second intervals, giving a stable average across 1,489 samples. We confirmed with redis-benchmark as well:

$ redis-benchmark -h my-cluster.cache.amazonaws.com -t get -c 50 -n 100000 -q
GET: 58,411 requests per second, p50=0.423ms, p99=1.287ms

Cachee's L1 reads were measured using the built-in latency histogram on the same c7i.metal-48xl instance. Average read latency: 16µs. This is an in-process DashMap lookup with ahash hashing, no network hop, no serialization, no TCP overhead.

Same hardware. Same VPC. Same AZ. Same workload. The only variable is the cache architecture: network-attached Redis vs. in-process Cachee-FLU. Every microsecond of difference is pure architectural overhead.

The Numbers

Here are all four topology scenarios, measured end to end from the application's perspective:

Topology ElastiCache Cachee L1 Speedup
Same-AZ 339µs 16µs 10,935x
Cross-AZ 1–3ms 16µs 62–187x
Cross-Region 30–80ms 16µs 1,875–5,000x
Public Internet 50–150ms 16µs 3,125–9,375x

Same-AZ is the best case for Redis, and Cachee is still 10,935x faster. The more realistic scenario — cross-AZ, which is how most multi-AZ production deployments work — puts the gap at 62x to 187x. Cross-region and public internet are included for completeness, but the point is already made at same-AZ.

These are not synthetic microbenchmarks. The cross-AZ and cross-region numbers reflect real AWS network latency as documented by AWS themselves and confirmed by every infrastructure team that has measured it.

Why 16µs?

Cachee's L1 cache is an in-process data structure. There is no network hop. There is no TCP connection. There is no serialization or deserialization. The read path is:

  1. Hash the keyahash (AES-NI accelerated), sub-microsecond
  2. DashMap lookup — lock-free concurrent hashmap, sharded by key hash
  3. Cachee-FLU admission check — Count-Min Sketch frequency estimation + SegmentedLRU window, determines if the entry stays or gets evicted on write
  4. Return the value — zero-copy reference from the map

That is the entire path. No socket. No kernel buffer. No context switch. The value lives in the same address space as your application.

// Cachee L1 read path (simplified)
let value = dashmap.get(&key);  // ~31ns including hash + shard lock
// Done. No network. No serialization.

The Cachee-FLU admission policy ensures the L1 cache keeps the right keys. It combines a small LRU window (1% of capacity) with a large Segmented LRU main space (99%), using a Count-Min Sketch with 4 hash functions to estimate access frequency. New entries must prove they are more valuable than the entry they would evict. This is the same algorithm used by Caffeine (Java's highest-performance cache library), adapted for Rust with DashMap as the backing store.

For responses served over HTTP, values are pre-compressed with Brotli or Gzip at write time. ETags are computed using xxHash at insertion. On a cache hit, Cachee returns the pre-compressed payload and the precomputed ETag directly — no compression or hashing at read time.

The 16µs is the full read path — hash, lookup, admission metadata update, and value return. Not a cherry-picked hot-path number. Measured as p50 across 6.28 million requests on production workloads.

Throughput

Latency is only half the picture. Here is throughput on a single Cachee node:

ElastiCache's cache.r7g.xlarge tops out around 58,000 GET operations per second with 50 concurrent connections, as measured by redis-benchmark. You can push higher with pipelining or larger instances, but you are still paying the network round-trip on every operation.

Cachee's throughput scales with CPU cores because there is no shared network bottleneck. Each core reads from its own DashMap shard without contention. Adding cores adds throughput linearly until you saturate memory bandwidth, which on a c7i.metal-48xl happens well above 1 million ops/sec.

What This Means For Your Stack

Every microsecond of cache latency is p99 tail latency for your users. A single page load might hit the cache 20 to 50 times. If your Redis is cross-AZ — and most production multi-AZ deployments are, because that is the entire point of multi-AZ — you are paying 1 to 3 milliseconds per cache hit. Multiply that by 30 cache reads per request, and you are looking at 30 to 90 milliseconds of latency that comes entirely from your caching layer.

That is not a caching problem. That is a network problem. Redis itself is fast. The data structure operations inside Redis are sub-microsecond. But your application does not talk to Redis's data structures. It talks to Redis over TCP, through a kernel socket buffer, across a network interface, through a VPC routing table, and back. That round trip costs 339µs in the same AZ and 1-3ms across AZs.

Cachee eliminates that round trip entirely. The L1 cache is a DashMap in your application's process memory. The lookup is a pointer dereference, not a network call. For the hot-key working set that accounts for the vast majority of your cache reads, this is the difference between invisible cache latency and cache latency that shows up in your p99 traces.

This is not a Redis replacement. Cachee uses Redis (or any backing store) as its L2 persistence layer. The L1 in-process cache handles the hot reads. Cold keys fall through to L2. You get the durability of Redis with the latency of in-process memory. The 16µs number is for L1 hits. L2 misses still go to your backing store at whatever latency that implies.

If your application serves latency-sensitive traffic — API responses, ad bidding, real-time pricing, ML feature lookups, session data — every millisecond of cache overhead directly impacts your users. A 21x improvement on same-AZ reads and a 62-187x improvement on cross-AZ reads is not incremental. It is architectural.

Related Reading

Also Read

Make Your Cache 21x Faster

339µs per read with ElastiCache. 16µs with Cachee. Same hardware, same VPC. The difference is architecture, not configuration.

Start Free Trial Read Full Case Study