16 Microseconds: Why We Stopped Using ElastiCache

We were paying AWS $2,300/month for ElastiCache. Every cache read cost us 339 microseconds. We built something 10,935x faster for the cost of a single EC2 instance.

This is the story of how a latency investigation turned into a company. It starts with a question that nobody on our team could answer: why is our cache so slow?

We were running a standard setup. A cluster of cache.r6g.xlarge nodes in us-east-1. Our application servers lived in the same availability zone. We had done everything right — connection pooling, pipelining, compressed values, the works. Redis was supposed to be our sub-millisecond layer. The marketing says so. The benchmarks say so. Every blog post and conference talk says so.

So I ran a histogram on 10 million production reads. The p50 was 339 microseconds. The p99 was 1.2 milliseconds. The p99.9, on a bad minute, hit 4.7 milliseconds. For a cache. A cache.

That number — 339 microseconds — became an obsession. Not because Redis is slow. Redis is fast. The problem is physics.

The Problem Nobody Talks About

ElastiCache is a network hop. It does not matter how fast Redis processes your command internally. Your data has to leave your application process, traverse a TCP socket, pass through a kernel network stack, cross a physical network (even if it is "same-AZ"), arrive at the ElastiCache node, get processed, and make the entire return trip. Every single read. Every single time.

Same-AZ latency is the best case, and it is 200-500 microseconds. Cross-AZ — which AWS explicitly recommends for high availability with Multi-AZ replication — adds 1-3 milliseconds. Cross-region, if you are doing global replication, is 50-150 milliseconds. These are not edge cases. These are the default architectures AWS sells you.

            The "sub-millisecond" marketing is technically correct and practically misleading. Redis processes commands in single-digit microseconds on the server side. But your application never sees server-side latency. It sees network round-trip latency plus serialization plus kernel context switches plus connection pool overhead. The number that matters — the number your users feel — is 100-1000x higher than the number in the benchmark.
        

We started measuring. Not Redis benchmarks — actual production reads from our application code to ElastiCache and back. The numbers were consistent across three months of data:

Same-AZ reads: 339 microseconds average, 1.2ms p99
Cross-AZ reads: 1-3 milliseconds average
Serialization overhead: 15-40 microseconds per round trip
Connection pool checkout: 2-8 microseconds under contention

Every cached value on a hot path was adding a third of a millisecond to our response time. A page that hit the cache six times was spending 2 milliseconds just talking to Redis. That is 2 milliseconds of pure network overhead on data that is already computed, already stored, already "cached."

I kept coming back to the same thought: why are we sending data over a network to read data that already exists in our own process's memory space?

What We Built

We built a Rust-native cache engine with a Cachee-FLU admission policy and a tiered architecture that keeps hot data where it belongs — in-process, zero network hops, zero serialization.

The core is a DashMap — a lock-free concurrent hashmap from the Rust ecosystem — combined with ahash for hardware-accelerated hashing. DashMap uses fine-grained sharding internally, so 96 concurrent threads can read and write without contention. There is no mutex. There is no lock. There is no kernel involvement. A read is a hash computation, a shard lookup, and a pointer dereference. That is it.

Cachee-FLU sits in front of the map as the admission policy. It uses a Count-Min Sketch to track access frequency with minimal memory overhead (8 bytes per counter, not per key) and a Cachee-FLU filter to prevent cache pollution from scan operations. New items only get admitted if their estimated frequency exceeds the frequency of the item they would replace. This gives us near-optimal hit rates without the memory overhead of full LFU tracking.

# In-process read — no network, no serialization
# Application code calls directly into the Cachee engine

$ cachee-cli GET user:session:abc123
"eyJhbGciOiJIUzI1NiJ9..."
(31 ns)

# Compare: same key via ElastiCache
$ redis-cli -h my-cluster.cache.amazonaws.com GET user:session:abc123
"eyJhbGciOiJIUzI1NiJ9..."
(339 µs)

The result: 16 microsecond reads. Not as a benchmark outlier. As the production average across millions of operations. Sixteen microseconds from request to response, including hash computation, shard lookup, and value copy.

The Architecture

Cachee is not a single layer. It is a tiered cache with automatic promotion and demotion:

L0 — Shared memory: Cross-process cache using memory-mapped files. Multiple application instances on the same host share cached data without any network call. Latency: ~50 microseconds.
L1 — In-process DashMap: The hot tier. Lock-free concurrent hashmap with Cachee-FLU admission. Latency: 16 microseconds. This is where 90%+ of your reads resolve.
L2 — Any backend: Redis, ElastiCache, CloudFlare KV, Memcached, DynamoDB — whatever you already have. Cachee treats your existing cache as a cold-data fallback, not your primary read path.

On a cache miss at L1, Cachee checks L0, then L2, and automatically promotes the value back to L1 for subsequent reads. Hot data migrates up. Cold data stays in the lower tiers. You never make a network call for data that is being actively read.

Two features make this architecture practical at scale. First, pre-compressed responses: Cachee applies Brotli compression at write time, not read time. When a client requests a cached value, the compressed bytes are already stored and ready to serve. No CPU spent on compression in the hot path. Second, xxHash ETags: every cached value gets an xxHash64 fingerprint. Clients that send If-None-Match headers get 304 Not Modified responses without Cachee ever reading the value from the cache — just a hash comparison.

# RESP protocol — drop-in Redis replacement
SET mykey "hello" EX 3600
+OK

GET mykey
$5
hello

# Cachee extensions: RESP-compatible
SET mykey "hello" EX 3600 COST 500 COMPRESS brotli
+OK

The Results That Changed Everything

We benchmarked Cachee against ElastiCache across every deployment topology AWS offers. The numbers are not modest improvements. They are order-of-magnitude differences.

Scenario	ElastiCache	Cachee L1	Speedup
Same-AZ	339 µs	31 ns	10,935x
Cross-AZ (HA recommended)	1–3 ms	31 ns	62–187x
Cross-region (global)	30–80 ms	31 ns	1,875–5,000x
Public internet (edge)	50–150 ms	31 ns	3,125–9,375x

            Production numbers: 32,000,000 ops/sec sustained throughput. 99%+ hit rate on hot data (Cachee-FLU admission). 16 MB binary size. Four protocol interfaces: REST, RESP, gRPC, and QUIC.
        

The cost comparison was equally stark. Our ElastiCache cluster — three cache.r6g.xlarge nodes for HA — cost $2,300/month. Cachee runs as a sidecar on your existing compute. The incremental cost is the memory you allocate to the L1 cache, which for most workloads is 256 MB to 2 GB. That is memory you are already paying for on your EC2 instances or containers. The effective incremental cost is zero.

What We Didn't Expect

When your cache is in-process, you can do things that are physically impossible over a network. We did not plan most of these features. They emerged because the architecture made them trivial.

Speculative prefetch. Cachee observes access patterns in real time and learns sequential relationships between keys. If every time your application reads user:123:profile, it reads user:123:preferences within 5 milliseconds, Cachee prefetches the second key into L1 before you ask for it. Over a network, prefetching is a gamble — you are spending bandwidth on data the client might not need. In-process, prefetching is nearly free. A wrong guess costs 16 microseconds of wasted computation. A right guess saves a full read latency.

Dependency graph cascade invalidation. Cachee tracks causal relationships between keys. When you invalidate product:456:price, every derived key that depends on it — the catalog page, the cart total, the recommendation score — gets invalidated atomically in the same operation. Over a network, cascade invalidation requires multiple round trips or a Lua script. In-process, it is a graph traversal that completes in single-digit microseconds.

MVCC for snapshot isolation. Cachee supports multi-version concurrency control. A long-running request can read a consistent snapshot of the cache even while other requests are writing new values. This is impossible with Redis — there is no isolation between concurrent clients. In-process, MVCC is a reference count and a pointer swap.

Cache contracts with freshness SLAs. You can declare that a key must be refreshed within a specific time window, and Cachee enforces it. If a key's freshness contract is violated — the background refresh failed, the upstream is down — Cachee can serve stale data with a degradation header, reject the read, or trigger a fallback. This turns your cache from a passive data store into an active system with auditable guarantees.

None of these features require exotic infrastructure. They require your cache to be in the same process as your application. That is the insight that started Cachee, and it is the insight that keeps compounding.

339 microseconds was the number that started the investigation. 16 microseconds is the number that ended it. The difference is not an optimization. It is an architectural decision: stop sending your data over a network to read it back.

Try 16 Microsecond Cache Reads.

Rust-native Cachee-FLU engine. In-process L1. 10,935x faster than ElastiCache same-AZ. Drop-in RESP compatibility. Deploy in minutes.

Start Free Trial Schedule Demo