MVCC for Caches: Zero-Contention Reads at 96-Worker Scale

DashMap is the best concurrent hash map in the Rust ecosystem. Its sharded architecture delivers sub-microsecond reads under heavy concurrency, and it is the foundation of Cachee's in-process cache engine. But at 96 workers on Graviton4 with a 30% write ratio, we measured something that benchmarks with pure-read workloads never reveal: same-shard contention adding 1–3 microseconds of P99 jitter. For most workloads, this is invisible. For HFT, ML feature stores, and real-time pricing engines, it is the difference between acceptable and unacceptable. We built MVCC into the cache engine to eliminate it.

The Ceiling You Don't See in Read-Only Benchmarks

DashMap divides keys into shards, each protected by a read-write lock. Multiple readers can hold the same shard lock concurrently, but a writer requires exclusive access. This is excellent engineering — it reduces contention by a factor equal to the number of shards. At 64 shards (the default), the probability of two operations colliding on the same shard is roughly 1 in 64.

The problem is that probability compounds with worker count. At 96 workers performing tight loops of reads and writes (the FHE/NTT batch pipeline, for example), the expected number of same-shard collisions per second is not negligible. It is statistically frequent. A reader that arrives during the ~13µs window of a concurrent write to the same shard will block until the write completes. The P50 is unaffected — most reads hit uncontended shards. But the P99 tells a different story.

We measured this on a c8g.metal-48xl (192 vCPUs, Graviton4) running 96 workers with a workload mix of 70% reads and 30% writes:

P50 read latency: ~0.0015ms (identical to pure-read benchmarks)
P99 read latency: ~4µs (vs ~1.5µs in pure-read benchmarks)
Cause: same-shard RwLock contention during concurrent writes

A 2.5µs P99 increase sounds trivial until you consider the workloads where it matters. An HFT system with a 10µs tick-to-trade budget just lost 25% of its latency budget to lock contention in the cache. An ML inference pipeline reading features while a streaming ingestion pipeline writes them sees unpredictable jitter in a path that is supposed to be deterministic. A pricing engine reading prices for order validation while market feeds write continuous updates gets occasional stalls on the read path.

            The core insight: Low contention is not zero contention. DashMap's sharded locking reduces contention by 64x. MVCC eliminates it. For workloads where microseconds are P&L, the difference between “reduced” and “eliminated” is the entire product decision.
        

How MVCC Eliminates Read-Path Contention

Multi-Version Concurrency Control borrows a technique from database engines (PostgreSQL, MySQL/InnoDB, Oracle) and applies it to in-process cache reads. The core idea is simple: instead of overwriting a value in place (which requires a write lock that blocks readers), each write creates a new version of the value. Readers see a consistent snapshot at their read timestamp. No lock required.

The implementation has three components:

Version Chains

Each key maintains a linked chain of versions, ordered newest to oldest. When a writer updates a key, it allocates a new version struct (value + timestamp + epoch = 24 bytes of overhead), sets the value, and atomically swaps the head pointer to the new version. The previous version remains accessible to any reader that started before the write.

Epoch-Based Reads

When a reader begins, it captures the current global epoch (a single atomic load — one CPU instruction on ARM). It then traverses the version chain and returns the most recent version whose epoch is less than or equal to the reader's epoch. This guarantees a consistent snapshot: the reader sees the state of the cache as it existed at the moment it started reading, regardless of any concurrent writes.

The read path is completely lock-free. Not “mostly lock-free” like DashMap (which is lock-free until a concurrent write to the same shard acquires the write lock). Unconditionally lock-free. A reader never waits on any writer, on any shard, under any level of concurrency.

Epoch-Based Garbage Collection

Old versions cannot live forever. A background GC thread runs every 100µs (configurable) and scans version chains for versions that are no longer visible to any active reader. When all active readers have epoch greater than a version's epoch, that version is reclaimed. The GC is non-blocking — it operates on a separate thread and never pauses the read or write path.

Under sustained load, versions are GC'd within 100–500µs of becoming unreachable. Memory overhead stays bounded at approximately 24 bytes per version per key. With the default of 2 versions per key and 10 million keys, the total version overhead is ~480 MB.

Before and After: The Numbers

Same hardware (c8g.metal-48xl), same worker count (96), same workload mix (70/30 read/write):

P50 read latency: ~0.0015ms → ~0.0015ms (unchanged)
P99 read latency: ~4µs → ~1.8µs (55% jitter reduction)
Write latency: 0.013ms → 0.014ms (+0.001ms for version allocation)
Read-path locks: per-shard RwLock → zero (completely lock-free)

The P50 is unchanged because the common case (uncontended reads) was already fast. The P99 drops by 55% because the uncommon-but-critical case (reads that collide with writes on the same shard) is eliminated entirely. The write latency increase of 0.001ms is the cost of allocating a 24-byte version struct and performing one atomic CAS. For write-latency-sensitive workloads, this is invisible.

Who Needs This

MVCC is not for every workload. If your write ratio is below 5%, DashMap contention is negligible and MVCC adds memory overhead for no measurable benefit. Enable MVCC when P99 jitter under concurrent writes is a measured problem, not a theoretical concern.

The workloads where it matters most:

HFT and algorithmic trading: Deterministic latency is not optional. A 2µs P99 spike during a critical position lookup can translate directly into P&L impact. MVCC guarantees that price reads, position checks, and risk limit lookups never block on concurrent market data writes.
ML feature stores: Real-time feature ingestion produces write ratios of 30–50%. Inference reads must be low-latency and deterministic. MVCC ensures that the inference path never stalls waiting for a feature write to release a shard lock.
IoT device state: Millions of devices writing state updates while fleet queries read aggregate and per-device data. The write volume is enormous, the read latency requirement is strict, and the devices never stop sending.
Real-time pricing engines: Continuous price feeds write updated prices while downstream services read them for order validation, display, and risk calculation. Every shard-level collision is a potential stall in the order path.

            The decision rule: Measure your P99 read latency under your actual read/write ratio at your actual worker count. If it is materially higher than your P50, you have lock contention. CONFIG SET mvcc.enabled true eliminates it.
        

Configuration

MVCC is transparent to the client. No API changes, no code changes. Three config parameters:

CONFIG SET mvcc.enabled true          # Enable MVCC
CONFIG SET mvcc.max_versions 2        # Versions retained per key (default: 2)
CONFIG SET mvcc.gc_interval_us 100    # GC scan interval in microseconds (default: 100)

All three are changeable at runtime. Enabling MVCC does not require a restart or data migration. Disabling it collapses all version chains back to single versions during the next GC cycle.

The Numbers That Matter

Cache performance discussions get philosophical fast. Here are the actual measured numbers from production deployments running on documented hardware, so you can compare against your own infrastructure instead of trusting marketing copy.

L0 hot path GET: 28.9 nanoseconds on Apple M4 Max, single-threaded against pre-warmed in-memory cache. This is the floor — there's no faster way to read a key.
L1 CacheeLFU GET: ~89 nanoseconds on AWS Graviton4 (c8g.metal-48xl). Sharded DashMap with admission filtering.
Sustained throughput: 32 million ops/sec single-threaded on M4 Max, 7.41 million ops/sec at 16 workers on Graviton4 c8g.16xlarge.
L2 fallback: Sub-millisecond hits against ElastiCache Redis 7.4 over same-AZ network when L1 misses cascade through.

The compounding effect matters more than any single number. A 28-nanosecond L0 hit means your application spends almost zero time on cache lookups in the hot path, leaving the CPU free for the actual business logic that generates revenue.

When Caching Actually Helps

Caching isn't free. It introduces a consistency problem you didn't have before. Before adding any cache layer, the question to answer is whether your workload actually benefits from caching at all.

Caching helps when three conditions hold simultaneously. First, your reads dramatically outnumber your writes — typically a 10:1 ratio or higher. Second, the same keys get read repeatedly within a window where a cached value remains valid. Third, the cost of computing or fetching the underlying value is meaningfully higher than the cost of a cache lookup. Database queries that hit secondary indexes, RPC calls to slow upstream services, expensive computed aggregations, and rendered template fragments all qualify.

Caching hurts when those conditions don't hold. Write-heavy workloads suffer because every write invalidates a cache entry, multiplying your work. Workloads with poor key locality suffer because the cache wastes memory storing entries that never get reused. Workloads where the underlying fetch is already fast — well-indexed primary key lookups against a properly tuned database, for example — gain almost nothing from caching and inherit the consistency complexity for no reason.

The honest first step before any cache deployment is measuring your actual read/write ratio, key access distribution, and underlying fetch latency. If your read/write ratio is below 5:1 or your underlying database is already returning results in single-digit milliseconds, the engineering time is better spent elsewhere.

Memory Efficiency Is The Hidden Cost Lever

Throughput numbers get the headlines but memory efficiency determines your monthly bill. A cache that stores the same hot data in less RAM lets you run a smaller instance class — and on AWS that's the difference between profitable and breakeven for a lot of services.

Redis stores each key as a Simple Dynamic String with 16 bytes of header overhead, plus dictEntry pointers in the main hashtable, plus embedded TTL metadata. For 1KB values, per-entry overhead lands around 1100-1200 bytes once you account for hashtable load factor and slab fragmentation. At a million keys, that's roughly 1.2 GB of resident memory just for the data.

Cachee's L1 layer uses sharded DashMap entries with compact packing — a 64-bit key hash, value bytes, an 8-byte expiry timestamp, and a small frequency counter for the CacheeLFU admission filter. Per-entry overhead lands at roughly 40 bytes of structural data on top of the value itself. For the same million-key workload, that's about 13% smaller resident memory. On AWS ElastiCache pricing, that gap is the difference between needing a cache.r7g.large versus a cache.r7g.xlarge for borderline workloads.

What This Actually Costs

Concrete pricing math beats hypothetical. A typical SaaS workload with 1 billion cache operations per month, average 800-byte values, and a 5 GB hot working set currently runs on AWS ElastiCache cache.r7g.xlarge primary plus a read replica — roughly $480 per month for the two nodes, plus cross-AZ data transfer charges that quietly add another $50-150 per month depending on access patterns.

Migrating the hot path to an in-process L0/L1 cache and keeping ElastiCache as a cold L2 fallback drops the dedicated cache spend to $120-180 per month. For workloads where the hot working set fits inside the application's existing memory budget, you can eliminate the dedicated cache tier entirely. The cache becomes a library you link into your binary instead of a separate service to operate.

Compounded over twelve months, that's $3,600 to $4,500 per year on a single small workload. Multiply across a fleet of services and the savings start showing up in finance team conversations. The bigger savings usually come from eliminating cross-AZ data transfer charges, which Redis-as-a-service architectures incur on every read that crosses an availability zone.

Eliminate Lock Contention. Ship Deterministic Latency.

MVCC for the cache engine. Zero-contention reads under concurrent writes. One config flag to enable.

Start Free Trial Schedule Demo