Hybrid Memory Tiering: RAM + NVMe for 100x Larger Working

Your working set is growing faster than your RAM budget. Ten gigabytes in RAM costs $50–100/month. A hundred gigabytes costs $500–1,000. A terabyte is off the table. Meanwhile, every key that doesn’t fit in RAM falls through to Redis at 1–5ms — a network round-trip that your users feel on every request. There is a 100x gap between RAM speed and network speed, and until now, nothing has filled it.

The Gap Nobody Fills

Look at the latency landscape for a typical cache read. RAM gives you 1.5 microseconds. Redis gives you 1–5 milliseconds. That is a 1,000x difference. In CPU terms, it is the equivalent of jumping from L1 cache directly to main memory, skipping L2 and L3 entirely. No reasonable engineer would design a CPU that way. Yet every caching architecture in production does exactly this.

The reason is historical. When caching systems were designed, the options were RAM (fast, expensive) and network (slow, shared). NVMe SSDs did not exist, or they were too slow to matter. That is no longer the case. Modern NVMe drives deliver 10–50 microsecond random reads. That is 50–250x faster than a network round-trip to Redis, at 100x lower cost per GB than RAM.

NVMe is the L2 cache that the data layer has been missing.

The CPU Cache Hierarchy, Applied to Data

CPU designers solved this problem decades ago. They did not try to make L1 cache big enough to hold everything. They built a hierarchy: L1 is small and fast, L2 is larger and slightly slower, L3 is larger still. Each tier catches what the tier above it cannot hold. The result is that the effective capacity of the cache system is the sum of all tiers, while the effective latency for most accesses is close to L1 speed because the most frequently accessed data stays in the fastest tier.

L0: Zero-Copy Shared Memory → <1µs (future)
L1: RAM (Cachee-FLU) → 1.5µs (current Cachee)
L1.5: NVMe SSD → 10-50µs (NEW — hybrid tiering)
L2: Redis / ElastiCache → 1-5ms (network)
L3: Database → 5-50ms (disk)

This is the architecture Cachee now implements. Your hottest 5% of keys — the ones driving 95% of reads — stay in RAM at 1.5µs. The next 30% of keys, the warm tier, live on NVMe at 10–50µs. The remaining 65% fall through to Redis or your database. The application sees a single cache interface. The hierarchy is invisible.

Cachee-FLU Already Knows What Is Hot

The key insight is that Cachee’s eviction engine already has the information needed to make tiering decisions. Cachee-FLU tracks both frequency and recency for every key. It knows which keys are hot (accessed constantly), which are warm (accessed occasionally), and which are cold (accessed rarely). The eviction engine has been using this information to decide what to keep in RAM. Now it also decides what to demote to NVMe instead of dropping entirely.

When RAM fills up and a new key needs space, the eviction engine picks the least valuable key. Previously, that key was evicted — gone from the L1 cache, requiring a 1–5ms Redis round-trip on its next access. With hybrid tiering, the key is written to NVMe asynchronously. Next time it is accessed, the read completes in 10–50µs instead of 1–5ms. If it gets accessed frequently enough (configurable threshold, default 3 hits), it is promoted back to RAM.

The demotion write is non-blocking. It happens via io_uring in the background. The hot path — serving the new key that triggered the eviction — is never delayed by a disk write.

            Key insight: The eviction engine already classifies keys by temperature. Hybrid tiering gives it a new option: demote instead of evict. The warm tier is not an add-on. It is a natural extension of what the eviction engine already does.
        

Pluggable Storage Backend

The tiering system is built on a StorageBackend trait. RAM, NVMe, and future storage technologies all implement the same interface: get, put, delete, capacity. The current implementation provides a RAM backend (DashMap, unchanged from existing Cachee) and an NVMe backend (memory-mapped file with io_uring).

The trait is designed for extensibility. CXL-attached memory is coming — byte-addressable, approximately 300ns latency, sitting between RAM and NVMe. Intel Optane persistent memory offers another point on the curve. Cloud providers are introducing specialized storage tiers (EBS io2, Azure Ultra Disk) that could serve as warm tiers for distributed deployments. Each of these becomes a new StorageBackend implementation without changes to the tiering logic, the eviction engine, or the application-facing API.

The Cost Math

Consider a 100GB working set — common for large catalog e-commerce, recommendation engines, or IoT device state stores. The traditional approach has two options, both bad:

All-RAM: Keep 100GB in RAM at $500–1,000/month. Fast, but expensive.
RAM + Redis: Keep 5–10GB in RAM, let the rest miss to Redis at 1–5ms. Cheap, but 90%+ of reads hit the network.

Hybrid tiering introduces a third option:

RAM (5GB): $25–50/month. Hot keys at 1.5µs.
NVMe (30GB): $1.50–3/month. Warm keys at 10–50µs.
Redis (65GB): $32–65/month. Cold keys at 1–5ms.
Total: $58–118/month — an 80–90% reduction versus all-RAM.

The 35% of keys that land in RAM + NVMe get sub-50µs P99 latency. The 95/5 rule means those 35% of keys serve the overwhelming majority of reads. The effective latency profile is nearly indistinguishable from all-RAM for most workloads, at a fraction of the cost.

Who This Changes Everything For

Large catalog e-commerce: Millions of SKUs, but 5% drive 80% of pageviews. Hot products in RAM, long-tail catalog on NVMe at 20µs instead of 2ms from Redis. Product pages for niche items load in 20 microseconds, not 2 milliseconds.

Recommendation engines: Millions of user and item embeddings with power-law access patterns. Popular vectors in RAM, the full embedding table on NVMe. No more choosing between model size and latency.

IoT platforms: Millions of device states, but recently active devices are hot. Active devices in RAM, dormant devices on NVMe. When a dormant device wakes up, its state is available in 30µs, not 3ms.

Enterprise search: Millions of document indices, but trending topics drive most queries. Trending indices in RAM, the full corpus on NVMe. Every query gets sub-millisecond response, not just the popular ones.

            The question is not whether you can fit your working set in RAM. It is whether you need to. With hybrid tiering, the answer is no. Keep the hot keys fast, the warm keys fast enough, and the cold keys where they have always been. Same interface. 100x larger effective capacity. 80–90% lower cost.
        

The Numbers That Matter

Cache performance discussions get philosophical fast. Here are the actual measured numbers from production deployments running on documented hardware, so you can compare against your own infrastructure instead of trusting marketing copy.

L0 hot path GET: 28.9 nanoseconds on Apple M4 Max, single-threaded against pre-warmed in-memory cache. This is the floor — there's no faster way to read a key.
L1 CacheeLFU GET: ~89 nanoseconds on AWS Graviton4 (c8g.metal-48xl). Sharded DashMap with admission filtering.
Sustained throughput: 32 million ops/sec single-threaded on M4 Max, 7.41 million ops/sec at 16 workers on Graviton4 c8g.16xlarge.
L2 fallback: Sub-millisecond hits against ElastiCache Redis 7.4 over same-AZ network when L1 misses cascade through.

The compounding effect matters more than any single number. A 28-nanosecond L0 hit means your application spends almost zero time on cache lookups in the hot path, leaving the CPU free for the actual business logic that generates revenue.

Average Latency Hides The Real Story

Average latency is the most misleading number in cache benchmarking. The percentile distribution is what actually breaks production systems. Tail latency — the slowest 0.1% of requests — is where users notice the lag and where SLAs get violated.

Percentile	Network Redis (same-AZ)	In-process L0
p50	~85 microseconds	28.9 nanoseconds
p95	~140 microseconds	~45 nanoseconds
p99	~280 microseconds	~80 nanoseconds
p99.9	~1.2 milliseconds	~150 nanoseconds

The p99.9 spike on networked Redis isn't a bug — it's the cost of running a single-threaded event loop that occasionally blocks on background tasks like RDB snapshots, AOF rewrites, and expired-key sweeps. Cachee's L0 stays inside a few hundred nanoseconds because the hot-path read is a lock-free shard lookup with no background work scheduled on the same thread.

If your application is sensitive to tail latency — payments, real-time bidding, fraud detection, trading — the p99.9 number is the one to optimize against. Average latency improvements that don't move the tail are vanity metrics.

Memory Efficiency Is The Hidden Cost Lever

Throughput numbers get the headlines but memory efficiency determines your monthly bill. A cache that stores the same hot data in less RAM lets you run a smaller instance class — and on AWS that's the difference between profitable and breakeven for a lot of services.

Redis stores each key as a Simple Dynamic String with 16 bytes of header overhead, plus dictEntry pointers in the main hashtable, plus embedded TTL metadata. For 1KB values, per-entry overhead lands around 1100-1200 bytes once you account for hashtable load factor and slab fragmentation. At a million keys, that's roughly 1.2 GB of resident memory just for the data.

Cachee's L1 layer uses sharded DashMap entries with compact packing — a 64-bit key hash, value bytes, an 8-byte expiry timestamp, and a small frequency counter for the CacheeLFU admission filter. Per-entry overhead lands at roughly 40 bytes of structural data on top of the value itself. For the same million-key workload, that's about 13% smaller resident memory. On AWS ElastiCache pricing, that gap is the difference between needing a cache.r7g.large versus a cache.r7g.xlarge for borderline workloads.

Observability And What To Measure

You can't tune what you can't measure. The four metrics that matter for any production cache deployment, in order of importance:

Hit rate, broken down by key prefix or namespace. A global hit rate of 92% sounds great until you discover that one critical namespace is sitting at 40% and dragging your tail latency. Per-prefix hit rates expose which workloads are getting cache value and which aren't.
Latency percentiles, not averages. p50, p95, p99, and p99.9 for both cache hits and cache misses. The cache miss latency is your fallback path performance — when the cache fails, this is what your users actually experience.
Memory pressure and eviction rate. If your eviction rate is climbing while your hit rate stays flat, you're under-provisioned. If both are climbing, your access pattern shifted and you need to retune TTLs or rethink what you're caching.
Stale-read rate. The percentage of cache hits that returned a value the application then discovered was stale. This is the canary for your invalidation strategy. If it's above 1%, your invalidation logic has a bug.

Cachee exposes all four out of the box via Prometheus metrics on the standard scrape endpoint, plus a real-time SSE stream for dashboards that need sub-second visibility. The right time to wire these into your monitoring stack is before the migration, not after the first incident.

Stop Choosing Between Speed and Scale.

Hybrid memory tiering. 1.5µs for hot keys. 10–50µs for warm keys. 100x larger working sets at 80–90% lower cost.

Start Free Trial Schedule Demo

Hybrid Memory Tiering: RAM + NVMe for 100x Larger Working Sets