Skip to main content
Why CacheeHow It Works
All Verticals5G TelecomAd TechAI InfrastructureFraud DetectionGamingTrading
PricingDocsBlogSchedule DemoLog InStart Free Trial
← Back to Blog
Engineering

Hybrid Memory Tiering: RAM + NVMe for 100x Larger Working Sets

Your working set is growing faster than your RAM budget. Ten gigabytes in RAM costs $50–100/month. A hundred gigabytes costs $500–1,000. A terabyte is off the table. Meanwhile, every key that doesn’t fit in RAM falls through to Redis at 1–5ms — a network round-trip that your users feel on every request. There is a 100x gap between RAM speed and network speed, and until now, nothing has filled it.

The Gap Nobody Fills

Look at the latency landscape for a typical cache read. RAM gives you 1.5 microseconds. Redis gives you 1–5 milliseconds. That is a 1,000x difference. In CPU terms, it is the equivalent of jumping from L1 cache directly to main memory, skipping L2 and L3 entirely. No reasonable engineer would design a CPU that way. Yet every caching architecture in production does exactly this.

The reason is historical. When caching systems were designed, the options were RAM (fast, expensive) and network (slow, shared). NVMe SSDs did not exist, or they were too slow to matter. That is no longer the case. Modern NVMe drives deliver 10–50 microsecond random reads. That is 50–250x faster than a network round-trip to Redis, at 100x lower cost per GB than RAM.

NVMe is the L2 cache that the data layer has been missing.

The CPU Cache Hierarchy, Applied to Data

CPU designers solved this problem decades ago. They did not try to make L1 cache big enough to hold everything. They built a hierarchy: L1 is small and fast, L2 is larger and slightly slower, L3 is larger still. Each tier catches what the tier above it cannot hold. The result is that the effective capacity of the cache system is the sum of all tiers, while the effective latency for most accesses is close to L1 speed because the most frequently accessed data stays in the fastest tier.

L0: Zero-Copy Shared Memory <1µs (future)
L1: RAM (W-TinyLFU) 1.5µs (current Cachee)
L1.5: NVMe SSD 10-50µs (NEW — hybrid tiering)
L2: Redis / ElastiCache 1-5ms (network)
L3: Database 5-50ms (disk)

This is the architecture Cachee now implements. Your hottest 5% of keys — the ones driving 95% of reads — stay in RAM at 1.5µs. The next 30% of keys, the warm tier, live on NVMe at 10–50µs. The remaining 65% fall through to Redis or your database. The application sees a single cache interface. The hierarchy is invisible.

W-TinyLFU Already Knows What Is Hot

The key insight is that Cachee’s eviction engine already has the information needed to make tiering decisions. W-TinyLFU tracks both frequency and recency for every key. It knows which keys are hot (accessed constantly), which are warm (accessed occasionally), and which are cold (accessed rarely). The eviction engine has been using this information to decide what to keep in RAM. Now it also decides what to demote to NVMe instead of dropping entirely.

When RAM fills up and a new key needs space, the eviction engine picks the least valuable key. Previously, that key was evicted — gone from the L1 cache, requiring a 1–5ms Redis round-trip on its next access. With hybrid tiering, the key is written to NVMe asynchronously. Next time it is accessed, the read completes in 10–50µs instead of 1–5ms. If it gets accessed frequently enough (configurable threshold, default 3 hits), it is promoted back to RAM.

The demotion write is non-blocking. It happens via io_uring in the background. The hot path — serving the new key that triggered the eviction — is never delayed by a disk write.

Key insight: The eviction engine already classifies keys by temperature. Hybrid tiering gives it a new option: demote instead of evict. The warm tier is not an add-on. It is a natural extension of what the eviction engine already does.

Pluggable Storage Backend

The tiering system is built on a StorageBackend trait. RAM, NVMe, and future storage technologies all implement the same interface: get, put, delete, capacity. The current implementation provides a RAM backend (DashMap, unchanged from existing Cachee) and an NVMe backend (memory-mapped file with io_uring).

The trait is designed for extensibility. CXL-attached memory is coming — byte-addressable, approximately 300ns latency, sitting between RAM and NVMe. Intel Optane persistent memory offers another point on the curve. Cloud providers are introducing specialized storage tiers (EBS io2, Azure Ultra Disk) that could serve as warm tiers for distributed deployments. Each of these becomes a new StorageBackend implementation without changes to the tiering logic, the eviction engine, or the application-facing API.

The Cost Math

Consider a 100GB working set — common for large catalog e-commerce, recommendation engines, or IoT device state stores. The traditional approach has two options, both bad:

Hybrid tiering introduces a third option:

The 35% of keys that land in RAM + NVMe get sub-50µs P99 latency. The 95/5 rule means those 35% of keys serve the overwhelming majority of reads. The effective latency profile is nearly indistinguishable from all-RAM for most workloads, at a fraction of the cost.

Who This Changes Everything For

Large catalog e-commerce: Millions of SKUs, but 5% drive 80% of pageviews. Hot products in RAM, long-tail catalog on NVMe at 20µs instead of 2ms from Redis. Product pages for niche items load in 20 microseconds, not 2 milliseconds.

Recommendation engines: Millions of user and item embeddings with power-law access patterns. Popular vectors in RAM, the full embedding table on NVMe. No more choosing between model size and latency.

IoT platforms: Millions of device states, but recently active devices are hot. Active devices in RAM, dormant devices on NVMe. When a dormant device wakes up, its state is available in 30µs, not 3ms.

Enterprise search: Millions of document indices, but trending topics drive most queries. Trending indices in RAM, the full corpus on NVMe. Every query gets sub-millisecond response, not just the popular ones.

The question is not whether you can fit your working set in RAM. It is whether you need to. With hybrid tiering, the answer is no. Keep the hot keys fast, the warm keys fast enough, and the cold keys where they have always been. Same interface. 100x larger effective capacity. 80–90% lower cost.

Related Reading

Also Read

Stop Choosing Between Speed and Scale.

Hybrid memory tiering. 1.5µs for hot keys. 10–50µs for warm keys. 100x larger working sets at 80–90% lower cost.

Start Free Trial Schedule Demo