Your working set is growing faster than your RAM budget. Ten gigabytes in RAM costs $50–100/month. A hundred gigabytes costs $500–1,000. A terabyte is off the table. Meanwhile, every key that doesn’t fit in RAM falls through to Redis at 1–5ms — a network round-trip that your users feel on every request. There is a 100x gap between RAM speed and network speed, and until now, nothing has filled it.
The Gap Nobody Fills
Look at the latency landscape for a typical cache read. RAM gives you 1.5 microseconds. Redis gives you 1–5 milliseconds. That is a 1,000x difference. In CPU terms, it is the equivalent of jumping from L1 cache directly to main memory, skipping L2 and L3 entirely. No reasonable engineer would design a CPU that way. Yet every caching architecture in production does exactly this.
The reason is historical. When caching systems were designed, the options were RAM (fast, expensive) and network (slow, shared). NVMe SSDs did not exist, or they were too slow to matter. That is no longer the case. Modern NVMe drives deliver 10–50 microsecond random reads. That is 50–250x faster than a network round-trip to Redis, at 100x lower cost per GB than RAM.
NVMe is the L2 cache that the data layer has been missing.
The CPU Cache Hierarchy, Applied to Data
CPU designers solved this problem decades ago. They did not try to make L1 cache big enough to hold everything. They built a hierarchy: L1 is small and fast, L2 is larger and slightly slower, L3 is larger still. Each tier catches what the tier above it cannot hold. The result is that the effective capacity of the cache system is the sum of all tiers, while the effective latency for most accesses is close to L1 speed because the most frequently accessed data stays in the fastest tier.
L1: RAM (W-TinyLFU) → 1.5µs (current Cachee)
L1.5: NVMe SSD → 10-50µs (NEW — hybrid tiering)
L2: Redis / ElastiCache → 1-5ms (network)
L3: Database → 5-50ms (disk)
This is the architecture Cachee now implements. Your hottest 5% of keys — the ones driving 95% of reads — stay in RAM at 1.5µs. The next 30% of keys, the warm tier, live on NVMe at 10–50µs. The remaining 65% fall through to Redis or your database. The application sees a single cache interface. The hierarchy is invisible.
W-TinyLFU Already Knows What Is Hot
The key insight is that Cachee’s eviction engine already has the information needed to make tiering decisions. W-TinyLFU tracks both frequency and recency for every key. It knows which keys are hot (accessed constantly), which are warm (accessed occasionally), and which are cold (accessed rarely). The eviction engine has been using this information to decide what to keep in RAM. Now it also decides what to demote to NVMe instead of dropping entirely.
When RAM fills up and a new key needs space, the eviction engine picks the least valuable key. Previously, that key was evicted — gone from the L1 cache, requiring a 1–5ms Redis round-trip on its next access. With hybrid tiering, the key is written to NVMe asynchronously. Next time it is accessed, the read completes in 10–50µs instead of 1–5ms. If it gets accessed frequently enough (configurable threshold, default 3 hits), it is promoted back to RAM.
The demotion write is non-blocking. It happens via io_uring in the background. The hot path — serving the new key that triggered the eviction — is never delayed by a disk write.
Pluggable Storage Backend
The tiering system is built on a StorageBackend trait. RAM, NVMe, and future storage technologies all implement the same interface: get, put, delete, capacity. The current implementation provides a RAM backend (DashMap, unchanged from existing Cachee) and an NVMe backend (memory-mapped file with io_uring).
The trait is designed for extensibility. CXL-attached memory is coming — byte-addressable, approximately 300ns latency, sitting between RAM and NVMe. Intel Optane persistent memory offers another point on the curve. Cloud providers are introducing specialized storage tiers (EBS io2, Azure Ultra Disk) that could serve as warm tiers for distributed deployments. Each of these becomes a new StorageBackend implementation without changes to the tiering logic, the eviction engine, or the application-facing API.
The Cost Math
Consider a 100GB working set — common for large catalog e-commerce, recommendation engines, or IoT device state stores. The traditional approach has two options, both bad:
- All-RAM: Keep 100GB in RAM at $500–1,000/month. Fast, but expensive.
- RAM + Redis: Keep 5–10GB in RAM, let the rest miss to Redis at 1–5ms. Cheap, but 90%+ of reads hit the network.
Hybrid tiering introduces a third option:
- RAM (5GB): $25–50/month. Hot keys at 1.5µs.
- NVMe (30GB): $1.50–3/month. Warm keys at 10–50µs.
- Redis (65GB): $32–65/month. Cold keys at 1–5ms.
- Total: $58–118/month — an 80–90% reduction versus all-RAM.
The 35% of keys that land in RAM + NVMe get sub-50µs P99 latency. The 95/5 rule means those 35% of keys serve the overwhelming majority of reads. The effective latency profile is nearly indistinguishable from all-RAM for most workloads, at a fraction of the cost.
Who This Changes Everything For
Large catalog e-commerce: Millions of SKUs, but 5% drive 80% of pageviews. Hot products in RAM, long-tail catalog on NVMe at 20µs instead of 2ms from Redis. Product pages for niche items load in 20 microseconds, not 2 milliseconds.
Recommendation engines: Millions of user and item embeddings with power-law access patterns. Popular vectors in RAM, the full embedding table on NVMe. No more choosing between model size and latency.
IoT platforms: Millions of device states, but recently active devices are hot. Active devices in RAM, dormant devices on NVMe. When a dormant device wakes up, its state is available in 30µs, not 3ms.
Enterprise search: Millions of document indices, but trending topics drive most queries. Trending indices in RAM, the full corpus on NVMe. Every query gets sub-millisecond response, not just the popular ones.
Related Reading
- Hybrid Memory Tiering Product Page
- Hybrid Tiering Technical Specification
- How Cachee Works
- Predictive Caching
- Causal Dependency Graph
Also Read
Stop Choosing Between Speed and Scale.
Hybrid memory tiering. 1.5µs for hot keys. 10–50µs for warm keys. 100x larger working sets at 80–90% lower cost.
Start Free Trial Schedule Demo