Overview
Hybrid tiering adds NVMe as an intermediate storage tier between Cachee's existing RAM-based L1 cache and Redis-based L2 fallback. The goal is to serve warm keys — keys that are accessed occasionally but not frequently enough to stay in RAM — at 10–50µs latency from local NVMe instead of 1–5ms latency from a network round-trip to Redis.
The tiering engine integrates with the existing W-TinyLFU eviction pipeline. When the RAM tier reaches capacity, evicted keys are demoted to NVMe instead of being dropped. When a key on NVMe is accessed frequently enough, it is promoted back to RAM. The application-facing API is unchanged — GET and SET commands work identically, with the tiering engine transparently routing reads across tiers.
All tiering operations are non-blocking. Demotions happen asynchronously after the RAM eviction completes. Promotions happen asynchronously after the NVMe read returns. The read hot path never waits for a write to NVMe. The write hot path (new key insertion) never waits for a demotion to complete.
Architecture
StorageBackend Trait
All storage tiers implement a common trait that provides the four fundamental operations plus capacity reporting.
RamBackend
The existing Cachee L1 implementation, unchanged. Uses DashMap for concurrent reads and writes with W-TinyLFU frequency-based eviction. This is the hot tier. Capacity is bounded by available process memory.
NvmeBackend
New tier. Uses a memory-mapped file on NVMe with io_uring for asynchronous, zero-copy I/O operations. Internal organization uses a slab allocator with fixed-size slots for predictable random-read latency. LRU eviction within the tier — simpler than W-TinyLFU because NVMe capacity is typically 10–100x larger than RAM capacity, making sophisticated eviction less critical.
Promotion Policy
When a key is read from NVMe, the access is recorded. After the key accumulates a configurable number of NVMe hits (default: 3), it is promoted to RAM asynchronously. The promotion is non-blocking — the NVMe read returns the value immediately, and the promotion write to RAM happens in the background. If the key is read again before promotion completes, it is served from NVMe at 10–50µs latency.
Demotion Policy
When W-TinyLFU evicts a key from RAM, the eviction callback writes the key-value pair to NVMe asynchronously via io_uring. The eviction from RAM completes immediately — the demotion write does not block the hot path. If NVMe is full, the NVMe tier's LRU eviction removes the least recently accessed NVMe entry to make room.
Eviction from NVMe
LRU eviction within the NVMe tier. When NVMe capacity is reached and a new demotion arrives, the least recently accessed entry on NVMe is evicted. Evicted NVMe entries are not written to Redis (L2) — they simply become cache misses that fall through to the L2 tier or origin on next access. NVMe capacity is large enough that LRU is effective without the frequency-tracking overhead of W-TinyLFU.
Configuration
Hybrid tiering is configured via CONFIG SET commands. All settings can be changed at runtime without restart.
| Parameter | Default | Description |
|---|---|---|
tiering.enabled |
false | Enable/disable hybrid tiering. When disabled, eviction drops keys normally. |
tiering.nvme_path |
— | Path to NVMe device or directory. Required when tiering is enabled. |
tiering.nvme_capacity_gb |
50 | Maximum NVMe capacity in GB. Should be 5–20x RAM tier capacity. |
tiering.ram_capacity_gb |
10 | RAM tier capacity in GB. Same as existing max_memory setting. |
tiering.promotion_threshold |
3 | Number of NVMe accesses before promoting a key to RAM. Higher values keep NVMe entries stable; lower values promote aggressively. |
tiering.demotion_async |
true | When true, demotion writes to NVMe are non-blocking. When false, eviction blocks until the NVMe write completes (not recommended). |
I/O Architecture
The NVMe backend uses io_uring for all I/O operations, providing kernel-bypassed asynchronous reads and writes with zero system call overhead per operation after the initial ring setup.
Read Path
- RAM check: DashMap lookup. If hit, return immediately at ~1.5µs. No NVMe I/O.
- NVMe check: Submit io_uring read request. The key's location on NVMe is tracked in an in-memory index (hash map of key → slab offset). If the key exists in the index, a single 4KB-aligned read is submitted to io_uring and awaited. Returns at 10–50µs.
- L2 miss: If the key is not in RAM or NVMe, the request falls through to the L2 tier (Redis) or origin database. Standard Cachee miss behavior, unchanged.
Write Path
- Always write to RAM first. New keys and updates go directly to the RAM tier via DashMap. This ensures the most recent write is always in the fastest tier.
- Async demotion on eviction. When RAM eviction fires, the evicted key-value pair is submitted to io_uring as an asynchronous write to NVMe. The write is batched with other pending demotions for efficiency.
- NVMe write completion. io_uring completion events are polled in a dedicated background thread. Failed writes are retried once; on second failure, the entry is silently dropped (treated as an eviction).
The critical design constraint: no blocking I/O on the read path. RAM reads are lock-free (DashMap). NVMe reads are submitted via io_uring and awaited with epoll — the calling thread is not blocked; it can serve other requests while waiting for the NVMe read to complete. Demotion writes are fire-and-forget from the hot path's perspective.
Performance Benchmarks
Expected latency by tier and operation. All numbers assume enterprise NVMe SSDs (Intel P5800X, Samsung PM9A3, or equivalent).
| Operation | RAM | NVMe | Redis L2 |
|---|---|---|---|
| Random read (P50) | 1.5µs | 15µs | 1ms |
| Random read (P99) | 4µs | 50µs | 3ms |
| Sequential read | 1.5µs | 8µs | 1ms |
| Write | 13µs | 20µs (async demotion) | 1ms |
The NVMe P99 of 50µs is 60x faster than the Redis P99 of 3ms. For workloads where 30% of keys live in the NVMe tier, the effective P99 across all cache hits drops significantly compared to a RAM-only cache that misses to Redis for everything not in RAM.
With the 95/5 access pattern (95% of reads hit 5% of keys), approximately 95% of reads hit RAM at 1.5µs, 4% hit NVMe at 15µs (P50), and 1% miss to Redis at 1ms. The weighted average read latency is approximately 2.1µs — barely distinguishable from a pure RAM cache, at a fraction of the cost.
Capacity Planning
Recommended tier sizing based on total working set. The RAM tier holds the hottest keys, NVMe holds the warm tier, and Redis L2 holds everything else.
| Working Set | RAM Tier | NVMe Tier | Redis L2 | Monthly Cost |
|---|---|---|---|---|
| 10GB | 2GB | 3GB | 5GB | $15–30 |
| 100GB | 5GB | 30GB | 65GB | $58–118 |
| 1TB | 10GB | 200GB | 790GB | $120–250 |
At 1TB working set, the all-RAM cost would be $5,000–10,000/month. Hybrid tiering achieves the same effective capacity at $120–250/month — a 40–80x cost reduction — while keeping 35% of keys (representing 99%+ of reads) at sub-50µs latency.
Limitations
- Linux 5.1+ required.
io_uringis a Linux-specific kernel feature introduced in 5.1. The NVMe backend is not available on macOS, Windows, or older Linux kernels. For development on non-Linux systems, use the RAM-only backend (existing behavior, unchanged). - NVMe write endurance. Enterprise NVMe SSDs are rated for 1–3 DWPD (drive writes per day). At 100K demotions/sec with 1KB average value size, write volume is approximately 8.6GB/day — well within endurance limits for any enterprise drive. For drives with 0.3 DWPD (consumer-grade), monitor write volume and consider write throttling.
- Minimum value size. Not recommended for keys with values smaller than 100 bytes. The io_uring I/O overhead for a 4KB-aligned read exceeds the benefit of caching a sub-100-byte value at NVMe latency. For small-value workloads, the RAM tier is sufficient.
- First-access latency for demoted keys. A key that was demoted from RAM to NVMe will return at 10–50µs on its next access, not 1.5µs. This is 50–250x faster than a Redis miss, but it is measurably slower than RAM. Applications with strict sub-5µs P99 requirements for all keys should size the RAM tier to hold the entire hot + warm working set.
- No cross-instance NVMe sharing. NVMe is local to the instance. Each Cachee instance maintains its own NVMe tier. Cross-instance coherence operates at the RAM tier level (existing behavior). NVMe entries are populated by local demotion from RAM, not by coherence.
io_uring requires Linux 5.1+. Amazon Linux 2023 (kernel 6.1), Ubuntu 22.04+ (kernel 5.15), and RHEL 9+ (kernel 5.14) all support io_uring. Check with uname -r before enabling tiering. On unsupported kernels, CONFIG SET tiering.enabled true will return an error.
Future Extensions
The StorageBackend trait is designed for extensibility as new hardware tiers emerge.
- CXL-attached memory. Compute Express Link (CXL) enables byte-addressable memory expansion over PCIe. CXL Type 3 devices provide approximately 300ns random read latency — 5x slower than local DRAM but 50x faster than NVMe. A CXL backend would sit between RAM and NVMe in the hierarchy: L0 (shared memory) → L1 (RAM) → L1.25 (CXL) → L1.5 (NVMe) → L2 (Redis).
- Intel Optane persistent memory. If available on the deployment platform, Optane DCPMM offers approximately 350ns latency with byte-addressability and persistence. Similar position in the hierarchy to CXL, with the additional benefit of crash recovery for the warm tier.
- Cloud-specific storage backends. AWS EBS io2 Block Express (sub-millisecond latency), Azure Ultra Disk (sub-ms), and GCP Hyperdisk Extreme provide NVMe-like performance as network-attached storage. A cloud storage backend would enable hybrid tiering in containerized deployments where local NVMe is not available, at the cost of slightly higher latency (100–500µs vs 10–50µs for local NVMe).
- Persistent NVMe tier. Currently, the NVMe tier is volatile — it is rebuilt from RAM demotions after a restart. A future extension could persist the NVMe slab file and index across restarts, allowing instant warm-tier recovery without a cold-start period. This requires adding checkpointing for the key-to-offset index.