Hybrid Memory Tiering Technical Specification

Overview

Hybrid tiering adds NVMe as an intermediate storage tier between Cachee's existing RAM-based L1 cache and Redis-based L2 fallback. The goal is to serve warm keys — keys that are accessed occasionally but not frequently enough to stay in RAM — at 10–50µs latency from local NVMe instead of 1–5ms latency from a network round-trip to Redis.

The tiering engine integrates with the existing Cachee-FLU eviction pipeline. When the RAM tier reaches capacity, evicted keys are demoted to NVMe instead of being dropped. When a key on NVMe is accessed frequently enough, it is promoted back to RAM. The application-facing API is unchanged — GET and SET commands work identically, with the tiering engine transparently routing reads across tiers.

Design Principle

All tiering operations are non-blocking. Demotions happen asynchronously after the RAM eviction completes. Promotions happen asynchronously after the NVMe read returns. The read hot path never waits for a write to NVMe. The write hot path (new key insertion) never waits for a demotion to complete.

Architecture

StorageBackend Trait

All storage tiers implement a common trait that provides the four fundamental operations plus capacity reporting.

Rust
pub trait StorageBackend: Send + Sync {
    /// Retrieve a value by key. Returns None on miss.
    fn get(&self, key: &[u8]) -> Option<Vec<u8>>;

    /// Store a key-value pair. Overwrites if key exists.
    fn put(&self, key: &[u8], value: &[u8]);

    /// Remove a key. No-op if key does not exist.
    fn delete(&self, key: &[u8]);

    /// Total capacity in bytes.
    fn capacity(&self) -> usize;
}
        

RamBackend

The existing Cachee L1 implementation, unchanged. Uses DashMap for concurrent reads and writes with Cachee-FLU frequency-based eviction. This is the hot tier. Capacity is bounded by available process memory.

NvmeBackend

New tier. Uses a memory-mapped file on NVMe with io_uring for asynchronous, zero-copy I/O operations. Internal organization uses a slab allocator with fixed-size slots for predictable random-read latency. LRU eviction within the tier — simpler than Cachee-FLU because NVMe capacity is typically 10–100x larger than RAM capacity, making sophisticated eviction less critical.

Promotion Policy

When a key is read from NVMe, the access is recorded. After the key accumulates a configurable number of NVMe hits (default: 3), it is promoted to RAM asynchronously. The promotion is non-blocking — the NVMe read returns the value immediately, and the promotion write to RAM happens in the background. If the key is read again before promotion completes, it is served from NVMe at 10–50µs latency.

Demotion Policy

When Cachee-FLU evicts a key from RAM, the eviction callback writes the key-value pair to NVMe asynchronously via io_uring. The eviction from RAM completes immediately — the demotion write does not block the hot path. If NVMe is full, the NVMe tier's LRU eviction removes the least recently accessed NVMe entry to make room.

Eviction from NVMe

LRU eviction within the NVMe tier. When NVMe capacity is reached and a new demotion arrives, the least recently accessed entry on NVMe is evicted. Evicted NVMe entries are not written to Redis (L2) — they simply become cache misses that fall through to the L2 tier or origin on next access. NVMe capacity is large enough that LRU is effective without the frequency-tracking overhead of Cachee-FLU.

Configuration

Hybrid tiering is configured via CONFIG SET commands. All settings can be changed at runtime without restart.

Configuration
# Enable hybrid tiering
CONFIG SET tiering.enabled true

# NVMe device or directory path
CONFIG SET tiering.nvme_path /dev/nvme0n1p1

# NVMe tier capacity (in GB)
CONFIG SET tiering.nvme_capacity_gb 100

# RAM tier capacity (in GB)
CONFIG SET tiering.ram_capacity_gb 10

# Promotion threshold: promote to RAM after N accesses from NVMe
CONFIG SET tiering.promotion_threshold 3

# Async demotion (recommended: true)
CONFIG SET tiering.demotion_async true
        

Parameter	Default	Description
`tiering.enabled`	false	Enable/disable hybrid tiering. When disabled, eviction drops keys normally.
`tiering.nvme_path`	—	Path to NVMe device or directory. Required when tiering is enabled.
`tiering.nvme_capacity_gb`	50	Maximum NVMe capacity in GB. Should be 5–20x RAM tier capacity.
`tiering.ram_capacity_gb`	10	RAM tier capacity in GB. Same as existing max_memory setting.
`tiering.promotion_threshold`	3	Number of NVMe accesses before promoting a key to RAM. Higher values keep NVMe entries stable; lower values promote aggressively.
`tiering.demotion_async`	true	When true, demotion writes to NVMe are non-blocking. When false, eviction blocks until the NVMe write completes (not recommended).

I/O Architecture

The NVMe backend uses io_uring for all I/O operations, providing kernel-bypassed asynchronous reads and writes with zero system call overhead per operation after the initial ring setup.

Read Path

RAM check: DashMap lookup. If hit, return immediately at ~1.5µs. No NVMe I/O.
NVMe check: Submit io_uring read request. The key's location on NVMe is tracked in an in-memory index (hash map of key → slab offset). If the key exists in the index, a single 4KB-aligned read is submitted to io_uring and awaited. Returns at 10–50µs.
L2 miss: If the key is not in RAM or NVMe, the request falls through to the L2 tier (Redis) or origin database. Standard Cachee miss behavior, unchanged.

Write Path

Always write to RAM first. New keys and updates go directly to the RAM tier via DashMap. This ensures the most recent write is always in the fastest tier.
Async demotion on eviction. When RAM eviction fires, the evicted key-value pair is submitted to io_uring as an asynchronous write to NVMe. The write is batched with other pending demotions for efficiency.
NVMe write completion. io_uring completion events are polled in a dedicated background thread. Failed writes are retried once; on second failure, the entry is silently dropped (treated as an eviction).

Zero Blocking I/O on the Read Path

The critical design constraint: no blocking I/O on the read path. RAM reads are lock-free (DashMap). NVMe reads are submitted via io_uring and awaited with epoll — the calling thread is not blocked; it can serve other requests while waiting for the NVMe read to complete. Demotion writes are fire-and-forget from the hot path's perspective.

Performance Benchmarks

Expected latency by tier and operation. All numbers assume enterprise NVMe SSDs (Intel P5800X, Samsung PM9A3, or equivalent).

Operation	RAM	NVMe	Redis L2
Random read (P50)	1.5µs	15µs	1ms
Random read (P99)	4µs	50µs	3ms
Sequential read	1.5µs	8µs	1ms
Write	13µs	20µs (async demotion)	1ms

The NVMe P99 of 50µs is 60x faster than the Redis P99 of 3ms. For workloads where 30% of keys live in the NVMe tier, the effective P99 across all cache hits drops significantly compared to a RAM-only cache that misses to Redis for everything not in RAM.

Latency Distribution

With the 95/5 access pattern (95% of reads hit 5% of keys), approximately 95% of reads hit RAM at 1.5µs, 4% hit NVMe at 15µs (P50), and 1% miss to Redis at 1ms. The weighted average read latency is approximately 2.1µs — barely distinguishable from a pure RAM cache, at a fraction of the cost.

Capacity Planning

Recommended tier sizing based on total working set. The RAM tier holds the hottest keys, NVMe holds the warm tier, and Redis L2 holds everything else.

Working Set	RAM Tier	NVMe Tier	Redis L2	Monthly Cost
10GB	2GB	3GB	5GB	$15–30
100GB	5GB	30GB	65GB	$58–118
1TB	10GB	200GB	790GB	$120–250

At 1TB working set, the all-RAM cost would be $5,000–10,000/month. Hybrid tiering achieves the same effective capacity at $120–250/month — a 40–80x cost reduction — while keeping 35% of keys (representing 99%+ of reads) at sub-50µs latency.

Cost Math
# 1TB working set with hybrid tiering
RAM:   10GB x $7.50/GB/mo  = $75
NVMe: 200GB x $0.075/GB/mo = $15
Redis: 790GB x $0.75/GB/mo = $592    # only paying for cold keys
Total                       = $682/mo  (vs $7,500/mo all-RAM)

# Without Redis L2 (NVMe catches warm keys, origin handles cold)
RAM:   10GB x $7.50/GB/mo  = $75
NVMe: 200GB x $0.075/GB/mo = $15
Total                       = $90/mo   (origin handles cold misses)
        

Limitations

Linux 5.1+ required. io_uring is a Linux-specific kernel feature introduced in 5.1. The NVMe backend is not available on macOS, Windows, or older Linux kernels. For development on non-Linux systems, use the RAM-only backend (existing behavior, unchanged).
NVMe write endurance. Enterprise NVMe SSDs are rated for 1–3 DWPD (drive writes per day). At 100K demotions/sec with 1KB average value size, write volume is approximately 8.6GB/day — well within endurance limits for any enterprise drive. For drives with 0.3 DWPD (consumer-grade), monitor write volume and consider write throttling.
Minimum value size. Not recommended for keys with values smaller than 100 bytes. The io_uring I/O overhead for a 4KB-aligned read exceeds the benefit of caching a sub-100-byte value at NVMe latency. For small-value workloads, the RAM tier is sufficient.
First-access latency for demoted keys. A key that was demoted from RAM to NVMe will return at 10–50µs on its next access, not 1.5µs. This is 50–250x faster than a Redis miss, but it is measurably slower than RAM. Applications with strict sub-5µs P99 requirements for all keys should size the RAM tier to hold the entire hot + warm working set.
No cross-instance NVMe sharing. NVMe is local to the instance. Each Cachee instance maintains its own NVMe tier. Cross-instance coherence operates at the RAM tier level (existing behavior). NVMe entries are populated by local demotion from RAM, not by coherence.

io_uring Kernel Version

io_uring requires Linux 5.1+. Amazon Linux 2023 (kernel 6.1), Ubuntu 22.04+ (kernel 5.15), and RHEL 9+ (kernel 5.14) all support io_uring. Check with uname -r before enabling tiering. On unsupported kernels, CONFIG SET tiering.enabled true will return an error.

Future Extensions

The StorageBackend trait is designed for extensibility as new hardware tiers emerge.

CXL-attached memory. Compute Express Link (CXL) enables byte-addressable memory expansion over PCIe. CXL Type 3 devices provide approximately 300ns random read latency — 5x slower than local DRAM but 50x faster than NVMe. A CXL backend would sit between RAM and NVMe in the hierarchy: L0 (shared memory) → L1 (RAM) → L1.25 (CXL) → L1.5 (NVMe) → L2 (Redis).
Intel Optane persistent memory. If available on the deployment platform, Optane DCPMM offers approximately 350ns latency with byte-addressability and persistence. Similar position in the hierarchy to CXL, with the additional benefit of crash recovery for the warm tier.
Cloud-specific storage backends. AWS EBS io2 Block Express (sub-millisecond latency), Azure Ultra Disk (sub-ms), and GCP Hyperdisk Extreme provide NVMe-like performance as network-attached storage. A cloud storage backend would enable hybrid tiering in containerized deployments where local NVMe is not available, at the cost of slightly higher latency (100–500µs vs 10–50µs for local NVMe).
Persistent NVMe tier. Currently, the NVMe tier is volatile — it is rebuilt from RAM demotions after a restart. A future extension could persist the NVMe slab file and index across restarts, allowing instant warm-tier recovery without a cold-start period. This requires adding checkpointing for the key-to-offset index.

Hybrid Memory Tiering:
Technical Specification

Overview

Architecture

StorageBackend Trait

RamBackend

NvmeBackend

Promotion Policy

Demotion Policy

Eviction from NVMe

Configuration

I/O Architecture

Read Path

Write Path

Performance Benchmarks

Capacity Planning

Limitations

Future Extensions

Also Read

100x Larger Working Sets.
Same Cache Interface.

Overview

Architecture

StorageBackend Trait

RamBackend

NvmeBackend

Promotion Policy

Demotion Policy

Eviction from NVMe

Configuration

I/O Architecture

Read Path

Write Path

Performance Benchmarks

Capacity Planning

Limitations

Future Extensions

Also Read

100x Larger Working Sets.Same Cache Interface.

100x Larger Working Sets.
Same Cache Interface.