Zero-Copy L0 Technical Specification

1. Shared Memory Allocation

L0 allocates a shared memory region at startup using POSIX shared memory primitives. The region is pre-allocated to a configurable size and memory-mapped into every process that attaches to it.

POSIX API
// 1. Create shared memory object
int fd = shm_open("/cachee-l0", O_CREAT | O_RDWR, 0600);
ftruncate(fd, region_size);

// 2. Memory-map with MAP_SHARED
void* region = mmap(NULL, region_size,
    PROT_READ | PROT_WRITE,
    MAP_SHARED,   // all processes see the same physical pages
    fd, 0);

// 3. Workers inherit mapping after fork() — no additional setup
        

Platform Support

Linux: shm_open + mmap. Shared memory objects live in /dev/shm (tmpfs). Memory is backed by RAM, not disk. Supports MAP_HUGETLB for 2MB huge pages when available.
macOS: shm_open + mmap. Functionally identical to Linux. Shared memory objects are managed by the kernel. No /dev/shm filesystem, but the API is the same.

Pre-Fork Model

When used with gunicorn or uvicorn, the master process allocates the shared memory region before forking workers. Each forked worker inherits the memory mapping via the standard fork() semantics — the child process has the same virtual-to-physical page mappings as the parent. No shm_open call is needed in the worker. The region is immediately accessible.

2. Data Layout

The shared memory region contains a fixed-size hash table using open addressing with linear probing. All data is stored inline — no pointers to heap-allocated memory, no indirection through process-local address spaces.

Slot Layout
struct L0Slot {
    state:      AtomicU8,       // 0=empty, 1=occupied, 2=tombstone
    key_len:    u16,            // key length in bytes (max 256)
    val_len:    u32,            // value length in bytes
    key_hash:   u64,            // pre-computed hash for fast comparison
    epoch:      AtomicU64,      // MVCC write epoch
    key:        [u8; MAX_KEY],  // key bytes, inline
    value:      [u8; MAX_VAL],  // value bytes, inline
}
// Total slot size: fixed, depends on MAX_KEY + MAX_VAL configuration
// Default: 256 (key) + 4096 (value) = ~4.4KB per slot
        

Design Decisions

Inline storage: Keys and values are stored directly in the slot, not as pointers. Pointers to heap memory are process-local and would be invalid in other processes. Inline storage ensures every process can read the data directly.
Open addressing: Linear probing with a fixed slot array. No linked lists (which would require heap pointers). Probe sequences are deterministic and cache-friendly.
Fixed slot size: Every slot is the same size regardless of actual key/value length. This wastes some memory for small values but ensures O(1) index-to-address calculation: address = base + (index * slot_size).
Pre-computed hash: The 64-bit hash is stored in the slot to avoid rehashing during probe sequences. Comparison is: check hash (fast integer compare), then check key bytes (only on hash match).

Configuration	Slot Size	10M Slots	Use Case
Default (4KB values)	~4.4 KB	~44 GB	ML features, config values
Embeddings (64KB values)	~64.3 KB	~643 GB	Large embedding vectors
Small (512B values)	~0.8 KB	~8 GB	Counters, flags, small features

3. Read Path

The L0 read path is three operations with zero syscalls and zero memory copies:

Hash key: Compute the 64-bit hash of the key using the same hash function as L1 (xxHash3). This is a pure CPU operation.
Index into shared memory: slot_index = hash % num_slots. Linear probe from this index, comparing key_hash first, then key bytes on match. Address calculation: slot_ptr = base_addr + (slot_index * slot_size).
Return value: Read the value bytes directly from the slot. For standard get(), the bytes are copied to the caller. For get_numpy(), a NumPy array view is constructed that points directly to the value bytes in the slot — zero copy.

Why Sub-Nanosecond

The shared memory region is in the process's virtual address space. Accessing it is a pointer dereference, which resolves to a CPU cache line load. If the slot is in the L1 CPU cache (likely for hot features accessed frequently), the read completes in ~0.3ns. If it requires an L2/L3 cache hit, it is 0.5–0.8ns. There is no kernel transition, no context switch, no memory copy, and no function call beyond the hash computation.

Scenario	Latency	Notes
L1 CPU cache hit (hot feature)	~0.3ns	Slot data in L1 cache line
L2/L3 CPU cache hit	0.5–0.8ns	Slot data in L2/L3, no main memory access
Main memory (cold access)	~50–100ns	First access to a slot not in any CPU cache
Linear probe (collision)	+0.3–0.8ns per probe	Additional probes at 70% load factor

4. Write Path

Writes use MVCC atomic swap to ensure readers never see partial writes. The write path is designed so that readers are never blocked.

Find slot: Hash the key and linear-probe to find either the existing slot (update) or an empty/tombstone slot (insert).
Write new version: For updates, the new value is written to the next available slot in a versioned ring. The key hash, key bytes, and value bytes are written. The epoch is set to the current global epoch.
Atomic pointer swap: The primary slot's version pointer is atomically swapped (CAS) to point to the new version. The old version remains readable by any reader that acquired its epoch before the swap.
Epoch advance: The global epoch is incremented via fetch_add(1, Release). Subsequent readers will see the new version.

Write Latency

Operation	Latency
Insert (empty slot)	5–10ns
Update (atomic swap)	10–15ns
Insert with probe (collision)	+2–5ns per probe

Write Contention

Concurrent writes to the same key require CAS retry. Under extreme contention (multiple workers writing the same key simultaneously), write latency can increase. For write-heavy workloads on hot keys, consider writing from a single coordinator process and letting workers read via shared memory.

5. Python Bindings

Python bindings use cffi for direct memory access with zero Python-level overhead. The GIL is released during all shared memory operations.

Python
from cachee import SharedCache
import numpy as np

# Initialize (call in gunicorn master, before fork)
cache = SharedCache(
    path="/dev/shm/cachee",
    size_gb=2,
    max_value_size=4096,     # bytes, per slot
    hash_slots=10_000_000,    # number of hash table slots
)

# Standard key-value operations
cache.set("key", b"value")              # returns None
value = cache.get("key")                  # returns bytes or None
cache.delete("key")                       # tombstone the slot

# NumPy zero-copy access
embedding = np.random.randn(128).astype(np.float32)
cache.set_numpy("emb:123", embedding)

# Returns a VIEW into shared memory (not a copy)
result = cache.get_numpy("emb:123", shape=(128,), dtype=np.float32)
# result.base is the shared memory buffer
# result.flags.owndata is False (does not own the memory)
        

API Reference

Method	Returns	Copy?	Notes
`get(key)`	`bytes \| None`	Yes (memcpy)	Safe; returned bytes are independent of shared memory
`get_numpy(key, shape, dtype)`	`np.ndarray \| None`	No (view)	Zero copy; array points into shared memory
`set(key, value)`	`None`	Write	MVCC atomic write
`set_numpy(key, arr)`	`None`	Write	Copies array data into shared memory slot
`delete(key)`	`None`	N/A	Sets slot state to tombstone

Zero-Copy Safety

Arrays returned by get_numpy() point directly into shared memory. If another process writes to the same key, the array contents may change. For read-only access (the common case in ML feature serving), this is safe. If you need an immutable snapshot, call .copy() on the returned array.

6. Configuration

L0 is controlled by four parameters. l0.enabled can be toggled at runtime. The remaining parameters require a restart because they determine the shared memory region size.

Config Commands
# Enable L0 shared memory tier (default: false)
CONFIG SET l0.enabled true

# Shared memory region size in gigabytes (default: 1)
CONFIG SET l0.size_gb 4

# Maximum value size per slot in bytes (default: 4096)
CONFIG SET l0.max_value_size 65536

# Number of hash table slots (default: 1000000)
CONFIG SET l0.hash_slots 10000000
        

Parameter	Default	Requires Restart	Description
`l0.enabled`	false	No	Enable or disable the L0 shared memory tier.
`l0.size_gb`	1	Yes	Total shared memory region size. Determines maximum data capacity.
`l0.max_value_size`	4096	Yes	Maximum value size per slot in bytes. Determines slot size and therefore capacity.
`l0.hash_slots`	1000000	Yes	Number of hash table slots. Determines maximum number of keys. Keep load factor below 70%.

Capacity Planning

Effective capacity = hash_slots * (max_key_size + max_value_size + 24 bytes overhead). At 10M slots with 4KB values, the region requires ~44GB. At 1M slots with 4KB values, it requires ~4.4GB. Size the region to keep the load factor (used slots / total slots) below 70% for optimal probe performance.

7. Performance

Measured on c8g.metal-48xl (192 vCPUs, Graviton4), 96 workers, 10M pre-loaded 128-dim float32 embeddings.

Operation	Latency	Notes
Read (hot, CPU L1 hit)	~0.3ns	Slot in CPU L1 cache
Read (warm, CPU L2/L3 hit)	0.5–0.8ns	Slot in CPU L2/L3 cache
Read (cold, main memory)	50–100ns	First access, not in any CPU cache
Write (insert)	5–10ns	Empty slot, no CAS contention
Write (update, atomic swap)	10–15ns	MVCC version swap
get_numpy() overhead	<5ns	Constructing NumPy view (no data copy)

Capacity	Value
Maximum shared region size	Up to 64 GB
Maximum hash slots	Configurable (limited by region size)
Maximum key size	256 bytes
Maximum value size	Configurable (default 4KB, up to 64KB)

8. Gunicorn / Uvicorn Integration

L0 integrates with the pre-fork model used by gunicorn and uvicorn. The shared memory region is allocated once in the master process and inherited by all workers via fork().

gunicorn.conf.py
from cachee import SharedCache

# Allocate in master process (runs before fork)
def on_starting(server):
    server.app.shared_cache = SharedCache(
        path="/dev/shm/cachee",
        size_gb=4,
        max_value_size=4096,
        hash_slots=10_000_000,
    )

# Workers inherit the mapping — no additional setup
def post_fork(server, worker):
    # server.app.shared_cache is already accessible
    # The memory mapping was inherited via fork()
    pass
        

No IPC channels are created. No sockets are opened. No additional configuration is needed per worker. The shared memory mapping persists for the lifetime of the master process. If the master is restarted, the shared memory object (/dev/shm/cachee) persists on disk and can be re-attached.

9. Limitations

L0 is optimized for a specific deployment model. It is not a general-purpose distributed cache.

Same-machine only: Shared memory is not distributed. Processes on different machines cannot attach to the same region. For cross-machine caching, use L2 (Redis/ElastiCache) with coherence.
Fixed hash table size: The number of slots is set at allocation time and cannot be changed without re-creating the shared memory region. Over-provision slots to accommodate growth (keep load factor below 70%).
Value size limit: Values larger than max_value_size cannot be stored in L0. They fall through to L1. The default (4KB) handles most ML features. Set to 64KB for large embedding vectors.
Linux and macOS only: L0 requires POSIX shared memory (shm_open + mmap). Windows is not supported.
No eviction policy: L0 does not implement Cachee-FLU or any eviction policy. It is a fixed-size hash table. When full, new inserts fail until keys are deleted. Use L0 for known-size, hot-path data; use L1 for general-purpose caching with admission control.
Zero-copy arrays are mutable: NumPy arrays returned by get_numpy() share memory with the cache. Concurrent writes to the same key will change the array contents. Call .copy() if immutability is required.

When to Use L0 vs L1

Use L0 for data that is accessed on every request across multiple worker processes and benefits from sub-nanosecond reads: ML features, embedding vectors, tokenizer vocabularies, hot configuration. Use L1 for the general working set with dynamic admission control. The two tiers compose: L0 serves the hottest data, L1 handles the rest, and promotion/demotion happens automatically via the tiering engine.

Zero-Copy L0:
Shared Memory Cache Tier

1. Shared Memory Allocation

Platform Support

2. Data Layout

Design Decisions

3. Read Path

4. Write Path

Write Latency

5. Python Bindings

API Reference

6. Configuration

7. Performance

8. Gunicorn / Uvicorn Integration

9. Limitations

Also Read

Sub-Nanosecond Reads.
Shared Across Every Worker.

1. Shared Memory Allocation

Platform Support

2. Data Layout

Design Decisions

3. Read Path

4. Write Path

Write Latency

5. Python Bindings

API Reference

6. Configuration

7. Performance

8. Gunicorn / Uvicorn Integration

9. Limitations

Also Read

Sub-Nanosecond Reads.Shared Across Every Worker.

Sub-Nanosecond Reads.
Shared Across Every Worker.