1. Shared Memory Allocation

L0 allocates a shared memory region at startup using POSIX shared memory primitives. The region is pre-allocated to a configurable size and memory-mapped into every process that attaches to it.

POSIX API // 1. Create shared memory object int fd = shm_open("/cachee-l0", O_CREAT | O_RDWR, 0600); ftruncate(fd, region_size); // 2. Memory-map with MAP_SHARED void* region = mmap(NULL, region_size, PROT_READ | PROT_WRITE, MAP_SHARED, // all processes see the same physical pages fd, 0); // 3. Workers inherit mapping after fork() — no additional setup

Platform Support

  • Linux: shm_open + mmap. Shared memory objects live in /dev/shm (tmpfs). Memory is backed by RAM, not disk. Supports MAP_HUGETLB for 2MB huge pages when available.
  • macOS: shm_open + mmap. Functionally identical to Linux. Shared memory objects are managed by the kernel. No /dev/shm filesystem, but the API is the same.
Pre-Fork Model

When used with gunicorn or uvicorn, the master process allocates the shared memory region before forking workers. Each forked worker inherits the memory mapping via the standard fork() semantics — the child process has the same virtual-to-physical page mappings as the parent. No shm_open call is needed in the worker. The region is immediately accessible.

2. Data Layout

The shared memory region contains a fixed-size hash table using open addressing with linear probing. All data is stored inline — no pointers to heap-allocated memory, no indirection through process-local address spaces.

Slot Layout struct L0Slot { state: AtomicU8, // 0=empty, 1=occupied, 2=tombstone key_len: u16, // key length in bytes (max 256) val_len: u32, // value length in bytes key_hash: u64, // pre-computed hash for fast comparison epoch: AtomicU64, // MVCC write epoch key: [u8; MAX_KEY], // key bytes, inline value: [u8; MAX_VAL], // value bytes, inline } // Total slot size: fixed, depends on MAX_KEY + MAX_VAL configuration // Default: 256 (key) + 4096 (value) = ~4.4KB per slot

Design Decisions

  • Inline storage: Keys and values are stored directly in the slot, not as pointers. Pointers to heap memory are process-local and would be invalid in other processes. Inline storage ensures every process can read the data directly.
  • Open addressing: Linear probing with a fixed slot array. No linked lists (which would require heap pointers). Probe sequences are deterministic and cache-friendly.
  • Fixed slot size: Every slot is the same size regardless of actual key/value length. This wastes some memory for small values but ensures O(1) index-to-address calculation: address = base + (index * slot_size).
  • Pre-computed hash: The 64-bit hash is stored in the slot to avoid rehashing during probe sequences. Comparison is: check hash (fast integer compare), then check key bytes (only on hash match).
Configuration Slot Size 10M Slots Use Case
Default (4KB values) ~4.4 KB ~44 GB ML features, config values
Embeddings (64KB values) ~64.3 KB ~643 GB Large embedding vectors
Small (512B values) ~0.8 KB ~8 GB Counters, flags, small features

3. Read Path

The L0 read path is three operations with zero syscalls and zero memory copies:

  1. Hash key: Compute the 64-bit hash of the key using the same hash function as L1 (xxHash3). This is a pure CPU operation.
  2. Index into shared memory: slot_index = hash % num_slots. Linear probe from this index, comparing key_hash first, then key bytes on match. Address calculation: slot_ptr = base_addr + (slot_index * slot_size).
  3. Return value: Read the value bytes directly from the slot. For standard get(), the bytes are copied to the caller. For get_numpy(), a NumPy array view is constructed that points directly to the value bytes in the slot — zero copy.
Why Sub-Nanosecond

The shared memory region is in the process's virtual address space. Accessing it is a pointer dereference, which resolves to a CPU cache line load. If the slot is in the L1 CPU cache (likely for hot features accessed frequently), the read completes in ~0.3ns. If it requires an L2/L3 cache hit, it is 0.5–0.8ns. There is no kernel transition, no context switch, no memory copy, and no function call beyond the hash computation.

Scenario Latency Notes
L1 CPU cache hit (hot feature) ~0.3ns Slot data in L1 cache line
L2/L3 CPU cache hit 0.5–0.8ns Slot data in L2/L3, no main memory access
Main memory (cold access) ~50–100ns First access to a slot not in any CPU cache
Linear probe (collision) +0.3–0.8ns per probe Additional probes at 70% load factor

4. Write Path

Writes use MVCC atomic swap to ensure readers never see partial writes. The write path is designed so that readers are never blocked.

  1. Find slot: Hash the key and linear-probe to find either the existing slot (update) or an empty/tombstone slot (insert).
  2. Write new version: For updates, the new value is written to the next available slot in a versioned ring. The key hash, key bytes, and value bytes are written. The epoch is set to the current global epoch.
  3. Atomic pointer swap: The primary slot's version pointer is atomically swapped (CAS) to point to the new version. The old version remains readable by any reader that acquired its epoch before the swap.
  4. Epoch advance: The global epoch is incremented via fetch_add(1, Release). Subsequent readers will see the new version.

Write Latency

Operation Latency
Insert (empty slot) 5–10ns
Update (atomic swap) 10–15ns
Insert with probe (collision) +2–5ns per probe
Write Contention

Concurrent writes to the same key require CAS retry. Under extreme contention (multiple workers writing the same key simultaneously), write latency can increase. For write-heavy workloads on hot keys, consider writing from a single coordinator process and letting workers read via shared memory.

5. Python Bindings

Python bindings use cffi for direct memory access with zero Python-level overhead. The GIL is released during all shared memory operations.

Python from cachee import SharedCache import numpy as np # Initialize (call in gunicorn master, before fork) cache = SharedCache( path="/dev/shm/cachee", size_gb=2, max_value_size=4096, # bytes, per slot hash_slots=10_000_000, # number of hash table slots ) # Standard key-value operations cache.set("key", b"value") # returns None value = cache.get("key") # returns bytes or None cache.delete("key") # tombstone the slot # NumPy zero-copy access embedding = np.random.randn(128).astype(np.float32) cache.set_numpy("emb:123", embedding) # Returns a VIEW into shared memory (not a copy) result = cache.get_numpy("emb:123", shape=(128,), dtype=np.float32) # result.base is the shared memory buffer # result.flags.owndata is False (does not own the memory)

API Reference

Method Returns Copy? Notes
get(key) bytes | None Yes (memcpy) Safe; returned bytes are independent of shared memory
get_numpy(key, shape, dtype) np.ndarray | None No (view) Zero copy; array points into shared memory
set(key, value) None Write MVCC atomic write
set_numpy(key, arr) None Write Copies array data into shared memory slot
delete(key) None N/A Sets slot state to tombstone
Zero-Copy Safety

Arrays returned by get_numpy() point directly into shared memory. If another process writes to the same key, the array contents may change. For read-only access (the common case in ML feature serving), this is safe. If you need an immutable snapshot, call .copy() on the returned array.

6. Configuration

L0 is controlled by four parameters. l0.enabled can be toggled at runtime. The remaining parameters require a restart because they determine the shared memory region size.

Config Commands # Enable L0 shared memory tier (default: false) CONFIG SET l0.enabled true # Shared memory region size in gigabytes (default: 1) CONFIG SET l0.size_gb 4 # Maximum value size per slot in bytes (default: 4096) CONFIG SET l0.max_value_size 65536 # Number of hash table slots (default: 1000000) CONFIG SET l0.hash_slots 10000000
Parameter Default Requires Restart Description
l0.enabled false No Enable or disable the L0 shared memory tier.
l0.size_gb 1 Yes Total shared memory region size. Determines maximum data capacity.
l0.max_value_size 4096 Yes Maximum value size per slot in bytes. Determines slot size and therefore capacity.
l0.hash_slots 1000000 Yes Number of hash table slots. Determines maximum number of keys. Keep load factor below 70%.
Capacity Planning

Effective capacity = hash_slots * (max_key_size + max_value_size + 24 bytes overhead). At 10M slots with 4KB values, the region requires ~44GB. At 1M slots with 4KB values, it requires ~4.4GB. Size the region to keep the load factor (used slots / total slots) below 70% for optimal probe performance.

7. Performance

Measured on c8g.metal-48xl (192 vCPUs, Graviton4), 96 workers, 10M pre-loaded 128-dim float32 embeddings.

Operation Latency Notes
Read (hot, CPU L1 hit) ~0.3ns Slot in CPU L1 cache
Read (warm, CPU L2/L3 hit) 0.5–0.8ns Slot in CPU L2/L3 cache
Read (cold, main memory) 50–100ns First access, not in any CPU cache
Write (insert) 5–10ns Empty slot, no CAS contention
Write (update, atomic swap) 10–15ns MVCC version swap
get_numpy() overhead <5ns Constructing NumPy view (no data copy)
Capacity Value
Maximum shared region size Up to 64 GB
Maximum hash slots Configurable (limited by region size)
Maximum key size 256 bytes
Maximum value size Configurable (default 4KB, up to 64KB)

8. Gunicorn / Uvicorn Integration

L0 integrates with the pre-fork model used by gunicorn and uvicorn. The shared memory region is allocated once in the master process and inherited by all workers via fork().

gunicorn.conf.py from cachee import SharedCache # Allocate in master process (runs before fork) def on_starting(server): server.app.shared_cache = SharedCache( path="/dev/shm/cachee", size_gb=4, max_value_size=4096, hash_slots=10_000_000, ) # Workers inherit the mapping — no additional setup def post_fork(server, worker): # server.app.shared_cache is already accessible # The memory mapping was inherited via fork() pass

No IPC channels are created. No sockets are opened. No additional configuration is needed per worker. The shared memory mapping persists for the lifetime of the master process. If the master is restarted, the shared memory object (/dev/shm/cachee) persists on disk and can be re-attached.

9. Limitations

L0 is optimized for a specific deployment model. It is not a general-purpose distributed cache.

  • Same-machine only: Shared memory is not distributed. Processes on different machines cannot attach to the same region. For cross-machine caching, use L2 (Redis/ElastiCache) with coherence.
  • Fixed hash table size: The number of slots is set at allocation time and cannot be changed without re-creating the shared memory region. Over-provision slots to accommodate growth (keep load factor below 70%).
  • Value size limit: Values larger than max_value_size cannot be stored in L0. They fall through to L1. The default (4KB) handles most ML features. Set to 64KB for large embedding vectors.
  • Linux and macOS only: L0 requires POSIX shared memory (shm_open + mmap). Windows is not supported.
  • No eviction policy: L0 does not implement W-TinyLFU or any eviction policy. It is a fixed-size hash table. When full, new inserts fail until keys are deleted. Use L0 for known-size, hot-path data; use L1 for general-purpose caching with admission control.
  • Zero-copy arrays are mutable: NumPy arrays returned by get_numpy() share memory with the cache. Concurrent writes to the same key will change the array contents. Call .copy() if immutability is required.
When to Use L0 vs L1

Use L0 for data that is accessed on every request across multiple worker processes and benefits from sub-nanosecond reads: ML features, embedding vectors, tokenizer vocabularies, hot configuration. Use L1 for the general working set with dynamic admission control. The two tiers compose: L0 serves the hottest data, L1 handles the rest, and promotion/demotion happens automatically via the tiering engine.