1. Shared Memory Allocation
L0 allocates a shared memory region at startup using POSIX shared memory primitives. The region is pre-allocated to a configurable size and memory-mapped into every process that attaches to it.
Platform Support
- Linux:
shm_open+mmap. Shared memory objects live in/dev/shm(tmpfs). Memory is backed by RAM, not disk. SupportsMAP_HUGETLBfor 2MB huge pages when available. - macOS:
shm_open+mmap. Functionally identical to Linux. Shared memory objects are managed by the kernel. No/dev/shmfilesystem, but the API is the same.
When used with gunicorn or uvicorn, the master process allocates the shared memory region before forking workers. Each forked worker inherits the memory mapping via the standard fork() semantics — the child process has the same virtual-to-physical page mappings as the parent. No shm_open call is needed in the worker. The region is immediately accessible.
2. Data Layout
The shared memory region contains a fixed-size hash table using open addressing with linear probing. All data is stored inline — no pointers to heap-allocated memory, no indirection through process-local address spaces.
Design Decisions
- Inline storage: Keys and values are stored directly in the slot, not as pointers. Pointers to heap memory are process-local and would be invalid in other processes. Inline storage ensures every process can read the data directly.
- Open addressing: Linear probing with a fixed slot array. No linked lists (which would require heap pointers). Probe sequences are deterministic and cache-friendly.
- Fixed slot size: Every slot is the same size regardless of actual key/value length. This wastes some memory for small values but ensures O(1) index-to-address calculation:
address = base + (index * slot_size). - Pre-computed hash: The 64-bit hash is stored in the slot to avoid rehashing during probe sequences. Comparison is: check hash (fast integer compare), then check key bytes (only on hash match).
| Configuration | Slot Size | 10M Slots | Use Case |
|---|---|---|---|
| Default (4KB values) | ~4.4 KB | ~44 GB | ML features, config values |
| Embeddings (64KB values) | ~64.3 KB | ~643 GB | Large embedding vectors |
| Small (512B values) | ~0.8 KB | ~8 GB | Counters, flags, small features |
3. Read Path
The L0 read path is three operations with zero syscalls and zero memory copies:
- Hash key: Compute the 64-bit hash of the key using the same hash function as L1 (xxHash3). This is a pure CPU operation.
- Index into shared memory:
slot_index = hash % num_slots. Linear probe from this index, comparingkey_hashfirst, then key bytes on match. Address calculation:slot_ptr = base_addr + (slot_index * slot_size). - Return value: Read the value bytes directly from the slot. For standard
get(), the bytes are copied to the caller. Forget_numpy(), a NumPy array view is constructed that points directly to the value bytes in the slot — zero copy.
The shared memory region is in the process's virtual address space. Accessing it is a pointer dereference, which resolves to a CPU cache line load. If the slot is in the L1 CPU cache (likely for hot features accessed frequently), the read completes in ~0.3ns. If it requires an L2/L3 cache hit, it is 0.5–0.8ns. There is no kernel transition, no context switch, no memory copy, and no function call beyond the hash computation.
| Scenario | Latency | Notes |
|---|---|---|
| L1 CPU cache hit (hot feature) | ~0.3ns | Slot data in L1 cache line |
| L2/L3 CPU cache hit | 0.5–0.8ns | Slot data in L2/L3, no main memory access |
| Main memory (cold access) | ~50–100ns | First access to a slot not in any CPU cache |
| Linear probe (collision) | +0.3–0.8ns per probe | Additional probes at 70% load factor |
4. Write Path
Writes use MVCC atomic swap to ensure readers never see partial writes. The write path is designed so that readers are never blocked.
- Find slot: Hash the key and linear-probe to find either the existing slot (update) or an empty/tombstone slot (insert).
- Write new version: For updates, the new value is written to the next available slot in a versioned ring. The key hash, key bytes, and value bytes are written. The epoch is set to the current global epoch.
- Atomic pointer swap: The primary slot's version pointer is atomically swapped (CAS) to point to the new version. The old version remains readable by any reader that acquired its epoch before the swap.
- Epoch advance: The global epoch is incremented via
fetch_add(1, Release). Subsequent readers will see the new version.
Write Latency
| Operation | Latency |
|---|---|
| Insert (empty slot) | 5–10ns |
| Update (atomic swap) | 10–15ns |
| Insert with probe (collision) | +2–5ns per probe |
Concurrent writes to the same key require CAS retry. Under extreme contention (multiple workers writing the same key simultaneously), write latency can increase. For write-heavy workloads on hot keys, consider writing from a single coordinator process and letting workers read via shared memory.
5. Python Bindings
Python bindings use cffi for direct memory access with zero Python-level overhead. The GIL is released during all shared memory operations.
API Reference
| Method | Returns | Copy? | Notes |
|---|---|---|---|
get(key) |
bytes | None |
Yes (memcpy) | Safe; returned bytes are independent of shared memory |
get_numpy(key, shape, dtype) |
np.ndarray | None |
No (view) | Zero copy; array points into shared memory |
set(key, value) |
None |
Write | MVCC atomic write |
set_numpy(key, arr) |
None |
Write | Copies array data into shared memory slot |
delete(key) |
None |
N/A | Sets slot state to tombstone |
Arrays returned by get_numpy() point directly into shared memory. If another process writes to the same key, the array contents may change. For read-only access (the common case in ML feature serving), this is safe. If you need an immutable snapshot, call .copy() on the returned array.
6. Configuration
L0 is controlled by four parameters. l0.enabled can be toggled at runtime. The remaining parameters require a restart because they determine the shared memory region size.
| Parameter | Default | Requires Restart | Description |
|---|---|---|---|
l0.enabled |
false | No | Enable or disable the L0 shared memory tier. |
l0.size_gb |
1 | Yes | Total shared memory region size. Determines maximum data capacity. |
l0.max_value_size |
4096 | Yes | Maximum value size per slot in bytes. Determines slot size and therefore capacity. |
l0.hash_slots |
1000000 | Yes | Number of hash table slots. Determines maximum number of keys. Keep load factor below 70%. |
Effective capacity = hash_slots * (max_key_size + max_value_size + 24 bytes overhead). At 10M slots with 4KB values, the region requires ~44GB. At 1M slots with 4KB values, it requires ~4.4GB. Size the region to keep the load factor (used slots / total slots) below 70% for optimal probe performance.
7. Performance
Measured on c8g.metal-48xl (192 vCPUs, Graviton4), 96 workers, 10M pre-loaded 128-dim float32 embeddings.
| Operation | Latency | Notes |
|---|---|---|
| Read (hot, CPU L1 hit) | ~0.3ns | Slot in CPU L1 cache |
| Read (warm, CPU L2/L3 hit) | 0.5–0.8ns | Slot in CPU L2/L3 cache |
| Read (cold, main memory) | 50–100ns | First access, not in any CPU cache |
| Write (insert) | 5–10ns | Empty slot, no CAS contention |
| Write (update, atomic swap) | 10–15ns | MVCC version swap |
| get_numpy() overhead | <5ns | Constructing NumPy view (no data copy) |
| Capacity | Value |
|---|---|
| Maximum shared region size | Up to 64 GB |
| Maximum hash slots | Configurable (limited by region size) |
| Maximum key size | 256 bytes |
| Maximum value size | Configurable (default 4KB, up to 64KB) |
8. Gunicorn / Uvicorn Integration
L0 integrates with the pre-fork model used by gunicorn and uvicorn. The shared memory region is allocated once in the master process and inherited by all workers via fork().
No IPC channels are created. No sockets are opened. No additional configuration is needed per worker. The shared memory mapping persists for the lifetime of the master process. If the master is restarted, the shared memory object (/dev/shm/cachee) persists on disk and can be re-attached.
9. Limitations
L0 is optimized for a specific deployment model. It is not a general-purpose distributed cache.
- Same-machine only: Shared memory is not distributed. Processes on different machines cannot attach to the same region. For cross-machine caching, use L2 (Redis/ElastiCache) with coherence.
- Fixed hash table size: The number of slots is set at allocation time and cannot be changed without re-creating the shared memory region. Over-provision slots to accommodate growth (keep load factor below 70%).
- Value size limit: Values larger than
max_value_sizecannot be stored in L0. They fall through to L1. The default (4KB) handles most ML features. Set to 64KB for large embedding vectors. - Linux and macOS only: L0 requires POSIX shared memory (
shm_open+mmap). Windows is not supported. - No eviction policy: L0 does not implement W-TinyLFU or any eviction policy. It is a fixed-size hash table. When full, new inserts fail until keys are deleted. Use L0 for known-size, hot-path data; use L1 for general-purpose caching with admission control.
- Zero-copy arrays are mutable: NumPy arrays returned by
get_numpy()share memory with the cache. Concurrent writes to the same key will change the array contents. Call.copy()if immutability is required.
Use L0 for data that is accessed on every request across multiple worker processes and benefits from sub-nanosecond reads: ML features, embedding vectors, tokenizer vocabularies, hot configuration. Use L1 for the general working set with dynamic admission control. The two tiers compose: L0 serves the hottest data, L1 handles the rest, and promotion/demotion happens automatically via the tiering engine.