L0 maps a shared memory region that all processes on the same machine read directly. No serialization. No IPC. No function call overhead. Sub-nanosecond reads for data that's already in your address space.
Every production Python deployment runs into the same wall. The GIL prevents multi-threaded parallelism, so you scale with processes. And every process has its own memory space. Your cache data is duplicated N times.
Cachee allocates a memory-mapped region (shm_open + mmap with MAP_SHARED) at startup. All worker processes attach to the same region. Reads are direct pointer dereference — no syscall, no copy, no serialization. Writes use lock-free MVCC for consistency. The result: reading a cached ML feature is as fast as reading a struct field.
Cachee allocates the shared memory region in the master process before gunicorn or uvicorn forks workers. Each forked worker inherits the memory mapping automatically. No additional setup, no IPC channels, no per-worker configuration. The region is live the moment the worker starts.
This is the same mechanism that PostgreSQL uses for shared buffers and that Nginx uses for shared caching zones. The operating system handles the mapping. Cachee handles the data layout and concurrency.
Writes to L0 use the same MVCC architecture as the L1 engine. New versions are written to the next available slot. The pointer is atomically swapped. Old versions remain readable until all readers advance their epoch. Readers are never blocked by writers.
This composes naturally with hybrid tiering. L0 serves the hottest data at pointer speed. L1 handles the working set with W-TinyLFU admission. L2 (Redis) backs the full keyspace. Promotion and demotion happen automatically.
L0 is not a replacement for L1. It is the tier below it — faster, closer to the CPU, and shared across processes. Together with L1, L1.5, and L2, it forms a complete caching hierarchy that matches data temperature to access speed.
| Tier | Mechanism | Latency | Scope | Best For |
|---|---|---|---|---|
| L0: Zero-Copy Shared Memory | Pointer dereference into mmap region | <1ns | All processes, same machine | ML features, embeddings, hot config |
| L1: In-Process RAM (W-TinyLFU) | DashMap lookup + memory copy | ~1.5µs | Single process | Working set, general-purpose cache |
| L1.5: NVMe SSD | io_uring async read | 10–50µs | Single machine | Warm data, overflow from RAM |
| L2: Redis / ElastiCache | Network round-trip | 1–5ms | Cluster-wide | Full keyspace, shared state |
Cache tensors, embeddings, and feature matrices as shared memory arrays. Read them back as NumPy views — zero copy, zero allocation, zero deserialization.
get_numpy(key, shape, dtype) returns a NumPy array whose underlying buffer is the shared memory region. No allocation, no memcpy, no deserialization. The array is usable immediately with TensorFlow, PyTorch, scikit-learn, or any library that accepts NumPy arrays.L0 shared memory is not for every workload. It is for workloads where per-request feature access is the bottleneck — and every microsecond between request and prediction is revenue.
When embedded micro-inference ships, L0 shared memory becomes its natural data source. Models read features from shared memory at sub-nanosecond speed. The cache is the feature store is the inference input.
Zero-copy shared memory caching maps a memory region (via mmap/shm_open) that is shared across all processes on the same machine. When a process reads a cached value, it performs a direct pointer dereference into the shared region — no system call, no memory copy, no serialization. The read completes in sub-nanosecond time because the data is already in the process's virtual address space.
Python's Global Interpreter Lock (GIL) prevents true multi-threaded parallelism for CPU-bound work. Production Python deployments use multi-process architectures where each process has its own memory space. Without shared memory, each process maintains its own cache copy. Eight workers with a 2GB feature cache means 16GB of duplicated data. L0 eliminates this duplication: one copy, shared across all workers, with zero-copy reads.
L1 in-process caching (DashMap with W-TinyLFU) requires a function call, hash table lookup, and memory copy — approximately 1.5 microseconds per read. L0 shared memory is a direct pointer dereference — approximately 0.3–0.8 nanoseconds per read, which is 2,000–5,000x faster. L0 sits below L1 in the memory hierarchy: data that is in L0 never needs to be looked up in L1.
Yes. The master process allocates the shared memory region before forking workers. When gunicorn forks, each worker inherits the memory mapping automatically — no additional setup, no IPC channels, no configuration per worker. Writes use lock-free MVCC to ensure consistency without blocking readers.
Yes. get_numpy(key, shape, dtype) returns a NumPy array view directly into the shared memory region. The returned array shares the same underlying memory — no copy, no allocation, no deserialization. You can pass it directly to TensorFlow, PyTorch, or scikit-learn.
Zero-copy shared memory. Sub-nanosecond reads. Native Python bindings. Purpose-built for the workloads where every nanosecond between request and prediction is revenue.