How It Works Pricing Benchmarks
vs Redis Docs Resources Blog
Start Free Trial
L0 Cache Tier

Faster Than L1. Zero Copies.
Shared Across Processes.

L0 maps a shared memory region that all processes on the same machine read directly. No serialization. No IPC. No function call overhead. Sub-nanosecond reads for data that's already in your address space.

Sub-ns
Reads
Zero
Copies
Multi-Process
Shared Memory
Python
Native
The Problem

Python's Architecture Forces Duplication

Every production Python deployment runs into the same wall. The GIL prevents multi-threaded parallelism, so you scale with processes. And every process has its own memory space. Your cache data is duplicated N times.

🐍
The GIL Forces Multi-Process
Gunicorn workers. Uvicorn workers. Multiprocessing pools. Celery workers. Every production Python deployment that needs parallelism forks into multiple processes. Each process has its own memory space. Cached data is duplicated once per worker. 8 workers with a 2GB feature cache means 16GB wasted on identical copies.
8 workers × 2GB cache = 16GB wasted
📋
Every Cache Today Copies Data
Even in-process L1 caches — Cachee's DashMap, Caffeine, lru_cache — require a function call, a hash table lookup, and a memory copy to return a value. For hot-path ML feature serving at millions of inferences per second, even microseconds matter. The fastest L1 read is still 1.5µs. The fastest L0 read is 0.3ns.
L1: 1.5µs per read. L0: <1ns per read.
🚫
IPC Is Too Slow
Unix sockets, pipes, and shared Redis all add 10–100µs per operation. When you need to serve ML features at millions of inferences per second, you cannot afford a network hop or even a syscall for every feature lookup. The feature store should be as fast as reading a local variable. That requires shared memory.
IPC: 10–100µs. Shared memory: <1ns.
How It Works

mmap a Region. All Workers Read It. Zero Copies.

Cachee allocates a memory-mapped region (shm_open + mmap with MAP_SHARED) at startup. All worker processes attach to the same region. Reads are direct pointer dereference — no syscall, no copy, no serialization. Writes use lock-free MVCC for consistency. The result: reading a cached ML feature is as fast as reading a struct field.

L0 Shared Memory Architecture
Master Process
shm_open + mmap
Allocate shared region
fork()
Workers Inherit
Memory mapping shared
Worker Read
Pointer Deref
<1ns, zero syscalls
Read Path
hash(key) → shared_region[index] → return value
No syscall. No copy. No serialization. Direct pointer dereference into shared memory.

Pre-Fork Compatible

Cachee allocates the shared memory region in the master process before gunicorn or uvicorn forks workers. Each forked worker inherits the memory mapping automatically. No additional setup, no IPC channels, no per-worker configuration. The region is live the moment the worker starts.

This is the same mechanism that PostgreSQL uses for shared buffers and that Nginx uses for shared caching zones. The operating system handles the mapping. Cachee handles the data layout and concurrency.

Lock-Free MVCC Writes

Writes to L0 use the same MVCC architecture as the L1 engine. New versions are written to the next available slot. The pointer is atomically swapped. Old versions remain readable until all readers advance their epoch. Readers are never blocked by writers.

This composes naturally with hybrid tiering. L0 serves the hottest data at pointer speed. L1 handles the working set with W-TinyLFU admission. L2 (Redis) backs the full keyspace. Promotion and demotion happen automatically.

The Full Picture

The Complete Memory Hierarchy

L0 is not a replacement for L1. It is the tier below it — faster, closer to the CPU, and shared across processes. Together with L1, L1.5, and L2, it forms a complete caching hierarchy that matches data temperature to access speed.

Tier Mechanism Latency Scope Best For
L0: Zero-Copy Shared Memory Pointer dereference into mmap region <1ns All processes, same machine ML features, embeddings, hot config
L1: In-Process RAM (W-TinyLFU) DashMap lookup + memory copy ~1.5µs Single process Working set, general-purpose cache
L1.5: NVMe SSD io_uring async read 10–50µs Single machine Warm data, overflow from RAM
L2: Redis / ElastiCache Network round-trip 1–5ms Cluster-wide Full keyspace, shared state
L1 was the fastest cache tier.
L0 is 2,000x faster.
Python Integration

Zero-Copy NumPy Arrays

Cache tensors, embeddings, and feature matrices as shared memory arrays. Read them back as NumPy views — zero copy, zero allocation, zero deserialization.

from cachee import SharedCache import numpy as np # Allocate shared memory region (pre-fork, in gunicorn master) cache = SharedCache(path="/dev/shm/cachee", size_gb=2) # Store an embedding vector embedding = np.random.randn(128).astype(np.float32) cache.set_numpy("embedding:user:123", embedding) # Read it back — returns a VIEW into shared memory, not a copy result = cache.get_numpy("embedding:user:123", shape=(128,), dtype=np.float32) # result.base points to the shared memory region # Zero allocation. Zero copy. Sub-nanosecond. # Works with any ML framework import torch tensor = torch.from_numpy(result) # Still zero-copy — shares the buffer # Standard key-value access cache.set("config:model_version", b"v2.3.1") version = cache.get("config:model_version") # returns bytes
🧠
NumPy View, Not Copy
get_numpy(key, shape, dtype) returns a NumPy array whose underlying buffer is the shared memory region. No allocation, no memcpy, no deserialization. The array is usable immediately with TensorFlow, PyTorch, scikit-learn, or any library that accepts NumPy arrays.
np.ndarray view into shared memory
⚙️
cffi Bindings, Zero Overhead
Python bindings use cffi/ctypes for direct memory access with zero Python-level overhead. No pickle, no JSON, no msgpack. The binding maps the shared memory pointer directly into Python's buffer protocol. The GIL is released during the read — other Python threads proceed without blocking.
GIL released during read
Use Cases

Built for the Hardest ML Workloads

L0 shared memory is not for every workload. It is for workloads where per-request feature access is the bottleneck — and every microsecond between request and prediction is revenue.

🤖
Python ML Serving
Gunicorn + TensorFlow/PyTorch inference. Feature lookups across 8–32 worker processes, millions of inferences per second. L0 eliminates per-worker cache duplication and serves features at hardware speed.
📊
Feature Stores
Millions of features accessed across worker processes. User embeddings, item embeddings, real-time aggregations. One shared copy, zero-copy reads, sub-nanosecond access from any worker.
📝
NLP Pipelines
Shared tokenizer caches, embedding lookup tables, vocabulary maps. Data that every worker needs and that never changes mid-request. L0 stores it once, all workers read it at pointer speed.
Real-Time Inference
Every microsecond between request arrival and prediction response matters. L0 removes the feature lookup from the latency budget entirely. The features are already in your address space.
Future

The Cache Becomes the Feature Store

When embedded micro-inference ships, L0 shared memory becomes its natural data source. Models read features from shared memory at sub-nanosecond speed. The cache is the feature store is the inference input.

L0 Shared Memory
Feature Store
<1ns reads
Embedded Model
Micro-Inference
In-cache prediction
Result
Prediction
Zero network hops
Feature lookup + model inference + prediction — entirely in shared memory. No network. No serialization. No external feature store.
FAQ

Frequently Asked Questions

What is zero-copy shared memory caching?

Zero-copy shared memory caching maps a memory region (via mmap/shm_open) that is shared across all processes on the same machine. When a process reads a cached value, it performs a direct pointer dereference into the shared region — no system call, no memory copy, no serialization. The read completes in sub-nanosecond time because the data is already in the process's virtual address space.

Why do Python ML workloads need shared memory caching?

Python's Global Interpreter Lock (GIL) prevents true multi-threaded parallelism for CPU-bound work. Production Python deployments use multi-process architectures where each process has its own memory space. Without shared memory, each process maintains its own cache copy. Eight workers with a 2GB feature cache means 16GB of duplicated data. L0 eliminates this duplication: one copy, shared across all workers, with zero-copy reads.

How does L0 compare to L1 in-process caching?

L1 in-process caching (DashMap with W-TinyLFU) requires a function call, hash table lookup, and memory copy — approximately 1.5 microseconds per read. L0 shared memory is a direct pointer dereference — approximately 0.3–0.8 nanoseconds per read, which is 2,000–5,000x faster. L0 sits below L1 in the memory hierarchy: data that is in L0 never needs to be looked up in L1.

Does L0 work with gunicorn's pre-fork model?

Yes. The master process allocates the shared memory region before forking workers. When gunicorn forks, each worker inherits the memory mapping automatically — no additional setup, no IPC channels, no configuration per worker. Writes use lock-free MVCC to ensure consistency without blocking readers.

Can I cache NumPy arrays in shared memory without copying?

Yes. get_numpy(key, shape, dtype) returns a NumPy array view directly into the shared memory region. The returned array shares the same underlying memory — no copy, no allocation, no deserialization. You can pass it directly to TensorFlow, PyTorch, or scikit-learn.

Stop Duplicating Cache Data Across Workers.
Share It. At Hardware Speed.

Zero-copy shared memory. Sub-nanosecond reads. Native Python bindings. Purpose-built for the workloads where every nanosecond between request and prediction is revenue.

Start Free Trial Schedule Demo