Zero-Copy L0 | Shared Memory Cache for Multi-Process Apps

Q: What is zero-copy shared memory caching?

Zero-copy shared memory caching maps a memory region (via mmap/shm_open) that is shared across all processes on the same machine. When a process reads a cached value, it performs a direct pointer dereference into the shared region — no system call, no memory copy, no serialization. The read completes in sub-nanosecond time because the data is already in the process's virtual address space. This is faster than any function-call-based cache, including in-process L1 caches.

Q: Why do Python ML workloads need shared memory caching?

Python's Global Interpreter Lock (GIL) prevents true multi-threaded parallelism for CPU-bound work. Production Python deployments use multi-process architectures — gunicorn workers, uvicorn workers, multiprocessing pools, Celery workers — where each process has its own memory space. Without shared memory, each process maintains its own cache copy. Eight workers with a 2GB feature cache means 16GB of duplicated data. L0 shared memory eliminates this duplication: one copy, shared across all workers, with zero-copy reads.

Q: How does L0 compare to L1 in-process caching?

L1 in-process caching (DashMap with Cachee-FLU) requires a function call, hash table lookup, and memory copy — approximately 1.5 microseconds per read. L0 shared memory is a direct pointer dereference — approximately 0.3-0.8 nanoseconds per read, which is 2,000-5,000x faster. L0 sits below L1 in the memory hierarchy: data that is in L0 never needs to be looked up in L1. L0 is for hot-path data that must be accessed at hardware speed across multiple processes.

Q: Does L0 work with gunicorn's pre-fork model?

Yes. L0 is designed for the pre-fork model. The master process allocates the shared memory region before forking workers. When gunicorn forks, each worker inherits the memory mapping automatically — no additional setup, no IPC channels, no configuration per worker. The workers read from the shared region at pointer-dereference speed. Writes use lock-free MVCC to ensure consistency without blocking readers.

Q: Can I cache NumPy arrays in shared memory without copying?

Yes. Cachee's L0 Python bindings include get_numpy(key, shape, dtype) which returns a NumPy array view directly into the shared memory region. The returned array shares the same underlying memory — there is no copy, no allocation, and no deserialization. You can pass the array directly to TensorFlow, PyTorch, or scikit-learn. This is critical for ML feature serving where embedding vectors and feature matrices must be accessed at millions of inferences per second.

The Problem

Python's Architecture Forces Duplication

Every production Python deployment runs into the same wall. The GIL prevents multi-threaded parallelism, so you scale with processes. And every process has its own memory space. Your cache data is duplicated N times.

🐍

The GIL Forces Multi-Process

Gunicorn workers. Uvicorn workers. Multiprocessing pools. Celery workers. Every production Python deployment that needs parallelism forks into multiple processes. Each process has its own memory space. Cached data is duplicated once per worker. 8 workers with a 2GB feature cache means 16GB wasted on identical copies.

8 workers × 2GB cache = 16GB wasted

📋

Every Cache Today Copies Data

Even in-process L1 caches — Cachee's DashMap, Caffeine, lru_cache — require a function call, a hash table lookup, and a memory copy to return a value. For hot-path ML feature serving at millions of inferences per second, even microseconds matter. The fastest L1 read is still 1.5µs. The fastest L0 read is 0.3ns.

L1: 1.5µs per read. L0: <1ns per read.

🚫

IPC Is Too Slow

Unix sockets, pipes, and shared Redis all add 10–100µs per operation. When you need to serve ML features at millions of inferences per second, you cannot afford a network hop or even a syscall for every feature lookup. The feature store should be as fast as reading a local variable. That requires shared memory.

IPC: 10–100µs. Shared memory: <1ns.

How It Works

mmap a Region. All Workers Read It. Zero Copies.

Cachee allocates a memory-mapped region (shm_open + mmap with MAP_SHARED) at startup. All worker processes attach to the same region. Reads are direct pointer dereference — no syscall, no copy, no serialization. Writes use lock-free MVCC for consistency. The result: reading a cached ML feature is as fast as reading a struct field.

L0 Shared Memory Architecture

Master Process

shm_open + mmap

Allocate shared region

→

fork()

Workers Inherit

Memory mapping shared

→

Worker Read

Pointer Deref

<1ns, zero syscalls

Read Path

hash(key) → shared_region[index] → return value

No syscall. No copy. No serialization. Direct pointer dereference into shared memory.

Pre-Fork Compatible

Cachee allocates the shared memory region in the master process before gunicorn or uvicorn forks workers. Each forked worker inherits the memory mapping automatically. No additional setup, no IPC channels, no per-worker configuration. The region is live the moment the worker starts.

This is the same mechanism that PostgreSQL uses for shared buffers and that Nginx uses for shared caching zones. The operating system handles the mapping. Cachee handles the data layout and concurrency.

Lock-Free MVCC Writes

Writes to L0 use the same MVCC architecture as the L1 engine. New versions are written to the next available slot. The pointer is atomically swapped. Old versions remain readable until all readers advance their epoch. Readers are never blocked by writers.

This composes naturally with hybrid tiering. L0 serves the hottest data at pointer speed. L1 handles the working set with Cachee-FLU admission. L2 (Redis) backs the full keyspace. Promotion and demotion happen automatically.

The Full Picture

The Complete Memory Hierarchy

L0 is not a replacement for L1. It is the tier below it — faster, closer to the CPU, and shared across processes. Together with L1, L1.5, and L2, it forms a complete caching hierarchy that matches data temperature to access speed.

Tier	Mechanism	Latency	Scope	Best For
L0: Zero-Copy Shared Memory	Pointer dereference into mmap region	<1ns	All processes, same machine	ML features, embeddings, hot config
L1: In-Process RAM (Cachee-FLU)	DashMap lookup + memory copy	~1.5µs	Single process	Working set, general-purpose cache
L1.5: NVMe SSD	io_uring async read	10–50µs	Single machine	Warm data, overflow from RAM
L2: Redis / ElastiCache	Network round-trip	1–5ms	Cluster-wide	Full keyspace, shared state

L1 was the fastest cache tier.
L0 is 2,000x faster.

Python Integration

Zero-Copy NumPy Arrays

Cache tensors, embeddings, and feature matrices as shared memory arrays. Read them back as NumPy views — zero copy, zero allocation, zero deserialization.

from cachee import SharedCache
import numpy as np

# Allocate shared memory region (pre-fork, in gunicorn master)
cache = SharedCache(path="/dev/shm/cachee", size_gb=2)

# Store an embedding vector
embedding = np.random.randn(128).astype(np.float32)
cache.set_numpy("embedding:user:123", embedding)

# Read it back — returns a VIEW into shared memory, not a copy
result = cache.get_numpy("embedding:user:123", shape=(128,), dtype=np.float32)
# result.base points to the shared memory region
# Zero allocation. Zero copy. Sub-nanosecond.

# Works with any ML framework
import torch
tensor = torch.from_numpy(result)  # Still zero-copy — shares the buffer

# Standard key-value access
cache.set("config:model_version", b"v2.3.1")
version = cache.get("config:model_version")  # returns bytes
    

🧠

NumPy View, Not Copy

get_numpy(key, shape, dtype) returns a NumPy array whose underlying buffer is the shared memory region. No allocation, no memcpy, no deserialization. The array is usable immediately with TensorFlow, PyTorch, scikit-learn, or any library that accepts NumPy arrays.

np.ndarray view into shared memory

⚙️

cffi Bindings, Zero Overhead

Python bindings use cffi/ctypes for direct memory access with zero Python-level overhead. No pickle, no JSON, no msgpack. The binding maps the shared memory pointer directly into Python's buffer protocol. The GIL is released during the read — other Python threads proceed without blocking.

GIL released during read

Use Cases

Built for the Hardest ML Workloads

L0 shared memory is not for every workload. It is for workloads where per-request feature access is the bottleneck — and every microsecond between request and prediction is revenue.

🤖

Python ML Serving

Gunicorn + TensorFlow/PyTorch inference. Feature lookups across 8–32 worker processes, millions of inferences per second. L0 eliminates per-worker cache duplication and serves features at hardware speed.

📊

Feature Stores

Millions of features accessed across worker processes. User embeddings, item embeddings, real-time aggregations. One shared copy, zero-copy reads, sub-nanosecond access from any worker.

📝

NLP Pipelines

Shared tokenizer caches, embedding lookup tables, vocabulary maps. Data that every worker needs and that never changes mid-request. L0 stores it once, all workers read it at pointer speed.

⚡

Real-Time Inference

Every microsecond between request arrival and prediction response matters. L0 removes the feature lookup from the latency budget entirely. The features are already in your address space.

Future

The Cache Becomes the Feature Store

When embedded micro-inference ships, L0 shared memory becomes its natural data source. Models read features from shared memory at sub-nanosecond speed. The cache is the feature store is the inference input.

L0 Shared Memory

Feature Store

<1ns reads

→

Embedded Model

Micro-Inference

In-cache prediction

→

Result

Prediction

Zero network hops

Feature lookup + model inference + prediction — entirely in shared memory. No network. No serialization. No external feature store.

FAQ

Frequently Asked Questions

What is zero-copy shared memory caching?

Zero-copy shared memory caching maps a memory region (via mmap/shm_open) that is shared across all processes on the same machine. When a process reads a cached value, it performs a direct pointer dereference into the shared region — no system call, no memory copy, no serialization. The read completes in sub-nanosecond time because the data is already in the process's virtual address space.

Why do Python ML workloads need shared memory caching?

Python's Global Interpreter Lock (GIL) prevents true multi-threaded parallelism for CPU-bound work. Production Python deployments use multi-process architectures where each process has its own memory space. Without shared memory, each process maintains its own cache copy. Eight workers with a 2GB feature cache means 16GB of duplicated data. L0 eliminates this duplication: one copy, shared across all workers, with zero-copy reads.

How does L0 compare to L1 in-process caching?

L1 in-process caching (DashMap with Cachee-FLU) requires a function call, hash table lookup, and memory copy — approximately 1.5 microseconds per read. L0 shared memory is a direct pointer dereference — approximately 0.3–0.8 nanoseconds per read, which is 2,000–5,000x faster. L0 sits below L1 in the memory hierarchy: data that is in L0 never needs to be looked up in L1.

Does L0 work with gunicorn's pre-fork model?

Yes. The master process allocates the shared memory region before forking workers. When gunicorn forks, each worker inherits the memory mapping automatically — no additional setup, no IPC channels, no configuration per worker. Writes use lock-free MVCC to ensure consistency without blocking readers.

Can I cache NumPy arrays in shared memory without copying?

Yes. get_numpy(key, shape, dtype) returns a NumPy array view directly into the shared memory region. The returned array shares the same underlying memory — no copy, no allocation, no deserialization. You can pass it directly to TensorFlow, PyTorch, or scikit-learn.

Faster Than L1. Zero Copies.
Shared Across Processes.

Python's Architecture Forces Duplication

mmap a Region. All Workers Read It. Zero Copies.

Pre-Fork Compatible

Lock-Free MVCC Writes

The Complete Memory Hierarchy

Zero-Copy NumPy Arrays

Built for the Hardest ML Workloads

The Cache Becomes the Feature Store

Frequently Asked Questions

What is zero-copy shared memory caching?

Why do Python ML workloads need shared memory caching?

How does L0 compare to L1 in-process caching?

Does L0 work with gunicorn's pre-fork model?

Can I cache NumPy arrays in shared memory without copying?

Stop Duplicating Cache Data Across Workers.
Share It. At Hardware Speed.

Faster Than L1. Zero Copies.Shared Across Processes.

Python's Architecture Forces Duplication

mmap a Region. All Workers Read It. Zero Copies.

Pre-Fork Compatible

Lock-Free MVCC Writes

The Complete Memory Hierarchy

Zero-Copy NumPy Arrays

Built for the Hardest ML Workloads

The Cache Becomes the Feature Store

Frequently Asked Questions

What is zero-copy shared memory caching?

Why do Python ML workloads need shared memory caching?

How does L0 compare to L1 in-process caching?

Does L0 work with gunicorn's pre-fork model?

Can I cache NumPy arrays in shared memory without copying?

Stop Duplicating Cache Data Across Workers.Share It. At Hardware Speed.

Faster Than L1. Zero Copies.
Shared Across Processes.

Stop Duplicating Cache Data Across Workers.
Share It. At Hardware Speed.