Caching ML Model Weights: Sub-Microsecond Guide

April 18, 2026 | 11 min read | Engineering

Every ML serving pipeline has the same bottleneck, and it is not inference. It is the data lookup that happens before inference begins. Before your model can score a recommendation, classify a transaction, or generate an embedding, it needs input features. Those features come from somewhere -- a feature store, a database, a Redis cluster, an API call to another service. The lookup latency for those features often exceeds the inference latency itself.

A typical embedding lookup from Redis takes 0.5ms for a 4 KB vector. A feature vector retrieval from a managed feature store takes 2-10ms. A cold embedding lookup from DynamoDB takes 5-15ms. Meanwhile, the actual model inference on a modern GPU takes 1-5ms for most production models, and CPU-based models (XGBoost, LightGBM, small neural nets) run in 0.1-1ms.

Your model is fast. Your data access is slow. This guide shows how to fix that by caching the right ML artifacts at the right tier, with the right eviction policy, to achieve sub-microsecond data access for ML serving.

What ML Engineers Actually Cache

Not all ML artifacts are cacheable, and not all cacheable artifacts benefit equally from caching. Here is what matters, ordered by impact.

1. Embedding Vectors (4 KB per vector, millions of entries)

This is the highest-impact ML caching use case. Every recommendation system, search engine, fraud detector, and personalization pipeline uses embeddings. A typical embedding is a 1024-dimensional float32 vector: 1024 dimensions times 4 bytes per float equals 4,096 bytes per embedding. Some systems use 768 dimensions (3,072 bytes) or 1536 dimensions (6,144 bytes), but 4 KB is the common center.

A product catalog with 10 million items has 10 million embeddings. A user embedding table with 50 million users has 50 million embeddings. At serving time, a recommendation request requires the query user's embedding and the candidate items' embeddings. A typical request touches 1 user embedding and 100-1,000 candidate embeddings. That is 100-1,000 lookups per request.

From Redis, each 4 KB embedding lookup takes 0.5ms. One hundred lookups take 50ms (with pipelining, maybe 10-15ms). One thousand lookups take 500ms (with pipelining, 50-80ms). For a recommendation system that must respond in under 100ms, spending 50-80ms on embedding lookups leaves almost no budget for the actual scoring model.

From an in-process cache, each 4 KB embedding lookup takes 31 nanoseconds. One hundred lookups take 3.1 microseconds. One thousand lookups take 31 microseconds. The entire embedding retrieval phase drops from dominating the request budget to being a rounding error.

31ns

Per embedding lookup

3.1us

100 embeddings

16,129x

Faster than Redis at 4KB

2. Model Inference Outputs (Variable size, high reuse)

Many ML systems receive the same or similar inputs repeatedly. A fraud detection model scoring the same merchant-category-amount pattern. A recommendation model scoring the same user-item pair. A classification model processing the same text input. If the model is deterministic (same input produces same output), caching the output eliminates redundant inference entirely.

The cache key is a hash of the model input features. The cache value is the model output -- typically a score (8 bytes), a probability distribution (100-500 bytes), or a full prediction with metadata (1-4 KB). At 1-4 KB per entry, this is squarely in the range where Redis latency begins to scale linearly with value size.

In practice, model output caching delivers 30-60% hit rates for recommendation systems (users revisit similar content), 40-70% for fraud scoring (transactions cluster around common patterns), and 60-90% for search ranking (popular queries repeat). Each cache hit eliminates a GPU inference call that costs 1-5ms and consumes GPU memory that could serve other requests.

3. Feature Vectors from Feature Stores (2-20 KB, latency-critical)

Feature stores (Feast, Tecton, Hopsworks, SageMaker Feature Store) provide precomputed feature vectors for model inference. A typical feature vector contains 50-200 features, each a float32 or float64, plus metadata. Total size: 2-20 KB depending on the number of features and encoding.

Managed feature stores add significant latency. Feast with Redis online store: 2-5ms per lookup. Tecton: 5-15ms per lookup. SageMaker Feature Store: 10-25ms per lookup. These latencies exist because the feature store is a separate service with its own serialization, network hop, and deserialization overhead.

Caching the most frequently accessed feature vectors in-process eliminates this latency entirely. For a fraud model that checks the same merchant's features 10,000 times per day, the feature vector is computed once and served at 31ns for the remaining 9,999 lookups.

4. Tokenizer State and Vocabulary Mappings (1-50 MB, loaded once)

NLP models require tokenizer vocabularies that map strings to token IDs. A BERT tokenizer vocabulary is approximately 800 KB. A GPT-style BPE tokenizer vocabulary with merge rules is 2-5 MB. A multilingual tokenizer (like mT5) can be 10-50 MB. These are loaded once and read millions of times.

This is not a traditional caching problem -- tokenizer state should be loaded into memory at startup and kept permanently. But in serverless and container-based environments (Lambda, Cloud Run, Kubernetes with frequent scaling), every new instance must reload the tokenizer. Caching the tokenizer state in a shared L0 cache that persists across container restarts eliminates cold-start tokenizer loading (typically 500ms-2s).

Memory Math: Can You Fit It?

The first question every ML engineer asks about in-process caching is: "Can I fit my embeddings in memory?" The answer is usually yes for the hot subset, and the math is straightforward.

Embedding Count	Dimensions	Bytes/Vector	Total Memory	Fits in RAM?
100,000	1024	4,096	390 MB	Easily (any instance)
1,000,000	1024	4,096	3.8 GB	Yes (8GB+ instance)
10,000,000	1024	4,096	38 GB	Yes (64GB+ instance)
100,000,000	1024	4,096	381 GB	Dedicated (r7g.12xlarge)
1,000,000	1536	6,144	5.7 GB	Yes (16GB+ instance)
1,000,000	768	3,072	2.9 GB	Yes (8GB+ instance)

Most production ML systems do not need all embeddings in memory. The access distribution follows a power law: 5% of embeddings account for 80% of lookups. If you have 10 million product embeddings but only 500,000 products are actively viewed, you need 1.9 GB of L0 cache to cover 80% of embedding lookups at 31ns. The remaining 20% fall through to Redis or your embedding store and pay the network latency -- but only on the cold tail.

The 80/20 Rule for ML Caching

You do not need to cache all embeddings to get most of the benefit. 1 million hot embeddings at 4 KB each = 3.8 GB of L0 memory. This covers 80%+ of lookups at 31ns per read. The cold 20% falls through to Redis at 0.5ms. Your effective average lookup latency drops from 0.5ms to 0.1ms -- a 5x improvement -- even with imperfect cache coverage. With CacheeLFU admission, the hot set self-tunes based on observed access patterns.

CacheeLFU: Hot/Cold Embedding Tiering

Embedding access patterns are highly skewed but not static. Today's popular product is tomorrow's long tail. A static cache configuration (fixed set of embeddings in memory) becomes stale as access patterns shift. You need an eviction policy that automatically promotes hot embeddings and demotes cold ones.

CacheeLFU tracks access frequency using a count-min sketch with 4 rows of 65,536 atomic counters. Total memory: 512 KiB constant, regardless of how many embeddings exist in the system. At 10 million embeddings, per-key frequency tracking in a hash map would consume approximately 634 MB (64 bytes per key for the map entry). CacheeLFU's sketch uses 512 KiB -- a 1,239x memory reduction.

The admission scoring function is frequency / ln(age_since_last_access). An embedding accessed 10,000 times in the last minute gets a high score and stays in L0. An embedding accessed 3 times last week gets a low score and evicts to L1 (warm tier) or drops entirely. The scoring naturally handles the power-law distribution of embedding accesses: popular items stay resident, long-tail items cycle through on demand.

How Tiering Works in Practice

Consider a recommendation system with 5 million product embeddings. You allocate 4 GB to L0 cache, which holds approximately 1 million 4 KB embeddings. The system starts cold -- every lookup misses L0 and falls through to Redis. Over the first few minutes, CacheeLFU observes access patterns and promotes the most frequently accessed embeddings to L0.

After warmup (typically 2-5 minutes under production traffic), L0 holds the 1 million most popular product embeddings. Hit rate stabilizes at 82-88% depending on how skewed the access distribution is. The remaining 12-18% of lookups go to Redis at 0.5ms. The effective average latency is: (0.85 x 31ns) + (0.15 x 500,000ns) = 75,026ns = 0.075ms. That is 6.7x faster than pure Redis, with no manual configuration of which embeddings to cache.

When access patterns shift -- a new product goes viral, a seasonal trend changes the popular items -- CacheeLFU automatically evicts the least-frequently-accessed embeddings and promotes the newly popular ones. The transition happens within minutes, not hours. There is no manual cache warming, no static configuration files, no deployment to update the cached embedding set.

Zero-Copy Reads for Large Model Weight Shards

Some ML serving architectures shard model weights across multiple workers. Each worker holds a subset of the model and serves inference for its shard. When a request arrives, the router identifies which shard(s) are needed and dispatches the request. The worker loads the relevant weight matrices, runs inference, and returns the result.

Weight matrices are large. A single transformer layer's attention weights in a 7B parameter model are approximately 32 MB (4096 x 4096 x 2 matrices x float16). Loading these from disk takes 1-5ms on NVMe. Loading from Redis would take 80-200ms (32 MB over network). Loading from an in-process cache takes 31ns -- but only if the read is zero-copy.

A traditional in-process hash map returns a clone of the value. For a 32 MB weight matrix, that clone takes approximately 200-500 microseconds (memcpy at memory bandwidth). This is faster than Redis but still significant. Cachee returns a reference to the value, not a copy. The 31ns measurement is the hash lookup plus pointer dereference. The model inference code reads directly from the cached memory region. Zero allocation, zero copy, zero overhead beyond the initial lookup.

This matters for two reasons. First, it eliminates the clone latency that would otherwise dominate for large values. Second, it eliminates the memory allocation. A model serving 1,000 requests per second, each reading a 32 MB weight shard, would allocate 32 GB/s of heap memory if using copies. With zero-copy reads, heap allocation for cache reads is exactly zero.

When Not to Cache Model Weights

Full model weights for large models (7B+ parameters) are tens of gigabytes. Do not attempt to cache an entire 14 GB model in a key-value cache. Use memory-mapped files (mmap) instead -- frameworks like ONNX Runtime, PyTorch (torch.load(mmap=True)), and vLLM already do this. In-process caching is for shards, layers, and frequently accessed subsets of weights, not the entire model. Cache the hot attention layers. Let the OS page cache handle the rest.

Common ML Caching Patterns

Pattern 1: Embedding Lookup Cache (Recommendations, Search)

The most common and highest-impact pattern. Cache user and item embeddings for nearest-neighbor lookup. The cache key is the entity ID (user ID, product ID, document ID). The cache value is the embedding vector (4 KB at 1024 dimensions). Access pattern: extremely high frequency for popular items, power-law distribution.

# Pseudocode: embedding lookup with L0 cache
def get_embedding(entity_id: str) -> np.ndarray:
    # L0: in-process, 31ns
    cached = cachee.get(f"emb:{entity_id}")
    if cached is not None:
        return np.frombuffer(cached, dtype=np.float32)

    # L1: Redis fallthrough, 0.5ms
    raw = redis.get(f"emb:{entity_id}")
    if raw is not None:
        cachee.set(f"emb:{entity_id}", raw)  # promote to L0
        return np.frombuffer(raw, dtype=np.float32)

    # L2: compute from model or feature store
    embedding = model.encode(entity_id)
    cachee.set(f"emb:{entity_id}", embedding.tobytes())
    redis.set(f"emb:{entity_id}", embedding.tobytes())
    return embedding

Pattern 2: Model Output Cache (Fraud, Classification)

Cache the output of deterministic models to avoid redundant inference. The cache key is a hash of the input features. The cache value is the model prediction. This pattern works best when the same inputs recur frequently -- fraud models see the same merchant-amount-category patterns thousands of times per day.

# Pseudocode: model output caching
def predict(features: dict) -> float:
    cache_key = f"pred:{hashlib.sha256(json.dumps(features, sort_keys=True)).hexdigest()[:16]}"

    cached = cachee.get(cache_key)
    if cached is not None:
        return struct.unpack('d', cached)[0]  # 8 bytes -> float64

    score = model.predict(features)
    cachee.set(cache_key, struct.pack('d', score), ttl=300)  # 5 min TTL
    return score

The TTL on model output caches is critical. Feature values change over time (a user's transaction count increases, a merchant's average ticket size shifts). A TTL of 5-15 minutes balances cache hit rate against staleness. For real-time fraud detection, shorter TTLs (30-60 seconds) prevent stale scores from passing transactions that a fresh model would flag.

Pattern 3: Feature Store Cache (Any ML Pipeline)

Place an in-process cache in front of your feature store. The cache key is the entity ID plus feature group name. The cache value is the serialized feature vector. This pattern reduces feature store latency from 5-15ms to 31ns for hot entities.

# Pseudocode: feature store with L0 cache
def get_features(user_id: str, feature_group: str) -> dict:
    cache_key = f"feat:{feature_group}:{user_id}"

    cached = cachee.get(cache_key)
    if cached is not None:
        return msgpack.unpackb(cached)

    # Feature store lookup: 5-15ms
    features = feast_client.get_online_features(
        entity_rows=[{"user_id": user_id}],
        feature_refs=[f"{feature_group}:*"]
    ).to_dict()

    cachee.set(cache_key, msgpack.packb(features), ttl=60)
    return features

Pattern 4: Tokenizer and Vocabulary Cache (NLP Serving)

For NLP models in containerized environments, cache the tokenizer state to eliminate cold-start loading. The cache key is the model name plus tokenizer version. The cache value is the serialized tokenizer (1-50 MB). This is a write-rarely, read-constantly pattern -- the tokenizer loads once and is read on every inference request.

In a Kubernetes environment with horizontal pod autoscaling, every new pod must load the tokenizer from disk or a remote store. With a shared Cachee instance (or a Cachee sidecar with persistent volume), the tokenizer loads from L0 at 31ns instead of from S3 at 500ms-2s. For models that scale from 2 to 50 pods during traffic spikes, this eliminates 24-96 seconds of cumulative tokenizer loading during scale-up events.

Latency Comparison: The Full Picture

Operation	Redis	Feature Store	Cachee L0	Speedup vs Redis
Embedding lookup (4 KB)	0.50ms	5-15ms	0.000031ms	16,129x
100 embedding batch	10-15ms*	50-150ms	0.0031ms	3,226-4,839x
Feature vector (8 KB)	0.65ms	5-15ms	0.000031ms	20,968x
Model output (1 KB)	0.35ms	N/A	0.000031ms	11,290x
Tokenizer load (5 MB)	15ms	500-2000ms	0.000031ms	483,871x

* Redis batch with pipelining. Without pipelining: 50ms.

Architecture Recommendations by Scale

Small Scale: Under 100K Embeddings

Cache everything in-process. At 4 KB per embedding, 100K embeddings is 390 MB. This fits in any production instance. No Redis needed for embedding serving. Use Cachee as the sole embedding store at serving time, with a batch job that refreshes embeddings from your training pipeline on a schedule (hourly, daily). Every lookup is 31ns. Hit rate: 100%.

Medium Scale: 100K - 10M Embeddings

Tiered architecture. L0 (in-process Cachee) holds the hot 500K-2M embeddings in 2-8 GB of memory. CacheeLFU handles admission and eviction automatically. Redis L1 holds the full embedding set for fallthrough. Expected L0 hit rate: 80-92% depending on access distribution skew. Effective average latency: 0.05-0.1ms.

Large Scale: 10M+ Embeddings

Tiered architecture with dedicated embedding cache instances. Run Cachee on high-memory instances (r7g.4xlarge, 128 GB) with 80 GB allocated to L0 embedding cache. This holds 20 million embeddings in-process. For catalogs larger than 20M items, partition by embedding space (different Cachee instances for different product categories or user segments). Each partition covers its hot set independently.

At this scale, the cost of a dedicated Cachee instance ($0.80/hr for r7g.4xlarge) is a fraction of the cost of the GPU instances running inference. If the Cachee layer reduces average embedding lookup from 0.5ms to 0.05ms, and your inference pipeline performs 500 embedding lookups per request at 10,000 requests per second, the cumulative time savings is 2,250 GPU-seconds per second. That is the equivalent of freeing up 2,250 GPUs worth of wait time, which more than justifies a $0.80/hr instance.

The ML Caching Payoff

ML inference is expensive. Every millisecond your model waits for data is a millisecond of GPU idle time you are paying for. At $3/hr per GPU and 10,000 inferences per second, reducing data access latency by 0.45ms (from 0.5ms to 0.05ms) saves 4.5 GPU-seconds per second of idle time. Over a month, that is 11.7 million GPU-seconds recovered. In-process embedding caching does not just improve latency -- it improves GPU utilization, which directly reduces your inference infrastructure bill.

Getting Started

# Install
brew tap h33ai-postquantum/tap
brew install cachee

# Initialize with ML-optimized settings
cachee init --memory-limit 4GB
cachee start

# Cache an embedding (4KB float32 vector)
cachee set "emb:user:12345" "$(python3 -c "import numpy as np; print(np.random.randn(1024).astype(np.float32).tobytes().hex())")"

# Retrieve at 31ns
cachee get "emb:user:12345"

# Monitor hit rate and memory usage
cachee status

Cachee speaks RESP -- any Redis client library works. If your ML serving code already uses Redis for embedding lookups, change the connection string from redis://your-redis:6379 to redis://localhost:6380. Cachee intercepts the lookups, serves hot embeddings at 31ns from L0, and falls through to your existing Redis for cold misses. No model code changes. No retraining. No redeployment of your inference pipeline.

The data access layer is the bottleneck your ML monitoring dashboard does not show. Your GPU utilization looks healthy. Your model latency looks acceptable. But 30-80% of your end-to-end inference time is spent waiting for embeddings, features, and cached outputs to arrive over the network. Move that data into the same process as your model, and the latency disappears.

Serve embeddings at 31ns. Stop paying GPU idle time for network latency.

Install Cachee Redis Latency by Value Size