OpenAI published how they scaled PostgreSQL to serve 800 million ChatGPT users. The engineering is impressive — connection pooling, read replicas, aggressive partitioning. Hacker News dissected every paragraph. But buried in the details is the part nobody is talking about: none of those database optimizations survive first contact with 800 million users without a caching layer absorbing the overwhelming majority of reads. The database is the foundation. The cache is the load-bearing wall.
What OpenAI Published
The blog post details a PostgreSQL architecture that has been hardened through three years of explosive growth. The core strategies are well-understood individually but rarely deployed together at this scale. PgBouncer sits in front of every PostgreSQL instance, multiplexing thousands of application connections into a smaller pool of actual database connections. Without it, connection overhead alone would saturate the database before any queries execute. At 800 million users, even with modest concurrency, the raw connection count would overwhelm any single PostgreSQL cluster.
Read replicas distribute query load across multiple PostgreSQL instances. ChatGPT’s workload is overwhelmingly read-heavy — users retrieving conversation history, loading preferences, fetching model routing configuration. OpenAI routes these reads to replicas, reserving the primary for writes. This is standard practice at scale, but the ratio matters: at their volume, they likely run dozens of read replicas per primary, each handling tens of thousands of queries per second.
Partition strategies keep individual tables manageable. Conversation history, the largest data set by volume, is partitioned by user ID and time range. This ensures that queries for a single user’s recent conversations scan a small partition rather than a multi-terabyte table. Index sizes stay bounded. Vacuum operations complete in minutes instead of hours. The operational benefit is as significant as the performance benefit — a database you can maintain is a database that stays fast.
The Caching Layer They Didn’t Detail
Here is the arithmetic that makes the case. ChatGPT has 800 million users. Assume 200 million are monthly active. Assume 30 million are daily active. Assume each daily active user generates an average of 15 database-touching interactions per session — loading the conversation list, opening a conversation, sending a message, receiving a response, checking model availability, loading user preferences. That is 450 million database operations per day, or roughly 5,200 queries per second sustained, with peaks during US and European business hours pushing to 15,000–25,000 queries per second.
Even with read replicas, 25,000 queries per second of actual database work is expensive. Each query requires parsing, planning, and execution. Each result set must be serialized and transmitted over the network. Connection pool slots are consumed. Replica lag becomes a factor under write pressure. The only way to make this sustainable is to ensure that the vast majority of those 25,000 requests never reach PostgreSQL at all.
The data that must be cached is extensive: session state (which user is logged in, their current conversation context), user preferences (model selection, UI settings, plugin configurations), conversation metadata (title, creation date, message count — everything needed to render the sidebar without loading full message bodies), and model routing decisions (which model serves which request, capacity availability, feature flags). Every one of these is read on nearly every page load. Every one of them changes infrequently. They are textbook caching candidates.
To keep PostgreSQL healthy at this scale, OpenAI likely needs a cache hit rate above 95%. At 95%, only 1,250 of those 25,000 peak queries per second hit the database. At 99%, it drops to 250. The difference between 95% and 99% is the difference between needing 10 read replicas and needing 2. That is not a marginal optimization. That is the architecture that makes the published database strategy viable.
What a Predictive Cache Would Add
A traditional cache is reactive. A user opens ChatGPT, the application checks the cache for their session data, gets a miss on first load, queries PostgreSQL, and populates the cache. Subsequent requests hit the cache. This works, but that first request — the cold start — always pays the full database penalty. Multiply that across 30 million daily active users, and cold-start misses alone generate millions of database queries that a smarter cache could eliminate.
A predictive cache inverts the model. Instead of waiting for a request to trigger a cache population, it analyzes access patterns and pre-warms data before the user arrives. If a user opens ChatGPT at 9:03 AM every weekday, the cache begins loading their session state, conversation list, and model preferences at 9:01 AM. When the user’s first request arrives, everything is already in memory. The cold start disappears. The database never sees the query.
At OpenAI’s scale, predictive caching would have three high-impact applications. First, session pre-warming — loading user context into the cache before the user opens the app. Second, conversation resumption prediction — identifying which of a user’s conversations are likely to be reopened and pre-loading their metadata (most users return to their most recent 2–3 conversations). Third, model routing pre-computation — caching the decision of which model will serve a user’s request based on their subscription tier, current load, and feature flags, so the routing lookup is a cache read instead of a multi-step computation.
The database caching layer approach extends this further. Rather than caching individual query results, you cache the logical data objects that the application works with. A user’s “session bundle” — preferences, active conversation, model assignment — is cached as a single unit, fetched in one lookup instead of four separate cache reads. This reduces not just database load but also cache access volume, which matters when you are serving billions of cache reads per day.
Lessons for Your Architecture
You do not need 800 million users to benefit from the architecture patterns OpenAI is using. The principles scale down cleanly. At 10,000 users, the same L1 + L2 + database architecture applies — the numbers are smaller, but the ratios are identical. Your database is still doing redundant work on every repeated read. Your users are still experiencing cold-start latency on first load. Your cache hit rate still determines whether your infrastructure costs are linear or sublinear with user growth.
The layered approach is what matters. L1 (in-process) handles the hottest data — the 100–1,000 keys that account for 80% of your read volume. Zero serialization, zero network hops, microsecond lookups. L2 (Redis or equivalent) handles the warm data — user sessions, recent query results, computed aggregations. Millisecond lookups, shared across application instances. Database handles cold reads and all writes — the fallback for cache misses and the source of truth for mutations.
The mistake most teams make is jumping straight to L2 (Redis) and skipping L1 entirely. They accept the serialization overhead, the network hop, and the stampede risk as inherent costs of caching. They are not. They are costs of remote caching. An in-process L1 tier eliminates them on the hottest path, where the savings compound most aggressively. OpenAI has the engineering resources to build custom solutions for each layer. You do not need to. The architecture pattern is the same whether you are caching ChatGPT sessions or product catalog pages.
To increase your cache hit rate from the typical 85–90% to the 95%+ that makes this architecture sing, focus on three things: cache the right granularity (objects, not query results), use predictive pre-warming instead of passive TTLs, and ensure your L1 and L2 tiers have independent eviction policies optimized for their respective access patterns. For strategies on reducing the latency of your L2 tier specifically, see how to reduce Redis latency.
The Numbers at Scale
The impact of a properly layered cache architecture is not theoretical. Here is what the numbers look like at different scales, assuming a 97% L1 + L2 combined hit rate and an average database query cost of 5ms.
| Users | Peak QPS | DB Queries (No Cache) | DB Queries (97% Hit) | DB Load Saved |
|---|---|---|---|---|
| 1,000 | 50 | 50/sec | 1.5/sec | 97% |
| 10,000 | 500 | 500/sec | 15/sec | 97% |
| 100,000 | 5,000 | 5,000/sec | 150/sec | 97% |
| 1,000,000 | 50,000 | 50,000/sec | 1,500/sec | 97% |
At 1,000 users, the savings are measurable but not critical — your database can handle 50 queries per second without breaking a sweat. At 100,000 users, the cache is the difference between a single database instance and a read-replica cluster. At 1 million users, it is the difference between a manageable infrastructure bill and a scaling emergency. The percentage saved is constant. The absolute cost avoided grows linearly. This is why caching is not a performance optimization — it is a scaling strategy.
Further Reading
- Database Caching Layer: Architecture Guide
- Predictive Caching: How AI Pre-Warming Works
- Low-Latency Caching Architecture
- How to Reduce Redis Latency in Production
- How to Increase Cache Hit Rate
- Cachee Performance Benchmarks
Also Read
You Don’t Need OpenAI’s Scale to Use Their Architecture.
See how in-process L1 caching, predictive pre-warming, and layered architecture reduce your database load by 97% — at any scale.
Start Free Trial Schedule Demo