Skip to main content
Why CacheeHow It Works
All Verticals5G TelecomAd TechAI InfrastructureFraud DetectionGamingTrading
PricingDocsBlogSchedule DemoLog InStart Free Trial
← Back to Blog
Architecture

OpenAI Scaled PostgreSQL to 800M Users — Here’s the Caching Architecture They Used

OpenAI published how they scaled PostgreSQL to serve 800 million ChatGPT users. The engineering is impressive — connection pooling, read replicas, aggressive partitioning. Hacker News dissected every paragraph. But buried in the details is the part nobody is talking about: none of those database optimizations survive first contact with 800 million users without a caching layer absorbing the overwhelming majority of reads. The database is the foundation. The cache is the load-bearing wall.

What OpenAI Published

The blog post details a PostgreSQL architecture that has been hardened through three years of explosive growth. The core strategies are well-understood individually but rarely deployed together at this scale. PgBouncer sits in front of every PostgreSQL instance, multiplexing thousands of application connections into a smaller pool of actual database connections. Without it, connection overhead alone would saturate the database before any queries execute. At 800 million users, even with modest concurrency, the raw connection count would overwhelm any single PostgreSQL cluster.

Read replicas distribute query load across multiple PostgreSQL instances. ChatGPT’s workload is overwhelmingly read-heavy — users retrieving conversation history, loading preferences, fetching model routing configuration. OpenAI routes these reads to replicas, reserving the primary for writes. This is standard practice at scale, but the ratio matters: at their volume, they likely run dozens of read replicas per primary, each handling tens of thousands of queries per second.

Partition strategies keep individual tables manageable. Conversation history, the largest data set by volume, is partitioned by user ID and time range. This ensures that queries for a single user’s recent conversations scan a small partition rather than a multi-terabyte table. Index sizes stay bounded. Vacuum operations complete in minutes instead of hours. The operational benefit is as significant as the performance benefit — a database you can maintain is a database that stays fast.

The published stack: PgBouncer for connection multiplexing, read replicas for horizontal read scaling, hash + range partitioning for table management, and careful query optimization. All of this is necessary. None of it is sufficient at 800 million users without a caching layer in front.

The Caching Layer They Didn’t Detail

Here is the arithmetic that makes the case. ChatGPT has 800 million users. Assume 200 million are monthly active. Assume 30 million are daily active. Assume each daily active user generates an average of 15 database-touching interactions per session — loading the conversation list, opening a conversation, sending a message, receiving a response, checking model availability, loading user preferences. That is 450 million database operations per day, or roughly 5,200 queries per second sustained, with peaks during US and European business hours pushing to 15,000–25,000 queries per second.

Even with read replicas, 25,000 queries per second of actual database work is expensive. Each query requires parsing, planning, and execution. Each result set must be serialized and transmitted over the network. Connection pool slots are consumed. Replica lag becomes a factor under write pressure. The only way to make this sustainable is to ensure that the vast majority of those 25,000 requests never reach PostgreSQL at all.

The data that must be cached is extensive: session state (which user is logged in, their current conversation context), user preferences (model selection, UI settings, plugin configurations), conversation metadata (title, creation date, message count — everything needed to render the sidebar without loading full message bodies), and model routing decisions (which model serves which request, capacity availability, feature flags). Every one of these is read on nearly every page load. Every one of them changes infrequently. They are textbook caching candidates.

To keep PostgreSQL healthy at this scale, OpenAI likely needs a cache hit rate above 95%. At 95%, only 1,250 of those 25,000 peak queries per second hit the database. At 99%, it drops to 250. The difference between 95% and 99% is the difference between needing 10 read replicas and needing 2. That is not a marginal optimization. That is the architecture that makes the published database strategy viable.

800M Total Users
25K Peak QPS
95%+ Required Hit Rate
20× DB Load Reduction

What a Predictive Cache Would Add

A traditional cache is reactive. A user opens ChatGPT, the application checks the cache for their session data, gets a miss on first load, queries PostgreSQL, and populates the cache. Subsequent requests hit the cache. This works, but that first request — the cold start — always pays the full database penalty. Multiply that across 30 million daily active users, and cold-start misses alone generate millions of database queries that a smarter cache could eliminate.

A predictive cache inverts the model. Instead of waiting for a request to trigger a cache population, it analyzes access patterns and pre-warms data before the user arrives. If a user opens ChatGPT at 9:03 AM every weekday, the cache begins loading their session state, conversation list, and model preferences at 9:01 AM. When the user’s first request arrives, everything is already in memory. The cold start disappears. The database never sees the query.

At OpenAI’s scale, predictive caching would have three high-impact applications. First, session pre-warming — loading user context into the cache before the user opens the app. Second, conversation resumption prediction — identifying which of a user’s conversations are likely to be reopened and pre-loading their metadata (most users return to their most recent 2–3 conversations). Third, model routing pre-computation — caching the decision of which model will serve a user’s request based on their subscription tier, current load, and feature flags, so the routing lookup is a cache read instead of a multi-step computation.

The database caching layer approach extends this further. Rather than caching individual query results, you cache the logical data objects that the application works with. A user’s “session bundle” — preferences, active conversation, model assignment — is cached as a single unit, fetched in one lookup instead of four separate cache reads. This reduces not just database load but also cache access volume, which matters when you are serving billions of cache reads per day.

Predictive caching at 800M users: If 60% of daily active users have predictable access patterns (same time, same device, same initial actions), pre-warming eliminates 18 million cold-start database queries per day. At an average of 3ms per query, that is 54,000 seconds of database compute time — removed entirely. See how predictive caching works.

Lessons for Your Architecture

You do not need 800 million users to benefit from the architecture patterns OpenAI is using. The principles scale down cleanly. At 10,000 users, the same L1 + L2 + database architecture applies — the numbers are smaller, but the ratios are identical. Your database is still doing redundant work on every repeated read. Your users are still experiencing cold-start latency on first load. Your cache hit rate still determines whether your infrastructure costs are linear or sublinear with user growth.

The layered approach is what matters. L1 (in-process) handles the hottest data — the 100–1,000 keys that account for 80% of your read volume. Zero serialization, zero network hops, microsecond lookups. L2 (Redis or equivalent) handles the warm data — user sessions, recent query results, computed aggregations. Millisecond lookups, shared across application instances. Database handles cold reads and all writes — the fallback for cache misses and the source of truth for mutations.

The mistake most teams make is jumping straight to L2 (Redis) and skipping L1 entirely. They accept the serialization overhead, the network hop, and the stampede risk as inherent costs of caching. They are not. They are costs of remote caching. An in-process L1 tier eliminates them on the hottest path, where the savings compound most aggressively. OpenAI has the engineering resources to build custom solutions for each layer. You do not need to. The architecture pattern is the same whether you are caching ChatGPT sessions or product catalog pages.

To increase your cache hit rate from the typical 85–90% to the 95%+ that makes this architecture sing, focus on three things: cache the right granularity (objects, not query results), use predictive pre-warming instead of passive TTLs, and ensure your L1 and L2 tiers have independent eviction policies optimized for their respective access patterns. For strategies on reducing the latency of your L2 tier specifically, see how to reduce Redis latency.

The Numbers at Scale

The impact of a properly layered cache architecture is not theoretical. Here is what the numbers look like at different scales, assuming a 97% L1 + L2 combined hit rate and an average database query cost of 5ms.

Users Peak QPS DB Queries (No Cache) DB Queries (97% Hit) DB Load Saved
1,000 50 50/sec 1.5/sec 97%
10,000 500 500/sec 15/sec 97%
100,000 5,000 5,000/sec 150/sec 97%
1,000,000 50,000 50,000/sec 1,500/sec 97%

At 1,000 users, the savings are measurable but not critical — your database can handle 50 queries per second without breaking a sweat. At 100,000 users, the cache is the difference between a single database instance and a read-replica cluster. At 1 million users, it is the difference between a manageable infrastructure bill and a scaling emergency. The percentage saved is constant. The absolute cost avoided grows linearly. This is why caching is not a performance optimization — it is a scaling strategy.

97% DB Load Reduction
33× Fewer DB Queries
~2µs L1 Lookup Latency
0ms Serialization Cost

Further Reading

Also Read

You Don’t Need OpenAI’s Scale to Use Their Architecture.

See how in-process L1 caching, predictive pre-warming, and layered architecture reduce your database load by 97% — at any scale.

Start Free Trial Schedule Demo