Scaling PostgreSQL
How OpenAI scales PostgreSQL
I share interesting articles, videos, papers and more about distributed systems, formal methods and computer science.
For years, PostgreSQL has been one of the most critical, under-the-hood data systems powering core products like ChatGPT and OpenAI’s API. As our user base grows rapidly, the demands on our databases have increased exponentially, too. Over the past year, our PostgreSQL load has grown by more than 10x, and it continues to rise quickly.
In distributed systems, there’s a common understanding that it is not possible to guarantee exactly-once delivery of messages. What is possible though is exactly-once processing. By adding a unique …
This post is about gaining intuition for Write Skew, and, by extension, Snapshot Isolation. Snapshot Isolation is billed as a transaction isolation level that offers a good mix between performance and correctness.
Redis has been gradually making inroads into areas of data management where there are stronger consistency and durability expectations – which worries me, because this is not what Redis is designed for. Arguably, distributed locking is one of those areas. Let’s examine it in some more detail.
As a small experiment, we’ll use a model checker to see how such a race could happen. Formal verification can’t prevent every failure, but it helps us think more clearly about correctness and reason about subtle concurrency bugs.
On Oct 19–20, 2025, AWS’s N. Virginia region suffered a major DynamoDB outage triggered by a DNS automation defect that broke endpoint resol...
This post motivates TernFS, explains its high-level architecture, and then explores some key implementation details.
The goal with Aurora DSQL’s design is to break up the database into bite-sized chunks with clear interfaces and explicit contracts. Each component follows the Unix mantra—do one thing, and do it well—but working together they are able to offer all the features users expect from a database (transactions, durability, queries, isolation, consistency, recovery, concurrency, performance, logging, and so on).