Recent Posts
2025-11-10
As a small experiment, we’ll use a model checker to see how such a race could happen. Formal verification can’t prevent every failure, but it helps us think more clearly about correctness and reason about subtle concurrency bugs.
2025-11-06
On Oct 19–20, 2025, AWS’s N. Virginia region suffered a major DynamoDB outage triggered by a DNS automation defect that broke endpoint resol...
2025-11-03
This post motivates TernFS, explains its high-level architecture, and then explores some key implementation details.
2025-10-27
The goal with Aurora DSQL’s design is to break up the database into bite-sized chunks with clear interfaces and explicit contracts. Each component follows the Unix mantra—do one thing, and do it well—but working together they are able to offer all the features users expect from a database (transactions, durability, queries, isolation, consistency, recovery, concurrency, performance, logging, and so on).
2025-10-20
In this article, I’m going to explain how connections to Aurora DSQL are authenticated and authorized. This information is meant to be supplemental to what is found in the official Amazon Aurora DSQL documentation.
2025-10-14
People often ask me about the architectural relationship between Amazon Dynamo (as described in the classic 2007 SOSP paper), Amazon DynamoDB (the serverless distributed NoSQL database from AWS), and Aurora DSQL (the serverless distributed SQL database from AWS). There’s a ton to say on the topic, but I’ll start off on comparing how the systems achieve a few key properties.
2025-10-14
People often ask me about the architectural relationship between Amazon Dynamo (as described in the classic 2007 SOSP paper), Amazon DynamoDB (the serverless distributed NoSQL database from AWS), and Aurora DSQL (the serverless distributed SQL database from AWS). There’s a ton to say on the topic, but I’ll start off on comparing how the systems achieve a few key properties.
2025-09-30
With S2, it is a hard requirement that our Stream API operations exhibit linearizability. Linearizable systems are far simpler to reason about, and many applications are only possible to build on top of data platforms that offer strong consistency guarantees like this. Because it's important, we also need to test it! We can gain confidence that S2 is linearizable by taking an empirical validation approach, using a model checker like Knossos, or Porcupine.
2025-09-22
Learn how queues make horizontal scaling, scheduling, and flow control easier in cloud systems, and how to make them durable and observable.
2025-08-09
We are on a path to build a strong foundation in distributed systems. We have already gone over distributed time; the next topic we will cover is Distributed Consensus. To build the foundation on distributed consensus, we will go over Paxos. Paxos revolutionized distributed computing by providing the first provably correct solution for achieving consensus among unreliable processors, forming the theoretical foundation for modern distributed systems and databases. Paxos is one of the most important and most difficult to understand algorithm. In this blog I will simplify and explain paxos in a very intuitive way.
2025-07-31
Murat Demirbas (https://muratbuffalo.blogspot.com) and Aleksey Charapko (https://charap.co) read and discuss "Real Life Is Uncertain. Consensus Should Be Too...
2025-05-30
This is definitely not a "learn distributed systems in 21 days" post. I recommend a principled, from the foundations-up, studying of distrib...
2025-05-29
The consensus problem involves an asynchronous system of processes,some of which may be unreliable. The problem is for the reliable processesto agree on a binary value. In this paper, it is shown that every protocol for this problem has the possibility of nontermination, even with only one faulty process. By way of contrast, solutions are known for the synchronous case, the “Byzantine Generals” problem.
2025-05-28
AWS Senior Principal Engineers, Niko Matsakis and Marc Bowes, take us inside Aurora DSQL's development: scaling write operations without two-phase commit, overcoming garbage collection hurdles, and embracing Rust for both data and control planes.
2025-05-27
Here at decentralized thoughts, we spend a lot of time reasoning about distributed protocols. Often, we focus on solving distributed consensus, personally it’s my favorite CS problem, but it’s also famously one of the most difficult and subtle problems in distributed computing. Reasoning about distributed algorithms is hard at the...
2025-05-15
In this blog I will go over how Apache Iceberg contributes to performance of compute engine. Apache Iceberg is an ACID table format designed for large-scale analytics workloads. While its consistency and schema evolution features are covered in previous blog, its impact on query performance can be equally transformative. By the end of this document, you will have a deep understanding of how Iceberg enhances performance, the trade-offs involved, and best practices for maximizing efficiency in read-heavy workloads.
2025-05-12
Debugging concurrency bugs is no picnic, but we're going to get into it. Enter Fray, a deterministic concurrency testing framework from CMU’s PASTA Lab, that turns flaky failures into reliably reproducible ones.
2025-05-09
To me it’s clear that the big idea there isn’t lightweight processes2 and message passing, but rather the generic components which in Erlang are called behaviours.
2025-05-08
A curated collection of resources about deterministic simulation testing for distributed systems.
2025-05-07
Emerging patterns for building scalable, high-performance observability pipelines
2025-05-06
Material for the course Parallel, Concurrent and Distributed Programming by Ilya Sergey at Yale-NUS College
2025-04-25
Why is the raft consensus algorithm called 'raft'?
2025-04-22
The challenge is that physical imperfections in hardware clocks (called quartz crystal oscillators) cause our software clocks to tick at different speeds, so that time passes faster or slower than it should, with these “drift” errors also accumulating into significant “skew” errors within a matter of minutes.
2025-04-21
This page is a relatively informal discussion of distributed consensus and Paxos, what it does, how it works, and some tricks and variants.
2025-04-19
The architecture of Restate, a Durable Execution engine built from the ground up.