Welcome to Distributed Bytes!

I share interesting articles, videos, papers and more about distributed systems, formal methods and computer science.

Made with by Federico Ponzi

Recent Posts

Simple and Correct Snapshot Isolation

Snapshot isolation (SI) is a popular approach to concurrency control in databases systems. It avoids many forms of anomalies, yet provides a high degree of concurrency, especially for read-heavy workloads.

Simple and Correct Snapshot Isolation

Notes on Paxos

These are my notes after learning the Paxos algorithm. The primary goal here is to sharpen my own understanding of the algorithm, but maybe someone will find this explanation of Paxos useful! This post assumes fluency with mathematical notation.

Notes on Paxos

Hypothesis, Antithesis, synthesis | Antithesis Blog

Introducing Hegel, our new family of property-based testing libraries.

Hypothesis, Antithesis, synthesis | Antithesis Blog

Testing for DR Failover Testing | USENIX

Disaster Recovery is an important area in SRE. A simplified scenario is recovery from a full data centre failure. The Zendesk Chat backend infrastructure operates in a single data centre. The way to be sure that DR works is to perform a real failover. The past failover attempts were full of surprises and unexpected issues, most of them having to do with the applications failing to work after failover, due to various reasons. These unexpected issues led to failed failover tests and/or extended maintenance window due to extra efforts required to bring things back to order, causing bad customer experience.

Testing for DR Failover Testing | USENIX

Top Five Scalability Patterns | F5

Availability is serious business in an economy where applications are currency. Apps that don’t respond are summarily deleted and bad mouthed on the Internet with the speed and sarcasm of a negative Yelp review.Since the earliest days of the Internet, organizations have sought to ensure applications (web sites, in the old days) were available 24x7x365. Because the Internet never sleeps, never takes vacation, and never calls in sick.To serve that need (requirement, really), scalability rose as one of the first application services to provide for availability. The most visible – and well-understood – application service serving the needs of availability is load balancing.There are many forms of load balancing, however, and scalability patterns you can implement using this core technology. Today, I’m going to highlight the top five scalability patterns in use that keep apps and the Internet online and available 24x7x365.

Top Five Scalability Patterns | F5

Building A Billion User Load Balancer

Want to learn how Facebook scales their load balancing infrastructure to support more than 1.3 billion users? We will be revealing the technologies and methods we use to global route and balance Facebook's traffic. The Traffic team at Facebook has built several systems for managing and balancing our site traffic, including both a DNS load balancer and a software load balancer capable of handling several protocols. This talk will focus on these technologies and how they have helped improve user performance, manage capacity, and increase reliability.

Building A Billion User Load Balancer

What Is Coordination, Really? | Async Stream

In an earlier post I argued that coordination is about ruling out futures that the world hasn't ruled out on its own. Waiting, ordering, and commitment all…

What Is Coordination, Really? | Async Stream

Distributed rate limiting of delivery attempts | blog.allegro.tech

In our services ecosystem it’s usually the case that services can handle a limited amount of requests per second. We show how we introduced a new algorithm for our publish-subscribe queue system. The road to production deployment highlights some key distributed systems’ takeaways we’d like to discuss.

Distributed rate limiting of delivery attempts | blog.allegro.tech

Smarter Auto-Scaling for ClickHouse: The Two-Window Approach

How ClickHouse Cloud's two-window recommender and target-tracking CPU algorithm cut scale-down latency from 30 hours to 3 hours while eliminating oscillations and reducing infrastructure costs.

Smarter Auto-Scaling for ClickHouse: The Two-Window Approach

pthorpe92.dev

I get a lot of emails from people asking me how they can begin to learn the vast world of databases, and whether they are far enough along on their programming journey to bother trying to start to learn this sub-genre of CS. This post is meant to be my authoritative answer to those questions... Sort of a database specific version of: this post about programming in general.

pthorpe92.dev