Papers We Love: Failure Detectors

I gave this talk at Papers We Love, March 2016.


The problem of consensus is central to many distributed systems algorithms. Failure detectors are central to the way we think about consensus algorithms. In a fully asynchronous system, the FLP impossibility result shows that no consensus solution that can tolerate crash failures exists! This simple, stunning result imposed a hard constraint on what could be solved in an asynchronous model.

The FLP result kicked off a flurry of research into ways to circumvent the impossibility result. Failure detectors were the most compelling abstraction proposed. These augmented the asynchronous model just enough to allow consensus, while retaining most of the neat abstractions that make asynchronous systems simple to reason about.

In this talk, I’ll discuss some of the history and background of Chandra and Toueg’s failure detector proposal, and discuss some failure detector mechanisms that followed the paper.


Kiran (@kiranb) is a software engineer at Stripe. At work, she’s thinks a lot about distributed systems fallacies and how we can observe what our software is doing. A normal day working with Kiran involves conversations about operating distributed systems and learning that she made that awesome space dress she’s wearing.