(paper report) Lessons from giant-scale services

This is part of a series of posts I’m writing for my version of NaNoWriMo, where I summarize/review papers I’ve read recently.

This is a paper from Eric Brewer (of the CAP theorem) summarizing his experiences working with large systems. Though the paper is from 2001, the basic tradeoffs of the systems he describes have not really changed since.

The most interesting insight of this paper is the DQ principle, which is related to the principles discussed in the harvest/yield paper:

This paper is kind of a grab-bag of ideas, but I think the key insight is using the DQ principle to evaluate graceful degradation strategies.

Measuring availability (harvest/yield)

Uptime, or the amount of time a site handles traffic, is the traditional metric of availability.

As a better set of availability metrics, Brewer reiterates the harvest/yield metrics, discussed here previously.

The key insight is that we can influence whether faults impact yield, harvest, or both. Replicated systems tend to map faults to reduced capacity (and to yield at high utilizations), while partitioned systems tend to map faults to reduced harvest, as parts of the database temporarily disappear, but the capacity in queries per second remains the same.

That is, given a failure, you can either completely answer less queries, or answer queries with less data. Given this data, Brewer goes on to define…

The DQ principle

The DQ principle is not a strict theorem, but is a useful way of thinking about very constrained systems.

The intuition behind this principle is that the system’s overall capacity tends to have a particular physical bottleneck, … which is tied to data movement.

The DQ value/overall throughput is easily measured, and is a good measure for how overloaded a system is.1 The absolute value isn’t super important, but it’s crucial to understand both how adding/losing nodes affects it relatively and how close to the limit you’re operating at. Load testing is a good way to get an initial measurement for the DQ value of an entire system.

Graceful degradation under load

Being able to avoid saturation is a nice thought, but unless you’re horribly over-provisioned, it’s near impossible. Brewer presents three main reasons for this:

Given the inherent spikiness of traffic, if you don’t plan how you shed load, you still will shed load, but not necessarily the load you can afford to shed.

The DQ principle gives you a disciplined way to think about load shedding:

you can either focus on harvest through admission control, which reduces Q, or on yield through dynamic database reduction, which reduces D, or use a combination of the two.

There are a couple of different ways this can manifest3:

Upgrades

Maintenance and upgrades are basically controlled failures.

completing upgrades without taking down the site is important becasue giant-scale services are updated so frequently.

Online upgrades can be viewed as a temporary reduction in DQ value –

There are three main approaches to upgrades:

  1. Fast reboots
    • SUPER QUICKLY reboot everything. The entire system is down for u time, which might not be a huge drop in yield during off-peak hours.
  2. Rolling upgrades
    • Take one node at a time, which probably doesn’t reduce harvest or yield at all during off-peak hours, since there’s probably enough excess capacity in the system to handle a single node failure. This requires that the new and old systems are inter-compatible, though.
    • This is the most commonly used upgrade path.
  3. Big flip
    • Take down and upgrade half the cluster at a time. This allows for incompatible changes – each half of the cluster handles all the traffic at any point in time. That is – you take down part A, and upgrade it while B handles all the traffic. Once A is back up, you cut over traffic to it vian DNS, drain B, then take down B for upgrades.
    • You need to be over-provisioned by 100% for this to work.
    • The big flip is pretty powerful, since you don’t have to ensure compatibility between the new and old versions of the world. For example, both Inktomi (in the paper) and Stripe (in my experience) have used this for data center moves. It’s mostly useful for large and complex changes, which are rare.

Brewer also mentions that most systems set up a staging area to perform upgrades, making rollbacks easy. This, in combination with the big flip, sounds a lot like blue-green deploys.

A note about observability/recovery systems

Relating uptime to mean-time-between-failure (MTBF) and mean-time-to-repair (MTTR):

you can improve uptime either by reducing the frequency of failures, or reducing the time to fix them. Although the former is more pleasing aesthetically, the latter is much easier to accomplish with evolving systems.

In other words, it’s just as important to assume failures and improve tooling around debugging and recovery as it is to program defensively and attempt to prevent failures in the first place. In fact, given that new features often reduce MTBF without affecting MTTR much, it’s easier to focus on improving time to recovery.

There’s a bit about this in “How Complex Systems Fail”, on how change constantly introduces new forms of failure, which often interact with each other in unexpected ways to cause failure. Essentially – it’s easier to improve observability, debugging tools, and recovery systems (circuit breakers, etc) than it is to solely prevent failures.

Acceptable quality comes down to software the provides a target MTBF, a low MTTR, and no cascading failures.

Happy to hear suggestions!

footnotes


  1. This is true only for data-intensive sites. If the system is computationally-bound, or bottlenecked by external systems, there’s not much improving DQ can do. Most large services tend to be data-bound, though.
  2. I’d give a small fortune for a dependency graph of their systems.
    SQS and Auto-Scaling groups are dependent on DynamoDB. ELB and RDS depend on EBS (again). What else?
  3. I’ve omitted the third option Brewer presents: cost-based admission control, which estimates the difficulty (DQ cost) of the query before load-shedding. You can deny a very thorough unindexed query, for example, to enable several quicker ones. This can be taken further, to probabilistically blocking queries.

Comments