(Paper report) Harvest/Yield

This is part of a series of posts I’m writing for my version of NaNoWriMo, where I summarize/review papers I’ve read recently.

Overview

Coda Hale’s blog post on partition tolerance1 summarizes this beautifully, so I’m just going to pull from him.

Despite your best efforts, your system will experience enough faults that it will have to make a choice between reducing yield (i.e., stop answering requests) and reducing harvest (i.e., giving answers based on incomplete data). This decision should be based on business requirements.

The easiest/default error-handling setup degrades the system entirely (affecting the yield). Instead, Fox and Brewer suggest an alternative approach to increasing availability: mapping faults to a reduction in the amount of data a system operates on – in other words, by weakening the consistency guarantees of your system.

It’s really easy to point to the CAP theorem and assert that systems can’t have both strong consistency and availability, ending the discussion there. While some systems absolutely can’t tolerate imperfect/degraded answers, there are a wide variety of systems that can tolerate some other point on the consistency spectrum, especially for queries. Clearly defining consistency/availability goals, especially for a system composed of many independent services, is crucial to improving availability.

Given this need, the authors define 2 metrics for correct behavior in a distributed system:
  1. yield = requests completed successfully / total requests
    • note: yield is related to uptime, but deals with the number of queries missed, not just time down. (Being down for 10 minutes on Cyber Monday is not the same as being down for 10 minutes at 3 AM on a Tuesday.)
  2. harvest = data in response / complete data
    • for example: if 1% of a sharded data store goes down, the request drops 1% of the data it would have returned otherwise, giving it 99% harvest.

Our approaches tolerate partial failures by emphasizing simple composition mechanisms that promote fault containment, and by translating partial failure modes into engineering mechanisms that provide smoothly-degrading functionality rather than lack of availability of the service as a whole.

The authors begin by talking about harvest degradation mechanisms within a single monolithic system (responding to failures by operating on partial data), and about harvest degradation in a system composed of many independent services.

Reducing harvest: probabilistic availability

(note: there’s not much I can say here that hasn’t been said better by Coda2, especially starting from “A Readjustment In Focus”)

Nearly all systems are probabilistic whether they realize it or not. … Availability maps naturally to probabilistic approaches.

A point that I found interesting in here: replicating all data (not just highly-accessed data) might have relatively little impact on harvest/yield. Temporary loss of data that isn’t heavily accessed is unlikely to show up in yield numbers, since it’s unlikely that requests were made against that data, anyway.

The authors also mention trading off latency for harvest – a system can declare with some confidence that it’s received enough data to make a satisfactory response, without waiting on straggler information, which reminded me of the watermarks in Google’s Millwheel pipeline.3

Reducing harvest: compartmentalization

Some large applications can be decomposed into subsystems that are independently intolerant to harvest degradation (ie, they fail by reducing yield), but whose independent failure can allows the overall application to continue functioning with reduced utility.

Fox and Brewer emphasize orthogonal system design in this paper. A system with orthogonal services is composed of mechanisms that have essentially no runtime interface to each other. They cite Nancy Leveson’s claim4 that “most failures in complex systems result from unexpected inter-component interaction, rather than intra-component bugs.”

Compartmentalization5 is a related concept from fire safety codes. It divides a building into several independent compartments that can be isolated in case of a fire, allowing continuing operations and damage containment. I like to think of building fire doors between services: clear entry/exit points with protection mechanisms that can be enabled to contain the spread of system fires. This could be something like circuit-breaker code to shed load in the case of dependent systems, or designing services that can entirely down without affecting the operation of another. If a service errors downstream, it’s easy to throw up hands and error out the whole system. Error-handling code is tricky to write, and far enough off the execution path that it’s rarely tested, which is where gamedays and kill -9 tests6 come into play.

This is what I’ve read so far. Happy to hear suggestions!

footnotes


  1. You Can’t Sacrifice Partition Tolerance
  2. You Can’t Sacrifice Partition Tolerance
  3. MillWheel’s watermark is a lower bound (often heuristically established) on event times that have been processed by the pipeline, used to determine how complete the pipeline’s view of the world is at a certain time.
  4. Her book’s on Amazon. This sounds like a series of case studies, which is the best! I love scary systems cautionary tales.
  5. Compartmentalization
  6. gamedays and kill -9 tests

Comments