Deployment
59Learning from history
LEARNINGFROMTHEACCIDENT
Three Mile Island is one of the most studied complex systems accidents. Some blame the operators, who "should" have understood what was happening, and who "should" have closed the valves after maintenance, who "should" have left the high pressure cold water injection running1. Have you ever left your home and can’t remember if
you’ve locked the front door? Imagine having 500 front doors. On any given day, in any given reactor, some small percentage of valves will be in the wrong state. Others blame the sloppiness of the management culture. There should’ve been lock sheets for the valves. But more paperwork to track work practices has only reduced valve errors in other reactors, not eliminated them. Some blame the design of the reactor. Too much complexity and coupling and inter-dependence. A simpler design has fewer failure modes, but hidden complexity is inherent to systems engineering, and truly simple designs aren’t possible.
None of these judgments are useful, because they’re all obvious and true to some degree. The real learning is that complex systems are fragile, and will fail. No amount of safety devices and procedures can solve this problem, because the safety devices and procedures are part of the problem. Three Mile Island makes this clear. The interactions of all the components of the system (including the humans) led to failure.
This is a clear analogy to software systems. We build architectures that have a simi- lar degree of complexity, the same kinds of interactions and tight couplings. We try to add redundancy and fail safes, and then find that they fail anyway, as they haven’t been sufficiently tested. We try to control risk with detailed release procedures and strict quality assurance, and still end up having to do releases at the weekend, with inevitable downtime. In one way, we are worse than nuclear reactors—with every release we change fundamental core components!
You can’t remove risk by trying to contain complexity. Eventually you’ll have a LOCA.
5.2.2 A model for failure in software systems
Let’s try to understand the nature of failure in software systems using a simple model. We need to quantify our exposure to risk to understand how different levels of com- plexity and change affect a system.
A software system can be thought of as a set of components, with dependency rela- tionships between the components. The simplest case is a single component. Under what conditions does the component, and the entire system, fail? To answer that ques- tion, we should clarify the term failure. In this model, failure isn’t an absolute binary condition, but a quantity we can measure. Success might be 100% up-time over a given period, and failure is any up-time less than 100%. But we could be quite happy
with a failure rate of 1%, giving us 99% up-time as the threshold of success. We could count the number of requests that have correct responses. Out of every 1000 requests, perhaps 10 fail, and we’ve a failure rate of 1%. Again, we could be quite happy with this. Loosely, we can define the failure rate as the proportion of some quantity (con- tinuous or discrete) that fails to meet a specific threshold value. Remember that we are building as simple a model as we can, and what the failure rate is a failure of is excluded from the model. All we care about is the rate, and meeting the threshold. Failure is failure to meet the threshold, not failure to operate.
For our one component system, if the component has a failure rate of 1%, then the system has a failure rate of 1%. Is the system failing?
If the acceptable failure threshold is 0.5%, then the system is failing. If the accept- able failure threshold is 2%, then the system isn’t failing, it’s succeeding, and we can go home.
This model reflects an important change of perspective: accepting that software systems are in a constant state of low-level failure. A failure rate always exists. Valves are always left closed somewhere. The system fails only when a threshold of pain is crossed. This new perspective is different from the embedded organizational assump- tion that software can be perfect and operate without defects. The obsession with tally- ing defective features seems quaint from this viewpoint. Once you gain this perspective, you can begin to understand how the operational costs of the microser- vice architecture are out-weighed by the benefit of superior risk management.
Figure 5.3 A single component system, where P0 is the failure rate of component C0.
TWOCOMPONENTS
Now consider a two-component system. One component depends on the other, and both must function correctly for the system to succeed. Let’s set the failure threshold at 1%. Perhaps this is the proportion of failed purchases. Perhaps we’re counting many different kinds of error, and purchases are one type: it isn’t relevant to the model. Let’s also make the assumption that both components fail independently of each other1. One failing doesn’t make the other more likely to fail. Both components
have their own failure rate. Below is a two-component system, and a given function can only succeed if both components succeed. Both are needed.
Figure 5.4 A two component system, where Pi is the failure rate of component Ci
61