27The Error Kernel Pattern

with Brian Hanafee and Jamie Allen

components, keeping an eye on their anticipated communication needs. We then dived into the scheduling component and repeated the process, finding that there are sizable and non-overlapping sub-responsibilities which we split out into their own sub- components. This left the overall scheduling responsibility in a parent component because we anticipate coordination tasks that will be needed independently of the sub-components’ functions.

By this process we arrived at segregated components that can be treated independently during the further development of the system. Each of these has one clearly defined purpose and each core responsibility of the system lies with exactly one component. Though the overall system and the internals of any component may be complex, the Single Responsibility Principle yields the simplest division of components to further work on—it frees us from always having to consider the whole picture when working on smaller pieces. This is its quintessential feature: it addresses the concern of system complexity.

Additionally, following the Simple Component Pattern simplifies the treatment of failures as we will exploit in the following two patterns.

12.1.4 Applicability

This is the most basic pattern to follow and it is universally applicable. Its application may lead you to a fine-grained split of your problem or to the realization that you are dealing with only one single component—the important part is that afterward you know whyyou chose your system structure as you did. It helps all later phases of the design and implementation to document and remember this, because when questions come up later of where to place certain functionality in detail you can let yourself be guided by the simple question of “what is its purpose?” The answer will directly point you toward one of the responsibilities you identified, or it will send you back to the drawing board in case you forgot to consider it.

It is important to remember that this pattern is meant to be applied in a recursive fashion, making sure that none of the identified responsibilities remain too complex or high-level. One word of warning, though: once you start dividing up components hierarchically it is easy to get carried away and go too far—the goal is simple components that have a real responsibility, not trivial components without an individual reason to exist.

12.2 The Error Kernel Pattern

“In a supervision hierarchy keep important application state or functionality near the root while delegating risky operations towards the leaves.”

This pattern builds upon the Simple Component Pattern and is applicable wherever components of different failure probability and reliability requirements are combined into a larger system or application—some functions of the system must “never” go down while others are necessarily exposed to failure. Applying the Simple Component

Pattern will frequently leave you in this position, hence it pays to familiarize yourself well with the Error Kernel.

This pattern has been established in Erlang programs for decades3_{and was one of}

the main reasons that inspired Jonas Bonér to implement an Actor framework— Akka—on the JVM. The name “AKKA” was originally conceived as the palindrome of “Actor Kernel,” referring to this core design pattern.

12.2.1 The Problem Setting

From the discussion of hierarchical failure handling in chapter 7 we know that each component of a reactive system is supervised by another component that is responsible for its lifecycle management. This implies that if the supervisor component fails then all its subordinates will be affected by the subsequent restart, resetting everything to a known good state and potentially losing intermediate updates. If the recovery of important pieces of state data is expensive then such a failure will lead to extensive service downtimes, a condition that reactive systems aim to minimize.

THE TASK

Consider each of the six components identified in the previous example as a failure domain and ask yourself which component should be responsible for reacting to its failures as well as which components will be directly affected by them. Summarize your findings by drawing the supervision hierarchy for the resulting system architecture.

12.2.2 Applying the Pattern

Since recovering from a component’s failure implies the loss and subsequent recre- ation of its state, we shall look for opportunities to separate likely points of failure from the places where important and expensive data are kept. The same applies to pieces that provide services that shall be highly available: these should not be obstructed by frequent failure nor long recovery times. In the example we identified the following disparate responsibilities:

 Communication with clients (accepting jobs and delivering their results)

 Persistent storage of job descriptions and their status

 Overall job scheduling responsibility

 Validation of jobs against quotas or authorization requirements

 Job schedule planning

 Job execution

Each of these responsibilities benefits from being decoupled from the rest. For example the communication with clients should not be obstructed by a failure of the job scheduling logic, just as client-induced failures should not affect the currently run- ning jobs. The same reasoning applies to the other pieces analogously. This is another reason in addition to the single responsibility principle for considering them as dedi- cated components as shown again in figure 12.4.

3 _{The Ericsson AXD301’s legendary reliability is attributed in part to this design pattern and its success popu-}

In document Reactive Data Handling (Page 32-34)