29The Error Kernel Pattern

with Brian Hanafee and Jamie Allen

The next step is to consider the failure domains in the system and ask ourselves how each of them should recover and how costly that process will be. To this end we follow the path by which a job travels through the system.

Jobs enter the service through the communication component, which speaks an appropriate protocol with the clients, maintaining protocol state and validating inputs. The state that is kept is short-lived, tied to the communication sessions that are currently open with clients. When this component fails, affected clients will have to re- establish a session and possibly send commands or queries again, but our component does not need to take responsibility for these activities. In this sense it is effectively stateless—the state that it does keep is ephemeral and local. Recovery of such components is trivially done by just terminating the old and starting the new runtime instance.

Once a job has been received from a client, it will need to be persisted, a responsibility that we placed with the storage component. This component will have to allow all other components to query the list of jobs, selecting them by current status or client account and holding all necessary meta-information. Apart from caches for more efficient operation, this component does not hold any runtime state; its function is only to operate a persistent storage medium, therefore it can easily be restarted in case of failure. This assumes that the responsibility of providing persistence will be split out into a sub-component—which today is a likely approach—that we would have to consider as well. If the contents of the persistent storage become corrupted then it is a business decision whether to implement (partial) automatic resolution of these cases or leave it to the operations personnel. Automatic recovery would presumably interfere with normal operation of the storage medium and would therefore fall into the storage component’s responsibility.

The next stop of a job’s journey through the batch service is the scheduling component. At the top level this one has the responsibilities of applying quotas and resource request validation, as well as providing the executor component with a queue of jobs to pick up. The latter is crucial for the operation of the overall batch service: without it the executors would run idle and the system would fail to perform its core function. For this reason we place this function at the top of the scheduling

Clients Client interface Storage Execution Validation Planning Job scheduling

component’s priorities and correspondingly at the root of its sub-component hierarchy as shown in figure 12.5.

While applying the Simple Component Pat- tern we identified two sub-responsibilities of the scheduling component. The first is to vali- date jobs against policy rules like per-client quotas4_{or general compatibility with the cur-}

rently available resource set—it would not do to accept a job that needs 20 executor units when only 15 can be provisioned. Those jobs that pass validation from the input to the second sub-component that performs the job schedule planning for all currently outstanding and accepted jobs. Both of these responsibilities are task-based; they are started periodically and then either complete successfully or fail. Failure modes include hardware failures as well as not terminating within a reasonable time frame. In order to compartmentalize possible failures, these tasks should not directly modify the persistent state of jobs or the planned schedule but instead report back to their parent component who then takes action, be that notifying clients (via the client interface component) of jobs that failed their submission criteria or updating the internal queue of jobs to be picked next.

Whereas restarting the sub-components proved to be trivial, restarting the parent scheduling component is more complex—it will need to initiate one successful schedule planning run before it can reliably resume performing its duties. Therefore we keep the important data and the vital functionality at the root and delegate the poten- tially risky tasks to the leaves. Here again we note that the Error Kernel Pattern con- firms and reinforces the results of the Simple Component Pattern: we frequently find that the boundaries of responsibilities and failure domains coincide and that their hierarchies match as well.

Once a job has reached the head of the scheduler’s priority queue it will be picked up for execution as soon as computing resources become available. We have so far considered execution to be an atomic component, but when considering failure we come to the conclusion that we will have to divide its function: the executor needs to keep track of which job is currently running where and it will also have to monitor the health and progress of all worker nodes. The worker nodes are those components that upon receiving a job description will interpret the contained information, contact data sources, and run the analysis code that was specified by the client. Clearly the failure of each worker shall be contained to that node and not spread to other workers or the overall executor, which implies that the execution manager supervises all worker nodes as shown in figure 12.6.

4 _{For example one might want to limit the maximal number of jobs queued by one client—both in order to}

protect the scheduling algorithm and to enforce administrative limits.

Job scheduling

Parent of...

Planning Validation

In document Reactive Data Handling (Page 34-36)