7.1 Centralised control for adaptive processing
7.1.2 Control design
In the previous sections we identified the process behavior and the process model space and now we present the control loop design.
The control loop
For the sake of simplicity, we present in Figure 7.5 the block diagram of a control loop ex- ample using only two nodes. Figure 7.5 shows the controller (CONDUCTOR) located on a central supervisor, the actuators (W orkM anx, W orkM any), and the controlled tasks
(N odex.T ask[],N odey.T ask[]) located onN odexandN odey, respectively.
In order to have the supervised system stable regardless of the environment changes, CONDUCTORhas, by design, two sub-objectives: acontinuous analysisand afailure anal- ysis. As the name suggests, the former is performed continuously (at the highest sampling capability of the system), while the latter analysis runs only when a failure event was detected by the first analysis.
In the continuous analysis, CONDUCTORperforms periodically, at the highest sampling rate offered by the system, in our case every 100 msec, two actions. First, CONDUCTOR uses the load status of each node (LN ode[x]) to update a local process map. The process map represents CONDUCTOR’s view of the entire system under its control. Note that if the number of nodes increases to a point where scalability becomes an issue, we may be forced to look at a more distributed controller (e.g., a federated one). However, as we assume the number of nodes not to exceed a few tens, we did not explore the distributed control in this thesis. Second, CONDUCTORchecks the evolution of each node’s status against a pre-loaded model and sends requests to the failure analysis when, for instance, an overload or underload (or even a hardware failure) problem occurs. A PRE-FAIL event is detected through the evaluation of the model. The model represents a dynamic behaviour of a node as illustrated in Figure 7.2. The model is loaded and set up at deployment time with the system configuration and specific parametric data such asLOandHIthreshold levels.
L x Node sensor adaptor L L y Node R LNode[].Task[] Wref Conductor NodeX 1 2 Task 3 Task Task 1 2 Task 3 Task Task Node Y * Failure analysis: cost function re−routing: re−mapping: RoutingTable update Node[i].Task[x]−>Node[j].Task[y]
N load feedback from ’N’ nodes Process map: "N distributed Nodes"
C * Continuous analysis: X WorkMan Y WorkMan
Figure 7.5: Block diagram of the control loop.
The failure analysis processes each PRE-FAIL event received from the continuous anal- ysis. The failure analysis searches for an available core on the same node or another node from the processing hierarchy. When it finds multiple cores available, it chooses the first- fit based on the cost function as described in Equation 6.1. Once such a solution is found (e.g., with the smallest cost function), CONDUCTOR re-mapsthe overloaded task from the congested node on the chosen node with the help of the nodes’ actuators (WorkMans). This action involves also updating the routing policies (event°R in Figure 7.5) because moving or
replicating a task between two nodes requires moving or replicating the needed input stream, respectively.
Another anomaly that the ‘failure analysis’ can detect is the case when a processing task becomes underloaded; for example, the load status indicates a very low workload value like less than 5 %. In this case, the CONDUCTOR searches through the process map of the dis- tributed system for other available processing nodes that could handle the workload of the underloaded task. When it finds a proper solution, it moves the underloaded task and re- distributes the traffic so that the underloaded core becomes available. Then it simply marks
7.1 Centralised control for adaptive processing 113
the available core as ‘unused’.
Note that the continuous analysis determines an important characteristic of the control systems: whether there could be packet loss during the decision time in case of a failure or whether it can prevent any packet loss. Based on this characteristic, several algorithms are possible for continuous analysis: (1) AIMD, and (2) prediction. We opted for AIMD which stands for additive increase, multiplicative decrease and is a dynamic flow control mechanism used for instance in TCP flow control. Although AIMD will be explained presently, for completeness, we also present a possible alternative below.
AIMD control algorithm
This algorithm is a simple control algorithm inspired by the TCP flow control model. It tries to avoid packet loss during node offloading by adjusting the currentHI level according to a previous offload experience so as to provide enough spare time to the system for the next offload experience. Initially, a default value is computed for each threshold: LOandHI. The algorithm starts up using the default thresholds values and then it dynamically adjusts theHI level so as to provide a good tradeoff between early offloading (preventing packet loss) and late offloading (for higher resource utilisation). The algorithm checks, for each offloading experience, whether packet loss occurred or not (see Figure 7.6). TheHIlevel is adjusted according to this answer as follows: when there was no packet loss, theHI level grows one point (HI=HI+ 1). When there was some packet loss, theHIlevel decreases according to the formula:HI=α·HI, whereα∈[0..1]. The choice ofαis also important because it provides the system’s dynamics: how often the troubled tasks are re-mapped. In our experiments we found that anα= 0.8is a good tradeoff value.
α
if pkt.drop => HI = HI
LO: underloaded system MAX: congested system
HI
Legend:
reach HI without pkt loss reach HI with pkt loss
t workload
i HI
default if NOT pkt.drop => HI = HI + 1
The advantage of this algorithm consists in its simplicity and the disadvantage is the assumption that packet loss would be acceptable when re-mapping tasks.
Prediction control algorithm
An alternative would be to try and prevent packet loss during task re-mapping by forecasting the congestion moment. A popular approach, largely used in trend estimation, uses least squares fit math principles. The efficiency of linear prediction using least squares fit was shown by [100] for traffic congestion forecast of web-servers. Therefore, an implementation of this algorithm is beyond the scope of this research and interested readers are referred to [100].
Summarising, the main job of the controller (CONDUCTOR) is to re-distribute the tasks over the distributed processing hierarchy such as the entire system is well loaded (not un- derloaded and nor overloaded). A secondary job consists in re-routing of the traffic when re-mapping of the tasks. Although the controller performs a continuous analysis over the load status of the entire distributed system, the jobs run only when the pre-failure event oc- curs, as shown in Figure 7.2.