7.1 Centralised control for adaptive processing
7.1.1 Model identification in a distributed traffic processing system
Given a processing node as described in Section 6.2.3, we first need to identify the process model by investigating its behaviour, then try to synthesise a controller that will control the real process.
Processing node behaviour
A node’s workload is expressed as a buffer usage level percentage. A workload example in a processing node is depicted in Figure 7.2. There is a maximum level (M AX) up to where the system is able to process packets. In other words, when the system hits theM AXlevel it means it gets congested and starts dropping packets. TheHIlevel is an intermediate thresh- old level that is set-up by the supervisor as a reference so as to provide enough provisioning time for solving the problem of re-mapping of the task. Therefore,HI is estimated at de- ployment time and depends on the size of the buffers and the estimated core capacity to run the task in normal environment conditions. As we will see, it will be dynamically adjusted at runtime. Moreover, when the system workload is below a minimum level (M IN) it means that the node is underloaded and its task(s) should be moved to other nodes so as to free up the underloaded node completely. The underloaded node informs its supervisor about its new
state and the supervisor decides, according to a ‘defragmentation policy’, how and when to re-map the tasks from the underloaded node so as to efficiently use the resources of the entire distributed traffic processing system.
TL= loading time t
MAX: congested system workload
normal runtime LO: underloaded system
HI:
TR = re−mapping time
PRE−FAIL
Figure 7.2: Dynamic behaviour in a processing node.
Let us considering the example in Figure 7.2. When the system starts up, it takes a while, called loading time (T Lin Figure 7.2). During the loading time the system is filling all the hardware cores with tasks and the supervisor is achieving the first stable state. Then we say that the entire system is stable as long as each node’s workload is steady or slightly fluctuating up to node’s HI level. However, the workload could go up to theM AX limit because of changes in the environment. When the workload exceeds theHI reference, the supervisor throws an exceptionPRE-FAILin order to take an early decision for solving the problem before the node reaches the M AX limit. Next, the system tries to solve the problem by re-mapping the troubled task from congested node to an available core with an appropriate capacity. After the re-mapping time,T R, the system comes back to a stable state.
Model spaces
In order to identify the process parameters (TL, TR, etc.), we need to do some practical experiments. We take two tasks located on different nodes:N ode1.T askxandN ode2.T asky
and work on the following scenario (see Figure 7.3):T askxgets overloaded inN ode1, and we note traffic re-routing fromT askxtoT asky, and task moving fromN ode1 toN ode2. Although the experiments use, for the sake of simplicity, tasks with no dependent children, the measured parameters can be scaled according to the amount of children that would need to be re-mapped together with their parent.
7.1 Centralised control for adaptive processing 109 000000000 000000000 000000000 111111111 111111111 111111111 000000000 000000000 000000000 111111111 111111111 111111111 0000000 0000000 0000000 1111111 1111111 1111111 0000000 0000000 0000000 1111111 1111111 1111111
PBuf WorkMan 1 WorkMan 2 PBuf
IBuf MBuf IBuf MBuf WorkMan 0 X Task TaskY Node0 Node2 Node1 Conductor WorkMan
supervisor Input traffic
Figure 7.3: MovingT askx→T asky.
The algorithm of a task moving between two nodes tries to solve the following two prob- lems: (1) transferring of process state and (2) minimise the packet loss. Let us assume that when the supervisor receives a ‘congestion’ notification fromN ode1.T askx, it first prepares
the offloading of the overloaded task. It finds an available core that has the appropriate ca- pacity to run the troubled task, then it copies and loads the object code onto the core without starting the task effectively. Next, it re-routes the traffic fromN ode1toN ode2. Then the supervisor waits for the ‘ready’ status fromN ode1that tells when the shared bufferPBufis completely drained. In other words, we keep theN ode1.T askxprocessing of the remaining
packets in itsPBufwhile the new incoming traffic is routed to theN ode2and is currently stored inN ode2.P Buf. When the supervisor receives the ‘ready’ status (all packets from N ode1 were processed), it issues a state transfer fromN ode1.T askx → N ode2.T asky by
means of anMBuf copy. After the state transfer, the supervisor is ready to start the new processing task:N ode2.T asky.
We implemented this algorithm on a distributed architecture (see Figure 7.3) composed of a commodity PC (used as supervisor), a switch, and two different generations of network processors: IXP2400 and IXP2850. The task used in this experiment consists of a simple application that counts the UDP packets in a stateful counter and computes a hash over the first 64 Bytes of the packet. The counter gives us a local state that needs to be transferred together with the task when it gets overloaded. While this application is not in itself par- ticularly useful, it is illustrative for real applications, while also providing a simple constant benchmark. The re-mapping algorithm, in this particular distributed architecture, illustrates a ‘system upgrade’ case and is described in detail below.
1. The CONDUCTOR calls onW orkM an1 andW orkM an2to load the same applica- tion’s code object – ‘UDP packet counter and hash over the first 64 Bytes’ – on their nodes: T askxonN ode1,T askyonN ode2; then CONDUCTORstarts theT askxand
keeps theT askyin standby (not running);
UDP copy packets inPBuf. We have evaluated various sizes forPBuf, as illustrated in Figure 7.4 and described later in this section;
3. T askxprocesses all packets fromIBufand stores the temporary state/results intoMBuf.
We used a 1150 Bytes long array for storing a hash table inMBuf. We found that the size ofMBufdoes not affect the re-mapping time significantly compared to a stateless case when the task has noMBufand hence no need to wait for the drain;
4. APRE-FAILevent occurs inN ode1: the incoming UDP traffic grows, exceeds the T askxreference (HI), and the event is detected byW orkM an1by means of checking the buffer usagedx. HI was set at 75 % to ensure with high probability that there is
enough time for task re-mapping before the workload would hit theM AXlevel; 5. W orkM an1signals thePRE-FAILevent to the CONDUCTOR:W orkM an1.T askx=
−1 and therefore, the CONDUCTOR signalsW orkM an0to re-route the UDP traffic fromN ode1toN ode2;
6. The CONDUCTOR waits for a second event of the troubled node (N ode1) that tells whenW orkM an1.T askxfinishes the processing of buffered packets (T askx.IBuf
becomes empty);
7. When the CONDUCTORreceives the second event, it stops theW orkM an1.T askxand
issues a state transfer:W orkM an1.T askx.M Buf →W orkM an2.T asky.M Buf;
8. Finally, the CONDUCTORstartsW orkM an2.T asky.
The goals of this test are threefold. First, we measure the loading time of an application’s code object (a task) on a node, and then on two nodes (T askx andT asky). This loading
time (TL in Figure 7.2) is described below in Step 1. Second, we measure the dead-time betweenPRE-FAILevent occurrence (Step 4) and the successful failure addressing (Step 8) for a single task (no dependent children). This time is depicted in Figure 7.2: TR. Lastly, we find the best way to address the failure detection: a realisticHI threshold in order to signal a PRE-FAILin time for minimising the dead-time introduced by a task movement and avoiding of packet loss.
Our experiments show that the loading time (TL) measured in Step 1 is roughly 1 second. This is mainly due to the fact that we use secure file transfer (scp) for loading the application code remotely. The application’s code object file is about 500KB. Moreover, we found that TL is not much dependent on the number of nodes in the distributed architecture due to the simultaneous code loading on all the nodes. The most important limitation is the supervisor’s bandwidth for the control link.
When handling aPRE-FAILevent, we found that there is a constant detection time of roughly 1.7ms. It involves one way communication over a TCP connection. Next, when moving a task between two nodes we found that the re-distributing time (TR) is dependent on the following attributes:
• feedback communication: we use a dedicated asynchronous TCP connection separate from the main control connection that is needed to inform the CONDUCTORon each node’s status (workload);
7.1 Centralised control for adaptive processing 111
• state transfer: theMBuf stream size in our case is 1150Bytes and it takes 9.7ms to transfer this state from one node to another;
• draining time: the time needed to process the rest of the packets inPBufsince the route was changed and which is therefore, dependent on thePBufsize.
Figure 7.4 illustrates the draining and re-distributing (TR) time for different sizes of the shared packed buffer (PBuf). The ‘simulated drain time’ chart was computed using Intel’s cycle simulator for the µEngine processing time of each incoming packet. The ‘Drain- Time’ chart was measured locally, in hardware, inside the troubled node using an interrupt- based mechanism for reading the sensors (SRAM memory locations written by the process- ing tasks). The ‘TotalTime’ is the entire re-distributing time since the occurrence of the
PRE-FAILevent until the successful moving of task (Steps 4-8). This time includes the draining time, the copy of local states (MBufstream) and the additional transfer events over the control link (TCP connections).
Figure 7.4: Tasks moving benchmark.