are shown as small black filled rectangles.
Such protocol might be implementable on custom hardware, e. g. data switch or pro- cessor with all custom hardware or at least some programmable routing device like FPGA- enabled routing cards.
5.3. OPERATION IN DYNAMIC ENVIRONMENT 38
5.3 Operation in Dynamic Environment
In this section, we describe necessary protocols for distributed AE operating in dynamic environment, where nodes of the tightly coupled cluster may disappear suddenly while new becoming available. So protocols for failure recovery and nodes joining and leaving distributed AE are discussed in this section.
Further, processing and network capacity of the individual nodes may change in time when the nodes are shared for some other operations and thus some protocol for commu- nication with distribution unit to allow dynamically adjusted load balancing. Generalizing this scenario, we develop a general protocol for distributed AE units communicating with distribution unit.
5.3.1 Setup of a New Ring
When new distributed AE is set up, the following steps are executed in order: 1. Any of the nodes starts to broadcast setup message.
2. All the nodes that want to participate reply with broadcasting their randomly gen- erated IDs. If collision is encountered, the colliding nodes generate new random IDs which must not be any of existing IDs.
3. A node with lowest ID becomes the master. All nodes create the ring according to increasing ID and remember IDs of all nodes participating in the ring.
5.3.2 Addition of a New Node
When a node wants to join an existing ring of AEs, the following steps take place:
1. New node asks any of the existing ring member for the ring topology including list of existing node IDs. If anycast is present in the network, it can be used.
2. New node generates a new random and unique ID. 3. New node broadcasts its joining to all the nodes.
4. After an acknowledgment received from all the nodes, it announces its availability to ingress load balancer.
5. Load balancer acknowledges its addition.
5.3.3 Failure Detection and Recovery
Failure detection relies on timeout if no token is received during some timeout period, by default 10× RTT. To recover, each node uses tokens, but this time tokens are acknowledged and they gather the new topology as they travel around the ring. The following procedure is performed:
1. Each node generates its own recovery token.
2. On reception of recovery token, the node adds itself to the list in the token and ac- knowledges its reception to the node the token came from.
3. If the token is not the token it has generated, it sends the token to the next node in the ring and waits for an acknowledgment. If the acknowledgment doesn’t arrive unit a timeout, it tries to send it to next plus one node in the ring and so on, until it receives and acknowledgment.
4. If the token is node’s own one, it extracts the topology of the new ring from it. As an alternative, it is also possible to broadcast the failure of the ring to all the nodes and use standard setup procedure instead as described in Section 5.3.1.
5.3.4 Removal of Existing Node
When a node wants to remove itself from a ring of AEs while allowing distributed AE to operate flawlessly without need for failure recovery, it executes the following protocol:
1. It announces its unavailability to the ingress load distribution unit.
2. After acknowledgment from distribution unit it considers itself ready for removal from the ring of AEs.
3. It broadcasts new topology to all nodes.
4. It waits until it hasn’t received circulating token for some timeout value, by default 10× last RTT, and shuts down.
5.3.5 Communication between Distributed AE and Load Balancers
There are four cases when the distributed AE communicates with the load distribution unit in dynamic environment:
Announcement of new ring topology after ring set-up or failure recovery. After a new ring is set up or after ring recovery, the master node announces ring topology to the distribution unit if it is known. Otherwise the distribution unit asks for that information (by unicast if master or at least one node of the ring is know, or using broadcast/multicast otherwise).
Addition of a new node. The new node announces its availability after it has successfully started its operation and joined the ring of existing parallel AE units as described above.
Removal of an existing node. Before leaving the ring of parallel AE units, the node an- nounces removal to distribution unit and waits for an acknowledgment.
Load reporting from parallel AE units. This process is handled by the load distribution unit. It polls Network Information module of each parallel AE unit through a mes- saging interface either regularly or according to specified policy.
Knowledge of the topology by the distribution unit is handled in soft-state manner, i. e. it has to be periodically refreshed by master node. If the information expires, the distributing unit stops distribution (to avoid flooding network) and tries to get information about the topology.
5.4 Prototype Implementation
Prototype implementation of the distributed AE is implemented in ANSI C language for portability and performance reasons. The implementation comprises two parts: a load distribution library and distributed AE itself.
Because of lack of flexible enough load distribution hardware unit, we have imple- mented it as a library, which allows simple replacement of standard UDP related send- ing functions in existing applications and allows developers to have defined type of load distribution—either pure round robin or load balancing.
5.4. PROTOTYPE IMPLEMENTATION 40 Each parallel AE element uses threaded modular implementation based on reflector architecture described in Section 3.2.1 with networking and distribution extensions de- scribed in Section 4.2. Internal buffering capacity of each AE node has been set to 500 packets. Explicit synchronization using FCT protocol has been implemented using MPICH implementation [80] of MPI [79] built with low-latency Myrinet GM 2.0 API [84] (so called MPICH–GM).
For cost-effective prototype implementation, the aggregation unit was a implemented as commodity switch satisfying condition that egress link capacity is equal or larger than ingress capacities and with sufficient capacity of internal switching matrix.
Prototype implementation has been tested and known to work on Linux and FreeBSD 5.x platforms.
5.4.1 Experimental Setup
In order to evaluate performance and behavior of the distribute AE experimentally, we have set up a testbed shown in Figure 5.6 comprising eight machines and two switches:
GE data link Myrinet low latency
interconnect AE parallel nodes low latency interconnect switch data path switch sending/receiving probes