A Review of Protocol Designs - What is Scalability?

What is Scalability?

4.3 A Review of Protocol Designs

The protocols described in Chapter 2 were, in general, not designed for scalability. Although many of them discuss their performance with respect to the number of par ticipants, describing this as scalability, the number of participants is usually small, in the region of tens of hosts. While the designs work and work well in this region, extrapolating to larger groups should be treated cautiously. It should be noted that implosion will always occur whenever a multicast is replied to either with an acknowledgement or by a reply message, although that implosion may not necessarily result in buffer overrun. While many of the protocols previously discussed were not specifically transactional, that does not preclude their use in such a manner by a higher protocol.

Centralized ordering protocols [Ka89a,Ch84a,Na88a,Ar92a,Ga88b,Ga88a], where a single designated group member is responsible for ordering messages from the other group members, have two potential scalability problems. The first is due to the poten tial for implosion at the central site, the second the performance of the central site in

normal usage. The central site is an implosion point because of the messages that are directed exclusively to that site by all of the other group members. In addition, some of the protocol designs discussed require some form of buffer synchronization to ensure reliability, each group member being required to return the message number of the last message received in sequence so that the central site can flush any older mes sages from its buffers. The scalability limit here is based on the number of messages that the central site can process with an acceptable delay, which in turn relates to the number of such group members supported. Those protocol designs that are broadcast are more susceptible to this as the central site has to support all group members, rather than in some cases where there is a separate central site for each group.

One of the problems of the central site concept is that of the failure of the central site, which may result in an inconsistent group state due to the loss of buffered, but not yet transmitted, packets being lost. This is a problem for those designs which acknowl edge each message separately, which is the case if the group accepts messages from non-group members, as the subsequent multicast itself is not receivable by the origi nator, requiring an extra acknowledgement. Also, the failure of the central site requires that a new central site be created, most often employing some form of elec tion algorithm to resolve any contention. As the central site also supports the group management information, then either this information must be maintained by every group member or be gathered when required, which is again a possible source of scal ability problems.

Garcia-Molina’s Propagation Graph, PG, produces a number of short trees, the stmc- ture of which being passed to the individual group members in order for them to for ward messages based on the tree structure. Each membership change requires a new tree structure be calculated and distributed, a potential costly operation. One of the problems is that the "old" tree is used to propagate the "new" tree, which seems to assume that all parts of a tree are reachable at all times, something which may not in fact be true if the group change was prompted by a member failure. As the protocol was designed to be employed over an arbitrary network architecture there may be no recourse to a broadcast to inform any lost branches of the new graph, requiring the generator of the F Gto directly inform the lost branch of the new graph.

Kaashoek’s protocol [Ka89a] was reliable because the central site retained every packet transmitted to it, with packets being flushed flrom the finite buffer space when the central site was assured that every group member, in this case every host which

was participating in the protocol, had received that packet. By implication therefore, some method was required to track the group membership, as the primary technique of flushing this buffer was to record the last received packet for each host, the last received sequence number being placed in each data packet by originating hosts. If the buffer became full, then a special flushing algorithm was used, which required each host to reply to a multicast. Two areas effectively limit the scalability of this design, firstly the buffering available, more hosts participating leading to potentially greater buffering requirements if the second limiting factor, the implosion of replies at the central site, is to be reduced.

Chang and Maxemchuk’s [Ch84a] design employed a ring structure to ensure both that order was preserved and to protect against member failure. With increasing group size this ring becomes large, and because the design requires that each member be the central site twice before messages are delivered, the latency, that is the delay between transmission and delivery, increases proportionally.

Multi-phase and positively acknowledged protocols [Ve89a,Bi89b,Da89a] require that replies be received by each group member in order for the protocol to operate correctly, a classic example of the potential for implosion described above. Indeed, the requirement of multi-phase designs that this occurs a multiplicity of times exacer bates the situation. Because all the replies must be gathered, the originator must have complete knowledge of the group membership so that each reply can be "ticked off". The method for gathering this management information is often ignored in the design of such a protocol, being assumed to exist outside the protocol. Several methods have been proposed, such as providing a list of potential group members, which is subse quently pruned based on the replies gathered by a multicast or series of multicasts. The ISIS toolkit provides management information at each ISIS site, so that complete information is available, the information being maintained in a consistent and timely manner by the group management protocol, GBCAST.

The AMp [Ve89a], in addition to requiring that all replies be gathered at each phase, requires that all the replies are received in response to the same multicast message, ensuring that the multicast is atomic by guaranteeing that the multicast group receive the same message. This further restricts the potential scalability, as replies cannot be gathered over a number of transmissions.

A potential limiting factor may be seen in Danzig’s protocol [Da89a] where a bit map of received acknowledgements is transmitted as part of a retransmission, the number

of group members being limited by the size of this bit map. Of course, a bit map is the most efficient method of representing information and so can be extended without impacting too heavily on the data capacity of a packet, however a more sophisticated management policy must be employed to ensure that each group member "knows" which bit map field is allocated to itself, this information having to be passed to any originator on request.

Transactional protocols [Cr88a,Ch88a], where the reply contains data are also poten tially limited by implosion. However, the number of replies that are actually required by the application making the multicast, expressed in terms of reliability, would tend to reduce the effect of buffer overrun if it occurs. If the transaction is set to be at least X:-reliable, that is at least k replies are required, then as long as the buffer overrun occurs after k replies then there is no problem. Indeed Cheriton assumes in the con text of the V-Kernelthat only the first reply is needed, which assumes that the data con

tained in each reply is identical.

As has already been described, some failure modes may lead to replies which do not contain the same data, which if the above argument was employed may result in an inconsistent application state. In addition, many applications may require all replies in order to be assured that the latest, or most accurate data is used by the application. So, while transactional designs may be less effected by implosion, it is by no means cer tain that they will not be.

Simple protocols, such as those that use negative acknowledgements [Er87a,Me90a] or simple datagrams [Po80a] are less prone to reply synchronization, as replies are only generated by the detection of missed packets, or none at all. However, a number of shortcomings are inherent in such a system, the main one being the lack of reliabil ity.

Although the use of negative acknowledgements reduces reply traffic, some synchro nization will probably occur if messages are missed at bridges, so that a number of group members miss the same message, each then transmitting a N A C K . One of the

features of many negatively acknowledged designs is the use of extra traffic which is transmitted when there is no data to send, these extra messages being used for inform ing other group members of the last message received as well as the next sequence number to be used by that host. As a group becomes larger, the amount of overhead will also increase, possibly resulting in a substantial amount of unproductive activity.

The protocol by Melliar-Smith and others [Me90a] employs a distributed method of achieving order, each message containing information about messages received and the order in which they are currently arranged. It would be expected that as the group increases in size, the amount of this information may become a large proportion of the data in a message effectively limiting the number of group members supported. Most of the protocol designs described in chapter 2 are designed around a broadcast capable LANenvironment. One of the considerations when discussing large scale mul ticasting is the use of WANto link LANs together. A WANcan be characterized as hav ing many paths to each recipient, of different lengths, which introduces dispersion into the system, as is artificially introduced using a delay algorithm. This is of benefit for multicast transactional protocols, reducing implosion at intermediate bridges and gateways, although as the speed of networks increase this dispersion is likely to reduce, as it is largely caused by storing message temporarily at bridges. However, the use of disparate paths, implying that messages are forwarded effectively in paral lel, may be a cause of implosion. As implosion is caused mainly by a speed mis match, then it is unlikely that messages passing firom WANto LANimplode, as a LANis often very much faster than a WAN.In addition, buffer overrun at intermediate bridges may be delayed due to the dedicated nature of bridges, both in terms of the dedicated processing available for protocol processing and the large number of message buffers which can be expected.

In document The Scalability of Multicast Communication (Page 69-73)