BUILDING SCALABLE AND ADAPTIVE NETWORK
SERVICES
by
Adolfo Francisco Rodriguez
Department of Computer ScienceDuke University
Date:
Approved:
Dr. Amin Vahdat, Supervisor
Dr. Jeffrey Chase
Dr. Carla Ellis
Dr. Kishor Trivedi
Dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy
in the Department of Computer Science in the Graduate School of
Duke University 2003
Copyright c
2003 by Adolfo Francisco Rodriguez
All rights reserved
ABSTRACT
(Computer Science)
BUILDING SCALABLE AND ADAPTIVE NETWORK
SERVICES
by
Adolfo Francisco Rodriguez
Department of Computer ScienceDuke University
Date:
Approved:
Dr. Amin Vahdat, Supervisor
Dr. Jeffrey Chase
Dr. Carla Ellis
Dr. Kishor Trivedi
An abstract of a dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Department of
Computer Science in the Graduate School of Duke University
Abstract
The lack of infrastructural functionality in IP-based networks has led to the creation of many overlay network algorithms where application-level nodes self-organize, forming an overlay atop the underlying IP substrate. The process of overlay design, development, and evaluation is plagued by a number of challenges. Current overlay algorithms are designed toward specific application requirements and tend to either be one-dimensional, optimizing for a single performance metric while sacrificing others, or not scalable to large numbers of nodes. Development of overlays is tedious, complex, and redundant since these algorithms make use of similar system and network functionality. Disparate evaluation techniques expose implementation artifacts rather than differences in algorithmic principles, thus leading to unfair and inconsistent evaluation.
This dissertation presents a collection of overlay services geared at simplify-ing the creation of scalable and adaptive overlay networks. First, RanSub is a scalable mechanism for providing overlay nodes with random subsets of global participants. Second, TreeMaint is a protocol for scalable and adaptive over-lay tree maintenance with guaranteed loop-free properties. Finally, MACEDON provides a language for the high-level representation of overlay algorithms from which high-performance implementations are generated. It leverages functional similarities of overlays to decrease development and evaluation complexity and to promote consistent evaluation. In addition, this dissertation describes how these tools can create high-performance versions of prominent overlays and design novel algorithms that overcome current one-dimensional approaches by optimizing for multiple performance metrics.
Acknowledgements
I would like to thank my advisors, Dr. Amin Vahdat and Dr. Kishor Trivedi, for their invaluable support. I acknowledge the other members of my committee, Dr. Jeff Chase and Dr. Carla Ellis, as well as other professors who have been influential in my academic career, Dr. Lars Arge, Dr. Alvin Lebeck, Dr. Thomas Narten, Dr. Xiaobai Sun, and Dr. Bart Vashaw. Also, I wish to thank my research colleagues Jeannie Albrecht, Rebecca Braynard, Yun Fu, Chip Killian, Dejan Kosti´c, Priya Mahadevan, Jaidev Patwardhan, Dr. Paul Pauca, and Kashi Vishwanath for their support and collaboration. I would like to thank David Becker for all of his help with ModelNet and Diane Riggs for her encourage-ment and guidance. Last, but certainly not least, I thank my wife, Karina, and daughter, Camila, without whose support this dissertation would not have been possible.
Contents
Abstract iv
Acknowledgements v
List of Figures xii
1 Introduction 1
1.1 The Rise of Overlays . . . 6
1.2 Challenges in Overlay Research . . . 9
1.3 Hypothesis . . . 13
1.3.1 Design Principles . . . 14
1.3.2 RanSub . . . 15
1.3.3 TreeMaint . . . 16
1.3.4 MACEDON . . . 16
1.3.5 Using the Tools . . . 18
2 RanSub 21 2.1 Desirable Properties . . . 24 2.2 Overview . . . 26 2.3 Collect/Distribute . . . 28 2.3.1 Collect Phase . . . 28 2.3.2 Distribute Phase . . . 30 2.3.3 Discussion . . . 32 2.4 Analysis . . . 34 2.4.1 RanSub Uniformity . . . 34
2.4.2 Node Churn and Failure . . . 36 2.4.3 Security Issues . . . 37 2.5 Experimental Results . . . 38 2.6 Conclusions . . . 40 3 TreeMaint 41 3.1 Use of RanSub . . . 42 3.2 TreeMaint Operation . . . 43 3.3 Correctness . . . 45
3.4 Joining and Leaving TreeMaint . . . 47
3.5 Evaluation . . . 48
3.6 Conclusions . . . 50
4 MACEDON 51 4.1 System Model . . . 55
4.1.1 Algorithm Characteristics . . . 55
4.1.2 Representing Overlays in MACEDON . . . 61
4.2 The MACEDON API . . . 65
4.3 Language Support . . . 68
4.3.1 MACEDON Grammar . . . 68
4.3.2 An Example Specification: Scribe . . . 73
4.4 Operating Environment . . . 78
4.4.1 Code Generation . . . 79
4.4.2 Thread Management . . . 79
4.4.4 Debugging Support . . . 82 4.5 Evaluation . . . 83 4.5.1 NICE . . . 85 4.5.2 Chord . . . 87 4.5.3 Pastry . . . 89 4.5.4 SplitStream . . . 89 4.6 Conclusions . . . 91
5 Case Study: AMMO 93 5.1 Types of Metrics . . . 96
5.2 Application Specifications . . . 98
5.3 Overlay Transformations . . . 98
5.3.1 The Effect of Simultaneous Transformations . . . 101
5.4 Constraining Degree . . . 102
5.5 Evaluating AMMO . . . 103
5.5.1 Overlay Convergence Results . . . 104
5.5.2 Adaptivity . . . 107
5.5.3 Dealing With Multiple Metrics . . . 110
5.5.4 Effects of Probe Set Size . . . 112
5.5.5 PlanetLab Deployment . . . 114
5.6 Conclusions . . . 117
6 Case Study: HCO 119 6.1 Algorithmic Foundations . . . 122
6.1.2 HyperChromatic Trees . . . 125
6.1.3 HCO: Beyond HyperChromatic Trees . . . 130
6.2 HCO Protocol . . . 132
6.2.1 Node State . . . 134
6.2.2 Node Insertion . . . 135
6.2.3 Node Locking . . . 137
6.2.4 HCO Rebalancing Transformations . . . 138
6.2.5 HCO Swapping Transformations . . . 141
6.2.6 Node Deletion and Failure . . . 143
6.3 Evaluating HCO . . . 145
6.3.1 Tree Convergence . . . 147
6.3.2 Tree Height After Convergence . . . 149
6.3.3 Control Packet Overhead . . . 150
6.3.4 Matching the Substrate Network Topology . . . 151
6.3.5 Adaptivity . . . 152
6.4 Conclusions . . . 154
7 Related Work 156 7.1 Group Membership . . . 156
7.2 Reliable Multicast . . . 156
7.2.1 Scalable Reliable Multicast (SRM) . . . 157
7.2.2 Logical Group Concept (LGC) . . . 158
7.2.3 Multicast TCP . . . 158
7.3.1 CAN . . . 160
7.3.2 Chord . . . 161
7.3.3 Pastry . . . 162
7.4 Indirect Overlay Routing . . . 162
7.4.1 RON . . . 163
7.4.2 Traffic Tunneling . . . 163
7.5 Application Level Multicast . . . 163
7.5.1 Yoid . . . 163 7.5.2 ALMI . . . 164 7.5.3 RMX . . . 164 7.5.4 Overcast . . . 165 7.5.5 Narada . . . 165 7.5.6 BTP and HMTP . . . 166 7.5.7 Scribe . . . 166 7.5.8 NICE . . . 167
7.6 Content Distribution Networks . . . 167
7.7 Evaluation Mechanisms . . . 168 7.7.1 ns . . . 169 7.7.2 ModelNet . . . 169 7.8 Domain-Specific Languages . . . 170 7.8.1 Teapot . . . 170 7.8.2 Devil . . . 170 7.9 Algorithm Verification . . . 171
8 Conclusions and Future Work 172 8.1 Contributions . . . 172 8.2 Future Work . . . 174 8.2.1 RanSub . . . 174 8.2.2 MACEDON . . . 174 8.2.3 AMMO . . . 175 8.2.4 HCO . . . 175 Bibliography 176 Biography 183
List of Figures
1.1 A sample overlay network . . . 6
1.2 Current overlay research cycle . . . 11
1.3 Streamlined, MACEDON-enabled overlay research cycle . . . 17
2.1 RanSub operation . . . 26
2.2 Example scenario depicting the first phase of the RanSub protocol: The collect phase traveling up the overlay . . . 33
2.3 Example scenario depicting the second phase of the RanSub pro-tocol: The distribute phase traveling down the overlay . . . 34
2.4 Average number of nodes learned as a function of epochs for opti-mal pure uniform and RanSub . . . 39
3.1 Number of transformations effected over time for TreeMaint-based Overcast and distributed lock-based Overcast . . . 50
4.1 The MACEDON protocol stack . . . 60
4.2 Portion of the Overcast algorithm representation . . . 64
4.3 Simplified MACEDON API, part 1 . . . 66
4.4 Simplified MACEDON API, part 2 . . . 67
4.5 MACEDON grammar . . . 69
4.6 Portions of the Scribe specification, part 1 . . . 74
4.7 Portions of the Scribe specification, part 2 . . . 75
4.8 MACEDON agents . . . 80
4.10 CDF of node stress for published and MACEDON NICE
imple-mentations . . . 86
4.11 Observed latencies for published and MACEDON NICE imple-mentations . . . 86
4.12 Observed stretch for published and MACEDON NICE implemen-tations . . . 87
4.13 Convergence toward correct routing tables for MIT and MACE-DON Chord implementations . . . 88
4.14 Average latency of received packets . . . 90
4.15 Achieved bandwidth over SplitStream . . . 91
5.1 Sample network illustrating a possible transformation . . . 100
5.2 Delay convergence as a function of time for three different delay targets in AM M OD . . . 105
5.3 AM M ODC delay and cost convergence as a function of time for two different delay constraints . . . 106
5.4 Overcast achieved bandwidth over time . . . 107
5.5 Adaptivity of an AM M OD tree in response to pronounced change in network delay . . . 108
5.6 Adaptivity ofAM M ODC in response to pronounced change in net-work delay . . . 109
5.7 Cost and delay under varying delay bounds . . . 111
5.8 Delay convergence time and resulting per-node probing overhead as a function of the size of the random subset in AM M OD . . . . 112
5.9 Cost and delay convergence times with resulting per-node probing overhead for varying probe set sizes forAM M ODC . . . 114
5.10 CDF of convergence time for 19 PlanetLab nodes, with each node
acting as root in turn for two different delay targets . . . 115
5.11 Streaming live media usingAM M OD and SHOUTCast over Plan-etLab . . . 116
5.12 Distribution of packet received ratio for a 5 minute experiment streaming 256 Kbps overAM M OD running on PlanetLab . . . . 117
6.1 A sample red-black tree red propagation . . . 124
6.2 A sample red-black tree rotation . . . 124
6.3 A sample HyperChromatic tree . . . 126
6.4 Red propagation about node p . . . 127
6.5 Black propagation about node p . . . 128
6.6 A rotation about node n . . . 128
6.7 A single child/double grandchild rotation about node p . . . 131
6.8 A single child/single grandchild rotation about node p . . . 131
6.9 The first phase of locking for rebalancing . . . 139
6.10 The second phase of locking for rebalancing . . . 139
6.11 A parental swap of nodes p and n . . . 142
6.12 An ancestral swap of nodes p and n . . . 143
6.13 Node d could disappear from the HCO tree . . . 144
6.14 Sample HCO trees with (left) rebalancing disabled and (right) re-balancing enabled . . . 145
6.16 Height of 100 node tree over time for HCO, HCO with static co-ordinator timer, and HCO without rebalancing . . . 148
6.17 Height of HCO tree for varying number of nodes and coordinator timer values . . . 149
6.18 Control packet overhead in 100 node topology . . . 150
6.19 Network waste factor with varying number of HCO nodes in a 600 node topology . . . 152
Chapter 1
Introduction
With the advent of the Internet, the world has seen an explosion in the number of users that depend on computer networks for business and personal tasks. The exponential growth [66] in the number of Internet hosts suggests the importance of a connected world in many facets of life. Much of the Internet’s success can be largely attributed to the simplicity of the network protocols it uses, the TCP/IP protocol family [75]. While competing network protocols such as IBM’s Systems Network Architecture (SNA) [20, 45] and Novell’s Internetwork Packet Exchange (IPX) [15] have previously rivaled the popularity of IP, it is TCP/IP’s simple and open architecture that has increased its interoperability and ultimately placed it as the network protocol of choice for small- and large-scale networks.
Layering
TCP/IP’s approach to networking relies heavily on its layering of protocols. In this approach, networking functionality is divided vertically into components that each provide additional services beyond those provided by lower layers [58]. By doing so, TCP/IP components (such as IP, TCP, and UDP) can be kept sim-ple, yet can be combined to provide powerful transmission semantics. Further, modularity allows for competing protocols at one layer to use the same implemen-tation of lower layers, promoting code re-usability and easing code development and maintenance. Network components (routers) that deliver data from one part of the network to another need only implement up to layer 3 (network, IP) func-tionality. Because of this design choice, transmission functionality in TCP/IP
networks is end-to-end [73]. In this paradigm, the ends of a network connection communicate over the best-effort IP substrate to assure transport semantics, such as the in-order, reliable, and congestion-friendly delivery characteristics of TCP. Traditional IP applications are, for the most part, client-server based and de-pend on IP for the bulk of their communication needs. In this model, a client contacts a server implementing a desired service, makes a request, and receives some desired result. Examples include the File Transfer (FTP) Protocol and Tel-net as well as the ever-popular HyperText Transfer Protocol (HTTP) that allows access to web content in the World-Wide Web. A web browser, for instance, opens a connection with a web server, requests a web page and associated im-ages, receives these data objects, displays them to the user and terminates the connection. In these systems, communication isunicast, meaning that it is from one machine to another.
Increased Application Requirements
With increased popularity and use, the Internet is seeing a large increase in the network requirements of applications. Users demand more application func-tionality, which in turn demands more of the supporting network infrastructure. Network research has focused on retrofitting the necessary mechanisms to the ex-isting Internet infrastructure. An example of increased application requirements is in the security properties of unicast communication. While native IP provides little support for security measures such as authentication and secure channels, much work has addressed this concern via secure tunnels as with IP security (IPsec [66]).
Another example is multicasting, where communication occurs many-to-many. A news service could multicast a video signal over the Internet to a large number
of subscribers without over-burdening the signal source. While attempts have been made to retrofit such distribution characteristics into the network fabric (i.e. routers), none have been successful. ISPs and customers are reluctant to enable IP multicast in their routers because of the potential crippling effect it could have on their networks. In essence, multicast packets are received on a particular link and transmitted over multiple links. What was once a single copy of a packet is now multiple copies, each requiring its own amount of processing resources in the network. Issues involving congestion control and scalability have further decreased the appeal of IP multicast. In addition, the difficulty and complexity of deploying a standard multicast routing protocol and the challenges related to address allocation and reliable multicast data delivery have decreased the momentum of this technology.
Quality of Service (QoS), where applications are driving the need for greater administrative control over network-level routing decisions made in the network fabric, is another example. While traditional packet delivery in IP networks has been best-effort, researchers have created mechanisms to prioritize certain traf-fic over others. In some cases, application traftraf-fic is categorized into equivalence classes and assigned a particular priority as is done in Differentiated Services [66] (DiffServ). In other cases, Integrated Services [66] (IntServ) maintain state in the network fabric, enabling each flow to have guaranteed network resources such as a minimum bandwidth requirement. While Integrated Services provide more strin-gent controls over a flow’s network access and ensure that network expectations are consistently met, these services have been plagued by scalability considera-tions in that routers must maintain per-flow state regarding user-level network connections. On the other hand, Differentiated Services provide a more scalable solution at the cost of decreased control over network expectations.
Addition-ally, DiffServ techniques suffer from administrative burdens since it is difficult for multiple autonomous systems to agree on the appropriate policies for inter-domain packet prioritization and traffic shaping. As a result, Quality of Service technologies have also been hindered by scalability and administrative obstacles in placing functionality in the network fabric.
Active Networks
Active Networks technology [31, 51, 80, 83] promises to enable infrastructural programmability for implementing these functions throughout the network infras-tructure. In this framework, data packets carry portions of code to be executed on behalf of the packet’s sender at strategic locations in the network fabric. Using such a mechanism, application-specific functionality can be injected into network routers. Routers not only forward packets to one another as in traditional IP networks, but also perform application-determined tasks such as, in the case of multicasting, forwarding data along multiple network interfaces.
Much like with IPv6 [38], these advances have been slow and expensive to deploy in the wide area as they require large-scale router replacement. The cur-rent IPv4 Internet framework is rigid to such replacement, regardless of how incrementally deployable the technology may be. Critics question whether such functionality belongs in the network altogether, since it is an apparent violation of layering (application-level functions are performed in network-aware entities). Another concern questions whether adding such higher-layer functionality would cause performance issues in routers, since traditionally, they are highly-tuned machines with the simple task of forwarding packets. In addition, Active Net-works lead to problems involving resource allocation (which functions get access to which network resources), security (running arbitrary code in the network
could cripple the network under attacks or programming errors), and accounting (if programmed packets of millions of users are running in millions of routers, how, if at all, do those users get charged for their CPU and network usage).
In essence, the current routing infrastructure has been slowly built over a number of years with very large investment and effort. It is unlikely that this infrastructure will be replaced with one that provides the required additional functionality, primarily due to the barrier in resource investment. As a result, much research, including this dissertation operates under the assumption that such functionality must therefore be provided by end hosts.
Peer-to-Peer Networks
While application requirements and lack of infrastructural support have led to the desire to distribute functionality among end hosts, the ability to distribute such functionality has been possible due to vast technological advancements in general purpose computers. Increased processing power, large memory capacity, and vast disk space have delivered powerful, yet largely idle, computing resources, the so-called “dark matter of the Internet,” at the fingertips of millions of users. Loosely referred to as Peer-to-Peer (P2P) networks, applications are increas-ingly leveraging these idle resources. Though slightly misused in this context, the term Peer-to-Peer typically refers to networks where the relationship be-tween communicating hosts is that of equal peers. This is in direct contrast with traditional centralized networks (such as IBM’s subarea SNA [45]) and the client-server model as a whole. In some sense, IP routers form peering relation-ships. Hence, the Internet routing fabric could be considered an early form of a P2P network (another early example is that of IBM’s Advanced Peer-to-Peer Networking or APPN [20]). In general, however, P2P refers to networks where
root
Internet FDDI Token Ring Ethernet PPPFigure 1.1: A sample overlay network nodes act as both client and server.
1.1
The Rise of Overlays
Overlay networks leverage P2P capabilities to provide rich application-specific computation and communication. In this manner, overlays overcome the deploy-ment and administrative concerns of Active Networks. Overlays consist of peer nodes that self-organize into a distributed data structure based on application criteria. Strategically placed application-level agents serve as intermediaries for forwarding data from a source to a set of destinations, in effect forming anoverlay on top of the underlying IP substrate. By doing so, the administrative burden is pushed out of the rigid network fabric controlled by network administrators to the more robust P2P realm controlled by application users.
Figure 1.1 shows an IP internetwork connecting a number of local area net-works (LANs) and illustrates how nodes in these netnet-works could organize to form a logical network overlayed atop IP. In this case, the overlay takes a tree form, containing a root that could, for example, be the source of a multicast video data stream. A number of overlays have been proposed to deliver multicast function-ality [5, 11, 13, 25, 36, 37, 39, 68, 69].
Overcoming IP Limitations
Beyond providing multicast functionality, overlay networks are an attractive methodology for enabling a number of other communication semantics. For ex-ample, it is simple to leverage an overlay network to provide an anycast service where a packet is delivered to one of a set of nodes. Another example would be the use of indirect routing such as with RON [4] where traditional IP unicast traf-fic is tunneled through an overlay to achieve higher packet reliability (probability that a packet is delivered to its destination) and better performance character-istics (such as lower end-to-end delay). This is particularly advantageous given that dynamic IP routing in the Internet (effected by BGP) can lead to very long recovery times in response to network or router failure [44].
Application-Specific Functionality
Overlays can meet application requirements such as the ability to route, modify, or compute on data as it traverses the network. Through application layer im-plementation, overlay networks provide a cost-effective alternative to full router replacement to accomplish the application-specific goals.
One particularly interesting instance of this type of functionality is with dis-tributed hash tables (DHTs). In this paradigm, nodes are assigned portions of
a hash address space. Nodes self-organize to create routing tables enabling mes-sages to be delivered to the appropriate owning node. For example, digests of data object names or filenames would map to some owning node. This node would be responsible for maintaining meta data regarding the object’s real net-work location or status. To enable scalability, these routing tables and hops traversed in routing a message from any source to its destination are sub-linear. This routing functionality is termedkey-based routing (KBR)and can be used to enable a wide variety of applications. A number of overlays provide such routing capability [61, 70, 76, 87].
Router Offload
By pushing functionality into the application layer and non-router based systems (the dark matter of the Internet), the routers themselves are alleviated of the duties that could potentially be placed upon them. For example, by implementing multicast in an overlay, intermediate routers only need to forward unicast packets and need not maintain state regarding multicast sessions. This offload allows routers to decrease the set of infrastructural functions that they must perform and therefore optimizes the more common unicast packet forwarding.
Control Domain
One last aspect of overlay networks that increases their appeal is the clear division of control domains. No one single entity owns all routers in the Internet. Portions of the network are owned by different Internet Service Providers (ISPs), yielding an array of trust conflicts when it comes to implementing consistent policies across domains. By using overlay networks, more complex functionality is pushed higher in the network stack, into peer nodes thatarecontrolled by a single entity,
whether it be PCs owned by a single company or software installed on users’ PCs that act in cooperation with each other (as in Napster [49] and Kazaa [40]). By keeping the router functionality simple, the cooperation among ISPs is also kept simple.
1.2
Challenges in Overlay Research
This section describes some of the challenges researchers face in building large-scale, adaptive, performance-tuned, decentralized, robust overlays of today. While these are but a subset of the difficult problems in overlay research, the outlined challenges correspond to key steps toward qualitatively advancing this field.
Group Membership
One interesting problem arising in large-scale network services is that of scalable membership awareness. Nodes periodically evaluate their positions with re-spect to other nodes. In order for the system to scale in terms of network and computation consumption, these node-to-node overlay edge evaluations must be sub-linear in the number of global participants. A simple solution to this problem would be for each node to maintain a list of global participants. Unfortunately, it is typically difficult to have direct knowledge of participants ahead of time. Additionally, even if such information was available before run-time, space con-siderations of storing information about such a large number of nodes would limit scalability. The challenge is to provide a tool that is able to periodically deliver random views, or subsets, of nodes to each participant. Using these subsets, net-work services can perform application-specific evaluations and ultimately improve the overlay as needed.
Overlay Maintenance
Overlay nodes add and remove edges based on their utility. An edge’s utility could take into account network-level metrics such as latency or bandwidth character-istics, but usually also take into account application-specific criteria. We refer to the process of adding and removing overlay edges as overlay maintenance. It is through these mechanisms that overlay algorithms are able to adapt to changing conditions in the underlying IP network. Unfortunately, overlay maintenance is challenged by global limitations unbeknownst to nodes making local decisions. For example, maintenance of an overlay tree must maintain the properties of the tree structure. Node edge additions and removals must not cause the tree to become disconnected or cyclic. Current tree maintenance algorithms rely on loop detection to overcome these scenarios, but it comes at the price of increased proto-col complexity and decreased scalability since it typically requires node ancestry lists to be maintained.
Long and Difficult Development Cycle
Current overlay research follows a cycle consisting of four phases as depicted in Figure 1.2, each of which suffers from a number of challenges. First, a re-searcher designs an algorithm to meet specific performance goals, optimizing for certain underlying substrate metrics and providing application behavior such as
O(lgn) routing hops in DHTs. From this description, one or more implemen-tations are created to evaluate the performance of the algorithm under certain conditions. For example, many researchers create a hand-crafted simulator capa-ble of evaluating performance under large numbers of nodes and an extensive live implementation for evaluation in real settings. Such implementations are tedious and difficult, both due to the sheer size of the software components needed to
Algorithm
Simulator 1000+ lines Live code 5000+ linesImplementation
Live deploymentExperimentation
Simulation Post-processing toolsEvaluation
Figure 1.2: Current overlay research cycle
build scalable implementations and the complexity of such functionality.
Using an algorithm’s implementation, researchers use experimentation to gather run-time performance data. Usually, this includes a combination of simulation such as with the network simulator, ns, [84] and small-scale live Internet runs (e.g. PlanetLab [57]). Unfortunately, simulation is unable to completely capture the complex behavior of real applications and networks. Live experiments suffer from scale limitations and the inability to extract vital network-level performance data. As a result, experimentation cannot fully generate the necessary informa-tion to gain complete understanding of overlays. The evaluainforma-tion phase of the cycle processes the information generated through experimentation. Researchers process performance data with hand-crafted tools and subsequently modify their implementation in light of code bugs or sub-optimal performance. Since
dif-ferent researchers employ disparate implementation techniques, the evaluation of competing overlay algorithms tends to reflect differences in implementation methodologies as opposed to algorithmic differences. Further, it is difficult to complete the cycle appropriately since the focus is on implementation rather than the algorithm itself.
Multi-Metric Application Level Multicast
In application level multicast [5, 11, 13, 25, 36, 37, 39, 68, 69], nodes receive multicast data from peers in the overlay and forward it on to other peers. Perfor-mance requirements of the overlay depend on the multicast application for which the overlay is created. For example, an online chat between multiple end-users would require overlay edges that are low in latency, while software distribution would require large bandwidth and could tolerate much higher delays. Con-trastingly, a video broadcast is very sensitive to both end-to-end delays and low bandwidth edges. Therefore, while some applications require overlays optimized for a single network-level performance metric, considerable benefit can be gained from overlays optimizing for a plurality of metrics. Current multicast overlay algorithms are one-dimensional, optimizing for a single performance metric, and are insufficient for a class of large-scale applications.
Data Aggregation
Data aggregation applications require an overlay of bounded height, such as
O(lgn). At each level in the overlay, data received from lower levels is aggre-gated and forwarded toward a root. This alleviates higher nodes from receiving information directly from a large number of nodes. Instead, higher nodes receive summaries of information from a small number of participants, their direct
de-scendants in the overlay. By bounding overlay height, the number of times that data is aggregated is kept low. Applications that benefit from data aggregation include sensor networks and reliable multicast protocols. In reliable multicast, repair requests can be forwarded up the overlay while being merged with requests from other nodes until a node in the path is capable of servicing the request di-rectly. In this manner, nodes higher in the overlay are alleviated of servicing all requests from every node. Applications that collect average statistics, such as resource consumption, over a large number of participants could be arranged in a height-bounded overlay so that reports may be summarized at each level.
1.3
Hypothesis
The hypothesis of this dissertation is aimed at overcoming many challenges cur-rently facing overlay research by creating a collection of powerful tools for over-lay construction. First, I argue that it is possible to create a group membership service that periodically delivers uniformly random subsets of global overlay par-ticipants. Second, it is also possible to create a robust, distributed protocol for efficiently building and maintaining overlays in a scalable, yet adaptive, manner. Finally, I further conjecture that it is feasible to create a methodology for describ-ing and implementdescrib-ing overlay algorithms that is both descriptive and powerful enough to ease the development of a wide variety of overlays.
To this end, this dissertation documents my experiences in building three overlay services each addressing a portion of my hypothesis. First, I describe how RanSub [41] delivers random subsets to each overlay participant by lever-aging a Compactoperation. I evaluate RanSub using a live implementation and analytically prove its uniformity. Second, I present TreeMaint [68], a distributed,
highly-parallel, protocol that enables overlay tree transformations allowing the tree to conform to underlying network conditions. I show how this tool can be used in a variety of settings and prove its correctness in maintaining loop-free trees. Finally, I introduce MACEDON [67], a methodology for qualitatively im-proving the overlay research process. It provides a concise language for describing overlay from which code is generated that can be used to evaluate, experiment with, and deploy a variety of overlays. I use MACEDON to implement and debug performance-optimized versions of many prominent overlay algorithms. In addi-tion, this dissertation describes, as proofs-of-concept, how these services can be used together to create novel algorithms that optimize for multiple network- and application- level metrics simultaneously. Specifically, I detail two novel overlay algorithms, Adaptive Multi-Metric Overlay (AMMO) [68] and Hyper-Chromatic Overlay (HCO), each addressing specific application-driven requirements.
1.3.1
Design Principles
While overlays are free to make design choices that greatly influence performance, they share key design principles to insure scalability.
Sub-linear Node State
Overlay participants have limited memory, processing capacity, and network bandwidth. Hence, to ensure scalability, they must not maintain state regarding all other members of the overlay. As a result, a participant must be able to limit the number of nodes that it will peer with. This limit should be based on current processing and local network load and memory consumption and need not be the same for all participants in the overlay. For example, a mainframe node should be allowed to peer with many more overlay participants than an old PC.
Sub-linear Message Overhead
In a similar manner, nodes have limited network bandwidth. As a result, the control overhead of overlay algorithms and services must not exceed network restrictions. This is particularly a problem with network probing. Most adaptive overlay network algorithms make use of probing to determine which nodes make better peers than others. It is clearly not scalable to allow each node to probe all other nodes in a given epoch of time since doing so would undoubtedly strain the network. Thus, to ensure probing scalability, each node should probe a sub-linear number of nodes in any given epoch. In our experience, we have seen that probe set sizes of at most O(lgn) is scalable and adequate to ensure adaptivity.
1.3.2
RanSub
The RanSub utility periodically delivers uniformly random subsets of global group membership to overlay participants in an overlay tree using a scalable synchro-nization protocol called Collect/Distribute. In the Collect phase, nodes send a size-bound random subset of tree descendants to their parents (this is initiated by the leaves with subsets of size 1). Each node, upon receiving collect sets from its children, in turn creates a collect set respecting uniformity that it propagates to its parent. This is done via the Compactoperation that randomly selects nodes from input size-bounded sets based on the node population each set represents and generates an output size-bounded set. Once the root of the tree receives collect sets from all of its children, the distribute phase begins where each node receives a random subset. Each parent constructs a distribute set for each child by compacting the distribute set received from its parent with the collect sets pre-viously received from all other children. RanSub supports different semantics for
different applications. The default is uniform random subsets of all participants, but uniformity over non-descendant nodes is also supported. RanSub is used by TreeMaint (described in Chapter 3) to provide loop-free tree maintenance. It is also used by Bullet [42] to create a mesh that delivers high bandwidth data to each node. More information on RanSub is provided in Chapter 2.
1.3.3
TreeMaint
The TreeMaint algorithm allows for the scalable and adaptive maintenance of overlay trees. By establishing periodic global orderings of overlay participants, TreeMaint provides a loop-free technique for making overlay transformations. The key insight is that nodes in the tree respect the global ordering, a node may only move under nodes that precede it in the overlay. This is proven to guar-antee that transformations will not create cycles or disconnections. TreeMaint makes use of RanSub to deliver uniformly random subsets respecting this global ordering to nodes in the overlay. The global ordering changes periodically so nodes may consider all overlay participants over time. TreeMaint has been used by SARO [41] to create delay-constrained, cost 1 optimized trees and by the
more general AMMO [68], described further in Chapter 5, to build trees opti-mized for multiple performance metrics. Chapter 3 provides more information on TreeMaint.
1.3.4
MACEDON
MACEDON is geared at simplifying and streamlining the overlay research de-velopment cycle as shown in Figure 1.3. It consists of a simple domain-specific
1Cost is a weight assigned to each overlay edge using some specified policy. This is described
overlay.mac
< 600 lines
Live deploymentExperimentation
Simulation MACEDON post-processingEvaluation
Implementation
MACEDON code generator Network aware emulationFigure 1.3: Streamlined, MACEDON-enabled overlay research cycle
language that specifies the high-level behavior of an overlay. From this specifi-cation, functional code is generated to interface with an operational framework providing support for common overlay mechanisms. While initially focusing on overlay algorithms and applications, we expect MACEDON to be extensible to a wide variety of other distributed services. MACEDON has been used to de-velop performance-tuned versions of many prominent overlay algorithms includ-ing Nice [5], SplitStream [11], Overcast [39], RanSub [41], Bullet [42], AMMO [68], Pastry [70], Scribe [69], Chord [76], and HCO described further in Chapter 6. These MACEDON-encoded specifications typically consist of 200-600 lines, ul-timately decreasing development and debugging effort. Further, because the generated C++ code shares the same baseline implementation for probing, join-ing, failure detection, etc., relatively fair comparisons can be carried out between
competing systems, isolating differences in algorithms rather than implementa-tion artifacts. Further informaimplementa-tion on MACEDON is provided in Chapter 4.
1.3.5
Using the Tools
Aside from having created performance-tuned versions of a number of prominent overlay algorithms, this dissertation presents two novel overlay algorithms geared toward addressing some of the shortcomings in today’s overlays. The first of these algorithms, Adaptive Multi-Metric Overlay (AMMO), targets the problem of application-level multicast optimized for multiple network-level performance metrics. The second algorithm, Hyper-Chromatic Overlay (HCO), provides a height-bounded overlay for data aggregation applications.
Design Choices
This section describes the basic design choices used in the creation of AMMO and HCO.
• Tree-based Overlays
While some overlay network algorithms make use of connected meshes among participants, HCO and AMMO currently implement tree-based al-gorithms. In a pure mesh, nodes have multiple paths to other nodes, result-ing in what we loosely refer to asloops. Because of this, it is more difficult to determine whether multiple sections of the overlay are disconnected or whether a specific change in the overlay would cause a disconnection. Ad-ditionally, application semantics are complicated by mesh overlays. For example, implementing application-level multicast over a tree involves re-ceiving data from a parent overlay edge and forwarding it over children
edges. In a mesh, additional functionality must be created to determine whether a message should be forwarded and over which of the subset of overlay edges should it be transmitted. As a result of this added complex-ity, we have chosen the overlay tree as the basis for our approach.
• Limited Synchronization
Another important premise in our novel algorithms is in limiting the scope of the mechanisms used to transform the overlay. As the tree adapts to changes in the underlying network infrastructure, the overlay algorithm should make transformations to better conform to the new underlying net-work. Changes should involve the coordination of a small number of nodes in a small number of steps. The number and sizes of messages necessary for this coordination should also be small. As a result, simultaneous trans-formations can occur in multiple sections of the overlay, leading to quicker convergence in response to underlying network changes.
• Priority-Based Decisions
The overlay algorithm should prioritize the performance metrics it opti-mizes. There are three different mechanisms that an overlay algorithm uses to tune for a performance metric: constraining (a metric is not allowed to increase or decrease beyond a certain point), bounding (algorithmically bounding a performance metrics, such as overlay height, based on the num-ber of participants), and marginal optimization (tune for a performance metric under given constraints and bounds). As the algorithm tunes for a metric, it should operate under restrictions placed on it by more important metrics. For example, if latency is the most important metric and it is constrained and the algorithm wishes to improve jitter, it must not violate
the latency constraint to do so. This establishes a priority list of metrics to which the algorithm should adhere.
Hyper-Chromatic Overlay (HCO)
HCO uses a red-black tree based distributed algorithm aimed at creating an overlay ofO(lgn) height. It does this by decoupling the tree re-balancing process from insertions and deletions of nodes in the tree. In this way, HCO avoids global locking, though nodes make use of local locks to update their state. What results is an overlay that can marginally optimize for a performance characteristic (such as latency) while maintaining the desirableO(lgn) overlay height bound. HCO is implemented within the MACEDON framework further described in Chapter 4. Chapter 6 discusses the HCO algorithm and protocol in great detail.
Adaptive Multi-Metric Overlay (AMMO)
AMMO allows for the specification of performance constraints that indicate the minimum tolerable performance for correct operation. It also promotes the specification of a metric function that describes the relative priorities of non-constrained performance metrics. The AMMO algorithm creates an overlay tree that achieves the specified constraints (delay target, for example, that indicate the maximum tolerated latency between the root of the tree and all other nodes in the tree). Within these constraints, the AMMO algorithm is able to optimize edges, respecting the specified metric function. We have implemented AMMO in MACEDON with versions that constrain delay and optimize for cost or band-width. Chapter 5 describes the basic AMMO protocol as needed for application level multicast.
Chapter 2
RanSub
Many distributed services must track the characteristics of a subset of their peers. This information is used for failure detection, routing, application-layer multicast, resource discovery, or update propagation. Ideally, the size of this subset would equal the number of all global participants to provide each node with the high-est quality, most up-to-date information. Unfortunately, this approach breaks down beyond a few tens of nodes across the wide-area, encountering scalability limitations both in terms of per-node state and network overhead. Recent work suggests building scalable distributed systems on top of a location infrastruc-ture where each node can quickly (inO(lgn) steps) locate any remote node while maintaining onlyO(lgn) local state [61, 70, 76, 87]. This approach holds promise for scaling to distributed systems consisting of millions of participating nodes.
While existing techniques track the characteristics of a fixed set of O(lgn) nodes, a hypothesis of this work is that there are significant additional benefits from periodically distributing a different random subset of global participants to each node. By ensuring that the received subsets are uniformly representative of the entire participant set and are frequently refreshed, nodes will eventually receive information regarding a large fraction of participants. Consider the ap-plicability of such a mechanism to the following application classes:
• Adaptive overlays: A number of efforts build overlays that adapt to dynam-ically changing network conditions by probing peers. For instance, both Narada [36] and RON [4] maintain global group membership and
periodi-cally probe all participants to determine appropriate peering arrangements, limiting overall system scalability. The presence of a mechanism to deliver random subsets to each node would allow overlay participants to learn of remote nodes suitable for peering. In doing so, the nodes periodically learn enough new information to adapt to dynamically changing network condi-tions and changes in overlay membership.
• Parallel downloads: One recent effort [8] suggests “perpendicular” down-loads of popular content from a set of peers to receive erasure-coded content. Here, nodes receive data not only from the source, but also from peers that might have already received the data from the source or some other peer. One unresolved challenge to this approach is locating peers with both avail-able bandwidth and received data item diversity. Random subsets would provide a convenient mechanism for locating such peers. Related to this approach, a number of efforts into reliable multicast [7] propose the use of peers in the multicast tree for data repairs (to avoid scalability issues at the root). Random subsets would likewise provide a convenient mechanism for locating nearby peers that do not share the same bottleneck link (and hence have a good chance of containing lost data).
• Peer to peer systems: For locality, peer to peer systems [61, 70, 76, 87] often desire multiple choices at each hop between source and destination. A changing, random subset of participating nodes would enable nodes to insert entries into their routing table with good locality properties and adapt to dynamically changing network conditions.
• Content distribution networks: In CDNs, objects are stored at multiple sites spread across the network. Important challenges from the client perspective
include resource discovery (determining which replicas store which objects) and request routing (sending the request to the replica likely to deliver the best performance given current load levels and network conditions). Random subsets would allow CDNs to track the state of a subset of global replicas. A number of earlier studies [18] indicate that making decisions based on a random subset of global information often performs comparably to maintaining global system state.
• Joining distributed systems: One challenge in large-scale distributed ser-vices, such as peer-to-peer systems, is bootstrapping the join process. Al-ternative approaches include falling back to network-layer multicast or any-cast, which are typically not available, or joining at well-known sites, which creates bottlenecks. Given random subsets, a joining node is able to contact any existing system member to obtain a random subset of all participants. Based on the size of this subset, the node can locate a peer or parent that maintains network locality.
• Epidemic algorithms: A classic application of random subsets is epidemic algorithms [19, 77], where nodes transmit updates to random neighbors. With high probability, n nodes performing “anti-entropy” will converge to see the same set of updates in O(lgn) communication steps. Random subsets provide a convenient mechanism for locating neighbors and perhaps biasing communication to nearby sites.
Thus, we view a scalable mechanism for delivering uniformly random subsets of global participants as fundamental to a broad range of important network services. This chapter presents the design and implementation of RanSub, one such protocol. We have completed an implementation of RanSub and conducted a
number of large-scale experiments, typically consisting of 1000 nodes in a 20,000-node network. RanSub utilizes an overlay tree to periodically distribute random subsets to overlay participants. RanSub could leverage any number of existing techniques [5, 13, 25, 30, 36, 39, 56, 62, 69, 88] to provide this infrastructure. Though we note that RanSub’s correctness is independent of the underlying tree, we used a randomly constructed tree as the underlying basis for our live evaluation of RanSub.
The remainder of this chapter is organized as follows. Section 2.1 details the goals of RanSub for distributing random subsets. Section 2.2 gives an outline of the RanSub utility, while Section 2.3 provides the details of RanSub’s Col-lect/Distribute protocol. RanSub uniformity is shown analytically in Section 2.4 as we discuss how RanSub meets its stated goals. We provide a comparison of our live RanSub implementation versus the optimal uniformity case in Section 2.5. Section 2.6 provides concluding remarks related to RanSub.
2.1
Desirable Properties
Before presenting the details of our design and implementation, we discuss desir-able properties of a random subset service. Ideally, the system will offer:
1. Customization: Applications should determine the size of random subsets that are delivered. This size will depend on application-specific actions performed by nodes upon receiving the random subset. For example, a parallel download application may wish to initiate data transfer with only a small constant number of peers while a P2P system may wish to probe
2. Scalability: The system should support large-scale services without posing a burden on the underlying network in terms of control overhead. Addi-tionally, correct system operation should not depend on system size, i.e., the application should be able to request any random subset size. Overall, scalability implies that required per-node state and network communication overhead should grow sub-linearly with the number of participants.
3. Uniform, changing subsets: We envision a tool that is repeatedly invoked to retrieve “snapshots” of global participants at different points in time. Each snapshot, or random subset, should consist of nodes uniformly distributed across all global participants, such that each remote node appears in a de-livered subset with equal probability. If desired by the application, each invocation of the tool should return to each participant adifferent random subset independently chosen over all participants. Similarly, across invoca-tions, each participant should receive probabilistically different subsets with no correlation across invocations. In this way, over time, each node can be exposed to a wide variety of global participants. Certain applications may desire non-uniform distribution that, for example, favors nearby nodes; this functionality can be layered on top of the baseline system.
4. Frequent updates: The system should offer frequent distribution of random subsets. This is critical, for example, in supporting network services that use the system to adapt to changing network conditions. A parallel down-load application [41], as another example, would need to react to changes in available peer data, particularly under cases of high node churn 1 where
nodes could come and go frequently.
1Node churn relates to the rate at which nodes enter and leave the overlay. The term is used
B
C
A
Overlay Tree
RanSub
Collect
(Collect
Sets)
RanSub
Distribute
(Distribute
Sets)
DS
A={B,C}
Figure 2.1: RanSub operation
5. Resilience to failures: The system will preserve its properties even in the face of failures. Failed nodes should not appear in future random subsets within a short and bounded amount of time.
6. Resilience to security attacks: Even when under attack by malicious users, the system should maintain its properties (uniform distribution, etc.), and degrade its performance gracefully when it is unable to defend against a large-scale attack.
2.2
Overview
Given the goals described above, we now describe RanSub, our scalable approach to distributing random subsets containing nodes that are uniformly spread across all participants. For the purposes of this discussion, we assume the presence of some scalable mechanism for efficiently building and maintaining an overlay tree. A number of such techniques exist [5, 36, 39, 68, 69] for building such an overlay. Figure 2.1 summarizes RanSub’s operation. RanSub distributes random
sub-sets through Collect messages that propagate up the tree, leaving state at each node, and Distribute messages traveling down the tree using soft state from the previous collect round.
RanSub distributes a subset of participants to each node once per config-urable epoch. An epoch consists of two phases: one distribute phase in which data is transmitted from the root of an overlay tree to all participants (data is distributed down the tree) and a second collect phase where each participant successively propagates to its parent a random subset called a collect set (CS) containing nodes in the subtree it roots (data is aggregated up the tree). During the distribute phase, each node sends to its children a uniformly random subset called a distribute set (DS) of remote nodes. The contents of the distribute set are constructed using collect sets gathered during the previous collect phase.
When a Distribute message reaches a leaf in the RanSub tree, it triggers the beginning of the next collect phase where each node sends its parent a subset of its descendants (the collect set) along with other metadata. This process continues until the root of the tree is reached. The collect phase is complete once the root has received collect sets from all of its children. The root signals the beginning of a new epoch by distributing a new distribute set to each of its children, at which point the entire process begins again. The length of an epoch is configurable based on the requirements of applications running on top of RanSub. The lower bound on the length of an epoch is determined by the worst-case root-to-leaf and leaf-to-root transmission times of the overlay.
Collect Distribute Sequence # Sequence # of current epoch Sequence # Sequence # of current epoch
Collect Set Uniformly random sub-set of nodes in sender’s subtree
Distribute
set Uniformly random sub-set of all overlay participants
Descendants Estimate of # of nodes in sender’s subtree
Participants Estimate of total num-ber of nodes in the overlay Reshuffle flag (only for ordered RanSub) Determines if children should be reshuffled so that a new total order-ing is created
Table 2.1: Contents of Collect and Distribute messages.
2.3
Collect/Distribute
Each node participating in RanSub maintains the following state: address of its parent in the overlay, a list of its children, and the sequence number of the current epoch. In addition, it maintains the following soft state: a collect set and number of subtree descendants for each of its children, a distribute set, and the total number of overlay participants. Below, we describe how RanSub uses this information and maintains it in a decentralized manner.
2.3.1
Collect Phase
Overall, the goal of the Collect message is for each node to: i) compose the collect sets for constructing the distribute set during the subsequent distribute phase, and ii) determine the total number of participants in its local subtree.
The collect phase begins at the leaves of the tree in response to the reception of a Distribute message. Table 2.1 describes the fields in Collect messages (in the left half of the table). The Collect message has the same sequence number as the
triggering Distribute message. At the leaves, the number of descendants is set to one and the collect set contains only the leaf node itself. Once a parent receives all Collect messages from its children, it further propagates a Collect message to its own parent. The nodes in the collect set are selected randomly from the collect sets received from its children to form a subset of configurable size (O(lgn) by default). Each node stores this collect set to aid in the construction of distribute sets in the subsequent distribute phase.
One key challenge is to ensure that membership in the collect set propagated by a node to its parent is both random and uniformly representative of all mem-bers of the subtree rooted at the node. To achieve this, RanSub makes use of a Compact operation, which takes as input multiple subsets and the total pop-ulation represented by each subset. Compact outputs a new subset with two properties: i) group membership randomly chosen over the input subsets and ii) a target size constraint. This is achieved by building the output set incrementally. We first randomly choose an input subset based on the population that it repre-sents. We then randomly choose a member of this subset (not already selected) and add it to the output set. RanSub does this by splitting the population weight of a subset uniformly among subset members. For example, a subset A with two nodes and population size of 10 will set a weight of 5 to each member. Likewise, a subset B with three nodes and a population size of 18 will set a weight of 6 to each member. Assuming a subset size of three, Compact will choose the first element of A with probability 5/(10 + 18) = 5/28. Assuming this element was chosen, it would then choose the second element of A with probability 5/(28−5) = 5/23 since the first element cannot be chosen again 2. Note that Compact is able to 2Note that this slightly affects RanSub’s uniformity, though in practice, this effect is negligible.
properly weigh each subset because as part of collect/distribute, each node learns the number of nodes in the subtree of each of its children.
2.3.2
Distribute Phase
A new epoch can begin once the root has received a Collect message from all of its children for the previous epoch. The actual length of an epoch is determined by individual application requirements. The right half of Table 2.1 describes the fields contained in the Distribute message.
A parent constructs distribute sets for each child in the following manner. Recall that each node stores the collect set received from each child during the previous collect phase. Thus, for each ofk children, a particular node maintains
CS1, CS2, ...., CSk. Also recall that each collect set, CSi, consists of nodes
se-lected uniformly randomly from the subtree rooted at child i. A parent node A
constructs a distribute set for each child from this information saved during the preceding collect phase. This information includes the collect set for each child, the nodeA itself, as well as DSA, A’s own distribute set.
RanSub offers three choices regarding the contents of distribute sets:
• RanSub-all: This is suitable when the application requires uniformly ran-dom subsets of all nodes in the system. There are two flavors of the All option. All-identicaldelivers thesamedistribute set to all nodes in the over-lay. This distribute set, DSroot, is created by the root using the Compact
operation:
DSroot =Compact(CS1, ...., CSj,{root}) (2.1)
where CSi represents the subtree rooted at child i of the root (numbered
ALL-non-identical option that delivers differentdistribute sets to each node. In this case, node Z receives a Distribute message from its parent P containing
DS0
P. Z constructs its random subsetDSZ using DSP0 and the collect sets
stored from its children, CS1, ...., CSk, in the following manner:
DSZ =Compact(CS1, ...., CSk, DSP0 ,{P}) (2.2)
Z then forwards the following to each child x:
DS0
Z =Compact(CS1, ...., CSx−1, CSx+1, ...., CSk, DSP0 ,{P}) (2.3)
Note that the root’s DS0
P is {}since it has no parent.
• RanSub-nondescendants: In this case, each node should receive a random subset consisting of all nodes except its descendants. This might be appro-priate for an application-layer multicast structure where participants are probing for better bandwidth and latency to the root of the tree. In this case, considering a node’s descendants could introduce a cycle in the overlay tree. For each child x (numbered 1, ...., k), the parent node, A, constructs
DSx in the following manner:
DSx =Compact(CS1, ...., CSx−1, CSx+1, ...., CSk, DSA,{A}) (2.4)
• RanSub-ordered: This type of distribute set calculation imposes a total or-dering among participating nodes. A node receives a distribute set contain-ing random nodes that come before it in the total ordercontain-ing. For each child
x (numbered 1, ...., k), the parent node A constructs DSx in the following
manner:
We use Ransub-ordered in TreeMaint (described in Chapter 3 to ensure that simultaneous transformations to the tree overlay do not introduce loops. On the other hand, Bullet [42] uses RanSub-nondescendants to deliver knowledge of potential peers that provide high-bandwidth data streams.
2.3.3
Discussion
A limitation of RanSub-ordered is that the first child of a particular node will always have a smaller set of potential nodes to choose from than thekth. In fact, the first child’s distribute set would always be restricted to a relatively small subset of global nodes. For RanSub-ordered, this violates our goal of distributing random subsets to all nodes that are uniformly chosen across all global partic-ipants in a single epoch. We take the following step to ensure that every node still receives a uniformly random subset across multiple invocations of RanSub-ordered. Every configurable r epochs, the root of the overlay periodically sets a reshuffle flag in its Distribute message, signaling overlay participants to ran-domly re-order children lists. This allows children that were at the beginning of the total ordering (and hence received few nodes in distribute sets) a chance to move toward the end of the total ordering and receive information about more nodes.
Figures 2.2 and 2.3 summarize the operation of the two phases of the RanSub protocol (RanSub-ordered version). For simplicity, we do not include the results of Compact, which would appropriately reduce the size of all subsets to the application-specified size constraint. In the collect phase, each node constructs a collect set (CS) composed of the union of itself and the collect sets it received from its children. Thus,Areceives a collect set from each of its children,B andC, that are uniformly representative of the subtrees rooted atB and C respectively.
A
CS
D={D},CS
E={E}
CS
F={F},CS
G={G}
CS
B={B,D,E},CS
C={C,F,G}
B
D
C
E
F
G
Figure 2.2: Example scenario depicting the first phase of the RanSub protocol: The collect phase traveling up the overlay
Each node determines which of its collect sets should be used to compose the distribute set (DS) for each of its children.
For the distribute phase, node AconstructsDSB by taking the union of itself
(A) with CSC. Node B in turn constructs DSD by taking the union of itself
(B) with its own distribute set (DSB ={A, C, F, G}) and CSE ={E} from the
previous collect phase. D gets “lucky” in this ordering (it is actually the last node in this total ordering) and receives a distribute set representative of the entire topology (once again, recall that we are omitting the compact operation that would throw out appropriate set elements to maintain size-constrained sets). NodeG would get “unlucky” and only receive a distribute set consisting of itself (it is the first node in this total ordering). However, once the children lists are reshuffled, the resulting total ordering would be a random walk of the tree in
C
G
A
B
F
E
D
DS
B={A,C,F,G}
DS
C={A}
DS
D={A,B,C,E,F,G}
DS
F={A,C,G}
Figure 2.3: Example scenario depicting the second phase of the RanSub proto-col: The distribute phase traveling down the overlay
which each node is only visited once, yielding an entirely different new total ordering. Finally, note that the complexity of reshuffling children is only needed if a total ordering is required.
2.4
Analysis
We believe that RanSub closely approximates the ideal properties outlined at the beginning of this section. Since it uses a tree to propagate sublinear-sized collect and distribute sets between parents and children, it imposes low overhead on the underlying network. Using an efficient overlay structure further ensures that epochs can be short. For instance, we find that for a 1000-node system (in a network with a diameter less than 500 ms), epochs can be as short as five seconds.
2.4.1
RanSub Uniformity
In the absence of node failure and churn, RanSub delivers random subsets that are approximately uniformly distributed. The key behind achieving uniformity is
accounting for nodes that are represented by a random sample at any given time during the protocol execution. We achieve this by running the protocol over a tree, where it is straightforward for a node to have an estimate of the number of its descendants. Future work includes adapting RanSub to function over meshed overlays, not just trees. We now provide a proof of RanSub’s uniformity.
To this end, we make a small modification to the Compact operation, allowing entries in random subsets to be treatedindependentlyfrom one another. Consider RanSub executing with a subset of size k on input sets A and B. By treating entries individually, the Compact operation would select the first element ofC by randomly selecting the first element of Awith probability |Apop|/(|Apop|+|Bpop|)
where |Apop| is the population size that subset A represents and |Bpop| is the
population size that subset B represents (A ⊆ Apop and B ⊆ Bpop). Likewise,
Compact would choose the first entry ofBwith probability|Bpop|/(|Apop|+|Bpop|).
Each entry ci of C would in turn be chosen in the same manner from ai and bi
of A and B respectively. Initially, a subset Z, constructed from a population consisting of only one node (|Zpop|= 1), contains the single node in each of its k
entries. This occurs at all leaves in the tree over which RanSub runs.
By making such a modification, we are essentially allowing duplicate entries to exist in the random subsets (after an entry is selected from an input random subset, it may be randomly selected again for another entry). In practice, allow-ing for duplicates is somewhat wasteful since two or more entries in the subset are used for the same node and will carry the same information. In our RanSub implementation, we therefore do not allow duplicates, ultimately slightly affect-ing uniformity properties. Our evaluation results indicate that such an effect is negligible.
B with entries uniformly selected at random from the nodes they each represent,
ApopandBpop, will generate a subset Cwith entries uniformly selected at random
from Cpop = ApopSBpop. Because our modified RanSub treats all k elements in
our subsets independently, it suffices to show that each single element in C is chosen from Cpop with probability = 1/|Cpop|. As a result, we can view RanSub
subsets of k entries as k RanSub instances each consisting of one entry-sized subsets. Hence, for the remainder of this proof, we assume k = 1 (|A| = |B| =
|C|= 1).
Each leaf, l, of the tree provides our base case, with its subset consisting of the elementl with probability 1. By our inductive assumption, we note that each element ofApop is inAwith probability 1/|Apop|and each element ofBpopis in B
with probability 1/|Bpop|. When choosing the single element in C, our modified
RanSub picks the element fromAwith probability|Apop|/(|Apop|+|Bpop|) and the
element fromB with probability|Bpop|/(|Apop|+|Bpop|). Hence, for each element
in Apop, the probability that it is chosen is the probability that it is in A times
the probability that the node is chosen from A: = 1/|Apop| ∗ |Apop|/(|Apop|+|Bpop|)
= 1/(|Apop|+|Bpop|)
= 1/|Cpop|
The same is true for elements in Bpop and we can therefore conclude that C
is constructed in a uniformly random fashion.
2.4.2
Node Churn and Failure
RanSub uniformity might suffer as nodes join and leave the system. We do not provide the guarantee that nodes in the random subset will be active when considered by other nodes. RanSub is a mechanism for taking a snapshot of