The Feasibility of Supporting Large-Scale Live Streaming Applications with Dynamic Application End-Points

(1)

The Feasibility of Supporting Large-Scale Live Streaming Applications

with Dynamic Application End-Points

Paper 475, Regular paper, 14 pages

Abstract

While application end-point architectures have proven to be viable solutions for large-scale distributed applications such as distributed computing and file-sharing, there is little known about its feasibility for live streaming. Inherent prop-erties of application end-points, for example, heterogene-ity in bandwidth resources and dynamic group membership could adversely affect the feasibility of constructing a usable overlay. Even more challenging is attempting to do this at very large scales. In this paper, we use traces taken from a large content delivery network to characterize the behavior of users watching live audio and video streams. Based on these traces, we answer one of the most prominent architectural issues in the overlay multicast community: the feasibility of supporting large-scale groups using a purely application end-point architecture. We show that in the majority of common scenarios, application end-points have inherent resources and stability to support large-scale live streaming. In addition, we motivate a real-world problem that has been largely ignored in previous designs, but must be addressed in order for appli-cation end-point architectures to be successful at very large scales: flash crowd joins. We show that a scalable and ef-ficient join protocol can be implemented on top of dynamic application end-points.

1 Introduction

Live audio and video streams are now being delivered suc-cessfully over the Internet on a large scale. In the commer-cial sector, content delivery networks such as Akamai Tech-nologies [1] and Real Networks [15] have developed large scale dedicated infrastructure to deliver both live streams and video-on-demand. These architectures are capable of sup-porting many streams simultaneously and each stream may have thousands of subscribers.

Previous work in Overlay Multicast [7, 5, 9, 3, 10, 14, 20, 11, 17, 13, 21, 6, 2] has made the case that overlay networks are a promising architecture to enable quick deployment of multicast functionality on the Internet. In such an archi-tecture, application end-points self-organize into an overlay structure and data is distributed along the links of the over-lay. The responsibilities and cost of providing bandwidth is shared amongst the application end-points, reducing the burden at the content publisher. The lack of dependence on infrastructure support makes application end-point architec-tures easy to deploy, and economically viable. The ability for

users to receive content that they would otherwise not have access to provides a natural incentive for them to contribute resources to the system. In contrast to content delivery net-works, application end-point based architectures have shown promise for events where the peak group size is small-scale, on the order of 10 to 100 nodes [4]. The question remains whether or not such architectures are feasible at larger scales. In order to answer the feasibility question, we take the approach that we must first measure and understand the char-acteristics of application end-points. We leverage the wealth of data from a large content delivery network providing live streaming service. Because the system has been in everyday use for a some time, we can use the data to identify com-mon “real-world” application end-point characteristics and behavior that are likely to impact the choice of architectures. Inherent properties of application end-point architectures, for example, heterogeneity in bandwidth resources and dynamic group membership could adversely affect the feasibility of constructing a usable overlay for data delivery.

We look at feasibility along two key dimensions (i) band-width resources, and (ii) stability. The reason we choose these two dimensions is that they are fundamental in the sense that a purely application end-point architecture has lit-tle chance of providing good performance if these two are not satisfied. Secondary feasibility questions, such as net-work dynamics are relevant only when these primary feasi-bility requirements are satisfied. We find that in the majority of common scenarios, there is inherent bandwidth resources and inherent stability at application end-points, indicating the feasibility of such architectures for large-scale applications. Furthermore, we evaluate the viability of several design alter-natives that seek to exploit such properties to improve system performance.

In addition, we explore the design issues that one must address in order for application end-point architectures to be successful at very large scales. We find that a surprisingly large number of events have flash crowd behavior, motivating that a scalable and efficient solution for joining the system and managing group membership is required. We show that a simple solution based solely on application end-points can

In Section 2, we describe and analyze the streaming me-dia workload that we use in our study. Section 3 studies the feasibility of application end-point architectures to support large-scale live streaming events. In Section 4, we show that flash crowds are common in streaming media workloads and motivate the need for a scalable join and group management protocol. In Section 5, we summarize our findings.

(2)

0 200 400 600 800 1000 1200 1400 10/04 10/18 11/01 11/15 11/29 12/13 12/27 01/10 01/24

Number of live streams

Date

live streams

(a) Distinct streams.

0 200000 400000 600000 800000 1e+06 1.2e+06 10/04 10/18 11/01 11/15 11/29 12/13 12/27 01/10 01/24

Number of live requests

Date live requests (b) Number of requests. 0 50000 100000 150000 200000 250000 300000 10/04 10/18 11/01 11/15 11/29 12/13 12/27 01/10 01/24

Number of video requests

Date

video requests

(c) Number of video requests.

0 200000 400000 600000 800000 1e+06 1.2e+06 10/04 10/18 11/01 11/15 11/29 12/13 12/27 01/10 01/24

Number of audio requests

Date

audio requests

(d) Number of audio requests.

Figure 1: Daily summary of all live streams.

1000 10000 100000 0 100 200 300 400 500 600 Number of Hosts Stream ID Unique Hosts Peak Hosts Video Unique Hosts Video Peak Hosts

Figure 2: The peak group size and unique number of entities for the large-scale streams.

2 Live Streaming Workload

2.1 Methodology

In this study, we analyze the live streaming workload from a large content delivery network with the following motivat-ing question: what can we learn about user behavior that will enable us to better understand the design requirements for building a live streaming system. We focus our analysis on characteristics that are most likely to impact design, such as group dynamics. We then evaluate the impact of these work-loads on the performance of a application end-point based architecture.

2.2 Data Source

The logs used in our study are collected from thousands of streaming servers belonging to a large content delivery net-work (CDN). The CDN uses a static overlay, composed of (i) edge nodes which are located at the edge of the network, very close to clients, and (ii) intermediate nodes that take streams from the original content publisher and split and replicate them to the edge nodes. The logs that we use in this study are from the edge nodes that directly serve client requests.

The logs were collected over a 3-month period from Oc-tober 2003 to January 2004. The daily statistics for live streaming traffic during that period is depicted in Figure 1. The traffic consists of three of the most popular streaming media formats. In Figure 1(a), there were typically 800-1000 distinct streams on most days. However, there was a sharp drop in early December and a drop again in mid-December to January. This is because we do not have logs for one of

the formats on those days due to a problem with our log collection infrastructure. Figure 1(b) depicts the number of requests for live streams which varies from 600,000 on the weekends to 1 million on weekdays. Again, the drop in re-quests from mid-December onwards is due to the missing logs. Figure 1(c) and (d) depicts the number request for video streams and audio streams. The sharp peaks on various days correspond to special events. Note that there is an or-der of magnitude more requests for audio streams than video streams. In addition, audio streams have extremely regular weekend/weekday usage patterns. On the other hand, video streams are less regular, and are more dominated by special events with the very large events corresponding to the sharp peaks on various days.

Streaming media events can be classified into two broad categories based on the event duration. The first category, which we calllong-running events,are events in which there is live broadcast every day, all hours of the day. This is sim-ilar to always being “on-the-air” in radio terminology. The second category, which we callspecial events,are events in which the event duration is well-defined and typically on the order of a couple of hours. For example, a talk show that runs from 9am-10am is only broadcast during that period, and has no traffic at any other time during the day. For sim-plicity, for either one of these categories, we break the events into 24-hour chunks, which we call streams. For the rest of the paper, we present analysis on the granularity of streams. Note that forspecial events, a stream is the same as an event. Rather than present an analysis on all the streams ob-served over the three-month period, which is an overwhelm-ing amount of data, we limit the discussion in this paper to

large-scale streams. Large-scale streams are defined as the streams in which the peak group size, or the simultaneous number of hosts is larger than 1,000 hosts. There were a to-tal of over 653 large-scale streams, of which 60 were video streams and 593 were audio streams. In the following sec-tions, we present results from the analysis conducted for 590 streams (55 video and 535 audio) encoded in two of the for-mats. The third format has similar events and characteristics to the others and are not discussed due to space limitations. Figure 2 depicts the peak group size and the unique number of IP addresses (or entities, as we define in the next section). Most events had a peak group size of a few thousands. In ad-dition to group size, we also summarize the session duration characteristics, which are analyzed in detail in Section 3.2.

(3)

0 10000 20000 30000 40000 50000 60000 70000 80000 16.00 17.00 18.00 19.00 20.00 21.00 22.00 Number of Hosts Time All 250kbps 100kbps 20kbps

(a) Membership over time.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 10 100 1000 10000 100000 1e+06

Cumulative Probability Distribution

Session Duration (seconds)

mean 1304.76, 394101 Incarnations

mean 4323.95, 118921 Entities 1 min, 5 min

(b) Incarnation and entity session duration.

Figure 3: Largest event.

2.3 Workload and Log Processing

Streaming workload: Each entry in the log corresponds to

a request made by a client to an edge server. The following fields extracted from each entry are used in our study.

User identification: IP address and player ID Requested object: stream URI

Time-stamps: session start time and session duration at the granularity of seconds

Performance statistics: average received bandwidth for en-tire duration

Entity vs. incarnation: We define an entity to be a unique

user. An entity may join the broadcast many times, perhaps to tune in to distinct portions of the broadcast, and as a result have many incarnations.

We use the IP address in the logs to determine entity uniqueness. While this is not entirely accurate, it is the most robust in determining uniqueness. Its weaknesses are that it can (i) underestimate when multiple hosts are sharing the same IP addresses in the case of NATs, firewalls, proxies, and DHCP, and (ii) overestimate when the same host uses multi-ple IP addresses in the case of DHCP. We looked at other heuristics, such as the player id field which is intended to be a unique player identifier. However, in more than 50% of the requests, this field was anonymized by the user for pri-vacy reasons. We also looked at combinations of operating system version, player version and user agent, but no combi-nation was reliable in detecting uniqueness even for the cases in which we knew the clients were unique.

2.4 Largest event

Next, we present more detailed statistics for the largest event in the logs. The event consisted of three streams with bit-rates of 20 kbps audio, 100 kbps video and audio, and 250 kbps video and audio. The event duration was 2 hours, from 19:00-21:00, as shown in Figure 3(a). Combining all three streams, the peak membership was 74,000 entities. The sharp rise in membership at 19:00 is caused by everyone want-ing to tune in to the start of the event. There were roughly 119,000 entities and over 394,000 incarnations for the entire duration of the event. Note that there are many people who join the broadcast before the event started, probably to test

that their set up was working. Also, there are many people who stay after the broadcast was over, for unknown reasons (they may have just left their player running). The 20 kbps audio stream has a sharper drop-off in membership than the other streams because many incarnations switch from audio to a better quality stream. The most dynamic segment of the stream was between 19:00-19:30, where the peak join rate was over 700 arrivals/second and the peak leave rate was over 70 leaves/second.

Figure 3(b) depicts the cumulative distribution of the in-carnation and entity session duration time in seconds for the combined streams. For incarnation session duration, we see a sharp rise at 2 minutes. This is caused by a known client-side NAT/firewall problem that forced the streaming servers to time out on the connection.

The average incarnation session duration was 22 min-utes. And, more than 20% of incarnations stayed longer than 30 minutes. The entity session duration, which reflects the person’s attention span, was much longer, with an average and median of 72 minutes and 63 minutes. The average is dominated by the tail of the distribution. Taking a closer look at the distribution, we find a fair amount of dynamics. For example, over 60% of incarnations and 15% of entities stay for less than 5 minutes. Furthermore, 15% of incarnations and 5% of entities stay for under 1 minute. We explore what these numbers indicate about the stability of the system in Section 3.2.

Unless otherwise stated, in the remaining sections, we combine all three streams into one stream, and refer to this one stream as the “largest stream or event.” We assume that the “desired” streaming bit rate for all hosts is 250 kbps.

3 Feasibility of overlay tree

construc-tion

In this section, we look at the feasibility of supporting a large-scale live streaming application using a purely appli-cation end-point architecture with a focus on the data deliv-ery plane. We capture feasibility using two important dimen-sions. The first dimension is bandwidth resources, and the second is system stability.

(4)

Access technology Packet-pair measurement Outgoing bandwidth estimate Dial-up modems 0 kbps BW 100 kbps 30 kbps DSL, ISDN, Wireless 100 kbps BW 600 kbps 100 kbps Cable modems 600 kbps BW 1 Mbps 250 kbps Edu, Others BW 1 Mbps BW

Table 1: Mapping of access technologies to outgoing band-width.

Type Degree-bound Number of hosts

Free-riders 0 58646 (49.3%) Contributors 1 22264 (18.7%) Contributors 2 10033 (8.4%) Contributors 3-19 6128 (5.2%) Contributors 20 8115 (6.8%) Unknown - 13735 (11.6%) Total - 118921 (100%)

Table 2: Number of hosts with assigned degree for the largest event.

3.1 Are There Enough Resources?

In a purely application end-point architecture, there is no de-pendence on costly pre-provisioned infrastructure, making it favorable as an economical and quickly deployable alterna-tive for streaming applications. On the other hand, the lack of any supporting infrastructure requires that application end-points contribute their outgoing bandwidth as resources. The feasibility of such an architecture depends on whether or not there are enough bandwidth resources amongst application end-points to support all participants.

To answer this feasibility question, we run trace-based analysis on 81 of the 590 large-scale streams. We first need to estimate the outgoing bandwidth resources at each host. Then, using these estimates, we derive the amount of re-sources in the system for each second in the traces. Note that the amount of resources are dependent on the join and leave patterns of participating hosts.

3.1.1 Outgoing Bandwidth Estimation

Data sources:We define resources as the amount of

outgo-ing bandwidth that hosts in the system can contribute (i.e., how much bandwidth each host can send). We use a com-bination of data mining and active measurements for the es-timation. Our primary goal was to generate an estimate for roughly 119,000 IP addresses quickly and accurately. The most accurate methodology would be to actively measure the bandwidth from all the IP addresses. However, doing so re-quires more time and resources than what we had available. Therefore, we took a different approach. We used existing tools and databases to help reduce the set of IP addresses.

Step 1: As a first order filter, we used “speed test”

re-sults from a popular web-site, broadbandreports.com. Speed tests are TCP-based bandwidth measurements that are con-ducted from well-provisioned servers owned by the web-site. Users voluntarily go to the web-site to test their connection speeds. The results from the tests are aggregated based on DNS domain names or ISPs, and are summarized over the past week on a publicly available web-page. Approximately 10,000 tests for over 200 ISPs are listed in the summary. The

Quality Index: P P P NAT P P P NAT P P P P

Public only NAT and Public

Inefficient structure Connectivity-optimal structureNAT and Public

8/3 = 2.7

(a) 6/3 = 2.0(b) 8/3 = 2.7(c)

Figure 4: Example of how to compute the Quality Index.

0 1 2 3 4 5 6 17.00 18.00 19.00 20.00 21.00 22.00 Quality Index Time Optimistic Cap 20 Distribution Cap 20 Pessimistic Cap 20 Optimistic Cap 6 Distribution Cap 6 Pessimistic Cap 6

Figure 6: Quality Index for largest event with two degree cap policies.

measurements include bandwidth measurements in both in-coming and outgoing directions. We use the outgoing band-width values for bandband-width estimation. We matched the hosts in our list with the ones from broadbandreports.com using DNS names. For the 85935 of hosts that matched, we as-sign their bandwidth values to the ones reported by broad-bandreports.com.

Step 2: Next, we aggregated the remaining 32,986 IP

addresses into /24 prefix blocks and conducted packet-pair based measurements to one host in each block. Of the 13,483 IP addresses probed, 6,020 of them failed because the host did not respond do the probes. Finally, after all probing finished, we were able to compute bandwidth estimates for 7,463 IP addresses. The bandwidth estimate for a /24 pre-fix was assigned to each host in that block, which resulted in estimates for a total of 9,028 (7.6%) IP addresses.

Step 3: We used a commercial service that maps IP

addresses to host information. The service is provided by the CDN who provided us with the streaming media logs. One of the fields in the host information database is access technology. We compared the overlapping matches from this database to the ones from broadbandreports.com and our DNS name mapping, and found them to be fairly consistent. Using this database, we were able to eliminate 8,412 hosts.

Step 4:Finally, we used hosts’ DNS names to infer their

access technology. For example, we use keywords such as

adsl, cable-modem, dial-up, wireless, .eduand popular cable modem and dsl service providers such asrr.com and attbi. Of the remaining 15,546 IP addresses, we matched 2,597 using this technique. Finally, we constructed a database manually

(5)

0 1 2 3 4 5 0 5 10 15 20 25 30 Quality Index

Stream Sorted By Optimistic Quality Index Optimistic Distribution Pessimistic

(a) Audio streams, Degree Cap 4

0 1 2 3 4 5 0 10 20 30 40 50 60 Quality Index

(b) Video streams, Degree Cap 6

0 1 2 3 4 5 0 10 20 30 40 50 60 Quality Index

(c) Video streams, Degree Cap 10

0 1 2 3 4 5 6 7 8 0 10 20 30 40 50 60 Quality Index

(d) Video streams, Degree Cap 20

Figure 5: Average quality index for each stream. for a subset of the remaining IP addresses with resolved DNS

names. This served to generate estimates for small DSL and cable modem providers, corporations and international ed-ucational institutions that do not use common DNS names. This techniques provided estimates for an additional 1,418 IP addresses.

Assign degree-bounds:To simplify the discussion, we

nor-malize the bandwidth value by the streaming bit rate. For example, if a host has an outgoing link bandwidth of 300 kbps and the streaming rate is 250 kbps, then the normal-ized value is . That is, this host

can have an out-degree of 1, or can support 1 child at the full streaming rate. Although we assume a single-tree sce-nario, the out-degree values for multiple trees [2, 13, 6] can be derived from our estimates. Throughout this paper, we use “degree” as the outgoing bandwidth unit, instead of kbps.

Using the data sources from the previous step, we end up with estimates with the following formats:

Outgoing bandwidth (broadbandreports.com): We use this value to directly compute the degree.

Average of the incoming and outgoing bandwidth (packet-pair probing): Many access technologies have asymmetric bandwidth properties. For example, a DSL host with incom-ing link of 400 kbps and outgoincom-ing link of 100 kbps will have an estimate of 250 kbps. Using the average of the two direc-tions could over-estimate the outgoing link. Taking this into account, we discretize the bandwidth values from packet-pair into the access technology categories listed in Table 1, where BW stands for the bandwidth measured using packet-pair, and use the outgoing bandwidth estimate to compute the de-gree.

Access technology (Host information database, DNS name resolution): We use the outgoing bandwidth estimate in Ta-ble 1 to compute the degree.

The degree estimate derived from the outgoing band-width value reflects theinherentcapacity of a host. However, all that capacity may not be available for use. For example, the bandwidth may be shared across many applications on the end-host, or may be shared across many end-hosts in the case of a shared access link. Also, an application may want to be “nice” to the network and not saturate the link. In this paper, we set an absolute maximum bound of 20 for the out-degree such that no host in our simulations can exceed this limit, even if they have more resources to contribute. This roughly translates to 5 Mbps for video applications. We also study the effect of more conservative policies such as degree caps of 4, 6, and 10.

Contributors vs. free-riders:The results from the previous

two steps for the largest broadcast are listed in Table 2. Over half of the hosts have 0-degree, and are labeled as free-riders. Roughly 39% of the hosts are contributors, capable of sup-porting one or more children. Of these, 6.8% are hosts who are capable of supporting 20 or more children.

Assign values to unknowns:After using all the data sources

in the first step, we are missing estimates for 10% of the IP addresses. For these unknown values, we assign resources to them using 3 assignment algorithms. The optimistic as-signment assumes that all unknowns can contribute up to the maximum resource allocation (which we set to be degree 20). This provides an upper-bound on the best case resource esti-mate. The pessimistic assignment assumes that all unknowns are free-riders and contribute no resources. This provides a lower bound for the worst-case. The distribution algorithm assigns a random value drawn from the same distribution as the known resources. This provides a fair estimate assum-ing that the known and unknown resources follow the same distribution.

3.1.2 Quality Index

To measure the resource capacity of the system, we use a metric called Quality Index[4]. The Quality Indexis de-fined as the ratio of the supply of bandwidth to the demand for bandwidth in the system for a particular source rate. The supply is computed as the sum of all the degrees that applica-tion end-points participating in the system contribute. Note that the degree is dependent on the streaming bit rate, and in turn the Quality Index is also dependent on the streaming bit rate. Demand is computed as the number of participat-ing end-points. For example, consider Figure 4, where each host has enough outgoing bandwidth to sustain 2 children. The number of unused slots is 5, and the Quality Indexis

"!#$ %'&( . A Quality Index of 1 indicates that the

sys-tem is fully saturated, and a ratio less than 1 indicates that not all the participating hosts in the broadcast can receive the full source rate. As the Quality Index gets higher, the environ-ment becomes less constrained and it becomes more feasible to construct a good overlay tree. A Quality Index of 2 indi-cates that there are enough resources to support two times the current number of participants.

3.1.3 Trace replay analysis

In this section, we use Quality Index to measure the quality of resources across a set of 81 streams out of the 590 large-scale

(6)

streams. The analysis was conducted for all video streams and 5% of audio streams (randomly selected). To ensure some confidence in our results, we only analyzed streams that had sufficient bandwidth estimates (at least 70% of the IP addresses in the traces had estimates) using the estimation methodology described earlier in Section 3.1.1.

For each stream, we replay the trace using the group par-ticipation dynamics (joins and leaves) and compute the Qual-ity Index for each second in the trace. First, we discuss the largest event, and then we present a summary for the other large-scale events.

Figure 6 depicts the Quality Index as a function of time, with degree caps of 6 (bottom 3 lines) and 20 (top 3 lines) children. The interesting interval is between 19:00 - 21:00 when the event was taking place. Again, note that a Qual-ity Index above 1 means that there are sufficient resources to support the stream using a purely application end-point ar-chitecture. The highest and lowest curves for each degree cap policy are computed using optimistic and pessimistic as-signments of bandwidth estimates to unknowns, respectively. Regardless of the treatment of hosts with unknown estimates and the degree cap policy, the Quality Index is always above 1 during 19:00 - 21:00. However, a degree cap of 6 places more constraints on the resources in the system and could potentially be more difficult to construct a tree with good per-formance.

Figure 5 depicts a summary of the other large-scale streams. The quality index for audio streams is depicted in Figure 5(a). Each point on the x-axis represents an au-dio stream. The y-axis is the Quality Index for that stream averaged over the stream duration.The lowest curve in the figure is the Quality Index computed using the pessimistic assignment of bandwidth estimates to unknowns. For au-dio streams, even the pessimistic assignment is always be-tween 2-3 when using a degree cap of 4 children. This is expected because audio is not a bandwidth-demanding appli-cation. The typical streaming rate for audio was between 16 kbps to 20 kbps. Most hosts on the Internet, including dial-up modems, have outgoing bandwidth that is higher than 20 kbps. Application end-points participating in audio stream-ing applications can provide more than enough resources to support live streaming.

Figure 5(b),(c), and (d) depict the average Quality Index for video streams with degree caps of 6, 10, and 20 children. As the degree cap increases, the Quality Index increases. The top most curve in all 3 figures represents the optimistic as-signment. In the most optimistic view, across all degree cap policies (6,10, and 20), only one stream had a Quality Index below one. In the worst case scenario, where the degree cap is 6 and the unknown assignment policy is pessimistic (b), roughly a third of video streams had Quality Index below 1.

In order to determine feasibility, we look at theinherent

amount of resources in the system. Using a degree cap of 20, and the distribution-based degree assignment for unknowns in Figure 5(d), we find that only 1 stream has a Quality Index of less than 1.

The stream with the worst Quality Index (40 on the x-axis) had a streaming rate of 300 kbps, but was composed al-most exclusively (96%) of home broadband users (DSL and

cable modem). Many home broadband connections can only support 100-250 kbps of outgoing bandwidth which is less than the steaming rate. Therefore, such hosts did. not con-tribute any resources to the system at all.

To better understand whether or not this composition of hosts is common, we look at the nature of the event. This is a special event, starting on Sunday night at 11pm and ending at 2am in local time, where local time is determined based on the geographic location (Section 4.3 gives an overview of how geographic location is determined) of the largest group of hosts. Most participants are home users because the event took place on a weekend night where they are most likely to be at home. In contrast, most of the other events take place during the day, and are often accessed from more het-erogeneous locations with potentially more bandwidth re-sources, such as from school or the workplace. About 5 of the 55 large-scale video streams fall under this category. Their Quality Index dips below 1 for the distribution-based degree assignment in Figure 5(c).

To summarize the results in this section, we find that there are more than sufficient bandwidth resources amongst application end-points to support audio streaming. In addi-tion, there are enoughinherent resources in the system at scales of 1000 or more simultaneous hosts to support video streaming in over 90% of the common scenarios. This shows that using application end-point architectures for live stream-ing is feasible. While we have shown that there areinherent resources, we wish to point out that assigning policies and mechanisms to extract those resources from application end-points is beyond the scope of this paper. We have looked at simple policies, such as capping the degree bound to a static amount to make sure that no person contributes more than what is considered to be “reasonable.” There could be more complex policies to allow users to individually deter-mine how much they are willing to contribute in return for some level of performance.

While there are resources in the common scenarios, in 10% of the cases there were not enough resources to allow all participants to receive at the full streaming rate. In such scenarios, there are three generic classes of alternatives to consider. The first alternative is to enforce admission control and reject incoming free-riders when the Quality Index dips below one. A second alternative is to dynamically adapt the streaming bit-rate, either in the form of scalable coding [19], MDC [8], or multiple source rates [4]. This has the advan-tage that the system can still be fully supported using an ap-plication end-point architecture with a slight reduction in the perceived quality of the stream. Scaling the streaming by a factor of two effectively increases the Quality Index by a fac-tor of two. And lastly, a third alternative is to add additional resources into the system. These resources can be statically allocated from an infrastructure service such as CDN’s with the advantage that the resources only need tocomplementthe already existing resources provided by the application end-points. Another viable solution is to allocate resources in a moreon-demandnature using a more dynamic pool of re-sources, for example, using a waypoint architecture [4].

(7)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.01 0.1 1 10 100 1000 CDF Minutes 5th Percentile 25th Percentile 50th Percentile 95th Percentile

Figure 7: Incarnation session duration in minutes.

0 10 20 30 40 50 60 0 5 10 15 20 25 30 35 40 45 50

Interval between ancestor change

Streams Random MinDepth Stability 0 10 20 30 40 50 60 0 5 10 15 20 25 30 35 40 45 50

Streams

Random MinDepth Stability

Figure 9: Median interval between ancestor change for 50 large-scale events.

3.2 Is There Any Stability?

In this section, we look at the impact of group dynamics on stability, and evaluate mechanisms that can be used to in-crease the stability of the overlay.

3.2.1 Group dynamics

Figure 7 depicts the incarnation session duration for the 590 large-scale streams. Note that an incarnation, as defined pre-viously, refers to an instantiation of an entity or a unique host. The first curve on the left depicts the cumulative distribution of the observed 5th percentile incarnation session duration in all 590 streams. For example, 5% of incarnations in 30% of the streams had session durations of shorter than 1 second (shown as 0.01 minutes). The next 3 curves are for the 25th, 50th and 95th percentile. Note that the same point on the y-axis does not necessarily correspond to the same stream across all curves.

Based on this figure, we make two observations. First, there is a significant number of hosts that stay for very short durations. Looking at the 25th percentile curve, we find that for most streams, 25% of incarnations stay for under 2 min-utes. Furthermore, looking at the 5th percentile curve, in most streams, 5% of hosts stay for roughly 10 seconds. The most disastrous is that in the 50th percentile curve, more than 10% of the streams have extremely dynamic group member-ship with half of the incarnations staying less than 5 minutes. With such short session durations, it seems very unlikely that there could be any stability.

Our second observation is that there appears to be some stability at the tail end, where there are a small number of incarnations that stay for very long periods. The 95th per-centile curve in Figure 7 shows that for for most streams, the incarnations in the 95th percentile stay for longer than 30 minutes. There is a sharp rise at 30 minutes, caused by one “long-running” event. On multiple days, most people tuned in for 30 minutes. Perhaps the tail can can help counter-act the effects of the high rate of dynamics at the other end of the distribution.

Note that these observations are consistent with the ses-sion duration analysis of the largest event in Section 2.

3.2.2 Stability metrics

When an incarnation leaves, it can cause all of its descen-dants to become disconnected from the overlay and stop re-ceiving data. The descendants will need to find new parents and reconnect to continue receiving the stream. To capture stability of the overlay we look at two metrics:

Mean interval between ancestor changefor each

incarna-tion. An ancestor change is caused by an ancestor leaving the group. This metric captures the typical performance of each incarnation. A change every second, or every minute is annoying. A larger value indicates better performance. If a host sees only one ancestor change during its session, the time between ancestor change is computed as its session du-ration. If a host sees no ancestor changes at all, the time between ancestor changes is infinity.

Number of descendants of a departing incarnation. This

metric captures overall stability of the system. If many hosts are affected by one host leaving, then the overall stability of the system is poor. However, assuming a balanced tree, most hosts will be leaf nodes and will not have children. There-fore, we hope to see that a large percentage of hosts will not have children when they leave.

3.2.3 Overlay protocol

We simulate the effect of group dynamics on the overlay pro-tocol using a trace-driven event-based simulator. The simu-lator takes the group dynamics trace from the real event and the degree assignments based on the techniques in the previ-ous section, and simulates the overlay tree at each instance in time. Hosts in the simulator run a fully distributed self-organizing protocol to build a single connected tree rooted at the source. Note that we do not simulate any network dy-namics or adaptation to network dydy-namics.

Host Join:When a host joins, it contacts the source to get a

random list of) current group members. In our simulations, ) is set to 100. It then picks one of these members as its

parent using the parent selection algorithm.

Host Leave: When a host leaves, all of its descendants are

disconnected from the overlay tree. For each of its descen-dants, this is counted as an ancestor change. Descendants then connect back to the tree by finding a new parent using the parent selection algorithm.

Parent Selection:When a host needs to find a parent, it

(8)

0 10 20 30 40 50 60 70 80 90 100 0 5 10 15 20 25 30 35 Cumulative Distribution

Interval between ancestor change Oracle Minimum Depth Random Longest-first 0 10 20 30 40 50 60 70 80 90 100 0 5 10 15 20 25 30 35 Cumulative Distribution

Interval between ancestor change Oracle Minimum Depth Random Longest-first

(a) CDF of interval between ancestor change.

70 75 80 85 90 95 100 1 10 100 1000 10000 100000 Cumulative Distribution Number of Descendents Oracle Minimum Depth Random Longest-first 70 75 80 85 90 95 100 1 10 100 1000 10000 100000 Cumulative Distribution Number of Descendents Oracle Minimum Depth Random Longest-first

(b) Number of descendants of a departing host.

0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 Cumulative Distribution Depth Oracle Minimum Depth Random Longest-first 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 Cumulative Distribution Depth Oracle Minimum Depth Random Longest-first (c) Tree depth.

Figure 8: Performance of largest event. probes them to see if they are currently connected to the tree

and have enough resources to support a new incoming child, and then ranks them based on the parent selection criteria. We do not simulate the mechanisms for learning about the current hosts in the system, but simulate the effect of that knowledge propagation mechanism which provides random knowledge about hosts currently in the system. In a real sys-tem, Gossip-based mechanisms [16] can be used.

3.2.4 Parent selection algorithms

We ran simulations on 4 parent selection algorithms. Note that the chosen parent needs to be connected to the tree and have enough resources to support an incoming child (has not saturated its degree-bound) , in addition to satisfying the par-ent selection criteria.

Oracle:A host chooses the parent who will stay in the sys-tem the longest. This algorithm requires future knowledge and cannot be implemented in practice. However, it provides a good baseline comparison for the other algorithms.

Longest-first: This algorithm attempts to predict the future and guess which nodes are stable by using the heuristic that if a host has stayed in the system for a long time, it will con-tinue to stay for a long time. The intuition is that if the ses-sion duration distributions are heavy-tailed, then this heuris-tic should be a reasonable predictor for identifying the stable nodes.

Minimum depth: A host chooses the parent with the min-imum depth. If there is a tie, a random parent is selected from the ones with the minimum depth. The intuition is that a balanced shallow trees minimize the number of effected de-scendants when an ancestor higher up at the top of the tree leaves.

Random: A host chooses a random parent. This algorithm provide a baseline for how a “stability-agnostic” algorithm would perform. Intuitively, random should perform the worst compared to the above algorithms.

We used the degree assignment from the previous sec-tion, with a degree cap of 4 for audio streams and 20 for video streams, and the distribution-based assignment for the hosts with unknown measurements. Unless otherwise stated we use this same set up

3.2.5 Results

We simulated the performance of the 4 parent selection algo-rithms for the largest event over the most dynamic 30-minute segment from 19:00-19:30. We also removed the hosts with NAT/firewall timeout problems discussed in Section 2 be-cause their 2-minute time-outs are artificial and do not repre-sent user behavior. We used the degree assignment from the previous section with a degree cap of 20 for video streams, and the distribution-based assignment for the hosts with un-known measurements. Unless otherwise stated we use this same set up for all video streams. We simulated 10 runs for each algorithm with different random seeds.

A CDF of the mean time interval between ancestor change is depicted in Figure 8(a). The x-axis is time in min-utes. A larger interval is more desirable. Because we are simulating a 30-minute trace, the maximum time interval is 30 minutes, if a host sees on ancestor change. Hosts that do not see any ancestor changes have an infinite interval. For presentation purposes, we represent infinity as 31 on the x-axis. The bottom most line in the figure represents the CDF when using the oracle algorithm. Roughly 10% of the incar-nations saw only one ancestor change in 30 minutes. Further-more, 87% of incarnations did not see any changes at all. In fact, there were only one or two events that caused ancestor changes across all the runs. It is surprising that during the busiest 30 minutes in the trace, there is stability in the sys-tem. In addition, the overlay built by the oracle algorithm can exploit thatinherent stability.

The second-best algorithm is minimum depth. Over 80% of the incarnations saw either no changes, or 10 or more min-utes between changes. This is somewhat tolerable to humans as they will see a glitch every 10 minutes or so. The random algorithm and the longest-first algorithm performed poorly in this metric. For random, only 65% of the incarnations saw no changes, or 10 or more minutes between changes. To our sur-prise, the longest-first algorithm performed the worst, with 55% of incarnations seeing decent performance. The reason that it did not perform well stems from many factors. First, it may not always be correct in selecting the stable nodes. In fact, it was wrong for 9% of the cases because it had only 91% of hosts that had no descendants on departure (com-pared to 100% for oracle) as depicted in Figure 8(b). The number of descendants of a departing host is on the x-axis, in log scale. If longest-first were always correct, it would

(9)

overlap with the y-axis, like oracle where close to 100% of the incarnations had no descendants when they left the sys-tem. Note that longest-first is predicting correctly for some of the cases because it is performing better than random and min depth, which had 72% and 82% of incarnations with no descendants when they departed the system. One of the dif-ficulties in getting accurate predictions is that at the start of the event, almost all hosts will appear to have been in the group for the same amount of time making stable hosts in-distinguishable from dynamic hosts.

An additional factor is that the consequence of its mis-take is severe as depicted in Figure 8(c). The x-axis is the average depth of each node in the tree. The longest-first al-gorithm has taller trees than the random and min depth algo-rithms. Therefore, when it guesses incorrectly, a large num-ber of descendants are effected. We examined the tail end of Figure 8(b) more closely and confirmed that this was the case.

One interesting observation is that the oracle algorithm builds the tallest tree. The intuition here is that nodes that are stable will cluster together and “stable” branches will emerge. More nodes will cluster under these branches, mak-ing the tree taller. Although the height does not affect the stability results in our simulations, in practice a tall tree is more likely to suffer from problems with network dynamics. We find that min depth is the most effective and robust algorithm to enforce stability. Its property of minimizing damage seems to be the key to its good performance. The fact that it does not attempt to predict node stability makes it more robust to a variety of scenarios, as depicted in Figure 9. We ran the same set of simulations using the longest-first, random and min depth algorithms for 50 of the large-scale streams. These are the same streams as the ones used Sec-tion 3.1, with only half of the video streams used. Again, we used the distribution-based assignment for hosts with un-known measurements, a degree cap of 20 for video streams, and a degree cap of 4 for audio streams. The simulations were run over the most dynamic 1-hour segments in each trace. The y-axis is the median time interval between ances-tor change observed in each stream in minutes. Again, min depth performed the best with 45 out of the 50 streams see-ing a median time interval between ancestor changes of 10 minutes or longer. Random and longest-first both performed poorly with only 20 and 30 streams seeing a median time interval of longer than 10 minutes.

While we present results based on 4 parent selection al-gorithms, we also explored many design alternatives. For ex-ample, we looked at prioritizing contributors, and combining multiple algorithms. However, we do not present them in this paper due to lack of space. We note that alternate algorithms did not perform as well as the ones listed above. For example, when the parent selection algorithm prioritized contributors such that they are higher up in the tree, the performance was as poor as random. This is explained by the observation that there is no correlation between being a contributor and being stable.

We also looked at the impact of resource on stability. In particular, we looked at whether there is more stability if

Join requests RJ

Join responses RJ

Keep-alives RK

RP

Keep-alive responses RK

Explicit state update RE

Figure 12: Load at rendezvous point.

there are more high degree nodes (i.e., more resources). We ran simulations on the audio streams with degree caps of 6 and 10, and found that there was only a slight improvement compared to when the degree cap was 4.

To summarize, we find that there isinherent stabilityin application end-point architectures. Without future knowl-edge, the most effective and robust parent selection algo-rithm for maintaining stability and good performance is min depth. While we have looked at how to reduce the number of ancestor changes in the system, another important direc-tion is to reduce the perceived impact of an ancestor change. In particular, mechanisms such as maintaining application-level buffers to smooth out interruptions could help while the effected descendants are looking for new parents. Multiple trees [2, 13, 6] would also help reduce the impact of ancestor changes. In multiple trees, the stream is split and distributed across* independent trees. The probability of all trees

see-ing a disruption is small. And the impact of one tree seesee-ing a disruption will degrade the quality of the stream. How-ever, because a host now has * times more ancestors, it is

likely that overall, it would see a larger number of ancestor changes.

4 Scalability for joining and

member-ship management

4.1 Flash Crowds Are The Norm

A common pattern emerged in our analysis of group mem-bership across many of the large-scale streams. In 40% of the 590 large-scale streams, 236 (40%) of them exhibited “flash crowd” behavior. Breaking this down further, almost all video streams (53 out of 55) and 34% of audio streams had flash crowds. Figure 10 and 11 depict the group membership as a function of time for 4 large-scale video and 4 large-scale audio streams. The streams were sorted by their peak join ar-rival rate, and the streams with the maximum, 25th, 50th, and 75th percentile are shown in this figure. By 25th percentile, we mean that 25% of the streams had higher peak arrival rate than this stream. For all video streams in Figure 10, the sharp rise in group membership stands out. For the audio streams in Figure 11, the first two have a sharp rise in group mem-bership. The last two have smoother join patterns. The video streams and audio streams with the maximum peak arrival rate had rates as high as 1000 arrivals/second and 1200 ar-rivals/sec.

For video streams, flash crowds are the common behav-ior. Intuitively, this is because all of the video streams and some of the audio streams arespecial eventswhere the event

(10)

0 5000 10000 15000 20000 25000 30000 35000 00:00 03:00 06:00 09:00 12:00 15:00 18:00 21:00 00:00 Number of Hosts Time

(a) Max peak arrival rate.

0 500 1000 1500 2000 2500 3000 3500 00:00 03:00 06:00 09:00 12:00 15:00 18:00 21:00 00:00 Number of Hosts Time (b) 25th percentile. 0 500 1000 1500 2000 2500 3000 3500 00:00 03:00 06:00 09:00 12:00 15:00 18:00 21:00 00:00 Number of Hosts Time (c) 50th percentile. 0 500 1000 1500 2000 2500 00:00 03:00 06:00 09:00 12:00 15:00 18:00 21:00 00:00 Number of Hosts Time (d) 75th percentile.

Figure 10: Group membership for large-scale video streams, sorted by peak arrival rate.

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 00:00 03:00 06:00 09:00 12:00 15:00 18:00 21:00 00:00 Number of Hosts Time

(a) Max peak arrival rate.

0 1000 2000 3000 4000 5000 6000 00:00 03:00 06:00 09:00 12:00 15:00 18:00 21:00 00:00 Number of Hosts Time (b) 25th percentile. 0 500 1000 1500 2000 2500 3000 3500 4000 00:00 03:00 06:00 09:00 12:00 15:00 18:00 21:00 00:00 Number of Hosts Time (c) 50th percentile. 0 1000 2000 3000 4000 5000 6000 00:00 03:00 06:00 09:00 12:00 15:00 18:00 21:00 00:00 Number of Hosts Time (d) 75th percentile.

Figure 11: Group membership for large-scale audio streams, sorted by peak arrival rate. is taking place during a specific period, for example 2 hours.

It is natural for people to want to start joining and watching the stream from the beginning. As long as there arespecial events,there are flash crowds. However, a number of long-runningaudio streams also had flash crowds. We believe that special events that happened briefly during the long-running events, for example, invited guest appearances, could explain such behavior.

With motivating data that flash crowds are a normal fea-ture, it is very important to ensure that hosts can successfully join the system. If joining fails, there are no hosts in the sys-tem and no overlay to construct. The control plane in over-lay multicast can be divided into three protocols: (i) join, which is used to bootstrap hosts into the system, (ii) mem-bership management, which ensures that hosts know about other members in the system and (iii) tree construction and tree performance optimization, which depends on informa-tion gathered by the join and membership management pro-tocol to construct a data delivery tree and optimize it. Given that flash crowds are normal in live streaming, the join proto-col must be very scalable and efficient. In previous studies on Overlay Multicast scalability [17], the focus has been on the group management and tree construction protocols, however, attention has not been adequately given to flash crowds and the join protocol. In the remainder of this section, we outline the design requirements for scaling the join protocol and de-sign a simple join protocol. We evaluate its performance in Section 4.3.

4.2 Join Protocol

When a host wants to join the system, it needs to contact some node who knows the current members in the system. We refer to this node as the rendezvous point. For exam-ple, in End System Multicast [4] and CoopNet [13], the ren-dezvous point is the broadcast source and in NICE [17] the

rendezvous points is separated from the data source. Newly joining members contact the rendezvous point to get a sub-set of members currently in the overlay. Current members in the group maintain periodic keep-alives with the rendezvous point.

4.2.1 Design guidelines

This section motivates design guidelines resulting from the need to handle flash crowds.

The rendezvous point cannot be overloaded. Figure 12 depicts the rendezvous point and the rate of join and keep-alive traffic that it needs to handle. Assuming that join requests, keep-alives and keep-alive responses have packet sizes of+ , and the join responses have packet sizes of +-,/.10324.157698;:=<?>@:A64B , we can formulate the following bandwidth

capacity constraint equations that need to be satisfied:

CED@+FG!'CIHE+F3JLKM (1)

C D + ,N.90O24.157698;:P<Q>R:S64B T!UC H +NJVKW (2)

KXM and K W are the bandwidth resources in the

incom-ing and outgoincom-ing direction that the rendezvous point must dedicate to handling the control traffic. The rates C D

for join rate and C-H keep-alive rate are in packets per

second. Assuming that the rendezvous gives out a list of 100 members to newly joining hosts, the outgoing bandwidth in equation (2) becomes the primary constraint. Substituting values from the largest event, where the peakC%D was 1200

packets/sec and +%,/.1032Y.95$698;:P< is 100 IP-port pairs (600

Bytes), thenKW needs to be more than 5 Mbps to just handle

the membership lists. Keep-alive traffic would increase the bandwidth requirement even more.

Explicit state update is required for maintaining fresh state.

A less obvious failure than overload is that the group mem-bership information given to newly joining hosts is useless

(11)

if it is stale, which is a very important problem for applica-tion end-point based live streaming because staleness is very dynamic in this context. Information is stale if it contains (i) members that have already left the group, or (ii) mem-bers that are currently saturated and cannot support any more children. Newly joining members need fresh, high quality membership information in order to find a parent and join the overlay. Generally, staleness can be addressed by main-taining more frequent keep-alives between members and the rendezvous points. However, using a single rendezvous point to maintain keep-alives between all hosts is very prohibitive. For example, using the session distribution presented in sec-tion 2, optimal saturasec-tion of hosts where a host gets saturated only if no other host in the system can take a child without getting saturated, and a bandwidth limit of 1Mbps, 56% of the membership information given to new hosts will be stale. A general principle to addressing overload is to distribute the load among multiple machines. Our solution leverages this principle. First, we simplify the computation function-ality at the rendezvous server to only serve as a redirec-tion mechanism that redirects incoming hosts to a number of

membership service nodes. Second, the load of maintaining and distributing group membership lists is also distributed over the membership services nodes. We use clustering tech-niques to partition members into clusters and have one group member in each cluster serve as the membership server. Ex-plicit state updates from a host go directly to its membership server, instead of the rendezvous point to maintain fresh state. Our proposed solution uses dynamic membership servers that are selected from amongst the contributors in the system. Because of their dynamic properties,robust mecha-nisms needed to find replacement membership serversfor the cluster becomes another important guideline.

4.2.2 Join protocol

We outline a simple join protocol that we will evaluate in the following section. We wish to highlight that the simplicity of the protocol allows for simple recovery given the dynamic nature of the membership service.

Handling host join A new host joining the system

con-tacts the rendezvous point. If there are existing membership servers, the rendezvous point gives out a subset of that list to newly joining hosts. The host then selects a membership server, either randomly or by using clustering techniques dis-cussed in the next section. It contacts the chosen membership server, who then replies with a fresh list of current members in its cluster. The list is then given to the tree construction protocol. The tree construction protocol is responsible for connecting the newly joining host to some member in that list.

Creating membership serversThe rendezvous point is

re-sponsible for ensuring that there are enough membership servers in the system. Membership servers are created on-demandbased on the needs of the system. For example, when a new host arrives, and there are not enough member-ship servers in the system, the rendezvous point will immedi-ately assign the new host to function as a membership server

(assuming the new host has enough resources to support the control traffic).

Recovering from membership server dynamicsJust before

leaving a membership server looks inside its cluster to see if it can promote one of the hosts to become the new mem-bership server. In some cases, due to resource constraints, a promotion may not be possible. The rendezvous server will notice that the number of membership servers has decreased and will create a new membership server from the newly ar-riving hosts. Note that when a membership server leaves, it only affects the ability for new hosts to join. The data overlay is not affected except for the hosts that are direct descendants of the membership server. The existing hosts that were part of the membership server’s cluster need to find a new mem-bership server. If a promotion was successful, the newly pro-moted host becomes their replacement membership server. The membership state can be quickly refreshed as hosts will maintain both explicit liveness and keep-alives with the re-placement membership server. If a promotion was not suc-cessful, hosts will move to a different membership server.

State maintenance

In order to recover from dynamic membership servers, all membership servers exchange explicit state about their liveness with the rendezvous point. Membership servers also maintain explicit state, liveness, and information about a ran-domly selected subset of members in their cluster. In ad-dition, all hosts inside a cluster exchange explicit state and maintain keep-alives with their membership server. When keep-alive messages are received at the membership server, the membership server will respond with a list of a subset of the other live membership servers in the system and other members outside its cluster (learned from exchanges with other membership servers).

Interplay between control protocol and tree construction

In this protocol, the control structure is loosely coupled with the data forwarding structure in the sense that a poten-tial parent in the data forwarding structure is “discovered” through the control structure. Note that both protocols have full autonomy which makes recovery from failures indepen-dent and simple.

4.3 Resource-aware Clustering

The previous section motivated the need for hierarchy and clustering of hosts in the group management structure. This section considers different clustering policies and evaluates their effectiveness. [18] showed that short delay is reason-ably correlated with good bandwidth. We use Global Net-work Positioning (GNP), geographic and random clustering policies. GNP clustering approximates delay based cluster-ing, however is much more efficient in terms of the measure-ments required [12]. Geographic clustering achieves locality in real world distance, which at a large scale approximates network distance. For example, the delay between two hosts on the east coast of the U.S. will likely be shorter than the delay between a host on the east and one on west coast. Sim-ilarly, hosts within a country will likely have shorter delays than hosts in two different countries.

(12)

0 1 2 3 4 5 0 20 40 60 80 100 120 140 160 180 Quality Index Cluster NaturalClusters.Resource Random Clustering. Geographic Clustering. GNP Clustering.

Figure 13: Quality Index of clusters.

0 10 20 30 40 50 60 70 80 90 100 0 200 400 600 800 1000 1200

Cumulative Distribution of Clusters

Cluster Size

Geographic GNP Random

Figure 14: Cumulative distribution of cluster size.

4.3.1 Data Acquisition and Simulation Setup

GNP coordinates are generated using the algorithm presented in [12]. We use active measurements to all hosts in the largest event from 13 different landmark nodes distributed around the world. Of the 118,921, only 27,305 hosts re-sponded. We generated GNP coordinates with dimention-ality 8 for this set.

Geographic data is acquired from a service provided by the CDN who provided us with the streaming-media logs. This service is the same as the one that provided access tech-nology information and provided us with latitude and lon-gitude coordinates for each host. Manual correlation with known geographic locations showed that the information was accurate for our purposes which is coarse-grain geographic clustering.

We enhanced the tree construction simulator described in section 3 with the distributed join mechanism described in the previous section. However, when hosts join they are directed to the cluster with the closest membership server to them, where the measure of closest depends on the cluster-ing policy. The membership server selection algorithm is the same as described in section 3. Also, the degree distribu-tion used is the same as in secdistribu-tion 3 The following secdistribu-tion present results from simulations conducted using the largest event with GNP and geographic clustering using the subset of hosts for which we had GNP coordinates

4.3.2 Cluster Properties

We first analyze the largest event, which was introduced in section 2. Figure 13 shows the average quality index for all

the clusters. For a system to support all members at a partic-ular data rate, the quality index must be at least 1. Random clustering achieves close to uniform distribution of resources among clusters, where about 7% of clusters have quality in-dex below 1. Since no locality exists with random clustering, the existence of resources does not necessarily mean they are useful. For example, a trans-oceanic link may need to be crossed to access a resource within the same cluster. Both geographic and GNP clustering provide locality within clus-ters, however both policies produce clusters where 20% have quality index below 1. Figure 14 plots the cumulative dis-tribution of cluster size for the three clustering policies. It shows that random clustering achieves cluster sizes with ap-proximately uniform size distribution close to 200. The small clusters result from chosen membership servers staying for a very short duration. We noticed that all clusters with quality index below 1 are very small using random. Geographic and GNP clustering exhibit very different behavior. In general, these two clustering policies produce clusters with a wide variability in size. In addition, they produce a few large clus-ters with low quality index. These clusclus-ters comprise mainly of DSL and cable modem hosts.

It is critical to ensure that (i) the quality index of a cluster stays above one such that hosts can find parents inside their own clusters, and (ii) the cluster sizes are bounded so the membership servers do not get overloaded. Therefore, the clustering algorithm must be aware of both constraints and maintain good values for available resources within clusters and size of clusters.

Cluster Size Maintenance:Two possibilities for bounding the cluster size and handling overflows are: (i)redirection, where new hosts are redirected to the next best cluster un-til an unsaturated one is found and (ii)new cluster creation, where a new contributor host that is supposed to join a full cluster creates a new cluster and new free rider hosts are redi-rected.

Resource Maintenance:Hosts within a cluster with quality index below 1 must find parents outside their cluster. The design presented in the previous section incorporates this idea by having the membership servers provide its cluster members information about members in other clusters. It is still very important to maintain the quality index of a cluster above 1 to reduce the recovery time from disconnects. We redirect free riders (using redirection from above) joining a cluster with quality index at or below 1 to other clusters, but allow contributors in because they either increase the quality index or keep it the same.

4.3.3 Simulation Results

Two parameters in the system that can potentially effect per-formance of hosts are the maximum cluster size and the mimum number of clusters. The number of clusters may in-crease if the new cluster creation mechanism is used to bound cluster size. We use the average and maximum intra-cluster distances to understand the impact of the two parameters and evaluate both control mechanisms.

(13)

0 10 20 30 40 50 60 70 80 90 100 0 100 200 300 400 500 600

Intra-Cluster Distance Average 50 Maximum 50 Average 100 Maximum 100 Average 200 Maximum 200 Average 500 Maximum 500

Figure 15: Intra-cluster distances varying the number of clus-ters 0 10 20 30 40 50 60 70 80 90 100 0 50 100 150 200 250 300 350 400 450 500

Intra-Cluster Distance GNP Average GNP Maximum GNP Redirect Average GNP Redirect Maximum GNP Create New Average GNP Create New Maximum

Figure 16: Intra-cluster distances for bounding cluster size comparing naive (no bound), redirect and create new cluster. Analysis of intra-cluster distances varying the maximum cluster size between 200 and 1000 showed that the intra-cluster distances were not effected. However varying the minimum number of clusters does effect intra-cluster dis-tances. Figure 15 plots the the cumulative distribution of the intra-cluster distances where the number of minimum clus-ters is varied between 50 and 500 and the maximum cluster size is maintained at 200. More clusters results in small intra-cluster distance for each intra-cluster. The average improves only slightly, however the maximum improves significantly from close to 600ms for 50 clusters to about 250ms for 500 clus-ters. All remaining analysis uses 100 minimum clusters and a 200 maximum cluster size.

Next, we evaluate both control mechanisms. Figure 16 shows the cumulative distribution of intra-cluster distances for naive GNP clustering, GNP clustering with redirection and GNP clustering with new cluster creation for bounding the cluster size. For 90% of the clusters the average cluster distance is within 100ms and the maximum reaches 300ms. This figure shows that redirection and new cluster creation do not effect the intra-cluster distances significantly as one may expect when hosts are not put into the closest cluster. Redi-rection of hosts from large clusters, whether they have low quality index or are at full host capacity, does not severely effect intra-cluster distances because there are many other hosts nearby due to the fact that the cluster is large and the second best cluster is probably also nearby. Similarly, us-ing a new contributor to create a new cluster will work well because it should be expected that more hosts will join near

0 10 20 30 40 50 60 70 80 90 100 0 50 100 150 200 250 300 350 400 450 500

Intra-Cluster Distance GNP Average GNP Maximum GNP Redirect Average GNP Redirect Maximum

Figure 17: Intra-cluster distances for bounding quality index comparing naive (no bound) and redirect.

0 10 20 30 40 50 60 0 5 10 15 20 25 30 35 40 45 50

Streams

Random Random Minimum Depth Random Longest-first Random Random Geographic Minimum Depth Geographic Longest-first Geographic 0 10 20 30 40 50 60 0 5 10 15 20 25 30 35 40 45 50

Streams

Random Random Minimum Depth Random Longest-first Random Random Geographic Minimum Depth Geographic Longest-first Geographic

Figure 18: Interval between ancestor change it.

Figure 17 shows the cumulative distribution of intra-cluster distances for naive GNP and GNP using redirection for low quality index. Again, intra-cluster distances are not significantly effected.

We compared the intra-cluster distances from our re-sults above to distances resulting from clustering using the k-means algorithm. The key difference is that k-means will choose an optimal cluster center based on neighboring coor-dinates and our proposed mechanism chooses cluster heads randomly. The results for k-means assume all hosts are present at the same time so a direct comparison is not pos-sible, however, it is still useful. Using k-means, 90% of the clusters had an average intra-cluster distance less than 150ms. Splitting large clusters and choosing optimal cluster heads is not a critical problem when clustering for the con-trol plane. However, this does not convey the performance of hosts participating in the tree, which we consider next.

4.3.4 Summary of Results

We evaluate the same set of streams from section 3 using ge-ographic clustering with redirection to bound cluster size and maintain quality index above 1. We do not use GNP cluster-ing here because we did not have GNP coordinates for hosts in all events. Figure 18 plots the interval between ancestor change for the different tree construction and clustering poli-cies. The interval between ancestor change is very close to the result presented in section 3 without clustering. There are two ways that clustering can effect tree stability: (i) hosts