Any scalable substrate for data storage in a peer-to-peer se ing needs to adopt techniques for (1) data partitioning, (2) data retrieval, and (3) handling node membership changes, including failures. We describe how our custom hashing-based storage layer addresses these issues, in a way that is fully decentralized and supports multiple administrative domains.
Data Partitioning
Like most content-addressable overlay networks, we adopt a hash-based system for data placement. Similar to previous well-knowndistributed hash tables (DHTs) such as Pastry (Rowstron and Dr- uschel, 2001), we use as our key space 160-bit unsigned integers, matching the output of the SHA-1 cryptographic hash function. It is convenient to visualize the key space as a ring of values, starting at 0 and increasing clockwise until they get to2160−1, at which point they over ow back to 0. Figure 4.2
shows two examples of this ring that we will discuss in more detail.
Most overlay networks assign a position in the ring to each node according to a SHA-1 hash of the node’s IP address (forming a DHT ID). Values are placed at nodes according to the relationship with their hash keys. In Chord, keys are placed at the node whose hashed IP address liesaheadof them on the ring; in Pastry the keys are placed at the node withnearest hash value. e Pastry scheme of partitioning the key space among the participant nodes is shown in Figure 4.2a. Both of these approaches can determine the range a node “owns” given its ID and the IDs of its neighbors. ese
schemes are optimized for se ings with large numbers of nodes, and assume the nodes will be more or less uniformly distributed across the ring. Each node maintains information about the position of a limited number of its neighbors, as it has a routing table with a number of entries logarithmic in the membership of the DHT. DHTs, as presented so far, o en have highly nonuniform (approximately Poisson) distributions of values among the peers. Indeed, in the gure, nodesn3andn4are together
responsible for more than of the key space, while noden2is only responsible for of it.
e canonical approach to load imbalance in DHTs is to use manyvirtual nodes, also referred to as virtual serversin, e.g., Dabek et al. (2001), at each physical node. is makes it much more likely that each physical node receives an approximately even fraction of the key space; even though the amount assigned to individual virtual nodes will vary, the total amount assigned to a large number of them is very consistent with high probability. Virtual nodes have some drawbacks, and in particular, they lead to key space fragmentation. For our purposes, it is very advantageous to assign a single contiguous key range to each node; in addition to reducing the size of the routing table, this improves data retrieval performance by allowing us to colocate data at nearby DHT keys, as discussed in Section 4.3.
For this chapter, we adopt a simple solution we termtotally even partitioning. We divide the key space into evenly sized sequential ranges, one for each node, and assign the ranges in order to the nodes, sorted by their hash ID. Such an assignment for the same network we examined for Pastry- style partitioning is shown in Figure 4.2b; it distributes the key space, and therefore the data, uni- formly among the nodes. In response to node arrival or failure, we redistribute the ranges over the new node set. is approach is more sensitive to churn than the Pastry approach. Pastry has what we termpartitioning resiliency: only aneighboringnode’s arrival or departure causes a change to a node’s owned range. In totally even partitioning,anynode’s arrival or departure will cause a change to a node’s owned range. Partitioning resiliency reduces the effect of churn by causing less of the key space (and therefore less data) to be moved when a node arrives or departs¹. However, in a smaller, low-churn network the performance bene ts afforded by more even data distribution outweigh the lack of partitioning resiliency. In this chapter, we our experiments exclusively consider totally even partitioning to explore the limits of scale-up. In Chapter 5, we will explore an alternative approach that uses the exibility afforded by data replication to offer almost as good performance while restor- ¹It is worth noting that the use of virtual nodes can also increase the effects of churn by giving each physical node more neighboring nodes. However, it does not increase the number of data items that must be moved, only the number of (now much smaller) sections of the key space that must be moved.
ing partitioning resiliency.
Additionally, our partitioning approach assumes that all nodes are equally powerful, as they are assigned the same fraction of the key space and therefore approximately the same amount of data. is assumption holds true for the experiments (with one explicit exception) in this chapter. e virtual node approach can compensate for this by assigning more virtual nodes to more powerful or be er connected nodes, making them more likely to own a larger fraction of the key space. e techniques of Chapter 5 will also consider node heterogeneity. For the rest of this chapter, however, we focus on the simpler case of totally even partitioning over totally homogeneous nodes.
Data Retrieval
As mentioned above, a traditional DHT node maintains a routing table with only a limited number of entries (typically logarithmic in the number of nodes). is reduces the amount of state required, enabling greater scale-up, but requires multiple hops to route data. Recent peer-to-peer research in Gupta et al. (2004) has shown that storing a complete routing table (describing the partitioning of the key space among all nodes) at each node provides superior performance for up to thousands of nodes, since it provides single-hop communication in exchange for a small amount of state; we therefore adopt this approach. Our system requires a reliable, message-based networking layer connection with ow control. We found experimentally that, for scaling at least to one hundred nodes, maintaining a direct TCP connection to each node was feasible. With the use of modern non-blocking I/O, a single thread easily supports hundreds or thousands of open connections. For larger networks, a UDP- based approach could be developed to avoid the overhead of maintaining TCP’s in-order delivery guarantees, as all of the techniques described here are independent of message ordering.
Node Arrival and Departure
Traditional DHTs deal with node arrival and departure through background replication. Each data item is replicated at some number of nodes (known as thereplication factor). In Pastry, for example, for a replication factorr, each item is replicated at⌊2r⌋nodes clockwise from the node that owns it, and the same number counterclockwise from it, leading tortotal copies. In the ring of Figure 4.2a, ifr = 3, each data item that is owned by noden1will be replicated ton4andn2as well. When a
be stored at one of its neighbors. If a node leaves, each of its neighbors already has a copy of the data that it owned, so they are ready to respond to queries for data stored at the departed node. A similar approach works in the totally even partitioning scheme we adopt.
is approach makes an implicit assumption that all of the state at the nodes is stored in the DHT, and therefore that any node that has a copy of a particular data item can handle requests for it. If a node joins or fails, certain requests will suddenly be re-routed to different nodes, which are assumed to provide identical behavior (and hence do not get noti ed of this change). is does not work in the case of a distributed query processor, where in addition to persistent stored data there may be distributed “so state” that is local to a query and is not replicated; this includes operator state, such as the intermediate results stored in a join operator or an aggregator. If data for a particular range is suddenly rerouted from one node to another, tuples might never “meet up” with other tuples they should join with, or data for a single aggregate group may be split across multiple nodes, causing incorrect results.
To solve this problem, our system works onsnapshotsof the routing table, which for us is the complete partitioning of the key space, since all nodes have global knowledge of other nodes’ owned partitions. When a participant initiates a distributed computation, it sends out a snapshot of its cur- rent routing table, which all nodes will use in processing this request. erefore, if a new node joins in mid-execution, it does not participate in the current computation (otherwise it may be missing im- portant state from messages prior to its arrival). If a node fails, the query processor can detect what data was owned by the failed node, and thus can reprocess this state (this is discussed in Section 4.4). Our system must still handle replication of base data, which is done in a manner very similar to that of Pastry; each data point is replicated at⌊2r
⌋
nodes clockwise and counterclockwise from the node that owns it. is ensures that data can survive multiple node failures, and that in the event of a node failure, the nodes that take over for a failed node have copies of the base data for the sections of the ring they are newly responsible for. Unlike in Pastry, a single node arrival or departure will cause all the ranges in the range to change slightly; this causes a membership change to be more expensive, but we are assuming reasonable bandwidth and less frequent failures. With smaller numbers of fairly reliable nodes, the performance bene ts of uniform distribution likely outweigh the costs of extra shipping.
Currently we only replicate data as it is inserted into the DHT. is has been sufficient for the development and experimental analysis of our system, since we inserted data before any node fail-
ures, and failed few enough nodes that data was never lost. For completeness we plan to implement the Bloom lter-based background replication approach of the Pastry-based PAST storage system described in Druschel and Rowstron (2001), which can be directly applied to our context.
Using the Substrate on Cloud Services
One of the goals of our work is to be able to scale not only across the participants in the O - system, but also, especially as query load increases, to be able to exibly expand to incorporate cloud computing nodes. As we describe later, O can employ Amazon’s Elastic Comput- ing Cloud (EC2), which provides virtual Linux nodes upon which our query processor can be easily deployed. EC2 has several regionally distributed host sites with excellent connectivity, and we can quite efficiently migrate and replicate data to the EC2 nodes and bring them up to speed.