• No results found

Recall that Pastry-style partitioning uses a circular DHT key space (from0to2160−1) which wraps around. Each node has an ID in the DHT key space, typically the hash of it’s IP address (or IP address and some unique local identi er, such as TCP port, if multiple virtual nodes are being run on the same physical node). ese nodes are then placed onto the DHT key space ring, and each node is responsible for, or “owns,” the region of the key space closer to it than to its neighbors. is is shown in Figure 5.1a. If data is then replicated with a replication factorr, each data item (and therefore each portion of the key space) is stored atr−1other nodes as well, for a total ofrcopies. In this chapter, we assume thatris always odd, and that each data item is replicated at r−21nodes counterclockwise

from the node that owns it, andr−1

2 node clockwise as well. is leads to the overlapping assignment

of regions of the DHT key space (for purposes of data storage) to nodes as shown in Figure 5.1b. In a DHT-based storage system and query processor such as ours, the partitioning of the key space is used both to determine where persistent data is retrieved from persistent storage, and how intermediate results are partitioned. Each region of the key space must be assigned to exactly one node. One such assignment in shown in Figure 5.1a. Note, however, that this assignment is rather lopsided.

De nition 9(Routing Imbalance). We use therouting imbalancemetric as a measure of the quality of a partitioning. e routing imbalance is the largest action of the key space assigned to one node divided by the action assigned to each node in a perfectly even assignment; this is equivalent to the largest action times n, the total number of nodes. Since in any assignment one node must be responsible for at leastn1 of the key

space, and can be responsible for at most all of it, the routing imbalance ranges om one ton. If all nodes in a system are equally powerful, the node that is assigned to the largest action of the key space, and therefore has more work routed to it, is typically the bo leneck; therefore higher routing imbalance is strongly correlated with the execution time of a query. For the time being we assume homogeneous nodes, and will address node heterogeneity shortly.

We can use the routing imbalance to quantify how lopsided the partitioning in Figure 5.1a is. From the node IDs, for Pastry-style partitioning we can determine thatn1owns the region[0x00…, 0x25…),n2the region[0x25…,0x35…),n3the region[0x35…,0x90…), andn4 the region [0x90…,0x00…). n4therefore owns 167 0.44of the key space, giving a routing imbalance of

1.75. If the nodes are equally powerful, this partitioning would lead to much worse performance than the totally even partitioning we used in Chapter 4, sincen4in particular would be overloaded,

andn2in particular would be underloaded.

Since we have commi ed to preserving the property of partitioning resiliency, we cannot directly apply the techniques of even partitioning from Section 4.2 for all operations. Scans of persistent stor- age must occur where a copy of the data is stored. We could, in principle, adopt even partitioning for all operations that take place a er a scan, but use Pastry-style partitioning for base data. is com- plexity of multiple partitionings per query has several drawbacks. First, it signi cantly increases the complexity of the system, and would have required extensive modi cation to our existing codebase. Second, it reduces the potential for exploiting data locality to increase performance; data would have to be repartitioned using the second partitioning to, say, join a table on an a ribute that was parti- tioned by that a ribute.

We decided to continue to use Pastry-style partitioning for data storage, since this preserves par- titioning resiliency. However, since the data is replicated, each data item is stored at many different nodes in the system; the overlapping regions of the key space stored at multiple node are shown in Figure 5.1b. is redundancy is the key to our approach, as it means that data items are available at nodes other than the nodes that own according to the original partitioning of the key space. Further- more, since replication is designed for seamless failover, redundant data is placed in such a way that we can adjust the boundaries of nodes’ partitions in the key space by sliding them; in particular, we can slide them to reduce the load on overloaded nodes and increase the load on underloaded nodes. is exibility gives us signi cant opportunities to improve query performance by using a partitioning for query

(a) Balanced Partitioning (b) Weighted Partitioning

Figure 5.2: Optimized partitioning of the DHT key space. Here we show several partitionings of the key space that are possible, given the replicated data placement shown in Figure 5.1b. e overlap- ping colored ring segments inside the black ring show where the data is available, and the coloring of the outer ring shows which regions of the key space are assigned to which node. Figure 5.2a shows how the key space can be partitioned to assign equal amounts of data to each of the four nodes, and Figure 5.2b shows how it can be partitioned to give much less data to noden4(represented by the

do ed orange lines). ese partitionings respect the data availability due to replication.

execution that is much more balanced.We can tweak the regions of the key space assigned to each node to create a more balanced partitioning; by ensuring that the region assigned to each node falls within the region of the key space that it hasreplicateddata for, we retain the ability to scan persistent storage. is eliminates the need for two different partitionings that was a problem with the above proposal to use totally even partitioning only for intermediate results. Figure 5.2a shows how the redundancy in Figure 5.1b can be used to create a more even partitioning; each each node has an exactly even fraction of the key space, giving a routing imbalance of one.

ere are of course occasions where the routing imbalance is not as strongly correlated with per- formance. For example, if nodes are heterogeneous, then we want to allocate more of the key space to more powerful nodes, and less to less powerful nodes, in proportion to some sort ofcapabilitymetric that is directly proportional to query execution speed at each node. Suppose that, of the nodes shown in Figure 5.1b, noden4is approximately six times less powerful than the other nodes, therefore has

key space to noden4(the do ed orange regions) than to each of the other nodes. In this case, in-

stead of creating a partitioning with a lower routing imbalance, we want to create a partitioning with a lowerweighted routing imbalance, which considers the node with the largest fraction of the key space divided by its capability; we then normalize this by multiplying by the sum of all nodes’ capabili- ties. is metric ranges from one (optimal) to the quotient of the sum of all nodes’ capabilities and the least powerful node’s capability (worst outcome). Figure 5.2b shows a partitioning that mini- mizes this quantity to one. Noden4has of the key space, giving its weighted routing imbalance

as(191 ÷1

)

×19=1, and the other nodes have their routing imbalance as(196 ÷6

)

×19 =1; the maximum of these is one, so the overall routing imbalance is also one.

Another case case where naïve routing imbalance is not ideal occurs when data is skewed. Here, the implicit assumption that data points are evenly distributed through the key space is no longer true. erefore, instead of considering fractions of the key space when computing the routing imbalance, we must consider fractions of all data. is is a very difficult problem when considering skew in the distribution of intermediate results or different skews in different relations; we want to continue to use only one partitioning during query execution, as explained above. erefore, we consider data skew only to reduce the scanning cost of a single relation, when that cost may be the dominant component of query execution cost.

In the above examples, the data was highly replicated. Each portion of the key space was stored at of the nodes, giving us considerable leeway in choosing a partitioning. In general, however, each portion of the key space may be stored at a much smaller fraction of the nodes. ere may be a string of successive nodes with IDs very close together, meaning that some nodes may not have anywhere near their “fair share” of the key space. In the more general se ing, it is not clear what an algorithm to partition the key space optimally would look like. Decisions made locally (i.e. to assign a certain region of the key space to a particular node) can have global effects by forcing overly large or small regions of the key space to be assigned to other nodes because of the constraints imposed by data availability. However, the problem is relatively simply speci ed: we want to minimize the (weighted) routing imbalance, subject to the constraint that each node only be assigned regions of the key space it has data for. ere are general so ware tools (and indeed an entire eld of computer science research) devoted to the general problems of optimization and constraint satisfaction. In the next section, we describe how we have formulated this problem for such a tool.