2.3.3 Query Routing
Data-partitioning algorithms need to route queries to the nodes which hold the concerned data items.
For algorithms based on consistent hashing, there is a tradeoff between perfor- mance and robustness. For example, Chord has been to designed for unstable P2P environments where nodes can join and leave freely [98]. Chord improves the ro- bustness of consistent hashing by avoiding the requirement that every node knows about every other node. In an N-node network, each node maintains information about O(logN) other nodes, which can route any query. A query may need to route between multiple nodes before reaching the destination nodes, and a lookup requiresO(logN)network messages. On the other hand, joining and leaving of nodes only affects a small part of the system.
In data centers where node failures are rare, optimizing for performance is often preferred to robustness. Dynamo targets latency-sensitive applications that require at least 99.9% of read and write operations to be performed within a few hundred milliseconds [35]. Instead of using DHT routing, it maintains the complete membership information across all nodes of the system, such that queries are routed in 1 hop only.
For range-based algorithms, the key-range assignment information has to been known to route queries. Bigtable employs a master node to manage the key-range assignment information. The master node is responsible for assigning tablets to nodes, detecting the addition and expiration of nodes, and balancing load across nodes. Nodes cache this key-range assignment locally to route queries without looking up locations remotely.
PNUTS [29] supports both hash-based and range-based data partitioning. Ta- bles are ordered by the partition key, then partitioned into tablets and assigned to different servers. A binary search tree of tablets is maintained to efficiently map each record to a tablet. PNUTS uses standalone query routers, which maintain a cached copy of the mapping information and can directly route queries to the right node. The mapping itself is owned by the table controller. Routers maintain only soft states. If a router fails, PNUTS can simply start a new one instead. Query routers detect changes in data locations when query routing fails. They can simply reload the new mapping from the table controller.
2.4
Complex Query Support
Centralized relational databases commonly support the SQL language, which pro- vides rich semantics for complex queries such as secondary-key queries, range queries, join queries and aggregations. Implementing these complex queries in a centralized database is a well-understood topic [94]. However, supporting them in distributed and partitioned data stores remains a challenge.
A number of research efforts have been conducted to implement specific types of complex queries in distributed databases. However, they typically rely on a spe-
cific data-partitioning design targeted at one particular type of complex query. For example, the hash-join algorithm is widely used for join queries and it can be natu- rally parallelized for a distributed environment [90]. Parallel hash-join algorithms partition tables by the hash value of the join attribute. Records that have the same hashed value of the join attribute are assigned to the same machine, so identifying matching records can be executed by each machine locally. However, as the parti- tioning key is the join attribute rather than the primary key, such data-partitioning design is efficient only for queries that give the value of the join attribute in ad- vance or request a full join of two tables. This partitioning scheme implies large overheads for executing other queries such as primary-key queries and other join queries.
Centralized databases typically rely on indexes to implement complex queries efficiently. In a partitioned and distributed data store, one option is to maintain scalable and distributed B-trees of indexed columns [3]. By replicating all inner nodes and partitioning the leaf nodes of the B-tree, such a distributed data structure allows efficient data location and retrieval. However, this technique faces chal- lenges in the case of update-intensive workloads, as this causes frequent updates of the inner nodes. In addition, for large data sizes, replicating all inner nodes at all machines may be prohibitively expensive.
Another approach to implement a scalable general-purpose database is to run any number of database engines in the cloud, and use the cloud’s file system as the shared storage medium [17]. Each engine has access to the full data set and can therefore support any type of SQL queries. However, cloud file systems usually have very high latency compared to an ordinary local disk drive. A complex query accessing many data items would require accessing the cloud file system multiple times, resulting in very long query execution times.
As described in Section 2.3, NoSQL data stores typically partition data by their primary key. Such data-partitioning design is therefore not efficient for com- plex queries where the primary keys of accessed records are not given in advance. Bigtable [27] supports secondary-key and range queries by scanning the ordered table with user-defined filters on values of columns. However, this algorithm is not efficient for response-time-sensitive Web applications. SimpleDB [6] supports secondary-key and range queries automatically by indexing data. This clearly im- proves performance compared to Bigtable. However, such queries are restricted within a single table partition (or “domain” in SimpleDB terminology). Cassan- dra [23] maintains indexes of user-specified columns and supports secondary-key queries over these indexed columns. The indexes are maintained in a new column- family with the value of the indexed column as the primary key. Cassandra par- titions these index records in a hash-based approach so range queries cannot be supported. To our best knowledge, no scalable data store supports complex queries across tables, such as join queries. We will return to this topic in Chapter 5 where we discuss join query support in CloudTPS.