• No results found

3.4 Large-scale Data Analysis Based on MapReduce

3.4.3 Shared-Nothing Parallel Databases vs MapReduce

3.4.3 Shared-Nothing Parallel Databases vs MapReduce

Although being able to run on different hardware, MapReduce is typically running on a shared-nothing architecture where computing nodes are connected by network without memory or disk sharing among each other. Many parallel databases adopted

shared-nothing architecture, like in the parallel database machines, Gamma (54) and Grace (56).

Though MapReduce and parallel databases target different users, it is in fact possi-ble to write almost any parallel processing task as either a set of database queries or a set of MapReduce jobs (88). This led to controversies about which system is better for large-scale data processing. Among them, there are also criticizing voice of new-rising MapReduce. Some researchers in the database field even argued that MapReduce is a step backward in the programming paradigm for large-scale data intensive applica-tions (53). However, more and more commercial database software begun to integrate the cloud computing concept into their products. Existing commercial shared-nothing parallel databases suitable for doing data analysis application in cloud are: Teradata, IBM DB2, Greenplum, DATAllego, Vertica and Aster Data. Among others, DB2, Greenplum, Vertica and Aster Data are naturally suitable since their products could theoretically run in the data centers hosted by cloud computing provider (25). It is interesting to compare the features of both systems.

3.4.3.1 Comparison

We compare shared-nothing parallel database and MapReduce in the following aspects:

Data partitioning In spite of having a lot of differences, shared-nothing parallel database and MapReduce do share one feature: the data set is all partitioned in both systems. However, as in shared-nothing parallel database data is structured in tables, data partitioning is done with specific data partitioning methods. Partitioning take into account the data semantic, and is running under control of user. On the contrary, data partitioning in a typical MapReduce system is automatically done by system, where user can only participate data partitioning with limitations. For example user can configure the size of block. But the semantic of data is not considered during partitioning.

Data distribution In shared-nothing parallel database, the knowledge of data dis-tribution is available before query processing. This knowledge can help query optimizer to achieve load-balancing. In MapReduce system, the detail data distribution remains unknown, since distribution is automatically done by system.

3.4 Large-scale Data Analysis Based on MapReduce

Support for schema Shared-nothing parallel databases require data conform to a well-defined schema; data is structured with rows and columns. In contrast, MapReduce permits data to be any arbitrary format. MapReduce programmer is free of schema, and data can even have no structure at all.

Programming model Like other DBMSs, shared-nothing parallel database supports a high-level declarative programming language, i.e. SQL, which is known for and largely accepted by both professional and non-professional users. With SQL, users only need to declare what they want to do, but do not need to provide a specific algorithms to realize it. However, in MapReduce system, developers must provide an algorithm in order to realize the query processing.

Flexibility SQL is routinely criticized for its insufficient expressive power. In order to mitigate flexibility, shared-nothing parallel databases allow user-defined functions.

MapReduce has good flexibility by allowing developers to realize all calculations in the query processing.

Fault tolerance Both parallel database and MapReduce use replication to deal with disk failures. However, parallelism databases cannot handle node failures, since they do not save intermediate results, once a node fails, the whole query processing should be restarted. MapReduce is able to handle node failure during the execution of MapReduce computation. The intermediate results (from mappers) are stored before launching reducers in order to avoid starting the processing from zero in case of node failure.

Indexing Parallel databases have many indexing techniques, such hash or B-tree, to accelerate data access. MapReduce does not have built-in indexes.

Support for Transactions The support for transactions requires the processing to respect ACID. Shared-nothing parallel databases support transaction, since it can easily respect ACID. But it is difficult for MapReduce to respect such a principle. Note that, in large-scale data analysis, the ACID is not really necessary.

Scalability Shared-nothing parallel database can scale well to tens of nodes, but difficult to go any further. MapReduce has very good scalability, which is proved by Google’s use. It can scale to thousands nodes.

Adaptability over heterogeneous environment As shared-nothing parallel database is designed to run in homogeneous environment, it is not suited to run in heterogeneous environment. MapReduce is able to run in heterogeneous environment.

Execution strategy MapReduce has two phases, map phase and reduce phase. Re-ducers need to pull each of its input data from the nodes where mappers were run.

Shared-nothing parallel databases uses a push approach to transfer data instead of pull.

Table 3.1 summarizes the differences between parallel database and MapReduce with short descriptions.

Parallelism database MapReduce Data partitioning Use specific methods Done automatically

consider data semantic do not consider data semantic Data distribution Known for developers Unknown

Schema support Yes No

Programming model Declarative Direct realize

Flexibility Not good Good

Fault tolerance Handle disk failures Handle disk and node failures

Indexing Support Have no built-in index

Transaction support Yes No

Scalability Not good Good

Heterogeneous Unsuitable Suitable

environment

Execution strategy Push mode Pull mode

Table 3.1: Differences between parallel database and MapReduce

3.4.3.2 Hybrid Solution

MapReduce-like software, and shared-nothing parallel databases have their own advan-tages and disadvanadvan-tages. People look for a hybrid solution that combines the fault tolerance, heterogeneous cluster, and ease of scaling of MapReduce and the efficiency, performance, and tool plug-ability of shared-nothing parallel database.