MapReduce Join Strategies for Key-Value Storage

(1)

MapReduce Join Strategies for Key-Value Storage

Duong Van Hieu, Sucha Smanchat, and Phayung Meesad Faculty of Information Technology

King Mongkut’s University of Technology North Bangkok Bangkok 10800, Thailand

duongvanhieu@tgu.edu.vn, {suchas@kmutnb.ac.th,sucha.smanchat@acm.org}, pym@kmutnb.ac.th Abstract—This paper analyses MapReduce join strategies

used for big data analysis and mining known as map-side and reduce-side joins. The most used joins will be analysed in this paper, which are theta-join algorithms including all pair partition join, repartition join, broadcasting join, semi join, per-split semi join. This paper can be considered as a guideline for MapReduce application developers for the selection of join strategies. The analysis of several join strategies for big data analysis and mining is accompanied by comprehensive examples.

Keywords—MapReduce; join strategy; NoSQL

I. INTRODUCTION

With the continuous development of big data and cloud computing, it is believed that traditional database technologies are insufficient for data storage and access, and also performance and flexibility requirements. In the new era of big data, NoSQL databases are more appropriate than relational databases [1]. Key-Value store, a kind of NoSQL databases, is an appropriate choice for applications that use MapReduce model for distributed processing. Key-Value stores offer only four underlying operators including inserting <key, value> pairs to a data collection, updating values of existing pairs, finding values associated with a specific key, and deleting pairs from a data collection [2].

Joining two data collections to produce a new dataset based on joining fields is a responsibility of programmers or application developers rather than of database management systems. However, several join strategies existing, which have different advantages and disadvantages. To provide programmers a guideline to the selection of join strategies, this study analyses several joining strategies for big data analysis and mining accompanied comprehensive examples. The content of this paper is organised into four main sections. Section 2 gives an overview of the MapReduce programming model, Section 3 explains MapReduce join strategies, and Section 4 is the conclusion of and comparison of join strategies used in MapReduce.

II. MAPREDUCE OVERVIEW

MapReduce has been used at Google since February 2003, and was first introduced in 2004 by Dean and Ghemawat [3] and in Communications of the ACM in 2008 [4]. It is used for processing large datasets in a parallel or distributed computing environment. It is a combination of map processes and reduce processes. A map process is a function that processes a set of input <key, value> pairs that is a portion of a large input

dataset to generate a set of intermediate <key, value> pairs. A reduce process with a reduce function merges all of intermediate values generated by the map processes associated with the same intermediate key to form a possibly smaller set of <key, value> pairs, called final output <key, value> pairs.

Fig. 1 is a simple word counting example. The input string data “Advanced Research Methodology, Advanced Information Modelling and Database, Advanced Network and Information Security, Advanced Database and Distributed Systems” is divided into four blocks corresponding to each subject name separated by commas. A Hash function mod(code(upper(left(key,1))),k)+1 is used for distributing intermediate <key, value> pairs into reduce tasks. The left(key,1) means taking the first letter of key, the upper(x) means changing x to upper case, the code (x) means taking ASCII code of character x, and the mod(m, k) means returning the remainder after m is divided by k.

Inputdata Key Value Key Value

Advanced Advanced 1 Database 1

<key,value>pairs producedbyReduce

processs Intermediate <key,value>pairs

distribution

Red_uce

1

<key,value>pairs

Block1Research Map1 Research 1 Group1 Database 1 Key Value

Methodology Methodology 1 Distributed 1 Database 2

Key Value Key Value Distributed 1

Advanced Advanced 1 Advanced 1

Information Information 1 Advanced 1 Key Value

Block2Modelling Map2 Modelling 1 Advanced 1 Advanced 4

and and 1 Advanced 1 and 3

Database Database 1 Group2 and 1 Reduce2 Information 2

Key Value and 1 Methodology 1

Advanced Advanced 1 and 1 Modelling 1 Network Network 1 Information 1

Block3and Map3 and 1 Information 1 Key Value

Information Information 1 Methodology 1 Network 1

Red_uce

1

Redu ce3

Security Security 1 Modelling 1 Research 1

Key Value Key Value

Advanced Advanced 1 Group3 Network 1 Key Value

Reduc e4

Redu ce3

Database Database 1 Research 1 Security 1

Block4and Map4 and 1 Key Value Systems 1

Distributed Distributed 1 Group4 Security 1 Systems Systems 1 Systems 1

Reduc e4

Fig. 1. Map and reduce processes of a simple word counting example

III. MAPREDUCE KEY JOIN STRATEGIES

Physically, data in a Key-Value format can be stored in the form of a data structure such as B-Tree, Queue, and Hash table [5, 6]. Logically, each record in a Key-Value store is a single entry including a key and a value. To make it easy to under-stand, a set of <key, value> pairs, called data collection, can be

(2)

considered as a two-column table. The first column stores keys and the second one, which can be a combination of more than two columns, stores values associated with the keys.

Joins using MapReduce can be categorized as map-side join, reduce-side join, memory-backed join, join using Bloom Filter, and map-reduce merge [7]. However, this paper follows the categories proposed by Tom White [8, 9], grouping into two types which are map-side joins and reduce-side joins. Map-side joins are joins per-formed by mappers, used to join two large input datasets before feeding data to the map functions. Reduce-side joins are joins performed by reducers, being more general than a map-side join because inputs do not need to be structured in any particular way [9]. In some cases, reduce-side joins are less efficient than map-side joins because datasets go through the MapReduce shuffle. For reduce-side joining, several components are involved. These are multiple inputs and secondary sorting [8].

Multiple inputs mean inputs from different sources can have different formats or presentations. To deal with this situation, multiple inputs need to be parsed separately. This parsing is provided in Hadoop, called per-path basis [10]. Secondary sorting occurs when reducers obtain inputs from two sources and each of them can be sorted by different orders. To solve this challenge, when the first dataset comes from source A sorted by key1, the second dataset comes from source B sorted by key2. The merged data should be sorted by a composite key (key1, key2) before reducing.

A. Theta Joins

Theta join is a kind of join that uses comparison operators such as <, <=, >, >=, =, <> in the join predicates. Among these, equi-join is the most used join for joining two datasets to achieve the intersection between them. Fig. 2 is an example of equi-join. This join matches every record from table L to every record from table R which has the same value of the field join. The results of joining can be projected to eliminate some redundant fields to produce only required fields.

Among join algorithms used in MapReduce literature listed in [11-15], it is believed that equi-join strategies used in [11] are more efficient than those used in Yahoo Pig, Facebook Hive, and IBM Jaql. This paper focuses on the theta-join implementation strategies proposed by Blanas et al.[11] and

Okcan [16]. Theta join algorithms will be analysed in the following sections.

SELECT*FROML,RWHEREL.Profs=R.Profs

Stds Profs Stds Profs L.Stds L.Profs R.Stds R.Profs

Aj PhMe Hin Sup Aj PhMe Jia PhMe

Hiu PhMe Hiu Sup Aj PhMe Sul PhMe

Lo PhMe Jia PhMe Hiu PhMe Jia PhMe

Su Mar Ling Su Hiu PhMe Sul PhMe

Suna Un Sul PhMe Lo PhMe Jia PhMe

Lo PhMe Sul PhMe

TableL TableR

Fig. 2. A simple equi- join example (using equi-join on the field Profs)

B. All Pair Partition Joins

Given table R having |R| records and table L having |L| records, product of R and L is a set of |R|*|L| records. This traditional method takes a long time when joining two very large tables. To compute this product in MapReduce, table R and table L will be divided into u and v disjoin partitions, respectively. |R|*|L| records can be obtained from u*v products, each product partition (1, 1), partition (1, 2),.., partition (u, v) can be processed by a map or a reduce function. This method is called all pairs partition join in MapReduce model [16].

Fig. 3. All pairs partition join

Each compound partition will be assigned to a map task. Output of the map task is <compound key, tagged record> pairs. A compound key is a combination of partition name from table R and L such as (1, 2), (1, 2), and (1, 3). To identify which record comes from which table, each record from table R or L will be tagged its table name, called tagged record. Each group of <compound key, tagged record> pairs will be passed to reducers. Before reducing data, this input data will be split into table R and L and they will join in the same way as the traditional joining method.

Ta bl eL

Stds Profs Key Valuelists Key Valuelists Key Valuelists

Aj PhMe ('R',Hi n,Sup) ('R',Hi n,Sup) ('R',Ji a ,PhMe) Key

Hi u PhMe ('R',Hi u,Sup) ('R',Hi u,Sup) ('R',Li ng,Su) emptyR.Stds R.Profs L.Stds L.Profs

Lo PhMe ('L',Aj,PhMe) ('L',Lo,PhMe) ('L',Aj,PhMe) Ji a PhMe Aj PhMe

Di t PhMe ('L',Hi u,PhMe) ('L',Di t,PhMe) ('L',Hi u,PhMe) Ji a PhMe Hi u PhMe

Su Ma r Ji a PhMe Lo PhMe

Sun Un Key Valuelists Key Valuelists Key Valuelists Ji a PhMe Di t PhMe

Ta bl eR ('R',Hi n,Sup) ('R',Ji a ,PhMe) ('R',Ji a ,PhMe)

Stds Profs ('R',Hi u,Sup) ('R',Li ng,Su) ('R',Li ng,Su)

Pa rt1 Hi n Sup ('L',Su,Ma r) ('L',Su,Ma r) ('L',Lo,PhMe)

Hi u Sup ('L',Sun,Un) ('L',Sun,Un) ('L',Di t,PhMe)

Pa rt2 Ji a PhMe Li ng Su Pa rt1 Pa rt2 Pa rt3 (1,3) (2,3) (2,2) Valuelists (1,1) (1,2) (2,1)

(3)

In Fig. 4, each record from table L and R will be added tag ‘L’ and ‘R’, respectively. Those records are called tagged records. Only the composite key has records from both table L and R having the same join key are fed to reduce functions. In this example, only partition (2, 1), partition (2, 2) has shared join key records from table R and L, which will be used for joining. The remaining partitions will be ignored. Disadvantage of this joining is enumerating every pair may not be processed by reducers.

C. Repartition Join

Repartition join is the most used join strategy in MapReduce. Datasets L and R are dynamically split into parts based on the join key and pairs of partitions from L and R will be joined [15]. It has two versions called standard repartition join and improved repartition join.

The standard version is the same as the partitioned sort-merge join that is used in parallel Rational Database Management Systems [11]. In the map phase, each map task works on a block of either table L or table R. To identify which table an input record is from, the map function tags each record with its original table and produces the extracted join key and the tagged records. Output of the map function is a set of <join_key, tagged_record> pairs. Join_key is the attribute used to join two tables, and tagged_record is a compound of table name and record. These outputs are then partitioned, sorted, and merged. Then, all records for each join key are grouped together and fed to a reducer. In the reduce phase, for each join key, the reducer first separates and buffers the input records into two sets according to the table tagged, and then performs a cross-product between two sets. This following example uses hash function mod(code(upper(left(join key,1))),2)+1 for distributing intermediate <key, value> pairs to each reducer (the similar has function used earlier).

T able L Intermediate output

Stds Profs Join key Tagge d Re cord key tagged record T able L Stds Profs

Block1 Su Mar Map1 Mar ('L', Su, Mar) Group 1 PhMe ('L' , Aj, PhMe) Aj PhMe Aj PhMe PhMe ('L', Aj, PhMe) PhMe ('L' , Hiu, PhMe) Reduce 1 Hiu PhMe Block2 Hiu PhMe Map2 PhMe ('L', Hiu, PhMe) PhMe ('L' , Lo, PhMe) Lo PhMe

Lo PhMe PhMe ('L', Lo, PhMe) PhMe ('R' , Jia, PhMe) T able R Stds Profs L.Stds L.Profs R.Stds R.Profs

Block3 Sun Un Map3 Un ('L', Sun, Un) PhMe ('R' , Sul, PhMe) Jia PhMe Aj PhMe Jia PhMe T able R Sul PhMe Aj PhMe Sul PhMe

Stds Profs key tagged record T able L Stds Profs Hiu PhMe Jia PhMe

Jia PhMe PhMe ('R', Jia, PhMe) Mar ('L' , Su, Mar) Su Mar Hiu PhMe Sul PhMe Sul PhMe PhMe ('R', Sul, PhMe) Su ('R' , Ling, Su) Reduce 2 Sun Un Lo PhMe Jia PhMe Block2 Ling Su Map5 Su ('R', Ling, Su) Sup ('R' , Hin, Sup) Stds Profs Lo PhMe Sul PhMe

Hin Sup Sup ('R', Hin, Sup) Sup ('R' , Hiu, Sup) T able R Ling Su Hiu Sup Sup ('R', Hiu, Sup) Group 2 Un ('L' , Sun, Un) Hin Sup

Hiu Sup

Final result from reduce process

me rg in g, s or tin g, a nd gr oup in

Input of map functions Reduce process

Block3

Block1 Map4 Map6

Fig. 5. An example of standard repartition joins (using equi-join on the field Profs)

All records from table L and R will be buffered before joining and that may lead to insufficient memory problem, as encountered by Yahoo Pig and Facebook Hive [11, 17, 18]. To deal with this, improved repartition join is proposed.

In the improved version, the map function is changed. Output key of the map function is changed to a composite of join key and table tag. The table tags will be generated in a way that guarantees that records from table R will be stored ahead

of those from table L on a given join key. Partition function is also customised so that hash code is computed from just the join key instead of composite key. Records are then grouped by just the join key instead of the composite key. Grouping function in the reducer which groups records on the join key, and ensures that records from table R are stored ahead of those from table L for a given key. To decrease buffer size, only the record, that have composite key containing all table tags will be written into buffer.

Output of map functions Input of reduce functiom

Stds Profs Comp. Keys Tagged Re cords Ke ys Lists of Value s L.Stds L.Profs R.Stds R.Profs

Jia PhMe [PhMe, R] ('R', Jia, PhMe) Ke ys Tagged Re cords ([Jiaja, PhMe], [AjPae, PhMe]) Aj PhMe Jia PhMe

Sul PhMe [PhMe, R] ('R', Sul, PhMe) [Mar, L] ('L', Su, Mar) ([Jiaja,PhMe], [Hiu, PhMe]) Aj PhMe Sul PhMe

Block 2 Ling Su [Su, R] ('R', Ling, Su) [PhMe, R ('R', Jia, PhMe) ([Jiaja, PhMe], [Lo, PhMe]) Hiu PhMe Jia PhMe

Hin Sup [Sup, R] ('R', Hin, Sup) [PhMe, R ('R', Sul, PhMe) ([Sul,PhMe], [AjPae, PhMe]) Hiu PhMe Sul PhMe

Hiu Sup [Sup, R] ('R', Hiu, Sup) [PhMe, L ('L', Aj, PhMe) ([Sul, PhMe], [Hiu, PhMe]) Lo PhMe Jia PhMe

[PhMe, L ('L', Hiu, PhMe) ([Sul, PhMe], [Lo, PhMe]) Lo PhMe Sul PhMe

Stds Profs Comp. Keys Tagged Re cords [PhMe, L ('L', Lo, PhMe) Ke ys Lists of Value s

Block 1 Su Mar [Mar, L] ('L', Su, Mar) [Su, R] ('R', Ling, Su) [Mar,_, L] (_, [Su, Mar])

Aj PhMe [PhMe, L] ('L', Aj, PhMe) [Sup, R] ('R', Hin, Sup) [Un, _, L] (_, [Sun, Un])

Hiu PhMe [PhMe, L] ('L', Hiu, PhMe) [Sup, R] ('R', Hiu, Sup) [Su, R, _] ([Ling, Su],_)

Lo PhMe [PhMe, L] ('L', Lo, PhMe) [Un, L] ('L', Sun, Un) ([Hin, Sup],_)

Block 3 Sun Un [Un, L] ('L', Sun, Un) ([Hiu, Sup],_)

Block 2 Block 1

Block 3

[PhMe R, L] T able L

Final result from reducer T able R

[Sup, R, _] Interme diate Results

Map 1 Map 2 Map 3 Map 4 Map 6 Map 5

(4)

D. Broadcasting Join

Broadcast join is used when table R is much smaller than table L. Instead of passing both tables R and L across the network, the smaller table will be broadcasted to larger table. This technique reduces sorting time and network traffic. At the beginning of each map function, broadcast join checks whether R is stored on the local file system or not. If not, it retrieves table R from the distributed file system, and splits R into partitions on the join key, and stores these partitions on the local file system. Hash table is built from table L or R depending on which one has smaller size.

If R is smaller than a partition of L, then all partitions of R will be loaded to memory to build the hash table. The map function then extracts join key value from each record from L, and uses it to probe the hash table and to generate join output. If R is bigger than a split of L, joining is not done at the map function. The map function will map each partition of L with each partition of R using other join strategies. Then, results from R and L will be joined at the end of the map process.

In Fig. 7 and Fig. 8, table R is smaller than a part of table L, so it is broadcasted to each node. The map function loads all records from table R to build a hash table. For each record from a partition of table L, the map function finds its reference in the hash table, and outputs only those it has referenced. All unreferenced records from table L will be ignored.

TableR

StdId subject Hashtable,Distributedfunction=(StdIdmod2)+1

55501 701 StdId Group 55501 371 55501 2 55502 555 55502 1 56701 511 56701 2 56702 814 Tab 56702 1 le R is us ed to b u ild ha sh tab le

Fig. 7. Building Hash table when R is smaller than any part of L TableL

StdId Name L.StdId L.Name R.StdId R.subject

55701 Lo Map1 56701 Dit 56701 511

55702 Mo 56702 Hiu 56702 814 L.StdId L.Name R.StdId R.subject

56700 Bo 55701Hash table56701Hash table 55502 Sher 55502 555

56701 Dit 55702Hash table56703Hash table Group1 56702 Hiu 56702 814 56702 Hiu 56700Hash table

56703 Cha L.StdId L.Name R.StdId R.subject

55501 Sul Map2 L.StdId L.Name R.StdId R.subject 55501 Sul 55501 701

55502 Sher 55501 Sul 55501 701 Group2 55501 Sul 55501 371

55503 Jia 55501 Sul 55501 371 56701 Dit 56701 511

55504 Dih 55502 Sher 55502 555

55505 Tha 55503Hash table55505Hash table

56501 Ling 55504Hash table56501Hash table

IntermediaResults Joi n ke y is use d to pr obe ha sh ta b le Joi n ke y is use d to pr ob e ha sh ta b le Split1 Split2

Fig. 8. Example of broadcasting joins when R is smaller than any part of L(using equi-join)

In some cases, a large portion of table R may not be referenced by any record from table L. For example R is a table of users including millions of records while L is a table of activities that users act during an hour. In this situation, only a few of records from table R are referenced by records from table L. However, when joining based on broadcasting, a large amount of records of table R are shipped across network and loaded into the hash table. If these data are not referenced based on the join key, the network resource is wasted for the shipping.

E. Semi Join

The semi-join proposed to solve the problem mentioned above is comprised of three phases as follows. The first phase runs as a full MapReduce job. In the map function, a main memory table of hash code is used for determining the set of unique join key values in a part of table L. By sending only unique key values to the map output, number of records that need to be sorted is reduced. The reduce function processes unique join key. In Fig. 9, all unique join keys will be consolidated by a reducer, result from this phase is a single file called L.uk.

TableL HashtableL1

StdId subject StdId outputL.uk

55501 701 55501 StdId 55501 371 55502 55501 55502 555 HashtableL2 55502 56701 511 StdId 56701 56701 814 56701 56702 56702 814 56702 Split1 Split2

Fig. 9. Example of the first phase in Semi joins (using equi-join) The second phase, similar to the broadcast join, runs as a map job. Firstly, L.uk will be loaded into a memory hash table, the map function iterates each record from table R and outputs it if its join key can be found in the L.uk. Each part of table R produces one file called Ri. Output of this phase is a list of file

Ri as shown in Fig. 10.

The third phase, join all file Ri with table L using broadcast join as shown in Fig. 11. One challenge of semi join is that not every record in the Ri of R will join with a particular part Li of table L. To solve this issue, per-split semi join is proposed.

TableR OutputR1 TableR OutputR2

StdId Name Map1 StdId Name StdId Name Map2 StdId Name

55701 Lo Hashtable 56701 Dit 55501 Sul Hashtable 55501 Sul

55702 Mo StdId 56702 Hiu 55502 Sher StdId 55502 Sher

56700 Bo 55501 55701Hash table 55503 Jia 55501 55503Hash table 56701 Dit 55502 55702Hash table 55504 Di 55502 55504Hash table 56702 Hiu 56701 56700Hash table 55505 Tha 56701 55505Hash table 56703 Cha 56702 56703Hash table 56501 Ling 56702 56501Hash table

Split1 Split2

(5)

TableL TableL Intermediateresults2

StdId subject L.StdId L.Name R.StdId R.subject StdId subject L.StdId L.Name R.StdId R.subject

55501 701Ma p155501R1 55501 701 Map1 55501 Sul 55501 701

55501 371 55502R1 55501 371 55501 Sul 55501 371

55502 555 55502 555 55502 Sher 55502 555

56701 511Ma p2Intermediateresults1 56701 511

56701 814 L.StdId L.Name R.StdId R.subject 56701 814 Map2

56702 814 56701 Dit 56701 511 56702 814 L.StdId L.Name R.StdId R.subject

OutputR1 56701 Dit 56701 814 OutputR2 56701R2

StdId Name 56702 Hiu 56702 814 StdId Name 56702R2

56701 Dit 55501 Sul 56702 Hieu 55502 Sher Split1 Split2 R2 Split1 Split2 R1

Fig. 11. Example of the last phase in Semi joins (using equi-join)

F. Per-Split Semi Join

Per-split semi join consists of three phases. The first and the last phases are map jobs, and the second phase is a full map reduce job. The first phase is to generate the set of unique join keys in a split Li of table L, and stores them in the distributed file system, called Li.uk. The second phase is to load all records from a split of table R into main memory hash table, and read the unique keys from file Li.uk and probe the hash table for matching records from R. Each matched record is outputted with a tag RLi, which is used by reduce function to collect all records from table R that will join with Li. In the last phase, the results of the second phase and Li are joined directly as shown in Fig. 12 and Fig. 13.

Fig. 12. Example of the first phase and second phase in Per-Split Semi Join OutputofRjoinLi.uk

Tags StdId Name RL1 55501 Sul

RL1 55502 Sher L.StdId L.Name R.StdId R.subject

RL2 56701 Dit 55501 Sul 55501 701

RL2 56702 Hiu 55501 Sul 55501 371

TableL 55502 Sher 55502 555

StdId subject 56701 Dit 56701 511

55501 701 56701 Dit 56701 814 55501 371 56702 Hiu 56702 814 55502 555 56701 511 56701 814 56702 814

Outputoffinalphase

Fig. 13. Example of the last phase in Per-Split Semi Join (using equi-join)

IV. CONCLUSION

Many of big data mining problems can be solved by using MapReduce associated with Key-Value store. Based on advantages and drawbacks of those explained strategies in terms of time and network resources consumption, we provide a comparison of join strategies as shown in Table 1.

TABLE 1.COMPARISION OF JOIN STRATEGIES

Strategy Pros/Cons Suggestion

All pair partition join

Easy to implement, all compound partition transferred to reducers may not be processed by reducers.

Used when two datasets have more data in common, be sorted by the same fields.

Standard repartition join

Easy to implement, all records from both tables will be buffered before joining that may lead to insufficient memory problem.

Same with all pair partition join.

Improved repartition join

To reduce buffer size, implementation is more complex than the standard version.

Used when two joined datasets have few data in common, be sorted by the same fields. Broadcasting

join

Reduce sorting time and network traffic. May waste of network resource.

Used when one table is much smaller than the other table. Semi-join Some records from parts of

a table broadcasted to another table may not be joined.

Used when a large portion of a table may not be referenced by any record from the other table. Per-split semi

join

Complicated implementation, more reading and writing operations.

Same with semi-join.

Which strategy should be used in any problem depends on nature of the data and available network resources. If two

(6)

joined tables have more data in common or having sufficient network resources, all pair partition join, repartition join should be used because its implementation is not as complex as the others. If two joined tables have few data in common or having inadequate network resources, broadcasting join, semi join, per-split semi join should be used because it may reduce time and resources consumption.

Data in NoSQL database can be structured, semi-structured, or unstructured; and can be stored in many types of data structures such as indexed table of relational database, B-Tree, Queue, Hash table. Therefore, in addition to the consideration presented in this paper, selection of join strategies is also affected by data structures. MapReduce programmers may also need to consider data accessing time, data sorting time when selecting joining strategy. This issue is beyond the scope of this paper and is left for future research.

REFERENCES

[1] Mapanga, I. and P. Kadebu, Database Management Systems: A NoSQL

Analysis. Interna-tional Journal of Modern Communication Technologies & Research (IJMCTR), 2013. 1: p. 12-18.

[2] Hecht, R. and S. Jablonski. NoSQL evaluation: A use case oriented

survey. in Cloud and Service Computing (CSC), 2011 International Conference on. 2011.

[3] Dean, J. and S. Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, in OSDI '04: Sixth Symposium on Operating Systems Design and Implementation. 2004, USENIX: San Francisco, California, USA. p. 137–150.

[4] Dean, J. and S. Ghemawat, MapReduce: simplified data processing on large clusters, in Communications of the ACM - 50th anniversary issue: 1958 - 2008. 2008. p. 107-113.

[5] Celko, J., Chapter 6. Key–Value Stores, in Joe Celko's complete guide to NoSQL : what every SQL professional needs to know about

nonrelational databases, A. Dierna and H. Scherer, Editors. 2014, Morgan Kaufmann, Elsevier: USA. p. 81-88.

[6] Oracle, Chapter 1. Introduction to Berkeley DB, in Oracle Berkeley DB: Getting Started with Berkeley DB for C. 2013. p. 8-15.

[7] Jadhav, V., J. Aghav, and S. Dorwani, Join Algorithms Using

MapReduce: A Survey, in International Conference on Electrical Engineering and Computer Science. 2013, IOAJ INDIA: Coimbatore, Tamil Nadu, India. p. 40-44.

[8] White, T., Chapter 8. MapReuce Features, in Hadoop: The Definitive Guide, Second Edi-tion, M. Loukides, Editor. 2011, O'Reilly Media, Inc.,: USA. p. 225-257.

[9] White, T., Chapter 8. MapReduce Features, in Hadoop: The Definitive Guide, Third Edition, M. Loukides and M. Blanchette, Editors. 2012, O'Reilly Media, Inc.,: USA. p. 259-295.

[10] White, T., Chapter 7. MapReduce Types and Formats, in Hadoop: The Definitive Guide, Third Edition, M. Loukides and M. Blanchette, Editors. 2012, O'Reilly Media, Inc.: USA. p. 223-258.

[11] Blanas, S., et al., A comparison of join algorithms for log processing in MaPreduce, in Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. 2010, ACM: Indianapolis, Indiana, USA. p. 975-986.

[12] Özsu, M.T. and P. Valduriez, Chapter 3. Distributed Database Design, in Principles of Dis-tributed Database Systems, Third Edition. 2011, Springer New York. p. 71-125.

[13] Bernstein, P.A., et al., Query processing in a system for distributed databases (SDD-1). ACM Trans. Database Syst., 1981. 6(4): p. 602-625. [14] Lee, K.-H., et al., Parallel data processing with MapReduce: a survey.

SIGMOD Rec., 2012. 40(4): p. 11-20.

[15] Okcan, A. and M. Riedewald, Processing theta-joins using MapReduce, in Proceedings of the 2011 ACM SIGMOD International Conference on Management of data. 2011, ACM: Athens, Greece. p. 949-960. [16] Shim, K., MapReduce algorithms for big data analysis, in Proceedings

of the VLDB En-dowment 2012, VLDB Endowment. p. 2018-2017. [17] Olston, C., et al., Pig latin: a not-so-foreign language for data

processing, in Proceedings of the 2008 ACM SIGMOD international conference on Management of data. 2008, ACM: Vancouver, Canada. p. 1099-1110.