MapReduce Join Strategies for Key-Value Storage
Duong Van Hieu, Sucha Smanchat, and Phayung Meesad Faculty of Information Technology
King Mongkut’s University of Technology North Bangkok Bangkok 10800, Thailand
duongvanhieu@tgu.edu.vn, {suchas@kmutnb.ac.th,sucha.smanchat@acm.org}, pym@kmutnb.ac.th Abstract—This paper analyses MapReduce join strategies
used for big data analysis and mining known as map-side and reduce-side joins. The most used joins will be analysed in this paper, which are theta-join algorithms including all pair partition join, repartition join, broadcasting join, semi join, per-split semi join. This paper can be considered as a guideline for MapReduce application developers for the selection of join strategies. The analysis of several join strategies for big data analysis and mining is accompanied by comprehensive examples.
Keywords—MapReduce; join strategy; NoSQL
I. INTRODUCTION
With the continuous development of big data and cloud computing, it is believed that traditional database technologies are insufficient for data storage and access, and also performance and flexibility requirements. In the new era of big data, NoSQL databases are more appropriate than relational databases [1]. Key-Value store, a kind of NoSQL databases, is an appropriate choice for applications that use MapReduce model for distributed processing. Key-Value stores offer only four underlying operators including inserting <key, value> pairs to a data collection, updating values of existing pairs, finding values associated with a specific key, and deleting pairs from a data collection [2].
Joining two data collections to produce a new dataset based on joining fields is a responsibility of programmers or application developers rather than of database management systems. However, several join strategies existing, which have different advantages and disadvantages. To provide programmers a guideline to the selection of join strategies, this study analyses several joining strategies for big data analysis and mining accompanied comprehensive examples. The content of this paper is organised into four main sections. Section 2 gives an overview of the MapReduce programming model, Section 3 explains MapReduce join strategies, and Section 4 is the conclusion of and comparison of join strategies used in MapReduce.
II. MAPREDUCE OVERVIEW
MapReduce has been used at Google since February 2003, and was first introduced in 2004 by Dean and Ghemawat [3] and in Communications of the ACM in 2008 [4]. It is used for processing large datasets in a parallel or distributed computing environment. It is a combination of map processes and reduce processes. A map process is a function that processes a set of input <key, value> pairs that is a portion of a large input
dataset to generate a set of intermediate <key, value> pairs. A reduce process with a reduce function merges all of intermediate values generated by the map processes associated with the same intermediate key to form a possibly smaller set of <key, value> pairs, called final output <key, value> pairs.
Fig. 1 is a simple word counting example. The input string data “Advanced Research Methodology, Advanced Information Modelling and Database, Advanced Network and Information Security, Advanced Database and Distributed Systems” is divided into four blocks corresponding to each subject name separated by commas. A Hash function mod(code(upper(left(key,1))),k)+1 is used for distributing intermediate <key, value> pairs into reduce tasks. The left(key,1) means taking the first letter of key, the upper(x) means changing x to upper case, the code (x) means taking ASCII code of character x, and the mod(m, k) means returning the remainder after m is divided by k.
Inputdata Key Value Key Value
Advanced Advanced 1 Database 1
<key,value>pairs producedbyReduce
processs Intermediate <key,value>pairs
distribution
Reduce
1
<key,value>pairs
Block1Research Map1 Research 1 Group1 Database 1 Key Value
Methodology Methodology 1 Distributed 1 Database 2
Key Value Key Value Distributed 1
Advanced Advanced 1 Advanced 1
Information Information 1 Advanced 1 Key Value
Block2Modelling Map2 Modelling 1 Advanced 1 Advanced 4
and and 1 Advanced 1 and 3
Database Database 1 Group2 and 1 Reduce2 Information 2
Key Value and 1 Methodology 1
Advanced Advanced 1 and 1 Modelling 1 Network Network 1 Information 1
Block3and Map3 and 1 Information 1 Key Value
Information Information 1 Methodology 1 Network 1
Reduce
1
Redu ce3
Security Security 1 Modelling 1 Research 1
Key Value Key Value
Advanced Advanced 1 Group3 Network 1 Key Value
Reduc e4
Redu ce3
Database Database 1 Research 1 Security 1
Block4and Map4 and 1 Key Value Systems 1
Distributed Distributed 1 Group4 Security 1 Systems Systems 1 Systems 1
Reduc e4
Fig. 1. Map and reduce processes of a simple word counting example
III. MAPREDUCE KEY JOIN STRATEGIES
Physically, data in a Key-Value format can be stored in the form of a data structure such as B-Tree, Queue, and Hash table [5, 6]. Logically, each record in a Key-Value store is a single entry including a key and a value. To make it easy to under-stand, a set of <key, value> pairs, called data collection, can be
considered as a two-column table. The first column stores keys and the second one, which can be a combination of more than two columns, stores values associated with the keys.
Joins using MapReduce can be categorized as map-side join, reduce-side join, memory-backed join, join using Bloom Filter, and map-reduce merge [7]. However, this paper follows the categories proposed by Tom White [8, 9], grouping into two types which are map-side joins and reduce-side joins. Map-side joins are joins per-formed by mappers, used to join two large input datasets before feeding data to the map functions. Reduce-side joins are joins performed by reducers, being more general than a map-side join because inputs do not need to be structured in any particular way [9]. In some cases, reduce-side joins are less efficient than map-side joins because datasets go through the MapReduce shuffle. For reduce-side joining, several components are involved. These are multiple inputs and secondary sorting [8].
Multiple inputs mean inputs from different sources can have different formats or presentations. To deal with this situation, multiple inputs need to be parsed separately. This parsing is provided in Hadoop, called per-path basis [10]. Secondary sorting occurs when reducers obtain inputs from two sources and each of them can be sorted by different orders. To solve this challenge, when the first dataset comes from source A sorted by key1, the second dataset comes from source B sorted by key2. The merged data should be sorted by a composite key (key1, key2) before reducing.
A. Theta Joins
Theta join is a kind of join that uses comparison operators such as <, <=, >, >=, =, <> in the join predicates. Among these, equi-join is the most used join for joining two datasets to achieve the intersection between them. Fig. 2 is an example of equi-join. This join matches every record from table L to every record from table R which has the same value of the field join. The results of joining can be projected to eliminate some redundant fields to produce only required fields.
Among join algorithms used in MapReduce literature listed in [11-15], it is believed that equi-join strategies used in [11] are more efficient than those used in Yahoo Pig, Facebook Hive, and IBM Jaql. This paper focuses on the theta-join implementation strategies proposed by Blanas et al.[11] and
Okcan [16]. Theta join algorithms will be analysed in the following sections.
SELECT*FROML,RWHEREL.Profs=R.Profs
Stds Profs Stds Profs L.Stds L.Profs R.Stds R.Profs
Aj PhMe Hin Sup Aj PhMe Jia PhMe
Hiu PhMe Hiu Sup Aj PhMe Sul PhMe
Lo PhMe Jia PhMe Hiu PhMe Jia PhMe
Su Mar Ling Su Hiu PhMe Sul PhMe
Suna Un Sul PhMe Lo PhMe Jia PhMe
Lo PhMe Sul PhMe
TableL TableR
Fig. 2. A simple equi- join example (using equi-join on the field Profs)
B. All Pair Partition Joins
Given table R having |R| records and table L having |L| records, product of R and L is a set of |R|*|L| records. This traditional method takes a long time when joining two very large tables. To compute this product in MapReduce, table R and table L will be divided into u and v disjoin partitions, respectively. |R|*|L| records can be obtained from u*v products, each product partition (1, 1), partition (1, 2),.., partition (u, v) can be processed by a map or a reduce function. This method is called all pairs partition join in MapReduce model [16].
Fig. 3. All pairs partition join
Each compound partition will be assigned to a map task. Output of the map task is <compound key, tagged record> pairs. A compound key is a combination of partition name from table R and L such as (1, 2), (1, 2), and (1, 3). To identify which record comes from which table, each record from table R or L will be tagged its table name, called tagged record. Each group of <compound key, tagged record> pairs will be passed to reducers. Before reducing data, this input data will be split into table R and L and they will join in the same way as the traditional joining method.
Ta bl eL
Stds Profs Key Valuelists Key Valuelists Key Valuelists
Aj PhMe ('R',Hi n,Sup) ('R',Hi n,Sup) ('R',Ji a ,PhMe) Key
Hi u PhMe ('R',Hi u,Sup) ('R',Hi u,Sup) ('R',Li ng,Su) emptyR.Stds R.Profs L.Stds L.Profs
Lo PhMe ('L',Aj,PhMe) ('L',Lo,PhMe) ('L',Aj,PhMe) Ji a PhMe Aj PhMe
Di t PhMe ('L',Hi u,PhMe) ('L',Di t,PhMe) ('L',Hi u,PhMe) Ji a PhMe Hi u PhMe
Su Ma r Ji a PhMe Lo PhMe
Sun Un Key Valuelists Key Valuelists Key Valuelists Ji a PhMe Di t PhMe
Ta bl eR ('R',Hi n,Sup) ('R',Ji a ,PhMe) ('R',Ji a ,PhMe)
Stds Profs ('R',Hi u,Sup) ('R',Li ng,Su) ('R',Li ng,Su)
Pa rt1 Hi n Sup ('L',Su,Ma r) ('L',Su,Ma r) ('L',Lo,PhMe)
Hi u Sup ('L',Sun,Un) ('L',Sun,Un) ('L',Di t,PhMe)
Pa rt2 Ji a PhMe Li ng Su Pa rt1 Pa rt2 Pa rt3 (1,3) (2,3) (2,2) Valuelists (1,1) (1,2) (2,1)
In Fig. 4, each record from table L and R will be added tag ‘L’ and ‘R’, respectively. Those records are called tagged records. Only the composite key has records from both table L and R having the same join key are fed to reduce functions. In this example, only partition (2, 1), partition (2, 2) has shared join key records from table R and L, which will be used for joining. The remaining partitions will be ignored. Disadvantage of this joining is enumerating every pair may not be processed by reducers.
C. Repartition Join
Repartition join is the most used join strategy in MapReduce. Datasets L and R are dynamically split into parts based on the join key and pairs of partitions from L and R will be joined [15]. It has two versions called standard repartition join and improved repartition join.
The standard version is the same as the partitioned sort-merge join that is used in parallel Rational Database Management Systems [11]. In the map phase, each map task works on a block of either table L or table R. To identify which table an input record is from, the map function tags each record with its original table and produces the extracted join key and the tagged records. Output of the map function is a set of <join_key, tagged_record> pairs. Join_key is the attribute used to join two tables, and tagged_record is a compound of table name and record. These outputs are then partitioned, sorted, and merged. Then, all records for each join key are grouped together and fed to a reducer. In the reduce phase, for each join key, the reducer first separates and buffers the input records into two sets according to the table tagged, and then performs a cross-product between two sets. This following example uses hash function mod(code(upper(left(join key,1))),2)+1 for distributing intermediate <key, value> pairs to each reducer (the similar has function used earlier).
T able L Intermediate output
Stds Profs Join key Tagge d Re cord key tagged record T able L Stds Profs
Block1 Su Mar Map1 Mar ('L', Su, Mar) Group 1 PhMe ('L' , Aj, PhMe) Aj PhMe Aj PhMe PhMe ('L', Aj, PhMe) PhMe ('L' , Hiu, PhMe) Reduce 1 Hiu PhMe Block2 Hiu PhMe Map2 PhMe ('L', Hiu, PhMe) PhMe ('L' , Lo, PhMe) Lo PhMe
Lo PhMe PhMe ('L', Lo, PhMe) PhMe ('R' , Jia, PhMe) T able R Stds Profs L.Stds L.Profs R.Stds R.Profs
Block3 Sun Un Map3 Un ('L', Sun, Un) PhMe ('R' , Sul, PhMe) Jia PhMe Aj PhMe Jia PhMe T able R Sul PhMe Aj PhMe Sul PhMe
Stds Profs key tagged record T able L Stds Profs Hiu PhMe Jia PhMe
Jia PhMe PhMe ('R', Jia, PhMe) Mar ('L' , Su, Mar) Su Mar Hiu PhMe Sul PhMe Sul PhMe PhMe ('R', Sul, PhMe) Su ('R' , Ling, Su) Reduce 2 Sun Un Lo PhMe Jia PhMe Block2 Ling Su Map5 Su ('R', Ling, Su) Sup ('R' , Hin, Sup) Stds Profs Lo PhMe Sul PhMe
Hin Sup Sup ('R', Hin, Sup) Sup ('R' , Hiu, Sup) T able R Ling Su Hiu Sup Sup ('R', Hiu, Sup) Group 2 Un ('L' , Sun, Un) Hin Sup
Hiu Sup
Final result from reduce process
me rg in g, s or tin g, a nd gr oup in
Input of map functions Reduce process
Block3
Block1 Map4 Map6
Fig. 5. An example of standard repartition joins (using equi-join on the field Profs)
All records from table L and R will be buffered before joining and that may lead to insufficient memory problem, as encountered by Yahoo Pig and Facebook Hive [11, 17, 18]. To deal with this, improved repartition join is proposed.
In the improved version, the map function is changed. Output key of the map function is changed to a composite of join key and table tag. The table tags will be generated in a way that guarantees that records from table R will be stored ahead
of those from table L on a given join key. Partition function is also customised so that hash code is computed from just the join key instead of composite key. Records are then grouped by just the join key instead of the composite key. Grouping function in the reducer which groups records on the join key, and ensures that records from table R are stored ahead of those from table L for a given key. To decrease buffer size, only the record, that have composite key containing all table tags will be written into buffer.
Output of map functions Input of reduce functiom
Stds Profs Comp. Keys Tagged Re cords Ke ys Lists of Value s L.Stds L.Profs R.Stds R.Profs
Jia PhMe [PhMe, R] ('R', Jia, PhMe) Ke ys Tagged Re cords ([Jiaja, PhMe], [AjPae, PhMe]) Aj PhMe Jia PhMe
Sul PhMe [PhMe, R] ('R', Sul, PhMe) [Mar, L] ('L', Su, Mar) ([Jiaja,PhMe], [Hiu, PhMe]) Aj PhMe Sul PhMe
Block 2 Ling Su [Su, R] ('R', Ling, Su) [PhMe, R ('R', Jia, PhMe) ([Jiaja, PhMe], [Lo, PhMe]) Hiu PhMe Jia PhMe
Hin Sup [Sup, R] ('R', Hin, Sup) [PhMe, R ('R', Sul, PhMe) ([Sul,PhMe], [AjPae, PhMe]) Hiu PhMe Sul PhMe
Hiu Sup [Sup, R] ('R', Hiu, Sup) [PhMe, L ('L', Aj, PhMe) ([Sul, PhMe], [Hiu, PhMe]) Lo PhMe Jia PhMe
[PhMe, L ('L', Hiu, PhMe) ([Sul, PhMe], [Lo, PhMe]) Lo PhMe Sul PhMe
Stds Profs Comp. Keys Tagged Re cords [PhMe, L ('L', Lo, PhMe) Ke ys Lists of Value s
Block 1 Su Mar [Mar, L] ('L', Su, Mar) [Su, R] ('R', Ling, Su) [Mar,_, L] (_, [Su, Mar])
Aj PhMe [PhMe, L] ('L', Aj, PhMe) [Sup, R] ('R', Hin, Sup) [Un, _, L] (_, [Sun, Un])
Hiu PhMe [PhMe, L] ('L', Hiu, PhMe) [Sup, R] ('R', Hiu, Sup) [Su, R, _] ([Ling, Su],_)
Lo PhMe [PhMe, L] ('L', Lo, PhMe) [Un, L] ('L', Sun, Un) ([Hin, Sup],_)
Block 3 Sun Un [Un, L] ('L', Sun, Un) ([Hiu, Sup],_)
Block 2 Block 1
Block 3
[PhMe R, L] T able L
Final result from reducer T able R
[Sup, R, _] Interme diate Results
Map 1 Map 2 Map 3 Map 4 Map 6 Map 5
D. Broadcasting Join
Broadcast join is used when table R is much smaller than table L. Instead of passing both tables R and L across the network, the smaller table will be broadcasted to larger table. This technique reduces sorting time and network traffic. At the beginning of each map function, broadcast join checks whether R is stored on the local file system or not. If not, it retrieves table R from the distributed file system, and splits R into partitions on the join key, and stores these partitions on the local file system. Hash table is built from table L or R depending on which one has smaller size.
If R is smaller than a partition of L, then all partitions of R will be loaded to memory to build the hash table. The map function then extracts join key value from each record from L, and uses it to probe the hash table and to generate join output. If R is bigger than a split of L, joining is not done at the map function. The map function will map each partition of L with each partition of R using other join strategies. Then, results from R and L will be joined at the end of the map process.
In Fig. 7 and Fig. 8, table R is smaller than a part of table L, so it is broadcasted to each node. The map function loads all records from table R to build a hash table. For each record from a partition of table L, the map function finds its reference in the hash table, and outputs only those it has referenced. All unreferenced records from table L will be ignored.
TableR
StdId subject Hashtable,Distributedfunction=(StdIdmod2)+1
55501 701 StdId Group 55501 371 55501 2 55502 555 55502 1 56701 511 56701 2 56702 814 Tab 56702 1 le R is us ed to b u ild ha sh tab le
Fig. 7. Building Hash table when R is smaller than any part of L TableL
StdId Name L.StdId L.Name R.StdId R.subject
55701 Lo Map1 56701 Dit 56701 511
55702 Mo 56702 Hiu 56702 814 L.StdId L.Name R.StdId R.subject
56700 Bo 55701Hash table56701Hash table 55502 Sher 55502 555
56701 Dit 55702Hash table56703Hash table Group1 56702 Hiu 56702 814 56702 Hiu 56700Hash table
56703 Cha L.StdId L.Name R.StdId R.subject
55501 Sul Map2 L.StdId L.Name R.StdId R.subject 55501 Sul 55501 701
55502 Sher 55501 Sul 55501 701 Group2 55501 Sul 55501 371
55503 Jia 55501 Sul 55501 371 56701 Dit 56701 511
55504 Dih 55502 Sher 55502 555
55505 Tha 55503Hash table55505Hash table
56501 Ling 55504Hash table56501Hash table
IntermediaResults Joi n ke y is use d to pr obe ha sh ta b le Joi n ke y is use d to pr ob e ha sh ta b le Split1 Split2
Fig. 8. Example of broadcasting joins when R is smaller than any part of L(using equi-join)
In some cases, a large portion of table R may not be referenced by any record from table L. For example R is a table of users including millions of records while L is a table of activities that users act during an hour. In this situation, only a few of records from table R are referenced by records from table L. However, when joining based on broadcasting, a large amount of records of table R are shipped across network and loaded into the hash table. If these data are not referenced based on the join key, the network resource is wasted for the shipping.
E. Semi Join
The semi-join proposed to solve the problem mentioned above is comprised of three phases as follows. The first phase runs as a full MapReduce job. In the map function, a main memory table of hash code is used for determining the set of unique join key values in a part of table L. By sending only unique key values to the map output, number of records that need to be sorted is reduced. The reduce function processes unique join key. In Fig. 9, all unique join keys will be consolidated by a reducer, result from this phase is a single file called L.uk.
TableL HashtableL1
StdId subject StdId outputL.uk
55501 701 55501 StdId 55501 371 55502 55501 55502 555 HashtableL2 55502 56701 511 StdId 56701 56701 814 56701 56702 56702 814 56702 Split1 Split2
Fig. 9. Example of the first phase in Semi joins (using equi-join) The second phase, similar to the broadcast join, runs as a map job. Firstly, L.uk will be loaded into a memory hash table, the map function iterates each record from table R and outputs it if its join key can be found in the L.uk. Each part of table R produces one file called Ri. Output of this phase is a list of file
Ri as shown in Fig. 10.
The third phase, join all file Ri with table L using broadcast join as shown in Fig. 11. One challenge of semi join is that not every record in the Ri of R will join with a particular part Li of table L. To solve this issue, per-split semi join is proposed.
TableR OutputR1 TableR OutputR2
StdId Name Map1 StdId Name StdId Name Map2 StdId Name
55701 Lo Hashtable 56701 Dit 55501 Sul Hashtable 55501 Sul
55702 Mo StdId 56702 Hiu 55502 Sher StdId 55502 Sher
56700 Bo 55501 55701Hash table 55503 Jia 55501 55503Hash table 56701 Dit 55502 55702Hash table 55504 Di 55502 55504Hash table 56702 Hiu 56701 56700Hash table 55505 Tha 56701 55505Hash table 56703 Cha 56702 56703Hash table 56501 Ling 56702 56501Hash table
Split1 Split2
TableL TableL Intermediateresults2
StdId subject L.StdId L.Name R.StdId R.subject StdId subject L.StdId L.Name R.StdId R.subject
55501 701Ma p155501R1 55501 701 Map1 55501 Sul 55501 701
55501 371 55502R1 55501 371 55501 Sul 55501 371
55502 555 55502 555 55502 Sher 55502 555
56701 511Ma p2Intermediateresults1 56701 511
56701 814 L.StdId L.Name R.StdId R.subject 56701 814 Map2
56702 814 56701 Dit 56701 511 56702 814 L.StdId L.Name R.StdId R.subject
OutputR1 56701 Dit 56701 814 OutputR2 56701R2
StdId Name 56702 Hiu 56702 814 StdId Name 56702R2
56701 Dit 55501 Sul 56702 Hieu 55502 Sher Split1 Split2 R2 Split1 Split2 R1
Fig. 11. Example of the last phase in Semi joins (using equi-join)
F. Per-Split Semi Join
Per-split semi join consists of three phases. The first and the last phases are map jobs, and the second phase is a full map reduce job. The first phase is to generate the set of unique join keys in a split Li of table L, and stores them in the distributed file system, called Li.uk. The second phase is to load all records from a split of table R into main memory hash table, and read the unique keys from file Li.uk and probe the hash table for matching records from R. Each matched record is outputted with a tag RLi, which is used by reduce function to collect all records from table R that will join with Li. In the last phase, the results of the second phase and Li are joined directly as shown in Fig. 12 and Fig. 13.
Fig. 12. Example of the first phase and second phase in Per-Split Semi Join OutputofRjoinLi.uk
Tags StdId Name RL1 55501 Sul
RL1 55502 Sher L.StdId L.Name R.StdId R.subject
RL2 56701 Dit 55501 Sul 55501 701
RL2 56702 Hiu 55501 Sul 55501 371
TableL 55502 Sher 55502 555
StdId subject 56701 Dit 56701 511
55501 701 56701 Dit 56701 814 55501 371 56702 Hiu 56702 814 55502 555 56701 511 56701 814 56702 814
Outputoffinalphase
Fig. 13. Example of the last phase in Per-Split Semi Join (using equi-join)
IV. CONCLUSION
Many of big data mining problems can be solved by using MapReduce associated with Key-Value store. Based on advantages and drawbacks of those explained strategies in terms of time and network resources consumption, we provide a comparison of join strategies as shown in Table 1.
TABLE 1.COMPARISION OF JOIN STRATEGIES
Strategy Pros/Cons Suggestion
All pair partition join
Easy to implement, all compound partition transferred to reducers may not be processed by reducers.
Used when two datasets have more data in common, be sorted by the same fields.
Standard repartition join
Easy to implement, all records from both tables will be buffered before joining that may lead to insufficient memory problem.
Same with all pair partition join.
Improved repartition join
To reduce buffer size, implementation is more complex than the standard version.
Used when two joined datasets have few data in common, be sorted by the same fields. Broadcasting
join
Reduce sorting time and network traffic. May waste of network resource.
Used when one table is much smaller than the other table. Semi-join Some records from parts of
a table broadcasted to another table may not be joined.
Used when a large portion of a table may not be referenced by any record from the other table. Per-split semi
join
Complicated implementation, more reading and writing operations.
Same with semi-join.
Which strategy should be used in any problem depends on nature of the data and available network resources. If two
joined tables have more data in common or having sufficient network resources, all pair partition join, repartition join should be used because its implementation is not as complex as the others. If two joined tables have few data in common or having inadequate network resources, broadcasting join, semi join, per-split semi join should be used because it may reduce time and resources consumption.
Data in NoSQL database can be structured, semi-structured, or unstructured; and can be stored in many types of data structures such as indexed table of relational database, B-Tree, Queue, Hash table. Therefore, in addition to the consideration presented in this paper, selection of join strategies is also affected by data structures. MapReduce programmers may also need to consider data accessing time, data sorting time when selecting joining strategy. This issue is beyond the scope of this paper and is left for future research.
REFERENCES
[1] Mapanga, I. and P. Kadebu, Database Management Systems: A NoSQL
Analysis. Interna-tional Journal of Modern Communication Technologies & Research (IJMCTR), 2013. 1: p. 12-18.
[2] Hecht, R. and S. Jablonski. NoSQL evaluation: A use case oriented
survey. in Cloud and Service Computing (CSC), 2011 International Conference on. 2011.
[3] Dean, J. and S. Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, in OSDI '04: Sixth Symposium on Operating Systems Design and Implementation. 2004, USENIX: San Francisco, California, USA. p. 137–150.
[4] Dean, J. and S. Ghemawat, MapReduce: simplified data processing on large clusters, in Communications of the ACM - 50th anniversary issue: 1958 - 2008. 2008. p. 107-113.
[5] Celko, J., Chapter 6. Key–Value Stores, in Joe Celko's complete guide to NoSQL : what every SQL professional needs to know about
nonrelational databases, A. Dierna and H. Scherer, Editors. 2014, Morgan Kaufmann, Elsevier: USA. p. 81-88.
[6] Oracle, Chapter 1. Introduction to Berkeley DB, in Oracle Berkeley DB: Getting Started with Berkeley DB for C. 2013. p. 8-15.
[7] Jadhav, V., J. Aghav, and S. Dorwani, Join Algorithms Using
MapReduce: A Survey, in International Conference on Electrical Engineering and Computer Science. 2013, IOAJ INDIA: Coimbatore, Tamil Nadu, India. p. 40-44.
[8] White, T., Chapter 8. MapReuce Features, in Hadoop: The Definitive Guide, Second Edi-tion, M. Loukides, Editor. 2011, O'Reilly Media, Inc.,: USA. p. 225-257.
[9] White, T., Chapter 8. MapReduce Features, in Hadoop: The Definitive Guide, Third Edition, M. Loukides and M. Blanchette, Editors. 2012, O'Reilly Media, Inc.,: USA. p. 259-295.
[10] White, T., Chapter 7. MapReduce Types and Formats, in Hadoop: The Definitive Guide, Third Edition, M. Loukides and M. Blanchette, Editors. 2012, O'Reilly Media, Inc.: USA. p. 223-258.
[11] Blanas, S., et al., A comparison of join algorithms for log processing in MaPreduce, in Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. 2010, ACM: Indianapolis, Indiana, USA. p. 975-986.
[12] Özsu, M.T. and P. Valduriez, Chapter 3. Distributed Database Design, in Principles of Dis-tributed Database Systems, Third Edition. 2011, Springer New York. p. 71-125.
[13] Bernstein, P.A., et al., Query processing in a system for distributed databases (SDD-1). ACM Trans. Database Syst., 1981. 6(4): p. 602-625. [14] Lee, K.-H., et al., Parallel data processing with MapReduce: a survey.
SIGMOD Rec., 2012. 40(4): p. 11-20.
[15] Okcan, A. and M. Riedewald, Processing theta-joins using MapReduce, in Proceedings of the 2011 ACM SIGMOD International Conference on Management of data. 2011, ACM: Athens, Greece. p. 949-960. [16] Shim, K., MapReduce algorithms for big data analysis, in Proceedings
of the VLDB En-dowment 2012, VLDB Endowment. p. 2018-2017. [17] Olston, C., et al., Pig latin: a not-so-foreign language for data
processing, in Proceedings of the 2008 ACM SIGMOD international conference on Management of data. 2008, ACM: Vancouver, Canada. p. 1099-1110.