Pig vs Hive: Benchmarking High Level Query
Dr. Peter McBrien
Imperial College London, UK
This article presents benchmarking results1 of two benchmarking sets (run on
small clusters of 6 and 9 nodes) applied to Hive and Pig running on Hadoop 0.14.1. The first set of results were obtainted by replicating the Apache Pig benchmark published by the Apache Foundation on 11/07/07 (which served as a baseline to compare major Pig Latin releases). The second results were obtained by applying the TPC-H benchmarks.
The two benchmarks showed conflicting results; the first benchmark indi-cated that Pig outperformed Hive on most operations. However interestingly, TPC-H results provide evidence that Hive is significantly faster than Pig. The article analyzes the two benchmarks, concluding with a set of differences and justification of the results.
The article presumes that the reader has a basic knowledge about Hadoop and big data. (The article is not intended as an introduction to Hadoop, Pig or Hive).
About the authors
Benjamin Jakobus graduated with a BSc in Computer Sci-ence from University College Cork in 2011, after which he co-founded an Irish start-up. He returned to University one year later and graduated with an MSc in Advanced Computing from Imperial College London in 2013. Since graduating, he took up a position as Software Engineer at IBM Dublin (SWG, Collaboration Solutions). This article is based on his Masters thesis developed under the super-vision of Dr. Peter McBrien.
Dr. Peter McBrien graduated with a BA in Computer Science from Cambridge University in 1986. After some time working at Racal and ICL, heI joined the Depart-ment of Computing at Imperial College as an RA in 1989, working on the Tempora Esprit Project. He obtained his PhD Implementing Graph Rewriting By Graph Rewrit-ing in 1992, under the supervision of Chris Hankin. In 1994, he joined Department of Computing at King’s Col-lege London as a lecturer, and returned to the Department of Computing at Imperial College in August 1999 as a lecturer. Since then he has been promoted to be a Senior Lecturer.
The authors would like to thank Yu Liu, PhD student at Imperial College London, who, over the course of the past year helped us with any technical problems that we encountered.
Despite Hadoop’s popularity, the Hadoop user finds it cumbersome to develop Map-Reduce (MR). To simplify the task, high-level scripting languages such as Pig Latin or Hive QL have emerged. Users are often faced with the ques-tion whether to use Pig or Hive. At time of writing, no up-to-date scientific studies exist to help them answer this question. In addition, performance dif-ferences between Pig and Hive are not really well understood and not much literature in the field exists that examines these performance differences is scarce.
The article presents benchmarking results2of two benchmarking sets (run
on small clusters of 6 and 9 nodes) applied to Hive and Pig running on Hadoop 0.14.1. The first set of results were obtainted by replicating the Apache Pig benchmark published by the Apache Foundation on 11/07/07. The second results were obtained by applying the TPC-H benchmarks. These test cases consist of 22 distinct queries, each of which exhibit the same (or higher) de-gree of complexity that is typically found in real-world industry scenarios, consist of varying query parameters and various types of access
Whilst existing literature addresses some of these questions, the literature suffer from the following shortcomings:
1. The most recent Apache Pig benchmarks stem from 2009.
2. None of the literature cited in footnotes examines how operations scale over different datasets.
3. Hadoop benchmarks were performed on clusters of 100 nodes or less (Hadoop was designed to run on clusters containing thousands of nodes, therefore small-scale performance analysis may not really do it any justice). Naturally, the same argument can be applied against the benchmarking results presented in this article.
4. The literature fails to indicate the different communication overhead required by the various database management systems. (Again, this ar-ticle does not address this concern; rather this arar-ticle describes bench-mark during runtime.)
Background: Benchmarking High-level Query
To date there exist several publications comparing the performance of Pig, HiveQL and other High-level Query Languages (HLQLs). In 2011, Stewart and Trinder et al compared Pig, HiveQL and JAQL using runtime met-rics, and according to how well each language scales and how much shorter queries really are in comparison to using the Hadoop Java API directly. Us-ing a 32 node Beowulf cluster, Stewart et al found that:
• HiveQL scaled best (both up and out) and that Java was only slightly faster (It had the best runtime performance out of the three HLQLs). Java also had better scale-up performance than Pig.
• Pig is the most succinct and compact language of those compared.
• Pig and Hive QL are not Turing Complete.
• Pig and JAQL scaled the same except when using joins: Pig signifi-cantly outperformed JAQL on that regard.
• Pig and Hive are optimised for skewed key distribution and outperform hand-coded Java MR jobs in that regard.
Hive’s performance over Pig is further supported by Apache’s Hive per-formance benchmarks.
Moussa from the University of Tunis applied the TPC-H benchmark to compare Oracle SQL Engine to Pig. It was found that SQL Engine greatly outperformed Pig (whereby joins using Pig stood out to be particularly slow. Again, Apache’s own benchmarks confirm this: When executing a join, Hadoop took 470 seconds. Hive took 471 seconds. PIG took 764 seconds (Hive took 0.2% more time than Hadoop, whilst PIG took 63% more time than Hadoop). Moussa used a dataset of 1.1GB.
While studying the performance of Pig using large astrophysical datasets Loebman et al also found that a relational database management system
were used throughout the experiment. Hadoop however is designed to be used with hundreds if not thousands of nodes. Work by Sch¨atzle et al fur-ther underpins this argument: In 2011 the authors proposed PigSPARQL (a framework for translating SPARQL queries to Pig Latin) based on the reasoning that for ”scenarios, which can be characterized by first extract-ing information from a huge data set, second by transformextract-ing and loadextract-ing the extracted data into a different format, cluster-based parallelism seems to outperform parallel databases.” Their reasoning is based on  , how-ever the authors of  acknowledge that they cannot verify the claim that Hadoop would have outperformed the parallel database systems if only it had more nodes. That is, having benchmarked Hadoop’s MapReduce with 100 nodes against two parallel database systems, it was found that both systems outperformed Hadoop:
First, at the scale of the experiments we conducted, both parallel database systems displayed a significant performance advantage over Hadoop MR in executing a variety of data intensive analysis benchmarks. Averaged across all five tasks at 100 nodes, DBMS-X was 3.2 times faster than MR and Vertica was 2.3 times faster than DBMS-X. While we cannot verify this claim, we believe that the systems would have the same relative performance on 1,000 nodes (the largest Teradata configuration is less than 100 nodes managing over four petabytes of data).
Running the Apache Benchmark
The experiment follows in the footsteps of the Pig benchmarks3published by
the Apache Foundation on 11/07/07. Their objective was to have baseline numbers to compare to before they could make major changes to the system.
We decided to benchmark the execution of load, arithmetic, group, join and filter operations on 6 datasets (as opposed to just two):
• Dataset size 1: 30,000 records (772KB)
• Dataset size 2: 300,000 records (6.4MB)
• Dataset size 3: 3,000,000 records (63MB)
• Dataset size 4: 30 million records (628MB)
• Dataset size 5: 300 million records (6.2GB)
• Dataset size 6: 3 billion records (62GB)
That is, our datasets scale linearly, whereby the size equates to 3000 * 10n.
A seventh dataset consisting of 1,000 records (23KB) was produced to perform join operations on. Its schema is as follows:
name - string marks - integer gpa - float
The data was generated using thegenerate data.plperl script available for download on the Apache website. and produced tab delimited text files with the following schema:
name - string age - integer gpa - float
It should be noted that the experiment differs slightly to the original in that the original used only two datasets of 200 million records (200MB) and 10 thousand (10KB) records whereas our experiment consists of six separate datasets with a scaling factor of 10 (i.e. 30,000 records, 300,000 records etc).
The benchmarks were run on a cluster consisting of 6 nodes (1 dedicated to Name Node and Job Tracker and 5 compute nodes). Each node was equippted with a 2 dual-core Intel(R) Xeon(R) CPU @2.13GHz and 4 GB of memory. Furthermore, the cluster had Hadoop 0.14.1 installed, configured to 1024MB memory and 2 map + 2 reduce jobs per node. We modified the
As with the original Apache benchmark, the Linuxtimeutility was used to measure the average wall-clock time of each operation (operations were executed 3 times each).
As with the original benchmark produced by Apache, we benchmarked the following operations4
1. Loading and storing of data.
2. Filtering the data so that 10% of the records are removed. 3. Filtering the data so that 90% of the records are removed.
4. Performing basic arithmetic on integers and floats (age * gpa + 3, age / gpa - 1.5).
5. Grouping the data (by name). 6. Joining the data.
7. Distinct select. 8. Grouping the data.
3.4.1 Pig Benchmark Results
Having determined the optimal number of reducers (8 is optimal in our case; see section 3.4.3), the results of the Pig benchmarks run on the Hadoop clus-ter are as follows:
Set 1 Set 2 Set 3 Set 4 Set 5 Set 6 Arithmetic 32.82 36.21 49.49 83.25 423.63 3900.78 Filter 10% 32.94 34.32 44.56 66.68 295.59 2640.52 Filter 90% 33.93 32.55 37.86 53.22 197.36 1657.37 Group 49.43 53.34 69.84 105.12 497.61 4394.21 Join 49.89 50.08 78.55 150.39 1045.34 10258.19
Table 1: Averaged performance of arithmetic, join, join, group, order, dis-tinct select and filter operations on six datasets using Pig. Scripts were configured as to use 8 reduce and 11 map tasks.
Figure 1: Pig runtime plotted in logarithmic scale
3.4.2 Hive Benchmark Results
Having determined the optimal number of reducers (8 is optimal in our case; see section 3.4.3), the results of the Hive benchmarks run on the Hadoop cluster are as follows:
Set 1 Set 2 Set 3 Set 4 Set 5 Set 6 Arithmetic 32.84 37.33 72.55 300.08 2633.72 27821.19 Filter 10% 32.36 53.28 59.22 209.5 1672.3 18222.19 Filter 90% 31.23 32.68 36.8 69.55 331.88 3320.59 Group 48.27 47.68 46.87 53.66 141.36 1233.4 Join 48.54 56.86 104.6 517.5 4388.34 -Distinct 48.73 53.28 72.54 109.77 -
-Table 2: Averaged performance of arithmetic, join, join, group, distinct select and filter operations on six datasets using Hive. Scripts were configured as to use 8 reduce and 11 map tasks.
3.4.3 Hive and Pig: JOIN benchmarks using variable number of
The following results were obtained by varying the number of reduce tasks using the default JOIN for both Hive and Pig. All jobs were run on the aforementioned Hadoop cluster and each job was run three times. Runtimes were then averaged as seen below. It should be noted that Map time, Re-duce time and Total time refers to the cluster’s cumulative CPU time.
Real time is the ”actual” time as measured using the Unixtime command (/usr/bin/time). It is this difference in the two time metrics that causes the discrepancy between the times in the tables below. That is, the CPU time required by a job running on 10 node cluster will (more or less) be the same than the time required to run the same job on a 1000 node cluster. However the real time it takes the job to complete on the 1000 node cluster will be 100 times less than if it were to run on a 10 node cluster.
The JOIN scripts are as follows (with varied reduce tasks of course): Hive:
s e t mapred . r e d u c e . t a s k s =4; SELECT ∗ FROM d a t a s e t JOIN d a t a s e t j o i n ON d a t a s e t . name = d a t a s e t j o i n . name ; Pig: A = LOAD ’ / u s e r / b j 1 1 2 / d a t a /4/ d a t a s e t ’ u s i n g P i g S t o r a g e ( ’\t ’ ) AS ( name : c h a r a r r a y , age : i n t , gpa : f l o a t ) PARALLEL 4 ;
B = LOAD ’ / u s e r / b j 1 1 2 / d a t a / j o i n / d a t a s e t j o i n ’ u s i n g P i g S t o r a g e ( ’\t ’ ) AS ( name : c h a r a r r a y , age : i n t , gpa : f l o a t ) PARALLEL 4 ;
STORE C INTO ’ output ’ u s i n g P i g S t o r a g e ( ) PARALLEL 4 ; By default, Pig uses a Hash-join whilst Hive uses an equi-join.
# Reducers # Map tasks Avg. real time Avg. map time Avg. re-duce time 1 11 328.54 346.67 228.21 2 11 263.2 340.59 312.82 4 11 208.24 337.01 350.71 8 11 168.44 337.53 351.82 16 11 180.08 333.87 391.95 32 11 205.91 333.78 479.36 64 11 252.35 332.89 624.65 128 11 355.34 334.64 930.7
Table 3: Pig benchmarks for performing default JOIN operation on Dataset size 4 (consisting of 30 million records (628MB)) and a small dataset con-sisting of 1,000 records (23KB). See section 3.1 for info on the datasets.
# R % time spent
on map tasks
% time spent
Avg. total cpu time 1 60.3 39.7 574.89 2 52.13 47.87 653.41 4 48.77 50.75 691.05 8 48.96 51.04 689.35 16 46 54 725.82 32 41.05 58.95 813.14 64 34.76 65.23 957.55 128 26.45 73.55 1265.34
Table 4: Pig benchmarks for performing default JOIN operation on Dataset size 4 (consisting of 30 million records (628MB)) and a small dataset con-sisting of 1,000 records (23KB). See section 3.1 for info on the datasets.
# R Std. dev map time Std. dev red. time Std. dev total time Std. dev real time 1 1.71 1.03 0.91 3.09 2 0.66 49.68 49.86 12.21 4 0.75 24.61 21.34 19.00 8 0.65 2.42 2.63 2.74 16 1.52 3.24 4.75 0.54 32 1.22 2.61 3.77 0.50 64 3.62 5.80 9.41 2.44 128 0.97 5.58 6.54 5.63
Table 5: Pig benchmarks for performing default JOIN operation on Dataset size 4 (consisting of 30 million records (628MB)) and a small dataset con-sisting of 1,000 records (23KB). See section 3.1 for info on the datasets.
# Reducers # Map tasks Avg. real time Avg. map time Avg. re-duce time 1 4 846.04 321.98 362.14 2 4 654.82 263.06 367.4 4 4 588.65 204.170 607.89 8 4 541.67 228.13 533.72 16 4 551.62 229.66 548.42 32 4 555.22 216.48 609.11 64 4 580.91 216.12 809.43 128 4 647.78 220.16 1187.91
Table 6: Hivebenchmarks for performing default JOIN operation on Dataset size 4 (consisting of 30 million records (628MB)) and a small dataset con-sisting of 1,000 records (23KB). See section 3.1 for info on the datasets.
# R % time spent on map tasks
% time spent
Avg. total cpu time 1 47.06% 52.94% 684.12 2 41.73% 58.27% 630.46 4 25.14% 74.86% 812.060 8 39.94% 60.06% 680 16 30.00% 70.48% 778.1 32 26.00% 74.05% 822.54 64 21.00% 78.93% 1025.54 128 16.00% 84.36% 1408.08
Table 7: Hivebenchmarks for performing default JOIN operation on Dataset size 4 (consisting of 30 million records (628MB)) and a small dataset con-sisting of 1,000 records (23KB). See section 3.1 for info on the datasets.
Figure 3: Pig reduce time (CPU time) plotted in logarithmic scale. # R Std. dev map time Std. dev red. time Std. dev total time Std. dev real time 1 60.43 67.86 8.14 7.71 2 21.78 79.5 74.59 16.75 4 7.97 2.45 5.53 6.26 8 7.26 6.86 9.33 11.35 16 28.77 4.96 31.04 11.96 32 2.81 3.89 2.76 3.18 64 2.04 4.73 4.42 4.10 128 3.76 7.54 5.18 22.38
Table 8: Hivebenchmarks for performing default JOIN operation on Dataset size 4 (consisting of 30 million records (628MB)) and a small dataset con-sisting of 1,000 records (23KB). See section 3.1 for info on the datasets.
The resultant join consisted of 37,987,763 lines (1.5GB). The original dataset used to perform the join consisted of 1,000 records; the dataset to which it was joined consisted of 30 million.
Performing a replicated join using Pig and the same datasets resulted in a speed-up of 11% (the replicated join had an average real time runtime of
149.96 seconds compared to its hash-join equivalent of 168.44 seconds). By adjusting the minimum and maximum split size and providing Hadoop with ”hints”5 as to how many map tasks should be used, we forced Hive to use 11 map tasks (the same as Pig) and arrived at the following results6:
Map CPU Time Reduce CPU Time Total CPU Time Real Time 279.48 444.27 723.75 524.96 265.34 433.02 698.36 501.15 265.78 443.01 708.79 497.26 Avg.: 270.2 440.1 710.3 507.79 Std. Dev.: 8.04 6.16 12.76 14.99
Table 9: Hivebenchmarks for performing default JOIN operation on Dataset 4 (consisting of 30 million records (628MB)) and a small dataset consisting of 1,000 records (23KB) whilst using 8 reduce and 11 map tasks.
Running the TPC-H Benchmark
As previously noted, the TPC-H benchmark was used to confirm the exis-tence of a performance difference between Pig and Hive. TPC-H is a decision support benchmark published by the Transaction Processing Performance Council  (Transaction Processing Performance Council (TPC) is an orga-nization founded for the purpose to define global database benchmarks). As stated in the official TPC-H specification:
[TPC-H] consists of a suite of business oriented ad-hoc queries and concurrent data modifications. The queries and the data popu-lating the database have been chosen to have broad industry-wide relevance while maintaining a sufficient degree of ease of imple-mentation. This benchmark illustrates decision support systems
• Examine large volumes of data;
• Execute queries with a high degree of complexity;
• Give answers to critical business questions.
The performance metrics used for these benchmarks are the same as those used as part of the aforementioned Apache benchmarks:
• Real time runtime (using the Unix time command)
• Cumulative CPU time
• Map CPU time
• Reduce CPU time
In addition, 4 new metrics were added:
• Number of map tasks launched
• Number of reduce tasks launched
The TPC-H benchmarks differ from the Apache benchmarks (described earlier) and that a) they consist of more queries and b) the queries are more complex and intended to simulate a realistic business environment.
To recap section 3, we first attempted to replicate the Apache Pig bench-mark published by the Apache Foundation on 11/07/07. Consequently, the data was generated using the generate data.plperl script available for download on the Apache website. The Perl script produced tab delimited text files with the following schema:
name - string age - integer gpa - float
Six separate datasets were generated7 in an order to measure the
perfor-mance of, arithmetic, group, join and filter operations. The datasets scaled scaled linearly; therefore the size equates to 3000 * 10n: dataset size 1
con-sisted of 30,000 records (772KB), dataset size 2 concon-sisted of 300,000 records
(6.4MB), dataset size 3 consisted of 3,000,000 records (63MB), dataset size 4 consisted of 30 million records (628MB), dataset size 5 consisted of 300 mil-lion records (6.2GB) and dataset size 6 consisted of 3 bilmil-lion records (62GB). One obvious downside to the above datasets is their simplicity: in re-ality, databases tend to be much more complex and most certainly consist of tables containing more than just three columns. Furthermore, databases usually don’t just consist of one or two tables (the queries executed as part of the benchmarks from section 3 involved 2 tables at most. In fact all queries, except the join, involved only 1 table).
The benchmarks produced within this report address these shortcomings by employing the much richer TPC-H datasets generated using the TPC
dbgen utility. This utility produces 8 individual tables (customer.tbl con-sisting of 15,000,000 records (2.3GB), lineitem.tbl concon-sisting of 600,037,902 records (75GB), nation.tbl consisting of 25 records (4KB, orders.tbl con-sisting of 150,000,000 records (17GB), partsupp.tbl concon-sisting of 80,000,000 records (12GB), part.tbl consisting of 20,000,000 records (2.3GB), region.tbl consisting of 5 records (4KB), supplier.tbl consisting of 1,000,000 records (137MB).
The TPC-H test cases consist of 22 distinct queries, each of which were designed to exhibit the same (or higher) degree of complexity that is typically found in real-world scenarios, consist of varying query parameters and various types of access. They are designed so that each query covers a large part of each table/dataset.
Several modifications have been made to the cluster since we ran the first set of experiments detailed in section 3, and the cluster on which the TPC-H benchmarks were run now consist of 9 hosts:
• awake.doc.ic.ac.uk - 3.20GHz Intel(R) Core(TM) i5 CPU, 7847MiB system memory. Running Ubuntu 12.04.2 LTS, Precise Pangolin.
• mavolio.doc.ic.ac.uk- 3.00GHz Intel(R) Core(TM)2 Duo CPU, 5712MiB system memory. Running Ubuntu 12.04.2 LTS, Precise Pangolin.
• zim.doc.ic.ac.uk- 3.00GHz Intel(R) Core(TM)2 Duo CPU, 3824MiB system memory. Running Ubuntu 12.04.2 LTS, Precise Pangolin.
• zorin.doc.ic.ac.uk- 2.66GHz Intel(R) Core(TM)2 Duo CPU, 3872MiB system memory. Running Ubuntu 12.04.2 LTS, Precise Pangolin.
• tiffanycase.doc.ic.ac.uk - 2.66GHz Intel(R) Core(TM)2 Duo CPU, 3872MiB system memory. Running Ubuntu 12.04.2 LTS, Precise Pan-golin.
• zosimus.doc.ic.ac.uk- 3.00GHz Intel(R) Core(TM)2 Duo CPU, 3825MiB system memory. Running Ubuntu 12.04.2 LTS, Precise Pangolin.
• artemis.doc.ic.ac.uk- 3.20GHz Intel(R) Core(TM) i5 CPU, 7847MiB system memory. Running Ubuntu 12.04.2 LTS, Precise Pangolin. Both the Hive and Pig TPC-H scripts are available for download from the Apache website.
Section 4.6 presents additions to the set of benchmarks from section 3. The datasets and scripts used are identical to those presented in the section 3. Note: The Linuxtimeutility was used to measure the average wall-clock time of each operation. For other metrics (CPU time, heap usage, etc) the Hadoop logs were used.
This section presents the results for both Pig and Hive.
Running the TPC-H benchmarks for Hive produced the following results (note: script names were abbreviated):
Running the TPC-H benchmarks for Hive produced the following results:
Std. dev. Avg.
cu-mulative CPU time Avg. map tasks Avg. reduce tasks q1 623.64 26.2 4393.5 309 81 q2 516.36 3.94 2015 82 21 q3 1063.73 8.22 10144 402.5 102 q4 344.88 70.74 0 0 0 q5 1472.62 28.67 0 0 0 q6 502.27 7.66 2325.5 300 1 q7 2303.63 101.51 6 5 2 q8 1494.1 0.06 13235 428 111 q9 3921.66 239.36 48817 747 192 q10 1155.33 44.71 7427 416 103 q11 434.28 1.26 1446.5 59.5 15 q12 763.14 11.4 4911.5 380 82 q13 409.16 11.31 3157 93.5 24 q14 515.39 9.82 3231.5 322 80 q15 687.81 14.62 3168.5 300 80 q16 698.14 69.94 14 3 0 q17 1890.16 36.81 10643 300 80 q18 2147 38.57 5591 300 69 q19 1234.5 13.15 17168.5 322 80 q20 1228.72 36.92 91 13 3 q21 3327.84 16.3 10588.5 300 80
Script Avg. map heap usage Avg. reduce heap usage Avg.total heap usage Avg. map CPU time Avg. reduce CPU time Avg. to-tal CPU time q1 1428 57 1486 6225 2890 9115 q2 790 162 953 18485 13425 31910 q3 1743 241 1985 54985 22820 77805
q4 N/A N/A N/A N/A N/A N/A
q5 N/A N/A N/A N/A N/A N/A
q6 N/A N/A N/A N/A N/A N/A
q7 561 174 737 3275 4285 7560 q8 1620 469 2092 31625 23975 55600 q9 1882 199 2082 18055 12585 30640 q10 3960 367 4328 268270 233640 501910 q11 1468 254 1722 60365 33730 94095 q12 1588 145 1733 5665 4565 10230 q13 1663 349 2013 134420 42070 176490 q14 1421 57 1478 5525 2180 7705
q15 N/A N/A N/A N/A N/A N/A
q16 216 0 216 14435 0 14435
q17 N/A N/A N/A N/A N/A N/A
q18 N/A N/A N/A N/A N/A N/A
q19 1421 71 1493 5250 2395 7645
q20 N/A N/A N/A N/A N/A N/A
q21 N/A N/A N/A N/A N/A N/A
q22 1202 0 1202 159390 0 159390
Figure 4: Real time runtimes of all 22 TPC-H benchmark scripts for Hive.
Running the TPC-H benchmarks for Pig produced the following results (note: script names were abbreviated):
Script Avg. runtime Std. dev. q1 2192.34 5.88 q2 2264.28 48.35 q3 2365.21 355.49 q4 1947.12 262.88 q5 5998.67 250.99 q6 589.74 2.65 q7 1813.7 148.62 q8 5405.69 811.68 q9 7999.28 640.31 q10 1871.74 93.54 q11 824.42 103.37 q12 1401.48 120.69 q13 818.79 104.89 q14 913.31 3.79 q15 878.67 0.98 q16 925.32 133.34 q17 2935.41 178.31 q18 4909.62 67.7 q19 8375.02 438.12 q20 2669.12 299.79 q21 9065.29 543.42 q22 818.79 14.74
Table 12: TPC-H benchmark results for Pig using 6 trials (time is in seconds, unless indicated otherwise).
Script Avg. map heap usage Avg. reduce heap usage Avg.total heap usage Avg. map CPU time Avg. reduce CPU time Avg. to-tal CPU time q1 3023 400 3426 187750 12980 200730 q2 3221 861 4087 69670 43780 113450 q3 1582 631 2218 110120 46090 156210 q4 633 363 999 1340 9190 10530 q5 3129 878 4010 56020 28550 84570
q6 N/A N/A N/A N/A N/A N/A
q7 1336 372 1713 20410 12870 33280
q8 5978 882 6865 306900 162330 469230
q9 874 348 1222 2670 8620 11290
q10 N/A N/A N/A N/A N/A N/A
q11 2875 807 3686 172670 46840 219510 q12 1016 751 1771 33880 49830 83710 q13 600 346 950 1740 9860 11600 q14 115 0 115 710 0 710 q15 2561 559 3123 67130 24960 92090 q16 3538 576 4116 606220 70250 676470 q17 375 174 551 5470 3740 9210 q18 965 385 1353 9090 13810 22900 q19 328 175 504 4700 5390 10090 q20 3232 858 4094 69820 48000 117820 q21 1668 486 2160 34440 16800 51240 q22 989 417 1409 20260 13080 33340
Figure 5: Real time runtimes of all 22 TPC-H benchmark scripts for Pig.
Hive vs Pig (TPC-H)
As shown in figure 7, Hive outperforms Pig in the majority of cases (12 to be precise). Their performance is roughly equivalent for 3 cases and Pig outperforms Hive in 6 cases. At first glance, this contradicts all results of the experiments from section 3.
Figure 6: Real time runtimes of all 22 TPC-H benchmark scripts contrasted. Upon examining the TPC-H benchmarks more closely, two issues stood out that explain this discrepancy. The first is that after a script writes re-sults to disk, the output files are immediately deleted using Hadoop’s fs -rmr command. This process is quite costly and is measured as part the real-time execution of the script (however the fact that this operation is ex-pensive (in terms of runtime) is not catered for). In contrast, the HiveQL scripts merely drop tables at the beginning of the script. Dropping tables is cheap as it only involves manipulating the meta-information on the local filesystem - no interaction with the Hadoop filesystem is required. In fact, omitting the recursive delete operation reduces runtime by about 2%. In contrast, removing DROP TABLE in Hive does not produce any performance difference.
Figure 7: The runtime comparison between Pig and Hive (plotted in log-arithmic scale) for the Group By operator based on the benchmarks from section 3.
For example when running the TPC-H benchmarks for Pig, script 21 (q21 suppliers who kept orders waiting.pig) had a real-time runtime of 9065.29 seconds. 41% (or 3748.32 seconds) were required to execute the first Group By. In contrast, Hive only required 1031.23 seconds for the grouping of data. The script grouped data 3 times:
-- This Group By took up 41% of the runtime gl = group lineitem by l_orderkey;
fo = filter orders by o_orderstatus == ’F’; [...]
ores = order sres by numwait desc, s_name;
Consequently, the excessive use of theGroup Byoperator skews the bench-mark results significantly. Re-running the scripts and omitting the the
group-ing of data produces the expected results. For example, runngroup-ing script 3 (q3 shipping priority.pig) and omitting the Group By operator significantly reduces the runtime (to 1278.49 seconds real time runtime or a total of 12,257,630ms CPU time).
Figure 8: The total average heap usage (in bytes) of all 22 TPC-H benchmark scripts contrasted.
Another interesting artefact is exposed by figure 8: In all instances, Hive’s heap usage is significantly lower than that of Pig. This might be explained by the fact that Hive does not need to build intermediary data structure, but Pig (at the time of writing) does.
enabling compression on dataset size 4 (which contains a large amount of random data) produces a 3.2% speed-up in real time runtime.
Compression in Pig can be enabled by setting thepig.tmpfilecompression
flag to true and then specifying the type of compressionpig.tmpfilecompression.codec
to either gzip or lzo. Note that gzip produces better compression whilst LZO is much faster in terms of runtime.
By editing the entry formapred.reduce.slowstart.completed.mapsin Hadoop’s conf/mapred-site.xml we can tune the percentage of map tasks that must be completed before reduce tasks can be created. By default, this value is set to 5% which was found to be too low for our cluster. Balancing the ratio of mappers and reducers is critical to optimizing performance: reduc-ers should be started early enough so that data transfer is spread out over time and thus preventing network bottlenecks. On the other hand, reducers shouldn’t be started late enough so that they do not use up slots that could be used by map tasks. Performance peaked when reduce tasks were started after 70% of map jobs completed.
The maximum number of map and reduce tasks for a node can be specified usingmapred.tasktracker.map.tasks.maximumandmapred.tasktracker. reduce.tasks.maximum. Naturally care should be taken when configuring these: having a node with a maximum of 20 map slots but a script config-ured to use 30 map slots will result in significant performance penalties as the first 20 map tasks will run in parallel, but the additional 10 will only be spawned once the first 20 map tasks have completed execution (consequently requiring one extra round of computation). The same goes for the number of reduce tasks: as is illustrated by figure 9, performance peaks when a task requires just little below the maximum number of reduce slots per node.
Figure 9: Real time runtimes contrasted with a variable number of reducers for join operations in Pig.
A small addition to section 3 - CPU runtimes
One outstanding item that our first set of results failed to report was the con-trast between real time runtime and CPU runtime. As expected, cumulative CPU runtime was higher than real time runtime (since tasks are distributed between nodes).
Figure 10: Real time runtime contrasted with CPU runtime for the Pig scripts run on dataset size 5.
Of specific interest was the finding that Pig consistently outperformed Hive (with the exception of grouping data). Specifically:
• For arithmetic operations, Pig is 46% faster (on average) than Hive
• For filtering 10% of the data, Pig is 49% faster (on average) than Hive
• For filtering 90% of the data, Pig is 18% faster (on average) than Hive
• For joining datasets, Pig is 36% faster (on average) than Hive
This conflicted with existing literature that found Hive to outperform Pig: In 2009, Apache’s own performance benchmarks found that Pig was signif-icantly slower than Hive. These findings were validated in 2011 by Stewart and Trinder et al who also found that Hive map-reduce jobs outperformed those produced by the Pig compiler.
When forced to equal terms (that is, when forcing Hive to use the same num-ber of mappers as Pig), Hive remains 67% slower than Pig when comparing
real time runtime (i.e. it takes Pig roughly 1/3 of the time to compute the JOIN. That is, increasing the number of map tasks in Hive from 4 to 11 only resulted in a 13% speed-up.
It should also be noted that the performance difference between Pig and Hive does not scale linearly. That is, initially there is little difference in per-formance (this is due to the large start-up costs). However as the datasets increase in size, Hive becomes con- sistently slower (to the point of crashing when attempting to join large datasets).
To conclude, the discussed experiments allowed for the answering of 4 core questions:
1. How do Pig and Hive perform as other Hadoop properties are varied (e.g. number of map tasks)? Balancing the ratio of mappers and reducers has a big impact on real time runtime and consequently is crit-ical to optimizing performance: reducers should be started early enough so that data transfer is spread out sufficiently to prevent network congestions. On the other hand, reducers shouldn’t be started so late that they do not use up slots that could be used by map tasks.
Care should also be taken when setting the maximum allowable map and reduce slots per node. For example having a node with a maximum of 20 map slots but a script configured to use 30 map slots will result in significant performance penalties, because the first 20 map tasks will run in parallel, but the additional 10 will only be spawned once the first 20 map tasks have com-pleted execution (consequently requiring one extra round of computation). The same goes for the number of reduce tasks: as is illustrated by figure 9. Performance peaks when a task requires a number of reduce slots per node that falls just below the maximum number.
2. Do more complex datasets and queries (e.g. TPC-H bench-marks) yield the same results than the Apache benchmarks from 11/07/07? At first glance, running the TPC-H benchmarks contradicts the Apache benchmark results. In nearly all instances, Hive outperforms Pig.
skew the entire result set - as can be seen in section 4.4)
3. How does real time runtime scale with regards to CPU run-time? As expected given the cluster configuration (9 nodes). The real time runtime was between 15%-20% of the cumulative CPU runtime.
4. What should the ratio of map and reduce tasks be? The ratio for map and reduce tasks can be configured throughmapred.reduce.slowstart .completed.mapsHadoop’s conf/mapred-site.xml. The default value of 0.05 (i.e. 5%) was found to be too low. The optimal for ourcluster was at about 70%.
It should also be noted that the use of the Group By operator within the TPC-H benchmarks skews results significantly (recall the Apache bench-marks that showed that Pig outperformed Hive in all instances except when using the Group Byoperator: when grouping data Pig was 104% slower than Hive). Re-running the scripts and omitting the the grouping of data produces the expected results. For example, running script 3 (q3 shipping priority.pig) and ommitting the Group By operator significantly reduces the runtime (to 1278.49 seconds real time runtime or a total of 12,257,630ms CPU time).
As already noted in the introduction, Hadoop was designed to run on clusters containing hundreds / thousands of nodes, therefore running small-scale performance analysis may not really do it any justice. Ideally the benchmarks presented in this article should be run on much larger clusters.
 Stewart Robert J.; Trinder P.; Loidl H. (2011), ”Comparing High Level MapReduce Query Languages”,. Springer Berlin Heidelberg, Advanced Parallel Processing Technologies, pages 58-72.
 Moussa, R. (2012), ”TPC-H Benchmarking of Pig Latin on a Hadoop Cluster”,. Communications and Information Technology (ICCIT), 2012 International Conference, pages 85-90.
 Loebman S.; Nunley D.; Kwon Y.; Howe B.; Balazinska M.; Gardner. J.P. (2012),”Analyzing massive astrophysical datasets: Can Pig/Hadoop or a relational DBMS help?”,. Cin Proc. of CLUSTER. 2009, pages 1-10.
 Sch¨atzle A.; Przyjaciel-Zablocki M.; Hornung T.; Lausen G. (2011),
”PigSPARQL: ¨Ubersetzung von SPARQL nach Pig Latin”,. Proc. BTW, pages 65-84.
 Lin J.; Dyer C. (2010)”Data-intensive text processing with MapReduce”,. Synthesis Lectures on Human Language Technologies, pages 1-177.  Pavlo A.; Paulson E.; Rasin A.; Abadi D. J.; DeWitt D. J.; Madden S.;
Stonebraker. M. (2009) ”A comparison of approaches to large-scale data analysis”,. In Proc. SIGMOD, ACM, pages 165-178.
 DBPedias. (2013), ”Pig Performance Benchmarks,”.
http://wiki.apache.org/pig/PigPerformance, Visited 15/01/2013
 Gates, Alan F. and Natkovich, Olga and Chopra, Shubham and Kamath, Pradeep and Narayanamurthy, Shravan M. and Olston, Christopher and Reed, Benjamin and Srinivasan, Santhosh and Srivastava, Utkarsh (2009), ”Building a High-Level Dataflow System on top of Map-Reduce: The Pig Experience”,. Proc. VLDB Endow., pages 1414-1425.
 Transaction Processing Council ”Transaction Processing Council Web-site,”. http://www.tpc.org/, Visited 18/06/2013
 Apache Software Foundation. (2009),
”Hive, PIG, Hadoop benchmark results”,.
https://issues.apache.org/jira/secure/attachment/12411185/hive benchmark 2009-06-18.pdf, Visited 03/01/2013
 Moussa, R. (2012), ”TPC-H Benchmarking of Pig Latin on a Hadoop Cluster”,. Communications and Information Technology (ICCIT), 2012 International Conference, pages 85 - 90.
 Loebman S.; Nunley D.; Kwon Y.; Howe B.; Balazinska M.; Gardner. J.P. (2012),”Analyzing massive astrophysical datasets: Can Pig/Hadoop or a relational DBMS help?”,. Cin Proc. of CLUSTER. 2009, pages 1 -10.