TECHNIQUES FOR IMPROVING APACHE HIVE PERFORMANCE USING RELATIONAL DATA

(1)

www.turkjphysiotherrehabil.org 3556

TECHNIQUES FOR IMPROVING APACHE HIVE PERFORMANCE USING

RELATIONAL DATA

S.Vinitha Reddy¹, Dr. Y. Mohana Roopa²

1PG Student, ²Professor

Department of Computer Science, Institute of Aeronautical Engineering Hyderabad, India.

[email protected], [email protected]

ABSTRACT

Hadoop is an open-source map-reduce implementation for storing and manipulating enormous data sets that have been widely adopted. End-users, on the other hand, can find Hadoop challenging to use, especially if they are unfamiliar with the map-reduce approach. Users must write map-reduce programs except for basic tasks like having raw counts or averages. Apache Hive, a Hadoop data warehouse architecture platform for processing structured data, allows users to quickly query, summarise, and interpret Big Data using HiveQL, which is a 3QL-like phrase. It can import and export data from and to the storage file system in a variety of file formats. When petabytes of data need to be processed, Hive's goal is to make it simple and effective.

Unlike RDBM3, Hive stores data in a document-based format, so JOIN3 queries reduce output and use a lot of resources. However, by correctly configuring Hive, you can boost efficiency for relational data. In this study, we use a variety of optimization approaches to increase database efficiency and analyze the outcomes to see if they affect the end result. We used the TPC-generated dataset for this analysis. TPC is a non-profit organization that was founded to provide benchmarks for database transaction processing. Different methods are often explored in the form of aggregation strategies, as well as their vector states.

Index Terms—Big data, Hadoop, Hive Optimization

I. INTRODUCTION

Hive is a data warehousing platform developed on top of the Hadoop ecosystem. Structured raw data files can be converted to Hive tables using a variety of Hive software. Hive also has SQL built-in, such as Hive Query Language (HQL), which is SQL-compatible. Hive query statements can be converted into Map-Reduce, Apache Tez, and Spark tasks that can be run on several nodes using Yarn management. As a result, it is compatible with a variety of execution engines. Hive supports a variety of data formats, including text, sequence, RCFile, Avro, Parquet, and Orc. Each structure has its own set of advantages and disadvantages. Hive does not explicitly help indexing data. In HDFS, data is contained in fixed sections (chunks). Hive, on the other hand, uses a partitioning technique to reduce the question range, which may be an easy substitution for indexing. Buckets are another similar indexing method that is used to separate Hive table data into several files or folders.

(2)

www.turkjphysiotherrehabil.org 3557 Fig 1: Hive Architecture

Components of Hive Architecture :

Clients : Apache Hive Serves a number of Java Based Clients Python, Ruby, and ODBC Drivers.

Services : Hive offers a variety of programmes. ( Command Line Interface and Webinterface to perform Queries)

Processing Resource Management : Internally, Hive executes queries using the Hadoop MapReduce code.

Distributed Storage : It uses HDFS for distributed storage underneath.

II. RELATED WORK

Hadoop and its ecosystem are distributed storage systems that are widely used for storing and processing large amounts of data, especially for data mining, log analysis, and business intelligence. Scientists are now constantly in search of low-cost, portable, and distributed data processing systems.

T. Liu and colleagues explored how to enhance Apache Hive's success while dealing with science results. In their 'Processing efficiency on Apache Pig, Apache Hive, and MySQL cluster' report,

Fuad et al. compared Hive and its alternatives for performance and usability. There have recently been studies that use Multiple Query Optimization (MQO) techniques to reorder queries in MapReduce frameworks for unstructured data intelligently.

Dokeroglu and et.al's optimised MapReduce engine reduced overall execution time by batching and combining correlated HiveQL queries before moving them on to the Hive database optimizer, using MQQ techniques.

Aluko's benchmarked BigSQL frameworks, including Apache Hive, using TPC-H and TPCx-BB with 1GB, 100GB, 300GB, and 1TB sizes. However, owing to a misconfiguration, Aluko was unable to use the Tez engine.

Up until 2014,

Huai, Yin, and colleagues published on significant technological developments in Apache Hive. In this paper, we look at optimization strategies and how they affect query results. In comparison to other studies that only use a small range of optimization strategies, we consider all of the available techniques in this analysis, making it more systematic.

III. EXPERIMENTAL SETTUP:

It's challenging to ensure fast queries for any data, but it's complicated with Hive when petabytes of data are involved. Hive scans the whole table without optimization, except for simple queries. To stop checking the entire table, users needed to provide some domain information about the table's attributes and notify Hive about it.

(3)

www.turkjphysiotherrehabil.org 3558 Hive's underlying method for SQL to Map-Reduce conversion assigns each statement a single process (join) and creates a new job for each data manipulation operation in a SQL statement, such as join, group by, table search, and so on. Reducing the amount of Map-Reduce tasks for these workers will result in significant efficiency gains.

Big data and a robust server are needed for performance assessment. For Hive database benchmarking, TPC-DS and related decision support queries were selected. To test Hive's output under different workloads and scenarios, defined query sets are chosen from the typical TPC-DS queries. Program for data generation was downloaded from the open source GitHub repository, and 50 GB of data was produced.

Three Linux servers for tests are configured with two cores KVM64 2.2 GHz for one master and two slaves, each with 16GB RAM. Ubuntu 18.04 Bionic Beaver Edition is the operating system for the virtual servers. Hadoop 3.2.0 and Hive 3.1.1 are the most recent stable versions. The HDFS block size (file break size) is set to 128MB.

The server is segregated to test activities in order to remove possible interruptions during benchmarking by removing intrusion from external points such as OS and network level congestions on shared infrastructure. To determine average reaction times, we repeated our experiments three times with 10-second intervals.

The queries selected from TPC-DS and their corresponding average execution times are given in tabular form in each section with the appropriate optimization technique, as a reminder of this section.

1. HIVE Optimization

In our benchmarking tests, we calculated 6 optimization approaches after a thorough analysis of the literature and user guides for Apache Hive and associated technologies (Apache Hive, Tez, Hortonworks, and Cloudera) for the most recent versions.

1) Hive Tez Execution Engine:

Apache Tez Engine is a MapReduce execution system that focuses on collaborative and high-performance batch data processing. Yarn is in charge of keeping track of it. Tez not only enhanced the MapReduce solution by speeding it up and allowing it to scale to petabytes of data, but it also made it easy to programme.

To Enable the Tez Execution engine below setting has to be set in CMD:

set hive.exe ution.engine=tez;

2) File Formats in Hive:

Hive tables are HDFS folders in which the data for a Hive table is stored in files in the related directory. The file can be stored in a row or column format. RCFile (Record Columnar File), ORC (Optimized Row Columnar), and Parquet file types store in columnar format, while TextFile and SequenceFile file types store in row format. The row storage reading unit is one record, and the columnar storage unit is a row party, which is much larger than one line but much smaller than split. RCFile organises data files into 4 MB row sections. However, the ORC file divides the data file into stripes in a sequential manner, then groups the documents in each strip into 10,000 lined row groups.

Row that has been optimised The row data is contained in a columnar format, and the columnar format stores arrays of rows in one register. As a result, compression is easy, lowering storage costs. The use of ORC files increases Hive's efficiency when reading, writing, and manipulating large amounts of data. To boost query accuracy, it employs techniques such as predicate push-down, compression, and others. Additionally, it supports data compression through the ZLIB and Snappy methods.

3) Hive Partitioning:

On HDFS, data is partitioned and saved in various individual files. It can scan partitioned datasets rather than the whole dataset. Partitioning is a technique for separating the Hive table into sections based on the values of specific columns. The aim of partitioning is to split the execution load horizontally. The following configuration parameters must be set on Hive shell before partitioning can be performed:

aset hive.exe .dynami .partition = true;

(4)

www.turkjphysiotherrehabil.org 3559 set hive.exe .dynami .partition.mode = nonstri t;

4) Bucketing Hive:

The productive efficiency of results decreases as the number of partitions increases. As a result, Hive provides the Bucketing solution to solve the issue of over-partitioning. The hash function determines the bucket as mod: % in a bucketed column Number of buckets in all. In the Hive HDFS folder, buckets serve as a file. As compared to non-bucketed tables, bucketed tables have more accurate sampling. Querying on a subset of data is feasible thanks to sampling. The following Hive environment is related:

set hive.enfor e.bu keting = true;

5) Vetorizzation Hive:

The method of modifying an algorithm's action such that the Query Statement acts on a number of values rather than a single value at a time is known as vectorization. Vector processes, in which a single instruction is used to process several data points (SIMD), are explicitly supported by modern CPUs. The Apache Hive query execution engine, by default, only processes one row of a table at a time. Processing a single row at a time causes one argument to wait for the next, making the process inefficient. Data rows are batched together and interpreted as a series of column vectors in vectorized database execution.

You must store the data in ORC format and set the following variable in Hive SQL to use vectorized query execution:

set hive.ve torized.exe ution.enabled=true;

6) Cost Based Optimization in Hive (CBO):

Before executing a query, Hive optimises the physical and logical execution plans. This, though, has little to do with the expense of the question in Hive's first edition. The Cost-Based Optimization technique's fundamental goal is to generate efficient query execution schedules. CBO examines the tables and conditions based on the query's character, reducing query execution times and maximising resource use. Calcite is essential as a schedule pruner because it can choose the most cost-effective query plan.

Hive transforms queries to a physical operating tree, which is then transformed and streamlined into Tez or MapReduce jobs, which are then executed on Hadoop. This operation also includes SQL parsing and transformation, in addition to operator-tree optimization. Query optimizers are designed to have the greatest impact on efficiency. As a result, preparing an optimal execution schedule will save you anywhere from seconds to hours when it comes to question execution time. Set the following parameters at the start of the question to use cost-based optimization:

set hive. bo.enable=true;

set hive. ompute.query.using.stats=true;

set hive.stats.fet h. olumn.stats=true;

For performing relational requests, the order in which the join optimization is performed is crucial. The aim is to find a join order that is less expensive. The CBO's first effort is to look up the join order that results in the most significant reduction of mediator rows as early in the plan tree as possible.

IV. RESULTS AND DISCUSSION:

The first step in this study was to identify a reliable data set for benchmarking. This data collection not only helps us to equate our results to those of other researchers, but it also allows us to have our published experiments confirmed by other field researchers. The next step was to determine the best data size for producing reliable and accurate timings. We confirmed that Big Data benchmarking necessitates the generation and loading of at least 50 GB of data into the Hive database. Using larger sample sets does not substantially alter our results, but it does

(5)

www.turkjphysiotherrehabil.org 3560 lengthen the research process. Furthermore, the best results were obtained when HDFS was hosted on several commodity hardware and Hive was designed to use this setup.

Based on these findings, we may conclude that the Tez engine has better query efficiency than the MapReduce technique, as seen in Fig. 2. Tez engine queries run up to four times quicker than MapReduce queries. It's worth noting the Hive is designed to store and query data in petabytes. On both large and small data sets, the MapReduce engine uses about the same amount of energy. Hive tables are stored in the Hadoop file system and can not be smaller than a specific block size. Since Hadoop's master node keeps track of all file locations, the more files it has to keep track of, the more memory it requires on the master node and the longer it takes to create a job execution plan.

Despite the fact that the data in our sample (50 GB) was not extensive, the order of output gain from one technique to the next still holds.

Fig 2: Map_Reduce Vs Tez

Working with binary file formats in Hive has benefits, as it increases query efficiency when combined with compression techniques. Because of the resource specifications, specific queries could not be addressed by MapReduce while using the uncompressed ORC format. We found that the ORC format is not as effective as intended in MapReduce; however, Fig. 3 and Fig. 4 indicate that when Tez is used in the ORC file format, significant changes are made. When Hive data is loaded using the ORC file format, queries run up to 5-10 times faster than when using the Tez engine for the text file format, as seen in Fig. 4. Tez and MapReduce performed similarly in nested query forms such as Q39 (Fig. 4).

Fig 3: Map Reduce Using File Formats Performance

(6)

www.turkjphysiotherrehabil.org 3561 Fig 4: Tez Using FileFormats Performance

Hive generates several small partitions based on column values when partitioning is allowed. When bucketing is allowed, Hive may be set to limit the amount of buckets available for storage. Hive scripts will set the bucket count when creating tables. We couldn't reconfigure Hive tables without partitioning and bucketing because the TPC-DS dataset produces table build and data load scripts. As a result, we were unable to assess the impact of partitioning and bucketing on Hive results. It is necessary to create a new dataset generator with these options configurable in order to evaluate the effect of partitioning and bucketing. It was also required to add or upgrade corresponding queries. Since partitioning and bucketing are similar to an RDMS's indexing method, one can expect better query results.

We ran several queries in the vectorization segment and concluded that significant performance improvements are not possible because the functionality mentioned below is not supported on vectorized data;

 DDL queries,

 DML queries other than single table,

 Read-only queries

 Formats other than Optimized Row Columnar (ORC).

According to Fig. 5, using the vectorization technique will speed up the execution of specific queries by up to three times. If such parameters, such as accepted column data-types and expressions, are met, a query in Hive is vectorized. According to query forms, Q19, 52, 55, 58, and 63 display a negative impact on results.

Fig 6 : Tez Vectorization

CBO is a valuable tool for improving efficiency. It creates a query execution strategy based on advanced planning algorithm statistics and data. CBOs play a critical role in utilising services through Hadoop clusters. Some of the closer findings may be attributed to the form of questionnaire, as shown by our studies, which indicate that CBO is up to 5-6 times quicker. For example, Q19, Q22, and Q93 show this.

(7)

www.turkjphysiotherrehabil.org 3562 Fig 7 : Cost Based Optimization with vectorization

Owing to time constraints, the optimization methods discussed here are only evaluated on a small number of queries. However, we are secure in the relevance of our findings because they match the user guides and white papers. With many similar questions, we were also able to make swift observations. The proof of concept of the techniques is shown in the study's experimental section. It goes without saying that under various hardware and specialised configurations, more promising outcomes can be obtained. As a result of the measurements, we discovered that Hive and Hadoop need additional memory changes. Due to scarce resources (CPU, RAM etc), we were only able to publish a few papers for queris.

V. CONCLUSION

Big Data analytics is becoming increasingly mainstream. The open source Hadoop technology is the most widely used tool for Big Data analytics. Hadoop is lightweight and efficient for processing massive datasets using the MapReduce method, but it is still challenging to set up and programme. Users and developers who are familiar with SQL will find Hive to be a simple to use technology. However, once correctly configured, it performs poorly, especially in queries that use JOINs for aggregation and filtering. We identified multiple configuration options and benchmarked them on select Hive and related technology stack versions in this report. We found a number of configuration options and tested them on a few different queries. We discovered from our research that most of these benchmark cases do not behave uniformly across runs if the data is not large enough. Furthermore, our test will run on datasets that are terabytes in scale. Since distributed architectures like Hadoop are fundamentally distributed. As a result, such large data structures need to be set up in a robust cluster structure. As a result, the queries can produce more productive results. Aside from that, data generation is time consuming.

Although the TPC dataset generator automates this process, query results can be collected without effect or for a limited time. Benchmarking is often a complex analysis since it involves multiple variables that can influence the experiments depending on the data set used to set up the environment. This is an additional problem to remember, since even massive seconds in the big data world are crucial.

REFERENCE

1 Y. Mohana Roopa, and A. Rama Mohan Reddy ―SPARROW Algorithm for Clustering Software Components‖, International Journal of Engineering Research and Development, Vol. 10, Issue 6, pp. 20-24, July 2014. (ISSN: 2278-067X (Online), ISSN: 2278-800X (Print), Google Scholar Indexed) 2 H. Paulheim and C. Bizer, ―Type inference on noisy RDF data,‖ in Proc. ISWC, Sydney, NSW, Australia, 2013, pp. 510–525.

3 Y. Mohana Roopa and B. Rajani, ―Software Component Capturing Using Clustering Algorithm‖, International Journal of Systems Algorithms and Applications, Vol. 2, Issue 4, pp. 121-126, June 2013

4 J. Cheng, C. Liu, M. C. Zhou, Q. Zeng, and A. Ylä-Jääski, "Automatic Composition of Semantic Web services based on fuzzy predicate Petri nets,"

IEEE Trans. Autom. Sci. Eng., Nov. 2013, to be published.

5 D. Kourtesis, J. M. Alvarez-Rodriguez, and I. Paraskakis, "Semantic-based QoS management in cloud systems: Current status and future challenges,

"Future Gener. Comput. Syst., vol. 32, pp. 307–323, Mar. 2014.

6 M. Nagy and M. Vargas-Vera, "Multiagent ontology mapping framework for the Semantic Web," IEEE Trans. Syst., Man, Cybern. A Syst., Humans, vol. 41, no. 4, pp. 693–704, Jul. 2011.

7 J. Weaver and J. Hendler, ―Parallel materialization of the finite RDFS closure for hundreds of millions of triples,‖ in Proc. ISWC, Chantilly, VA, USA, 2009, pp. 682–697.

8 Mohamed Morsey et al., "DBpedia SPARQL Benchmark – Performance Assessment with Real Queries on Real Data," ISWC 2011, Part I, LNCS 7031, pp. 454–469, 2011.

9 J. Urbani, S. Kotoulas, E. Oren, and F. Harmelen, "Scalable distributed reasoning using MapReduce," in Proc. 8th Int. Semantic Web Conf., Chantilly, VA, USA, Oct. 2009, pp. 634–649.

10 Mohamed Morsey et al., "DBpedia SPARQL Benchmark – Performance Assessment with Real Queries on Real Data," ISWC 2011, Part I, LNCS 7031, pp. 454–469.

11 C. Anagnostopoulos and S. Hadjiefthymiades, "Advanced inference in situation-aware computing," IEEE Trans. Syst., Man, Cybern. A Syst., Humans, vol. 39, no. 5, pp. 1108–1115, Sep. 2009.

(8)

www.turkjphysiotherrehabil.org 3563 12 Y. Mohana Roopa, and A. Rama Mohan Reddy ―SPARROW Algorithm for Clustering Software Components‖, International Journal of Engineering

Research and Development, Vol. 10, Issue 6, pp. 20-24, July 2014. (ISSN: 2278-067X (Online), ISSN: 2278-800X (Print), Google Scholar Indexed) 13 [13] Y. Mohana Roopa and B. Rajani, ―Software Component Capturing using Clustering Algorithms‖, International Conference on International

Conference on Communications, Electrical, Electronics and Computer Engineering, Kuala Lumpur, Malaysia, 9 – 10 May 2013.