Query Processing Using Big Data Technology In Mobile Environment Performance perspective

(1)

127

Query Processing Using Big Data Technology In

Mobile Environment : Performance perspective

Mr. Ajay P. Chendke

DCST, DCPE, HVPM, Amravati, India

[email protected]

Dr.Mrs.S.S.Sherekar

Department of Computer Science

SGBAU,Amravati, India [email protected]

Dr.V.M.Thakare

Department of Computer Science

SGBAU,Amravati,India [email protected]

Abstract: The scope of databases have seen exponential growth in the past, and further expected to rush in the future, with the balanced drop in storage cost escorted by a rapid increase in storage capacity. Some years ago, a terabyte database was considered to be huge, but nowadays they are sometimes regarded as insignificant, and the daily volumes of data being added to some databases are measured in terabytes. In future, petabyte, exabyte, zettabyte databases will be common. With such volumes of data, it is obvious that the sequential processing paradigm will be unable to survive; for example, even assuming a data rate of 1 terabyte per second, reading through a petabyte database will take over many days. To effectively manage such volumes of data, it is necessary to allocate multiple resources to it, very often tremendously so. Besides the massive volume of data in the database to be processed, some data has been distributed across the globe in a grid environment. Hadoop is an open-source software framework written in Java for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. All the modules in Hadoop is designed with a fundamental assumption that hardware failures are common and should be automatically handled by the framework.

The performance evaluation of various query perform on how quick response effectively and efficiently retrieved data from multiple sites using Hadoop as storage and Mapreduce as processing techniques in mobile environment. This paper provides future trends of query processing techniques for retrieving database using big data technologies for improvements within minimum time in mobile computing environment.

Keywords-Query processing, mobile computing, query optimization, mobile environment, mobile database.

I. INTRODUCTION

Advanced queryprocessing for mobile environment is basic need for fast retrieval of big data.The main aim of the query processing is to minimize cost of each query execution. The cost may be in the form of time or space complexity.Thus different query processing algorithm intended to reduce the size of intermediate and final result as well as processing cost. The query processing in a mobile environment involves joint processing among different sites which includes static servers and mobile computers. Query processing takes maximum usability with fast access to global query processing in mobile computing environment [1].

Recently wireless communication and mobile computing have become emerging most popular and spreading field in the mobile environment. The primary goal that mobile device is

expected to fulfill two main objectives like efficiency and convenience. Mobile device is successful only if the end user accepts it as helpful tool that increases their productivity and provides more convenience for human life. Mobile devices have many limitations, connectivity troubles, quality of mobile service, security during communication and transmission. The use of portable laptop computers, palmtops, and personal digital assistants with integrated communication capabilities facilitates in mobile computing [2]. These massive data centers are also a part of the emergence of Cloud computing, where data access has shifted from local machines to powerful servers hosting web applications and services, making data access across the internet using standard web browsers universal. This adds another dimension to such systems. Finally needs to estimate cost of operation whichdepends on two factors like critically on statistical information about relations which database must maintain and to estimate statistics for intermediate result to compute cost of complex expression [3].

II. ROLE OF QUERY PROCESSING

The main objective is to investigate the performance improvement of mobile query processing, focusing on the server and client sides. In server side query processing, a single-cell and multi-cell queries, whereby a cell is a service area for a single stationary host to communicate with a static network [4]. A quick response in answer to a mobile query is important, because mobile users invariably move to another location while awaiting the query result. To handle such a dynamic situation, there are two solutions to answer single-cell and multi-cell queries. The solutions for processing single-cell queries are divided into static and dynamic query scopes, and angle of movement. The static and dynamic query scopes are extended to process multi-cell queries [5]. Furthermore, another solution is added in order to deal with a situation where the areas of several base stations are either disjoint or overlapping. Finally, it also handles disconnections which occur during query result transmission from a base station to the mobile users. Indexing mechanisms are important to speed up query processing, especially for handling multi-cell queries [6].  Query Processing: Query processing is a process of

selecting the best execution strategies to be used in response of database request. In query processing section different algorithm and query processing strategies are implemented for maximizing the user performance [7].

(2)

will check syntax of any query. In this decision box we check proper syntax and forward to next block.

 Relational algebra expression: It is just like programming language. It is procedural manipulation language to solve any type of problem. It is collection of operations to manipulate relations. It is divided broadly into two types: 1) Set oriented operation: like union, interaction, minus, and Cartesian product. 2) Relation oriented operation: projection, selection, joins, and division [8].

 Optimizer:It essentially enumerates a certain set of plans and chooses the plan with the least estimated cost. The query optimizer intent is to find good evaluation plan for given query. It identifies an efficient execution plan for evaluating the query. It generates alternative plans and estimates the cost of plan. Optimizer uses information in the system catalog, and examines it to retrieve information about data types, data size, and access path and decompose them into smaller blocks and single block at one time. The most important is decision block which decide direction of query and know about each and every thing about query [9].  System Catalog: It contains statistics about data also called

as history of database [10].

 Database:The database is huge collection of information which is served by the websites to the local users [11].  Execution Plan: It is blue print about any given query by

the user for evaluating a query.

 Evaluation Engine: It decide execution plan, query evaluator that understand and evaluate relational operator, query parsing, query optimization. It is divided into two types of data: Index file and data file [12].

III.LAYERS OF QUERY PROCESSING

The problem of query processing can itself be decomposed into several subprograms, corresponding to various layers. The input is a query on distributed data expressed in relational calculus. Four main layers are involved to map the distributed query into an optimized sequence of local operations, each act on a local database. The first three layers are performed by a central site and use global information; the local sites do the fourth [13].

Query optimization:

An optimizer essentially enumerates a certain set of plan andchooses the plans with the least estimated cost. It is broadly divided into following four layers [14]:

Query Decomposition: The first layer decomposes the distributed calculus query into an algebraic query on global relations. The information needed for this transformation is found in the global conceptual schema describing the global relations [15].

Data Localization: The input to the second layer is an algebraic query on distributed relations. The main role of the second layer is to localize the query’s data using data distribution information. Relations are fragmented and stored in disjoint subsets called fragments, each being stored at a different site [16].

Global Query Optimization: The input to the third layer is a fragment query, that is, an algebraic query on fragments. The goal of query optimization is to find an execution strategy for the query, which is close to optimal. An execution strategy for a distributed query can be described with relational algebra operations and communication primitives (send/receive operations) for transferring data between sites [17].

Local Query Optimization: The last layer is performed by all the sites having fragments involved in query. Each sub-query

executing at one site, called a local queries, then optimized using the local schema of the site [18].

IV.Role of Hadoop Technology

Hadoop is an open-source software framework written in Java for distributed storage and distributed processing of very huge data sets on computer clusters assembled from commodity hardware. All the modules in Hadoop is designed with a basic assumption that hardware failures are common and should be automatically handled by the framework.

The core of Apache Hadoop consists of a storage part, known as Hadoop Distributed File System (HDFS), and a processing part called Map Reduce. Hadoop splits files into large blocks and distributes them across nodes in a cluster. To process data, Hadoop transfers packaged code for nodes to process in parallel based on the data that needs to be processed. This approach takes advantage of data locality. The node manipulates the data to which they have access. Allows the dataset to be processed faster and more efficiently than it would be in a more conventional supercomputer architecture that relies on a parallel file system where computation and data are distributed via high-speed networking [19].

The base Apache Hadoop framework is composed of the following modules:

•HadoopCommon: It contains libraries and utilities needed by other Hadoop modules;

•Hadoop Distributed File System (HDFS): It is distributes file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster;

•HadoopYARN: It is resource-management platform responsible for managing computing resources in clusters and using them for scheduling of users' applications;

•HadoopMapReduce: It is an implementation of the MapReduce programming model for large scale data processing.

The term Hadoop has come to refer not just to the base modules above, but also to the ecosystem, or collection of extra software packages that can be installed on top of or alongside Hadoop, such as Apache Pig, Apache Hive, ApacheHBase.

The Hadoop framework itself is mostly written in the Java programming language, with some native code in C and command line utilities written as shell scripts. Though MapReduce Java code is common, any programming language can be used with "Hadoop Streaming" to implement the "map" and "reduce" parts of the user's program. Hadoop ecosystem expose richer user interfaces as follows [20]:

Hadoop distributed file system:

(3)

129 three nodes: two on the same rack, and one on a different rack. Data nodes can talk to each other to rebalance data, to move copies around, and to keep the replication of data high. HDFS is not fully POSIX-compliant, because the requirements for a POSIX file-system differ from the target goals for a Hadoop application. The trade-off of not having a fully POSIX-compliant file-system is increased performance for data throughput and support for non-POSIX operations such as Append [21].

The HDFS file system is not restricted to MapReduce jobs. It can be used for other applications, many of which are under development at Apache. Hadoop can in theory be used for any sort of work that is batch-oriented rather than real-time, is very data-intensive, and benefits from parallel processing of data. It can also be used to complement a real-time system.

The commercial applications of Hadoop included: • Log and/or clickstream analysis of various kinds • Marketing analytics

• Machine learning and/or sophisticated data mining • Image processing

• Processing of XML messages • Web crawling and/or text processing

• General archiving, including of relational/tabular data

V. Working of Hadoop and Mapreduce Technology

The HDFS file system includes a so-called secondary namenode, a misleading name that some might incorrectly interpret as a backup namenode for when the primary namenode goes offline. In fact, the secondary namenode regularly connects with the primary namenode and builds snapshots of the primary namenode's directory information, which the system then saves to local or remote directories. These check pointed images can be used to restart a failed primary namenode without having to replay the entire journal of file-system actions, then to edit the log to create an up-to-date directory structure. Because the namenode is the single point for storage and management of metadata, it can become a bottleneck for supporting a huge number of files, especially a large number of small files. HDFS Federation, a new addition, aims to tackle this problem to a certain extent by allowing multiple namespaces served by separate namenodes [22]. An advantage of using HDFS is data awareness between the job tracker and task tracker. The job tracker schedules map or reduce jobs to task trackers with an awareness of the data location. For example: if node A contains data (x,y,z) and node B contains data (a,b,c), the job tracker schedules node B to perform map or reduce tasks on (a,b,c) and node A would be scheduled to perform map or reduce tasks on (x,y,z). This reduces the amount of traffic that goes over the network and prevents unnecessary data transfer. When Hadoop is used with other file systems, this advantage is not always available. This can have a significant impact on job-completion times, which has been demonstrated when running data-intensive jobs. JobTracker and TaskTracker

Above the file systems comes the Map Reduce Engine, which consists of one JobTracker, to which client applications submit MapReduce jobs. The JobTracker pushes work out to available TaskTracker nodes in the cluster, striving to keep the work as close to the data as possible. With a rack-aware file system, the JobTracker knows which node contains the data, and which other machines are nearby. If the work cannot be hosted on the actual node where the data resides, priority is given to nodes in the same rack. This reduces network traffic on the

main backbone network. If a TaskTracker fails or times out, that part of the job is rescheduled. The TaskTracker on each node spawns a separate Java Virtual Machine process to prevent the TaskTracker itself from failing if the running job crashes its JVM. A heartbeat is sent from the TaskTracker to the JobTracker every few minutes to check its status. The Job Tracker and TaskTracker status and information is exposed by Jetty and can be viewed from a web browser[23].

Known limitations of this approach are:

•The allocation of work to TaskTrackers is very simple. Every TaskTracker has a number of available slots (such as "4 slots"). Every active map or reduce task takes up one slot. The Job Tracker allocates work to the tracker nearest to the data with an available slot. There is no consideration of the current system load of the allocated machine, and hence its actual availability.

•If one TaskTracker is very slow, it can delay the entire MapReduce job especially towards the end of a job, where everything can end up waiting for the slowest task. With speculative execution enabled, however, a single task can be executed on multiple slave nodes.

Scheduling

By default Hadoop uses FIFO scheduling, and optionally five scheduling priorities to schedule jobs from a work queue [24]. Fair scheduler

The goal of the fair scheduler is to provide fast response times for small jobs and QoS for production jobs. The fair scheduler has three basic concepts.

1.Jobs are grouped into pools.

2.Each pool is assigned a guaranteed minimum share. 3.Excess capacity is split between jobs.

By default, jobs that are uncategorized go into a default pool. Pools have to specify the minimum number of map slots, reduce slots, and a limit on the number of running jobs. Capacity scheduler

The capacity scheduler supports several features that are similar to the fair scheduler.

• Queues are allocated a fraction of the total resource capacity.

• Free resources are allocated to queues beyond their total capacity.

•Within a queue a job with a high level of priority has access to the queue's resources.

MapReduce is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster. MapReduce program is composed of a Map() procedure that performs filtering and sorting (such as sorting students by first name into queues, one queue for each name) and a Reduce() method that performs a summary operation (such as counting the number of students in each queue, yielding name frequencies). The "MapReduce System" (also called "infrastructure" or "framework") coordinates the processing by handling the distributed servers, running the various tasks in parallel, managing all communications and data transfers between the various parts of the system, and providing for redundancy and fault tolerance.

(4)

the scalability and fault-tolerance achieved for a variety of applications by optimizing the execution engine once. As such, a single-threaded implementation of MapReduce will usually not be faster than a traditional implementation; any gains are usually only seen with multi-threaded implementations. The use of this model is beneficial only when the optimized distributed shuffle operation (which reduces network communication cost) and fault tolerance features of the MapReduce framework come into play. Optimizing the communication cost is essential to a good MapReduce algorithm.

MapReduce libraries have been written in many programming languages, with different levels of optimization. A popular open-source implementation that has support for distributed shuffles is part of Apache Hadoop. The name MapReduce originally referred to the proprietary Google technology, but has since been generalizedby 2014, Google were no longer using MapReduce as a big data processing model, and development on Apache Mahout had moved on to more capable and less disk-oriented mechanisms that incorporated full map and reduce capabilities.

MapReduce is a framework for processing parallelizable problems across huge datasets using a large number of computers (nodes), collectively referred to as a cluster (if all nodes are on the same local network and use similar hardware) or a grid (if the nodes are shared across geographically and administratively distributed systems, and use more heterogeneous hardware). Processing can occur on data stored either in a file system (unstructured) or in a database (structured). MapReduce can take advantage of locality of data, processing it on or near the storage assets in order to reduce the distance over which it must be transmitted.

•"Map" step: Each worker node applies the "map()" function to the local data, and writes the output to a temporary storage. A master node orchestrates that for redundant copies of input data, only one is processed.

•"Shuffle" step: Worker nodes redistribute data based on the output keys (produced by the "map()" function), such that all data belonging to one key is located on the same worker node. •"Reduce" step: Worker nodes now process each group of output data, per key, in parallel.

MapReduce allows for distributed processing of the map and reduction operations. Provided that each mapping operation is independent of the others, all maps can be performed in parallel though in practice this is limited by the number of independent data sources and/or the number of CPUs near each source. Similarly, a set of 'reducers' can perform the reduction phase, provided that all outputs of the map operation that share the same key are presented to the same reducer at the same time, or that the reduction function is associative. While this process can often appear inefficient compared to algorithms that are more sequential, MapReduce can be applied to significantly larger datasets than "commodity" servers can handle a large server farm can use MapReduce to sort a petabyte of data in only a few hours. The parallelism also offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled assuming the input data is still available.

Another way to look at MapReduce is as a five-step parallel and distributed computation:

1.Prepare the Map() input : "MapReduce system" designates Map processors, assigns the input key value K1 that each processor would work on, and provides that processor with all the input data associated with that key value.

2.Run the user-provided Map() code: Map() is run exactly once for each K1 key value, generating output organized by key values K2.

3."Shuffle" the Map output to the Reduce processors: The MapReduce system designates Reduce processors, assigns the K2 key value each processor should work on, and provides that processor with all the Map-generated data associated with that key value.

4.Run the user-provided Reduce() code: Reduce() is run exactly once for each K2 key value produced by the Map step. 5.Produce the final output:TheMapReduce system collects all the Reduce output, and sorts it by K2 to produce the final outcome.

These five steps can be logically thought of as running in sequence each step starts only after the previous step iscompleted although in practice they can be interleaved as long as the final result is not affected.

In many situations, the input data might already be distributed among many different servers, in which case step 1 could sometimes be greatly simplified by assigning Map servers that would process the locally present input data. Similarly, step 3 could sometimes be sped up by assigning Reduce processors that are as close as possible to the Map-generated data they need to process.The frozen part of the MapReduce framework is a large distributed sort. The hot spots, which the application defines, are as follows [25]:

• Input reader •Map function • Partition function • Compare function •Reduce function • Output writer Input reader:

The input reader divides the input into appropriate size 'splits' (in practice typically 64 MB to 128 MB) and the framework assigns one split to each Map function. The input reader reads data from stable storage (typically a distributed file system) and generates key/value pairs.

Map function:

The Map function takes a series of key/value pairs, processes each, and generates zero or more output key/value pairs. The input and output types of the map can be different from each other. If the application is doing a word count, the map function would break the line into words and output a key/value pair for each word. Each output pair would contain the word as the key and the number of instances of that word in the line as the value.

Partition function:

Each Map function output is allocated to a particular reducer by the application's partition function for sharing purposes. The partition function is given the key and the number of reducers and returns the index of the desired reducer.

(5)

131 Between the map and reduce stages, the data is shuffled in order to move the data from the map node that produced it to the shard in which it will be reduced. The shuffle can sometimes take longer than the computation time depending on network bandwidth, CPU speeds, data produced and time taken by map and reduce computations.

Comparison function:

The input for each Reduce is pulled from the machine where the Map ran and sorted using the application's comparison function.

Reduce function:

The framework calls the application's Reduce function once for each unique key in the sorted order. The Reduce can iterate through the values that are associated with that key and produce zero or more outputs.In the word count example, the Reduce function takes the input values, sums them and generates a single output of the word and the final sum. Output writer:

The Output Writer writes the output of the Reduce to the stable storage.

VI.ANALYSISANDDISCUSSION

The performance and evaluationon query processing optimization in mobile environment. The proposed critical analysis and discussion of various types of query processing methods or techniques, layers and its optimization strategies and big data technologies like Hadoop is used for storage purpose and Mapreduce is used for processing purposeand different hot spots like Input reader, map function, partition function, compare function, reduce function, output writer. Performance considerations

MapReduce programs are not guaranteed to be fast. The main benefit of this programming model is to exploit the optimized shuffle operation of the platform, and only having to write the Map and Reduce parts of the program. In practice, the author of a MapReduce program however has to take the shuffle step into consideration; in particular the partition function and the amount of data written by the Map function can have a large impact on the performance. Additional modules such as the Combiner function can help to reduce the amount of data written to disk, and transmitted over the network.

When designing a MapReduce algorithm, the author needs to choose a good tradeoff between the computation and the communication costs. Communication cost often dominates the computation cost, and many MapReduce implementations are designed to write all communication to distributed storage for crash recovery.

For processes that complete quickly and where the data fits into main memory of a single machine or a small cluster, using a MapReduce framework usually is not effective. Since these frameworks are designed to recover from the loss of whole nodes during the computation. This crash recovery is expensive, and only pays off when the computation involves many computers and a long runtime of the computation. A task that completes in seconds can just be restarted in the case of an error, and the likelihood of at least one machine failing grows quickly with the cluster size.On such problems, implementations keeping all data in memory and simply restarting a computation on node failures or when the data is small enough non-distributed solutions will often be faster than a MapReduce system.

Distribution and reliability

MapReduce achieves reliability by parceling out a number of operations on the set of data to each node in the network. Each node is expected to report back periodically with completed work and status updates. If a node falls silent for longer than that interval, the master node records the node as dead and sends out the nodes assigned work to other nodes. Individual operations use atomic operations for naming file outputs as a check to ensure that there are not parallel conflicting threads running. When files are renamed, it is possible to also copy them to another name in addition to the name of the task. The reduce operations operate much the same way. Because of their inferior properties with regard to parallel operations, the master node attempts to schedule reduce operations on the same node.

VII.CONCLUSION

The performance evaluation of various query perform on how quick response effectively and efficiently retrieved data from multiple sites using query processing optimization techniques in mobile environment. Hadoop splits files into large blocks and distributes them across nodes in a cluster. To process data, Hadoop transfers packaged code for nodes to process in parallel based on the data that needs to be processed. This approach takes advantage of data locality such as the nodes manipulating the data they have access to allow the dataset to be processed faster and more efficiently than it would be in a more conventional supercomputer architecture that relies on a parallel file system where computation and data are distributed via high-speed networking.MapReduceis useful in a wide range of applications, including distributed pattern-based searching, distributed sorting, web link-graph reversal, Singular Value decomposition, web access log status, inverted index construction, document clustering, machine learning, and statistical machine translation. Moreover, the MapReduce model has been adapted to several computing environments like multi-core and many-core systems, desktop grids, volunteer computing environments, dynamic cloud environments, and mobile environments.MapReduce's stable inputs and outputs are usually stored in a distributed file system. The transient data is usually stored on local disk and fetched remotely by the reducers.

Finally figure out that the query optimization techniques till date are significant but considering it from the future prospective there is need for making advancement in query optimization techniques in mobile environment.

References

[1] S.sangeetha,s.dhanabal,vishnu kumar kaliappan,"optimization of k-nn query processing in road networks using frequent query retrieval table", world congress on computing and communication technologies,2014. [2] Jayant r. haritsa, "query optimizer plan diagrams: production, eduction

and applications", icde conference, pp. 1374-1377, 2011.

[3] Chun-Nan Hsu,Craig A. Knoblock, "Semantic Query Optimization for Query Plans of Heterogeneous Multidatabase Systems", ieee transactions on knowledge and data engineering, vol.12, no.6, pp.959-78,november/december 2000.

[4] Girish welling,b.r. badrinath,"an architecture for exporting environment awareness to mobile computing applications", ieee transactionson software engineering, vol.24, no.5, pp.391-400,may1998.

(6)

[6] Rajeswari malladi, karen c. davis “applying multiple query optimization in mobile databases”, proceeding of 36th hawaii international conference on system science 2003.

[7] Paurays.m.tsai,arbee l.p. chen,"optimizing queries with foreign functions in a distributed environment",ieee transactions on knowledge and data engineering, vol. 14, no.4, pp.809-824,july/august 2002.

[8] Dan chalmers,morris sloman,"a survey of quality ofservice in mobile computing environments",ieeecommunications surveys http://ww.wcomsoc.org/pubs/surveys second quarter 1999.

[9] Shaila pervin, joarder kamruzzaman,gour karmakar, "quality adjustable query processing framework for wireless sensor networks", ieee international symposium on network computing and applications, pp. 354-358, 2011.

[10] Mihaelaaa.bornea,vasilisvassalos,yanniskotidis,antoniosdeligiannakis,"a daptive join operators for result rate optimization on streaming inputs", ieee transactions on knowledge and data engineering, vol. 22, no. 8,pp.1110-1125, august 2010.

[11] Pedro bizarro,nicolasbruno,david j. dewit,"progressive parametric query optimization",ieee transactions on knowledge and data engineering, vol. 21, no. 4, pp. 582-594, april 2009.

[12] Wen-chihpeng, ming-syanchen, "query processing in a mobile computingenvironment:exploiting the features of asymmetry”, ieee transactions on knowledge and data engineering, vol. 17, no. 7, pp. 982-969 ,july 2005.

[13] Giedriuss livinskas,christian s. jensen,richard t. snodgrass,"a foundation for conventional and temporal query optimization addressing duplicates and ordering",ieee transactions on knowledge and data engineering, vol. 13, no. 1, pp.21-49,january/february 2001.

[14] Sangeethaseshadri,vibhorekumar,briancooper,lingliu,"a distributed stream query optimization framework through integrated planning and deployment",ieee transactions on parallel and distributed systems, vol. 20, no. 10, pp.1439- 1453,october 2009.

[15] John grant, jarek gryz, jack minker, louiqa raschid, “logic-based query optimization for object databases", ieee transactions on knowledge and data engineering, vol. 12, no. 4, pp. 529-547, july/august 2000.

[16] Jensclaussen,alfonskemper,guidomoerkotte,klauspeithner,michaelsteinbr un, "optimization and evaluation of disjunctive queries",ieeetransactions on aknowledge and data engineering,vol. 12, no. 2, pp.238-260,march/april 2000.

[17] Sukheja deepak,singh umesh kumar,mishra durgesh,pandya bhupendra,"query processing and optimization of parallel database system in multi processor environments", sixth asia modelling symposium, pp. 191-194,2012.

[18] Pankti doshi,vijay raisinghani,"review of dynamic query optimizationstrategies in distributed database",ieee transactions on knowledge and data engineering,pp. 145-149,2011.

[19] Jing Zhao, Ruisheng Zhang, Zhili Zhao, Dianwei Chen, Lujie Hou,”Hadoop MapReduce Framework to Implement Molecular Docking of Large-Scale Virtual Screening” IEEE, PP. 350–353, 2012.

[20] AltintasI “Workflow-driven programming paradigms for distributed analysis of biological big data” IEEE ,PP.1-1,2013.

[21] Dixin Tang, Taoying Liu, Hong Liu, Wei Li,”RHJoin: A fast and space-efficient join method for log processing in MapReduce” IEEE,PP.975– 980, 2014.

[22] Kc. K., Chin-Jung Hsu, Freeh V.W.,”Evaluation of MapReduce in a Large Cluster”, IEEE, PP.461 – 468, 2015.

[23] Jin Niu, Bai S., Khosravi E., Seung-Jong Park,”A Hadoop approach to advanced sampling algorithms in molecular dynamics simulation on cloud computing” IEEE, PP. 452 - 455, 2013.

[24] Cattaneo G., Roscigno G., Petrillo U.F.,” A Scalable Approach to Source Camera Identification over Hadoop “, IEEE, PP 366 – 373, 2014. [25] H. Chih, A. Dasdan, R. L. Hsiao, D. S. Parker, "Map-reducemerge: