• No results found

MAPREDUCE: INSIGHT ANALYSIS OF BIG DATA VIA PARALLEL DATA PROCESSING USING JAVA PROGRAMMING, HIVE AND APACHE PIG

N/A
N/A
Protected

Academic year: 2020

Share "MAPREDUCE: INSIGHT ANALYSIS OF BIG DATA VIA PARALLEL DATA PROCESSING USING JAVA PROGRAMMING, HIVE AND APACHE PIG"

Copied!
5
0
0

Loading.... (view fulltext now)

Full text

(1)

Abst is a s huge Java, Progr MapR save backe Index 1.1 B shop appli chan amou a hig any struc are th a) b) c) Orga beca MS A and easy Peop form quer Pica gene Mac abou This mob devic smar sea, toget hour very may very gene YouT data.

MAPRE

PROC

tract: Digitalda software framew

volume of dat , Python and R ram, Apache Pi Reduce jobs. In the result in o end process in M

x Terms— Big

Big Data: In t p, office, h ication, webs nging with ex unt of data, w gh velocity. B

huge amoun ctured data tha

hree sources o Organization People genera Machine gene anization gen ause all the da ACCESS, Or column. To y task as the da

ple generated m like Faceboo

ry, pictures on sa. Social me erate terabyte/p

hine generate ut 85% - 90% data is gener ile; devices li ces. The wid rt phone, sma

oceans, and s ther producin rs and days. In y high, every

be in terabyt y high veloci erated about 3 Tube. Data ca .

Interna

EDUCE:

CESSING

ata which come work to store a ta. MapReduce Ruby. The main ig and Hive.Hiv n this research p output file.At th MapReduce job

Data, Hadoop,

I. INTROD

today’s world hospital, coll sites, digital d xtremely high which comes in Big data is a nt of structure

at has the to b of big data:

generated dat ated data erated data nerated data: ata is stored in racle, MS SQL

search any in ata already in

data: is in sem ok messages, n Instagram, edia is main petabyte of da d data: This is of the total d rated by vario ike sleeping m despread avail

art cars, smar sky also conn ng a huge am n big data the

minute total te or petabyte ity for examp

50GB of data an be Batch, n

Volu

ational Jou

Av

INSIGHT

USING J

Lec

e from different and process this is a programm n objective of th

ve and Apache paper we write he end we perf b.

, MapReduce, H

DUCTION

d data is every leges, unive devices, etc.T h rate and pro

n a variety of growing term ed , semi stru be used for ana

ta

is the most n different dat L etc. in form nformation in particular sha mi-structured twitters tweet platform wh ata every day. s the major so data is generat ous real time s monitor, heart lability of sm rt homes, sma nected to this mount of data quantity or v

amount of da e. This data is ple every m a, 72 hours vid

near time, rea

ume 9, No.

urnal of Ad

RESE

vailable On

T ANALY

JAVA PRO

Dr cturer (I.T.), S

Salalah

t sources like o s enormous am ming model init his paper is to Pig working o MapReduce pr form the opera

Hive. Apache P

ywhere, either ersities, mob Technologies a

oducing a hu form along w m that describ uctured and u alysis [1]. The

structured da tabases either m of tables, row

n database is ape.

or un-structur t, Google sear

here people w

ource of big da ted by machin sensors, came

beat monitori mart devices li art camera, no

devises and a every minu volume of data ata is generat s generated a inute Facebo deo uploaded al time or strea

1, January

dvanced Re

EARCH PA

nline at ww

YSIS OF B

OGRAMM

. Ujjwal Agar Salalah Colleg h, Sultanate o

ffice, school, h mount of data. H

tiated by Googl describe the co on top layer of H

rogram in Java ation in HiveQL

ig r in bile are uge with bes un- ere ata r in ws an red rch will ata ne. era, ing ike ow all ute, a is ted at a ook on am A str or 1. w pr m fr w co fr an la (2 co Sy H na co da Th si st re 1. w Ec fo str fil di pr ap ab

y-February

esearch in C

APER

ww.ijarcs.in

BIG DAT

MING, H

rwal

ge of Technolo f Oman

hospital, social m Hadoop is using

le which can be oncepts of Map Hadoop ecosys to find anagram L (Hive query

As data is co ructure of big r un-structure.

2 Hadoop: A which is used t rocess this d model. It

om commodit way that alwa ommon and

amework. [2] nalysis of bi aunched on 10 2.7.3) came in omponent of ystem). HDF HDFS works ame node an ontroller/mast ata node. It sto

his metadata ze and file pe ore in differe eplication facto

3 Hadoop E which is used f

cosystem tied or structured

ructured and u le system w ifferent racks rovide the par pplication is bstraction like

y 2018

Computer

nfo

TA VIA PA

HIVE AND

ogy,

media or machi g HDFS and M e written in dif pReduce and sh tem and provid m words from i language) and

oming from g data varies it

.

Apache Hadoo o store large a data by usin

consists ty hardware. ays it assum

it should be . Hadoop is a ig data. The 0th December

nto existence Hadoop is H S works on on Master-sla nd multiple d

er of the syst ores the metad includes nam ermission. The ent clusters ac

or of any data

Ecosystem: H for storage an d up with ma data like sqo un-structured which respons

. Above that rallel data pro working w e Pig, Hive or

Science

ISSN No

ARALLEL

D APACH

ine generated d MapReduce to st fferent program howing the ope de the level of a input files, grou d Pig Latin Scr

various sour t can be struct

op is an open-amount of dat ng the MapRe of compute Hadoop is d es hardware

automaticall a major platfo first release 2011 and the on 25th Augu HDFS (Hadoo

Name Node ave architectu data node. N tem. Name no data of all the me, location of

e data that we cross the nod a is 3 on data n

Hadoop is a so nd analysis of any different oop and hive . In the core H sible to store

MapReduce ocessing and which provid

Sqoop etc.

o. 0976‐5697

L DATA

HE PIG

data. Apache H tore and proces mming language eration by using abstraction to ru up them togethe ript and showin

rces therefore ture, semi-stru

-source frame ta using HDFS educeprogram er clusters designed in su

failures are y handled by orm for storing e of Hadoop

first stable ve ust 2016. The op Distributed e and Data N

ure as it has Name Node i ode spreads da

files in the H f each block, b

store in Hado des. By defau

node.

oftware frame Big Data. Ha applications e, some for Hadoop distrib e all the da

(2)

Map proc Hado langu has t perfo RED anoth Map into brok The Map value The outp from the s job i

We c calle a) M b) M

By d only

Fig

II MA

pReduce [1, essing. This m oop can run uage, like Jav two different orms. The f DUCE, but b her steps call p which takes

another set ken down into next step is s p step and so

e) pairs. last sand fina ut from sort m Shuffling ph

[image:2.595.45.270.54.141.2] [image:2.595.38.261.427.489.2] [image:2.595.39.258.575.709.2]

sequence of t is always perfo

Figure

can divide Ma ed:-

MapReduce wit MapReduce wi

Figure 2.2

default numb y one reducer t

gure 1.1 Hado

APREDUCE

2] is a pro model is very MapReduce va, Python and

operations th first step is between map

led sort and s the group of of data, wh tuples (key/v sort and shuff ort and shuffl

al step is reduc and shuffle a hase and retur the name Ma formed after th

2.1 Architectu

apReduce task

th single redu ith multiple re

2 MapReduce

er of reducer task will be ex

oop Ecosystem

FRAMEWO

ogramming m simple but st program writ d Ruby. The te hat Apache H MAP and a and reduce shuffle. The f data or files here individua

alue pairs). fle, which tak le the result,

cing step,redu as input and c rns a single o apReduce imp he map job.[3]

ure of MapRe

k as two differ

uce task educe task

e Single Reduc

rs is set to 1 xecuted as sho

m

ORK

model for da till too comple tten in differe erm MapRedu Hadoop progra another step we have tw first step is o

and converts al elements a

kes the output based on (ke

uce job takes t combines valu output value. A plies, the redu

], [4] [5], [6]

educe

rent categories

ce Task

, means alwa own below.

ata ex. ent uce am is wo our s it are

of ey,

the ues As uce

s

ays

M in us fo

Jo jo

C co

M w ta th in

Th by W im so M re

M R

Th A pr str fil di

B ec str M C M th

Many reducers ndependently.

ser.We can c ollowing comm

ob job = new J ob.setNumRed

onfiguration c onf.set("mapre

Figure 2

MapReduce F working on <K akes the input he output again nput for Reduc

he key and the y the framew Writable interf

mplement the orting by the MapReduce jo educe -><k3, v

Inpu Map Job <key educe <key

III MA

here are differ A) The first w

rogram, these ructured and les. A MapR ifferent classe a) Mappe b) Reduc c) Driver ) Apache Pig cosystem and ructure data a MapReduce. [7 ) The Hive MapReduce is

he form of row

s can run in The number change the nu

mands:

Job(conf); duceTasks(3);

conf = new Co educe.job.redu

2.3 MapReduc

Framework: Key,Value> p t job as a pair n in the form ce job.

e value classe work and he face. Addition Writable-Co e framework. ob: (Input) <k v3> (Output).

ut Value y1, value1> y2, List (value2

Tab

APREDUCE

rent ways to e way to execu programs ca un-structured educe program s:

er Class cer Class r Class g is working

it is used to and used to pr 7]

Query Lang used for anal w and columns

parallel, as r of reducers umber of red

// 3 Reducers or

onfiguration() uces", "5"); //5

ce Multiple R

The MapRe pair. Firstly t

r of <Key, Va <key, Value>

es should be in ence, need t nally, the ke mparable inte

Input and O k1, v1> -> m

Output List <k 2) > List (K

ble 2.1

IMPLEMEN

execute MapR ute by using an be used for d data like .csv m usually co

g on the top analysis of s rocess the scri

guage (HiveQ lysis of structu s. [8]

they are wo is decided b ucer by usin

s

);

5 Reducer

Reduce Task

educe frame the Map oper alue> and pro > which will b

n serialized ma to implement

y classes hav erface to faci Output types map -><k2, v

t Value key2, value2> Key3, value3>

NTATION

Reduce jobs:-Java MapRe r structured, v, .json, or D nsists of the

layer of Ha structure and

ipting approac

QL or HQL ured data whi

orking y the g the

ework ration oduce be the

anner t the ve to ilitate

of a v2>->

educe semi-DBMS

three

adoop semi-ch for

(3)

3.1. takin of ch word rearr vary grou Map

Anag

inpu each Hado aba a key a

publ thro

{ Syste Strin Strin Strin

whil

{ Strin

s1=(

char

Arra

sorte conte

} }

Anag

and c

publ

key,I

thro

{ Strin Strin

for ( {

str=s

}

conte

}

Anag

exec call Anag file.

publ Clas

jobjo job.s

To implemen ng the exampl haracter forme d or characters ranged into " ying string, w up them toge pReduce we ar a) Anagram b) Anagram c) Anagram

gramMapper

ut file line by l h toke to form

oop will crea and baa will g as aab with gr

lic void map(T

owsIOExceptio

em.out.println ng keystring = ngTokenizerst

ng sorted = nu le(st.countTok

ng s1 = newSt String) st.nex

r[] chars =s1.t ays.sort(chars)

ed = new Strin

ext.write(new

gramReducer

combine them

licvoid reduce Iterable<Text

owsIOExceptio

ngBufferstr=n

ng finalkey =v

(Text t : value

str.append(t.to

ext.write(new

gramDriver

cution of prog the Anagr gramReducer

ic static void sNotFoundEx

ob = new Job( setJarByClass

nt the MapRed le of an anagra ed by rearran s, For exampl adhpoo".Inpu we need to id ether.For solv re usingthree c mMapper mReducer mDriver

Class(SortK

line.It will cre m a key with ate a group of

get sorted as a roup {aba,baa

Text key, Tex on, Interrupte

n("In mapper"

= key.toString(

t = newtringTo

ull ; kens()!=0)

tring(); xtElement();

toLowerCase( );

ng(chars);

w Text(sorted)

rClass: It will m into single st

e(Text t>values,Conte

on,Interrupted

ewStringBuff

values.iterator(

s)

oString()+" ");

w Text(finalkey

Class: The d ram by callin ramMapper

class and fin

main(String[] xception, Inter

();

(AnagramDri

duce program am. Anagram nging the lette le, the world “ ut files contai dentify anagra ving this pro

classes:-

KeyMapper):

eate a token fr h original va f similar word aab. So reduce a}

xt value, Conte dException

"); ();

okenizer(keys

().toCharArray

, new Text(s1

read each me tring separate

extcontext) dException

fer(); ().next().toStr

;

y), newext(str

driver progra ng both the cla class then nally write the

args) throws rruptedExcept

ver.class);

m in Java we a m is a word or

ers of a differe “hadoop” can in n number ams words a oblem by usi

This will re om line and s alue.In this w

ds.For examp er will receive

ext context)

string," ");

y();

1));

ember of group d by tab.

ring();

r.toString()));

am controls t asses first it w it will c e results in ne

IOException, tion{

are set ent be of and ing

ead ort way ple: e a

p

the will call ew

,

jo

Fi Pa w

Fi Pa w

jo jo jo jo jo jo jo

Sy }

3. pr an ec

H tra ex Tr an

To w em ob ta

H H

ob.setJobName

ileInputForma ath("C:\\Users workspace\\An

ileOutputForm ath("C:\\Users workspace\\An

ob.setInputFor ob.setMapperC ob.setReducerC ob.setMapOutp ob.setMapOutp ob.setOutputK ob.setOutputV

ystem.exit(job

2: Hive: Hiv rocess and ana nd databases. cosystem. Hiv

a) Data s b) Query c) Analy Hive supports anslate SQL l xecute the qu racker pass th nd Reduce tas

F

o exhibit the w we take the

mployee datab bjective is to able”. We will

Hive > use emp Hive >Select C

e("Ana");

at.addInputPat s\\ujjwal\\eclip

agram1\\hado

mat.setOutputP s\\ujjwal\\eclip agram1\\hado

rmatClass(Key Class(Anagram Class(Anagram putKeyClass( putValueClas KeyClass(Text ValueClass(Tex

b.waitForCom

ve is data wa alysis of struc Hive is wor ve has mainly summarization y

sis of data s query lang

[image:3.595.349.522.443.590.2]

ike query into ery it will pa he job to Tas k will execute

Figure 3.2.1Ar

working struc databasecalled base we have

fine the tota execute the f

ployee; Count (*) from

th(job, new

pse-oop\\anagram.

Path(job, new

pse-oop\\out_ana")

yValueTextIn mMapper.clas mReducer.cla Text.class); s(Text.class); .class); xt.class);

mpletion(true)?

arehouse tool ctured data in rking on top three basic fu n

guage called o the MapRedu ass to Job Tra

sk Tracker an e and fetch the

rchitecture of

cture of MapR d “employee e table called al number of following Hive

m emp;

txt"));

w

));

nputFormat.cla ss);

ass);

?0:1);

which is us the form of t layer of Ha unctions:-

HiveQL, w uce jobs. Onc acker and then nd finally the e data from HD

Hive

Reduce in Hive e” and inside “emp”. Our rows in the eQL query:-

ass);

ed to tables adoop

which ce we n Job Map DFS.

e first e the

(4)

To e show seco

3.3 Map data Map oper langu parts

The Map

$ pig

$ pig

Gran stude , city

Usin stand

[image:4.595.314.526.54.213.2]

Gran

Figure 3.2.2

execute this qu wn in 3.2.2 and

nd.

Apache Pig pReduce. Pig i set witho pReduce.We

rations in Had uage called P s:-

[image:4.595.37.278.54.206.2] [image:4.595.47.218.364.555.2]

a) Loading b) Transform c) Dumping

Figure 3

following co pReduce execu

g –version

g –x mapreduc

nt>A=LOAD ent.txt' USING y:chararray);/*

ng dump meth dard output.

nt >Dump A;

2 MapReduce

uery, Hive per d the total tim

g: Apache P is used for an out using

can perform doop using Ap Pig Latin. E

ming g of the data.

3.3.1 Architec

ommand we ution on Apac

ce

'hdfs://localho G PigStorage( *Load File */

hod, the proces

e job for HiveQ

rforms the Ma me required is

Pig is an ab nalysis of larg

writing the all the dat pache Pig by Every pig pro

cture of Apach

will execute che Pig:-

ost:9000/pig_d (',') as (id:int, n

ssed data is di

QL query

apReduce job 1 min and 22

bstraction ov ge or distribut

program ta manipulati y using scripti ogram has thr

he Pig

to display t

data/

name:chararra

isplayed on th as

ver ted of ion ing ree

the

ay

e

To as se

Th ec M vo M ex th do de of M w qu ex on ab bu fo M

[1

[2

[3

[4

[5

[6

Figure 3.

o execute this s shown in 3.2 econd.

his paper e cosystem and MapReduce pr olume of d MapReduce bre

xecute then Re his feature in H

own as per < efault there is f reducer as MapReduce job we execute the

uery and sho xecuted by M n top layer of bstraction, as u ut just he writ or Pig its Pig L MapReduce Job

] Jefry Dean an Processin 53, Issuse 2]Jefry Dean an

processin Volume 5 ] M. Dhavapri and Solut (IJCST) – 4] Dr. Urmila Mapreduc (AJER), V 5]AbdelrahmanE Sharkawi Direction Engineeri 6]Shafali Agarw

Expansion Computer

3.2 MapRedu

s Pig Script,P 2.2 and the tot

IV CON

exploits an d deeply expl

rogramming m data by usin

eak the job in educe job wil Hadoop ecosy <key,value> s one reducer

per need. W b in Java pro e HiveQL and owing that at apReduce job f Hadoop ecos

user no need t te the SQL lik Latin Script, th

bs.

V REF

nd Sanjay Ghem ng Tool, Comm

e.1,January 201 nd Sanjay hem ng on large clus

51 pp. 107–113 iya, N. Yasodh tions Using Ha – Volume 4 Issu a R. Pol, Big ce, American Volume-5, Issu Elsayed, Osam i, MapReduce ns, International ing, Vol. 6, No. wal, Map Red n, (IJACSA) I r Science and A

uce job for Pig

Pigperforms th tal time requir

NCLUSION

overview o ained MapRe model used t ng parallel n two groups ll perform. Ma ystem so that given by the but we can c We solve ana gramming lan d Pig Latin Sc t backend th b. Hive and Ap

system and pr to write the co ke script in hi

hat automatic

FERENCE

mwat, MapRed munications of 0, pp 72-77. mwat,.MapRedu sters, Communi

, 2008 ha, Big Data A adoop, Map Re ue 1, Jan - Feb g Data Analy

Journal of En ue-6, pp-146-15 ma Ismail, and

: State-of-the-l JournaState-of-the-l of Com

. 1, February 20 duce: A Surve

International Jo Applications, V

g Latin Script

he MapReduc red is 3 min an

fbigdata, Ha educe architec

to deal with data proces first Map job apReduce pro data can be b e programmer change the nu agram problem

nguage and fu cript to execut he whole que pache Pig wo rovide the lev ode of MapRe ive its HiveQL ally converted

duce:A Flexible f the ACM, Vo

uce: Simplified ications of the A

Analytics: Chall educe and Big T

2016 ysis Using H

ngineering Res 1

d Mohamed E -Art and Res mputer and Elec 014

ey Paper on R ournal of Adv ol. 6, No. 8, 20

ce job nd 20

adoop cture. huge ssing. b will ovides break r. By umber

m by urther

te the ery is orking

vel of educe L and d into

e Data olume

d data ACM,

lenges Table,

adoop search

E. El-search ctrical

Recent vanced

(5)

[7] P

[8] H

Auth

intern Big D langu Resea

Pig Latin: A N Christopher Yahoo! Rese

Hive – A Peta AshishThuso Prasad Chak Raghotham M

hors Profile

D S M w S m p national journals Data Analytics, M uage. He has more arch Experience.

Not-So-Foreign Olstonכ Yaho earch UtkarshSr

abyte Scale D oo, JoydeepSen kka, Ning Zhan Murthy, Facebo

VI PRO

Dr. Ujjwal Agar Science), M.Ph MSc(Information working in Salal Sultanate of Oma member in man published more t and conferences. Machine Learning

e then 12 years o

Language for oo! Research B rivastava ‡ Yah

Data Warehouse nSarma, Namit ng, Suresh Anto

ook Data Infras

OFILE

rwal completed hil. (Computer n Technology).

ah College of T an as a Lecturer ny International

than 12 research . His main resear g using Python a of teaching experi

Data Processin Benjamin Reed hoo! Research

e Using Hado Jain, Zheng Sh ony, Hao Liu a structure Team

his PhD(Compu r Science) a He is curren Technology, Salal (IT) alogwith he Journals. He h papers in repu rch work focuses and R Programm ience and 6 years

ng, d †

op, hao, and

Figure

Figure 2.22 MapReducee Single Reducce Task
Figure 3.2.1ArFrchitecture of Hive
Figure 3.2.22 MapReduce e job for HiveQQL query

References

Related documents

World Health Organization and the European research Organization on Genital Infection and Neoplasia in the year 2000 mentioned that HPV testing showed

The theoretical concerns that should be addressed so that the proposed inter-mated breeding program can be effectively used are as follows: (1) the minimum sam- ple size that

According to the aforementioned issues, seven research questions were raised in this study, in which the emphasis of each components of sustainable development, such as the

There is also the action of what is known as pancreatic lipase on the food, which breaks the fats into smaller components of diglycerides, monoglycerides or free fatty acids..

In this paper, we propose a mobility pattern based location tracking scheme based, which efficiently reduces the location updates and searching cost in the

In dit artikel zal ik betogen dat Proclus' gebeden voor gezondheid niet begrepen moeten worden als achteloze gemeenplaatsen in zijn gebeden, noch als tekenen dat Proclus als

Figure 5: The Kelvin source pin helps avoid the inductance in the gate driver loop and reduces switching energy loss... However, there is no effect on the low-frequency noise or

The mother liquor was pipetted off and the colourless solid was removed by washing with diethyl ether (3 × 10 mL). The crystals were recrystallized with the same composition of