• No results found

Cloudera_Academic_Partnership_8.pdf

N/A
N/A
Protected

Academic year: 2021

Share "Cloudera_Academic_Partnership_8.pdf"

Copied!
69
0
0

Loading.... (view fulltext now)

Full text

(1)

8-­‐1  

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

Apache  Hadoop  –  A  course  for  undergraduates  

(2)

Hadoop  Tools  for  Data  AcquisiHon  

(3)

8-­‐3  

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

§

How  to  load  data  from  an  exis6ng  RDBMS  into  HDFS  using  Sqoop  

§

How  to  manage  real-­‐6me  data  such  as  log  files  using  Flume  

(4)

Chapter  Topics  

Hadoop  Tools  for  Data  Acquisi6on  

§  

Loading  Data  into  HDFS  from  an  RDBMS  Using  Sqoop  

(5)

8-­‐5  

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

§

Typical  scenario:  data  stored  in  an  RDBMS  is  needed  in  a  

MapReduce  job  

Lookup  tables  

Legacy  data  

§

Possible  to  read  directly  from  an  RDBMS  in  your  Mapper  

Can  lead  to  the  equivalent  of  a  distributed  denial  of  service  

(DDoS)  a>ack  on  your  RDBMS  

In  pracHce  –  don’t  do  it!  

§

BeOer  idea:  use  Sqoop  to  import  the  data  into  HDFS  beforehand    

ImporHng  Data  From  an  RDBMS  to  HDFS  

(6)

§

Sqoop:  open  source  tool  originally  wriOen  at  Cloudera  

Now  a  top-­‐level  Apache  SoWware  FoundaHon  project  

§

Imports  tables  from  an  RDBMS  into  HDFS  

Just  one  table  

All  tables  in  a  database  

Just  porHons  of  a  table  

Sqoop  supports  a  WHERE  clause  

§

Uses  MapReduce  to  actually  import  the  data  

‘Thro>les’  the  number  of  Mappers  to  avoid  DDoS  scenarios  

Uses  four  Mappers  by  default  

Value  is  configurable  

(7)

8-­‐7  

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

§

Imports  data  to  HDFS  as  delimited  text  files  or  SequenceFiles  

Default  is  a  comma-­‐delimited  text  file  

§

Can  be  used  for  incremental  data  imports  

First  import  retrieves  all  rows  in  a  table  

Subsequent  imports  retrieve  just  rows  created  since  the  last  import  

§

Generates  a  class  file  which  can  encapsulate  a  row  of  the  imported  data  

Useful  for  serializing  and  deserializing  data  in  subsequent  MapReduce  

jobs  

(8)

§

Cloudera  has  partnered  with  other  organiza6ons  to  create  custom  Sqoop  

connectors  

Use  a  database’s  naHve  protocols  rather  than  JDBC  

Provides  much  faster  performance  

§

Current  systems  supported  by  custom  connectors  include:  

Netezza  

Teradata  

Oracle  Database  (connector  developed  with  Quest  SoWware)  

§

Others  are  in  development  

§

Custom  connectors  are  not  open  source,  but  are  free  

Available  from  the  Cloudera  Web  site  

(9)

8-­‐9  

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

§

Standard  syntax:  

§

Tools  include:  

§

Op6ons  include:  

Sqoop:  Basic  Syntax  

sqoop tool-name [tool-options]

--connect

--username

--password

import

import-all-tables

list-tables

(10)

§

Example:  import  a  table  called  employees  from  a  database  called  

personnel  in  a  MySQL  RDBMS  

§

Example:  as  above,  but  only  records  with  an  ID  greater  than  1000  

Sqoop:  Example  

$

sqoop import --username fred --password derf \

--connect jdbc:mysql://database.example.com/personnel \

--table employees

$

sqoop import --username fred --password derf \

--connect jdbc:mysql://database.example.com/personnel \

--table employees \

(11)

8-­‐11  

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

ImporHng  An  EnHre  Database  with  Sqoop  

§

Import  all  tables  from  the  database  (fields  will  be  tab-­‐delimited)  

$ sqoop import-all-tables \

--connect jdbc:mysql://localhost/company \

--username twheeler --password bigsecret \

--fields-terminated-by '\t' \

(12)

Incremental  Imports  with  Sqoop  

§

What  if  new  records  are  added  to  the  database?  

Could  re-­‐import  all  records,  but  this  is  inefficient  

§

Sqoop’s  incremental  append  mode  imports  only  new  records  

Based  on  value  of  last  record  in  specified  column  

$ sqoop import \

--connect jdbc:mysql://localhost/company \

--username twheeler --password bigsecret \

--warehouse-dir /mydata \

--table orders \

--incremental append \

--check-column order_id \

--last-value 6713821

(13)

8-­‐13  

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

§

Sqoop  can  take  data  from  HDFS  and  insert  it  into  an  already-­‐exis6ng  table  

in  an  RDBMS  with  the  command  

§

For  general  Sqoop  help:  

§

For  help  on  a  par6cular  command:  

Sqoop:  Other  OpHons  

$

sqoop export [options]

$

sqoop help

(14)

Chapter  Topics  

Hadoop  Tools  for  Data  Acquisi6on  

§

  

Loading  Data  into  HDFS  from  an  RDBMS  Using  Sqoop  

(15)

8-­‐15  

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

§

Flume  is  a  distributed,  reliable,  available  service  for  

efficiently  moving  large  amounts  of  data  as  it  is  produced  

Ideally  suited  to  gathering  logs  from  mulHple  systems  

and  inserHng  them  into  HDFS  as  they  are  generated  

§

Flume  is  Open  Source  

IniHally  developed  by  Cloudera  

§

Flume’s  design  goals:  

Reliability  

Scalability  

Extensibility  

(16)

Flume:  High-­‐Level  Overview  

Agent     Agent   Agent  

Agent   Agent   Agent(s)   Agent   compress   encrypt   batch   encrypt  

  

Optionally process incoming

data: perform transformations,

suppressions, metadata

enrichment

  

Each agent can be configured

with an in-memory or durable

channel

Writes to multiple HDFS file

formats (text, SequenceFile,

JSON, Avro, others)

Parallelized writes across

many collectors – as much

write throughput as required

(17)

8-­‐17  

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

§

Each  Flume  agent  has  a  source,  a  sink  and  a  channel  

§

Source  

Tells  the  node  where  to  receive  data  from  

§

Sink  

Tells  the  node  where  to  send  data  to  

§

Channel  

A  queue  between  the  Source  and  Sink  

Can  be  in-­‐memory  only  or  ‘Durable’  

Durable  channels  will  not  lose  data  if  power  is  lost  

(18)

§

Channels  provide  Flume’s  reliability  

§

Memory  Channel  

Data  will  be  lost  if  power  is  lost  

§

File  Channel  

Data  stored  on  disk  

Guarantees  durability  of  data  in  face  of  a  power  loss  

§

Data  transfer  between  Agents  and  Channels  is  transac6onal  

A  failed  data  transfer  to  a  downstream  agent  rolls  back  and  retries  

§

Can  configure  mul6ple  Agents  with  the  same  task  

e.g.,  two  Agents  doing  the  job  of  one  “collector”  –  if  one  agent  fails  

then  upstream  agents  would  fail  over  

(19)

8-­‐19  

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

§

Scalability  

The  ability  to  increase  system  performance  linearly  by  adding  more  

resources  to  the  system  

Flume  scales  horizontally  

As  load  increases,  more  machines  can  be  added  to  the  

configuraHon  

(20)

§

Extensibility  

The  ability  to  add  new  funcHonality  to  a  system  

§

Flume  can  be  extended  by  adding  Sources  and  Sinks  to  exis6ng  storage  

layers  or  data  plaeorms  

General  Sources  include  data  from  files,  syslog,  and  standard  output  

from  a  process  

General  Sinks  include  files  on  the  local  filesystem  or  HDFS  

Developers  can  write  their  own  Sources  or  Sinks  

(21)

8-­‐21  

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

§

Flume  is  typically  used  to  ingest  log  files  from  real-­‐6me  systems  such  as  

Web  servers,  firewalls  and  mailservers  into  HDFS  

§

Currently  in  use  in  many  large  organiza6ons,  inges6ng  millions  of  events  

per  day  

At  least  one  organizaHon  is  using  Flume  to  ingest  over  200  million  

events  per  day  

§

Flume  is  typically  installed  and  configured  by  a  system  administrator  

Check  the  Flume  documentaHon  if  you  intend  to  install  it  yourself  

(22)

§

Sqoop  is  a  tool  to  load  data  from  a  database  into  HDFS  

§

Flume  is  a  tool  for  managing  real-­‐6me  data  

e.g.  imporHng  data  from  log  files  into  HDFS  

(23)

8-­‐23  

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

The  following  offer  more  informa6on  on  topics  discussed  in  this  chapter

§

Incremental  impor6ng  is  described  in  the  Sqoop  documenta6on:  

http://archive.cloudera.com/cdh/3/sqoop/

SqoopUserGuide.html#_incremental_imports

(24)

An  IntroducHon  to  Oozie  

(25)

8-­‐25  

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

§

What  is  Oozie?  

§

Crea6ng  Oozie  workflows  

(26)

Chapter  Topics  

An  Introduc6on  to  Oozie  

§  

Introduc6on  to  Oozie  

(27)

8-­‐27  

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

§

Many  problems  cannot  be  solved  with  a  single    

MapReduce  job  

§

Instead,  a  workflow  of  jobs  must  be  created  

§

Simple  workflow:  

Run  Job  A  

Use  output  of  Job  A  as  input  to  Job  B  

Use  output  of  Job  B  as  input  to  Job  C  

Output  of  Job  C  is  the  final  required  output  

§

Easy  if  the  workflow  is  linear  like  this  

Can  be  created  as  standard  Driver  code  

The  MoHvaHon  for  Oozie  (1)  

Job A

Start

Data

Job B

Job C

Final

Result

(28)

§

If  the  workflow  is  more  complex,  Driver  code  becomes  much  more  

difficult  to  maintain  

§

Example:  running  mul6ple  jobs  in  parallel,  using  the  output  from  all  of  

those  jobs  as  the  input  to  the  next  job  

§

Example:  including  Hive  or  Pig  jobs  as  part  of  the  workflow  

(29)

8-­‐29  

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

§

Oozie  is  a  ‘workflow  engine’  

§

Runs  on  a  server  

Typically  outside  the  cluster  

§

Runs  workflows  of  Hadoop  jobs  

Including  Pig,  Hive,  Sqoop  jobs  

Submits  those  jobs  to  the  cluster  based  on  a  workflow  definiHon  

§

Workflow  defini6ons  are  submiOed  via  HTTP  

§

Jobs  can  be  run  at  specific  6mes  

One-­‐off  or  recurring  jobs  

§

Jobs  can  be  run  when  data  is  present  in  a  directory  

(30)

Chapter  Topics  

An  Introduc6on  to  Oozie  

§

  

IntroducHon  to  Oozie  

(31)

8-­‐31  

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

§

Oozie  workflows  are  wriOen  in  XML    

§

Workflow  is  a  collec6on  of  ac6ons  

MapReduce  jobs,  Pig  jobs,  Hive  jobs  etc.  

§

A  workflow  consists  of  control  flow  nodes  and  ac2on  nodes  

§

Control  flow  nodes  define  the  beginning  and  end  of  a  workflow  

They  provide  methods  to  determine  the  workflow  execuHon  path  

Example:  Run  mulHple  jobs  simultaneously  

§

Ac6on  nodes  trigger  the  execu6on  of  a  processing  task,  such  as  

A  MapReduce  job  

A  Hive  query  

A  Sqoop  data  import  job  

(32)

§

Simple  example  workflow  for  WordCount:  

(33)

8-­‐33  

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

Simple  Oozie  Example  (2)  

<workflow-app name='wordcount-wf' xmlns="uri:oozie:workflow:0.1"> <start to='wordcount'/> <action name='wordcount'> <map-reduce> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <configuration> <property> <name>mapred.mapper.class</name> <value>org.myorg.WordCount.Map</value> </property> <property> <name>mapred.reducer.class</name> <value>org.myorg.WordCount.Reduce</value> </property> <property> <name>mapred.input.dir</name> <value>${inputDir}</value> </property> <property> <name>mapred.output.dir</name> <value>${outputDir}</value> </property> </configuration> </map-reduce> <ok to='end'/> <error to='kill'/> </action> <kill name='kill'>

<message>Something went wrong: ${wf:errorCode('wordcount')}</message> </kill/>

<end name='end'/> </workflow-app>

(34)

Simple  Oozie  Example  (3)  

<workflow-app name='wordcount-wf' xmlns="uri:oozie:workflow:0.1"> <start to='wordcount'/> <action name='wordcount'> <map-reduce> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <configuration> <property> <name>mapred.mapper.class</name> <value>org.myorg.WordCount.Map</value> </property> <property> <name>mapred.reducer.class</name> <value>org.myorg.WordCount.Reduce</value> </property> <property> <name>mapred.input.dir</name> <value>${inputDir}</value> </property> <property> <name>mapred.output.dir</name> <value>${outputDir}</value> </property> </configuration> </map-reduce> <ok to='end'/> <error to='kill'/> </action>

A  workflow  is  wrapped  in  the  workflow-app  

enHty  

(35)

8-­‐35  

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

Simple  Oozie  Example  (4)  

<workflow-app name='wordcount-wf' xmlns="uri:oozie:workflow:0.1">

<start to='wordcount'/> <action name='wordcount'> <map-reduce> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <configuration> <property> <name>mapred.mapper.class</name> <value>org.myorg.WordCount.Map</value> </property> <property> <name>mapred.reducer.class</name> <value>org.myorg.WordCount.Reduce</value> </property> <property> <name>mapred.input.dir</name> <value>${inputDir}</value> </property> <property> <name>mapred.output.dir</name> <value>${outputDir}</value> </property> </configuration> </map-reduce> <ok to='end'/> <error to='kill'/> </action> <kill name='kill'>

<message>Something went wrong: ${wf:errorCode('wordcount')}</message> </kill/>

<end name='end'/> </workflow-app>

The  start  node  is  the  control  node  which  tells  

Oozie  which  workflow  node  should  be  run  first.  There  

must  be  one  start  node  in  an  Oozie  workflow.  In  

our  example,  we  are  telling  Oozie  to  start  by  

(36)

Simple  Oozie  Example  (5)  

<workflow-app name='wordcount-wf' xmlns="uri:oozie:workflow:0.1"> <start to='wordcount'/> <action name='wordcount'> <map-reduce> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <configuration> <property> <name>mapred.mapper.class</name> <value>org.myorg.WordCount.Map</value> </property> <property> <name>mapred.reducer.class</name> <value>org.myorg.WordCount.Reduce</value> </property> <property> <name>mapred.input.dir</name> <value>${inputDir}</value> </property> <property> <name>mapred.output.dir</name> <value>${outputDir}</value> </property> </configuration> </map-reduce> <ok to='end'/> <error to='kill'/> </action>

The  wordcount  acHon  node  defines  a

(37)

8-­‐37  

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

Simple  Oozie  Example  (6)  

<workflow-app name='wordcount-wf' xmlns="uri:oozie:workflow:0.1"> <start to='wordcount'/> <action name='wordcount'> <map-reduce> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <configuration> <property> <name>mapred.mapper.class</name> <value>org.myorg.WordCount.Map</value> </property> <property> <name>mapred.reducer.class</name> <value>org.myorg.WordCount.Reduce</value> </property> <property> <name>mapred.input.dir</name> <value>${inputDir}</value> </property> <property> <name>mapred.output.dir</name> <value>${outputDir}</value> </property> </configuration> </map-reduce> <ok to='end'/> <error to='end'/> </action> <kill name='kill'>

<message>Something went wrong: ${wf:errorCode('wordcount')}</message> </kill/>

<end name='end'/> </workflow-app>

(38)

Simple  Oozie  Example  (7)  

<workflow-app name='wordcount-wf' xmlns="uri:oozie:workflow:0.1"> <start to='wordcount'/> <action name='wordcount'> <map-reduce> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <configuration> <property> <name>mapred.mapper.class</name> <value>org.myorg.WordCount.Map</value> </property> <property> <name>mapred.reducer.class</name> <value>org.myorg.WordCount.Reduce</value> </property> <property> <name>mapred.input.dir</name> <value>${inputDir}</value> </property> <property> <name>mapred.output.dir</name> <value>${outputDir}</value> </property> </configuration> </map-reduce> <ok to='end'/> <error to='kill'/> </action>

We  specify  what  to  do  if  the  acHon  ends  successfully,  

and  what  to  do  if  it  fails.  In  this  example,  if  the  job  is  

successful  we  go  to  the  end  node.  If  it  fails  we  go  to  

the  kill  node.  

(39)

8-­‐39  

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

Simple  Oozie  Example  (9)  

<workflow-app name='wordcount-wf' xmlns="uri:oozie:workflow:0.1"> <start to='wordcount'/> <action name='wordcount'> <map-reduce> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <configuration> <property> <name>mapred.mapper.class</name> <value>org.myorg.WordCount.Map</value> </property> <property> <name>mapred.reducer.class</name> <value>org.myorg.WordCount.Reduce</value> </property> <property> <name>mapred.input.dir</name> <value>${inputDir}</value> </property> <property> <name>mapred.output.dir</name> <value>${outputDir}</value> </property> </configuration> </map-reduce> <ok to='end'/> <error to='end'/> </action> <kill name='kill'>

<message>Something went wrong: ${wf:errorCode('wordcount')}</message> </kill/>

<end name='end'/> </workflow-app>

If  the  workflow  reaches  a  kill  node,  it  will  kill  all  

running  acHons  and  then  terminate  with  an  error.  A  

workflow  can  have  zero  or  more  kill  nodes.  

(40)

Simple  Oozie  Example  (8)  

<workflow-app name='wordcount-wf' xmlns="uri:oozie:workflow:0.1"> <start to='wordcount'/> <action name='wordcount'> <map-reduce> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <configuration> <property> <name>mapred.mapper.class</name> <value>org.myorg.WordCount.Map</value> </property> <property> <name>mapred.reducer.class</name> <value>org.myorg.WordCount.Reduce</value> </property> <property> <name>mapred.input.dir</name> <value>${inputDir}</value> </property> <property> <name>mapred.output.dir</name> <value>${outputDir}</value> </property> </configuration> </map-reduce> <ok to='end'/> <error to='end'/> </action>

Every  workflow  must  have  an  end  node.  This  

indicates  that  the  workflow  has  completed  

successfully.  

(41)

8-­‐41  

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

§

A  decision  control  node  allows  Oozie  to  determine  the  workflow  

execu6on  path  based  on  some  criteria  

Similar  to  a  switch-­‐case  statement  

§ 

fork  and  join  control  nodes  split  one  execu6on  path  into  mul6ple  

execu6on  paths  which  run  concurrently  

fork  splits  the  execuHon  path  

join  waits  for  all  concurrent  execuHon  paths  to  complete  before  

proceeding  

fork  and  join  are  used  in  pairs  

(42)

Node  Name  

Descrip6on  

map-reduce

Runs  either  a  Java  MapReduce  or  Streaming  job  

fs

Create  directories,  move  or  delete  files  or  directories  

java

Runs  the  main()  method  in  the  specified  Java  class  as  a  single-­‐

Map,  Map-­‐only  job  on  the  cluster  

pig

Runs  a  Pig  script  

hive

Runs  a  Hive  query  

sqoop

Runs  a  Sqoop  job  

email

Sends  an  e-­‐mail  message  

(43)

8-­‐43  

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

§

To  submit  an  Oozie  workflow  using  the  command-­‐line  tool:  

 

§

Oozie  can  also  be  called  from  within  a  Java  program  

Via  the  Oozie  client  API  

Submisng  an  Oozie  Workflow  

$

oozie job -oozie http://<oozie_server>/oozie \

-config config_file -run

(44)

More  on  Oozie  

Informa6on  

Resource  

Oozie  installaHon  and  

configuraHon  

CDH  InstallaHon  Guide  

http://docs.cloudera.com

Oozie  workflows  and  acHons  

https://oozie.apache.org

The  procedure  of  running  a  

MapReduce  job  using  Oozie  

https://cwiki.apache.org/OOZIE/

map-reduce-cookbook.html

Oozie  examples  

Oozie  examples  are  included  in  the  Oozie  

distribuHon.  InstrucHons  for  running  them:  

http://oozie.apache.org/docs/

(45)

8-­‐45  

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

§

Oozie  is  a  workflow  engine  for  Hadoop  

§

Supports  Java  and  Streaming  MapReduce  jobs,  Sqoop  jobs,  Hive  queries,  

Pig  scripts,  and  HDFS  file  manipula6on  

 

(46)

The  following  offer  more  informa6on  on  topics  discussed  in  this  chapter

§

“Introduc6on  to  Oozie”  ar6cle  

http://www.infoq.com/articles/introductionOozie

(47)

8-­‐47  

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

IntroducHon  to  Pig  

(48)

§

The  key  features  Pig  offers  

§

How  to  use  Pig  for  data  processing  and  analysis  

§

How  to  use  Pig  interac6vely  and  in  batch  mode  

(49)

8-­‐49  

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

Chapter  Topics  

Introduc6on  to  Pig  

§  

What  is  Pig?  

§

  

Pig’s  Features  

§

  

Pig  Use  Cases  

(50)

§

Apache  Pig  is  a  plaeorm  for  data  analysis  and  processing  on  Hadoop  

It  offers  an  alternaHve  to  wriHng  MapReduce  code  directly  

§

Originally  developed  as  a  research  project  at  Yahoo    

Goals:  flexibility,  producHvity,  and  maintainability  

Now  an  open-­‐source  Apache  project  

(51)

8-­‐51  

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

§

Main  components  of  Pig  

The  data  flow  language  (Pig  LaHn)  

The  interacHve  shell  where  you  can  type  Pig  LaHn  statements  (Grunt)  

The  Pig  interpreter  and  execuHon  engine  

The  Anatomy  of  Pig  

Pig Latin Script

AllSales = LOAD 'sales' AS (cust, price); BigSales = FILTER AllSales BY price > 100; STORE BigSales INTO 'myreport';

!"Preprocess"and"parse"Pig"La0n

!"Check"data"types

!"Make"op0miza0ons

!"Plan"execu0on

!"Generate"MapReduce"jobs

!"Submit"job(s)"to"Hadoop

!"Monitor"progress

MapReduce Jobs

Pig Interpreter / Execution Engine

(52)

§

CDH  (Cloudera’s  Distribu6on  including  Apache  Hadoop)  is  the  easiest  way  

to  install  Hadoop  and  Pig  

A  Hadoop  distribuHon  which  includes  core  Hadoop,  Pig,  Hive,  Sqoop,  

HBase,  Oozie,  and  other  ecosystem  components  

Available  as  RPMs,  Ubuntu/Debian/SuSE  packages,  or  a  tarball  

Simple  installaHon  

100%  free  and  open  source  

§

Installa6on  is  outside  the  scope  of  this  course  

Cloudera  offers  a  training  course  for  System  Administrators,  Cloudera  

Administrator  Training  for  Apache  Hadoop  

(53)

8-­‐53  

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

Chapter  Topics  

Introduc6on  to  Pig  

§

  

What  is  Pig?  

§  

Pig’s  Features  

§

  

Pig  Use  Cases  

(54)

§

Pig  is  an  alterna6ve  to  wri6ng  low-­‐level  MapReduce  code  

§

Many  features  enable  sophis6cated  analysis  and  processing  

HDFS  manipulaHon  

UNIX  shell  commands  

RelaHonal  operaHons  

PosiHonal  references  for  fields  

Common  mathemaHcal  funcHons  

Support  for  custom  funcHons  and  data  formats  

Complex  data  structures  

(55)

8-­‐55  

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

Chapter  Topics  

Introduc6on  to  Pig  

§

  

What  is  Pig?  

§

  

Pig’s  Features  

§  

Pig  Use  Cases  

(56)

§

Many  organiza6ons  use  Pig  for  data  analysis  

Finding  relevant  records  in  a  massive  data  set  

Querying  mulHple  data  sets  

CalculaHng  values  from  input  data  

§

Pig  is  also  frequently  used  for  data  processing  

Reorganizing  an  exisHng  data  set  

Joining  data  from  mulHple  sources  to  produce  a  new  data  set  

(57)

8-­‐57  

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

§

Pig  can  help  you  extract  valuable  informa6on  from  Web  server  log  files  

Use  Case:  Web  Log  SessionizaHon  

May 3, 2013

10.174.57.241 - - [03/May/2013:17:57:41 -0500] "GET /s?q=widget HTTP/1.1" 200 3617 "http://www.hotbot.com/find/dualcore" "WebTV 1.2" "U=129"

Search for 'Widget' Process Logs

Widget Results Details for Widget X

Recent Activity for John Smith

May 12, 2013 Track Order Contact Us Send Complaint Order Widget X

...

10.174.57.241 - - [03/May/2013:17:58:03 -0500] "GET /wres.html HTTP/1.1" 200 5741 "http://www.example.com/s?q=widget" "WebTV 1.2" "U=129" 10.174.57.241 - - [03/May/2013:17:58:25 -0500] "GET /detail?w=41 HTTP/1.1" 200 8584 "http://www.example.com/wres.html" "WebTV 1.2" "U=129" 10.174.57.241 - - [03/May/2013:17:59:36 -0500] "GET /order.do HTTP/1.1" 200 964 "http://www.example.com/detail?w=41" "WebTV 1.2" "U=129" 10.174.57.241 - - [03/May/2013:17:59:47 -0500] "GET /confirm HTTP/1.1" 200 964 "http://www.example.com/order.do" "WebTV 1.2" "U=129"

10.218.46.19 - - [03/May/2013:17:57:43 -0500] "GET /ide.html HTTP/1.1" 404 955 "http://www.example.com/s?q=JBuilder" "Mosaic/3.6 (X11;SunOS)" 10.32.51.237 - - [03/May/2013:17:58:04 -0500] "GET /os.html HTTP/1.1" 404 955 "http://www.example.com/s?q=VMS" "Mozilla/1.0b (Win3.11)" 10.157.96.181 - - [03/May/2013:17:58:26 -0500] "GET /mp3.html HTTP/1.1" 404 955 "http://www.example.com/s?q=Zune" "Mothra/2.77" "U=3622"

...

Web Server Log Data

(58)

§

Sampling  can  help  you  explore  a  representa6ve  por6on  of  a  large  data  set  

Allows  you  to  examine  this  porHon  with  tools  that  do  not  scale  well  

Supports  faster  iteraHons  during  development  of  analysis  jobs  

Use  Case:  Data  Sampling  

100 TB

50 MB

(59)

8-­‐59  

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

§

Pig  is  also  widely  used  for  Extract,  Transform,  and  Load  (ETL)  processing  

Use  Case:  ETL  Processing  

Operations Validate data Accounting Call Center Fix errors Remove duplicates Encode values Data Warehouse

Pig Jobs Running on Hadoop Cluster

(60)

Chapter  Topics  

Introduc6on  to  Pig  

§

  

What  is  Pig?  

§

  

Pig’s  Features  

§

  

Pig  Use  Cases  

(61)

8-­‐61  

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

§

You  can  use  Pig  interac6vely,  via  the  Grunt  shell  

Pig  interprets  each  Pig  LaHn  statement  as  you  type  it  

ExecuHon  is  delayed  unHl  output  is  required  

Very  useful  for  ad  hoc  data  inspecHon  

§

Example  of  how  to  start,  use,  and  exit  Grunt  

 

§

Can  also  execute  a  Pig  La6n  statement  from  the  UNIX  shell  via  the  -e  

op6on

Using  Pig  InteracHvely  

$ pig

grunt>

allsales = LOAD 'sales' AS (name, price);

grunt>

bigsales = FILTER allsales BY price > 100;

grunt>

STORE bigsales INTO 'myreport';

(62)

§

You  can  manipulate  HDFS  with  Pig,  via  the  fs  command

 

InteracHng  with  HDFS  

grunt>

fs -mkdir sales/;

grunt>

fs -put europe.txt sales/;

grunt>

allsales = LOAD 'sales' AS (name, price);

grunt>

bigsales = FILTER allsales BY price > 100;

grunt>

STORE bigsales INTO 'myreport';

(63)

8-­‐63  

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

§

The  sh  command  lets  you  run  UNIX  programs  from  Pig

 

InteracHng  with  UNIX  

grunt>

sh date;

Fri May 10 13:05:31 PDT 2013

grunt>

fs -ls;

-- lists HDFS files

grunt>

sh ls;

-- lists local files

(64)

§

A  Pig  script  is  simply  Pig  La6n  code  stored  in  a  text  file  

By  convenHon,  these  files  have  the  .pig  extension  

§

You  can  run  a  Pig  script  from  within  the  Grunt  shell  via  the  run  command  

This  is  useful  for  automaHon  and  batch  execuHon    

§

It  is  common  to  run  a  Pig  script  directly  from  the  UNIX  shell  

Running  Pig  Scripts  

$ pig salesreport.pig

(65)

8-­‐65  

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

§

As  described  earlier,  Pig  turns  Pig  La6n  into  MapReduce  jobs  

Pig  submits  those  jobs  for  execuHon  on  the  Hadoop  cluster  

§

It  is  also  possible  to  run  Pig  in  ‘local  mode’  using  the  -x  flag  

This  runs  MapReduce  jobs  on  the  local  machine  instead  of  the  cluster  

Local  mode  uses  the  local  filesystem  instead  ofHDFS  

Can  be  helpful  for  tesHng  before  deploying  a  job  to  producHon  

MapReduce  and  Local  Modes  

$ pig –x local

-- interactive

(66)

§

If  a  job  fails,  Pig  may  produce  a  log  file  to  explain  why  

These  log  files  are  typically  produced  in  your  current  working  directory  

On  the  local  (client)  machine  

(67)

8-­‐67  

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

§

Pig  offers  an  alterna6ve  to  wri6ng  MapReduce  code  directly  

Pig  interprets  Pig  LaHn  code  in  order  to  create  MapReduce  jobs  

It  then  submits  these  MapReduce  jobs  to  the  Hadoop  cluster  

§

You  can  execute  Pig  La6n  code  interac6vely  through  Grunt  

Pig  delays  job  execuHon  unHl  output  is  required  

§

It  is  also  common  to  store  Pig  La6n  code  in  a  script  for  batch  execu6on  

Allows  for  automaHon  and  code  reuse  

(68)

The  following  offer  more  informa6on  on  topics  discussed  in  this  chapter  

§

Apache  Pig  Web  Site  

http://pig.apache.org/

§

Process  a  Million  Songs  with  Apache  Pig  

http://tiny.cloudera.com/dac03a

§

Powered  By  Pig  

http://tiny.cloudera.com/dac03b

§

LinkedIn:  User  Engagement  Powered  By  Apache  Pig  and  Hadoop  

http://tiny.cloudera.com/dac03c

§

Programming  Pig  (book)  

(69)

8-­‐69  

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

The  following  offer  more  informa6on  on  topics  discussed  in  this  chapter

§

Programming  Pig  (book)  

http://tiny.cloudera.com/dac03d

§

The  original  paper  on  Pig  published  by  Yahoo  in  2008:    

http://www.research.yahoo.com/files/sigmod08.pdf

References

Related documents

every drive more enjoyable.. Prius c Four shown in Sparkling Sea Metallic with available 16-in. 8-spoke alloy wheels... See numbered footnotes in Disclosures section.

By taking sectarianization as a theoretical vantage point, this study’s discussion has evolved around the particularities surrounding the Syrian revolution and

Name, Parameters. Notifies the camera system that a new event has happened in the current scene. The system will accordingly update its shot contribution list. This method is

McDonough County Public Transportation, yes Adams County Council for Senior Citizens, no West Central Illinois Area Agency on Aging, no Mental Health Centers of Western Illinois,

One of the most novel feedback design aspects in CrossTrainer is the different audio and tactile textures used in the crossmodal

This paper reviews several different mechanisms for organising procurement operations on a centralised basis: national purchasing groups, regional and local procurement groups,

The objective of the study was to describe response and tolerability of metronomic chemotherapy regimen GFIP/BDC, a modification of the G-FLIP regimen, in patients with persistent

Control loops step parts: the optical sensor located at the entrance of the classifier can count the number of pieces that go through the process so that when