8-‐1
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
Apache Hadoop – A course for undergraduates
Hadoop Tools for Data AcquisiHon
8-‐3
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
§
How to load data from an exis6ng RDBMS into HDFS using Sqoop
§
How to manage real-‐6me data such as log files using Flume
Chapter Topics
Hadoop Tools for Data Acquisi6on
§
Loading Data into HDFS from an RDBMS Using Sqoop
8-‐5
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
§
Typical scenario: data stored in an RDBMS is needed in a
MapReduce job
–
Lookup tables
–
Legacy data
§
Possible to read directly from an RDBMS in your Mapper
–
Can lead to the equivalent of a distributed denial of service
(DDoS) a>ack on your RDBMS
–
In pracHce – don’t do it!
§
BeOer idea: use Sqoop to import the data into HDFS beforehand
ImporHng Data From an RDBMS to HDFS
§
Sqoop: open source tool originally wriOen at Cloudera
–
Now a top-‐level Apache SoWware FoundaHon project
§
Imports tables from an RDBMS into HDFS
–
Just one table
–
All tables in a database
–
Just porHons of a table
–
Sqoop supports a WHERE clause
§
Uses MapReduce to actually import the data
–
‘Thro>les’ the number of Mappers to avoid DDoS scenarios
–
Uses four Mappers by default
–
Value is configurable
8-‐7
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
§
Imports data to HDFS as delimited text files or SequenceFiles
–
Default is a comma-‐delimited text file
§
Can be used for incremental data imports
–
First import retrieves all rows in a table
–
Subsequent imports retrieve just rows created since the last import
§
Generates a class file which can encapsulate a row of the imported data
–
Useful for serializing and deserializing data in subsequent MapReduce
jobs
§
Cloudera has partnered with other organiza6ons to create custom Sqoop
connectors
–
Use a database’s naHve protocols rather than JDBC
–
Provides much faster performance
§
Current systems supported by custom connectors include:
–
Netezza
–
Teradata
–
Oracle Database (connector developed with Quest SoWware)
§
Others are in development
§
Custom connectors are not open source, but are free
–
Available from the Cloudera Web site
8-‐9
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
§
Standard syntax:
§
Tools include:
§
Op6ons include:
Sqoop: Basic Syntax
sqoop tool-name [tool-options]
--connect
--username
--password
import
import-all-tables
list-tables
§
Example: import a table called employees from a database called
personnel in a MySQL RDBMS
§
Example: as above, but only records with an ID greater than 1000
Sqoop: Example
$
sqoop import --username fred --password derf \
--connect jdbc:mysql://database.example.com/personnel \
--table employees
$
sqoop import --username fred --password derf \
--connect jdbc:mysql://database.example.com/personnel \
--table employees \
8-‐11
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
ImporHng An EnHre Database with Sqoop
§
Import all tables from the database (fields will be tab-‐delimited)
$ sqoop import-all-tables \
--connect jdbc:mysql://localhost/company \
--username twheeler --password bigsecret \
--fields-terminated-by '\t' \
Incremental Imports with Sqoop
§
What if new records are added to the database?
–
Could re-‐import all records, but this is inefficient
§
Sqoop’s incremental append mode imports only new records
–
Based on value of last record in specified column
$ sqoop import \
--connect jdbc:mysql://localhost/company \
--username twheeler --password bigsecret \
--warehouse-dir /mydata \
--table orders \
--incremental append \
--check-column order_id \
--last-value 6713821
8-‐13
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
§
Sqoop can take data from HDFS and insert it into an already-‐exis6ng table
in an RDBMS with the command
§
For general Sqoop help:
§
For help on a par6cular command:
Sqoop: Other OpHons
$
sqoop export [options]
$
sqoop help
Chapter Topics
Hadoop Tools for Data Acquisi6on
§
Loading Data into HDFS from an RDBMS Using Sqoop
8-‐15
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
§
Flume is a distributed, reliable, available service for
efficiently moving large amounts of data as it is produced
–
Ideally suited to gathering logs from mulHple systems
and inserHng them into HDFS as they are generated
§
Flume is Open Source
–
IniHally developed by Cloudera
§
Flume’s design goals:
–
Reliability
–
Scalability
–
Extensibility
Flume: High-‐Level Overview
Agent Agent Agent
Agent Agent Agent(s) Agent compress encrypt batch encrypt
•
Optionally process incoming
data: perform transformations,
suppressions, metadata
enrichment
•
Each agent can be configured
with an in-memory or durable
channel
•
Writes to multiple HDFS file
formats (text, SequenceFile,
JSON, Avro, others)
•
Parallelized writes across
many collectors – as much
write throughput as required
8-‐17
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
§
Each Flume agent has a source, a sink and a channel
§
Source
–
Tells the node where to receive data from
§
Sink
–
Tells the node where to send data to
§
Channel
–
A queue between the Source and Sink
–
Can be in-‐memory only or ‘Durable’
–
Durable channels will not lose data if power is lost
§
Channels provide Flume’s reliability
§
Memory Channel
–
Data will be lost if power is lost
§
File Channel
–
Data stored on disk
–
Guarantees durability of data in face of a power loss
§
Data transfer between Agents and Channels is transac6onal
–
A failed data transfer to a downstream agent rolls back and retries
§
Can configure mul6ple Agents with the same task
–
e.g., two Agents doing the job of one “collector” – if one agent fails
then upstream agents would fail over
8-‐19
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
§
Scalability
–
The ability to increase system performance linearly by adding more
resources to the system
–
Flume scales horizontally
–
As load increases, more machines can be added to the
configuraHon
§
Extensibility
–
The ability to add new funcHonality to a system
§
Flume can be extended by adding Sources and Sinks to exis6ng storage
layers or data plaeorms
–
General Sources include data from files, syslog, and standard output
from a process
–
General Sinks include files on the local filesystem or HDFS
–
Developers can write their own Sources or Sinks
8-‐21
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
§
Flume is typically used to ingest log files from real-‐6me systems such as
Web servers, firewalls and mailservers into HDFS
§
Currently in use in many large organiza6ons, inges6ng millions of events
per day
–
At least one organizaHon is using Flume to ingest over 200 million
events per day
§
Flume is typically installed and configured by a system administrator
–
Check the Flume documentaHon if you intend to install it yourself
§
Sqoop is a tool to load data from a database into HDFS
§
Flume is a tool for managing real-‐6me data
–
e.g. imporHng data from log files into HDFS
8-‐23
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
The following offer more informa6on on topics discussed in this chapter
§
Incremental impor6ng is described in the Sqoop documenta6on:
–
http://archive.cloudera.com/cdh/3/sqoop/
SqoopUserGuide.html#_incremental_imports
An IntroducHon to Oozie
8-‐25
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
§
What is Oozie?
§
Crea6ng Oozie workflows
Chapter Topics
An Introduc6on to Oozie
§
Introduc6on to Oozie
8-‐27
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
§
Many problems cannot be solved with a single
MapReduce job
§
Instead, a workflow of jobs must be created
§
Simple workflow:
–
Run Job A
–
Use output of Job A as input to Job B
–
Use output of Job B as input to Job C
–
Output of Job C is the final required output
§
Easy if the workflow is linear like this
–
Can be created as standard Driver code
The MoHvaHon for Oozie (1)
Job A
Start
Data
Job B
Job C
Final
Result
§
If the workflow is more complex, Driver code becomes much more
difficult to maintain
§
Example: running mul6ple jobs in parallel, using the output from all of
those jobs as the input to the next job
§
Example: including Hive or Pig jobs as part of the workflow
8-‐29
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
§
Oozie is a ‘workflow engine’
§
Runs on a server
–
Typically outside the cluster
§
Runs workflows of Hadoop jobs
–
Including Pig, Hive, Sqoop jobs
–
Submits those jobs to the cluster based on a workflow definiHon
§
Workflow defini6ons are submiOed via HTTP
§
Jobs can be run at specific 6mes
–
One-‐off or recurring jobs
§
Jobs can be run when data is present in a directory
Chapter Topics
An Introduc6on to Oozie
§
IntroducHon to Oozie
8-‐31
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
§
Oozie workflows are wriOen in XML
§
Workflow is a collec6on of ac6ons
–
MapReduce jobs, Pig jobs, Hive jobs etc.
§
A workflow consists of control flow nodes and ac2on nodes
§
Control flow nodes define the beginning and end of a workflow
–
They provide methods to determine the workflow execuHon path
–
Example: Run mulHple jobs simultaneously
§
Ac6on nodes trigger the execu6on of a processing task, such as
–
A MapReduce job
–
A Hive query
–
A Sqoop data import job
§
Simple example workflow for WordCount:
8-‐33
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
Simple Oozie Example (2)
<workflow-app name='wordcount-wf' xmlns="uri:oozie:workflow:0.1"> <start to='wordcount'/> <action name='wordcount'> <map-reduce> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <configuration> <property> <name>mapred.mapper.class</name> <value>org.myorg.WordCount.Map</value> </property> <property> <name>mapred.reducer.class</name> <value>org.myorg.WordCount.Reduce</value> </property> <property> <name>mapred.input.dir</name> <value>${inputDir}</value> </property> <property> <name>mapred.output.dir</name> <value>${outputDir}</value> </property> </configuration> </map-reduce> <ok to='end'/> <error to='kill'/> </action> <kill name='kill'>
<message>Something went wrong: ${wf:errorCode('wordcount')}</message> </kill/>
<end name='end'/> </workflow-app>
Simple Oozie Example (3)
<workflow-app name='wordcount-wf' xmlns="uri:oozie:workflow:0.1"> <start to='wordcount'/> <action name='wordcount'> <map-reduce> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <configuration> <property> <name>mapred.mapper.class</name> <value>org.myorg.WordCount.Map</value> </property> <property> <name>mapred.reducer.class</name> <value>org.myorg.WordCount.Reduce</value> </property> <property> <name>mapred.input.dir</name> <value>${inputDir}</value> </property> <property> <name>mapred.output.dir</name> <value>${outputDir}</value> </property> </configuration> </map-reduce> <ok to='end'/> <error to='kill'/> </action>
A workflow is wrapped in the workflow-app
enHty
8-‐35
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
Simple Oozie Example (4)
<workflow-app name='wordcount-wf' xmlns="uri:oozie:workflow:0.1">
<start to='wordcount'/> <action name='wordcount'> <map-reduce> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <configuration> <property> <name>mapred.mapper.class</name> <value>org.myorg.WordCount.Map</value> </property> <property> <name>mapred.reducer.class</name> <value>org.myorg.WordCount.Reduce</value> </property> <property> <name>mapred.input.dir</name> <value>${inputDir}</value> </property> <property> <name>mapred.output.dir</name> <value>${outputDir}</value> </property> </configuration> </map-reduce> <ok to='end'/> <error to='kill'/> </action> <kill name='kill'>
<message>Something went wrong: ${wf:errorCode('wordcount')}</message> </kill/>
<end name='end'/> </workflow-app>
The start node is the control node which tells
Oozie which workflow node should be run first. There
must be one start node in an Oozie workflow. In
our example, we are telling Oozie to start by
Simple Oozie Example (5)
<workflow-app name='wordcount-wf' xmlns="uri:oozie:workflow:0.1"> <start to='wordcount'/> <action name='wordcount'> <map-reduce> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <configuration> <property> <name>mapred.mapper.class</name> <value>org.myorg.WordCount.Map</value> </property> <property> <name>mapred.reducer.class</name> <value>org.myorg.WordCount.Reduce</value> </property> <property> <name>mapred.input.dir</name> <value>${inputDir}</value> </property> <property> <name>mapred.output.dir</name> <value>${outputDir}</value> </property> </configuration> </map-reduce> <ok to='end'/> <error to='kill'/> </action>
The wordcount acHon node defines a
8-‐37
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
Simple Oozie Example (6)
<workflow-app name='wordcount-wf' xmlns="uri:oozie:workflow:0.1"> <start to='wordcount'/> <action name='wordcount'> <map-reduce> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <configuration> <property> <name>mapred.mapper.class</name> <value>org.myorg.WordCount.Map</value> </property> <property> <name>mapred.reducer.class</name> <value>org.myorg.WordCount.Reduce</value> </property> <property> <name>mapred.input.dir</name> <value>${inputDir}</value> </property> <property> <name>mapred.output.dir</name> <value>${outputDir}</value> </property> </configuration> </map-reduce> <ok to='end'/> <error to='end'/> </action> <kill name='kill'>
<message>Something went wrong: ${wf:errorCode('wordcount')}</message> </kill/>
<end name='end'/> </workflow-app>
Simple Oozie Example (7)
<workflow-app name='wordcount-wf' xmlns="uri:oozie:workflow:0.1"> <start to='wordcount'/> <action name='wordcount'> <map-reduce> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <configuration> <property> <name>mapred.mapper.class</name> <value>org.myorg.WordCount.Map</value> </property> <property> <name>mapred.reducer.class</name> <value>org.myorg.WordCount.Reduce</value> </property> <property> <name>mapred.input.dir</name> <value>${inputDir}</value> </property> <property> <name>mapred.output.dir</name> <value>${outputDir}</value> </property> </configuration> </map-reduce> <ok to='end'/> <error to='kill'/> </action>
We specify what to do if the acHon ends successfully,
and what to do if it fails. In this example, if the job is
successful we go to the end node. If it fails we go to
the kill node.
8-‐39
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
Simple Oozie Example (9)
<workflow-app name='wordcount-wf' xmlns="uri:oozie:workflow:0.1"> <start to='wordcount'/> <action name='wordcount'> <map-reduce> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <configuration> <property> <name>mapred.mapper.class</name> <value>org.myorg.WordCount.Map</value> </property> <property> <name>mapred.reducer.class</name> <value>org.myorg.WordCount.Reduce</value> </property> <property> <name>mapred.input.dir</name> <value>${inputDir}</value> </property> <property> <name>mapred.output.dir</name> <value>${outputDir}</value> </property> </configuration> </map-reduce> <ok to='end'/> <error to='end'/> </action> <kill name='kill'>
<message>Something went wrong: ${wf:errorCode('wordcount')}</message> </kill/>
<end name='end'/> </workflow-app>
If the workflow reaches a kill node, it will kill all
running acHons and then terminate with an error. A
workflow can have zero or more kill nodes.
Simple Oozie Example (8)
<workflow-app name='wordcount-wf' xmlns="uri:oozie:workflow:0.1"> <start to='wordcount'/> <action name='wordcount'> <map-reduce> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <configuration> <property> <name>mapred.mapper.class</name> <value>org.myorg.WordCount.Map</value> </property> <property> <name>mapred.reducer.class</name> <value>org.myorg.WordCount.Reduce</value> </property> <property> <name>mapred.input.dir</name> <value>${inputDir}</value> </property> <property> <name>mapred.output.dir</name> <value>${outputDir}</value> </property> </configuration> </map-reduce> <ok to='end'/> <error to='end'/> </action>
Every workflow must have an end node. This
indicates that the workflow has completed
successfully.
8-‐41
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
§
A decision control node allows Oozie to determine the workflow
execu6on path based on some criteria
–
Similar to a switch-‐case statement
§
fork and join control nodes split one execu6on path into mul6ple
execu6on paths which run concurrently
–
fork splits the execuHon path
–
join waits for all concurrent execuHon paths to complete before
proceeding
–
fork and join are used in pairs
Node Name
Descrip6on
map-reduce
Runs either a Java MapReduce or Streaming job
fs
Create directories, move or delete files or directories
java
Runs the main() method in the specified Java class as a single-‐
Map, Map-‐only job on the cluster
pig
Runs a Pig script
hive
Runs a Hive query
sqoop
Runs a Sqoop job
Sends an e-‐mail message
8-‐43
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
§
To submit an Oozie workflow using the command-‐line tool:
§
Oozie can also be called from within a Java program
–
Via the Oozie client API
Submisng an Oozie Workflow
$
oozie job -oozie http://<oozie_server>/oozie \
-config config_file -run
More on Oozie
Informa6on
Resource
Oozie installaHon and
configuraHon
CDH InstallaHon Guide
http://docs.cloudera.com
Oozie workflows and acHons
https://oozie.apache.org
The procedure of running a
MapReduce job using Oozie
https://cwiki.apache.org/OOZIE/
map-reduce-cookbook.html
Oozie examples
Oozie examples are included in the Oozie
distribuHon. InstrucHons for running them:
http://oozie.apache.org/docs/
8-‐45
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
§
Oozie is a workflow engine for Hadoop
§
Supports Java and Streaming MapReduce jobs, Sqoop jobs, Hive queries,
Pig scripts, and HDFS file manipula6on
The following offer more informa6on on topics discussed in this chapter
§
“Introduc6on to Oozie” ar6cle
–
http://www.infoq.com/articles/introductionOozie
8-‐47
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
IntroducHon to Pig
§
The key features Pig offers
§
How to use Pig for data processing and analysis
§
How to use Pig interac6vely and in batch mode
8-‐49
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
Chapter Topics
Introduc6on to Pig
§
What is Pig?
§
Pig’s Features
§
Pig Use Cases
§
Apache Pig is a plaeorm for data analysis and processing on Hadoop
–
It offers an alternaHve to wriHng MapReduce code directly
§
Originally developed as a research project at Yahoo
–
Goals: flexibility, producHvity, and maintainability
–
Now an open-‐source Apache project
8-‐51
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
§
Main components of Pig
–
The data flow language (Pig LaHn)
–
The interacHve shell where you can type Pig LaHn statements (Grunt)
–
The Pig interpreter and execuHon engine
The Anatomy of Pig
Pig Latin Script
AllSales = LOAD 'sales' AS (cust, price); BigSales = FILTER AllSales BY price > 100; STORE BigSales INTO 'myreport';
!"Preprocess"and"parse"Pig"La0n
!"Check"data"types
!"Make"op0miza0ons
!"Plan"execu0on
!"Generate"MapReduce"jobs
!"Submit"job(s)"to"Hadoop
!"Monitor"progress
MapReduce Jobs
Pig Interpreter / Execution Engine
§
CDH (Cloudera’s Distribu6on including Apache Hadoop) is the easiest way
to install Hadoop and Pig
–
A Hadoop distribuHon which includes core Hadoop, Pig, Hive, Sqoop,
HBase, Oozie, and other ecosystem components
–
Available as RPMs, Ubuntu/Debian/SuSE packages, or a tarball
–
Simple installaHon
–
100% free and open source
§
Installa6on is outside the scope of this course
–
Cloudera offers a training course for System Administrators, Cloudera
Administrator Training for Apache Hadoop
8-‐53
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
Chapter Topics
Introduc6on to Pig
§
What is Pig?
§
Pig’s Features
§
Pig Use Cases
§
Pig is an alterna6ve to wri6ng low-‐level MapReduce code
§
Many features enable sophis6cated analysis and processing
–
HDFS manipulaHon
–
UNIX shell commands
–
RelaHonal operaHons
–
PosiHonal references for fields
–
Common mathemaHcal funcHons
–
Support for custom funcHons and data formats
–
Complex data structures
8-‐55
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
Chapter Topics
Introduc6on to Pig
§
What is Pig?
§
Pig’s Features
§
Pig Use Cases
§
Many organiza6ons use Pig for data analysis
–
Finding relevant records in a massive data set
–
Querying mulHple data sets
–
CalculaHng values from input data
§
Pig is also frequently used for data processing
–
Reorganizing an exisHng data set
–
Joining data from mulHple sources to produce a new data set
8-‐57
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
§
Pig can help you extract valuable informa6on from Web server log files
Use Case: Web Log SessionizaHon
May 3, 2013
10.174.57.241 - - [03/May/2013:17:57:41 -0500] "GET /s?q=widget HTTP/1.1" 200 3617 "http://www.hotbot.com/find/dualcore" "WebTV 1.2" "U=129"
Search for 'Widget' Process Logs
Widget Results Details for Widget X
Recent Activity for John Smith
May 12, 2013 Track Order Contact Us Send Complaint Order Widget X
...
10.174.57.241 - - [03/May/2013:17:58:03 -0500] "GET /wres.html HTTP/1.1" 200 5741 "http://www.example.com/s?q=widget" "WebTV 1.2" "U=129" 10.174.57.241 - - [03/May/2013:17:58:25 -0500] "GET /detail?w=41 HTTP/1.1" 200 8584 "http://www.example.com/wres.html" "WebTV 1.2" "U=129" 10.174.57.241 - - [03/May/2013:17:59:36 -0500] "GET /order.do HTTP/1.1" 200 964 "http://www.example.com/detail?w=41" "WebTV 1.2" "U=129" 10.174.57.241 - - [03/May/2013:17:59:47 -0500] "GET /confirm HTTP/1.1" 200 964 "http://www.example.com/order.do" "WebTV 1.2" "U=129"
10.218.46.19 - - [03/May/2013:17:57:43 -0500] "GET /ide.html HTTP/1.1" 404 955 "http://www.example.com/s?q=JBuilder" "Mosaic/3.6 (X11;SunOS)" 10.32.51.237 - - [03/May/2013:17:58:04 -0500] "GET /os.html HTTP/1.1" 404 955 "http://www.example.com/s?q=VMS" "Mozilla/1.0b (Win3.11)" 10.157.96.181 - - [03/May/2013:17:58:26 -0500] "GET /mp3.html HTTP/1.1" 404 955 "http://www.example.com/s?q=Zune" "Mothra/2.77" "U=3622"
...
Web Server Log Data
§
Sampling can help you explore a representa6ve por6on of a large data set
–
Allows you to examine this porHon with tools that do not scale well
–
Supports faster iteraHons during development of analysis jobs
Use Case: Data Sampling
100 TB
50 MB
8-‐59
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
§
Pig is also widely used for Extract, Transform, and Load (ETL) processing
Use Case: ETL Processing
Operations Validate data Accounting Call Center Fix errors Remove duplicates Encode values Data Warehouse
Pig Jobs Running on Hadoop Cluster
Chapter Topics
Introduc6on to Pig
§
What is Pig?
§
Pig’s Features
§
Pig Use Cases
8-‐61
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
§
You can use Pig interac6vely, via the Grunt shell
–
Pig interprets each Pig LaHn statement as you type it
–
ExecuHon is delayed unHl output is required
–
Very useful for ad hoc data inspecHon
§
Example of how to start, use, and exit Grunt
§
Can also execute a Pig La6n statement from the UNIX shell via the -e
op6on
Using Pig InteracHvely
$ pig
grunt>
allsales = LOAD 'sales' AS (name, price);
grunt>
bigsales = FILTER allsales BY price > 100;
grunt>
STORE bigsales INTO 'myreport';
§
You can manipulate HDFS with Pig, via the fs command
InteracHng with HDFS
grunt>
fs -mkdir sales/;
grunt>
fs -put europe.txt sales/;
grunt>
allsales = LOAD 'sales' AS (name, price);
grunt>
bigsales = FILTER allsales BY price > 100;
grunt>
STORE bigsales INTO 'myreport';
8-‐63
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
§
The sh command lets you run UNIX programs from Pig
InteracHng with UNIX
grunt>
sh date;
Fri May 10 13:05:31 PDT 2013
grunt>
fs -ls;
-- lists HDFS files
grunt>
sh ls;
-- lists local files
§
A Pig script is simply Pig La6n code stored in a text file
–
By convenHon, these files have the .pig extension
§
You can run a Pig script from within the Grunt shell via the run command
–
This is useful for automaHon and batch execuHon
§
It is common to run a Pig script directly from the UNIX shell
Running Pig Scripts
$ pig salesreport.pig
8-‐65
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
§
As described earlier, Pig turns Pig La6n into MapReduce jobs
–
Pig submits those jobs for execuHon on the Hadoop cluster
§
It is also possible to run Pig in ‘local mode’ using the -x flag
–
This runs MapReduce jobs on the local machine instead of the cluster
–
Local mode uses the local filesystem instead ofHDFS
–
Can be helpful for tesHng before deploying a job to producHon
MapReduce and Local Modes
$ pig –x local
-- interactive
§
If a job fails, Pig may produce a log file to explain why
–
These log files are typically produced in your current working directory
–
On the local (client) machine
8-‐67
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
§
Pig offers an alterna6ve to wri6ng MapReduce code directly
–
Pig interprets Pig LaHn code in order to create MapReduce jobs
–
It then submits these MapReduce jobs to the Hadoop cluster
§
You can execute Pig La6n code interac6vely through Grunt
–
Pig delays job execuHon unHl output is required
§
It is also common to store Pig La6n code in a script for batch execu6on
–
Allows for automaHon and code reuse
The following offer more informa6on on topics discussed in this chapter
§
Apache Pig Web Site
–
http://pig.apache.org/
§
Process a Million Songs with Apache Pig
–
http://tiny.cloudera.com/dac03a
§
Powered By Pig
–
http://tiny.cloudera.com/dac03b
§
LinkedIn: User Engagement Powered By Apache Pig and Hadoop
–
http://tiny.cloudera.com/dac03c
§
Programming Pig (book)
8-‐69
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.