Cloudera_Academic_Partnership_8.pdf

(1)

8-‐1

Apache Hadoop – A course for undergraduates

(2)

Hadoop Tools for Data AcquisiHon

(3)

8-‐3

§

How to load data from an exis6ng RDBMS into HDFS using Sqoop

§

How to manage real-‐6me data such as log ﬁles using Flume

(4)

Chapter Topics

Hadoop Tools for Data Acquisi6on

§  

Loading Data into HDFS from an RDBMS Using Sqoop

(5)

8-‐5

§

Typical scenario: data stored in an RDBMS is needed in a

MapReduce job

–

_{Lookup
tables}

–

Legacy data

§

Possible to read directly from an RDBMS in your Mapper

–

_{Can
lead
to
the
equivalent
of
a
distributed
denial
of
service}

(DDoS) a>ack on your RDBMS

–

In pracHce – don’t do it!

§

BeOer idea: use Sqoop to import the data into HDFS beforehand

ImporHng Data From an RDBMS to HDFS

(6)

§

Sqoop: open source tool originally wriOen at Cloudera

–

_{Now
a
top-‐level
Apache
SoWware
FoundaHon
project}

§

_{Imports
tables
from
an
RDBMS
into
HDFS}

–

Just one table

–

_{All
tables
in
a
database}

–

Just porHons of a table

–

_{Sqoop
supports
a
WHERE
clause}

§

_{Uses
MapReduce
to
actually
import
the
data}

–

‘Thro>les’ the number of Mappers to avoid DDoS scenarios

–

_{Uses
four
Mappers
by
default}

–

Value is conﬁgurable

(7)

8-‐7

§

Imports data to HDFS as delimited text ﬁles or SequenceFiles

–

_{Default
is
a
comma-‐delimited
text
ﬁle}

§

_{Can
be
used
for
incremental
data
imports}

–

First import retrieves all rows in a table

–

_{Subsequent
imports
retrieve
just
rows
created
since
the
last
import}

§

_{Generates
a
class
ﬁle
which
can
encapsulate
a
row
of
the
imported
data}

–

Useful for serializing and deserializing data in subsequent MapReduce

jobs

(8)

§

Cloudera has partnered with other organiza6ons to create custom Sqoop

connectors

–

_{Use
a
database’s
naHve
protocols
rather
than
JDBC}

–

Provides much faster performance

§

Current systems supported by custom connectors include:

–

_Netezza

–

Teradata

–

_{Oracle
Database
(connector
developed
with
Quest
SoWware)}

§

_{Others
are
in
development}

§

_{Custom
connectors
are
not
open
source,
but
are
free}

–

Available from the Cloudera Web site

(9)

8-‐9

§

Standard syntax:

§

Tools include:

§

Op6ons include:

Sqoop: Basic Syntax

sqoop tool-name [tool-options]

--connect

--username

--password

import

import-all-tables

list-tables

(10)

§

Example: import a table called employees from a database called

personnel in a MySQL RDBMS

§

Example: as above, but only records with an ID greater than 1000

Sqoop: Example

$

sqoop import --username fred --password derf \

--connect jdbc:mysql://database.example.com/personnel \

--table employees

$

sqoop import --username fred --password derf \

--connect jdbc:mysql://database.example.com/personnel \

--table employees \

(11)

8-‐11

ImporHng An EnHre Database with Sqoop

§

Import all tables from the database (ﬁelds will be tab-‐delimited)

$ sqoop import-all-tables \

--connect jdbc:mysql://localhost/company \

--username twheeler --password bigsecret \

--fields-terminated-by '\t' \

(12)

Incremental Imports with Sqoop

§

What if new records are added to the database?

–

_{Could
re-‐import
all
records,
but
this
is
ineﬃcient}

§

_{Sqoop’s
incremental
append
mode
imports
only
new
records}

–

Based on value of last record in speciﬁed column

$ sqoop import \

--connect jdbc:mysql://localhost/company \

--username twheeler --password bigsecret \

--warehouse-dir /mydata \

--table orders \

--incremental append \

--check-column order_id \

--last-value 6713821

(13)

8-‐13

§

Sqoop can take data from HDFS and insert it into an already-‐exis6ng table

in an RDBMS with the command

§

For general Sqoop help:

§

For help on a par6cular command:

Sqoop: Other OpHons

$

sqoop export [options]

$

sqoop help

(14)

Chapter Topics

Hadoop Tools for Data Acquisi6on

§

Loading Data into HDFS from an RDBMS Using Sqoop

(15)

8-‐15

§

Flume is a distributed, reliable, available service for

eﬃciently moving large amounts of data as it is produced

–

_{Ideally
suited
to
gathering
logs
from
mulHple
systems}

and inserHng them into HDFS as they are generated

§

_{Flume
is
Open
Source}

–

IniHally developed by Cloudera

§

Flume’s design goals:

–

_Reliability

–

Scalability

–

_{Extensibility}

(16)

Flume: High-‐Level Overview

Agent Agent Agent

Agent Agent Agent(s) Agent compress encrypt batch encrypt

• Optionally process incoming

data: perform transformations,

suppressions, metadata

enrichment

• Each agent can be configured

with an in-memory or durable

channel

• Writes to multiple HDFS file

formats (text, SequenceFile,

JSON, Avro, others)

• Parallelized writes across

many collectors – as much

write throughput as required

(17)

8-‐17

§

Each Flume agent has a source, a sink and a channel

§

Source

–

_{Tells
the
node
where
to
receive
data
from}

§

_Sink

–

Tells the node where to send data to

§

Channel

–

_{A
queue
between
the
Source
and
Sink}

–

Can be in-‐memory only or ‘Durable’

–

_{Durable
channels
will
not
lose
data
if
power
is
lost}

(18)

§

Channels provide Flume’s reliability

§

Memory Channel

–

_{Data
will
be
lost
if
power
is
lost}

§

_{File
Channel}

–

Data stored on disk

–

_{Guarantees
durability
of
data
in
face
of
a
power
loss}

§

_{Data
transfer
between
Agents
and
Channels
is
transac6onal}

–

A failed data transfer to a downstream agent rolls back and retries

§

Can conﬁgure mul6ple Agents with the same task

–

_{e.g.,
two
Agents
doing
the
job
of
one
“collector”
–
if
one
agent
fails}

then upstream agents would fail over

(19)

8-‐19

§

Scalability

–

_{The
ability
to
increase
system
performance
linearly
by
adding
more}

resources to the system

–

Flume scales horizontally

–

_{As
load
increases,
more
machines
can
be
added
to
the}

conﬁguraHon

(20)

§

Extensibility

–

_{The
ability
to
add
new
funcHonality
to
a
system}

§

_{Flume
can
be
extended
by
adding
Sources
and
Sinks
to
exis6ng
storage}

layers or data plaeorms

–

General Sources include data from ﬁles, syslog, and standard output

from a process

–

_{General
Sinks
include
ﬁles
on
the
local
ﬁlesystem
or
HDFS}

–

Developers can write their own Sources or Sinks

(21)

8-‐21

§

Flume is typically used to ingest log ﬁles from real-‐6me systems such as

Web servers, ﬁrewalls and mailservers into HDFS

§

Currently in use in many large organiza6ons, inges6ng millions of events

per day

–

_{At
least
one
organizaHon
is
using
Flume
to
ingest
over
200
million}

events per day

§

_{Flume
is
typically
installed
and
conﬁgured
by
a
system
administrator}

–

Check the Flume documentaHon if you intend to install it yourself

(22)

§

Sqoop is a tool to load data from a database into HDFS

§

Flume is a tool for managing real-‐6me data

–

_{e.g.
imporHng
data
from
log
ﬁles
into
HDFS}

(23)

8-‐23

The following oﬀer more informa6on on topics discussed in this chapter

§

Incremental impor6ng is described in the Sqoop documenta6on:

–

_{http://archive.cloudera.com/cdh/3/sqoop/}

SqoopUserGuide.html#_incremental_imports

(24)

An IntroducHon to Oozie

(25)

8-‐25

§

What is Oozie?

§

Crea6ng Oozie workﬂows

(26)

Chapter Topics

An Introduc6on to Oozie

§  

Introduc6on to Oozie

(27)

8-‐27

§

Many problems cannot be solved with a single

MapReduce job

§

Instead, a workﬂow of jobs must be created

§

Simple workﬂow:

–

_{Run
Job
A}

–

Use output of Job A as input to Job B

–

_{Use
output
of
Job
B
as
input
to
Job
C}

–

Output of Job C is the ﬁnal required output

§

Easy if the workﬂow is linear like this

–

_{Can
be
created
as
standard
Driver
code}

The MoHvaHon for Oozie (1)

Job A

Start

Data

Job B

Job C

Final

Result

(28)

§

If the workﬂow is more complex, Driver code becomes much more

diﬃcult to maintain

§

Example: running mul6ple jobs in parallel, using the output from all of

those jobs as the input to the next job

§

Example: including Hive or Pig jobs as part of the workﬂow

(29)

8-‐29

§

Oozie is a ‘workﬂow engine’

§

Runs on a server

–

_{Typically
outside
the
cluster}

§

_{Runs
workﬂows
of
Hadoop
jobs}

–

Including Pig, Hive, Sqoop jobs

–

_{Submits
those
jobs
to
the
cluster
based
on
a
workﬂow
deﬁniHon}

§

_{Workﬂow
deﬁni6ons
are
submiOed
via
HTTP}

§

_{Jobs
can
be
run
at
speciﬁc
6mes}

–

One-‐oﬀ or recurring jobs

§

Jobs can be run when data is present in a directory

(30)

Chapter Topics

An Introduc6on to Oozie

§

IntroducHon to Oozie

(31)

8-‐31

§

Oozie workﬂows are wriOen in XML

§

Workﬂow is a collec6on of ac6ons

–

_{MapReduce
jobs,
Pig
jobs,
Hive
jobs
etc.}

§

_{A
workﬂow
consists
of
control
ﬂow
nodes
and
ac2on
nodes}

§

_{Control
flow
nodes
define
the
beginning
and
end
of
a
workflow}

–

They provide methods to determine the workﬂow execuHon path

–

_{Example:
Run
mulHple
jobs
simultaneously}

§

_{Ac6on
nodes
trigger
the
execu6on
of
a
processing
task,
such
as}

–

A MapReduce job

–

_{A
Hive
query}

–

A Sqoop data import job

(32)

§

Simple example workﬂow for WordCount:

(33)

8-‐33

Simple Oozie Example (2)

<workflow-app name='wordcount-wf' xmlns="uri:oozie:workflow:0.1"> <start to='wordcount'/> <action name='wordcount'> <map-reduce> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <configuration> <property> <name>mapred.mapper.class</name> <value>org.myorg.WordCount.Map</value> </property> <property> <name>mapred.reducer.class</name> <value>org.myorg.WordCount.Reduce</value> </property> <property> <name>mapred.input.dir</name> <value>${inputDir}</value> </property> <property> <name>mapred.output.dir</name> <value>${outputDir}</value> </property> </configuration> </map-reduce> <ok to='end'/> <error to='kill'/> </action> <kill name='kill'>

<message>Something went wrong: ${wf:errorCode('wordcount')}</message> </kill/>

<end name='end'/> </workflow-app>

(34)

Simple Oozie Example (3)

<workflow-app name='wordcount-wf' xmlns="uri:oozie:workflow:0.1"> <start to='wordcount'/> <action name='wordcount'> <map-reduce> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <configuration> <property> <name>mapred.mapper.class</name> <value>org.myorg.WordCount.Map</value> </property> <property> <name>mapred.reducer.class</name> <value>org.myorg.WordCount.Reduce</value> </property> <property> <name>mapred.input.dir</name> <value>${inputDir}</value> </property> <property> <name>mapred.output.dir</name> <value>${outputDir}</value> </property> </configuration> </map-reduce> <ok to='end'/> <error to='kill'/> </action>

A workﬂow is wrapped in the workflow-app

enHty

(35)

8-‐35

Simple Oozie Example (4)

<workflow-app name='wordcount-wf' xmlns="uri:oozie:workflow:0.1">

<start to='wordcount'/> <action name='wordcount'> <map-reduce> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <configuration> <property> <name>mapred.mapper.class</name> <value>org.myorg.WordCount.Map</value> </property> <property> <name>mapred.reducer.class</name> <value>org.myorg.WordCount.Reduce</value> </property> <property> <name>mapred.input.dir</name> <value>${inputDir}</value> </property> <property> <name>mapred.output.dir</name> <value>${outputDir}</value> </property> </configuration> </map-reduce> <ok to='end'/> <error to='kill'/> </action> <kill name='kill'>

The start node is the control node which tells

Oozie which workﬂow node should be run ﬁrst. There

must be one start node in an Oozie workﬂow. In

our example, we are telling Oozie to start by

(36)

Simple Oozie Example (5)

The wordcount acHon node deﬁnes a

(37)

8-‐37

Simple Oozie Example (6)

<workflow-app name='wordcount-wf' xmlns="uri:oozie:workflow:0.1"> <start to='wordcount'/> <action name='wordcount'> <map-reduce> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <configuration> <property> <name>mapred.mapper.class</name> <value>org.myorg.WordCount.Map</value> </property> <property> <name>mapred.reducer.class</name> <value>org.myorg.WordCount.Reduce</value> </property> <property> <name>mapred.input.dir</name> <value>${inputDir}</value> </property> <property> <name>mapred.output.dir</name> <value>${outputDir}</value> </property> </configuration> </map-reduce> <ok to='end'/> <error to='end'/> </action> <kill name='kill'>

(38)

Simple Oozie Example (7)

We specify what to do if the acHon ends successfully,

and what to do if it fails. In this example, if the job is

successful we go to the end node. If it fails we go to

the kill node.

(39)

8-‐39

Simple Oozie Example (9)

<workflow-app name='wordcount-wf' xmlns="uri:oozie:workflow:0.1"> <start to='wordcount'/> <action name='wordcount'> <map-reduce> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <configuration> <property> <name>mapred.mapper.class</name> <value>org.myorg.WordCount.Map</value> </property> <property> <name>mapred.reducer.class</name> <value>org.myorg.WordCount.Reduce</value> </property> <property> <name>mapred.input.dir</name> <value>${inputDir}</value> </property> <property> <name>mapred.output.dir</name> <value>${outputDir}</value> </property> </configuration> </map-reduce> <ok to='end'/> <error to='end'/> </action> <kill name='kill'>

If the workﬂow reaches a kill node, it will kill all

running acHons and then terminate with an error. A

workﬂow can have zero or more kill nodes.

(40)

Simple Oozie Example (8)

<workflow-app name='wordcount-wf' xmlns="uri:oozie:workflow:0.1"> <start to='wordcount'/> <action name='wordcount'> <map-reduce> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <configuration> <property> <name>mapred.mapper.class</name> <value>org.myorg.WordCount.Map</value> </property> <property> <name>mapred.reducer.class</name> <value>org.myorg.WordCount.Reduce</value> </property> <property> <name>mapred.input.dir</name> <value>${inputDir}</value> </property> <property> <name>mapred.output.dir</name> <value>${outputDir}</value> </property> </configuration> </map-reduce> <ok to='end'/> <error to='end'/> </action>

Every workﬂow must have an end node. This

indicates that the workﬂow has completed

successfully.

(41)

8-‐41

§

A decision control node allows Oozie to determine the workﬂow

execu6on path based on some criteria

–

_{Similar
to
a
switch-‐case
statement}

§ 

_{fork
and
join
control
nodes
split
one
execu6on
path
into
mul6ple}

execu6on paths which run concurrently

–

_{fork
splits
the
execuHon
path}

–

_{join
waits
for
all
concurrent
execuHon
paths
to
complete
before}

proceeding

–

_{fork
and
join
are
used
in
pairs}

(42)

Node Name

Descrip6on

map-reduce

Runs either a Java MapReduce or Streaming job

fs

Create directories, move or delete ﬁles or directories

java

Runs the main() method in the speciﬁed Java class as a single-‐

Map, Map-‐only job on the cluster

pig

Runs a Pig script

hive

Runs a Hive query

sqoop

Runs a Sqoop job

email

Sends an e-‐mail message

(43)

8-‐43

§

To submit an Oozie workﬂow using the command-‐line tool:

§

Oozie can also be called from within a Java program

–

_{Via
the
Oozie
client
API}

Submisng an Oozie Workﬂow

$

oozie job -oozie http://<oozie_server>/oozie \

-config config_file -run

(44)

More on Oozie

Informa6on

Resource

Oozie installaHon and

conﬁguraHon

CDH InstallaHon Guide

http://docs.cloudera.com

Oozie workﬂows and acHons

https://oozie.apache.org

The procedure of running a

MapReduce job using Oozie

https://cwiki.apache.org/OOZIE/

map-reduce-cookbook.html

Oozie examples

Oozie examples are included in the Oozie

distribuHon. InstrucHons for running them:

http://oozie.apache.org/docs/

(45)

8-‐45

§

Oozie is a workﬂow engine for Hadoop

§

Supports Java and Streaming MapReduce jobs, Sqoop jobs, Hive queries,

Pig scripts, and HDFS ﬁle manipula6on

(46)

The following oﬀer more informa6on on topics discussed in this chapter

§

“Introduc6on to Oozie” ar6cle

–

_{http://www.infoq.com/articles/introductionOozie}

(47)

8-‐47

IntroducHon to Pig

(48)

§

The key features Pig oﬀers

§

How to use Pig for data processing and analysis

§

How to use Pig interac6vely and in batch mode

(49)

8-‐49

Chapter Topics

Introduc6on to Pig

§  

What is Pig?

§

Pig’s Features

§

Pig Use Cases

(50)

§

Apache Pig is a plaeorm for data analysis and processing on Hadoop

–

_{It
oﬀers
an
alternaHve
to
wriHng
MapReduce
code
directly}

§

_{Originally
developed
as
a
research
project
at
Yahoo}

–

Goals: ﬂexibility, producHvity, and maintainability

–

_{Now
an
open-‐source
Apache
project}

(51)

8-‐51

§

Main components of Pig

–

_{The
data
ﬂow
language
(Pig
LaHn)}

–

The interacHve shell where you can type Pig LaHn statements (Grunt)

–

_{The
Pig
interpreter
and
execuHon
engine}

The Anatomy of Pig

Pig Latin Script

AllSales = LOAD 'sales' AS (cust, price); BigSales = FILTER AllSales BY price > 100; STORE BigSales INTO 'myreport';

!"Preprocess"and"parse"Pig"La0n

!"Check"data"types

!"Make"op0miza0ons

!"Plan"execu0on

!"Generate"MapReduce"jobs

!"Submit"job(s)"to"Hadoop

!"Monitor"progress

MapReduce Jobs

Pig Interpreter / Execution Engine

(52)

§

CDH (Cloudera’s Distribu6on including Apache Hadoop) is the easiest way

to install Hadoop and Pig

–

_{A
Hadoop
distribuHon
which
includes
core
Hadoop,
Pig,
Hive,
Sqoop,}

HBase, Oozie, and other ecosystem components

–

Available as RPMs, Ubuntu/Debian/SuSE packages, or a tarball

–

_{Simple
installaHon}

–

100% free and open source

§

Installa6on is outside the scope of this course

–

_{Cloudera
oﬀers
a
training
course
for
System
Administrators,
Cloudera}

Administrator Training for Apache Hadoop

(53)

8-‐53

Chapter Topics

Introduc6on to Pig

§

What is Pig?

§  

Pig’s Features

§

Pig Use Cases

(54)

§

Pig is an alterna6ve to wri6ng low-‐level MapReduce code

§

Many features enable sophis6cated analysis and processing

–

_{HDFS
manipulaHon}

–

UNIX shell commands

–

_{RelaHonal
operaHons}

–

PosiHonal references for ﬁelds

–

_{Common
mathemaHcal
funcHons}

–

Support for custom funcHons and data formats

–

_{Complex
data
structures}

(55)

8-‐55

Chapter Topics

Introduc6on to Pig

§

What is Pig?

§

Pig’s Features

§  

Pig Use Cases

(56)

§

Many organiza6ons use Pig for data analysis

–

_{Finding
relevant
records
in
a
massive
data
set}

–

Querying mulHple data sets

–

_{CalculaHng
values
from
input
data}

§

_{Pig
is
also
frequently
used
for
data
processing}

–

Reorganizing an exisHng data set

–

_{Joining
data
from
mulHple
sources
to
produce
a
new
data
set}

(57)

8-‐57

§

Pig can help you extract valuable informa6on from Web server log ﬁles

Use Case: Web Log SessionizaHon

May 3, 2013

10.174.57.241 - - [03/May/2013:17:57:41 -0500] "GET /s?q=widget HTTP/1.1" 200 3617 "http://www.hotbot.com/find/dualcore" "WebTV 1.2" "U=129"

Search for 'Widget' Process Logs

Widget Results Details for Widget X

Recent Activity for John Smith

May 12, 2013 Track Order Contact Us Send Complaint Order Widget X

...

10.174.57.241 - - [03/May/2013:17:58:03 -0500] "GET /wres.html HTTP/1.1" 200 5741 "http://www.example.com/s?q=widget" "WebTV 1.2" "U=129" 10.174.57.241 - - [03/May/2013:17:58:25 -0500] "GET /detail?w=41 HTTP/1.1" 200 8584 "http://www.example.com/wres.html" "WebTV 1.2" "U=129" 10.174.57.241 - - [03/May/2013:17:59:36 -0500] "GET /order.do HTTP/1.1" 200 964 "http://www.example.com/detail?w=41" "WebTV 1.2" "U=129" 10.174.57.241 - - [03/May/2013:17:59:47 -0500] "GET /confirm HTTP/1.1" 200 964 "http://www.example.com/order.do" "WebTV 1.2" "U=129"

10.218.46.19 - - [03/May/2013:17:57:43 -0500] "GET /ide.html HTTP/1.1" 404 955 "http://www.example.com/s?q=JBuilder" "Mosaic/3.6 (X11;SunOS)" 10.32.51.237 - - [03/May/2013:17:58:04 -0500] "GET /os.html HTTP/1.1" 404 955 "http://www.example.com/s?q=VMS" "Mozilla/1.0b (Win3.11)" 10.157.96.181 - - [03/May/2013:17:58:26 -0500] "GET /mp3.html HTTP/1.1" 404 955 "http://www.example.com/s?q=Zune" "Mothra/2.77" "U=3622"

...

Web Server Log Data

(58)

§

Sampling can help you explore a representa6ve por6on of a large data set

–

_{Allows
you
to
examine
this
porHon
with
tools
that
do
not
scale
well}

–

Supports faster iteraHons during development of analysis jobs

Use Case: Data Sampling

100 TB

50 MB

(59)

8-‐59

§

Pig is also widely used for Extract, Transform, and Load (ETL) processing

Use Case: ETL Processing

Operations Validate data Accounting Call Center Fix errors Remove duplicates Encode values Data Warehouse

Pig Jobs Running on Hadoop Cluster

(60)

Chapter Topics

Introduc6on to Pig

§

What is Pig?

§

Pig’s Features

§

Pig Use Cases

(61)

8-‐61

§

You can use Pig interac6vely, via the Grunt shell

–

_{Pig
interprets
each
Pig
LaHn
statement
as
you
type
it}

–

ExecuHon is delayed unHl output is required

–

_{Very
useful
for
ad
hoc
data
inspecHon}

§

_{Example
of
how
to
start,
use,
and
exit
Grunt}

§

Can also execute a Pig La6n statement from the UNIX shell via the -e

op6on

Using Pig InteracHvely

$ pig

grunt>

allsales = LOAD 'sales' AS (name, price);

grunt>

bigsales = FILTER allsales BY price > 100;

grunt>

STORE bigsales INTO 'myreport';

(62)

§

You can manipulate HDFS with Pig, via the fs command

InteracHng with HDFS

grunt>

fs -mkdir sales/;

grunt>

fs -put europe.txt sales/;

grunt>

allsales = LOAD 'sales' AS (name, price);

grunt>

bigsales = FILTER allsales BY price > 100;

grunt>

STORE bigsales INTO 'myreport';

(63)

8-‐63

§

The sh command lets you run UNIX programs from Pig

InteracHng with UNIX

grunt>

sh date;

Fri May 10 13:05:31 PDT 2013

grunt>

fs -ls;

-- lists HDFS files

grunt>

sh ls;

-- lists local files

(64)

§

A Pig script is simply Pig La6n code stored in a text ﬁle

–

_{By
convenHon,
these
ﬁles
have
the
.pig
extension}

§

_{You
can
run
a
Pig
script
from
within
the
Grunt
shell
via
the
run
command}

–

This is useful for automaHon and batch execuHon

§

It is common to run a Pig script directly from the UNIX shell

Running Pig Scripts

$ pig salesreport.pig

(65)

8-‐65

§

As described earlier, Pig turns Pig La6n into MapReduce jobs

–

_{Pig
submits
those
jobs
for
execuHon
on
the
Hadoop
cluster}

§

_{It
is
also
possible
to
run
Pig
in
‘local
mode’
using
the
-x
ﬂag}

–

This runs MapReduce jobs on the local machine instead of the cluster

–

_{Local
mode
uses
the
local
ﬁlesystem
instead
ofHDFS}

–

Can be helpful for tesHng before deploying a job to producHon

MapReduce and Local Modes

$ pig –x local

-- interactive

(66)

§

If a job fails, Pig may produce a log ﬁle to explain why

–

_{These
log
ﬁles
are
typically
produced
in
your
current
working
directory}

–

On the local (client) machine

(67)

8-‐67

§

Pig oﬀers an alterna6ve to wri6ng MapReduce code directly

–

_{Pig
interprets
Pig
LaHn
code
in
order
to
create
MapReduce
jobs}

–

It then submits these MapReduce jobs to the Hadoop cluster

§

You can execute Pig La6n code interac6vely through Grunt

–

_{Pig
delays
job
execuHon
unHl
output
is
required}

§

_{It
is
also
common
to
store
Pig
La6n
code
in
a
script
for
batch
execu6on}

–

Allows for automaHon and code reuse

(68)

The following oﬀer more informa6on on topics discussed in this chapter

§

Apache Pig Web Site

–

_{http://pig.apache.org/}

§

_{Process
a
Million
Songs
with
Apache
Pig}

–

_{http://tiny.cloudera.com/dac03a}

§

Powered By Pig

–

_{http://tiny.cloudera.com/dac03b}

§

_{LinkedIn:
User
Engagement
Powered
By
Apache
Pig
and
Hadoop}

–

_{http://tiny.cloudera.com/dac03c}

§

Programming Pig (book)

(69)

8-‐69

The following oﬀer more informa6on on topics discussed in this chapter

§

Programming Pig (book)

–

_{http://tiny.cloudera.com/dac03d}

§

_{The
original
paper
on
Pig
published
by
Yahoo
in
2008:}

–

_{http://www.research.yahoo.com/files/sigmod08.pdf}

Cloudera_Academic_Partnership_8.pdf

8-­‐1

Apache Hadoop – A course for undergraduates

Hadoop Tools for Data AcquisiHon

8-­‐3

§

How to load data from an exis6ng RDBMS into HDFS using Sqoop

§

How to manage real-­‐6me data such as log ﬁles using Flume

Chapter Topics

Hadoop Tools for Data Acquisi6on

§

Loading Data into HDFS from an RDBMS Using Sqoop

8-­‐5

§

Typical scenario: data stored in an RDBMS is needed in a

MapReduce job

–

Lookup tables

–

Legacy data

§

Possible to read directly from an RDBMS in your Mapper

–

Can lead to the equivalent of a distributed denial of service

(DDoS) a>ack on your RDBMS

–

In pracHce – don’t do it!

§

BeOer idea: use Sqoop to import the data into HDFS beforehand

ImporHng Data From an RDBMS to HDFS

§

Sqoop: open source tool originally wriOen at Cloudera

–

Now a top-­‐level Apache SoWware FoundaHon project

§

Imports tables from an RDBMS into HDFS

–

Just one table

–

All tables in a database

–

Just porHons of a table

–

Sqoop supports a WHERE clause

§

Uses MapReduce to actually import the data

–

‘Thro>les’ the number of Mappers to avoid DDoS scenarios

–

Uses four Mappers by default

–

Value is conﬁgurable

8-­‐7

§

Imports data to HDFS as delimited text ﬁles or SequenceFiles

–

Default is a comma-­‐delimited text ﬁle

§

Can be used for incremental data imports

–

First import retrieves all rows in a table

–

Subsequent imports retrieve just rows created since the last import

§

Generates a class ﬁle which can encapsulate a row of the imported data

–

Useful for serializing and deserializing data in subsequent MapReduce

jobs

§

Cloudera has partnered with other organiza6ons to create custom Sqoop

connectors

–

Use a database’s naHve protocols rather than JDBC

–

Provides much faster performance

§

Current systems supported by custom connectors include:

–

Netezza

8-‐1

8-‐3

How to manage real-‐6me data such as log ﬁles using Flume

§  

8-‐5

_{Lookup
tables}

_{Can
lead
to
the
equivalent
of
a
distributed
denial
of
service}

_{Now
a
top-‐level
Apache
SoWware
FoundaHon
project}

_{Imports
tables
from
an
RDBMS
into
HDFS}

_{All
tables
in
a
database}

_{Sqoop
supports
a
WHERE
clause}

_{Uses
MapReduce
to
actually
import
the
data}

_{Uses
four
Mappers
by
default}

8-‐7

_{Default
is
a
comma-‐delimited
text
ﬁle}

_{Can
be
used
for
incremental
data
imports}

_{Subsequent
imports
retrieve
just
rows
created
since
the
last
import}

_{Generates
a
class
ﬁle
which
can
encapsulate
a
row
of
the
imported
data}

_{Use
a
database’s
naHve
protocols
rather
than
JDBC}

_Netezza

_{Oracle
Database
(connector
developed
with
Quest
SoWware)}

_{Others
are
in
development}

_{Custom
connectors
are
not
open
source,
but
are
free}

8-‐9