• No results found

MapReduce Job Processing

N/A
N/A
Protected

Academic year: 2021

Share "MapReduce Job Processing"

Copied!
24
0
0

Loading.... (view fulltext now)

Full text

(1)

MapReduce Job Processing

Jeffrey Jestes

April 17, 2012

(2)

Background: Hadoop Distributed File System (HDFS)

Hadoop requires a Distributed File System (DFS), we utilize the Hadoop Distributed File System (HDFS).

(3)

Background: Hadoop Distributed File System (HDFS)

Hadoop requires a Distributed File System (DFS), we utilize the Hadoop Distributed File System (HDFS).

Record ID User ID Object ID . . .

1 1 12872 . . .

2 8 19832 . . .

3 4 231 . . .

... ... ... ...

NameNode

DataNodes R

(4)

Background: Hadoop Distributed File System (HDFS)

Hadoop requires a Distributed File System (DFS), we utilize the Hadoop Distributed File System (HDFS).

NameNode

DataNodes

64MB 64MB 64MB 64MB R’s DataChunks (Splits)

(5)

Background: Hadoop Distributed File System (HDFS)

Hadoop requires a Distributed File System (DFS), we utilize the Hadoop Distributed File System (HDFS).

NameNode

DataNodes

64MB 64MB 64MB 64MB R’s DataChunks (Splits)

(6)

Background: Hadoop Core

Hadoop Core consists of one masterJobTrackerand several TaskTrackers.

We assume one TaskTracker per physical machine.

(7)

Background: Hadoop Core

Hadoop Core consists of one masterJobTrackerand several TaskTrackers.

We assume one TaskTracker per physical machine.

(8)

Background: Hadoop Core

Hadoop Core consists of one masterJobTrackerand several TaskTrackers.

We assume one TaskTracker per physical machine.

JobTracker

TaskTrackers

(9)

Background: Hadoop Core

Hadoop Core consists of one masterJobTrackerand several TaskTrackers.

We assume one TaskTracker per physical machine.

JobTracker

TaskTrackers

Job Scheduling Mapper Task Scheduling Reduce Task Scheduling

(10)

Background: Hadoop Core

Hadoop Core consists of one masterJobTrackerand several TaskTrackers.

We assume one TaskTracker per physical machine.

JobTracker

TaskTrackers

Job Scheduling Mapper Task Scheduling Reduce Task Scheduling Mappers

Reducers

(11)

Background: Hadoop Cluster

In a Hadoop cluster one machine typically runs both the NameNode and JobTracker tasks and is called themaster.

The other machines run DataNode and TaskTracker tasks and are called slaves.

(12)

Background: Hadoop Cluster

In a Hadoop cluster one machine typically runs both the NameNode and JobTracker tasks and is called themaster.

The other machines run DataNode and TaskTracker tasks and are called slaves.

(13)

Background: Hadoop Cluster

In a Hadoop cluster one machine typically runs both the NameNode and JobTracker tasks and is called themaster.

The other machines run DataNode and TaskTracker tasks and are called slaves.

DataNodes + TaskTrackers NameNode + JobTracker

(14)

Background: Hadoop Cluster

In a Hadoop cluster one machine typically runs both the NameNode and JobTracker tasks and is called themaster.

The other machines run DataNode and TaskTracker tasks and are called slaves.

DataNodes + TaskTrackers NameNode + JobTracker

Master

Slaves

(15)

Background: MapReduce Job Overview

split 1 split 2 split 3 split 4 JobTracker

Mapper Mapper

Mapper

Mapper Map Phase Distributed Cache Job Configuration

Next we look at an overview of a typical MapReduce Job.

(16)

Background: MapReduce Job Overview

split 1 split 2 split 3 split 4 JobTracker

Mapper Mapper

Mapper

Mapper Map Phase Distributed Cache Job Configuration

Job specific variables are first placed in the Job Configuration which is sent to each Mapper Task by the JobTracker.

(17)

Background: MapReduce Job Overview

split 1 split 2 split 3 split 4 JobTracker

Mapper Mapper

Mapper

Mapper Map Phase Distributed Cache Job Configuration

Large data such as files or libraries are then put in the Distributed Cache which is copied to each TaskTracker by the JobTracker.

(18)

Background: MapReduce Job Overview

split 1 split 2 split 3 split 4

(k1, v1) (k1, v1)

(k1, v1)

(k1, v1) JobTracker

Mapper Mapper

Mapper

Mapper Map Phase Distributed Cache Job Configuration

The JobTracker next assigns each InputSplit to a Mapper task on a TaskTracker, we assume m Mappers and m InputSplits.

(19)

Background: MapReduce Job Overview

split 1 split 2 split 3 split 4

(k1, v1) (k1, v1)

(k1, v1)

(k1, v1) JobTracker

Mapper Mapper

Mapper

Mapper Map Phase

(k2, v2)

(k2, v2) (k2, v2)

(k2, v2) p1

p2

p1

p2

p1

p2

p1

p2

Distributed Cache Job Configuration

Each Mapper maps a (k1, v1) pair to an intermediate (k2, v2) pair and partitions by k2, i.e. hash(k2) = pi for i ∈ [1, r ], r = |reducers|.

(20)

Background: MapReduce Job Overview

split 1 split 2 split 3 split 4

(k1, v1) (k1, v1)

(k1, v1)

(k1, v1) JobTracker

Mapper Mapper

Mapper

Mapper Map Phase

(k2, v2)

(k2, v2) (k2, v2)

(k2, v2) p1

p2

p1

p2

p1

p2

p1

p2

(k2, list(v2)) Combiner (k2, list(v2))

Combiner (k2, list(v2))

Combiner (k2, list(v2))

Combiner Distributed Cache

Job Configuration

An optional Combiner is executed over (k2, list(v2)).

(21)

Background: MapReduce Job Overview

split 1 split 2 split 3 split 4

(k1, v1) (k1, v1)

(k1, v1)

(k1, v1) JobTracker

Mapper Mapper

Mapper

Mapper Map Phase

(k2, v2)

(k2, v2) (k2, v2)

(k2, v2) p1

p2

p1

p2

p1

p2

p1

p2

(k2, list(v2)) Combiner (k2, list(v2))

Combiner (k2, list(v2))

Combiner (k2, list(v2))

Combiner (k2, v2)

(k2, v2) (k2, v2)

(k2, v2) p1

p2

p1

p2

p1

p2

p1

p2

Distributed Cache Job Configuration

The Combiner aggregates v2for a k2and a (k2, v2) is written to a partition on disk.

(22)

Background: MapReduce Job Overview

split 1 split 2 split 3 split 4

(k1, v1) (k1, v1)

(k1, v1)

(k1, v1) JobTracker

Mapper Mapper

Mapper

Mapper Map Phase

(k2, v2)

(k2, v2) (k2, v2)

(k2, v2) p1

p2

p1

p2

p1

p2

p1

p2

Reducer

Reducer

Shuffle/Sort Phase (k2, list(v2))

Combiner (k2, list(v2))

Combiner (k2, list(v2))

Combiner (k2, list(v2))

Combiner (k2, v2)

(k2, v2) (k2, v2)

(k2, v2) p1

p2

p1

p2

p1

p2

p1

p2

Distributed Cache Job Configuration

The JobTracker assigns two TaskTrackers to run the Reducers, each Reducer copies and sorts it’s inputs from corresponding partitions.

(23)

Background: MapReduce Job Overview

split 1 split 2 split 3 split 4

(k1, v1) (k1, v1)

(k1, v1)

(k1, v1) JobTracker

Mapper Mapper

Mapper

Mapper Map Phase

(k2, v2)

(k2, v2) (k2, v2)

(k2, v2) p1

p2

p1

p2

p1

p2

p1

p2

Reducer

Reducer

Shuffle/Sort Phase (k2, list(v2))

Combiner (k2, list(v2))

Combiner (k2, list(v2))

Combiner (k2, list(v2))

Combiner (k2, v2)

(k2, v2) (k2, v2)

(k2, v2) p1

p2

p1

p2

p1

p2

p1

p2

Combiners reduce communication overhead!

Distributed Cache Job Configuration

The JobTracker assigns two TaskTrackers to run the Reducers, each Reducer copies and sorts it’s inputs from corresponding partitions.

(24)

Background: MapReduce Job Overview

split 1 split 2 split 3 split 4

(k1, v1) (k1, v1)

(k1, v1)

(k1, v1) JobTracker

Mapper Mapper

Mapper

Mapper Map Phase

(k2, v2)

(k2, v2) (k2, v2)

(k2, v2) p1

p2

p1

p2

p1

p2

p1

p2

Reducer

Reducer

Shuffle/Sort Phase

(k3, v3)

(k3, v3) o1

o2

Reduce Phase (k2, list(v2))

Combiner (k2, list(v2))

Combiner (k2, list(v2))

Combiner (k2, list(v2))

Combiner (k2, v2)

(k2, v2) (k2, v2)

(k2, v2) p1

p2

p1

p2

p1

p2

p1

p2

Distributed Cache Job Configuration

Each Reducer reduces a (k2, list(v2)) to a single (k3, v3) and writes the results to a DFS file, oi for i ∈ [1, r ].

References

Related documents

Such steps might include revising tenure and promotion guidelines in ways that promote public scholarship and civic learning; providing practical support for faculty who wish to be

Hadoop  Basics  -­‐  MapReduce  .. Mapping and reducing tasks run on nodes

Then we start the MapReduce daemons: the JobTracker is started on master, and TaskTracker daemons are started on all slaves (here: master and

Whether sitting cross -legged on the cushion or at work in the world, if we are always abiding in Big Mind with all beings, that is “Only zazen.” When we include each being as a

innovation in payment systems, in particular the infrastructure used to operate payment systems, in the interests of service-users 3.. to ensure that payment systems

ADUBAÇÃO LÍQUIDA E SÓLIDA DE NITROGÊNIO E POTÁSSIO EM ABACAXIZEIRO 'SMOOTH CAYENNE', NA PARAÍBAI SALIM ABREU CFIOAIRY 2, JOSÉ TEOTONIO DE LACERDA 3 e PEDRO DANTAS FERNANDES'

Este trabalho teve como objetivo verificar o efeito da aplicação de 1-metilciclopropeno (1-MCP) e do salicilato de metila (MeSA) no controle de injúrias pelo frio