• No results found

How To Write A Mapreduce Program On An Ipad Or Ipad (For Free)

N/A
N/A
Protected

Academic year: 2021

Share "How To Write A Mapreduce Program On An Ipad Or Ipad (For Free)"

Copied!
23
0
0

Loading.... (view fulltext now)

Full text

(1)

Course NDBI040: Big Data Management and NoSQL Databases

MapReduce

Martin Svoboda

Practice 01:

(2)

NDBI040: Big Data Management and NoSQL Databases | Practice 01: MapReduce | 20. 10. 2015 2

MapReduce: Overview

MapReduce

 Programming model for processing of large data sets

with a parallel and distributed algorithm on a cluster

Map function

‒ Processes input data in order to emit a set of intermediate

key/value pairs

‒ (k 1 , v 1 ) → list(k 2 , v 2 )

Reduce function

‒ Merges (all) intermediate values associated with the same

intermediate key

‒ (k 2 , list(v 2 )) → (k 2 , possibly smaller list(v 2 ))

(3)

MapReduce: Example

Word frequency – map and reduce functions

map(String key, String value):

// key: document name

// value: document contents

for each word w in value:

Emit(w, "1");

reduce(String key, Iterator values):

// key: a word

// values: a list of counts

int result = 0;

for each v in values:

result += ParseInt(v);

Emit(key, AsString(result));

(4)

NDBI040: Big Data Management and NoSQL Databases | Practice 01: MapReduce | 20. 10. 2015 4

MapReduce: Example

Word frequency – data flow

(5)

Hadoop: Overview

Apache Hadoop

 Open-source software framework allowing to execute

applications on large clusters of commodity hardware

 Modules

‒ Hadoop Common

Hadoop Distributed File System (HDFS)

• Distributed, scalable, and portable file-system

‒ Hadoop YARN

• Job scheduling and cluster resource management

Hadoop MapReduce

• Implementation of the MapReduce programming model

(6)

Apache Hadoop Tutorial

HDFS

MapReduce

(7)

Hadoop Client

SSH and SFTP access

acheron.ms.mff.cuni.cz:20104

 Login: SIS login, lower case (e.g. mylogin)

 Password: SIS login, upper/lower case (e.g. MyLoGiN)

• Tools

PuTTY

‒ http://www.chiark.greenend.org.uk/~sgtatham/putty/

WinSCP (Windows Secure Copy)

‒ http://winscp.net/

(8)

NDBI040: Big Data Management and NoSQL Databases | Practice 01: MapReduce | 20. 10. 2015 8

Hadoop Client

• Open PuTTY connection

• Change your password

passwd

• Open WinSCP connection

• Explore directories

 /home/mylogin/

‒ Personal directory with private content

 /home/NDBI040/

‒ Shared directory with all the tutorial files

(9)

First Steps

• Get familiar with basic Hadoop commands

hadoop

‒ Help for all Hadoop commands

hadoop fs

‒ Hadoop file system commands

hadoop jar

‒ MapReduce jobs launching

(10)

NDBI040: Big Data Management and NoSQL Databases | Practice 01: MapReduce | 20. 10. 2015 10

Word Count Example

• Create your local working directory

mkdir WordCount

• Make a copy of the sample java source file

cd WordCount

cp /home/NDBI040/WordCount.java .

• Compile it

 mkdir classes

javac –classpath /usr/lib/hadoop/hadoop-common-

2.4.1.jar:/usr/lib/hadoop-mapreduce/hadoop-

mapreduce-client-core-2.4.1.jar -d classes/

WordCount.java

jar -cvf WordCount.jar -C classes/ .

(11)

Word Count Example

• Create your HDFS working directory

hadoop fs -mkdir /tmp/mylogin

• Prepare sample input data

hadoop fs -mkdir /tmp/mylogin/input1

hadoop fs -copyFromLocal

/home/NDBI040/file01 /tmp/mylogin/input1/

hadoop fs -copyFromLocal

/home/NDBI040/file02 /tmp/mylogin/input1/

• Explore the file system (using a web browser)

http://acheron.ms.mff.cuni.cz:20106

‒ (HDFS name node)

(12)

NDBI040: Big Data Management and NoSQL Databases | Practice 01: MapReduce | 20. 10. 2015 12

Word Count Example

• Run the prepared sample job

hadoop jar WordCount.jar org.myorg.WordCount

/tmp/mylogin/input1 /tmp/mylogin/output1

• Follow the execution progress (using a web browser)

http://acheron.ms.mff.cuni.cz:20105

‒ (MapReduce JobTracker)

(13)

Word Count Example

• Look at the job result

hadoop fs -copyToLocal

/tmp/mylogin/output1/part-00000 ./result.txt

• Clean the output directory

‒ At least in case you want to run the job once again

hadoop fs -rmr /tmp/mylogin/output1/

(14)

NDBI040: Big Data Management and NoSQL Databases | Practice 01: MapReduce | 20. 10. 2015 14

Bigger Example

• Shakespeare

hadoop fs -mkdir /tmp/mylogin/input2

hadoop fs -copyFromLocal

/home/NDBI040/shake.txt

/tmp/mylogin/input2/shake.txt

hadoop jar WordCount.jar org.myorg.WordCount

/tmp/mylogin/input2 /tmp/mylogin/output2

(15)

Useful Commands

• List identifiers of all jobs

mapred job –list all

• Kill a particular job

mapred job -kill <job-id>

• Print status counters of a job

mapred job -status <job-id>

(16)

NDBI040: Big Data Management and NoSQL Databases | Practice 01: MapReduce | 20. 10. 2015 16

NetBeans Project

• Launch NetBeans IDE

• Create a new project

Select Java application as project type

• Add Hadoop libraries

 Make local copies of libraries from /home/NDBI040/

‒ hadoop-common-2.4.1.jar

‒ hadoop-mapreduce-client-core-2.4.1.jar

Follow Add JAR/Folder and add both these libraries

• Build project to create jar distribution

(17)

MapReduce Source Files

• Map and reduce functions

class Map extends MapReduceBase

implements Mapper {…}

class Reduce extends MapReduceBase

implements Reducer {…}

• Key/values pairs

 output.collect(key, value);

 WritableComparable for keys

Writable for values

 Available classes: Text , IntWritable , …

(18)

NDBI040: Big Data Management and NoSQL Databases | Practice 01: MapReduce | 20. 10. 2015 18

MapReduce Source Files

Job configuration

 Job name

conf.setJobName("WordCount");

 Map, reduce and combine functions

conf.setMapperClass(Map.class);

conf.setCombinerClass(Reduce.class);

conf.setReducerClass(Reduce.class);

 Types of keys and values

conf.setOutputKeyClass(Text.class);

conf.setOutputValueClass(IntWritable.class);

 Input and output format

conf.setInputFormat(TextInputFormat.class);

conf.setOutputFormat(TextOutputFormat.class);

 Number of reducers

conf.setNumReduceTasks()‏ ;

(19)

MapReduce Source Files

• Input and output files

FileInputFormat.setInputPaths(conf, new Path(args[0]));

FileOutputFormat.setOutputPath(conf, new Path(args[1]));

• Job execution

JobClient.runJob(conf);

‒ Blocks and waits until the job is finished

JobClient.submitJob(conf);

‒ Invokes the job execution but returns immediately

‒ Poll for status to make running decisions

‒ Or provide a URI to be invoked when the job finishes

conf.setJobEndNotificationURI()

(20)

NDBI040: Big Data Management and NoSQL Databases | Practice 01: MapReduce | 20. 10. 2015 20

Local Cluster Network

• RDP (Remote Desktop Protocol) access

vdr.ms.mff.cuni.cz

 Login and password as for the client

• Tools

Remote Desktop Connection

• Cluster

Client: 10.20.86.182:22

HDFS name node: http://10.20.86.180:50070/

MapReduce JobTracker: http://10.20.86.177:8088/

(21)

References

MapReduce tutorial

 http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/

hadoop-mapreduce-client-core/MapReduceTutorial.html

Commands manual

 https://hadoop.apache.org/docs/current/hadoop-project-dist/

hadoop-common/CommandsManual.html

File system shell commands

 https://hadoop.apache.org/docs/current/hadoop-project-dist/

hadoop-common/FileSystemShell.html

JavaDoc documentation

 https://hadoop.apache.org/docs/current/api/

(22)

Assignment 1

MapReduce

(23)

Assignment 1

Select and implement any computation problem

that can be solved using MapReduce

• Store all the related information into your

working subdirectory in /home/NDBI040/

 Sample input + output data

 Java source code + compiled jar

 Readme file: topic and how-to-run description

• Send an e-mail as a confirmation

[email protected]

References

Related documents

Two components are important to achieve the multi-party risk efficiency: recalling a sense of shame through straightforward presentation for honest listening and reconsidering

Artinya bahwa pada wilayah yang dipengaruhi ENSO, akurasi model ramalan produksi padi (dengan variabel prediktor SST Nino 3.4, DMI, rasio LT/LB) akan tinggi, dan sebaliknya

The dynamic model of the grid consists of turbine governors (TG), automatic voltage regulators (AVR) as well as wind turbines, solar power units and energy storage units1.

In a nut shell, Apache Hadoop uses a programming model called MapReduce for processing and generating large data sets using parallel computing.. The MapReduce programming model

In a nut shell, Apache Hadoop uses a programming model called MapReduce for processing and generating large data sets using parallel computing.. MapReduce

counseling was feasible to implement in outpatient commu- nity-based substance abuse treatment settings, was effective in producing modest abstinence rates and strong reductions

In 2009, the Obama Administration launched the Making Home Affordable Program, aimed at helping distressed borrowers and servicers reach successful loan modifications.. Despite

At Key Stages 2, learners at Llansanffraid Primary School are given opportunities to build on the experiences gained during the Foundation Phase, and to promote their