Hadoop HP Tutorial

(1)

www.hpottech.com

Last Updated:- 3

rd

Nov 2012

All software will be in D:\\Software

Verify that VM player is installed else you need to install it.

Start your VM player and start the Hadoop VM Node master

Logon with the following credentials:

root / root123

Create directory hadoop :- mkdir /hadoop

The required steps for setting up a single-node

Hadoop

cluster using the

Hadoop Distributed File System

(HDFS)

on

RedHat Linux

.

Red Hat Linux

Hadoop

1.0.3, released May 2012

Prerequisites

Sun Java 6 – Verify java as below.

# java -version

java version "1.6.0_20"

(2)

www.hpottech.com

Java HotSpot(TM) Client VM (build 16.3-b01, mixed mode, sharing)

If Java is not there then installed Java : change directory to /hadoop and run the following

# sh /mnt/hgfs/Hadoopsw/jdk-6u17-linux-i586.bin

Configuring SSH

Hadoop requires SSH access to manage its nodes, i.e. remote machines plus your local machine if you want to use

Hadoop on it .

#su - root

# ssh-keygen -t rsa -P ""

Generating public/private rsa key pair.

Enter file in which to save the key (/home/root/.ssh/id_rsa):

Created directory '/home/root/.ssh'.

Your identification has been saved in /home/root/.ssh/id_rsa.

Your public key has been saved in /home/root/.ssh/id_rsa.pub.

The key fingerprint is:

(3)

www.hpottech.com

9b:82:ea:58:b4:e0:35:d7:ff:19:66:a6:ef:ae:0e:d2 root@ubuntu

The key's randomart image is:

[...snipp...]

The second line will create an RSA key pair with an empty password.

Second, you have to enable SSH access to your local machine with this newly created key.

$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

The final step is to test the SSH setup by connecting to your local machine with the

root

user. The step is also

needed to save your local machine’s host key fingerprint to the

root

user’s

known_hosts

file.

$ ssh localhost

The authenticity of host 'localhost (::1)' can't be established.

RSA key fingerprint is d7:87:25:47:ae:02:00:eb:1d:75:4f:bb:44:f9:36:26.

Are you sure you want to continue connecting (yes/no)? yes

(4)

www.hpottech.com

Linux ubuntu 2.6.32-22-generic #33-Ubuntu SMP Wed Apr 28 13:27:30 UTC 2010 i686 GNU/Linux

Ubuntu 10.04 LTS

[...snipp...]

# ssh localhost

(5)

www.hpottech.com

Start the VM and follow the following steps

Hadoop

Installation

Extract the contents of the Hadoop package to a /hadoop.

$ cd /hadoop

$ tar xzf hadoop-1.0.0.tar.gz

$ mv hadoop-1.0.0 hadoop

$ chown -R hduser:hadoop hadoop (optional)

Update $HOME/.bashrc

Add the following lines to the end of the $HOME/.bashrc file of user root. If you use a shell other than bash, you should of course

update its appropriate configuration files instead of .bashrc.

# Set Hadoop-related environment variables

export HADOOP_HOME=/hadoop/hadoop

(Changes the directory app. to your installation.)

# Set JAVA_HOME (we will also configure JAVA_HOME directly for Hadoop later on)

export JAVA_HOME=/hadoop/jdk1.6.0_17

(6)

www.hpottech.com

# Add Hadoop bin/ directory to PATH

export PATH=$PATH:$HADOOP_HOME/bin

Hadoop Distributed File System (HDFS)

Configuration

Our goal in this tutorial is a single-node setup of Hadoop

hadoop-env.sh

The only required environment variable we have to configure for Hadoop in this tutorial is JAVA_HOME.

Open /hadoop/hadoop/conf/hadoop-env.sh in the editor of your choice and set the JAVA_HOME environment variable to the Sun

JDK/JRE 6 directory.

Change

# The java implementation to use. Required.

# export JAVA_HOME=/usr/lib/j2sdk1.5-sun

(7)

www.hpottech.com

to

# The java implementation to use. Required.

export JAVA_HOME=/hadoop/jdk1.6.0_17/

**conf/*-site.xml**

Now we create the directory and set the required ownerships and permissions:

(8)

www.hpottech.com

Add the following snippets between the <configuration> ... </configuration> tags in the respective configuration XML file.

In file conf/core-site.xml:

<property>

<name>hadoop.tmp.dir</name>

<value>/app/hadoop/tmp</value>

<description>A base for other temporary directories.</description>

</property>

<property>

<name>fs.default.name</name>

<value>hdfs://localhost:9000</value>

<description>The name of the default file system </description>

</property>

(9)

www.hpottech.com

In file conf/mapred-site.xml:

<property>

<name>mapred.job.tracker</name>

<value>localhost:9001</value>

<description>The host and port that the MapReduce job tracker runs at.</description>

</property>

(10)

www.hpottech.com

In file conf/hdfs-site.xml:

<property>

<name>dfs.replication</name>

<value>1</value>

<description>Default block replication. </description>

</property>

(11)

www.hpottech.com

Formatting the HDFS filesystem via the NameNode

The first step to starting up your Hadoop installation is formatting the Hadoop filesystem which is implemented on top of the local

filesystem of your “cluster” .

Do not format a running Hadoop filesystem as you will lose all the data currently in the cluster (in HDFS).

To format the filesystem (which simply initializes the directory specified by the dfs.name.dir variable), run the command

hduser@ubuntu:~$ /hadoop/hadoop/bin/hadoop namenode -format

The output will look like this:

hduser@ubuntu:/usr/local/hadoop$ bin/hadoop namenode -format

10/05/08 16:59:56 INFO namenode.NameNode: STARTUP_MSG:

/************************************************************

STARTUP_MSG: Starting NameNode

STARTUP_MSG: host = ubuntu/127.0.1.1

STARTUP_MSG: args = [-format]

STARTUP_MSG: version = 0.20.2

STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707; compiled by 'chrisdo'

on Fri Feb 19 08:07:34 UTC 2010

(12)

www.hpottech.com

************************************************************/

10/05/08 16:59:56 INFO namenode.FSNamesystem: fsOwner=hduser,hadoop

10/05/08 16:59:56 INFO namenode.FSNamesystem: supergroup=supergroup

10/05/08 16:59:56 INFO namenode.FSNamesystem: isPermissionEnabled=true

10/05/08 16:59:56 INFO common.Storage: Image file of size 96 saved in 0 seconds.

10/05/08 16:59:57 INFO common.Storage: Storage directory .../hadoop-hduser/dfs/name has been successfully formatted.

10/05/08 16:59:57 INFO namenode.NameNode: SHUTDOWN_MSG:

/************************************************************

SHUTDOWN_MSG: Shutting down NameNode at ubuntu/127.0.1.1

************************************************************/

hduser@ubuntu:/usr/local/hadoop$

Starting your single-node cluster

Run the command:

(13)

www.hpottech.com

This will startup a Namenode, Datanode, Jobtracker and a Tasktracker on your machine.

The output will look like this:

hduser@ubuntu:/usr/local/hadoop$ bin/start-all.sh

starting namenode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-namenode-ubuntu.out

localhost: starting datanode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-datanode-ubuntu.out

localhost: starting secondarynamenode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-secondarynamenode-ubuntu.out

starting jobtracker, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-jobtracker-ubuntu.out

localhost: starting tasktracker, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-tasktracker-ubuntu.out

hduser@ubuntu:/usr/local/hadoop$

A nifty tool for checking whether the expected Hadoop processes are running is jps (part of Sun’s Java since v1.5.0).

hduser@ubuntu:$ jps

(if jps doesn’t found, go to /Software/JDK/bin and execute ./jps)

2287 TaskTracker

2149 JobTracker

1938 DataNode

2085 SecondaryNameNode

2349 Jps

(14)

www.hpottech.com

1788 NameNode

You can also check with netstat if Hadoop is listening on the configured ports.

hduser@ubuntu:~$ netstat -plten | grep java

tcp 0 0 0.0.0.0:50070 0.0.0.0:* LISTEN 1001 9236 2471/java

tcp 0 0 0.0.0.0:50010 0.0.0.0:* LISTEN 1001 9998 2628/java

tcp 0 0 0.0.0.0:48159 0.0.0.0:* LISTEN 1001 8496 2628/java

tcp 0 0 0.0.0.0:53121 0.0.0.0:* LISTEN 1001 9228 2857/java

tcp 0 0 127.0.0.1:54310 0.0.0.0:* LISTEN 1001 8143 2471/java

tcp 0 0 127.0.0.1:54311 0.0.0.0:* LISTEN 1001 9230 2857/java

tcp 0 0 0.0.0.0:59305 0.0.0.0:* LISTEN 1001 8141 2471/java

tcp 0 0 0.0.0.0:50060 0.0.0.0:* LISTEN 1001 9857 3005/java

tcp 0 0 0.0.0.0:49900 0.0.0.0:* LISTEN 1001 9037 2785/java

tcp 0 0 0.0.0.0:50030 0.0.0.0:* LISTEN 1001 9773 2857/java

hduser@ubuntu:~$

If there are any errors, examine the log files in the /logs/ directory.

Stopping your single-node cluster

Run the command

(15)

www.hpottech.com

to stop all the daemons running on your machine.

Exemplary output:

hduser@ubuntu:/usr/local/hadoop$ bin/stop-all.sh

stopping jobtracker

localhost: stopping tasktracker

stopping namenode

localhost: stopping datanode

(16)

Lab objectives

In this lab you will practice HDFS command line interface.

Lab instructions

This lab has been developed as a tutorial. Simply execute the commands provided, and analyze the results.

Basic Hadoop Filesystem commands

1. In order to work with HDFS you need to use the hadoop fs command. For example to list the / and /app directories you need to input the following commands:

> >

hadoop fs -ls / hadoop fs -ls /app

(17)

> hadoop fs -mkdir test

Now let's see the directory we've created: >

>

hadoop fs -ls /

hadoop fs -ls /user/root

You will notice that the test directory got created under the /user/root directory. This is because as the root user, your default path is /user/root and thus if you don't specify an absolute path all HDFS commands work out of /user/root (this will be your default working directory).

3. You should be aware that you can pipe (using the | character) any HDFS command to be used with the Linux shell. For example, you can easily use grep with HDFS by doing the following: >

>

hadoop fs -mkdir /user/root/test2 hadoop fs -ls /user/root | grep test

As you can see the grep command only returned the lines which had test in them (thus removing the "Found x items" line and oozie-root directory from the listing.

(18)

Copy pg20417.txt from software folder to /hadoop >

>

hadoop fs -put /hadoop/pg20417.txt pg20417.txt hadoop fs -ls /user/root

You should now see a new file called /user/root/README listed. In order to view the contents of this file we will use the -cat command as follows:

> hadoop fs -cat pg20417.txt

You should see the output of the README file (that is stored in HDFS). We can also use the linux diff command to see if the file we put on HDFS is actually the same as the original on the local filesystem. You can do this as follows:

> diff <( hadoop fs -cat pg20417.txt) /hadoop/pg20417.txt

Since the diff command produces no output we know that the files are the same (the diff command prints all the lines in the files that differ).

Some more Hadoop Filesystem commands

1. In order to use HDFS commands recursively generally you add an "r" to the HDFS command (In the Linux shell this is generally done with the "-R" argument) For example, to do a recursive listing we'll use the -lsr command rather than just -ls. Try this:

> hadoop fs -ls /user

(19)

2. In order to find the size of files you need to use the -du or -dus commands. Keep in mind that these commands return the file size in bytes. To find the size of the pg20417.txt file use the following command:

> hadoop fs -du pg20417.txt

To find the size of all files individually in the /user/root directory use the following command: > hadoop fs -du /user/root

To find the size of all files in total of the /user/root directory use the following command: > hadoop fs -dus /user/root

(20)

3. If you would like to get more information about a given command, invoke -help as follows: > hadoop fs -help

For example, to get help on the dus command you'd do the following: > hadoop fs -help dus

--- This is the end of this lab ---

System -> Network Config -> DNS –

Hostname - Activate

/etc/sysconfig/network

(21)

www.hpottech.com

bin/hadoop dfsadmin –report

(22)

www.hpottech.com

bin/hadoop dfsadmin -metasave hadoop.txt

This will save some of NameNode’s metadata into its log directory under filename.

In this metadata, you’ll find lists of blocks waiting for replication, blocks being replicated, and blocks awaiting

deletion. For replication each block will also have a list of DataNodes being replicated to. Finally, the metasave file

will also have summary statistics on each DataNode.

Go to log folder:

# cd /hadoop/hadoop/logs

# ls

(23)

www.hpottech.com

hadoop dfsadmin -safemode get

hadoop dfsadmin -safemode enter

(24)

www.hpottech.com

hadoop fsck /

(25)

(26)

Hpot-Tech.com

Installation

To run the fair scheduler in your Hadoop installation, you need to put it on the CLASSPATH.

copy the hadoop-fairscheduler-2.0.0-mr1-cdh4.1.0.jar from

/hadoop/hadoop-2.0.0-mr1-cdh4.1.0/contrib/fairscheduler to HADOOP_HOME/lib. Using the

following command.

cp hadoop-fairscheduler-2.0.0-mr1-cdh4.1.0.jar $HADOOP_HOME/lib

Edit HADOOP_CONF_DIR/mapred-site.xml to have Hadoop use the fair scheduler:

<property>

<name>mapred.jobtracker.taskScheduler</name>

<value>org.apache.hadoop.mapred.FairScheduler</value>

</property>

Once you restart the cluster, you can check that the fair scheduler is running by going to

http://<jobtracker URL>/scheduler on the JobTracker's web UI.

(27)

Hpot-Tech.com

http://192.168.1.5:50030/scheduler

Run the map reduce job from two telnet window and observer the link display.

Create two different folders for input and output.

(28)

(29)

(30)

hPotTech | hadoop

1)

Verifying File System Health

a.

bin/hadoop fsck /

(31)

hPotTech | hadoop

Hadoop Web Interfaces

Hadoop comes with several web interfaces which are by default (see conf/core-site.xml) available at these

locations:

http://192.168.80.133:50030/

– web UI for MapReduce job tracker(s)

http://192.168.80.133:50060/

– web UI for task tracker(s)

http://

92.168.80.133:50070/

– web UI for HDFS name node(s)

These web interfaces provide concise information about what’s happening in your Hadoop cluster. You might want to give

them a try.

(32)

MapReduce Job Tracker Web Interface

The job tracker web UI provides information about general job statistics of the Hadoop cluster, running/completed/failed

jobs and a job history log file. It also gives access to the ”local machine’s” Hadoop log files (the machine on which the web

UI is running on).

By default, it’s available at

http://localhost:50030/

A screenshot of Hadoop's Job Tracker web interface.

hPotTech

MapReduce Job Tracker Web Interface

provides information about general job statistics of the Hadoop cluster, running/completed/failed

jobs and a job history log file. It also gives access to the ”local machine’s” Hadoop log files (the machine on which the web

http://localhost:50030/

.

A screenshot of Hadoop's Job Tracker web interface.

hPotTech | hadoop

provides information about general job statistics of the Hadoop cluster, running/completed/failed

jobs and a job history log file. It also gives access to the ”local machine’s” Hadoop log files (the machine on which the web

(33)

Task Tracker Web Interface

The task tracker web UI shows you running and non

log files.

By default, it’s available at

http://localhost:50060/

A screenshot of Hadoop's Task Tracker web interface.

hPotTech

The task tracker web UI shows you running and non-running tasks. It also gives access to the ”local machine’s” Hadoop

http://localhost:50060/

.

A screenshot of Hadoop's Task Tracker web interface.

hPotTech | hadoop

(34)

hPotTech | hadoop

HDFS Name Node Web Interface

The name node web UI shows you a cluster summary including information about total/remaining capacity, live and dead

nodes. Additionally, it allows you to browse the HDFS namespace and view the contents of its files in the web browser. It

also gives access to the ”local machine’s” Hadoop log files.

(35)

A screenshot of Hadoop's Name Node web interface.

hPotTech

A screenshot of Hadoop's Name Node web interface.

(36)

www.hpottech.com

Start the Hadoop VM Master

Base Directory:- /hadoop

Java Installation

1. Java 1.6.x (from Sun) is installed on /usr/jre16.

(37)

www.hpottech.com

Pig Installation

(38)

www.hpottech.com

#Set the PIG_HOME environment variable (vi $HOME/.bashrc)

(39)

www.hpottech.com

#export PIG_HOME=/hadoop/pig-0.10.0

(40)

www.hpottech.com

Start the hadoop cluster.

#bash

(41)

www.hpottech.com

$ cd /hadoop/pig-0.10.0/

(42)

www.hpottech.com

$ bin/pig -x local

Enter the following command in the Grunt shell;

log = LOAD '/hadoop/pig-0.10.0/tutorial/data/excite-small.log' AS (user, timestamp, query);

grpd = GROUP log BY user;

cntd = FOREACH grpd GENERATE group, COUNT(log);

STORE cntd INTO ‘output’;

(43)

www.hpottech.com

# quit

file:///hadoop/pig-0.10.0/tutorial/data/output

(44)

Hadoop | HPot-Tech

1 Install: Hbase

-1)

Get the hbase software from /software.

2)

Untar the software in /hadoop

$ tar xfz hbase-0.92.1.tar.gz –C /hadoop $ cd /hadoop/hbase-0.92.1

3)

edit

conf/hbase-site.xml

and set the directory for HBase to write to

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>hbase.rootdir</name> <value>file:///hadoop/hbase</value> </property> </configuration>

(45)

Hadoop | HPot-Tech

2 Edit

conf/hbase-env.sh

, uncommenting the

JAVA_HOME

line pointing it to your java install.

(46)

Hadoop | HPot-Tech

3 Start hadoop

Start HBase

$ ./bin/start-hbase.sh

starting Master, logging to logs/hbase-user-master-example.org.out

Shell Exercises

Connect to your running HBase via the shell.

$ ./bin/hbase shell

HBase Shell; enter 'help<RETURN>' for list of supported commands. Type "exit<RETURN>" to leave the HBase Shell

Version: 0.90.0, r1001068, Fri Sep 24 13:55:42 PDT 2010

(47)

Hadoop | HPot-Tech

4 Create a table named

test

with a single column family named

cf

. Verify its creation by listing all tables and then insert

some values.

hbase(main):003:0> create 'test', 'cf' 0 row(s) in 1.2200 seconds

hbase(main):003:0> list 'test' ..

1 row(s) in 0.0550 seconds

hbase(main):004:0> put 'test', 'row1', 'cf:a', 'value1' 0 row(s) in 0.0560 seconds

hbase(main):005:0> put 'test', 'row2', 'cf:b', 'value2' 0 row(s) in 0.0370 seconds

hbase(main):006:0> put 'test', 'row3', 'cf:c', 'value3' 0 row(s) in 0.0450 seconds

(48)

Hadoop | HPot-Tech

5 Above we inserted 3 values, one at a time. The first insert is at

row1

, column

cf:a

with a value of

value1

. Columns in

HBase are comprised of a column family prefix --

cf

in this example -- followed by a colon and then a column qualifier

(49)

Hadoop | HPot-Tech

6 Verify the data insert.

Run a scan of the table by doing the following

hbase(main):007:0> scan 'test' ROW COLUMN+CELL

row1 column=cf:a, timestamp=1288380727188, value=value1 row2 column=cf:b, timestamp=1288380738440, value=value2 row3 column=cf:c, timestamp=1288380747365, value=value3 3 row(s) in 0.0590 seconds

(50)

Hadoop | HPot-Tech

7 Get a single row as follows

hbase(main):008:0> get 'test', 'row1' COLUMN CELL

cf:a timestamp=1288380727188, value=value1 1 row(s) in 0.0400 seconds

(51)

Hadoop | HPot-Tech

8 Now, disable and drop your table. This will clean up all done above.

hbase(main):012:0> disable 'test' 0 row(s) in 1.0930 seconds

hbase(main):013:0> drop 'test' 0 row(s) in 0.0770 seconds

Exit the shell by typing exit.

hbase(main):014:0> exit

Stopping HBase

Stop your hbase instance by running the stop script.

$ ./bin/stop-hbase.sh

(52)

www.hpottech.com

Page 1

Installing Hive

(53)

www.hpottech.com

Page 2

Set the environment variable HIVE_HOME to point to the installation directory at $HOME/.bashrc:

$ export HIVE_HOME=/hadoop/hive-0.9.0

Finally, add $HIVE_HOME/bin to your PATH:

(54)

www.hpottech.com

Page 3

Running Hive

Hive uses hadoop that means:

• you must have hadoop in your path OR

• export HADOOP_HOME=<hadoop-install-dir>

In addition, you must create /tmp and /user/hive/warehouse

(aka hive.metastore.warehouse.dir) and set them chmod g+w in HDFS before a table can be created in Hive. Start hadoop service.

Commands to perform this setup

$ $HADOOP_HOME/bin/hadoop fs -mkdir /tmp

$ $HADOOP_HOME/bin/hadoop fs -mkdir /user/hive/warehouse $ $HADOOP_HOME/bin/hadoop fs -chmod g+w /tmp

$ $HADOOP_HOME/bin/hadoop fs -chmod g+w /user/hive/warehouse

(55)

www.hpottech.com

Page 4

Errors log:

/tmp/<user.name>/hive.log Type hive #bash #hive

Creates a table called pokes with two columns, the first being an integer and the other a string

(56)

www.hpottech.com

Page 5

hive> SHOW TABLES;

hive> DESCRIBE invites;

hive> LOAD DATA LOCAL INPATH

'/hadoop/hive-0.9.0/examples/files/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15');

(57)

www.hpottech.com

Page 6

hive> SELECT a.foo FROM invites a WHERE a.ds='2008-08-15';

(58)

www.hpottech.com

Page 7

hive> INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT a.* FROM invites a WHERE a.ds='2008-08-15';

(59)

www.hpottech.com

Page 8

#hadoop fs -cat /tmp/hdfs_out/000000_0

Dropping tables:

hive> DROP TABLE invites;

(60)

Hpot-Tech

Start Hadoop cluster – start-all.sh

Start hive.

copy extend1502.log to /hadoop

Create table from the Hive console.

hive > CREATE TABLE weblog (mdate STRING , mtime STRING, ssitename STRING, scomputername STRING, sip STRING, csmethod STRING,

csuristem STRING, suriquery STRING, sport STRING, csusername STRING, cip STRING, csversion STRING, csUserAgent STRING, csCookie STRING,

csReferer STRING, cshost STRING, scstatus STRING, scsubstatus STRING, scwin32status STRING, scbytes STRING, csbytes STRING, timetaken

STRING)

(61)

Hpot-Tech

hive> SET hive.enforce.bucketing = true;

(62)

(63)

hPot-Tech

(64)

hPot-Tech

package com.hp.types; // == JobBuilder import java.io.IOException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.util.GenericOptionsParser; import org.apache.hadoop.util.Tool;

publicclass JobBuilder {

privatefinal Class<?> driverClass; privatefinal Job job;

privatefinalintextraArgCount; privatefinal String extrArgsUsage;

private String[] extraArgs;

public JobBuilder(Class<?> driverClass) throws IOException { this(driverClass, 0, "");

}

public JobBuilder(Class<?> driverClass, int extraArgCount, String extrArgsUsage) throws IOException { this.driverClass = driverClass;

this.extraArgCount = extraArgCount; this.job = new Job();

this.job.setJarByClass(driverClass); this.extrArgsUsage = extrArgsUsage; }

// vv JobBuilder

publicstatic Job parseInputAndOutput(Tool tool, Configuration conf, String[] args) throws IOException {

if (args.length != 2) {

(65)

hPot-Tech

returnnull; }

Job job = new Job(conf);

job.setJarByClass(tool.getClass());

FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); return job;

}

publicstaticvoid printUsage(Tool tool, String extraArgsUsage) { System.err.printf("Usage: %s [genericOptions] %s\n\n",

tool.getClass().getSimpleName(), extraArgsUsage);

GenericOptionsParser.printGenericCommandUsage(System.err); }

// ^^ JobBuilder

public JobBuilder withCommandLineArgs(String... args) throws IOException { Configuration conf = job.getConfiguration();

GenericOptionsParser parser = new GenericOptionsParser(conf, args); String[] otherArgs = parser.getRemainingArgs();

if (otherArgs.length < 2 && otherArgs.length > 3 + extraArgCount) {

System.err.printf("Usage: %s [genericOptions] [-overwrite] <input path> <output path> %s\n\n", driverClass.getSimpleName(), extrArgsUsage);

GenericOptionsParser.printGenericCommandUsage(System.err); System.exit(-1);

}

int index = 0;

boolean overwrite = false;

if (otherArgs[index].equals("-overwrite")) { overwrite = true;

index++; }

Path input = new Path(otherArgs[index++]); Path output = new Path(otherArgs[index++]);

if (index < otherArgs.length) {

extraArgs = new String[otherArgs.length - index];

System.arraycopy(otherArgs, index, extraArgs, 0, otherArgs.length - index); }

(66)

hPot-Tech

if (overwrite) {

output.getFileSystem(conf).delete(output, true); }

FileInputFormat.addInputPath(job, input); FileOutputFormat.setOutputPath(job, output); returnthis;

}

public Job build() { returnjob; }

public String[] getExtraArgs() { returnextraArgs;

} }

(67)

hPot-Tech

package com.hp.types;

// cc SmallFilesToSequenceFileConverter A MapReduce program for packaging a collection of small files as a single SequenceFile import java.io.IOException; import org.apache.hadoop.conf.Configured; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.BytesWritable; import org.apache.hadoop.io.NullWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.InputSplit; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.lib.input.FileSplit; import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat; import org.apache.hadoop.util.Tool; import org.apache.hadoop.util.ToolRunner; //vv SmallFilesToSequenceFileConverter

publicclass SmallFilesToSequenceFileConverter extends Configured implements Tool {

staticclass SequenceFileMapper

extends Mapper<NullWritable, BytesWritable, Text, BytesWritable> {

private Text filenameKey;

@Override

protectedvoid setup(Context context) throws IOException, InterruptedException {

InputSplit split = context.getInputSplit(); Path path = ((FileSplit) split).getPath(); filenameKey = new Text(path.toString()); }

@Override

protectedvoid map(NullWritable key, BytesWritable value, Context context) throws IOException, InterruptedException {

(68)

hPot-Tech

} }

@Override

publicint run(String[] args) throws Exception {

Job job = JobBuilder.parseInputAndOutput(this, getConf(), args); if (job == null) { return -1; } job.setInputFormatClass(WholeFileInputFormat.class); job.setOutputFormatClass(SequenceFileOutputFormat.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(BytesWritable.class); job.setMapperClass(SequenceFileMapper.class);

return job.waitForCompletion(true) ? 0 : 1; }

publicstaticvoid main(String[] args) throws Exception { args = new String[2];

args[0]="input";

args[1]="output"+System.currentTimeMillis();

int exitCode = ToolRunner.run(new SmallFilesToSequenceFileConverter(), args); System.exit(exitCode);

} }

(69)

hPot-Tech

// cc WholeFileInputFormat An InputFormat for reading a whole file as a record import java.io.IOException; import org.apache.hadoop.fs.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.InputSplit; import org.apache.hadoop.mapreduce.JobContext; import org.apache.hadoop.mapreduce.RecordReader; import org.apache.hadoop.mapreduce.TaskAttemptContext; import org.apache.hadoop.mapreduce.lib.input.*; //vv WholeFileInputFormat

publicclass WholeFileInputFormat

extends FileInputFormat<NullWritable, BytesWritable> {

@Override

protectedboolean isSplitable(JobContext context, Path file) { returnfalse;

}

@Override

public RecordReader<NullWritable, BytesWritable> createRecordReader( InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException {

WholeFileRecordReader reader = new WholeFileRecordReader(); reader.initialize(split, context);

return reader; }

}

(70)

hPot-Tech

// cc WholeFileRecordReader The RecordReader used by WholeFileInputFormat for reading a whole file as a record import java.io.IOException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FSDataInputStream; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.BytesWritable; import org.apache.hadoop.io.IOUtils; import org.apache.hadoop.io.NullWritable; import org.apache.hadoop.mapreduce.InputSplit; import org.apache.hadoop.mapreduce.RecordReader; import org.apache.hadoop.mapreduce.TaskAttemptContext; import org.apache.hadoop.mapreduce.lib.input.FileSplit; //vv WholeFileRecordReader

class WholeFileRecordReader extends RecordReader<NullWritable, BytesWritable> {

private FileSplit fileSplit; private Configuration conf;

private BytesWritable value = new BytesWritable(); privatebooleanprocessed = false;

@Override

publicvoid initialize(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException {

this.fileSplit = (FileSplit) split;

this.conf = context.getConfiguration(); }

@Override

publicboolean nextKeyValue() throws IOException, InterruptedException { if (!processed) {

byte[] contents = newbyte[(int) fileSplit.getLength()]; Path file = fileSplit.getPath();

FileSystem fs = file.getFileSystem(conf); FSDataInputStream in = null;

try {

(71)

hPot-Tech

IOUtils.readFully(in, contents, 0, contents.length); value.set(contents, 0, contents.length);

} finally { IOUtils.closeStream(in); } processed = true; returntrue; } returnfalse; } @Override

public NullWritable getCurrentKey() throws IOException, InterruptedException { return NullWritable.get();

}

@Override

public BytesWritable getCurrentValue() throws IOException, InterruptedException {

returnvalue; }

@Override

publicfloat getProgress() throws IOException { returnprocessed ? 1.0f : 0.0f;

}

@Override

publicvoid close() throws IOException { // do nothing

} }

(72)

hPot-Tech

Copy the following files:

(73)

hPot-Tech

Submit the jobs to cluster

Un common the path initialization as follow:

/* args = new String[2]; args[0]="input";

args[1]="output"+System.currentTimeMillis();*/

Export the jar and run the following command

#hadoop fs -mkdir smallfiles/

#hadoop fs -copyFromLocal /hadoop/data/a smallfiles/

#hadoop fs -copyFromLocal /hadoop/data/b smallfiles/

#hadoop fs -copyFromLocal /hadoop/data/c smallfiles/

#hadoop fs -copyFromLocal /hadoop/data/d smallfiles/

#hadoop fs -copyFromLocal /hadoop/data/e smallfiles/

#hadoop fs -copyFromLocal /hadoop/data/f smallfiles/

(74)

hPot-Tech

(75)

(76)

ww.hpottech.com

From two single-node clusters to a multi-node cluster – We will build a multi-node cluster using two Red hat boxes in

this tutorial. The best way to do this for starters is to install, configure and test a “local” Hadoop setup for each of the two

RH boxes, and in a second step to “merge” these two single-node clusters into one multi-node cluster in which one RH

box will become the designated master (but also act as a slave with regard to data storage and processing), and the other

box will become only a slave. It’s much easier to track down any problems you might encounter due to the reduced

complexity of doing a single-node cluster setup first on each machine.

(77)

ww.hpottech.com

Prerequisites

Configuring single-node clusters first in both the VM

Use the earlier tutorial.

Now that you have two single-node clusters up and running, we will modify the Hadoop configuration to make one RH

box the ”master” (which will also act as a slave) and the other RH box a ”slave”.

We will call the designated master machine just the

master

from now on and the slave-only machine the

slave

. We will also give the two machines these respective hostnames in their networking setup, most notably

in

/etc/hosts

. If the hostnames of your machines are different (e.g.

node01

) then you must adapt the

settings in this tutorial as appropriate.

(78)

ww.hpottech.com

(79)

ww.hpottech.com

(80)

ww.hpottech.com

Start both the VM

Change the New VM host name:

System -> Administration -> Network -> DNS – hadoopslave

(81)

ww.hpottech.com

Verify the IP as belows:

(82)

ww.hpottech.com

Networking

Both machines must be able to reach each other over the network.

Update /etc/hosts on both machines with the following lines:

# vi /etc/hosts (for master AND slave)

10.72.47.42 master

(83)

ww.hpottech.com

SSH access

The root user on the master must be able to connect

a) to its own user account on the master – i.e. ssh master in this context and not necessarily ssh

localhost – and

b) to the root user account on the slave via a password-less SSH login.

You have to add the root@master‘s public SSH key (which should be in$HOME/.ssh/id_rsa.pub) to

the authorized_keys file of root@slave (in this user’s$HOME/.ssh/authorized_keys).

You can do this manually or use the

following SSH command

:

$ ssh-copy-id -i $HOME/.ssh/id_rsa.pub root@slave

This command will prompt you for the login password for user root on slave, then copy the public SSH key for you,

creating the correct directory and fixing the permissions as necessary.

(84)

ww.hpottech.com

The final step is to test the SSH setup by connecting with user root from the master to the user account root on

the slave. The step is also needed to save slave‘s host key fingerprint to the root@master‘sknown_hosts file.

So, connecting from master to master…

$ ssh master

(85)

ww.hpottech.com

And from master to slave.

(86)

ww.hpottech.com

Hadoop

Cluster Overview

(87)

ww.hpottech.com

The master node will run the “master” daemons for each layer: NameNode for the HDFS storage layer, and JobTracker

for the MapReduce processing layer. Both machines will run the “slave” daemons: DataNode for the HDFS layer, and

TaskTracker for MapReduce processing layer. Basically, the “master” daemons are responsible for coordination and

management of the “slave” daemons while the latter will do the actual data storage and data processing work.

Masters vs. Slaves

Typically one machine in the cluster is designated as the NameNode and another machine the as JobTracker, exclusively.

These are the actual “master nodes”. The rest of the machines in the cluster act as both DataNode and TaskTracker.

These are the slaves or “worker nodes”.

Configuration

conf/masters (

master

only)

On master, update /conf/masters that it looks like this:

(88)

ww.hpottech.com

conf/slaves (

master

only)

On master, update conf/slaves that it looks like this:

master

slave

(89)

ww.hpottech.com

**conf/*-site.xml (all machines)**

Note: As of Hadoop 0.20.x and 1.x, the configuration settings previously found in

hadoop-site.xmlwere moved

to conf/core-site.xml (fs.default.name), conf/mapred-site.xml(mapred.job.tracker)

and conf/hdfs-site.xml (dfs.replication).

Assuming you configured each machine as described in the

single-node cluster tutorial

, you will only have to change a

few variables.

Important: You have to change the configuration files

conf/core-site.xml

,

conf/mapred-site.xml

and

conf/hdfs-site.xml

on ALL machines as follows.

First, we have to change the

fs.default.name

variable (in conf/core-site.xml) which specifies

the

NameNode

(the HDFS master) host and port. In our case, this is the master machine.

(90)

ww.hpottech.com

<property>

<name>fs.default.name</name>

<value>hdfs://master:9000</value>

<description>The name of the default file system </description>

</property>

(91)

ww.hpottech.com

Second, we have to change the

mapred.job.tracker

variable (in conf/mapred-site.xml) which specifies

the

JobTracker

(MapReduce master) host and port. Again, this is the master in our case.

<property>

<name>mapred.job.tracker</name>

<value>master:9001</value>

<description>The host and port that the MapReduce job tracker runs at </description>

</property>

(92)

ww.hpottech.com

<property>

<name>dfs.name.dir</name>

<value>/hadoop/hdfs/name</value>

<final>true</final>

</property>

<property>

<name>dfs.data.dir</name>

<value>/hadoop/hdfs/data</value>

</property>

<property>

<name>dfs.replication</name>

<value>2</value>

<description>Default block replication. </description>

</property>

(93)

ww.hpottech.com

Create the necessary folder structure in both the node

#mkdir –p /hadoop/hdfs/name

#mkdir –p /hadoop/hdfs/data

Formatting the HDFS filesystem via the NameNode

(94)

(95)

ww.hpottech.com

Background: The HDFS name table is stored on the NameNode’s (here: master) local filesystem in the directory

specified by dfs.name.dir. The name table is used by the NameNode to store tracking and coordination information

for the DataNodes.

Starting the multi-node cluster

HDFS daemons

Run the command /bin/start-dfs.sh on the machine you want the (primary) NameNode to run on. This will

bring up HDFS with the NameNode running on the machine you ran the previous command on, and DataNodes on the

machines listed in the conf/slaves file.

In our case, we will run bin/start-dfs.sh on master:

(96)

ww.hpottech.com

On slave, you can examine the success or failure of this command by inspecting the log file logs/. Exemplary

(97)

ww.hpottech.com

As you can see in slave‘s output above, it will automatically format it’s storage directory (specified

bydfs.data.dir) if it is not formatted already. It will also create the directory if it does not exist yet.

At this point, the following Java processes should run on master…

(98)

ww.hpottech.com

…and the following on slave.

(99)

ww.hpottech.com

MapReduce daemons

Run the command /bin/start-mapred.sh on the machine you want the JobTracker to run on. This will bring up

the MapReduce cluster with the JobTracker running on the machine you ran the previous command on, and TaskTrackers

on the machines listed in the conf/slaves file.

In our case, we will run bin/start-mapred.sh on master:

(100)

ww.hpottech.com

On slave, you can examine the success or failure of this command by inspecting the log file

(101)

ww.hpottech.com

At this point, the following Java processes should run on master…

$ jps

(102)

ww.hpottech.com

…and the following on slave.

(103)

ww.hpottech.com

Stopping the multi-node cluster

Like starting the cluster, stopping it is done in two steps. The workflow is the opposite of starting, however. First, we begin

with stopping the MapReduce daemons: the JobTracker is stopped on master, and TaskTracker daemons are stopped

on all slaves (here: master and slave). Second, the HDFS daemons are stopped: the NameNode daemon is

stopped on master, and DataNode daemons are stopped on all slaves (here: master and slave).

MapReduce daemons

Run the command /bin/stop-mapred.sh on the JobTracker machine. This will shut down the MapReduce

cluster by stopping the JobTracker daemon running on the machine you ran the previous command on, and TaskTrackers

on the machines listed in the conf/slaves file.

In our case, we will run bin/stop-mapred.sh on master:

(104)

ww.hpottech.com

(Note: The output above might suggest that the JobTracker was running and stopped on slave, but you can be assured

that the JobTracker ran on master.)

At this point, the following Java processes should run on master…

(105)

ww.hpottech.com

…and the following on slave.

$ jps

(106)

ww.hpottech.com

HDFS daemons

Run the command /bin/stop-dfs.sh on the NameNode machine. This will shut down HDFS by stopping the

NameNode daemon running on the machine you ran the previous command on, and DataNodes on the machines listed in

the conf/slaves file.

In our case, we will run bin/stop-dfs.sh on master:

(107)

ww.hpottech.com

At this point, the only following Java processes should run on master…

$ jps

…and the following on slave.

(108)

ww.hpottech.com

Running a MapReduce job

Just follow the steps described in the section

Running a MapReduce job

of the

single-node cluster tutorial

.

Here’s the exemplary output on master…

Copy the data before running the following.

(109)

ww.hpottech.com

…and on slave for its datanode…

(110)

ww.hpottech.com

…and on slave for its tasktracker.

# from logs/ hadoop-root-tasktracker-hadoopslave.log on slave

(111)

Hpot-Tech

(112)

Hpot-Tech

package com.hp.join; // == JobBuilder import java.io.IOException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.util.GenericOptionsParser; import org.apache.hadoop.util.Tool;

publicclass JobBuilder {

privatefinal Class<?> driverClass; privatefinal Job job;

privatefinalintextraArgCount; privatefinal String extrArgsUsage;

private String[] extraArgs;

public JobBuilder(Class<?> driverClass) throws IOException { this(driverClass, 0, "");

}

public JobBuilder(Class<?> driverClass, int extraArgCount, String extrArgsUsage) throws IOException { this.driverClass = driverClass;

this.extraArgCount = extraArgCount; this.job = new Job();

this.job.setJarByClass(driverClass); this.extrArgsUsage = extrArgsUsage; }

// vv JobBuilder

publicstatic Job parseInputAndOutput(Tool tool, Configuration conf, String[] args) throws IOException {

if (args.length != 2) {

printUsage(tool, "<input> <output>"); returnnull;

(113)

Hpot-Tech

}

Job job = new Job(conf);

job.setJarByClass(tool.getClass());

FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); return job;

}

publicstaticvoid printUsage(Tool tool, String extraArgsUsage) { System.err.printf("Usage: %s [genericOptions] %s\n\n",

tool.getClass().getSimpleName(), extraArgsUsage);

GenericOptionsParser.printGenericCommandUsage(System.err); }

// ^^ JobBuilder

public JobBuilder withCommandLineArgs(String... args) throws IOException { Configuration conf = job.getConfiguration();

GenericOptionsParser parser = new GenericOptionsParser(conf, args); String[] otherArgs = parser.getRemainingArgs();

if (otherArgs.length < 2 && otherArgs.length > 3 + extraArgCount) {

System.err.printf("Usage: %s [genericOptions] [-overwrite] <input path> <output path> %s\n\n", driverClass.getSimpleName(), extrArgsUsage);

GenericOptionsParser.printGenericCommandUsage(System.err); System.exit(-1);

}

int index = 0;

boolean overwrite = false;

if (otherArgs[index].equals("-overwrite")) { overwrite = true;

index++; }

Path input = new Path(otherArgs[index++]); Path output = new Path(otherArgs[index++]);

if (index < otherArgs.length) {

extraArgs = new String[otherArgs.length - index];

System.arraycopy(otherArgs, index, extraArgs, 0, otherArgs.length - index); }

(114)

Hpot-Tech

output.getFileSystem(conf).delete(output, true); }

FileInputFormat.addInputPath(job, input); FileOutputFormat.setOutputPath(job, output); returnthis;

}

public Job build() { returnjob; }

public String[] getExtraArgs() { returnextraArgs;

} }

(115)

Hpot-Tech

package com.hp.join; import java.io.IOException; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapred.MapReduceBase; import org.apache.hadoop.mapred.Mapper; import org.apache.hadoop.mapred.OutputCollector; import org.apache.hadoop.mapred.Reporter;

publicclass JoinRecordMapper extends MapReduceBase implements Mapper<LongWritable, Text, TextPair, Text> { private NcdcRecordParser parser = new NcdcRecordParser();

publicvoid map(LongWritable key, Text value,

OutputCollector<TextPair, Text> output, Reporter reporter) throws IOException {

parser.parse(value);

output.collect(new TextPair(parser.getStationId(), "1"), value); }

(116)

Hpot-Tech

package com.hp.join; import org.apache.hadoop.conf.Configured; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapred.*; import org.apache.hadoop.mapred.lib.MultipleInputs; import org.apache.hadoop.util.*; @SuppressWarnings("deprecation")

publicclass JoinRecordWithStationName extends Configured implements Tool {

publicstaticclass KeyPartitioner implements Partitioner<TextPair, Text> { @Override

publicvoid configure(JobConf job) {}

@Override

publicint getPartition(TextPair key, Text value, int numPartitions) {

return (key.getFirst().hashCode() & Integer.MAX_VALUE) % numPartitions; }

}

@Override

publicint run(String[] args) throws Exception { if (args.length != 3) {

JobBuilder.printUsage(this, "<ncdc input> <station input> <output>"); return -1;

}

JobConf conf = new JobConf(getConf(), getClass()); conf.setJobName("Join record with station name");

Path ncdcInputPath = new Path(args[0]); Path stationInputPath = new Path(args[1]); Path outputPath = new Path(args[2]);

MultipleInputs.addInputPath(conf, ncdcInputPath, TextInputFormat.class, JoinRecordMapper.class); MultipleInputs.addInputPath(conf, stationInputPath, TextInputFormat.class, JoinStationMapper.class);

(117)

Hpot-Tech

FileOutputFormat.setOutputPath(conf, outputPath); conf.setPartitionerClass(KeyPartitioner.class); conf.setOutputValueGroupingComparator(TextPair.FirstComparator.class); conf.setMapOutputKeyClass(TextPair.class); conf.setReducerClass(JoinReducer.class); conf.setOutputKeyClass(Text.class); JobClient.runJob(conf); return 0; }

publicstaticvoid main(String[] args) throws Exception { args = new String[3];

args[0] = "inputncdc"; args[1] = "inputstation";

args[2] = "output"+System.currentTimeMillis();

int exitCode = ToolRunner.run(new JoinRecordWithStationName(), args); System.exit(exitCode);

} }

(118)

Hpot-Tech

package com.hp.join; import java.io.IOException; import java.util.Iterator; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapred.*;

publicclass JoinReducer extends MapReduceBase implements

Reducer<TextPair, Text, Text, Text> {

publicvoid reduce(TextPair key, Iterator<Text> values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {

Text stationName = new Text(values.next()); while (values.hasNext()) {

Text record = values.next();

Text outValue = new Text(stationName.toString() + "\t" + record.toString()); output.collect(key.getFirst(), outValue);

} } }

(119)

Hpot-Tech

package com.hp.join;

import java.io.IOException;

import org.apache.hadoop.io.*;

import org.apache.hadoop.mapred.*;

publicclass JoinStationMapper extends MapReduceBase implements Mapper<LongWritable, Text, TextPair, Text> {

private NcdcStationMetadataParser parser = new NcdcStationMetadataParser();

publicvoid map(LongWritable key, Text value,

OutputCollector<TextPair, Text> output, Reporter reporter) throws IOException {

if (parser.parse(value)) {

output.collect(new TextPair(parser.getStationId(), "0"), new Text(parser.getStationName()));

} } }

(120)

Hpot-Tech

import java.math.*;

import org.apache.hadoop.io.Text;

publicclass MetOfficeRecordParser {

private String year;

private String airTemperatureString; privateintairTemperature;

privatebooleanairTemperatureValid;

publicvoid parse(String record) { if (record.length() < 18) { return; } year = record.substring(3, 7); if (isValidRecord(year)) { airTemperatureString = record.substring(13, 18); if (!airTemperatureString.trim().equals("---")) {

BigDecimal temp = new BigDecimal(airTemperatureString.trim()); temp = temp.multiply(new BigDecimal(BigInteger.TEN));

airTemperature = temp.intValueExact(); airTemperatureValid = true; } } }

privateboolean isValidRecord(String year) { try { Integer.parseInt(year); returntrue; } catch (NumberFormatException e) { returnfalse; } }

publicvoid parse(Text record) { parse(record.toString()); }

(121)

Hpot-Tech

public String getYear() { returnyear;

}

publicint getAirTemperature() { returnairTemperature;

}

public String getAirTemperatureString() { returnairTemperatureString;

}

publicboolean isValidTemperature() { returnairTemperatureValid;

}

(122)

Hpot-Tech

import java.text.*;

import java.util.Date;

import org.apache.hadoop.io.Text;

publicclass NcdcRecordParser {

privatestaticfinalintMISSING_TEMPERATURE = 9999;

privatestaticfinal DateFormat DATE_FORMAT = new SimpleDateFormat("yyyyMMddHHmm");

private String stationId;

private String observationDateString; private String year;

private String airTemperatureString; privateintairTemperature;

privatebooleanairTemperatureMalformed; private String quality;

publicvoid parse(String record) {

stationId = record.substring(4, 10) + "-" + record.substring(10, 15); observationDateString = record.substring(15, 27);

year = record.substring(15, 19); airTemperatureMalformed = false;

// Remove leading plus sign as parseInt doesn't like them

if (record.charAt(87) == '+') {

airTemperatureString = record.substring(88, 92);

airTemperature = Integer.parseInt(airTemperatureString); } elseif (record.charAt(87) == '-') {

airTemperatureString = record.substring(87, 92);

airTemperature = Integer.parseInt(airTemperatureString); } else {

airTemperatureMalformed = true; }

airTemperature = Integer.parseInt(airTemperatureString); quality = record.substring(92, 93);

}