• No results found

CHAPTER 4 th Hadoop

4.5 Hadoop Example

Distinction based on the columns. Suppose that we want to create two data sets from patent metadata: one containing time-related information (e.g. publication date) for each patent and another one containing geographical information (e.g. country of invention).

These two data sets may be of different output formats and different types of data for keys and values.

Public class MultiFile extends Configured implements Tool { public static class Mapclass extends MapReduceBase

Implements Mapper<LongWritable, Text, NullWritable, Text> { private MultipleOutputs mos;

private OutputCollector<NullWritable, Text> collector;

Public void configure(JobConf conf) { mos = new MultipleOutputs(conf);

}

Public void map(LongWritable key, Text value, OutputCollector<NullWritable, Text> output, Reporter reporter) throws IOException {

String[] arr = value.toString().split(“,”,-1);

String chrono = arr[0] + “,” + arr[1] + “,” + arr[2];

String geo = arr[0] + “,” + arr[4] + “,”+ arr[5];

collector = mos.getCollector(“chrono”, reporter);

collector.collect(NullWritable.get(), new Text(chromo));

collector.collect(NullWritable.get(), new Text(geo));

}

Public void close() throws IOException { mos.close();

} }

Public int run(String[] args) throws Exception { Configuration conf=getConf();

JobConf job = new JobConf(conf, MultiFile.class);

Path in = new Path(args[0]);

Path out= new Path(args[1]);

FileInputFormat.setInputPath(job, in);

FileOutputFormat.setOutputPath(job, out);

job.setJobName(“MultiFile”);

job.setMapperClass(MapClass.class);

job.setInputFormat(TextInputFormat.class);

job.setOutputKeyclass(NullWritable.class);

job.setOutputValueClass(Text.class);

job.setNumReduceJobs(0);

MultipleOutputs.addNamedOutput(job,“chrono”, TextOutpuFormat.class, NullWritable.class, Text.class);

MultipleOutputs.addNamedOutput(job, “geo”,

TextOutputFormat.class, NullWritable.class, Text.class);

JobClient.runJob(job);

return 0;

}

Public static void main(String[] args) throws Exception { Int res = ToolRunner.run(new Configuration(),

new MultiFile(), args)

System.exit(res) }

}

The MultipleOutput class of MapReduce is used, which is used to set up the output collectors it expects that will be used. Their creation involves a call to the addNamedOutput static method of the MultipleOutputs class. We have created an output collector called "chrono"

and second one called "geo". They have both been created so as to use the TextOutputFormat and have the same key/value types, but we can choose to have different output formats or data types. After that, we call the MultipleOutputs object that tracks the collectors when the mapper becomes active in the call of a configure method. This object has to be available throughout the duration of the job. In the map() routine, the getCollector method is called by the MultipleObjecs object, and then it returns the chrono and geo collectors to output. Then, we will write different data that is appropriate for each of the two collectors. We have given a name to each of the two collectors in the MultipleOutputs routine and then the routine will automatically generate the output file names. In the following text, we can see how MultipleOutputs generates the output names:

ls –l output/

total 101896

-rwxwxrwx 1 Administrator None 9672703 Jul 31 06:28 chrono-m-0000 -rwxwxrwx 1 Administrator None 7752888 Jul 31 06:29 chrono-m-0001 -rwxwxrwx 1 Administrator None 9428951 Jul 31 06:28 geo-m-0000 -rwxwxrwx 1 Administrator None 7464690 Jul 31 06:29 geo-m-0001 -rwxwxrwx 1 Administrator None 0 Jul 31 06:28 part-0000 -rwxwxrwx 1 Administrator None 0 Jul 31 06:28 part-0000 -rwxwxrwx 1 Administrator None 0 Jul 31 06:29 part-0001 -rwxwxrwx 1 Administrator None 0 Jul 31 06:29 part-0001

These files can become records using the OutputCollector routine which is called through the map() method. The records could be:

head output/chrono-m-0000

“PATENT”,”GYEAR”, “GDATE”

3070801,1963,1096 3070802,1963,1096

head output/geo-m-0000

 Thirty million users update their status at least once everyday

 Over one billion photographs are uploaded every month

 Over ten million videos are uploaded every month

 Over one billion posts (web links, news stories, blog posts, notes, photos etc.) are shared every week

Daily statistics:

 Four TB compressed new data is added every day.

 One hundred thirty five TB of compressed data is scanned every day.

 Seven point five thousand Hive Jobs everyday per productive node.

 Eighty K hours of computer time every day.

Where is this data stored?

 Four thousand eight hundred processors, 5.5 Petabytes

 12 TB per node

 Two level network topology

 1Gbit/sec from node to rack switch

 4 Gbit/sec to top level rack switch

Facebook is one of Hadoop and big data’s biggest champions and it claims to operate the largest single Hadoop Distributed Filesystem (HDFS) cluster anywhere, with more than 100 petabytes of disk space in a single system as of July 2012. The sites stores more than 250 billion photos, with 350 million new ones uploaded every day. According to owner of Facebook is used in every Facebook product and in a variety of ways. Users actions such as a “like” or a status update are stored in a highly distributed, customized MySQL database, but applications such as Facebook Messaging run on top of HBase, Hadoop’s NoSQL database framework. All messages sent on desktop or mobile are persisted to HBase. Additionally the company uses Hadoop and Hive to generate reports for third-party developers and advertisers who need to track the success of their applications or campaigns.

Hive, the data warehousing infrastructure Facebook helped develop to run on top of Hadoop, is central to meeting the company's reporting needs. Facebook must balance the need for rapid results in features such as its graph tools with simplicity and ease of reporting, so it is working on another contribution to Hive that will improve the speed of queries. Improving Hive's speed is important, as the scalability that makes the tool central to the social network's needs can come at the expense of low latency.

4.7 Hadoop communication with a conventional database

Although Hadoop is useful for processing large volumes of data, relational databases remain the workhorse of many data processing applications. Hadoop will often need to communicate