Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

(1)

Lecture 32 Big Data

1. Big Data problem

2. Why the excitement about big data

3. What is MapReduce

4. What is Hadoop

5. Get started with Hadoop

(2)

2

(3)

Big Data Problems

• Data explosion

– Data from users on social networks – Data from mobile devices

– Data from of “things”

• What is big data?

Data at large quantity (terabytes), captured at a rapid

rate, structured or unstructured, stored or hold at various

machines and locations, or some combination of the

above

• Problems?

(4)

Why all the excitement

• There are many factors contributing to the hype around Big Data

– Challenges of the problems

– The appearance of a cost effective practical solutions – The expectation on Internet of Things

• Bringing compute and storage together on commodity hardware. i.e.

cloud computing

• Price performance: The Hadoop big data technology provides

significant cost savings with significant performance improvements

• Linear Scalability: Every parallel technology makes claims about scale up

• Full access to unstructured data: A highly scalable data store with a good parallel programming model, MapReduce, has been a

challenge for the industry for some time, until MapReduce system

like Hadoop appears ⁴

(5)

MapReduce

• A programming model for processing large data sets with a parallel and distributed algorithm on a cluster

• A MapReduce program is composed of two core procedures: Map() and Reduce()

– Map() performs filtering and sorting (such as sorting students by first name into queues, one queue for each name)

– Reduce() performs a summary operation (such as counting the number of students in each queue, yielding name frequencies)

• A "MapReduce System" (also called "infrastructure" or "framework") runs the various tasks in parallel, managing all communications and data transfers between the various parts of the system, and providing redundancy and fault tolerance

• A well-established open-source MapReduce system is Apathe Hadoop

(6)

Solve the word count example by MapReduce

Word count problem

Input: a several text files, or one big file Output: the words and their frequencies E.g.

file00: Hello World Bye World

file01: Hello Hadoop Goodbye Hadoop Solve the problem by MapReduce scheme

6

master

Map()

Reduce()

file01

file00 < Bye, 1>

< Hello, 1>

< World, 2>

< Goodbye,1>

< Hadoop, 2>

< Hello, 1>

< Bye, 1>

< Goodbye, 1>

< Hadoop, 2>

< Hello, 2>

< World, 2>

(7)

Hadoop architecture

(8)

MapReduce layer

• Jobtracker manages MapReduce jobs, hands out tasks

to the slave nodes, schedules tasks, monitoring them

and re-executes the failed tasks. There is exactly one

JobTracker in each cluster

• Tasktracker is a slaves that carry out map and reduce

tasks, usually associated with Datanode

8

(9)

(10)

HDFS layer

• Namenode manages the namespace, file system

metadata, and access control. There is exactly one

Namenode in each cluster

• Datanode holds application input/output data files and

map and reduce programs

• Client is an application launcher to create MapReduce

job with provided application specific input data files and

map and reduce programs

• Hadoop launch application from client program: split data

file into input chunks, map input chunks and programs to

Datanodes.

10

(11)

(12)

Hadoop ecosystem

12

(13)

Components in Hadoop echosystem

The Apache Hadoop project has two core components

1. the file store called Hadoop Distributed File System (HDFS) 2. the programming framework called MapReduce

Other components

1. Hadoop Streaming: A utility to enable MapReduce code in any

language: C, Perl, Python, C++, Bash, etc. The examples include a Python mapper and an AWK reducer.

2. Hive and Hue: Hive convert SQL to a MapReduce job. Hue gives a browser-based graphical interface to do Hive work.

3. Pig: A higher-level programming environment to do MapReduce coding. The Pig language is called Pig Latin.

(14)

5. Oozie: Manages Hadoop workflow.

6. HBase: A super-scalable key-value store. It works very much like a persistent hash-map

7. FlumeNG: A real time loader for streaming your data into Hadoop.

8. Whirr: Cloud provisioning for Hadoop. You can start up a cluster in just a few minutes with a very short configuration file.

9. Mahout: Machine learning for Hadoop. Used for predictive analytics and other advanced analysis.

10.Fuse: Makes the HDFS system look like a regular file system so you can use ls, rm, cd, and others on HDFS data

11.Zookeeper: Used to manage synchronization for the cluster.

14

(15)

Get started with Hadoop

Cloud computing and big data lab Lab tasks

1. create a private cloud, Ubuntu virtual machines 2. Install and test Hadoop

https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html

(16)

Hadoop business

• For the executives: Hadoop is an Apache open source software project to get value from the incredible volume/velocity/variety of data about your organization. Use the data instead of throwing most of it away

• For the technical managers: An open source suite of software that mines the structured and unstructured Big Data about your company. It integrates with your existing Business Intelligence ecosystem.

• Legal: An open source suite of software that is packaged and supported by multiple suppliers. Please see the Resources section regarding IP indemnification.

• Engineering: A massively parallel, shared nothing, Java-based map-reduce execution environment. Think hundreds to thousands of computers working on the same

problem, with built-in failure resilience. Projects in the Hadoop ecosystem provide data loading, higher-level languages, automated cloud deployment, and other capabilities.

• Security: A Kerberos-secured software suite.

16