Lecture 32 Big Data
1. Big Data problem
2. Why the excitement about big data
3. What is MapReduce
4. What is Hadoop
5. Get started with Hadoop
2
Big Data Problems
• Data explosion
– Data from users on social networks – Data from mobile devices
– Data from of “things”
• What is big data?
Data at large quantity (terabytes), captured at a rapid
rate, structured or unstructured, stored or hold at various
machines and locations, or some combination of the
above
• Problems?
Why all the excitement
• There are many factors contributing to the hype around Big Data
– Challenges of the problems
– The appearance of a cost effective practical solutions – The expectation on Internet of Things
• Bringing compute and storage together on commodity hardware. i.e.
cloud computing
• Price performance: The Hadoop big data technology provides
significant cost savings with significant performance improvements
• Linear Scalability: Every parallel technology makes claims about scale up
• Full access to unstructured data: A highly scalable data store with a good parallel programming model, MapReduce, has been a
challenge for the industry for some time, until MapReduce system
like Hadoop appears 4
MapReduce
• A programming model for processing large data sets with a parallel and distributed algorithm on a cluster
• A MapReduce program is composed of two core procedures: Map() and Reduce()
– Map() performs filtering and sorting (such as sorting students by first name into queues, one queue for each name)
– Reduce() performs a summary operation (such as counting the number of students in each queue, yielding name frequencies)
• A "MapReduce System" (also called "infrastructure" or "framework") runs the various tasks in parallel, managing all communications and data transfers between the various parts of the system, and providing redundancy and fault tolerance
• A well-established open-source MapReduce system is Apathe Hadoop
Solve the word count example by MapReduce
Word count problem
Input: a several text files, or one big file Output: the words and their frequencies E.g.
file00: Hello World Bye World
file01: Hello Hadoop Goodbye Hadoop Solve the problem by MapReduce scheme
6
master
Map()
Map()
Reduce()
file01
file00 < Bye, 1>
< Hello, 1>
< World, 2>
< Goodbye,1>
< Hadoop, 2>
< Hello, 1>
< Bye, 1>
< Goodbye, 1>
< Hadoop, 2>
< Hello, 2>
< World, 2>
Hadoop architecture
MapReduce layer
• Jobtracker manages MapReduce jobs, hands out tasks
to the slave nodes, schedules tasks, monitoring them
and re-executes the failed tasks. There is exactly one
JobTracker in each cluster
• Tasktracker is a slaves that carry out map and reduce
tasks, usually associated with Datanode
8
HDFS layer
• Namenode manages the namespace, file system
metadata, and access control. There is exactly one
Namenode in each cluster
• Datanode holds application input/output data files and
map and reduce programs
• Client is an application launcher to create MapReduce
job with provided application specific input data files and
map and reduce programs
• Hadoop launch application from client program: split data
file into input chunks, map input chunks and programs to
Datanodes.
10
Hadoop ecosystem
12
Components in Hadoop echosystem
The Apache Hadoop project has two core components
1. the file store called Hadoop Distributed File System (HDFS) 2. the programming framework called MapReduce
Other components
1. Hadoop Streaming: A utility to enable MapReduce code in any
language: C, Perl, Python, C++, Bash, etc. The examples include a Python mapper and an AWK reducer.
2. Hive and Hue: Hive convert SQL to a MapReduce job. Hue gives a browser-based graphical interface to do Hive work.
3. Pig: A higher-level programming environment to do MapReduce coding. The Pig language is called Pig Latin.
5. Oozie: Manages Hadoop workflow.
6. HBase: A super-scalable key-value store. It works very much like a persistent hash-map
7. FlumeNG: A real time loader for streaming your data into Hadoop.
8. Whirr: Cloud provisioning for Hadoop. You can start up a cluster in just a few minutes with a very short configuration file.
9. Mahout: Machine learning for Hadoop. Used for predictive analytics and other advanced analysis.
10.Fuse: Makes the HDFS system look like a regular file system so you can use ls, rm, cd, and others on HDFS data
11.Zookeeper: Used to manage synchronization for the cluster.
14
Get started with Hadoop
Cloud computing and big data lab Lab tasks
1. create a private cloud, Ubuntu virtual machines 2. Install and test Hadoop
https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html
Hadoop business
• For the executives: Hadoop is an Apache open source software project to get value from the incredible volume/velocity/variety of data about your organization. Use the data instead of throwing most of it away
• For the technical managers: An open source suite of software that mines the structured and unstructured Big Data about your company. It integrates with your existing Business Intelligence ecosystem.
• Legal: An open source suite of software that is packaged and supported by multiple suppliers. Please see the Resources section regarding IP indemnification.
• Engineering: A massively parallel, shared nothing, Java-based map-reduce execution environment. Think hundreds to thousands of computers working on the same
problem, with built-in failure resilience. Projects in the Hadoop ecosystem provide data loading, higher-level languages, automated cloud deployment, and other capabilities.
• Security: A Kerberos-secured software suite.
16