Big Data Workshop
About Praveen
Has more than15 years of experience working on various technologies.
Is a Cloudera Certified Developer for Apache Hadoop CDH4 (CCD-410) with 95% score and got through the first part Cloudera Certified Data Scientist Certification (DS-200). Is a Sun Certified Java Professional.
Has been active in the Big Data/Hadoop community Blog (http://www.thecloudavenue.com/ and
http://www.dattamsha.com/blog/) Stack Overflow
(http://stackoverflow.com/users/614157/praveen-sripati)
Extensive experience in consulting, conducting training and workshops around Big Data.
Agenda
Job market around Big Data What is Big Data?
Big Data use cases Big Data Ecosystem What is HDFS?
What is MapReduce? Pig, Hive etc
NoSQL Databases
Why we are here?
http://www.indeed.com/jobtrends
http://www.indeed.com/jobtrends
Why we are here?
http://www.nasscom.in/big-data-next-big-thing
IDC predicts that the market for big data will
reach $16.1 billion in 2014, growing 6 times faster than the overall IT market.
http://www.nasscom.in/big-data-next-big-thing
Big Data Certifications
http://university.cloudera.com/certification.html
http://hortonworks.com/hadoop-training/
Expectations from the workshop
To make each one of you get familiar with the different aspects of Big Data which will help you to explore further the various
opportunities around Big Data.
The work shop is to make each one of you motivated to work on the Big Data in the future.
The workshop is not to make you an expert in Big Data given the time frame.
Scale of data
http://en.wikipedia.org/wiki/Metric_prefix
Some interesting facts about Data
Every day, we create 2.5 quintillion bytes of data, so much that 90% of the data in the world today has been created in the last two years alone.
Walmart handles more than 1 million customer transactions every hour, which is imported into databases estimated to contain more than 2.5 PB of data
Twitter generates 12 TB of data every day.
Airbus A380 generates 10 TB every 30 minutes of flight. NYSE generates a TB of data every month.
What do we do we so much amount of data?
Ignore or use it.
Lets jump into Big Data
10
The model has changed …
Old Model – only a few companies were generating data (like news outlets), all others are consuming data
New Model – all of us are generating data, and all of us are consuming data
11
Who’s generating data?
Social media and networks
(all of us are generating data)
Scientific instruments
(collecting all sorts of data)
Mobile devices
(tracking all objects all the time)
Sensor technology
(measuring all kinds of data)
The progress and innovation is no longer hindered by the ability to collect data. But, by the ability to manage, analyze, summarize, visualize, and discover knowledge from the
collected data in a timely manner and in a scalable fashion. 12
MapMyRide iPhone App and Cycling
13
How much data is getting generated?
http://www.nasscom.in/big-data-next-big-thing
http://www.nasscom.in/big-data-next-big-thing
What's Big Data?
In information technology, Big Data is a collection of data sets
so large and complex that it becomes difficult to process and
store using on-hand existing tools and technologies.
Converting raw data into information
Raw data from various
sources Info Machine Learning Data Mining Analytics Store Process Action Increase the bottom line or the profits of the company dattamsha.com
Interesting fact of Neo (a graph db)
me
Friends (depth 2)
Friends of Friends (depth 3)
This is a Social Graph
Interesting fact of Neo (a graph db)
http://www.neotechnology.com/how-much-faster-is-a-graph-database-really/
With a database of 1,000,000 users, finding the friends of friends (fof) at different levels and the results are striking.
Execution Time is in seconds, for 1,000 users.
Depth Execution Time
-MySQL Execution Time –Neo4J How much faster
2 0.016 0.010 1.6 3 30.267 0.168 180.2 4 1543.505 1.359 1135.8 5 Not finished in 1 hr 2.132 NA dattamsha.com
1-Scale (Volume)
http://www.nasscom.in/big-data-next-big-thing
Various formats, types, and structures Text, numerical, images, audio, video,
sequences, time series, social media data, multi-dim arrays, etc…
Static data vs. streaming data A single application can be
generating/collecting many types of data
20
2-Complexity (Variety)
Data is begin generated fast and need to be processed fast. Late decisions will lead to missing opportunities.
E-Promotions: send promotions right now for store next to customer, based on your current location, your purchase history, what the customer likes.
Healthcare monitoring : Any abnormal sensor measurements in the body would require immediate reaction
Fraud detection : Credit Card company should detect frauds as soon as possible and take immediate action.
21
3-Speed (Velocity)
Some Make it 4V’s
22
General use cases of Big Data
23
Recommendation
Customer segmentation and targeted promotions Fraud detection
Network intrusion detection Reporting and analytics
Sentimental analysis
http://www.bigdata-madesimple.com/category/sectors/
Predict the future
24
Use Case - Recommendations
B1 B2 B3 U1 B1 B2 B4 U2 B6 B7 B8 U3 This is called Collaborative Filtering – a type of machine learning algorithm dattamsha.comUse Case – Sentimental Analysis
Twitter Facebook Blogs Natural Language Processing Based on the feedback in the different social media outlets identify if the feedback has been positive, negative or neutral.
This can be used to improve the service/product or how it has been marketed.
Use Case – Reporting & Predictive Analytics
How many smart phones have been sold in each of the store this year and the previous year?
How many customers
have booked a hotel room in the past and how many will book in the next few months?
Did providing an incentive / offer to the customer increase the sales of a particular gadget?
How much of a particular flu medicine should be manufactured in the next 6 months?
Use Case – Classification
Classify an email as a spam or a non-spam automatically as in the case of GMail.
Given a particular medical report. Automatically
classify if the report is positive or negative for a particular symptom.
Based on the different features of a customer. Separate the potential customer who would be interested in a particular service from the rest.
Classify weather a taxi
booked will be cancelled by the customer or not.
Use Case – Customer segmentation &
targeted advertising
Separate the customers into different segments based on the features like age, salary, gender, location, spending habits.
Target each of the customer by sending a customized
email with some offers. This will increase the prospects of the customer opting for a particular service.
This is much better than sending the same email in bulk to all the customers.
Use Case – Fraud detection
Based on the nature of the transaction like time, location,
spending habits etc identify if a particular credit card transaction is a genuine
transaction or not. Identifying the
fraudulent transactions as soon as possible will help in mitigating or minimizing the losses.
Use Case – Flu Trends
Identify the trends in the search terms for flu
related topics.
And then manufacture the appropriate amount of flu related drugs.
Manufacturing more flu related drugs will result in excess inventory. By manufacturing less, the demand won’t be met. So, the appropriate
amount of flu related drugs have to be
manufactured.
http://www.google.org/flutrends/
Use Case – Wiki Analysis
Download the entire Wikipedia dump and identify the importance of the different pages based on the PageRank algorithm. Create an index for the entire Wikipedia
similar to an index at the end of the book. By combining the above two data sets, a
search engine can be created on top of Wikipedia, similar to Google.
http://en.wikipedia.org/wiki/PageRank
http://en.wikipedia.org/wiki/Wikipedia:Database_download
Challenges with Big Data
33
Big Data technology is like a moving target. Security had been an after thought
Privacy of the data Lack of skills
Sharing of data across silos
Risks of storing huge data sets within a single system Regulations in the different domains (HIPPA for Health
Care)
Lets jump into technology aspects of Big
Data
34
What is ASF?
Apache Software is a Non Profit Organization.
Provides a platform or an environment for the development of different softwares.
The development happens in an open way. The code can be obtained from http://svn.apache.org/repos/asf/.
Different individuals and companies can work in a collaborative fashion.
There are a lot of Big Data and Non Big Data related projects. The softwares are free to use with very few or no restrictions.
Different softwares under ASF
Non Big Data softwares Ant Maven Tomcat Log4J JUnit Apache Http
Torque (ORM Tool)
Big Data softwares Hadoop Hive Pig HBase Cassandra Sqoop Oozie https://projects.apache.org/indexes/alpha.html dattamsha.com
Ecosystem around Big Data
s
How they fit together?
Parallels between Linux and Big Data
GNU Linux
Fedora CentOS RHEL Ubuntu
RedHat Canonical
Base
Derivatives
Company
Parallels between Linux and Big Data
Apache
CDH HDP M3/M5/M7 HDInsights
Cloudera Hortonworks MapR Microsoft
Base
Derivative
Company
About the Big Data Virtual Machine
Laptop/Desktop
Windows Mac Linux
Oracle Virtual Box Ubuntu
Hadoop Hive Pig
This is what has been shared in the DVD along
with the installation instructions.
http://www.thecloudavenue.com/2013/01/virtual-machine-for-learning-hadoop.html
Getting started with Big Data Virtual
Machine (VM)
Installation of VirtualBox
Configuring VM image in VirtualBox Use the different frameworks in the VM
Ubuntu running on Windows 8
43
Introduction to Linux
http://www.ee.surrey.ac.uk/Teaching/Unix/
Vertical & Horizontal scaling of Big Data
systems
45 Traditional Systems Big Data SystemsBy adding more and more commodity machines to
the cluster, more and more of the data is stored
and processed in parallel Traditional systems are
expensive to scale and by design difficult to distribute. High end machine Desktop grade machine Commodity machines dattamsha.com
How a datacenter looks like?
46
Google leading the Big Data revolution
47
http://www.thecloudavenue.com/2012/10/google-driving-big-data-space.html
What is Hadoop?
Framework/Software for storage and analysis at large scale Can process and store structured as well as unstructured data Addresses many of the challenges with distributed processing and
storing
HDFS for Storage and MapReduce for Analysis HDFS and MapReduce based on Google's papers.
Is free and is developed from Apache Software Foundation Commercial support is available from vendors like Cloudera,
HortonWorks, Amazon and others
Deployments are in the range of thousands of nodes in production
Hadoop handles the different failure
scenarios
49
A rack/node going down. A process dying.
A particular process running slow. Problems in the network.
A Hard Disk Drive crashing.
….. and it also handles any changes to the network topology like adding a new rack or a new node.
What is HDFS?
50
http://research.google.com/archive/gfs-sosp2003.pdf
Distributed file system for redundant storage based on Google’s GFS Paper.
Designed to reliably store data on commodity hardware. Built to expect hardware failures.
Performs best with a modest number of large files Sits on top of native file system
How does HDFS store the data?
51 http://research.google.com/archive/gfs-sosp2003.pdf b1 b2 b3 b4 m/c1 m/c2 m/c3 m/c4Original file is split into blocks
b1 b2 b2 b3 b3 b4 b4 b1
The blocks are placed on different machines with redundancy to address fault tolerance.
Processing of the data on different machines
52
m/c1
b1
b2
Once the blocks are placed in different machines, they can processed in parallel by different processes.
p1 m/c3 b3 b4 p3 m/c2 b2 b3 p2 m/c4 b4 b1 p4 dattamsha.com
Different models for processing of the data
HDFS
MapReduce RDD MPI BSP Graph
Hadoop Spark Open MPI Hama Giraph
53 Distributed computing
models
Softwares implementing the distributed computing
models
What is MapReduce?
54
http://research.google.com/archive/mapreduce.html
Programming model for distributed computations at a massive scale.
A method for distributing a task across multiple nodes Execution framework for organizing and performing such
computations
Each node processes data stored on that node.
Provides
Automatic parallelization and distribution Fault tolerance
Status and monitoring tools
A clean abstraction for programmers
Map Reduce Data Flow
55 Hadoop is easy Hadoop is cool m2 m1 (Hadoop,1)(is,1) (easy,1) (Hadoop,1) (is,1) (cool,1) r (Hadoop,2) (is,2) (easy,1) (cool,1)Mapper doing the transformation by splitting the line
into words.
Reducer doing the aggregation by counting the numbers of occurrence of a particular word. dattamsha.com
WordCount in Java (Mapper code)
56
http://wiki.apache.org/hadoop/WordCount
MapReduce code can be written in Java and
non-Java languages.
WordCount in Java (Reducer code)
57
Apache Spark
Spark was developed AMPlab in University of
Berkeley
http://spark.apache.org/
WordCount in Python (Spark Code)
59
from pyspark import SparkContext
logFile = "hdfs://localhost:9000/user/bigdatavm/input"
sc = SparkContext("spark://bigdata-vm:7077", "WordCount") textFile = sc.textFile(logFile)
wordCounts = textFile.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)
wordCounts.saveAsTextFile("hdfs://localhost:9000/user/bigdatavm/output")
Spark code is not only faster, but is much
easy to write
http://spark.apache.org/
Putting it all together
60
Apache Pig
High-level data flow language Made of two components
Data processing language Pig Latin
Compiler to translate Pig Latin to MapReduce
Abstracts you from specific details and allows you to focus on data processing.
Apache Pig (find top 10 urls visited by users)
users = LOAD 'users.txt' USING PigStorage(',') AS (name, age); pages = LOAD 'pages.txt' USING PigStorage(',') AS (user, url); filteredUsers = FILTER users BY age >= 18 and age <=50; joinResult = JOIN filteredUsers BY name, pages by user; grouped = GROUP joinResult BY url;
summed = FOREACH grouped GENERATE group, COUNT(joinResult) as clicks; sorted = ORDER summed BY clicks desc;
top10 = LIMIT sorted 10;
STORE top10 INTO 'top10sites';
Same thing in MapReduce
users = LOAD 'users.txt' USING PigStorage(',') AS (name, age);
pages = LOAD 'pages.txt' USING PigStorage(',') AS (user, url);
filteredUsers = FILTER users BY age >= 18 and age <=50; joinResult = JOIN filteredUsers BY name, pages by user; grouped = GROUP joinResult BY url;
summed = FOREACH grouped GENERATE group, COUNT(joinResult) as clicks;
sorted = ORDER summed BY clicks desc; top10 = LIMIT sorted 10;
STORE top10 INTO 'top10sites';
MapReduce requires a lot more coding when
compared to Pig.
Apache Hive (find top 3 urls visited by users)
Data Warehousing Layer on top of Hadoop
Allows analysis and queries using a SQL-like language Hive is best for data analysts familiar with SQL
who need to do dynamic queries, summarization and data analysis.
Apache Hive (find top 10 urls visited by
users)
CREATE TABLE users(name STRING, age INT); CREATE TABLE pages(user STRING, url STRING);
LOAD DATA INPATH '/user/sandbox/users.txt' INTO TABLE 'users'; LOAD DATA INPATH '/user/sandbox/pages.txt' INTO TABLE 'pages'; SELECT pages.url, count(*) AS clicks FROM users JOIN pages ON
(users.name = pages.user) WHERE users.age >= 18 AND users.age <= 50 GROUP BY pages.url SORT BY clicks DESC LIMIT 10;
What is NoSQL?
66
Different types of RDBMS
67
RDBMS have scalability problems
68
RDBMS
ACID Compliant Codd’s rule
Designed 15-20 years back
NoSQL
BASE Compliant Brewers Cap Theorem
Schema less Horizontally scalable 125+ NoSQL databases RDBMS have scalability issues dattamsha.com
Different types of NoSQL database
69
Building online applications
70 RDBMS NoSQL Online application dattamsha.comDemo of the Cloudera Cluster
71
Books
72
Hadoop – The Definitive Guide (3rd Edition)
http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520/
HBase – The Definitive Guide
http://www.amazon.com/HBase-Definitive-Guide-Lars-George/dp/1449396100/
NoSQL Distilled
http://www.amazon.com/NoSQL-Distilled-Emerging-Polyglot-Persistence/dp/0321826620/
Seven Database in Seven Weeks
http://www.amazon.com/Seven-Databases-Weeks-Modern-Movement/dp/1934356921/
Books are old and technology is moving
fast.
Blogs
73 http://blog.cloudera.com/blog/ http://hortonworks.com/blog/ https://www.mapr.com/blog https://databricks.com/blog http://www.datastax.com/blog http://planetbigdata.com/ Use an RSS aggregator like feedly.com to keepupdated.
74
There is data every where around us.
The data needs to be analyzed for better things customer satisfaction
better health better sales
Big Data will be the next big thing.
Need to scale up with proper knowledge of Big Data to grab the opportunities
Proper training Practice
Reading books
75