Big Data Workshop. dattamsha.com

(1)

Big Data Workshop

(2)

About Praveen

 Has more than15 years of experience working on various technologies.

 Is a Cloudera Certified Developer for Apache Hadoop CDH4 (CCD-410) with 95% score and got through the first part Cloudera Certified Data Scientist Certification (DS-200).  Is a Sun Certified Java Professional.

 Has been active in the Big Data/Hadoop community  Blog (http://www.thecloudavenue.com/ and

http://www.dattamsha.com/blog/)  Stack Overflow

(http://stackoverflow.com/users/614157/praveen-sripati)

 Extensive experience in consulting, conducting training and workshops around Big Data.

(3)

Agenda

 Job market around Big Data  What is Big Data?

 Big Data use cases  Big Data Ecosystem  What is HDFS?

 What is MapReduce?  Pig, Hive etc

 NoSQL Databases

(4)

Why we are here?

http://www.indeed.com/jobtrends

(5)

Why we are here?

http://www.nasscom.in/big-data-next-big-thing

IDC predicts that the market for big data will

reach $16.1 billion in 2014, growing 6 times faster than the overall IT market.

(6)

Big Data Certifications

http://university.cloudera.com/certification.html

http://hortonworks.com/hadoop-training/

(7)

Expectations from the workshop

 To make each one of you get familiar with the different aspects of Big Data which will help you to explore further the various

opportunities around Big Data.

 The work shop is to make each one of you motivated to work on the Big Data in the future.

 The workshop is not to make you an expert in Big Data given the time frame.

(8)

Scale of data

http://en.wikipedia.org/wiki/Metric_prefix

(9)

Some interesting facts about Data

 Every day, we create 2.5 quintillion bytes of data, so much that 90% of the data in the world today has been created in the last two years alone.

 Walmart handles more than 1 million customer transactions every hour, which is imported into databases estimated to contain more than 2.5 PB of data

 Twitter generates 12 TB of data every day.

 Airbus A380 generates 10 TB every 30 minutes of flight.  NYSE generates a TB of data every month.

What do we do we so much amount of data?

Ignore or use it.

(10)

Lets jump into Big Data

10

(11)

The model has changed …

Old Model – only a few companies were generating data (like news outlets), all others are consuming data

New Model – all of us are generating data, and all of us are consuming data

11

(12)

Who’s generating data?

Social media and networks

(all of us are generating data)

Scientific instruments

(collecting all sorts of data)

Mobile devices

(tracking all objects all the time)

Sensor technology

(measuring all kinds of data)

The progress and innovation is no longer hindered by the ability to collect data. But, by the ability to manage, analyze, summarize, visualize, and discover knowledge from the

collected data in a timely manner and in a scalable fashion. 12

(13)

MapMyRide iPhone App and Cycling

13

(14)

How much data is getting generated?

(15)

What's Big Data?

In information technology, Big Data is a collection of data sets

so large and complex that it becomes difficult to process and

store using on-hand existing tools and technologies.

(16)

Converting raw data into information

Raw data from various

sources Info Machine Learning Data Mining Analytics Store Process Action Increase the bottom line or the profits of the company dattamsha.com

(17)

Interesting fact of Neo (a graph db)

me

Friends (depth 2)

Friends of Friends (depth 3)

This is a Social Graph

(18)

Interesting fact of Neo (a graph db)

http://www.neotechnology.com/how-much-faster-is-a-graph-database-really/

With a database of 1,000,000 users, finding the friends of friends (fof) at different levels and the results are striking.

Execution Time is in seconds, for 1,000 users.

Depth Execution Time

-MySQL Execution Time –Neo4J How much faster

2 0.016 0.010 1.6 3 30.267 0.168 180.2 4 1543.505 1.359 1135.8 5 Not finished in 1 hr 2.132 NA dattamsha.com

(19)

1-Scale (Volume)

(20)

Various formats, types, and structures  Text, numerical, images, audio, video,

sequences, time series, social media data, multi-dim arrays, etc…

 Static data vs. streaming data  A single application can be

generating/collecting many types of data

20

2-Complexity (Variety)

(21)

Data is begin generated fast and need to be processed fast. Late decisions will lead to missing opportunities.

 E-Promotions: send promotions right now for store next to customer, based on your current location, your purchase history, what the customer likes.

 Healthcare monitoring : Any abnormal sensor measurements in the body would require immediate reaction

 Fraud detection : Credit Card company should detect frauds as soon as possible and take immediate action.

21

3-Speed (Velocity)

(22)

Some Make it 4V’s

22

(23)

General use cases of Big Data

23

 Recommendation

 Customer segmentation and targeted promotions  Fraud detection

 Network intrusion detection  Reporting and analytics

 Sentimental analysis

http://www.bigdata-madesimple.com/category/sectors/

(24)

Predict the future

24

(25)

Use Case - Recommendations

B1 B2 B3 U1 B1 B2 B4 U2 B6 B7 B8 U3 This is called Collaborative Filtering – a type of machine learning algorithm dattamsha.com

(26)

Use Case – Sentimental Analysis

Twitter Facebook Blogs Natural Language Processing

 Based on the feedback in the different social media outlets identify if the feedback has been positive, negative or neutral.

 This can be used to improve the service/product or how it has been marketed.

(27)

Use Case – Reporting & Predictive Analytics

 How many smart phones have been sold in each of the store this year and the previous year?

 How many customers

have booked a hotel room in the past and how many will book in the next few months?

 Did providing an incentive / offer to the customer increase the sales of a particular gadget?

 How much of a particular flu medicine should be manufactured in the next 6 months?

(28)

Use Case – Classification

 Classify an email as a spam or a non-spam automatically as in the case of GMail.

 Given a particular medical report. Automatically

classify if the report is positive or negative for a particular symptom.

 Based on the different features of a customer. Separate the potential customer who would be interested in a particular service from the rest.

 Classify weather a taxi

booked will be cancelled by the customer or not.

(29)

Use Case – Customer segmentation &

targeted advertising

 Separate the customers into different segments based on the features like age, salary, gender, location, spending habits.

 Target each of the customer by sending a customized

email with some offers. This will increase the prospects of the customer opting for a particular service.

 This is much better than sending the same email in bulk to all the customers.

(30)

Use Case – Fraud detection

 Based on the nature of the transaction like time, location,

spending habits etc identify if a particular credit card transaction is a genuine

transaction or not.  Identifying the

fraudulent transactions as soon as possible will help in mitigating or minimizing the losses.

(31)

Use Case – Flu Trends

 Identify the trends in the search terms for flu

Use Case – Wiki Analysis

 Download the entire Wikipedia dump and identify the importance of the different pages based on the PageRank algorithm.  Create an index for the entire Wikipedia

similar to an index at the end of the book.  By combining the above two data sets, a

search engine can be created on top of Wikipedia, similar to Google.

http://en.wikipedia.org/wiki/PageRank

http://en.wikipedia.org/wiki/Wikipedia:Database_download

(33)

Challenges with Big Data

33

 Big Data technology is like a moving target.  Security had been an after thought

 Privacy of the data  Lack of skills

 Sharing of data across silos

 Risks of storing huge data sets within a single system  Regulations in the different domains (HIPPA for Health

Care)

(34)

Lets jump into technology aspects of Big

Data

34

(35)

What is ASF?

 Apache Software is a Non Profit Organization.

 Provides a platform or an environment for the development of different softwares.

 The development happens in an open way. The code can be obtained from http://svn.apache.org/repos/asf/.

 Different individuals and companies can work in a collaborative fashion.

 There are a lot of Big Data and Non Big Data related projects.  The softwares are free to use with very few or no restrictions.

(36)

Different softwares under ASF

Non Big Data softwares  Ant  Maven  Tomcat  Log4J  JUnit  Apache Http

 Torque (ORM Tool)

Big Data softwares  Hadoop  Hive  Pig  HBase  Cassandra  Sqoop  Oozie https://projects.apache.org/indexes/alpha.html dattamsha.com

(37)

Ecosystem around Big Data

s

(38)

How they fit together?

(39)

Parallels between Linux and Big Data

GNU Linux

Fedora CentOS RHEL Ubuntu

RedHat Canonical

Base

Derivatives

Company

(40)

Parallels between Linux and Big Data

Apache

CDH HDP M3/M5/M7 HDInsights

Cloudera Hortonworks MapR Microsoft

Base

Derivative

Company

(41)

About the Big Data Virtual Machine

Laptop/Desktop

Windows Mac Linux

Oracle Virtual Box Ubuntu

Hadoop Hive Pig

This is what has been shared in the DVD along

with the installation instructions.

http://www.thecloudavenue.com/2013/01/virtual-machine-for-learning-hadoop.html

(42)

Getting started with Big Data Virtual

Machine (VM)

 Installation of VirtualBox

 Configuring VM image in VirtualBox  Use the different frameworks in the VM

(43)

Ubuntu running on Windows 8

43

(44)

Introduction to Linux

http://www.ee.surrey.ac.uk/Teaching/Unix/

(45)

Vertical & Horizontal scaling of Big Data

systems

45 Traditional Systems Big Data Systems

By adding more and more commodity machines to

the cluster, more and more of the data is stored

and processed in parallel Traditional systems are

expensive to scale and by design difficult to distribute. High end machine Desktop grade machine Commodity machines dattamsha.com

(46)

How a datacenter looks like?

46

(47)

Google leading the Big Data revolution

47

http://www.thecloudavenue.com/2012/10/google-driving-big-data-space.html

(48)

What is Hadoop?

 Framework/Software for storage and analysis at large scale  Can process and store structured as well as unstructured data  Addresses many of the challenges with distributed processing and

storing

 HDFS for Storage and MapReduce for Analysis  HDFS and MapReduce based on Google's papers.

 Is free and is developed from Apache Software Foundation  Commercial support is available from vendors like Cloudera,

HortonWorks, Amazon and others

 Deployments are in the range of thousands of nodes in production

(49)

Hadoop handles the different failure

scenarios

49

 A rack/node going down.  A process dying.

 A particular process running slow.  Problems in the network.

 A Hard Disk Drive crashing.

….. and it also handles any changes to the network topology like adding a new rack or a new node.

(50)

What is HDFS?

50

http://research.google.com/archive/gfs-sosp2003.pdf

 Distributed file system for redundant storage based on Google’s GFS Paper.

 Designed to reliably store data on commodity hardware.  Built to expect hardware failures.

 Performs best with a modest number of large files  Sits on top of native file system

(51)

How does HDFS store the data?

51 http://research.google.com/archive/gfs-sosp2003.pdf b1 b2 b3 b4 m/c1 m/c2 m/c3 m/c4

Original file is split into blocks

b1 b2 _b2 b3 b3 b4 b4 b1

The blocks are placed on different machines with redundancy to address fault tolerance.

(52)

Processing of the data on different machines

52

m/c1

b1

b2

Once the blocks are placed in different machines, they can processed in parallel by different processes.

p1 m/c3 b3 b4 p3 m/c2 b2 b3 p2 m/c4 b4 b1 p4 dattamsha.com

(53)

Different models for processing of the data

HDFS

MapReduce RDD MPI BSP Graph

Hadoop Spark Open MPI Hama Giraph

53 Distributed computing

models

Softwares implementing the distributed computing

models

(54)

What is MapReduce?

54

http://research.google.com/archive/mapreduce.html

 Programming model for distributed computations at a massive scale.

 A method for distributing a task across multiple nodes  Execution framework for organizing and performing such

computations

 Each node processes data stored on that node.

Provides

 Automatic parallelization and distribution  Fault tolerance

 Status and monitoring tools

 A clean abstraction for programmers

(55)

Map Reduce Data Flow

55 Hadoop is easy Hadoop is cool m2 m1 (Hadoop,1)(is,1) (easy,1) (Hadoop,1) (is,1) (cool,1) r (Hadoop,2) (is,2) (easy,1) (cool,1)

Mapper doing the transformation by splitting the line

into words.

Reducer doing the aggregation by counting the numbers of occurrence of a particular word. dattamsha.com

(56)

WordCount in Java (Mapper code)

56

http://wiki.apache.org/hadoop/WordCount

MapReduce code can be written in Java and

non-Java languages.

(57)

WordCount in Java (Reducer code)

57

(58)

Apache Spark

Spark was developed AMPlab in University of

Berkeley

http://spark.apache.org/

(59)

WordCount in Python (Spark Code)

59

from pyspark import SparkContext

logFile = "hdfs://localhost:9000/user/bigdatavm/input"

sc = SparkContext("spark://bigdata-vm:7077", "WordCount") textFile = sc.textFile(logFile)

wordCounts = textFile.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)

wordCounts.saveAsTextFile("hdfs://localhost:9000/user/bigdatavm/output")

Spark code is not only faster, but is much

easy to write

http://spark.apache.org/

(60)

Putting it all together

60

(61)

Apache Pig

 High-level data flow language  Made of two components

 Data processing language Pig Latin

 Compiler to translate Pig Latin to MapReduce

 Abstracts you from specific details and allows you to focus on data processing.

(62)

Apache Pig (find top 10 urls visited by users)

users = LOAD 'users.txt' USING PigStorage(',') AS (name, age); pages = LOAD 'pages.txt' USING PigStorage(',') AS (user, url); filteredUsers = FILTER users BY age >= 18 and age <=50; joinResult = JOIN filteredUsers BY name, pages by user; grouped = GROUP joinResult BY url;

summed = FOREACH grouped GENERATE group, COUNT(joinResult) as clicks; sorted = ORDER summed BY clicks desc;

top10 = LIMIT sorted 10;

STORE top10 INTO 'top10sites';

(63)

Same thing in MapReduce

users = LOAD 'users.txt' USING PigStorage(',') AS (name, age);

pages = LOAD 'pages.txt' USING PigStorage(',') AS (user, url);

filteredUsers = FILTER users BY age >= 18 and age <=50; joinResult = JOIN filteredUsers BY name, pages by user; grouped = GROUP joinResult BY url;

summed = FOREACH grouped GENERATE group, COUNT(joinResult) as clicks;

sorted = ORDER summed BY clicks desc; top10 = LIMIT sorted 10;

STORE top10 INTO 'top10sites';

MapReduce requires a lot more coding when

compared to Pig.

(64)

Apache Hive (find top 3 urls visited by users)

 Data Warehousing Layer on top of Hadoop

 Allows analysis and queries using a SQL-like language  Hive is best for data analysts familiar with SQL

who need to do dynamic queries, summarization and data analysis.

(65)

Apache Hive (find top 10 urls visited by

users)

CREATE TABLE users(name STRING, age INT); CREATE TABLE pages(user STRING, url STRING);

LOAD DATA INPATH '/user/sandbox/users.txt' INTO TABLE 'users'; LOAD DATA INPATH '/user/sandbox/pages.txt' INTO TABLE 'pages'; SELECT pages.url, count(*) AS clicks FROM users JOIN pages ON

(users.name = pages.user) WHERE users.age >= 18 AND users.age <= 50 GROUP BY pages.url SORT BY clicks DESC LIMIT 10;

(66)

What is NoSQL?

66

(67)

Different types of RDBMS

67

(68)

RDBMS have scalability problems

68

RDBMS

ACID Compliant Codd’s rule

Designed 15-20 years back

NoSQL

BASE Compliant Brewers Cap Theorem

Schema less Horizontally scalable 125+ NoSQL databases RDBMS have scalability issues dattamsha.com

(69)

Different types of NoSQL database

69

(70)

Building online applications

70 RDBMS NoSQL Online application dattamsha.com

(71)

Demo of the Cloudera Cluster

71

(72)

Books

72

Hadoop – The Definitive Guide (3rd _Edition)

http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520/

HBase – The Definitive Guide

http://www.amazon.com/HBase-Definitive-Guide-Lars-George/dp/1449396100/

NoSQL Distilled

http://www.amazon.com/NoSQL-Distilled-Emerging-Polyglot-Persistence/dp/0321826620/

Seven Database in Seven Weeks

http://www.amazon.com/Seven-Databases-Weeks-Modern-Movement/dp/1934356921/

Books are old and technology is moving

fast.

(73)

Blogs

73 http://blog.cloudera.com/blog/ http://hortonworks.com/blog/ https://www.mapr.com/blog https://databricks.com/blog http://www.datastax.com/blog http://planetbigdata.com/ Use an RSS aggregator like feedly.com to keep

updated.

(74)

74

 There is data every where around us.

 The data needs to be analyzed for better things  customer satisfaction

 better health  better sales

 Big Data will be the next big thing.

 Need to scale up with proper knowledge of Big Data to grab the opportunities

 Proper training  Practice

 Reading books

(75)

75

(76)