BIG DATA What it is and how to use?

(1)

Lauri Ilison, PhD

Data Scientist

21.11.2014

BIG DATA

What it is and how to use?

Big Data definition?

There is no clear definition for

“BIG DATA”

“BIG DATA” is more of a concept

than precise term

(2)

BIG DATA concept versions

1. Unstructured vs structured - Big Data focuses on unstructured data

2. Big Data could be a volume issue - petabyte-scale data (1 Mio GB)

3. The 3V-s of Big Data - Volume, Velocity, Variety

Volume

Variety Velocity

BIG DATA

MB to GB to TB DATA

GB bytes of data transported every hour Social media

Purchases Calls, scripts

Weather Logs

It all started with …

(3)

Google Whitepapers

In 2003, 2004, 2005 Google released three academic papers describing

Google’s technology for massive data processing:

1.  Google File System (GFS) - Google storing all web content

2.  Map-Reduce – Google calculating PageRank and web search index

3.  BigTable – Google storing Crawling data Analytics, Earth and

Personalized Search in columnar database

Hadoop – historical background

In 2004/5 Doug Cutting developed Nutch

open source web search engine struggling

with huge data processing issues.

Doug implemented Google File System

analog and named it HADOOP

From 2006 Hadoop is an

Apache Foundation project

(4)

Hadoop file system (HDFS)

HDFS:

•  is a file system that can store very large data sets

•  scales out across a cluster of hosts

•  is optimized for throughput instead of latency

•  achieves high availability through replication instead of redundancy

•  faults of nodes are expected to be norm than exception

HDFS Client Name node

Data node

Metadata Blocks Management

Data Read and Write HDFS Architecture

http://static.googleusercontent.com/media/research.google.com/et//archive/gfs-sosp2003.pdf

(5)

MAP-REDUCE concept

Huge Job

Job 1 Job 2 Job 3 Job 4

Worker 1 does job 1

Worker 2 does job 2

Worker 3 does job 3

Worker 4 does job 4

Combine job 1 and job 2

result Combine job 3

and job 4 result

Process combined

results

Huge job result

MAP REDUCE is a framework for processing huge works 1.  Split the huge job between workers

2.  Combine workers results into single result

TASK MAP REDUCE RESULT

How it works? Step 1 – MAP

DATA: 5 baskets of apples, oranges, pears

Task: Find the number of apples, oranges and pears that I have

Server 1 Server 2 Server 3 Server 4 Server 5

Initial data

In each basket we count apples, oranges, pears

(6)

Shuffle

How it works? Step – Shuffle

How it works? Step 2 – Reduce

Reduce

X 50 X 42 X 31

X 50 X 42 X 31 Final

result

(7)

Hadoop + MAP-REDUCE

Hadoop filesystem with MAP-REDUCE is a

distributed grid with

storage and processing power

Hadoop

Processing power Storage

2007

Hadoop has been adopted!

2003

Google Whitepaper

2006 2008 2009 2010

2004 2005

Google file system reimplem entation

(8)

Hadoop ecosystem

Non-Relational DBMS Fine-grainer data handling

Hive

Data warehouse that provides SQL interface,

data strucutre is projected ad hoc onto underliying unstructured

dat

HBase

Column oriented, schema less, distributed database modeled after Google’s Big Table.

Random real time read/

write

Pig

Platform for manipulating and analyzing large data sets, Scripting language

for analysis

Mahout

Machine learning libraries for recommendations, clustering, classification

and item sets

Scripting Machine Learning

Hadoop Core Platform

HDFS

Distributes and replicates data across machines

MapReduce

Distributes and Monitors tasks, restarts failed tasks

Big Data technical stack

Data Sources

…

..

…

..

Data Mining / Modeling Data

mining Data

modeling Business analytics

Forecasting Business

Intelligence

HDFS

Batch/Map- Reduce Script

SQL On-line

Database Real-time Machine-

learning

Metadata management (HCatalog)

Hadoop cluster of hosts

Output

…

….

Cluster management / monitoring (Ambari)

Business analytics tools Data Mining /Modeling tools

Key value stores Columnar databases Document databases

Search In-Memory

Data integration

Batch data integration

Streaming data flow

(9)

Relational Data vs BIG DATA

DATA

Apply data schema

Store in Relational

database Apply data schema

Store data

Apply analytics

Apply analytics Relational data

management

BIG DATA management

Structure first Structure later

Schema on READ

Supervised learning Unsupervised learning

•  Classification

•  Regression

•  Decision Trees

•  Clustering

•  Hidden Markov Chains

•  Dimensionality reduction We have previous knowledge

about the sample cases that are basis for learning

We do not have any previous knowledge about the sample cases that are basis for learning

Machine Learning

How to find the value in data?

(10)

How does it work – Linear Regression?

Example: Linear Regression

TASK: find the price for 46m2 apartment In order to find a price of a 46m2-size-apartment we find the linear relation of samples.

y = ax + b

1. We assume linear relation Price = a * Size + b

2. We calculate each sample distance from the line

3. We search for the blue line equation with minimal total distance from samples

Apartment size Price

46m2 56K

4. Knowing the line function we calculate the price for 46m2 apartment

Price

Size

Example: Customer churn

Customer historical data Churn?

Customer Churn prediction rules Decision TREE algorithm

Actionable insights for enterprise

Gender Customer age Card

type Brand Sales total In eur Purchase

frequency Purchase No Churn

Male 37 type1 brand1 62 1 123 no

Female 49 type2 brand1 15 125 6 no

Male 64 type4 brand1 12 4 8 no

Female 47 type2 brand7 31 97 3 yes

Male 30 type3 brand2 35 162 6 yes

… .. .. .. . ... .. ..

….

purchace.freq.sdev <= 165:

:...purchase.no > 7: no purchase.no <= 7:

:...purchace.freq.sdev > 86:

:...purchase.no > 4:

: :...purchace.freq.sdev <= 126:

: : :...purchase.no > 5: no : : : purchase.no <= 5:

: : : :...brand in {brand1,brand2,brand4}:

no

: : : brand = brand3: yes : : purchace.freq.sdev > 126:

: : :...purchase.no <= 6: yes : : purchase.no > 6:

: : :...purchace.freq.sdev <= 139: no : : purchace.freq.sdev > 139: yes

….

(11)

Example 3: Predict loan payment default?

Example: Bank loan decision

TASK: Find the probability of default for applicant

In order to predict the probability of default we use Multivariate logistic regression

1. Logistic function

2. We create model based on historical data predicting the default 16 factors (parameters)

Historical loan application data _Target

No Default = 0 Default = 1

3000 samples 3. Testing the model we split the

dataset randomly into training 80%

and test set 20%

Input parameters T

3 5 6 7 8 9 4 2 3 4 5 2 4 6 2 2 0 2 4 5 6 3 2 5 3 2 5 7 3 6 3 7 2 1 3 7 5 4 2 4 7 6 2 5 2 6 5 4 7 2 0 3 5 6 7 8 9 4 2 3 4 5 2 4 6 2 2 1 2 4 5 6 3 2 5 3 2 5 7 3 6 3 7 2 1 3 7 5 4 2 4 7 6 2 5 2 6 5 4 7 2 1 3 5 6 7 8 9 4 2 3 4 5 2 4 6 2 2 1 2 4 5 6 3 2 5 3 2 5 7 3 6 3 7 2 0 3 7 5 4 2 4 7 6 2 5 2 6 5 4 7 2 1 3 5 6 7 8 9 4 2 3 4 5 2 4 6 2 2 0 2 4 5 6 3 2 5 3 2 5 7 3 6 3 7 2 1 3 7 5 4 2 4 7 6 2 5 2 6 5 4 7 2 1 3 5 6 7 8 9 4 2 3 4 5 2 4 6 2 2 1 2 4 5 6 3 2 5 3 2 5 7 3 6 3 7 2 1 3 7 5 4 2 4 7 6 2 5 2 6 5 4 7 2 0

… … … … ..

3 7 5 4 2 4 7 6 2 5 2 6 5 4 7 2 0

Predicted

Actual

0 1

0 True positive False Negative 1 False positive True Negative

f (x) =

1 1+ e

^{− x}

Error matrix

Big Data ….

1.  Technology invented by Google, further developed by

all big internet companies

2.  Linear scalability, open-source

3.  Decreased costs – low cost HardWare, no licenses

4.  Increased capabilities – schema on read, massive

analytics

5.  Machine Learning to discover value in the data

(12)

4 steps approach for Big Data problems

IDEAs discovery Find potentially valuable data Apply short validation, test STEP 2

Plan and prototype Minimalistic Prototypes

Setup and business value validation STEP 3

Implementation Implement fast, low risk

Integrate with existing processes STEP 4

Knowledge creation Seminars, workshops Real-life examples STEP 1

Where to start?

"  Look the tutorials in the internet

"  Read some books about BIG DATA and Machine

Learning

"  Participate in on-line coursers (Coursera.org or similar)

"  Experiment with tools – sandboxes, sample setups

"  Participate on online competitions (like Kaggle.com)

(13)

BIG DATA What it is and how to use?

Lauri Ilison, PhD

Data Scientist

21.11.2014

BIG DATA

What it is and how to use?

Big Data definition?

There is no clear definition for

“BIG DATA”

“BIG DATA” is more of a concept

than precise term

BIG DATA concept versions

1. Unstructured vs structured - Big Data focuses on unstructured data

2. Big Data could be a volume issue - petabyte-scale data (1 Mio GB)

3. The 3V-s of Big Data - Volume, Velocity, Variety

It all started with …

Google Whitepapers

In 2003, 2004, 2005 Google released three academic papers describing

Google’s technology for massive data processing:

1. Google File System (GFS) - Google storing all web content

2. Map-Reduce – Google calculating PageRank and web search index

3. BigTable – Google storing Crawling data Analytics, Earth and

Personalized Search in columnar database

Hadoop – historical background

In 2004/5 Doug Cutting developed Nutch

open source web search engine struggling

with huge data processing issues.

Doug implemented Google File System

analog and named it HADOOP

From 2006 Hadoop is an

Apache Foundation project

Hadoop file system (HDFS)

HDFS:

• is a file system that can store very large data sets

• scales out across a cluster of hosts

• is optimized for throughput instead of latency

• achieves high availability through replication instead of redundancy

• faults of nodes are expected to be norm than exception

MAP-REDUCE concept

How it works? Step 1 – MAP

How it works? Step – Shuffle

How it works? Step 2 – Reduce

Hadoop + MAP-REDUCE

Hadoop filesystem with MAP-REDUCE is a

distributed grid with

storage and processing power

Hadoop

Hadoop has been adopted!

Hadoop ecosystem

Hive

HBase

Pig

Mahout

HDFS

MapReduce

Big Data technical stack

Relational Data vs BIG DATA

Supervised learning Unsupervised learning

Machine Learning

How to find the value in data?

How does it work – Linear Regression?

Example: Customer churn

Example 3: Predict loan payment default?

1

1+ e

Big Data ….

1. Technology invented by Google, further developed by

all big internet companies

2. Linear scalability, open-source

3. Decreased costs – low cost HardWare, no licenses

4. Increased capabilities – schema on read, massive

analytics

5. Machine Learning to discover value in the data

4 steps approach for Big Data problems

Where to start?

" Look the tutorials in the internet

" Read some books about BIG DATA and Machine

Learning

" Participate in on-line coursers (Coursera.org or similar)

" Experiment with tools – sandboxes, sample setups

1.  Google File System (GFS) - Google storing all web content

2.  Map-Reduce – Google calculating PageRank and web search index

3.  BigTable – Google storing Crawling data Analytics, Earth and

•  is a file system that can store very large data sets

•  scales out across a cluster of hosts

•  is optimized for throughput instead of latency

•  achieves high availability through replication instead of redundancy

•  faults of nodes are expected to be norm than exception

1.  Technology invented by Google, further developed by

2.  Linear scalability, open-source

3.  Decreased costs – low cost HardWare, no licenses

4.  Increased capabilities – schema on read, massive

5.  Machine Learning to discover value in the data

"  Look the tutorials in the internet

"  Read some books about BIG DATA and Machine

"  Participate in on-line coursers (Coursera.org or similar)

"  Experiment with tools – sandboxes, sample setups

"  Participate on online competitions (like Kaggle.com)