• No results found

BIG DATA What it is and how to use?

N/A
N/A
Protected

Academic year: 2021

Share "BIG DATA What it is and how to use?"

Copied!
13
0
0

Loading.... (view fulltext now)

Full text

(1)

Lauri Ilison, PhD

Data Scientist

21.11.2014

BIG DATA

What it is and how to use?

Big Data definition?

There is no clear definition for

“BIG DATA”

“BIG DATA” is more of a concept

than precise term

(2)

BIG DATA concept versions

1. Unstructured vs structured - Big Data focuses on unstructured data

2. Big Data could be a volume issue - petabyte-scale data (1 Mio GB)

3. The 3V-s of Big Data - Volume, Velocity, Variety

Volume

Variety Velocity

BIG DATA

MB to GB to TB DATA

GB bytes of data transported every hour Social media

Purchases Calls, scripts

Weather Logs

It all started with …

(3)

Google Whitepapers

In 2003, 2004, 2005 Google released three academic papers describing

Google’s technology for massive data processing:

1.  Google File System (GFS) - Google storing all web content

2.  Map-Reduce – Google calculating PageRank and web search index

3.  BigTable – Google storing Crawling data Analytics, Earth and

Personalized Search in columnar database

Hadoop – historical background

In 2004/5 Doug Cutting developed Nutch

open source web search engine struggling

with huge data processing issues.

Doug implemented Google File System

analog and named it HADOOP

From 2006 Hadoop is an

Apache Foundation project

(4)

Hadoop file system (HDFS)

HDFS:

•  is a file system that can store very large data sets

•  scales out across a cluster of hosts

•  is optimized for throughput instead of latency

•  achieves high availability through replication instead of redundancy

•  faults of nodes are expected to be norm than exception

HDFS Client Name node

Data node

Data node

Data node

Metadata Blocks Management

Data Read and Write HDFS Architecture

http://static.googleusercontent.com/media/research.google.com/et//archive/gfs-sosp2003.pdf

(5)

MAP-REDUCE concept

Huge Job

Job 1 Job 2 Job 3 Job 4

Worker 1 does job 1

Worker 2 does job 2

Worker 3 does job 3

Worker 4 does job 4

Combine job 1 and job 2

result Combine job 3

and job 4 result

Process combined

results

Huge job result

MAP REDUCE is a framework for processing huge works 1.  Split the huge job between workers

2.  Combine workers results into single result

TASK MAP REDUCE RESULT

How it works? Step 1 – MAP

DATA: 5 baskets of apples, oranges, pears

Task: Find the number of apples, oranges and pears that I have

Server 1 Server 2 Server 3 Server 4 Server 5

Server 1 Server 2 Server 3 Server 4 Server 5

Initial data

In each basket we count apples, oranges, pears

(6)

Server 1 Server 2 Server 3 Server 4 Server 5

Server 1 Server 2 Server 3 Server 4 Server 5

Shuffle

How it works? Step – Shuffle

How it works? Step 2 – Reduce

Server 1 Server 2 Server 3 Server 4 Server 5

Reduce

Server 1 Server 2 Server 3 Server 4 Server 5

X 50 X 42 X 31

X 50 X 42 X 31 Final

result

(7)

Hadoop + MAP-REDUCE

Hadoop filesystem with MAP-REDUCE is a

distributed grid with

storage and processing power

Hadoop

Processing power Storage

2007

Hadoop has been adopted!

2003

Google Whitepaper

2006 2008 2009 2010

2004 2005

Google file system reimplem entation

(8)

Hadoop ecosystem

Non-Relational DBMS Fine-grainer data handling

Hive

Data warehouse that provides SQL interface,

data strucutre is projected ad hoc onto underliying unstructured

dat

HBase

Column oriented, schema less, distributed database modeled after Google’s Big Table.

Random real time read/

write

Pig

Platform for manipulating and analyzing large data sets, Scripting language

for analysis

Mahout

Machine learning libraries for recommendations, clustering, classification

and item sets

Scripting Machine Learning

Hadoop Core Platform

HDFS

Distributes and replicates data across machines

MapReduce

Distributes and Monitors tasks, restarts failed tasks

Big Data technical stack

Data Sources

..

..

Data Mining / Modeling Data

mining Data

modeling Business analytics

Forecasting Business

Intelligence

HDFS

Batch/Map- Reduce Script

SQL On-line

Database Real-time Machine-

learning

Metadata management (HCatalog)

Hadoop cluster of hosts

Output

….

….

Cluster management / monitoring (Ambari)

Business analytics tools Data Mining /Modeling tools

Key value stores Columnar databases Document databases

Search In-Memory

Data integration

Batch data integration

Streaming data flow

(9)

Relational Data vs BIG DATA

DATA

Apply data schema

Store in Relational

database Apply data schema

Store data

Apply analytics

Apply analytics Relational data

management

BIG DATA management

Structure first Structure later

Schema on READ

Supervised learning Unsupervised learning

•  Classification

•  Regression

•  Decision Trees

•  Clustering

•  Hidden Markov Chains

•  Dimensionality reduction We have previous knowledge

about the sample cases that are basis for learning

We do not have any previous knowledge about the sample cases that are basis for learning

Machine Learning

How to find the value in data?

(10)

How does it work – Linear Regression?

Example: Linear Regression

TASK: find the price for 46m2 apartment In order to find a price of a 46m2-size-apartment we find the linear relation of samples.

y = ax + b

1. We assume linear relation Price = a * Size + b

2. We calculate each sample distance from the line

3. We search for the blue line equation with minimal total distance from samples

Apartment size Price

46m2 56K

4. Knowing the line function we calculate the price for 46m2 apartment

Price

Size

Example: Customer churn

Customer historical data Churn?

Customer Churn prediction rules Decision TREE algorithm

Actionable insights for enterprise

Gender   Customer age   Card

type   Brand   Sales total In eur   Purchase

frequency   Purchase No   Churn  

Male   37   type1   brand1   62   1   123   no  

Female   49   type2   brand1   15   125   6   no  

Female   38   type3   brand3   116   31   5   no  

Male   64   type4   brand1   12   4   8   no  

Female   30   type5   brand6   47   21   43   no  

Female   30   type4   brand1   25   82   16   no  

Female   47   type2   brand7   31   97   3   yes  

Male   30   type3   brand2   35   162   6   yes  

Female   51   type1   brand3   24   88   73   no  

Female   30   type3   brand2   31   32   22   no  

Male   42   type4   brand3   57   279   3   yes  

Female   30   type1   brand1   25   175   11   no  

Female   30   type3   brand2   54   5   40   no  

Male   30   type2   brand7   44   467   3   yes  

…   ..   ..   ..   .   ...   ..   ..  

Female   30   type3   brand1   46   150   3   no  

….

purchace.freq.sdev <= 165:

:...purchase.no > 7: no purchase.no <= 7:

:...purchace.freq.sdev > 86:

:...purchase.no > 4:

: :...purchace.freq.sdev <= 126:

: : :...purchase.no > 5: no : : : purchase.no <= 5:

: : : :...brand in {brand1,brand2,brand4}:

no

: : : brand = brand3: yes : : purchace.freq.sdev > 126:

: : :...purchase.no <= 6: yes : : purchase.no > 6:

: : :...purchace.freq.sdev <= 139: no : : purchace.freq.sdev > 139: yes

….

(11)

Example 3: Predict loan payment default?

Example: Bank loan decision

TASK: Find the probability of default for applicant

In order to predict the probability of default we use Multivariate logistic regression

1. Logistic function

2. We create model based on historical data predicting the default 16 factors (parameters)

Historical loan application data Target

No Default = 0 Default = 1

3000 samples 3. Testing the model we split the

dataset randomly into training 80%

and test set 20%

Input parameters T

3 5 6 7 8 9 4 2 3 4 5 2 4 6 2 2 0 2 4 5 6 3 2 5 3 2 5 7 3 6 3 7 2 1 3 7 5 4 2 4 7 6 2 5 2 6 5 4 7 2 0 3 5 6 7 8 9 4 2 3 4 5 2 4 6 2 2 1 2 4 5 6 3 2 5 3 2 5 7 3 6 3 7 2 1 3 7 5 4 2 4 7 6 2 5 2 6 5 4 7 2 1 3 5 6 7 8 9 4 2 3 4 5 2 4 6 2 2 1 2 4 5 6 3 2 5 3 2 5 7 3 6 3 7 2 0 3 7 5 4 2 4 7 6 2 5 2 6 5 4 7 2 1 3 5 6 7 8 9 4 2 3 4 5 2 4 6 2 2 0 2 4 5 6 3 2 5 3 2 5 7 3 6 3 7 2 1 3 7 5 4 2 4 7 6 2 5 2 6 5 4 7 2 1 3 5 6 7 8 9 4 2 3 4 5 2 4 6 2 2 1 2 4 5 6 3 2 5 3 2 5 7 3 6 3 7 2 1 3 7 5 4 2 4 7 6 2 5 2 6 5 4 7 2 0

… … … … ..

3 7 5 4 2 4 7 6 2 5 2 6 5 4 7 2 0

Predicted

Actual

0 1

0 True positive False Negative 1 False positive True Negative

f (x) =

1

1+ e

− x

Error matrix

Big Data ….

1.  Technology invented by Google, further developed by

all big internet companies

2.  Linear scalability, open-source

3.  Decreased costs – low cost HardWare, no licenses

4.  Increased capabilities – schema on read, massive

analytics

5.  Machine Learning to discover value in the data

(12)

4 steps approach for Big Data problems

IDEAs discovery Find potentially valuable data Apply short validation, test STEP 2

Plan and prototype Minimalistic Prototypes

Setup and business value validation STEP 3

Implementation Implement fast, low risk

Integrate with existing processes STEP 4

Knowledge creation Seminars, workshops Real-life examples STEP 1

Where to start?

"  Look the tutorials in the internet

"  Read some books about BIG DATA and Machine

Learning

"  Participate in on-line coursers (Coursera.org or similar)

"  Experiment with tools – sandboxes, sample setups

"  Participate on online competitions (like Kaggle.com)

(13)

If you are interested?

Nortal has interesting Big Data and

Machine Learning tasks to solve!

Lauri Ilison, PhD

email: [email protected]

References

Related documents

This is because space itself is to function as the ‘form’ of the content of an outer intuition (a form of our sensi- bility), as something that ‘orders’ the ‘matter’

 Analysing the lexicon, grammatical structure, communication situation, and cultural context of the SL text to determine its meaning and reconstructing the same meaning using

These example files contain exported data that corresponds to the various predefined format specifications that are supplied with the DeltaV software.. The

Consuming passion and resume sample cover letter for teacher cover letters should include examples to make you want to provide examples are faced with a teacher..?. Designing fun take

The management node is installed with StackIQ Enterprise Data management software, which automatically installs and configures Hortonworks Data Platform software on the Name

In contrast to the …ndings of Fedderke and Szalontai (2004), who found some support for a positive impact of inequality of …rm size and hence indirectly concentration on

In large horsepower (hp) applications (greater than 100 hp), gear systems tend to be designed for greater efficiency because of the costs, heat, and noise problems that result

Similarly, inequality solutions are required to determine the monotonicity and concavity of functions by the use of derivative (Sandor 1997).. E-mail address: