Lauri Ilison, PhD
Data Scientist
21.11.2014
BIG DATA
What it is and how to use?
Big Data definition?
There is no clear definition for
“BIG DATA”
“BIG DATA” is more of a concept
than precise term
BIG DATA concept versions
1. Unstructured vs structured - Big Data focuses on unstructured data
2. Big Data could be a volume issue - petabyte-scale data (1 Mio GB)
3. The 3V-s of Big Data - Volume, Velocity, Variety
Volume
Variety Velocity
BIG DATA
MB to GB to TB DATA
GB bytes of data transported every hour Social media
Purchases Calls, scripts
Weather Logs
It all started with …
Google Whitepapers
In 2003, 2004, 2005 Google released three academic papers describing
Google’s technology for massive data processing:
1. Google File System (GFS) - Google storing all web content
2. Map-Reduce – Google calculating PageRank and web search index
3. BigTable – Google storing Crawling data Analytics, Earth and
Personalized Search in columnar database
Hadoop – historical background
In 2004/5 Doug Cutting developed Nutch
open source web search engine struggling
with huge data processing issues.
Doug implemented Google File System
analog and named it HADOOP
From 2006 Hadoop is an
Apache Foundation project
Hadoop file system (HDFS)
HDFS:
• is a file system that can store very large data sets
• scales out across a cluster of hosts
• is optimized for throughput instead of latency
• achieves high availability through replication instead of redundancy
• faults of nodes are expected to be norm than exception
HDFS Client Name node
Data node
Data node
Data node
Metadata Blocks Management
Data Read and Write HDFS Architecture
http://static.googleusercontent.com/media/research.google.com/et//archive/gfs-sosp2003.pdf
MAP-REDUCE concept
Huge Job
Job 1 Job 2 Job 3 Job 4
Worker 1 does job 1
Worker 2 does job 2
Worker 3 does job 3
Worker 4 does job 4
Combine job 1 and job 2
result Combine job 3
and job 4 result
Process combined
results
Huge job result
MAP REDUCE is a framework for processing huge works 1. Split the huge job between workers
2. Combine workers results into single result
TASK MAP REDUCE RESULT
How it works? Step 1 – MAP
DATA: 5 baskets of apples, oranges, pears
Task: Find the number of apples, oranges and pears that I have
Server 1 Server 2 Server 3 Server 4 Server 5
Server 1 Server 2 Server 3 Server 4 Server 5
Initial data
In each basket we count apples, oranges, pears
Server 1 Server 2 Server 3 Server 4 Server 5
Server 1 Server 2 Server 3 Server 4 Server 5
Shuffle
How it works? Step – Shuffle
How it works? Step 2 – Reduce
Server 1 Server 2 Server 3 Server 4 Server 5
Reduce
Server 1 Server 2 Server 3 Server 4 Server 5
X 50 X 42 X 31
X 50 X 42 X 31 Final
result
Hadoop + MAP-REDUCE
Hadoop filesystem with MAP-REDUCE is a
distributed grid with
storage and processing power
Hadoop
Processing power Storage
2007
Hadoop has been adopted!
2003
Google Whitepaper
2006 2008 2009 2010
2004 2005
Google file system reimplem entation
Hadoop ecosystem
Non-Relational DBMS Fine-grainer data handling
Hive
Data warehouse that provides SQL interface,
data strucutre is projected ad hoc onto underliying unstructured
dat
HBase
Column oriented, schema less, distributed database modeled after Google’s Big Table.
Random real time read/
write
Pig
Platform for manipulating and analyzing large data sets, Scripting language
for analysis
Mahout
Machine learning libraries for recommendations, clustering, classification
and item sets
Scripting Machine Learning
Hadoop Core Platform
HDFS
Distributes and replicates data across machines
MapReduce
Distributes and Monitors tasks, restarts failed tasks
Big Data technical stack
Data Sources
…
..
…
..
Data Mining / Modeling Data
mining Data
modeling Business analytics
Forecasting Business
Intelligence
HDFS
Batch/Map- Reduce Script
SQL On-line
Database Real-time Machine-
learning
Metadata management (HCatalog)
Hadoop cluster of hosts
Output
…
…
….
….
Cluster management / monitoring (Ambari)
Business analytics tools Data Mining /Modeling tools
Key value stores Columnar databases Document databases
Search In-Memory
Data integration
Batch data integration
Streaming data flow
Relational Data vs BIG DATA
DATA
Apply data schema
Store in Relational
database Apply data schema
Store data
Apply analytics
Apply analytics Relational data
management
BIG DATA management
Structure first Structure later
Schema on READ
Supervised learning Unsupervised learning
• Classification
• Regression
• Decision Trees
• Clustering
• Hidden Markov Chains
• Dimensionality reduction We have previous knowledge
about the sample cases that are basis for learning
We do not have any previous knowledge about the sample cases that are basis for learning
Machine Learning
How to find the value in data?
How does it work – Linear Regression?
Example: Linear Regression
TASK: find the price for 46m2 apartment In order to find a price of a 46m2-size-apartment we find the linear relation of samples.
y = ax + b
1. We assume linear relation Price = a * Size + b
2. We calculate each sample distance from the line
3. We search for the blue line equation with minimal total distance from samples
Apartment size Price
46m2 56K
4. Knowing the line function we calculate the price for 46m2 apartment
Price
Size
Example: Customer churn
Customer historical data Churn?
Customer Churn prediction rules Decision TREE algorithm
Actionable insights for enterprise
Gender Customer age Card
type Brand Sales total In eur Purchase
frequency Purchase No Churn
Male 37 type1 brand1 62 1 123 no
Female 49 type2 brand1 15 125 6 no
Female 38 type3 brand3 116 31 5 no
Male 64 type4 brand1 12 4 8 no
Female 30 type5 brand6 47 21 43 no
Female 30 type4 brand1 25 82 16 no
Female 47 type2 brand7 31 97 3 yes
Male 30 type3 brand2 35 162 6 yes
Female 51 type1 brand3 24 88 73 no
Female 30 type3 brand2 31 32 22 no
Male 42 type4 brand3 57 279 3 yes
Female 30 type1 brand1 25 175 11 no
Female 30 type3 brand2 54 5 40 no
Male 30 type2 brand7 44 467 3 yes
… .. .. .. . ... .. ..
Female 30 type3 brand1 46 150 3 no
….
purchace.freq.sdev <= 165:
:...purchase.no > 7: no purchase.no <= 7:
:...purchace.freq.sdev > 86:
:...purchase.no > 4:
: :...purchace.freq.sdev <= 126:
: : :...purchase.no > 5: no : : : purchase.no <= 5:
: : : :...brand in {brand1,brand2,brand4}:
no
: : : brand = brand3: yes : : purchace.freq.sdev > 126:
: : :...purchase.no <= 6: yes : : purchase.no > 6:
: : :...purchace.freq.sdev <= 139: no : : purchace.freq.sdev > 139: yes
….
Example 3: Predict loan payment default?
Example: Bank loan decision
TASK: Find the probability of default for applicant
In order to predict the probability of default we use Multivariate logistic regression
1. Logistic function
2. We create model based on historical data predicting the default 16 factors (parameters)
Historical loan application data Target
No Default = 0 Default = 1
3000 samples 3. Testing the model we split the
dataset randomly into training 80%
and test set 20%
Input parameters T
3 5 6 7 8 9 4 2 3 4 5 2 4 6 2 2 0 2 4 5 6 3 2 5 3 2 5 7 3 6 3 7 2 1 3 7 5 4 2 4 7 6 2 5 2 6 5 4 7 2 0 3 5 6 7 8 9 4 2 3 4 5 2 4 6 2 2 1 2 4 5 6 3 2 5 3 2 5 7 3 6 3 7 2 1 3 7 5 4 2 4 7 6 2 5 2 6 5 4 7 2 1 3 5 6 7 8 9 4 2 3 4 5 2 4 6 2 2 1 2 4 5 6 3 2 5 3 2 5 7 3 6 3 7 2 0 3 7 5 4 2 4 7 6 2 5 2 6 5 4 7 2 1 3 5 6 7 8 9 4 2 3 4 5 2 4 6 2 2 0 2 4 5 6 3 2 5 3 2 5 7 3 6 3 7 2 1 3 7 5 4 2 4 7 6 2 5 2 6 5 4 7 2 1 3 5 6 7 8 9 4 2 3 4 5 2 4 6 2 2 1 2 4 5 6 3 2 5 3 2 5 7 3 6 3 7 2 1 3 7 5 4 2 4 7 6 2 5 2 6 5 4 7 2 0
… … … … ..
3 7 5 4 2 4 7 6 2 5 2 6 5 4 7 2 0
Predicted
Actual
0 1
0 True positive False Negative 1 False positive True Negative
f (x) =
1
1+ e
− xError matrix
Big Data ….
1. Technology invented by Google, further developed by
all big internet companies
2. Linear scalability, open-source
3. Decreased costs – low cost HardWare, no licenses
4. Increased capabilities – schema on read, massive
analytics
5. Machine Learning to discover value in the data
4 steps approach for Big Data problems
IDEAs discovery Find potentially valuable data Apply short validation, test STEP 2
Plan and prototype Minimalistic Prototypes
Setup and business value validation STEP 3
Implementation Implement fast, low risk
Integrate with existing processes STEP 4
Knowledge creation Seminars, workshops Real-life examples STEP 1