Bringing Big Data Modelling into the
Hands of Domain Experts
David Willingham
Senior Application Engineer MathWorks
“Data is the sword of the 21st century, those who
wield it the samurai.”
Big data — how to create
it, manipulate it, and put it
to good use.
“If you want to work at
Google, make sure you
can use MATLAB.”
Who is looking to do Big Data Analytics now?
Software Developers
– Programmers using languages such as Java / .NET – They may not be domain experts
End Users (non technical)
– Non programmers
– They may or may not be domain experts
Domain Users
– Scientists, Engineers, Analysts, etc..
Key Industries
Aerospace and defense
Automotive
Biotech and pharmaceutical
Communications
Education
Electronics and semiconductors
Energy production
Financial services
Industrial automation and machinery
Medical devices
Tesco uses supply chain analytics to
save £100m a year.
4 years of sales data held in
a Teradata data warehouse.
MATLAB is used to forecast
the optimum stock levels.
“We can run 100 stores for 100 days in about half an hour.
We can figure out quickly whether what we are doing is
Preventing lethal ship collisions with whales
Cornell University collected
terabytes of ocean acoustic
data.
Crowdsourced algorithms
to detect and classify
animal signals in the
presence of noise.
“A data set that would have taken months to process can
Earth’s topography on a Miller cylindrical projection, created with MATLAB and Mapping Toolbox.
MathWorks Today
More than 1 million users worldwide.
3500 universities
21 AU universities have MATLAB campus-
Challenges of Big Data
“Any collection of data sets so large and complex that it becomes difficult to process using … traditional data processing applications.”
(Wikipedia)
Getting started
Rapid data exploration
Development of scalable algorithms
Big Data Capabilities in MATLAB
Memory and Data Access
64-bit processors
Memory Mapped Variables
Disk Variables
Databases
Datastores
Platforms
Desktop (Multicore, GPU)
Clusters
Programming Constructs
Streaming
Block Processing
Parallel-for loops
GPU Arrays
SPMD and Distributed Arrays
MapReduce
MATLAB and Memory
Best Practices for Memory Usage
Use the appropriate data storage
– Use only the precision your need – Sparse Matrices
– Categorical Arrays
– Be aware of overhead of cells and structures
Minimize Data Copies
– In place operations, if possible – Use nested functions
– If using objects, consider handle classes
Considerations for Choosing an Approach
Data characteristics
– Size, type and location of your data
Compute platform
– Single desktop machine or cluster
Analysis Characteristics
– Embarrassingly Parallel
– Analyze sub-segments of data and aggregate results – Operate on entire dataset
Techniques for Big Data in MATLAB
Complexity
Embarrassingly Parallel
Non- Partitionable
MapReduce
Distributed Memory SPMD and distributed arrays Load,
Analyze, Discard
parfor, datastore,
out-of-memory in-memory
Techniques for Big Data in MATLAB
Embarrassingly Parallel
Non- Partitionable
MapReduce
Distributed Memory SPMD and distributed arrays Load,
Analyze, Discard
parfor, datastore,
out-of-memory in-memory
Complexity
Parallel Computing with MATLAB
MATLAB Desktop (Client)
Worker
Worker Worker
Worker
Worker
Worker
Demo: Determining Land Use
Using Parallel for-loops (parfor)
Data
– Arial images of agriculture land – 24 TIF files
Analysis
– Find and measure irrigation fields
– Determine which irrigation circles are in use (by color)
– Calculate area under irrigation
When to Use parfor
Data Characteristics
– Can be of any format (i.e. text, images) as long as it can be broken into pieces – The data for each iteration must
fit in memory
Compute Platform
– Desktop (Parallel Computing Toolbox)
– Cluster (MATLAB Distributed Computing Server)
Analysis Characteristics
– Each iteration of your loop must be independent
Techniques for Big Data in MATLAB
Embarrassingly Parallel
Non- Partitionable
MapReduce
Distributed Memory SPMD and distributed arrays Load,
Analyze, Discard
parfor, datastore,
out-of-memory in-memory
Complexity
Access Big Data
datastore
Easily specify data set
– Single text file (or collection of text files)
Preview data structure and format
Select data to import
using column names
Incrementally read
subsets of the data
airdata = datastore('*.csv');
airdata.SelectedVariables = {'Distance', 'ArrDelay‘};
Demo: Vehicle Registry Analysis
Using a DataStore
Data
– Massachusetts Vehicle Registration Data from 2008-2011
– 16M records, 45 fields
Analysis
– Examine hybrid adoptions
– Calculate % of hybrids registered by quarter
– Fit growth to predict further adoption
When to Use datastore
Data Characteristics
– Text data in files, databases or stored in the Hadoop Distributed File System (HDFS)
Compute Platform
– Desktop
Analysis Characteristics
– Supports Load, Analyze, Discard workflows – Incrementally read chunks of data,
process within a while loop
Big Data Capabilities in MATLAB
Memory and Data Access
64-bit processors
Memory Mapped Variables
Disk Variables
Databases
Datastores
Platforms
Desktop (Multicore, GPU)
Clusters
Programming Constructs
Streaming
Block Processing
Parallel-for loops
GPU Arrays
SPMD and Distributed Arrays
MapReduce
Reading in Part of a Dataset from Files
Text file, ASCII file
– datastore
MAT file
– Load and save part of a variable using the matfile
Binary file
– Read and write directly to/from file using memmapfile – Maps address space to file
Databases
–
Techniques for Big Data in MATLAB
Complexity
Embarrassingly Parallel
Non- Partitionable
MapReduce
Distributed Memory SPMD and distributed arrays Load,
Analyze, Discard
parfor, datastore,
out-of-memory in-memory
Analyze Big Data
mapreduce
Use the powerful MapReduce programming technique to analyze big data
– mapreduce uses a datastore to process data in small chunks that individually fit into memory – Useful for processing multiple keys, or when
Intermediate results do not fit in memory
mapreduce on the desktop
– Increase compute capacity (Parallel Computing Toolbox)
– Access data on HDFS to develop algorithms for use on Hadoop
mapreduce with Hadoop
– Run on Hadoop using MATLAB Distributed Computing Server
********************************
* MAPREDUCE PROGRESS *
********************************
Map 0% Reduce 0%
Map 20% Reduce 0%
Map 40% Reduce 0%
Map 60% Reduce 0%
Map 80% Reduce 0%
Map 100% Reduce 25%
Map 100% Reduce 50%
Map 100% Reduce 75%
Map 100% Reduce 100%
mapreduce
Data Store Map Reduce
Veh_typ Q3_08 Q4_08 Q1_09 Hybrid
Car 1 1 1 0
SUV 0 1 1 1
Car 1 1 1 1
Car 0 0 1 1
Car 0 1 1 1
Car 1 1 1 1
Car 0 0 1 1
SUV 0 1 1 0
Car 1 1 1 0
SUV 1 1 1 1
Car 0 1 1 1
Car 1 0 0 0
Key: Q3_08 0
1 1 0 1 1 1
Key: Q4_08
0 1 1 1 1
Key: Q1_09 Hybrid
0 0
Key % Hybrid (Value)
Q3_08 0.4
Q4_08 0.67
Q1_09 0.75
Key: Q3_08
Key: Q3_08 0
1 1
0 1 1 1
Key: Q4_08
0 1 Hybrid
0 0
0 1
Shuffle and Sort
Demo: Vehicle Registry Analysis
Using MapReduce
Data
– Massachusetts Vehicle Registration Data from 2008-2011
– 16M records, 45 fields
Analysis
– Examine hybrid adoptions
– Calculate % of hybrids registered
By Quarter
By Regional Area
– Create map of results
Datastore
HDFS
The Big Data Platform
Reduce
Node Node
Node Data
Data
Data
Map
Reduce
MapMap Reduce
Map Reduce
Datastore
Explore and Analyze Data on Hadoop
MATLAB MapReduce
Code
HDFS Node Data
MATLAB Distributed Computing
Server
Node Data
Node Data
Map Reduce
Map Reduce
Map Reduce
Deployed Applications with Hadoop
Datastore
HDFS Node Data
Node Data
Node Data
Map Reduce
Map Reduce
Map Reduce
MATLAB runtime
When to Use mapreduce
Data Characteristics
– Text data in files, databases or stored in the Hadoop Distributed File System (HDFS) – Dataset will not fit into memory
Compute Platform
– Desktop
– Scales to run within Hadoop MapReduce on data in HDFS
Analysis Characteristics
– Must be able to be Partitioned into two phases
Techniques for Big Data in MATLAB
Complexity
Embarrassingly Parallel
Non- Partitionable
MapReduce
Distributed Memory SPMD and distributed arrays Load,
Analyze, Discard
parfor, datastore,
out-of-memory in-memory
Parallel Computing – Distributed Memory
Core 1
Core 3 Core 4 Core 2
RAM
Using More Computers (RAM)
Core 1
Core 3 Core 4 Core 2
RAM
…
T
OOLBOXESB
LOCKSETS11 26 41 12 27 42 13 28 43 14 29 44 15 30 45 16 31 46 17 32 47 17 33 48 19 34 49 20 35 50 21 36 51 22 37 52
Distributed Arrays
Available from Parallel Computing Toolbox
MATLAB Distributed Computing Server
spmd blocks
spmd
% single program across workers
end
Mix parallel and serial code in the same function
Run on a pool of MATLAB resources
Single Program runs simultaneously across
workers
Multiple Data spread across multiple workers
Example: Airline Delay Analysis
Data
– BTS/RITA Airline On-Time Statistics
– 123.5M records, 29 fields
Analysis
– Calculate delay patterns – Visualize summaries – Estimate & evaluate
predictive models
When to Use Distributed Memory
Data Characteristics
– Data must be fit in collective memory across machines
Compute Platform
– Prototype (subset of data) on desktop – Run on a cluster or cloud
Analysis Characteristics
– Consists of:
Parts that can be run on data in memory (spmd)
Big Data on the Desktop
Expand workspace
64 bit processor support – increased in-memory data set handling
Access portions of data too big to fit into memory
Memory mapped variables – huge binary file
Datastore – huge text file or collections of text files
Database – query portion of a big database table
Variety of programming constructs
System Objects – analyze streaming data
MapReduce – process text files that won’t fit into memory
Increase analysis speed
Further Scaling Big Data Capacity
MATLAB supports a number of programming
constructs for use with clusters
General compute clusters
Parallel for-loops – embarrassingly parallel algorithms
SPMD and distributed arrays – distributed memory
Hadoop clusters
MapReduce – analyze data stored in the Hadoop Distributed File System
Use these constructs Migrate to a
Learn More
MATLAB Documentation
– Strategies for Efficient Use of Memory – Resolving "Out of Memory" Errors
Big Data with MATLAB
– www.mathworks.com/discovery/big-data-matlab.html
MATLAB MapReduce and Hadoop
– www.mathworks.com/discovery/matlab-mapreduce-hadoop.html
EXTRA SLIDES
Running into Big Data Issues?
Performance
– Takes too long to process all of your data
“Out of memory”
– Running out of address space
Slow processing (swapping)
– Data too large to be efficiently managed between RAM and virtual memory
MATLAB and Memory
Memory and Your Operating System
32-bit operating systems
– 4GB of addressable memory per process – Part of it is reserved by the OS,
leaving the application < 4GB
64-bit operating systems
– 8TB of addressable memory per process
– Limited by the amount of memory available on the computer
Memory per Process
Available for Process Reserved by Operating System
MATLAB and Memory
Best Practices for Memory Usage
Use the appropriate data storage
– Use only the precision your need – Sparse Matrices
– Categorical Arrays
– Be aware of overhead of cells and structures
Minimize Data Copies
– In place operations, if possible – Use nested functions
– If using objects, consider handle classes
No dependencies or communications between tasks
Ideal problem for parallel computing (parfor)
Independent Tasks or Iterations
Time Time
Block Processing Images
blockproc automatically divides an
image into blocks for processing
Reduces memory usage
– Read and write block directly from image file
Processes arbitrarily large images
Available from
Image Processing Toolbox
System Objects
A class of MATLAB objects that
support streaming workflows
Simplifies data access for streaming applications
– Manages flow of data from files or network – Handles data indexing and buffering
Contain algorithms to work with streaming data
– Manages algorithm state
– Available for Signal Processing, Communications, Video Processing, and Phased Array Applications
Available from
DSP System Toolbox
Communications System Toolbox
Computer Vision System Toolbox
Phased Array System Toolbox
Image Acquisition Toolbox
Platform Desktop Only Desktop + Cluster Desktop + Hadoop
Data Size 100’s MB -10’s GB 100’s MB -100’s GB 100’s GB – PBs
Summary: Options for Handling Large Data
MATLAB Desktop (Client)
Hadoop Cluster
Hadoop Scheduler
…
… … …
..…
..…
..…
MATLAB Desktop (Client)
Cluster
Scheduler
…
… … …
..…
..…
..…
MATLAB Desktop (Client)
Performance Gain with More Hardware
Using GPUs
Device Memory GPU cores
Device Memory Core 1
Core 3 Core 4 Core 2
Cache
Using More Cores (CPUs)
Scaling Up Your Computations
MATLAB Distributed Computing Server
Transition from desktop to cluster with
minimal code changes
Supports both interactive and
batch workflows
Integrates with commonly
used schedulers
Cluster Computer Cluster
Scheduler
MapReduce Programming Model
mapreducer
datastore
MAP REDUCE
Execution Environment
Output Files Input Files
mapreducer
mapreducer: Control the Execution Environment
of the MapReduce operation
MAP REDUCE
Output Files Input Files
Execution Environment
mapreducer
Serial Local Pool Hadoop Cluster
Multi-core CPU
datastore
datastore: Stores information regarding
input files and output files
MAP REDUCE
Execution Environment
Output Files Input Files
mapreduce
MAP REDUCE
Output Files Input Files
Execution Environment
mapreduce: Executes the MapReduce algorithm
mapreduce
mapreduce: - Input datastore object
- Map function
MAP REDUCE
Output Files Input Files
Execution Environment