• No results found

Bringing Big Data Modelling into the Hands of Domain Experts

N/A
N/A
Protected

Academic year: 2021

Share "Bringing Big Data Modelling into the Hands of Domain Experts"

Copied!
55
0
0

Loading.... (view fulltext now)

Full text

(1)

Bringing Big Data Modelling into the

Hands of Domain Experts

David Willingham

Senior Application Engineer MathWorks

[email protected]

(2)

“Data is the sword of the 21st century, those who

wield it the samurai.”

Big data — how to create

it, manipulate it, and put it

to good use.

“If you want to work at

Google, make sure you

can use MATLAB.”

(3)

Who is looking to do Big Data Analytics now?

Software Developers

– Programmers using languages such as Java / .NET – They may not be domain experts

End Users (non technical)

– Non programmers

– They may or may not be domain experts

Domain Users

– Scientists, Engineers, Analysts, etc..

(4)

Key Industries

Aerospace and defense

Automotive

Biotech and pharmaceutical

Communications

Education

Electronics and semiconductors

Energy production

Financial services

Industrial automation and machinery

Medical devices

(5)

Tesco uses supply chain analytics to

save £100m a year.

4 years of sales data held in

a Teradata data warehouse.

MATLAB is used to forecast

the optimum stock levels.

“We can run 100 stores for 100 days in about half an hour.

We can figure out quickly whether what we are doing is

(6)

Preventing lethal ship collisions with whales

Cornell University collected

terabytes of ocean acoustic

data.

Crowdsourced algorithms

to detect and classify

animal signals in the

presence of noise.

“A data set that would have taken months to process can

(7)

Earth’s topography on a Miller cylindrical projection, created with MATLAB and Mapping Toolbox.

MathWorks Today

More than 1 million users worldwide.

3500 universities

21 AU universities have MATLAB campus-

(8)

Challenges of Big Data

“Any collection of data sets so large and complex that it becomes difficult to process using … traditional data processing applications.”

(Wikipedia)

Getting started

Rapid data exploration

Development of scalable algorithms

(9)

Big Data Capabilities in MATLAB

Memory and Data Access

64-bit processors

Memory Mapped Variables

Disk Variables

Databases

Datastores

Platforms

Desktop (Multicore, GPU)

Clusters

Programming Constructs

Streaming

Block Processing

Parallel-for loops

GPU Arrays

SPMD and Distributed Arrays

MapReduce

(10)

MATLAB and Memory

Best Practices for Memory Usage

Use the appropriate data storage

– Use only the precision your need – Sparse Matrices

– Categorical Arrays

– Be aware of overhead of cells and structures

Minimize Data Copies

– In place operations, if possible – Use nested functions

– If using objects, consider handle classes

(11)

Considerations for Choosing an Approach

Data characteristics

– Size, type and location of your data

Compute platform

– Single desktop machine or cluster

Analysis Characteristics

– Embarrassingly Parallel

– Analyze sub-segments of data and aggregate results – Operate on entire dataset

(12)

Techniques for Big Data in MATLAB

Complexity

Embarrassingly Parallel

Non- Partitionable

MapReduce

Distributed Memory SPMD and distributed arrays Load,

Analyze, Discard

parfor, datastore,

out-of-memory in-memory

(13)

Techniques for Big Data in MATLAB

Embarrassingly Parallel

Non- Partitionable

MapReduce

Distributed Memory SPMD and distributed arrays Load,

Analyze, Discard

parfor, datastore,

out-of-memory in-memory

Complexity

(14)

Parallel Computing with MATLAB

MATLAB Desktop (Client)

Worker

Worker Worker

Worker

Worker

Worker

(15)

Demo: Determining Land Use

Using Parallel for-loops (parfor)

Data

– Arial images of agriculture land – 24 TIF files

Analysis

– Find and measure irrigation fields

– Determine which irrigation circles are in use (by color)

– Calculate area under irrigation

(16)

When to Use parfor

Data Characteristics

– Can be of any format (i.e. text, images) as long as it can be broken into pieces – The data for each iteration must

fit in memory

Compute Platform

– Desktop (Parallel Computing Toolbox)

– Cluster (MATLAB Distributed Computing Server)

Analysis Characteristics

– Each iteration of your loop must be independent

(17)

Techniques for Big Data in MATLAB

Embarrassingly Parallel

Non- Partitionable

MapReduce

Distributed Memory SPMD and distributed arrays Load,

Analyze, Discard

parfor, datastore,

out-of-memory in-memory

Complexity

(18)

Access Big Data

datastore

Easily specify data set

– Single text file (or collection of text files)

Preview data structure and format

Select data to import

using column names

Incrementally read

subsets of the data

airdata = datastore('*.csv');

airdata.SelectedVariables = {'Distance', 'ArrDelay‘};

(19)

Demo: Vehicle Registry Analysis

Using a DataStore

Data

– Massachusetts Vehicle Registration Data from 2008-2011

– 16M records, 45 fields

Analysis

– Examine hybrid adoptions

– Calculate % of hybrids registered by quarter

– Fit growth to predict further adoption

(20)

When to Use datastore

Data Characteristics

– Text data in files, databases or stored in the Hadoop Distributed File System (HDFS)

Compute Platform

– Desktop

Analysis Characteristics

– Supports Load, Analyze, Discard workflows – Incrementally read chunks of data,

process within a while loop

(21)

Big Data Capabilities in MATLAB

Memory and Data Access

64-bit processors

Memory Mapped Variables

Disk Variables

Databases

Datastores

Platforms

Desktop (Multicore, GPU)

Clusters

Programming Constructs

Streaming

Block Processing

Parallel-for loops

GPU Arrays

SPMD and Distributed Arrays

MapReduce

(22)

Reading in Part of a Dataset from Files

Text file, ASCII file

datastore

MAT file

– Load and save part of a variable using the matfile

Binary file

– Read and write directly to/from file using memmapfile – Maps address space to file

Databases

(23)

Techniques for Big Data in MATLAB

Complexity

Embarrassingly Parallel

Non- Partitionable

MapReduce

Distributed Memory SPMD and distributed arrays Load,

Analyze, Discard

parfor, datastore,

out-of-memory in-memory

(24)

Analyze Big Data

mapreduce

Use the powerful MapReduce programming technique to analyze big data

– mapreduce uses a datastore to process data in small chunks that individually fit into memory Useful for processing multiple keys, or when

Intermediate results do not fit in memory

mapreduce on the desktop

Increase compute capacity (Parallel Computing Toolbox)

Access data on HDFS to develop algorithms for use on Hadoop

mapreduce with Hadoop

Run on Hadoop using MATLAB Distributed Computing Server

********************************

* MAPREDUCE PROGRESS *

********************************

Map 0% Reduce 0%

Map 20% Reduce 0%

Map 40% Reduce 0%

Map 60% Reduce 0%

Map 80% Reduce 0%

Map 100% Reduce 25%

Map 100% Reduce 50%

Map 100% Reduce 75%

Map 100% Reduce 100%

(25)

mapreduce

Data Store Map Reduce

Veh_typ Q3_08 Q4_08 Q1_09 Hybrid

Car 1 1 1 0

SUV 0 1 1 1

Car 1 1 1 1

Car 0 0 1 1

Car 0 1 1 1

Car 1 1 1 1

Car 0 0 1 1

SUV 0 1 1 0

Car 1 1 1 0

SUV 1 1 1 1

Car 0 1 1 1

Car 1 0 0 0

Key: Q3_08 0

1 1 0 1 1 1

Key: Q4_08

0 1 1 1 1

Key: Q1_09 Hybrid

0 0

Key % Hybrid (Value)

Q3_08 0.4

Q4_08 0.67

Q1_09 0.75

Key: Q3_08

Key: Q3_08 0

1 1

0 1 1 1

Key: Q4_08

0 1 Hybrid

0 0

0 1

Shuffle and Sort

(26)

Demo: Vehicle Registry Analysis

Using MapReduce

Data

– Massachusetts Vehicle Registration Data from 2008-2011

– 16M records, 45 fields

Analysis

– Examine hybrid adoptions

– Calculate % of hybrids registered

By Quarter

By Regional Area

– Create map of results

(27)

Datastore

HDFS

The Big Data Platform

Reduce

Node Node

Node Data

Data

Data

Map

Reduce

MapMap Reduce

Map Reduce

(28)

Datastore

Explore and Analyze Data on Hadoop

MATLAB MapReduce

Code

HDFS Node Data

MATLAB Distributed Computing

Server

Node Data

Node Data

Map Reduce

Map Reduce

Map Reduce

(29)

Deployed Applications with Hadoop

Datastore

HDFS Node Data

Node Data

Node Data

Map Reduce

Map Reduce

Map Reduce

MATLAB runtime

(30)

When to Use mapreduce

Data Characteristics

– Text data in files, databases or stored in the Hadoop Distributed File System (HDFS) – Dataset will not fit into memory

Compute Platform

– Desktop

– Scales to run within Hadoop MapReduce on data in HDFS

Analysis Characteristics

– Must be able to be Partitioned into two phases

(31)

Techniques for Big Data in MATLAB

Complexity

Embarrassingly Parallel

Non- Partitionable

MapReduce

Distributed Memory SPMD and distributed arrays Load,

Analyze, Discard

parfor, datastore,

out-of-memory in-memory

(32)

Parallel Computing – Distributed Memory

Core 1

Core 3 Core 4 Core 2

RAM

Using More Computers (RAM)

Core 1

Core 3 Core 4 Core 2

RAM

(33)

T

OOLBOXES

B

LOCKSETS

11 26 41 12 27 42 13 28 43 14 29 44 15 30 45 16 31 46 17 32 47 17 33 48 19 34 49 20 35 50 21 36 51 22 37 52

Distributed Arrays

Available from

 Parallel Computing Toolbox

 MATLAB Distributed Computing Server

(34)

spmd blocks

spmd

% single program across workers

end

Mix parallel and serial code in the same function

Run on a pool of MATLAB resources

Single Program runs simultaneously across

workers

Multiple Data spread across multiple workers

(35)

Example: Airline Delay Analysis

Data

– BTS/RITA Airline On-Time Statistics

– 123.5M records, 29 fields

Analysis

– Calculate delay patterns – Visualize summaries – Estimate & evaluate

predictive models

(36)

When to Use Distributed Memory

Data Characteristics

– Data must be fit in collective memory across machines

Compute Platform

– Prototype (subset of data) on desktop – Run on a cluster or cloud

Analysis Characteristics

– Consists of:

Parts that can be run on data in memory (spmd)

(37)

Big Data on the Desktop

Expand workspace

64 bit processor support – increased in-memory data set handling

Access portions of data too big to fit into memory

Memory mapped variables – huge binary file

Datastore – huge text file or collections of text files

Database – query portion of a big database table

Variety of programming constructs

System Objects – analyze streaming data

MapReduce – process text files that won’t fit into memory

Increase analysis speed

(38)

Further Scaling Big Data Capacity

MATLAB supports a number of programming

constructs for use with clusters

General compute clusters

Parallel for-loops – embarrassingly parallel algorithms

SPMD and distributed arrays – distributed memory

Hadoop clusters

MapReduce – analyze data stored in the Hadoop Distributed File System

Use these constructs Migrate to a

(39)

Learn More

MATLAB Documentation

– Strategies for Efficient Use of Memory – Resolving "Out of Memory" Errors

Big Data with MATLAB

www.mathworks.com/discovery/big-data-matlab.html

MATLAB MapReduce and Hadoop

www.mathworks.com/discovery/matlab-mapreduce-hadoop.html

(40)

EXTRA SLIDES

(41)

Running into Big Data Issues?

Performance

– Takes too long to process all of your data

“Out of memory”

– Running out of address space

Slow processing (swapping)

– Data too large to be efficiently managed between RAM and virtual memory

(42)

MATLAB and Memory

Memory and Your Operating System

32-bit operating systems

– 4GB of addressable memory per process – Part of it is reserved by the OS,

leaving the application < 4GB

64-bit operating systems

– 8TB of addressable memory per process

– Limited by the amount of memory available on the computer

Memory per Process

Available for Process Reserved by Operating System

(43)

MATLAB and Memory

Best Practices for Memory Usage

Use the appropriate data storage

– Use only the precision your need – Sparse Matrices

– Categorical Arrays

– Be aware of overhead of cells and structures

Minimize Data Copies

– In place operations, if possible – Use nested functions

– If using objects, consider handle classes

(44)

No dependencies or communications between tasks

Ideal problem for parallel computing (parfor)

Independent Tasks or Iterations

Time Time

(45)

Block Processing Images

blockproc automatically divides an

image into blocks for processing

Reduces memory usage

– Read and write block directly from image file

Processes arbitrarily large images

Available from

 Image Processing Toolbox

(46)

System Objects

A class of MATLAB objects that

support streaming workflows

Simplifies data access for streaming applications

– Manages flow of data from files or network – Handles data indexing and buffering

Contain algorithms to work with streaming data

– Manages algorithm state

– Available for Signal Processing, Communications, Video Processing, and Phased Array Applications

Available from

 DSP System Toolbox

 Communications System Toolbox

 Computer Vision System Toolbox

 Phased Array System Toolbox

 Image Acquisition Toolbox

(47)

Platform Desktop Only Desktop + Cluster Desktop + Hadoop

Data Size 100’s MB -10’s GB 100’s MB -100’s GB 100’s GB – PBs

Summary: Options for Handling Large Data

MATLAB Desktop (Client)

Hadoop Cluster

Hadoop Scheduler

… …

..…

..…

..…

MATLAB Desktop (Client)

Cluster

Scheduler

… …

..…

..…

..…

MATLAB Desktop (Client)

(48)

Performance Gain with More Hardware

Using GPUs

Device Memory GPU cores

Device Memory Core 1

Core 3 Core 4 Core 2

Cache

Using More Cores (CPUs)

(49)

Scaling Up Your Computations

MATLAB Distributed Computing Server

Transition from desktop to cluster with

minimal code changes

Supports both interactive and

batch workflows

Integrates with commonly

used schedulers

Cluster Computer Cluster

Scheduler

(50)

MapReduce Programming Model

mapreducer

datastore

MAP REDUCE

Execution Environment

Output Files Input Files

(51)

mapreducer

mapreducer: Control the Execution Environment

of the MapReduce operation

MAP REDUCE

Output Files Input Files

Execution Environment

(52)

mapreducer

Serial Local Pool Hadoop Cluster

Multi-core CPU

(53)

datastore

datastore: Stores information regarding

input files and output files

MAP REDUCE

Execution Environment

Output Files Input Files

(54)

mapreduce

MAP REDUCE

Output Files Input Files

Execution Environment

mapreduce: Executes the MapReduce algorithm

(55)

mapreduce

mapreduce: - Input datastore object

- Map function

MAP REDUCE

Output Files Input Files

Execution Environment

References

Related documents

plied in substance the “machine or transformation” test, and determined that the invention before the Court was not substantially the software, but rather the totality of an

Examples are pulse width modulation (PWM) converters and cycloconverters. Interharmonics generated by them may be located anywhere in the spectrum with respect to

Even at this early stage, Duke was getting noticed for his different style of music and by 1930, Duke and his band were famous..

Cisco Devices Generic DoS 367 ICMP Remote DoS Vulnerabilities 367 Malformed SNMP Message DoS Vulnerability 369. Examples of Specific DoS Attacks Against Cisco Routers 370 Cisco

This study, therefore, recommended 25% inclusion level of ANLE in boar semen extension up to 48 h as indicated by observed mean values of all parameters, which

The majority (58%) of parents of children aged 12 and younger reported that their children always wear helmets when riding a bicycle1. Low rates of helmet use among children

• Prior to joining Whitman Howard, Araminta was a buy-side analyst with a specialist small and mid-cap

HDFS - Hadoop Distributed File System (HDFS) is a file system that spans all the nodes in a Hadoop cluster for data storage.. It links together the file systems on many