Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics

(1)

Surfing the Data Tsunami: A New Paradigm

for Big Data Processing and Analytics

Dr. Liangxiu Han

Future Networks and Distributed Systems Group (FUNDS) School of Computing, Mathematics and Digital Technology,

Manchester Metropolitan University

http://www2.docm.mmu.ac.uk/STAFF/L.Han/

June, 2014

(2)

Outline

Data tsunami

What is big data?

Value of big data

Challenges of big data

Technologies for big data

Data exploration for future roadmap

@Funds.MMU

(3)

Data tsunami

Increased capability of generating and

capturing data (e.g. Petascale simulations, experimental devices, the Internet, sensors, etc.)

300m photos, 2.5m contents shared per day

caBIG: 4.7+millions for cancers

(4)

Data tsunami

Gene expression data in GEO and ArrayExpress: over 1 millions Climate data from NASA: 32

Petabytes (2⁵⁰)

SKA(The Square Kilometre Array):

‘The data collected by the SKA in a single day would take nearly two

million years to playback on an ipod.’

(5)

Slide Credit: Intel

(6)

Data tsunami

Data intensive era -- big data/data rich/data- centric/data-driven era

0 10 20 30 40

2010 2011 2020

35

1 2

Data Volumes

1.3 2

35

(7)

Data size representation

What is big data?

Binary digit (bit) Byte(B): 8 bits

Kilobyte (KB): 2¹⁰bytes Megabyte (MB): 2²⁰bytes Gigabyte (GB): 2³⁰bytes

Terabyte (TB): 2⁴⁰bytes; Petabyte (PB): 2⁵⁰bytes Exabyte (EB): 2⁶⁰bytes; Zettabyte (ZB): 2⁷⁰bytes Yottabyte (YB): 2⁸⁰bytes

(8)

What is big data?

Timeline

1998 - the origins of ‘big data’

4/25/98 page 1

John R. Mashey Chief Scientist, SGI

Big Data ...

and the Next Wave of InfraStress

Technology Waves:

NOT technology for technology’s sake IT’S WHAT YOU DO WITH IT But if you don’t understand the trends

IT’S WHAT IT WILL DO TO YOU Uh−oh!

OK!

Data, data everywhere

A special report on managing information

2001 - 3‘D’ Data Management: Controlling

Data Volume, Velocity, and Variety by Doug Laney 2010- widespread in the Economist

2012- Gartner, IBM, Cisco, Microsoft, etc

(9)

What is big data?

A relative term ( don’t define it in terms of size being larger than a certain number of terabytes or petabytes)

Larger, more complex and hard to access,

organise and analysis beyond the capability of the existing tools (varying on sectors)

The data volume, velocity or variety/complexity (3 V) limits the ability to perform effective analysis using traditional approaches

(10)

What is big data?

Big data is about pushing limits !

(11)

What is big data?

Volume (Data at rest)

The size and scale of the data

By 2015,it will reach 8 Zettabytes (IDC)

(12)

What is big data?

Velocity (data in motion)

Real time capture and analytics/streaming processing and analytics

Stock exchange, fraud analysis/customer churn predictions

(13)

What is big data?

Variety/complexity (data in many forms) Various formats, types and structures

Structured data, e.g. data defined by schema, relational databases, or semi-structured (xml) Unstructured data, e.g. free form text, emails, logs, images, audio, video, social media data (e.g. graph)

(14)

What is big data?

Two more ‘Vs’

Value: business value to be derived

Veracity( data in doubt):

the quality and

understandability of the data

!"#$

%&'&$

!"#$%&'()

!*+&"'()

!*#,")

!$#,-"))

!"+*%&'()

(15)

Value of big data

Next frontier for innovation, competition and productivity:

Commerce and economy

Science discovery in all most every science and engineering discipline for addressing societal

challenges ( health, food, energy, environment, etc)

(16)

Value of big data

Source: wikibon

(17)

Value of big data

New paradigm – Big Data leads science discovery

(18)

Challenges of big data

Bottleneck in Technology: IT infrastructures

Source: samsung

(19)

Challenges of big data

Bottleneck in technical skills: professionals to handle big data

!"#$%&'&$$

())*+',-"./0$

Source: eskills

(20)

Technologies for big data

What kind of big data technologies in your mind?

Cloud computing?

...

(21)

Big data processing and analytics

Architectures for efficiently processing big data

Data analytics for filtering, analysing and generating

actionable insights !"#"$%&'()**+,-

$",.$/,"01*+*

2+-3

2+-3 4'5

!"#"$6'078)9$

6"&+)#1 ('8%0):+#19$

9$

4'5 !"#"$6"07)

Technologies for big data

(22)

Traditional approaches, for example,

OLTP( online transaction processing)

OLAP(online analytical processing): data warehouse

OLTP OLAP

Business process

Business datawarehouse Data analytics

Decision making Operations

Informations

Business strategy

Architectures

(23)

Issues

Relational databases (RDBMS), dealing with structured data only

doesn’t support complex analytics lacks scalability and performance

Architectures

(24)

Current solutions

Apache Hadoop: an open source for storage and large-scale processing of data-sets (both structured and unstructured data (noSQL)); major components:

HDFS, MapReduce, HCommon, HYarn Google File System and MapReduce

Apache Spark: combine SQL, streaming, and complex analytics and in-memory computing

Architectures

(25)

Parallel and distributed computing for data processing

!"#$%"&'$()** !"#"$$%$&&

Architectures

(26)

Parallelisation: a sought after solution for speeding up an application, particularly for data intensive applications

Three considerations:

How to distribute workloads or decompose an algorithm into parts

How to map the tasks onto various computing nodes and execute subtasks in parallel

How to coordinate and communicate subtasks on those computing nodes.

Architectures

(27)

Data parallelism: workload are distributed into different computing nodes and the same task can be executed on different subsets of the data simultaneously

Task parallelism: tasks are independent and can be executed purely in parallel

Pipelining: an iteration of a task consisting of many stages, where each stage in the task is chained and executed in order and the output of one stage is the input of the next one. Pipelining can be implemented with streaming and

without using streaming

Architectures

(28)

Programming models for parallel and distributed computing (e.g. MPI, MapReduce, POSIX Threads, OpenMP, etc)

Bridging the gap between the underlying

hardware and the supporting layers of software available to applications

Independent of programming languages and API

Architectures

(29)

MapReduce: a programming model for processing large scale datasets

Implementations(e.g. Google DFS, Apache Hadoop)

Architectures

(30)

Map and Reduce functions

Map: perform a function on an individual value of a data set and return a new list of values

Given a dataset: A={1, 2, 3}, Map function: Square = X*X. After Map process, it returns {2,4,9}

Architectures

(31)

Reduce: performa a function by combining values in a data set

Given a dataset: B={2,4,9}, Reduce function: sum = X1+X2+X3. After Reduce process, it returns 15

!

"#$%&!

'#(#

!"#

$$%&'()&

)*(+*(

,-&./!0#1*&2 !

!

Architectures

(32)

Architectures

<Hello, 2>

<big data, 3>

<Hello, 4>

<big data, 5>

<Hello, 6>

<big data, 8>

Reduce MapReduce

The question： “ to count the words

called “hello” and “big data” from big data”

Map

(33)

Comparison of RDBMS-based approaches, spark MapReduce

Architectures

(34)

Data analytics: discovery of useful, possibly unexpected, patterns in data, automation of data exploration and

analysis

Statistics analysis

Machine learning/data mining, for example

Classification, Clustering, Association rule, Regression, graph mining, ...

Data analytics

(35)

Classification

AMD Diabetic

Retinopathy

Data analytics

(36)

Clustering

Clustering of the fish industry in UK

! !

!"#$%&'(')*+',-$.)&%.'+/'0,+)-123'$."2#'456&12',-$.)&%"2#'1-#+%")76' '

''''''''''''''''''!"#$%&'8''!".7"2#'"23$.)%9'' !

!

Data analytics

(37)

Graph mining .... and so on

Source: http://wisonets.wordpress.com/

Community detection in Facebook friends

Data analytics

(38)

Big data technology =

Architectures for data

processing and management

+Data analytics

(39)

Future development: programming abstractions need to be developed to support and facilitate big data

processing and analytics

Technologies for big data

NoSQL

RDBMS DFS ...

Programming abstractions to support big data processing and analytics

Apps

(40)

We focus on:

----both fundamental and applied research in large-scale data processing and analytics

--- Intelligent management and optimisation of large- scale networked distributed systems ( challenges:

reliability, scalability, security, resilience, autonomy and self-adaptation)

Data exploration for future

[email protected]

http://www.scmdt.mmu.ac.uk/research/funds/

(41)

Data exploration for future

[email protected]

People

Energy

Manufacturing

Planning Food Health

sustainability society Future

(42)

BBSRC

EPSRC-DHPA

Sustainability Society Network+

Amazon

MRC HGU

MMU Optos Fera

Acknowledgement &

Collaboration

(43)

Outside:

MRC HGU, University of Edinburgh

University of Manchester University of StrathClyde Heriot Watt University Loughborough University University of Melbourne

University of Glasgow ...

Optos Fera ...

Acknowledgement &

Collaboration

(44)

Inside:

School of Science and Environment

Business School

School of Engineering

Department of Sociology ...

Acknowledgement &

Collaboration

(45)

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics