Surfing the Data Tsunami: A New Paradigm
for Big Data Processing and Analytics
Dr. Liangxiu Han
Future Networks and Distributed Systems Group (FUNDS) School of Computing, Mathematics and Digital Technology,
Manchester Metropolitan University
http://www2.docm.mmu.ac.uk/STAFF/L.Han/
June, 2014
Outline
Data tsunami
What is big data?
Value of big data
Challenges of big data
Technologies for big data
Data exploration for future roadmap
@Funds.MMU
Data tsunami
Increased capability of generating and
capturing data (e.g. Petascale simulations, experimental devices, the Internet, sensors, etc.)
300m photos, 2.5m contents shared per day
caBIG: 4.7+millions for cancers
Data tsunami
Gene expression data in GEO and ArrayExpress: over 1 millions Climate data from NASA: 32
Petabytes (250)
SKA(The Square Kilometre Array):
‘The data collected by the SKA in a single day would take nearly two
million years to playback on an ipod.’
Slide Credit: Intel
Data tsunami
Data intensive era -- big data/data rich/data- centric/data-driven era
0 10 20 30 40
2010 2011 2020
35
1 2
Data Volumes
1.3 2
35
Data size representation
What is big data?
Binary digit (bit) Byte(B): 8 bits
Kilobyte (KB): 210 bytes Megabyte (MB): 220 bytes Gigabyte (GB): 230 bytes
Terabyte (TB): 240 bytes; Petabyte (PB): 250 bytes Exabyte (EB): 260 bytes; Zettabyte (ZB): 270 bytes Yottabyte (YB): 280 bytes
What is big data?
Timeline
1998 - the origins of ‘big data’
4/25/98 page 1
John R. Mashey Chief Scientist, SGI
Big Data ...
and the Next Wave of InfraStress
Technology Waves:
NOT technology for technology’s sake IT’S WHAT YOU DO WITH IT But if you don’t understand the trends
IT’S WHAT IT WILL DO TO YOU Uh−oh!
OK!
Data, data everywhere
A special report on managing information
2001 - 3‘D’ Data Management: Controlling
Data Volume, Velocity, and Variety by Doug Laney 2010- widespread in the Economist
2012- Gartner, IBM, Cisco, Microsoft, etc
What is big data?
A relative term ( don’t define it in terms of size being larger than a certain number of terabytes or petabytes)
Larger, more complex and hard to access,
organise and analysis beyond the capability of the existing tools (varying on sectors)
The data volume, velocity or variety/complexity (3 V) limits the ability to perform effective analysis using traditional approaches
What is big data?
Big data is about pushing limits !
What is big data?
Volume (Data at rest)
The size and scale of the data
By 2015,it will reach 8 Zettabytes (IDC)
What is big data?
Velocity (data in motion)
Real time capture and analytics/streaming processing and analytics
Stock exchange, fraud analysis/customer churn predictions
What is big data?
Variety/complexity (data in many forms) Various formats, types and structures
Structured data, e.g. data defined by schema, relational databases, or semi-structured (xml) Unstructured data, e.g. free form text, emails, logs, images, audio, video, social media data (e.g. graph)
What is big data?
Two more ‘Vs’
Value: business value to be derived
Veracity( data in doubt):
the quality and
understandability of the data
!"#$
%&'&$
!"#$%&'()
!*+&"'()
!*#,")
!$#,-"))
!"+*%&'()
Value of big data
Next frontier for innovation, competition and productivity:
Commerce and economy
Science discovery in all most every science and engineering discipline for addressing societal
challenges ( health, food, energy, environment, etc)
Value of big data
Source: wikibon
Value of big data
New paradigm – Big Data leads science discovery
Challenges of big data
Bottleneck in Technology: IT infrastructures
Source: samsung
Challenges of big data
Bottleneck in technical skills: professionals to handle big data
!"#$%&'&$$
())*+',-"./0$
Source: eskills
Technologies for big data
What kind of big data technologies in your mind?
Cloud computing?
...
Big data processing and analytics
Architectures for efficiently processing big data
Data analytics for filtering, analysing and generating
actionable insights !"#"$%&'()**+,-
$",.$/,"01*+*
2+-3
2+-3 4'5
!"#"$6'078)9$
6"&+)#1 ('8%0):+#19$
9$
4'5 !"#"$6"07)
Technologies for big data
Traditional approaches, for example,
OLTP( online transaction processing)
OLAP(online analytical processing): data warehouse
OLTP OLAP
Business process
Business datawarehouse Data analytics
Decision making Operations
Informations
Business strategy
Architectures
Issues
Relational databases (RDBMS), dealing with structured data only
doesn’t support complex analytics lacks scalability and performance
Architectures
Current solutions
Apache Hadoop: an open source for storage and large-scale processing of data-sets (both structured and unstructured data (noSQL)); major components:
HDFS, MapReduce, HCommon, HYarn Google File System and MapReduce
Apache Spark: combine SQL, streaming, and complex analytics and in-memory computing
Architectures
Parallel and distributed computing for data processing
!"#$%"&'$()** !"#"$$%$&&
Architectures
Parallelisation: a sought after solution for speeding up an application, particularly for data intensive applications
Three considerations:
How to distribute workloads or decompose an algorithm into parts
How to map the tasks onto various computing nodes and execute subtasks in parallel
How to coordinate and communicate subtasks on those computing nodes.
Architectures
Data parallelism: workload are distributed into different computing nodes and the same task can be executed on different subsets of the data simultaneously
Task parallelism: tasks are independent and can be executed purely in parallel
Pipelining: an iteration of a task consisting of many stages, where each stage in the task is chained and executed in order and the output of one stage is the input of the next one. Pipelining can be implemented with streaming and
without using streaming
Architectures
Programming models for parallel and distributed computing (e.g. MPI, MapReduce, POSIX Threads, OpenMP, etc)
Bridging the gap between the underlying
hardware and the supporting layers of software available to applications
Independent of programming languages and API
Architectures
MapReduce: a programming model for processing large scale datasets
Implementations(e.g. Google DFS, Apache Hadoop)
Architectures
Map and Reduce functions
Map: perform a function on an individual value of a data set and return a new list of values
Given a dataset: A={1, 2, 3}, Map function: Square = X*X. After Map process, it returns {2,4,9}
Architectures
Reduce: performa a function by combining values in a data set
Given a dataset: B={2,4,9}, Reduce function: sum = X1+X2+X3. After Reduce process, it returns 15
!
"#$%&!
'#(#
!"#
$$%&'()&
)*(+*(
,-&./!0#1*&2 !
!
!
!
!
!
!
Architectures
Architectures
<Hello, 2>
<big data, 3>
<Hello, 4>
<big data, 5>
<Hello, 6>
<big data, 8>
Reduce MapReduce
The question: “ to count the words
called “hello” and “big data” from big data”
Map
Comparison of RDBMS-based approaches, spark MapReduce
Architectures
Data analytics: discovery of useful, possibly unexpected, patterns in data, automation of data exploration and
analysis
Statistics analysis
Machine learning/data mining, for example
Classification, Clustering, Association rule, Regression, graph mining, ...
Data analytics
Classification
AMD Diabetic
Retinopathy
Data analytics
Clustering
Clustering of the fish industry in UK
! !
!"#$%&'(')*+',-$.)&%.'+/'0,+)-123'$."2#'456&12',-$.)&%"2#'1-#+%")76' '
''''''''''''''''''!"#$%&'8''!".7"2#'"23$.)%9'' !
!
!
Data analytics
Graph mining .... and so on
Source: http://wisonets.wordpress.com/
Community detection in Facebook friends
Data analytics
Big data technology =
Architectures for data
processing and management
+Data analytics
Future development: programming abstractions need to be developed to support and facilitate big data
processing and analytics
Technologies for big data
NoSQL
RDBMS DFS ...
Programming abstractions to support big data processing and analytics
Apps
We focus on:
----both fundamental and applied research in large-scale data processing and analytics
--- Intelligent management and optimisation of large- scale networked distributed systems ( challenges:
reliability, scalability, security, resilience, autonomy and self-adaptation)
Data exploration for future
[email protected]
http://www.scmdt.mmu.ac.uk/research/funds/
Data exploration for future
[email protected]
People
Energy
Manufacturing
Planning Food Health
sustainability society Future
BBSRC
EPSRC-DHPA
Sustainability Society Network+
Amazon
MRC HGU
MMU Optos Fera
Acknowledgement &
Collaboration
Outside:
MRC HGU, University of Edinburgh
University of Manchester University of StrathClyde Heriot Watt University Loughborough University University of Melbourne
University of Glasgow ...
Optos Fera ...
Acknowledgement &
Collaboration
Inside:
School of Science and Environment
Business School
School of Engineering
Department of Sociology ...