Big Data Analytics
Chances and Challenges
Volker Markl | DIMA | BDOD Big Data Chances and Challenges
Volker Markl
Professor and Chair
Database Systems and Information Management
(DIMA), Technische Universität Berlin
More and more data is available to science and business!
Big Data Analytics | Volker Markl | BDOD Big Data – Chances and Challenges
Drivers:
Cloud Computing
Internet of Services
Internet of Things
Cyberphysical Systems
Underlying Trends:
Connectivity
Collaboration
Computer generated data
video streams
web archives
sensor data
audio streams
RFID data
simulation data
Data are becoming increasingly complex!
Size
(volume)
Freshness
(velocity)
Format/Media Type
(variability)
Uncertainty/Quality
(veracity)
etc.
Data and analyses are becoming increasingly complex!
Big Data Analytics | Volker Markl | BDOD Big Data – Chances and Challenges
Size
(volume)
Freshness
(velocity)
Format/Media Type
(variability)
Uncertainty/Quality
(veracity)
etc.
Data
Selection/Grouping
(map/reduce)
Relational Operators
(Join/Correlation)
Extraction & Integration,
(map/reduce or dataflow systems)
Data Mining
(R, S+, Matlab)
Predictive Models
(R, S+, Matlab)
etc.
Data-driven applications …
… will revolutionize decision making in business and the sciences!
lifecycle management
home automation
health
water management
market research
traffic management
energy management
information marketplaces
Big Data Analytics | Volker Markl | BDOD Big Data – Chances and Challenges
General consensus seems
to be that people
(would like to) have lots of
data, which they would like
to analyze. How do we go
about that?
Running in Circles?
SQL
NoMapReduce
Big Data Analytics | Volker Markl | BDOD Big Data – Chances and Challenges
scripting
wrong
platform?
XQuery?
SQL--column
store++
a query
plan
scalable
parallel sort
What is Wrong with this Picture?
Big Data Technologies Worldwide
■
MapReduce und Tools: Hadoop, Cloudera Impala (USA), InfoSphere
BigInsights (IBM)
■
Data Mining und Informationsextraktion: SystemT (IBM),
■
Statistical Packages: R (Universität Auckland), Matlab (Mathworks;USA), SAS,
SPSS
■
DBMS: Aster (Teradata), Greenplum (EMC), DB2 (IBM), SQLServer (Microsoft)
Oracle Exadata (Oracle)
■
Business Analytics/Intelligence: SAS Analytics (SAS), Vertica (HP), SPSS IBM)
■
Script Sprachen: Pig, Hive, JAQL
■
Graph Databases: Neo4j (USA), AllegroGraph (Franz Inc.; Giraph (Apache)
■
Visualization: Tableau (Tableau Software; USA),
■
Suche / Indexing: Lucene (Apache), Solr (Apache)
Big Data Analytics | Volker Markl | BDOD Big Data – Chances and Challenges
System
Type
Stakeholder
ParStream
DBMS (in-memory)
ParStream
SAP Hana
DBMS(in-memory)
SAP
Hyper
DBMS(in-memory)
TU München
BigMemory
DBMS(in-memory)
Terracotta Inc./Software AG
Scalaris
Storage (parallel)
Konrad-Zuse-Zentrum für
Informationstechnik
RapidMiner
Data mining toolkit
Rapid-I
ImageMaster
Enterprise Content Management
T-Systems
SalesIntelligence
Market research
Implisense
Predictive-Analytics-Suite
Business Analytics
Blue Yonder
Stratosphere
Scalable Data Analytics System
TU Berlin
3 Big Trends in Big Data
■
Solving complex data analysis problems
□
Predictive analytics
□
Infinite data streams
□
Distributed data
□
Dealing with uncertainty
□
Bringing the human in the loop: Visual analytics
■
Producing results in near-real time
□
First results fast
□
Low latency
□
Exploiting new Hardware
(manycore, OpenGL/CUDA, FPGA, heterogeneous processors, NUMA, remote memory)
■
Hiding complexity
□
Declarative data anaylsis programs
►
aggregation
►
relational algebra
►
iterative algorithms
►
distributed state
□
Fault tolerance (pessimistic/optimistic)
V
a
lu
e
i
n
c
r
e
a
s
e
Another “V“ - Value:
Big Data in the Cloud - the Information Economy
A major new trend in information processing will be the trading of
original and enriched data, effectively creating an information economy.
Cloud-Computing Stack
The Market situation
MIA, Datamarket, …
MIA, Azure IMR, etc …
Salesforce.com,
Office 2010 WebApps
Microsoft Azure,
Google App Engine
Amazon Elastic
Compute Cloud
Data as a Service
SaaS
PaaS
Information as a Service
IaaS
End users
System
administrators
Application
Developers
Analysts
Corporations
„When hardware became commoditized, software was valuable.
Now that software is being commoditized, data is valuable.“
(T
IMO‘R
EILLY)
„The important question isn’t who owns the data. Ultimately, we all do.
A better question is, who owns the means of analysis?“
(A. C
ROLL, M
ASHABLE, 2011)
Marketplace
T
ru
s
t
Infrastructure as a Service
e.g., Social MediaMonitoring e.g., Social Media
Monitoring
e.g., Media Publisher Services
e.g., Media Publisher
Services e.g., SEOe.g., SEO
Distributed Data Storage
Distributed Data Storage
Massively Parallel Infrastructure
Massively Parallel Infrastructure
Index
Index
Users
Dataproviders
Algorithms
Data & Aggregation RevenueSharing Technology Licensing Queries
Analytical results
German Web
http://www.mia-marktplatz.de
Information Marketplaces:
Data analysis program
Big Data Analytics | Volker Markl | BDOD Big Data – Chances and Challenges
Program
compiler
Dataflow
Runtime operators
Hash- and sort-based out-of-core
operator implementations,
memory management
Stratosphere optimizer
Automatic selection of parallelization,
shipping and local strategies, operator
order and placement
Execution plan
Parallel execution engine
Task scheduling, network data transfers,
resource allocation, checkpointing
Job graph
Execution graph
Big Data
looks
Tiny
from the Stratosphere
Stratosphere is a European declarative, massively-parallel Big Data Analytics System,
funded by DFG and EIT, available open-source with Apache License
“
Germany is on the Map!
“Scalable Machine Learning for Big Data”
Tutorial by Microsoft at the
IEEE International Conference
5 Mistakes
Data Scientists Make When Putting Big Data To Work
Selecting a platform that scales
□
Performance (runtime, first-results fast)
□
People
Selecting the right data
□
Information Quality
□
Timeliness
Choosing the right model
□
Assumptions about the data
Reproducability
□
Numerical stability
Trusting your results
□
Confidence and statistical significance
□
Lineage
Legal
Dimension
Social
Dimension
Economic
Dimension
Technology
Dimension
Application
Dimension
Business Models
Benchmarking
Impact of Open Source
Deployment
Pricing
Scalable Data Processing
Signal Processing
Statistics/ML
Linguistics
HCI/Visualization
„Big Data“ is Interdisciplinary
Copyright
Privacy
Liability
User Behaviour
Societal Impact
Collaboration
Digital Preservation
Decision Making
Verticals
Call to Action
Big Data Analytics | Volker Markl | BDOD Big Data – Chances and Challenges
■
Educate Data Scientists to create the required talent
□
T-shaped students
□
Information „literacy“
□
Data Analytics Curriculum
■
Research Big Data Analytics Technologies
□
Data management (uncertainty, query processing under near real-time constraints,
information extraction)
□
Programming models
□
Machine learning and statistical methods
□
Systems architectures
□
Information Visualization
■
Innovate to maintain competetiveness
□
Demonstrate flagship use-cases to raise awareness
□
Promote startups in the area of data analytics
□
Transfer technologies to enterprises, in particular SMEs
□
Determine legal frameworks and business models
for
Big
Data
Analytics
1: Thou shall use declarative languages
Big Data Analytics | Volker Markl | BDOD Big Data – Chances and Challenges
All-pairs shortest paths using recursive doubling in
Stratosphere’s Scala front-end
Beyond MapReduce
3: Thou shall use rich primitives
Map
Reduce
Cross
Match
CoGroup
2: Thou shall accept
external (dynamic)
sources
4: Thou shall deeply
embed UDFs
Big Data Analytics | Volker Markl | BDOD Big Data – Chances and Challenges
Concise and flexible
6: Thou shall iterate
Needed for most
interesting
analysis cases
Pregel as a Stratosphere query plan
with comparable performance.
7: Thou shall use a scalable and efficient execution engine
Big Data Analytics | Volker Markl | BDOD Big Data – Chances and Challenges
Pipeline and data parallelism, flexible
checkpointing, optimized network data
transfers
Meteor program
Big Data Analytics | Volker Markl | BDOD Big Data – Chances and Challenges
Meteor compiler
Pact program
Runtime
Hash- and sort-based out-of-core
operator implementations, memory
management
Stratosphere optimizer
Picks data shipping and local
strategies, operator order
Execution plan
Nephele execution engine
Task scheduling, network data t
ransfers,
resource allocation, checkpointing
Stratosphere bulk iterations
commandments
Currently making the optimizer
iteration-aware and exposing PACT-level
PACT4S: Embedding of PACTs in Scala
Big Data Analytics | Volker Markl | BDOD Big Data – Chances and Challenges