• No results found

How To Get The Most Out Of Big Data

N/A
N/A
Protected

Academic year: 2021

Share "How To Get The Most Out Of Big Data"

Copied!
29
0
0

Loading.... (view fulltext now)

Full text

(1)

Big Data Analytics

Chances and Challenges

Volker Markl | DIMA | BDOD Big Data Chances and Challenges

Volker Markl

Professor and Chair

Database Systems and Information Management

(DIMA), Technische Universität Berlin

(2)

More and more data is available to science and business!

Big Data Analytics | Volker Markl | BDOD Big Data – Chances and Challenges

Drivers:

Cloud Computing

Internet of Services

Internet of Things

Cyberphysical Systems

Underlying Trends:

Connectivity

Collaboration

Computer generated data

video streams

web archives

sensor data

audio streams

RFID data

simulation data

(3)

Data are becoming increasingly complex!

Size

(volume)

Freshness

(velocity)

Format/Media Type

(variability)

Uncertainty/Quality

(veracity)

etc.

(4)

Data and analyses are becoming increasingly complex!

Big Data Analytics | Volker Markl | BDOD Big Data – Chances and Challenges

Size

(volume)

Freshness

(velocity)

Format/Media Type

(variability)

Uncertainty/Quality

(veracity)

etc.

Data

Selection/Grouping

(map/reduce)

Relational Operators

(Join/Correlation)

Extraction & Integration,

(map/reduce or dataflow systems)

Data Mining

(R, S+, Matlab)

Predictive Models

(R, S+, Matlab)

etc.

(5)

Data-driven applications …

… will revolutionize decision making in business and the sciences!

lifecycle management

home automation

health

water management

market research

traffic management

energy management

information marketplaces

(6)

Big Data Analytics | Volker Markl | BDOD Big Data – Chances and Challenges

General consensus seems

to be that people

(would like to) have lots of

data, which they would like

to analyze. How do we go

about that?

(7)

Running in Circles?

SQL

NoMapReduce

(8)

Big Data Analytics | Volker Markl | BDOD Big Data – Chances and Challenges

scripting

wrong

platform?

XQuery?

SQL--column

store++

a query

plan

scalable

parallel sort

What is Wrong with this Picture?

(9)

Big Data Technologies Worldwide

MapReduce und Tools: Hadoop, Cloudera Impala (USA), InfoSphere

BigInsights (IBM)

Data Mining und Informationsextraktion: SystemT (IBM),

Statistical Packages: R (Universität Auckland), Matlab (Mathworks;USA), SAS,

SPSS

DBMS: Aster (Teradata), Greenplum (EMC), DB2 (IBM), SQLServer (Microsoft)

Oracle Exadata (Oracle)

Business Analytics/Intelligence: SAS Analytics (SAS), Vertica (HP), SPSS IBM)

Script Sprachen: Pig, Hive, JAQL

Graph Databases: Neo4j (USA), AllegroGraph (Franz Inc.; Giraph (Apache)

Visualization: Tableau (Tableau Software; USA),

Suche / Indexing: Lucene (Apache), Solr (Apache)

(10)

Big Data Analytics | Volker Markl | BDOD Big Data – Chances and Challenges

System

Type

Stakeholder

ParStream

DBMS (in-memory)

ParStream

SAP Hana

DBMS(in-memory)

SAP

Hyper

DBMS(in-memory)

TU München

BigMemory

DBMS(in-memory)

Terracotta Inc./Software AG

Scalaris

Storage (parallel)

Konrad-Zuse-Zentrum für

Informationstechnik

RapidMiner

Data mining toolkit

Rapid-I

ImageMaster

Enterprise Content Management

T-Systems

SalesIntelligence

Market research

Implisense

Predictive-Analytics-Suite

Business Analytics

Blue Yonder

Stratosphere

Scalable Data Analytics System

TU Berlin

(11)

3 Big Trends in Big Data

Solving complex data analysis problems

Predictive analytics

Infinite data streams

Distributed data

Dealing with uncertainty

Bringing the human in the loop: Visual analytics

Producing results in near-real time

First results fast

Low latency

Exploiting new Hardware

(manycore, OpenGL/CUDA, FPGA, heterogeneous processors, NUMA, remote memory)

Hiding complexity

Declarative data anaylsis programs

aggregation

relational algebra

iterative algorithms

distributed state

Fault tolerance (pessimistic/optimistic)

(12)

V

a

lu

e

i

n

c

r

e

a

s

e

Another “V“ - Value:

Big Data in the Cloud - the Information Economy

A major new trend in information processing will be the trading of

original and enriched data, effectively creating an information economy.

Cloud-Computing Stack

The Market situation

MIA, Datamarket, …

MIA, Azure IMR, etc …

Salesforce.com,

Office 2010 WebApps

Microsoft Azure,

Google App Engine

Amazon Elastic

Compute Cloud

Data as a Service

SaaS

PaaS

Information as a Service

IaaS

End users

System

administrators

Application

Developers

Analysts

Corporations

„When hardware became commoditized, software was valuable.

Now that software is being commoditized, data is valuable.“

(T

IM

O‘R

EILLY

)

„The important question isn’t who owns the data. Ultimately, we all do.

A better question is, who owns the means of analysis?“

(A. C

ROLL

, M

ASHABLE

, 2011)

(13)

Marketplace

T

ru

s

t

Infrastructure as a Service

e.g., Social Media

Monitoring e.g., Social Media

Monitoring

e.g., Media Publisher Services

e.g., Media Publisher

Services e.g., SEOe.g., SEO

Distributed Data Storage

Distributed Data Storage

Massively Parallel Infrastructure

Massively Parallel Infrastructure

Index

Index

Users

Dataproviders

Algorithms

Data & Aggregation Revenue

Sharing Technology Licensing Queries

Analytical results

German Web

http://www.mia-marktplatz.de

Information Marketplaces:

(14)

Data analysis program

Big Data Analytics | Volker Markl | BDOD Big Data – Chances and Challenges

Program

compiler

Dataflow

Runtime operators

Hash- and sort-based out-of-core

operator implementations,

memory management

Stratosphere optimizer

Automatic selection of parallelization,

shipping and local strategies, operator

order and placement

Execution plan

Parallel execution engine

Task scheduling, network data transfers,

resource allocation, checkpointing

Job graph

Execution graph

Big Data

looks

Tiny

from the Stratosphere

Stratosphere is a European declarative, massively-parallel Big Data Analytics System,

funded by DFG and EIT, available open-source with Apache License

(15)

Germany is on the Map!

“Scalable Machine Learning for Big Data”

Tutorial by Microsoft at the

IEEE International Conference

(16)

5 Mistakes

Data Scientists Make When Putting Big Data To Work

Selecting a platform that scales

Performance (runtime, first-results fast)

People

Selecting the right data

Information Quality

Timeliness

Choosing the right model

Assumptions about the data

Reproducability

Numerical stability

Trusting your results

Confidence and statistical significance

Lineage

(17)

Legal

Dimension

Social

Dimension

Economic

Dimension

Technology

Dimension

Application

Dimension

Business Models

Benchmarking

Impact of Open Source

Deployment

Pricing

Scalable Data Processing

Signal Processing

Statistics/ML

Linguistics

HCI/Visualization

„Big Data“ is Interdisciplinary

Copyright

Privacy

Liability

User Behaviour

Societal Impact

Collaboration

Digital Preservation

Decision Making

Verticals

(18)

Call to Action

Big Data Analytics | Volker Markl | BDOD Big Data – Chances and Challenges

Educate Data Scientists to create the required talent

T-shaped students

Information „literacy“

Data Analytics Curriculum

Research Big Data Analytics Technologies

Data management (uncertainty, query processing under near real-time constraints,

information extraction)

Programming models

Machine learning and statistical methods

Systems architectures

Information Visualization

Innovate to maintain competetiveness

Demonstrate flagship use-cases to raise awareness

Promote startups in the area of data analytics

Transfer technologies to enterprises, in particular SMEs

Determine legal frameworks and business models

(19)

for

Big

Data

Analytics

(20)

1: Thou shall use declarative languages

Big Data Analytics | Volker Markl | BDOD Big Data – Chances and Challenges

All-pairs shortest paths using recursive doubling in

Stratosphere’s Scala front-end

(21)

Beyond MapReduce

3: Thou shall use rich primitives

Map

Reduce

Cross

Match

CoGroup

2: Thou shall accept

external (dynamic)

sources

(22)

4: Thou shall deeply

embed UDFs

Big Data Analytics | Volker Markl | BDOD Big Data – Chances and Challenges

Concise and flexible

(23)

6: Thou shall iterate

Needed for most

interesting

analysis cases

Pregel as a Stratosphere query plan

with comparable performance.

(24)

7: Thou shall use a scalable and efficient execution engine

Big Data Analytics | Volker Markl | BDOD Big Data – Chances and Challenges

Pipeline and data parallelism, flexible

checkpointing, optimized network data

transfers

(25)
(26)

Meteor program

Big Data Analytics | Volker Markl | BDOD Big Data – Chances and Challenges

Meteor compiler

Pact program

Runtime

Hash- and sort-based out-of-core

operator implementations, memory

management

Stratosphere optimizer

Picks data shipping and local

strategies, operator order

Execution plan

Nephele execution engine

Task scheduling, network data t

ransfers,

resource allocation, checkpointing

(27)

Stratosphere bulk iterations

commandments

Currently making the optimizer

iteration-aware and exposing PACT-level

(28)

PACT4S: Embedding of PACTs in Scala

Big Data Analytics | Volker Markl | BDOD Big Data – Chances and Challenges

Master thesis developed language and Scala compiler plug-in, ongoing

Master thesis will integrate in query optimizer, still in (very) prototypical

mode

(29)

Ingredients of a Big Data Project

database systems,

scalable platforms

machine learning/statistics/

optimization

real-world

analysis problem

visualization

NLP

signal processing

technology

(data preparation,

scalable processing)

mathematical

References

Related documents

Recently advocacy groups have put these two strands of the literature together to argue that locating affordable housing near TOD, by providing locations for low-income persons

Recently, however, several agile frameworks designed for large-scale settings have emerged of which the most commonly adopted is the Scaled Agile Framework (SAFe) according to

From 1990 through 1999 almost 3.2 billion guilders from the Netherlands’ budget for development assistance were spent on relief of the external debt of developing countries. A

We believe that the book will be read by the people with a common inter- est in geospatial techniques, remote sensing, sustainable water resource develop- ment, applications and

For example, it may be known that every patient from a given specialty will be admitted to the ward after surgery; however, the number of patients who receive surgery during 1 OR

After training volunteers help with feeding, cleaning, all round care for wildlife and now is the best time for training before we get busy. Volunteer trades people

The two Reports, together with the University’s Reports and Financial Statements for the year ended 31 July 2014 (which are also published in this issue) will be brought forward for

-v- Trinidad Cement Limited Hearing 09:30 w/day CASE MANAGEMENT