• No results found

Big Data Analytics. Prof. Dr. Lars Schmidt-Thieme

N/A
N/A
Protected

Academic year: 2021

Share "Big Data Analytics. Prof. Dr. Lars Schmidt-Thieme"

Copied!
33
0
0

Loading.... (view fulltext now)

Full text

(1)

Big Data Analytics

Prof. Dr. Lars Schmidt-Thieme

Information Systems and Machine Learning Lab (ISMLL)

Institute of Computer Science

University of Hildesheim, Germany

33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim

[based on original slides by Lucas Rego Drumond, ISMLL 2014]

Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

(2)

Outline

1. What is Big Data?

2. Where to Find Big Data?

3. Big Data Applications

4. How to analyze Big Data?

5. Big Data at ISMLL, University of Hildesheim

6. Conclusions

Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

(3)

Outline

1. What is Big Data?

2. Where to Find Big Data?

3. Big Data Applications

4. How to analyze Big Data?

5. Big Data at ISMLL, University of Hildesheim

6. Conclusions

Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

(4)

What is Big Data?

Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

(5)

What is Big Data?

Some definitions:

I “A collection of data sets so large and complex that it becomes

difficult to process using on-hand database management tools or

traditional data processing applications.”

http://en.wikipedia.org/wiki/Big data

I “Big data is high-volume, high-velocity and high-variety

information assets that demand cost-effective, innovative forms of

information processing for enhanced insight and decision making.”

www.gartner.com/it-glossary/big-data/

Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

(6)

Big Data — Dimensions (the “4 Vs”)

Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

(7)

What is Big Data?

Big Data is about:

I Storing and accessing large amounts of (unstructured/complex) data

I

querying

I Processing high volume data streams

I Decision support based on large data

I

visualization, reporting

I

navigation / query interfaces

I

contextualization / sense making

I Building predictive models trained on large data

Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

(8)

Outline

1. What is Big Data?

2. Where to Find Big Data?

3. Big Data Applications

4. How to analyze Big Data?

5. Big Data at ISMLL, University of Hildesheim

6. Conclusions

Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

(9)

Big Data in Physics (CERN)

I Large Hadron Collider has collected data from over 300 trillion

proton-proton collisions

I Approx. 25 Petabytes per year

Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

(10)

Big Data in Genetics (Ensembl)

I Ensembl database contains the genome of humans and 50 other

species

I “only” 250 GB

source: http://www.ensembl.org/

Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

(11)

Big Data in the Web (Google)

I 3.3 billion searches per day (on average)

I 30 trillion unique URLs identified on the Web

I 20 billion sites crawled a day

I In 2008 Google processed more than 20 Petabytes of data per day

Source:

– http://searchengineland.com/google-search-press-129925

– Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data processing on large clusters. Commun. ACM 51, 1 (January 2008), 107-113.

Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

(12)

Big Data in Social Media (Facebook)

I 1.28 billion users (1.23 billion monthly active in January 2014)

I Size of user data stored by Facebook: 300 Petabytes

I Average amount of data that Facebook takes in daily: 600 Terabytes

I Size of Facebook’s Graph Search database: 700 Terabytes

Source: http://allfacebook.com/orcfile b130817

Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

(13)

Big Data in Social Media (Twitter)

I Average number of tweets per day: 58 million

I Number of Twitter search engine queries every day: 2.1 billion

I Total number of active registered Twitter users: 645,750,000

Source: http://www.statisticbrain.com/twitter-statistics/

Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

(14)

Big Data — Public Datasets

1000 Genomes Project DNA of 1700 humans 200 TB

Common Crawl Corpus 5G web pages 81 TB

Wikipedia / Freebase 1.9G subject/predicate/object triples 250 GB

Million Song Dataset audio features of 1M songs 280 GB

OpenStreetMap a map of earth 90 GB

2000 US Census US census data 200 GB

PubChem library biological activities of small molecules 230 GB

NCDC weather data daily measurements from 9000 stations 20 GB

Open Library metadata of 20M books 7 GB

Twitter 1.6G tweets 0.6 GB

CD 700 MB, DVD 4,7–17 GB, Blu-ray 25–100 GB, hard disc: 3–4 TB.

Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

(15)

How Large is 1 Petabyte (PB)?

I 35.7M years counted one byte per second,

I 254 years listening to music stored in CD quality (500MB/h)

I 25 years watching DVDs

I

223.000 DVDs (a 4.7 GB)

I

but there are only 74.000 TV movies on IMDB !

(1.8M including TV episodes)

I can be stored on 341 harddisks ` a 3 TB/90 e (30,000 e)

I 96 days to read from standard harddisks sequentially (1030 MBits/s)

Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

(16)

Outline

1. What is Big Data?

2. Where to Find Big Data?

3. Big Data Applications

4. How to analyze Big Data?

5. Big Data at ISMLL, University of Hildesheim

6. Conclusions

Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

(17)

What to do with Big Data?

Application examples:

I Test complex scientific hypotheses (physics, genetics)

I Index the web, including relevance feedback by users (web)

I Online personalized advertising (social media, esp. Google)

I Recommender systems (e-commerce, esp. Amazon)

I Media analysis, sentiment analysis, market research (social media)

I

e.g., Obama campaign 2012

Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

(18)

More Applications & Case Studies

I T-Mobile USA:

integrated Big Data across multiple IT systems to combine customer

transaction and interactions data in order to better predict customer

defections

I

By leveraging social media data along with transaction data from CRM

and billing systems, customer defections is said to have been cut in half

in a single quarter.

I US Xpress:

collects data elements ranging from fuel usage to tire condition to

truck engine operations to GPS information

I

Optimal fleet management

I McLaren’s Formula One racing team:

real-time car sensor data during car races

I

Real time identification of issues with its racing cars

Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

(19)

Outline

1. What is Big Data?

2. Where to Find Big Data?

3. Big Data Applications

4. How to analyze Big Data?

5. Big Data at ISMLL, University of Hildesheim

6. Conclusions

Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

(20)

How to handle Big Data? — The BI Approach

Data

Warehouse

I Static databases (snapshots)

I Structured data

I Centralized approaches

Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

(21)

How to handle Big Data? — The Distributed Approach

I Massive Parallelism

I Heterogeneous data

sources

I Unstructured data

I Data streams

Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

(22)

Challenges

Challenges to deal with large volumes of data:

I Store and query large amounts of data in a distributed environment

efficiently

I

distributed file systems

I

nosql databases, distributed databases

I Process distributed large data efficiently

I

execution environments

I Scale / distribute machine learning techniques

I

distributed learning algorithms (message passing)

I

ML execution environments

Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

(23)

Execution Environments: MapReduce

Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

(24)

Execution Environments: GraphLab

Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

(25)

Outline

1. What is Big Data?

2. Where to Find Big Data?

3. Big Data Applications

4. How to analyze Big Data?

5. Big Data at ISMLL, University of Hildesheim

6. Conclusions

Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

(26)

Usage of Big Data in Cooperative Research Projects

I transportation

I

REDUCTION: Danish Taxi fleet

(14.000 vehicles, 1.5 million trips, 2.2 billion GPS measurements)

I e-commerce / recommender systems

I

NetFlix dataset: 100 million transactions

I

Rossmann Online / Compra

I technology enhanced learning

I

Whizz Education

(1200 exercises, 250,000 students, 30 million interactions)

I engineering data mining

I

Rolls Royce: jet engine vibration

I

Detectino: ground penetrating radar

Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

(27)

ISMLL Compute Cluster

I 61 compute nodes

I 840 cores, 1288 threads

I 2.2 TB RAM

I 183 TB hard disk capacity

I 10 TFlops

I special nodes:

I

database server

I

coprocessor compute node

(240 cores, 960 threads,

4 TFlops)

I software:

I

simple scheduler

(sun grid engine)

I

Map–Reduce (hadoop)

Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

(28)

Data Analytics — A New Study Programme in 2015

term module type ECTS

1 Advanced Machine Learning lecture 6

Modern Optimization Techniques lecture 6

Programming Machine Learning lab 6

Seminar Data Analytics I seminar 4

application module I misc. 6

2 Advanced Database Technologies lecture 6

Data and Privacy Protection lecture 3

methodological specialization I lecture 6

Distributed Machine Learning lab 6

Seminar Data Analytics II seminar 4

Project (part I) project 6

3 Planning and Optimal Control lecture 6

methodological specialization II lecture 6

Project (part II) project 9

Seminar Data Analytics III seminar 4

application module II misc. 6

4 Master thesis and colloquium thesis 30

Total 120

Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

(29)

Data Analytics — A New Study Programme in 2015

I solid education on all fundamental aspects of data analytics at the

state-of-the-art

I

machine learning

I

analytical database technology & execution models

I

planning and control

I several methodological specializations:

I

Bayesian Networks, Computer Vision, etc.

I integrated application area

I

media systems, software engineering, environmental sciences, computer

linguistics, information sciences, psychology (several still requested)

I hands-on lab courses

I a deep and fun, two term integrated group project

I fully internationally targeted (completely in English)

Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

(30)

Outline

1. What is Big Data?

2. Where to Find Big Data?

3. Big Data Applications

4. How to analyze Big Data?

5. Big Data at ISMLL, University of Hildesheim

6. Conclusions

Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

(31)

Big Data — Chances . . .

Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

(32)

Big Data — . . . and Risks

Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

(33)

Conclusions

I Big Data Analytics addresses the analysis of large and complex data

(volume, variety, velocity, veracity)

I

at Petabyte scale (1024 Terabytes)

I Big Data emerges naturally in many different domains

(science, web, e-commerce, social media, robotics)

I Big Data requires massively parallel infrastructure to be analyzed

timely

I

data centers with hundreds and thousands of compute nodes

I

distributed databases, nosql databases

I

execution environments (MapReduce)

I To exploit big data in a principled and optimal way, machine learning

methods have to be scaled and distributed,

I

requiring innovations at the level of the learning algorithms,

I

also requiring special ML execution environments.

Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

References

Related documents