• No results found

A Berkeley View of Big Data

N/A
N/A
Protected

Academic year: 2021

Share "A Berkeley View of Big Data"

Copied!
27
0
0

Loading.... (view fulltext now)

Full text

(1)

A Berkeley View of Big Data

Ion Stoica UC Berkeley

BEARS

(2)

Big Data is Massive…

• Facebook:

– 130TB/day: user logs

– 200-400TB/day: 83 million pictures

• Google: > 25 PB/day processed

data

• Data generated by LHC: 1 PB/sec

• Total data created in 2010:

1

.

ZettaByte (1,000,000 PB)/year

– ~60% increase every year

(3)

…and Grows Bigger and Bigger!

• More and more devices

• More and more people

• Cheaper and cheaper storage

– ~50% increase in GB/$ every year

(4)

…and Grows Bigger and Bigger!

• Log everything!

– Don’t always know what question you’ll need to answer

• Hard to decide what to delete

– Thankless decision: people know only when you are wrong!

– “Climate Research Unit (CRU) scientists admit they threw away key data used in global warming calculations”

(5)

What is Big Data?

• You don’t need to be big to have big data problem!

– Inadequate tools to analyze data

– Data management may dominate infrastructure cost

5

Data that is expensive to manage,

and hard to extract value from

(6)

Big Data is not Cheap!

• Storing and managing 1PB

data: $500K-$1M/ year

– Facebook: 200 PB/year

• “Typical” cloud-based

service startup (e.g.,

Conviva)

– Log storage dominates

infrastructure cost 0% 20% 40% 60% 80% 100% 2007 2008 2009 2010 Storage cluster Other

In fr a s tr u c tu re c o s t ~1PB storage capacity

(7)

Hard to Extract Value from Data!

• Data is

– Diverse, variety of sources

– Uncurated, no schema, inconsistent semantics, syntax – Integration a huge challenge

• No easy way to get answers that are

– High-quality

– Timely

• Challenge: maximize value from data by getting

best possible answers

(8)

Requires Multifaceted Approach

• Three dimensions to improve data

analysis

– Improving scale, efficiency, and quality of algorithms (Algorithms)

– Scaling up datacenters (Machines)

– Leverage human activity and intelligence (People)

• Need to adaptively and flexibly combine all

three dimensions

(9)

Algorithms, Machines, People

• Today’s apps: fixed point in solution space

9

Algorithms

Machines

People

Need techniques to dynamically pick best

operating point

search Watson/IBM

(10)

The AMP Lab

search Watson/IBM

Machines Algorithms

Make sense of data at scale by tightly

(11)

AMP Faculty and Sponsors

• Faculty

– Alex Bayen (mobile sensing platforms) – Armando Fox (systems)

– Michael Franklin (databases): Director

– Michael Jordan (machine learning): Co-director – Anthony Joseph (security & privacy)

– Randy Katz (systems)

– David Patterson (systems)

– Ion Stoica (systems): Co-director – Scott Shenker (networking)

• Sponsors:

(12)

Algorithms

• State-of-art Machine Learning (ML)

algorithms do not scale

– Prohibitive to process all data points

How do you know when to stop? true answer E s ti ma te # of data points

(13)

Algorithms

• Given any problem, data and a budget

– Immediate results with continuous improvement – Calibrate answer: provide error bars

13

Error bars on every answer! E s ti ma te # of data points true answer

(14)

Algorithms

• Given any problem, data and a time budget

– Immediate results with continuous improvement – Calibrate answer: provide error bars

Stop when error

smaller than a given threshold E s ti ma te # of data points true answer

(15)

Algorithms

• Given any problem, data and a time budget

– Automatically pick the best algorithm

15 E s ti ma te time pick

sophisticated pick simple

error too high

true answer

sophisticated

(16)

Machines

• “The datacenter as a computer” still in its

infancy

– Special purpose clusters, e.g., Hadoop cluster – Highly variable performance

– Hard to program – Hard to debug

(17)

Machines

• Make datacenter a real computer!

17 Node OS (e.g. Linux) Node OS (e.g. Windows) Node OS (e.g. Linux)

Datacenter “OS” (e.g., Mesos)

• Share datacenter between multiple cluster computing

apps

• Provide new abstractions and services

AMP stack

Existing stack

(18)

Machines

• Make datacenter a real computer!

Node OS (e.g. Linux) Node OS (e.g. Windows) Node OS (e.g. Linux)

Datacenter “OS” (e.g., Mesos)

Hadoop M PI Hy per tbal e Ca ssa nd ra

Hive Support existing cluster computing apps

AMP stack

Existing stack

(19)

Machines

• Make datacenter a real computer!

19 Node OS (e.g. Linux) Node OS (e.g. Windows) Node OS (e.g. Linux) Sp ark SCADS

Datacenter “OS” (e.g., Mesos)

Hadoop M PI Hy per tbal e Ca ssa nd ra Hive PIQL Support interactive and iterative data analysis (e.g., ML algorithms) Consistency adjustable data store Predictive & insightful query language AMP stack Existing stack

(20)

Machines

• Make datacenter a real computer!

Node OS (e.g. Linux) Node OS (e.g. Windows) Node OS (e.g. Linux) Sp ark SCADS

Datacenter “OS” (e.g., Mesos) Applications, tools Hadoop M PI Hy per tbal e Ca ssa nd ra Hive PIQL • Advanced ML algorithms • Interactive data mining

• Collaborative visualization AMP stack

Existing stack

(21)

People

• Humans can make sense of messy data!

(22)

People

• Make people an integrated part of

the system!

– Leverage human activity

– Leverage human intelligence (crowdsourcing):

• Curate and clean dirty data • Answer imprecise questions • Test and improve algorithms

• Challenge

– Inconsistent answer quality in all dimensions (e.g., type of question, time, cost) Machines + Algorithms dat a, a c tiv it y Q ues ti ons An s we rs

(23)

Real Applications

• Mobile Millennium Project

Alex Bayen, Civil and Environment Engineering, UC Berkeley

• Microsimulation of urban development

Paul Waddell, College of

Environment Design, UC Berkeley

• Crowd based opinion formation

Ken Goldberg, Industrial

Engineering and Operations Research, UC Berkeley

• Personalized Sequencing

Taylor Sittler, UCSF

(24)
(25)

Sequencing Microsimulation

Mobile Millennium

The AMP Lab

25

Machines

People

Algorithms

Make sense of data at scale by tightly

(26)

Big Data in 2020

Almost Certainly: • Create a new

generation of big data scientist

• A real datacenter OS • ML becoming an

engineering discipline • People deeply

integrated in big data analysis pipeline

If We’re Lucky:

• System will know what to throw away • Generate new

knowledge that an individual person cannot

(27)

Summary

• Goal: Tame Big Data Problem

– Get results with right quality at the right time

• Approach: Holistically integrate

A

lgorithms,

M

achines, and

P

eople

• Huge research issues across many

domains

27 27

References

Related documents

In conclusion, for the studied Taiwanese population of diabetic patients undergoing hemodialysis, increased mortality rates are associated with higher average FPG levels at 1 and

The authors, thus, designed and conducted this prospective cohort study to compare the morbidity and mortality between moderate preterm and term babies; and evaluate

Among many TCM medical and philosophical concepts, I specifically focus on the healing, the silence and the miracle cure and how they are embodied and co-constructed by

In this paper, we propose a mobility pattern based location tracking scheme based, which efficiently reduces the location updates and searching cost in the

To the best of our knowledge, this is the first study to report prevalence rates, prevalence of comorbid depression, and the impact of depression on all-cause and Irritable

In this dissertation we will create the new kind of environment inspired by the Pinsky’s environment, “have your own cookie and eat it” environment, called “ k th Left Jump

By means of ATP luminometry, all strains showed growth curves with log, stationary and death phases, and culture colour change (from the original red to orange) firstly detected

As we are testing Active Safety Systems, the test scenario will be designed to use the Test Targets to force a particular behaviour from the VUT however we may inadvertently