• No results found

an Abstraction for Large-Scale Computation

N/A
N/A
Protected

Academic year: 2021

Share "an Abstraction for Large-Scale Computation"

Copied!
43
0
0

Loading.... (view fulltext now)

Full text

(1)

Experiences with MapReduce,

an Abstraction for Large-Scale Computation

Jeff Dean

Google, Inc.

(2)

Outline

Overview of our computing environment

MapReduce

overview, examples

implementation details

usage stats

Implications for parallel program development

(3)

Problem: lots of data

Example: 20+ billion web pages x 20KB = 400+ terabytes

One computer can read 30-35 MB/sec from disk – ~four months to read the web

~1,000 hard drives just to store the web

Even more to do something with the data

(4)

Solution: spread the work over many machines

Good news: same problem with 1000 machines, < 3 hours

Bad news: programming work – communication and coordination

recovering from machine failure

status reporting

debuggingoptimization

locality

Bad news II: repeat for every problem you want to solve

(5)

Computing Clusters

Many racks of computers, thousands of machines per cluster

Limited bisection bandwidth between racks

(6)

Machines

2 CPUs

Typically hyperthreaded or dual-core

Future machines will have more cores

1-6 locally-attached disks – 200GB to ~2 TB of disk

4GB-16GB of RAM

Typical machine runs:

Google File System (GFS) chunkserver

Scheduler daemon for starting user tasks

One or many user tasks

(7)

Implications of our Computing Environment

Single-thread performance doesn’t matter

We have large problems and total throughput/$ more important than peak performance

Stuff Breaks

If you have one server, it may stay up three years (1,000 days)

If you have 10,000 servers, expect to lose ten a day

“Ultra-reliable” hardware doesn’t really help

At large scales, super-fancy reliable hardware still fails, albeit less often

software still needs to be fault-tolerant

commodity machines without fancy hardware give better perf/$

How can we make it easy to write distributed programs?

(8)

MapReduce

A simple programming model that applies to many large-scale computing problems

Hide messy details in MapReduce runtime library:

automatic parallelization

load balancing

network and disk transfer optimization

handling of machine failures

robustness

improvements to core library benefit all users of library!

(9)

Typical problem solved by MapReduce

• Read a lot of data

• Map: extract something you care about from each record

• Shuffle and Sort

• Reduce: aggregate, summarize, filter, or transform

• Write the results

Outline stays the same,

map and reduce change to fit the problem

(10)

More specifically…

• Programmer specifies two primary methods:

map(k, v) <k', v'>*

reduce(k', <v'>*) <k', v'>*

• All v' with same k' are reduced together, in order.

• Usually also specify:

partition(k’, total partitions) -> partition for k’

often a simple hash of the key

allows reduce operations for different k’ to be parallelized

(11)

Example: Word Frequencies in Web Pages

A typical exercise for a new engineer in his or her first week

Input is files with one document per record

Specify a map function that takes a key/value pair key = document URL

value = document contents

Output of map function is (potentially many) key/value pairs.

In our case, output (word, “1”) once per word in the document

“document1”, “to be or not to be”

“to”, “1”

“be”, “1”

“or”, “1”

(12)

Example continued: word frequencies in web pages

MapReduce library gathers together all pairs with the same key (shuffle/sort)

The reduce function combines the values for a key In our case, compute the sum

Output of reduce (usually 0 or 1 value) paired with key and

saved “be”, “2”

“not”, “1”

“or”, “1”

“to”, “2”

key = “or”

values = “1”

“1”

key = “be”

values = “1”, “1”

“2”

key = “to”

values = “1”, “1”

“2”

key = “not”

values = “1”

“1”

(13)

Example: Pseudo-code

Map(String input_key, String input_value):

// input_key: document name

// input_value: document contents for each word w in input_values:

EmitIntermediate(w, "1");

Reduce(String key, Iterator intermediate_values):

// key: a word, same for input and output // intermediate_values: a list of counts int result = 0;

for each v in intermediate_values:

result += ParseInt(v);

Emit(AsString(result));

Total 80 lines of C++ code including comments, main()

(14)

Widely applicable at Google

Implemented as a C++ library linked to user programs

Can read and write many different data types

Example uses:

web access log stats web link-graph reversal inverted index construction statistical machine translation

distributed grep

distributed sort

term-vector per host document clustering machine learning ...

(15)

Queries containing “eclipse” Queries containing “world series”

Example: Query Frequency Over Time

Queries containing “summer olympics”

Queries containing “Opteron”

Queries containing “full moon”

Queries containing “watermelon”

(16)

Example: Generating Language Model Statistics

Used in our statistical machine translation system

need to count # of times every 5-word sequence occurs in large corpus of documents (and keep all those where count >= 4)

Easy with MapReduce:

map: extract 5-word sequences => count from document

reduce: combine counts, and keep if count large enough

(17)

Example: Joining with Other Data

Example: generate per-doc summary, but include per-host information (e.g. # of pages on host, important terms on host)

per-host information might be in per-process data structure, or

might involve RPC to a set of machines containing data for all sites

map: extract host name from URL, lookup per-host info, combine with per-doc data and emit

reduce: identity function (just emit key/value directly)

(18)

MapReduce Programs in Google’s Source Tree

0 1000 2000 3000 4000 5000 6000

Jan-03 Jul-03 Jan-04 Jul-04 Jan-05 Jul-05 Jan-06 Jul-06

(19)

New MapReduce Programs Per Month

0 100 200 300 400 500 600

Jan-03 Jul-03 Jan-04 Jul-04 Jan-05 Jul-05 Jan-06 Jul-06

Summer intern effect

(20)

MapReduce: Scheduling

One master, many workers

Input data split into M map tasks (typically 64 MB in size) Reduce phase partitioned into R reduce tasks

Tasks are assigned to workers dynamically Often: M=200,000; R=4,000; workers=2,000

Master assigns each map task to a free worker Considers locality of data to worker when assigning task Worker reads task input (often from local disk!)

Worker produces R local files containing intermediate k/v pairs

Master assigns each reduce task to a free worker Worker reads intermediate k/v pairs from map workers

Worker sorts & applies user’s Reduce op to produce the output

(21)

Parallel MapReduce

        





 

    

    

    

   

  

 



(22)

Task Granularity and Pipelining

Fine granularity tasks: many more map tasks than machines – Minimizes time for fault recovery

Can pipeline shuffling with map execution

Better dynamic load balancing

Often use 200,000 map/5000 reduce tasks w/ 2000 machines

(23)

MapReduce status: MR_Indexer-beta6-large-2003_10_28_00_03

(24)

MapReduce status: MR_Indexer-beta6-large-2003_10_28_00_03

(25)

MapReduce status: MR_Indexer-beta6-large-2003_10_28_00_03

(26)

MapReduce status: MR_Indexer-beta6-large-2003_10_28_00_03

(27)

MapReduce status: MR_Indexer-beta6-large-2003_10_28_00_03

(28)

MapReduce status: MR_Indexer-beta6-large-2003_10_28_00_03

(29)

MapReduce status: MR_Indexer-beta6-large-2003_10_28_00_03

(30)

MapReduce status: MR_Indexer-beta6-large-2003_10_28_00_03

(31)

MapReduce status: MR_Indexer-beta6-large-2003_10_28_00_03

(32)

MapReduce status: MR_Indexer-beta6-large-2003_10_28_00_03

(33)

MapReduce status: MR_Indexer-beta6-large-2003_10_28_00_03

(34)

Fault tolerance: Handled via re-execution

On worker failure:

Detect failure via periodic heartbeats

Re-execute completed and in-progress map tasks

Re-execute in progress reduce tasks

Task completion committed through master On master failure:

State is checkpointed to GFS: new master recovers &

continues

Very Robust: lost 1600 of 1800 machines once, but finished fine

(35)

Refinement: Backup Tasks

Slow workers significantly lengthen completion time – Other jobs consuming resources on machine

Bad disks with soft errors transfer data very slowly

Weird things: processor caches disabled (!!)

Solution: Near end of phase, spawn backup copies of tasks – Whichever one finishes first "wins"

Effect: Dramatically shortens job completion time

(36)

Refinement: Locality Optimization

Master scheduling policy:

Asks GFS for locations of replicas of input file blocks

Map tasks typically split into 64MB (== GFS block size)

Map tasks scheduled so GFS input block replica are on same machine or same rack

Effect: Thousands of machines read input at local disk speed

Without this, rack switches limit read rate

(37)

Refinement: Skipping Bad Records

Map/Reduce functions sometimes fail for particular inputs

Best solution is to debug & fix, but not always possible

On seg fault:

Send UDP packet to master from signal handler

Include sequence number of record being processed

If master sees K failures for same record (typically K set to 2 or 3) :

Next worker is told to skip the record

Effect: Can work around bugs in third-party libraries

(38)

Other Refinements

Optional secondary keys for ordering

Compression of intermediate data

Combiner: useful for saving network bandwidth

Local execution for debugging/testing

User-defined counters

(39)

Using 1,800 machines:

MR_Grep scanned 1 terabyte in 100 seconds

MR_Sort sorted 1 terabyte of 100 byte records in 14 minutes

Rewrote Google's production indexing system

a sequence of 7, 10, 14, 17, 21, 24 MapReductions

simpler

more robust

faster

more scalable

Performance Results & Experience

(40)

MR_Sort

(41)

Usage Statistics Over Time

426 55 3,351 1.2 157 193 758 3,288 217 634 Aug, ‘04 29,423

2345 147 3,836 5.0 268 2,970 6,743 52,254 2,002 874 Mar, ‘06 171,834

3,097 Average map tasks per job

411 Unique map/reduce combinations

144 Average reduce tasks per job

1.9 Average worker deaths per job

232 Average worker machines

941 Output data written (TB)

2,756 Intermediate data (TB)

12,571 Input data read (TB)

981 Machine years used

934 Average completion time (secs)

Mar, ‘05 72,229 Number of jobs

(42)

Implications for Multi-core Processors

Multi-core processors require parallelism, but many programmers are uncomfortable writing parallel programs

MapReduce provides an easy-to-understand programming model for a very diverse set of computing problems

users don’t need to be parallel programming experts

system automatically adapts to number of cores & machines available

Optimizations useful even in single machine, multi-core environment

locality, load balancing, status monitoring, robustness, …

(43)

Conclusion

MapReduce has proven to be a remarkably-useful abstraction

Greatly simplifies large-scale computations at Google

Fun to use: focus on problem, let library deal with messy details

Many thousands of parallel programs written by hundreds of different programmers in last few years

Many had no prior parallel or distributed programming experience

Further info:

MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat, OSDI’04

http://labs.google.com/papers/mapreduce.html (or search Google for [MapReduce])

References

Related documents

A recombinant gene encoding α-ʟ-arabinofuranosidase (An abfA ) amplified from the genomic deoxyribonucleic acid (DNA) of Aspergillus niger ATCC 120120 was

She said the fact that the Legislature did not enact that bill (HB 6351 (Doc 2014-21347)) and instead made elimination of the compact election prospective underscores the Michigan

By analyzing the vast amounts of data on customer behavior on the different platforms, identifying value pockets and communicating actionable insights to the Product, Marketing

Customer layers &amp; story paths S/W Team MEs Self Algorithm designers Scientist EEs “Volt Mon” “Volt Disp”. S/W is customer for ‘Volt Mon’, EEs are customer

The purpose of articulating an export model where working capital is required for production is to show how productivity and cash interact to jointly determine export status of

Hitherto, the RPA framework has not been used to assess breast cancer leaflets to determine the presence or absence of the RPA constructs (risk (including

Although online networking is useful in bridging individuals to new contacts, our findings suggest that for rural business owners, face-to-face interactions are necessary in order

Jane Addams, Martin Luther King, Jr., and César Chávez were ordinary Americans?. They lived at different times and in