2008 NSF Data-Intensive Scalable Computing in Education Workshop

(1)

2008 NSF Data-Intensive Scalable Computing in

Education Workshop

Module II: Hadoop-Based Course

Components

(2)

Overview

• University of Washington Curriculum

– Teaching Methods – Reflections

– Student Background

– Course Staff Requirements

• MapReduce Algorithms

(3)

UW: Course Summary

• Course title: “Problem Solving on Large Scale Clusters”

• Primary purpose: developing large-scale problem solving skills

• Format: 6 weeks of lectures + labs, 4

week project

(4)

UW: Course Goals

• Think creatively about large-scale

problems in a parallel fashion; design parallel solutions

• Manage large data sets under memory, bandwidth limitations

• Develop a foundation in parallel algorithms

for large-scale data

(5)

Lectures

• 2 hours, once per week

• Half formal lecture, half discussion

• Mostly covered systems & background

• Included group activities for reinforcement

(6)

Classroom Activities

• Worksheets included pseudo-code

programming, working through examples

– Performed in groups of 2—3

• Small-group discussions about engineering and systems design

– Groups of ~10

– Course staff facilitated, but mostly open-

(7)

Readings

• No textbook

• One academic paper per week

– E.g., “Simplified Data Processing on Large Clusters”

– Short homework covered comprehension

• Formed basis for discussion

(8)

Lecture Schedule

• Introduction to Distributed Computing

• MapReduce: Theory and Implementation

• Networks and Distributed Reliability

• Real-World Distributed Systems

• Distributed File Systems

• Other Distributed Systems

(9)

Intro to Distributed Computing

• What is distributed computing?

• Flynn’s Taxonomy

• Brief history of distributed computing

• Some background on synchronization and

memory sharing

(10)

MapReduce

• Brief refresher on functional programming

• MapReduce slides

– More detailed version of module I

• Discussion on MapReduce

(11)

Networking and Reliability

• Crash course in networking

• Distributed systems reliability

– What is reliability?

– How do distributed systems fail?

– ACID, other metrics

• Discussion: Does MapReduce provide

reliability?

(12)

Real Systems

• Design and implementation of Nutch

• Tech talk from Googler on Google Maps

(13)

Distributed File Systems

• Introduced GFS

• Discussed implementation of NFS and

AndrewFS for comparison

(14)

Other Distributed Systems

• BOINC: Another platform

• Broader definition of distributed systems

– DNS

– One Laptop per Child project

(15)

Labs

• Also 2 hours, once per week

• Focused on applications of distributed systems

• Four lab projects over six weeks

(16)

Lab Schedule

• Introduction to Hadoop, Eclipse Setup, Word Count

• Inverted Index

• PageRank on Wikipedia

• Clustering on Netflix Prize Data

(17)

Design Projects

• Final four weeks of quarter

• Teams of 1—3 students

• Students proposed topic, gathered data, developed software, and presented

solution

(18)

Example: Geozette

(19)

Example: Galaxy Simulation

(20)

Other Projects

• Bayesian Wikipedia spam filter

• Unsupervised synonym extraction

• Video collage rendering

(21)

Ongoing research: traceroutes

• Analyze time-stamped traceroute data to model changes in Internet router topology

– 4.5 GB of data/day * 1.5 years

– 12 billion traces from 200 PlanetLab sites

• Calculates prevalence and persistence of

routes between hosts

(22)

Ongoing research: dynamic program traces

• Dynamic memory trace data from

simulators can reach hundreds of GB

• Existing work focuses on sampling

• New capability: record all accesses and

post-process with Hadoop

(23)

Common Features

• Hadoop!

• Used publicly-available web APIs for data

• Many involved reading papers for algorithms and translating into

MapReduce framework

(24)

Background Topics

• Programming Languages

• Systems:

– Operating Systems – File Systems

– Networking

• Databases

(25)

Programming Languages

• MapReduce is based on functional programming map and fold

• FP is taught in one quarter, but not reinforced

– “Crash course” necessary

– Worksheets to pose short problems in terms of map and fold

– Immutable data a key concept

(26)

Multithreaded programming

• Taught in OS course at Washington

– Not a prerequisite!

• Students need to understand multiple

copies of same method running in parallel

(27)

File Systems

• Necessary to understand GFS

• Comparison to NFS, other distributed file

systems relevant

(28)

Networking

• TCP/IP

• Concepts of “connection,” network splits, other failure modes

• Bandwidth issues

(29)

Databases

• Concept of shared consistency model

• Consensus

• ACID characteristics

– Journaling

– Multi-phase commit processes

(31)

Course Staff

• Instructor (me!)

• Two undergrad teaching assistants

– Helped facilitate discussions, directed labs

• One student sys admin

– Worked only about three hours/week

(32)

Preparation

• Teaching assistants had taken previous iteration of course in winter

• Lectures retooled based on feedback from that quarter

– Added reasonably large amount of background material

• Ran & solved all labs in advance

(33)

The Course: What Worked

• Discussions

– Often covered broad range of subjects

• Hands-on lab projects

• “Active learning” in classroom

• Independent design projects

(34)

Things to Improve: Coverage

• Algorithms were not reinforced during lecture

– Students requested much more time be spent on “how to parallelize an iterative algorithm”

• Background material was very fast-paced

(35)

Things to Improve: Projects

• Labs could have used a

moderated/scripted discussion component

– Just “jumping in” to the code proved difficult – No time was devoted to Hadoop itself in

lecture

– Clustering lab should be split in two

• Design projects could have used more

time

(36)

Part 2: Algorithms

(37)

Algorithms for MapReduce

• Sorting

• Searching

• Indexing

• Classification

• TF-IDF

• PageRank

• Clustering

(38)

MapReduce Jobs

• Tend to be very short, code-wise

– IdentityReducer is very common

• “Utility” jobs can be composed

• Represent a data flow, more so than a

procedure

(39)

Sort: Inputs

• A set of files, one value per line.

• Mapper key is file name, line number

• Mapper value is the contents of the line

(40)

Sort Algorithm

• Takes advantage of reducer properties:

(key, value) pairs are processed in order by key; reducers are themselves ordered

• Mapper: Identity function for value

(k, v)  (v, _)

• Reducer: Identity function (k’, _) -> (k’, “”)

(41)

Sort: The Trick

• (key, value) pairs from mappers are sent to a particular reducer based on hash(key)

• Must pick the hash function for your data such that k₁ < k₂ => hash(k₁) < hash(k₂)

(42)

Final Thoughts on Sort

• Used as a test of Hadoop’s raw speed

• Essentially “IO drag race”

• Highlights utility of GFS

(43)

Search: Inputs

• A set of files containing lines of text

• A search pattern to find

• Mapper key is file name, line number

• Mapper value is the contents of the line

• Search pattern sent as special parameter

(44)

Search Algorithm

• Mapper:

– Given (filename, some text) and “pattern”, if

“text” matches “pattern” output (filename, _)

• Reducer:

– Identity function

(45)

Search: An Optimization

• Once a file is found to be interesting, we only need to mark it that way once

• Use Combiner function to fold redundant (filename, _) pairs into a single one

– Reduces network I/O

(46)

Indexing: Inputs

• A set of files containing lines of text

• Mapper key is file name, line number

• Mapper value is the contents of the line

(47)

Inverted Index Algorithm

• Mapper: For each word in (file, words), map to (word, file)

• Reducer: Identity function

(48)

Index: MapReduce

map(pageName, pageText):

foreach word w in pageText:

emitIntermediate(w, pageName);

done

reduce(word, values):

foreach pageName in values:

(49)

Index: Data Flow

(50)

An Aside: Word Count

• Word count was described in module I

• Mapper for Word Count is (word, 1) for each word in input line

– Strikingly similar to inverted index

– Common theme: reuse/modify existing mappers

(51)

Bayesian Classification

• Files containing classification instances are sent to mappers

• Map (filename, instance)  (instance, class)

• Identity Reducer

(52)

Bayesian Classification

• Existing toolsets exist to perform Bayes classification on instance

– E.g., WEKA, already in Java!

• Another example of discarding input key

(53)

TF-IDF

• Term Frequency – Inverse Document Frequency

– Relevant to text processing

– Common web analysis algorithm

(54)

The Algorithm, Formally

(55)

Information We Need

• Number of times term X appears in a given document

• Number of terms in each document

• Number of documents X appears in

• Total number of documents

(56)

Job 1: Word Frequency in Doc

• Mapper

– Input: (docname, contents) – Output: ((word, docname), 1)

• Reducer

– Sums counts for word in document – Outputs ((word, docname), n)

• Combiner is same as Reducer

(57)

Job 2: Word Counts For Docs

• Mapper

– Input: ((word, docname), n) – Output: (docname, (word, n))

• Reducer

– Sums frequency of individual n’s in same doc – Feeds original data through

– Outputs ((word, docname), (n, N))

(58)

Job 3: Word Frequency In Corpus

• Mapper

– Input: ((word, docname), (n, N))

– Output: (word, (docname, n, N, 1))

• Reducer

– Sums counts for word in corpus

– Outputs ((word, docname), (n, N, m))

(59)

Job 4: Calculate TF-IDF

• Mapper

– Input: ((word, docname), (n, N, m))

– Assume D is known (or, easy MR to find it) – Output ((word, docname), TF*IDF)

• Reducer

– Just the identity function

(60)

Working At Scale

• Buffering (doc, n, N) counts while

summing 1’s into m may not fit in memory

– How many documents does the word “the”

occur in?

• Possible solutions

– Ignore very-high-frequency words – Write out intermediate data to a file

(61)

Final Thoughts on TF-IDF

• Several small jobs add up to full algorithm

• Lots of code reuse possible

– Stock classes exist for aggregation, identity

• Jobs 3 and 4 can really be done at once in same reducer, saving a write/read cycle

• Very easy to handle medium-large scale,

but must take care to ensure flat memory

(62)

PageRank: Random Walks Over The Web

• If a user starts at a random web page and surfs by clicking links and randomly

entering new URLs, what is the probability that s/he will arrive at a given page?

• The PageRank of a page captures this notion

– More “popular” or “worthwhile” pages get a higher rank

(63)

PageRank: Visually

(64)

PageRank: Formula

Given page A, and pages T

₁

through T

_n

linking to A, PageRank is defined as:

PR(A) = (1-d) + d (PR(T

₁

)/C(T

₁

) + ... +

PR(T

_n

)/C(T

_n

))

(65)

PageRank: Intuition

• Calculation is iterative: PR

_i+1

is based on PR

_i

• Each page distributes its PR

_i

to all pages it links to. Linkees add up their awarded rank fragments to find their PR

_i+1

• d is a tunable parameter (usually = 0.85)

encapsulating the “random jump factor”

(66)

Graph Representations

• The most straightforward representation of

graphs uses references from each node to

its neighbors

(67)

Direct References

• Structure is inherent to object

• Iteration requires

linked list “threaded through” graph

• Requires common view of shared

memory

class GraphNode {

Object data;

Vector<GraphNode>

out_edges;

GraphNode iter_next;

}

(68)

Adjacency Matrices

• Another classic graph representation. M[i]

[j]= '1' implies a link from node i to j.

• Naturally encapsulates iteration over nodes

1 1

0 1

2 1 0

1 0

1 4 3

2

1

(69)

Adjacency Matrices: Sparse Representation

• Adjacency matrix for most large graphs

(e.g., the web) will be overwhelmingly full of zeros.

• Each row of the graph is absurdly long

• Sparse matrices only include non-zero

elements

(70)

Sparse Matrix Representation

1: (3, 1), (18, 1), (200, 1)

2: (6, 1), (12, 1), (80, 1), (400, 1) 3: (1, 1), (14, 1)

…

(71)

Sparse Matrix Representation

1: 3, 18, 200

2: 6, 12, 80, 400 3: 1, 14

…

(72)

PageRank: First Implementation

• Create two tables 'current' and 'next' holding the PageRank for each page. Seed 'current' with initial PR values

• Iterate over all pages in the graph,

distributing PR from 'current' into 'next' of linkees

• current := next; next := fresh_table();

(73)

Distribution of the Algorithm

• Key insights allowing parallelization:

– The 'next' table depends on 'current', but not on any other rows of 'next'

– Individual rows of the adjacency matrix can be processed in parallel

– Sparse matrix rows are relatively small

(74)

Distribution of the Algorithm

• Consequences of insights:

– We can map each row of 'current' to a list of PageRank “fragments” to assign to linkees

– These fragments can be reduced into a single PageRank value for a page by summing

– Graph representation can be even more

compact; since each element is simply 0 or 1, only transmit column numbers where it's 1

(75)

(76)

Phase 1: Parse HTML

• Map task takes (URL, page content) pairs and maps them to (URL, (PR

_init

, list-of-urls))

– PR_init is the “seed” PageRank for URL

– list-of-urls contains all pages pointed to by URL

• Reduce task is just the identity function

(77)

Phase 2: PageRank Distribution

• Map task takes (URL, (cur_rank, url_list))

– For each u in url_list, emit (u, cur_rank/|url_list|)

– Emit (URL, url_list) to carry the points-to list along through iterations

PR(A) = (1-d) + d (PR(T₁)/C(T₁) + ... + PR(T_n)/C(T_n))

(78)

Phase 2: PageRank Distribution

• Reduce task gets (URL, url_list) and many (URL, val) values

– Sum vals and fix up with d

– Emit (URL, (new_rank, url_list))

PR(A) = (1-d) + d (PR(T₁)/C(T₁) + ... + PR(T_n)/C(T_n))

(79)

Finishing up...

• A subsequent component determines

whether convergence has been achieved (Fixed number of iterations? Comparison of key values?)

• If so, write out the PageRank lists - done!

• Otherwise, feed output of Phase 2 into

another Phase 2 iteration

(80)

PageRank Conclusions

• MapReduce runs the “heavy lifting” in iterated computation

• Key element in parallelization is

independent PageRank computations in a given step

• Parallelization requires thinking about

minimum data partitions to transmit (e.g.,

compact representations of graph rows)

(81)

Clustering

• What is clustering?

(82)

Google News

They didn’t pick all 3,400,217 related articles by hand…

Or Amazon.com

Or Netflix…

(83)

Other less glamorous things...

• Hospital Records

• Scientific Imaging

– Related genes, related stars, related sequences

• Market Research

– Segmenting markets, product positioning

• Social Network Analysis

• Data mining

• Image segmentation…

(84)

The Distance Measure

• How the similarity of two elements in a set is determined, e.g.

– Euclidean Distance – Manhattan Distance – Inner Product Space – Maximum Norm

– Or any metric you define over the space…

(85)

• Hierarchical Clustering vs.

• Partitional Clustering

Types of Algorithms

(86)

Hierarchical Clustering

(87)

Partitional Clustering

(88)

Partitional Clustering

(89)

K-Means Clustering

• Simple Partitional Clustering

• Choose the number of clusters, k

• Choose k points to be cluster centers

• Then…

(90)

K-Means Clustering

iterate {

Compute distance from all points to all k- centers

Assign each point to the nearest k-center

Compute the average of all points assigned to all specific k-centers

Replace the k-centers with the new averages }

(91)

But!

• The complexity is pretty high:

– k * n * O ( distance metric ) * num (iterations)

• Moreover, it can be necessary to send tons of data to each Mapper Node.

Depending on your bandwidth and memory

available, this could be impossible.

(92)

Furthermore

• There are three big ways a data set can be large:

– There are a large number of elements in the set.

– Each element can have many features.

– There can be many clusters to discover

• Conclusion – Clustering can be huge, even

when you distribute it.

(93)

Canopy Clustering

• Preliminary step to help parallelize computation.

• Clusters data into overlapping Canopies using super cheap distance metric.

• Efficient

• Accurate

(94)

Canopy Clustering

While there are unmarked points {

pick a point which is not strongly marked call it a canopy center

mark all points within some threshold of it as in it’s canopy

strongly mark all points within some stronger threshold

}

(95)

After the canopy clustering…

• Resume hierarchical or partitional clustering as usual.

• Treat objects in separate clusters as being

at infinite distances.

(96)

MapReduce Implementation:

• Problem – Efficiently partition a large data set (say… movies with user ratings!) into a fixed number of clusters using Canopy

Clustering, K-Means Clustering, and a

Euclidean distance measure.

(97)

The Distance Metric

• The Canopy Metric ($)

• The K-Means Metric ($$$)

(98)

Steps!

• Get Data into a form you can use (MR)

• Picking Canopy Centers (MR)

• Assign Data Points to Canopies (MR)

• Pick K-Means Cluster Centers

• K-Means algorithm (MR)

– Iterate!

(99)

Selecting Canopy Centers

(100)

(101)

(102)

(103)

(104)

(105)

(106)

(107)

(108)

(109)

(110)

(111)

(112)

(113)

(114)

(115)

Assigning Points to Canopies

(116)

(117)

(118)

(119)

(120)

(121)

K-Means Map

(122)

(123)

(124)

(125)

(126)

(127)

(128)

(129)