• No results found

Map-Reduce for Machine Learning on Multicore

N/A
N/A
Protected

Academic year: 2022

Share "Map-Reduce for Machine Learning on Multicore"

Copied!
27
0
0

Loading.... (view fulltext now)

Full text

(1)

Map-Reduce for Machine Learning on Multicore

Chu, et al.

Tuesday, August 30, 11

(2)

Problem

• The world is going multicore

• New computers - dual core to 12+-core

• Shift to more concurrent programming paradigms and languages

• Erlang, Scala, Occam, Go, among others

Tuesday, August 30, 11

- Paper discusses other languages, but 1) I dont know any of them

2) These languages are more up to date and useable.

(3)

• Machine Learning frameworks on multicore

• How to parallelize algorithms?

• Highly specialized, non-obvious solutions

• Often applicable to very few algorithms

Tuesday, August 30, 11

Example 1: Cargea et al - Restricted to decision trees Example 2: Jin, Agrawal - Shared Memory Machines only

(4)

Develop general techniques for parallelizing many machine learning algorithms

• Goal of “throw more cores at it”

optimization

• Algorithms must fit the Statistical Query Model

Goal

Tuesday, August 30, 11

- Rather than traditional approach of algorithm specific tweaks

(5)

Statistical Query Model

• (Kearns, 1999)

Permit learning algorithm to access learning problem only through a Statistical Oracle

• Oracle returns estimated expectation of f(x,y), given over all data instances

Tuesday, August 30, 11

- The expectation averaged over the training/test distribution

(6)

All algorithms that computes sufficient

statistics can be expressed as a summation over all data points

• This trait lends itself to batch summations over all data points

• Why do we care?

Tuesday, August 30, 11

Sufficient statistics -

A statistic is sufficient for a family of probability distributions if the sample from which it is calculated gives no additional information than does the statistic, as to which of those probability distributions is that of the population from which the sample was taken.

Pr(X=x | T(x) = t, Θ) = Pr(X=x | T(X) = t)

conditional probability not reliant on underlying parameter Θ.

(Source wikipedia article)

(7)

• Algorithms that sum over data points allows for partitioning

• Chunk data into subgroups

• Process and aggregate results

Multicoring Algorithms

Tuesday, August 30, 11

(8)

• Example: Linear Least Squares

• Fits model of

Solve

• Letting X ∈ be the instances

• be the target values

• Solve

y = ✓ T x

= min

X m

i=1

(✓ T x i y i ) 2

R m ⇥n

~y = [y 1 , . . . , y m ]

= (X T X) 1 X T ~y

Tuesday, August 30, 11

(9)

• To Summation form:

• Two-phase algorithm

• Compute sufficient statistics

• Aggregate statistics and solve

• Solve to get ✓ = A 1 b

Tuesday, August 30, 11

- Compute sufficient statistics by summing over the data

(10)

• Divide and conquer using the multiple cores

• What programming architecture can help us do this task?

A = X T X b = X T ~y

b =

X m i=1

( ~ x i y i ) A =

X m i=1

( ~ x i x ~ i T )

Tuesday, August 30, 11

(11)

• Google inspired software framework

• Specialized for big data on clusters

• Based on functional programming

Map: Give sub-problems to worker nodes

Reduce: Combine all sub-problem results to obtain answer

Map-Reduce

Tuesday, August 30, 11

- Map and Reduce functions not the same as in the functional programming

- Nodes are cores in a single-machine implementation, and both machines and cores in a cluster setup

(12)

Tuesday, August 30, 11

If mapper/reducer requires scalar information, they can access from the algorithm (1.1.1.1 and 1.1.3.2)

You can have multiple reducers as well (Load balancing and fault tolerance) ideal in cluster setting but you have increased communication overhead

(13)

Example: k-means

• Given a set of data points

• Compute initial k centroids

• Cluster data points to these centroids

• Calculate the new centroid

• Repeat process till convergence

Tuesday, August 30, 11

We will use Euclidean (easy to think about)

(14)

Tuesday, August 30, 11

Left image: The dataset we want to run k-means on

Right Image: Clustering and calculating the new centroid Old Centroids - Square

New Centroids - Triangle

(15)

Example: k-means

Mappers

• Cluster: Compute distance between points and centroids

• Centroid: Split up data, compute partial sums of vectors

Reduce:

• Cluster: Choose the closest centroid and apply it to that data point

• Centroid: Aggregate partial sums and calculate centroids

Tuesday, August 30, 11

Iterative Clustering process...

Eventually settle and have our clusters

(16)

Data Point centroid distance

D1 1 0.023

D1 2 0.459

D1 3 0.724

Example

Tuesday, August 30, 11

Each image here represents a different map job

The map job is responsible for calculating distance between each data point and the centroid specified for the mapper

Each point will have k different distances for each centroid Can anyone think of a different way to run this mapping job?

(17)

I

III II

Reduce Step

Tuesday, August 30, 11

Reduce step can take the values for each point, and select the min value

(18)

1 II III IV

Map Reduce

Tuesday, August 30, 11

Different map jobs for calculating the new centroids - Calculate the partial sum for each centroid

Reduce jobs aggregate and average these partial sums to form new centroid points

(19)

Example: PCA

• Computing covariance matrix Σ

• • With

• How would we map this? Reduce?

P = m 1 ( P m

i=1 x i x T i ) µµ T µ = m 1 P m

i=1 x i

Tuesday, August 30, 11

Ans: Map each sum to one of the cores. How to map it further??

Reduce: perform final calculation (subtract

(20)

Results

• Theoretical Time Complexity speedup of P’ on P cores

• Experiment using two versions of algorithm

• For dual-core processors at least1.9 times speedup for parallelized algorithms

• Linear speedup with number of cores

• Viable method for parallelizing machine learning algorithms

Tuesday, August 30, 11

P’ appx. equals P

Many different datasets used

Original algorithms dont utilize all cpu cycles

Sub 1 slope for speed increase - increasing communication overhead

(21)

Frameworks

• Influenced a variety of researchers and developers

• Drost, Ingersoll - Apache Mahout

• Ghoting et al. - IBM SystemML

• Zaharia et al. - Berkeley Spark

Tuesday, August 30, 11

(22)

Frameworks: Mahout

• Build scalable machine learning libraries

• Four use cases

• Recommendation

Clustering

• Classification

• Frequent item set mining

Tuesday, August 30, 11

- Developed with Hadoop in mind though

Recommendation - Given a set of features for a user, find items that a user may like Clustering - Given set of data, group them based on some set of features

Classification - Assign unlabeled data based on learned patterns from trained data Frequent item set mining - Given items in a set, find items that are most frequently associated with elements in that item set

(23)

Still new

• current release 0.5/0.6

• Many algorithms proposed in paper yet to be implemented

• Ones implemented scale well

• Compatibility

• Works with/without Apache Hadoop

• Works single or multi node setup

Tuesday, August 30, 11

Many clustering techniques available, some classification techniques, SVD, Similarity, and Recommenders available, with others either open or awaiting integration.

(24)

• (Relatively) Easy to use with Apache Lucene

• Text analysis

Benefits

• Algorithms available

• Plug in data and play

Tuesday, August 30, 11

Apache Lucene - Full-text search framework

Can pass term vectors created in Lucene to Mahout for processing (i.e. running LDA job with the term vectors for topic modeling, or calculating pairwise document similarity with cosine sim)

(25)

• Compatibility

• Available to use right now and open source

• Detractions

• Setting up Mahout not always easy

• What you want might not be implemented

• Very young, and code/usability shows it

Tuesday, August 30, 11

(26)

Discussion

• What are some of the drawbacks to Map- Reduce?

Tuesday, August 30, 11

(27)

http://en.wikipedia.org/wiki/MapReduce

http://hadoop.apache.org/

http://en.wikipedia.org/wiki/Naive_Bayes_classifier

http://en.wikipedia.org/wiki/Expectation-maximization_algorithm

http://mahout.apache.org/

http://en.wikipedia.org/wiki/Sufficient_statistic http://en.wikipedia.org/wiki/K-means_clustering

https://researcher.ibm.com/researcher/files/us-ytian/systemML.pdf http://www.spark-project.org/

http://www.cs.berkeley.edu/~matei/papers/2011/tr_spark.pdf

K-means images generated in Matlab

Tuesday, August 30, 11

References

Related documents

It is interesting to note that the lethal chromosomes n,, n,, n3, n, were isolated from natural populations, while r l , rn, r3, r4 are radiation induced, and in both

Emerging Female Voice in the Victorial Women Novelists Md..

The surface shear was found to be reduced by an amount equivalent to the effect of the distributed couples aroused on the fluid surface in a thin layer adjacent to the surface,

This work, combined with the quantitative genetic theory for response to multilevel selection presented in an accompanying article in this issue, enables the design of

Based upon the findings of all of these prospective randomized trials combined, the appropriate cutoff level to stop antibiotics in septic patients in an ICU setting appears to be

For personal use only... These results showed that RGD-PEG 3500 -DSPE GEM LPs signi fi cantly upregulated pro-apoptotic protein bax and downregulated anti-apoptotic protein

• One: deal with the banking crisis • Created a Bank Holiday - banks!. shut down and subject to gov't inspection, allowed to open when "healthy"- people's

Figure 2: Colonization of the Cecum and Cytotoxicity during Infection with C. difficile throughout the experiment, despite resolution of clinical signs and observed weight gain.