• No results found

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

N/A
N/A
Protected

Academic year: 2021

Share "COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers"

Copied!
12
0
0

Loading.... (view fulltext now)

Full text

(1)

COMP 598 – Applied Machine Learning

Lecture 21: Parallelization methods for

large-scale machine learning

!

Instructor: Joelle Pineau ([email protected])

TAs: Pierre-Luc Bacon ([email protected]) Ryan Lowe ([email protected])

Class web page: www.cs.mcgill.ca/~jpineau/comp598

Big Data by the numbers

•  20+ billion web pages x 20KB = 400+TB

–  1 computer reads 30-35 MB/sec from disk. ~4 months to read the web. ~1000 hard drives to store the web.

•  In 2011: An estimated 1M machines used by Google.

–  An expected 1000 machines fail every day!

•  In 2012: An estimated 100PB of data on Facebook’s cluster. •  In 2014: An estimated 455PB of data stored on Yahoo! cluster. •  Copying large amounts of data over a network takes time.

(2)

2

Joelle Pineau

3

Big Data

•  Big data is large number of features, or large number of datapoints, or both.

•  Major issues for machine learning:

–  Need distributed data storage / access.

–  Can’t use algorithms that are superlinear in the number of examples.

–  Large number of features causes overfitting.

–  Need parallelized algorithms.

–  Often need to deal with unbalanced datasets.

–  Need to represent high-dimensional data.

COMP-598: Applied Machine Learning

Joelle Pineau 4

An era of supercomputers

COMP-598: Applied Machine Learning

(3)

Joelle Pineau 5

An era of supercomputers

COMP-598: Applied Machine Learning

https://computecanada.ca/

Hadoop

•  Open-source software framework for datacentric computing:

Distributed File System + MapReduce + Advanced components.

MapReduce = The system that assigns jobs to nodes in the cluster.

•  Allows distributed processing of large datasets across clusters. •  High degree of fault tolerance:

–  When nodes go down, jobs are re-distributed.

–  Data is replicated multiple times, in “small” chunks.

(4)

4

Joelle Pineau

7

Hadoop Distributed File System (HDFS)

•  All data is stored multiple times:

–  Robust to failure. Don’t need to backup a Hadoop cluster.

–  Multiple access points for any one piece of data.

•  Files are stored in 64MB chunks (“shards”).

–  Sequential reads are fast. 100MB/s disk requires 0.64s to read a chunk but only 0.01s to start reading it.

–  Not made for random reads.

–  Not efficient for accessing small files.

–  No support for file modification.

COMP-598: Applied Machine Learning

Joelle Pineau 8

MapReduce

(5)

Joelle Pineau 9

MapReduce

•  Programming framework for applying parallel algorithms to large datasets, across a cluster of computers. Developed by Google.

•  Map: Partition the data across nodes, perform computation (independently) on each partition.

•  Reduce: Merge values from nodes to produce global output.

COMP-598: Applied Machine Learning

Example

(6)

6

Joelle Pineau

11

Word count example

• 

COMP-598: Applied Machine Learning

Joelle Pineau 12

Execution

(7)

Joelle Pineau 13

Parallel Execution

• 

COMP-598: Applied Machine Learning

Task scheduling

•  Typically many more tasks than machines.

(8)

8

Joelle Pineau

15

Fault tolerance

•  Handled via re-execution.

•  Worker failure:

–  Detect via periodic checking.

–  Re-execute completed and in-progress map tasks.

–  Re-execute in-progress reduce tasks. •  Master failure:

–  Could handle; highly unlikely.

•  Slow workers significantly lengthen completion time:

–  Near end of job, spawn backup copies of in-progress tasks.

COMP-598: Applied Machine Learning

Joelle Pineau 16

Growing applicability

• 

(9)

Joelle Pineau 17

Properties

•  Advantages:

–  Automatic parallelization and distribution.

–  Fault-tolerance.

–  I/O scheduling

–  Status and monitoring

•  Limitations:

–  Cannot share data across jobs.

COMP-598: Applied Machine Learning

MapReduce for machine learning

•  Many ML algorithms can be re-written in “summation form”. •  Consider linear regression:

Err(w) = ∑i=1:n ( yi - wTxi)2ŵ = (XTX)-1 XT Y Let A = (XTX), B = (XTY) A = ∑

i=1:n (xixiT), B = ∑i=1:n (xiyi) Now the computation of A and B can be split between different nodes.

•  Other algorithms that are amenable to this form:

–  Naïve Bayes, LDA/QDA, K-means, logistic regression, neural networks, mixture of Gaussians (EM), SVMs, PCA.

(10)

10

Joelle Pineau

19

Parallel linear learning

•  Given 2.1 TB of data, how can you learn a good linear

predictor? How long does it take?

–  17B examples, 16M parameters, 1K computation nodes.

•  Stochastic gradient descent will take a long time going through all this data!

•  Answer: Can be done using parallelization in 70 minutes = 500M features/second.

–  Faster than the I/O bandwidth of a single machine.

COMP-598: Applied Machine Learning

Joelle Pineau 20

Parallel learning for parameter search

•  One of the most common use of parallelization in ML.

•  Map: Create separate jobs, exploring different subsets of parameters. Train separate models, calculate performance on validation set.

•  Reduce: Select subset with the best validation set performance.

(11)

Joelle Pineau 21

Parallel learning of decision trees

•  Usually, all data examples are used to select each subtree split.

–  If data is not partitioned properly, may produce poor trees.

•  When parallelizing, each processor can handle either:

–  Subset of the features: Easier to efficiently parallelize tree.

–  Subset of the training data: More complex to properly parallelize tree.

•  Easier to parallelize learning of Random Forests (true of bagging in general).

COMP-598: Applied Machine Learning

PLANET

(Panda et al., VLDB’09)

•  Parallel Learner for Assembling Numerous Ensemble Trees. •  Insight: Build the tree level by level.

•  Map: Consider a number of possible splits on subsets of the data.

–  Intermediate values = partial statistics on # points.

•  Reduce: Determine the best split from the partial statistics. •  Repeat. One MapReduce step builds one level of the tree

(12)

12

Joelle Pineau

23

Final notes

•  Mini-project #2 evaluations are available on CMT. •  Mini-project #3 peer reviews due Nov. 25.

•  Mini-project #4:

–  Has everyone started thinking about a topic, dataset, and team?

–  Oral presentations, Dec.2, during class.

•  Midterm: Friday November 20, 11:30am-1pm.

 Divided in 2 rooms. Last name A-K: Burnside 1B36 Last name L-Z: MAASS 10

 Closed book. No calculator.

 You can bring 1 page (2-sided) of hand-written notes.

 Covers lectures 1-21.

•  Course evaluations now available on Minerva. Please fill out!

COMP-598: Applied Machine Learning

Joelle Pineau 24

Today’s lecture

•  Significant material (pictures, text, equations) for these slides was taken from:

–  http://www.stanford.edu/class/cs246

–  http://cilvr.cs.nyu.edu/doku.php?id=courses:bigdata:slides:start

– 

http://research.google.com/archive/mapreduce-osdi04-slides/index-auto-0002.html

–  http://machinelearning.wustl.edu/mlpapers/paper_files/

NIPS2006_725.pdf

References

Related documents

We believe that the book will be read by the people with a common inter- est in geospatial techniques, remote sensing, sustainable water resource develop- ment, applications and

Research Question 1: To what extent does a composite placement score, based on the ACCUPLACER scores for reading comprehension and mathematics, statistically and significantly

“The market for outsourcing in the commercial healthcare sector grows by 6-8% per year” (Edison research, Bloomberg).. The healthcare industry (along with the

The Westenberger-Kallrath problem has been understood as a typical scheduling problem occurring in process industry including the major characteristics of a real batch

Students who are LVNs and who are electing the Non-Degree 30 Unit Option, as well as students who do not complete the requirements for the Associate in Arts (AA) or the Associate

Where the employer is licenced by another body other than the Trade & Business Licencing Board, proof of current licence or copy of the receipt of payment for the renewal

I will limit myself to state that autogestión can be conceived as the manifestation of analectical marronage inasmuch as it consists of a social and political praxis

We then discuss optimal student credit arrangements that balance three important objectives: (i) providing credit for students to access college and finance consumption while in