• No results found

Parallel & Distributed Optimization. Based on Mark Schmidt s slides

N/A
N/A
Protected

Academic year: 2021

Share "Parallel & Distributed Optimization. Based on Mark Schmidt s slides"

Copied!
46
0
0

Loading.... (view fulltext now)

Full text

(1)

Parallel & Distributed Optimization

(2)

Motivation behind using parallel & Distributed optimization

● Performance

○ Computational throughput have increased exponentially in linear time (Moore’s law)

○ But only so many transistors can fit in limited space (atomic size)

○ Serial computation throughput plateaued (Moore’s law coming to an end)

● Space

(3)

Introduction

● Parallel Computing

○ One machine

(4)

Introduction

● Parallel Computing

○ One machine

■ Multiple processors (Quad-Core, GPU)

● Distributed Computing

(5)

Introduction

● Parallel Computing

○ One machine

■ Multiple processors (Quad-Core, GPU)

● Distributed Computing

(6)

Distributed optimization

● Strategy

○ Each machine handles a subset of the dataset

● Issues

○ link failures between machines

■ devise algorithms that limits communication ■ decentralize optimization

● Synchronization

○ wait for slowest machine when the all machines depend on the variable parameter ‘x’ coordinates

(7)

Distributed optimization

● Strategy

○ Each machine handles a subset of the dataset

● Issues

link failures between machines synchronization

■ wait for the slowest machine to complete processing its data

● Solutions

○ devise algorithms that limits communication

○ decentralize optimization

(8)

Straightforward distributed optimization

● Run different algorithms/strategies on different machines/cores

○ First one that finishes wins

■ Gauss-Southwell Coordinate Descent ■ Randomized Coordinate Descent ■ Gradient Descent

(9)

Straightforward distributed optimization

● Run different algorithms/strategies on different machines/cores

○ First one that finishes wins

■ Gauss-Southwell Coordinate Descent ■ Randomized Coordinate Descent ■ Gradient Descent

■ Stochastic Gradient Descent

Gauss-Southwell Coordinate Descent

Randomized Coordinate Descent

(10)

Parallel first-order methods

● Synchronized deterministic Gradient Descent

● Can be broken into separable components

n samples m machines

(11)

Parallel first-order methods

● Synchronized deterministic Gradient Descent

n samples m machines

(12)

Parallel first-order methods

● Synchronized deterministic Gradient Descent

n samples m machines

(13)

Parallel first-order methods

● Synchronized deterministic Gradient Descent

n samples

(14)

Parallel first-order methods

● Synchronized deterministic Gradient Descent

m machines

● These allow optimal linear speedups

(15)

Parallel first-order methods

● Synchronized deterministic Gradient Descent

m machines

● Issue

○ What if one of the computers is very slow ? ○ What if one of the links failed ?

(16)

Centralized Gradient descent (HogWild)

● Update ‘x’ asynchronously - saves a lot of time ● Stochastic gradient method on shared memory

(17)

Centralized Gradient descent (HogWild)

● Update ‘x’ asynchronously - saves a lot of time ● Stochastic gradient method on shared memory

(18)

Centralized Gradient descent

● Update ‘x’ asynchronously - saves a lot of time ● Stochastic gradient method on shared memory

(19)

Centralized Gradient descent

● Update ‘x’ asynchronously - saves a lot of time ● Stochastic gradient method on shared memory

(20)

Centralized Gradient descent

● Update ‘x’ asynchronously - saves a lot of time ● Stochastic gradient method on shared memory

(21)

Centralized Gradient descent

● Update ‘x’ asynchronously - saves a lot of time ● Stochastic gradient method on shared memory

(22)

Centralized Coordinate descent

● Communicating parameters ‘x’ can be expensive

● Use coordinate descent to transmit one coordinate update at a time

(23)

Centralized Coordinate descent

● Communicating parameters ‘x’ can be expensive

mini-batch 1 mini-batch 2 mini-batch 3

Sending and receiving `x` parameters is expensive

(24)

Centralized Coordinate descent

● Communicating parameters ‘x’ can be expensive

● Use coordinate descent to transmit one coordinate update at a time

(25)

Centralized Coordinate descent

● Communicating parameters ‘x’ can be expensive

(26)

Centralized Coordinate descent

● Communicating parameters ‘x’ can be expensive

● Use coordinate descent to transmit one coordinate update at a time

Coordinates that have changed Coordinates that have changed

(27)

Decentralized Coordinate descent for sparse datasets

● Least square problem

● Update rule

● Doesn’t seem separable at first sight.

○ But, can have many non-zero entries - most entries in will be unnecessary 0 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0 Sparse A

(28)

Decentralized Coordinate descent for sparse datasets

● Least square problem

● Update rule 0 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0 Sparse A

(29)

Decentralized Coordinate descent for sparse datasets

● Least square problem

● Update rule 0 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0 Sparse A

(30)

Decentralized Coordinate descent for sparse datasets

● Least square problem

● Update rule 0 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0 Sparse A

(31)

Decentralized Coordinate descent for sparse datasets

● Least square problem

● Update rule 0 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0 Sparse A

Coordinates 1 & 4 Coordinate 2 Coordinate 3 Samples 2 & 3 Sample 1 Sample 4

(32)

Decentralized Coordinate descent for sparse datasets

● Update rule 0 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0 Sparse A

Coordinates 1 & 4 Coordinate 2 Coordinate 3 Samples 2 & 3 Sample 1 Sample 4 ● Say, you can only fit one sample in the machine

(33)

Decentralized Coordinate descent for sparse datasets

● Update rule 0 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0 Sparse A

Coordinates 1 & 4 Coordinate 2 Coordinate 3

Samples 2 Sample 3

Sample 1 Sample 4 ● Say, you can only fit one sample in a machine

(34)

Decentralized Coordinate descent for sparse datasets

● Update rule 0 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0 Sparse A

Coordinates 1 & 4 Coordinate 2 Coordinate 3

Samples 2 Sample 3

Sample 1 Sample 4 ● Say, you can only fit one sample in a machine

(35)

Decentralized Gradient Descent

● Distribute the data across machines.

● We may not want to update a ‘centralized’ vector x

Decentralized Gradient Descent

Each machine has its own data samples

0 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0 Sparse A

(36)

Decentralized Gradient Descent

● Distribute the data across machines.

● We may not want to update a ‘centralized’ vector x

Decentralized Gradient Descent

Each machine has its own data samples

0 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0 Sparse A

(37)

Decentralized Gradient Descent

● Distribute the data across machines.

● We may not want to update a ‘centralized’ vector x

Decentralized Gradient Descent

Each machine has its own data samples Each machine has its own parameter vector

0 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0 Sparse A

Machine 1 Machine 2 Machine 3 Machine 4 Sample 1 Sample 2 Sample 3 Sample 4

Sample 1 Sample 2 Sample 3 Sample 4

(38)

Decentralized Gradient Descent

● Distribute the data across machines.

● We may not want to update a ‘centralized’ vector x

Decentralized Gradient Descent

Each machine has its own data samples Each machine has its own parameter vector

0 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0 Sparse A

Machine 1 Machine 2 Machine 3 Machine 4 Sample 1 Sample 2 Sample 3 Sample 4

Sample 1 Sample 2 Sample 3 Sample 4

(39)

Decentralized Gradient Descent

● Distribute the data across machines.

● We may not want to update a ‘centralized’ vector x

Decentralized Gradient Descent

Each machine has its own data samples Each machine has its own parameter vector

0 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0 Sparse A

Machine 1 Machine 2 Machine 3 Machine 4 Sample 1 Sample 2 Sample 3 Sample 4

Sample 1 Sample 2 Sample 3 Sample 4

(40)

Decentralized Gradient Descent

● Distribute the data across machines.

● We may not want to update a ‘centralized’ vector x

Decentralized Gradient Descent

Each machine has its own data samples Each machine has its own parameter vector

0 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0 Sparse A

Machine 1 Machine 2 Machine 3 Machine 4 Sample 1 Sample 2 Sample 3 Sample 4

Sample 1 Sample 2 Sample 3 Sample 4

(41)

Decentralized Gradient Descent

● Distribute the data across machines.

● We may not want to update a ‘centralized’ vector x

Decentralized Gradient Descent

Each machine has its own data samples Each machine has its own parameter vector

0 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0 Sparse A

Machine 1 Machine 2 Machine 3 Machine 4 Sample 1 Sample 2 Sample 3 Sample 4

Sample 1 Sample 2 Sample 3 Sample 4

(42)

Decentralized Gradient Descent

● Distribute the data across machines.

● We may not want to update a ‘centralized’ vector x

Decentralized Gradient Descent

Each machine has its own data samples Each machine has its own parameter vector

0 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0 Sparse A

Machine 1 Machine 2 Machine 3 Machine 4 Sample 1 Sample 2 Sample 3 Sample 4

Sample 1 Sample 2 Sample 3 Sample 4

(43)

Decentralized Gradient Descent

● Distribute the data across machines.

● We may not want to update a ‘centralized’ vector x

Decentralized Gradient Descent

Each machine has its own data samples Each machine has its own parameter vector

0 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0 Sparse A

Machine 1 Machine 2 Machine 3 Machine 4 Sample 1 Sample 2 Sample 3 Sample 4

Sample 1 Sample 2 Sample 3 Sample 4

(44)

Decentralized Gradient Descent

● Distribute the data across machines.

● We may not want to update a ‘centralized’ vector x

Decentralized Gradient Descent

Each machine has its own data samples Each machine has its own parameter vector Update rule 0 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0 Sparse A Sample 1 Sample 2 Sample 3 Sample 4 no communication

machine 1 machine 2 machine 3 machine 4 communicates with

machine 2

No communication communicates with

(45)

Decentralized Gradient Descent

● Distribute the data across machines.

● We may not want to update a ‘centralized’ vector x

Decentralized Gradient Descent

○ Each machine has its own data samples

○ Each machine has its own parameter vector

○ Update rule

○ Similar convergence to the gradient descent with central communication

■ The rate depends on the sparsity of the dataset

0 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0 Sparse A Sample 1 Sample 2 Sample 3 Sample 4

(46)

Summary

● Using parallel and distributed systems is important for speeding up optimization for big data

● Synchronized Deterministic Gradient Descent

○ Optimization halts with link failure or when a machine is slow at processing its data

● Centralized Asynchronous Gradient Descent

○ Communicating vector ‘x’ is costly

● Centralized Asynchronous Coordinate descent

○ Centralization causes additional overhead - communication

● Decentralized Asynchronous Coordinate descent

○ Helpful for sparse datasets

○ No communication between machines

● Decentralized Gradient descent

○ Helpful for sparse datasets

References

Related documents

In the novels Dickens provided narrative observations regarding his characters’ articulation of some word sounds, thus facilitating the reading of the speech for

Nitrate Transport Modeling in Deep Aquifers. Comparison between Model Results and Data from the Groundwater Monitoring Network.. G.J.M. Uffink and

All of our regression tests are based on the assumption that the lending technology that a bank uses to evaluate a small business loan application is exogenous to the distance of

Konsolidovaná účetní závěrka se zabývá finančním výkaznictvím majetkově propojených organizací. Vlastníky, jako uţivatele účetní závěrky, zajímají výsledky

UNCHS (Habitat) provides network support services for the sharing and exchange of information on, inter alia, good and best practices, training and human resources development,

Alternatively the proposed dynamic HMM approach using the complete patient longitudinal data has additional benefits over the traditional threshold based detection by both (i) using

In this study, orientation and location finder services for indoor navigation will be done by using accelerometer, compass and camera that have been already included in the phones

The ideal of self-r eliance cer tainly does not exclude this other ideal of shar ing, of the g iv e-and-tak e, whether spontaneous or institutionalized, that