Parallel & Distributed Optimization. Based on Mark Schmidt s slides

(1)

Parallel & Distributed Optimization

(2)

Motivation behind using parallel & Distributed optimization

● Performance

○ Computational throughput have increased exponentially in linear time (Moore’s law)

○ But only so many transistors can fit in limited space (atomic size)

○ Serial computation throughput plateaued (Moore’s law coming to an end)

● Space

(3)

Introduction

● Parallel Computing

○ One machine

(4)

Introduction

○ One machine

■ Multiple processors (Quad-Core, GPU)

● Distributed Computing

(5)

Introduction

○ One machine

■ Multiple processors (Quad-Core, GPU)

● Distributed Computing

(6)

Distributed optimization

● Strategy

○ Each machine handles a subset of the dataset

● Issues

○ link failures between machines

■ devise algorithms that limits communication ■ decentralize optimization

● Synchronization

○ wait for slowest machine when the all machines depend on the variable parameter ‘x’ coordinates

(7)

Distributed optimization

● Strategy

○ Each machine handles a subset of the dataset

● Issues

○ link failures between machines ○ synchronization

■ wait for the slowest machine to complete processing its data

● Solutions

○ devise algorithms that limits communication

○ decentralize optimization

(8)

Straightforward distributed optimization

● Run different algorithms/strategies on different machines/cores

○ First one that finishes wins

■ Gauss-Southwell Coordinate Descent ■ Randomized Coordinate Descent ■ Gradient Descent

(9)

Straightforward distributed optimization

● Run different algorithms/strategies on different machines/cores

○ First one that finishes wins

■ Gauss-Southwell Coordinate Descent ■ Randomized Coordinate Descent ■ Gradient Descent

■ Stochastic Gradient Descent

Gauss-Southwell Coordinate Descent

Randomized Coordinate Descent

(10)

Parallel first-order methods

● Synchronized deterministic Gradient Descent

● Can be broken into separable components

n samples m machines

(11)

Parallel first-order methods

(12)

Parallel first-order methods

(13)

Parallel first-order methods

n samples

(14)

Parallel first-order methods

m machines

● These allow optimal linear speedups

(15)

Parallel first-order methods

m machines

● Issue

○ What if one of the computers is very slow ? ○ What if one of the links failed ?

(16)

Centralized Gradient descent (HogWild)

● Update ‘x’ asynchronously - saves a lot of time ● Stochastic gradient method on shared memory

(17)

Centralized Gradient descent (HogWild)

(18)

Centralized Gradient descent

(19)

Centralized Gradient descent

(20)

Centralized Gradient descent

(21)

Centralized Gradient descent

(22)

Centralized Coordinate descent

● Communicating parameters ‘x’ can be expensive

● Use coordinate descent to transmit one coordinate update at a time

(23)

Centralized Coordinate descent

mini-batch 1 mini-batch 2 mini-batch 3

Sending and receiving `x` parameters is expensive

(24)

Centralized Coordinate descent

(25)

Centralized Coordinate descent

(26)

Centralized Coordinate descent

Coordinates that have changed _{Coordinates that have changed}

(27)

Decentralized Coordinate descent for sparse datasets

● Least square problem

● Update rule

● Doesn’t seem separable at first sight.

○ But, can have many non-zero entries - most entries in will be unnecessary 0 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0 Sparse A

(28)

Decentralized Coordinate descent for sparse datasets

● Update rule 0 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0 Sparse A

(29)

Decentralized Coordinate descent for sparse datasets

(30)

Decentralized Coordinate descent for sparse datasets

(31)

Decentralized Coordinate descent for sparse datasets

Coordinates 1 & 4 Coordinate 2 Coordinate 3 Samples 2 & 3 Sample 1 Sample 4

(32)

Decentralized Coordinate descent for sparse datasets

Coordinates 1 & 4 Coordinate 2 Coordinate 3 Samples 2 & 3 Sample 1 Sample 4 ● Say, you can only fit one sample in the machine

(33)

Decentralized Coordinate descent for sparse datasets

Coordinates 1 & 4 Coordinate 2 Coordinate 3

Samples 2 Sample 3

Sample 1 Sample 4 ● Say, you can only fit one sample in a machine

(34)

Decentralized Coordinate descent for sparse datasets

Coordinates 1 & 4 Coordinate 2 Coordinate 3

Samples 2 Sample 3

Sample 1 Sample 4 ● Say, you can only fit one sample in a machine

(35)

Decentralized Gradient Descent

● Distribute the data across machines.

● We may not want to update a ‘centralized’ vector x

● Decentralized Gradient Descent

○ Each machine has its own data samples