• No results found

Distributed Machine Learning and Big Data

N/A
N/A
Protected

Academic year: 2022

Share "Distributed Machine Learning and Big Data"

Copied!
76
0
0

Loading.... (view fulltext now)

Full text

(1)

Distributed Machine Learning and Big Data

Sourangshu Bhattacharya

Dept. of Computer Science and Engineering, IIT Kharagpur.

http://cse.iitkgp.ac.in/~sourangshu/

August 21, 2015

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 1 / 76

(2)

Outline

1 Machine Learning and Big Data Support Vector Machines Stochastic Sub-gradient descent

2 Distributed Optimization ADMM

Convergence

Distributed Loss Minimization Results

Development of ADMM

3 Applications and extensions Weighted Parameter Averaging Fully-distributed SVM

(3)

What is Big Data ?

6 Billionweb queries per day.

10 Billiondisplay advertisements per day.

30 Billiontext ads per day.

150 Millioncredit card transactions per day.

100 Billionemails per day.

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 3 / 76

(4)

Machine Learning and Big Data

Machine Learning on Big Data

Classification - Spam / No Spam - 100B emails.

Multi-label classification - image tagging - 14M images 10K tags.

Regression - CTR estimation - 10B ad views.

Ranking - web search - 6B queries.

Recommendation - online shopping - 1.7B views in the US.

(5)

Classification example

Email spam classification.

Features (ui)): Vector of counts of all words.

No. of Features (d ): Words in vocabulary (∼ 100,000).

No. of non-zero features: 100.

No. of emails per day: 100 M.

Size of training set using 30 days data: 6 TB(assuming 20 B per data)

Time taken to read the data once: 41.67 hrs(at 20 MB per second)

Solution:use multiple computers.

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 5 / 76

(6)

Machine Learning and Big Data

Big Data Paradigm

3V’s - Volume, Variety, Velocity.

Distributed system.

Chance of failure:

Computers 1 10 100

Chance of a failure in an hour 0.01 0.09 0.63 Communication efficiency - Data locality.

Many systems: Hadoop, Spark, Graphlab, etc.

Goal: Implement Machine Learning algorithms on Big data systems.

(7)

Binary Classification Problem

A set of labeled datapoints (S) = {(ui,vi),i = 1, . . . , n}, ui ∈ Rd and vi ∈ {+1, −1}

Linear Predictor function: v = sign(xTu) Error function: E =Pn

i=11(vixTui ≤ 0)

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 7 / 76

(8)

Machine Learning and Big Data

Logistic Regression

Probability of v is given by:

P(v |u, x) = σ(v xTu) = 1 1 + e−v xTu Learning problem is:

Given dataset S, estimatex.

Maximizing the regularized log likelihood:

x=argminx

n

X

i=1

log(1 + e−vixTui) + λ 2xTx

(9)

Convex Function

f is a Convex function: f (tx1+ (1 − t)x2) ≤tf (x1) + (1 − t)f (x2)

Objective function for logistic regression is convex.

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 9 / 76

(10)

Machine Learning and Big Data

Convex Optimization

Convex optimization problem

minimizex f (x )

subject to: gi(x ) ≤ 0, ∀i = 1, . . . , k where:

f , gi are convex functions.

For convex optimization problems, local optima are also global optima.

(11)

Optimization Algorithm: Gradient Descent

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 11 / 76

(12)

Machine Learning and Big Data Support Vector Machines

Classification Problem

(13)

SVM

Separating hyperplane: xTu = 0

Parallel hyperplanes (developing margin): xTu = ±1

Margin (perpendicular distance between parallel hyperplanes): kxk2 Correct classification of training datapoints: vixTui ≥ 1, ∀i

Allowing error (slack), ξi: vixTui≥ 1 − ξi, ∀i Max-margin formulation:

min

x,ξ

1

2kxk2+C

n

X

i=1

ξi

subject to: vixTui ≥ 1 − ξi, ξi ≥ 0 ∀i = 1, . . . , n

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 13 / 76

(14)

Machine Learning and Big Data Support Vector Machines

SVM: dual

Lagrangian:

L = 1

2xTx + C

n

X

i=1

ξi+

n

X

i=1

αi(1 − ξi − vixTui) +

n

X

i=1

µiξi

Dual problem: (x, α, µ) =maxα,µminxL(x, α, µ)

For strictly convex problem, primal and dual solutions are same (Strong duality).

KKT conditions:

x =

n

X

i=1

αiviui C = αi+ µi

(15)

SVM: dual

The dual problem:

maxα n

X

i=1

αi−1 2

n,n

X

i=1,j=1

αiαjvivjuTi uj subject to: 0 ≤ αi ≤ C, ∀i

The dual is a quadratic programming problem in n variables.

Can be solved even if kernel function, k (ui,uj) =uTi uj are given.

Dimension agnostic.

Many efficient algorithms exist for solving it, e.g. SMO (Platt99).

Worst case complexity is O(n3), usually O(n2).

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 15 / 76

(16)

Machine Learning and Big Data Support Vector Machines

SVM

A more compact form: minxPn

i=1max(0, 1 − vixTui) + λkxk22 Or: minxPn

i=1l(x, ui,vi) + λΩ(x)

(17)

Multi-class classification

There are m classes. vi ∈ {1, . . . , m}

Most popular scheme: vi =argmaxv ∈{1,...,m}xTvui Given example (ui,vi),xTv

iui ≥ xTj ui ∀j ∈ {1, . . . , m}

Using a margin of at least 1, loss

l(ui,vi) =maxj∈{1,...,vi−1,vi+1,...,m}{0, 1 − (xTviui − xTj ui)}

Given dataset D, solve the problem

x1min,...,xm

X

i∈D

l(ui,vi) + λ

m

X

j=1

kxjk2

This can be extended to many settings e.g. sequence labeling, learning to rank, etc.

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 17 / 76

(18)

Machine Learning and Big Data Support Vector Machines

General Learning Problems

Support Vector Machines:

minx n

X

i=1

max{0, 1 − vixTui} + λkxk22

Logistic Regression:

minx n

X

i=1

log(1 + exp(−vixTui)) + λkxk22

General form:

minx n

X

i=1

l(x, ui,vi) + λΩ(x) l: loss function, Ω: regularizer.

(19)

Sub-gradient Descent

Sub-gradient for a non-differentiable convex function f at a point x0is a vectorv such that:

f (x) − f (x0) ≥vT(x − x0)

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 19 / 76

(20)

Machine Learning and Big Data Stochastic Sub-gradient descent

Sub-gradient Descent

Randomly initializex0

Iteratexk =xk −1− tkg(xk −1), k = 1, 2, 3, . . . . Where g is a sub-gradient of f .

tk = 1

k.

xbest(k ) = mini=1,...,kf (xk)

(21)

Sub-gradient Descent

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 21 / 76

(22)

Machine Learning and Big Data Stochastic Sub-gradient descent

Stochastic Sub-gradient Descent

Convergence rate is: O(1

k).

Each iteration takes O(n) time.

Reduce time by calculating the gradient using a subset of examples - stochastic subgradient.

Inherently serial.

Typical O(12)behaviour.

(23)

Stochastic Sub-gradient Descent

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 23 / 76

(24)

Distributed Optimization

Distributed gradient descent

Divide the dataset into m parts. Each part is processed on one computer. Total m.

There is one central computer.

All computers can communicate with the central computer via network.

Define loss(x) =Pm j=1

P

i∈Cjli(x) + λΩ(x), where li(x) = l(x, ui,vi) The gradient (in case of differentiable loss):

∇loss(x) =

m

X

j=1

∇(X

i∈Cj

li(x)) + λΩ(x)

Compute ∇lj(x) =P

i∈Cj∇li(x) on the jthcomputer. Communicate to central computer.

(25)

Distributed gradient descent

Compute ∇loss(x) =Pm

j=1∇lj(x) + Ω(x) at the central computer.

The gradient descent update: xk +1=xk − α∇loss(x).

αchosen by a line search algorithm (distributed).

For non-differentiable loss functions, we can use distributed sub-gradient descent algorithm.

Slow for most practical problems.

For achieving  tolerance,

Gradient descent (Logistic regression): O(1/) iterations.

Sub-gradient descent (Stochastic Sub-gradient descent): O(1

2) iterations.

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 25 / 76

(26)

Distributed Optimization ADMM

Alternating Direction Method of Multipliers

Problem

minimizex ,z f (x ) + g(z) subject to: Ax + Bz = c Algorithm

Iterate till convergence:

xk +1=argminxf (x ) +ρ2kAx + Bzk− c + ukk22 zk +1=argminzg(z) + ρ2kAxk +1+Bz − c + ukk22 uk +1=uk +Axk +1+Bzk +1− c

(27)

Stopping criteria

Stop when primal and dual residuals small:

krkk2≤ pri and kskk2≤ dual Hence, krkk2→ 0 and kskk2→ 0 as k → ∞

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 27 / 76

(28)

Distributed Optimization ADMM

Observations

x - update requires solving an optimization problem minx f (x ) +ρ

2kAx − v k22 with, v = Bzk − c + uk

Similarly for z-update.

Sometimes has a closed form.

ADMM is a meta optimization algorithm.

(29)

Convergence of ADMM

Assumption 1: Functions f : Rn→ R and g : Rm→ R are closed, proper and convex.

Same as assuming epif = {(x , t) ∈ Rn× R|f (x) ≤ t} is closed and convex.

Assumption 2: The unaugmented Lagrangian L0(x , y , z) has a saddle point (x ∗, z∗, y ∗):

L0(x ∗, z∗, y ) ≤ L0(x ∗, z∗, y ∗) ≤ L0(x , z, y ∗)

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 29 / 76

(30)

Distributed Optimization Convergence

Convergence of ADMM

Primal residual: r = Ax + Bz − c

Optimal objective: p =infx ,z{f (x) + g(z)|Ax + Bz = c}

Convergence results:

Primal residual convergence: rk → 0 as k → ∞ Dual residual convergence: sk → 0 as k → ∞ Objective convergence: f (x ) + g(z) → pas k → ∞ Dual variable convergence: yk → yas k → ∞

(31)

Decomposition

If f is separable:

f (x ) = f1(x1) + · · · +fN(xN), x = (x1, . . . ,xN) A is conformably block separable; i.e. ATA is block diagonal.

Then, x -update splits into N parallel updates of xi

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 31 / 76

(32)

Distributed Optimization Distributed Loss Minimization

Consensus Optimization

Problem:

minx f (x ) =

N

X

i=1

fi(x )

ADMM form:

minxi,z N

X

i=1

fi(xi)

s.t. xi− z = 0, i = 1, . . . , N Augmented lagrangian:

Lρ(x1, . . . ,xN,z, y ) =

N

X

i=1

(fi(xi) +yiT(xi− z) + ρ

2kxi− zk22)

(33)

Consensus Optimization

ADMM algorithm:

xik +1=argminx

i(fi(xi) +yikT(xi− zk) + ρ

2kxi− zkk22) zk +1= 1

N

N

X

i=1

(xik +1+1 ρyik) yik +1=yik+ ρ(xik +1− zk +1) Final solution is zk.

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 33 / 76

(34)

Distributed Optimization Distributed Loss Minimization

Consensus Optimization

z-update can be written as:

zk +1= ¯xk +1+1 ρ¯yk +1 Averaging the y -updates:

k +1= ¯yk+ ρ(¯xk +1− zk +1)

Substituting first into second: ¯yk +1=0. Hence zk = ¯xk. Revised algorithm:

xik +1=argminxi(fi(xi) +yikT(xi− ¯xk) + ρ

2kxi− ¯xkk22) yik +1=yik + ρ(xik +1− ¯xk +1)

Final solution is zk.

(35)

Distributed Loss minimization

Problem:

minx l(Ax − b) + r (x ) Partition A and b by rows:

A =

 A1

... AN

, b =

 b1

... bN

, where, Ai ∈ Rmi×mand bi ∈ Rmi

ADMM formulation:

minxi,z N

X

i=1

li(Aixi− bi) +r (z)

s.t.: xi− z = 0, i = 1, . . . , N

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 35 / 76

(36)

Distributed Optimization Distributed Loss Minimization

Distributed Loss minimization

ADMM solution:

xik +1=argminx

i(li(Aixi− bi) + ρ

2kxi− zk+ukik22) zk +1=argminz(r (z) +Nρ

2 kz − ¯xk +1+ ¯ukk22) uik +1=uik+xik +1− zk +1

(37)

ADMM Results

Logistic Regression using the loss minimization formulation (Boyd et al.):

minx n

X

i=1

log(1 + exp(−vixTui)) + λkxk22

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 37 / 76

(38)

Distributed Optimization Results

ADMM Results

Logistic Regression using the loss minimization formulation (Boyd et al.):

minx n

X

i=1

log(1 + exp(−vixTui)) + λkxk22

(39)

Other Machine Learning Problems

Ridge Regression.

Lasso.

Multi-class SVM.

Ranking.

Structured output prediction.

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 39 / 76

(40)

Distributed Optimization Results

ADMM Results

Lasso Results (Boyd et al.):

(41)

ADMM Results

SVM primal residual:

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 41 / 76

(42)

Distributed Optimization Results

ADMM Results

SVM Accuracy:

(43)

Results

Risk and Hyperplane

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 43 / 76

(44)

Distributed Optimization Development of ADMM

Dual Ascent

Convex equality constrained problem:

minx f (x )

subject to: Ax = b Lagrangian: L(x , y ) = f (x ) + yT(Ax − b) Dual function: g(y ) = infxL(x , y )

Dual problem: maxyg(y )

Final solution: x =argminxL(x , y )

(45)

Dual Ascent

Gradient descent for dual problem: yk +1=yk+ αkykg(yk)

ykg(yk) =A˜x − b, where ˜x = argminxL(x , yk) Dual ascent algorithm:

xk +1=argminxL(x , yk) yk +1=yk+ αk(Axk +1− b) Assumptions:

L(x , yk)is strictly convex. Else, the first step can have multiple solutions.

L(x , yk)is bounded below.

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 45 / 76

(46)

Distributed Optimization Development of ADMM

Dual Decomposition

Suppose f is separable:

f (x ) = f1(x1) + · · · +fN(xN), x = (x1, . . . ,xN)

L is separable in x : L(x , y ) = L1(x1,y ) + · · · + LN(xN,y ) − yTb, where Li(xi,y ) = fi(xi) +yTAixi

x minimization splits into N separate problems:

xik +1=argminx

iLi(xi,yk)

(47)

Dual Decomposition

Dual decomposition:

xik +1=argminx

iLi(xi,yk), i = 1, . . . , N yk +1=yk+ αk(

N

X

i=1

Aixi− b)

Distributed solution:

Scatter yk to individual nodes

Compute xi in the ithnode (distributed step) Gather Aixi from the ithnode

All drawbacks of dual ascent exist

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 47 / 76

(48)

Distributed Optimization Development of ADMM

Method of Multipliers

Make dual ascent work under more general conditions Useaugmented Lagrangian:

Lρ(x , y ) = f (x ) + yT(Ax − b) + ρ

2kAx − bk22 Method of multipliers:

xk +1=argminxLρ(x , yk) yk +1=yk+ ρ(Axk +1− b)

(49)

Methods of Multipliers

Optimality conditions (for differentiable f ):

Primal feasibility: Ax− b = 0 Dual feasibility: ∇f (x) +ATy=0 Since xk +1minimizes Lρ(x , yk)

0 = ∇xLρ(xk +1,yk)

= ∇xf (xk +1) +AT(yk+ ρ(Axk +1− b))

= ∇xf (xk +1) +ATyk +1

Dual update yk +1=yk+ ρ(Axk +1− b) makes (xk +1,yk +1)dual feasible

Primal feasibility is achieved in the limit: (Axk +1− b) → 0

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 49 / 76

(50)

Distributed Optimization Development of ADMM

Alternating direction method of multipliers

Problem with applying standard method of multipliers for distributed optimization:

there is no problem decomposition even if f is separable.

due to square term ρ2kAx − bk22

(51)

Alternating direction method of multipliers

ADMM problem:

minx ,z f (x ) + g(z)

subject to: Ax + Bz = c Lagrangian:

Lρ(x , z, y ) = f (x ) + g(z) + yT(Ax + Bz − c) +ρ2kAx + Bz − ck22 ADMM:

xk +1=argminxLρ(x , zk,yk) zk +1=argminzLρ(xk +1,z, yk) yk +1=yk + ρ(Axk +1+Bzk +1− c)

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 51 / 76

(52)

Distributed Optimization Development of ADMM

Alternating direction method of multipliers

Problem with applying standard method of multipliers for distributed optimization:

there is no problem decomposition even if f is separable.

due to square term ρ2kAx − bk22

The above technique reduces to method of multipliers if we do joint minimization of x and z

Since we split the joint x , z minimization step, the problem can be decomposed.

(53)

ADMM Optimality conditions

Optimality conditions (differentiable case):

Primal feasibility: Ax + Bz − c = 0

Dual feasibility: ∇f (x ) + ATy = 0 and ∇g(z) + BTy = 0 Since zk +1minimizes Lρ(xk +1,z, yk):

0 = ∇g(zk +1) +BTyk + ρBT(Axk +1+Bzk +1− c)

= ∇g(zk +1) +BTyk +1

So, the dual variable update satisfies the second dual feasibility constraint.

Primal feasibility and first dual feasibility are satisfied asymptotically.

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 53 / 76

(54)

Distributed Optimization Development of ADMM

ADMM Optimality conditions

Primal residual: rk =Axk +Bzk− c Since xk +1minimizes Lρ(x , zk,yk):

0 = ∇f (xk +1) +ATyk+ ρAT(Axk +1+Bzk − c)

= ∇f (xk +1) +AT(yk + ρrk +1+ ρB(zk − zk +1)

= ∇f (xk +1) +ATyk +1+ ρATB(zk− zk +1) or,

ρATB(zk− zk +1) = ∇f (xk +1) +ATyk +1

Hence, sk +1= ρATB(zk− zk +1)can be thought as dual residual.

(55)

ADMM with scaled dual variables

Combine the linear and quadratic terms Primal feasibility: Ax + Bz − c = 0

Dual feasibility: ∇f (x ) + ATy = 0 and ∇g(z) + BTy = 0 Since zk +1minimizes Lρ(xk +1,z, yk):

0 = ∇g(zk +1) +BTyk + ρBT(Axk +1+Bzk +1− c)

= ∇g(zk +1) +BTyk +1

So, the dual variable update satisfies the second dual feasibility constraint.

Primal feasibility and first dual feasibility are satisfied asymptotically.

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 55 / 76

(56)

Applications and extensions Weighted Parameter Averaging

Distributed Support Vector Machines

Training dataset partitioned into M partitions (Sm, m = 1, . . . , M).

Each partition has L datapoints: Sm= {(xml,yml)},l = 1, . . . , L.

Each partition can be processed locally on a single computer.

Distributed SVM training problem [?]:

wminm,z M

X

m=1 L

X

l=1

loss(wm; (xml,yml)) +r (z) s.t.wm− z = 0, m = 1, · · · , M, l = 1, . . . , L

(57)

Parameter Averaging

Parameter averaging, also called “mixture weights” proposed in [?], for logistic regression.

Results hold true for SVMs with suitable sub-derivative.

Locally learn SVM on Sm:

wˆm =argminw1 L

L

X

l=1

loss(w; xml,yml) + λkwk2, m = 1, . . . , M

The final SVM parameter is given by:

wPA= 1 M

M

X

m=1

wˆm

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 57 / 76

(58)

Applications and extensions Weighted Parameter Averaging

Problem with Parameter Averaging

PA with varying number of partitions - Toy dataset.

(59)

Weighted Parameter Averaging

Final hypothesis is a weighted sum of the parameters ˆwm.

w =

M

X

m=1

βmw^m

Also proposed in [?].

How to get βm?

Notation: β = [β1, · · · , βM]T, ^W = [ ˆw1, · · · , ˆwM] w = ^Wβ

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 59 / 76

(60)

Applications and extensions Weighted Parameter Averaging

Weighted Parameter Averaging

Find the optimal set of weights β which attains the lowest regularized hinge loss:

minβ,ξ λk ^Wβk2+ 1 ML

M

X

m=1 L

X

i=1

ξmi

subject to: ymiTW^Txmi) ≥1 − ξmi, ∀i, m ξmi ≥ 0, ∀m = 1, . . . , M, i = 1, . . . , L W is a pre-computed parameter.ˆ

(61)

Distributed Weighted Parameter Averaging

Distributed version of primal weighted parameter averaging:

minγm

1 ML

M

X

m=1 L

X

l=1

loss( ˆW γm;xml,yml) +r (β) s.t. γm− β = 0, m = 1, · · · , M,

r (β) = λk ˆWβk2, γm weights for mthcomputer, β consensus weight.

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 61 / 76

(62)

Applications and extensions Weighted Parameter Averaging

Distributed Weighted Parameter Averaging

Distributed algorithm using ADMM:

γmk +1:=argminγ(loss(Aiγ) + (ρ/2)kγ − βk+ukmk22) βk +1:=argminβ(r (β) + (Mρ/2)kβ − γk +1− ukk22) uk +1m =ukm+ γmk +1− βk +1.

um are the scaled Lagrange multipliers, γ = M1 PM

m=1γm and u = M1 PM

m=1um.

(63)

Toy Dataset - PA and WPA

PA (left) and WPA (right) with varying number of partitions - Toy dataset.

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 63 / 76

(64)

Applications and extensions Weighted Parameter Averaging

Toy Dataset - PA and WPA

Accuracy of PA and WPA with varying number of partitions - Toy dataset.

(65)

Real World Datasets

Epsilon (2000 features, 6000 datapoints) test set accuracy with varying number of partitions.

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 65 / 76

(66)

Applications and extensions Weighted Parameter Averaging

Real World Datasets

Gisette (5000 features, 6000 datapoints) test set accuracy with varying number of partitions.

(67)

Real World Datasets

Real-sim (20000 features, 3000 datapoints) test set accuracy with varying number of partitions.

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 67 / 76

(68)

Applications and extensions Weighted Parameter Averaging

Real World Datasets

Convergence of test accuracy with iterations (200 partitions).

(69)

Real World Datasets

Convergence of primal residual with iterations (200 partitions).

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 69 / 76

(70)

Applications and extensions Fully-distributed SVM

Distributed SVM on Arbitrary Network

Motivations:

Sensor Networks.

Corporate networks.

Privacy.

Assumptions:

Data is available at nodes of network

Communication is possible only along edges of the network.

(71)

Distributed SVM on Arbitrary Network

SVM optimization problem:

w ,b,ξmin 1

2kw k2+C

J

X

j=1 nj

X

n=1

ξjn

s.t.: yjn(wtxjn+b) ≥ 1 − ξjn, ∀j ∈ J, n = 1, . . . , Nj ξjn ≥ 0, ∀j ∈ J, n = 1, . . . , Nj

Node j has a copy of wj,bj. Distributed formulation:

min

{wj,bjjn}

1 2

J

X

j=1

kwjk2+JC

J

X

j=1 nj

X

n=1

ξjn

s.t.: yjn(wjtxjn+b) ≥ 1 − ξjn, ∀j ∈ J, n = 1, . . . , Nj ξjn≥ 0, ∀j ∈ J, n = 1, . . . , Nj

wj =wi, ∀j, i ∈ Bj

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 71 / 76

(72)

Applications and extensions Fully-distributed SVM

Algorithm

Using vj = [wjTbj]T, Xj = [[xj1, . . . ,xjNj]T1j]and Yj =diag([yj1, . . . ,yjNj]):

{vjminjnji}

1 2

J

X

j=1

r (vj) +JC

J

X

j=1 nj

X

n=1

ξjn

s.t.: YjXjvj ≥ 1 − ¯ξj, ∀j ∈ J ξ¯j ≥ 0, ∀j ∈ J

vj = ωji, vi = ωji, ∀j, i ∈ Bj

Surrogate augmented Lagrangian:

L({vj}, { ¯ξj}, {ωji}, {αijk}) = 1 2

J

X

j=1

r (vj) +JC

J

X

j=1 nj

X

n=1

ξjn

+

J

X

j=1

X

i∈Bj

Tij1(vj− ωji) + αTij2(vi− ωji)) + ηX

i∈Bj

(kvj− ωjik2+ kvi− ωjik2)

(73)

Algorithm

ADMM based algorithm:

{vjt+1, ξjnt+1} = argmin{vj, ¯ξj}∈WL({vj}, { ¯ξj}, {ωtji}, {αtijk}) {ωt+1ji } = argminωjiL({vj}t+1, { ¯ξt+1j }, {ωji}, {αtijk})

αt+1ji1 = αtji1+ η(vjt+1− ωt+1ji ) αt+1ji2 = αtji2+ η(ωjit+1− vit+1) From the second equation:

ωjit+1= 1

2η(αtji1− αtji2) +1

2(vjt+1+vit+1)

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 73 / 76

(74)

Applications and extensions Fully-distributed SVM

Algorithm

Hence:

αt+1ji1 = 1

2(αtji1+ αtji2) +η

2(vjt+1− vit+1) αt+1ji2 = 1

2(αtji1+ αtji2) +η

2(vjt+1− vit+1)

Substituting ωjit+1= 12(vjt+1+vit+1)into surrogate augmented lagrangian, the third term becomes:

J

X

j=1

X

i∈Bj

αTij1(vj− vi) =

J

X

j=1

X

i∈Bj

vjTtji1− αtij1)

Substitute αtj =P

i∈Bjtji1− αtij1)

(75)

Algorithm

The final algorithm:

{vjt+1, ξjnt+1} = argmin{v

j, ¯ξj}∈WL({vj}, { ¯ξj}, {αtj}) αt+1j = αtj

2 X

i∈Bj

(vjt+1− vit+1)

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 75 / 76

(76)

Applications and extensions Fully-distributed SVM

Thank you ! Questions ?

References

Related documents

The purpose of this study is to meta-analyze the findings of studies of postsecondary students comparing learning performance and course withdrawal rates between open and commer-

Despite the demonstrated benefits both at macroeconomic and individual levels, the prEA sector contribution is still hampered by four main factors: unjustified regula-

Judge Milner has taught classes for the North Texas Council of Governments, City of Arlington Police Academy, Texas Court Clerks Association, the Building Professional

یهاگآ نهذ ،یتخانش نامرد ،دنزرف.. 2 / لعجرب ی هرود ،يراتفر مولع هلجم 7 هرامش ، 1 ، راهب 1312 همدقم دودح 91 یم راگزاسان ار دوخ ناناوجون یعون هب نیدلاو زا دصرد تبسن نیدلاو

Our primary concerns are good coupling of the dipole modes through the manifolds, and the use of the microwave signals from the manifold for measurement of the

Practical information for teens, including free pamphlets on various mental health topics and referrals to mental health centers, hotlines and treatment facilities throughout

Recently, Google rolled out Place Search, which reorganizes how local results are displayed on the search engine results page (SERP).. Place Search shows when Google predicts that

If your area monitors appear in the Supervisor Server Utility, use either the web interface (firmware versions 39801N12 and later, page 89) or the Model 375 Setup Utility