Distributed Machine Learning and Big Data

(1)

Distributed Machine Learning and Big Data

Sourangshu Bhattacharya

Dept. of Computer Science and Engineering, IIT Kharagpur.

http://cse.iitkgp.ac.in/~sourangshu/

August 21, 2015

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 1 / 76

(2)

Outline

1 Machine Learning and Big Data Support Vector Machines Stochastic Sub-gradient descent

2 Distributed Optimization ADMM

Convergence

Distributed Loss Minimization Results

Development of ADMM

3 Applications and extensions Weighted Parameter Averaging Fully-distributed SVM

(3)

What is Big Data ?

6 Billionweb queries per day.

10 Billiondisplay advertisements per day.

30 Billiontext ads per day.

150 Millioncredit card transactions per day.

100 Billionemails per day.

(4)

Machine Learning and Big Data

Machine Learning on Big Data

Classification - Spam / No Spam - 100B emails.

Multi-label classification - image tagging - 14M images 10K tags.

Regression - CTR estimation - 10B ad views.

Ranking - web search - 6B queries.

Recommendation - online shopping - 1.7B views in the US.

(5)

Classification example

Email spam classification.

Features (u_i)): Vector of counts of all words.

No. of Features (d ): Words in vocabulary (∼ 100,000).

No. of non-zero features: 100.

No. of emails per day: 100 M.

Size of training set using 30 days data: 6 TB(assuming 20 B per data)

Time taken to read the data once: 41.67 hrs(at 20 MB per second)

Solution:use multiple computers.

(6)

Big Data Paradigm

3V’s - Volume, Variety, Velocity.

Distributed system.

Chance of failure:

Computers 1 10 100

Chance of a failure in an hour 0.01 0.09 0.63 Communication efficiency - Data locality.

Many systems: Hadoop, Spark, Graphlab, etc.

Goal: Implement Machine Learning algorithms on Big data systems.

(7)

Binary Classification Problem

A set of labeled datapoints (S) = {(u_i,v_i),i = 1, . . . , n}, u_i ∈ R^d and v_i ∈ {+1, −1}

Linear Predictor function: v = sign(x^Tu) Error function: E =Pn

i=11(v_ix^Tu_i ≤ 0)

(8)

Logistic Regression

Probability of v is given by:

P(v |u, x) = σ(v x^Tu) = 1 1 + e^{−v x}^T^u Learning problem is:

Given dataset S, estimatex.

Maximizing the regularized log likelihood:

x^∗=argmin_x

n

X

i=1

log(1 + e^−vⁱ^x^T^uⁱ) + λ 2x^Tx

(9)

Convex Function

f is a Convex function: f (tx₁+ (1 − t)x₂) ≤tf (x₁) + (1 − t)f (x₂)

Objective function for logistic regression is convex.

(10)

Convex Optimization

Convex optimization problem

minimizex f (x )

subject to: g_i(x ) ≤ 0, ∀i = 1, . . . , k where:

f , g_i are convex functions.

For convex optimization problems, local optima are also global optima.

(11)

Optimization Algorithm: Gradient Descent

(12)

Machine Learning and Big Data Support Vector Machines

Classification Problem

(13)

SVM

Separating hyperplane: x^Tu = 0

Parallel hyperplanes (developing margin): x^Tu = ±1

Margin (perpendicular distance between parallel hyperplanes): _kxk² Correct classification of training datapoints: v_ix^Tu_i ≥ 1, ∀i

Allowing error (slack), ξ_i: v_ix^Tu_i≥ 1 − ξ_i, ∀i Max-margin formulation:

min

x,ξ

1

2kxk²+C

n

X

i=1

ξ_i

subject to: v_ix^Tu_i ≥ 1 − ξ_i, ξ_i ≥ 0 ∀i = 1, . . . , n

(14)

SVM: dual

Lagrangian:

L = 1

2x^Tx + C

n

X

i=1

ξi+

n

X

i=1

αi(1 − ξ_i − vix^Tu_i) +

n

X

i=1

µiξi

Dual problem: (x^∗, α^∗, µ^∗) =max_α,µmin_xL(x, α, µ)

For strictly convex problem, primal and dual solutions are same (Strong duality).

KKT conditions:

x =

n

X

i=1

α_iv_iu_i C = α_i+ µ_i

(15)

SVM: dual

The dual problem:

maxα n

X

i=1

α_i−1 2

n,n

X

i=1,j=1

α_iα_jv_iv_ju^T_i u_j subject to: 0 ≤ α_i ≤ C, ∀i

The dual is a quadratic programming problem in n variables.

Can be solved even if kernel function, k (u_i,u_j) =u^T_i u_j are given.

Dimension agnostic.

Many efficient algorithms exist for solving it, e.g. SMO (Platt99).

Worst case complexity is O(n³), usually O(n²).

(16)

SVM

A more compact form: min_xPn

i=1max(0, 1 − v_ix^Tu_i) + λkxk²₂ Or: minxPn

i=1l(x, u_i,v_i) + λΩ(x)

(17)

Multi-class classification

There are m classes. v_i ∈ {1, . . . , m}

Most popular scheme: v_i =argmaxv ∈{1,...,m}x^T_vu_i Given example (u_i,v_i),x^T_v

iu_i ≥ x^T_j u_i ∀j ∈ {1, . . . , m}

Using a margin of at least 1, loss

l(u_i,v_i) =max_j∈{1,...,v_i_−1,v_i_+1,...,m}{0, 1 − (x^T_v_iu_i − x^T_j u_i)}

Given dataset D, solve the problem

x₁min,...,xm

X

i∈D

l(u_i,v_i) + λ

m

X

j=1

kxjk²

This can be extended to many settings e.g. sequence labeling, learning to rank, etc.

(18)

General Learning Problems

Support Vector Machines:

minx n

X

i=1

max{0, 1 − v_ix^Tu_i} + λkxk²₂

Logistic Regression:

minx n

X

i=1

log(1 + exp(−v_ix^Tu_i)) + λkxk²₂

General form:

minx n

X

i=1

l(x, u_i,v_i) + λΩ(x) l: loss function, Ω: regularizer.

(19)

Sub-gradient Descent

Sub-gradient for a non-differentiable convex function f at a point x₀is a vectorv such that:

f (x) − f (x₀) ≥v^T(x − x₀)

(20)

Machine Learning and Big Data Stochastic Sub-gradient descent

Sub-gradient Descent

Randomly initializex₀

Iteratex_k =x_{k −1}− t_kg(x_{k −1}), k = 1, 2, 3, . . . . Where g is a sub-gradient of f .

t_k = ^√¹

k.

x_best(k ) = min_i=1,...,kf (x_k)

(21)

Sub-gradient Descent

(22)

Machine Learning and Big Data Stochastic Sub-gradient descent

Stochastic Sub-gradient Descent

Convergence rate is: O(^√¹

k).

Each iteration takes O(n) time.

Reduce time by calculating the gradient using a subset of examples - stochastic subgradient.

Inherently serial.

Typical O(¹2)behaviour.

(23)

Stochastic Sub-gradient Descent

(24)

Distributed Optimization

Distributed gradient descent

Divide the dataset into m parts. Each part is processed on one computer. Total m.

There is one central computer.

All computers can communicate with the central computer via network.

Define loss(x) =Pm j=1

P

i∈Cjl_i(x) + λΩ(x), where l_i(x) = l(x, u_i,v_i) The gradient (in case of differentiable loss):

∇loss(x) =

m

X

j=1

∇(X

i∈C_j

l_i(x)) + λΩ(x)

Compute ∇l_j(x) =P

i∈C_j∇li(x) on the j^thcomputer. Communicate to central computer.

(25)

Distributed gradient descent

Compute ∇loss(x) =Pm

j=1∇l_j(x) + Ω(x) at the central computer.

The gradient descent update: x^{k +1}=x^k − α∇loss(x).

αchosen by a line search algorithm (distributed).

For non-differentiable loss functions, we can use distributed sub-gradient descent algorithm.

Slow for most practical problems.

For achieving tolerance,

Gradient descent (Logistic regression): O(1/) iterations.

Sub-gradient descent (Stochastic Sub-gradient descent): O(¹

²) iterations.

(26)

Distributed Optimization ADMM

Alternating Direction Method of Multipliers

Problem

minimize_{x ,z} f (x ) + g(z) subject to: Ax + Bz = c Algorithm

Iterate till convergence:

x^{k +1}=argmin_xf (x ) +^ρ₂kAx + Bz^k− c + u^kk²₂ z^{k +1}=argmin_zg(z) + ^ρ₂kAx^{k +1}+Bz − c + u^kk²₂ u^{k +1}=u^k +Ax^{k +1}+Bz^{k +1}− c

(27)

Stopping criteria

Stop when primal and dual residuals small:

kr^kk₂≤ ^pri and ks^kk₂≤ ^dual Hence, kr^kk₂→ 0 and ks^kk₂→ 0 as k → ∞

(28)

Distributed Optimization ADMM

Observations

x - update requires solving an optimization problem minx f (x ) +ρ

2kAx − v k²₂ with, v = Bz^k − c + u^k

Similarly for z-update.

Sometimes has a closed form.

ADMM is a meta optimization algorithm.

(29)

Convergence of ADMM

Assumption 1: Functions f : Rⁿ→ R and g : R^m→ R are closed, proper and convex.

Same as assuming epif = {(x , t) ∈ Rⁿ× R|f (x) ≤ t} is closed and convex.

Assumption 2: The unaugmented Lagrangian L₀(x , y , z) has a saddle point (x ∗, z∗, y ∗):

L₀(x ∗, z∗, y ) ≤ L₀(x ∗, z∗, y ∗) ≤ L₀(x , z, y ∗)

(30)

Distributed Optimization Convergence

Convergence of ADMM

Primal residual: r = Ax + Bz − c

Optimal objective: p^∗ =infx ,z{f (x) + g(z)|Ax + Bz = c}

Convergence results:

Primal residual convergence: r^k → 0 as k → ∞ Dual residual convergence: s^k → 0 as k → ∞ Objective convergence: f (x ) + g(z) → p^∗as k → ∞ Dual variable convergence: y^k → y^∗as k → ∞

(31)

Decomposition

If f is separable:

f (x ) = f₁(x₁) + · · · +f_N(x_N), x = (x₁, . . . ,x_N) A is conformably block separable; i.e. A^TA is block diagonal.

Then, x -update splits into N parallel updates of x_i

(32)

Distributed Optimization Distributed Loss Minimization

Consensus Optimization

Problem:

minx f (x ) =

N

X

i=1

f_i(x )

ADMM form:

minx_i,z N

X

i=1

f_i(x_i)

s.t. x_i− z = 0, i = 1, . . . , N Augmented lagrangian:

L_ρ(x₁, . . . ,x_N,z, y ) =

N

X

i=1

(f_i(x_i) +y_i^T(x_i− z) + ρ

2kxi− zk²₂)

(33)

Consensus Optimization

ADMM algorithm:

x_i^{k +1}=argmin_x

i(f_i(x_i) +y_i^kT(x_i− z^k) + ρ

2kx_i− z^kk²₂) z^{k +1}= 1

N

X

i=1

(x_i^{k +1}+1 ρy_i^k) y_i^{k +1}=y_i^k+ ρ(x_i^{k +1}− z^{k +1}) Final solution is z^k.

(34)

Consensus Optimization

z-update can be written as:

z^{k +1}= ¯x^{k +1}+1 ρ¯y^{k +1} Averaging the y -updates:

y¯^{k +1}= ¯y^k+ ρ(¯x^{k +1}− z^{k +1})

Substituting first into second: ¯y^{k +1}=0. Hence z^k = ¯x^k. Revised algorithm:

x_i^{k +1}=argmin_x_i(f_i(x_i) +y_i^kT(x_i− ¯x^k) + ρ

2kx_i− ¯x^kk²₂) y_i^{k +1}=y_i^k + ρ(x_i^{k +1}− ¯x^{k +1})

Final solution is z^k.

(35)

Distributed Loss minimization

Problem:

minx l(Ax − b) + r (x ) Partition A and b by rows:

A =





 A₁

... A_N





, b =





 b₁

... b_N





, where, A_i ∈ R^mⁱ^×mand b_i ∈ R^mⁱ

ADMM formulation:

minxi,z N

X

i=1

l_i(A_ix_i− b_i) +r (z)

s.t.: x_i− z = 0, i = 1, . . . , N

(36)

Distributed Loss minimization

ADMM solution:

x_i^{k +1}=argmin_x

i(l_i(A_ix_i− b_i) + ρ

2kx_i− z^k+u^k_ik²₂) z^{k +1}=argmin_z(r (z) +Nρ

2 kz − ¯x^{k +1}+ ¯u^kk²₂) u_i^{k +1}=u_i^k+x_i^{k +1}− z^{k +1}

(37)

ADMM Results

Logistic Regression using the loss minimization formulation (Boyd et al.):

minx n

X

i=1

(38)

Distributed Optimization Results

ADMM Results

Logistic Regression using the loss minimization formulation (Boyd et al.):

minx n

X

i=1

(39)

ADMM Results

Lasso Results (Boyd et al.):

(41)

ADMM Results

SVM primal residual:

(42)

ADMM Results

SVM Accuracy:

(43)

Results

Risk and Hyperplane

(44)

Distributed Optimization Development of ADMM

Dual Ascent

Convex equality constrained problem:

minx f (x )

subject to: Ax = b Lagrangian: L(x , y ) = f (x ) + y^T(Ax − b) Dual function: g(y ) = inf_xL(x , y )

Dual problem: max_yg(y )

Final solution: x^∗ =argmin_xL(x , y )

(45)

Dual Ascent

Gradient descent for dual problem: y^{k +1}=y^k+ α^k∇_ykg(y^k)

∇_ykg(y^k) =A˜x − b, where ˜x = argmin_xL(x , y^k) Dual ascent algorithm:

x^{k +1}=argmin_xL(x , y^k) y^{k +1}=y^k+ α^k(Ax^{k +1}− b) Assumptions:

L(x , y^k)is strictly convex. Else, the first step can have multiple solutions.

L(x , y^k)is bounded below.

(46)

Dual Decomposition

Suppose f is separable:

f (x ) = f₁(x₁) + · · · +f_N(x_N), x = (x₁, . . . ,x_N)

L is separable in x : L(x , y ) = L₁(x₁,y ) + · · · + L_N(x_N,y ) − y^Tb, where L_i(x_i,y ) = f_i(x_i) +y^TA_ix_i

x minimization splits into N separate problems:

x_i^{k +1}=argmin_x

iL_i(x_i,y^k)

(47)

Dual Decomposition

Dual decomposition:

x_i^{k +1}=argmin_x

iL_i(x_i,y^k), i = 1, . . . , N y^{k +1}=y^k+ α^k(

N

X

i=1

A_ix_i− b)

Distributed solution:

Scatter y^k to individual nodes

Compute x_i in the i^thnode (distributed step) Gather A_ix_i from the i^thnode

All drawbacks of dual ascent exist

(48)

Method of Multipliers

Make dual ascent work under more general conditions Useaugmented Lagrangian:

L_ρ(x , y ) = f (x ) + y^T(Ax − b) + ρ

2kAx − bk²₂ Method of multipliers:

x^{k +1}=argmin_xL_ρ(x , y^k) y^{k +1}=y^k+ ρ(Ax^{k +1}− b)

(49)

Methods of Multipliers

Optimality conditions (for differentiable f ):

Primal feasibility: Ax^∗− b = 0 Dual feasibility: ∇f (x^∗) +A^Ty^∗=0 Since x^{k +1}minimizes L_ρ(x , y^k)

0 = ∇xL_ρ(x^{k +1},y^k)

= ∇xf (x^{k +1}) +A^T(y^k+ ρ(Ax^{k +1}− b))

= ∇xf (x^{k +1}) +A^Ty^{k +1}

Dual update y^{k +1}=y^k+ ρ(Ax^{k +1}− b) makes (x^{k +1},y^{k +1})dual feasible

Primal feasibility is achieved in the limit: (Ax^{k +1}− b) → 0

(50)

Alternating direction method of multipliers

Problem with applying standard method of multipliers for distributed optimization:

there is no problem decomposition even if f is separable.

due to square term ^ρ₂kAx − bk²₂

(51)

Alternating direction method of multipliers

ADMM problem:

minx ,z f (x ) + g(z)

subject to: Ax + Bz = c Lagrangian:

L_ρ(x , z, y ) = f (x ) + g(z) + y^T(Ax + Bz − c) +^ρ₂kAx + Bz − ck²₂ ADMM:

x^{k +1}=argmin_xL_ρ(x , z^k,y^k) z^{k +1}=argmin_zL_ρ(x^{k +1},z, y^k) y^{k +1}=y^k + ρ(Ax^{k +1}+Bz^{k +1}− c)

(52)

Alternating direction method of multipliers

Problem with applying standard method of multipliers for distributed optimization:

there is no problem decomposition even if f is separable.

due to square term ^ρ₂kAx − bk²₂

The above technique reduces to method of multipliers if we do joint minimization of x and z

Since we split the joint x , z minimization step, the problem can be decomposed.

(53)

ADMM Optimality conditions

Optimality conditions (differentiable case):

Primal feasibility: Ax + Bz − c = 0

Dual feasibility: ∇f (x ) + A^Ty = 0 and ∇g(z) + B^Ty = 0 Since z^{k +1}minimizes L_ρ(x^{k +1},z, y^k):

0 = ∇g(z^{k +1}) +B^Ty^k + ρB^T(Ax^{k +1}+Bz^{k +1}− c)

= ∇g(z^{k +1}) +B^Ty^{k +1}

So, the dual variable update satisfies the second dual feasibility constraint.

Primal feasibility and first dual feasibility are satisfied asymptotically.

(54)

ADMM Optimality conditions

Primal residual: r^k =Ax^k +Bz^k− c Since x^{k +1}minimizes L_ρ(x , z^k,y^k):

0 = ∇f (x^{k +1}) +A^Ty^k+ ρA^T(Ax^{k +1}+Bz^k − c)

= ∇f (x^{k +1}) +A^T(y^k + ρr^{k +1}+ ρB(z^k − z^{k +1})

= ∇f (x^{k +1}) +A^Ty^{k +1}+ ρA^TB(z^k− z^{k +1}) or,

ρA^TB(z^k− z^{k +1}) = ∇f (x^{k +1}) +A^Ty^{k +1}

Hence, s^{k +1}= ρA^TB(z^k− z^{k +1})can be thought as dual residual.

(55)

ADMM with scaled dual variables

Combine the linear and quadratic terms Primal feasibility: Ax + Bz − c = 0

Dual feasibility: ∇f (x ) + A^Ty = 0 and ∇g(z) + B^Ty = 0 Since z^{k +1}minimizes L_ρ(x^{k +1},z, y^k):

0 = ∇g(z^{k +1}) +B^Ty^k + ρB^T(Ax^{k +1}+Bz^{k +1}− c)

= ∇g(z^{k +1}) +B^Ty^{k +1}

So, the dual variable update satisfies the second dual feasibility constraint.

Primal feasibility and first dual feasibility are satisfied asymptotically.

(56)

Applications and extensions Weighted Parameter Averaging

Distributed Support Vector Machines

Training dataset partitioned into M partitions (S_m, m = 1, . . . , M).

Each partition has L datapoints: S_m= {(x_ml,y_ml)},l = 1, . . . , L.

Each partition can be processed locally on a single computer.

Distributed SVM training problem [?]:

wminm,z M

X

m=1 L

X

l=1

loss(wm; (x_ml,y_ml)) +r (z) s.t.wm− z = 0, m = 1, · · · , M, l = 1, . . . , L

(57)

Parameter Averaging

Parameter averaging, also called “mixture weights” proposed in [?], for logistic regression.

Results hold true for SVMs with suitable sub-derivative.

Locally learn SVM on Sm:

wˆm =argmin_w1 L

L

X

l=1

loss(w; x_ml,y_ml) + λkwk², m = 1, . . . , M

The final SVM parameter is given by:

w_PA= 1 M

M

X

m=1

wˆm

(58)

Problem with Parameter Averaging

PA with varying number of partitions - Toy dataset.

(59)

Weighted Parameter Averaging

Final hypothesis is a weighted sum of the parameters ˆw_m.

w =

M

X

m=1

βmw^m

Also proposed in [?].

How to get β_m?

Notation: β = [β₁, · · · , β_M]^T, ^W = [ ˆw₁, · · · , ˆw_M] w = ^Wβ

(60)

Weighted Parameter Averaging

Find the optimal set of weights β which attains the lowest regularized hinge loss:

minβ,ξ λk ^Wβk²+ 1 ML

M

X

m=1 L

X

i=1

ξ_mi

subject to: y_mi(β^TW^^Tx_mi) ≥1 − ξ_mi, ∀i, m ξ_mi ≥ 0, ∀m = 1, . . . , M, i = 1, . . . , L W is a pre-computed parameter.ˆ

(61)

Distributed Weighted Parameter Averaging

Distributed version of primal weighted parameter averaging:

minγm,β

1 ML

M

X

m=1 L

X

l=1

loss( ˆW γ_m;x_ml,y_ml) +r (β) s.t. γ_m− β = 0, m = 1, · · · , M,

r (β) = λk ˆWβk², γm weights for m^thcomputer, β consensus weight.

(62)

Distributed Weighted Parameter Averaging

Distributed algorithm using ADMM:

γ_m^{k +1}:=argmin_γ(loss(A_iγ) + (ρ/2)kγ − β^k+u^k_mk²₂) β^{k +1}:=argmin_β(r (β) + (Mρ/2)kβ − γ^{k +1}− u^kk²₂) u^{k +1}_m =u^k_m+ γ_m^{k +1}− β^{k +1}.

u_m are the scaled Lagrange multipliers, γ = _M¹ PM

m=1γm and u = _M¹ PM

m=1um.

(63)

Toy Dataset - PA and WPA

PA (left) and WPA (right) with varying number of partitions - Toy dataset.

(64)

Toy Dataset - PA and WPA

Accuracy of PA and WPA with varying number of partitions - Toy dataset.

(65)

Real World Datasets

Epsilon (2000 features, 6000 datapoints) test set accuracy with varying number of partitions.

(66)

Real World Datasets

Gisette (5000 features, 6000 datapoints) test set accuracy with varying number of partitions.

(67)

Real World Datasets

Real-sim (20000 features, 3000 datapoints) test set accuracy with varying number of partitions.

(68)

Real World Datasets

Convergence of test accuracy with iterations (200 partitions).

(69)

Real World Datasets

Convergence of primal residual with iterations (200 partitions).

(70)

Applications and extensions Fully-distributed SVM

Distributed SVM on Arbitrary Network

Motivations:

Sensor Networks.

Corporate networks.

Privacy.

Assumptions:

Data is available at nodes of network

Communication is possible only along edges of the network.

(71)

Distributed SVM on Arbitrary Network

SVM optimization problem:

w ,b,ξmin 1

2kw k²+C

J

X

j=1 nj

X

n=1

ξ_jn

s.t.: y_jn(w^tx_jn+b) ≥ 1 − ξ_jn, ∀j ∈ J, n = 1, . . . , N_j ξ_jn ≥ 0, ∀j ∈ J, n = 1, . . . , N_j

Node j has a copy of w_j,b_j. Distributed formulation:

min

{wj,b_j,ξ_jn}

1 2

J

X

j=1

kwjk²+JC

J

X

j=1 n_j

X

n=1

ξjn

s.t.: y_jn(w_j^tx_jn+b) ≥ 1 − ξ_jn, ∀j ∈ J, n = 1, . . . , N_j ξ_jn≥ 0, ∀j ∈ J, n = 1, . . . , N_j

w_j =w_i, ∀j, i ∈ B_j

(72)

Algorithm

Using v_j = [w_j^Tb_j]^T, X_j = [[x_j1, . . . ,x_jN_j]^T1_j]and Y_j =diag([y_j1, . . . ,y_jN_j]):

{vjmin,ξ_jn,ω_ji}

1 2

J

X

j=1

r (vj) +JC

J

X

j=1 nj

X

n=1

ξ_jn

s.t.: Y_jX_jv_j ≥ 1 − ¯ξ_j, ∀j ∈ J ξ¯_j ≥ 0, ∀j ∈ J

vj = ω_ji, vi = ω_ji, ∀j, i ∈ Bj

Surrogate augmented Lagrangian:

L({vj}, { ¯ξ_j}, {ω_ji}, {α_ijk}) = 1 2

J

X

j=1

r (vj) +JC

J

X

j=1 nj

X

n=1

ξ_jn

+

J

X

j=1

X

i∈Bj

(α^T_ij1(v_j− ω_ji) + α^T_ij2(v_i− ω_ji)) + ηX

i∈Bj

(kv_j− ω_jik²+ kv_i− ω_jik²)

(73)

Algorithm

ADMM based algorithm:

{v_j^t+1, ξ_jn^t+1} = argmin_{v_j_{, ¯}_ξ_j_}∈WL({v_j}, { ¯ξ_j}, {ω^t_ji}, {α^t_ijk}) {ω^t+1_ji } = argmin_ω_jiL({v_j}^t+1, { ¯ξ^t+1_j }, {ω_ji}, {α^t_ijk})

α^t+1_ji1 = α^t_ji1+ η(v_j^t+1− ω^t+1_ji ) α^t+1_ji2 = α^t_ji2+ η(ω_ji^t+1− v_i^t+1) From the second equation:

ω_ji^t+1= 1

2η(α^t_ji1− α^t_ji2) +1

2(v_j^t+1+v_i^t+1)

(74)

Algorithm

Hence:

α^t+1_ji1 = 1

2(α^t_ji1+ α^t_ji2) +η

2(v_j^t+1− v_i^t+1) α^t+1_ji2 = 1

2(α^t_ji1+ α^t_ji2) +η

2(v_j^t+1− v_i^t+1)

Substituting ω_ji^t+1= ¹₂(v_j^t+1+v_i^t+1)into surrogate augmented lagrangian, the third term becomes:

J

X

j=1

X

i∈Bj

α^T_ij1(v_j− v_i) =

J

X

j=1

X

i∈Bj

v_j^T(α^t_ji1− α^t_ij1)

Substitute α^t_j =P

i∈Bj(α^t_ji1− α^t_ij1)

(75)

Algorithm

The final algorithm:

{v_j^t+1, ξ_jn^t+1} = argmin_{v

j, ¯ξj}∈WL({v_j}, { ¯ξ_j}, {α^t_j}) α^t+1_j = α^t_j +η

2 X

i∈Bj

(v_j^t+1− v_i^t+1)

(76)

Thank you ! Questions ?

Distributed Machine Learning and Big Data