BIG DATA PROBLEMS AND LARGE-SCALE OPTIMIZATION: A DISTRIBUTED ALGORITHM FOR MATRIX FACTORIZATION

(1)

B

IG

D

ATA

P

ROBLEMS AND

L

ARGE

-S

CALE

O

PTIMIZATION

: A D

ISTRIBUTED

A

LGORITHM FOR

M

ATRIX

F

ACTORIZATION

¸

S. ˙Ilker Birbil

Sabancı University

Ali Taylan Cemgil¹, Hazal Koptagel¹, Figen Öztoprak², Umut ¸Sim¸sekli¹ 1: Bo ˘gaziçi University, 2: Bilgi University

Nottingham University March, 2015

¸

S. ˙Ilker Birbil (Sabancı University) Big Data Optimization 1 / 22

(2)

L

ARGE

-S

CALE

O

PTIMIZATION AND

M

ACHINE

L

EARNING

Introduction Exploiting the Structure Need for Parallel Algorithms

F. ¨Oztoprak

(3)

D

ATA

S

CIENCE

¸

(4)

G

RADUATE

C

OURSES

(5)

N

ONLINEAR

O

PTIMIZATION

Typically, a nonlinear optimization problem is defined as minimize

x∈Rⁿ f(x)

subject to ci(x) = 0, i∈ E, ci(x)≥ 0, i ∈ I,

(1)

where f :Rⁿ→ R is theobjective functionand ci:Rⁿ→ R for i ∈ E ∪ I are the constraint functions. At least one of these functions isnonlinear.

Introduction Exploiting the Structure Need for Parallel Algorithms

Nonlinear Programming (NLP) Problem

Covers optimization problems

minx2Xf (x)

where X ={x 2 Rⁿ: g(x) 0}, the functions g : Rⁿ! R^m, f :Rⁿ! R are continuous and not necessarily linear.

x*

F. ¨Oztoprak

¸

(6)

R

OLE OF

N

ONLINEAR

O

PTIMIZATION

IntroductionExploiting the Structure Need for Parallel Algorithms

Machine Learning (Image Recovery)

Statistics Computer Science

Applied Mathematics Operations Research Core NLP

(Protein Folding) Molecular Biology

Finance (Risk Management)

Large Scale

PDE−Constrained Optimization Global Optimization

Health (Cancer Treatment) Engineering Design

(Machining)

Production (Chemical Complex

Design) Optimization

Derivative−Free

Convex

Optimization NLP

Mixed Integer Stochastic Prog.

Nonlinear

F. ¨Oztoprak

(7)

O

UR

R

ESEARCH

G

ROUP

I Three faculty members, four PhD students, three MSc students

I (Coupled) Tensor or matrix factorization

I Distributed and parallel algorithms:

I Bayesian inference

I Nonlinear optimization

Processor 1

Memory 1 Core 1 Core 2 Core 3 Core 4

Processor 2

Processor 3

¸

(8)

O

UR

R

ESEARCH

G

ROUP

I Three faculty members, four PhD students, three MSc students

I (Coupled) Tensor ormatrix factorization

I Distributedand parallel algorithms:

I Bayesian inference

I Nonlinear optimization

Processor 1

Processor 2

Processor 3

(9)

L

INK

P

REDICTION VIA

T

ENSOR

F

ACTORIZATION

X1(i, j, k): if user i visits location j and performs activity k X2(i, m): frequency of a user i visiting location m Xj(j, n): points of interest for a location j

¸

(10)

T

ENSOR

F

ACTORIZATION

I Tensor≡ Multidimensional Array

I Used to extract the underlying factors in higher-order data sets

Matrix & Tensor Factorizations Tensor Factorization

Tensor Factorization

I Tensor⌘ Multidimensional Array (Xi,j,k,...)

I Extension of matrix factorizations to higher-order tensors

I Tensor factorizations are used to extract the underlying factors in higher-order data

sets Tensor Factorisation

+

X(i, j, k)

r

Z1(i, r)Z2(j, r)Z3(k, r)

Cemgil Probabilistic Latent Tensor Factorisation. IFG19 Sabanci University 14

X (i, j, k)⇡X

r

Z1(i, r )Z2(j, r )Z3(k, r )

7/1

(11)

M

ATRIX

F

ACTORIZATION

An inverse problem: Estimate Z1and Z2given data matrix X assuming X≈ Z¹Z2 Matrix & Tensor Factorisations Matrix Factorisation

Matrix Factorisation

I

An Inverse problem: estimate Z

1

, Z

2

given data matrix X . X ⇡ Z

¹

Z

2

X (⌫, ⌧ ) ⇡ X

i

Z

1

(⌫, i)Z

2

(i, ⌧ )

Matrix Factorisation

• An Inverse problem: estimate Z

1

, Z

2

given data matrix X.

X Z

₁

Z

₂

X( , )

i

Z

₁

( , i)Z

₂

(i, )

!

"

!

"

!

#

"

! "

X ˆ Z

1

Z

2

X M

• Minimise a suitable error function subject to constraints (e.g., nonnegativity, orthogonality)

(Z

₁

, Z

₂

) = arg min

Z₁,Z₂

D(X ||Z

1

Z

₂

) + R(Z

₁

, Z

₂

)

Cemgil Probabilistic Latent Tensor Factorisation. IFG19 Sabanci University 5

I

Minimise a suitable error function subject to constraints (e.g., nonnegativity, orthogonality)

(Z

1

, Z

2

)

^⇤

= arg min

Z₁,Z₂

D(X ||Z

¹

Z

2

) + R(Z

1

, Z

2

)

5/28

Overall optimization problem

minimize kX − Z¹Z2k²^F subject to Z1, Z2∈ Z,

whereZ is the feasible region. When Z is the first orthant, we have the nonnegative matrix factorization problem.

¸

(12)

M

OVIE

R

ECOMMENDATION

minimize kX − Z¹Z2k²^F subject to Z1≥ 0, Z²≥ 0

(13)

D

ISTRIBUTED

I

MPLEMENTATION

X₁₂ X₂₃ X₃₁

Z₁(1,:) Z₁(2,:) Z₁(3,:)

Z₂(:,1) Z₂(:,2) Z₂(:,3)

≈ x

Z₁(1,:) Z₁(2,:) Z₁(3,:)

Z2(:,1) Z

2(:,3) Z2(:,2)

≈ ^x

Z₁(1,:) Z₁(2,:) Z₁(3,:)

Z₂(:,1) Z₂(:,2) Z₂(:,3)

≈ ^x

X₁₁

X₃₃ X₂₂

X₁₃ X₂₁

X₃₂ Time Slot 1:

Time Slot 2:

Time Slot 3:

Time Slot 4:

...

Perform

X₁₂=Z₁(1,:)Z₂(:,2) on X₂₃=Z₁(2,:)Z₂(:,3) on X₃₁=Z₁(3,:)Z₂(:,1) on by employing IPA.

P1 P3 P2

¸

(14)

R

EFORMULATION

1" 2" 3"

4" 5" 6"

1"

6"

."

Z₁

Z₂

z

minimize kX − Z¹Z2k²F

subject to Z1, Z2∈ Z

GENERICPROBLEM

minimize X

i∈{1,··· ,m}

fi(z) subject to z∈ ζ

(15)

D

ISTRUBUTED

O

PTIMIZATION

X₁₂ X₂₃ X₃₁

Z₁(1,:) Z₁(2,:) Z₁(3,:)

Z₂(:,1)Z₂(:,2)Z₂(:,3)

≈ x

Z₁(1,:) Z₁(2,:) Z₁(3,:)

Z₂(:,1)Z₂(:,2)Z₂(:,3)

≈ x

Z₁(1,:) Z₁(2,:) Z₁(3,:)

Z₂(:,1)Z₂(:,2)Z₂(:,3)

≈ x

X₁₁

X₃₃ X₂₂

X₁₃ X₂₁

X₃₂ Time Slot 1:

Time Slot 2:

Time Slot 3:

Time Slot 4:

...

Perform X₁₂=Z₁(1,:)Z₂(:,2) on X₂₃=Z₁(2,:)Z₂(:,3) on X₃₁=Z₁(3,:)Z₂(:,1) on by employing IPA.

P1 P3 P2

1" 2" 3"

4" 5" 6"

1"

6"

."

Z1

Z2

z

minimize X

i∈{1,··· ,m}

fi(z) subject to z∈ ζ

I At each time slot k, we solve a subsetSkof the component functions fi, i∈ {1, 2, · · · , m}

I We make sure that each data block is visited after c passes (c= 3 in the figure)

¸

(16)

I

NCREMENTAL

Q

UASI

-N

EWTON

A

LGORITHM

I Unlike gradient-based methods, the proposed algorithm uses second order information through Hessian approximation (L-BFGS quasi-Newton method)

I The proposed algorithm visits each subset of component functions in the same order (incremental and deterministic)

I We do not assume convexity of the function (matrix factorization can be solved)

CORESTEP

Solve a quadratic approximation of the (partial) objective function:

Q^tk(z) = (z− z^k)^|∇^Skf(zk) +1

2(z− z^k)^|Ht(z− z^k) +1

2βtkz − z^kk².

(17)

I

NCREMENTAL

Q

UASI

-N

EWTON

A

LGORITHM

(

CONT

’

D

)

Q^t_k(z) = (z− z^k)^|∇^Skf(zk) +1

2(z− z^k)^|Ht(z− z^k) +1

2βtkz − z^kk².

Algorithm 1: HAMSI input: y0,β1

1 fort= 0, 1, 2,· · · do

2 z1= yt

3 Compute Ht

4 fork= 1, 2,· · · , c do

5 Choose a subset Sk⊂ {1, · · · , m}

6 Compute∇^Skf(zk)

7 zk+1= arg min_z_∈ζQ^tk(z)

8 end

9 yt+1= zc+1

10 Setβt+1≤ β^t

11 end

¸

(18)

C

ONVERGENCE

A

NALYSIS

(ζ = R

ⁿ

)

ASSUMPTIONS

1. Hessians of the component functions and(Ht+ βtI) are uniformly bounded:

kX

i∈Sk

∇²ⁱf(yt)k ≤ L^t≤ L ∀S^k, yt.

2. The smallest eigenvalue of(Ht+ βtI) is bounded away from zero:

Ut≤ k(H^t+ βtI)⁻¹k ≤ M^t ∀t.

3. The gradient norms are uniformly bounded:

k∇^Skf(yt)k ≤ C ∀S^k, yt.

(19)

C

ONVERGENCE

A

NALYSIS

(

CONT

’

D

)

LEMMA

At each outer iteration t of Algorithm 1 and for k= 1,· · · , c, we have

δk=k∇^Skf(zk)− ∇^Skf(yt)k ≤ L^tMt k−1

X

j=1

(1 + LtMt)^k^−1−jk∇^Sjf(yt)k

THEOREM

Consider the iterates yt produced by Algorithm 1. Then, all accumulation points of{y^t} are stationary points of the generic problem.

COROLLARY

Algorithm 1 solves the matrix factorization problem.

¸

(20)

C

ONVERGENCE

A

NALYSIS

(

CONT

’

D

)

LEMMA

At each outer iteration t of Algorithm 1 and for k= 1,· · · , c, we have

δk=k∇^Skf(zk)− ∇^Skf(yt)k ≤ L^tMt k−1

X

j=1

(1 + LtMt)^k^−1−jk∇^Sjf(yt)k

THEOREM

Consider the iterates yt produced by Algorithm 1. Then, all accumulation points of{y^t} are stationary points of the generic problem.

COROLLARY

Algorithm 1 solves the matrix factorization problem.

(21)

P

RELIMINARY

E

XPERIMENTS

- S

ETUP

I Linux cluster with 15 nodes

I Each node has 8, Intel Xeon 2.50 GHz processor with 16 GB RAM

I This setting allows execution of 120 parallel tasks in parallel

I MovieLens data (1M) is used for our preliminary experiments

¸

(22)

P

RELIMINARY

E

XPERIMENTS

FIGURE:Objective function values

(23)

P

RELIMINARY

E

XPERIMENTS

(

CONT

’

D

)

FIGURE:Root mean square error

¸

(24)

C

ONCLUDING

R

EMARKS

SUMMARY

I A promising research path at the intersection of operations research and computer science

I A new distributed and parallel implementation for matrix factorization

I A generic analysis that could be used for showing convergence of other algorithms

FUTURERESEARCHJ

I Extensive computational study

I Stochastic version of the proposed algorithm

I Quasi-Newton-based Bayesian inference

(25)

C

ONCLUDING

R

EMARKS

SUMMARY

FUTURERESEARCHJ

¸

(26)

C

ONCLUDING

R

EMARKS

SUMMARY

FUTURERESEARCHJ

(27)

C

ONCLUDING

R

EMARKS

SUMMARY

FUTURERESEARCHJ

¸

(28)

C

ONCLUDING

R

EMARKS

SUMMARY

FUTURERESEARCHJ

(29)

C

ONCLUDING

R

EMARKS

SUMMARY

FUTURERESEARCHJ

¸

(30)

C

ONCLUDING

R

EMARKS

SUMMARY

FUTURERESEARCHJ