• No results found

BIG DATA PROBLEMS AND LARGE-SCALE OPTIMIZATION: A DISTRIBUTED ALGORITHM FOR MATRIX FACTORIZATION

N/A
N/A
Protected

Academic year: 2022

Share "BIG DATA PROBLEMS AND LARGE-SCALE OPTIMIZATION: A DISTRIBUTED ALGORITHM FOR MATRIX FACTORIZATION"

Copied!
30
0
0

Loading.... (view fulltext now)

Full text

(1)

B

IG

D

ATA

P

ROBLEMS AND

L

ARGE

-S

CALE

O

PTIMIZATION

: A D

ISTRIBUTED

A

LGORITHM FOR

M

ATRIX

F

ACTORIZATION

¸

S. ˙Ilker Birbil

Sabancı University

Ali Taylan Cemgil1, Hazal Koptagel1, Figen Öztoprak2, Umut ¸Sim¸sekli1 1: Bo ˘gaziçi University, 2: Bilgi University

Nottingham University March, 2015

¸

S. ˙Ilker Birbil (Sabancı University) Big Data Optimization 1 / 22

(2)

L

ARGE

-S

CALE

O

PTIMIZATION AND

M

ACHINE

L

EARNING

Introduction Exploiting the Structure Need for Parallel Algorithms

F. ¨Oztoprak

(3)

D

ATA

S

CIENCE

¸

S. ˙Ilker Birbil (Sabancı University) Big Data Optimization 3 / 22

(4)

G

RADUATE

C

OURSES

(5)

N

ONLINEAR

O

PTIMIZATION

Typically, a nonlinear optimization problem is defined as minimize

x∈Rn f(x)

subject to ci(x) = 0, i∈ E, ci(x)≥ 0, i ∈ I,

(1)

where f :Rn→ R is theobjective functionand ci:Rn→ R for i ∈ E ∪ I are the constraint functions. At least one of these functions isnonlinear.

Introduction Exploiting the Structure Need for Parallel Algorithms

Nonlinear Programming (NLP) Problem

Covers optimization problems

minx2Xf (x)

where X ={x 2 Rn: g(x) 0}, the functions g : Rn! Rm, f :Rn! R are continuous and not necessarily linear.

x*

F. ¨Oztoprak

¸

S. ˙Ilker Birbil (Sabancı University) Big Data Optimization 5 / 22

(6)

R

OLE OF

N

ONLINEAR

O

PTIMIZATION

IntroductionExploiting the Structure Need for Parallel Algorithms

Machine Learning (Image Recovery)

Statistics Computer Science

Applied Mathematics Operations Research Core NLP

(Protein Folding) Molecular Biology

Finance (Risk Management)

Large Scale

PDE−Constrained Optimization Global Optimization

Health (Cancer Treatment) Engineering Design

(Machining)

Production (Chemical Complex

Design) Optimization

Derivative−Free

Convex

Optimization NLP

Mixed Integer Stochastic Prog.

Nonlinear

F. ¨Oztoprak

(7)

O

UR

R

ESEARCH

G

ROUP

I Three faculty members, four PhD students, three MSc students

I (Coupled) Tensor or matrix factorization

I Distributed and parallel algorithms:

I Bayesian inference

I Nonlinear optimization

Processor 1

Memory 1 Core 1 Core 2 Core 3 Core 4

Processor 2

Memory 2 Core 1 Core 2 Core 3 Core 4

Processor 3

Memory 3 Core 1 Core 2 Core 3 Core 4

¸

S. ˙Ilker Birbil (Sabancı University) Big Data Optimization 7 / 22

(8)

O

UR

R

ESEARCH

G

ROUP

I Three faculty members, four PhD students, three MSc students

I (Coupled) Tensor ormatrix factorization

I Distributedand parallel algorithms:

I Bayesian inference

I Nonlinear optimization

Processor 1

Memory 1 Core 1 Core 2 Core 3 Core 4

Processor 2

Memory 2 Core 1 Core 2 Core 3 Core 4

Processor 3

Memory 3 Core 1 Core 2 Core 3 Core 4

(9)

L

INK

P

REDICTION VIA

T

ENSOR

F

ACTORIZATION

X1(i, j, k): if user i visits location j and performs activity k X2(i, m): frequency of a user i visiting location m Xj(j, n): points of interest for a location j

¸

S. ˙Ilker Birbil (Sabancı University) Big Data Optimization 8 / 22

(10)

T

ENSOR

F

ACTORIZATION

I Tensor≡ Multidimensional Array

I Used to extract the underlying factors in higher-order data sets

Matrix & Tensor Factorizations Tensor Factorization

Tensor Factorization

I Tensor⌘ Multidimensional Array (Xi,j,k,...)

I Extension of matrix factorizations to higher-order tensors

I Tensor factorizations are used to extract the underlying factors in higher-order data

sets Tensor Factorisation

+

X(i, j, k)

r

Z1(i, r)Z2(j, r)Z3(k, r)

Cemgil Probabilistic Latent Tensor Factorisation. IFG19 Sabanci University 14

X (i, j, k)X

r

Z1(i, r )Z2(j, r )Z3(k, r )

7/1

(11)

M

ATRIX

F

ACTORIZATION

An inverse problem: Estimate Z1and Z2given data matrix X assuming X≈ Z1Z2 Matrix & Tensor Factorisations Matrix Factorisation

Matrix Factorisation

I

An Inverse problem: estimate Z

1

, Z

2

given data matrix X . X ⇡ Z

1

Z

2

X (⌫, ⌧ ) ⇡ X

i

Z

1

(⌫, i)Z

2

(i, ⌧ )

Matrix Factorisation

• An Inverse problem: estimate Z

1

, Z

2

given data matrix X.

X Z

1

Z

2

X( , )

i

Z

1

( , i)Z

2

(i, )

!

"

!

"

!

#

#

"

! "

X ˆ Z

1

Z

2

X M

• Minimise a suitable error function subject to constraints (e.g., nonnegativity, orthogonality)

(Z

1

, Z

2

) = arg min

Z1,Z2

D(X ||Z

1

Z

2

) + R(Z

1

, Z

2

)

Cemgil Probabilistic Latent Tensor Factorisation. IFG19 Sabanci University 5

I

Minimise a suitable error function subject to constraints (e.g., nonnegativity, orthogonality)

(Z

1

, Z

2

)

= arg min

Z1,Z2

D(X ||Z

1

Z

2

) + R(Z

1

, Z

2

)

5/28

Overall optimization problem

minimize kX − Z1Z2k2F subject to Z1, Z2∈ Z,

whereZ is the feasible region. When Z is the first orthant, we have the nonnegative matrix factorization problem.

¸

S. ˙Ilker Birbil (Sabancı University) Big Data Optimization 10 / 22

(12)

M

OVIE

R

ECOMMENDATION

minimize kX − Z1Z2k2F subject to Z1≥ 0, Z2≥ 0

(13)

D

ISTRIBUTED

I

MPLEMENTATION

X12 X23 X31

Z1(1,:) Z1(2,:) Z1(3,:)

Z2(:,1) Z2(:,2) Z2(:,3)

x

Z1(1,:) Z1(2,:) Z1(3,:)

Z2(:,1) Z

2(:,3) Z2(:,2)

x

Z1(1,:) Z1(2,:) Z1(3,:)

Z2(:,1) Z2(:,2) Z2(:,3)

x

X11

X33 X22

X13 X21

X32 Time Slot 1:

Time Slot 2:

Time Slot 3:

Time Slot 4:

...

Perform

X12=Z1(1,:)Z2(:,2) on X23=Z1(2,:)Z2(:,3) on X31=Z1(3,:)Z2(:,1) on by employing IPA.

P1 P3 P2

¸

S. ˙Ilker Birbil (Sabancı University) Big Data Optimization 12 / 22

(14)

R

EFORMULATION

1" 2" 3"

4" 5" 6"

1"

6"

."

."

."

Z1

Z2

z

minimize kX − Z1Z2k2F

subject to Z1, Z2∈ Z

GENERICPROBLEM

minimize X

i∈{1,··· ,m}

fi(z) subject to z∈ ζ

(15)

D

ISTRUBUTED

O

PTIMIZATION

X12 X23 X31

Z1(1,:) Z1(2,:) Z1(3,:)

Z2(:,1)Z2(:,2)Z2(:,3)

x

Z1(1,:) Z1(2,:) Z1(3,:)

Z2(:,1)Z2(:,2)Z2(:,3)

x

Z1(1,:) Z1(2,:) Z1(3,:)

Z2(:,1)Z2(:,2)Z2(:,3)

x

X11

X33 X22

X13 X21

X32 Time Slot 1:

Time Slot 2:

Time Slot 3:

Time Slot 4:

...

Perform X12=Z1(1,:)Z2(:,2) on X23=Z1(2,:)Z2(:,3) on X31=Z1(3,:)Z2(:,1) on by employing IPA.

P1 P3 P2

1" 2" 3"

4" 5" 6"

1"

6"

."

."

."

Z1

Z2

z

minimize X

i∈{1,··· ,m}

fi(z) subject to z∈ ζ

I At each time slot k, we solve a subsetSkof the component functions fi, i∈ {1, 2, · · · , m}

I We make sure that each data block is visited after c passes (c= 3 in the figure)

¸

S. ˙Ilker Birbil (Sabancı University) Big Data Optimization 14 / 22

(16)

I

NCREMENTAL

Q

UASI

-N

EWTON

A

LGORITHM

I Unlike gradient-based methods, the proposed algorithm uses second order information through Hessian approximation (L-BFGS quasi-Newton method)

I The proposed algorithm visits each subset of component functions in the same order (incremental and deterministic)

I We do not assume convexity of the function (matrix factorization can be solved)

CORESTEP

Solve a quadratic approximation of the (partial) objective function:

Qtk(z) = (z− zk)|Skf(zk) +1

2(z− zk)|Ht(z− zk) +1

tkz − zkk2.

(17)

I

NCREMENTAL

Q

UASI

-N

EWTON

A

LGORITHM

(

CONT

D

)

Qtk(z) = (z− zk)|Skf(zk) +1

2(z− zk)|Ht(z− zk) +1

tkz − zkk2.

Algorithm 1: HAMSI input: y01

1 fort= 0, 1, 2,· · · do

2 z1= yt

3 Compute Ht

4 fork= 1, 2,· · · , c do

5 Choose a subset Sk⊂ {1, · · · , m}

6 Compute∇Skf(zk)

7 zk+1= arg minz∈ζQtk(z)

8 end

9 yt+1= zc+1

10 Setβt+1≤ βt

11 end

¸

S. ˙Ilker Birbil (Sabancı University) Big Data Optimization 16 / 22

(18)

C

ONVERGENCE

A

NALYSIS

(ζ = R

n

)

ASSUMPTIONS

1. Hessians of the component functions and(Ht+ βtI) are uniformly bounded:

kX

i∈Sk

2if(yt)k ≤ Lt≤ L ∀Sk, yt.

2. The smallest eigenvalue of(Ht+ βtI) is bounded away from zero:

Ut≤ k(Ht+ βtI)−1k ≤ Mt ∀t.

3. The gradient norms are uniformly bounded:

k∇Skf(yt)k ≤ C ∀Sk, yt.

(19)

C

ONVERGENCE

A

NALYSIS

(

CONT

D

)

LEMMA

At each outer iteration t of Algorithm 1 and for k= 1,· · · , c, we have

δk=k∇Skf(zk)− ∇Skf(yt)k ≤ LtMt k−1

X

j=1

(1 + LtMt)k−1−jk∇Sjf(yt)k

THEOREM

Consider the iterates yt produced by Algorithm 1. Then, all accumulation points of{yt} are stationary points of the generic problem.

COROLLARY

Algorithm 1 solves the matrix factorization problem.

¸

S. ˙Ilker Birbil (Sabancı University) Big Data Optimization 18 / 22

(20)

C

ONVERGENCE

A

NALYSIS

(

CONT

D

)

LEMMA

At each outer iteration t of Algorithm 1 and for k= 1,· · · , c, we have

δk=k∇Skf(zk)− ∇Skf(yt)k ≤ LtMt k−1

X

j=1

(1 + LtMt)k−1−jk∇Sjf(yt)k

THEOREM

Consider the iterates yt produced by Algorithm 1. Then, all accumulation points of{yt} are stationary points of the generic problem.

COROLLARY

Algorithm 1 solves the matrix factorization problem.

(21)

P

RELIMINARY

E

XPERIMENTS

- S

ETUP

I Linux cluster with 15 nodes

I Each node has 8, Intel Xeon 2.50 GHz processor with 16 GB RAM

I This setting allows execution of 120 parallel tasks in parallel

I MovieLens data (1M) is used for our preliminary experiments

¸

S. ˙Ilker Birbil (Sabancı University) Big Data Optimization 19 / 22

(22)

P

RELIMINARY

E

XPERIMENTS

FIGURE:Objective function values

(23)

P

RELIMINARY

E

XPERIMENTS

(

CONT

D

)

FIGURE:Root mean square error

¸

S. ˙Ilker Birbil (Sabancı University) Big Data Optimization 21 / 22

(24)

C

ONCLUDING

R

EMARKS

SUMMARY

I A promising research path at the intersection of operations research and computer science

I A new distributed and parallel implementation for matrix factorization

I A generic analysis that could be used for showing convergence of other algorithms

FUTURERESEARCHJ

I Extensive computational study

I Stochastic version of the proposed algorithm

I Quasi-Newton-based Bayesian inference

(25)

C

ONCLUDING

R

EMARKS

SUMMARY

I A promising research path at the intersection of operations research and computer science

I A new distributed and parallel implementation for matrix factorization

I A generic analysis that could be used for showing convergence of other algorithms

FUTURERESEARCHJ

I Extensive computational study

I Stochastic version of the proposed algorithm

I Quasi-Newton-based Bayesian inference

¸

S. ˙Ilker Birbil (Sabancı University) Big Data Optimization 22 / 22

(26)

C

ONCLUDING

R

EMARKS

SUMMARY

I A promising research path at the intersection of operations research and computer science

I A new distributed and parallel implementation for matrix factorization

I A generic analysis that could be used for showing convergence of other algorithms

FUTURERESEARCHJ

I Extensive computational study

I Stochastic version of the proposed algorithm

I Quasi-Newton-based Bayesian inference

(27)

C

ONCLUDING

R

EMARKS

SUMMARY

I A promising research path at the intersection of operations research and computer science

I A new distributed and parallel implementation for matrix factorization

I A generic analysis that could be used for showing convergence of other algorithms

FUTURERESEARCHJ

I Extensive computational study

I Stochastic version of the proposed algorithm

I Quasi-Newton-based Bayesian inference

¸

S. ˙Ilker Birbil (Sabancı University) Big Data Optimization 22 / 22

(28)

C

ONCLUDING

R

EMARKS

SUMMARY

I A promising research path at the intersection of operations research and computer science

I A new distributed and parallel implementation for matrix factorization

I A generic analysis that could be used for showing convergence of other algorithms

FUTURERESEARCHJ

I Extensive computational study

I Stochastic version of the proposed algorithm

I Quasi-Newton-based Bayesian inference

(29)

C

ONCLUDING

R

EMARKS

SUMMARY

I A promising research path at the intersection of operations research and computer science

I A new distributed and parallel implementation for matrix factorization

I A generic analysis that could be used for showing convergence of other algorithms

FUTURERESEARCHJ

I Extensive computational study

I Stochastic version of the proposed algorithm

I Quasi-Newton-based Bayesian inference

¸

S. ˙Ilker Birbil (Sabancı University) Big Data Optimization 22 / 22

(30)

C

ONCLUDING

R

EMARKS

SUMMARY

I A promising research path at the intersection of operations research and computer science

I A new distributed and parallel implementation for matrix factorization

I A generic analysis that could be used for showing convergence of other algorithms

FUTURERESEARCHJ

I Extensive computational study

I Stochastic version of the proposed algorithm

I Quasi-Newton-based Bayesian inference

References

Related documents

For pure-text corpora such as the historical corpora considered here, the metadata neces- sary to create such links is not available. However, it is possible to imagine creating

We believe that the book will be read by the people with a common inter- est in geospatial techniques, remote sensing, sustainable water resource develop- ment, applications and

The calculation with the traditional formulae does not give you any exact fair price but only a result which is true, assuming a flat yield curve and a re-investment of the

REQUIREMENTS : A recognized three-year qualification (REQV 13) which must include appropriate training as an educator.. Eight years teaching experience for CES, seven years

Example: PRT Interval Coding Data Collection Sheet Vismara, L.. Example: PRT interval coding data

The comparison of the effects of three different ARBs (losartan, irbesartan, and candesartan) on endothelium-dependent vasomotor responsiveness, oxidant stress, and on markers

We propose that hydrodynamic escape of hydrogen rich protoatmospheres, accreted by forming planets, explains the limit in rocky planet size.. Following the hydrodynamic

1Laboratoire d'Anatomie Pathologique, 2Laboratoire de Biologie de la Reproduction, 3Service d'Urologie, HSpital Cochin, Paris.. RESUME