B
IGD
ATAP
ROBLEMS ANDL
ARGE-S
CALEO
PTIMIZATION: A D
ISTRIBUTEDA
LGORITHM FORM
ATRIXF
ACTORIZATION¸
S. ˙Ilker Birbil
Sabancı University
Ali Taylan Cemgil1, Hazal Koptagel1, Figen Öztoprak2, Umut ¸Sim¸sekli1 1: Bo ˘gaziçi University, 2: Bilgi University
Nottingham University March, 2015
¸
S. ˙Ilker Birbil (Sabancı University) Big Data Optimization 1 / 22
L
ARGE-S
CALEO
PTIMIZATION ANDM
ACHINEL
EARNINGIntroduction Exploiting the Structure Need for Parallel Algorithms
F. ¨Oztoprak
D
ATAS
CIENCE¸
S. ˙Ilker Birbil (Sabancı University) Big Data Optimization 3 / 22
G
RADUATEC
OURSESN
ONLINEARO
PTIMIZATIONTypically, a nonlinear optimization problem is defined as minimize
x∈Rn f(x)
subject to ci(x) = 0, i∈ E, ci(x)≥ 0, i ∈ I,
(1)
where f :Rn→ R is theobjective functionand ci:Rn→ R for i ∈ E ∪ I are the constraint functions. At least one of these functions isnonlinear.
Introduction Exploiting the Structure Need for Parallel Algorithms
Nonlinear Programming (NLP) Problem
Covers optimization problems
minx2Xf (x)
where X ={x 2 Rn: g(x) 0}, the functions g : Rn! Rm, f :Rn! R are continuous and not necessarily linear.
x*
F. ¨Oztoprak
¸
S. ˙Ilker Birbil (Sabancı University) Big Data Optimization 5 / 22
R
OLE OFN
ONLINEARO
PTIMIZATIONIntroductionExploiting the Structure Need for Parallel Algorithms
Machine Learning (Image Recovery)
Statistics Computer Science
Applied Mathematics Operations Research Core NLP
(Protein Folding) Molecular Biology
Finance (Risk Management)
Large Scale
PDE−Constrained Optimization Global Optimization
Health (Cancer Treatment) Engineering Design
(Machining)
Production (Chemical Complex
Design) Optimization
Derivative−Free
Convex
Optimization NLP
Mixed Integer Stochastic Prog.
Nonlinear
F. ¨Oztoprak
O
URR
ESEARCHG
ROUPI Three faculty members, four PhD students, three MSc students
I (Coupled) Tensor or matrix factorization
I Distributed and parallel algorithms:
I Bayesian inference
I Nonlinear optimization
Processor 1
Memory 1 Core 1 Core 2 Core 3 Core 4
Processor 2
Memory 2 Core 1 Core 2 Core 3 Core 4
Processor 3
Memory 3 Core 1 Core 2 Core 3 Core 4
¸
S. ˙Ilker Birbil (Sabancı University) Big Data Optimization 7 / 22
O
URR
ESEARCHG
ROUPI Three faculty members, four PhD students, three MSc students
I (Coupled) Tensor ormatrix factorization
I Distributedand parallel algorithms:
I Bayesian inference
I Nonlinear optimization
Processor 1
Memory 1 Core 1 Core 2 Core 3 Core 4
Processor 2
Memory 2 Core 1 Core 2 Core 3 Core 4
Processor 3
Memory 3 Core 1 Core 2 Core 3 Core 4
L
INKP
REDICTION VIAT
ENSORF
ACTORIZATIONX1(i, j, k): if user i visits location j and performs activity k X2(i, m): frequency of a user i visiting location m Xj(j, n): points of interest for a location j
¸
S. ˙Ilker Birbil (Sabancı University) Big Data Optimization 8 / 22
T
ENSORF
ACTORIZATIONI Tensor≡ Multidimensional Array
I Used to extract the underlying factors in higher-order data sets
Matrix & Tensor Factorizations Tensor Factorization
Tensor Factorization
I Tensor⌘ Multidimensional Array (Xi,j,k,...)
I Extension of matrix factorizations to higher-order tensors
I Tensor factorizations are used to extract the underlying factors in higher-order data
sets Tensor Factorisation
+
X(i, j, k)
r
Z1(i, r)Z2(j, r)Z3(k, r)
Cemgil Probabilistic Latent Tensor Factorisation. IFG19 Sabanci University 14
X (i, j, k)⇡X
r
Z1(i, r )Z2(j, r )Z3(k, r )
7/1
M
ATRIXF
ACTORIZATIONAn inverse problem: Estimate Z1and Z2given data matrix X assuming X≈ Z1Z2 Matrix & Tensor Factorisations Matrix Factorisation
Matrix Factorisation
I
An Inverse problem: estimate Z
1, Z
2given data matrix X . X ⇡ Z
1Z
2X (⌫, ⌧ ) ⇡ X
i
Z
1(⌫, i)Z
2(i, ⌧ )
Matrix Factorisation
• An Inverse problem: estimate Z
1, Z
2given data matrix X.
X Z
1Z
2X( , )
i
Z
1( , i)Z
2(i, )
!
"
!
"
!
#
#
"
! "
X ˆ Z
1Z
2X M
• Minimise a suitable error function subject to constraints (e.g., nonnegativity, orthogonality)
(Z
1, Z
2) = arg min
Z1,Z2
D(X ||Z
1Z
2) + R(Z
1, Z
2)
Cemgil Probabilistic Latent Tensor Factorisation. IFG19 Sabanci University 5
I
Minimise a suitable error function subject to constraints (e.g., nonnegativity, orthogonality)
(Z
1, Z
2)
⇤= arg min
Z1,Z2
D(X ||Z
1Z
2) + R(Z
1, Z
2)
5/28
Overall optimization problem
minimize kX − Z1Z2k2F subject to Z1, Z2∈ Z,
whereZ is the feasible region. When Z is the first orthant, we have the nonnegative matrix factorization problem.
¸
S. ˙Ilker Birbil (Sabancı University) Big Data Optimization 10 / 22
M
OVIER
ECOMMENDATIONminimize kX − Z1Z2k2F subject to Z1≥ 0, Z2≥ 0
D
ISTRIBUTEDI
MPLEMENTATIONX12 X23 X31
Z1(1,:) Z1(2,:) Z1(3,:)
Z2(:,1) Z2(:,2) Z2(:,3)
≈ x
Z1(1,:) Z1(2,:) Z1(3,:)
Z2(:,1) Z
2(:,3) Z2(:,2)
≈ x
Z1(1,:) Z1(2,:) Z1(3,:)
Z2(:,1) Z2(:,2) Z2(:,3)
≈ x
X11
X33 X22
X13 X21
X32 Time Slot 1:
Time Slot 2:
Time Slot 3:
Time Slot 4:
...
Perform
X12=Z1(1,:)Z2(:,2) on X23=Z1(2,:)Z2(:,3) on X31=Z1(3,:)Z2(:,1) on by employing IPA.
P1 P3 P2
¸
S. ˙Ilker Birbil (Sabancı University) Big Data Optimization 12 / 22
R
EFORMULATION1" 2" 3"
4" 5" 6"
1"
6"
."
."
."
Z1
Z2
z
minimize kX − Z1Z2k2F
subject to Z1, Z2∈ Z
GENERICPROBLEM
minimize X
i∈{1,··· ,m}
fi(z) subject to z∈ ζ
D
ISTRUBUTEDO
PTIMIZATIONX12 X23 X31
Z1(1,:) Z1(2,:) Z1(3,:)
Z2(:,1)Z2(:,2)Z2(:,3)
≈ x
Z1(1,:) Z1(2,:) Z1(3,:)
Z2(:,1)Z2(:,2)Z2(:,3)
≈ x
Z1(1,:) Z1(2,:) Z1(3,:)
Z2(:,1)Z2(:,2)Z2(:,3)
≈ x
X11
X33 X22
X13 X21
X32 Time Slot 1:
Time Slot 2:
Time Slot 3:
Time Slot 4:
...
Perform X12=Z1(1,:)Z2(:,2) on X23=Z1(2,:)Z2(:,3) on X31=Z1(3,:)Z2(:,1) on by employing IPA.
P1 P3 P2
1" 2" 3"
4" 5" 6"
1"
6"
."
."
."
Z1
Z2
z
minimize X
i∈{1,··· ,m}
fi(z) subject to z∈ ζ
I At each time slot k, we solve a subsetSkof the component functions fi, i∈ {1, 2, · · · , m}
I We make sure that each data block is visited after c passes (c= 3 in the figure)
¸
S. ˙Ilker Birbil (Sabancı University) Big Data Optimization 14 / 22
I
NCREMENTALQ
UASI-N
EWTONA
LGORITHMI Unlike gradient-based methods, the proposed algorithm uses second order information through Hessian approximation (L-BFGS quasi-Newton method)
I The proposed algorithm visits each subset of component functions in the same order (incremental and deterministic)
I We do not assume convexity of the function (matrix factorization can be solved)
CORESTEP
Solve a quadratic approximation of the (partial) objective function:
Qtk(z) = (z− zk)|∇Skf(zk) +1
2(z− zk)|Ht(z− zk) +1
2βtkz − zkk2.
I
NCREMENTALQ
UASI-N
EWTONA
LGORITHM(
CONT’
D)
Qtk(z) = (z− zk)|∇Skf(zk) +1
2(z− zk)|Ht(z− zk) +1
2βtkz − zkk2.
Algorithm 1: HAMSI input: y0,β1
1 fort= 0, 1, 2,· · · do
2 z1= yt
3 Compute Ht
4 fork= 1, 2,· · · , c do
5 Choose a subset Sk⊂ {1, · · · , m}
6 Compute∇Skf(zk)
7 zk+1= arg minz∈ζQtk(z)
8 end
9 yt+1= zc+1
10 Setβt+1≤ βt
11 end
¸
S. ˙Ilker Birbil (Sabancı University) Big Data Optimization 16 / 22
C
ONVERGENCEA
NALYSIS(ζ = R
n)
ASSUMPTIONS
1. Hessians of the component functions and(Ht+ βtI) are uniformly bounded:
kX
i∈Sk
∇2if(yt)k ≤ Lt≤ L ∀Sk, yt.
2. The smallest eigenvalue of(Ht+ βtI) is bounded away from zero:
Ut≤ k(Ht+ βtI)−1k ≤ Mt ∀t.
3. The gradient norms are uniformly bounded:
k∇Skf(yt)k ≤ C ∀Sk, yt.
C
ONVERGENCEA
NALYSIS(
CONT’
D)
LEMMA
At each outer iteration t of Algorithm 1 and for k= 1,· · · , c, we have
δk=k∇Skf(zk)− ∇Skf(yt)k ≤ LtMt k−1
X
j=1
(1 + LtMt)k−1−jk∇Sjf(yt)k
THEOREM
Consider the iterates yt produced by Algorithm 1. Then, all accumulation points of{yt} are stationary points of the generic problem.
COROLLARY
Algorithm 1 solves the matrix factorization problem.
¸
S. ˙Ilker Birbil (Sabancı University) Big Data Optimization 18 / 22
C
ONVERGENCEA
NALYSIS(
CONT’
D)
LEMMA
At each outer iteration t of Algorithm 1 and for k= 1,· · · , c, we have
δk=k∇Skf(zk)− ∇Skf(yt)k ≤ LtMt k−1
X
j=1
(1 + LtMt)k−1−jk∇Sjf(yt)k
THEOREM
Consider the iterates yt produced by Algorithm 1. Then, all accumulation points of{yt} are stationary points of the generic problem.
COROLLARY
Algorithm 1 solves the matrix factorization problem.
P
RELIMINARYE
XPERIMENTS- S
ETUPI Linux cluster with 15 nodes
I Each node has 8, Intel Xeon 2.50 GHz processor with 16 GB RAM
I This setting allows execution of 120 parallel tasks in parallel
I MovieLens data (1M) is used for our preliminary experiments
¸
S. ˙Ilker Birbil (Sabancı University) Big Data Optimization 19 / 22
P
RELIMINARYE
XPERIMENTSFIGURE:Objective function values
P
RELIMINARYE
XPERIMENTS(
CONT’
D)
FIGURE:Root mean square error
¸
S. ˙Ilker Birbil (Sabancı University) Big Data Optimization 21 / 22
C
ONCLUDINGR
EMARKSSUMMARY
I A promising research path at the intersection of operations research and computer science
I A new distributed and parallel implementation for matrix factorization
I A generic analysis that could be used for showing convergence of other algorithms
FUTURERESEARCHJ
I Extensive computational study
I Stochastic version of the proposed algorithm
I Quasi-Newton-based Bayesian inference
C
ONCLUDINGR
EMARKSSUMMARY
I A promising research path at the intersection of operations research and computer science
I A new distributed and parallel implementation for matrix factorization
I A generic analysis that could be used for showing convergence of other algorithms
FUTURERESEARCHJ
I Extensive computational study
I Stochastic version of the proposed algorithm
I Quasi-Newton-based Bayesian inference
¸
S. ˙Ilker Birbil (Sabancı University) Big Data Optimization 22 / 22
C
ONCLUDINGR
EMARKSSUMMARY
I A promising research path at the intersection of operations research and computer science
I A new distributed and parallel implementation for matrix factorization
I A generic analysis that could be used for showing convergence of other algorithms
FUTURERESEARCHJ
I Extensive computational study
I Stochastic version of the proposed algorithm
I Quasi-Newton-based Bayesian inference
C
ONCLUDINGR
EMARKSSUMMARY
I A promising research path at the intersection of operations research and computer science
I A new distributed and parallel implementation for matrix factorization
I A generic analysis that could be used for showing convergence of other algorithms
FUTURERESEARCHJ
I Extensive computational study
I Stochastic version of the proposed algorithm
I Quasi-Newton-based Bayesian inference
¸
S. ˙Ilker Birbil (Sabancı University) Big Data Optimization 22 / 22
C
ONCLUDINGR
EMARKSSUMMARY
I A promising research path at the intersection of operations research and computer science
I A new distributed and parallel implementation for matrix factorization
I A generic analysis that could be used for showing convergence of other algorithms
FUTURERESEARCHJ
I Extensive computational study
I Stochastic version of the proposed algorithm
I Quasi-Newton-based Bayesian inference
C
ONCLUDINGR
EMARKSSUMMARY
I A promising research path at the intersection of operations research and computer science
I A new distributed and parallel implementation for matrix factorization
I A generic analysis that could be used for showing convergence of other algorithms
FUTURERESEARCHJ
I Extensive computational study
I Stochastic version of the proposed algorithm
I Quasi-Newton-based Bayesian inference
¸
S. ˙Ilker Birbil (Sabancı University) Big Data Optimization 22 / 22
C
ONCLUDINGR
EMARKSSUMMARY
I A promising research path at the intersection of operations research and computer science
I A new distributed and parallel implementation for matrix factorization
I A generic analysis that could be used for showing convergence of other algorithms
FUTURERESEARCHJ
I Extensive computational study
I Stochastic version of the proposed algorithm
I Quasi-Newton-based Bayesian inference