Dimensionality reduction techniques for large-scale optimization
Coralia Cartis (University of Oxford) Joint with
Jari Fowkes, Estelle Massart, Adilet Otemissov, Zhen Shao (Oxford) Lindon Roberts (ANU Canberra), Jan Fiala (NAG Ltd)
Research supported by the Alan Turing Institute for Data Science, NAG Ltd and NPL
Workshop on Mathematical Foundations of Optimization in Data Science November 24, 2020 (online)
Cantab Capital Institute for the Mathematics of Information
Johnson-Lindenstrauss Lemma and Random Embeddings
A ∈ ℝn×d
S ∈ ℝm×n SA ∈ ℝm×d
Let a(ny) real matrix, , .
Then is an -subspace embedding for if
for all .
A n × d n ≫ d ϵ
s∈ (0,1]
S m × n ϵ
sA
(1 − ϵ
s)∥Ax∥
22≤ ∥SAx∥
22≤ (1 + ϵ
s)∥Ax∥
22x ∈ ℝ
dJohnson-Lindenstrauss Lemma: [Woodruff,’14]
If is a scaled Gaussian matrix with , then
is an (oblivious) -subspace embedding for with probability at least .
S m = 𝒪 (d|log δ
s|ϵ
s−2)
S ϵ
sA 1 − δ
sBut note the high cost of forming
SA ⟹ 𝒪(nd
2)
Sparse Random Embeddings
Moving away from Gaussian sketching: uniformly sampling rows of
A ⟶
fast, preserves sparsity.A ∈ ℝn×d SA ∈ ℝm×d But it does not work sometimes….
𝒪(1)
𝒪(10−6)
Chance of missing first row: (n-m)/n
A
Sampling provides an embedding when A has low coherence
If μ(A) ≪ 1 ∥U, i∥2 are similar in magnitude; intuitively, the rows of are similarly important in determining the solution.A
Definition [Leverage score, coherence]: Given , the leverage score of row is (row norm).
The coherence of , is the maximum of the leverage scores.
A = UΣV i ∥U
i∥
2A μ(A),
d/n ≤ μ(A) ≤ 1[Drineas et al’10,’11, Tropp’11]: If is a random sampling matrix with , then is an subspace embedding for with probability at least .
S m = 𝒪 (μ(A)
2d log d|log δ
s|ϵ
s−2)
S ϵ
s− A 1 − δ
sSparse Random Embeddings
Hashing: sparse sketching for dense and sparse matrices
Sampling: one non-zero per row
⎛
⎜⎜
⎝
0 · · · 0 1 0 0 0 0 · · · 1 0 0 1 0 · · · 0 0 · · · · 1 0 0
⎞
⎟⎟
⎠
Hashing: one non-zero per column
0 BB
@
0 1 0 1 0 · · · 0 1 0 0 0 0 · · · 1 0 0 1 0 0 · · · 0 0 0 0 0 1 · · · 0
1 CC A
<latexit sha1_base64="WPyQptR0n9joYXrp3aceajFQpSg=">AAACrnicbVHBThsxEPUutAXT0gDHXiwCVU/RblqpvRQhceGYSiQgZVeR1ztJLLz2yp5tG63yP/wSX9FfqBMWtARGGvn5zRt79CYrlXQYRfdBuLX95u27nV269/7D/sfOweHImcoKGAqjjL3JuAMlNQxRooKb0gIvMgXX2e3Fqn79G6yTRl/hooS04DMtp1Jw9NSkc3dCkwxmUtdlwdHKv0tGacQ+s9hn+0xEbtCtLklCH8l2Pgli5hXRRvvzJ9qKtrKloAno/GkoejLpdKNetA72EsQN6JImBpPOvyQ3oipAo1DcuXEclZjW3KIUCpY0qRyUXNzyGYw91LwAl9ZrR5fs1DM5mxrrUyNbs+2OmhfOLYrMK/2Ec7dZW5Gv1cYVTn+ktdRlhaDFw0fTSjE0bLUelksLAtXCAy6s9LMyMeeWC/RLpImGPzgHY6Gom3NZXzXAOxRv+vESjPq9+Guv/6vfPR80Xu2QT+SYfCEx+U7OySUZkCERwX7wLfgZnIVROArTcPIgDYOm54g8i3D+H4gVvnw=</latexit>
A ∈ ℝn×d SA ∈ ℝm×d
Action of Hashing S:
Compared to sampling,
Hashing uses every row of A.
•Expect better robustness
SA = ∑n
i=1
siai
: column of S : row of A
si ith ai ith
Can also consider nonzero per columns: s-hashing.
More robustness is achieved.s
Sparse Random Embeddings
Sketching with hashing matrices: theoretical results. [Shao, C, Fiala’20]
Sampling has better embedding properties when coherence of is low.
Is this true for hashing?
A
When is sufficiently small, hashing provides an -subspace embedding with an optimal dimensionality reduction bound, , better than the bound for sampling.
μ(A) ϵ
s𝒪(d) 𝒪(d log d)
Result Coherence of A Size of sketching S
[Meng & Mahoney’13] —
[Bourgain et al’15]
[C, Fiala & Shao,’20]
Θ(d
2|log δ
s|ϵ
s−2)
Throughout, can be replaced by rank of .d A
𝒪(log
−3d) 𝒪(d log
2d|log δ
s|ϵ
s−2)
𝒪(d
−1) 𝒪(d|log δ
s|ϵ
s−2)
Using sketching for optimization?
Sketching in the observational domain (subsampling, batch)
reduces number of observations/measurements/data points
•
linear least squares solver (Solver: Ski-LLS [C, Fiala, Shao’20])•
nonlinear least squares - derivative-based Gauss-Newton methods [C, Scheinberg’20]•
nonlinear least squares - derivative-free Gauss-Newton methods [C, Ferguson, Roberts’20]
Sketching in the variable domain (block-coordinate, subspace methods) reduces the number of parameters/variables
•
Gauss-Newton variants for derivative-based and derivative-free•
Functions with low effective dimensionality, global optimizationHow can we use sketching for improving efficiency and scalability of optimization algorithms ?
Today
Nonlinear least squares:
derivative-based methods
x∈ℝ
min
df(x) = 1
2 ∥r(x)∥
22= ∑
ni=1
(r
i(x))
2Gauss-Newton method for Non-linear Least Squares (NNLS)
where
r : ℝ
d→ ℝ
n smooth and possibly nonconvex;J n × d
Jacobian matrix of first derivatives of .r
Gauss-Newton method(s): state-of-the-art for NNLS
At iterate , calculate direction by approximately minimizing a regularized/constrained/
unconstrained variant of the convex quadratic local model, over ,
x
ks
k∈ ℝ
ds ∈ ℝ
dq
k(s) = 1
2 ∥J(x
k)s + r(x
k)∥
22= f(x
k) + ⟨J(x
k)
Tr(x
k), s⟩ + 1
2 ⟨s, J(x
k)
TJ(x
k)s⟩ .
Regularization, trust-region and linesearch variants have been successfully developed.
We will look at:
Sketching in the variable domain (subspace methods)
Randomised Subspace Gauss-Newton (R-SGN) methods: variable sketching
randomly draw sketching matrix ; calculate the subspace-Jacobian and the reduced local quadratic model, , ,
solve the reduced subproblem (inexactly) to find
compute the ratio .
set and if ; else, and
p × d S
kJ(x
k)S
kT̂s ∈ ℝ
pp ≪ d
̂s
k∈ ℝ
p,
ρ
k= f(x
k) − f(x
k+ S
kT k̂s ) f(x
k) − q
k( ̂s
k)
x
k+1= x
k+ S
kT k̂s σ
k+1< σ
k(Δ
k+1> Δ
k) ρ
k≥ η
1x
k+1= x
kσ
k+1> σ
k(Δ
k+1< Δ
k) .
min
̂s∈ℝpq
k( ̂s)(+ σ
k2 ∥S
kT̂s∥
2) or (such that ∥S
kT̂s∥ ≤ Δ
k) .
R-SGN with quadratic regularisation /trust region: at iteration , k
[C,Fowkes,Shao’20]q
k(s) = 1
2 ( J(x
k)S
kT) ̂s+ r(x
k)
2= f(x
k) + ⟨S
kJ(x
k)
Tr(x
k), ̂s⟩ + 1
2 ⟨ ̂s, S
kJ(x
k)
TJ(x
k)S
kT̂s⟩ .
Global rates of convergence for R-SGN methods
Assumptions:
are Lipschitz continuous; [smoothness]
Let . At each iterate , with probability at least ,
, and . [sketching accuracy]
Typical inexact model minimization conditions for quadratic regularisation/trust-region
r, J ϵ
s, δ
s∈ (0,1) x
k1 − δ
s∥S
k∇f(x
k)∥
22≥ (1 − ϵ
s)∥∇f(x
k)∥
22∥S
k∥
2≤ S
maxTheorem[R-SGN]: Let , and such that , where is a user- chosen parameter . Then the R-SGN algorithm takes at most
iterations and evaluations of the residual and sketched Jacobian such that
, with probability at least .
[ The constant connects the updates of the regularisation/trust-region parameter]
ϵ > 0 δ ∈ (0,1) (1 − δ
s)δ > c c
*
N ≤ [(1 − δ
s)δ
1− c]
−1𝒪 (f(x
0)(1 − ϵ
s)
−1ϵ
−2)
min
k≤N∥∇f(x
k)∥
2≤ ϵ 1 − e
− (1 − δ)22 (1−δs)N* c
This bound matches deterministic complexity bounds for first-order and Gauss-Newton methods despite having only partial Jacobian information available at each iteration. [C,Fowkes,Shao’20]
Global rates of convergence for R-SGN methods
Proof Idea: uses techniques from probabilistic models complexity analyses [Gratton et al’18;
C, Scheinberg’18]
True/false iterations, successful/unsuccessful iterations
There can be at most true and successful iterations (from sufficient decrease condition) and .
Sketching accuracy assumption gives: for any ,
where and are the total number of true and total iterations, respectively.
C[f(x
0) − f
*]ϵ
−2f
*= 0
ℙ(T < δN) ≤ ϵ
−(1−δ)2Nδ ∈ (0,1)
T N
Global rates of convergence for R-SGN methods
Satisfying the sketching accuracy assumption
matrix with iid scaled Gaussian entries with -hashing matrix with
S
kp × d p = 𝒪(|log δ
s|ϵ
s−2)
S
kp × d s p = 𝒪(|log δ
s|ϵ
s−2)
Sufficient for each to be a (one-sided) -subspace embedding for one-dimensional vectors, so that the gradient can be embedded correctly.
S
kϵ
ssampling matrix : need non-uniformity dependent subspace embeddings for vectors
with . Then . This implies that sampling embeds
correctly the gradient whenever , the gradient components are similar in magnitude.
S
kp × d
∥y∥
∞⋅ ∥y∥
−12≤ ν
sp = 𝒪(dν
s2|log δ
s|ϵ
s−2) S
k∥∇f(x)∥
∞⋅ ∥∇f(x)∥
−12≤ ν
sComparison with probabilistic models
Our sketching assumption is weaker than probabilistic model conditions [Bandeira, Scheinberg, Vicente’13]:
one-sided length preservation of gradient; not required to embed subspace; numerical example
[C,Fowkes,Shao’20]
Block-Coordinate Gauss-Newton (BC-GN) methods
BC-GN= R-SGN with sampling matrix
S
kTheorem[R-SGN] global rate of convergence of BC-GN (with quadratic
regularisation or trust region) with high probability, provided the gradient has similar components in magnitude
⟹
When is a sampling matrix,
S
kJ(x
k)S
kT is a random subset of size p of the columns∂r
of .∂x
iJ(x
k)
Under more general assumptions, we can obtain a weaker global rate analysis for BC-GN with fixed and arbitrary block size.
Assume that each coordinate block of size p is drawn with probability P_k (with replacement or from a partition). Then , where each coordinate appears R times in the
set of all possible block.
B
k𝔼Bk∥∇Bk f(xk)∥2 ≥ PminR∥∇f(xk)∥2
[∇Bk f(xk) = JBk(xk)Tr(xk)]
Theorem[BC-GN]: Assume Lipschitz continuous. Then the number of BC-GN iterations/evaluations until is at most . In particular, when blocks are drawn uniformly
at random, such as from a partition, then .
𝔼(∥∇f(x
k)∥
2) ≤ ϵ
2r, J 𝒪 (f(x
0)(P
minR)
−1ϵ
−2)
P
minR = pd
−1R-SGN/BC-GN methods: numerical experiments
[C,Fowkes,Shao’20; WPaper 20]
BC-GN with TR on logistic regression for chemotherapy
dataset (Python code)
R-SGN/BC-GN methods: numerical experiments
[C,Fowkes,Shao’20]
BC-GN with TR on logistic regression on gisette dataset
Nonlinear least squares:
derivative-free methods
Subspace derivative-free Gauss-Newton methods for NNLS
Sketching DFO-GN/DFO-LS in (number of variables/size of interpolation set)
d
Less evaluations and lower linear algebra cost per iteration. Global efficiency?
[Roberts, PhD Thesis’19; C, Roberts’20]
Use interpolation set , then solve
Underdetermined system take minimal norm solution.
Computational Cost= factorization + solve =
Evaluation cost: only need evaluations of on first iteration and a small number/multiple of subsequently
Choose based on computational resources/evaluation cost
{x
k, y
1, …, y
p}
forp < d
(y1 − xk)T
(yp − x⋮ k)T kT̂J = (r(y1) − r(xk))T (r(yp) − r(x⋮ k))T
⟹
𝒪(dp
2) + (np
2) ≈ 𝒪(np
2)
p r p
p
Subspace derivative-free Gauss-Newton methods for NNLS
DFBGN (Derivative-Free Block Gauss-Newton) Algorithm
Build low-dimensional model and calculate trust-region step,
Evaluate , accept/reject step, and update (usual DFO choices)
(where is basis of interpolation set )
Add to interpolation set and remove points from the interpolation set
Add random orthogonal directions for until we have interpolation points
̂s
k∈ ℝ
p, min
s∈ℝp1
2 ∥r(x
k) + ̂J
k k̂s ∥
2 s.t.∥ ̂s∥ ≤ Δ
kf(x
k+ Q
k k̂s ) Δ
kQ
k𝒴
k= {y
1− x
k, …, y
p− x
k}
x
k+ Q
k k̂s p
drop≥ 2
x
k+ Δ
kd d ⊥ 𝒴
kp + 1
Comments:
Linear algebra cost vs full space method
Choosing points to remove uses Lagrange polynomials (geometry-aware)
Choice of on successful iterations, on unsuccessful iterations
𝒪(np
2+ dp
2+ p
3) 𝒪(nd
2+ d
3)
p
drop: p
drop= 2 p/10
Subspace derivative-free Gauss-Newton methods for NNLS
Numerical results for DFBGN algorithm
Choose test set CUTEst with , max 12hrs per problem
Relative accuracy=0.1 vs budget; Solver and timeout
DFBGN outperforms DFO-LS for low accuracy solutions …because it does not time out!
d ≈ 1000
1 2 4 8 16 32
Budget / min budget of any solver 0.0
0.2 0.4 0.6 0.8 1.0
Proportionproblemssolved
DFO-LS
DFO-LS (init n/100) DFBGN (p = n)
DFBGN (p = n/2) DFBGN (p = n/10) DFBGN (p = n/100)
DFOLS 93%
DFBGN (d/100) 35%
DFBGN (d/10) 74%
DFBGN (d/2) 82%
in figures
[n → d]
Subspace derivative-free Gauss-Newton methods for NNLS
Numerical results for DFBGN algorithm
Other advantage: DFBGN make progress after evaluations (especially important when large)
p ≪ d d
normalized objective reduction vs.~\# evaluations, 12hr timeout);
0.0 0.2 0.4 0.6 0.8 1.0
Budget (in gradients) 100
2 × 10−1 3 × 10−1 4 × 10−1 6 × 10−1
NormalizedObjectiveValue
DFO-LS
DFO-LS (init n/100) p = n
p = n/2 p = n/10 p = n/100
0.0 0.2 0.4 0.6 0.8 1.0
Budget (in gradients) 100
5 × 10−1 6 × 10−1 7 × 10−1 8 × 10−1 9 × 10−1
NormalizedObjectiveValue
DFO-LS
DFO-LS (init n/100) p = n
p = n/2 p = n/10 p = n/100
ARWHDNE, d=2000 CHANDHEQ, d=2000
in figures
[n → d]
Random embeddings for global optimization
Global optimization of functions with low effective dimensionality
Global optimization is generally NP-hard. Can global optimization algorithms be made
efficient for `simpler' problems? What is problem/data ‘simplicity'? Can algorithms adapt to data (without knowing it a priori)?
min
xf(x) subject to x ∈ 𝒳 = [−1,1]
dProblem simplicity: Functions which do not vary along certain linear subspaces.
Alternative names: low effective dimensionality, (multi-)ridge, planar waves, active subspaces
Applications: hyper-parameter optimization;
complex engineering simulations; parametric, stochastic PDEs; over-parametrized DNNs?
Global optimization of functions with low effective dimensionality
Challenging set-up: The objective function is black box. The orientation of the important subspace is not known.
Solution: Random embeddings [Ziyu Wang et al. Bayesian optimization in a billion dimensions via random embeddings. \textit{J. Artif. Int. Res.}, 55(1), 2016.]
Random embeddings lower dimensional problems
Replace by , where is an Gaussian matrix, and .
f(x) f(S
Ty) ⟶ S p × d p ≪ d
Functions with low effective dimensionality [Wang et al.’13]: has
effective dimensionality if there exists a linear subspace of dimension such that for all vectors in and in . [ is the smallest integer satisfying these properties]. Dimensions of interest: .
f : ℝ
d→ ℝ
d
e≤ d 𝒯 d
ef(x
⊤+ x
⊥) = f(x
⊤) x
⊤𝒯 x
⊥𝒯
⊥d
ed
e≤ p ≤ d
Global optimization of functions with low effective dimensionality
(R) (AR)
y∈ℝ
min
pf(S
Ty + u)
s.t.
y ∈ Y = [−a, a]
py∈ℝ
min
pf(S
Ty + u)
s.t.
S
Ty + u ∈ 𝒳
Reduced optimization problems [C, Otemissov’20; C, Massart, Otemissov’20]
REGO algorithm: (single random embedding)
u=0; solves (R) once (using any global solver) , unconstrained solution
f(S
Ty
*) ≈ f
*S
Ty
*AREGO algorithm: (multiple random embedding)
solves (AR) multiple times (with any global solver) updates to best point found so far
,
f(S
Ty
*) ≈ f u
*S
Ty
*∈ 𝒳
Assumption: p ≥ de.
Theoretical analysis
REGO: best-known probability of success of (R) and suitable choices of , depends only on
, not on ambient dimension a
p, de d
Numerical experiments: confirm the theoretical findings; include replacing global solvers with local ones
AREGO: probability of success of (AR)
and convergence of AREGO, depends on algebraically, not exponentially
d
Global optimization of functions with low effective dimensionality
(R) reduced subproblem and REGO algorithm min
y∈ℝp f(STy)
s.t. y ∈ Y = [−a, a]p y*2 = arg min
y: f(STy)=f* ∥y∥2.
We can show that
∥x*
T∥
22∥y*
2∥
22∼ χ
p−d2 e+1⟹ ℙ((R) is successful) ≥ 1 − C(q)(1 + q
2 e
−c2/2) ( c
22 )
q2
where
q = p − d
e+ 1 c = ∥x*
, T∥/a
.BARON on GO problems with low effective dimensionality
in figures
[D → d and d → p]
(R)
Global optimization of functions with low effective dimensionality
(AR) reduced subproblem and AREGO algorithm min
y∈ℝp f(STy + u)
s.t. STy + u ∈ 𝒳
ℙ((AR)
is successful) ≥ ℙ(−1 ≤ S
Ty*
2+ u ≤ 1) > τ(d) > 0
⟶
t-distributionConvergence of AREGO, with prob one, proved to a neighbourhood of global minimum of original problem; multiple embeddings used.
S
kBARON Local KNITRO
Same tests as for REGO, functions with low effective dimensionality
(AR)
in figures
[D → d and d → p]