Dimensionality reduction techniques for large-scale optimization

(1)

Dimensionality reduction techniques for large-scale optimization

Coralia Cartis (University of Oxford) Joint with

Jari Fowkes, Estelle Massart, Adilet Otemissov, Zhen Shao (Oxford) Lindon Roberts (ANU Canberra), Jan Fiala (NAG Ltd)

Research supported by the Alan Turing Institute for Data Science, NAG Ltd and NPL

Workshop on Mathematical Foundations of Optimization in Data Science November 24, 2020 (online)

Cantab Capital Institute for the Mathematics of Information

(2)

Johnson-Lindenstrauss Lemma and Random Embeddings

A ∈ ℝ^n×d

S ∈ ℝ^m×n SA ∈ ℝ^m×d

Let a(ny) real matrix, , .

Then is an -subspace embedding for if

for all .

A n × d n ≫ d ϵ

_s

∈ (0,1]

S m × n ϵ

_s

A

(1 − ϵ

_s

)∥Ax∥

²₂

≤ ∥SAx∥

²₂

≤ (1 + ϵ

_s

)∥Ax∥

²₂

x ∈ ℝ

^d

Johnson-Lindenstrauss Lemma: [Woodruff,’14]

If is a scaled Gaussian matrix with , then

is an (oblivious) -subspace embedding for with probability at least .

S m = 𝒪 (d|log δ

^s

|ϵ

_s⁻²

)

S ϵ

_s

A 1 − δ

_s

But note the high cost of forming

SA ⟹ 𝒪(nd

²

)

(3)

Sparse Random Embeddings

Moving away from Gaussian sketching: uniformly sampling rows of

A ⟶

fast, preserves sparsity.

A ∈ ℝ^n×d SA ∈ ℝ^m×d But it does not work sometimes….

𝒪(1)

𝒪(10⁻⁶)

Chance of missing first row: (n-m)/n

A

Sampling provides an embedding when A has low coherence

If μ(A) ≪ 1 ∥U, _i∥₂ are similar in magnitude; intuitively, the rows of are similarly important in determining the solution.A

Definition [Leverage score, coherence]: Given , the leverage score of row is (row norm).

The coherence of , is the maximum of the leverage scores.

A = UΣV i ∥U

_i

∥

₂

A μ(A),

d/n ≤ μ(A) ≤ 1

[Drineas et al’10,’11, Tropp’11]: If is a random sampling matrix with , then is an subspace embedding for with probability at least .

S m = 𝒪 (μ(A)

²

d log d|log δ

_s

|ϵ

_s⁻²

)

S ϵ

_s

− A 1 − δ

_s

(4)

Sparse Random Embeddings

Hashing: sparse sketching for dense and sparse matrices

Sampling: one non-zero per row

⎛

⎜⎜

⎝

0 · · · 0 1 0 0 0 0 · · · 1 0 0 1 0 · · · 0 0 · · · · 1 0 0

⎞

⎟⎟

⎠

Hashing: one non-zero per column

0 BB

@

0 1 0 1 0 · · · 0 1 0 0 0 0 · · · 1 0 0 1 0 0 · · · 0 0 0 0 0 1 · · · 0

1 CC A

<latexit sha1_base64="WPyQptR0n9joYXrp3aceajFQpSg=">AAACrnicbVHBThsxEPUutAXT0gDHXiwCVU/RblqpvRQhceGYSiQgZVeR1ztJLLz2yp5tG63yP/wSX9FfqBMWtARGGvn5zRt79CYrlXQYRfdBuLX95u27nV269/7D/sfOweHImcoKGAqjjL3JuAMlNQxRooKb0gIvMgXX2e3Fqn79G6yTRl/hooS04DMtp1Jw9NSkc3dCkwxmUtdlwdHKv0tGacQ+s9hn+0xEbtCtLklCH8l2Pgli5hXRRvvzJ9qKtrKloAno/GkoejLpdKNetA72EsQN6JImBpPOvyQ3oipAo1DcuXEclZjW3KIUCpY0qRyUXNzyGYw91LwAl9ZrR5fs1DM5mxrrUyNbs+2OmhfOLYrMK/2Ec7dZW5Gv1cYVTn+ktdRlhaDFw0fTSjE0bLUelksLAtXCAy6s9LMyMeeWC/RLpImGPzgHY6Gom3NZXzXAOxRv+vESjPq9+Guv/6vfPR80Xu2QT+SYfCEx+U7OySUZkCERwX7wLfgZnIVROArTcPIgDYOm54g8i3D+H4gVvnw=</latexit>

A ∈ ℝ^n×d SA ∈ ℝ^m×d

Action of Hashing S:

Compared to sampling,

Hashing uses every row of A.

•Expect better robustness

SA = ∑ⁿ

i=1

sⁱa_i

: column of S : row of A

sⁱ i^th a_i i^th

Can also consider nonzero per columns: s-hashing.

More robustness is achieved.s

(5)

Sparse Random Embeddings

Sketching with hashing matrices: theoretical results. [Shao, C, Fiala’20]

Sampling has better embedding properties when coherence of is low.

Is this true for hashing?

A

When is sufficiently small, hashing provides an -subspace embedding with an optimal dimensionality reduction bound, , better than the bound for sampling.

μ(A) ϵ

_s

𝒪(d) 𝒪(d log d)

Result Coherence of A Size of sketching S

[Meng & Mahoney’13] —

[Bourgain et al’15]

[C, Fiala & Shao,’20]

Θ(d

²

|log δ

_s

|ϵ

_s⁻²

)

Throughout, can be replaced by rank of .d A

𝒪(log

⁻³

d) 𝒪(d log

²

d|log δ

_s

|ϵ

_s⁻²

)

𝒪(d

⁻¹

) 𝒪(d|log δ

_s

|ϵ

_s⁻²

)

(6)

Using sketching for optimization?

Sketching in the observational domain (subsampling, batch)

reduces number of observations/measurements/data points

•

linear least squares solver (Solver: Ski-LLS [C, Fiala, Shao’20])

•

nonlinear least squares - derivative-based Gauss-Newton methods [C, Scheinberg’20]

•

nonlinear least squares - derivative-free Gauss-Newton methods [C, Ferguson, Roberts’20]

Sketching in the variable domain (block-coordinate, subspace methods) reduces the number of parameters/variables

•

Gauss-Newton variants for derivative-based and derivative-free

•

Functions with low eﬀective dimensionality, global optimization

How can we use sketching for improving eﬃciency and scalability of optimization algorithms ?

Today

(7)

Nonlinear least squares:

derivative-based methods

(8)

x∈ℝ

min

^d

f(x) = 1

2 ∥r(x)∥

²₂

= ∑

ⁿ

i=1

(r

_i

(x))

²

Gauss-Newton method for Non-linear Least Squares (NNLS)

where

r : ℝ

^d

→ ℝ

ⁿ smooth and possibly nonconvex;

J n × d

Jacobian matrix of first derivatives of .

r

Gauss-Newton method(s): state-of-the-art for NNLS

At iterate , calculate direction by approximately minimizing a regularized/constrained/

unconstrained variant of the convex quadratic local model, over ,

x

_k

s

_k

∈ ℝ

^d

s ∈ ℝ

^d

q

_k

(s) = 1

2 ∥J(x

_k

)s + r(x

_k

)∥

²₂

= f(x

_k

) + ⟨J(x

_k

)

^T

r(x

_k

), s⟩ + 1

2 ⟨s, J(x

_k

)

^T

J(x

_k

)s⟩ .

Regularization, trust-region and linesearch variants have been successfully developed.

We will look at:

Sketching in the variable domain (subspace methods)

(9)

Randomised Subspace Gauss-Newton (R-SGN) methods: variable sketching

randomly draw sketching matrix ; calculate the subspace-Jacobian and the reduced local quadratic model, , ,

solve the reduced subproblem (inexactly) to find

compute the ratio .

set and if ; else, and

p × d S

_k

J(x

_k

)S

_k^T

̂s ∈ ℝ

^p

p ≪ d

̂s

k

∈ ℝ

^p

,

ρ

_k

= f(x

_k

) − f(x

_k

+ S

_k^T _k

̂s ) f(x

_k

) − q

_k

( ̂s

_k

)

x

_k+1

= x

_k

+ S

_k^T _k

̂s σ

_k+1

< σ

_k

(Δ

_k+1

> Δ

_k

) ρ

_k

≥ η

₁

x

_k+1

= x

_k

σ

_k+1

> σ

_k

(Δ

_k+1

< Δ

_k

) .

min

̂s∈ℝ^p

q

_k

( ̂s)(+ σ

_k

2 ∥S

_k^T

̂s∥

²

) or (such that ∥S

_k^T

̂s∥ ≤ Δ

_k

) .

R-SGN with quadratic regularisation /trust region: at iteration , k

^[C,Fowkes,Shao’20]

q

_k

(s) = 1

2 ( J(x

_k

)S

_k^T

) ̂s+ r(x

k

)

²

= f(x

_k

) + ⟨S

_k

J(x

_k

)

^T

r(x

_k

), ̂s⟩ + 1

2 ⟨ ̂s, S

_k

J(x

_k

)

^T

J(x

_k

)S

_k^T

̂s⟩ .

(10)

Global rates of convergence for R-SGN methods

Assumptions:

are Lipschitz continuous; [smoothness]

Let . At each iterate , with probability at least ,

, and . [sketching accuracy]

Typical inexact model minimization conditions for quadratic regularisation/trust-region

r, J ϵ

_s

, δ

_s

∈ (0,1) x

_k

1 − δ

_s

∥S

_k

∇f(x

_k

)∥

²₂

≥ (1 − ϵ

_s

)∥∇f(x

_k

)∥

²₂

∥S

_k

∥

₂

≤ S

_max

Theorem[R-SGN]: Let , and such that , where is a user- chosen parameter . Then the R-SGN algorithm takes at most

iterations and evaluations of the residual and sketched Jacobian such that

, with probability at least .

[ The constant connects the updates of the regularisation/trust-region parameter]

ϵ > 0 δ ∈ (0,1) (1 − δ

_s

)δ > c c

*

N ≤ [(1 − δ

_s

)δ

₁

− c]

⁻¹

𝒪 (f(x

⁰

)(1 − ϵ

_s

)

⁻¹

ϵ

⁻²

)

min

k≤N

∥∇f(x

_k

)∥

₂

≤ ϵ 1 − e

⁻ ^{(1 − δ)2}² ^(1−δ^s^)N

* c

This bound matches deterministic complexity bounds for first-order and Gauss-Newton methods despite having only partial Jacobian information available at each iteration. [C,Fowkes,Shao’20]

(11)

Global rates of convergence for R-SGN methods

Proof Idea: uses techniques from probabilistic models complexity analyses [Gratton et al’18;

C, Scheinberg’18]

True/false iterations, successful/unsuccessful iterations

There can be at most true and successful iterations (from sufficient decrease condition) and .

Sketching accuracy assumption gives: for any ,

where and are the total number of true and total iterations, respectively.

C[f(x

₀

) − f

_*

]ϵ

⁻²

f

_*

= 0

ℙ(T < δN) ≤ ϵ

^−(1−δ)²^N

δ ∈ (0,1)

T N

(12)

Global rates of convergence for R-SGN methods

Satisfying the sketching accuracy assumption

matrix with iid scaled Gaussian entries with -hashing matrix with

S

_k

p × d p = 𝒪(|log δ

_s

|ϵ

_s⁻²

)

S

_k

p × d s p = 𝒪(|log δ

_s

|ϵ

_s⁻²

)

Suﬃcient for each to be a (one-sided) -subspace embedding for one-dimensional vectors, so that the gradient can be embedded correctly.

S

_k

ϵ

_s

sampling matrix : need non-uniformity dependent subspace embeddings for vectors

with . Then . This implies that sampling embeds

correctly the gradient whenever , the gradient components are similar in magnitude.

S

_k

p × d

∥y∥

_∞

⋅ ∥y∥

⁻¹₂

≤ ν

_s

p = 𝒪(dν

_s²

|log δ

_s

|ϵ

_s⁻²

) S

_k

∥∇f(x)∥

_∞

⋅ ∥∇f(x)∥

⁻¹₂

≤ ν

_s

Comparison with probabilistic models

Our sketching assumption is weaker than probabilistic model conditions [Bandeira, Scheinberg, Vicente’13]:

one-sided length preservation of gradient; not required to embed subspace; numerical example

[C,Fowkes,Shao’20]

(13)

Block-Coordinate Gauss-Newton (BC-GN) methods

BC-GN= R-SGN with sampling matrix

S

_k

Theorem[R-SGN] global rate of convergence of BC-GN (with quadratic

regularisation or trust region) with high probability, provided the gradient has similar components in magnitude

⟹

When is a sampling matrix,

S

_k

J(x

_k

)S

_k^T is a random subset of size p of the columns

∂r

of .

∂x

_i

J(x

_k

)

Under more general assumptions, we can obtain a weaker global rate analysis for BC-GN with fixed and arbitrary block size.

Assume that each coordinate block of size p is drawn with probability P_k (with replacement or from a partition). Then , where each coordinate appears R times in the

set of all possible block.

B

_k

𝔼_B_k∥∇_B_k f(x_k)∥² ≥ P_minR∥∇f(x_k)∥²

[∇^B^k f(x_k) = J_B_k(x_k)^Tr(x_k)]

Theorem[BC-GN]: Assume Lipschitz continuous. Then the number of BC-GN iterations/evaluations until is at most . In particular, when blocks are drawn uniformly

at random, such as from a partition, then .

𝔼(∥∇f(x

_k

)∥

²

) ≤ ϵ

²

r, J 𝒪 (f(x

⁰

)(P

_min

R)

⁻¹

ϵ

⁻²

)

P

_min

R = pd

⁻¹

(14)

R-SGN/BC-GN methods: numerical experiments

[C,Fowkes,Shao’20; WPaper 20]

BC-GN with TR on logistic regression for chemotherapy

dataset (Python code)

(15)

R-SGN/BC-GN methods: numerical experiments

[C,Fowkes,Shao’20]

BC-GN with TR on logistic regression on gisette dataset

(16)

Nonlinear least squares:

derivative-free methods

(17)

Subspace derivative-free Gauss-Newton methods for NNLS

Sketching DFO-GN/DFO-LS in (number of variables/size of interpolation set)

d

Less evaluations and lower linear algebra cost per iteration. Global eﬃciency?

[Roberts, PhD Thesis’19; C, Roberts’20]

Use interpolation set , then solve

Underdetermined system take minimal norm solution.

Computational Cost= factorization + solve =

Evaluation cost: only need evaluations of on first iteration and a small number/multiple of subsequently

Choose based on computational resources/evaluation cost

{x

_k

, y

₁

, …, y

_p

}

^for

p < d

(y₁ − x_k)^T

(y_p − x⋮ _k)^T _k^T̂J = (r(y₁) − r(x_k))^T (r(y_p) − r(x⋮ _k))^T

⟹

𝒪(dp

²

) + (np

²

) ≈ 𝒪(np

²

)

p r p

p

(18)

Subspace derivative-free Gauss-Newton methods for NNLS

DFBGN (Derivative-Free Block Gauss-Newton) Algorithm

Build low-dimensional model and calculate trust-region step,

Evaluate , accept/reject step, and update (usual DFO choices)

(where is basis of interpolation set )

Add to interpolation set and remove points from the interpolation set

Add random orthogonal directions for until we have interpolation points

̂s

k

∈ ℝ

^p

, min

s∈ℝ^p

1 2 ∥r(x

_k

) + ̂J

_k _k

̂s ∥

² ^s.t.

∥ ̂s∥ ≤ Δ

_k

f(x

_k

+ Q

_k _k

̂s ) Δ

_k

Q

_k

𝒴

_k

= {y

₁

− x

_k

, …, y

_p

− x

_k

}

x

_k

+ Q

_k _k

̂s p

_drop

≥ 2

x

_k

+ Δ

_k

d d ⊥ 𝒴

_k

p + 1

Comments:

Linear algebra cost vs full space method

Choosing points to remove uses Lagrange polynomials (geometry-aware)

Choice of on successful iterations, on unsuccessful iterations

𝒪(np

²

+ dp

²

+ p

³

) 𝒪(nd

²

+ d

³

)

p

_drop

: p

_drop

= 2 p/10

(19)

Subspace derivative-free Gauss-Newton methods for NNLS

Numerical results for DFBGN algorithm

Choose test set CUTEst with , max 12hrs per problem

Relative accuracy=0.1 vs budget; Solver and timeout

DFBGN outperforms DFO-LS for low accuracy solutions …because it does not time out!

d ≈ 1000

1 2 4 8 16 32

Budget / min budget of any solver 0.0

0.2 0.4 0.6 0.8 1.0

Proportionproblemssolved

DFO-LS

DFO-LS (init n/100) DFBGN (p = n)

DFBGN (p = n/2) DFBGN (p = n/10) DFBGN (p = n/100)

DFOLS 93%

DFBGN (d/100) 35%

DFBGN (d/10) 74%

DFBGN (d/2) 82%

in figures

[n → d]

(20)

Subspace derivative-free Gauss-Newton methods for NNLS

Numerical results for DFBGN algorithm

Other advantage: DFBGN make progress after evaluations (especially important when large)

p ≪ d d

normalized objective reduction vs.~\# evaluations, 12hr timeout);

0.0 0.2 0.4 0.6 0.8 1.0

Budget (in gradients) 10⁰

2 × 10⁻¹ 3 × 10⁻¹ 4 × 10⁻¹ 6 × 10⁻¹

NormalizedObjectiveValue

DFO-LS

DFO-LS (init n/100) p = n

p = n/2 p = n/10 p = n/100

0.0 0.2 0.4 0.6 0.8 1.0

Budget (in gradients) 10⁰

5 × 10⁻¹ 6 × 10⁻¹ 7 × 10⁻¹ 8 × 10⁻¹ 9 × 10⁻¹

NormalizedObjectiveValue

DFO-LS

DFO-LS (init n/100) p = n

p = n/2 p = n/10 p = n/100

ARWHDNE, d=2000 CHANDHEQ, d=2000

in figures

[n → d]

(21)

Random embeddings for global optimization

(22)

Global optimization of functions with low effective dimensionality

Global optimization is generally NP-hard. Can global optimization algorithms be made

eﬃcient for `simpler' problems? What is problem/data ‘simplicity'? Can algorithms adapt to data (without knowing it a priori)?

min

x

f(x) subject to x ∈ 𝒳 = [−1,1]

^d

Problem simplicity: Functions which do not vary along certain linear subspaces.

Alternative names: low eﬀective dimensionality, (multi-)ridge, planar waves, active subspaces

Applications: hyper-parameter optimization;

complex engineering simulations; parametric, stochastic PDEs; over-parametrized DNNs?

(23)

Global optimization of functions with low effective dimensionality

Challenging set-up: The objective function is black box. The orientation of the important subspace is not known.

Solution: Random embeddings [Ziyu Wang et al. Bayesian optimization in a billion dimensions via random embeddings. \textit{J. Artif. Int. Res.}, 55(1), 2016.]

Random embeddings lower dimensional problems

Replace by , where is an Gaussian matrix, and .

f(x) f(S

^T

y) ⟶ S p × d p ≪ d

Functions with low eﬀective dimensionality [Wang et al.’13]: has

eﬀective dimensionality if there exists a linear subspace of dimension such that for all vectors in and in . [ is the smallest integer satisfying these properties]. Dimensions of interest: .

f : ℝ

^d

→ ℝ

d

_e

≤ d 𝒯 d

_e

f(x

_⊤

+ x

_⊥

) = f(x

_⊤

) x

_⊤

𝒯 x

_⊥

𝒯

^⊥

d

_e

d

_e

≤ p ≤ d

(24)

Global optimization of functions with low effective dimensionality

(R) (AR)

y∈ℝ

min

^p

f(S

^T

y + u)

s.t.

y ∈ Y = [−a, a]

^p

y∈ℝ

min

^p

f(S

^T

y + u)

s.t.

S

^T

y + u ∈ 𝒳

Reduced optimization problems [C, Otemissov’20; C, Massart, Otemissov’20]

REGO algorithm: (single random embedding)

u=0; solves (R) once (using any global solver) , unconstrained solution

f(S

^T

y

_*

) ≈ f

_*

S

^T

y

_*

AREGO algorithm: (multiple random embedding)

solves (AR) multiple times (with any global solver) updates to best point found so far

,

f(S

^T

y

_*

) ≈ f u

_*

S

^T

y

_*

∈ 𝒳

Assumption: p ≥ d_e.

Theoretical analysis

REGO: best-known probability of success of (R) and suitable choices of , depends only on

, not on ambient dimension a

p, d_e d

Numerical experiments: confirm the theoretical findings; include replacing global solvers with local ones

AREGO: probability of success of (AR)

and convergence of AREGO, depends on algebraically, not exponentially

d

(25)

Global optimization of functions with low effective dimensionality

(R) reduced subproblem and REGO algorithm min

y∈ℝ^p f(S^Ty)

s.t. y ∈ Y = [−a, a]^p y*₂ = arg min

y: f(S^Ty)=f_* ∥y∥₂.

We can show that

∥x*

_T

∥

²₂

∥y*

₂

∥

²₂

∼ χ

_p−d² _e₊₁

⟹ ℙ((R) is successful) ≥ 1 − C(q)(1 + q

2 e

^−c²^/2

) ( c

²

2 )

q2

where

q = p − d

_e

+ 1 c = ∥x*

, _T

∥/a

.

BARON on GO problems with low eﬀective dimensionality

in figures

[D → d ^and d → p]

(R)

(26)

Global optimization of functions with low effective dimensionality

(AR) reduced subproblem and AREGO algorithm min

y∈ℝ^p f(S^Ty + u)

s.t. S^Ty + u ∈ 𝒳

ℙ((AR)

is successful

) ≥ ℙ(−1 ≤ S

^T

y*

₂

+ u ≤ 1) > τ(d) > 0

⟶

t-distribution

Convergence of AREGO, with prob one, proved to a neighbourhood of global minimum of original problem; multiple embeddings used.

S

_k

BARON Local KNITRO

Same tests as for REGO, functions with low eﬀective dimensionality

(AR)

in figures

[D → d ^and d → p]