Machine Learning from a Continuous Viewpoint. Weinan E. Princeton University

(1)

Machine Learning from a Continuous Viewpoint

Weinan E

Princeton University

Joint work with:

Chao Ma, Lei Wu

(2)

Examples

Standard ML problems for which we are given the dataset:

1 Supervised learning: Given S = {(x

j, yj = f∗(xj)), j ∈ [n]}, learn f∗.

Objective: Minimize population risk over the “hypothesis space” R(f) = _E_x_∼_µ(f(x) − f∗(x))2

2 Dimension reduction: Given S = {x_j, j ∈ [n]} ⊂ _RD sampled from µ, find mapping: Φ : _RD → _Rd,(d D) that best preserves all important features of µ. “Auto-encoder”: Minimize reconstruction error x − x:˜ x˜ = Ψ(z), z = Φ(x)

(3)

Non-standard ML problem, no dataset given beforehand:

1 Ground state of quantum many-body problem: Let H = − ~2

2m∆ + V be the Hamiltonian operator of the system

I(φ) = (φ, Hφ)

(φ, φ) = Ex∼µφ

φ(x)Hφ(x)

φ(x)2 , µφ(dx) =

1 (φ, φ)φ

2

(x)dx subject to the constraint imposed by Pauli exclusion principle.

2 Stochastic control problems:

st+1 = st + bt(st, at) + ξt+1,

st = state at time t, at = control at time t, ξt= i.i.d. noise

L({a_t}_tT₌₀−1) = _E_{_ξ_t_}

T−1

X

t=0

c_t(s_t, a_t(s_t)) + c_T(s_T) ,

(4)

Remark: High dimensionality

Benchmark: High dimensional integration

I(g) = Z

X=[0,1]d

g(x)dµ, Im(g) =

1

m

X

j

g(xj)

Grid-based quadrature rules:

I(g) − Im(g) ∼

C(g)

mα/d

Appearance of 1/d in the exponent of m: Curse of dimensionality (CoD)! If we want m−α/d = 0.1, then m = 10d/α = 10d, if α = 1.

Monte Carlo: {x_j, j ∈ [m]} is uniformly distributed in X.

E(I(g) − Im(g))2 =

var(g)

m , var(g) =

Z

X

g2(x)dx −

Z

X

g(x)dx

2

(5)

Overall strategy:

Formulate a “nice” continuous problem, then

discretize to get concrete models/algorithms.

For PDEs, “nice” = well-posed.

For calculus of variation problems, “nice” = “convex”, lower semi-continuous.

(6)

How do we represent a function? An illustrative example

Traditional approach for Fourier transform:

f(x) = Z

Rd

a(ω)ei(ω,x)dω, f_m(x) = 1

m

X

j

a(ω_j)ei(ωj,x) {ω_j} is a fixed grid, e.g. uniform.

kf − fmkL2(X) ≤ C0m−α/dkfkHα(X) “New” approach: Let π be a probability distribution and

f(x) = Z

Rd

a(ω)ei(ω,x)π(dω) = _Eω∼πa(ω)ei(ω,x)

Let {ω_j} be an i.i.d. sample of π, fm(x) = _m1 P_jm₌₁ a(ωj)ei(ωj,x),

E|f(x) − fm(x)|2 = m−1var(f) m

(7)

Integral transform-based representation

Let σ be a scalar nonlinear function (activation function), e.g. σ = ReLU Consider functions represented in the form:

f(x;θ) = Z

Rd

a(w)σ(wTx)π(dw) =_Ew∼πa(w)σ(wTx)

=_E₍_a,_w)∼ρaσ(wTx)

θ = parameter to be optimized:

θ = a(·) corresponds to a feature-based model

θ = ρ corresponds to a two-layer neural network-like model.

(8)

Discretize

Fourier method: π ∼ _N1 P

j δωj where {ωj} lives on a uniform lattice. Optimize a(·):

f(x;θ) ∼ f_m(x) = 1

m

X

j

a(w_j)σ(w_jTx)

Neural network-based method: ρ ∼ _N1 P

j δ(aj,ωj) ({ωj} is also optimized):

f(x;θ) ∼ fm(x) =

1

m

X

j

ajσ(w_jTx)

then optimize (say, using L-BFGS) — this is more in line with traditional numerical analysis (e.g. nonlinear finite element or meshless methods).

(9)

For truly large datasets, we need to use stochastic algorithms

objective function are all expressed as expectations:

R(θ) = _Ex∼µ(f(x;θ) − f∗(x))2

R(θ1, θ2) = Ex∼µ(x − Ψ(Φ(x;θ1);θ2))2

I(θ) = _Ex∼µ_θ

φ(x;θ)Hφ(x;θ)

φ(x;θ)2

Gradient descent (GD) can be readily converted to stochastic gradient descent (SGD). Let F = F(θ) = _Ex∼µg(θ, x) be the objective function:

GD : θ_k₊₁ = θ_k − η∇_θ_E_x∼µg(θk,x)

SGD : θk+1 = θk − η∇θg(θk, xk)

(10)

Optimization: Defining gradient flows

“Free energy” = R(f) = _Ex∼µ(f(x) − f∗(x))2

f(x) = Z

a(w)σ(wTx)π(dw) = _Ew∼πa(w)σ(wTx)

Follow Halperin and Hohenberg (1977):

a = non-conserved, use “model A” dynamics (Allen-Cahn):

∂a

∂t = −

δR

δa

π = conserved (probability density), use “model B” (Cahn-Hilliard):

∂π

∂t + ∇ · J = 0

J = πv, v = −∇V, V = δR

(11)

Gradient flow for the feature-based model

Fix π, optimize a.

∂ta(w, t) = −

δR

δa (w, t) = −

Z

a( ˜w, t)K(w, w˜)π(dw˜) + ˜f(w)

K(w, w˜) = _Ex[σ(wTx)σ( ˜wTx)], f˜(w) = Ex[f∗(x)σ(wTx)]

This is an integral equation with a symmetric positive definite kernel.

Decay estimates due to convexity: Let f∗(x) = _Ew∼πa∗(w)σ(wTx),

I(t) = 1

2ka(·, t) − a

∗

(·)k2 + t(R(a(t)) − R(a∗))

Then we have

dI

dt ≤ 0, R(a(t)) ≤

C₀ t

(12)

Conservative gradient flow

Optimize ρ:

f(x) = _Eu∼ρφ(x, u)

Example: u = (a, w), φ(x,u) = aσ(wTx)

∂tρ = ∇(ρ∇V )

V (u) = δR

δρ (u) = Ex[(f(x) − f

∗

(x))φ(x,u)] = Z

K(u,u˜)ρ(du˜) − f˜(u)

This is the mean-field equation derived by Chizat and Bach (2018), Mei, Montanari and Nguyen (2018), Rotskoff and Vanden-Eijnden (2018), Sirignano and Spiliopoulos

(2018), by studying the continuum limit of two-layer neural networks. Does not satisfy displacement convexity.

(13)

Mixture model

Optimize (a, π) (a = non-conservative, π =conservative):

∂ta(w, t) = −

δR

δa (w, t) = −

Z

a( ˜w, t)K(w,w˜)π(dw˜, t) + ˜f(w)

∂tπ = ∇(π∇V˜), V (w) =

δR

(14)

Discretizing the gradient flows

Discretizing the population risk (into the empirical risk) using data Discretizing the gradient flow

particle method – the dynamic version of Monte Carlo smoothed particle method – analog of vortex blob method spectral method – very effective in low dimensions

We will see that gradient descent algorithm (GD) for random feature and neural network models are simply the particle method discretization of the gradient flows discussed before.

(15)

Particle method for the feature-based model

∂_ta(w, t) = −δR

δa (w) = −

Z

a( ˜w, t)K(w,w˜)π(dw˜) + ˜f(w)

π(dw) ∼ 1

m

X

j

δw_j, a(wj, t) ∼ aj(t)

Discretized version:

d

dtaj(t) = −

1

m

X

k

K(wj, wk)ak(t) + ˜f(wj)

This is exactly the GD for the random feature model.

f(x) ∼ f_m(x) = 1

m

X

j

(16)

Particle method for the conservative flow

∂_tρ = ∇(ρ∇V ) (1)

Particle method discretization:

ρ(t, du) ∼ 1

m

X

j

δ_u_j₍_t₎

Define the loss function

I(u1,· · · ,um) = R(fm), fm(x) =

1

m

X

j

φ(x,uj)

Lemma: Given a set of initial data {u0

j, j ∈ [m]}. The solution of (1) with initial data

ρ(0) = _m1 Pm_j₌₁ δ_u0

j is given by

ρ(t) = 1

m

X

j=1

δ_u_j₍_t₎

where the particles {u_j(·), j ∈ [m]} solves:

(17)

Comparison with conventional NN models

Continuous viewpoint (in this case same as mean-field): fm(x) = _m1 P_j ajσ(w_jTx)

Conventional NN models: f_m(x) = P_j a_jσ(w_jTx)

2.0 2.5 3.0 3.5 4.0 4.5

log

10

(m)

1.8 2.0 2.2 2.4 2.6

log

10

(n

)

Test errors

2.0 2.5 3.0 3.5 4.0 4.5

log

10

(m)

1.8 2.0 2.2 2.4 2.6

log

10

(n

)

Test errors

6.0 5.4 4.8 4.2 3.6 3.0 2.4 1.8 1.2 0.6

Figure: (Left) continuous viewpoint; (Right) conventional NN models. Target function is a single neuron

(18)

Flow-based representation

Continuous dynamical system viewpoint (E (2017), Haber and Ruthotto (2017), “Neural ODEs” (Chen et al, 2018))

dz

dτ = g(τ,z), z(0) = x

The flow-map at time 1: x → z(x,1). Trial functions:

f = αTz(1)

(19)

How do we choose the form of

g

?

The correct form of g is given by (E, Ma and Wu, 2019):

g(τ,z) = _E_w∼πτa(w, τ)σ(w

T

z)

where {πτ} is a family of probability distributions.

dz

dτ =Ew∼πτa(w, τ)σ(w

T

z) =_E₍_a_,_w_)∼_ρ_τaσ(wTz)

Discretize: We obtain the “residual neural network” model:

zl+1 = zl +

1

LM

M

X

j=1

aj,lσ(z_lTwj,l), l = 1, 2, · · · , L − 1, z0 = V x˜

(20)

Compositional law of large numbers

Consider the following compositional scheme: z0,L(x) = x,

zl+1,L(x) = zl,L(x) +

1

LM

M

X

k=1

al,kσ(w_l,kT zl,L(x)),

(al,k,wl,k) are pairs of vectors i.i.d. sampled from a distribution ρ.

Theorem (E, Ma and Wu 2019)

Assume that

Eρk|a||wT|k2_F < ∞

where for a matrix or vector A, |A| means taking element-wise absolute value for A. Define

z by

z(x, 0) = x, d

dτ z(x, t) = E(a,w)∼ρaσ(w

T

z(x, τ)).

(21)

The optimal control problem

In a slightly more general form

dz

dτ = Eu∼ρτφ(z, u), z(x,0) = x

z =state, ρτ = control at time τ.

The objective : Minimize R over {ρτ}

R({ρτ}) = Ex∼µ(f(x) − f∗(x))2 =

Z

Rd

(f(x) − f∗(x))2dµ

where

(22)

Pontryagin’s maximum principle

Define the Hamiltonian H : _Rd × _Rd × P₂(Ω) :7→ _R as

H(z,p, µ) = _E_u∼µ[pTφ(z,u)].

The solutions of the control problem must satisfy:

ρ_τ = argmax_ρ _E_x[H z_τt,x, pt,_τx, ρ], ∀τ ∈ [0,1],

and for each x, (z_τt,x,pt,_τx) are defined by the forward/backward equations:

dz_τt,x

dτ = ∇pH = Eu∼ρτ(·;t)[φ(z

t,x

τ , u)]

dpt,_τx

dτ = −∇zH = Eu∼ρτ(·;t)[∇

T

zφ(z

t,x

τ ,u)p t,x

τ ].

f(x) = 1Tz(x,1)

with the boundary conditions:

z₀t,x = x

(23)

Gradient flow for flow-based models

Define the Hamiltonian H : _Rd × _Rd × P₂(Ω) :7→ _R as

H(z,p, µ) = _Eu∼µ[pTφ(z,u)].

The gradient flow for {ρτ} is given by

∂tρτ(u, t) = ∇ · (ρτ(u, t)∇V (u;ρ)) , ∀τ ∈ [0, 1],

where

V (u;ρ) = _Ex[

δH

δρ z

t,x

τ , p t,x

τ , ρτ(·;t)

],

and for each x, (z_τt,x,pt,_τx) are defined by the forward/backward equations:

dz_τt,x

dτ = ∇pH = Eu∼ρτ(·;t)[φ(z

t,x

τ , u)]

dpt,_τx

dτ = −∇zH = Eu∼ρτ(·;t)[∇

T

zφ(z

t,x

τ ,u)p t,x

τ ].

with the boundary conditions:

z₀t,x = x

(24)

Discretize the gradient flow

forward Euler for the flow in τ variable, step size 1/L.

particle method for the GD dynamics, M samples in each layer

z_lt,₊₁x = z_lt,x + 1

LM

M

X

j=1

φ(z_lt,x,u_lj(t)), l = 0, . . . , L − 1

pt,_l x = pt,_l₊₁x + 1

LM

M

X

j=1

∇_zφ(z_lt,₊₁x, uj_l₊₁(t))pt,_l₊₁x , l = 0, . . . , L − 1

duj_l(t)

dt = −Ex[∇

T

wφ(z

t,x

l , u j

l(t))p t,x

l ].

This recovers the gradient descent algorithm (with back-propagation) for the ResNet:

zl+1 = zl +

1

LM

M

X

j=1

(25)

Max principle-based training method

Qianxiao Li, Long Chen, Cheng Tai and Weinan E (2017):

Basic “method of successive approximation” (MSA): Initialize: θ0 ∈ U

For k = 0,1,2,· · ·: Solve

dz_τk

dτ = ∇pH(z

k τ,p

k τ, θ

k

τ), z k

0 = V x

Solve

dpk_τ

dτ = −∇zH(z

k τ,p

k τ, θ

k

τ), p k

1 = 2(f(x;θ

k

) − f∗(x))1

Set θk_τ+1 = argmax θ∈ΘH(z_τk, pk_τ, θ), for each τ ∈ [0, 1] Extended MSA:

˜

(26)

(27)

Comparison between GD and maximum principle

Maximum principle:

ρτ = argmaxρ Ex[H z_τt,x, pt,_τx, ρ

], ∀τ ∈ [0,1],

GD:

∂tρτ(u, t) = ∇ ·

ρτ(u, t)∇Ex[

δH

δρ z

t,x

τ ,p t,x

τ , ρτ(u;t)

]

, ∀τ ∈ [0, 1],

Hybrid:

Introducing a different time scale for optimization step: One time forward/backward propagation every k steps of optimization.

k = 1, usual GD or SGD

(28)

“Mean-field” vs. “continuous”

They sometimes give rise to the same models (e.g. two-layer NNs).

They are DIFFERENT viewpoints.

mean field: discrete → continuous by taking the limit (more like interacting particles in stat phys)

continuous formulation: continuous → discrete by discretization (more like the usual numerical analysis situation)

“Continuous” formulation tries to formulate the “first principles” of ML. It allows us to think “outside the box” about ML.

alternative discretization (spectral) alternative model

(29)

Flow-based random feature model

dz

dτ = Ew∼πτa(w, τ)σ(w

T

z)

(30)

Summary

Continuous formulation tries to formulate the “first principles” of ML. It gives rise to new variational and PDE-like problems.

The performance of these models/algorithms are more stable to the choice of hyper-parameters (no “phase transition”).

It gives rise to new models and new algorithms, more suited to the specific application.

Machine Learning from a Continuous Viewpoint. Weinan E. Princeton University

Examples

Remark: High dimensionality

Overall strategy:

Formulate a “nice” continuous problem, then

discretize to get concrete models/algorithms.

How do we represent a function? An illustrative example

Integral transform-based representation

Discretize

For truly large datasets, we need to use stochastic algorithms

Optimization: Defining gradient flows

Gradient flow for the feature-based model

Conservative gradient flow

Mixture model

Discretizing the gradient flows

Particle method for the feature-based model

Particle method for the conservative flow

Comparison with conventional NN models

log

(m)

log

(n

)

Test errors

log

(m)

log

(n

)

Test errors

Flow-based representation

How do we choose the form of

g

?

Compositional law of large numbers

The optimal control problem

Pontryagin’s maximum principle

Gradient flow for flow-based models

Discretize the gradient flow

Max principle-based training method

Comparison between GD and maximum principle

“Mean-field” vs. “continuous”

Flow-based random feature model

Summary

The crucial point is the representation of

functions as some form of expectations.

integral transform-based

flow-based

Other representations?