• No results found

Machine Learning from a Continuous Viewpoint. Weinan E. Princeton University

N/A
N/A
Protected

Academic year: 2021

Share "Machine Learning from a Continuous Viewpoint. Weinan E. Princeton University"

Copied!
31
0
0

Loading.... (view fulltext now)

Full text

(1)

Machine Learning from a Continuous Viewpoint

Weinan E

Princeton University

Joint work with:

Chao Ma, Lei Wu

(2)

Examples

Standard ML problems for which we are given the dataset:

1 Supervised learning: Given S = {(x

j, yj = f∗(xj)), j ∈ [n]}, learn f∗.

Objective: Minimize population risk over the “hypothesis space” R(f) = Exµ(f(x) − f∗(x))2

2 Dimension reduction: Given S = {xj, j ∈ [n]} ⊂ RD sampled from µ, find mapping: Φ : RD → Rd,(d D) that best preserves all important features of µ. “Auto-encoder”: Minimize reconstruction error x − x:˜ x˜ = Ψ(z), z = Φ(x)

(3)

Non-standard ML problem, no dataset given beforehand:

1 Ground state of quantum many-body problem: Let H = − ~2

2m∆ + V be the Hamiltonian operator of the system

I(φ) = (φ, Hφ)

(φ, φ) = Ex∼µφ

φ(x)Hφ(x)

φ(x)2 , µφ(dx) =

1 (φ, φ)φ

2

(x)dx subject to the constraint imposed by Pauli exclusion principle.

2 Stochastic control problems:

st+1 = st + bt(st, at) + ξt+1,

st = state at time t, at = control at time t, ξt= i.i.d. noise

L({at}tT=0−1) = E{ξt}

T−1

X

t=0

ct(st, at(st)) + cT(sT) ,

(4)

Remark: High dimensionality

Benchmark: High dimensional integration

I(g) = Z

X=[0,1]d

g(x)dµ, Im(g) =

1

m

X

j

g(xj)

Grid-based quadrature rules:

I(g) − Im(g) ∼

C(g)

mα/d

Appearance of 1/d in the exponent of m: Curse of dimensionality (CoD)! If we want m−α/d = 0.1, then m = 10d/α = 10d, if α = 1.

Monte Carlo: {xj, j ∈ [m]} is uniformly distributed in X.

E(I(g) − Im(g))2 =

var(g)

m , var(g) =

Z

X

g2(x)dx −

Z

X

g(x)dx

2

(5)

Overall strategy:

Formulate a “nice” continuous problem, then

discretize to get concrete models/algorithms.

For PDEs, “nice” = well-posed.

For calculus of variation problems, “nice” = “convex”, lower semi-continuous.

(6)

How do we represent a function? An illustrative example

Traditional approach for Fourier transform:

f(x) = Z

Rd

a(ω)ei(ω,x)dω, fm(x) = 1

m

X

j

a(ωj)ei(ωj,x) {ωj} is a fixed grid, e.g. uniform.

kf − fmkL2(X) ≤ C0m−α/dkfkHα(X) “New” approach: Let π be a probability distribution and

f(x) = Z

Rd

a(ω)ei(ω,x)π(dω) = Eω∼πa(ω)ei(ω,x)

Let {ωj} be an i.i.d. sample of π, fm(x) = m1 Pjm=1 a(ωj)ei(ωj,x),

E|f(x) − fm(x)|2 = m−1var(f) m

(7)

Integral transform-based representation

Let σ be a scalar nonlinear function (activation function), e.g. σ = ReLU Consider functions represented in the form:

f(x;θ) = Z

Rd

a(w)σ(wTx)π(dw) =Ew∼πa(w)σ(wTx)

=E(a,w)∼ρaσ(wTx)

θ = parameter to be optimized:

θ = a(·) corresponds to a feature-based model

θ = ρ corresponds to a two-layer neural network-like model.

(8)

Discretize

Fourier method: π ∼ N1 P

j δωj where {ωj} lives on a uniform lattice. Optimize a(·):

f(x;θ) ∼ fm(x) = 1

m

X

j

a(wj)σ(wjTx)

Neural network-based method: ρ ∼ N1 P

j δ(aj,ωj) ({ωj} is also optimized):

f(x;θ) ∼ fm(x) =

1

m

X

j

ajσ(wjTx)

then optimize (say, using L-BFGS) — this is more in line with traditional numerical analysis (e.g. nonlinear finite element or meshless methods).

(9)

For truly large datasets, we need to use stochastic algorithms

objective function are all expressed as expectations:

R(θ) = Ex∼µ(f(x;θ) − f∗(x))2

R(θ1, θ2) = Ex∼µ(x − Ψ(Φ(x;θ1);θ2))2

I(θ) = Ex∼µθ

φ(x;θ)Hφ(x;θ)

φ(x;θ)2

Gradient descent (GD) can be readily converted to stochastic gradient descent (SGD). Let F = F(θ) = Ex∼µg(θ, x) be the objective function:

GD : θk+1 = θk − η∇θEx∼µg(θk,x)

SGD : θk+1 = θk − η∇θg(θk, xk)

(10)

Optimization: Defining gradient flows

“Free energy” = R(f) = Ex∼µ(f(x) − f∗(x))2

f(x) = Z

a(w)σ(wTx)π(dw) = Ew∼πa(w)σ(wTx)

Follow Halperin and Hohenberg (1977):

a = non-conserved, use “model A” dynamics (Allen-Cahn):

∂a

∂t = −

δR

δa

π = conserved (probability density), use “model B” (Cahn-Hilliard):

∂π

∂t + ∇ · J = 0

J = πv, v = −∇V, V = δR

(11)

Gradient flow for the feature-based model

Fix π, optimize a.

∂ta(w, t) = −

δR

δa (w, t) = −

Z

a( ˜w, t)K(w, w˜)π(dw˜) + ˜f(w)

K(w, w˜) = Ex[σ(wTx)σ( ˜wTx)], f˜(w) = Ex[f∗(x)σ(wTx)]

This is an integral equation with a symmetric positive definite kernel.

Decay estimates due to convexity: Let f∗(x) = Ew∼πa∗(w)σ(wTx),

I(t) = 1

2ka(·, t) − a

(·)k2 + t(R(a(t)) − R(a∗))

Then we have

dI

dt ≤ 0, R(a(t)) ≤

C0 t

(12)

Conservative gradient flow

Optimize ρ:

f(x) = Eu∼ρφ(x, u)

Example: u = (a, w), φ(x,u) = aσ(wTx)

∂tρ = ∇(ρ∇V )

V (u) = δR

δρ (u) = Ex[(f(x) − f

(x))φ(x,u)] = Z

K(u,u˜)ρ(du˜) − f˜(u)

This is the mean-field equation derived by Chizat and Bach (2018), Mei, Montanari and Nguyen (2018), Rotskoff and Vanden-Eijnden (2018), Sirignano and Spiliopoulos

(2018), by studying the continuum limit of two-layer neural networks. Does not satisfy displacement convexity.

(13)

Mixture model

Optimize (a, π) (a = non-conservative, π =conservative):

∂ta(w, t) = −

δR

δa (w, t) = −

Z

a( ˜w, t)K(w,w˜)π(dw˜, t) + ˜f(w)

∂tπ = ∇(π∇V˜), V (w) =

δR

(14)

Discretizing the gradient flows

Discretizing the population risk (into the empirical risk) using data Discretizing the gradient flow

particle method – the dynamic version of Monte Carlo smoothed particle method – analog of vortex blob method spectral method – very effective in low dimensions

We will see that gradient descent algorithm (GD) for random feature and neural network models are simply the particle method discretization of the gradient flows discussed before.

(15)

Particle method for the feature-based model

ta(w, t) = −δR

δa (w) = −

Z

a( ˜w, t)K(w,w˜)π(dw˜) + ˜f(w)

π(dw) ∼ 1

m

X

j

δwj, a(wj, t) ∼ aj(t)

Discretized version:

d

dtaj(t) = −

1

m

X

k

K(wj, wk)ak(t) + ˜f(wj)

This is exactly the GD for the random feature model.

f(x) ∼ fm(x) = 1

m

X

j

(16)

Particle method for the conservative flow

tρ = ∇(ρ∇V ) (1)

Particle method discretization:

ρ(t, du) ∼ 1

m

X

j

δuj(t)

Define the loss function

I(u1,· · · ,um) = R(fm), fm(x) =

1

m

X

j

φ(x,uj)

Lemma: Given a set of initial data {u0

j, j ∈ [m]}. The solution of (1) with initial data

ρ(0) = m1 Pmj=1 δu0

j is given by

ρ(t) = 1

m

m

X

j=1

δuj(t)

where the particles {uj(·), j ∈ [m]} solves:

(17)

Comparison with conventional NN models

Continuous viewpoint (in this case same as mean-field): fm(x) = m1 Pj ajσ(wjTx)

Conventional NN models: fm(x) = Pj ajσ(wjTx)

2.0 2.5 3.0 3.5 4.0 4.5

log

10

(m)

1.8 2.0 2.2 2.4 2.6

log

10

(n

)

Test errors

2.0 2.5 3.0 3.5 4.0 4.5

log

10

(m)

1.8 2.0 2.2 2.4 2.6

log

10

(n

)

Test errors

6.0 5.4 4.8 4.2 3.6 3.0 2.4 1.8 1.2 0.6

Figure: (Left) continuous viewpoint; (Right) conventional NN models. Target function is a single neuron

(18)

Flow-based representation

Continuous dynamical system viewpoint (E (2017), Haber and Ruthotto (2017), “Neural ODEs” (Chen et al, 2018))

dz

dτ = g(τ,z), z(0) = x

The flow-map at time 1: x → z(x,1). Trial functions:

f = αTz(1)

(19)

How do we choose the form of

g

?

The correct form of g is given by (E, Ma and Wu, 2019):

g(τ,z) = Ew∼πτa(w, τ)σ(w

T

z)

where {πτ} is a family of probability distributions.

dz

dτ =Ew∼πτa(w, τ)σ(w

T

z) =E(a,w)∼ρτaσ(wTz)

Discretize: We obtain the “residual neural network” model:

zl+1 = zl +

1

LM

M

X

j=1

aj,lσ(zlTwj,l), l = 1, 2, · · · , L − 1, z0 = V x˜

(20)

Compositional law of large numbers

Consider the following compositional scheme: z0,L(x) = x,

zl+1,L(x) = zl,L(x) +

1

LM

M

X

k=1

al,kσ(wl,kT zl,L(x)),

(al,k,wl,k) are pairs of vectors i.i.d. sampled from a distribution ρ.

Theorem (E, Ma and Wu 2019)

Assume that

Eρk|a||wT|k2F < ∞

where for a matrix or vector A, |A| means taking element-wise absolute value for A. Define

z by

z(x, 0) = x, d

dτ z(x, t) = E(a,w)∼ρaσ(w

T

z(x, τ)).

(21)

The optimal control problem

In a slightly more general form

dz

dτ = Eu∼ρτφ(z, u), z(x,0) = x

z =state, ρτ = control at time τ.

The objective : Minimize R over {ρτ}

R({ρτ}) = Ex∼µ(f(x) − f∗(x))2 =

Z

Rd

(f(x) − f∗(x))2dµ

where

(22)

Pontryagin’s maximum principle

Define the Hamiltonian H : Rd × Rd × P2(Ω) :7→ R as

H(z,p, µ) = Eu∼µ[pTφ(z,u)].

The solutions of the control problem must satisfy:

ρτ = argmaxρ Ex[H zτt,x, pt,τx, ρ], ∀τ ∈ [0,1],

and for each x, (zτt,x,pt,τx) are defined by the forward/backward equations:

dzτt,x

dτ = ∇pH = Eu∼ρτ(·;t)[φ(z

t,x

τ , u)]

dpt,τx

dτ = −∇zH = Eu∼ρτ(·;t)[∇

T

zφ(z

t,x

τ ,u)p t,x

τ ].

f(x) = 1Tz(x,1)

with the boundary conditions:

z0t,x = x

(23)

Gradient flow for flow-based models

Define the Hamiltonian H : Rd × Rd × P2(Ω) :7→ R as

H(z,p, µ) = Eu∼µ[pTφ(z,u)].

The gradient flow for {ρτ} is given by

∂tρτ(u, t) = ∇ · (ρτ(u, t)∇V (u;ρ)) , ∀τ ∈ [0, 1],

where

V (u;ρ) = Ex[

δH

δρ z

t,x

τ , p t,x

τ , ρτ(·;t)

],

and for each x, (zτt,x,pt,τx) are defined by the forward/backward equations:

dzτt,x

dτ = ∇pH = Eu∼ρτ(·;t)[φ(z

t,x

τ , u)]

dpt,τx

dτ = −∇zH = Eu∼ρτ(·;t)[∇

T

zφ(z

t,x

τ ,u)p t,x

τ ].

with the boundary conditions:

z0t,x = x

(24)

Discretize the gradient flow

forward Euler for the flow in τ variable, step size 1/L.

particle method for the GD dynamics, M samples in each layer

zlt,+1x = zlt,x + 1

LM

M

X

j=1

φ(zlt,x,ulj(t)), l = 0, . . . , L − 1

pt,l x = pt,l+1x + 1

LM

M

X

j=1

zφ(zlt,+1x, ujl+1(t))pt,l+1x , l = 0, . . . , L − 1

dujl(t)

dt = −Ex[∇

T

wφ(z

t,x

l , u j

l(t))p t,x

l ].

This recovers the gradient descent algorithm (with back-propagation) for the ResNet:

zl+1 = zl +

1

LM

M

X

j=1

(25)

Max principle-based training method

Qianxiao Li, Long Chen, Cheng Tai and Weinan E (2017):

Basic “method of successive approximation” (MSA): Initialize: θ0 ∈ U

For k = 0,1,2,· · ·: Solve

dzτk

dτ = ∇pH(z

k τ,p

k τ, θ

k

τ), z k

0 = V x

Solve

dpkτ

dτ = −∇zH(z

k τ,p

k τ, θ

k

τ), p k

1 = 2(f(x;θ

k

) − f∗(x))1

Set θkτ+1 = argmax θ∈ΘH(zτk, pkτ, θ), for each τ ∈ [0, 1] Extended MSA:

˜

(26)
(27)

Comparison between GD and maximum principle

Maximum principle:

ρτ = argmaxρ Ex[H zτt,x, pt,τx, ρ

], ∀τ ∈ [0,1],

GD:

∂tρτ(u, t) = ∇ ·

ρτ(u, t)∇Ex[

δH

δρ z

t,x

τ ,p t,x

τ , ρτ(u;t)

]

, ∀τ ∈ [0, 1],

Hybrid:

Introducing a different time scale for optimization step: One time forward/backward propagation every k steps of optimization.

k = 1, usual GD or SGD

(28)

“Mean-field” vs. “continuous”

They sometimes give rise to the same models (e.g. two-layer NNs).

They are DIFFERENT viewpoints.

mean field: discrete → continuous by taking the limit (more like interacting particles in stat phys)

continuous formulation: continuous → discrete by discretization (more like the usual numerical analysis situation)

“Continuous” formulation tries to formulate the “first principles” of ML. It allows us to think “outside the box” about ML.

alternative discretization (spectral) alternative model

(29)

Flow-based random feature model

dz

dτ = Ew∼πτa(w, τ)σ(w

T

z)

(30)

Summary

Continuous formulation tries to formulate the “first principles” of ML. It gives rise to new variational and PDE-like problems.

The performance of these models/algorithms are more stable to the choice of hyper-parameters (no “phase transition”).

It gives rise to new models and new algorithms, more suited to the specific application.

(31)

The crucial point is the representation of

functions as some form of expectations.

integral transform-based

flow-based

Other representations?

References

Related documents

Trusted submit host access may be directly specified without using RCmd authentication by setting the server submit_hosts parameter via qmgr as in the following

The TVAAS is an accountability system that utilizes a multivariate longitudinal anal- ysis of student data and looks at student academic gain from year to year rather than the

Type of College: College, university, vocational, technical or trade school in Indiana Other Criteria: Applicants must plan to pursue or are currently pursuing a degree in or

In section 5 , the integral representation of the fractional Dunkl transform as well as the Bochner type identity and Master formula are given.. Background:

Table 5.6 presents the proportion of adults aged 18 years and older who reported that dental visits in the previous 12 months were a large financial burden, classified by survey

In the interactive condition (see 3.3), when a human approached, the robot(s) turned their faces toward the direction of the person and bowed while saying “Hello.” After that,

Conclusions: This study provides initial evidence that postnatal exposure to commonly used environmental chemicals at doses comparable to human exposure is capable of modifying the

Notwithstanding the limitations, it is possible to affirm that university students show a higher awake bruxism and stress levels in relation to the general