• No results found

Chapter 4_handout.pdf

N/A
N/A
Protected

Academic year: 2020

Share "Chapter 4_handout.pdf"

Copied!
10
0
0

Loading.... (view fulltext now)

Full text

(1)

CHAPTER 4

S

TOCHASTIC

A

PPROXIMATION FOR

R

OOT

F

INDING IN

N

ONLINEAR

M

ODELS

•Organization of chapter in ISSO

–Introduction and potpourri of examples

•Sample mean •Quantile and CEP

•Production function (contrast with maximum likelihood)

–Convergence of the SA algorithm

–Asymptotic normality of SA and choice of gain sequence –Extensions to standard root-finding SA

•Joint parameter and state estimation

•Higher-order methods for algorithm acceleration •Iterate averaging

•Time-varying functions

Slides for Introduction to Stochastic Search and Optimization (ISSO) by J. C. Spall

Introduction

Stochastic approximation (SA) is collection of iterative stochastic optimization algorithms that attempt to find zeroes or extrema of functions

– Functions cannot (usually) be computed directly, but only estimated via noisy observations

• Original SA methods are RobbinsMonro (1951) and KieferWolfowitz (1952) algorithms

• Focus of Chaps. 4 and 5 is Robbins-Monro and related stochastic gradient algorithm

• Simple-minded (non-SA) way to remove noise is to compute average of function measurements at eachiteration

• Innovation in SA is to remove noise across iterations while not averaging at each iteration

(2)

2

4-3

Stochastic Root-Finding Problem

• Focus is on finding  (i.e., ) such that g() = 0

g() is typically a nonlinearfunction of  (contrast with

Chapter 3 in ISSO)

• Assume only noisy measurements ofg() are available: Yk() = g() + ek(), k = 0, 1, 2,…,

• Above problem arises frequently in practice

– Optimization with noisy measurements (g()represents

gradient of loss function) (see Chapter 5 of ISSO)

– Quantile-type problems

– Equation solving in physics-based models – Machine learning (see Chapter 11 of ISSO)

4-4

Core Algorithm for Stochastic Root-Finding

• Basic algorithm published in Robbins and Monro (1951) • Algorithm is stochastic analogue to steepest descent when

used for optimization

– Noisy measurementYk() replaces exact gradient g()

• Generally wasteful to average measurements at given value of

– Average across iterations(changing )

• Core Robbins-Monro algorithm for unconstrained root-finding is

• Constrained version of algorithm also exists

(3)

4-5

Circular Error Probable (CEP): Example of

Root-Finding (Example 4.3 in

ISSO

)

• Interested in estimating radius of circle about target such that half of impacts lie within circle (is scalar radius) • Define success variable

• Root-finding algorithm becomes

• Figure on next slide illustrates results for one study

  

 

ˆ

1 if (success)

ˆ

0 otherwise (nonsuccess)

( ) k k

k k

s X

  

 



1

ˆ

ˆ

( )

ˆ

0.5

k k k k k

k

Y

a

s

True and estimated CEP: 1000 impact points

with impact mean differing from target point

(4)

4

4-7

Convergence Conditions

• Central aspect of root-finding SA are conditions for formal convergence of the iterate to a root 

– Provides rigorous basis for many popular algorithms (LMS, backpropagation, simulated annealing, etc.)

• Section 4.3 of ISSO contains two sets of conditions:

“Statistics”conditions based on classical assumptions about g(), noise, and gains ak

“Engineering”conditions based on connection to deterministic ordinary differential equation (ODE)

• Convergence and stability of ODE dZ()/d= –g(Z()) closely related to convergence of SA algorithm (Z() represents p

-dimensional time-varying function and denotes time) • Neither of statistics or engineering conditions is special

case of other

4-8

Fundamental Convergence Theorem

• Theorem 4.1 (Sect. 4.3 of ISSO) shows almost sure (a.s.) convergence of standard root-finding (RobbinsMonro) SA to 

– Theorem requires “Statistics” conditions (“A” conditions, A.1A.4 ) or “Engineering” conditions (“B” conditions, B.1B.5)

• Statement of theorem from p. 108 of ISSO given below

(5)

4-9

ODE Convergence Paths for Nonlinear Problem

4 2 2

Z1

2 2

Z2

• Plot below from nonlinear g() in Example 4.6 in ISSO

• Each line depicts path of Z() = [Z1(),Z2()]Tover time from particular initial condition Z(0)

• Plot shows case satisfying relevant ODE (“B”) conditions

‒ Asymptotic stability and global domain of attraction

Gain Selection

• Choice of the gain sequence akis criticalto the

performance of SA

• Famous conditions for convergence are =  and

• Common practical choice of gain sequence is

where 1/2 <  1, a> 0, and A 0

• Strictly positive A (“stability constant”) allows for larger a (possibly faster convergence) without risking unstable behavior in early iterations

•  and A can usually be pre-specified; critical coefficient a usually chosen by “trial-and-error”

 

k 0ak

 

2

0 k k

a

 

( 1 )

k

a a

(6)

6

4-11

Asymptotic Normality

• No known finite-sample (k < ) distribution for SA iterate • Asymptotic distribution provides approximate description of

uncertainty and variability of iterate in finite samples • Using standard gain sequence form (previous slide) and

under conditions similar to those for convergence (conditions “A” or “B”), can show asymptotic normality:

where  governs decay rate for akand depends on gain sequence akand on Jacobian matrix of g() (i.e., derivative

of gw.r.t. )

• Asymptotic normality above implies rate of convergence of iterate maximized at  = 1

‒ Practical values of  < 1 usually better in finite samples

/2

(

ˆ

)



dist.

( , ),

k

k

  

N

0

4-12

Extensions to Basic Root-Finding SA

(Section 4.5 of

ISSO

)

• Joint Parameter and State Evolution

– There exists state vector xkrelated to system being

optimized

– E.g., state-space model governing evolution of xk, where

model depends on values of 

• Adaptive Estimation and Higher-Order Algorithms

– Adaptively estimating gain ak

– SA analogues of fast Newton-Raphson search

• Iterate Averaging

– See slides to follow

• Time-Varying Functions

(7)

4-13

Iterate Averaging

• Iterate averaging is important and relatively recent development in SA

• Provides means for achieving optimal asymptotic

performance without using optimal gains ak

• Basic iterate average uses following sample mean as final estimate:

• Results in finite-samplepractice are mixed

• Success relies on large proportion of individual iterates hovering in some balanced way around 

– Many practical problems have iterate approaching in

roughly monotonic manner

– Monotonicity not consistent with good performance of iterate averaging; see plot on following slide

 

 1 

0

ˆ ( 1)

k

k j

j k

Contrasting Search Paths for Typical

p

= 2

(8)

8

Theoretical Guidance for Iterate Averaging

• Sometimes possible to provide theory for when iterate

averaging is effective or not effective relative to standard SA (no averaging)

– Results apply in setting of finite number of iterations

• For example: Guidance available in special caseof

root-finding for optimization: Setting g() = 0, where g() is

gradient of quadratic loss function. Leads to linear

stochastic gradient algorithm (Chap. 5)

• See Appendix below

4-15

4-16

Time-Varying Functions

• In some problems, the root-finding function varies with iteration: gk() (rather than g())

– Adaptive control with time-varying target vector – Experimental design with user-specified input values

– Signal processing based on Markov models (Subsection 4.5.1 of ISSO)

• Let denote the root to gk() = 0

• Suppose that for some fixed value (equivalent to fixed  in conventional root-finding)

– In such cases, much standard theory continues to apply – Plot on following slide shows case when gk() represents a

gradient function with scalar 

• General case where not “settling down” considered in:

Zhu, J. and Spall, J. C. (2018), “Probabilistic Bounds in Tracking a Discrete-Time Varying Process,” Proc. of the IEEE Conference on Decision and Control, Miami Beach, FL, 17–19 Dec. 2018, pp.

3146–3151. http://dx.doi.org/10.1109/CDC.2016.7798957 

k

 

k  

(9)

4-17

Time-Varying

g

k

(

) =

L

k

(

)

/



for Loss

Functions with Limiting Minimum

Appendix: Iterate Averaging for

Linear Model

• Provide theoretical guidance for when iterate averaging (IA) (Sect. 4.5.3) is effective or not effective relative to standard SA (no averaging)

• Results apply in setting of finite-samples (non-asymptotic) • Consider special case of optimization with quadratic L(),

leading to linear stochastic gradient algorithm (Chap. 5)

– Generalization of LMS algorithm (Chap. 3)

• Results documented in:

Feng, C., Wang, K., and Yang, K. (2017), “Finite-Sample Analysis of Iterate Averaging Method for Stochastic

Approximation with Quadratic Loss Function,” Proc. 51st Conf. on Information Sciences and Systems, Baltimore, MD, 22−24

March 2017. http://dx.doi.org/10.1109/CISS.2017.7926146

(10)

10

4-19

Appendix: Iterate Averaging for

Linear Model (cont’d)

• Consider IA for following linear model:

whereH is Hessian matrix of quadratic loss L() (H not a

function of )

• Above model includes LMS (Chap. 3) as special case with particular form of Hand ek

• Establish necessary and sufficient condition under which

IA has lower MSE than non-IA (i.e., standard stochastic gradient) at any finite iteration

‒ Make simplifying assumption that (not critical assumption)

• Necessary and sufficient conditions rely on three aspects: (1) Eigenvalues of H; (2) cov(ek) for all k; and (3) gains {ak}

 

  

  

    

1

ˆ ˆ ˆ )

ˆ ) ˆ ) ˆ ),

(

( ( (

k k k k

k k k k

a Y

Y H e

  ˆ0

ˆ 2

k

E   

4-20

Appendix: Iterate Averaging for

Linear Model (cont’d)

• Can give sufficient conditions under which IA or non-IA has lower MSE

• Figure below provides example sufficient condition when ak= a/(k+1)0.501

• Condition holds when a0m located in either white or dark

References

Related documents

The purpose of this study was two-fold: 1) to examine the race- and gender-specific associa- tions between school term length and systolic blood pressure (SBP), diastolic blood

The findings from our study imply that the Greek translation of the SAST is a useful and reliable instru- ment for primarily detecting anxiety disorders in older patients

to investigate the reasons for low level of library patronage among accounting students in South African Universities; to determine what features would make the use of a library

 Treatment of Psychiatric disorders is associated Treatment of Psychiatric disorders is associated •• Slower disease progression and mortality Slower disease progression and

Most staff teaching HE in FE now possess a qualification higher than that which they teach and most HE programmes have a high standard of equipment and facilities – this was not

For organizations that already have their own vehicles but perhaps need a temporary replacement vehicle due to capacity needs or maintenance issues, easy fleet offers a

3.14 Latest generation integrated intra vascular ultrasound which is useable and interfacable with all IVUS catheters available in the market for coronary, peripheral and carotid

Now when a physician diagnoses adult onset insulin dependent diabetes as type 1 that claim is most often turned down as the NAS study indicates only diabetes type 2 is related to