Problems in Bayesian statistics relating to discontinuous phenomena, catastrophe theory and forecasting

(1)

University of Warwick institutional repository: http://go.warwick.ac.uk/wrap

A Thesis Submitted for the Degree of PhD at the University of Warwick

http://go.warwick.ac.uk/wrap/4089

This thesis is made available online and is protected by original copyright. Please scroll down to view the document itself.

(2)

PROBLEMS IN BAYESIAN STATISTICS RELATING _TO

DISCONTINUOUS P1 NONEMA, CATASTROPHE

THEORY, AND FORECASTING

J. Q. SMITH'

Ph. D.

raj

ifý ý` a

University

_{of Warwick}

(3)

BEST COPY

AVAILABLE

Poor text in the original

thesis.

Some text bound close to

the spine.

(4)

N-1

6 DEC

1918

UNIVERSITÄTSBIBLIOTHEK DER TECHNISCHEN UNIVERSITÄT HANNOVER UND TECHNISCHE INFORMATIONSBIBLIOTHEK

The Library

University _{of Warwick}

Coventry CV4 7A

Great Britain

Your Ref.: Our Ref.: 1.1.4/le

Dissertations _Acquisition

Universitätsbibliothek _{und TIB} Welfengarten IB

D 3000 _Hannover 1 (rerderal _Republic

of Germany)

Telfon Date

(0511)762-3419 _November _30,1978

ATTENTION PLEASE ! This is _{a claim} for the b. m. thesis, which has

been _requested from _us _at first _{on July} 26, but till today we haven't received any copy or answer from. Can you please proove, what has

happend _with _our _request?

Dear _Ladies

and Gentlemen,

We kindly _ask _you to _send _{us a 35mm microfilm/xerox/micro-}

fiche _copy (copies) _of the following thesis (theses) for #3ýý}iaY3ýNaiXXfXfXýriX.

purchase.

The title _of thesis (theses):

AUTHOR _: Smith _,JQ: Thesis 1977

TITLE Problems in bayesian statistics relating to discontinuours

phenomena, catastrophe theory and forecasting.

VERIFIED: _--

Thanking _{you very} much in advance for your kind attention to _our letter, I remain

Your faith' _ul _,ý

i. A. /r

(Miss) _Anette _Lueck

(5)

TABLE OF CONTENTS

Page No.

CHAPTER-1 INTRODUCTION TO THESIS ₁

S

CHAPTER 2A DISCUSSION OF BAYESIAN 'INFERENCE

2.1. Introduction 3

2.2. _{A choice} _of topologies _{on F4}

2.3. Topologies _{on FlCn8}

2.4. Restrictions _{on prior} _{to posterior} _analysis 10

2.5. Stability _of Bayes Estimates 18

2.6. A Preview _of the Kernel _of the Thesis 26

2.7. Natural Parametrisations 29

Summary 34-

CHAPTER 3 SOME' PROPERTIES"OF ESTIMATES-MADE UNDER BOUNDED LOSS

3.1. Introduction 36,

3.2. The Step loss function _{and its-properties} 40

3.3. A Classification of Bayes Decisions using

Step loss functions 44

3.4. A new representation of a posterior distribution

function 48

3.5. Examples of Standard 4(b) functions 57

3.6. " Expanding _'families of distribution functiq' s 62

3.7. A link _with the prior likelihood approach 64

3.8. On to Bimodal distributions-The _symmetric

product 66

3.9. Convex Utilities 71

(6)

Page No.

CHAPTER 4 _{CATASTROPHE THEORY IN STATISTICS}

4.1. _Introduction ₈₃

4.2. _-Rules _in _Catastrophe _Theory ₉₆

4.3. _Bayes _Statistical _Rules ₁₀₂

-Summary 110

CHAPTER 5 _{SOME COMMON CATASTROPHES OCCURRING IN STATISTICS'}

5'. 1. _Bivariate _Normal ₁₁₁

5.2. Simple Heirachical Normal Model ₁₁₅

5.3. The Normal Sample distribution _{- Student} _t

. prior distribution model 117

5.4. The Student _prior _{and sample} distribution ₁₂₀

5.5. _{The t-distribution} _under _asymmetric _step _loss ₁₂₄

5.6. _{The t-product} ₁₂₆

Summary ₁₂₈

CHAPTER'6 A CLASSIFICATION OF-EXPECTED LOSS ARISING FROM

GENERAL DISTRIBUTION FUNCTIONS

6.1. Topology

_{of Bayes Decisions}

_arising

from

unimodal distributions 129

6.2. Topology _{of Expected} loss for _multimodal

distributions 143

Summary

151

CHAPTER 7 CATASTROPHES ARISING FROM MIXTURES

Theorem _7.2 (The Cusp Mixture) 152

Theorem _7.4 (The Butterfly Mixture) ₁₇₀

(7)

. Page No.

CHAPTER 8'

GENERALISED BAYES FORECASTING

8.1. _{-Introduction}

8.2. _Normal

_{Bayes Forecasting}

8.3. _{Some Difficulties} _{in Formalising} _and

Generalisation

8.4. _{The Steady} _Model

8.5. _{The General} Steady Model

8.6. The Power Steady Model

8.7. Examples _of _the _Simple Power Steady Model

8.8. Multivariate _Simple Power Steady Models

8.9. A Discussion _of _{Some of the Drawbacks} _of

Box-Jenkins _Modelling Summary

List _of _References

176 176

178 180 181 185 192 199

208 213

(8)

Abbreviations

Unless _otherwise _stated _{n(8, V) represented-a} _normal _distribution

with mean 0 and-variance V. In Chapter 2I _{use P(8)} to denote _the posterior distribution of 8 when P(8) is the'prior. ` All _other

abbreviations are defined in the text.

List _of _{Fig ures"} _{Page No.}

2.1. _' _{Two Bayes} decision ₂₇

2.2. _{An evolution}

_{of expected}

loss

₂₈

2.3. The function k(n) ₃₂

3.1. _- Two loss densities 37'

3.2. The functions _r(b) _{and q(b)} ₅₄

Table 1 fi(b) for-some _standard distributions 59

3.3. Lognormal _{and Gamma fi(b)} _functions ₆₁

3.4. Convex _utility 72

3.5. Convex _loss 72

3.6. Ramp loss 74

3.7.

A

The function

L(bl-b2')(s)

78

3.8. A Catastrophe

79

4.1. The Fold Catastrophe Manifold 89

4.2. Some Fold Potential functions 89

4.3. The Cusp Manifold

90

4.4. The Bifurcation Set 90

4.5. The Swallowtail

Control

Space

92

4.6. A Swallowtail Potential 93

4.7. The Butterflys Catastrophe's _projected

fold

_points

94

4.8. The Restricted Behaviour Space 95

4.9. A path on the control

_{space of the Cusp}

(9)

Page No.

4.10. Typical Time Series _generated _{by cusp} ₁₀₀

4.11. The Hysteresis Loop ₁₀₁

4.12. The-Expected Loss function _with Cost _{on Change} ₁₀₃

4.13. Cost _of Change _effect _{on Control} Space _of _Cusp ₁₀₄

. 4.13a. Introversion and driving-. 105

4.14. Estimated Speed 105

4.15. Loss functions _of _subjects 106

4.16. Expected loss functions _of _subjects 107

4.17. Model Breakdowns

108

4.18. The true _expected loss _{and an approximation} 110

5.1. Bivariate _normal likelihood 112

5.2. Bivariate _normal _manifold 113

5.3. The Bivariate _normal _cusp 114

5.4. The heirachical _normal fold-with boundary 116

5.5. The t-normal _cusp _with trajectories 119

5.6. The txt _cusp _with trajectory. '121

5.7. The t-expected loss (assymetrical case) 125

5.8. The txt cusp on expected loss 127

6.1. A loss function causing bifurcation 138

7.1. Graph of E(d) and some of its derivatives 154

7.2. Section of Manifold where a= 156

7.3. An evolution of expected loss 157

7.4. The normal cusp 161

7.5. Quadratic loss_v conjugate loss 162

7.6. The expected

loss

167

7.7. The cusp 169

8.1. A density-causing difficulty 185

8.2. Box Jenkins

drift

210

(10)

ACKNOWLEDGEMENTS

I wish to thank all

those

in the Department

_{of Statistics}

at Warwick University whose help and enthusiasm contributed to this thesis. In particular I should like to thank Jeff Harrison. _for _his bombardment _with _conceptual ideas _{and for} _checking the thesis before

typing. _{I need also} thank Keith Ord and Robin--Reed-for help in'--

some of the proofs in Chapters 2 and 3, Tom Leonard for his general encouragement and references and Tony O'Hagan-for directing my

attention to the t-normal product given in Chapter 5; the first page of working here is his.

I should like _to _thank _{Pam for} _putting _{up with} _numerous

neurotic storms during my writing up and Terri for making such a good job _of the typing. Finally, I thank the _referee's in _advance for

(11)

DECLARATION

All _work in this thesis is to the best of my knowledge

original, _-except for the page of algebra on the t-normal product

which as mentioned in the acknowledgements is some research of

(12)

(13)

SUMMITRY

The aim of this

thesis

is to generalise

Bayesian

Forecasting

processes to models where normality assumptions are, not appropriate.

In particular

I develop

_{models. that}

_{can change their}

_minds

_{and I}

utilise

Catastrophe

Theory

in their

description.

Under. squared-error

loss

types

of criteria

the estimates

will

be smoothed out, so for model description

and-prediction

I need

to _{use bounded} loss functions. Unfortunately the induced types of estimators have not been investigated very fully and so two chapters

of the thesis represent an attempt to develop theory up to a necessar level to be used on. Times Series _models of the above kind.

An introduction to Catastrophe Theory is then given.

Catastrophe Theory is basically _{a classification} of Cm-potential functions _{and since} the _expected loss function is in fact itself

a potential function, I can use the classification on them. Chapters 6 and 7 relate the topologies _of the _posterior distribution and loss

function to'the topologies _of the posterior expected loss hence a

Bayes classification

of posterior

distributions

is possible.

In Chapter 8,1 _relate these results to the forecasting of non-stationary time series obtaining. ' models which are very much

akin to the simple weighted moving average processes under which lies this firm mathematical foundation. From this i can generate pleasing models which adjust in a "Catastrophic" way to changes

(14)

1.

1). Introduction to, Thesis

y: ay. I take the earliest opportunity-to _apologise for the

style in which this thesis is written. I write in the first person singular for-three quite inadequate reasons. The first is that I find the _passive tense difficult to, read, the _second is that the

word "we" is a

. trifle regal

for _{me, to} feel _comfortable _using _it

and the-third, is that for a scientist who believes only in subjective reality it expresses his basic philosophy succinctly and frequently.

The reason I-have included this note is that I feel it

important _to _outline _the _layout _of _the _rest _of _the _thesis _{so that} the _reader knows _which _parts _are the most important.

Chapters 6,7 _{and 8 represent} the _core _of _results _obtained

and-are the central chapters of the thesis. Chapters 6 and 7 deal with Bayes estimation problems relating the topology of the posterior

distribution to the topology _of the _posterior _expected loss. Chapter 8 gives a generalisation _of the Steady Model defined in Harrison and Stevens (1) for _general distributions.

Chapter _{4 and 5 give} _{an introduction} to Catastrophe Theory and examples-of its uses in getting to understand the topology of well known likelihoods, posterior distributions and expected loss

functions. These two chapters therefore make much less-intense

reading

than the rest

of the thesis.

Chapter 2 is _at the beginning of this. wvork because its

subject matter is extremely important conceptually to the mood of the _rest _of the thesis. I use a stability rather than an axiomatic

approach to show that to act sensibly in a Bayesian framework I must work _with bounded likelihoods and loss functions. Having come to

the latter _conclusion, which I must remark-is not new, though it is _{a point} _never _emphasised by theoretical Bayesians, I realised

(15)

Z. "

Bayesians are bad. It _{was therefore} _necessary to _develop _{a theory} around bounded loss. This is what Chapter 3 is _all _about.

Thus Chapters 2 and 3 are reference _chapters _{and only} _the

results--therein contained are of importance to-the rest _of the thesis. Finally may I point _out that there-is _{a summary} _at _the _end

of each chapter which can be used by the reader to get an idea _of the'main`points _covered in them.

.

(16)

3.

2. A DISCUSSION OF'BAYESIAN INFERENCE

2.1. Introduction

Before _using any inferential _procedure _{` it'} is important _to

understand how and when-that procedure can be used. In the following chapter'the author explores when it is likely that a Bayesian

analysis will give reasonable and coherent results (the word "coherent' being _used in _the _{non-technical} _sense). There _are two _ways in-which

such an exploration can begin.

(i)

_{An axiomatic}

base

(ii) Identification _of _various "counterexamples" to the _system.

The axiomatic basis for Bayesian _statistics is _well documented " (See, for _example, _{De Groot} (1), De Finetti (1), Raiffa (1)). 1

must admit to feeling that although these axioms give a good feel

for _what _{one is} doing _{when making} inference _about distributions, the implications _of _seemingly innocuous _assumptions _{can often} be much

more than one would suppose. It follows that they may induce a type of structure that is clearly undesirable.

Rather than _get lost in these _obscurities (to which I will

refer throughout), I would like to start with the conclusion of the axiomatic systems and criticise the building rather than the bricks. The concluding statement from the axiomatic systems is:

"I _{can express} _all _{my prior} opinions about a particular

parameter 6 e3R in terms of a probability distribution P(8), my

prior belief that 0 lies in a Borel set A being represented by the

number P(A)".

Now there is a gap here, an-inferential step, that many seem to have _missed. How am I to express my prior beliefs in terms of

the _measure P, that. is, do 'I substitute what L conceive heuristically

(17)

'4.

If _there _is _{to be any meaning} _{to Bayes} _inference _{I obviously}

must do the former. For this link _{to be made I feel} _that _it _is _at _least necessary. (but by no means sufficient) that the following _two _criteril

are met.

Criteria

1). _If _{2 prior} _{distributions} _P1(8), _P2(0) _are _"close", _{- then'their} posterior distributions P1(0), P2(6) are close

A

i. e. M. : P(O) _{-> P(O)} is _continuous.

2). If _{2 priors} P1(8) P2(9) _are--"close" then their _associated

decisions (or _estimates) _{dl, d2 respectively} _must _{be "close"}

i. e. 11 : Pk(6) _{-ý dk is} _{a continuous} _map.

If these-criteria _did'not _hold, _then _prior _{probabilities}

would lose their intuitive _meaning _since the _exact form _of the _prior

would have to be known for _{any sense} to _{come from} the inference. But how can one be sure of the-exact form? Obviously one could not

since

the intuitive,

subjective

idea

of probability

is necessarily

fuzzy.

Of course, in the. _usual _statistical tradition (see Wilkinson (1)) I have _not _specified _what I mean precisely by the _above two

criteria.

Firstly

the word "close"

must be defined.

This

is done

by; introducing

_{a topology}

_onto

the distribution

functions

_which

is

to the class

of probability

measures

in some direct

sense.

2.2. _{A choice} of topologies _{on F. (the} _class _of distribution functi

Clearly there _are _{many topologies} that _{one could} _use _to _give

this

idea

_{of. closeness.}

Perhaps

the weakest

_{such topology}

is

(18)

5.

Definitions

Define the distance in _Levy _metric _{PL(F, G) between} _two distribution- functions _{F and G to be the} _infimum _of _all _h>0

such that

F(O-h) _-hs G(O) _{s F(6+h)} _{+ h.} _for _all _0c _1R. The usefulness _of this topology-is that

w

Fn _-*F _properly (i. e. properly in distribution) iff _PL(Fn*F)-O (See Moran (1)).

Although this_1s, _{a useful} _property for _our _purposes it should be noted that it linearly varies its measure of distance as the parameter 8 is transformed linearly.

The uniform _metric p0(F, G) is defined by

po(F, G) = Sup IF-GI 0 eJR

The variation _metric _{pV(F, G) is} defined by

pv(F, G) = Sup fgý UdF -f UdGýI' }

{uEC0:

Mull =1}

Lemma 2.1..

(i), The Levy _metric is-at least _{as weak " as "the} Uniform metric

(ii)

The Uniform

_metric

is at least

_{as weak as the Variation}

_metric

Proof. (i) is _{an obvious} _consequence of the fact that the L4vy metric convergence is equivalent to convergence in distribution.

(ii) Let X, Y has distribution functions 'P _{and G respectively}

(19)

6.

where _{X[a, b](X)} _{_} {1 Xe [a, b]

to _otherwise

and

p (F, G)

_-{uECup

II ulI

U( ) -- ]E U(Y) I

0

which since COIF is dense-in F under the usual _metric

=

Sup

IIE U(X) -1E U(Y) l

{u measurable:

11u1l = 1}

po(F, G). Q

In fact this _weakness is _strict. Consider these two _{counterexamples.}

Counterexample 1.

Let F {O if _xsÖ _{and Fn =} {O if xsn

11

otherwise _.{1 otherwise

Then _{pL(F, Ff)-'} 0asn. _->yet _{po(F, Fn)=1} for _all _n.

Counterexample 2.

It

_{can be shown that}

if

_{F and G are absolutely}

_continuous

CX)

pv(F, G) =f if(x) g(x)Jdx (See Feller (1)).

00

Let F be distributed

_{rectangularly}

on [0,1]

and Gm(x) defined

by

m

GG(x) mE. J(m)(x)

n=1

where

3(m)

_{1xE

_1

_.+Z,

`z)

{m

{

{OxE (-co, n-1I

{m

{ increasing _elsewhere

(20)

7.

Then GG is _{an absolutely} _continuous distribution _on (0,1] _and

po(F, Gm) sm yet it'is easily checked that

m

If(x)

- gm(x)I dx > for all m>M

where U is a suitably large constant.

Hencelp

(F, G )+0

_{and inf}

p (F, Gin) >

as in +m.

0m

_m>M

°

Now I shall introduce-a little notation.

Let _Ti '>' _t. _signify that p. induces a weaker topology Ti than p does TJ.

T signify that the induced topologies are equivalent.

Then the _above _comments _{can be summarised} by Weak convergence '=-', T _L. '>' To '>' Tv

°`' I. shall _{now proceed} as follows. If loan show an inferential

procedure to satisfy Criteria 1 and 2 with respect to weak topologies (in _particular the'ones induced by pL) then I'will provisionally

accept it. If, however, no matter how strong the topology which I put on the distribution fucntions Criteria 1 and 2 are still not

satisfied I will reject the inference as fatuous if used in this general setting.

To show that*any counterexamples are zot extrodinary in some sense I will assume that to reject an inference a Cn(IR) counterexamplel

(i. e. an nx

differentiable

distribution

where n is an arbitrary

(21)

8.

2.3. _Topologies _{"on FCC'(---)}

Let _T(n) _{be the} _topology _induced _{by the} _basis _B(n)(F,

e) where

B(n)(F,

_{c) _ {G}

_{E FICn(-°°,}

_{(_): Sup(IDnF(9)-DrG(8),}

_{< E}}

6 EE

and Tn be the topology

induced

by the basis

_{Bn(F, e), where}

B (F, e) = {G -u F1Cn -Sup'{max-I _{Di F(O)-D1G(@)f} _{< e}}

n (-ý'ý)" ý0 _{1t Osisn}

So, for _example _Tý1ý _close _says _that _the _{p. d. f.} _of _{G are uniformly} close to F

t3 _close'says _that _the _distribution _functions,

p. d. f and its derivative, of G are uniformly close to F,

, and so on.

Clearly _{as n becomes} _large, _the _topology _{Tn becomes} _extremely strong and hence _should _satisfy _anyone.

Using the _notation _of _the _previous _section _the _{following-diagram} is true in FjCnCO)

weak convergence '. ' TL ý_ý To t>r Tv '>' TP) 1_1 r1t>t t2'>* '>, TJ

To prove this, -it is necessary to show

(i)

Convergence

in To

_{'weak convergence}

in FICn(O2CO)

(ii) Convergence in _TMý Convergence in _variation,

(22)

9.

Lemma 2.2.

(i)

Convergence

in To---=; > weak convergence

in FICn(_CO.

()

(ii) Convergence-in _T(12 Convergence in _variation.

Proof: (i) Note that _since F(e) is _{a continuous} distribution function it-is--uniformly _continuous _oniR..

Hence Sup JF(O) _{- F(8+6)1} <n(S) _where n(8) ;0 as 161 }O

0 OR

So Sup *{F(e-e) +e< Fn(8) < F(9+E) _{- E} can be arranged} in the 6 eJR

form

= Sup {t1(e) < Fn(e) _{- F(6)} < T2(e)}: where T1(e) T2(£) -º 0

ea

as c -ý 0

convergence weakly convergence in L4vy metric, so from the

definition _of the Levy _metric

PL(Fn(0), F(S)) _-Oe}O

T1(E), T2(E) +

=ý p0(e)(Fn(A), F(e)) }0 as required

(ii) Let GE B(1)(F, S)

Then f(x)-g(x)Idx sJ f(x)-g(x)ldx +

1Ixj>S(x)dx

+

J

_00

iXl<a'i

f

g(x)dx

xl'a

Note that

lIlxl>a-f(x)dx

=J1 f(x)dx +

!!

lIlxl>s(f(x)-g(x))dx.

2.3.1 1

(23)

10.

s2

{I

<

If(x)

- g(x)ldx

+1+

f ýx( S

<2 S' + i(S) _where _n(S) _=1- T(S-, _{+ F(-5-1)} 0

as 69O.

The result

follows.

0

The'reader _will be pleased to hear that I can make some comments on Bayesian inference now.

2.4. Restricti6ns'on _prior to _posterior _analysis 0 Restriction 1. The Likelihood function 2. (8x) is _measurable.

Comment. This is in fact _{an extremely} _mild _condition _{on ß,} in _fact

it is difficult to _construct _a'problem _{when this} is _not the _case. However it does _emphasise _{one point,} _namely that the family _of

sample distributions that have been _chosen to represent the experiment have to be "sensible" in some way.

Restriction 2. The likelihood function _must be bounded.

Comment. This is _{a restriction} that does not seem to be commonly realised and often "occurs" in practice. To demonstrate why this

does not satisfy

Criteria

1 suppose Z (O t) +w as 6+0

and is

continuous on (O, k) where k>O. 1 can without loss of generality consider

R1(e t) _{= 1. (9 jx)} I (O, k) since I can assume that it is

appropriate to put prior measure zero on 6 outside this range. For construction purposes transform the interval (O, k) smoothly to

(-m,

Co)

(24)

11.

I now have a likelihood 2(6) _x) _on]R _such _that:

R(OIx)

_{; (D}

_{as 6+ _co}

Q(0Ix)

is defined

_{and finite}

_elsewhere.

Consider _the _distribution funcfion F1(OIa): _f1(OJa) _{a n(a, 1)} _aa _3R. Then Sup (Sup _(max IDkF1(8ýa)ý))

amt O¬]R 1sksn

= Sup (max ID'Fl(0IO)J) Mn EIR

O Jt 1sksn

2.4.1.

"i

since Dkf(OIO) = Pk(9) exp {-JO 2} where Pk(0) is a polynomial.

Now suppose I have _chosen _{a prior} F(8) _{and let} G(O) be defined _by G(Ola, _{a) = (1'c`)F(O)} _{+ OF1(61a)}

Then it is _easy to _check, _using _the _above _comment (2.4.1) that for each c>0 there is _{an A such} that if _a<A, ae IR

G(9 _{a, a) E Bn(F, c),} _provided _{Sup (max_ Dk(F(x)))} _sM xQR 1sksn

However, if G and F represent _the _posterior distributions _using respective priors G and, ',

G(O) (1-a) _F*(6) _+a F1(91a)

e.

where F*(61a) _JR, (elx)dF(A)

F1(e1a) _=J _{L(olx)dFl(ela)}

-

00

henc

Ae (1-a)F*(-)F(e) + aF1(-Ia)F1(Ota) G(g)

(1-a)F*(ý) _{+ aF1(cola)}

*

A

_ (1-a*)

F(6)

+a

F1(8Ia)

-1 where a= [1 + (a-1-1)F*(°°)(F (ýýa))-1]

2.4.2. 2.4.3.

2.4.4.

2.4.5.

2.4.6.

(25)

12.

providing it _exists. Since R is unbounded _at _-o, _{as a -> -}

F*(oja) _{-} m.} _It _follows _that _for _each _fixed _{a there} _is _{an R(a)} such that for _all _a< R(a)

(a-1-1)F*(ý)

<1i. _e. _a>

2' 2.4.8.

F*(ýý _a)

Finally _note _that F(a) _}0 _{as a-} _-- _whereas

F1(a _{a) =J} for _all _ae IR _2.4.9.

, Hence there is an R1 (depending on a and hence e) such that if

a< R1(e)

IG(ala)

- F(a)j = la* F1(ala) _- (1-a*)F(a)I >

1.

., 2.4.10.

Hence for _all _priors _{F and all} _{e >0} _there _is _aG _such _that

GE _{Bn(F, e) and yet}

1 G _{Bo(F, n)} _if _n< _3"

In words this means that _{any prior} distribution I choose _must'be

specified exactly, otherwise the posterior distribution and hence my consequent inference will be more or less arbitrary. Hence

Criteria 1 is _not _met. 0

It follows _that _in _general _{I cannot} _{make sensible} _inference in _{a Bayesian} _setting (or for that _matter _{m. l. e. approach} _see

Edwards (1)) _using _unbounded likelihoods. The difficulty is _only.

an apparent one, however, for the following reasons. In any experiment (and _here, _{I. echo Bartlett} _et _al (1)) _{I;, can only} _take-a _measurement

within a tolerance governed by my measuring instrument. So rather

than _take _{an observation} _xI take _{an observation} _x±c. _Hence _{I will,} in _general, _observe _{a set., uf_ positive} _measure

rather than _{a point.} ' It will-follow that-the _{corresponding} _likelihood

(26)

13.

Since _the form _of _the _tolerance _region _(x _-e) _{taken 4 äre} _often communicated by the scaling in which the _observations _are _given'. _I

feel-that the _'invariance _principle _applied _to _the _scaling _in _the'

sample distribution

of the data

is a bad Criteriönfor

discrimination

between

inferences.

_-

Restriction 3

My; prior F must be such that

J

£(9lx)dF(8) # o.

Comment

This _means _that _apriori _{I must have the Prob} (k(8! _x) _{> 0) >0} Even _{a Bayesian} is _going _{to be confused} if'every _value _of _{e he thinks}

a priori is possible his data tells him is not and values of 6 he diened impossible-his _data _tells _{him. have} _positive _likelihood.

This highlights _indirectly _{an important} _deficiency' _of

Bayesian _analysis. _If one remembers _that _{A is} _just _{a label} _for

a particular family F(A) _of _sampling distributions then on putting

a prior on 6 we automatically _put zero measure on all other sampling distributions. So I cannot _{see when my data} (whilst _not _rejecting

as above) seems to contradict the family.

For _example _suppose _that _{I 'assume} that the _random variable X is _normally _'distributed _with _{mean p and unit} _variance _{and I have}

prior on 0 which is normal mean 10 variance 1. If I now take 1,999 observations of which

999 have _value 10.0000 1000 have _value _-10.000.

Then my posterior

distribution

(which

I quote

"contains

_all

_the

(27)

14.

But _what _rubbish is this! _{I am almost} _positive' _that _the _points

have _{come from} _{a n(O, 1) distribution!} _This _is _obviously _ridiculous,

but without

sidestepping

the formalism-there

is no way out of the

problem. The-more-data points I have the more information there

is _{contradicting} _my-prior _choice _of distribution _functions, _but _I cannot adapt to-more likely ones since I have put measure zero _on them.

Ofýcourse this _problem _arises from the fact that I am not putting a prior "distribution" across function space in a sensible way. At the moment I am researching into-more sensible methods

but it _should _{be noted} that the inferential procedures I'have found

so far, contradict De Groots 5th axiom (1), that_is, that I-oän compare the _chance _of _each distribution function being _right _with _{a uniform}

distribution _{on [0,1].} _For _interesting _analagous _problems _see

Ferguson (1), Leonard (4).

Henceforth _write F(6) _{as the} _posterior _distribution _of ₀

given

data

Restriction 4

Let T(. Z(Ofx)) _{be the} _set _of discontinuities _of Z. Then

T(. Z(O! x))

must have measure

zero with

respect

to the prior

distribution F. Comment

First I will _give _{an example} _of the restriction.

Example

Let £(OIx) ₌

10 _on

cA

[0,1)

1 otherwise

The prior T(8) is _{a rectangular} distribution _{on [0,1]} _{and the}

(28)

15.

Fn(e)

_{_}

"{

0x<O

{m/n _x _{[m, m+1)} _{0 -s ms ni-1)}

{1 _otherwise

Then pL(F, Fn) =n -ý 0 as n -ý

But F(O) is _{a rectangular} distribution _{on [0,1]}

AA

Fn(6)is {0 _{x. < i} if _{pL(F, Fn)} ₌₁ for _all _n.

{

{1 _xz1

Hence _if _{I have}

a , family of _sampling distributions labelled,

by 0, _{I must arrange} the family in _such _{a way that} the _priors _are ordered in a natural continuous _{way with} respect to their labels,

i. e. the closeness _of distribution _functions is _mimicked _by

closeness in 0. This is _{a topic} that _{I will} _{go into} _more detail about in a later _section.

Having _{now discussed} _the _{restrictions'I} _{can show that} _under them Criteria 1 is _satisfied.

Preservation under L6vy _norm

Lemma 2.3. If (i)-Q(0jx) is _measurable

(ii) _0n w '0 _{and P(T(R(0jx))} ₌₀ (where T is

defined _above)

then 2(0n) w R(8)

Proof [See Billingsey (13)] Q

Lemma 2.4. If R, (Gfx) is bounded above, then

x(ed w` Z(e) -: ý, iE(9,

_{(0 )) -> IE(Q(O))}

(29)

16.

t? 'ß'`01 _.1--. '

Theorem

Let Pn -ý P in L6vy metric,

_{and ! t(O) be a likelihood,}

=then provided

(i)

k(O)

is-bounded

and measurable

(ii) P(T(2)) _{=. O} 00

(iii) f

_-L(e)dP(G)

>

0,

then Pn P in Levy _metric. _.

Proof. As a consequence _of Lemmas fand 2 and assumption (iii)

J R(8) dPn(0) _+f2, (0) dP(8) _{= P*(6)} >02.4.11.

3p. 3R

where convergence is in Levy metric Also for _each t

pn(t)

_{_i}

CO

_{X(-°°, t]}

_£(0)dZ

_{(0) -> I}

:

_X(--,

tln(e)dP(e)

_{= P*(t)}

2.4.12,00

n.

₁₁

CO

_{as n -ºco}

at all continuity points-of P, by replacing ß(8) by X(-, t) £(A) and using Lemmas 2.3 and 2.4.

,.

P(t)

But P (t) _=n, P(t) ₌ P* t) _{, which} by (2.4.11) and (2.4.12)

n _P*n(-) _P*(°°)

. gives that Pn(t) + P(t) at. all points of continuity of P(t)

(Since _P*(-) _{> 0)}

The result

follows.

0

-To conclude this section, consider-the following theorem.

Theorem 2.5.

Suppose that

(i)

2, (6)

is bounded

above by U and is

measurable

(30)

17.

Then Fn -F in _{Tv topology} implies _{Fn ;F} _in _{zv topology} _in _the class absolutely continuous priors.

Proof Firstly, _{as in} _{the, previous} _theorem, _{by premise} _(ii). _it is _sufficient to _prove that _-

Fn-, -ý- F Fn -} F* using the _notation" _above.

Well, _since k(O) is _measurable f* _{= £f} _{and fn} _{= Rfn} _are Lebesgue

measurable as a consequence of the D. C. T, it follows, that F and Fn, are absolutely continuous.

00

So P(F*Fn)

₌

J-

JL(O)f(e)

_{- £(e) f-(e)IdO}

.

oo

sMJ 1ß('o)-ßn(e)Ide

_ Mpv(F, Ff). The result follows

by a comment in _the previous section

0

So ifI _work in the _class of _all _absolutely _continuous distribution

functions, in fact Restriction 4 isýno longer _needed _provided _the p; -metric-is used. Perhaps the moral of"this story is-that _when

dealing _with Bayesian inference, _-the Levy topology is"a little too weak'and that it might be more sensible to restrict oneself to

absolutely continuous priors when"considering continuous phenomena. MMy personal feelings are that Restriction 4 is much more difficult

to justify than the _restrictions imposed above.

Note that there is-still _{a need} for _{a natural,} _ordering _of

parametrisation

of the family

of sample distributions

since

2

likelihoods £l' and £2 equal a. s. will _give the _{same inference.} The set of measure zero on which they are different _{may contain} the

(31)

18.

2.5. Stability _{of Bayes} Estimates

The reader will be familiar _with the _{way a Bayes} decision/ estimate d is made. Without loss of generality I can assume that my utility is linear (by absorbing it into the loss function

(see Chapter _{3) so my- Bayes} decision _corresponds to the infimum _of the _expected loss (with--respect to prior F and loss function. L)

written E(L, F, d) (where L and'F may be omitted if no confusion arises. )

Since _{my concern} is _estimation _rather than _general decision making I will assume I have a loss function of the form: L(d-A).

Unbounded loss functions

Definition An unbounded loss, function L(d-e) has. the _property that

Sup L(d-8) _{_} _for _all dEI..

6ES

where S is the extended _support of the posterior distribution F (i. e. the smallest:

-open

interval _containing _all _points _such that f(6) _>0 (see Chapter 3))

Until _{now I think} _that it is _safe, _{to.. say that} the bulk _of

Bayes _estimation have _corresponded to such loss functions. The-, fact that this is theoretically absurd is apparent when one sees that I

am taking the infimum of a function E(F, d) which may or may not occur depending _{on the} particular convention I employ to obtain F. I will

elucidate this point.

Claim

The rate

of convergence

to zero of the tails

of my

(32)

19.

Justification

Suppose that I` have _{a family} 'of _sample _{distributions} G parametrised by -O and have observed a measurable' set'

T= (x _-6, _x+ _e)..

Following the _{same sort} _of _argument _{as for} Criteria 1, the family

G= {G8(x) _;6e ]R}

of sample distributions cannot be specified precisely, each must

be considered as a representative of itself. and distributions "close" to it in _{any practical} _situation. Again I must define "close"

so I shall use the strong definition in the section preceding this, since I am looking for counterexamples.

Now consider _{an alternative} family _of _sample distributions G* defined by

ß* _{ (1-a )GB(x) _{+ aF1(s} a) a _.ý],, 0sas 1}

where F1(xja) is a normal distribution with mean'a and variance 1. In _{an exactly} _analogous _{way to} the _example in Restriction 2 in the previous section, it can be, shown: that for all C>0 there. is an

Asuch that for _all Osa: A and a, A ¬]R

GE G*-

GE B(G0(x),

_c)"-

porovided Sup { max D1 G0(x)} sH for some ME 3R.,

xcS(G0) 15iSn

In particular

putting

a=x

where x

is defined

above,

the likelihood

induced

by G(O, a, a),

£, (9)

is-such

that

L* (0 ) -º a as 10 1 _-} ' provided that the

original family with likelihood £( g) has the property A, (e) _., 0

as

tel

-} 110

_.I

leave

the reader

to check that

_small

_{perturbations}

(33)

20.

The point of this claim is to _{show that} _{perturbations} _of my prior distributions in the tail under Tn topology can induce

the _{same perturbations} in the tails _{of, the} _posterior distribution

if the _problem is _changed _{an indissemable} _amount, _since by wiggling my family of sample distriubtions a bit I can make the likelihood

constant in the tails. Hence without loss of generality I can

consider perturbations of the posterior distribution rather than the _prior, _when-the _{perturbations} are in the tail of the posterior.

I can now return to the original problem, _{- that} of constructing counterexamples to using unbounded loss functions.

Lemma: 2.6.

Suppose that _{a function} L(9) _{as 8+} and finite

elsewhere in ]R, _0. Then there exists a distribution function F such

that F2 E FICnsuch that

00

1

L(O)'-dF2(0) _{_ co}

, where Sup {max DiF2 (9)} ýM _n

-00

t 1sisn

Proof-

Since L(6) _{} o, there*exists} _points t1.

-.. tn such that L(A) _>2

L(A) _{> 24}

L(9) _{> 2j}

Let F2(6)

_{_}

Co ₁

Ji(6) E2

i1

'0E

A1=

e EA2=

(ti-1,

t1+1]

[t2-1,

t2+11.

e E, A [t-1, t+1]

where Ai nA it je IN.

where J1(0)

function _wi- (ti-l, _t1+1)

Sup'{max _D

15isn

is some chosen Cý distributio

th p. d. f. having

support

in

and such that

(34)

21.

Then (i) T(0) is _{a Cc* distribution} _function

(ii) I L(8)dF(O) _>21+ _22.1 ₊

.., = CO

Now suppose I do a Bayesian analysis _using _{a loss}

function L(d-O) _{and a posterior} distribution _F, _giving _rise _to a Bayes decision d, the minima of expected loss, its associated expected loss being E(L, F, d ). Perturbing"F slightly in Pn

topology by replacing it by

G= (1-a)F _{+ aF2} _where F2 is defined in the preceeding lemma,

I see that the _associated _expected loss _{no longer} _exists _at d i. e. this is _in _fact _{a worst} decision. _{So my posterior} decision

depends _crucially _{on the} (usually _conjugate) form I have _chosen _to approximate it. This is _obviously _{unacceptable.}

Consider _this _even _more _stunning _{counterexample} _when _{I use} convex loss functions (advocated by De Groot (2)) and including

the _squared _error loss _currently in vogue.

Let (6-d) _=S _{and suppose} _L(s) is differentiable _and L'(s) is _strictly _monotone.

Lemma 2.7.

If

L is defined

_{as above and E(L, F, d) exists}

for

_all

d EiR,

then it

has exactly

_{one minima,}

provided

F'(0)

# 0,0

eIR.

Proof

It is _obviously _sufficient to prove _that _{E(L, F, d) has exactly}

one stationary point since E(L, F, d) as d+ _-so this

(35)

22.

1e11 E'(P, L, d) - E' (F, L, O) =

L'(s)dF(s+d)

-

L'(s)dF(s)

]R

_1R.

_

J(L'(s)_L'(s_d))dF(s) 3R

which is strictly monotone in d since L'(s) is strictly _monotone in its _argument (by definition). Hence E'(F, _{L, d) is} _strictly

monotone. It follows that it will cut the axis d=0 at most

once.

11

Theorem 2.8.

Let L(8-d) _{be a convex} loss function _with its derivative

L'(s) _continuous _{and my original} _posterior distribution _function _F is-such that

Sup, { max ID F(A)J} _sM _where f(6) ý0 _on]R.

8 1t Osksr

Then for

_all

_c(small)

_>0

_A'(large)

_>0

_{and nEN}

there

_{is a}

posterior

distribution

function

G such that

Pn(F, G) <e _and Id(F) _{- d(G)t} > A*

where d(F), and d(G) are the Bayes decisions corresponding to F and 0,, respectively.

Proof. Define F1(a, O) as n(a, 1) and

(1-a)F(O) _{+ aP(a, O).} Then it has

previously been shown that for e >O, there is an a>0 such that for _all _aa< _{a* and ae} IR

(36)

23.

Let E(d, G)-represent the _expected loss _with _respect to decision d and distribution function L. Then d(G) such that E'(d(G), G) =0

gives the (unique by Lemma 2.7) Bayes decision.

Well, E'(ýG(a, _a))

j,

L? (s) ((1-a) I(s+d)+af1(a''s+d))ds ₌₀

implies E'(d, F) =a E'(d, F1(a)). 1-a

Fix _{an arbitary} d E1R. Then it is easily checked that

E'(d, F1(a)) as a

E' (d, F1(a) _{-ý -ý} _as _{a -ý} _-ý since L' is increasing _and _unbounded.

It follows that for _all de 1R there is an a(d) E iR such that

E'(d, _{F) =a} E'(d, F(a)), i. e. such that d is the 1-a

unique Bayes decision with respect to G(a, a(d), 6). _.

The result is _{now clear.} 0

.I hope that the reader is now satisfied that such contortions

of a proper Bayesian analysis are just not on. The question remains "Is Criteria 2 satisfied by a proper Bayesian analysis? " (i. e. one

where bounded loss functions are used). The answer is almost. First a definition.

Definition

A decision d(F) is _said to be ; table within J with respect to topology induced by the metric p if'for all n>0 there is an e>0

such that,

A

p(F, G)< e( C(G) _{- d(G)} I<n

where C is _{some point} in J. A

(37)

24.

Theorem 2.9.

Suppose (i) L(s) _=L (s) _+L _(s)

where L(s) _-= {L(s), s>0L _{; (s)} _ {L(s) s<0

{{

{ 0, _otherwise,

- { 0, otherwise

and such that L+'} are right continuous increasing} and bounded by M

L- } left decreasing} _M2

(ii) _J _{L+ (e-d)dF(6)} _and JmL(8_d)dF(O) _are _continuous.

then E(L, Fn, d) _{-ý E(L, F, d) uniformly---'as'} _{pL(Fn, F)} 0 Proof

E(L, G, d) =f

IR

+JL (6-d)dG(8)

fIR

gt

= M11

JH(d_O)dG(O)

_{- 1121}

JH

(d-e)dG(6)

where k+and`H^ are distribution functions. Each of the above integrals is _therefore _{a convolution.}

E(L, G, d) = M1' UG1)(d)

where UG1) and UG2 are the It is _{a well} known _result t:

PL(F, Fn) "' 0

Hence

+ M21 U(2)(d)

convolutions mentioned above. hat (See Feller (1))

pL(UGi), U('))+ 0i=1,2: n

so provided

U(1)

and U(2)

are continuous

by Lemma 2.2.

pL(F, Fn) -} 0

po(U(i),

U(i)-}0

i=1,2.

n

So in particular

E(L, Fn, d)-+ E(L, F,, d)^, uniformly _{as. pL(Fn, F) -º 0.}

_cl

(38)

25.

Comments

The conditions on this theorem _need _{some discussion.} _Firstly if F is _continuous (see Feller (1) _{p. 147)} _condition _(ii) _is _auto-

matically _satisfied _{and even} ii L+ and L _are _not _respectively _right. and left _continuous the _proof holds. So inthis _case the theorem is true for-all bounded loss functions _of the form L(s). _Secondly

if _{I allow} _{F to be discontinuous} _{and L continuous"in} _{S, the} _conditions of the theorem _are _also met. So unless my loss function and

distribution _function _are _very discontinuous _the theorem holds. Finally _{note that} _if _pL(Lf, _{L) -)- 0, E(Ln, F, d) ; E(L, F, d)}

by symmetry of the _convolution _operator. Putting these two results together _gives the following _Corollary.

Corollary 2.9.1.

If _the _conditions _of _Theorem _2.9 _are _{met for} _all _{Ln and}

Max {PL(Ln, L), _{PL(Fn, F)}} _} 0, then

.

E(Ln, Fn, d) } E(L, F, d) uniformly. 0

Thus if Loss _functions _{and distributions} _are _close, _{so is}

the _expected loss function. This is the _most important _result for Bayesians. However _the _Bayes _estimate is _{one step} _{away from} _this,

since I am interested in the infimum of such functions.

Theorem 2.10

Let D be the _set _{of minima} _of _expected loss _with _respect to the _originally _chosen _posterior distribution E(L, F, d). Suppose

there is _{no sequence} dD such that

lim _{E(L, F, d) } E(L, F, d(F))} Jý00

(39)

26.

Since E(L, Fnd) _{}, E(L, F, d) uniformly}

inf E(L, F

, d) ; inf E(L, F, d) (since E -> 0 by definition)

dot nd _FýR

Then if _{E(L, Fn, d) - E(L, F, d) uniformly} d(F) is _stable _within J where

Proof

J= _set _of _original Bayes decisions.

Suppose _to _the _contrary there _exists _{a. sequence} _of distributions such that

E(L, Fn, d) ; E(L, F-, d) and yet

lim Id(Fn)'-- d(F)l >c _where d(F) EJ and d(Fn) is a

Bayes decision _{w. r. t.} E(L, Fnd).

. Hence. E(L, F,, d(Fn)) ; E(L, F, d) hypothesis. 0

as, n -> co -contradicting _-the

Since this _condition _is _extremely _weak (since it is _{on the} posterior I end up with) I have-in fact proved that condition 2

is _satisfied _provided _{I use a bounded} loss function in the Bayesian setting.

A related piece _of _work has subsequently emerged from Kadane & Chuang (1) dealing in _{a weaker} sort of way with general loss

functions _of the form L(d, O). _{-They seem} however to completely miss the _point that the infimum of an expected loss function has no

meaning if that expected loss function-does 'not exist.

2.6. A preview of the Kernel of the Thesis

The reader may be wondering what this has to do with Catastrophe Theory _{and Time} Series. The answer is that Catastrophe Theory is _a

classification theorem about families of potential functions THE EXPECTED LOSS FUNCTION IS A POTENTIAL _FUNCTION

(40)

27.

since the behaviour (or decisions). are governed by its minima. _-Thus- theorems _classifying _minima _of _potentials _{are-applicable.}

Now if-convex loss functions _are _{used, Lemma} _2.7 _{shows: that} only one minima can arise on the expected loss function. Since

Catastrophe Theory is'a _{classification} in terms _of the _number _of

minima, it would be redundant. But I have shown that such abortions of. Bayesian analysis admit ridiculous and unacceptable results. I must work with bounded loss functions so Catastrophe Theory is

applicable.

Now consider the last theorem, but this time, _rather than

interpret F _{as the} _original _posterior distribution let it _represent the "best" _{representation} _{of my posterior} beliefs. Suppose the

corresponding expected loss function E(L, F, d) is smooth and has 2

minima and 1 maxima _where the 2 minima ml, m2 are each Bayes decision with respect to F. _-

E(

Fig. 2.1.

If _{F1 and F2 are 2 perturbations} of F, i. e. 2 approximations _of the

"best" _{representation} _{of my posterior} beliefs, then _{no matter} how good my approximations, the Bayes decision d(F1), d(F2) of F1 and FZ

respectively could be very different and unique (e. g. d(F1) _near _ml near _m2) Hence a bifurcation (classified in Catastrophe Theory) is

observed.

(41)

28.

A

Certainly in _a _one-off _situation _such _{as F (with} _two

identical _minima) _would be an extreme _oddity. (see for _example Morse Theory). But _suppose {F} is in fact _{a parametrised} family

{F(t) _:t _c]R} _where _{t might} for _example _represent _time. _Then _one often finds that the family {E(L, F(t), d) :te 3R) goes through

evolution @ pictured below,. as t increases.

E(L, F(t), d)

d, ý d(F(t_) )

E(L, F(t1), d)

d

E(L, r(t3),

d)'

d _t1 _{< t2} _{< t3} d(F(t3))

Fig. 2.2.

--, 1-

(42)

29.

The-Bayes decision. _thus _experiences _{a jump} _from _{a point} _in a neighbourhood _of d(F(t1)) to _{a point,} _near d(F(t3)). Hence,

Catastrophes _occur _(in _{a Bayesian} _sense) _readily _in: Time Series

Sequential Analysis

I Removal of nuisance parameters and many other

fields. The first _of _these is _the _most _striking _{and so'I} _shall concentrate on it in. this thesis.,

nx Differentiability of the expected loss follows if either: (i) L is _nX _{differentiable} _{w. r.}

_t. d and bounded, in some sense (see Burrill (1) for details),

(ii) _{the-distribution} _function _used _is _{sufficiently-smooth.}

Since _such _loss _{functions/distributions} _functions _will _be dense (under _{PL metric)} _in _the _space _of _all _bounded _loss functions/F, the last theorem _{and the} _previous _Corollary _{mean that} _without _loss of generality I can make assumptions. _of _smoothness _of E(d).

Since _no-one _{has yet developed} _Bayesian inference _with bounded loss functions far _enough _{I must} devote _{a chapter} to these _problems

(Chapter 3) before _going _{on to} _using Catastrophe Theory _{on them.} For this I need the concept _of _natural _{parametrisation.}

2.7. Natural Parametrisations

It does _not _{seem to be widely} _realised _that _Bayesian _inference

using

a loss

function

L is. invariant

under

transformations

of the

parameter, even though the posterior moments/mode _are not, _since

(43)

30.

The transformation

is just

_absorbed

into-the

loss

function.

Therefore

the _most _sensible _{way in which} to define _{a "natural"} _{parametrisation}

is _{by specifying} the loss _structure _and-then _changing the _{parametrisatio}

of 6 and d such that

L-is

in-a

simple

form.

In most problems

of an

inferential _nature it is _possible to _convert the _original loss function I into the form L(6-d) by an appropriate choice of parametrisation _of 0.

Note _that _this implies that before I make a-Bayes _estimate I

must say what it is I am looking for. (i. e. fix L(O-d). This seems sensible but in an inferential setting. the appropriate loss function

may at first seem difficult to find: many statist. icians, (inclu ding

Bayesians) don't like to _state _explicitly _what they are looking for, for fear that their _results _may _seem _subjective. The following

suggestions are for those at a loss choosing their right parametrisat - i. e. when the loss structure is not immediately obvious to them.

Definition

Call _{a parametrisation} _{6 of a family} F0 of sample distributions p-continuous if