An investigation into the properties of Bayesian forecasting models

(1)

University of Warwick institutional repository: http://go.warwick.ac.uk/wrap

A Thesis Submitted for the Degree of PhD at the University of Warwick

http://go.warwick.ac.uk/wrap/34685

This thesis is made available online and is protected by original copyright. Please scroll down to view the document itself.

(2)

AN INVESTIGATION _{INTO THE PROPERTIES} _OF

BAYESIAN _{FORECASTING MODELS}

by

Nick S. Cantarelis

A thesis _submitted for the degree _{of Doctor} _{of Philosophy}

School _of Industrial _{and Business} Studies, University _of Warwick,

Coventry CV4 7AL

(3)

TABLE OF CONTENTS

Page

CHAPTER 1 I1'TRODUCTION

1.1 Definition _of time _series 1

1.2 _{The role} _of _short _term _sales forecasting

in management 2

1.3 Historical development _{of time} _series

methods 2

1.4 _{The case for} Bayesian forecasting 9

1.5 Objectives _and _structure _of _the _thesis 19

1.6 Notation _{and abbreviations} 22

CHAPTER 2 THE BAYESIAN APPROACH TO FORECASTING

2.1 Foreword 23

2.2 The Kalman Filter

2.3 Properties _of _the _constant _variance

Steady State Model (SSM) 23

2.4 Properties _of _the _constant _variance Linear

Growth Model (LGM) 31

CHAPTER 3 THE MULTI STATE MODEL (MSM)

3.1 Foreword 39

3.2 _Modelling discontinuities 39

3.2.1 _Model _of _{a "no} _change" _state _{: Mtl)} 44

3.2.2 Model _of _{an "outlier"} _state Mt2) 45

3.2.3 _Model _of _{a "growth} _change" _state _{: itt3)} 47

(4)

(ii)

Page

3.3 _{The updating} _procedure 50

3.4 Supplying the initial information 60

3.4.1 _Introduction 60

3.4.2 _Starting _values for _m0

, C0

and pM 61

0

3.4.3 _{Normal noise variance} 63

3.4.4 _{System parameters} 64

CHAPTER 4: MEASURES OF PERFORMANCE

4.1 _{The problem} 67

4.2 Measuring _the _goodness _of RW 70

4.2.1 _Response _{to state} 1- "no change"

R(1) 72

4.2.2 _Response _{to state} 2- "outlier"

R(2) 75

4.2.3 _Response _to _state 3- "growth _change"

R(3) 76

4.2.4 _Response _to _state 4- "step _change"

R(4) 79

4.3 _Consistency _{of performance} 79

4.3.1 _Foreword 79

4.3.2 _Experiment 81

4.3.3 _Discussion _{of results} 86

4.3.4 _Experiment 88

(5)

(iii)

Page

CHAPTER 5: UNDERSTANDING AND CONTROLLING THE

RESPONSE OF THE SYSTEM

5.1. Introduction 98

5.2 Interactions between _{H's and} X's 99

5.2.1 _Controlling _R(1) ₁₀₃

5.2.2 _Controlling _R(2) ₁₀₄

5.2.3 _Controlling _R(3) ₁₀₄

5.2.4 Controlling R(4) 109

5.2.5 Concluding _remarks 111

(j)

5.3 _Relationship between _{X's and R} 112

5.3.1 _{MSM (1,2)} ₁₁₈

5.3.2 _{MSM (1,3)} 119

5.3.3 _{MSM (1,4)} ₁₂₃

5.3.4 _{MSM (1,2,3)} 126

5.3.5 _{MSM (1,2,4)} 130

5.3.6 _{NSM (1,3,4)} 134

5.3.7 _{MSM (1,2,3,4)} 142

5.3.8 Concluding _remarks 151

5.4 Robustness _{of the system to the nominated}

noise variance VE _N 152

5.4.1 _{Underestimation} _{of V} 155

C

5.4.2 Overestimation _{of V} 158

E

(6)

(iv)

CHAPTER 6: ON LINE ESTIMATION OF V IN THE MSM C

6.1 _{The Fixed} _{Range Method} (FRM)

6.2 _Illustration _{of the FRM}

6.2.1 _{Data used for experimentation}

6.2.2 _Selection _{of V range} _{and prior}

probabilities

6.2.3 _Effect _{of the choice} _{of K}

6.2.4 _Effect _of _process _{discontinuities}

6.2.5 _Effect _of _{a discontinuity} _{in VE}

6.2.6 _Illustration _of _{FRM on 3M Data} _and B/J Data

6.3 _{The Variable} Range Method (VRM)

6.4 Illustration _of _the VRM

6.4.1 _Foreword

6.4.2 _{Specification} _of _the _variable _{V range}

6.4.3 _Speed _of initial _convergence

6.4.4 _Effect _of _process discontinuities

6.4.5 _Effect _of _{a discontinuity} _{in V}

6.5 Concluding _remarks

CHAPTER 7: ESTIMATION OF VARIANCES IN THE STEADY STATE DLM

7.1 Foreword

7.2 On line _estimation _{of VC} in the SSM

7.2.1 _{FRM in} _the _SSM

7.2.2 _Numerical _Illustration

7.2.3 _{VRM in the SSM}

7.2.4 Numerical Illustration

(7)

Page

7.3 _{On line} _estimation _{of V} _{and r}

E

in the SSM 244

7.3.1 _Numerical _Illustration ₂₅₁

7.4 _{On line} _estimation _{of V} _{and r}

eu

when both are assumed to be constant 262

7.4.1 _Optimality _conditions for _the

likelihood function 263

7.4.2 _{CVM basic} _concepts 264

7.4.3 _{CVM updating} _procedure ₂₆₆

7.4.4 _{Unconditional} _posterior _estimates

and prediction 269

7.4.5 _Numerical _Illustration 270

7.5 _Concluding _remarks 274

CHAPTER 8: ALTERNATIVE FORMULATIONS OF THE MSM

8.1 _Foreword 276

8.2 _Expansion _{of matrix} _operations 277

8.3 _{The I-} _Point _{System (1PS)} 280

8.3.1 _Comparison _{of 1PS and MSM} 282

8.4 _{The 2-} _Point _System (2PS) ₂₉₃

8.4.1 _{2PS Kalman updating} _{and Collapsing} 296

8.4.2 _Numerical _Illustration _of 2PS 300

8.5 _{The "three-three"} System (33S) 313

8.5.1 _Numerical _Illustration _{of 33S} 316

8.6 _Concluding _remarks 331

(v)

,I

(8)

(vi) Page

APPENDIX A: Notation _{- Abbreviations.} _{-. Operators} A. 1

B: The derivation _{of the Kalman Filter}

equations A. 6

C: The incorporation _{of seasonal} _and

promotional effects A"10

D: Variance _{of the one step ahead forecast}

in the SSM

error e A. 25

t

E Mixed distributions in the _collapsing process A. 27

F: Generation _{of random Normal} deviates A. 30

G: Figures G. 1 and G. 2 A. 31

H: Computer _programs A. 51

I: Figures 1.1,1.2 _{and 1.3} A. 66

J: Results _referred to from _section 6.2.2. A. 68

K: Relative importance _{of VE N and r kN} in

ft, A. 72

the SSM

L: Table L. 1 A. 78

M Tables M. l. a., M. l. b., M. l. c and

M. 2. a., M. 2. b., M. 2. c. _.A. 82

N: Tables _{N. 1, N. 2 and N. 3} A. 88

P: Tables P. 1 to P. 12 A. 94

(9)

(vii)

LIST OF TABLES

Table Page Table Page

1.1 16 6.6 183

3.1 65 6.7 184

3.2 66 6.8 186

4.1 68 6.9 188

4.2 76 6.10 189

4.3 84 6.11 189

4.4 84 6.12 1,92

5.1 101 6.13 193

5.2 102 6.14 196

5.3 103 6.15 198

5.4 ₁₀₄ _6.16 ₂₀₉

5.5 105 6.17 212

5.6 107 6.18 216

5.7 108 6.19 218

5.8 111 6.20 219

5.9 114 7.1 232

5.10 118 7.2 235

5.11 120 7.3 241

5.12 123 7.4 253

5.13 127 7.5 255

5.14 131 7.6 258

5.15 135 7.7 260

5.16 144 7.8 272

5.17 156 8.1 282

5.18 156 8.2 287

5.19 157 8.3 289

5.20 162 8.4 _. 301

5.21 163 8.5 304

6.1 176 8.6 308

6.2 177 8.7 317

6.3 179 8.8 320

6.4 180 8.9 324

6.5 181 8.10

(10)

(viii)

LIST OF FIGURES

Figure Page Figure Page

3.1 40 5.12a 139

3.2 46 5.12b 140

3.3 58 5.12c 141

4.1 69 5.13a 145

4.2 74 5.13b 146

4.3 74 5.13c 147

4.4 78 5.14a 148

4.5 78 5.14b 149

4.6 80 5.14c 150

4.7 80 5.14d 153

4.8 82 5.14e 154

4.9 92 5.15 . 160

4.10 93 5.16 161

4.11 94 6.1 197

4.12

95

6.2

199

5.1 106. 6.3 214

5.2 106 8.1 280

5.3 110 8.1a 283

5.4 110 8. lb 284

5.5 115 8.2 286

5.6 116 8.3 294

5.7a 121 8.4 295

5.7b 122 8.5 299

5.8a 124 8.6a 305

5.8b 125 8.6b 306

5.9a 128 8.7a 309

5.9b 129 8.7b 310

5.10a 132 8.8a 311

5.10b 133 8.8b 312

5. lla 136 8.9a 321

5. llb 137 8.9b 322

5. llc 138 8.10a 327

8.10b 328

8. lla 329

(11)

(ix)

ACKNOWLEDGEMENTS

I am grateful to Dr. F. R. Johnston for his

numerous invaluable suggestions, advice and encouragement

in _every _stage _of _this _research. _{I also} _wish _to _thank

the C. I. E. B. R. the Umbrako fund _{and the} S. I. B. S. for

their _contribution towards the _costs _of the _work. Finally

I wish to thank the _members _of the Computer Unit _staff _and

in _particular _Keith _Halstead, Tim Clark, Gerry Sheridan _and

Ruth Raymond.

(12)

(X)

SUMMARY

In the early 70's, Harrison _{and Stevens} [17,18,191,

made a major contribution to the area of statistical forecasting.

They adopted a Bayesian approach in conjunction _with _{a fundamental}

model, first used by Kalman [24) and called the Dynamic Linear

Model (DLM).

This _thesis is concerned _with _{the investigation} _of

the properties of different Bayesian forecasting _models _and

in particular of the Multi State Model (MSM). The latter is

important _{in postulating} _that _{no single} _{DLM can adequately} _describe _a

process with discontinuities and consequently the system defines a

number of models characterising the most likely process states.

A small number of _parameters are shown to govern the behaviour of the MSM and the relationship between the _choice _of these _parameters. and performance has been examined.. It is. shown that for a process

exhibiting discontinuities, traditional forecasting criteria such

as the mean square error are no longer appropriate. An alternative

set of _performance measures is _proposed and used as the main language

of understanding the variety of responses of the MSM to different types

and sizes of discontinuities. The parameter representing the noise

variance of the _process is shown to be critical to the _performance

of both the MSIl and other single state Bayesian models. A number

of on line variance estimation methods are proposed and tested on

artificial and real data. The methods are shown to be robust and

lead _{to improved} _performance _{not only} _{of the MSM but the other}

Bayesian _single _state _models _which _{of course} _require _{a noise} _variance

estimate.

Finally, _alternative formulations _of _the _{MSM are proposed,}

leading _to _significant _reduction in _the _{computational} _{and storage}

(13)

(xi)

To

(14)

1

CHAPTER 1

INTRODUCTION

1.1. Definition _{of time} _series

The analysis _{of time} _series is an important _{area of}

statistics with application in a variety of fields such as economics,

engineering, production planning, quality control and many others.

A time _series is defined _{as a collection} _of _observations

yl y2 "" yn made sequentially in time and the analysis of time

series is essentially concerned with identifying the stochastic model

which might have generated these observations.

Khlintchine 126,27], _showed _that _{a time} _series _{may be viewed}

as a random process of the variable Yt sampled over time periods

t=1,2, _.... _n. Hence yt for t=1,2 _.... _{n may be thought}

of as a particular realisation of an infinite set of time series _which

might have been observed by sampling from Yt.

Similarly Kolmogorov [231 viewed it as a sample from

an n-dimensional probability distribution with each time period corresponding

to one dimension of this distribution.

A major objective of time series analysis is to predict

future _values _{of the series} _{and this} _{is an important} _task in sales

(15)

2

1.2. _{The role} _{of short} _{term sales} _forecasting _{in management}

Mathematical _methods for _producing _routine _short _{term sales}

forecasts _{have slowly} _{become accepted} _{as an important} _source _of

information _for _effective _decision _making.

In inventory _control _systems forecasts _{of customer} _demand

are needed each week or month to be used for stock control, ordering

raw materials and packages, _production _planning _etc.

The United States National Industrial Conference Board

reporting its study [38] on Business Policies in Inventory Management

states that "nearly 30% of the working _capital _{of the average} business

is tied _{up in inventories".}

Lewis [311 reports _that _{the interest} lost _{on money invested}

in stock can add approximately _{20% to the works prime cost.}

If improved _methods _{of producing} forecasts _{can reduce} _the

uncertainties of future demand then lower stock holding, scheduling

and production planning can be achieved.

1.3. _Historical development _{of time} _series _methods

Historically _{the theory} _{of time} _series developed from

Statistics. It _therefore became concerned _with _stationary _time _series,

i. e. where the mean or level of the series is constant and the variability

retains the same pattern through time. A more mathematical definition

(16)

3

A time series is stationary if _{the joint} _probability

distribution of Yt+1 _, Yt+2 _{Yt+n is the same with} _that _of

Yt+k+1 _' Yt+k+2 _'"" Yt+k+n for _all t, _{n and k.}

Real data _series _{are usually} _{non-stationary} _{but by}

suitable transformations or differencing starionarity can sometimes

be achieved.

The interest in _time _series _started in _{1927 with} _the _work

of Yule X467 who introduced the concept of autoregression. His

approach was further developed in 1931 by Walker [39] _{who defined} the

general autoregressive (AR) _process of (1.3.1). This _assumes that'an

observation yt can be represented as the sum of a weighted linear

combination of old observations _{and some chance} _element:

P

yt =Ey+E (1.3.1)

t-i t

2

where _et -N (0 _{, of} ) for _all t.

In 1937 Siutsky [36] _presented _{the moving average} (MA)

process _{of (1.3.2.} ):

q

yt

=E0e+e+u

_j=1 _j (1.3.2)

t-j t

(17)

4

A major advance in the theory took _place in 1938 when

Wold [44] developed _{a comprehensive} _theory _{of mixed} _{autoregressive/}

moving averages (ARMA) processes. He proved that any discrete

stationary process can be expressed as the sum of two uncorrelated processes,

one purely deterministic (e. g. a sinusoidal process) and one purely

indeterministic (e. g. MA and AR processes).

The first _complete _solution to the least _squares _prediction

problem was given in 1941 by Kolmogorov [29] for discrete stationary

time _series. The problem was to predict a future value yt+k using

a linear predictor yt+k where

yttk _-,_{j yt-J}

such that E(y _t+k -y)2i _t+k sa minimum.

Mann and Wold [33] in 1943 solved the problem of estimating the

unknown parameters in stationary autoregressive series using maximum

likelihood _methods _{and Whittle} (41), Durbin [12] _{and Walker} 140]

extended their results to more general processes such as ARMA and multiple ti

series.

During World War II Wiener [49] _working in the field _of

control and communication pioneered the estimation problem for continuous

filters (the _continuous _counterpart _{of Kolmogorov's} least _squares

prediction problem). He developed new techniques for filtering

(18)

5

some random noise process.

Wiener. _showed _that _the _prediction _of _random _signals _and

their _separation from _random _noise _{can be achieved} by solving the

so-called Wiener-Hopf integral equation and he gave a method for its

solution in the practically important special case of stationary

processes.

Before 1960 many extensions _{and generalisations} followed

Wiener's basic _{work but all} _{of them were subject} _{to a number of}

limitations _which _seriously _curtailed _their _practical _usefulness.

In 1960-61 however, Kalman _{and Bucy} X24,25), _proposed _{a new approach}

to the _standard filtering _{and prediction} _problem for _multivariate

non stationary Markov processes. They realised that rather than to

attack the Wiener-Hopf integral equation directly, it is better to

convert it. into a non linear differential equation whose solution

yields the covariance matrix of the minimum filtering error, which in

turn _contains _all _necessary information for the design _of the _optimal

filter.

The result of their work was the Kalman filter recursive

equations, which are easily updated on line on receipt of additional

observations.

Details _{of the Kalman Filter} _will be given later _{on since}

(19)

6

r

Exponential _smoothing _{models were introduced} _around 1960

by operational researchers such as Brown [6,71, Holt [22] and Winters [43].

They are special cases of autoregressive models with the weights placed

on past observations decreasing exponentially. Their main contribution

has been in short term sales forecasting and stock control.

In _the late 60's _{Box and Jenkins} [3,5) _{made an important}

contribution to the theory of time series forecasting. Their procedure

utilises the concepts of MA, AR, ARMA processes but its major advantage is _that _{a class} _{of models} is _proposed _together _with _{a strategy} _{by which}

for _{any given} _series _{a particular} _model is _chosen from _this _class

according to the properties of the particular time series in question.

Thus _{as Newbold} [35] _puts it, _the form _of _the _eventual forecast function

is _dictated, _to _{a large} _extent _{by the data} _-a _principle _known _as

"letting _the _data _speak for itself. "

In _contrast _to _the _exponential _smoothing _models the Box-Jenkins

procedure is not fully automatic in the sense that given a data series, a

computer program can not be used to produce. forecasts without manual

intervention, _since _{a single} _{model has to be chosen from the large}

class of autoregressive integrated moving average (ARIMA) models.

Very briefly _{an ARIMA} (p, d, _{q) process} is defined _as

follows:

Consider _{a process} {yt _;t=1,2 _.... } which can be non

(20)

7

C

by successive differencing and that an integer d exists such that

wt= (1-B)dYt

is stationary, where Bay =y

t t-j

It is further _{assumed that} _{wt can be represented} _{as an ARMA model:}

PP Wt

i=1

ý1 _Wt-i +E

=1

03 et-j. _{+ Et}

and this is known as an ARIMA (p, d q) model where:

ýi _{are the autoregressive} _parameters

03 _{are the moving} _average _parameters

2 and ct is a white noise sequence distributed as N(0, aE )

(1.3.3)

Box and Jenkins (B/J) fit ARIMA models to a set of data

by an iterative three _stage _cycle _{of identification,} _estimation and

diagnostic _checking. First _tentative _values _{of p, d, q are chosen which}

allows the estimation of ýi and 0. for i=1,2 _... p and j=1,2 _... q.

Finally, _{the adequacy of the fitted} _{model is checked} _{and any inadequacies}

found _are _expected _to _suggest _alternative _values for _p, d, _{q and hence}

(21)

8

Clearly _the B/J _procedure is more versatile _than _exponential

smoothing models _{many of which} _are _simply _subsets _of the _general _class

considered by B/J. However, Chatfield 14-81 argues that in _practice

some exponential smoothing models are not special cases. of the B/J

procedure and points out that the multiplicative Holt-Winters forecasting

procedure does not have an ARIMA equivalent. B/J methods are of course

much more expensive and considerable experience is required to identify

an appropriate ARIMA model. Often (e. g. see Chatfield [11)) several

such models may be found to fit the data equally well but can produce

different forecasts. Recently _considerable ingenuity has gone into

removing the subjective element in the first stage of the cycle (i. e.

identification _of _p, _d, _q) _assuming _{a statinnary} _process. _Akaike [2)

is _the _main _{name associated} _with _this _{and in} _engineering _circles _{a well} _known

approach, State Space Forecasting as presented by Mehra [347 for example, is _equivalent _to _{an automatic} _technique for _selecting _p, d, _q. _This

simplifies the B/J procedure for the user without denying its. subtlety but _the _major _objection is its _reliance _{on relative} _stability in _the

process which in socioeconomic. time series is certainly quite unrealistic.

Finally _{a serious} drawback _of the B/J _procedure is the _requirement _of _a

moderately long series of data for its satisfactory application.

In the early 70's almost parallel to the work of B/J,

Harrison _{and Stevens} (H/S) [17,18,19], developed _{a Bayesian} _approach

pioneering the use of Kalman filters in time series forecasting.

Following _{the filtering} _results _{of Kalman they} _{showed how previously}

accepted techniques can be restructured using a Markov formulation

of time series. This structural change covers all existing linear time

series models and with the introduction of likelihood methods the

Bayesian _approach _provides _{a means of learning} _{and communication.} The

advantages offered by this approach are numerous and an attempt is

(22)

9

1.4. _{The case for} _Bayesian forecasting

In _order _to _appreciate fully _the facilities _and

advantages offered by the Bayesian approach it is first necessary

to introduce the Dynamic Linear Model (DLM) _which is _{a special} _case

of the state space characterisation of linear systems developed by

Kalman. _{The general} form _of _the _univariate DLN is _{characterised}

by two equations:

Observation _eqn _: _yt _{= Ft} Ot + et _; _{Et -} N(0

, Vt) (1.4.1)

System _{eq= :} Ot Ot_l _{+ wt} _; _{wt " N(0} Wt) (1.4.2)

where yt is the process observation made at time t

Ot is an (n x 1) unobservable _vector _{of process} _parameters

at time t.

Ft is a (1 x n) vector _{of independent} variables known at, time

t and defines how the process is observed.

G is _an (n x n) known system matrix _whose function. is to

represent deterministic changes in Ot from one time period

to the next.

Et is a stochastic noise element representing errors inherent

in observing the true 0t

Wt is _an (n % 1) stochastic disturbance _vector _representing _random

(23)

10 1

Vt = E(et2 ) _{assumed known at all} _times

Wt = E(wt w') assumed known _at all times

Equation (1.4.1) _specifies _{how yt}

, the process observation

at time t, is stochastically dependent on the current _process _parameter

vector 0t while equation (1.4.2) specifies how the process parameters

evolve in time, both stochastically and deterministically through wt and

G respectively.

It follows from (1.4.2) _that _{0t evolves} _according _to

a stochastic process in the presence of random shocks and hence it _{can never}

be known _exactly. Thus _at _time _{t the} information _about 0t is _expressed

in terms of a normal distribution:

(0t 1Dt)N (nºt

'C

where Dt is the set of all the data up to and including time t, i. e.

Dt = {Do _{$ Y1 '} F1 _{' y2} _' F2 _.... yt _' Ft} m {D

t-1 'yt ' Ft}

where Do represents the information supplied at the start prior to

(24)

, 11

C

Given _{an additional} _observation (yt+l

' Ft+l) _' 0t is

updated using the Kalman filter recurrence relations which will be

given in the next chapter.

We can now discuss a number of advantages of the Bayesian approach

in an attempt to illustrate its importance in the forecasting field.

a) Communication

This is a consequence _{of the parametric} _structural

representation of time series by the DLM.

Consider for _example _the following deterministic linear

growth process yt for t=1,2 _... _.

(1.4.3)

yt - 2yt-1

+ yt-2

=0

The DLM representation _of this _process is:

yt=ut

pt = Pt-1 + at _(1.4.4)

at ßt-1

where the _parameters _ut _, ßt are easily interpreted as the "level"

of the _process _at time _{t and the} incremental "growth" in level in the

(25)

12

Hence _the Markovian _structure _of _the _{DLM allows} _{an easier}

interpretation _of _the _model _parameters _with _the _result _that

subjective knowledge is more easily assimilated into the _scheme. In

this _respect the Bayesian _approach has _{a distinct} _advantage _over

ARIMA models. The forecaster _{can now impart} his _views _{on the}

parameters reflecting information external to the data history.

This is _not _only _useful initially _{when no data} history _exists but it is

also important in anticipating and dealing with major process changes. The point is that _{an environmental} _change _which invalidates _the _whole

representation usually only invalidates one part or a limited _number

of parts and often does not affect the overall model concept which in

our previous example is captured by both equations (1.4.3) and the system

(1.4.4). _{The difference} _lies _{in the fact} _that _if _{the overall}

model fails at time t because we think that for some reason the level

of the process has changed but the growth is the same, then this subjective

information _{can easily} _{be communicated} _{to the DLM by changing} _{the current}

estimate of the level without disturbing the growth estimate. At this

point in time it is natural that our uncertainty has also increased

as a result of this environmental change and this can also be communicated

by simply increasing the variance estimates associated with the process

level _{and possibly} _{of growth} _too.

A critical distinction is made by HIS [18] between a

statistical forecasting method and a forecasting system. The former

transforms input data into _output information in _{a purely} _mechanical _way

while the latter includes people; the person responsible for the forecasts

as well as all the people concerned with using the forecasts and

(26)

13

b) _{Recursiveness}

A major weakness of all traditional forecasting methods has

been _their inability _{to deal} _adequately _with _{a new time} _series _such _as

occur at product launch or even when there is a major change in the

market for a product.

The recursive nature of the Kalman filter however implies

that _{the current} _posterior distribution

(0t 1 Dt) _-N (Mt

, Ct)

may be calculated from

(i) _{the most recent} _observation, _yt

(ii) _{the current} _observation _{and disturbance}

variance Vt and Wt

and (iii) the posterior distribution

(0t-1 I _Dt-1) _{N (mt-1}

' gt-1)

The great advantage of this is that, for the first time,

we have a rational, coherent framework for forecasting in situations

where there is little or no prior data history and for developing

systems _which require structured models to facilitate communication

(27)

14 1

c) Uncertainty _about _{the underlying} _model

Another disadvantage _of _traditional _forecasting _methods _is

their inability _{to reflect} _varying _uncertainty _not _only _with _regard

to _parametric _estimates but _also _with _{a range} _of _possible _models _and

their _complete _absence _of _{any form} of model learning.

The Bayesian _approach however is _geared _towards learning,

change and probability descriptions _{and provides} _elegant _ways _of

extending the DLM, thus overcoming these _problems.

A DLM is _{characterised} by Ft

,G, Vt and lit so that

a particular model MM _{can be specified} _as:

M(j) =- (F(J)

' GiJ) _' ViJ) WSJ) )

t-t _-t

If _{a number of such models} _{is considered} for j=1,2

... n

with associated initial _prior _{probabilities} _p( then Bayes'

theorem _{can be used to update} _these _{probabilities} _with _{ever_y. new}

observation, leading to an on line _{model identification} _procedure.

H/S [18] _{term this} _{model set "multi} _process _model" _and

distinguish _between _{two possibilities:}

(I) _Class _{I multi} _process _model:

Here it is assumed that a single _{unknown model adequately}

(28)

15

c

the learning logic _{of Bayes'} _theorem is used

to select the single _{most appropriate} _model

for _{the process} _under _observation,

or

forecasts _{and decisions} _{are based on a weighted'}

combination of the whole set of models so that at any

time _{the selected} _{model is not necessarily} _{one of the}

original models but an "interpolated" _model.

(II) _Class _II _multi _process _model:

The postulate here is _that _at different _periods _of _time,

different _models _or _combinations _{of models} _{appropriately}

describe _the _context. _{In other} _words _the _process _shifts

from being best _represented _{by one model} _{to being} _best

represented by another.

H/S [191 describe _{a situation} _with four different _process

states namely "no change", "outlier", "growth _change", _{and "step} _change". (jThese

states are characterised by different _variances Vý. W( ýý

in each of four linear growth stochastic generating _{models MCJ) of the}

form:

+ E(i)

yt - ut

_t

++

(j)

u t= u t-1

t

du

_t

ß=ß ý] )

t t-1 + 60 t

E

(i)

-N (0, V (j) ₎

te

au(J) - _{N (0,} _{V(. 7) )}

tu

aß

(i)

-N

(0, V(J)

(29)

16

where riQ) for j=1,2,3,4 is used to model the four different

process states. The variance magnitudes necessary to _characterise

these _states _{are shown below and further} _explanation _{and justification}

for _these _will _{be given} _{in Chapter} _3.

TABLE 1.1.

j _{STATE j} _VW

E VM u V(D B

1 "no _change" Normal _Nil _Nil

2 "outlier" _Large _Nil _Nil

3 "growth _change" _Normal _Nil _Large

4 "step _change" _Normal _Large _Nil

Allowing _models _{to continually} _{wax and wane in a}

probabilistic fashion necessitates statement s on model transition

probabilities and an obvious choice is a very high probability for

the "no change" model and low probabilities for the "intruding" _models.

This _system has _the _advantage _of _recognising _{and responding}

appropriately to sudden changes of level, slope and outliers which are

major _problems for all _previous forecasting methods. Because of the.

different _states _modelled _{by this} _multi _process _Class _{II model,} it _will

(30)

17

The learning _capability _of _this _approach is _obvious.

Because of the Bayesian foundations _{of the DLM it} _{is possible}

to start forecasting immediately at time _{zero when the forecasts} _will be

based entirely on the initial prior _{probabilities} _{of parameters} and

models. However, as time _progresses the probabilities _{are continually}

modified by the data and hence there is no clear distinction, _{as in}

conventional methods, between start up, model selection, parameter

estimation and actual forecasting, since this is done on line.

d) _Decision _making

Traditional forecasting _methods _{are not capable} _of

communicating varying degrees of uncertainty. Yet in practice this is

always the case: _{at product} launch or when a sudden change has

taken _place there is a lot of uncertainty _which decreases _{as data}

becomes available and a stable _pattern is established.

This _changing _uncertainty is automatically _reflected

in a MSM for example, through _{the changing} _posterior _{probabilities}

of the various models which are combined to produce realistic measures

of the forecast error variance for the purpose of decision making.

In addition probabilistic information on the process parameters

at any given time means that decisions are not based on a single

figure forecast _{but a probability} distribution.

e) Stationarity

Virtually _all _previous forecasting _methods _{were based}

(31)

18

c

reduced to _stationarity by a suitable transformation _{and can then}

be used for model identification. This assumption is however too

strong since, in identification, this implies that because something

never happened in the past it will never happen in the future which is

not realistic. In the Bayesian approach there is no stationarity

requirement since the evolution of the future is not wholly tied to that of

past observations nor to a single model.

This list _{of advantages} _although _{not exhaustive} _gives

sufficient information for the potentials of the Bayesian approach to

(32)

19

1.5. Objectives _{and structure} _{of the thesis}

The prime objective of this thesis has been to investigate

the _properties _{and examine} _methods _of _overcoming _{a number} _of

theoretical _{and practical} difficulties _which _arise in _the _application

of Bayesian models. Most of the work however is concentrated on the MSM

since it is undoubtably the _most important _model relieving the

forecaster _of _the _burden _{of model} identification _while _being

computationally cheaper than B/J _procedures for example, since it does

not require any non linear search routines.

The following _topics _{are examined:}

(ij _Definition _{of forecasting} _performance: _Chapter _{4 looks}

at the _problem _{of measuring} the _performance _of a forecasting

system since the traditional mean square error (MSE) criterion

is _shown _{to be inappropriate} _{and misleading.} _This _is

especially obvious when the MSM is used to model data series

exhibiting a number of discontinuities. Instead a mixture of

both _qualitative _{and quantitative} _measures _are _proposed _and

justified _{by a number} _of _arguments _{and empirical} _evidence.

(ii) _Sensitivity _{and robustness} _of _the _{MSM : At time} _period

zero the MSM requires information to be supplied in terms

of a number of parameters. Some of these parameter

(33)

20

L

The objective of Chapter 5 is to determine _{the sensitivity}

and robustness _{of the MSM in relation} _{to the choice}

of those _parameters _which _remain fixed throughout _since

their _choice _controls _{the response} _{of the system.}

(iii) _Estimation _{of variance} _{: One problem} _{of the Bayesian}

approach is as follows. The Kalman filter _{assumes an}

exact knowledge _{of the noise} _{and disturbance} _variances

which must be known outside the system. This is a crucial

problem to the successful _practical application _{of the}

approach but very little _attention has been paid to it

in the literature _{and especially} _{to the important} _case

of on line variance _estimation. Godolphin _{and Harrison} [15]

recognise this _explicitly _pointing _{out that} _parameter _search

can give much difficulty to the inexperienced _worker _using the

DLM Markov representation _since it _takes _{him into} _the

relatively _unfamiliar territory _{of variance} _rather than

coefficient _estimation.

The problem arises because if the models operate with

fixed _estimates _{of these} _variances _then _{the predictive} _{distributional}

information _available _from _{the Kalman filter} (which _{are very}

important _for _decision _making) _{can be misleading.} _{In addition} _an

estimate of the "Normal" variability as defined earlier in Table 1.1 is of

special importance to the MSM which in essence interpretes deviations

from forecast _{in the light} _{of prediction} _error. _If for _example

this "Normal" _variability is _{overestimated} then the _prediction _error

will be similarly overestimated and the _system will respond slower to a

genuine sudden change in the environment. This is because the larger

(34)

s

as "no change" in the light _{of our over-estimated} _expected

forecast _errors.

The objective of Chapters 6 and 7 is to investigate this

variance _problem and devise _methods for its _{on line} _estimation.

(iv) _{Computational} _efficiency _{of the MSM and its} _speed

21

of response to growth _changes: As pointed _{out earlier}

it is _true _that in _relative _terms _the _{MSM is} _{computationally}

cheap. However _{as will} be seen in Chapter 2 it is _possible

to model the _effects _of _seasonality, _promotional _activity _etc.

by incorporating _{a suitable} _number _of _parameters _into _the

process parameter vector Ot. Suppose for example that 0t

consists of the usual level and growth components _{as well} _{as 13}

seasonal factors _{and possibly} 13 promotional _ones.

Then the MSM operates with vectors _{and square matrices} _{of order} 28.

Under such circumstances as well _{as when the MSM is to be used}

for forecasting _{a large} _number _of items, _the _{computational}

and storage requirements _{may well} be too high for _many

potential users of the system. With this in mind Chapter 8

looks _{at ways} _{of making} _the _{MSM more computationally} _efficient

by excluding a number of unlikely state transitions.

Another _objective in that Chapter _{was to investigate}

alternative formulations of the MSM in order to improve its _performance

and particularly its speed of response to abrupt changes in growth. The

problem of growth response improvement was one of great interest to a

(35)

22

V

but had difficulties in "tuning" the _system _{so as to respond}

faster _to _growth _changes _without _at _the _{same time} becoming _over-

sensitive during "quiet" periods.

Chapter 9 summarises _{our conclusions} _{and a number of}

proofs, graphs and computer program listings are given in the

Appendices.

Finally, Chapters 2 and 3 describe _{some fundamental}

concepts of the Bayesian approach thus completing the necessary

background.

1.6. NOTATION AND ABBREVIATIONS

The notation will be carefully defined in each Chapter

and for all the abbreviations used the term will be given in full

at the first instance followed-by the abbreviation in brackets.

A list _{of symbols} _{and abbreviations} together _with _{the section} _{when they}

were first used or defined is given in Appendix A and can be used

as a general reference whenever the meaning of a symbol or term is

ambiguous.

Subsection _{z of section} _{y of chapter} _{x will} be _referred

to as x. y. z. and the nth equation of x. y: z will be referred to as

(x. _y. _z. _n).

The kth reference will be denoted by 1k].

Finally, "Table _x. _y" _refers to the _yth table _which _can

(36)

I

CHAPTER 2

THE BAYESIAN APPROACH TO FORECASTING

2.1. _Foreword

The aim of this chapter is to summarise some basic

theoretical _results _of the Bayesian _approach. The Kalman filter

parameter estimation and updating procedure is first described and

the _principle _of _{superposition} is introduced _which has important implications _{in model building.} _This is followed _{by two sections}

giving the DLM representation and properties of two widely used

models. These will be used for reference from several parts of

the thesis.

2.2. The Kalman filter

Recall _{the general} form of the DLM as defined in section 1.4:

Observation _ein _{yt = Ft} 0t + et _{et `N} (0, Vt) (2.2.1)

System eqn : Ot = 2t-1 + wt wt " N(O, -Wt) (2.2.2)

23

Suppose that _{at time} t-l _{our posterior} information _about

the process parameter vector 0t takes the form of a Normal distribution,

(0t-1 1 Dt-1) _{ti N} (mt-1

(37)

9

then as soon as yt and Ft become available the estimation problem

is to make inferences about the current value _{of the unknown parameter}

vector O. This is exactly what the Kalman filter does and it

can be shown (see Appendix B) that as a consequence of the DLM

representation and the distribution of (0t-1 I Dt-1)' the posterior

distribution _{of 0t given} _{Dt is also Norma l}

(0t 1Dt) ti u (mt

, Ct) (2.2.4)

where mt and Ct are calculated recursively from the following

Kalman filter _equations first _given by Kalman [25].

yt =E (Yt 1, Dt-1) = Ft G mt-1

et = yt

yt

Et = Var (Ot I Dt_1) _=G Ct_1 G' _{+ Wt}

Yt = Var (Yt I Dt_1) = Ft 4 Ftt + Vt

At (Yt)-1 Rt Fit

(2.2.5)

mt =E (Ot I Dt) =G mt-1 + At et

(2.2.6)

Q= Var (0t I Dt) _{= Rt - At} Yt At (2.2.7)

The analogy of At in (2.2.6) with traditional fixed

smoothing constants is obvious since At controls the amount of the

forecast _error fed back into _revising _the _parameter _estimates. The

24

(38)

25

t

but varies with time in a manner determined by the relative uncertain-

ties.

Of course at time t=0 _prior _{to any observations} the Kalman

filter _requires _{an initial} _prior distribution,

(E)

t1 DD) 'L N(Eb , Co)

utilising information based on experience, analogy, judgement etc.

This initial _subjective information _requirement is an important feature

of the approach and it can be argued that it is a step which many

practitioners are not only willing to take but do take on the many

occasions in real life where decisions implying forecasts must be made

prior to the occurence of the first data point.

Having _established _{the procedure} _{of inferring} (0t I Dt ) we

now simply state the results derived by H/S [18] which can be used to

make inferences about future values of the variable under observation

in the important _{case of} Ft being _constant _{or varying} in a predictable

way for example on account of periodic functions. Using the observation

and system equations of the DLM it can be shown that the prediction mean

and variance of the k-step ahead process observation can be obtained from:

yt+k = Ft+k mt+k

_r

t+k Ft+k ýt+k Ft+k +

Vt+k

(2.2

(39)

" 26

where _mt+k and Ct+k are recursively obtained from:

llt+k E(0t+k 1 Dt) = G mt+k-1

-t+k Var(0 t+k

I D)

t =GC+W _t+k-1 _t+k

for k=1,2,3... When k=1, _{mt+k_1} _and Ct+k_1 _{are in fact} _the

latest Kalman filter _estimates _of 0t _{and its} _uncertainty:

mt+k_1 = mt = E(0t 1 Dtv = _mt

gt+k-I ₌ St ₌ Var(O I Dt) = Ct

Finally _{we comment on the principle} _{of superposition} _which has

far _reaching implications for _{the construction} _{of large} _{sophisticated}

models. The principle simply states that a linear combination of

a number of simpler linear models is itself a linear model. Thus the

model builder might fix his attention in simple sub-systems which he

can readily describe in DLM form and then bring them together into a

single DLM. For example, in a sales forecasting situation the fore-

caster might construct a sub-model DLM describing the effect of advert-

ising _promotions, _another _{DLM to describe} _{the seasonal} _variation _and

finally _{to describe} _{the wandering} _{and effect} _{of unrecorded} factors _as

a linear growth time series DLM.

The power of the principle of superposition can be illustrated

in terms of the following simple example. Suppose that we are operating

with a model of the general DLM form given by (2.2.1) and (2.2.2) but

(40)

27

I

et ti N(0, V

would be better described by a correlated time series'such as an

expontentially weighted moving average or ARIrIA (0,1,1) process.

As will be seen in the next section such a process for ct can be

represented as follows:

Et = Qt + Et Et ti N(0, VE)

ýt

_{= ßt_1 + aft}

apt ti N(o, v

Using _{the principle} _{of superposition} _{the complete} _{model may be}

expressed _{as follows:}

Yt = FEt _, ýj 1t { et

rt1

-G0

Ot_1 _Wt

t01 ßt_1 Sit

which with an obvious redefinition of Ft at etc. can be seen to be

another DLM of the general form.

In sections C. 1 and C. 2 of Appendix C an attempt is made to

specify the way in which the principle _{of superposition} can be used to

(41)

28

includes _seasonality _factors _{in the parameter} _vector _Ot _while _that _of

C. 2 incorporates _promotion _{and price} _effects in addition _{to seasonality.}

Both represent actual practical applications of Bayesian systems and

have formed _parts _{of M. Sc. projects} _prepared for _{two large} _{manufacturing}

companies based in London and Liverpool respectively.

2.3. _Properties _{of the constant} _variance Steady State Model (SSM)

The SSM is widely used in practice to model a process which

can be generated by a simple random walk. In sales forecasting for

example the model can be used to predict the demand for a product which

is established and reasonably stable. It assumes that the observed

figure _{of demand is the sum of some underlying} _true level _{and some}

unpredictable random disturbance. Further the underlying true level

at any period is equal to that in the preceeding period plus some

small random perturbation. Such a process can be written as:

yt = ut + st et ti N(0, VE) (2.3.1)

Pt = ut_1 + dot - dut ti N(0, Vu) (2.3.2)

where pt = unknown true process level at time t

et = observation noise at time t

6pt ₌ level disturbance _{at time} _t

and VE _, Vu are unknown true constant variances of the uncorrelated

random variables et

(42)

I

simplest DLM representation:

Observation _eqq _: _yt Cut] ₊ _st

System eq= Ptl = f1] {pJ + (öutl

Comparing _this _{representation} _with _{the general} _{DLM formulation} _given in

Section 1.4 it _{can be seen that} for _{the SSM,}

n=1 Ot = Imt]

G= [l] Vt ₌ VE

and wt = [ouJ

Ft = [ýJ

wt _ [Vill

(2.3.4)

At time t=0 information _about _u0 is described in the form of a

Normal distribution,

(UC 1 DC) _{I 'u}

, CO)

and estimates for the true variances VE _, VP are nominated and desig-

nated VE _N and Vu _N. Alternatively VE _{N and} _{ru N can} be nomin-

>>

ated where rp)N is our estimate of the true process variance ratio

ru = (Ve/VU) which _{can be viewed} _{as a measure of the time} dependence

of ut. Using this notation and substituting the SSM values of

Qt, Ft, G, Vt _and Wt into _{the general} Kalman filter _equations

(43)

30

to a limiting form with the limiting _values _of Rt. Yt, At _and Ct

given by RN, YN' AN and CN respectively :

RN=

vE,

N

AN /

(1 - AN)

YN = VE

N/ (1 - AN)

[-1 +(1+4r.

N )Z1 / (2r N)

u, J u,

CN = AN Ve, _N

(2.3.5)

The posterior distribution for _{the process} _{mean is given} _{by :}

(Ut _Dt) _N(mt

' CN)

where _mt = mt-1 + AN

t (2.3.6)

et = yt yt

yt ₌ _mt-1

The model in this limiting _{form is equivalent} _{to an ARIMA} (0,1,1)

model and an exponentially weighted moving average (EWMA) model operating

with a smoothing constant a equal to AN. Hence through AN, the

nominated ratio r

U9N is directly related to a and controls the up-

dating _of _mt. _It _{can be seen from the expression} _for AN for _example

that _r= 20 _corresponds _{to an EWMA a=0.2.}

u, N

(44)

I

be of the following form:

00 mtAN

i=0

(1 - Ar)

1 yt-i

(2.3.7)

which implies that in the limit _{rp, N} is the only factor (except the

actual data) _affecting the updating _{of the process} level _{and consequently}

the limiting _value _{of the MSE is a function} _{of rp,}

N alone and does not

depend on VE, _{N .} It _{can in fact} _{be shown (see Appendix} _{D) that} _the

one step ahead forecast error et is distributed normally with zero mean

and variance given by,

Var (e

t) = E(et2) = limit (MSE)t =

2-IVVe + Vu

(2.3.8)

t -* co AN 2-AN

t

where (MSE)

t=tE _i=l eil

This _result _also _{shows clearly} _that _{the limiting} _value _of

MSE depends _only _{on ru, N and not on the actual} _value _{of Ve}

N'

2.4. _Properties _of _the _constant _variance _Linear _Growth _Model (LGM)

The LGM extends the SSM by the addition of a growth component

ßt and is highly relevant to applications. As an extension _of the SSM

of (2.3.1) and (2.3.2) it can be written as follows:

Observation _eqn _{Yt = ut} _{+ Et}

System eqn lit - ut-1 + ßt + dut

at ßt-1 + 60t

et " N(0, VE) (2.4.1)

Sii _{" N(0,} _V

(2.4.2)

6ßt ti N(O, Va )

(45)

I

32

If _{the equation} for ßt is _substituted in the _ut _equation _then _the

DLM representation of the LGM is obvious.

Yt [l

, 0]

1'R'

t+ et

t

ut

_

1 _ut-1

}

sut _} _aßt

LJ

01 _ßt-1 _aßt

and by comparison _{to the general} DLM formulation

of Section 1.4 it _{can be seen that} for _{the LGM}

we have:

n=2 Ft ₌ rl

, 01 0= Ut

ßt

10' 1 _öut ₊ _aßt

G=w=

1 -t _sßt

Vt ₌ _Var(F-

t) = VE and

Wt ₌ _{E(wt wt')} ₌ _Vu ₊ _Vß _Vß

Vß Vß

(46)

e

Let _{the information} _about _{Ot_1} _{at time} _t-1 _{and prior} _to

yt be of the form:

ut-1

([:

_1;:::

N (2.4.4)

t-1c22, _t-1

where m= best estimate of the process level given Dt-1

t-1

b= it it if It It _growth

t-1

cll,

t-1 = variance on the estimate of the level

C

22, t-1

it it it 11 11 it

growth

, t-1

c12

t-1 c21 t-1 covariance of the level and growth

s _estimates

Substituting _the _{LGri values} _of Ft, G, Vt _and Wt into

the _general Kalman filter _equations (2.2.5), (2.2.6) _and (2.2.7) it is

straightforward to arrive _at the following _posterior _estimates, _given

the latest _observation _at time t, _yt:

t

bt

rc11,

t c12, t

(2.4.5) [ci2,

t c22, t