Recursive Estimation
Lemma 2. 2.2 Recursive Calculation for o\(t)
1. Forgetting profile
current data. Thus the estimates obtained are more representative of the current state of the system. The usual LS function (2.2.1) is modified to
(2.3.1) Vt(e)=s^ 1\ t{s)(y(s)-e, x{s))2
where At(s) is typically increasing in 5 for given t. Of course, this is simply generalized least squares and the minimizing value of Vt(0) is
(2.3.2) 0(<)=Ls= r
However, the expression (2.2.3) is the off-line estimate and this sequence of off-line estimates can only be computed recursively if a certain structure is imposed on At(s). Thus, if we assume that
(2.3.3a) At(s)=A(f)Aw (s) , 1< s< t- 1 which can also be written as
j p t(s)a<s)2/(s)
(2.3.3b) Xt( s ) = \ s(S) ^ \ U )
a recursive calculation of (2.3.2) can be effected. (Typically As(s)= l). We omit the derivation (see Ljung and Söderström, 1983, section 2.6.2), but note that A(s)=l, Vs, gives the standard LSE given in (2.2.2).
The function At(s) defined by (2.3.3), is referred to as the forgetting profile. To be appropriate for adaptive estimation we require A(s)<l and often an exponential forgetting profile is used where A(s)=A<l, Vs. Then,
(2.3.4) At(s)=At_sAs(s)
and A is referred to as the forgetting factor. In this case, the P(H-1) in (A2.2.1) is replaced by A_1[P(£)—{ \ + x ( t + l ) / P(t)x(t+l)}~1P(t)x(t+l)x(t+l)/ P(t)], taking At(£)=1, so that the modified RLS equations become as given in (A2.3.1) below. Note that in the factorized form of (A2.2.2), adaptive estimation is effected by using X[R(t) I n(t)], but in (A2.2.3) only AD(t) is required which involves the least number of additional calculations.
A2.3.1 Adaptive R LS - Exponential Forgetting
If exponential forgetting is used in (2.3.1), then the LSE 9(t) in (2.3.2) is recursively calculated as (2.3.5a) 0{t+l)=6{t)+K(t+l)(y{t+l)-0{t)'x(t+l)) where (2.3.5b) K(t+l)=6(t+l)P(t)i(t+l) < 5 ( £ + l ) = [ A +' P(t)x(t-\-l)} * P(<+l)=A-1[P(0-^(<+l)P(Qa<^+l)a<i4-l)/ P(0]
However, because the past has been weighted down in an exponential manner, the information contributed by the past also decreases exponentially. Although this reduction is desirable from the point of view that such data may no longer be relevant, it does introduce a certain contrariety. This is the balance between adequate damping and adaptive response time which can be rather sensitive. As A is decreased, the response time becomes quicker, but at the same time random fluctuations, due to noise for example, become amplified in the current data. Therefore the response becomes more sensitive to random perturbations and can result in very erratic behaviour of the estimates and may sometimes lead to wild oscillations. (This is equivalent to what W. E. Demming, in the context of Quality Control, referred to as "hunting the system"). Conversely, as A nears unity, the response time becomes increasingly slow and the estimates do not adapt fast enough to be useful.
Clearly, A will need to be experimentally tuned to the application at hand but for the simulation studies conducted in chapter 5, values of A>0.9 provided the best results. The choice of A also affects the calculation of quantities such as the prediction variance ^ { t), since, here, some notion of an effective sample size is required. This can be considered to be
t
(2.3.6) r ( 0 = J 1At(5 )= A (0 r(i-l)+ l (Assuming, A i(l)= l). Hence, if b\(fy is to be obtained from (2.3.7) sh(t)2=\{t)sh( t - l ) 2+6h(t)eh{t)2
t
where now Sh(t)2= X h ( s + l ) } 2, then b‘lL(t)=T(ty1stl(t)2. Note that
when exponential forgetting is used, r(t)—►(l-A)-1, so that, in practice, r(t) could be approximated by this quantity after some large enough t.
In the above, we have primarily discussed the use of exponential forgetting, since it avoids the problem of choosing A(t) at each t. This implies that a certain amount of pretuning is required and in some situations (for example, in adaptive control) it may be better to tune A online. However, we cannot properly deal with such procedures here and refer the reader to, for example, Goodwin and Sin (1983) where the subject of adaptive control is comprehensively treated.
As indicated in section 2.1, adaptive estimation can also be used in non time—varying systems. By taking A(£)<1 for say, t <50, the effect of these initial observations in starting up the recursion will be significantly reduced. For £>50, A(t) can be set to unity or simply removed from subsequent recursions by switching back to standard RLS. Alternatively, one might allow A(£) to slowly increase to unity (eg. X( t ) =( t - l ) X( t - l ) / 1). We point out however, that using forgetting profiles in stationary systems will mean that the parameter estimates will be biassed.
Truncation
Instead of artificially emphasizing the effect of the current data, an alternative is to base an estimate on only a finite number of observations. That is, to use a sliding window ?/m, which we defined in section 2.1. Thus, yl represents the active number of points used to calculate the estimates at time £, so that yt-m is completely discarded or truncated. Let,
(2.3.8) P t-» (0 = ,_ E h * ( *)*(*)'
Then, with extended notation, the sliding window RLS algorithm is recursively calculated as follows.
A2.3.2 R L S - Sliding Window
Let y(t), x(t) be observed. Then the RLS updating equations (A2.2.1), for a frame of size m and using (2.3.8) become:
(2.3.9a) Pt.m(t)=
0 = ^t-m( 0
To maintain a frame size m, y(t-m), x(t-m) are now discarded giving the RLS adjustment equations,
(2.3.9b)
Pt.m.l(t)={I+&t^(t)Pt.^t)}(t-m )x(t-m y}Pt.m(t)
^t-m+l( 0 = ^t-m( 0 ^Vm(
Similarly, the factorised forms of P(t) can be written in the form of (2.3.9) with inverse Givens transformations used in (2.3.9b). The algorithm requires that the last m observations (?/(s),i(s)}, t - m <s <t be stored and clearly, at each t, twice the number of calculations are required as compared to standard RLS. Like A, the frame size m needs to be specified a priori or from pretesting some data.
Although the data contained in the active window is not modified, the increased amount of computation required over the forgetting methodology may make the use of this method less desirable than forgetting. Also, the method is possibly better suited to processes where the time—variation in the parameters is slow, since if sudden changes occur, the window would, for a time, contain a mixture of possibly distinctly different processes. Consequently, the adaption may be ineffective and the use of a forgetting profile may be more appropriate. Such a situation is considered in chapter 5.
In some cases it may be plausible to model the time variation in 9(t) by a stochastic equation. In this type of situation the Extended Kalman Filter may be appropriate (see Anderson and Moore, 1979). However, we do not deal with that situation here.
§2.4 Recursive PLR Algorithms
In the off-line PLR algorithm, an iterative least squares approach is used to obtain improved parameter estimates. The conversion to an online procedure is done by replacing the iterative LS updating with RLS time updating. However, as in section 1.4, the RLS algorithm needs to be complemented with a residual recursion in order to construct the regressor x(t) since this will now contain unobserved inputs.
The recursive calculation of estimates of the unobserved inputs can be done in a number of ways and following (A2.4.1), we go on to describe three methods which derive from the basic RLS algorithm,
(2.4.1) 9(t)=9{t-l)+P(t)x{t)e{t),
where
(2.4.2) e( t ) =y(t )-9(t -l )' x(t )
In the following, it is assumed that (2.4.1) is to be used in the online estimation of ARMA systems of fixed order but for notational simplicity, the dependence of the quantities in (2.4.1) on 9 is suppressed. In each case we need only specify y(t),x(t) and 9(t) so that the use of factorization and adaptive estimation is taken as given.
A2.4.1 AR
If the underlying process is assumed to be an autoregression of order m, then
(2.4.3) x(t)'
0(t)' («))•
The asymptotic properties of 9(t) will be the same as for the off-line estimate. Because no residual estimation is required for x(t), no additional computation is
needed for (2.4.1). Hence, AR approximations to more general processes (such as ARM A) are often used in online estimation, particularly in high frequency sampling applications where the amount of computation needed, can be critical. Experience also shows that a good description of the data can be obtained in many situations.
However, where explicit estimates of the ARMA parameters are desired a residual recursion is required. The underlying process is now assumed to be ARMA.
A2.4.2 AML
Here, (2-4.4)
where
(2.4.5) e(t)=y{t)-0{t)'x{t)
This can be computed at time t since the e(t-j), j> 0 are available to compute e(t). Of course, e(t) still needs to be formed for (2.4.1).
AML stands for Approximate Maximum Likelihood which is somewhat misleading since the method is not based on the maximum likelihood principle. The term Extended Least Squares (ELS) is used for AML by engineers and is more appropriate. However, convention has meant the term AML has stayed and so we shall use it. Of course, the e(t) from (2.4.2) could be used instead of e(t) to construct x(t) in (2.4.4) which would avoid the extra computation needed in (2.4.5). Then the procedure is sometimes referred to as RMLi. However, it is with (2.4.5) that convergence of AML can be established. The following theorem is due to Solo (1979).