6.4 Sparse Bayesian identification of polynomial NARX models
6.5.1 Numerical example 1: A non-linear benchmark
The SVB-NARX algorithm is demonstrated by application to the structure detec- tion of the generative system given by
yk = θ1yk−2+θ2yk−1uk−1+θ3u2k−2+θ4y3k−1
+θ5yk−2u2k−2+ek (6.82)
θ = [θ1, θ2, θ3, θ4, θ5]T
= [−0.5, 0.7, 0.6, 0.2,−0.7]T
where ek is a normally distributed white noise sequence drawn from the distri- bution N (ek|0, σ2). The system is simulated for N = 1000 data samples with
σ2 = 0.004 corresponding to a SNR of ≈ 20dB. The input, uk, is drawn from a
uniform distribution in the range [−1, 1]. This system was initially used in [80]
and again as a benchmark in [99] and [8]. The system demonstrates a situation in which the FRO algorithm fails to select the correct terms, necessitating further identification steps.
Assuming no prior knowledge of the model structure the polynomial model order is conservatively set to np = 4 and the dynamic order in both the input
Figure 6.5: The SVB-NARX algorithm iteratively prunes terms that fall below a threshold. Model term index plotted against ARD value. The log(ARD) value of each model term is given by the black stems, true model terms are painted red. The dashed red line indicates the threshold, r, at each iteration if the algorithm terms with an ARD value below the threshold are pruned, these are painted blue. The correct model is found at iteration 58 although the algorithm continues untill only one term remains in the model.
and the output is set to nu = ny = 4. The assumption on the model order leads to a superset of M = 495 model terms in which to search (including the DC term). The SVB-NARX algorithm requires initialisation of the hyper-parameters associated with the prior distributions, namely a0, b0, c0 and d0 as well as the res- olution variable, r. Hyper-parameters were chosen as a0 = c0 = 1×10−2 and b0= d0 =1×10−4, so as to produce uninformative prior distributions. The mean of the Gamma distribution on τ−1 at these values is undefined but it has mode b0/(a0+1) ≈ 1×10−4. This implies that the most likely variance on θ will be small a priori. The a priori variance on τ−1 is also undefined at these values, however the variance on τ can be computed as a0/b20 = 1×106. It can hence be concluded that although the prior distribution indicates a preference for θ to take small values, this effect will be minimal on the inference because of the broad distribution. The same reasoning can be applied to the prior distribution on α.
The FRO, LASSO and SEMP algorithms are applied to the same data set to provide a benchmark. LASSO is applied in order to show a comparison to avail- able sparse methods and to demonstrate the resultant over parametrisation. The FRO and SEMP algorithms provide benchmarks to standard identification meth-
Table 6.1: The SVB-NARX algorithm selects the correct model structure. Terms selected the SVB-NARX, LASSO and FRO algorithms for the system given by Equation (6.82).
SVB-NARX
- Basis function ARD (×103) Parameter estimate Correct term? yk−2u2k−2 1.1821 -0.7053 3 yk−1uk−1 1.2037 0.6990 3 u2 k−2 0.8708 0.5999 3 yk−2 0.6064 -0.5006 3 y3 k−1 0.1047 0.2079 3 FRO
Iteration Basis function ERR Parameter estimate Correct term?
1 yk−4u2k−2 0.3792 -0.0035 7 2 u2 k−2 0.1576 0.6006 3 3 yk−2 0.2681 -0.5006 3 4 yk−1uk−1 0.1600 0.6990 3 5 yk−2u2k−2 0.0236 -0.7077 3 6 y3 k−1 0.0070 0.2028 3 LASSO
- Basis function - Parameter estimate Correct term?
yk−1 -0.4669 3 yk−1uk−1 0.6250 3 u2 k−1 0.5610 3 y3 k−1 0.1316 3 yk−1u2k−1 -0.6137 3 yk−1u2k−1 -0.0371 7 yk−1u2k−1 -0.0160 7 yk−1u2k−1 -0.0036 7 yk−1u2k−1 0.0236 7 uk−1u2 k−1 0.0025 7 y2 k−1u2k−1 0.0547 7 yk−1uk−1u2k−1 0.0742 7 SEMP
Iteration Basis function SRR Parameter estimate Correct term?
1 u2 k−1 0.3226 0.5904 3 2 yk−1 0.2001 -0.5164 3 3 yk−1uk−1 0.2344 0.6970 3 4 yk−1u2k−1 0.1525 -0.6556 3 5 y3 k−1 0.0888 0.2021 3
ods. Both algorithms are terminated when the ERR/SERR value falls below a threshold, in this case the threshold was set to 0.01 in both cases. The LASSO parameter estimates are found by minimising the cost function, given by Equation (3.34), using a gradient descent algorithm [39]. The parameters are estimated with values of λ in the range [0, 1]. 10-fold cross validation is then used to estimate the MSPE for the parameter set at each value of λ. The final model is selected as the model with the greatest MSPE within one standard deviation of the minimum found MSPE, see Figure 6.6C.
The SVB-NARX algorithm correctly identifies the model structure at iteration 58, indicated by the maximum of the variational lower bound recorded at each iteration. 63 iteration were required before all but one term is left in the model terminating the algorithm, see Figure 6.5. The number of iterations can be re- duced drastically by selecting a smaller value for r while still obtaining the correct model structure, however in a different scenario relevant terms could be incor- rectly pruned. The correct model is selected at the maximum of the variational lower bound, see Figure 6.6A. The Bayesian framework allows for a natural calcu- lation of the probability distributions over the model parameters, see Figure 6.7. The algorithm has estimated all the parameters within a 95% confidence interval, see Table 6.2 and Figure 6.7.
The FRO algorithm selects an incorrect term at the first iteration, see Table 6.1. The incorrect term selection is suggested to be a result of the local nature of the search performed by the algorithm [112]. It should be noted that this problem can be solved by the inclusion of an extension to the FRO algorithm that adds a prun- ing step at every iteration of the algorithm [99]. The SEMP algorithm identifies the correct model structure in 5 model iterations.
The LASSO parameter estimate resulting in the minimum MSPE is found at
λ = 0.0043 with a sample standard deviation of 3.0443×10−04. The model se-
lected is then taken as the one with the maximum MSPE within one standard deviation and is found at λ = 0.0090. The resulting model has 12 non-zero pa-
rameter estimates, incorrectly including 7 extra terms to the true model structure and therefore being greatly over parametrised. The true model structure is not recovered at any value of λ.
4
6.5.2 Numerical example 2: Effect of noise on algorithm performance.