4.2 Online Efficient Regularization Update
4.2.3 Simulations with Time Invariant Dynamical Systems
The purpose of this section is to evaluate some preliminary performance of Algorithm 2
in a set-up easier than the one proposed in Section4.1because time invariant systems (a particular case of time-varying systems) are considered. Moreover, it is aimed to show which of the methods proposed in Section 4.2.1to compute the unique step in the marginal likelihood optimization outperforms the others.
Data
The experiment consisted of 200 Monte Carlo runs, in each of them a random SISO discrete-time system has been generated through the Matlab routine drmodel.m. The system orders have been randomly chosen in the range [5, 10], while the systems poles
78 Online Gaussian Regression
are all inside a circle of radius 0.95.
The input signal is a unit variance band-limited Gaussian signal with normalized band [0, 0.8]. A zero mean white Gaussian noise, with variance adjusted so that the Signal to Noise Ratio (SNR) is always equal to 5, has been added to the output data. For each Monte Carlo run, a data set of N = 5000 input-output pairs has been generated, while the length of the online upcoming datasets Di has been chosen to be T = 10.
Estimators
The procedures that perform only one iteration of the iterative algorithms SGP, BB, BFGS and EM (illustrated in Algorithms 4-7)), are also compared to the standard iterative algorithm which estimates the hyperparameters running the optimization algorithm until convergence. In the following, the former procedures will be denoted 1-STEP, while we will refer to the latter one as OPT. The OPT procedure exploits the SGP algorithm to maximize the Marginal Likelihood.
The OPT procedure corresponds to the so called “batch” procedure equipped with an ad-hoc initialization of the optimization problem (4.7) provided by the previous hyperparameters estimate of the online procedure SGP and with the recursive update of the data depending matrices, see Algorithm2 steps1-3 to reduce the computational time.
In the experiments, a zero-mean Gaussian prior with a covariance matrix given by the so-called TC-kernel, see Chen et al. (2012), is adopted:
KηT C(k, j) = λ min(βk, βj) (4.40) where λ ≥ 0 and 0 ≤ β ≤ 1 are the hyperparameters collected in η = [λ, β]. The length n of the estimated impulse responses has been set to 80.
In the interest of exploring the solutions with higher computational time performance of the online updates, two versions of BFGS, SGP, BB, EM are proposed.
• Update λ and β. Both the hyperparameters in η are updated whenever a new dataset Di becomes available.
• Update only λ. Only the scaling factor λ is updated, retaining β fixed to its initial value. This methodology reflects the framework where the theoretical results in Section4.2.2has been achieved.
It is clear that the second case allows a faster computation, at the expenses of a less precise impulse response estimator.
4.2 Online Efficient Regularization Update 79
• EM1, where bλ(i+1) = n1bg(i)>K−1ˆ
β gb
(i), which is the current approximation of the asymptotically optimal value.
• EM2, where the update corresponds to (4.31).
The aim is to show a comparison between the asymptotic theory and the EM update, see e.g. Bottegal, Aravkin, Hjalmarsson, and Pillonetto(2014); notice that the second term of (4.31) tends to zero when the number of data tends to infinity.
Performance
As a first comparison, the adherence of the impulse response estimate to the true one is evaluated. Thus, for each estimated system and for each procedure the impulse response fit is performed: F(bg) = 100· 1 −kg −gbk2 kgk2 (4.41) where g, bg are the true and the estimated impulse responses of the considered system, respectively.
Figure 4.1 shows the impulse response fits (4.41) achieved in the Monte-Carlo sim- ulations along with the increase of the number of observed data. OPT procedure is compared with the 1-STEP SGP, BB, BFGS and EM. On the left hand side the obtained results optimizing both hyperparameters in η are reported, while the results on the right hand side are obtained by updating only λ.
All the 1-STEP procedures which update both hyperparameters perform remarkably well, with the fit index being almost equivalent to the one obtained with the OPT procedure. This suggests that the full optimization of problem (4.7) does not bring any particular advantage in terms of fit in the online setting. Notice that we are taking a sort of worst case approximation since we stop the optimization algorithm after only 1 step: some more advanced techniques could be considered (e.g. an early stopping criterionYao,
Rosasco, and Caponnetto(2007)). The 1-STEP updates optimizing only λ, as expected,
perform worse than the other update technique, having a bigger variance and slightly inferior performance in terms of median. However, their behaviour is comparable to the one when both hyperparameters are updated, therefore depending on the application this technique can be taken in consideration. The only exception is represented by EM1 which achieves inferior fits, but it is expected that also this update reaches the same performance when the number of data tends to infinity.
The second comparison is done in terms of cumulative computational time of the procedures, see Figure4.2and Table 4.1.
80 Online Gaussian Regression OPT SGP BB BFGS EM 70 80 90 100 SGP BB BFGS EM2 EM1 70 80 90 100 OPT SGP BB BFGS EM 85 90 95 100 SGP BB BFGS EM2 EM1 85 90 95 100 OPT SGP BB BFGS EM 85 90 95 100 SGP BB BFGS EM2 EM1 85 90 95 100 OPT SGP BB BFGS EM 85 90 95 100 SGP BB BFGS EM2 EM1 85 90 95 100
Figure 4.1: Monte Carlo results. Left: Boxplots of the impulse response fit obtained updating both hyperparameters in η. Right: Boxplots of the impulse response fit obtained updating
only λ.
The OPT procedure, as expected, is much slower than the 1-STEP procedures. This
Update λ and β Update only λ
OPT SGP BB BFGS EM SGP BB BFGS EM2 EM1 mean 163.1 0.56 0.93 1.19 0.57 0.31 0.60 0.45 0.18 0.30 std 18.45 0.13 0.16 0.36 0.11 0.06 0.13 0.25 0.06 0.92
Table 4.1: MC results. Mean and standard deviation (std) of the cumulative computational time after N = 5000 data have been used.
4.2 Online Efficient Regularization Update 81 OPT 0.5 1 1.5 Time [s] SGP BB BFGS EM 0 0.1 0.2 SGP BB BFGS EM2 EM1 0 0.1 0.2 OPT 60 80 100 Time [s] SGP BB BFGS EM 0 0.2 0.4 0.6 SGP BB BFGS EM2 EM1 0 0.2 0.4 0.6 OPT 100 150 200 Time [s] SGP BB BFGS EM 0 0.5 1 1.5 SGP BB BFGS EM2 EM1 0 0.5 1 1.5 OPT 150 200 250 Time [s] SGP BB BFGS EM 0 1 2 SGP BB BFGS EM2 EM1 0 1 2
Figure 4.2: Monte Carlo results. Boxplots of the cumulative computational time. Each row of plots corresponds to the situation after T data are viewed. Left: OPT procedure. Mid: 1-STEP optimization of both hyperparameters. Right: 1-STEP optimization only of λ (β is
fixed).
could suggest that the 1-STEP procedures we consider appear to be excellent candidates for real-time applications. Indeed, these techniques perform comparably in terms of fit w.r.t. the OPT procedure, but demanding a computational time which is two or three order of magnitude faster; furthermore the difference in terms of computational time
82 Online Gaussian Regression
diverges in favour of the 1-STEP procedure with the increase of the number of data seen. Among the 1-STEP procedures SGP and EM provide the fastest updates: this is surprisingly positive for the EM update since only λ has a closed form update, while β is the solution of a maximization problem; indeed, in the right hand side of Figure 4.2, where only λ is updated, EM1 and EM2 outperform SGP. The update BB is a particular case of SGP, where D(i) = I (see Section4.2.2), but it is significantly slower: this is due to the backtracking loop at Step 8 in Algorithm3. The right hand side of Figure 4.2
shows the advantage of updating only λ: the cumulative computational time is inferior. Finally, Figure4.3reports the evolution of the fit and of the hyperparameters estimates, for a single system, when new datasets of different lengths arrive. In this experiment, datasets Di of lengths T = 1, 10, 50 are considered. It is of interest to observe that in
terms of both fit and hyperparameters update, the performance of the 1-step techniques match closely the performance of the OPT procedure. The graph is cut after 3000 data to highlight the transitory behaviour. As expected, the transitory is longer and more accentuated in the case of T = 50, particularly in the behaviour of λ. However, this does not affect the behaviour of the fit performance significantly.
4.3 Time-Varying Dynamical Systems 83