6.4 Discussion
6.4.6 Limitations and future work
7.1.2.3 Comparison of ALS algorithms
Three algorithms are compared in this section: ALS without the update step, i.e. Algorithm 7.3 where the update steps are substituted by vector rescaling steps (u ← u/kXuk2 and v ← v/kY vk2); ALS with the update step, but with a fixed value of δ; and the final proposed algorithm with the update step and with the update on the value of δ, as described in Algorithm 7.3.
The results presented below summarise the performance during the hyper- parameter selection steps for all the 10 hold-out data splits, i.e. each hyper-parameter combination was evaluated using 1000 data splits.
Figure 7.6 shows the results obtained when using the original dataset without the addition of noisy features ({X, Y }). By comparing how ALS performed before (red) and after (green) the addition of the update step, one can see that there was a slight improvement in the fraction of times it converged (Figure 7.6(a)). The update of the value of δ provided even better results (blue). Despite this improvement, one
132 Chapter 7. Alternating Least Squares (ALS) method for SCCA and SPLS can see that the impact on the test correlations was negligible (Figure 7.6(b)), which seems to support our initial hypothesis that, by not forcing the algorithm to converge to a result using the proposed ALS method, there is the risk that it will oscillate between a set of very similar solutions.
8 6 q v 4 2 0 0 50 pu 100 0.7 0.5 0.6 1 0.9 0.8 150
Frac. times it converged
No update step No delta update Delta update
(a) Fraction of times algorithm converged
8 6 q v 4 2 0 0 50 p u 100 0.55 0.75 0.7 0.65 0.6 150 Crr No update step No delta update Delta update
(b) Average test correlation
Figure 7.6: Comparison of ALS algorithms using original dataset {X, Y }. Both plots show
results for hyper-parameter selection across the 10 hold-out splits. Red – ALS without the update step; Green – ALS with the update step, but with a fixed
δ; Blue – proposed algorithm, i.e. ALS with update step and update on the
value of δ.
Figure 7.7 shows the same results as Figure 7.6, but using the dataset with added noisy features ({X0, Y0}). In this case, one can see that the algorithms behave quite differently. Figure 7.7(a) shows that for the optimal hyper-parameter combination, the addition of the update step (green) improved the fraction of times the algorithm converged from 0.4530 to 0.5180, which was further improved by the update on the value of δ (blue) from 0.5180 to 0.9820. However, this was not the case for most hyper-parameter combinations, where the original ALS formulation (red) converged as often as the proposed formulation (blue).
Figure 7.7(b) shows that, unlike the results with the original dataset (Fig- ure 7.6(b)), the correlation obtained on the test sets actually changed with the ALS algorithm. For most of the hyper-parameter combinations, the proposed method actually performed worse (blue) than the original ALS (red), or the ALS with the update step but without the update on the value of δ (green). This result may be due an implementation issue, where the proposed ALS formulation assumes that the algorithm is oscillating too soon, and prematurely decreases the value of δ, forcing ALS to converge to a sub-optimal solution.
7.1. SCCA using ALS 133 Despite these results, the difference between the correlation obtained by the three different ALS algorithms for the optimum hyper-parameter combination was negligible: there was a small increase from 0.6714 (red) to 0.6717 (blue). Therefore, the proposed ALS algorithm will still choose the same hyper-parameter combination as the other two ALS algorithms. This result leads us to believe that using the proposed ALS algorithm is the best option, as it will provide comparable solutions, while converging more often. This last property is especially attractive, since it will decrease the computational time, by preventing the algorithm from spending time oscillating between very similar solutions.
1500 1000 q v 500 0 0 500 p u 1000 0.7 1 0.9 0.8 0.6 0.5 0.4 1500
Frac. times it converged
No update step No delta update Delta update
(a) Fraction of times algorithm converged
1500 1000 q v 500 0 0 500 p u 1000 0.55 0.7 0.65 0.6 0.5 0.45 0.4 1500 Crr No update step No delta update Delta update
(b) Average test correlation
Figure 7.7: Comparison of ALS algorithms using dataset with added noise {X0, Y0}. Both plots show results for hyper-parameter selection across the 10 hold-out splits.
Red – ALS without the update step; Green – ALS with the update step,
but with a fixed δ; Blue – proposed algorithm, i.e. ALS with update step and update on the value of δ.
7.1.3 Conclusion
This section has served mainly as an introduction to the remaining work presented in this thesis (Section 7.2 and Chapter 8). However, it also proposes a different formulation of the ALS algorithm [Golub and Zha, 1992]. This differs from earlier ALS formulations [Lykou and Whittaker, 2010, Chi et al., 2013, Wilms and Croux, 2015, Polajnar, 2015] by introducing an extra update step on the weight vectors, in order to prevent the algorithm for oscillating between similar solutions.
The algorithm was tested on a dementia dataset containing ROI information and clinical/demographic information ({X, Y }), and on a higher dimensional dataset where p, q > n and most features were comprised of noise ({X0, Y0}). This was done to assess whether the ALS algorithm could be used for SCCA when the problem was
134 Chapter 7. Alternating Least Squares (ALS) method for SCCA and SPLS ill-conditioned. The results have shown that the ALS was indeed able to provide statistically significant results, which were robust to perturbations in the data.
There were still a few cases in which ALS did not converge, due to the fact that the constraints could not be obeyed. This was especially common in cases where the constraints on the number of features were very strict, i.e. only one or two features were allowed. In some of these cases, the glmnet package was not able to provide a solution which obeyed the constraints. Nevertheless, SCCA was able to compute weight vector pairs which were able to generalise well for test data.
One of the possible limitations of the ALS is the fact that its computational time did not scale well with the number of features. This scalability will be dependent on the algorithm used to solve each LASSO regression step. As mentioned in Section 7.1.1.1, glmnet was chosen due to the fact that it was the best performing algorithm in early preliminary experiments. However, the computational time of ALS can be further reduced if the LASSO regression steps are solved using a faster algorithm.
Despite the encouraging results, there are still questions that should be inves- tigated, more specifically: how do these results compare with SPLS; is it possible to use other types of penalties besides the LASSO; and is it possible to solve SPLS using ALS. These shall be addressed in Section 7.2.