3.4 Results
3.4.6 SUMO site random forest models
The SUMO site sequences were converted into 33 input variables for the random forest models. Unlike the SIM data, using 3 principal component dimensions rather than 5 resulted in better performance, sug- gesting that SUMO site prediction is less complex than SIM prediction. The differences between type I and II SUMO sites were notable and fit well with the understanding of these motifs, see Figure 3.22. For type I SUMO sites, the hydrophobic residue upstream of the central lysine is the most important followed by the [DE] feature at position 2, with other amino acids contributing a small amount of in- formation. For the type II sites, the [DE] is most important but the strong hydrophobic feature upstream of the central lysine is absent, with the variable importance spread over the other positions. The central lysine contributes no information as sequences are preselected to have a lysine residue at this position so do not contribute any information to random forest predictors, thus no variables for this position are used in the predictor.
The optimisation algorithm was applied to the SUMO site data. As was found in the SIM optimisa- tion, a performance maximum was found after which adding more variables reduced the performance of the model as measured by AUC. The performance drop however, was much less than for the SIM models. The optimal number of variables used at each tree node,m, was 1, the same as for most of the SIM models. This is in contrast to the random forest predictor by Tenget al.(2012) which was found to have optimal performance withm=6, though their predictors used hundreds of input variables while the optimal number found in this work was 11 and 6 for type I and II respectively. For random forests using a higher number of variables, the optimal value formstarts to increase, this is especially apparent for the type I random forest with 33 variables in Figure 3.23 wherem=1 had the worst performance.
1 2 3
−5 −4 −3 −2 −1 0 1 2 3 4 5
Type I SUMO sites
1 2 3
−5 −4 −3 −2 −1 0 1 2 3 4 5
Type II SUMO sites
Random forest variable importance
Amino acid position in motif
Principal component analysis dimension
0.5 1.0 1.5 2.0 2.5
Mean Gini Decrease
Figure 3.22: SUMO site predictor variable importance. The central SUMOylated lysine is at position 0 in the figure. Position 0 has no importance as sequences are prescreened to have a lysine residue at this position and so this position contributes no information to the prediction models.
0.975 0.980 0.985 0.990
0 10 20 30
Type I SUMO sites
0.76 0.78 0.80 0.82 0.84 0.86 0 10 20 30
Type II SUMO sites
SUMO site model parameter optimisation
Number of variables in model
OOB estimate of ROC AUC
Variables used at each tree node
1 2 3 4 5 6 7 8 9 10
Figure 3.23: SUMO site model parameter optimisation. Performance of random forest models was assessed by adding 1 variable at a time and testing a different number of variables sampled at each tree node. Shaded areas show 95% confidence intervals,n=25.
Once optimal parameters for the SUMO site predictors were identified, random forests were trained with 2000 trees, the point at which the OOB estimate of error could not be improved any further. For each SUMO site type, 25 random forests were trained and their performance was assessed by AUC values. Like the SIM random forest predictors, the variance of the AUC values between the random forests was very small. The performance was also compared with SUMOsp (Ren et al., 2009) and seeSUMO (Tenget al., 2012) by querying the training data used against these two predictors and using the resulting score values to calculate AUC values and ROC curves. The AUC for our model, known as HyperSUMO, and the other predictors are shown in Table 3.5 and the ROC curves are shown in Figure 3.24. The results show that as expected the type I predictor (AUC =0.986) outperforms the type II predictor (AUC=0.842) and the SUMO site predictors greatly outperform the SIM predictors (Table 3.4). The better performance of the SUMO site predictors is at least partly due to the much larger size of the training dataset but may also be influenced by the quality of the data and the complexity of the problem being addressed.
Model SUMO types ROC AUC (±95% CI) HyperSUMO type I I 0.986±0.00075 HyperSUMO type II II 0.842±0.0039
SUMOsp type I I 0.731
SUMOsp type II II 0.725
seeSUMO I & II 0.705
Table 3.5: Comparison of ROC AUC values for various SUMO site predictors. Mean ROC AUC for random forest models were obtained by repeatedly training random forests on the same data set (n=
25). Random forests with 2000 trees were grown for each iteration.
SUMOsp and seeSUMO were trained with the same data as was used to build the predictors and due to technical restrictions cross validation was not possible, therefore the resulting AUC values likely over- estimate the performance of these predictors. Despite the possible overestimation of the performance of SUMOsp and seeSUMO, our model, HyperSUMO, greatly outperforms these models even for the less accurate type II predictor. The results for seeSUMO disagree with those published by the author who estimated a ROC AUC value of 0.920 for their best performing predictor while our results give a value of 0.705, which is an enormous discrepancy. One of the major differences between HyperSUMO’s and seeSUMO’s estimation of AUC is the validation dataset used, our method used all of the 8318 training sequences while Tenget al.(2012) used a separate set with a total of 1338 sequences of which 48 were SUMOylated. There may be something inherently different between these datasets that accounts for the discrepancy in estimated AUC values where the larger dataset gives a lower value and the smaller a higher value. Factors that could contribute to this difference include a different ratio of type I and II
sites or different accuracy of the data resulting from the methods used to identify the SUMO sites. The smaller dataset used for evaluation by Tenget al.(2012) was curated from publications after January 2010, whereas the data used in this work was from before this date.
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
False positive rate
Sensitivity Model HyperSUMO type I HyperSUMO type II SUMOsp type I SUMOsp type II seeSUMO
SUMO site predictor performance
Figure 3.24: ROC curves of the performance of various SUMO site predictors. The published pre- dictors SUMOsp and seeSUMO have comparable performance. HyperSUMO greatly outperforms the other two published predictors for both type I and type II sites, while the performance of type I pre- diction is markedly better. To calculate performance for HyperSUMO, FPR and TPR were calculated using OOB estimation using the full training dataset. For the other two models, the training data were queried against the predictors with the threshold set to 0 so that a score was generated for every lysine. SUMOsp and seeSUMO were trained with the same datasets but cross validation was not possible due to technical restrictions; therefore the ROC curves generated for seeSUMO and SUMOspover-estimate
their performance. Despite this HyperSUMO outperforms these predictors.