SUMO site random forest models - An investigation into the role and mechanism of action of smal

3.4 Results

3.4.6 SUMO site random forest models

The SUMO site sequences were converted into 33 input variables for the random forest models. Unlike the SIM data, using 3 principal component dimensions rather than 5 resulted in better performance, sug- gesting that SUMO site prediction is less complex than SIM prediction. The differences between type I and II SUMO sites were notable and fit well with the understanding of these motifs, see Figure 3.22. For type I SUMO sites, the hydrophobic residue upstream of the central lysine is the most important followed by the [DE] feature at position 2, with other amino acids contributing a small amount of information. For the type II sites, the [DE] is most important but the strong hydrophobic feature upstream of the central lysine is absent, with the variable importance spread over the other positions. The central lysine contributes no information as sequences are preselected to have a lysine residue at this position so do not contribute any information to random forest predictors, thus no variables for this position are used in the predictor.

The optimisation algorithm was applied to the SUMO site data. As was found in the SIM optimisation, a performance maximum was found after which adding more variables reduced the performance of the model as measured by AUC. The performance drop however, was much less than for the SIM models. The optimal number of variables used at each tree node,m, was 1, the same as for most of the SIM models. This is in contrast to the random forest predictor by Tenget al.(2012) which was found to have optimal performance withm=6, though their predictors used hundreds of input variables while the optimal number found in this work was 11 and 6 for type I and II respectively. For random forests using a higher number of variables, the optimal value formstarts to increase, this is especially apparent for the type I random forest with 33 variables in Figure 3.23 wherem=1 had the worst performance.

1 2 3

−5 −4 −3 −2 −1 0 1 2 3 4 5

Type I SUMO sites

1 2 3

−5 −4 −3 −2 −1 0 1 2 3 4 5

Type II SUMO sites

Random forest variable importance

Amino acid position in motif

Principal component analysis dimension

0.5 1.0 1.5 2.0 2.5

Mean Gini Decrease

Figure 3.22: SUMO site predictor variable importance. The central SUMOylated lysine is at position 0 in the figure. Position 0 has no importance as sequences are prescreened to have a lysine residue at this position and so this position contributes no information to the prediction models.

0.975 0.980 0.985 0.990

0 10 20 30

Type I SUMO sites

0.76 0.78 0.80 0.82 0.84 0.86 0 10 20 30

Type II SUMO sites

SUMO site model parameter optimisation

Number of variables in model

OOB estimate of ROC AUC

Variables used at each tree node

1 2 3 4 5 6 7 8 9 10

Figure 3.23: SUMO site model parameter optimisation. Performance of random forest models was assessed by adding 1 variable at a time and testing a different number of variables sampled at each tree node. Shaded areas show 95% confidence intervals,n=25.

Once optimal parameters for the SUMO site predictors were identified, random forests were trained with 2000 trees, the point at which the OOB estimate of error could not be improved any further. For each SUMO site type, 25 random forests were trained and their performance was assessed by AUC values. Like the SIM random forest predictors, the variance of the AUC values between the random forests was very small. The performance was also compared with SUMOsp (Ren et al., 2009) and seeSUMO (Tenget al., 2012) by querying the training data used against these two predictors and using the resulting score values to calculate AUC values and ROC curves. The AUC for our model, known as HyperSUMO, and the other predictors are shown in Table 3.5 and the ROC curves are shown in Figure 3.24. The results show that as expected the type I predictor (AUC =0.986) outperforms the type II predictor (AUC=0.842) and the SUMO site predictors greatly outperform the SIM predictors (Table 3.4). The better performance of the SUMO site predictors is at least partly due to the much larger size of the training dataset but may also be influenced by the quality of the data and the complexity of the problem being addressed.

Model SUMO types ROC AUC (±95% CI) HyperSUMO type I I 0.986±0.00075 HyperSUMO type II II 0.842±0.0039

SUMOsp type I I 0.731

SUMOsp type II II 0.725

seeSUMO I & II 0.705

Table 3.5: Comparison of ROC AUC values for various SUMO site predictors. Mean ROC AUC for random forest models were obtained by repeatedly training random forests on the same data set (n=

25). Random forests with 2000 trees were grown for each iteration.

SUMOsp and seeSUMO were trained with the same data as was used to build the predictors and due to technical restrictions cross validation was not possible, therefore the resulting AUC values likely over- estimate the performance of these predictors. Despite the possible overestimation of the performance of SUMOsp and seeSUMO, our model, HyperSUMO, greatly outperforms these models even for the less accurate type II predictor. The results for seeSUMO disagree with those published by the author who estimated a ROC AUC value of 0.920 for their best performing predictor while our results give a value of 0.705, which is an enormous discrepancy. One of the major differences between HyperSUMO’s and seeSUMO’s estimation of AUC is the validation dataset used, our method used all of the 8318 training sequences while Tenget al.(2012) used a separate set with a total of 1338 sequences of which 48 were SUMOylated. There may be something inherently different between these datasets that accounts for the discrepancy in estimated AUC values where the larger dataset gives a lower value and the smaller a higher value. Factors that could contribute to this difference include a different ratio of type I and II

sites or different accuracy of the data resulting from the methods used to identify the SUMO sites. The smaller dataset used for evaluation by Tenget al.(2012) was curated from publications after January 2010, whereas the data used in this work was from before this date.

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

False positive rate

Sensitivity Model HyperSUMO type I HyperSUMO type II SUMOsp type I SUMOsp type II seeSUMO

SUMO site predictor performance

Figure 3.24: ROC curves of the performance of various SUMO site predictors. The published predictors SUMOsp and seeSUMO have comparable performance. HyperSUMO greatly outperforms the other two published predictors for both type I and type II sites, while the performance of type I prediction is markedly better. To calculate performance for HyperSUMO, FPR and TPR were calculated using OOB estimation using the full training dataset. For the other two models, the training data were queried against the predictors with the threshold set to 0 so that a score was generated for every lysine. SUMOsp and seeSUMO were trained with the same datasets but cross validation was not possible due to technical restrictions; therefore the ROC curves generated for seeSUMO and SUMOspover-estimate

their performance. Despite this HyperSUMO outperforms these predictors.

In document An investigation into the role and mechanism of action of small ubiquitin like modifier interacting motifs in Arabidopsis thaliana proteins (Page 106-110)