• No results found

3.3 Materials and methods

3.3.7 Random forest predictors

Training data collected for SIMs and SUMO sites were used to train random forest predictors. For each feature, data was divided into multiple subsets and a different random forest model was trained for each set. The R implementation of the random forest algorithm by Breiman (2001) was used (Liaw & Wiener, 2002).

3.3.7.1 Data treatment

The processed SIM data was split into six training sets, splitting by the two species (Arabidopsis& human) and by the three SIM types, A, B and R. The synthetic SIM sequences were sampled from a limited set of peptides with different amino acid distributions compared to natural peptide sequences. To prevent the models being biased by this, random sequences with natural amino acid frequencies were ad- ded to the training sets as negative data. For each SIM type, random sequences were generated by using natural amino frequencies as probabilities to randomly sample each amino acid in a 13-mer sequence. The same number of these randomly generated sequences was added to each SIM type set as there were non-interacting peptides. For the B type SIM random sequences, position 5 and 6 corresponding to the highly conserved DL motif was constrained to these amino acids as sequences tested with the random forest predictors for this motif would be prescreened to contain this conserved DL motif. If the amino acids in the random negative set were not constrained to DL at position 5 and 6, the random forest would select these amino acids as the most important variables, which would lead to poor performance as the important features outside the DL motif would not be prioritised.

For the SUMO site data, the positive training data was partitioned into type I and II sites, and all the negative data was added to each set. The SUMO site type was not known and was determined by performingk-means clustering on the positive data, withk= 2. To perform the clustering, a distance matrix was calculated by taking three amino acids downstream and upstream of the central lysine (K±3) and converting these to the numerical values from the first dimension of the PCA of amino acid features. The Euclidian distance was then calculated for each pair of vectors to generate the distance values. To determine the SUMO site type of the sets of data from the cluster analysis, the resulting groups were analysed for conformity to the canonicalΨKx[E/D] motif, with the closest to the group being designated

as type I.

Next the datasets were prepared for the random forest models by converting the amino acid factors into numeric vectors from the PCA of amino acid indices. Each amino acid in the 13-mer SIM sequences was converted into 5 PCA dimensions resulting in a vector of 65 dimensions for each SIM sequence in all of the SIM subsets. For the SUMO site data, 11-mers consisting of five amino acids downstream and upstream of the central lysine (K±5) were converted into the 3 PCA dimensions, resulting in vectors of 33 dimensions for each SUMO site sequence in each subgroup.

3.3.7.2 Building and optimising random forest predictors

A multistep approach was taken to build the random forests for each data subset. Since each sub-dataset contained many times more negative training data than positive (for both the SIM and SUMO site data), subsampling of the data was required to prevent the random forest models optimising for the prediction of the negative data at the expense of positive data, i.e. the random forest models would tend to a

specificity of 100% and a sensitivity of 0% without subsampling. The internal random forest sampling method was used, maintaining the integrity of the OOB error estimation. An approximate sampling ratio was determined by exhaustive searching to find the ratio where the OOB error estimate of positives and negatives was the most similar, which was generally close to a ratio of 1:1.

Once the optimal subsampling ratios had been determined, a very large random forest with 10 000 trees was trained on the full datasets for each training subset with purpose of calculating variable im- portance. The resulting variable importance, measured in mean decrease in the Gini coefficient, was used to rank the importance of each variable in the input vector.

Next parameter selection was performed with another exhaustive search. A parameter search with two variables,vandmwas performed with each data subset. The variablevis the number of highest ranked variables to be used andmis the number of variables used at each node in each tree. The range formwas 1 to 10, and forv1 to the maximum number of variables in the training data, wherem≤vin all cases. A small random forest of 250 trees was trained and the performance of the random forest was estimated by calculating the OOB estimate of the AUC using the R package ROCR (Singet al., 2005). The score parameter used in the AUC calculation was the proportion of trees predicting a positive value. This was then repeated 25 times and the mean AUC value with the 95% confidence interval for each combinationmandvwas calculated. Algorithm 2 details this process. 10 performance curves were generated for each training data subset, with a specific curve for eachmvalue. Themandvpairs that generated the maximum mean AUC values were then taken as the optimal parameters for each data subset to use to train each random forest predictor.

Algorithm 2Algorithm for finding optimal random forest parameters.RFis the random forest function;

vis the number of variables with a maximum of Nv;mis the number of variables to use at each tree node with a maximum ofNmandTis a matrix of training data.

For each training data subset perform the following to calculate a vector of means and confidence inter- vals. forv in1to Nvdo form in1to min(v,Nm)do µv,m=mean AUC ofRF(T,v,m) CIv,m=95% confidence interval ofRF(T,v,m) end end

The final random forest predictors were then trained with the calculated optimalmvalue and the

vmost important variables, adding trees until the OOB estimate of error could not be improved any further, which in all cases was around 2000 trees. The resulting RF models, along with metadata about the sequence features, was encapsulated into a sequence feature object. The metadata included which

variables were used, the type of sequence feature, the indices of the core and a search mask. These metadata were used to correctly configure the predicator for each sequence feature. For SIMs the core corresponds to the central hydrophobic patch and in SUMO sites the central lysine. The mask is a short regular expression that determines which subsequences of a full length protein are tested with the sequence feature predictor. For the SUMO sites, the mask only allowed sequences with a lysine at position 6 within the subsequence, the same position as the SUMOylatable lysine in the training data. For the SIM prediction models the mask matches core features; the SIM type A and R mask matches three hydrophobic residues in the 4 residue cores (ΨΨxΨor ΨxΨΨ) while for SIM type R the core

matches the immutable DL amino acid pair. The masks were used to decide which subsequences to test within a protein sequence and this method dramatically reduces the number of subsequences tested, reducing the computational time required for the predictor to run. Sequences that do not contain the core features in these masks are very unlikely to be SIMs.

3.3.7.3 Quantifying and comparing random forest performance

Once optimal random forest model parameters had been found for each data subset, the performance of those models was assessed by calculating the OOB ROC statistics. RFs were trained 25 times and the mean OOB AUC and 95% confidence intervals were calculated. The SUMO site predictors were compared with the other published predictors, SUMOsp 2 and seeSUMO. The full training set used to build the random forests was queried against both predictors, using their web interfaces. For SUMOsp, the training data were split into type I and II sites, as the output of this model distinguishes between these two types of site. The score thresholds were set to their lowest values so that the predictors would return score values for all training data. The resulting score values and their corresponding interaction values were used to generate ROC curves and calculate the AUC for each predictor.

The score used from the random forest models is the proportion of trees giving a positive prediction for a given peptide sequence. These score values do not however provide any useful information about the accuracy of the prediction these models give, and using OOB ROC data the false positive rate (FPR) was modelled using an inverse sigmoid for each random forest model. TheFPRfunction is given by

FPR(score) = 1

1+e(α·score+β)

with the constantsαandβestimated by performing Gauss-Newton non-linear least squares regression on the estimated ROC data. An inverse FPR model was used to calculate score cutoffs for the random forest models. This function is given by

score(FPR) = ln( 1 FPR−1)−β α                          ifFPR> 1 eβ+1 ,FPR= 1 eβ+1 if 1 e(α+β)+1 ≤FPR≤ 1 eβ+1 ,FPR=FPR ifFPR< 1 e(α+β)+1 ,FPR= 1 e(α+β)+1

and uses the sameαandβconstant values calculated for the FPR model.