Variation of DPS parameters - A combined approach to domain assignment using PSI-BLAST sequence

A combined approach to domain assignment using PSI-BLAST sequence alignment and DomSSEA

5.3.1 Variation of DPS parameters

By varying these parameters, a default set could be chosen that gave a reasonable trade-off between domain number and domain boundary prediction. Choosing parameters can be rather subjective, as the selection made can depend

largely on what you want to use the method for. In this case a choice of parameters was made to assign domain boundaries as accurately as possible, whilst also retaining a useful domain number prediction accuracy.

5.3.1.1 Domain content prediction.

The DPS algorithm was developed in order to assign domain boundaries from sequence comparison using PSI-BLAST local alignments. Three main parameters were varied, E-value, Z-score and the length of the smoothing window (section 5.2.3). The analysis was carried out using the data set of single domain and multi domain chains (section 5.2.1). Whilst the study was mainly focused upon the assignment of continuous domains, discontinuous domain assignment has also been addressed.

Table 5.1 shows the results for the prediction of the domain content of the test sequences for Z-score cut-offs of I.O, 1.5 and 2.0, together with E-value cut-offs of I, 10 ^ and 10'^^. All values were calculated with an initial fixed smoothing window length width of 15 residues. For the multi-domain chains, the percentage of multi domain sequences correctly predicted to contain more than one domain is shown, i.e. the percentage of true-positives. The true-positive multi-domain prediction values are also shown for continuous multi-domain and discontinuous multi-domain chains. The number of single domain chains incorrectly predicted to be composed of more than one domain is shown as the percentage of false-positives. By increasing the Z- score and therefore the distance by which a peak in the termini-profile must deviate from the mean, the number of correctly assigned multi-domain chains decreases, but with a corresponding decrease in the false positive prediction of single-domain chains as multi-domain. Furthermore, it can be seen from Table 5.1, that by decreasing the E-value and therefore increasing the significance of a PSI-BLAST alignment hit permitted to be included in the termini-profile, appears to have a smaller effect on the true and false-positive rate of domain content prediction. However the general trend seems to be smaller E-values giving fewer false-positives with fewer true-positives. From these results, it would seem that a reasonable trade off between the true and false-positive prediction rate of domain content prediction for this study is given by a Z-score of 1.5.

% predicted as multi-domain single domain all multi-domain continuous multi-domain discontinuous multi-domain Z-score E-value (FP’s) (TP’s) (TP’s) (TP’s) 1 1 23.8 55.3 62.0 38.7 10^ 21.1 52.6 57.4 38.7 IQ -IO 19.5 49.3 54.0 37.7 1.5 1 14.6 45.8 50.2 34.9 lO'S 13.6 43.1 48.7 29.2 IQ -lO 12.5 40.7 45.6 28.3 2 1 11.7 37.7 43.0 24.5 lO'S 10.0 34.7 39.5 22.6 lQ - 1 0 8.9 31.7 36.9 18.9

Table 5.1 Prediction of domain content by DPS varying Z-score and E-value cut-offs

Domain content was predicted using DPS with varying Z-score and E-value cut-offs. Predictions were made for 369 multi-domain chains (263 continuous domain chains, 106 containing one or more discontinuous domains). Results are shown as percentage correct multi-domain predictions (TP = true positives), and percentage incorrect multi-domain predictions (FP = false positives), i.e. single-domain chains predicted to be multi-domain.

Figure 5.1 considers the effect of a wider range of E-values from 10'^^ to 1, with a fixed Z-score of 1.5 and a fixed smoothing window size of 15 residues. It can be seen, similar to the values in Table 5.1, that the percentage of correctly assigned multi-domain chains only increases slightly over this range, by 5%, with an increase of false-positive multi-domain assignments of 3%. Therefore, an E-value of 0.01 was chosen as a cut-off for use in this analysis, since it gives the highest multi domain prediction accuracy, whilst retaining a false-positive assignment rate similar to that given by smaller E-values cut-offs.

Correspondingly, Figure 5.2 shows the multi-domain prediction accuracy for an E-value of 0.01, and Z-scores ranging from 1 to 5 with a fixed smoothing window size of 15 residues. Although increasing the Z-score decreases the number of false- positive multi-domain assignments, it also gives a rapid decrease in the number of true-positives. The use of a Z-score of 1.5 appears to give a reasonable trade-off in the prediction rate and reliability, giving as high a true-positive prediction rate, with as few single-domain chains assigned domain boundaries as possible.

Finally, in order to assess the effect that different window sizes would have on prediction accuracy window smoothing sizes between 7 and 19 (with an interval of 2 residues, as the window must be an odd number, to allow a centralised residue)

were used (Figure 5.3). Figure 5.3 shows domain content prediction for the different smoothing window lengths with a fixed Z-score of 1.5 and E-value of 0.01. A length of 15 was chosen as it gives as few false-positive multi-domain predictions, without decreasing the number of correct assignments to too large an extent.

5.3.1.2 Domain boundary prediction

So far, the parameters have been optimised for domain content prediction. For a perfect domain assignment method, in all cases where a correct multi-domain assignment is made, the corresponding boundary predictions will be true domain boundaries (in this case as assigned structurally by CATFl). However, as has been previously found, structural domain boundaries are not always equivalent to those found in sequence (Marchler-Bauer et a l, 2002). Table 5.2 shows the effect that different Z-score cut-offs have on the success of domain boundary assignment for multi-domain chains. Shown separately are the results for continuous and chains

50 - 40

î

I

10-10 10-9 10-8 10-7 10-6 10-5 10-4 10-3 10-2 10-1 1.0

In document Analysis and prediction of protein domains (Page 138-142)