• No results found

5.2 The Domain Prediction Pipeline

5.2.2 Step 1: Finding Potential Domain Boundaries

Before going into details, though this already pertains to the description of the algorithms, we now define some sets which will make it easier to follow the description of the domain prediction process, namely

Centers: The set of all centers of coil regions on the targets sequence with respect to the predicted secondary structure, with the exception of the leading and trailing coils. For those, we include the first and the last position of the target sequence instead.

Domains: The template database of known domains used by our algorithm, namely the ASTRAL 95 after exclusion of genetic domains (i.e. domains which span different chains).

Images: This set collects the highest-scoring representatives of the templates stored in the set Domains as found by SSEA (see below for details).

0 40 80 120 160 200 240 280 320 360 400 440 480 520 560 600 0 1000 2000 3000 4000 domain length # domains

Figure 5.2: Histogram of domain lengths observed in the ASTRAL distribution. The his- togram was cut at length 600 for better readability, though some domains in the ASTRAL 1.65 distribution are longer than 600. The vertical line shows the mean length of the whole distribution.

Regions: This set contains all potential domain regions on the target sequence. The first step of our method deals with finding positions on the target sequence where boundaries between domains may be located. We regard all centers of predicted coil regions on the target sequence t as potential boundaries. Since the number of boundaries may affect the complexity of the method quadratically (see Step 2), we employ a heuristic to select only a reasonably small number of these centers for further evaluation as described in Algorithm 2.

Finding Domain Images using SSEA

First of all, we collect all centers of predicted coil regions ont together with the start and the end of t in the set Centers. For each template domain sequence d in our template database Domains, we alignd against all subsequences rij between coil centers ci andcj

Centers using SSEA.

Definition(Domain Image). We collect the highest-scoringrij for each template domain

d as so-called domain images in the set Images.

In other words, all subsequences between coil centers are scanned for secondary structure similarity to the template domains using SSEA. The stored domain images for our template domains will be used to assign a score to each center and then to select potential domain boundaries based on these scores.

Length Filtering of Templates

Here, instead of aligning against all possible subsequences, we apply a simple length filter for selecting only subsequences of similar length for each template domain.

5.2 The Domain Prediction Pipeline 71 Definition(Length Filter): As a further criterion for finding domain images for a do- main templated, in order to be evaluated, subsequences on the target may differ in length from|d|by 5% at maximum. In the following we will write this property as |d| ≈ |rij|.

As we directly make use of ASTRAL domains as templates, we chose the threshold of 5% based on a simple evaluation on all ASTRAL domain sequences of version 1.65: The distribution of the lengths of these domains is shown in Figure 5.2. Our analysis shows a mean length of 188 with a large standard deviation of 118. We further computed the mean coil length at either end of a domain according to DSSP [Kabsch and Sander, 1983] applied to the coordinate files provided by ASTRAL, which was found to be about 4.5 amino acids.

Using a threshold of 5%, for a potential region of length 188, we allow templates to differ by the average coil length at either end, i.e. by 9 amino acids. In addition, using a scaled threshold, we assume that with increased domain length the possible length differences between homolog domains are also increased.

Significance Filtering of Domain Images

For filtering out unlikely domain images, we compare the SSEA score of a hit smax(d)

against a threshold sthresh(d) derived from the all-against-all SSEA alignment score distri- bution of the fold class the template belongs to. These distributions were computed for each fold class by aligning all members against each other, based on ASTRAL 95. Only hits having a score higher than the mean of the corresponding distribution are accepted and thus added to the set of domain images (Images). For classes having only one member, we use the mean of all computed means as threshold.

Accumulative Boundary Scores

Now, given the set of domain images for our template domains, we can derive a score for each coil center, which will then enable us to select only the few most probable ones as potential domain boundaries for the next stages. In particular, the score of each of the top-scoring 100 accepted domain images is added to the corresponding coil centers, i.e. the score of a coil centerci is the sum of the scores of all adjacent domain images in this set.

Definition(Potential Boundaries): For the next stages, we then select the ends of the target sequence as well as the 4 top scoring coil centers with respect to this accumulative score as domain boundaries.

The number of boundaries was determined in the parameter calibration process described in section 5.2.1.

Algorithm 2 Domain Boundary Search (Step 1)

1: // initialization

2: Centers centers of coil regions predicted on target t 3: Regions← {rij =t[ci..cj]|ci, cj Centers ∧ci < cj} 4: Images ← {}

5: Domains ASTRAL95 6:

7: // generation of domain images

8: for all template domainsd ∈Domains do 9: // get highest scoring region of similar length 10: smax(d)maxrij∈Regions∧ |rij|≈|d|SSEA(d, rij) 11:

12: // significance filtering: score high enough? 13: if smax(d)> sthresh(d) then

14: add corresponding region rij to Images 15: with score(rij)←smax(d)

16: end if

17: end for

18:

19: // accumulative scoring of coil centers

20: ∀c∈ Centers : score(c)0

21: for the top-scoring rij Images do 22: score(ci)score(ci) + score(rij) 23: score(cj)score(cj) + score(rij) 24: end for

25: select the top-scoring coil centers