• No results found

3. Material and methods

3.5. P value calculation with neural networks

programming matrix are crossed out. Then, we get the start- and end-positions of all alignments that had a score above the threshold in the prefiltering. For each alignment we calculate the diagonals which limit the alignment at the bottom and at the top and add an additional shading space to these diagonals (default is 200, see blue lines in figure 3.11). These diagonals and the shading space left and right of the alignment define the region in which the cells of the matrix are activated. For average proteins with a length between 200 and 400 amino acids nearly the full matrix is activated. But especially for long proteins, which result in long runtimes, the search space of this time-consuming comparison can be reduced dramatically. A similar tube shading is used in the realign step, where all good matches are realigned with the more accurateMAC(Maximum ACcurancy) algorithm. Here, the activated cells in the dynamic programming matrix are defined by the previously calculated Viterbi alignments.

3.4.3. Avoid aligning the same templates in multiple iterations

The last additional filter preventsHMMs that have been found in one iteration ofHHblits with an E-value below the inclusion threshold, to be aligned again in later iterations. Therefore, the names of all identified matches in one iteration are stored in a list of previous hits. If the sameHMMpasses the prefilter in a later iteration, it won’t be aligned again by theHMM-HMM

comparison. This reduces the runtime and it prevents the alignment from deteriorating it in the cases of a too diverse query profile, i. e., if the database match is closely related to the query sequence and the query profile contains the information of evolutionary more distantly related sequences. But on the other hand the scores and therefore the E-values of such matches that were found in previous iterations could improve by a better query profile. For that reason, we re-score in the last iteration all matches that were identified in the previous iterations, but we keep the first alignment.

3.5.

P-value calculation with neural networks

The interpretation of the results of an HMM-HMM comparison depends strongly on the statistics of the similarity scores. The scores are calculated during the Viterbi algorithm, but for a meaningful interpretation of the data, measures of the confidence for homology are needed. Most methods report E-values or P-values for all of their matches. The P-value of a match with scoreS is defined as the probability to obtain a chance hit with score≥S in a pairwise comparison. Therefore, the score distribution for non-homologous sequences is needed. In the case of HHsearch and HHblits the score distribution of non-homologous sequences follows an EVD (extreme value distribution) as shown in figure 3.12.

If such a score distribution is given, theP-value can be estimated by calculating the area under the curve to the right of S. From this P-value P, an E-value E can be computed. TheE-value is defined as the expected number of chance hits with a score≥S in a search

3.5 P-value calculation with neural networks 37

Figure 3.12.: (A) The score distribution of non-homologous sequences follows an EVD (extreme value distribution). TheP-value of a score S is defined as the area under the curve to the right of this score. (B)Example of the score distribution of non-homologous sequences of anHHblits search. The red line is the fittedEVDprobability density function.

of a database with NDB sequences: E = NDBP. In HHblits we have the problem that there are two search steps (the prefiltering and the Viterbi alignment) which are partially correlated and influence the E-value calculation. We can easily define the E-value for the two extreme cases: (1) If the prefiltering and Viterbi scores are perfectly correlated the E-value can be calculated byE =NDBP. (2) If the scores are completely uncorrelated the E-value can be calculated by E = NpreP, where Npre is the number of HMMs that pass the prefilter. We handle the partially correlated case by introducing a empirical correlation factor (Epre/NDB)α in the E-value formula:

E =NDBP× E pre NDB α (3.10) Epre is theE-value-threshold of the prefilter. α is a measure for the degree of correlation (α= 0: perfect correlation,α= 1: no correlation) and is defined asα = 0.4+0.02×(NeffT

1)×(1−0.1×(NeffQ−1)). Here, NeffQ and NeffT are the numbers of effective sequences (definition see appendix A.2) in the query and template HMMs, respectively. The three coefficients were optimized to yield accurateE-values (see chapter 4.1.5).

The E-value is the most widely used significance measure. Also, some thresholds in

HHblits are given as E-values, e. g., the inclusion threshold with a value of 10−3 means

that by chance one non-homologous protein passes this threshold in 1000 HHblits runs. In

HHsearch andHHblits aprobability for a homologous relationship is given to the user as an additional significance measure. It includes the secondary structure score and is based on the score distribution for non-homologs and homologs in an all-against-all comparison of theSCOP database:

P(pos|S) = P(S|pos)P(pos)

P(S|pos)P(pos) +P(S|neg)P(neg) (3.11) In the previous version of HHsearch, the parameters of theEVD were estimated for each

3.5 P-value calculation with neural networks 38

query in an additional calculation step. Therefore, anHHsearch run was performed against a calibration database which consists of one HMM for every fold in SCOP, hence all HMMs in this calibration database should be unrelated. Then, the score distribution of all non- homologous proteins was calculated by excluding the three best-scoring folds. This was done to be sure that only non-homologous matches contributed to the score distribution. Theλandµparameters for thisEVDwere determined by Maximum Likelihood and stored in the query HMM. These parameters was then used in the actual search for the P-value calculation. For HHblits we must find another way to determine the parameters, because an additional calibration search in every search iteration is too slow.

We now determine the parametersλandµby a machine learning approach and train two neural networks, one for the calculation ofλand one for the calculation ofµ(Sadreyev and Grishin, 2008). A neural network can model complex relationships between the inputs and the output. It is an interconnected network of nodes, usually arranged in three layers: the input nodes, the hidden nodes and the output nodes. Each node has an activation function, which generates the output of the node from the weighted input. In a learning phase, the weightsw of all edges in the network are changed to model the given training data (input and corresponding output) with a minimal error.

In our case, we use for both parameters the same neural network topology (see figure 3.13): 4 input nodesk∈ {1,2,3,4}, that are fully connected with 4 hidden nodesl∈ {5,6,7,8}and one output node o= 9. As inputs we take the length of the query and database HMMand the diversities of these HMMs. All inputs are normalized to a value between 0 and 1 which results in a slightly better performance of the neural network. In the hidden layer we use a logistic function to generate the outputHl of the hidden nodel:

Hl=

1 1 + exp−P4

k=1wklIk

(3.12)

Ik is the input value of the input node k and wkl is the weight of the edge connecting

Figure 3.13:Neural network topology for the EVD parameter λ and µ. The network consists of 4 input nodes, 4 hidden nodes and one output node. Inputs are the length of the query and the database HMMand their diversities. All edges in this network have individual weights, which were trained with the backpropagation learning algo- rithm.