A.2 Methods
A.2.1 Developing the Beta Contact Evaluator (BCeval)
The evaluation of beta contacts can be tackled as a prediction problem, similar to contact map prediction. We must (i) define the training data, (ii) find a suitable representation of the input and output data and (iii) set up a proper architecture and learning algorithm(s). All the described modules have been implemented using the GAME framework(8), written in Java 6.0. Following the philosophy of GAME, the experiments and the combination of experts have been set up using the graphical interface. The resulting system is called BCeval.
A.2.1.1 Why not use a Generic Contact Map Predictor?
In principle, it might be possible to use a generic contact predictor to evaluate a contact between beta strands. Various contact predictors have been proposed, such as those by Cheng & Baldi(90) and Tegge et al.(91). However, the accuracy of these tools in not high, owing to the difficulty of predicting all possible contacts occurring in a
A.2 Methods
----AA---AA---BBBB---CCCCC---CCCCC---BBBB--- ----12---12---1234---12345---54321---4321--- QSPVDIDTHTAKYDPSLKPLSVSYDQATSLRILNNGHAFNVEFDDSQDKAVLKGGPLDGT CCCCEECCCCCEECCCCCCEEEECCCCCEEEEEECCCCEEEEECCCCCCCEEEECCCCCC Figure A.3: An example chain indicating the residues in contact. The letters in the first line indicate the beta strand pairs. The numbers in the second line indicate the residues in contact within the same pair. For example, the two residues labelled B1 form a contact.
protein (including between α-helices). Even more specific predictors specialized in beta contacts report accuracies below 50%(184).
Fortunately, rather than the general problem of predicting contacts between β- strands, we already know which strands are in contact and we can concentrate on small shifts around a given position. Thus, we developed a new system specialized in recognizing a contact from the ‘shifted’ versions that could be identified from an alignment procedure.
A.2.1.2 Data Representation
The beta-pairs must be represented in a fixed-length vector in order to obtain an input suitable for a neural network. The main requirement for the input vector is to contain the residues of the two strands involved in the pairing. Additionally, the shifted versions of the same pair must be clearly recognizable.
Figure A.3 shows an example fragment of protein chain, in which the contacting residues belonging to different beta strands are indicated. Clearly the length of the β- segments is variable, while a fixed-length vector is needed for the data representation. Choosing to encode a window of N residues, that window would be perfectly suited to strands of length N , while information would necessarily be lost for pairings of longer strands. On the other hand, using larger windows would include residues which are not involved in contacts.
In addition, one must account for correct pairing of residues in both parallel and anti-parallel strands. For instance, taking a window of four residues along the anti- parallel B strands in Figure A.3, the data representation must indicate that the leucine at the first position in the first strand is in contact with the glycine in the last po- sition of the second strand and not the valine in the first position. The different
hydrogen-bonding patterns observed in parallel and anti-parallel sheets also result in different propensities in the contacts between residues, as shown by Fooks et al.(180) and Hutchinson et al.(183). For these reasons, a ‘mixture of experts’ approach has been adopted: one expert deals with only strands of one type, without any adjustment in the encoding phase.
Profiles, obtained after three iterations of a PSI-BLAST search on the whole protein, have been used to encode the residues in the window. PSI-BLAST is run against the data repository uniref901, with the inclusion threshold set to 10−3 and the number of iterations set to 3. The default values of the NCBI stand-alone version of blastpgp were used for the remaining parameters.
A simple position-independent coding of the residues was also tried, but profiles lead to better performance.
A.2.1.3 The Architecture
The core of BCeval consists of a mixture of 13 neural networks, each one specialized for a specific length (1,2,3,4,5,6,7+) and type (parallel or anti-parallel) of beta pair- ing. Figure A.4 shows the architecture of BCeval. The 13 experts (bridge contacts of one residue do not distinguish between parallel and anti-parallel) are combined at the first level and constitute a ‘core evaluation module’. The window length includes all (and only) the residues involved in each pairing, such that each neural network has a fixed-length vector as input, representing the residues involved in the contact. Simpler architectures with only one neural network and fixed input length (1, 2, 3) were tried first, but the described architecture improved accuracy. The final output is obtained by averaging three core evaluation modules trained separately.
A.2.1.4 Training data Composition
The composition of the training examples determines the ability of the trained machine learning system to fit the problem. A reference test set, TESTDOM, was built by selecting 10% of the total codes in the CATH database(21) at the homologue level, and taking the corresponding domains. A subset of the domains in TESTDOM, prevalently far homologues, have been used to build a set of pairs of domains to be aligned. The
A.2 Methods Parallel Length
+
ANN ANN ANN 1 2 Y N Parallel ANN ANN 3 Y N Parallel ANN ANN 7+ N 4,5,6 0.1 0.5 … 0.6 0.0 … 0.2 0.1 … 0.0 0.5 … 0.0 0.0 … … 0.2 0.3 … 0.4 0.1 … 0.2 0.1 … 0.7 0.0 … … M E R P Y … C P V E … input profile ENCODE Y beta pair OUTFigure A.4: The BCeval architecture. The guarding functions ensure that only one neural network is activated at a time. The ‘parallel’ guard is able to distinguish between parallel and anti-parallel strand pairs, while the ‘length’ guard dispatches based on the length. In the example, an anti-parallel pair of length 3 is given so activating the path shown in bold. Three core units consisting of independently trained neural networks are averaged to obtain the final evaluation.
resulting set, TESTALIGN, consists of 743 proteins, which have been used to test the alignment algorithms. A set of protein chains, TRAINCH, consisting of protein chains from a dataset with identity < 25%1, selected in order not to include any chain containing the domains in TESTDOM, was used as a starting point to obtain contacts; whole chains were preferred to domains in order to use DSSP(185) outputs from the EBI2 directly. Contacts included in the training of BCeval have been obtained from within the chains of TRAINCH by parsing the DSSP files, which include the position of the contacts between pairing residues in β-strands. Negative examples (i.e. pairings not observed to be in a beta contact) have been obtained by a synthetic sampling around the actual contacts. A Gaussian distribution (µ = 0, σ =√5) around the positive examples was used to perform the sampling. The negative and positive samples are balanced (half
1
http://bio-cluster.iis.sinica.edu.tw/~bioapp/hyprosp2/dataset_8297.txt
positive, half negative), without taking into account the observed distribution. As seen in Figure A.2, the extent of the observed shifts depends greatly on the sequence identity, making it hard to model the observed distribution correctly and, in any case, balanced inputs generally result in better learning. This partial synthetic sampling has been preferred to sampling the real data in order to obtain more, and more varied, samples. In practice, the negative data are randomly generated at each training iteration, so improving the diversity given to the training algorithm.
A.2.1.5 Training Technique and Parameter Setting
Each expert is based on a 3-layer feed-forward neural network, trained with a variant of the back-propagation algorithm, with learning rate = 0.001 (initial value) and momen- tum = 0.1. The learning rate is adjusted between iterations with an inverse-proportion law. The number of hidden neurons is 75 for each neural network. The number of input neurons is 20 · N , where N is the size of the input window. A single output neuron indicates whether the given input is a contact or not. With the goal of improving the ability of the training algorithm to escape from local minima, the training set is ran- domly shuffled at each iteration, so that the learning algorithm is fed with the same set of proteins given in a different order. Furthermore, each protein provides only a subset of its inputs, according to a random choice performed in accordance with a specific parameter, n. In particular, a random value k is extracted in the range [0, n − 1] and the inputs with index k, k + n, k + 2n, . . . are provided to the learning algorithm.
To prevent the training process from stopping with a local oscillation of accuracy (evaluated on a validation set consisting of a 10% of TRAINCH, not used in the back- propagation process), weights are recorded when a minimum is encountered on the validation set, but the training continues until the error on the validation set increases for 10 consecutive iterations.