• No results found

Predicting structure of a given sequence

If the calculation of the free energy change of an RNA structure compared to its un- folded form is possible when its sequence is known, then calculating the minimum free energy (MFE) structure of an arbitrary sequence becomes a computational prob- lem. Theoretically, one could calculate the free energy of every possible structure, then pick the one with the lowest free energy, but this approach is very computationally expensive. This problem was solved by Zuker and Stiegler (1981), who showed that the global minimum free energy structure for an RNA sequence could be computed relatively efficiently using a dynamic programming algorithm.

The principle of this algorithm is to divide up the sequence into smaller ones, starting with all possible 5 nt subsequences, to calculate the energies of the subsequences, and to use these results in a dynamic algorithm to build up the larger sequence. First, the free energy of all subsequences in all possible permissible configurations (i.e., for all permissible base pairs: A-U, G-C and G-U), are computed. The first set of subsequences would therefore be nucleotides 1-5, 2-6, 3-7, and so on. The length of 5 bp is chosen because 3 bp is considered to be the smallest possible loop, so a 5 nucleotide subsequence has two possible configurations: 5 unpaired bases, or a structure with 2 paired bases and a 3 nt loop.

For a longer subsequence that goes from two arbitrary nucleotides i to j, there are two possibilities: either nucleotide j is unpaired or it is paired. If j is unpaired, the free energy for the sequence is that of the sequence i – j-1, plus the energy of the 3’ overhang j. As smaller subsequences are calculated before larger ones, the previously calculated value for the subsequence i – j-1is used and adjusted for the 1 bp overhang. If j is paired, it can be paired to i or to i’, where the order of the nucleotides isi<i0<j. When paired to i, the free energy of the base i-j is added to that of the substructure i+1– j−1, which has already been calculated. When paired to i’, the same process is applied to the substructure i’ – j, and added to the free energy of i – i’-1 which has 6

already been computed.

This process is repeated for all subsequences one base longer, until the last one which is the whole sequence. This yields the free energy of the final sequence, but also that of all the substructures that compose it. The final structure is then built up by backtracing through the matrix of previously calculated values. This process uses experimentally measured free energy values for the small subsequences, with larger ones calculated by adding up the calculations for subsequences.

The method was adapted to permit the calculation of suboptimal structures within a certain percentage the MFE structure (Zuker, 1989), and a comparison of these subop- timal structures with the MFE structure allows an assessment of the probability that a given base pair will be present in the real structure. For example, a base pair present in the MFE and all suboptimal structures within 10% of MFE would be assigned a high score, and a base pair only present in the MFE structure would be assigned a low one. When tested against a variety of RNA sequences up to 800 nt long, models that use this type of dynamic programming algorithm have a sensitivity of about 74%, and a positive predictive value (PPV) of 64%, that is, they identify 74% of the base pairs that form in reality, and 64% of base pairs that are identified really form (Mathews et al., 1999). Generally, the prediction tends to get worse for longer RNAs.

Sensitivity= True positive pairs

True positive pairs+False negative pairs

PPV=True positive pairsTrue positive pairs

+False positive pairs

The predictions of such models can be strengthened by the use of a partition function that calculates the probability that a base will be paired at equilibrium, taking into ac- count suboptimal structures (McCaskill, 1990; Mathews, 2004). 91% of bases that pair with a probability of 99% in the predicted minimum free energy structure also exist

in the known structures derived from comparative sequence analysis. Overall, such energy minimisation algorithms can be used to predict the structure of short RNAs quite well, as well as provide an assessment of the quality of the prediction. The longer the RNA gets, the less accurate these predictions become. One important drawback of the dynamic programming algorithm presented above is that it explicitly excludes the presence of pseudoknots in the sequence (Lorenz et al., 2011). A pseudoknot is formed if any base pairs a-b and a’-b’ form such that a<a’<b<b’ in the sequence. Pseudoknots are known to exist in longer RNA sequences and are important in many catalytic RNAs such as ribozymes. Finally, with all its advantages and drawbacks, this method aims to calculate the minimum free energy structure, which would be expected to be the dominant structure at equilibrium. Whether this is a physiologically relevant structure will depend on the kinetics of RNA folding, as well as on interactions with other cellular components.

1.1 The sequence-structure-function relationship in RNA