CHAPTER 4 PATTERN ANALYSIS ON RNA STRUCTURES
4.2 E XPERIMENTS ON RNA S TRUCTURES
4.2.3 Data Preprocessing
An RNA structure is divided into structural segments by moving a sliding window along the backbone of an RNA sequence. Each time the sliding window moves one base. The sequence in the sliding window is called subsequence and the structure in the window is called a structure segment. These structure segments are to be used for pattern analysis using clustering algorithms. The structure segment in a sliding window is
54 described using distance matrix. The stability of RNA three dimensional structures is determined by the base pairing interaction of nucleic acids. Thus base pairing, as one of the most important information related to RNA three dimensional structures, is indispensable for pattern analysis of RNA structures. The distance matrix and classification of RNA base pairs are defined as follows.
4.2.3.1 Distance Matrix
In this work, structural clustering is to group similar structure segments obtained from RNA three dimensional structures into clusters. How to evaluate the similarity of structure segments is the key point. Since structure segments are three dimensional, we can see the structure segment as a three dimensional graph. There are various algorithms available for three dimensional graph comparison. However, three dimensional graph comparison is computationally expensive and unsuitable for efficient processing of large dataset. We can also convert a structure segment into a tree or two dimensional graph. Most of the research efforts have been made for tree comparison and graph similarity evaluation. However, some information is missed during the process of converting three dimensional to three dimensional tree or graph. Tree-based algorithms are still complex and computationally expensive. In this work, we prefer a descriptor which can represent the structure segment without putting too much overhead onto the clustering. We find that the distance matrix is a suitable descriptor meeting our requirements. It represents the Euclidean distance between any two adjacent nucleic acids in a sliding window. The definition of distance matrix is described as follows:
55
S=X1X2…Xw. The coordinates of an RNA backbone are known. The distance matrix
(DM) is ⎩ ⎨ ⎧ >= <= < = j i if w j i if d DM ij , 0 , (4.1)
where dij is the Euclidean distance between nucleic acids i and j; w is the sliding window
size.
The DM can be further simplified into a vector which only contains dij where
i<j<=w. Consequently, a structure segment is finally described as a vector as follows: w} j i 0 | {d V= ij < < <= (4.2)
4.2.3.2 RNA Base Pair Classification
The RNA base pair information of each structure segment must be considered in pattern analysis. The major reason is that an RNA sequence folds itself to form secondary structures by base pairing interaction. RNA secondary structure is an important transitional step to the formation of a functional three dimensional structure. Consequently, the base pair information should be taken into consideration in structure pattern analysis of RNA three dimensional structures.
The available RNA structures to date show a great diversity of base pairing interaction (Batey et al., 1999; Nagaswamy et al., 2002). Leontis and Westhof (Leontis and Westhof, 2001) gave the criteria of base pairs classification. This classification is adapted by NDB (Berman et al., 1993) and widely used in the literature. The classification is based on two major requirements: (a) the planar edge-edge hydrogen bond interactions between two bases involve one of the three distinct edges: Watson-
56 Crick edge, Hoogsteen edge, and Sugar edge (see Figure 4.2). (b) a base pair has at least two hydrogen bonds. The relative orientation of the two bases is also considered in the
classification as trans and cis. A line is drawn parallelto and between the two connecting
H-bonds. The relative orientation of the two bases is called trans if the glycosidic bonds
of the interactingnucleotides lie on opposite sides of the line. Otherwise it is calledcis.
This is described in Figure 4.2.
(a) (b) Figure 4.2. Base pair classification. (a) Hytrogen bond edges in RNA bases. (b) Cis
versus trans orientation of glycosidic bonds. The three edges are Waston–Crick, Hoogsteen and Sugar (Leontis and Westhof, 2001).
In this dissertation, we use the classification defined by Leontis and Westhof (Leontis and Westhof, 2001) when analyzing RNA structure segments. The classification
57 gives 12 classes of base pairs with the base pair orientation as in Table 4.2 shows.
Table 4.2. 12 families of base pairs.
Base pair
relative orientation
Interaction edges
Watson-Crick/ Watson-Crick Watson-Crick/Hoogsteen Watson-Crick/sugar Hoogsteen/ Hoogsteen Hoogsteen/sugarcis
Sugar/sugar Watson-Crick/ Watson-Crick Watson-Crick/Hoogsteen Watson-Crick/sugar Hoogsteen/ Hoogsteen Hoogsteen/sugartrans
Sugar/sugarIn Table 4.2, the first column is the relative orientations of the glycosidic bonds of the interaction bases and the second column is the interaction edges.
4.2.4 Experiment Setup
We have introduced the major parameters of the improved K-means algorithm and the criteria on how to determine the suitable number of clusters in section 3.6.2.3. In this experiment, the same criteria are used to identify the suitable number of clusters. The
58 number of clusters (K) is set from 2 to 50 to observe the changes of mean similarities with K. For each K, 20 iterations of similarity calculation are performed. We further compute the mean of the 20 similarities which indicates the stability of the clustering under cluster number K. In this experiment, F, the fraction of dataset for sampling, is 0.7. The window sizes of 3 to 20 are tested.