7.3
Features based on absent words
7.3.1
Design
In this section, we study the performance of features based on absent words, i.e.
k-mers that are possible but missing in the sequences, focusing on MAW.
We used the algorithm and code described in [145] to compute the set of MAW of each genome sequence. We extract MAW with length ranges from 1 to 1,000, which is much larger than the majority MAW in a sequence. The benefits of doing so are two-fold. First, we would like to extract as many MAW as possible for a reasonable computational cost. Second, setting a large maximum length is likely to identify the longest MAW, which are shown to be more distinguishable for different sequences than short ones [118].
In addition, we also compute the MAW of each sequence with its Reverse Complement (RC) concatenated, and compare this to the ones without (noRC). The potential benefit of RC is that it considers words that might occur in the re- verse complement strand but be absent from the direct strand. For example, given a sequence ACCGTA, the input sequence for extracting MAW in a noRC setting is the original one, ACCGTA; in the RC setting, the original sequence ACCGTA is concatenated with its reverse complement TACGGT, and the input sequence is ACCGTA$TACGGT (the $ sign is used to flag artificial words formed in the bound- ary and any MAW containing it will be removed).
7.3.2
MAW performs well
Table 7.11 shows that MAW can give respectable performance in both Baltimore Class and ICTV Order experiments. Classification errors for ICTV Orders are lower than those for Baltimore Classes, which is consistent with the results from other features in our study. However, the best performing combination of feature and difference measure is noRC and JD, which is actually the opposite of the results
in [121], where RC outperforms noRC andLW I∩ outperforms JD. The main cause
of this inconsistency could be the datasets used. Experiments in [121] use a small
vary from 86 to 105 base pairs (for details of the dataset see [146]). In contrast, the dataset we use is much larger, with sequence length varying significantly from 859 to 2,473,870 base pairs. RC can be redundant given the long sequences in
our dataset, and the significant variation in sequence length tends to bias theLW I∩
measure. Typically, the intersection between two long genome sequences tends to contain more elements than between two short ones, hence long sequences give a
smaller LW I∩, suggesting a lower level of difference. However, JD can alleviate
this problem by using the ratio between|MAWS1∩MAWS2| and|MAWS1∪MAWS2|
(see Section 4.2.6 for difference measures).
Baltimore Classes
Feature Difference Measure
LW I∩ JD
noRC 0.193±0.007 0.027±0.006
RC 0.228±0.014 0.030±0.004
ICTV Orders
Feature Difference Measure
LW I∩ JD
noRC 0.159±0.010 0.014±0.005
RC 0.193±0.012 0.017±0.006
Table 7.11: Classification error rate of different features and difference measures us- ing MAW.
7.4
Features based on compression
7.4.1
Design
The purpose of this section is to study the classification performance of predicting Baltimore Classes and ICTV Orders using features derived from compression meth- ods. The assumption for this study is that similar sequences contain similar patterns and tend to have similar compression ratios for a given tool. Hence, the features will be representative of a sequence and the distance between features reflects the distance between the original sequences.
7.4. Features based on compression 123 Features for a genome sequence consist of compression ratios obtained us- ing different compression tools. We first consider genome sequences as a piece of text and compress them in a regular way using general-purpose compression tools: bzip2, gzip, xz, zip. Then, we use features derived using reference-free DNA- specific compression tools: DELIMINATE [133], MFCompress [134] and LEON [136]. All the features are summarised in Table 4.4. For each tool, the parameters are set to achieve the best compression ratio.
7.4.2
General-purpose compression
The distribution of compression ratios from each tool is shown in Fig. 5.19 and 5.21. We construct features by combining the compression ratios of bzip2, gzip, xz and zip (CRGP). Since bzip2 gives the best compression performance, we explore two other features related to its ratio. One is to use its ratio as a single variable feature (CRB), and the other is a 2D feature that combines its ratio with log transformed genome length (CRBL). For details of features, see Table 4.4.
The classification performance is shown in Table 7.12. The best performance for predicting Baltimore Classes and ICTV Orders are both achieved using the fea- ture CRGP, with error rates of 0.139 and 0.092 respectively.
Baltimore Classes Classifier CRGP CRB CRBL kNN 0.154±0.008 0.228±0.009 0.187±0.014 SVM 0.139±0.006 0.229±0.009 0.177±0.013 ICTV Orders Classifier CRGP CRB CRBL kNN 0.101±0.011 0.183±0.007 0.139±0.012 SVM 0.092±0.007 0.183±0.009 0.117±0.011
Table 7.12: Classification performance using features based on compression ratios of general-purpose compression tools.
7.4.3
DNA-specific compression
The distribution of compression ratios from each tool is shown in Fig. 5.20 and 5.22. We construct features by combining the compression ratios of DELIMINATE, MF- Compress and LEON (CRDNA). Since LEON gives the best compression perfor- mance, we explore two other features related to its ratio. One is to use its ratio as a single variable feature (CRL), and the other combines its ratio with log transformed genome length (CRLL). In addition, we also construct a feature that combines the ratios of both general-purpose and DNA-specific tools (CRA). It is a 7D vector con- sisting of the compression ratios of the seven tools (four general-purpose tools and three DNA-specific tools), and a 2D feature that combines the best general-purpose and DNA-specific tools (CRLB). For details of features, see Table 4.4.
The classification performance is shown in Table 7.13. The best performance for predicting Baltimore Classes and ICTV Orders are both achieved using the fea- ture CRA, with error rates of 0.113 and 0.073 respectively.
Baltimore Classes
Classifier CRDNA CRL CRLL CRA CRLB
kNN 0.178±0.007 0.218±0.015 0.198±0.012 0.128±0.014 0.174±0.009
SVM 0.162±0.006 0.221±0.014 0.184±0.012 0.113±0.012 0.162±0.008
ICTV Orders
Classifier CRDNA CRL CRLL CRA CRLB
kNN 0.125±0.008 0.205±0.014 0.132±0.011 0.089±0.014 0.124±0.008
SVM 0.119±0.006 0.206±0.014 0.125±0.012 0.073±0.011 0.102±0.008
Table 7.13: Classification performance using features based on the compression ratios of DNA specific compression tools.