Features based on absent words - Genome sequence-based virus taxonomy using machine learning

7.3 Features based on absent words

7.3.1 Design

In this section, we study the performance of features based on absent words, i.e.

k-mers that are possible but missing in the sequences, focusing on MAW.

We used the algorithm and code described in [145] to compute the set of MAW of each genome sequence. We extract MAW with length ranges from 1 to 1,000, which is much larger than the majority MAW in a sequence. The benefits of doing so are two-fold. First, we would like to extract as many MAW as possible for a reasonable computational cost. Second, setting a large maximum length is likely to identify the longest MAW, which are shown to be more distinguishable for different sequences than short ones [118].

In addition, we also compute the MAW of each sequence with its Reverse Complement (RC) concatenated, and compare this to the ones without (noRC). The potential benefit of RC is that it considers words that might occur in the reverse complement strand but be absent from the direct strand. For example, given a sequence ACCGTA, the input sequence for extracting MAW in a noRC setting is the original one, ACCGTA; in the RC setting, the original sequence ACCGTA is concatenated with its reverse complement TACGGT, and the input sequence is ACCGTA$TACGGT (the $ sign is used to flag artificial words formed in the bound- ary and any MAW containing it will be removed).

7.3.2 MAW performs well

Table 7.11 shows that MAW can give respectable performance in both Baltimore Class and ICTV Order experiments. Classification errors for ICTV Orders are lower than those for Baltimore Classes, which is consistent with the results from other features in our study. However, the best performing combination of feature and difference measure is noRC and JD, which is actually the opposite of the results

in [121], where RC outperforms noRC andLW I_∩ outperforms JD. The main cause

of this inconsistency could be the datasets used. Experiments in [121] use a small

vary from 86 to 105 base pairs (for details of the dataset see [146]). In contrast, the dataset we use is much larger, with sequence length varying significantly from 859 to 2,473,870 base pairs. RC can be redundant given the long sequences in

our dataset, and the significant variation in sequence length tends to bias theLW I_∩

measure. Typically, the intersection between two long genome sequences tends to contain more elements than between two short ones, hence long sequences give a

smaller LW I∩, suggesting a lower level of difference. However, JD can alleviate

this problem by using the ratio between|MAW_S₁∩MAW_S₂| and|MAW_S₁∪MAW_S₂|

(see Section 4.2.6 for difference measures).

Baltimore Classes

Feature Difference Measure

LW I_∩ JD

noRC 0.193±0.007 0.027±0.006

RC 0.228±0.014 0.030±0.004

ICTV Orders

Feature Difference Measure

LW I_∩ JD

noRC 0.159±0.010 0.014±0.005

RC 0.193±0.012 0.017±0.006

Table 7.11: Classification error rate of different features and difference measures using MAW.

7.4 Features based on compression

7.4.1 Design

The purpose of this section is to study the classification performance of predicting Baltimore Classes and ICTV Orders using features derived from compression meth- ods. The assumption for this study is that similar sequences contain similar patterns and tend to have similar compression ratios for a given tool. Hence, the features will be representative of a sequence and the distance between features reflects the distance between the original sequences.

7.4. Features based on compression 123 Features for a genome sequence consist of compression ratios obtained using different compression tools. We first consider genome sequences as a piece of text and compress them in a regular way using general-purpose compression tools: bzip2, gzip, xz, zip. Then, we use features derived using reference-free DNA- specific compression tools: DELIMINATE [133], MFCompress [134] and LEON [136]. All the features are summarised in Table 4.4. For each tool, the parameters are set to achieve the best compression ratio.

7.4.2 General-purpose compression

The distribution of compression ratios from each tool is shown in Fig. 5.19 and 5.21. We construct features by combining the compression ratios of bzip2, gzip, xz and zip (CRGP). Since bzip2 gives the best compression performance, we explore two other features related to its ratio. One is to use its ratio as a single variable feature (CRB), and the other is a 2D feature that combines its ratio with log transformed genome length (CRBL). For details of features, see Table 4.4.

The classification performance is shown in Table 7.12. The best performance for predicting Baltimore Classes and ICTV Orders are both achieved using the feature CRGP, with error rates of 0.139 and 0.092 respectively.

Baltimore Classes Classifier CRGP CRB CRBL kNN 0.154±0.008 0.228±0.009 0.187±0.014 SVM 0.139±0.006 0.229±0.009 0.177±0.013 ICTV Orders Classifier CRGP CRB CRBL kNN 0.101±0.011 0.183±0.007 0.139±0.012 SVM 0.092±0.007 0.183±0.009 0.117±0.011

Table 7.12: Classification performance using features based on compression ratios of general-purpose compression tools.

7.4.3 DNA-specific compression

The distribution of compression ratios from each tool is shown in Fig. 5.20 and 5.22. We construct features by combining the compression ratios of DELIMINATE, MF- Compress and LEON (CRDNA). Since LEON gives the best compression performance, we explore two other features related to its ratio. One is to use its ratio as a single variable feature (CRL), and the other combines its ratio with log transformed genome length (CRLL). In addition, we also construct a feature that combines the ratios of both general-purpose and DNA-specific tools (CRA). It is a 7D vector con- sisting of the compression ratios of the seven tools (four general-purpose tools and three DNA-specific tools), and a 2D feature that combines the best general-purpose and DNA-specific tools (CRLB). For details of features, see Table 4.4.

The classification performance is shown in Table 7.13. The best performance for predicting Baltimore Classes and ICTV Orders are both achieved using the feature CRA, with error rates of 0.113 and 0.073 respectively.

Baltimore Classes

Classifier CRDNA CRL CRLL CRA CRLB

kNN 0.178±0.007 0.218±0.015 0.198±0.012 0.128±0.014 0.174±0.009

SVM 0.162±0.006 0.221±0.014 0.184±0.012 0.113±0.012 0.162±0.008

ICTV Orders

Classifier CRDNA CRL CRLL CRA CRLB

kNN 0.125±0.008 0.205±0.014 0.132±0.011 0.089±0.014 0.124±0.008

SVM 0.119±0.006 0.206±0.014 0.125±0.012 0.073±0.011 0.102±0.008

Table 7.13: Classification performance using features based on the compression ratios of DNA specific compression tools.

In document Genome sequence-based virus taxonomy using machine learning (Page 121-124)