2.2 Genomic sequence analysis methods
2.2.2 Alignment-free methods
The alignment-free methods have been proposed as an alternative to address situations where
alignment-based methods are computationally inefficient or fail [11, 20, 21]. They have follow-
ing advantages: (i) alignment-free methods are capable of recognizing homology even when the loss of contiguity is beyond the possibility of alignment [20]. (ii) With alignment-free methods, similarities can be found that can’t be discovered through edit distances (counting the minimum number of operations required to transform one string into the other), which are used in alignment-based methods [24]. (iii) ability to compare unrelated sequences. There are a variety of alignment-free methods proposed over the last few decades.
Random walk [25, 26] was one of the first alignment-free methods that were proposed. It generates two-dimensional graphical representations of genomic sequences and compares
them using Manhattan and Euclidean distances. More specifically, the four nucleotidesT, A, C,
Gare encoded by four possible moves corresponding to the directionsup, down, left, rightre-
spectively, to generate a graphical representation in a plane. Susceptible to degeneracy, initially this method was considered unsuitable for genomic analysis. The method was later improved [27, 28] by using the geometric center of the points in the walk for sequence comparison. Modified versions of the random walk technique have been used to produce the similarity ma-
trices from the first exon of the β-globin gene of several mammals [29, 30, 31, 32, 33] and
to generate the phylogenetic trees for primate mitochondrial DNA [30], coronaviruses [34], etc. The random walk technique has also been used to analyze proteins [35, 36, 37], bacteria [38] and yeast [39]. In the random walk technique, the plotting of the current point depends
on the preceding points. Randicet al. [40, 41] proposed an alternative representation, called
“cell” representation, where the plotting of points is independent of the preceding points. They proposed the construction of a 12-component vector by using the leading eigenvalues of the
L/L matrix (Length by Length matrix) for the comparison of the first exon ofβ-globin region
of 11 mammals. The elements of the L/L matrix are defined as the quotient of the Euclidean
pair of dots measured along the curve. Various modifications were proposed following this study [42, 43, 44], but these techniques failed to receive attention because the representation
construction is computationally inefficient.
Qi et al. proposed a graph theory based method [45], where for each DNA sequence a
weighted directed graph with four vertices (one vertex for each nucleotide) is constructed. Each edge of the graph represents a unique dinucleotide and graph has sixteen edges in total. The edge weights are updated based on both ordering and frequency of nucleotides, and an
adjacency matrix of size 4×4 corresponding to the edge weights is constructed. The dissim-
ilarity between any two DNA sequences is measured by computing a distance between their respective adjacency matrices.
Over the years, other alignment-free methods have been proposed which used different
approaches. Markov models have been used to cluster coding DNA sequences [46], to study
intra-genomic variations for viruses and some animals [47], and to build phylogenies of S.
flexneri, E. Coli [48], Hepatitis-E virus [49] and HIV-1 [50]. Thermal melting profiles have
been used to classify several mammalian species usingβ-globin andαchain class II MHC genes
[51]. Lempel-Ziv complexity has been used to cluster protein families into functional subtypes [52]. This method has also been used to build phylogenetic trees of fungi using ribosomal DNA
sequences [53], perennial plant genusGalanthususing nuclear and chloroplast DNA sequences
[54], and HEV and mammals using DNA sequences [55, 56, 57, 58] .
Another popular category of alignment-free methods makes use of word frequencies [59,
60, 61]. The difference between the two sequences can be obtained by computing the k-mer
(subsequences of lengthk) frequencies first and then distance between them. The word-based
alignment-free technique was first used to construct accurate phylogenetic trees for mammalian
alpha- and beta-globin genes [62]. Baoet al. [63] proposed a Category-Position-Frequency
(CPF) model, which utilized word frequency and position information of nucleotides in DNA sequences. The main disadvantage of this method is that the adjacent word matches are de-
frequencies to address the problem of dependency on adjacent word matches. This method used spaced-words, defined by patterns of ‘match’ and ‘don't care’ positions, for alignment-
free sequence comparison. Sims et al. [65] proposed a k-mer vectors based method called
Feature Frequency Profiles (FFP). FFP has been used for phylogenetic analysis using a variety of sequences including intron sequences of mammals [65], mitochondrial DNA sequences of primates and nuclear DNA sequences of plants [66], and bacterial genomes [67]. Many au- thors [68, 69, 70, 71, 72, 73, 74, 75, 76, 77] have used Chaos game Representation (CGR)
[78] fork-mer-based sequence analysis. CGR is a two-dimensional graphical representation of
DNA sequence, and the details of the CGR construction are given in Section 2.3.1. CGR has been used in literature on a variety of sequences e.g. to build phylogenies using mitochondrial DNA sequences [71, 72], nuclear DNA sequences [73, 75], bacterial sequences [76], and viral sequences [77, 70].
In recent years, Genomic Signal Processing (GSP) [79] based alignment-free methods have also been proposed. GSP-based methods apply techniques of Digital Signal Processing (DSP) to genomic data. GSP-based methods have been successfully used for a variety of applications, e.g., to distinguish introns from exons [80, 81, 82], for complete genome phylogenetic analysis of primates, bacteria and influenza [83], and for classification of whole bacterial genomes
[84]. Borraya et al. [85] proposed a GSP-based method for the computation of alignment-
free distances between DNA sequences, where DNA sequences were mapped to numerical
sequences based on the nucleotide doublet values (0− 15 for all possible 16 combinations).
The analysis was done on relatively small dataset composed of the ribosomalS18 subunit gene.
Yinet al. [86] proposed another alignment-free method that encoded each DNA sequence to
four binary indicator sequences and applied Discrete Fourier Transform (DFT) to compute the power spectra. The Euclidean distance of full DFT power spectra of the DNA sequences was used as a dissimilarity measure. Other DSP techniques have also been used for genome similarity analysis, e.g. comparing the phase spectra of the DFT of digital signals of full mtDNA genomes [87, 88].
Though existing alignment-free methods have successfully addressed most of the limita- tions of the alignment-based methods, they have some disadvantages of their own. Zielezinski
et al. [11] reviewed the majority of existing alignment-free methods and highlighted the fol-
lowing limitations:
(i) A majority of existing alignment-free methods are still exploring the technical founda- tions and lack software implementation, so it is not possible to compare their perfor-
mance on common datasets. Without comparison or existing proven results, it is difficult
for users to pick one method for their specific application.
(ii) Most of the existing alignment-free methods that have software implementations avail- able are tested using very small real-world datasets or simulated sequences. Their appli- cability to a variety of applications is untested.
(iii) Though alignment-free methods have lower time-complexity, their memory consump-
tion is still an issue, at least for k-mer based methods. The use of longer k-mers for
multigenome data can cause possible memory overhead.
We propose a novel alignment-free GSP-based methodology that addresses the limitations of the existing alignment-free methods in addition to the alignment-based methods, see Section 2.3 for details.Though our proposed approach addresses the previously identified limitations of both alignment-based and alignment-free algorithms, high memory use remains an issue when
CGR, ak-mer dependent numerical representation, is used. The high memory use is because
of the length of sequences, and large size of datasets. In particular, high memory use is un- avoidable if the required analysis demands the use of full genomes. Another notable limitation of our methodology is inherited from the use of supervised machine learning algorithms. More specifically, our approach can only predict the label of an unknown new sequence by assigning a label from the available labels in the training set. In case the actual label is missing from the training set, our approach assigns a closest available label (the label of the most similar sequence in the training set).