New Contig Creation Algorithm for the de novo DNA Assembly Problem

(1)

Algorithm for the

de novo

DNA Assembly Problem

Mohammad Goodarzi

Computer Science

A thesis submitted in partial fulfilment of the requirements for the degree of

Master of Science in Computer Science

Department of Computer Science, Brock University St. Catharines, Ontario

(2)

(3)

DNA assembly is among the most fundamental and difficult problems in bioinformatics. Near optimal assembly solutions are available for bacterial and small genomes, however assembling large and complex genomes especially the human genome using Next-Generation-Sequencing (NGS) technologies is shown to be very difficult because of the highly repetitive and complex nature of the human genome, short read lengths, uneven data coverage and tools that are not specifically built for human genomes. Moreover, many algorithms are not even scalable to human genome datasets containing hundreds of millions of short reads. The DNA assembly problem is usually divided into several sub-problems including DNA data error detection and correction, contig creation, scaffolding and contigs orientation; each can be seen as a distinct research area. This thesis specifically focuses on creating contigs from the short reads and combining them with outputs from other tools in order to obtain better results. Three different assemblers including SOAPdenovo [Li09], Velvet [ZB08] and

Meraculous [CHS+_{11] are selected for comparative purposes in this thesis.}

Obtained results show that this thesis’ work produces comparable results to other assemblers and combining our contigs to outputs from other tools, produces the best results outperforming all other investigated assemblers.

(4)

I would like to thank my supervisor Dr. Sheridan Houghten who has made available her support in many ways which helped me achieving my research

goals. I would also like to appreciate Dr. Ping Liang’s guidances and suggestions through my research which helped me better understand the area

and have access to his laboratory resources.

I am very grateful to have a family who have supported me in my way to continue my higher educations and finally, this thesis would not have been possible without the helps and sacrifices made by my beloved wife, Nazanin.

(5)

Abstract . . . i

1 Introduction to Genomes and Genome Assembly 1 1.1 DNA Molecule and Structure . . . 1

1.2 DNA Sequencing Technologies . . . 5

1.3 Summary . . . 9

1.4 Organization of Thesis . . . 9

2 Review of de novo Genome Assembly Algorithms 10 2.1 Overlap-Layout-Consensus (OLC) Methods . . . 13

2.2 De Bruijn Graph (DBG) Methods . . . 17

2.3 Greedy Graph Methods . . . 24

2.4 Summary . . . 25

(6)

3.2.1 Input Data Loading and Reads/K-Mer Class Structures 37

3.2.2 Removing Less Frequent k-mers . . . 41

3.2.3 Finding k-mer Overlaps . . . 43

3.2.4 Contig Creation . . . 47

3.3 Multi k-mer Assembly Solution . . . 47

3.3.1 Contigs Merging . . . 50

3.4 External Contigs Expansion . . . 54

3.5 Summary . . . 57

4 Experimental Results 58 4.1 Experimental Results Terminology . . . 58

4.1.1 Datasets . . . 60

4.2 Results . . . 61

4.2.1 N50 Results . . . 62

4.2.2 External Contigs Expansion Results . . . 66

4.2.3 Computation Time Results . . . 72

4.3 Summary . . . 74

5 Conclusion and Future Work 75 5.1 Conclusion . . . 75

5.2 Future Work . . . 76

(7)

Appendix B 90

List of Figures 170

List of Tables 176

(8)

Chapter

1

Introduction to Genomes and Genome

Assembly

1.1

DNA Molecule and Structure

Functions, activities and development of all living organisms are defined by a chemical molecule in their body called DNA. DNA is a macro molecule that consists of other simpler chemical units that encodes important genetic instructions defining how a living organism functions. Finding and analysing the sequence of chemical units in a DNA molecule is considered to be a key to understanding how living organisms work and finding cures for many genetic-related diseases. The importance of genetics and DNA analysis has created vast research areas in biology to find DNA structure and also in computer science to analyse massive amount of data generated in biology labs in order

(9)

to reveal important information about genetic codes. Bioinformatics is the general area of research that targets biology problems from the computer sci-ence point of view. This thesis focuses on solving one of the most fundamental

problems in bioinformatics, the “de novo DNA assembly problem”. Before

going deep in to the main problem, an introduction about DNA structure, DNA sequencing technologies and genome assembly are presented in this chapter.

DNA consists of two long biopolymers made of simpler chemical units

called nucleotides. These two long chains of nucleotides are connected to each

other at every nucleotide location and can be imagined as a ladder. Each

long chain is called a strand. There are four different nucleotides that are the

basic blocks of the DNA molecule: Adenine, Cytosine, Guanine and Thymine which are abbreviated by the letters A, C, G and T respectively. Figure 1.1 shows the chemical structure of these nucleotides. Each pair of nucleotides

in the DNA is called a base. Generally there is no preference for two bases

to connect to each other in one strand but bases in equivalent locations in opposite strands must be complementary to each other. “A” is always complemented by “T” and “C” is always complemented by “G” and vice-versa

[WC+53]. Figure 1.2 shows a very simple view of DNA molecule structure.

For more detailed information about DNA molecule and its structure refer to [Nai07].

(10)

Figure 1.1: Chemical structure of nucleotides. (A): Adenine, (B): Cytosine, (C): Guanine, (D): Thymine. (Images source: http://en.wikipedia.org /wik-i/Adenine, http://en.wikipedia.org /wiki/Cytosine, http://en.wikipedia.org /wiki/Guanine, http://en.wikipedia.org /wiki/Thymine)

(11)

Figure 1.2: DNA structure (Image source: http://www.chemguide.co.uk/-organicprops/aminoacids/doublehelix.gif)

(12)

1.2

DNA Sequencing Technologies

Finding the sequence of base-pairs in a given DNA molecule is not an easy task. There has not been any approach to provide the complete sequence of DNA in a chromosome or a genome in a continuous form. This is mainly because DNA molecules are extremely large. For example, they consist of hundreds of millions of base-pairs in the case of mammalian genomes including the human genome. However, having knowledge of the DNA sequence of a genome is fundamental for other research areas in biology to progress. The first method to detect the precise order of base-pairs in a DNA molecule was devised by Fredrick Sanger in 1977 [SNC77] and this is still the most accurate method for DNA sequencing. Sanger-based sequencing technologies are able to extract base-pairs from fragments of the whole chromosomal DNA with a maximum length of around 1000 bp. DNA sequencing is an error prone process which may result in detecting wrong base-pairs from the DNA molecule. Two important problems around Sanger technology are its slow run time for large genomes and its cost. These limitations led to new technologies being devised addressing speed and price challenges. Next-Generation-Sequencing (NGS)

technologies [DSC+_{10, MPC}+_{09, HBB}+_{08] were proposed from 1996 with the}

aim of reducing the cost and increasing the speed of the DNA sequencing process. From their time of invention until now, there have been numerous improvements in NGS technology and currently it is feasible to determine the DNA sequence of a genome comparatively quickly and cost effectively.

(13)

However NGS techrnologies also have several draw-backs:

• They produce even shorter sequence reads compared to Sanger

sequenc-ing. Currently the maximum length of DNA fragments produced by most NGS technologies is below 400 bps.

• They are more error prone than Sanger-based sequencing, especially in

the starting and ending locations of fragments.

Illumina Genome Analyzer [DSC+_{10], Applied Biosystems SOLiD}

Sys-tem [MPC+09], Helicos BioScience HeliScope [HBB+08], 454 Life Sciences

[MEA+05] and Ion Torrent [RHR+11] are current leaders of

Next-Generation-Sequencing technology.

Because it is not possible to sequence an entire DNA molecule in one attempt, researchers divide the large DNA molecule into chunks with lots of copies and perform the sequencing separately on every chunk in parallel, therefore obtaining sequences for all parts of the genome. The obtained sequences should be merged at the end to produce one continuous sequence

of base-pairs for the base DNA molecule. Shotgun Sequencing [Pop04] is the

technology that divides the DNA molecule into smaller parts in order to make the whole genome sequencing possible. Smaller DNA chunks produced by shotgun sequencing technology from random locations are called “Reads”. Shotgun sequencing tries to produce random reads from all over the genome with even distribution, thus being able to produce the whole DNA sequence

(14)

Figure 1.3: Shotgun sequencing. Small reads are created from random locations in the genome. Reads have overlap with each other making it possible to assemble them later, creating one contiguous sequence called “Assembly”. (Image source: https://wiki.cebitec.uni-bielefeld.de/brf-software/

images/2/2e/WholeGenomeShotgun.png)

for a genomic DNA.

Sequences obtained from random locations of the genome need to be processed in order to create one unique and continuous sequence expressing the base DNA sequence. Finding overlaps between the reads, merging the correct links together and expanding the reads to achieve larger sequences is

the main task of “DNA Assembly” algorithms. This process is called“de novo”

when there is not any other DNA sequence information about the species being sequenced. Sanger sequencing technology creates sufficiently high quality reads with enough length for DNA assembly algorithms to perform and extract the final assembly, but using NGS technologies imposes drastically different strategies in DNA sequence assembly. The DNA assembly problem can be solved if there are enough high quality reads from all over the genome that can resolve all complex repeating structures through the genome. Having larger reads helps to find better correct overlaps and lead to better results.

(15)

Coverage (read depth) is the average number of reads representing a given nucleotide in the genome. It can be calculated from the length of the original genome (G), the number of reads (N), and the average read

length (L) as (N ∗L)/G[MGG10].

Currently, shotgun sequencing is used along with NGS technologies to sequence new species with large genomes. This produces hundreds of millions

of reads that need to be processed. Dealing with this huge amount of

data needs careful considerations and algorithms, thus conventional DNA assembly algorithms designed for Sanger sequencing data cannot be used any more. Currently assembling DNA sequences of large genomes with complex repeating patterns like the human genome is not completely possible using NGS technologies. Assembly results obtained from NGS data are far less accurate than Sanger sequencing assemblies, even though the algorithms are more complex and better developed. Besides, by rapid improvements in NGS technologies, there has been much interest in sequencing DNA molecules of new species, however there is no perfect DNA Assembly algorithm to produce

high quality results especially in the case of being de novo working on new

species without having any knowledge about the resulting DNA sequence. Therefore, there has been much demand for new DNA assembly algorithms, fast techniques and methods to check the quality of DNA assemblers.

(16)

1.3

Summary

This chapter covered basic information about the DNA molecule, its structure and basic blocks as well as a brief introduction to DNA sequencing tech-nologies and two types of currently available sequencing methods: Sanger Sequencing and Next-Generation-Sequencing. Each method’s specifications and limitations are presented and the shotgun sequencing technique used to create datasets for DNA assembly problem is explained. The next

chap-ter specifically presents the DNA assembly problem and introduces current

approaches to solve it.

1.4

Organization of Thesis

The remainder of the thesis is organized as follows:

In chapter 2, different approaches to the Genome Assembly Problem including OLC, de Bruijn and Greedy methods are introduced.

Chapter 3 discusses the details about our new algorithm to solve the Genome Assembly Problem and introduces the new methods that we use compared to other assembly tools investigated in this thesis.

Chapter 4 includes the experimental results for running our new algorithm on several datasets and compares the performance of our algorithm to other DNA assemblers.

And finally, chapter 5 concludes the work that is done in the thesis and introduces the next steps and future work for this research.

(17)

Chapter

2

Review of

de novo

Genome Assembly

Algorithms

Genome assembly is the process of finding the unique single and contiguous sequence of a DNA molecule by using its set of reads containing smaller sequences from random locations of the genome. For better understanding, DNA assembly can be compared to having many copies of a book which is only written with four characters (A, C, G, T), each of them passed through a shredder with different cutters, and aiming to obtain one clean copy of

the book from the shredded parts [NSW+_{13]. Besides the obvious difficulties}

of the problem, more hidden issues should also be considered: the original book may contain repeated paragraphs, some shreds are modified through out the shredding process therefore having typos and shredded parts may

(18)

not the book example). Having a full DNA assembler capable of solving the problem for any input dataset is demanded by researchers, however based on our knowledge, such a system has not been created yet. This inability stems from several reasons:

• Different sequencing technologies have different characteristics [SJ08].

Some produce longer reads, making it easier for assemblers to detect overlaps, while some produce shorter reads with considerably high coverage, making the assemblers’ work more difficult since they must deal with massive inputs with short lengths. Moreover, noise distributions are

different among sequencing technologies [KSS+10]. Some technologies

tend to produce noise at the starting and ending locations of reads, and some tend to generate noise in regions containing special sequences, such as long runs of homopolymers. Currently, creating a framework capable of addressing all of the mentioned situations and having significant performance for any sequencing technology seems impossible.

• Different species or even different individuals in the same species have

different genomes. Genomes can be straight forward to assemble or can be extremely complex. Repeating patterns are the most important factor defining the complexity of genomes. If repeat lengths are less than reads size, there is a good chance of obtaining DNA fragments by resolving the repeat, however complex genomes have repeats of length far greater than actual read size, making them very difficult to solve

(19)

Figure 2.1: Two types of repeats in genome. SequenceATCGTGTGC marked

as R1 is repeated four times through out the genome and it is resided in a

bigger repeat pattern GTTATCGTGTGCGGTTGATCGTGTGCGCCCAT

marked as R2

[MKS10]. Moreover, repeats can happen in the middle of one another. Figure 2.1 shows two different types of repeating patterns in a genome. Assemblers are usually tuned heuristically to target special types of genomes with some definite repeating patterns, making them incapable of solving the DNA assembly problem for any newly sequenced genome

and also being“de novo”, not having any information about the genome.

• Some assembly methods that work for small sequencing projects are not

scalable to large sequencing project dealing with very large genomes,

having hundreds of millions of reads [LLS+_{11, LZR}+_10].

Conventional DNA Assembly algorithms were designed to work with Sanger-based sequencing reads. Sanger reads are more accurate compared to NGS reads and are long enough to ease the assembly process. Many assembly algorithms dealing with Sanger reads use the Overlap-Layout-Consensus (OLC) approach which will be explained thoroughly in section 2.1. However by invention of NGS technologies, sequencing new species becomes available

(20)

genome and are not very accurate compared to Sanger reads. New methods have been devised to specifically address assembly of NGS reads. Using de Bruijn graphs as a data structure is the most commonly used technique to tackle the DNA assembly problem and was first proposed by Pavel Pevzner in 2001 [PTW01]. This chapter covers three general techniques for solving DNA assembly problem. Section 2.1 describes the Overlap-Layout-Consensus (OLC) approach, section 2.2 explains the de Bruijn graph approach and section 2.3 presents greedy graph algorithms to solve the DNA assembly problem.

2.1

Overlap-Layout-Consensus (OLC)

Meth-ods

The Overlap-Layout-Consensus method is considered as the first approach

proposed to solve the de novo DNA assembly problem. It was widely used

in the Sanger reads era and it was proposed by having Sanger sequencing

characteristics in mind. Celera Assembler [MSD+_{00], Arachne [BJS}+_02,

JBG+03], CAP and PCAP [HY05] are among the most used OLC DNA

assemblers. It is argued in [Pop09] that the OLC approach may not be scalable to be used for NGS data mainly because of being very time and memory intensive in the overlapping phase.

(21)

Figure 2.2: Two different scenarios are conceivable when two reads have over-lap: (i) overlap is true, denoting a correct connection between the reads. (ii) overlap is denoting a repeating pattern and not expressing a direct connection between the reads. Detecting which condition the overlap denotes is usually

not possible. (Image source: [MSD+_00])

• Overlap: Find overlaps between all pairs of reads in input dataset. These overlaps make the main graph data structure to work on. Graph nodes represent reads and edges represent the overlap between reads. Overlapping criteria can vary in length and similarity percentage in different assemblers. Overlaps computation is the most time-intensive phase of the OLC approaches, requiring time proportional to the square of the number of reads, in the worst case (each read must be compared

to all other reads, leading ton₂operations)[Pop09]. However there are

techniques to reduce the running time by parallelizing the computation and using multi-processor machines [Pop09]. Figure 2.2 shows two different scenarios in which two reads can have overlap and Figure 2.3 shows a simple overlap graph for a set of reads.

(22)

com-Figure 2.3: (A): set of reads with indentions showing overlaps between them. (B): overlap graph created for the read set which is usually used by OLC

methods (Image source: http://genome.cshlp .org/content/20/9/1165)

to merge nodes that have unique overlaps, therefore the graph becomes

smaller without losing any information. This phase is calledLayout. By

performing Layout algorithms, some graph nodes are merged together

and unique sequences from the genome called Contigs are created. The

output graph still can be seen as an overlap graph but between the contigs. Figure 2.4 shows a Layout scenario and formation of contigs.

• Consensus: The consensus phase aims to convert the whole graph to

a single continuous sequence called a Scaffold representing the sequence

of base-pairs that the input set expresses. This task can be done by finding a Hamiltonian path which traverses all nodes in the graph. A Hamiltonian path in an undirected graph is a path that visits every vertex (node) in the graph exactly once. Finding if a Hamiltonian

path exists in a graph isNP-Complete [GJT76] which is a draw-back of

using OLC methods for DNA assembly. Scaffolds can contain gap base-pairs and they connect contigs together by using mate-pair information.

(23)

Figure 2.4: Layout scenario. Reads that have their connections deter-mined are merged together and only nodes facing fork situations are left. Contigs are created by merging the nodes together. (Image source: http://gcat.davidson.edu/phast/olc.html)

(24)

Locations between the contigs are filled by gaps representing unknown bases, if they cannot be determined. Therefore, if there is not enough mate-pair information it is not possible to obtain one single scaffold.

The Overlap-Layout-Consensus technique is described in more depth in [MKS10, Bat05, PPDS04]

2.2

De Bruijn Graph (DBG) Methods

Generally the de Bruijn graph is a directed graph representing overlaps between sequences of symbols. The idea of using de Bruijn graphs to solve the DNA assembly problem was first proposed by Pavel Pevzner in 2001 [PTW01]. Currently de Bruijn graphs are the most commonly used technique to solve the DNA assembly problem for NGS data. There are various implementations and several DNA assemblers that are designed based on de Bruijn graph structure.

(25)

Pevzner [PTW01] defines the de Bruijn graph used for DNA assembly problem as follows: Given a set of reads S = {s1, s2, ..., sn}, the de

Bruijn graph G(Sl) with vertex set Sl−1 (the set of all (l−1)-tuples

from S) is defined as follows. An (l −1)-tuple v ∈ Sl−1 is joined by

a directed edge with an (l −1)-tuple w ∈ Sl−1, if Sl contains an l

-tuple for which the first l−1 nucleotides coincides with v and the last

l−1 nucleotides coincides with w. With this definition, if S contains

only one sequences1, then the assembly is obtained by a path visiting

each edge of the de Bruijn graph , a Chinese Postman Path [Fle90]. The Chinese Postman Path then can be translated to finding a path visiting every edge of a graph exactly once, an Eulerian Path Problem [Pev00]. This transformation happens by introducing multiplicities of edges in the de Bruijn graph. For example, every edge in the de

Bruijn graph can be substituted by k parallel edges for every l-tuple

repeating k times in s1 [PTW01]. For real situations, the de Bruijn

graph becomes very large and having errors in sequenced reads make the graph even more complicated. Even with error-free cases, the graph

becomes very complicated. Thus the information about whichl-tuples

belong to the same reads is being used again to define Read-Paths and Eulerian SuperPaths introduced by [PTW01]. More information about the theories and detail specification of de Bruijn graphs used for DNA assembly problem can be found in [PTW01].

(26)

There are two significant advantages of de Bruijn graphs compared to the OLC technique that makes them practical for large genome projects:

• No need to precisely calculate overlaps between all reads.

• The idea proposed by Pavel Pevzner [PTW01] to use the Eulerian path

to solve the DNA assembly problem instead of using the Hamiltonian path. An Eulerian path is a path that visits every edge in a graph exactly once. This makes a huge impact on DNA assembly problem as efficient algorithms in polynomial times exist to calculate Eulerian paths in graphs [AIS84, AV84, UTK88].

De Bruijn graph assemblies do not explicitly calculate every single overlap

between all pairs of reads in the input dataset. They work based on k-mer

calculation instead of read overlaps. All reads are first processed to find all

overlapping substrings of length k. These substrings are called k-mers. All

k-mers from all reads in the dataset are extracted and each k-mer is stored in memory only once, although it can be repeated in several reads. Fast

data structures e.g. hash tables can be used to store and retrieve k-mers. A

de Bruijn graph is created based on the k-mers set. Graph edges are the

actual k-mers which are substrings of sizek within the reads and graph nodes

represent substrings of length (k-1) within the reads. Edges are established

between any two nodes that have their (k-2) prefix and suffix in common.

Figure 2.5 shows a de Bruijn graph for a sample consensus sequence with

(27)

Figure 2.5: Simple de Bruijn graph with k = 4 for a set of reads that creates the consensus sequence “ACCCAACCAC” (Image source: http://gcat.davidson.edu/phast/debruijn.html)

The above definition creates the basic de Bruijn graph for DNA assembly, however different assemblers may have slightly different structures, definitions and assumptions to build the graph.

As reads are not considered as nodes in the de Bruijn graph and each

unique k-mer is only stored once in the graph, de Bruijn graphs grow linearly

with the input dataset size, making the DNA assembly problem solvable for

large genomes. K-mers are usually stored in fast hash table structures in

order to make the graph creation process as fast as possible. Moreover, k-mers

are presented by graph edges and not nodes, therefore the final sequence can be extracted by finding an Eulerian path in the graph traversing all edges and not Hamiltonian paths traversing all nodes. This makes a huge impact on DNA assembly problem as efficient algorithms exist to calculate Eulerian

(28)

Figure 2.6: Tips and bulges in de Bruijn assembly graphs shown in red. Tips are branches in the graph that end without connecting to other parts of the graph. Bulges are branches from a node that come back to the main path after passing several edges. Bulges can be small, large or complex containing other bulges. (Image source: http://www.homolog.us/)

(29)

As for overlap graphs in OLC methods, de Bruijn graphs also become very large with millions of nodes for large genome assemblies. As de Bruijn

graphs are based on k-mers, errors and noisy base-pairs in the dataset have

significant influence on the graph as they produce different k-mers. In other

words, de Bruijn graphs are more sensitive to sequencing errors than the overlap graphs. This makes the error detection procedure very important. One should also keep in mind that DBG methods are usually used with data generated from NGS technologies which are normally more error-prone. Assemblers usually define different types of errors and try to detect them after creating the graph. Errors are of different types including base insertion, base deletion and base replacement. Errors that occur at the end of the reads usually create tips in the de Bruijn graphs that are branches that end in a dead-end situation. Errors which occur in the middle of reads usually create bulges in the graph. These two types of graph structures are detected by assemblers and resolved before finding the Eulerian path in the graph. Differentiating between errors and repeat structures is usually not possible in most cases, therefore assemblers try to detect noisy parts by heuristics. Figure 2.6 shows tips and bulges in a de Bruijn graph.

After the graph simplification phase, an Eulerian path in the graph defines the result sequence for the assembly. However, there are fork situations in the graph which are nodes with out-degree of more than one which may create more than one Eulerian path in the graph. Not all Eulerian paths in the

(30)

Figure 2.7: Two different Eulerian paths are conceivable for one set of reads. (Image source: http://sourceforge.net/apps/mediawiki/contrail-bio/index.php?title=Contrail)

the path which expresses the correct assembly. Heuristics used by different assemblers vary in this phase which makes every assembler somehow unique. Figure 2.7 shows two possible Eulerian paths in a de Bruijn graph, created for one unique set of reads. This is happening because some nodes in the graph can have more than one out-going edge that may not converge later (see node labeled CTG in figure 2.7). Having these type of nodes causes more than one Eulerian path to exist in the graph, however only one Eulerian path correctly represents the genome. For more information about the de Bruijn graph techniques refer to [PTW01, MKS10, Pop09].

(31)

2.3

Greedy Graph Methods

Greedy methods for the DNA assembly problem are based on one objective which is to choose the best overlap match at the current state of the algorithm. Reads with the highest overlap score are selected and merged together [Pop09]. The process continues until no more overlaps can be found. Large genomes sequenced using NGS technologies like the human genome are shown to be very complex to assemble, therefore it is not possible to solve them with greedy algorithms. Greedy algorithms usually get stuck in local maxima and are not be able to provide complete assemblies when dealing with sophisticated situations. However, they do not have any overhead in computation and time and are usually very fast. TIGR [SWAK95] and CAP3 [HM99] are among the first assemblers using greedy methods and SSAKE [WSJH07], SHARCGS

[DLBH07] and VCAKE [JRB+_{07] are among the newer attempts to solve the}

DNA assembly problem with a greedy approach.

Recently there has been a renewed interest in using greedy methods in different parts of the DNA assembly problem and it is making significant progress. For example, Chikhi et al. [CL11] use a greedy based algorithm for their localized assembly algorithm to create scaffolds directly from reads. They unified the process of contig creation from reads and scaffold creation from contigs to one phase of creating scaffolds from reads. The Meraculous

(32)

huge computational overheads. This thesis also tries to improve assembly results by having a greedy view to the problem which will be explained thoroughly in chapter 3.

2.4

Summary

This chapter covered a literature review of thede novoDNA assembly problem.

It first introduced the DNA assembly problem, its specifications and current limitations that assemblers deal with. Three different approaches to solve DNA assembly problem including Overlap-Layout-Consensus (OLC), de-Bruijn Graph (DBG) and Greedy methods are described. Specifications and limitations of each method are also presented for the two major types of DNA sequencing technologies. The next chapter presents our contig creation algorithm for NGS technology reads.

(33)

Chapter

3

New Contig Creation Algorithm

As already discussed in the previous chapter, the DNA assembly problem is generally solved with heuristics in mind. These include de Bruijn graph simplification or greedy-based techniques that decide on the correctness of graph edges heuristically. Different heuristics result in fragmented assemblies from different locations of the genome. By applying different heuristic and simplification methods, various assemblies can be generated for one genome and the problem becomes worse when it is infeasible to accurately select the

best result. This is mainly because in de novo assembly there is no reference

genome to match the results against. Results with higher length-based metric

values such as the N50 parameter are currently considered better assemblies,

(34)

N50 value is a statistical measure of a set of numbers in which all

elements of greater than or equal toN50 value are covering at least half

of the total addition of all set elements [MKS10]. N50 is used in DNA

assembly as a metric to measure quality of results. Larger N50 values

express on having larger contigs.

However there are experimental results [MPC+_{13, BFA}+_{] that show larger}

contigs do not necessarily mean improved results and can be misleading when not correctly assembled. For instance, a new technique for evaluating genome

assemblers [MPC+13] first splits the contigs/scaffolds on locations for which

left and right pieces map onto distant locations in the base genome and

then calculate the N50 based on the split contigs, leading to more accurate

calculations by skipping false positive links in assemblies. Such techniques essentially prevent the results from becoming biased by heuristics that accept many false positives during the assembly process. In this thesis, we also use a similar technique to first split the contigs from the locations that are not mapped to close locations in the test reference genome and then calculate

the N50 values.

3.1

Objectives

There are three main objectives in this thesis:

(1) Assemble fragments of the genome with the highest probability of correctness by avoiding the use of aggressive heuristics. Whenever there

(35)

is more than one way to extend contigs based on the k-mers, instead of selecting one direction and continuing the process, we terminate the contig creation procedure to be sure about contigs’ quality and correctness. By having such a behaviour, we end up having smaller contigs in some datasets compared to other assemblers, but we can be certain that our contigs are perfectly matched with the target genome. We compensate for the small size in contigs by running the algorithm

in parallel for multiple k values and combine the results from different

runs at the end in order to obtain better lengths. This objective is thoroughly explained in section 3.2.

(2) Provide the ability to run the algorithm with differentk parameters. As

described in chapter 2, reads are split into overlapping segments of length

k to createk-mers which are the main inputs of the assembly algorithm.

The k parameter has significant influence on assembly results and due

to a variety of reasons including uneven data coverage, noisy data and varying repeat structures in different genome locations, a single value for

parameter k does not necessarily give the optimal result for all locations

in the genome. Having a very large value for k results in false positive

links in fragments, while a small value for k results in tangled graphs

which makes the problem impractical to solve [BNA+12]. Running the

(36)

algorithms are very time- and space-consuming and it is not feasible to run multiple instances of the algorithm with dedicated memories

in parallel. Trying to devise structures for multi k-mer assembly is a

possible key to solving this problem.

This objective is thoroughly explained in section 3.3, and experimental

results in chapter 4 shows the influence of using multiple values for k

on the quality of results.

(3) By generating contigs with differentk values from the genome locations

that are usually left by other assemblers (because of using only one k

value), there is a good chance of expanding contigs that are generated by other tools in order to obtain better results. Investigating the possibility of linking other tools’ contigs to generate high quality contigs is the main target for this section.

This objective is thoroughly explained in section 3.3.1 and results from merging contig sets together are presented in section 4.3.

3.2

Producing Contigs

One of the most challenging problems in de novo DNA assembly is to find a

good metric to measure the quality of created contigs. For new species that have not been sequenced before, there is not any reference sequence available to be used for verification purposes. In the absence of the reference genome,

(37)

there are length-based metrics such as the N50 value which is widely used by assemblers to express the quality of results.

It is worth noting the critique in [MPC+13, BFA+] that larger contigs

which lead to better N50 values do not necessarily mean better results in

terms of accuracy and also that there can be many false positive links in generated contigs.

We are also using theN50 value in order to measure our quality of results.

The detailed explanation on how to calculate more realistic N50 values is

presented in chapter 4.

This thesis focuses on using methods which are more conservative in expanding contigs and do not attempt to create larger contigs by lowering

the certainty of contigs. The same idea is also proposed by [CHS+_{11]. Our}

approach is based on the method first proposed in [CHS+_{11] and it improves}

the results significantly by performing some changes to the algorithm flow and a new implementation which are all described in this chapter and appendices. Moreover, we use our generated contigs in order to improve results from other tools by importing their outputs to our system.

The assembly process can be described as follows:

(1) Stream the input data files to memory, store reads and pairing

infor-mation. Fill in data structures for reads and k-mers and load the

(38)

Figure 3.1: Reverse-complemented reads are generated by processing the original read backwards and changing any base character to its complementary

base. (A ↔ T , C ↔G)

(2) Extract k-mers while processing each read from the input files by

having a pre-defined k value. Use hash-tables to store k-mers and

their occurrence positions in the read set. Because it is not possible

to determine which DNA strand the reads and corresponding k-mers

belong to, all reads are processed to generate their reverse-complement as well. This doubles the input data space but boosts the quality of the results significantly. Contigs that are the reverse-complement of each other are filtered at the end of the assembly process by assuming that they are expressing the same location in the genome. Figure 3.1 depicts an example read and its reverse-complement.

(3) Detecting “noisy” k-mers, which are hash-table entries that occur less

often than a fixed threshold number in the whole set of reads. The assumption behind this noise detection technique is that the input reads are randomly distributed through the genome with roughly even

(39)

Figure 3.2: For a read of length n, the right overlap (postfix) is a read for

which its base-pairs positions 1 ton−1 are matched to the original read’s

base-pairs from positions 2 to n. Also, its left overlap (prefix) is a read for

which its base-pairs from positions 2 ton are matched to the original read’s

base-pairs from positions 1 to n−1.

coverage, therefore all k-mers should be seen at least some minimum

number of times, and entries that are seen less often than the threshold value can be assumed to be noise. This threshold value can be estimated to be lower but close to equal to the genome coverage depth of the input dataset, because it is assumed that each base-pair in the genome

is roughly seen C times where C is the coverage depth. This part is

explained in more detail in section 3.2.2.

(4) Find (k-1) length overlaps between all k-mers and link k-mers that

can be the prefix or postfix of each other. An example of prefix and

postfix k-mers (left and right links) are shown in figure 3.2. This part

is explained in more detail in section 3.2.3.

(40)

base information to create contigs based on k-mers. This part is also explained in more detail in section 3.2.3.

(6) Create contigs based on qualified k-mers with unique extensions until

reaching dead-end or fork situations. This part is explained in more details in section 3.2.4.

(7) Analyse generated contigs from different k values (which can be run

in parallel) to find any promising overlap between them. Because of having sequences from both DNA strands in the input set, contigs are made from both strands in this step. Therefore reverse-complement contigs should be detected and only one of them should be kept. This part is explained in more details in section 3.3.

(8) Import external contigs from other tools and analyse them, aiming to expand them even more by finding if they overlap with our generated contigs. This part is explained in more detail in section 3.4.

Figure 3.3 depicts a high level view of the proposed assembly algorithm. One of the most important aspects of our algorithm is the extensive use of quality scores during the assembly process. Also, this algorithm does not rely on external error detection and correction tools. Many DNA error detection tools are using these quality scores to prune the data and detect noise before starting the assembly algorithm. However there are also some

(41)

(42)

Figure 3.4: One unique k-mer may appear in more than one read. k-mers

that are seen less than a pre-defined threshold amount can be treated as noise and filtered out.

use a method first described in [CHS+_{11] that handles the noisy parts of data}

based on the occurrence frequency ofk-mersin input reads. We believe that in

addition to saving reasonable time and space by avoiding the running of error detection tools, this approach also leads to better and more accurate results which experiments also support in section 4.2.1. The minimum acceptable

frequency of k-mers in reads can be adjusted by the user. Figure 3.4 shows

how k-mers may appear in more than one read.

Our algorithm obtained its basic idea from the research in [CHS+_{11] and}

works to improve the quality of results. The differences between our algorithm

(and implementation) and [CHS+11] can be summarized as follows:

(43)

unique extensions in the contig creation process. Although this is correct and generates very high quality contigs, it can be improved by adding contigs that are expressing on unique extensions with the probability

of more than a threshold value; therefore we can use majority vote

on the unique base-pair extensions and the number of trusted k-mers

increases. This consequently leads to larger contigs while keeping the quality of contigs very high. There are also situations in which one

end of a k-mer expresses a “harsh fork” situation in which it cannot be

resolved even by majority voting but the other end is resolved. This

will be discussed further in section 3.2.3. These k-mers are also not

being used by the Meraculous package but can be added to the trusted

k-mers list in our implementation because they help to create larger contigs with comparatively high quality to other tools.

• Different data structures and hash functions are used in our tool to

produce better results in comparison to the Meraculous assembler’s implementation. Section 4.2.2 shows our tool’s improvements in com-parison to the Meraculous package.

• The Meraculous assembler is not capable of running the algorithm for

differentk values in parallel, thus it has difficulties creating enough large contigs from all genome locations on the datasets in our experiments.

(44)

Experimental results support this idea and show the improvement when

using multiple k values.

• Our tool is also designed to accept other assemblers’ contigs in order

to analyse and expand them. There is no feature similar to this in the Meraculous package.

3.2.1

Input Data Loading and

Reads

/

K-Mer

Class

Struc-tures

All assemblers should be able to deal with large input files. It is assumed in this thesis that inputs are coming with pair information showing which two reads are connected as pairs. Algorithms are designed for Illumina technology reads and input sequence data must be in .fastq file format, however other file formats can also be easily supported by adding appropriate parser code for them. In the case of .fastq file format, there are an even number of files each including read information for one set of pairs. Two files that are presenting pair information must have an equal number of reads. Figure 3.5 shows a sample configuration file that includes addresses for .fastq files and Appendix A shows a sample set of input files in .fastq format.

Read objects are created by processing the input data. Quality scores are also stored and pairing information is set for all reads. The main algorithm does not work directly with these sequences and they are only used once to

(45)

(46)

Figure 3.6: Class diagram showing Read and K-mer class structures. Each

K-mer has list of Read objects in which it is belonged to. K-mer ending labels are also presented with an Enumeration class.

the k-mer creation process. Figure 3.6 shows the class diagram for the Read

and K-mer classes. Appendix B provides complete information regarding the class hierarchies, structures and implementation details.

The main algorithm can work with multiplek values. Eachk value has its

own k-mer set which is created based on the input reads. .NET framework

hash-table structures are used to store k-mer sets. The hash function used in

our tool is the algorithm presented by Jon Skeet [Ske13] for generating hash codes for byte arrays presented in algorithm 1. By one-time processing of

reads all k-mers and their occurrence counts are extracted and stored in the

hash-table. Each k-mer also keeps track of the reads that contain it. This

(47)

hash = 17;

//Cycle through each element in the array. foreach (byte b in bytes)

{

//Update the hash.

hash = hash * 23 + b.GetHashCode(); }

return hash;

Algorithm 1: Jon Skeet’s hashing algorithm used in this thesis.

Figure 3.7: Paired k-mers are two k-mers in two paired reads. k-mers pairing

relation is not unique. k-mers CGTTG is assumed to be paired with k-mers

GTACC considering the left read pair but k-mers CGTTG can be seen in

another read like the right read pair and it is assumed to be paired with

(48)

In the same way that reads keep the pairing information, k-mers also

keep pair information with other k-mers, but with a slight difference. All

pairing information between reads are unique but one single k-mer may have

more than one pair k-mer because k-mers occur in more than one location

in different reads. Basically it is assumed that if read A has a pair read of

B then the first k-mer of read A pairs with the first k-mer of read B, and

so on. As it is also depicted in Figure 3.7, this relationship is generally not unique. This type of information is also stored in data structures (described in Appendix B) and it may be useful for further analyses to use mate-pair information for the scaffolding problem which is briefly presented in 2.1.

In order to reduce memory usage, we store reads andk-mers information

as compactly as possible. This is achieved by reserving only 2 bits for each base-pair in sequences as there are only four possible base-pair characters. “A” base-pairs are stored as 00, “C” base-pairs are stored as 10, “G” base-pairs are stored as 01 and “T” base-pairs are stored as 11. Therefore, each byte which normally should keep only one base-pair, actually stores four base-pairs in our program, resulting in reduction of memory usage by 75%. A library including encoding, decoding and other useful functions for compressing the DNA sequences is implemented in our tool which is presented in Appendix B.

3.2.2

Removing Less Frequent

k-mers

Many assemblers use error detection/correction techniques to find and resolve noise in input data, and then run the assembly algorithm on the corrected

(49)

data. It is shown (e.g. in [KSS+_{10]) that using error detection/correction}

techniques improves the assembly results, however there are some problems using these methods:

• Error detection/correction tools may filter our correct data because of

having lower coverage or any other complexity in the data. However this is inevitable and currently there is no other approach in our knowledge to address this problem.

• Error detection/correction tools are time demanding and their run time

increases drastically when working with large inputs e.g. human genome, even though they just need to be run once.

Therefore in this research, we follow the idea from [CHS+_{11] to not use any}

error detection/correction tool beforehand and instead handle the noisy data in the middle of the assembly algorithm when creating contigs. In addition to having faster running time, it is also shown that this approach can lead to

better and more accurate results [CHS+11].

By creating thek-mer set, the occurrence number of every single k-mer

in the read set is counted and stored in the hash-table structure. A minimum threshold can also be set by the user that defines the minimum occurrence

number of k-mers in the input set. All entries that have fewer occurrences

(50)

Figure 3.8: Each node represents a k-mer and each edge defines overlap

between two k-mers. Nodes that have only one edge going in/out of them

are considered qualified and will be detected by our algorithm. Some nodes such as the one labelled in red are in fork situations, meaning the algorithm

cannot decide which k-mer succeeds it without using heuristics. Heuristics

used to resolve these fork situations have drastic influence on assemblers’ performance. These fork situations are the ones that could not be resolved by majority voting or other techniques.

3.2.3

Finding

k-mer

Overlaps

As described in the previous chapter, de Bruijn graph-based assembly methods

create overlap graphs with each node containing a k-mer and each edge

defining overlap between two k-mer nodes, thus defining sequences of length

(k+1). Using the whole k-mer set, this creates a very large and memory intensive de Bruijn graph which has many nodes and edges that prevent the algorithms from effectively simplifying the graph if it happens by using an

inappropriate k value. The effect of using inappropriate k values in large de

Bruijn graphs is explained in the previous chapter. In this research we follow

the idea of not using the whole k-mer set to create the de Bruijn graph (as

in [CHS+_{11, CL11] and only consider} _k-mers _{which are not involved in fork}

locations. Figure 3.8 shows qualifying k-mers and a k-mer in a fork situation

(51)

Figure 3.9: Overlapping k-mers connect together and create larger fragments

from DNA. This simple example only shows how k-mers can have right and

left overlaps and does not show repeat structures in the genome, therefore in this example one final unique sequence can be achieved.

In order to find qualifying k-mers, first all overlaps should be detected.

The basic idea of detecting the overlaps is to find which two k-mers have

similar prefix and suffix sub-strings of length (k-1). Ifk-mer A has a prefix

(sub-string from the first element to the one before the last element) equal to

k-mer B’s suffix (sub-string from the second element to the last element), then

k-mer B can connect to k-mer A on the right in order to make a (k+1)-mer. This extension can be checked from both ends to create left and right overlaps

for all k-mers in the set. Only overlaps that have a quality score of more than

a defined threshold in the overlapping base-pairs are considered in this step,

which ensures skipping noisy data. Figure 3.9 shows how k-mers connect

together.

(52)

• (Resolved): All left/right overlapped k-mers express on a unique extension base pair. Figure 3.10 shows a Resolved state scenario.

Figure 3.10: Resolve State. All high quality extensions express on base-pair

A, selecting it as a true extension for the k-mer.

• (Dead-End) There is not any left/right overlap.

• (Majority Voted) Overlappedk-mers do not all express on a unique extension base-pair but the majority of entries vote for a unique exten-sion with probability of higher than a defined threshold. Figure 3.11 shows a Majority-Voted state scenario.

• (Unresolved) If none of the above labels apply, the k-mer’s end is labelled as “unresolved” which shows a fork situation. Figure 3.12 shows an Unresolved state scenario.

k-mers that are considered to express on unique extensions in both their right and left overlaps (“Resolved” or “Majority-Voted” labels) are considered “qualified”.

The idea of having labels including “Resolved”, “Dead-End” and

(53)

“Majoriy-Figure 3.11: Majority-Voted State. Not all high quality extensions express

on a unique base-pair but most of them express on base-pair A selecting it as

a unique base-pair extension. Minimum probability for Majority-Vote can be set by the user.

Figure 3.12: Unresolved State. Not all high quality extensions express on a unique base-pair and none can be selected as a majority.

Voted” label is a contribution of this research to the community.

Qualified k-mers can build unique and uncrossed paths through the large

de Bruijn graph that do not have any forking nodes, therefore the algorithm does not need much time and memory space in comparison to other tools that create the full de Bruijn graph at the first step. In this way the most important information is obtained from the de Bruijn graph without any need

(54)

datasets like the human genome.

3.2.4

Contig Creation

Qualified k-mers are the base information used in the contig creation process.

Each qualified k-mer is expressing on a unique single-base extension in both

its right and left links. Thus, two overlappingk-merscan be created by having

one starting k-mer. Newly created k-mers are checked in the qualified k-mer

set and if they exist, the base k-mer is extended by one base-pair (k-mers

merged) and the process continues by following the extensions for the new

added k-mer. The contig creation process terminates when both ends of the

contig reach a dead-end or unresolved situation with nothing to match from

the qualified k-mer set. Selecting the base k-mer to start is not important

and can be done randomly. New contigs are generated until the qualified

k-mer set runs out of elements. Algorithm 2 shows the procedure of creating

contigs from qualified k-mers.

3.3

Multi

k-mer

Assembly Solution

Many current assembly algorithms consider a fixed value for k and this

parameter has a significant role in obtaining the best results. There are

methods to analyse the input data and find the most appropriate k value for

the given input[SWJ+_{09, BMK}+_{08], however, to the best of our knowledge,}

(55)

while qualif iedKmers is not emptydo

cntg← instantiate a new contig object

f irstKmer← pick and remove first element from qualif iedKmers rightExtension← f irstKmer’s rightExtension

lef tExtnesion←f irstKmers’s leftExtension

cntg←f irstKmer rightT runcated←f alse lef tT runcated←f alse f inish←f alse

while f inish is not true do

f inish ←true

rightOverlapKmer ←cntg[n−k :n] +rightExtension lef tOverlapKmer←lef tExtension+cntg[0 :k]

if rightT runcated6=true then

if qualif iedKmers contains rightOverlapKmer then

cntg ←cntg+rightExtension

rightExtension←rightOverlapKmer’s rightExtension

remove rightOverlapKmer fromqualif iedKmers

f inish ←f alse

else

rightT runcated←true

end if end if

if lef tT runcated6=true then

if qualif iedKmers contains lef tOverlapKmerthen

cntg ←lef tExtension+cntg

lef tExtension←lef tOverlapKmer’s leftExtension

remove lef tOverlapKmer from qualif iedKmers

f inish ←f alse

else

lef tT runcated←true

end if end if end while

add cntg to contigs

(56)

data and calculate a single k value for the data set; this is not always correct especially for human genome data because of its size and complexity in repeat patterns. Moreover, repeating patterns in the genome have different characteristics and they play the most important role in the quality of assembly

results. Different k values result in either resolving repeat structures, or being

stuck in the middle of the contig creation process, and there is not any unique

k value that can work for all locations of the genome. Small k values make

the de Bruijn graph very tangled and messy, thus the paths are not fully

detectable and the quality of results decreases. On the other hand, large k

values may resolve repeat patterns with length of less than k but may fail to

detect overlaps between reads, particularly in low coverage regions, making

the graph more fragmented [BNA+_12].

There have been attempts in assemblers like [ZB08] to find the most

appropriate k value and run the algorithms for multipleks but the assemblers

themselves do not try to improve the overall results based on outputs from

multiple k values.

In this research, the most important goal is to produce qualified contigs

from all over the genome using different k values. The idea of using multiple

k values in order to build contigs is also proposed by other assembly tools (e.g.

in [BNA+_{12, MPC}+_{11]) but we claim to have a very simple way of doing this}

without any complicated mathematics and complex structures that brings overhead.

(57)

contigs from all locations of the genome are being created even though they are from different runs. Therefore it is feasible to obtain larger contigs by analysing the results from different runs and trying to merge the overlapping

parts. However, a significant portion of contigs from different k values are

expressing on the same locations in the genome, therefore repeating parts should be detected and removed at the end.

3.3.1

Contigs Merging

Contigs are contiguous portions of the genome that the assembler successfully constructs. Because there is not any information regarding which strand the base reads belong to, contigs are created on both strands which brings two versions of each contig (the contig itself and its reverse-complement) to the

contig set. However, contigs do not have any overlap of length more than k

with each other, because if they had it would be detected in previous steps of

the assembly algorithm, unless they come from different k runs. Therefore

attempting to merge contigs all generated from one fixed k value does not

improve the results, but the idea of merging works when dealing with contigs

generated from different k values.

Differentk values generate different contigs with different lengths through

the genome. In some assemblies, more repeats may be resolved and different locations of the genome may be constructed. The main reason behind this is already discussed in section 3.3. Some locations of the genomes which do

(58)

List<Contig> oldContigs; List<Contig> newContigs for all cntg in oldContigs do

newCntg← instantiate new contig object

for i=n−1 downto 0 do

newCntg[n−i−1]←complementBP(cntg[i]) end for

add cntg to newContigs

add newCntg to newContigs

end for

return newContigs

Algorithm 3: Creating Reverse-Complement Contigs.

every reasonable k value. Thus, contigs from different assembly runs do have

overlaps and applying a merging technique should improve the results. The first step to merge contigs is to find overlaps between all of the input contigs. As contigs can belong to each of the genome strands, reverse-complements are generated for all of them at the first step. By actually doubling the dataset, we can be sure to find overlap between two contigs that construct the same location in the genome but from different strands. Algorithm 3 shows how the contig set doubles in size when creating Reverse-Complement versions. In order to find extensions for the contigs, an algorithm is needed to check if there is any overlap between two input contigs or not. There are three situations in which two contigs can be linked together:

• (1): The first contig’s ending base-pairs are matched with the second

contig’s starting base-pairs, thus the first contig can be linked to the second contig from the left. The Algorithm to check this condition is

(59)

p←L1−1

while p≥CON T IGS_M IN_OV ERLAP do

match←true for i= 0 to p−1do if cntg1[L1−p+i]6=cntg2[i]then match←f alse break end if end for if match then return cntg1 +cntg2.substr(p) end if p←p−1 end while return null

Algorithm 4: Contigs left link check algorithm

p←0 while p+L1≤L2 do match←f alse for i= 0 to L1−1 do if cntg1[i]6=cntg2[i+p] then match←true break end if end for if match then return cntg2 end if p←p+ 1 end while return null

(60)

p←L1−1

while p≥CON T IGS_M IN_OV ERLAP do

match←true for i= 0 to p−1do if cntg1[i]6=cntg2[L1−p+i]then match←f alse break end if end for if match then return cntg2 +cntg1.substr(p) end if p←p−1 end while return null

Algorithm 6: Contigs right link check algorithm

Contig cntg1;//cntg 1 is always the smaller contig Contig cntg2;

L1←length(cntg1)

L2←length(cntg2)

consensus←RightLinkCheck(cntg1, cntg2) if consensus6=null then

return consensus

end if

consensus←Lef tLinkCheck(cntg1, cntg2) if consensus6=null then

end if

consensus←SubStringCheck(cntg1, cntg2) if consensus6=null then

end if return null

(61)

presented as Algorithm 4.

• (2): The first contig is completely repeated in the second contig, thus

the second contig expresses the merging result. The Algorithm to check this condition is presented as Algorithm 5.

• (3): The first contig’s starting base-pairs are matched with the second

contig’s ending base-pairs, thus the first contig can be linked to the second contig from right. The Algorithm to check this condition is presented as Algorithm 6.

Algorithm 7 shows the procedure of finding the overlap between two input contigs (consensus sequence). It calls other procedures presented in Algorithm 4, Algorithm 6 and Algorithm 5 to check for all conditions in which two contigs can generate a consensus sequence. The maximum overlap length between contigs can be set in the assembler’s configuration file and is usually

equal to the minimum k value considered. By being able to merge any two

input contigs, an iterative procedure can be devised to merge and extend contigs until no more extension is possible. Algorithm 8 shows this procedure.

3.4

External Contigs Expansion

Contigs created using the approach described in this thesis are assumed to express certain fragments in the genome with high probability. Running

(62)

while contigs >1 do

baseContig←contigs[0]

remove baseContig from contigs overlapF ound←f alse

List < Contig > newlyAddedContigs

for all cntg in contigs do

consensus←ContigsOverlaped(baseContig, cntg) if consensus6=null then

remove cntg fromcontigs

add consensus tonewlyAddedContigs overlapF ound←true

if consensus== cntg then

break

end if end if end for

add newlyAddedContigsto contigs

if overlapF ound==f alse then

addbaseContigtof inalContigs

end if end while

return f inalContigs

(63)

different runs usually leads to better results. While merging results from different runs of our own assembly algorithm is useful, importing contigs from other tools can also be very beneficial. The same set of expansion and merging algorithms can be performed on imported contigs too. However, this also creates false positive links between the contigs due to sequences in repeating regions. Currently, we detect the false links after the contig creation process by aligning and comparing the fragments to the human reference genome, and only consider the correctly aligned fragments for evaluating the algorithm. Devising techniques to prevent false positives during the merging algorithm is part of our future work for this research.

There are definitely some areas in the genome that are covered by other assemblers. Also different assemblers can construct different locations of one genome because of using different heuristics and assumptions. Therefore merging results from different assemblies should lead to better contigs. By

having all contigs which are built from different k values, there is a better

chance of creating larger contigs from state of the art algorithms while not reducing the contigs’ correctness. The procedure of merging external contigs with our generated result is the same as the algorithm described in section 3.2.4. The experimental results in section 4.2.2 show that importing other tools’ contigs to our system and performing the expansion algorithm can help obtain significantly better results.

(64)

3.5

Summary

This chapter described the main algorithms used in this thesis in order to create contigs from input short reads. Methods to load input data to the memory, storing them in the designed data structures and performing algorithms to create contigs are presented in this chapter. Moreover, running the assembly

algorithm for multiple k values in parallel is described in section 3.3. Finally,

we proposed a method to merge contigs from different assembly runs and the ability to utilize external contigs from other tools in order to improve their quality of results.

(65)

Chapter

4

Experimental Results

4.1

Experimental Results Terminology

To the best of our knowledge, the de novo DNA assembly problem for the

human genome is still an open problem. It is discussed in [MPC+_{13] that}

when dealing with complex genomes, using different available assemblers may not help to obtain better results unless there is better input data with less noise, better coverage, longer reads, and etc. Therefore, it is believed that currently the most important problem is the data and not the algorithms. However algorithms also vary significantly: some are not even scalable to human genomes and others that are capable, obtain limited results compared to results from Sanger data.

(66)

in chapter 2. Larger N50 value shows that larger contigs are created, which

can primarily be considered as a better result. However, sometimes N50

values can become misleading, when the generated contigs are not accurate. Unfortunately deciding if a contig is correct or not is currently i