Analyzing genomic datasets by finding overlaps between coordinates in different reference genomes

(1)

Analyzing genomic datasets by finding overlaps between

coordinates in different reference genomes

Long Hoang Ho

29th August 2016

(2)

(3)

Analyzing genomic datasets by finding overlaps between coordinates in different

reference genomes

Long Hoang Ho 29th August 2016

(4)

(5)

Abstract

A reference genome is a structure that represents the genome of a given species. For simplification, the reference genome can be viewed as a linear sequence. The linear sequence contains a line of letters as a set of text strings. The position of one letter is referred as a genomic location.

The similarities between two genomes can be compared by the liftOver tool. It takes a genomic location in genome A and lift it over to a genome B. The problem with this analysis is that a majority of the genomic location in genome A can’t be lifted over to genome B because some areas are not defined in genome B due to lacking coverage of input data.

In this thesis we have explored a better way of analyzing a this comparison using a graph-based representation. We have created a tool for perform- ing the analysis which is available online. We compared our tool against an existing method, showing a higher number of matches when analyzing genomes across species. Additionally we have described what weaknesses exist in our method in order to lay a foundation for future improvements.

(6)

(7)

Acknowledgements

I want to thank my supervisors Lex Nederbragt and Geir Kjetil Sandve for providing guidance through the process. I also want to extend a special thanks to PhD candidate Ivar Grytten and Esten Høyland Leonardsen for helping me along the way. Lastly, I want to thank my friends and family for providing moral support and encouragement for the last months when it was needed.

(8)

(9)

List of Figures

2.1 A graph with completely separate paths for every input

sequence . . . 6

2.2 A graph with a path for every possible nucleotide sequence 6 2.3 Example of an axt format . . . 8

2.4 Example of size for the two first chromosomes in an art . . . 9

2.5 Example of a complete BED format . . . 10

2.6 Genome Browser’s translation of BED score values . . . 10

2.7 Lift over a genomic location in mouse genome to a human genome . . . 12

3.1 Node with variables . . . 14

3.2 Example of axt format (Primary chr1) . . . 16

3.3 Example of axt format (Primary chr2) . . . 16

3.4 Chromosome size for primary chromosome . . . 17

3.5 Chromosome size for aligning chromosome . . . 17

3.6 Genomic locations in primary chromosome . . . 18

3.7 Genomic locations in aligning chromosome . . . 18

3.8 Extracted from both axt formats . . . 19

3.9 Black node 0 is appended to the graph . . . 20

3.11 Blue node 7 and black node 2 are appended to the graph . . 21

3.16 Table for primary chromosome . . . 25

3.17 Primary chromosome is completed in the graph . . . 25

3.18 Extracted with sorting on aligning chromosome corresponding to its start . . . 26

3.19 Red node 12 is appended to the graph and black node 0 is updated . . . 27

3.20 Edge connection between black node 0 and 2. In addition, black node 2 is updated . . . 27

3.21 Red node 13 is appended to the graph . . . 28

3.22 Black node 3 is updated . . . 29

3.23 Red node 14 is appended to the graph and black node 1 is updated . . . 29

3.24 Black node 4 is updated . . . 30

(12)

3.25 Red node 15 is appended and black node 5 is updated . . . . 31

3.26 Last red node 16 is appended to the graph . . . 32

3.27 Table for figure 3.28 . . . 33

3.28 Complete graph of example datasets . . . 34

3.29 Genomic locations in primary chromosome . . . 36

3.30 Genomic locations in the aligning chromosome . . . 36

3.31 Small part from the example graph . . . 37

4.1 Initialisation of GBLiftOver tool in terminal . . . 42

4.2 Result on a real datasets . . . 43

4.3 Intersection of the result with max step = 40000 . . . 43

4.4 Two different genomes with a common representation . . . . 44

(13)

List of Tables

3.1 Result table of the example datasets . . . 39

(14)

(15)

Chapter 1 Introduction

The DNA is a biomolecule which represents each organism and how they are built. The structure is comprised of two twisted strands of nucleic acid with genetic information. As species evolve, this genetic information is passed down from a parent organism to its descendant. These inheritable units are called genes [9].

The sum of an organism’s DNA, including its genes, is referred to as a genome [10][16, Chapter 1]. Furthermore, all species have had their genomes studied and structured as data which is referred to as a reference genome. Their unique genetic information has been collected and stored in a database where all species have their own reference genome and every reference genome defines a certain species. The current human reference genome, GRCh38, is collected from several individuals. The GRCh38 is a sample which represents the entire human genome [11].

When using the data from the reference genome in research, there is a need for simplification prior to comparison. The linear sequence is a common method for depicting a reference genome. A linear sequence can be viewed as a long line of bases as a set of text strings. Feature and events along the line can then again be viewed as points and intervals. The position of one base is referred to as a genomic location. These genomic locations in one reference genome can be used for comparison with other genomic locations in other reference genomes. This type of comparison can determine if two genomes are similar and whether the species are related.

This linear model facilitates the study of genetic information. But as research progresses and we discover more and more variations within reference genomes, a higher level of precision and a better foundation becomes necessary to represents these variations. A graph representation has been proposed as a new way for modelling reference genomes [5]. This graph representation is able to depict more complex relationships, than the previous standard linear models.

Additionally, when analysing a linear sequence, one commonly uses the

(16)

UCSC liftOver tool to "translate" the genomic locations in a genome and transfer them into another genome. The main problem with using the liftOver tool, is that the majority of genomic locations in one genome can- not be directly "translated" because the same genomic locations may not exist in the second genome.

1.1 Aims of the thesis

Firstly, the aim of this thesis is to develop an improved tool for comparing genomic locations between two genomes by utilizing graph representation.

Secondly, we wish to test whether the new method can provide similar or better results without using the liftOver tool. This thesis will present our own tool, called the "Graph Based LiftOver" tool (GBLiftOver tool). The design of the algorithm will be presented in chapter 3 as we demonstrate how the algorithm works.

The algorithm is implemented in the GBLiftOver tool and this tool is available online through github. Instructions pertaining to the utilization and retrieval of the program can be found in Appendix A.

The early stages of the thesis was done in cooperation with another master [1] working on a graph related problem. The section covering the background is to some degree the result of this team effort.

(17)

Chapter 2 Background

This chapter will focus on the fundamental theoretical knowledge necessary for understanding the theory of the thesis. We will begin with a brief introduction of general biology. Thereafter explain bioinformatic presenting graph representation, as well as the datasets which will be necessary in the description of our tool.

2.1 Genetics

Every living organism possesses genetic information. It is stored as the molecular structure Deoxyribonucleic acid (DNA) [7]. The information is created by chaining together smaller blocks called nucleotides where it forms a phosphate-sugar backbone into the strand. The nucleotides contain smaller molecules that are based on the nitrogenous bases: Adenine (A), Cytosine (C), Guanine (G) or Thymine (T). Each of the nucleotides has a complementary base, A has T and C has G, which can bind to form a base pair (bp). A larger number of base pairs are often using the standard SI units¹. The structure to the nucleotides at the end have a direction called the 5’end and the 3’end. In a single strand the left end is the 5’end and the right end is the 3’end, also called upstream and downstream, respectively. The DNA molecule contains two reverse complementary strands that connects in a double helix structure [16, Chapter 1]. The two strands in the DNA have opposing directions where every base in one of the strands will be connected into base pair with its complement. Since one of the strand are the complementary to the other, DNA is usually represented by one of them. The representation of the DNA can be viewed as a linear sequence of discrete units and can be represented by the four reading letters representing nucleotides. A complete set of DNA in an organism is called a genome.

2.1.1 The central dogma

The central dogma of molecular biology describes the process of transform- ing a genetic information into a functional product by transcription and

11.000bp=1kb, 1.000.000bp=1Mb, 1.000.000.000bp=1Gb

(18)

translation. Transcription is a process where the information contained by a particular DNA sequence exact copy and a new form, called messenger RNA (mRNA), which is later translated into proteins. The mRNA contains a sequence of nucleotides of the three bases A, C, G and Uracil (U) instead of T. To decode the mRNA information into proteins, it can be divided in- tro triplets of nucleotides called codons [16, Chapter 1]. The cell decodes the codons and create strings of amino acids which are transformed into functional proteins. The relationship between codons and amino acids can be looked up in a table called The standard genetic code [13, Chapter 1]. A protein is a sequence comprised of varying types and numbers of amino acids. Only a fraction of the nucleotides in DNA perform as coding regions for proteins, these are called exons. And the remaining non-coding regions of the genetic sequence is called introns. There are about 1.3 % of the genome with coding regions in humans [13, Chapter 4].

2.1.2 Chromosome

A human genome is divided into 23 separate pairs of DNA [16, Chapter 1]. Each cell contains DNA which is coiled around proteins called histones.

Together, they compose the structure called chromosome.

At the chromosome’s center lies the centromere. The centromere divides the chromosome in two parts. A top part is called a p arm and is considered as the short arm, while the lower is referred to as a q arm and is considered the longer [4]. The centromere gathers the chromosome in a unique shape which can be used to pinpoint exact locations of specific genes.

2.1.3 A reference genome

Various methods have been developed to collect and analyse the genetic information into data. A reference genome is a data structure which contains genetic information for a given species. The reference genome contains a set of contiguous nucleotide sequence, called contigs, which are combined into a larger set of scaffolds [15]. These scaffolds are combined to form the genome of the species. The first reference genome was created from samples of many individuals into a linear consensus sequence, which was representable for the species as whole. The consensus sequence is a sequence consisting of the most frequently used nucleotides from different individiuals [6]. A position in a reference genome is referred as a loci.

Alternative loci refer to positions on a reference genome for which a different sequence is known to persist in some individuals of that species.

(19)

2.1.4 The human genome

The human genome contain almost 3 billion base pairs which are spread over 46 chromosomes and are assumed to contain about 23000 genes [13].

The latest human reference genome version is called the GRCh38 [11]

which contains 261 alternate loci.

2.2 Graph based genome representations

In the introduction, we briefly mentioned how a linear model simplified the understanding of genetic information. But as we gradually discover more and more variation in the reference genome, the linear model becomes in- sufficient it its accuracy.

This section will present a graph representation as an alternative to the linear model. Graphs have greater flexibility, making it possible to describe more complex relationships between the genetics.

We will present the graph model which we will be using the next chapter and not go into more details than necessary. Readers that are more interested with the details regards to the graph, general and bioinformatical, are referred to the bibliography [18, Chapter 9][12].

2.2.1 Graph model

The graphs are presented as a set of nodes connected by edges. The graphs show the genetic data of the reference genome. The nodes represent the nucleotides. Two nodes are connected with an edge if they originate from a consecutive pair of nucleotides in the input data. A sequence of nodes where there exists an edge for every pair of consecutive nodes is called a path. If the input data exhibits variations, we need to convey this information. This is of great importance as the input data may show different types of variations. A graph can capture this through the possibility of several paths through a variable region.

Due to the divergences, the graph structure needs to be flexible and tolerate variability. The risk, however, of added flexibility is the loss of consistency.

Figure 2.1 illustrate a graph with little variations, while figure 2.2 illustrate a graph with greater variations.

(20)

START

A G G T C

A G C T C

A T C T C

END

Figure 2.1: A graph with completely separate paths for every input sequence

START

A

C

G

T

END

Figure 2.2: A graph with a path for every possible nucleotide sequence

2.2.2 Aligning sequences

An alignment is used to compare the sequences. It can determine whether the sequences are homologous and decide if they have a common evolutionary ancestry. The process is to identify the most likely alignment, aim- ing to discover a correspondence between the sequences. Additionally, this process can determine if the similarity between the sequence have greater significance than a random coincidental occurrence [16, Chapter 2].

The evolutionary relationship between the sequences may differ as evolution progresses. There are four possible types of nucleotide transforma- tions involved in evolution:

• No change between the nucleotide

(21)

• A substitution of a nucleotide A to a nucleotide B

• A deletion of a nucleotide

• An insertion of a nucleotide

An alignment can be given a score where the score indicates how well two sequence are aligned. Two sequences with a higher score are closer evolutionary than two sequences with a lower score. The score is based on a scoring system², which shows the scores for match and mismatch.

2.2.3 Mapping

One of the operations which should be a prerequisite for graph model is mapping. A mapping is performed to describe the positions of nucleotide in a sequence, to indicate the distance between two nucleotide. It is a process for finding a relation between single nucleotide in a string and a nucleotide in another string. Two nucleotide from two different sequence can either be in the same position or in different position.

2.3 Data File Format

As a genome sequence is being used for analysis, the issue of efficiency of displaying a DNA sequence is critical. A plain file with billions of letters of genomic DNA provides too much information and is not helpful. Which is why the "UCSC Genome Browser" has created a solution, by collection different types of information and combining all relevant information in one site.

The "UCSC Genome Bioinformatics" is a website³ which holds the reference sequences for a large collection of genomes. It is developed and main- tained by the Genome Bioinformatics Group, a cross-departmental team within the UC Santa Cruz Genomics Institute at the University of Califor- nia Santa Cruz (UCSC). It contains several different types of general datasets and each datasets contain multiple formats for different purposes.

There are about twenty different file formats in UCSC. Many of the datasets have their own benefits, but we will focus on three different datasets;

the Axt, BED and sizes format. This thesis won’t cover all the different types of formats, but readers who are interested in the remaining formats are encouraged to visit the UCSC⁴.

2http://www.ncbi.nlm.nih.gov/Class/FieldGuide/BLOSUM62.txt

3http://genome.ucsc.edu/index.html

4https://genome.ucsc.edu/FAQ/FAQformat.html

(22)

2.3.1 Axt format

The axt format is essential to build the graph. It is produced by Blastz, an alignment tool from Webb Miller’s lab at Penn State University. The format contains two chromosomes in one file, with the structure written in a compact format. It is structured by having each alignment block in the axt format contain three lines, a summary line and two sequence lines. Each sequence lines corresponds to two different chromosomes. Furthermore, the blocks are separated by a blank line. Figure 2.3 is an example from the first two blocks in an axt format⁵.

0 chr19 3001012 3001075 chr11 70568380 70568443 - 3500

TCAGCTCATAAATCACCTCCTGCCACAAGCCTGGCCTGGTCCCAGGAGAGTGTCCAGGCTCAGA TCTGTTCATAAACCACCTGCCATGACAAGCCTGGCCTGTTCCCAAGACAATGTCCAGGCTCAGA 1 chr19 3008279 3008357 chr11 70573976 70574054 - 3900

CACAATCTTCACATTGAGATCCTGAGTTGCTGATCAGAATGGAAGGCTGAGCTAAGATGAGCGA CGAGGCAATGTCACA

CACAGTCTTCACATTGAGGTACCAAGTTGTGGATCAGAATGGAAAGCTAGGCTATGATGAGGGA CAGTGCGCTGTCACA

Figure 2.3: Example of an axt format

The first line is called an information line which represents a primary chromosome and an aligning chromosome. The second and third line contain sequences of the primary chromosome and the aligning chromosome, respectively. After the sequences, a blank lines indicates the end of the blocks.

Reading the information line in figure 2.3 from left to right:

Alignment number

The alignment number start at 0 and increase by 1.

Chromosome (Primary) Chromosome name.

Alignment start (Primary) Start position.

Alignment end (Primary) End position.

Chromosome (Aligning) Chromosome name.

5https://genome.ucsc.edu/goldenPath/help/axt.html

(23)

Alignment start (Aligning) Start position.

Alignment end (Aligning) End position.

Strand (Aligning)

If the strand value is ’-’, then the aligning start and end fields are reverse-complemented coordinates of its chromosome.

Blastz score

All organisms have their own blastz scoring matrices.

The first line in figure 2.3 holds all the information necessary for this thesis.

The last two lines will not be used and the sequences in the axt format will therefore not be relevant.

2.3.2 Sizes format

The sizes format contains the length of the primary and aligning chromosomes from the axt format. Since the axt format contains two types of chromosomes, we need two sizes files because each chromosome varies in size. The structure is formed by two columns separated by a tab. The first column contains the sequence name and the second column contains the length. Figure 2.4 show an example of how the sizes format looks.

chr1 2492621 chr2 2431373

Figure 2.4: Example of size for the two first chromosomes in an art

2.3.3 BED format

The BED (Browser Extensible Data) format provides the desired genomic location we want to investigate in the main and aligning chromosome.

There are three fields which are required with nine optional fields, a tab is used to separate the fields. This format needs to stay consistent for each line. Figure 2.5 shows an example of a complete BED format.

(24)

track name=pairedReads description="Clone Paired Reads" useScore=1 chr22 1000 5000 cloneA 960 + 1000 5000 0 2 567,488, 0,3512 chr22 2000 6000 cloneB 900 - 2000 6000 0 2 433,399, 0,3601

Figure 2.5: Example of a complete BED format

The first line in figure 2.5 holds the header information line, but this is not relevant in this thesis. Reading the information line in figure 2.5 from left to right:

chrom

The name of the chromosome chromStart

The starting position of the chromosome chromEnd

The ending position of the chromosome name

Define the name of the BED line.

score

Score between 0 and 1000. If the track lines attribute "useScore" is set to 1, the score value is determined by the level of grey. Higher number give darker grey. Figure 2.6 show the score values into grey colour from the UCSC website⁶:

Figure 2.6: Genome Browser’s translation of BED score values strand

The strand is either ’+’ or ’-’

thickStart

The starting position where the characteristic is drawn thickly, for example a start codon. If there is no thick part, thickStart is normally set to the chromStart position.

thickEnd

The ending position where the characteristic is drawn thickly, for example an end codon. If there is no thick part, thickEnd is normally set to the chromStart position.

6https://genome.ucsc.edu/FAQ/FAQformat.html#format1

(25)

itemRgb

An RGB value where the form is R, G, B. If the track line attribute

"itemRgb" is set to "on", then the RGB value will determine the display colour of the data contained in the BED line.

blockCount

The number of blocks in the BED line.

blockSizes

A comma-separated list of the block sizes. The number of items should correspond to blockCount.

blockStarts

A comma-separated list of block starts. All the positions should be calculated relative to chromStart. The number of items should correspond to blockCount.

The three first columns are the required fields, while the rest are optional fields. The optional fields are not required in this thesis and therefore not relevant.

2.4 UCSC LiftOver

In the introduction, the UCSC LiftOver [14] was presented as a tool which

"translated" or "lifted" the genomic locations in a genome and transferred them into another genome. After the lift, one can compare the distance between the genomic locations to see if the genomes have shared similarities. The main problem with using the liftOver tool, is that a large fraction of the genomic locations may not get lifted. The cause is that the genomic locations in the first genome is not present in the second genome.

One way to avoid the problem is to extend the genomic locations to intervals. The intervals will contain a sequence instead of one nucleotide.

After the extension, the intervals in the first genome is lifted to a very similar interval in the second genome. Some parts may not be lifted over, but users can choose a minimum ratio of the interval which has to exist in the other genome. We will not go in details how it works, since we are only interested in the result from the liftOver tool. Figure 2.7 illustrates these steps.

(26)

x= Genomic position on mouse o= Genomic position on human

X Genomic position of

mouse genome

Extend the genomic position to an interval

Lifted over the interval to human

O O O Compute the distance between the interval

and the genomic locations in human Figure 2.7: Lift over a genomic location in mouse genome to a human genome

(27)

Chapter 3 The algorithm "Graph Based LiftOver" tool

This chapter will introduce the "Graph Based Liftover" tool (GBLiftOver for short). In order to do this, the chapter is divided into four sections.

The first section will contain definitions of the elements and structures involved. This description will give a conceptual overview of the construction of the graph.

After the overview, the next two successive sections will present the datasets and the implementation of the GBLiftOver tool. For simplicity we will use an example datasets rather than real datasets. Follwing the example datasets is the implementation of the algorithm, which will present details on how the graph is built up. These two sections are necessary to give readers a better intuition on how the tool is operated, and what they can expect when seeing the result. The GBLiftOver tool can be found in appendix A.

After the two aforementioned sections have been presented, the last section introduces the behaviour of the algorithm together with a set of principles of finding the genomic locations in the graph. Primarily, this is not an at- tempt to show whether the GBLiftOver tool acts correctly, that part will be discussed in the next chapter. Instead, it will show the reader what we ac- tually consider as a correct behaviour when finding the genomic locations.

3.1 Define the graph

To build a graph, the system begins with an empty graph. A node can be appended to the graph and can connect its edge to another node if there exists a path between them.

The node is identified by an unique identifier represented by an integer.

The integers starts from 0 and adds one for every node appended to the graph. Each node in the graph is able to store up to two regions. One region defines a small part of a chromosome and the region has stored a start index and an end index. The other region is stored in an already existing

(28)

node if the new region has an alignment with the region in the existing node. This represent an overlap between the coordinates. We will call this a relation between the regions. In order to distinguish between the two different regions we will use the the expression previously presented in the background-section, the primary chromosome and aligning chromosome.

The primary chromosome represents a specific genome and the aligning chromosome represents a different specific genome. These will also be added to the node for chromosome identification. Each chromosome region has a start and an end which also needs to be defined in the node.

Since a region defines a small part of a primary chromosome, the rest of the other region in the primary chromosome will be stored in separate nodes.

These regions have an order and to connect the nodes in the same order, the nodes have edges to indicate the previous and next region. Figure 3.1 illustrate the variables in a node.

id

Primary chromosome id Alignment start

Alignment end

Aligning chromosome id Alignment start

Alignment end Previous edges Next edges

Figure 3.1: Node with variables

3.2 The datasets

Having defined the graph, we now present the example datasets we will use in the rest of this chapter. There are five files of which three are required to construct the graph. One axt format, two sizes files and two BED files.

The axt format is considered to be the essential file. It is the one which will impact the graph appearance.

(29)

3.2.1 Example axt format

One reference genome is represented by a primary chromosome, the other is defined by an aligning chromosome. One primary chromosome in one format can only be representing one chromosome. This means each axt format defines one chromosome. In the example datasets, we have two axt formats containing chromosome 1 and chromosome 2. The axt format is already sorted by the primary chromosomes corresponding to its alignment start. Figure 3.2 and 3.3 show the two axt formats.

Every information line in each of the blocks shows the alignment between the primary and aligning chromosomes. This means that a region in the primary and aligning chromosomes in each line have a relation.

The primary chromosome in the example datasets might have some unrepresented region between the blocks. These unrepresented regions are left out of the axt format because they have no alignment with the aligning chromosome. This also means the two axt formats does not represent the chromosome 1 and chromosome 2 as a whole. We will include these in the graph to specify some region have no alignment with the aligning chromosome, and vice versa with the primary chromosome.

An unrepresented region can be viewed as an empty "space" between two regions. For example in figure 3.2, the region between the second block and third block is unrepresented, that is the region start at 21 to region end at 29. We assume all chromosome number with regions starts at 1 is the beginning of the chromosome number.

The strand and blastz score fields in the axt format are not relevant in this thesis. The strand contains the information of reverse complement, while blastz scores contains the scoring information. They are however important information that could impact the result which will be discussed in the next chapter.

(30)

0 chr1 1 10 chr1 15 20 - 1748 TAAATCACCC

TA—————CCC

1 chr1 11 20 chr2 20 30 - 16775 GACTACTT-G

GACTACTTAG

2 chr1 30 40 chr1 21 30 - 11032 TAGAGTGTAC

TAGAGTGT-C

Figure 3.2: Example of axt format (Primary chr1)

0 chr2 2 11 chr2 1 10 - 1748 TAAATCACCC

TAAATC-CCC

1 chr2 12 19 chr2 31 40 - 16775 GACTACT—-T

GACTAC-ACT

2 chr2 30 40 chr2 50 60 - 11032 TAGAGTGTAT

TAGAGTGTAT

3 chr2 50 60 chr1 40 50 - -162 ACATGATCTC

ACATGATCTC

Figure 3.3: Example of axt format (Primary chr2)

(31)

3.2.2 Example sizes format

One sizes format contains the size of a chromosome in one reference genome. Since we have two chromosomes, we need two sizes files.

Figure 3.2 and 3.3 show the sizes files we will be using. In the primary chromosome, we have chromosome 1 and chromosome 2 which corresponds to the sizes files. The main purpose of the sizes files is to assist the axt format when we want to check an unrepresented region at the end of a chromosome number. In figure 3.3, the primary chromosome end, in the last block, is at 60. But the sizes format show that there is an area which is unrepresented in the region 61 to 62.

chr1 40 chr2 62

Figure 3.4: Chromosome size for primary chromosome

chr1 50 chr2 62

Figure 3.5: Chromosome size for aligning chromosome

3.2.3 Example BED format

We will as mentioned use three fields: the chromosome name, start position and end position. Figure 3.6 and 3.7 show the BED format we will be using. The BED format will not be put to use until the construction of the graph is complete. Therefore, we will introduce the BED format after the construction of the graph is done.

(32)

chr1 5 12

chr1 14 20

chr1 25 27

chr1 29 30

chr1 33 34

chr1 36 50

chr1 55 66

chr1 67 75

chr1 79 87

chr1 90 105

chr2 12 15

chr2 18 20

Figure 3.6: Genomic locations in primary chromosome

chr1 1 9

chr1 15 23

chr1 25 40

chr2 4 13

chr2 15 18

chr2 32 42

chr3 12 17

chr3 55 57

chr3 79 80

chr4 20 29

chr4 72 74

chr4 87 90

chr5 40 59

chr5 95 98

chr5 101 108

Figure 3.7: Genomic locations in aligning chromosome

3.3 Construct the graph

Having presented the example datasets we will in this section present the algorithm constructing the graph. The axt format is read in the algorithm and it is crucial the axt format is read in a order by the chromosome number. This is because the tool is expecting the formats to be in order. It is then parsing the axt formats, where the information line is the only relevant line.

Every node has to have an unique identifier which will be determined by a running counter. Figure 3.8 shows the extraction of the axt formats where the alignment number is the counter, starting at 0 and adding 1 for each information line found. Since the sequences in the axt format is not relevant, we will not use them further. This is because the start and end of each region have already defined the sequence and we want the tool only to check if the desired genomic locations are close to each other within a distance.

Additionally the strand and blast score columns is not relevant and will be left out of the figures to only focus on the columns we will be using.

(33)

0 chr1 1 10 chr1 15 20 1 chr1 11 20 chr2 20 30 2 chr1 30 40 chr1 21 30 3 chr2 2 11 chr2 1 10 4 chr2 12 19 chr2 31 40 5 chr2 30 40 chr2 50 60 6 chr2 50 60 chr1 40 50

Figure 3.8: Extracted from both axt formats

One axt format have already sorted the primary chromosome corresponding to its start region. Since the axt format is read in order by the chromosome number, we can be assured the primary chromosome corresponding to its start is sorted. We want to have it sorted because it will indicate better the order of the chromosome number. Furthermore, it will also show when a chromosome number has changed to a new chromosome number. It is more natural to have an increased chromosome number in a sorted order corresponding to its start.

Each lines in the extraction contains a relation between the primary and aligning chromosome region which we will referred as a common nodes. If there are some unrepresented region between the related region, they will be referred as an unique nodes. We will divide them into three different types:

Black Blue Red

Common node Unique node for primary chromosome

Unique node for aligning chromosome

The colour black, common node, will represent the region with a relation.

A unique node can be distinguished by the colour blue which indicates a region in the primary chromosome which does not have any relation to a region in the aligning chromosome. Similarity, the colour red indicates a region in the aligning chromosome which does not have any relation to the primary chromosome. We will be using black, blue and red nodes as they referred to a common node and uniques nodes in the rest of this thesis.

(34)

After the parsing of the axt format, the next step is to read line by line in the extraction. The graph contain a list, where each index in the list store a node type. We also want to sort the node type by black, blue and red in this order. From the extraction, we know all the lines contain the black nodes.

Therefore, we make rooms for index 0 to 6 for storing the black nodes. In- dex 7 and outwards is reserved for the unique nodes.

Having properly define the node types, we can now read the extracted information. Starting with the alignment number 0 in primary chromosome column, we read the start and end region as 1 and 10. We see the start is 1 and can be assured there is no unrepresented node before alignment number 0. A black node is created where the primary chromosome 1, with its start at 1 and end at 10, is stored. This black node 0 is appended in index 0 in the graph. Figure 3.9 illustrates the black node appended to the graph, where the alignment number represents the black node identification. The left figure shows current line we are reading. The aligning chromosome is not read until we are finished with the primary chromosome. We therefore focus in this section only on the primary chromosome column. In addition, we refer the start and the end to corresponding with the primary chromosome region.

0 chr1 1 10 chr1 15 20

0

Figure 3.9: Black node 0 is appended to the graph

Continue with the next line. Alignment number 1 is read next where the start is at 11. Since the previous line have an end at 10, we know the region between alignment number 0 and 1 is in order and we can be assured there is no unrepresented region between them. Therefore a black node 1 is created with primary chromosome 1, with the start and end to be at 11 and 20, respectively. Figure 3.10 illustrates the second appending to the graph, where the black node is appended to index 1. Since region between alignment number 0 and 1 is in order, we can connect the edge between node 0 and node 1.

1 chr1 11 20 chr2 20 30

0 1

(35)

Thereafter, alignment number 2 is read. In this case, the start is at 30 where the end in alignment number 1 ending at 20. There is a region unrepresented in this case which has to be engaged first. Since this unrepresented region don’t have a relation to a region in the aligning chromosome, a blue node is created where the primary chromosome 1 is added. This blue node will have the primary start at 21 and end at 29. The id for a blue node is set to be 7, since the first six indices in the graph are stored for the black nodes.

Because the blue node indicate a region start after alignment number 1, the edge will connect from black node 1 to blue node 7. When the blue node has been connected, the alignment number 2 can be continued. The black node 2 is created and storing the primary chromosome, starting at 30 and ending at 40. Figure 3.11 illustrates the third appending, where the blue node 7 and the black node 2 is appended in index 7 and 2, respectively.

The figure will illustrate the graph through the edges connections, not the indices in the graph.

2 chr1 30 40 chr1 21 30

0 1 7 2

Figure 3.11: Blue node 7 and black node 2 are appended to the graph

Next line to read is the alignment number 3. This line has a chromosome different from the previous line. This indicate that the previous chromosome number is finished. We need to check if there is any unrepresented region between the end in alignment number 2 and the primary chromosome 1 size. In this case, the size of the chromosome 1 has the same end as the end in alignment number 2 (from figure 3.4). Therefore, we can be assured there is no unrepresented region and we know all the primary chromosome 1 have been created in our graph.

Continue in alignment number 3 where the start is at 2. A region is unrepresented and thus a blue node 8 is created. This blue node will have the primary chromosome 2, with the start and end to be at 1 and 1, respectively. The black node 2 will not have its edge connected to the blue node 8.

This is because there are two different chromosome number in each nodes which have no association with each other. In addition, it does not have a biological purpose. This is a restriction between chromosome numbers we have used in the thesis, which we have called a crossing restriction.

After the blue node 8 is done, the black node 3 is created where it stores

(36)

the primary chromosome 2, with its start at 2 and end at 11. The blue node 8 will have its edge connected to the black node 3. They are appended to the graph, where index 8 contain the blue node and index 3 contain the black node 3. Figure 3.12 illustrates the current graph so far.

3 chr2 2 11 chr2 1 10

0 1 7 2 8

3 Figure 3.12: Blue node 8 and

black node 3 are appended to the graph

The alignment number 4 gets read, presenting the start to be at 12 and the end to be at 19. The previous line alignment end was 11, meaning there is no unrepresented region between alignment number 3 and 4. A black node 4 is created storing the primary chromosome 2, with its start and end to be at 12 and 19, respectively. The black node 3 edge is connected to the black node 4. Furthermore, the black node 4 is stored in index 4 in the graph. The graph is illustrated in figure 3.13.

4 chr2 12 19 chr2 31 40

0 1 7 2 8

3 4

The alignment number 5 is the second to last to be read. The start in this alignment number is at 30, while the previous alignment number had an end at 19. This means an unrepresented region between alignment number 4 and 5. A blue node 9 is created storing the primary chromosome 2, with the start at 20 and the end at 29. Proceeded by creating a black node 5. The black node store the primary chromosome 2, with the start at 30 and end at 40. The black node 4 have its edge connected to the blue node 9. Continue with the connection, the blue node 9 edge is connecting to black node 5.

(37)

The index 9 in the graph is storing the blue node 9, while index 5 is storing the black node 5. Figure 3.14 illustrate two more nodes are appended to the graph.

5 chr2 30 40 chr2 50 60

0 1 7 2 8

4 3 9

5

The last line is alignment number 6. The previous alignment number had an end at 40, but alignment number 6 have start at 50. An unrepresented region is found and creates a blue node 10. The blue node 10 store the primary chromosome 2, with its start to be at 41 and the end to be at 49.

Furthermore, the black node 5 have its edge connected to the blue node 10.

Proceeded by connecting the blue node 10 to the black node 6. The black node 6 will store the primary chromosome 2, with its start and end to be at 50 and 60, respectively. The index 6 store the last black node in the graph, while the blue node 10 is stored in index 10. Figure 3.15 illustrate the current graph.

6 chr2 50 60 chr1 40 50

0 1 7 2 8

3 4

9 5

10

6

(38)

After reading the last line in the extraction, we know there is no more primary chromosome 2. Even though we have read the last line, we still have to check if the unrepresented region of primary chromosome 2. From figure 3.4, we find the primary chromosome 2 to have a chromosome size of 62. The last line in the extraction had the end at 60. This means an unrepresented region is found at the end of the primary chromosome 2 and a blue node 11 is created. This last blue node 11 will contain the primary chromosome 2, with its start to be at 61 and end to be at 62. The size of the primary chromosome is important to include as it indicate the end primary chromosome size. The index 11 in the graph will store the blue node 11.

We have now filled the indices of the black nodes from 0 to 6, while the index 7 to 11 contains the blue nodes. This conclude the end of reading the primary chromosome column. The figure 3.17 illustrates the current graph we have constructed, with the table 3.16 showing the corresponding information to the nodes.

(39)

NodeidPrimary (start-end)Primarychromosome 01-10chr1 111-20chr1 721-29chr1 230-40chr1 81-1chr1 32-11chr1 412-19chr1 920-29chr1 530-40chr1 1041-49chr1 650-60chr1 1161-62chr1 Figure 3.16: Table for

primary chromosome

1161059438

2710

Figure 3.17: Primary chromosome is completed in the graph

(40)

Having read the primary chromosome column in the extraction, we can now read the aligning chromosome. In addition, we refer this time the start and the end in a region to corresponding with the aligning chromosome.

By looking at the aligning chromosome column in figure 3.8, we notice the column is not sorted by the aligning chromosome corresponding with the start. This is because the original axt format is sorted by the primary chromosome and the aligning chromosome in the axt format show the relation to the primary chromosome. We could have read the aligning chromosome column alongside the primary chromosome, but it would have been more inconvenient. Therefore we will sort the extraction by the aligning chromosome column with its start. Figure 3.18 show the result after the process.

0 chr1 1 10 chr1 15 20 2 chr1 30 40 chr1 21 30 6 chr2 50 60 chr1 40 50 3 chr2 2 11 chr2 1 10 1 chr1 11 20 chr2 20 30 4 chr2 12 19 chr2 31 40 5 chr2 30 40 chr2 50 60

Figure 3.18: Extracted with sorting on aligning chromosome corresponding to its start

Since the new extracted file is sorted, we have to restart reading from the beginning in the extraction. The first line consists of the alignment number 0, where the start is at 15 and the end at 20. At the beginning of the read, we find an unrepresented region in the aligning chromosome 1. This time, we will create a red node instead of a blue node. This is to indicate the aligning chromosome have a unrepresented region with no relation to a region in the primary chromosome. The red node will store the aligning chromosome 1 with its start at 1 and end at 14. Again, we assume the start of a region in a chromosome number begins at 1. The red node id will continue from the last blue node id which was 11. The red node have the identification 12.

The black node 0 has already been defined in our graph when we read the primary chromosome column. Since the aligning chromosome 1 has a relation with the black node 0, we only need to update the information. In other words, the black node 0 is updated and will store the aligning chromosome 1, with its start to be at 15 and end to be at 20. Since the red node

(41)

12 is a region that start before black node 0, the red node edge will be connected to the black node 0. The index 12 in the graph is storing the red node 12. Figure 3.19 illustrates the first red node appending to the graph and the black node 0 is updated.

0 1 7 2

12

Figure 3.19: Red node 12 is appended to the graph and black node 0 is updated

The next line to be read is alignment number 2. The start in this alignment is at 21, while the previous alignment number ending was at 20. The region between alignment number 0 and 2 have no unrepresented region and no red node will be created. The black node 2 is already defined in the graph and needs to be updated. The update consists of aligning chromosome 1 with its start at 21 and end at 30. The edge from black node 0 will have its edge connected to the black node 2. Figure 3.20 illustrate the edge added to the graph between the black nodes 0 and 2.

0 1 7 2

12

Figure 3.20: Edge connection between black node 0 and 2. In addition, black node 2 is updated

Continue with the read, we find the alignment number 6. Looking back in the previous alignment number 2 we find the end to be at 30, but the start in alignment number 6 start at 40. We found an unrepresented region.

A red node 13 is created and store the aligning chromosome 1 with its start and end to be at 31 and 39, respectively. Furthermore, the black node 2

(42)

have its edge connected to the red node 13. Proceeding, the black node is updated and store the aligning chromosome 1 with its start to be at 40 and end to be at 50. The red node 13 edge is connected to the black node 6. The index 13 in the graph store the node 13. Figure 3.21 illustrates the red node 13 that is appended to the graph, and figure 3.22 illustrates the edge from the red node 13 to the black node 6.

0 1 7 2

13 12

Figure 3.21: Red node 13 is appended to the graph

Reading the next line is the alignment number 3 where a new chromosome number is detected. This means the aligning chromosome 1 is finished and we have to check the if there is an unrepresented region between the end in alignment number 6 and the aligning chromosome 1 size. From figure 3.5 we can see the end in alignment number 6 is the same as the aligning chromosome 1 size. Thus no unrepresented region and we do not create a red node.

Continue with the alignment number 3 is the start at 1. We can be sure there is no unrepresented region before the new aligning chromosome 2.

Therefore, the black node 3 is updated and storing the aligning chromosome 3, with its start to be at 1 and end to be at 10. In addition, the black node 6 will not have its edge connected with the black node 3. This is because two different chromosome number in the have no association with each other and thus we make a crossing restriction between the black node 6 and 3. In this read, no nodes where appended to the graph, but figure 3.22 illustrate the current graph.

(43)

8 3 4 9 5 10 6 11

13

Figure 3.22: Black node 3 is updated

Continue on the new aligning chromosome 2, we read the alignment number 1. The previous alignment number had the end to be at 10, while the alignment number 1 have a start at 20. We found a new unrepresented region and the red node 14 is created. The red node 14 contains the aligning chromosome 2 with its start and end to be at 11 and 19, respectively. The black node 3 will have its edge connected to the red node 14. We also need to update the black node 1, where it will store the aligning chromosome 2 with the start at 20 and end at 30. On the end of this step, the red node 14 is connected to the black node 1. The index 14 in the graph is storing the red node 14. Figure 3.23 illustrates the graph.

0 1 7 2 8 3 4

13 14

12

Figure 3.23: Red node 14 is appended to the graph and black node 1 is updated

The second to last line to be read is alignment number 4. The start is at 31, while the previous alignment number 1 had the end at 30. The region

(44)

between alignment number 1 and 4 have no unrepresented region. And thus the black node 4 is updated where it store the aligning chromosome 2, with its start at 31 and end at 40. Furthermore, the black node 1 is connected to the black node 4. Figure 3.24 illustrates the current graph.

0 1 7 2 8 3 4

13 14

12

Figure 3.24: Black node 4 is updated

The last line is alignment number 5. We look at the previous alignment number 4 where its end is at 40 and the alignment number 5 have it start at 50. There is a region that is unrepresented and a red node 15 is gener- ated. It will contain the aligning chromosome 2 where the start is at 41 and end is at 49. The edge on the black node 4 is connected to the red node 15. Proceeding, the black node 5 is updated where it store the aligning chromosome 2 with its start at 50 and end at 60. The edge of the red node 15 is connected to the black node 5. Figure 3.25 illustrate the current graph.

(45)

8 3 4 9 5 10 6 15

13

Figure 3.25: Red node 15 is appended and black node 5 is updated

After reading the last line in the extraction, we know there is no more aligning chromosome 2. Even though we have read the last line, we still have to check if the unrepresented region of aligning chromosome 2. From figure 3.5, we find the aligning chromosome 2 to have a chromosome size of 62. A region is unrepresented in the end of aligning chromosome 2 and thus the red node 16 is created. The red node 6 is the last node to be appended to the graph, where it contain the aligning chromosome 2 with its start to be at 61 and end to be at 62. The index 16 in the graph is storing the red node 16.

This conclude the end of reading the aligning chromosome column. The current graph is illustrated in figure 3.26.

(46)

8 3 4 9 5 10 6 15

16 13

Figure 3.26: Last red node 16 is appended to the graph

This concludes the construction of the graph. The example datasets we have used in this chapter was to give the readers a simple understanding for constructing the graph. We also provided in each step a motivation for the choices we made. Because of the choices we made, the illustration of the whole graph is illustrated in figure 3.28. We also have a table which corresponding to the nodes in the graph.

(47)

NodeidPrimary (start-end)Aligning (start-end)PrimarychromosomeAligningchromosome 01-1015-20chr1chr1 111-2020-30chr1chr2 721-29-chr1- 230-4030-40chr1chr1 81-1-chr1- 32-111-10chr1chr2 412-1931-40chr1chr2 920-29-chr1- 530-4050-60chr1chr2 1041-49-chr1- 650-6040-50chr1chr1 1161-62-chr1- 12-1-14-chr1 13-31-39-chr1 14-11-19-chr2 15-41-49-chr2 16-61-62-chr2 Figure 3.27: Table for figure 3.28

(48)

11610594382710 1213

14 15 16

Figure 3.28: Complete graph of example datasets

(49)

3.4 Algorithm for finding the genomic locations

In the previous section, we showed the step for constructing the graph with an example datasets. We will still use the complete graph from the previous section. We will present the second part of the algorithm for finding the genomic locations within a distance. In addition, we will present the BED format which will be used to find the genomic locations. We will also explain the underlying reasons behind our choices.

The algorithm for searching the genomic locations is using a recursive method. The reason we chose the recursion method is because we need to visit all the nodes in the graph to check all possible path. The path is defined by traveling the nodes in the graph through the edges. The algorithm will use the primary chromosome as the "main" path. This means when we want to travel through an edge, the primary chromosome edge is picked first before the aligning chromosome edge if it is available. The primary chromosome edge is defined to travel from a primary chromosome region to the next primary chromosome region. This is a choice we made when implementing the algorithm, because we constructed the primary chromosome first. We will refer traveling through the edge via primary chromosome edge as a blue path, and a red path for traveling through the edge via aligning chromosome edge.

The genomic location will start in the primary chromosome. If there are no other blue edges in the a black node or they have already been checked, and there is an red path available, then the black node is allowed to go through the red path. Because of the overlapping coordinates between primary and aligning chromosome in a black node, it is possible to travel from a black node to a red node. Consider it has not previously gone through a red path earlier and the path is still on the red path. The black node can travel through another red path if it has a chromosome number that is equal to the next red node chromosome number. This is a choice we decided to implement in the algorithm.

If we already have been on the same node before or there was a shorter path to the node, we do not want to unnecessary search. In order not to make traveling through the nodes futile, we have a list of nodes visited and distance traveled. This list append each node visited during the search from a genomic locations. In the same process, it will also mark each node visited with a path length. These path length is appended to the path edge.

If a longer path is about to visit the node, where the node was visited with a shorter path, it will check for other paths.

3.4.1 Read the genomic locations in the BED formats

We will now utilise the BED format defined in section 3.2.3. The BED format contains an intervals which are to be used to find in other genomes.

The intervals define the bases from the start positions to the end positions.

(50)

Since we want to find the desired positions and not the intervals for different purposes, we will only use the start positions and not the end positions.

The start positions will be referred to as genomic locations.

After the construction of the graph is finished, the BED formats is read in the algorithm where it will be parse the genomic locations into diction- aries. The primary chromosome and the aligning chromosome will each have their dictionary, where they key is the chromosome number and values are a list of genomic locations. A variation here could have been to keep searching for every genomic locations within the distance. Figure 3.29 and 3.30 illustrate the outcome.

'chr1': [5, 14, 25, 29, 33, 36, 55, 67, 79, 90]

'chr2': [12, 18]

Figure 3.29: Genomic locations in primary chromosome

'chr1', [1, 15, 25]

'chr2', [4, 15, 32]

'chr3', [12, 55, 79]

'chr4', [20, 72, 87]

'chr5', [40, 95, 101]

Figure 3.30: Genomic locations in the aligning chromosome

We are only interested in the first genomic locations found within a distance and not the quantity. This means for example if genomic location 5 in primary chromosome 1 finds the first genomic locations in the aligning chromosome number, the search found a match and the genomic location 14 is used next. The match is defined when a genomic location in primary chromosome manage to find a genomic location in aligning chromosome within the distance. This will be discussed more in the next chapter.

3.4.2 Steps-form algorithm

Having introduced the BED formats, we can find the genomic locations within a distance. We start in the primary chromosome, because the primary chromosome is the "main" path. The computation of the step in- side a node is defined by going from an index to another index in the same region. The region is the sequence from start to end defined in the node. If the index is on the edge of the node and is about to go through the edge, it has to check if the edge contains a blue or red path. A blue path means it will use the primary chromosome region to compute the steps. The red path use the aligning chromosome region to compute the steps. If both path is available, the blue path is picked first as mentioned. In addition, crossing an edge count as a step.

(51)

The distance is a value given by the user which denotes the maximal number of steps which is allowed, and thus has to be a positive integer. For each match in our search, we have a counter to count how many genomic locations we got within the max step and count it as 1.

We will first give the main principle about the search method and after- wards show an example. The recursive method have a backward search and a forward search. The backward search is defined by searching all the incoming nodes in the start node. The forward search is to search all the outgoing nodes in the start node.

The backward search is the first method to be used in the algorithm. The backward can only use the node in the previous edges. If backward search come back to the start node, we will deny access. Since the forward search method will take care of the node in next edges. When the backward search has recursive been called back to the start node, it has tried every possible path it could go.This also means it could not find any genomic locations.

The forward search is starting right after the backward search has finished.

It will operate as the backward search, but the only difference is the node is from next edges. When the forward search has recursive been called back to the start node, it has tried every possible path it could go. If the search end back in the start node, no genomic locations was found within the max steps. If the backward search found a genomic locations, we would not do the forward search since a match has been found. We design our recursive this way because of problems during the development of the GBLiftOver tool and it evolved to two different search method.

0 1 7 2

1-10 15-20

11-20 20-30

21-29 30-40

21-30 Figure 3.31: Small part from the example graph

Next show an example of how a path is computed. This will give readers a visualisation on how the computation works. Figure 3.31 shows a small part of the example graph. Additional to the figure, there are colour regions representing the primary and aligning chromosome.

The genomic locations to the primary chromosome is at 25 in chromosome 1 which means the blue node 7. The genomic location to the aligning chromosome will be at 15 and 21. We set the max step to be 18. We refer the

(52)

left to be the start of the region, right to be the end of the region and length to be the difference between the end and the start. This depends on the current path:

1. Check if a genomic location is on the blue node 7. That is not the case.

2. Backward search first. Go to the left of blue node 7. Step: 4 (25 - 21) 3. Crossing over the edge via blue path and the length of black node 1.

Step: 14 (4 + 1 + 9)

4. Step counter is lower than 18, keep going

5. Check if a genomic location is on the black node 1. That is not the case.

6. Crossing over the edge via blue path and the length of black node 0.

Step: 24 (14 + 1 + 9)

7. Step counter is higher than 18, stop

8. Check if a genomic location is on the black node 0, we found one.

9. Take the current step 24 and subtract the length used in node 0. Step:

15 (24 - 9)

10. Change the blue path to a red path

11. Calculate the step to genomic location 15. Step: 20 (15 + 5) 12. Step 20 is not within the max step. Could not find a match 13. Check node in next. Found node 1, but already visited it.

14. Crossing over the edge via red path and the length of black node 2.

Step: 25 (15 + 1 + 9)

15. Step counter is higher than 18, stop

16. Check if a genomic location is on the black node 2, we found one.

17. Calculate the step to genomic location 25. Step: 17 (25 - 9 + 1) 18. Step counter is lower than 18, found a match

19. Recurse all the way back to blue node 7 since a match was found 20. End forward search

(53)

3.4.3 Result from the example datasets

Having illustrated how the algorithm in the GBLiftOver tool works, we can now show the result of the example datasets. We used the GBLiftOver tool to run the example datasets. We will not show any steps, but rather give an overview of the output. The table 3.1 illustrate the output where it shows the results for each genomic locations in the BED formats.

Chr (mouse)

Position (mouse)

Found chr (Human)

Position found

Path by node ID

chr1

5 chr1 15 0

14 chr1 15 1→0

25 chr1 25 7→1→0→2

29 chr1 32 7→₁→₄

33 chr1 25 2

36 chr1 25 2

55 - - -

67 - - -

79 - - -

90 - - -

chr2 12 chr2 32 4

18 chr2 32 4

Table 3.1: Result table of the example datasets

In this section, we defined the search algorithm methods. We also gave a motivation for the choice we had made to make the search algorithm. We will not be using the example datasets further in the next chapter.

(54)

(55)

Chapter 4 Results and discussion

In the last chapter we have described the algorithm we developed and how we chose to implement it. In this chapter we will analyse the GBLiftOver tool results and compare them to the result from the UCSC liftOver tool.

We will use a real dataset to do the comparison.

This chapter is divided into three sections. The first section will present the analysis described in the previous chapter and the result of the GBLiftOver tool. The second section is the result of the liftOver tool analysed by Ivar Grytten. In the last section we will discuss the result of the GBLiftOver tool.

4.1 GBLiftOver result

The real dataset is taken from the UCSC website¹. We chose to analyse and compare a mouse genome with a human genome. Here the mouse genome will be represented as the primary chromosome. Which genome is chosen as primary is arbitrary for our algorithm, but because the liftOver analysis chose the mouse genome we decided to do so as well. The liftOver tool used the mouse genome to be the primary chromosome and that is the reason we will do it to. The datasets contain two sizes files and two BED files and nineteen axt files.

This time, we will show how the GBLiftOver tool is run on the terminal.

The run.py is the file to start the whole tool. The file is set to run a method where the example datasets is set as default, provided in the last chapter.

This can be turned off in the run.py and in the same process turn on another method to read the real datasets. This instructions is given in the Appendix A. The steps to run the tool is:

1. Run the run.py program

2. Provide a positive integer for the max steps

1http://hgdownload.soe.ucsc.edu/goldenPath/mm10/vsHg19/axtNet/

(56)

The run.py it will automatically find the files it needs, given the datasets is in the right folder. It will take around one minute to read in nineteen axt files and store them in the graph. When the initialisation is finished, user can write in a desired max step to make the tool start searching for matches.

The verbose is in the run.py is set to be true which means it print out the aligning chromosome number corresponding to the genomic locations if a match is found. This was an optional function we implemented in the tool, but the final version of the tool has the verbose set to be always true. To let the user know the feedback from the tool, the algorithm prints out the axt format it managed to read in and when the constructing of the graph is finished. In addition, we implemented a time function to see the time it took to run the GBLiftOver tool given the max steps. Figure 4.1 shows the process before running the GBLiftOver tool.

Figure 4.1: Initialisation of GBLiftOver tool in terminal

After the max step is given, the algorithm will start analysing the genomic locations. The BED formats we will be using contains 3899 genomic locations for the mouse genome and 3875 genomic locations for the human

(57)

genome.

max step = 40000

Time: 208.41 minutes Result: 1860

Figure 4.2: Result on a real datasets

The result from the terminal after running the GBLiftOver tool is presented in figure 4.2. We can see the tool managed to acquire 1860 match within the max step. One match is defined by finding the genomic locations between the mouse and human genome within the max steps. In order to get a better view of the genomic locations in the BED format it managed to find, we present a venn diagram in figure 4.3.

2039 1860 2015

Figure 4.3: Intersection of the result with max step = 40000

Figure 4.3 shows the matches found and the genomic locations it does not manage to find. This indicate of all the 3899 genomic locations in the mouse genome, 2039 genomic locations did not find a match. While the 2015 genomic locations in the human genome did not find a match. This result is based on the choices we had made when we designed the algorithm.

Therefore, some factors could have affect the result. We will discuss this further in the last section of the chapter.

Analyzing genomic datasets by finding overlaps between coordinates in different reference genomes

Analyzing genomic datasets by finding overlaps between

coordinates in different reference genomes

Long Hoang Ho

29th August 2016

Analyzing genomic datasets by finding overlaps between coordinates in different

reference genomes

Abstract

Acknowledgements

Contents

List of Figures

List of Tables

Chapter 1

Introduction

1.1 Aims of the thesis

Chapter 2

Background

2.1 Genetics

2.2 Graph based genome representations

2.3 Data File Format

2.4 UCSC LiftOver

Chapter 3

The algorithm "Graph Based LiftOver" tool

3.1 Define the graph

3.2 The datasets

0 chr1 1 10 chr1 15 20 - 1748 TAAATCACCC

TA—————CCC

1 chr1 11 20 chr2 20 30 - 16775 GACTACTT-G

GACTACTTAG

2 chr1 30 40 chr1 21 30 - 11032 TAGAGTGTAC

TAGAGTGT-C

0 chr2 2 11 chr2 1 10 - 1748 TAAATCACCC

TAAATC-CCC

1 chr2 12 19 chr2 31 40 - 16775 GACTACT—-T

GACTAC-ACT

2 chr2 30 40 chr2 50 60 - 11032 TAGAGTGTAT

TAGAGTGTAT

3 chr2 50 60 chr1 40 50 - -162 ACATGATCTC

ACATGATCTC

chr1 40 chr2 62

chr1 50 chr2 62

chr1 5 12

chr1 14 20

chr1 25 27

chr1 29 30

chr1 33 34

chr1 36 50

chr1 55 66

chr1 67 75

chr1 79 87

chr1 90 105

chr2 12 15

chr2 18 20

chr1 1 9

chr1 15 23

chr1 25 40

chr2 4 13

chr2 15 18

chr2 32 42

chr3 12 17

chr3 55 57

chr3 79 80

chr4 20 29

chr4 72 74

chr4 87 90

chr5 40 59

chr5 95 98

chr5 101 108

3.3 Construct the graph

0 chr1 1 10 chr1 15 20 1 chr1 11 20 chr2 20 30 2 chr1 30 40 chr1 21 30 3 chr2 2 11 chr2 1 10 4 chr2 12 19 chr2 31 40 5 chr2 30 40 chr2 50 60 6 chr2 50 60 chr1 40 50

0 chr1 1 10 chr1 15 20 2 chr1 30 40 chr1 21 30 6 chr2 50 60 chr1 40 50 3 chr2 2 11 chr2 1 10 1 chr1 11 20 chr2 20 30 4 chr2 12 19 chr2 31 40 5 chr2 30 40 chr2 50 60

3.4 Algorithm for finding the genomic locations

Chapter 4

Results and discussion

4.1 GBLiftOver result

max step = 40000

Time: 208.41 minutes Result: 1860