Developing a tool for comparing annotations across genomes

(1)

Developing a tool for comparing annotations across

genomes

University of Oslo, Department of Informatics

May 16, 2012

Abstract

Annotations on genomes is an important part of bioinformatics. As the amount of information grows the need for analytical tools on this informa-tion has become apparent. This essay first introduces the web framework Hy-perbrowser for reproducable analysis on such annotations and the concept of annotation tracks. The background for creating a tool for comparing such an-notation tracks is given. Based on a master thesis on developing such a tool a few challenges are described. These challenges are divided in two categories, implementation challenges and design challenges. Test-driven development and array programming are proposed as possible solutions to the implemen-tation challenges. The design choices that have to be done when developing quantitative methods comparing annotations across genomes have no pro-posed solutions but a road-map for solving them is presented. The problem of verifying results from the tool is described and use of the Monte Carlo method as well as reproducing published results from other methods are pro-posed.

1 Background

The amount of available genomic sequence (DNA) in public databases has grown. This has lead to more analysis and interpretion of raw DNA. Large open projects like Hyperbrowser aim to make it easy for life scientists to use for analysis of this data and testing hypothesises. An important part of validating and judging research is reproducing the results. Hyperbrowser makes this possible by supplying an open interface with several tools and data sets to do analysis on. [3] Hyperbrowser is web-based and does not require more than a shallow programming-background.

(2)

The framework gives access to genomic sequences and tools to use on these se-quences. Tools are coded in python, and when uploaded and released anyone can use the tools for their research. Using such tools as part of your research ensures reproducability of your results, which is a known problem for our field of study. [5] As sequencing technology gets better and more genomic sequences are avail-able more research is being done on these sequences. This leads to more insight in the properties and purpose of different parts of the genome. The information is annotated in annotation tracks. Comparing annotations across chromosomes and genomes is a new field that have not been explored fully yet. Comparative anal-ysis of annotations would make it easier to discover overlaps and differences in genomes across species. Work has been done in a 2010 article in Science mag-azine for revealing the evolutionary dynamics of transcription factor binding. [9] However, no general method for quantitative comparison of annotation tracks have been presented. A tool in Hyperbrowser being able to do the quantitative compari-son of annotation tracks would open up a wide array of possibilites for hypothesis testing in evolutionary research, that would be reproducable.

2 Annotating genome sequences

This chapter gives an description of comparative analysis of annotations, including some definitions based on the work done by Eivind Gard-Lund in his Master Thesis "An extensible framework for comparative analysis of annotations". [4]

2.1 Annotation tracks

Annotation tracks are used in large genomic databases to store empirical data on the DNA. The annotation tracks consist of track elements with possible informa-tion about the specific basepair of the given DNA. Explained simply, annotainforma-tions are meta-information on the DNA. Track elements may contain meta-information like being part of a specific gene, or a transcription factor binding site. Track ele-ments with such information will occur grouped in intervals on the annotation track in many implementations, since the meta-information will span several basepairs. Examples of meta-information are transcription factor binding sites or information about similar parts on other genomes. Annotation tracks are organized by marking intervals of basepairs with relevant information. Usually there is only one or few types of information on each track, and extracting just one type of information from such tracks is trivial. With only one type of information possible for the intervals in an annotation track the intervals in essence become boolean values. A basepair is either a part of an interval with information or it isn’t. One way of visualizing

(3)

Figure 1: Visualisation of an annotation track from An Extensible Framework For Comparative Analysis of Annotations [4]

such annotation tracks is as intervals on a track as shown in Figure 1. As shown an annotation track can be divided in several disjoint chromosome annotation tracks.

2.2 Coupling annotations

One kind of meta-information track elements might contain is information on sim-ilar or identical track elements in different genomes. Such information across an annotation tracks forms a mapping from one genome to another, showing where two genomes share properties. A track element with this information is called a coupled track element, and annotation tracks only consisting of coupled track elements are called coupled annotation track. Coupled annotation tracks can be extracted by excluding non-coupled track elements. What to do with non-coupled track elements is an important question covered later in this essay. Together two (or more) corresponding coupled annotation tracks are called connected coupled annotation tracks, consisting of connecting coupled track elements. These track elements may overlap, as more than one basepair of a genome might be mapped to a basepair on the other.

2.3 Annotation track formats

There are several different types of formats for annotation tracks. An article BMC Bioinformatics by Gundersen et al [2] discusses these different file formats in depth. Since these annotation track often aim to describe vast amounts of data choosing a flexible and compact format is important. The most used current for-mats are General Feature Format (GFF), Browser Extensible Data (BED) and

(4)

Wig-Figure 2: Visualisation of an annotation tracks mapped to each other

gle Track Format (WIG). All of these formats lack some functionality and are not mutually compatible, but they are very similar in that they are tabular based and have a narrow focus. Two new proposed track formats are introduced in the article:

• GTrack

GTrack is a tabular format with specific headers for type of content, making it possible to convert any of 15 different formats to GTrack by setting the relevant information in the header. This makes GTrack very powerful for future use. GTrack allows only one kind of feature to be in focus in the annotation track. This means that we will have to create several annotation track files for the same genome.

• BioXSD

BioXSD is an XML-type format which allows combining features with over-lapping intervals etc. in a single file. This format requires more processing which is unfortunate when dealing with massive datasets.

3 Developing annotation comparisons

The coupled annotation tracks connects annotation tracks to each other. The next important step is to actually compare other kinds of annotations on the

(5)

correspond-ing annotation tracks with each other. Creatcorrespond-ing such a tool poses several lenges, as shown in the earlier work by Gard-Lund. [4] In this chapter these chal-lenges are defined, and some suggestions for succeeding are made. The chalchal-lenges can be divided in two different types:

• Implementation challenges

These are challenges that impact how the implementation is done, which in principle should not impact how the tool itself works in the end. Making the right decisions will however have a major effect on productivity and the amount of time that has to be used on developing.

• Design challenges

These are decisions that impacts the results given by the final tool. It might be possible to present a variety of options to the user, but they still have to be considered when developing the tools.

3.1 Implementation challenges

There are a few general implementation challenges any programmer face in a project. The goal is to avoid bugs and breakdowns, and getting an effective solu-tion to the problem at hand. In the case of creating a tool for comparing annotasolu-tion tracks running time is of importance as well.

Discovering and avoiding bugs early

Software development methodology aims to increase productivity and decreasing potential for bugs and breakdowns. Methodology research on how to effectively develop applications is a growing field, driven in part by the private sector. One popular philosophy is so-called test-driven development. A 2005 metastudy pub-lished in Computer shows that test-driven development might increase productivity and recommends the use of it in academia. [6] Another study assessing test-driven development at IBM shows great improvements. [7] Testdriven programming is a way of developing software that forces the developer to test the code very of-ten.The development cycle is described in the book "Test-Driven Development by Example" [1] as follows:

• Add a test

• Run all tests and see if the new one fails • Write some code

(6)

• Run the automated tests and see them suceed • Refactor code

• Repeat

Test-driven programming is a way of forcing the developer to be sure that what has been produced so far works, and continually thinking through the design. It is based on a bottom-up design process where you start by simplifying your problem drastically and then expanding on it. The development process is also significant in the fact that you have to write the tests first and then code. Using test-driven programming should make it easier to identify problem areas and assess our results. This way of thinking might be extra beneficial when developing new algorithms and models. By having to test often the developer is forced to think through the model itself from early on. Any mistakes in the model will be found early. One of the problems encountered earlier when trying to develop a tool for comparing annotation tracks is exactly pinpointing mistakes. [4]

Running time

Array programming is a good option for optimizing code. The NumPy package implements array programming in python. Gard Lunds work showed that using ar-rays to represent our annotation tracks is possible and beneficial for comparisons. Use of array programming often requires the developer to split problems in sub-problems. [4] Another challenge that arises when using NumPy is the fact that the package is better suited for interactive sessions than being part of programs. This creates a demand for extra attention when utilising methods from this package.

3.2 Design challenges

The problem of comparing annotation tracks can be split in several subproblems. Some that are fundamental to the problem and others that can be seen as an expan-sion on the problem. The challenges that have been identified are outlined in this section.

Base unit: interval or base pair

The earlier work on developing tools for comparing annotation tracks across genomes is based on representation of intervals. [4] This might seem like a comfortable choice considering that comparisons can be reduced to set problems. However complex problems arise when implementing, including overlap in mapping and se-cluded elements. Representing base pairs instead of whole intervals of base pair

(7)

when implementing might make problems clearer and our code more flexible under development.

Identifying overlaps

There are several possible sources of error in our data, stemming from both the underlying genomic sequence or the annotations themselves. This means we will have to be a bit lenient when identifying overlaps in annotation tracks. This is one of the challenges that are given extra attention in Gard Lunds work. Two methods are mentioned [4]:

• Projection

By simply combining the coupled track elements to one larger annotation track we can use developed methods to solve our problem. However, this raises the problem of what to do with the track elements not projected. • Quantitative comparison

By first doeing analysis on the enviroment surrounding coupled track el-ements we can include a larger part of each annotation track in our final analysis.

Quantitative comparison divides the problem into two subproblems and is the method used for five-vertebrate comparisons in the 2010 Science article. [9] An-other interesting subproblem when identifying overlaps is identifying turnovers, i.e. not direct overlaps, but instances where the overlap is in the vincinity of the mapping. This is in itself an expansion of the problem that will have to be adressed at a later stage. When starting development a conservative approach would be to ig-nore non-mapped track elements and counting overlapping coupled track elements, so-called strict overlaps.

Multiple overlaps

As mentioned earlier, a track element might be coupled several times in a coupled annotation track. A tool will therefore have to deal with cases where a track ele-ment maps to more than one other track eleele-ment. If we are supposed to pick one a fully functional tool will have to do analysis on the enviroment of each coupled track element. Such features should not be a requirement when starting develop-ment, as it increases complexity vastly. Throwing away a random overlap might be the best way to start of when developing the tool. This lets us test the basic features before tackling this complex challenge.

(8)

Figure 3: Possible road map for developing a tool comparing annotation tracks

Non-mapped elements

The notion of computing shortest distance between elements across annotation tracks is also introduced by Gard Lund [4], but it is noted that the sensibility of such metrics is not clear. Including non-mapped elements is a way of expand-ing our data set. This is an excercise that requires much thought. Expandexpand-ing the coupled annotation tracks might give larger margins of error and not make sense biologically.

Multiple annotation tracks

It is interesting to compare more than two annotation tracks at once. This is solved in Gard Lunds thesis by combining results of comparing two annotation tracks at the time. Expanding the tool to be able to compare more than two annotation tracks therefore not needed untill the very end of development.

The roadmap

The road map in Figure 3 illustrates a possible order for these challenges, which is in line with the test-driven development philosophy. Some of the challenges presented can be postponed in early development. This will ensure that the core features of our model are correct. The development should start by writing test cases for identifying strict overlaps in a single coupled annotation track with no multiple overlaps. When these tests passes the tool will be expanded by requiring the identification of partial overlaps. Then the notion of multiple overlaps and

(9)

non-mapped elements will be introduced. In the end functionality for more than two annotation tracks can be added as described above.

4 Verifying results

One important lesson from Gard Lunds master thesis is that it is often difficult to know what results to expect form a comparative analysis on annotations. [4] This means that verifying the results can be quite hard, especially when the methods we will design have no counterpart. There are some means to asess correctness which we will describe in the following sections.

4.1 Monte Carlo method

By computing the expected amount of overlap across two annotation tracks we have a rough benchmark to compare our results on. However, our genomes are not uniformly distributed random samples so finding an analytical model for the expected amount of overlap might be farfetched.

Monte Carlo methods are a group of algorithms that use random sampling to compute their results. Initially developed by scientists at the Manhattan project [8] They are often used for simulating systems with several degrees of freedom, espe-cially in economics. In this case, Monte Carlo might be used by picking random track elements of our coupled annotation tracks and checking for overlap, and from them computing how many overlaps we should expect in total. This method does not provide information on where exactly our overlaps are situated, but gives a benchmark to compare the total number of overlaps with.

4.2 Reproducing known results

In an article published in Science [9] a comparison of transcriptor factor binding sites across three genomes were published. Reproducing these results will be an important step in assessing our work. The article itself does not describe how this part of the work is done. This might be because the code written is very dependent on that use case. If the tool is able to reproduce the results published in the article, we have an important indicator that the tool is working. The tool created and described in Gard Lunds thesis got significantly lower results than the article. If this is repeated it will be important to be able to find an explanation for this deviation, hopefully possible through a strict test-driven development methodology.

(10)

5 Conclusion

Creating a tool for comparison of annotation tracks will be of great use to life sci-entists. However, as shown in Gard Lunds master thesis [4] this is not an easy feat. A bottom-up, test-driven approach to the problem will lessen the probability of failure. In addition there are several design challenges to address. To avoid large complexity in the code each of these design challenges will be tackled individually. A road map has been proposed. Further work, including getting a broader biologi-cal understanding of these challenges, have to be done to solve these some of these challenges.

References

[1] Kent Beck. Test-Driven Development: By Example. Addison-Wesley Profes-sional, 2003.

[2] Gundersen et al. Identifying elemental genomic track types and representing them uniformly. BMC Bioinformatics, 12(1):494–, 2011.

[3] Sandve et al. The genomic hyperbrowser inferential genomics at the sequence level. Genome Biology, 11(12), 2010.

[4] E. Gard Lund. An extensible framework for comparative analysis of annota-tions. Master’s thesis, University of Oslo, 2011.

[5] J.P.A. Ioannidis, D.B. Allison, C.A. Ball, I. Coulibaly, X. Cui, A.C. Culhane, M. Falchi, C. Furlanello, L. Game, G. Jurman, et al. Repeatability of published microarray gene expression analyses. Nature genetics, 41(2):149–155, 2008. [6] D. Janzen and H. Saiedian. Test-driven development concepts, taxonomy, and

future direction. Computer, 38(9):43–50, 2005.

[7] E.M. Maximilien and L. Williams. Assessing test-driven development at ibm. In Software Engineering, 2003. Proceedings. 25th International Conference

on, pages 564–569. IEEE, 2003.

[8] N. Metropolis. The beginning of the monte carlo method. Los Alamos Science, 15(584):125–130, 1987.

[9] Wilson M. D. Ballester B. Schwalie P. C. Brown G. D. Marshall A. Kutter C. et al. Schmidt, D. Five-vertebrate chip-seq reveals the evolutionary dynamics of transcription factor binding. Science (New York, NY), 2010.