Data File Format - Analyzing genomic datasets by finding overlaps between coordinates in differ

As a genome sequence is being used for analysis, the issue of efficiency of displaying a DNA sequence is critical. A plain file with billions of letters of genomic DNA provides too much information and is not helpful. Which is why the "UCSC Genome Browser" has created a solution, by collection dif-ferent types of information and combining all relevant information in one site.

The "UCSC Genome Bioinformatics" is a website³ which holds the refer-ence sequrefer-ences for a large collection of genomes. It is developed and main-tained by the Genome Bioinformatics Group, a cross-departmental team within the UC Santa Cruz Genomics Institute at the University of Califor-nia Santa Cruz (UCSC). It contains several different types of general data-sets and each datadata-sets contain multiple formats for different purposes.

There are about twenty different file formats in UCSC. Many of the data-sets have their own benefits, but we will focus on three different datadata-sets;

the Axt, BED and sizes format. This thesis won’t cover all the different types of formats, but readers who are interested in the remaining formats are encouraged to visit the UCSC⁴.

2http://www.ncbi.nlm.nih.gov/Class/FieldGuide/BLOSUM62.txt

3http://genome.ucsc.edu/index.html

4https://genome.ucsc.edu/FAQ/FAQformat.html

2.3.1 Axt format

The axt format is essential to build the graph. It is produced by Blastz, an alignment tool from Webb Miller’s lab at Penn State University. The format contains two chromosomes in one file, with the structure written in a compact format. It is structured by having each alignment block in the axt format contain three lines, a summary line and two sequence lines. Each sequence lines corresponds to two different chromosomes. Furthermore, the blocks are separated by a blank line. Figure 2.3 is an example from the first two blocks in an axt format⁵.

0 chr19 3001012 3001075 chr11 70568380 70568443 - 3500

TCAGCTCATAAATCACCTCCTGCCACAAGCCTGGCCTGGTCCCAGGAGAGTGTCCAGGCTCAGA TCTGTTCATAAACCACCTGCCATGACAAGCCTGGCCTGTTCCCAAGACAATGTCCAGGCTCAGA 1 chr19 3008279 3008357 chr11 70573976 70574054 - 3900

CACAATCTTCACATTGAGATCCTGAGTTGCTGATCAGAATGGAAGGCTGAGCTAAGATGAGCGA CGAGGCAATGTCACA

CACAGTCTTCACATTGAGGTACCAAGTTGTGGATCAGAATGGAAAGCTAGGCTATGATGAGGGA CAGTGCGCTGTCACA

Figure 2.3: Example of an axt format

The first line is called an information line which represents a primary chro-mosome and an aligning chrochro-mosome. The second and third line contain sequences of the primary chromosome and the aligning chromosome, re-spectively. After the sequences, a blank lines indicates the end of the blocks.

Reading the information line in figure 2.3 from left to right:

Alignment number

The alignment number start at 0 and increase by 1.

Chromosome (Primary)

Alignment start (Aligning) Start position.

Alignment end (Aligning) End position.

Strand (Aligning)

If the strand value is ’-’, then the aligning start and end fields are reverse-complemented coordinates of its chromosome.

Blastz score

All organisms have their own blastz scoring matrices.

The first line in figure 2.3 holds all the information necessary for this thesis.

The last two lines will not be used and the sequences in the axt format will therefore not be relevant.

2.3.2 Sizes format

The sizes format contains the length of the primary and aligning chromo-somes from the axt format. Since the axt format contains two types of chromosomes, we need two sizes files because each chromosome varies in size. The structure is formed by two columns separated by a tab. The first column contains the sequence name and the second column contains the length. Figure 2.4 show an example of how the sizes format looks.

chr1 2492621 chr2 2431373

Figure 2.4: Example of size for the two first chromosomes in an art

2.3.3 BED format

The BED (Browser Extensible Data) format provides the desired genomic location we want to investigate in the main and aligning chromosome.

There are three fields which are required with nine optional fields, a tab is used to separate the fields. This format needs to stay consistent for each line. Figure 2.5 shows an example of a complete BED format.

track name=pairedReads description="Clone Paired Reads" useScore=1 chr22 1000 5000 cloneA 960 + 1000 5000 0 2 567,488, 0,3512 chr22 2000 6000 cloneB 900 - 2000 6000 0 2 433,399, 0,3601

Figure 2.5: Example of a complete BED format

The first line in figure 2.5 holds the header information line, but this is not relevant in this thesis. Reading the information line in figure 2.5 from left to right:

chrom

The name of the chromosome chromStart

The starting position of the chromosome chromEnd

The ending position of the chromosome name

Define the name of the BED line.

score

Score between 0 and 1000. If the track lines attribute "useScore" is set to 1, the score value is determined by the level of grey. Higher number give darker grey. Figure 2.6 show the score values into grey colour from the UCSC website⁶:

Figure 2.6: Genome Browser’s translation of BED score values strand

The strand is either ’+’ or ’-’

thickStart

The starting position where the characteristic is drawn thickly, for example a start codon. If there is no thick part, thickStart is normally set to the chromStart position.

thickEnd

The ending position where the characteristic is drawn thickly, for example an end codon. If there is no thick part, thickEnd is normally set to the chromStart position.

6https://genome.ucsc.edu/FAQ/FAQformat.html#format1

itemRgb

An RGB value where the form is R, G, B. If the track line attribute

"itemRgb" is set to "on", then the RGB value will determine the display colour of the data contained in the BED line.

blockCount

The number of blocks in the BED line.

blockSizes

A comma-separated list of the block sizes. The number of items should correspond to blockCount.

blockStarts

A comma-separated list of block starts. All the positions should be calculated relative to chromStart. The number of items should correspond to blockCount.

The three first columns are the required fields, while the rest are optional fields. The optional fields are not required in this thesis and therefore not relevant.

In document Analyzing genomic datasets by finding overlaps between coordinates in different reference genomes (Page 21-25)