As a genome sequence is being used for analysis, the issue of efficiency of displaying a DNA sequence is critical. A plain file with billions of letters of genomic DNA provides too much information and is not helpful. Which is why the "UCSC Genome Browser" has created a solution, by collection dif-ferent types of information and combining all relevant information in one site.
The "UCSC Genome Bioinformatics" is a website3 which holds the refer-ence sequrefer-ences for a large collection of genomes. It is developed and main-tained by the Genome Bioinformatics Group, a cross-departmental team within the UC Santa Cruz Genomics Institute at the University of Califor-nia Santa Cruz (UCSC). It contains several different types of general data-sets and each datadata-sets contain multiple formats for different purposes.
There are about twenty different file formats in UCSC. Many of the data-sets have their own benefits, but we will focus on three different datadata-sets;
the Axt, BED and sizes format. This thesis won’t cover all the different types of formats, but readers who are interested in the remaining formats are encouraged to visit the UCSC4.
2http://www.ncbi.nlm.nih.gov/Class/FieldGuide/BLOSUM62.txt
3http://genome.ucsc.edu/index.html
4https://genome.ucsc.edu/FAQ/FAQformat.html
2.3.1 Axt format
The axt format is essential to build the graph. It is produced by Blastz, an alignment tool from Webb Miller’s lab at Penn State University. The format contains two chromosomes in one file, with the structure written in a compact format. It is structured by having each alignment block in the axt format contain three lines, a summary line and two sequence lines. Each sequence lines corresponds to two different chromosomes. Furthermore, the blocks are separated by a blank line. Figure 2.3 is an example from the first two blocks in an axt format5.
0 chr19 3001012 3001075 chr11 70568380 70568443 - 3500
TCAGCTCATAAATCACCTCCTGCCACAAGCCTGGCCTGGTCCCAGGAGAGTGTCCAGGCTCAGA TCTGTTCATAAACCACCTGCCATGACAAGCCTGGCCTGTTCCCAAGACAATGTCCAGGCTCAGA 1 chr19 3008279 3008357 chr11 70573976 70574054 - 3900
CACAATCTTCACATTGAGATCCTGAGTTGCTGATCAGAATGGAAGGCTGAGCTAAGATGAGCGA CGAGGCAATGTCACA
CACAGTCTTCACATTGAGGTACCAAGTTGTGGATCAGAATGGAAAGCTAGGCTATGATGAGGGA CAGTGCGCTGTCACA
Figure 2.3: Example of an axt format
The first line is called an information line which represents a primary chro-mosome and an aligning chrochro-mosome. The second and third line contain sequences of the primary chromosome and the aligning chromosome, re-spectively. After the sequences, a blank lines indicates the end of the blocks.
Reading the information line in figure 2.3 from left to right:
Alignment number
The alignment number start at 0 and increase by 1.
Chromosome (Primary)
Alignment start (Aligning) Start position.
Alignment end (Aligning) End position.
Strand (Aligning)
If the strand value is ’-’, then the aligning start and end fields are reverse-complemented coordinates of its chromosome.
Blastz score
All organisms have their own blastz scoring matrices.
The first line in figure 2.3 holds all the information necessary for this thesis.
The last two lines will not be used and the sequences in the axt format will therefore not be relevant.
2.3.2 Sizes format
The sizes format contains the length of the primary and aligning chromo-somes from the axt format. Since the axt format contains two types of chromosomes, we need two sizes files because each chromosome varies in size. The structure is formed by two columns separated by a tab. The first column contains the sequence name and the second column contains the length. Figure 2.4 show an example of how the sizes format looks.
chr1 2492621 chr2 2431373
Figure 2.4: Example of size for the two first chromosomes in an art
2.3.3 BED format
The BED (Browser Extensible Data) format provides the desired genomic location we want to investigate in the main and aligning chromosome.
There are three fields which are required with nine optional fields, a tab is used to separate the fields. This format needs to stay consistent for each line. Figure 2.5 shows an example of a complete BED format.
track name=pairedReads description="Clone Paired Reads" useScore=1 chr22 1000 5000 cloneA 960 + 1000 5000 0 2 567,488, 0,3512 chr22 2000 6000 cloneB 900 - 2000 6000 0 2 433,399, 0,3601
Figure 2.5: Example of a complete BED format
The first line in figure 2.5 holds the header information line, but this is not relevant in this thesis. Reading the information line in figure 2.5 from left to right:
chrom
The name of the chromosome chromStart
The starting position of the chromosome chromEnd
The ending position of the chromosome name
Define the name of the BED line.
score
Score between 0 and 1000. If the track lines attribute "useScore" is set to 1, the score value is determined by the level of grey. Higher number give darker grey. Figure 2.6 show the score values into grey colour from the UCSC website6:
Figure 2.6: Genome Browser’s translation of BED score values strand
The strand is either ’+’ or ’-’
thickStart
The starting position where the characteristic is drawn thickly, for example a start codon. If there is no thick part, thickStart is normally set to the chromStart position.
thickEnd
The ending position where the characteristic is drawn thickly, for example an end codon. If there is no thick part, thickEnd is normally set to the chromStart position.
6https://genome.ucsc.edu/FAQ/FAQformat.html#format1
itemRgb
An RGB value where the form is R, G, B. If the track line attribute
"itemRgb" is set to "on", then the RGB value will determine the display colour of the data contained in the BED line.
blockCount
The number of blocks in the BED line.
blockSizes
A comma-separated list of the block sizes. The number of items should correspond to blockCount.
blockStarts
A comma-separated list of block starts. All the positions should be calculated relative to chromStart. The number of items should correspond to blockCount.
The three first columns are the required fields, while the rest are optional fields. The optional fields are not required in this thesis and therefore not relevant.