Aligning sequences to the POG - Gyper: A graph-based HLA genotyper using aligned DNA sequences

When aligning sequences to our graph the goal is to determine which pair of alleles can explain the most reads. Our assumption is that this pair of alleles is the individual’s true allele pair.

We have a backtracker which both to keeps track of the read’s path through the graph. The match is simply an array of boolean values equal to the length of the read. This array initially has all bits set to 0, which are flipped if a match is found. The backtracker also stores which node is the previous node of the match, so we know how the match traversed through the graph. The size of the two arrays are the same as the length of the read we are aligning. In our alignment we are free to start anywhere and end anywhere, it is a semi-global alignment where both of the ends of the references are free. Generally we would need to have our two arrays equal to the length of the read plus one, but since we can start anywhere the top boolean will always be true so there is no need to store that.

Since the graph is acyclic we can always find topological sorting of the nodes, meaning if there is a node n1 which depends on the results of node n2, n2 will never depend on n1 or any of n1’s dependencies. Also there is one, and only one, node that does not depend on any other nodes. That node is the first node in our topological sort.

3.3.1 Algorithm

We use a dynamic programming algorithm to align sequences to the graph as shown in algorithm 3. Our algorithm requires O(nm) time in the worst case, where n is the number of edges and m is the length of the sequence. It visits every edge on the graph and compares the DNA base of the sequence to the target node.

When traversing through the graph it is always guaranteed that we have already calculated the current node’s dependencies. When matches are found we only need to change a boolean value in the array and store a reference to the previous node. The aligner can find a list of nodes where the sequence was matched, because the sequence can be aligned to more than one location. If we align both reads in a read pair to a location that is very far from each other we discard that read pair.

The highest distance between reads allowed is arbitrarily chosen to be 800 base pairs. If the read is aligned to multiple locations we need to choose the best one to use. We chose the best distance between two reads to be 350 base pairs. These two values were estimated from the 99.99% highest and the most common insert sizes of deCODE’s BAM files, respectively.

3.3 Aligning sequences to the POG

matches the read to nodes 4 (A), 5 (no base), 7 (C), 8 (A), and 9 (T). Here, a reference to the 9th node will be the only node in the output list. Then, when the alignment has finished, we backtrack from that node only. Backtracking has a complexity of O(m) in the worst case.

3.3.2 Backtracking

The backtracking algorithm uses the backtracker to determine which reference alleles can explain the aligned read. It picks a node from the alignment algorithm and starts backtracking there. Initially it has a bit string of length equal to the amount of reference alleles used with all bits flipped to 1. Then, as the backtracking algorithm travels backwards through the graph it will perform a bitwise AND operation for every edge with a bit string. That is, if we are traversing inside an exon we will perform the AND operation.

What we end with is a bit string whose bits are only flipped on if the read followed the corresponding reference allele exactly. In other words we say that those reference allele explain the read. If however we are traversing through an intron we do not know which reference allele created that edge on the graph, we would rather say that any reference can explain the read for the reasons we discussed before.

Continuing with the previous example (Figure 3.6) the backtracking of the sequence ACAT would generate the following calculations:

111 AND 111 AND 100 AND 100 = 100

The convention when using bit string is to say they the rightmost bit is the first one. So the bit string 100 means that only the third reference explains the read. The exon of the third reference was CATA which ACAT overlaps. ACAT does not overlap the other two exons.

If however we aligned the read TTA, the read maps to nodes 1, 3, and 4. Since these nodes are not connected to an edge with a bit string we will not require any AND operations and simply have the bit string:

111

3 Methods

Input : A sequenced read from an individual who is being genotyped for gene

gene.

Output: backtracker we can use to find all references that explain the read and

an array of nodes where alignments end at.

1 graph ← a partial order graph for gene.; 2 order ← TopologicalSort(graph);

3 backtracker ← array for each node in graph storing both match (true or false)

and previousNode.;

4 nodes ← empty array. 5 for source in order do

6 for edge in edges directed from source do 7 target ← edge’s target.;

8 if target stores dna then

9 if read[0] == target.dna then

10 backtracker[target].match[0] = true;

11 backtracker[target].previousNode(0) = source;

12 end

13 pos = 1;

14 while pos is smaller than the length of the read do 15 if backtracker[source].match and

16 read[pos] == target.dna then

17 backtracker[target].match[pos] = true; 18 backtracker[target].previousNode(pos) = source; 19 end 20 Increment pos by 1.; 21 end 22 if backtracker[target].match[Length(read)] then 23 Add target to nodes.

24 end

25 end

26 else

27 backtracker[target].match ← array of true values.; 28 backtracker[target].previousNode ← array of source.;

29 end

30 end

31 end

3.3 Aligning sequences to the POG 5

G

A

T

A

10 11

001

111

101

010 C

100 T

G

A

Figure 3.7: Alignment of the sequence ACAT to the graph from figure 3.6. The numbers below each node denotes their topological sort order and blue edges are the path of the alignment.

3 Methods

3.3.3 Genotyping constraints

When genotyping we estimate how likely it is that an individual has a particular allele depending on how many reads that allele can explain. Read and its complementary are both aligned to the graph, since we do not know the read’s direction compared to the reference. We use the following constraints:

• Each individual can either have one or two different alleles. Everyone has two strings

of chromosome 6 so each can have two different variations.

• A read needs to be continuous, meaning we can never add gaps to it or add gaps to

the reference while aligning.

• Under some strict circumstances we may allow a mismatch between the read and

reference. We allow this but no other types of errors since the most frequent errors in Illumina read data are mismatches [Hoffmann et al., 2009].

• Since each read in a read pair is from the same chromosome, we do an AND bitwise

operation on both reads’ bit strings. All non-paired reads are discarded.

We believe using these constraints we can create a model that can accurately predict the correct genotype from Illumina next-generation data. To further improve the model we include some parameters we wish to train using in-house tools which will be discussed later.

In document Gyper: A graph-based HLA genotyper using aligned DNA sequences (Page 46-50)