3.2 Constructing a reference partial order graph
Typically sequences are aligned to a single reference genome. For genes with high struc- tural and sequence diversity this can lead to poor characterization of such regions. To represent these genes, such as the HLA genes, we used a partial order graph (POG).
3.2.1 Graph implementation
In the POG we both store nodes (vertices) and directed edges. By doing a MSA we ensure that every feature has the same length. The features are added to the graph one by one. Each node has an integer that stores the level and a single DNA base value, that is A, T, G, or C. We surround each feature with a node that has no DNA base.
The level corresponds to the location of that base in the sequence. No two nodes connected by a direct path can have the same level. If n1 and n2 are two connected nodes in the graph the edge will always be directed to the node with the higher level. This means that the last node of the graph, the one with no outgoing edges, has the highest level. The first node of the graph, the one with no incoming edges, has a level 0.
Each edge stores reference to both nodes it connects with. Furthermore, if the edge is inside an exon it also stores a bit string. The bit string has a length equal to the number of references used. The purpose of these bit strings will be discussed when we align sequences to the graph and genotype in section 3.3. When an edge is created all bits are initialized to 0 except for the one representing the exon that is being added.
When adding another sequence to the graph which shares a path with an earlier sequence, we flip the corresponding bit to 1. By storing it this way we never need more than ceil(r/8) bytes memory per edge, where r is the number of references used in the graph. If, for example, we had 1000 references and 10,000 edges we would only need 1.25 megabytes to store this information. Additionally, we only need to store it for exons. Algorithm 2 shows the pseudocode behind this method and figure 3.5 shows an example how it creates a graph for three exon sequences: GATA, -AT-, and CATA.
We add exons and introns sequences separately because edge creations are handled differ- ently, exons have a bit string while introns do not. There is high amount of missing and unreliable intron data in the IMGT/HLA database, compared to the exons. So instead of trying to reuse data from other introns we simply allow reads to align freely within the intron. With such free alignment there is no need for the bit strings, so they are omitted on introns.
3 Methods
Input : Fasta file with aligned sequences of some feature. Output: Partial order graph we can use as a reference. 1 graph ← empty partial order graph;
2 sequences ← read sequences from Fasta file; 3 previous ← new node with level 0.;
4 endNode ← new node with level Length(sequences[0]) + 1.; 5 Add previous and endNode to graph.;
6 for sequence in sequences do
7 for pos in Length(sequence) do 8 if sequence[pos] is a gap then
9 continue;
10 end
11 node ← node with letter sequence[pos] and level pos + 1.; 12 if node exists in graph then
13 next ← node;
14 if No edge exists from previous to next then
15 Add Edge between previous and next with bit pos flipped on.
16 end
17 else
18 Flip bit pos on for edge from previous to next.
19 end
20 end
21 else
22 Add node to graph.;
23 Add Edge between previous and next with bit pos flipped on.
24 end
25 previous ← next;
26 end
27 if No edge exists between previous and endNode then
28 Add Edge between previous and endNode with bit pos flipped on.
29 end
30 else
31 Flip bit pos on for edge from previous to next.
32 end
33 end
Algorithm 2: Creating a reference partial order graph for a single exon. Creating a graph for an intron is similar but then we do not store the bit string on edges, hence no need to create or modify them.
3.2 Constructing a reference partial order graph
Levels
0 1 2 3 4 5a)
b)
G
A
T
A
001
001
001
001
001
c)
G
A
T
A
001
001
011
001
001
010
010
d)
G
A
T
A
001
001
111
101
101
010
010
C
100
100
Figure 3.5: Create a partial order reference graph using three example exon sequences: GATA, -AT-, and CATA. Blue edges show edges we traversed through, green labels represent changed or new labels on edges. Red and yellow nodes represent new and old nodes, respectively. a) Two nodes are created, initial node on level 0 and a final node on level Length(sequences[0]) + 1, which is here 5. b) The sequence GATA is added to the graph. c) The sequence -AT- is added to the graph. Note that no new nodes need to be created, only edges. We change the bit string for the edge going from A to T so it includes this sequence. d) The sequence CATA is added to the graph. The new C node will be on the same level as the G node.
3 Methods
3.2.2 Extending the POG
Figure 3.6 shows how the partial order graph can be extended with three intron sequences: TTA, -TA, and GTA. In our implementation we extend the graph by adding sequences connecting to the lowest level node. So when creating a graph for the HLA genes we add features in reversed order: First the 3’ untranslated end of the allele, then the last exon, then the last intron, and so on until we have added the 5’ untranslated region.
The nodes connecting the features are always free to traverse through so they do not need to store any DNA base. We keep track of the level of these nodes. When we are aligning to the graph we can check the level of the node we are aligning to. So at any point in the alignment, we know if we are aligning to an exon or an intron.
3.2 Constructing a reference partial order graph
G
A
T
A
001
001
111
101
101
010
010
C
100
100
T
T
G
A
Figure 3.6: Extended graph from figure 3.5. Here we have added three intron references to the graph. We use the previous initial node as a final node for the new extension. The red nodes are the new intron nodes. The intron sequences we have added are: TTA, GTA, and -TA. Note that the edges on the new nodes do not store a bit string like the other exon nodes.
3 Methods