Alternative Approaches to the Haplotype Inference
5.2 Algorithm for finding cliques in compatibil- compatibil-ity graph
We can learn one last thing from the running example introduced in the previous section. We report it for reference: find a maximally parsimonious phasing for g1 = 0120222, g2 = 0222012, g3 = 0100012. When we were examining the resolution of g1, we showed how a wrong choice of haplotype can lead to suboptimal solutions. One could also attempt to implement a heuristic that tries to resolve g1 with a haplotype that is also compatible to a neighbor of g1
in the compatibility graph. Indeed, this heuristic choice is performed by both algorithms presented in this thesis (see lower level resolution in Section 3.2.1 and the slave procedure in Section 4.3.1). Nevertheless, a suboptimal selection could still be performed; for instance, if we picked haplotype 0110011—compatible to both g1 and g2—the final solution would have cardinality 5 instead of 4. We certainly could extend this heuristic to include not only couples of compatible genotypes, but triplets; notice that such triplets correspond to triangles, or 3-cliques, in the compatibility graph. If we iterate this reasoning, it appears that it is convenient to find haplotypes compatible to all genotypes that form a clique in the compatibility graph.
To this aim, in this section we delineate an algorithm that finds all cliques in a compatibility graph and the set of haplotypes compatible to all genotypes in each clique.
To further stress the importance of finding cliques in the compatibility graph, we point out that a solution with value k + 1 can be easily found for a k-clique of genotypes. Such solution would have a single haplotype with coverage k and k haplotypes with coverage 1 (for the definition of coverage see Section 2.3). This consideration might be important for tackling Haplotype Inference by Minimum Entropy. Moreover, finding haplotypes that resolve a whole clique could be an effective way to bootstrap Clark’s Rule (Section 3.1).
5.2.1 Preliminary definitions
It is useful to introduce the concept of schema, taken straight from genetic al-gorithms, because it helps to generalise properties of haplotypes and genotypes.
A schema s is a string in the alphabet Σ = {0, 1, ∗} where ∗ is called wild-card. A m-length schema compactly represent binary strings of length m: if we instantiate in every possible way all wildcards in a schema, we obtain all strings it represents. The number of string a schema s represents is 2w(s)where
w(s) is the number of wildcards. For example, schema s = 10 ∗ ∗11 compactly represents the set {100011, 100111, 101011, 101111}.
The resemblance of schemata to genotypes is clear: if we replace wildcards with 2’s we can transform a schema in a genotype and vice-versa. Interestingly, a genotype is a compact representation of the set of haplotypes that can resolve it. Because of this property we can interchangeably use schemata and genotypes.
In a similar way, a schema without wildcards is isomorphic to a haplotype.
As well as we did for genotypes, we can introduce compatibility between schemata, whose definitions is the same as Definition 3, just replacing 2’s with wildcards. If we identify schemata as the sets they represent, we can say that two schemata are compatible if their set intersection is non-empty. We still write s1 ∼ s2 to indicate compatibility. For example, these schemata are compatible {01 ∗ ∗0∗, ∗101 ∗ 0, 0 ∗ ∗10∗}, while these are not {01 ∗ 10, 0 ∗ 1 ∗ 1}.
Since schemata represent sets, it is useful to introduce an intersection op-eration. Schema intersection is defined only if operands are compatible and its definition is straightforward: the intersection of two compatible schemata s1
and s2, indicated as s1∩ s2, is a schema whose i-th element is 0 (resp. 1) if the corresponding element in s1or s2 is 0 (resp. 1), otherwise its a wildcard.1 For example, 01 ∗ 1 ∗ ∗ ∩ ∗101∗ = 0101∗.
5.2.2 Algorithm for finding cliques in compatibility graph
The algorithm we are about to presents is able to find all cliques in a compati-bility graph and, contemporarily, the haplotypes compatible to all genotypes in such cliques.
The rationale behind the algorithm is simple. Suppose that we have a com-patibility graph Gc; the idea is to construct another undirected graph Gc′ whose vertices are schemata obtained in the following way: for every edge (gi, gj) ∈ Gc, a vertex s′ in Gc′ is gi∩ gj.2 Now every vertex in Gc′ corresponds to a set of hap-lotypes that solve both gi and gj. In addition, Gc′ has an edge (s′i, s′j) iff s′i∼ s′j: practically Gc′ is a higher-order compatibility graph. If we iterate this procedure, we obtain compatibility graph of further higher orders whose vertices represent sets of haplotypes compatible to potentially many genotypes.
In the following we formalise the algorithm, demonstrate that converges and that finds all cliques in the compatibility graph.
Description of the algorithm. Let G = {g1, . . . , gn} be a genotype set. The initial step of our algorithm (iteration l = 1) consists in building an auxiliary graph data structure Gs13 which is a variant of the compatibility graph Gc: its vertices are the schemata {s11, . . . , s1n}, each one corresponding to its homologous genotype gi, it has an edge (s1i, s1j) iff s1i ∼ s1j and each vertex is associated to a set σ(s1i) = {gi}. The generic step at the l + 1-th iteration is described in what follows. Let Gsl be our graph data structure built at the previous iteration.
If Gsl contains only isolated vertices (every vertex has degree 0) the algorithm stops; otherwise produces a new graph Gsl+1 and continues. Gsl+1 has a vertex sl+1 = sli∩ slj for every edge (sli, slj) ∈ Gsl (duplicates are eliminated) whose
1Compatibility prerequisite excludes cases where an operand has a 0 and the other a 1 in the i-th position.
2Here genotypes are interpreted as schemata.
3In the following description, superscripts denote iteration counters.
5.3. CONCLUSIONS AND DISCUSSION 63 associated set is σ(sl+1) = σ(sli) ∪ σ(slj). Gsl+1 has also an edge (sl+1i , sl+1j ) iff sl+1i ∼ sl+1j .
The algorithm returns a list of L graphs Gs1, . . . , GsL one for each iteration performed.
Cliques and haplotypes. By construction, zero-degree vertices in any Gsl, 1 ≤ l ≤ L correspond to maximal cliques in the compatibility graph. Let us call sl one of these vertices; σ(sl) contains the genotypes in the clique and |σ(sl)| is, of course, its order; in addition, the set described by sl contains all haplotypes that are compatible to the genotypes in σ(sl). Finally, GsL is a graph whose vertices have null degree; its vertices correspond to maximum cliques on Gs1 (or Gc which is the same).
Termination and complexity. The algorithm does not diverge because the genotype set G is finite: during an iteration the algorithm calculates, at most, a set σ(·), and therefore a vertex, for every non-empty subset of G. The num-ber of vertices computed is thus bounded from above by 2|G|− 1. Moreover, the algorithm does indeed converge. By construction, GLs contains, in fact, a vertex for every maximum clique in Gc: since cliques are finite in number and their order is finite, the algorithm must converge. The complexity of the algo-rithm is, in the worst case, exponential (of course this algoalgo-rithm can solve the N P-complete Maximum Clique Problem which has worst-case exponential com-plexity). A worst-case instance can be generated this way: G = {g1, g2, . . . , gn} where genotype gihas only heterozygous sites except the i-th which is a 0; this way we obtain a complete graph and exponentially many cliques.
5.3 Conclusions and discussion
In this chapter we described two novel approaches to the Haplotype Inference problem that have the potential to improve the performance of the metaheuris-tics described in previous chapters. In Section 5.1 we provided an extension to the plain binary haplotype model that integrates CP techniques. This model, based on generalised haplotypes, addresses the restrictiveness of early commit-ment and the inadequacy of the usual definition of complecommit-mentarity when deal-ing with unknowns. Section 5.2 presents a worst-case exponential algorithm able to find all cliques in the compatibility graph of an instance. Such cliques correspond to (ordinary) haplotypes which have high coverage and therefore can represent a good starting point for constructive techniques, which include also the simple Clark’s Rule. Although this algorithm is exponential, we conjecture that it has much lower complexity in typical, real-world cases.