Algorithm for finding cliques in compatibil- compatibil-ity graph

Alternative Approaches to the Haplotype Inference

5.2 Algorithm for finding cliques in compatibil- compatibil-ity graph

We can learn one last thing from the running example introduced in the previous section. We report it for reference: ﬁnd a maximally parsimonious phasing for g1 = 0120222, g2 = 0222012, g3 = 0100012. When we were examining the resolution of g1, we showed how a wrong choice of haplotype can lead to suboptimal solutions. One could also attempt to implement a heuristic that tries to resolve g1 with a haplotype that is also compatible to a neighbor of g1

in the compatibility graph. Indeed, this heuristic choice is performed by both algorithms presented in this thesis (see lower level resolution in Section 3.2.1 and the slave procedure in Section 4.3.1). Nevertheless, a suboptimal selection could still be performed; for instance, if we picked haplotype 0110011—compatible to both g1 and g2—the ﬁnal solution would have cardinality 5 instead of 4. We certainly could extend this heuristic to include not only couples of compatible genotypes, but triplets; notice that such triplets correspond to triangles, or 3-cliques, in the compatibility graph. If we iterate this reasoning, it appears that it is convenient to ﬁnd haplotypes compatible to all genotypes that form a clique in the compatibility graph.

To this aim, in this section we delineate an algorithm that ﬁnds all cliques in a compatibility graph and the set of haplotypes compatible to all genotypes in each clique.

To further stress the importance of finding cliques in the compatibility graph, we point out that a solution with value k + 1 can be easily found for a k-clique of genotypes. Such solution would have a single haplotype with coverage k and k haplotypes with coverage 1 (for the definition of coverage see Section 2.3). This consideration might be important for tackling Haplotype Inference by Minimum Entropy. Moreover, finding haplotypes that resolve a whole clique could be an effective way to bootstrap Clark’s Rule (Section 3.1).

5.2.1 Preliminary definitions

It is useful to introduce the concept of schema, taken straight from genetic al-gorithms, because it helps to generalise properties of haplotypes and genotypes.

A schema s is a string in the alphabet Σ = {0, 1, ∗} where ∗ is called wild-card. A m-length schema compactly represent binary strings of length m: if we instantiate in every possible way all wildcards in a schema, we obtain all strings it represents. The number of string a schema s represents is 2^w(s)where

w(s) is the number of wildcards. For example, schema s = 10 ∗ ∗11 compactly represents the set {100011, 100111, 101011, 101111}.

The resemblance of schemata to genotypes is clear: if we replace wildcards with 2’s we can transform a schema in a genotype and vice-versa. Interestingly, a genotype is a compact representation of the set of haplotypes that can resolve it. Because of this property we can interchangeably use schemata and genotypes.

In a similar way, a schema without wildcards is isomorphic to a haplotype.

As well as we did for genotypes, we can introduce compatibility between schemata, whose deﬁnitions is the same as Deﬁnition 3, just replacing 2’s with wildcards. If we identify schemata as the sets they represent, we can say that two schemata are compatible if their set intersection is non-empty. We still write s1 ∼ s2 to indicate compatibility. For example, these schemata are compatible {01 ∗ ∗0∗, ∗101 ∗ 0, 0 ∗ ∗10∗}, while these are not {01 ∗ 10, 0 ∗ 1 ∗ 1}.

Since schemata represent sets, it is useful to introduce an intersection op-eration. Schema intersection is deﬁned only if operands are compatible and its deﬁnition is straightforward: the intersection of two compatible schemata s1

and s2, indicated as s1∩ s2, is a schema whose i-th element is 0 (resp. 1) if the corresponding element in s1or s2 is 0 (resp. 1), otherwise its a wildcard.¹ For example, 01 ∗ 1 ∗ ∗ ∩ ∗101∗ = 0101∗.

5.2.2 Algorithm for finding cliques in compatibility graph

The algorithm we are about to presents is able to ﬁnd all cliques in a compati-bility graph and, contemporarily, the haplotypes compatible to all genotypes in such cliques.

The rationale behind the algorithm is simple. Suppose that we have a com-patibility graph Gc; the idea is to construct another undirected graph G_c^′ whose vertices are schemata obtained in the following way: for every edge (gi, gj) ∈ Gc, a vertex s^′ in G_c^′ is gi∩ gj.² Now every vertex in G_c^′ corresponds to a set of hap-lotypes that solve both gi and gj. In addition, G_c^′ has an edge (s^′_i, s^′_j) iﬀ s^′_i∼ s^′_j: practically G_c^′ is a higher-order compatibility graph. If we iterate this procedure, we obtain compatibility graph of further higher orders whose vertices represent sets of haplotypes compatible to potentially many genotypes.

In the following we formalise the algorithm, demonstrate that converges and that ﬁnds all cliques in the compatibility graph.

Description of the algorithm. Let G = {g1, . . . , gn} be a genotype set. The initial step of our algorithm (iteration l = 1) consists in building an auxiliary graph data structure G_s¹³ which is a variant of the compatibility graph Gc: its vertices are the schemata {s¹₁, . . . , s¹_n}, each one corresponding to its homologous genotype gi, it has an edge (s¹_i, s¹_j) iﬀ s¹_i ∼ s¹_j and each vertex is associated to a set σ(s¹_i) = {gi}. The generic step at the l + 1-th iteration is described in what follows. Let G_s^l be our graph data structure built at the previous iteration.

If G_s^l contains only isolated vertices (every vertex has degree 0) the algorithm stops; otherwise produces a new graph G_s^l+1 and continues. G_s^l+1 has a vertex s^l+1 = s^l_i∩ s^l_j for every edge (s^l_i, s^l_j) ∈ G_s^l (duplicates are eliminated) whose

1Compatibility prerequisite excludes cases where an operand has a 0 and the other a 1 in the i-th position.

2Here genotypes are interpreted as schemata.

3In the following description, superscripts denote iteration counters.

5.3. CONCLUSIONS AND DISCUSSION 63 associated set is σ(s^l+1) = σ(s^l_i) ∪ σ(s^l_j). G_s^l+1 has also an edge (s^l+1_i , s^l+1_j ) iﬀ s^l+1_i ∼ s^l+1_j .

The algorithm returns a list of L graphs G_s¹, . . . , G_s^L one for each iteration performed.

Cliques and haplotypes. By construction, zero-degree vertices in any G_s^l, 1 ≤ l ≤ L correspond to maximal cliques in the compatibility graph. Let us call s^l one of these vertices; σ(s^l) contains the genotypes in the clique and |σ(s^l)| is, of course, its order; in addition, the set described by s^l contains all haplotypes that are compatible to the genotypes in σ(s^l). Finally, G_s^L is a graph whose vertices have null degree; its vertices correspond to maximum cliques on G_s¹ (or G^c which is the same).

Termination and complexity. The algorithm does not diverge because the genotype set G is finite: during an iteration the algorithm calculates, at most, a set σ(·), and therefore a vertex, for every non-empty subset of G. The num-ber of vertices computed is thus bounded from above by 2^|G|− 1. Moreover, the algorithm does indeed converge. By construction, G^L_s contains, in fact, a vertex for every maximum clique in Gc: since cliques are finite in number and their order is finite, the algorithm must converge. The complexity of the algo-rithm is, in the worst case, exponential (of course this algoalgo-rithm can solve the N P-complete Maximum Clique Problem which has worst-case exponential com-plexity). A worst-case instance can be generated this way: G = {g1, g2, . . . , gn} where genotype gihas only heterozygous sites except the i-th which is a 0; this way we obtain a complete graph and exponentially many cliques.

5.3 Conclusions and discussion

In this chapter we described two novel approaches to the Haplotype Inference problem that have the potential to improve the performance of the metaheuris-tics described in previous chapters. In Section 5.1 we provided an extension to the plain binary haplotype model that integrates CP techniques. This model, based on generalised haplotypes, addresses the restrictiveness of early commit-ment and the inadequacy of the usual deﬁnition of complecommit-mentarity when deal-ing with unknowns. Section 5.2 presents a worst-case exponential algorithm able to ﬁnd all cliques in the compatibility graph of an instance. Such cliques correspond to (ordinary) haplotypes which have high coverage and therefore can represent a good starting point for constructive techniques, which include also the simple Clark’s Rule. Although this algorithm is exponential, we conjecture that it has much lower complexity in typical, real-world cases.

Chapter 6

The Founder Sequence

In document Metaheuristics for Search Problems in Genomics - New Algorithms and Applications (Page 85-89)