A heuristic solution to problem FF-Adjacencies

that_Fα(Mheur)≥ Fα(MMW M). Then conserved candidate adjacency pair c={{g₁a, gb2},

{ha₁, hb_2}}can be discarded if

Fα(Mheur) >edg(MRMW M) (3.19) without losing optimality in solving problem FF-Adjacencies for gene similarity graph B_◦.

Proof: Recall that according to Lemma 2, the score Fα(M) of any solutionM ⊆

E to problem FF-Adjacencies cannot exceed edg(_MMW M), i.e., the sum of edge weights of a maximum weighted matching solution_MMW M ⊆ E. Now, assume for contradiction that there exists an optimal solution M0 _{to problem FF-Adjacencies}

for G and H, where conserved candidate adjacency pair c is established between G_M0 and H_M0. Since any edge in E_D cannot be part of M0, it holds that

Fα(M0)≤edg(MRMW M) <Fα(Mheur),

contradicting the premise of Equation (3.19).

Computing a maximum weight matching solution in gene similarity graph B_◦ of genomes G and H is achievable within O(|E(B_◦)|p

|G| + |H|) time [39], however an implementation of Duan and Su’s algorithm was not publicly available at the time of writing. The number of possible conserved candidate adjacencies of B_◦ is in order O(2|E(B◦)|₎_{, rendering the remaining subgraph test inefficient when applied in}

an exhaustive procedure. Nevertheless, if edges of conserved candidate adjacency c have maximal weight and the inclusion of c leads to a suboptimal solution to problem FF-Adjacencies, then any conserved candidate adjacency spanning c will also lead to a suboptimal solution. Thus, it is often not necessary to apply the remaining subgraph test for all conserved candidate adjacencies. We propose a preprocess- ing algorithm that executes the remaining subgraph test only for reasonably large deleted subgraphs, where the edges of the tested conserved candidate adjacency c are artificially set to maximal edge weight. If the test then concludes that the estab- lishment of c leads to a suboptimal solution, then all spanning conserved candidate adjacencies can also be omitted from the solution space.

We will see in Section 3.9.2 that the utilized implementation of a maximum weight matching algorithm was too slow to apply the remaining subgraph test in practice.

3.8 A heuristic solution to problem FF-Adjacencies

With increasing genome size, it may become infeasible in practice to solve problem FF-Adjacencies exactly. Moreover, even when genome sizes are small, but the studied genomes accumulated many genome rearrangements so that no or only an insufficient amount of anchors can be found, the running time and size of program

Algorithm 2Algorithm FFAdj-MCS

Input: Two genomes G, H, their corresponding gene similarity graph B_◦= (U, V, E), and α∈ [0, 1]

Output: MatchingM ⊆E

1: Unseen←E

2: Initialize empty list L 3: while Unseen6=∅ do

4: Find an MCS S⊆Unseen with highest scoreFα(S)

5: Unseen←Unseen\S

6: Remove all edges incident to edges of S in Unseen. 7: Remove all singleton vertices of U and V

8: Elongate MCSs in L and possibly remove further edges from Unseen and further vertices from U and V, respectively

9: Append S to L 10: end while 11: M =S

S_∈LS

FFAdj-2G inflate to a point where exact solutions can no longer be obtained by today’s computational means. In this section, we present heuristic FFAdj-MCS as described by Algorithm 2. FFAdj-MCS is an adaptation of the heuristic IILCS of Angibaud et al. [6].

IILCSallows to compute the number of adjacencies between two genomes when gene families are known, under an exemplar, intermediate, or maximum matching model. It is a greedy algorithm based on the idea of solving the longest common substring (LCS) problem: Given two strings S and T, find a longest string X that is a substring of both S and T. Solving a gene family-based problem, IILCS iteratively identifies LCSs in strings drawn from the alphabet of genes family identifiers rep- resenting chromosomal sequences and subsequently matches their corresponding genes until a satisfying matching is constructed.

This strategy can no longer be applied for problem FF-Adjacencies: Connected components in the gene similarity graph of two genomes G and H do not form gene families in general. Hence, their genes cannot be represented by a single gene family identifier in a string representation of genomes G and H, without losing infor- mation. However, in case edges are unweighted, chromosomes can be represented as indeterminate strings. Indeterminate strings, which will be further discussed in Chapter 5, are a class of strings that have one or more characters per position. Yet, we aim to address the general case with arbitrary edge weights, therefore we cannot make use of existing algorithms for the identification of LCSs in indeterminate strings such as those described in [56]. Given the gene similarity graph B_◦ of two genomes G and H, our heuristic FFAdj-MCS matches in each iteration a maximal sequence of consecutive conserved candidate adjacencies that locally maximizes the objective function Fα, i.e., the convex combination of weights of conserved adjacen-

In document Gene family-free genome comparison (Page 45-47)