thatFα(Mheur)≥ Fα(MMW M). Then conserved candidate adjacency pair c={{g1a, gb2},
{ha1, hb2}}can be discarded if
Fα(Mheur) >edg(MRMW M) (3.19) without losing optimality in solving problem FF-Adjacencies for gene similarity graph B◦.
Proof: Recall that according to Lemma 2, the score Fα(M) of any solutionM ⊆
E to problem FF-Adjacencies cannot exceed edg(MMW M), i.e., the sum of edge weights of a maximum weighted matching solutionMMW M ⊆ E. Now, assume for contradiction that there exists an optimal solution M0 to problem FF-Adjacencies
for G and H, where conserved candidate adjacency pair c is established between GM0 and HM0. Since any edge in ED cannot be part of M0, it holds that
Fα(M0)≤edg(MRMW M) <Fα(Mheur),
contradicting the premise of Equation (3.19).
Computing a maximum weight matching solution in gene similarity graph B◦ of genomes G and H is achievable within O(|E(B◦)|p
|G| + |H|) time [39], however an implementation of Duan and Su’s algorithm was not publicly available at the time of writing. The number of possible conserved candidate adjacencies of B◦ is in order O(2|E(B◦)|), rendering the remaining subgraph test inefficient when applied in
an exhaustive procedure. Nevertheless, if edges of conserved candidate adjacency c have maximal weight and the inclusion of c leads to a suboptimal solution to prob- lem FF-Adjacencies, then any conserved candidate adjacency spanning c will also lead to a suboptimal solution. Thus, it is often not necessary to apply the remaining subgraph test for all conserved candidate adjacencies. We propose a preprocess- ing algorithm that executes the remaining subgraph test only for reasonably large deleted subgraphs, where the edges of the tested conserved candidate adjacency c are artificially set to maximal edge weight. If the test then concludes that the estab- lishment of c leads to a suboptimal solution, then all spanning conserved candidate adjacencies can also be omitted from the solution space.
We will see in Section 3.9.2 that the utilized implementation of a maximum weight matching algorithm was too slow to apply the remaining subgraph test in practice.
3.8 A heuristic solution to problem FF-Adjacencies
With increasing genome size, it may become infeasible in practice to solve prob- lem FF-Adjacencies exactly. Moreover, even when genome sizes are small, but the studied genomes accumulated many genome rearrangements so that no or only an insufficient amount of anchors can be found, the running time and size of program
Algorithm 2Algorithm FFAdj-MCS
Input: Two genomes G, H, their corresponding gene similarity graph B◦= (U, V, E), and α∈ [0, 1]
Output: MatchingM ⊆E
1: Unseen←E
2: Initialize empty list L 3: while Unseen6=∅ do
4: Find an MCS S⊆Unseen with highest scoreFα(S)
5: Unseen←Unseen\S
6: Remove all edges incident to edges of S in Unseen. 7: Remove all singleton vertices of U and V
8: Elongate MCSs in L and possibly remove further edges from Unseen and further vertices from U and V, respectively
9: Append S to L 10: end while 11: M =S
S∈LS
FFAdj-2G inflate to a point where exact solutions can no longer be obtained by today’s computational means. In this section, we present heuristic FFAdj-MCS as described by Algorithm 2. FFAdj-MCS is an adaptation of the heuristic IILCS of Angibaud et al. [6].
IILCSallows to compute the number of adjacencies between two genomes when gene families are known, under an exemplar, intermediate, or maximum matching model. It is a greedy algorithm based on the idea of solving the longest common substring (LCS) problem: Given two strings S and T, find a longest string X that is a substring of both S and T. Solving a gene family-based problem, IILCS iteratively identifies LCSs in strings drawn from the alphabet of genes family identifiers rep- resenting chromosomal sequences and subsequently matches their corresponding genes until a satisfying matching is constructed.
This strategy can no longer be applied for problem FF-Adjacencies: Connected components in the gene similarity graph of two genomes G and H do not form gene families in general. Hence, their genes cannot be represented by a single gene family identifier in a string representation of genomes G and H, without losing infor- mation. However, in case edges are unweighted, chromosomes can be represented as indeterminate strings. Indeterminate strings, which will be further discussed in Chapter 5, are a class of strings that have one or more characters per position. Yet, we aim to address the general case with arbitrary edge weights, therefore we can- not make use of existing algorithms for the identification of LCSs in indeterminate strings such as those described in [56]. Given the gene similarity graph B◦ of two genomes G and H, our heuristic FFAdj-MCS matches in each iteration a maximal sequence of consecutive conserved candidate adjacencies that locally maximizes the objective function Fα, i.e., the convex combination of weights of conserved adjacen-