We now study the upper bound of objective functionFα which is a convex com-
bination of the two measures edg and adj. Observe that the number of conserved adjacencies of any matchingM in B◦ is at most its size n= |M|. This case occurs if the gene order between G and H is perfectly conserved and both contain only circular chromosomes. In order to establish an upper bound of adjacency measure adjGH(M) for any solution M, we assume that all n edges ei ∈ M, 1 ≤ i ≤ n, are contained in two conserved adjacencies, respectively. W.l.o.g. genomes G and H comprise each a single circular chromosome. We follow the natural ordering of these edges, such that every pair of consecutive edges is contained in a conserved adjacency. Then the upper bound of adjacency measure adjGH(M)is given by
adjGH(M) ≤ d1 2ne
∑
i=1 q w(e2i−1)·w(e2i) + b1 2nc∑
i=1 q w(e2i)·w(e2i+1), (3.6) where we define the(n+1)stedge to be equivalent to the first, i.e., en+1 ≡e1, which allows us to sum over the weights of all possible conserved adjacencies in a matching M. Applying the Arithmetic Mean-Geometric Mean Inequality we obtain
n
∑
i=1 w(ei)≥ d1 2ne∑
i=1 q w(e2i−1)·w(e2i) + b1 2nc∑
i=1 q w(e2i)·w(e2i+1) (3.7) ⇔ edg(M) ≥adjGH(M), (3.8)which holds true for any matching M in any gene similarity graph B◦. Further, because the objective function Fα is a convex combination of edg and adj, it holds
that Fα(M) ≤ edg(M) for any M, independent of α. This leads to the following
lemma:
Lemma 2 Given two genomes G and H and gene similarity measure σ, any solution
MMW M to the maximum weighted matching problem in gene similarity graph B◦ of G
and H, is a 1−1α approximation of problem FF-Adjacencies for α <1. Furthermore, for any matchingMthat is a solution to problem FF-Adjacencies,
(1−α)·edg(MMW M)≤ Fα(M) ≤edg(M) ≤edg(MMW M).
3.6 An exact solution to problem FF-Adjacencies
In the following, we describe program FFAdj-2G that solves problem FF-Adjacencies exactly in pairwise comparisons. The presented results are based on previous work of Angibaud et al. [6]. The idea is to translate the problem FF-Adjacencies into a 0-1 linear program. That means we define a set of constraints, which are solely linear inequalities, whose variables are binary (i.e., they take on values 0 or 1) and an objective function (maximization or minimization of a linear formula). 0-1 linear
Algorithm 1 Program FFAdj-2G is an ILP for finding a solution to problem FF- Adjacencies (Problem 1) for two genomes G and H.
Objective: Maximize α·
∑
{ga 1,gb2}∈A?(G), {ha 1,hb2}∈A?(H) s(ga1, g2b, h1a, hb2)·c(g1a, gb2, ha1, hb2) + (1−α)·∑
g∈C◦(G), h∈C◦(H) σ(g, h)·a(g, h) Constraints: (C.01) for all g∈ C◦(G),∑
h∈C◦(H) a(g, h) =b(g) for all h∈ C◦(H),∑
g∈C◦(G) a(g, h) =b(h)(C.02) for all {g1a, g2} ∈ Ab ?(G)and for all{ha1, hb2} ∈ A?(H)with a, b∈ {h, t}
a(g1a, hb1) +a(g2a, hb2)−2·c(ga1, gb2, h1a, h2b)≥0,
if{g1a, g2b} 6∈ A◦(G), then for all g in{g3, . . . , gn} ⊆ C(G) such that{{ga 1, g a3 3 },{g b3 3 , g a4 4 }, . . . ,{g bn n, gb2}} ⊆ A◦(G), b(g) +c(ga 1, gb2, ha1, h2b)≤1,
if{ha1, hb2} 6∈ A◦(H), then for all h∈ {h3, . . . , hn} ⊆ C(H) such that{{h1a, ha3 3 },{h b3 3, h a4 4 }, . . . ,{h bn n, hb2}} ⊆ A◦(H), b(h) +c(g1a, gb2, ha1, hb2)≤1 Domains:
(D.01) for all g∈ C◦(G)and for all h∈ C◦(H), a(g, h)∈ {0, 1}
(D.02) for all g∈ C◦(G), b(g)∈ {0, 1},
for all h∈ C◦(H), b(h)∈ {0, 1}
(D.03) for all {g1a, g2} ∈ Ab ?(G)and for all{ha1, hb2} ∈ A?(H)with a, b∈ {h, t}
c(g1a, g2b, ha1, hb2)∈ {0, 1}
programs are a special form of Integer Linear Programs (ILPs). We subsequently use a solver to assign a value for each variable such that the constraints are verified and the objective is optimized.
Program FFAdj-2G requires as input two genomes G and H, a number α∈ [0, 1], and a gene similarity measure σ. The objective, the variables, and the constraints are defined in Algorithm 1. Therein, we explore the set of conserved adjacencies of
3.6. An exact solution to problem FF-Adjacencies
all possible subgenomes of G and H, by means of the sets of candidate adjacencies of G and H.
Definition 9 (candidate adjacencies set) For a given genome G,
A?(G)≡ [
G0⊆G
A◦(G0)
denotes the set of candidate adjacencies over all subgenomes G0 ⊆ G.
We will now discuss program FFAdj-2G, as presented in Algorithm 1, in detail, thereby making use of gene similarity graph B◦ of genomes G and H, which is implicitly used in our program in order to obtain a matching M between genes of genomes G and H.
FFAdj-2Gcontains three types of 0-1 variables, which are listed in domain spec- ifications (D.01), (D.02), and (D.03) in Algorithm 1. Variables a(g, h) indicate if edge {g, h} in gene similarity graph B◦ is part of the anticipated matching M. That is, iff variable a(g, h) is assigned value 1, then {g, h} ∈ M. Similarly b(g), g ∈ C◦(G), and b(h), h ∈ C◦(H), encode if the corresponding vertices of genes or
telomeres g and h in gene similarity graph B◦ are incident to an edge inM. Lastly, variables c(ga
1, gb2, h1a, h2b) indicate if gene extremities ga1, gb2 of subgenome GM and ha1, hb2 of subgenome HM, with a, b ∈ {h, t}, induced by matching M can possibly form conserved adjacencies, i.e.,{ga1, gb2} ∈ A◦(GM)and{ha1, hb2} ∈ A◦(HM).
Program FFAdj-2G contains two types of constraints, where constraint (C.01) represents matching constraints, and constraint (C.02) defines the rules of forming valid conserved adjacencies. For the former, it is straightforward to see that if the sum of edges that are part of the matching and that are incident to a single ver- tex equals one, then the resulting outcome will be a valid matching M. To this end, we set the sum of variables a(g, h) for a given gene g of genome G over all genes h of genome H to be equal to b(g), and do so analogously for each gene h and its corresponding variable b(h). Since a and b are 0-1 variables, constraint (C.01) fulfills two purposes: Next to ensuring a valid matching, it acts as switch for variables b, that indicate if a gene g is incident to an edge in solution M. These variables are used in the second constraint, (C.02), to ensure that there are no genes g ∈ C(GM) and h ∈ C(GM) that are located within a conserved adjacency pair ({ga
1, gb2},{h1a, h2b}). In the trivial case, no gene is located within adjacency {ga1, g2b} in genome G in the first place, i.e., {ga1, gb2} ∈ A◦(G), and nothing needs
to be done. Yet, whenever, subgenome GM contains an adjacency {ga1, g2}b , where genes g1 and g2 are further apart in G, i.e., there exists a sequence of adjacencies {{ga1, ga3 3 },{g b3 3, g a4 3 }, . . . ,{g bn n, gb2}} ⊆ A◦(G), where {ai, bi} = {h, t}with 3≤ i≤ n, containing genes{g3, . . . , gn} ∈ C(G), then each of these genes g3, . . . , gn must not be contained in M-induced subgenome GM. Constraint (C.02) ensures this by (i) either allowing conserved adjacency pair ({g1a, gb2},{ha1, hb2}), or (ii) any genes of
{g3, . . . gn} to be part of GM, or (iii) neither of them to be true. This is achieved
by constraint b(g) +c(g1a, gb2, ha1, hb2) ≤ 1 for each gene g ∈ {g3, . . . , gn}, and analo- gously for adjacency {ha1, hb2}of subgenome HM.