• No results found

MURPAR: Combining Pairwise Solutions using ILP

5.2 Background

5.3.3 MURPAR: Combining Pairwise Solutions using ILP

as an estimate of the solution of the Set-wise HGT Inference problem. An advan- tage of this approach is that fast exact algorithms and heuristics exist for obtaining SP R(ST, gt), as described above, and taking the union of pairwise reconciliations is very simple computationally. Indeed, the algorithm yields good performance on simulated data sets [3]. Nonetheless, a careful inspection of the algorithm raises some issues that need to be resolved for accurate estimates.

First, it is possible that the optimal solutions for SP R(ST, gt) do not lead to the optimal solution for SP R(ST, G ), and cause overestimation in a “global” estimate of reticulations. For example, consider the HGT scenarios shown in Fig. 5.2. If we take

A B C D E F G 1 4 3 2 X N A B C D E F G A B C D E F G A B C D E F G A B C DE F G GT1 GT2 GT3 GT4

Figure 5.2 : A phylogenetic network with four independent HGT scenarios. The species tree ST in this case is the network N without the four HGT edges. The gene whose tree is GTi(1 ≤ i ≤ 3) underwent HGT event (i), and the gene whose tree is GT4underwent

HGT events (1) and (4). The combined effect of HGTs (1) and (4) on the gene tree topology is the same as the combined effect of HGTs (1), (2), and (3).

the union of the four pairwise solutions computed by SP R(ST, GTi), for 1 ≤ i ≤ 4,

we obtain a set of four HGT edges: Ξ1 = {[B → C], [E → D], [F → E], [F → X]}.

However, a smallest solution for the Set-wise HGT Inference problem given the species tree ST and the four gene trees contains three HGT edges, which is the set

Ξ2 = {[B → C], [E → D], [F → E]}. In this case, the HGT edge [F → X] is

not needed, since its effect can be simulated by the two HGT edges [E → D] and [F → E], once the HGT edge [B → C] is applied. However, notice that in this case, the solution that truly reflects what happened is Ξ1, since the HGT event denoted

by [F → X] did occur, even though its effect on the topologies of gene trees can be simulated by the other three HGT edges. In other words, while the union of pairwise solutions may not provide a smallest global solution, it may provide a solution that is closer to the true HGT scenarios that took place at the genomic level. Further, under these scenarios, where a smallest solution is a proper subset of the union of pairwise solutions, post-processing via gradual elimination of members of the union can yield a smallest solution. However, it is not guaranteed that the smallest solution is a subset of the union of pairwise reconciliations.

Second, SP R(ST, gt) may not be unique; in fact, the number of possible solu- tions to the Pairwise HGT Inference problem can be exponential in the size of a solution [135]. To account for this issue, we need to consider all solutions, or a large number of them when obtaining all is computationally infeasible, for each pair of trees. Without accounting for multiple solutions, methods such as M2 [3] that agglomerate pairwise solutions would obtain biased estimates.

A third issue that requires special attention is the following: while solutions to the Pairwise HGT Inference problem may be acyclic (that is, the inferred HGT events, when added to the species tree, do not result in cycles), taking the union of solutions cannot guarantee acyclicity. When this occurs, the solution is not a phylogenetic network as given by Definition 2.2.

As mentioned above, the former method M2 [3] uses the approach given by Eq. (5.2), but it does not address the last two issues raised above. For the first

issue, let ST be a species tree and G ={GT1, . . . , GTk} be a collection of gene trees.

Also let Si = {si,1, si,2, . . . , si,mi} be the set of all optimal pairwise solutions on the

pair �ST, GTi�. Then, M2 counts the frequency with which each potential reticu-

lation edge appears throughout si,j, 1 ≤ i ≤ k, 1 ≤ j ≤ mi, and calculates the set

of reticulation edges such that 1) it covers at least one solution for all trees, and 2) it maximizes its frequency value, in the assumption that the frequency reflects how often a reticulation edge is used in each solution and maximizing the frequency would result in a smallest set of reticulation edges.

However, with multiple solutions (e.g., obtained by using RIATA-HGT [39, 40]), a reticulation edge occurring multiple times in si,j,∀1 ≤ j ≤ mi, contributes more to the

frequency than those occurring once. As a result, a solution would be biased towards the edges occurring multiple times in solutions. In this section, we propose our heuristic MURPAR (MUlti-tree Reconciliation using PAirwise Reconciliations) for addressing these issues. With the definition of S =�1≤i≤k,1≤j≤misi,j MURPAR seeks

the smallest set S� ⊆ S that satisfies the property [∀1 ≤ i ≤ k, ∃s

i,j ∈ Si s.t. si,j ⊆

S�]. MURPAR solves this problem using integer linear programming (ILP). We define

binary variables as follows:

(A) Bs, ∀s ∈ S. Bs will take value 1 if SPR move s is selected as an element of S�,

and 0 otherwise.

(B) Pij, ∀1 ≤ i ≤ k, ∀1 ≤ j ≤ mi. Pij will take value 1 if all SPR moves in the

optimal pairwise solution sij are selected, and 0 otherwise.

Finally, the ILP program is: minimize �Bs

– Pij = � ∧y∈sijBy � ,∀1 ≤ i ≤ k, ∀1 ≤ j ≤ mi – [1≤j≤miPij] = 1,∀1 ≤ i ≤ k

Here, ∧ and ∨ represent logical ‘and’ and logical ‘or’, respectively. All these con- straints can be turned into linear constraints by introducing auxiliary variables as follows:

• a = (b1∧ b2 ∧ · · · ∧ bp), where all variables are binary, can be turned into the

linear inequalities −1 ≤ 2b1+ 2b2+· · · + 2bp− 2pa ≤ 2p − 1.

• (b1 ∨ b2∨ · · · ∨ bp) = 1, where all variables are binary, can be turned into the

linear inequality b1+ b2+· · · + bp ≥ 1.

When a species tree ST is not given, we repeat MURPAR with GTi, 1≤ i ≤ k as

the species tree and choose the smallest S� i as S�.

As discussed above, the solution thus far may result in cyclic graphs, which are not phylogenetic networks. To address this issue, MURPAR post-processes the results to avoid the solutions with cycles using a straightforward cycle detection algorithm. If a minimal solution is found to have a cycle, MURPAR skips it and inspects the next minimal solution (minimal in terms of the number of reticulation nodes). While all solution candidates found by MURPAR may have cycles, and thus MURPAR returns no solution, we found through extensive simulations that this was never the case. Similarly, it was shown in [137] and confirmed in [138] that cycles may not be a major concern for reticulation detection algorithms when run on real data sets or synthetic data sets that are generated under realistic models.