• No results found

Repeat and Short Contig Filtering Problem

PART 2 ALGORITHMS FOR SCAFFOLDING AND SCAFFOLD-

2.4 Repeat aware scaffolding

2.4.1 Repeat and Short Contig Filtering Problem

BATISCAF is a novel repeat aware scaffolding tool. The high-level idea behind it consists of 3 steps:

1. Filtering out potential repeats via ILP

2. Constructing backbone scaffolding for potentially unique contigs

3. Insertion of multiple copies of potential repeats into backbone scaffolds

Repeat and Short Contig Filtering Problem. Let C be the set of contigs. For each contig ci ∈ C we produce two strands – the sequence for the first strand c+i coincides with

the contig sequence and the sequence for the second strand c−i coincides with the reverse complement of ci. We connect the corresponding vertices c+i and c

edge ed = (c+i , c

i ) of weight ∞ (a big enough number) (see Figure 1(a)). The set of all

intra-contig edges is denoted as Ed. Note that each contig ci ∈ C has one and only one

corresponding intra-contig edgeed∈Ed.

We say that a paired-end readr(e.g., Illumina technology) connects strands csi

i and c

sj j

where si, sj ∈ {+,−}, i 6= j, of two distinct contigs if the forward read of r is aligned to csi

i and the reverse read of r is aligned to c sj

j . The vertices corresponding to these strands

are connected with an inter-contig edge e = (csi i , c

sj

j ) of weight ωij if and only if they are

connected with ωij paired-end reads supporting similar gap estimate (statistically inferred

value for the distance in base pairs, gap estimation in BATISCAF follows [74]) between the contigs ci and cj. Let E denote the set of such edges e. Finally, the scaffolding graph G= (V =C+C, EE

d, ω) consist of vertices V =C+∪C− corresponding to the contig

strands connected with weighted intra- and inter-contig edges.

(a)

A B X1 C X2 X3 E F G X4 H D

(b)

+ + + + + - + + + - - - + - - - - - + A + - B - + C + - D + - E - + F + - G + - H - + -

(c)

A+ B- C+ D+ E- F+ G+ H- - + - +

Figure 2.13 (a) The scaffolding graph G. Each contig is represented by two vertices (+ and −) corresponding to forward and backward strands. The intra-contig edges are dashed, the inter-contig edges are solid. (b) The scaffolding graph corresponding to a valid scaffold. The graph is a chain of alternating intra- and inter-contig edges. (c) The chain of contigs corresponding to the scaffolding graph of a scaffold. Each contig is either in the original orientation (+) or inverted (−).

Any valid scaffold corresponds to a scaffolding graph in which each vertex (strand) is incident to exactly one intra-contig edge and at most one inter-contig edge (see Figure

A A C B A

(a)

B C A B C

(b)

(c)

Figure 2.14 (a) A confusion triple: the same strand of contigAis connected with two strands of contigsB and C; The two possible scenarios causing the confusion: (b) The contig A is a repeat and another copy of contig A is connected with C; (c) The connection from A to C

jumps over the short contig B.

2.13(b)). Therefore, if a vertex in G is incident to at least two inter-contig edges, then either one or both of them should be disregarded. Assuming no contig mis-assemblies such confusing edges are caused by either repeat or short contigs (see Figure 2.14). If there is no such confusion, the scaffolding graph G is a set of valid scaffolds (with potentially missing links). So in order to avoid confusion and keep only reliable contigs we need to delete suspected repeat and short contigs. Usually, repeat contigs are also short (which is defined by the repeat monomer length of 150-400 bp [64]). Thus, the problem of repeat and short contig removal can be formulated as the following

Repeat and Short Contig Filtering (RSCF) Problem. Given a scaffolding graph G = (V, E ∪Ed, ω) find minimum total length subset of contigs X ⊆ V

such that in subgraph G0 induced by V \X, any vertex v is incident to at most one inter-contig edge.

The solution of the RSCF problem represents a set of contigs whose removal from

G leaves a set of simple alternating paths and/or cycles consisting of intra-contig edges interspersed with inter-contig ones. The paths can be easily transformed into scaffolds. If there are no cyclic chromosomes, we need to transform them into paths. Therefore, from each cycle, we remove the least weight edge. After the least weight edge is removed from each cycle the resulting alternating paths can be easily transformed into a set of scaffolds

using a procedure similar to [62] (see Figure 2.13(c)). Such scaffolds are highly reliable since there is no confusion during their extraction. Clearly, solving the RSCF problem does not guarantee to exhaustively remove all the repeated contigs from C. Indeed, a contig with a high degree of its strands which is not necessarily a repeated one in the scaffolding graph would be a candidate for removal. The objective of RSCF problem ensures that the minimal length contigs are removed.

The RSCF problem can be viewed as a set cover problem in which elements correspond to confusion triples of contigs, i.e., contig triples with twoE-edges connecting a single strand with two different strands of other contigs (see Figure2.14) and sets corresponding to contigs – each contigccovers all confusion triples containingc. Therefore, this is a set cover problem where each element belongs to at most three sets. Although such a problem is NP-complete, it can be 3-approximated with a primal-dual algorithm [90].

The RSCF problem can be solved efficiently if the scaffolding graph does not contain cycles. Therefore, instead of solving this problem exactly or approximately, we also propose a fast heuristic consisting of finding the maximum spanning tree T(G) of the scaffolding graph G (note that edges connecting the strands of the same contig will belong to T) and then finding the exact solution forT (BATISCAF-MST).

ILP Formulation for the RSCF Problem. Since the RSCF problem is NP-hard, we propose to find the exact optimal solution using an Integer Linear Programming (ILP) approach.

Let the length (in bp) of contig u be denoted as lu. We formulate the following ILP: X v lvxv →min s.t. xu+xv ≥yuv, ∀e= (u, v)∈E∪Ed (a) yuv≥xu, ∀e= (u, v)∈E∪Ed (b) yuv≥xv, ∀e= (u, v)∈E∪Ed (c) X v:(u,v)∈E∪Ed yuv≥deg(u) − 2, ∀u∈V (d) ( X v:(u,v)∈E∪Ed yuv)−xu ≤deg(u)−1, ∀u∈V (e) xu−xv = 0, ∀e= (u, v)∈Ed (f) xu ∈ {0,1}, ∀u∈V (g) yuv∈ {0,1}, ∀e= (u, v)∈V (h) (2.4)

Binary variablesxu ∈ {0,1} have the following meaning:

xu =        1,

if contig u is marked as either a repeat or a possibly skipped short contig 0, otherwise.

(2.5)

Definition. Let u be a contig belonging to the scaffolding graph G. If by solving the ILP (2.4) we obtain xu = 1 we call such contig untrusted (otherwise, trusted). The set of all

untrusted contigs is denoted as U.

Binary variablesyuv ∈ {0,1} have the following meaning:

yuv=          1,

if the link e= (u, v) is incident to an untrusted contig (either u orv)

.

0, otherwise.

(2.6)

untrusted if it is incident to at least one untrusted contig.

The meaning of all the constraints is the following:

• Conditions (a), (b), (c) from the ILP (2.4) ensure that edges incident to an untrusted contig are also marked as untrusted.

• Condition (d) means that the degree of a trusted contigu is at most 2 in the resulting scaffolds, i.e. no more than two edges which are not marked as untrusted are incident to it.

• Condition (e) means that if all the edges incident to a nodeu are marked as untrusted than u is untrusted as well.

• Condition (f) guarantees that two strands of any contig are both are either marked as trusted or untrusted.

Finally, the objective of the ILP (2.4) is to minimize the total length of the untrusted contigs while requesting that all the trusted contigs be connected into chains or cycles (i.e. their degree in the graph G\U is either 0, 1, or 2).

The solution of ILP (2.4) represents a set of untrusted contigsU. After we remove them from the scaffolding graph we obtain a graph T =G\U with all its connected components being either paths or cycles. We remove the least weight edge from each cycle. As a result, we obtain a set of alternating paths which can be translated into scaffolds following a procedure similar to [62]. In each scaffold,S relative ordering and orientation of each contig is established.

Most Likely Repeat and Short Contig Insertion. The second stage of our algorithm which is constructing scaffolding corresponding to contigs left after removing confusing con- tigs (see Figure 2.13(c)) is trivial. The third stage of our approach is to insert the removed contigs from the set U back into the scaffolds. Potential repeats identified in the first stage are inserted into the scaffolds as many times as it can be inferred from the scaffolding graph structure. For each scaffold s∈S we create asurroundinggraphGS = (S∪N, E) which is a

subgraph of the scaffolding graphGon the set of nodes representing contig strands in S and all their neighboring nodes N, i.e. ∀n ∈N,∃u∈S, such that e = (n, u)∈E(G) (see Figure

2.15a). In GS, the orientation of each contig in S is known and the relative order between

contigs in S is established. The same information is to be determined for the contigs in N. Surrounding graphs Gsi corresponding to different scaffolds si ∈ S may share neighboring

nodes, i.e. Nk∩Np 6=∅for some scaffolds sk and sp. This fact determines the copy number

of each repeated contig.

Next, we build a directed surrounding graph−G→S where nodes represent oriented contigs

and arcs encode the relative ordering of neighboring contigs. The orientation of each contig inN, as well as the relative ordering between any contig n∈N and its neighbors fromS in

GS, can be determined in the following way. Let e be an edge in the surrounding graph GS

between a strand of contig n ∈N and a strand of a trusted contigu∈S. If the orientation of u is determined to be “+” and e is incident to the negative strand of u then if e is also incident to the positive strand of n the orientation of n is assigned to be “+” (otherwise “–”). A new arc from n tou is added to −G→S. In the same manner, the direction of each arc

and the orientation of each contig is determined. For example, in the graph GS depicted in

Figure 2.15a, the positive strand of the contig X1 ∈ N is connected to the negative strand

of contigB ∈S (which has orientation “+”). Therefore, in the graph−G→S depicted in Figure

2.15b, contigX1 is assigned orientation “+” and there is an outgoing arc connecting it with

B.

The directed graph−G→S may not be acyclic because some of the newly introduced contigs

(from the setN) into the scaffold are repeats. We have to identify the set of repeated contigs

R and for each contig r∈ R we replace it with several copies of itself. The resulting graph is acyclic, i.e. it represents a partial order. A minimal set of repeated contigs in −G→S can

be determined by solving the Minimum Feedback Vertex Set problem or any of its weighted versions (for example, using contig lengths as weights). It is known that this problem is APX-hard on directed graphs [26], i.e. it does not admit any polynomial time approximation schemes (PTAS). Therefore, we apply a simple greedy heuristic to determine the feedback

vertex set. Namely, we randomly pick a cycle C in−G→S, find the smallest length contig c∈ C

and remove c from the graph. We assign a copy of c to each of its neighbors in −G→S. We

continue this procedure until the graph is acyclic.

...

...

...

...

A X1 X2 B X4 X3 C B X2 X3 X4 C A X1

...

...

A X1 X2 B X3 X4 C + + - + + + + + + - - - - - - + + + + + + + + + + + + +

(b)

(a)

(c)

Figure 2.15 Insertion procedure. Contigs belonging to a backbone scaffoldShave green color; contigs which are candidates for insertion have blue color. a) A fragment of surrounding graph −G→S with the chain of trusted contigs A, B, C and neighboring contigs X1−X4. b)

The directed surrounding graph−G→S corresponding toGS. c) The scaffoldS augmented with

contigs X1, X2, X3, andX4.

We refer to the transitive reduction of the directed acyclic graph (DAG) −G→S as spine.

The spine consists of all the nodes in the scaffoldS and some or all the N nodes.

We define a slot S = (u, v) as a set of nodes between a pair of articulation nodesu and

v in the spine of −G→S which does not contain any other articulation point. For a slot S there

can be only two cases (or a combination of them):

1. It is composed of a set of directed paths from utov (e.g., in the Figure2.15b, the slot (A, B) comprises two paths: A→X1 →B and A →X2 →B);

the slot (B, C) contains a dangling nodeX4 attached toB and another dangling node

X3 attached to C);

In the first case, for all the contigs belonging toS we identify their relative ordering by sorting them according to the distance from either u orv. In the second case, a “dangling” nodeDmay not necessarily belong toS. For example, in Figure2.15b, contigX3may belong

to either slot (B, C) or (A, B) depending on the distance estimates between X3 and C and

betweenB andC. Without loss of generality, let’s considerD to be connected with a contig

Si ∈ S by an outgoing arc (e.g. contig X3 has an outgoing arc to C). Contig D may be

inserted into one of the slots (Si−1, Si), (Si−2, Si−1), etc. For all such slotsSk= (Si−k−1, Si−k),

we estimate the probability PSk =P(D ∈ Sk) defined as

PSk =F(x≤d(D, Si)≤x+y), where x= k−1 X p=0 (d(Si−p−1, Si−p) +l(Si−p)), y=d(Si−k−1, Si−k)−l(D),

l(z) is the length of contig z, d(z1, z2) is distance estimate between two contigs z1 and

z2, F is the normal distributionN(µ, σ2) with µ,σ2 being the mean and standard deviation

of the library insert size.

The dangling contigd is assigned to the slotSk0, wherek0 =argmaxkPSk. After all the

contigs inN are assigned to the corresponding slots, we get the set of scaffolds S0 augmented with repeated and short contigs (see Figure2.15c).

2.4.2 Results

Datasets with repeats. We used the five B splitdatasets described in the previous part of theis chapter (see Validation results in 2.3.3). The following Illumina paired-end read

datasets were used: S. aureus- read length 37, insert size 3600;R. sphaeroides- read length 101, insert size 3700; M. fijiensis - read length 100, insert size 1800; H. sapiens (chr14) - read length 101, insert size 2700; M. graminicola- read length 100, insert size 1800. In the Table 2.13 the basic characteristics of the contig datasets are presented.

Table 2.13 The basic characteristics of the simulated contig datasets: “avg len”- average contig length, “# unique” - the number of unique contigs (no copies counted), “# total” - the total number of contigs including copies, “# repeats” - the number of repeated contigs, “# max CN” - the maximum copy number, i.e. the number of times the most abundant contig is encountered in the dataset.

avg len # unique # total # repeats # max CN

S. aureus 13.9 K 203 244 22 5

R. sphaeroides 7.2 K 612 652 30 5

H. sapiens (chr14) 1.7 K 44350 45035 613 5

M. graminicola 4.7 K 6875 7261 280 9

M. fijiensis 2.0 K 17781 19357 995 28

Performance metrics. We used the following evaluation metrics in our comparison:

• number of correct contig links;

• sensitivity (or recall) and PPV (positive predictive value) - two scaffolding quality metrics introduced in [62] and used in [56]. They are defined asT P R = T PP , P P V =

T P

T P+F P, where T P is the number of correct contig joins in the output of the scaffolder

(true positives), F P be the number of erroneous joins (false positives), and P is the number of potential contigs that can be joined in scaffold (equal to the number of contigs minus the number of reference sequences). We also report F-score equal to the harmonic mean of T P R and P P V;

• Corrected N50 which is the length of contigs in the smallest corrected scaffold necessary to cover 50% of all contigs [76].

Validation results. We compared our tool with other five state-of-the-art stand-alone scaf- folders: OPERA-LG, SSPACE, BESST, ScaffMatch, and BOSS. On each of the five datasets BATISCAF outperforms all other tools (Table 2.15). Notably, a large gap between BATIS- CAF and the remaining tools is observed on the GAGE datasets S. aureus,R. sphaeroides,

and H. sapiens (chr14) in terms of both F-score and corrected N50 metrics. Indeed, on S. aureus it identified 33 %, on R. sphaeroides 23 %, and on H. sapiens (chr14) 34 % more correct contig links than the next top competitor (ScaffMatch on the first two datasets and OPERA-LG on the third one) (see Table 2.15).

BATISCAF scaffolds for the fungi datasets are also of a better quality, although it did not aggressively join contigs as on the GAGE datasets. However, a small improvement over ScaffMatch in terms of F-score is compensated by more contiguous scaffolds as it is suggested by the corrected N50 results (see Table 2.15).

The runtime of BATISCAF on the largestH. sapiens (chr14) dataset containing 44350 distinct contigs is reasonable (≈ 80 minutes) and comparable with runtime of other tools and the wall clock time spent on solving the ILP (2.4) is 16 seconds. We used CPLEX (version 12.7) for solving the ILP (2.4). All the experiments were run on 2.5GHz 16-core AMD Opteron 6380 processors with 256Gb RAM running under Ubuntu 16.04 LTS.

Further, we also used the well-established standard evaluation framework [40] to confirm the advantage of BATISCAF over the competitor tools. As we mentioned previously, it does not take into account repeats and chooses only the “best” placement of each contig in the reference ground truth scaffolding. We generated artificial contigs for the same five datasets

Related documents