We now define the dual string pointer rules. These rules will be used to charac- terize the effect of flip operations on the underlying legal string. For allp, q∈Π
withp6=q, we define
• thedual string positive rule forpbydsprp(u1pu2pu3) =u1pu¯2pu3,
• thedual string double rule forp, qbydsdrp,q(u1pu2qu3pu¯ 4qu¯ 5) =
u1pu4qu3pu¯ 2qu¯ 5,
whereu1, u2, . . . , u5are arbitrary (possibly empty) strings overΠ. Notice that the dual string pointer rules are self-inverse.
The names of these rules are due to their strong similarities with the two of the three types of string rewriting rules of a specific model of gene assembly, called string pointer reduction system (SPRS) [12]. In this model, gene assembly is performed by three types of recombination (splicing) operations that are sub- sequently modeled as string rewriting rules. For convenience we now recall these string rewriting rules.
For allp, q∈Πwithp6=q, we define
• thestring negative rule forpbysnrp(u1ppu2) =u1u2,
• thestring positive rule forpbysprp(u1pu2pu¯ 3) =u1u¯2u3,
• thestring double rule forp, q bysdrp,q(u1pu2qu3pu4qu5) =u1u4u3u2u5, whereu1, u2, . . . , u5are arbitrary (possibly empty) strings overΠ.
Notice the strong similarities betweendsprand spr, and betweendsdr and sdr. Both dsprp and sprp invert the substring between the two occurrences of
porp¯. However,dsprp is applicable whenpis negative, whilesprp is applicable when p is positive. Also, sprp removes the occurrences of pand p¯, while dspr does not. A similar comparison can be made betweendsdrandsdr.
Thedomain of a dual string pointer ruleρ, denoted by dom(ρ), is defined by
dom(dsprp) = {p} and dom(dsdrp,q) = {p,q} for p, q ∈Π. For a composition
ϕ=ρn · · · ρ2ρ1 of such rulesρ1, ρ2, . . . , ρn, thedomain, denoted bydom(ϕ), is dom(ρ1)∪dom(ρ2)∪ · · · ∪dom(ρn). Also, we defineodom(ϕ) =L1≤i≤ndom(ρi). Thus,odom(ϕ)⊆dom(ϕ)consists of the pointers that are used an odd number of times. We callϕreduced if everyp∈dom(ϕ)is used exactly once, i.e.,dom(ρi)∩ dom(ρj) = ∅for all 1 ≤i < j ≤n. Note that if ϕ is reduced, then dom(ϕ) = odom(ϕ).
Definition 30
Letuandvbe legal strings. We say thatuandvaredual, denoted by≈dif there is a (possibly empty) sequenceϕof dual string pointer rules applicable tousuch thatϕ(u)≈v.
Notice that≈dis an equivalence relation. Clearly,≈dis reflexive. It is symmetrical since dual string pointer rules are self-inverse, and it is transitive by function composition: ifϕ1(u)≈v andϕ2(v)≈w, then(ϕ2 ϕ1)(u)≈w.
Since dsprp is applicable when pis negative in u and dsdrp,q is applicable whenpandqare positive and overlapping, the following result is a direct corollary to Lemma 28.
Corollary 31
Letube a legal string, and letD⊆dom(u)be nonempty. IfflipD(Mu)∈CONu, then there is a dual string pointer ruleρwithdom(ρ)⊆D applicable tou.
Let G = B(E1, E2, E3) be an extended abstract reduction graph, and let
D ⊆ dom(G). Then we define flipD(G) = B(E1, E2,flipG′,D(E3)), where G′ =
B(E1, E2). Lemma 32
Letube a legal string, and letϕbe a sequence of dual string rules applicable to
u. ThenEϕ(u)≈flipD(Eu)withD= odom(ϕ). Consequently,Rϕ(u)≈ Ru. Proof
It suffices to prove the result for the caseϕ=dsprp withp∈Πand for the case
ϕ =dsdrp,q with p, q ∈ Π. We first prove the case where ϕ= dsprp for some
p ∈ Π is applicable to u. Then by the second figure in the proof of Lemma 26 we see that the inversion of the substring between the two occurrences ofpin u
accomplished byϕfaithfully simulates the corresponding effect offlipponEu. We only need to verify thatpis negative in flipp(Eu). To do this, we depict Eu such that the vertices are represented by their identity instead of their label:
s · · · v1 v2 · · · v3 v4 · · · t
where the verticesvi,i∈ {1,2,3,4}, are labelled byp. Thenflipp(Eu)is
s · · · v1 v3 · · · v2 v4 · · · t
Thereforepis indeed negative inflipp(Eu), and consequentlyEϕ(u)≈flipp(Eu). We now prove the case whereϕ=dsdrp,qwithp, q∈Π. LetEu=B(E1, E2, E3), thenEu has the following form
96 Dual String Pointer Rules
where we omitted the edges inE2. Since pand q are positive in u, flip{p,q}(Eu) has the following form:
s · · · p p · · · q q · · · p p · · · q q · · · t
where we again omitted the edges inE2. Thus, we see that interchanging the sub- string inubetweenpandqand the substring inubetweenp¯andq¯accomplished byϕfaithfully simulates the corresponding effect of flipp,q on Eu. We only need to verify that both p and q are positive in flipp,q(Eu). To do this, we depict Eu such that the vertices are represented by their identity instead of their label:
s · · · v1 v2 · · · w1 w2 · · · v3 v4 · · · w3 w4 · · · t
where the verticesvi andwi,i∈ {1,2,3,4}, are labelled bypandq, respectively. Thenflipp,q(Eu)is
s · · · v1 v4 · · · w3 w2 · · · v3 v2 · · · w1 w4 · · · t
Therefore bothpandqare indeed positive inflipp,q(Eu), and consequentlyEϕ(u)≈
flipp,q(Eu).
Thus, if ϕ1 andϕ2 are sequences of dual string pointer rules applicable to a legal string u with odom(ϕ1) = odom(ϕ2), then Eϕ1(u) ≈ Eϕ2(u) and thus ϕ1(u) ≈
ϕ2(u). Lemma 33
Letube a legal string, and letD ⊆dom(u). There is a reduced sequenceϕof dual string pointer rules applicable tousuch thatdom(ϕ) =DiffflipD(Mu)∈CONu. Proof
The forward implication follows directly from Lemma 32. We now prove the re- verse implication. If D = ∅, we have nothing to prove. LetD 6= ∅. By Corol-
lary 31, there is a dual string pointer rule ρ1 withD1= dom(ρ1)⊆D applicable to u. By Lemma 32, Eρ1(u) ≈flipD1(Eu) and D1 = odom(ρ1) = dom(ρ1). Thus, flipD\D1(Mρ1(u))∈CONρ1(u). Now by iteration, there is a reduced sequenceϕof
It follows from Lemmata 32 and 33 that reduced sequences of dual string pointer rules are a normal form of sequences of dual string pointer rules. Indeed, by Lemma 32, if ϕ is a sequence of dual string pointer rules applicable to a legal string u with D = odom(ϕ), then flipD(Mu) ∈ CONu. By Lemma 33, there is a reduced sequence ϕ′ of dual string pointer rules applicable to u such that
dom(ϕ′) = odom(ϕ′) =D. By the paragraph below Lemma 32, we haveϕ(u)≈
ϕ′(u).
We are now ready to prove the second (and final) main result of this chapter. It shows that the fiber R−1(Ru)for each legal stringuis the ‘orbit’ of uunder the dual string pointer rules. Hence, the partition of the set of all legal strings induced by the fibers underR, and the one induced by≈dcoincide. Equivalently, the legal strings obtained fromuby applying dual string pointer rules are exactly those legal strings that have the same reduction graph asu(up to isomorphism). Theorem 34
Letuandv be legal strings. ThenRu≈ Rv iffu≈dv. Proof
The reverse implication follows directly from Lemma 32. We now prove the for- ward implication. LetRu≈ Rv. By Corollary 11, there is aE∈CONu such that
Ev ≈ B(E1, E2, E) with Ru = B(E1, E2). By Theorem 15, E = flipD(Mu) for someD ⊆dom(u). SinceflipD(Mu)∈CONu, by Lemma 33, there is a reduced sequenceϕof dual string pointer rules applicable tousuch thatdom(ϕ) =D. Now by Lemma 32,Eϕ(u)≈flipD(Eu)≈ Ev, and therefore, by Theorem 10,ϕ(u)≈v.
4.12
Discussion
This chapter characterizes, letting R be the function that assigns to each legal string u its reduction graph Ru, the range of R (Theorem 24) and each fiber
R−1(R
u)(Theorem 34) modulo graph isomorphism.
The first characterization corresponds to a computationally efficient algorithm that determines whether or not a graph G is isomorphic to a reduction graph. Moreover, if this is the case, then the algorithm given below Theorem 24 allows for an efficient determination of a legal string u such that G ≈ Ru. The first characterization relies on the notion of merge-legal edges and its flip operation in- troduced in this chapter. In particular, the connected components in the subgraph induced by the reality edges and the merge-legal edges and the flip operation turn out to be relevant in this context.
The second characterization determines, given u, the whole setR−1(R u). It turns out that R−1(R
u) is the orbit of u under the dual string pointer rules. Moreover, each two legal stringsuand vin such a fiber can be transformed into each other by a sequenceϕof dual string pointer rules without using any pointer more than once. Therefore, the number of dual string pointer rules inϕ can be bounded by the size of the domain of u (and v). Surprisingly, the dual string
98 Discussion
pointer rules are very similar to those used in a specific model of gene assembly called SPRS.
The second characterization has additional uses for SPRS. The reduction graph of a legal stringuin a certain sense retains all information regarding applicability of string negative rules (defined in SPRS) in transformations ofuto its end result, while discarding almost all other information regarding the applicability of the other rules, see [4]. Therefore, the fibers in a sense provide the equivalence classes of legal strings having the same properties regarding the application of string negative rules.
From a biological point of view, the first characterization provide requirements on the structure of MAC genes, while the second characterization determines which types of MIC genes obtain the same MAC structure.
How Overlap Determines
Reduction Graphs for Gene
Assembly
Abstract
Ciliates are unicellular organisms having two types of functionally different nuclei: micronucleus and macronucleus. Gene assembly transforms a micronucleus into a macronucleus, thereby transforming each gene from its micronuclear form to its macronuclear form. Within a formal intramolecular model of gene assembly based on strings, the notion of reduction graph represents the macronuclear form of a gene, including byproducts, given only a description of the micronuclear form of that gene. For a more abstract model of gene assembly based on graphs, one cannot, in general, define the notion of reduction graphs. We show that if we restrict ourselves to the so-called realistic overlap graphs (which correspond to genes occurring in nature), then the notion of reduction graph can be defined in a manner equivalent to the string model. This allows one to carry over from the string model to the graph model several results that rely on the notion of reduction graph.
5.1
Introduction
Gene assembly is a process that takes place in unicellular organisms called ciliates, which have two types of functionally different nuclei: micronucleus (MIC) and macronucleus (MAC). Gene assembly transforms the genome of the MIC into the genome of the MAC. The two genomes are dramatically different in both the global form of their chromosomes and in the local form of single genes. During gene assembly each gene in its MIC form gets transformed into the same gene in its MAC form. See [12] for a detailed account of the biology of gene assembly.
100 Introduction
In this chapter we consider only intramolecular models of gene assembly – thus here we do not consider the intermolecular models initiated by Landweber and Kari [20], and further developed by Daley et al. [10, 9]. Among the formal mod- els of intramolecular gene assembly the string pointer reduction system (SPRS) and the graph pointer reduction system (GPRS), see [12], are of interest for this chapter. The SPRS consist of three types of string rewriting rules operating on so-called legal strings, while the GPRS consist of three types of graph rewriting rules operating on so-called overlap graphs. The GPRS is an abstraction of the SPRS: some information present in the SPRS is lost in the GPRS.
Realistic strings are strings that represent genes in their micronuclear form. Legal strings are an abstraction of realistic strings. The reduction graph, which is defined for legal strings, is a notion that describes the gene corresponding to the legal string in its macronuclear form (along with its waste products: the substrings “spliced out” in the process) – it is unique for a given legal string. It has been shown that the reduction graph retains the information needed to characterize which string negative rules (one of the three types of string rewriting rules) can be used during the transformation of a MIC form of a gene to its MAC form [6, 4]. Therefore it would be useful to have a notion of the reduction graph also for the GPRS. However, this is not so straightforward. We will demonstrate that, since the GPRS loses some information concerning the application of string negative rules, in general there is no unique reduction graph for a given overlap graph, cf. Example 6. However, as we will show, when we restrict ourselves to “realistic” overlap graphs then one gets a unique reduction graph. These overlap graphs are called realistic since they correspond to (micronuclear) genes. In this chapter, we explicitly define the notion of reduction graph for realistic overlap graphs (within the GPRS) and show that it is equivalent to the notion of reduction graph for legal strings (within the SPRS). Finally, we give a number of direct corollaries of this equivalence, including an answer to an open problem formulated in Chapter 13 in [12].
In Section 5.2 we recall some basic notions and notation concerning sets, strings and graphs. In Section 5.3 we recall notions used in models for gene assembly, such as legal strings, realistic strings and overlap graphs. In Section 5.4 we recall the notion of reduction graph within the framework of SPRS and we prove some ele- mentary properties of this graph for legal strings. In particular we establish a cal- culus for the sets of overlapping pointers between vertices of the reduction graph. In Section 5.5 we prove properties of the reduction graph for a more restricted type of legal strings, the realistic strings. It is shown that reduction graphs of realistic strings have a subgraph of a specific structure, the root subgraph. More- over, we show (using the calculus from Section 5.4) that the existence of the other edges in the reduction graph depends directly on the overlap graph. In Sec- tion 5.6 we provide a convenient function for reduction graphs that allows one to simplify reduction graphs without losing any information. In Section 5.7 we define the reduction graph for realistic overlap graphs, and prove the main theorem of this chapter: the equivalence of reduction graphs defined for realistic strings with
the reduction graphs defined for realistic overlap graphs. In Section 5.8 we dis- cuss some immediate consequences of this theorem. A conference version of this chapter, which does not contain any proofs, was presented at FCT ’07 [7].