Mapping strings - Restricted inputs - Programming Using Automata and Transducers

5.6 Restricted inputs

5.6.1 Mapping strings

Suppose we restrict the inputs to contain only internal symbols, that is, strings overΣ. Then the STT cannot use its stack, and we can assume that the set Pof stack symbols is the empty set. This restricted transducer can still map strings to nested words (or trees) overΓwith interesting hierarchical structure, and hence, is called astring-to-tree

transducer. This leads to the following definition: astreaming string-to-tree transducer

(SSTT) § from input alphabetΣto output alphabetΓconsists of a finite set of statesQ; an initial stateq0 ∈ Q; a finite set of typed variablesXtogether with a conflict relation

η; a partial output function F : Q 7→ E0(X,Γ)such that for each state q, a variablex

appears at most once inF(q); a state-transition functionδ:Q×Σ7→ Q; and a variable- update function ρ : Q×Σ 7→ A(X,X,η,Γ). Configurations of such a transducer are

of the form(q,α), whereq ∈ Qis a state, andαis a type-consistent valuation for the

variables X. The semantics[[S]] of such a transducer is a partial function from Σ∗ to W0(Γ). We notice that in this setting the copyless restriction is enough to capture MSO

completeness since the model is closed under regular look-ahead (i.e. a reflexiveηis

enough).

Theorem 5.16 (Copyless SSTTs are closed under RLA). A string-to-tree transduction is

definable by an SSTT iff it is definable by a copyless SSTT.

Proof. The _⇐direction is immediate. For the_⇒direction, using Theorem 5.9 we con-

sider the input to be a copyless SSTT with regular look-ahead. Given a DFA A = (R,r0,δA)over the alphabetΣ, and a copyless string-to-tree STTS = (Q,q0,X,F,δ,ρ)

over R, we construct an equivalent copyless multi-parameter STT

S0 _{= (}_Q0_,_q0

0,X0,Π,ϕ,F0,δ0,ρ0)over Σ. Thanks to Theorem 5.11 this implies the exis-

tence of a copyless SSTT.

Auxiliary notions. Given a finite setU, we inductively define the following sets:

F(U): the set of forests over U defined as: ε ∈ F(U), and if s0, . . . ,sn ∈ U, and

f0, . . . ,fn∈ F(U), thens0(f0). . .sn(fn)∈ F(U);

SF(f0₎_: _{given a forest} _f0 _{∈ F}₍_U₎_{, the set of sub-forests of} _f0_{, for short} _SF₍_f0₎_{, is} defined as follows:

• if f0 ≡ s00(t00). . .s0m(t0m), and there exist i ≤ msuch that f ∈ SF(t0i), then

f ∈ SF(f0);

• the empty forestεbelongs toSM(f0₎_;

• let f ≡s0(t0). . .sn(tn), and f0 ≡s0₀(t₀0). . .s0m(t0m). If there exist 0≤i≤m−n such thats0. . .sn =si0. . .s0i+n, and for every 0≤ j≤n,tj ∈ SF M(t0i+j), then

f _{∈ SF}(f0).

Intuitively, the first rule in the definition of SF allows to move down in the forest until an exact match inSF Mis found by the other rules. Given two forests f1, f2 ∈ F(U), we write_S(f1,f2) ≡ SF(f1)∩ SF(f2)for the set shared sub-forests of f1 and

f2. Finally the set of maximal shared sub-forests is defined as

M(t1,t2) ={f | f ∈ S(t1,t2)∧ ¬∃f0 ∈ S(t1,t2).f0 =6 f∧ f ∈ SF(f0)}.

State components and invariants. The transition of the STTSat a given step depends on the state of Aafter reading the reverse of the suffix. Since the STTS0 _{cannot deter-} mine this value based on the prefix, it needs to simulateSfor every possible choice. We denote byXi the type-i variables ofX. Every stateq∈Q0 is a tuple(l,f,g)where

• l :R 7→ R, keeps track, for every possible stater ∈ R, of what would be the state ofAafter processing the stringREV(w), wherewis the string read so far;

• f :R7→ Q, keeps track, for every possible stater ∈ R, of what would be the state ofSafter processing the stringwr, wherewis the string read so far;

• g : (R×X) _7→ F(X0_{∪ {}_?_}₎_{keeps track of how the variables}_X0 _of_S0 _{need to be} combined in order to obtain the value of a variable ofS.

State summarization invariants. We first discuss the invariants of the first two com- ponentsland f of a state, and how they are preserved by the transition functionδ0of S0_{. After reading a word}_{w S}0_{is in state}₍_l_,_f_,_g₎_where

• for every look-ahead stater ∈R,l(r) =r0,δ∗_A(r,REV(w)) =r0;

• for every look-ahead stater ∈R, f(r) =q,δ∗(q0,wr) =q.

At the beginningS0_{is in state}₍_l₀_,_f₀_,_g₀₎_{, where for every}_r _∈_R_,_l₀₍_r_{) =}_r_and _f₀₍_r_{) =}_q₀_. The componentg0is discussed later.

Next we describe the transition functionδ0. We assumeS0 to be in state (l, f,g), and to be reading the input symbola ∈ Σ; we denote withl0_,_f0 _{the new values of the state} components, and we only write the parts that change. For every look-ahead stater_∈ R, ifδA(r,a) =r0, f(r) =q, andδ(q,r0) =q0, thenl0(r) =l(r0)and f0(r) =q0.

Variable summarization. Next, we describe howS0keeps track of the variable values. The natural approach for this problem would be that of keeping, for each stater ∈ R

and variablex∈ X, a variableg(r,x)containing the value ofxinS, assuming the prefix read so far, was read by Astarting in stater. This natural approach, however, would cause the machine not to be copyless. Consider, for example, the following scenario. Letr,r1andr2be look-ahead states inRsuch that, for somea∈ Σ,δ(r1,a) =δ(r2,a) =

r. AssumeSonly has one stateq∈ Q, and one variablex ∈ X. IfSupdatesρ(q,r,x)to x, in order to perform the corresponding update inS0 _{we would have to assign}_g₍_r_,_x₎ to bothg(r1,x)andg(r2,x), and this assignment is not copyless.

Our solution to this problem relies on a symbolic representation of the update and a careful analysis of sharing. In the previous example, a possible way to represent such update is by storing the content of g(r,x) into a variable z, and then remembering

in the state the fact that both g(r1,x) and g(r2,x) now contain z as a value. In the

construction, the above update is reflected by updating the state, without touching the variable values.

The set of variablesX0_contains_|_R_|₍₄_|_X₀_||_R_|₎_{type-0 variables, and}_|_R_|₍₄_|_X₁_||_R_|₎_type-1 variables. The set of parametersΠofS0 _is_{_π_i _| ₀ _≤ _i _≤ ₄_|_X_||_R_|}_{. We will show later} how these numbers are obtained.

Variables semantics and invariants. We next describe how we can recover the value of a variable inXfrom the corresponding shape functiong(x). Intuitively, the value of

a variablexis recovered by merging together the variables appearing in the shape ofx. We call this operation unrolling.

We define the unrolling u : F(X0) 7→ E(X0,Σ)of a symbolic variable representation

as follows. Given a forest f = s0(f0). . .sn(fn) ∈ F(X0), the unrolling of f is defined as u(f) ≡ ut(s0,f0). . .ut(sn, fn), where for every s ∈ X, and g0. . .gm ∈ X∗,

ut(s,g0. . .gm)≡s[π0 7→u(g0), . . . ,πm 7→u(gm)].

After reading i-th symbol ai of an input word w, S0 is in a configuration ((l,f,g),α) iff for every look-ahead stater _∈ R, and variable x _∈ X, ifδ∗(q0,REV((a1. . .ai)r)) =

(q,α1), andu(g(r,x)) =s, thenα1(x) =α(s).

Counting argument invariants. Next, we describe how we keep the shape function

gcompact, allowing us to use a finite number of variables while updating them in a copyless manner. The shape functiongmaintains the following invariants.

Single-Use: each shape g(r,x)is repetition-free: no variable x0 _∈ _X0 _{appears twice in}

g(r,x).

Sharing Bound: for all statesr,r0 _∈_R_,_∑_x_,_y

∈X|M(g(r,x),g(r0,y))| ≤ |X|.

Hole Placement: for every type-1 variablex _∈ X1, and stater ∈ R, there exists exactly

one occurrence of ? ing(r,x), and it does not have any children.

Horizontal compression: for every f f0 _{∈ SF}₍_g₍_r_,_x₎₎_{, such that ?}_6∈ _{f f}0_{, then there must}

be a shapeg(r0,x0), with (r0,x0) 6= (r,x), such that either f ∈ SF(g(r0,x0))and

f f0 _{6∈ SF}₍_g₍_r0_,_x0₎₎_{, or} _f0 _{∈ SF}₍_g₍_r0_,_x0₎₎_and _{f f}0 _{6∈ SF}₍_g₍_r0_,_x0₎₎_.

Vertical compression: for everys(f)∈ SF(g(r,x)), such that ?6∈ s(f), then there must

be a shapeg(r0,x0), with(r0,x0)₆= (r,x), such that eithers()_{∈ SF}(g(r0,x0))and

s(f)6∈ SF(g(r0,x0)), or f ∈ SF(g(r0,x0))ands(f)6∈ SF(g(r0,x0)).

The first invariant ensures the bounded size of shapes. The second invariant limits the amount of sharing between shapes and it implies that, for eachr, and for eachx 6= y,

g(r,x)andg(r,y)are disjoint. The second invariant also implies that, for every stater,

the treeg(r,x), for allx cumulatively, can have a total of_|X||R|maximal shared sub- forests, with respect to all other strings. The third invariant says that the tree of each type-1 variable contains exactly one hole and this hole appears as a leaf. This helps us dealing with variable substitution. The fourth and fifth compression invariants guar- antee that the shapes use only the minimum necessary amount of variables. Together they imply that the sum ∑x∈X|g(r,x)|is bounded by 4|X||R|. This is due to the fact that a shape can be updated only in three ways, 1) on the left, 2) on the right, and 3) below the ?. As a result it suffices to have_|R_|(4_|X_||R_|)variables inZ.

Variable and state updates. Next we show howgand the variables inX0are updated and initialized.

The initial valueg0 in the initial state, and each variable in X0 are defined as follows:

letX0

0 = {z1, . . . ,zk},X10 = {z01, . . . ,z0k},X0 = {x1, . . . ,xi}, X1 = {x0₁, . . . ,x0_j}. For each

type-0 variablexi ∈ X0, for each type-1 variablexi0 ∈ X1, and look-ahead stater ∈ R,

we have thatg0(r,xi) =zi,zi =ε,g0(r,x0_j) =z0_j(?), andz0_j =π0.

We assume S0 _{to be in state} ₍_l_, _f_,_g₎_{, and to be reading the input symbol} _a _∈ _Σ_{. We} denote with g0 _{the new value of} _g_{. We assume we are given a look-ahead state}_r _∈ _R_, such thatδA(r,a)to be equal tor0, andxandyare the variables to be the updated.

{x :=w_}: where without loss of generality w = _ha?b_i. We first assume there exists

unused variable. We updateg0₍_r_,_x_{) =}_z_f_{, and set}_z_f _:₌_h_a_?_b_i_{. Since the counting} invariants are preserved, there must have existed an unused variablezf.

{x := xy,y:=ε}: we perform the following update: g0₍_r_,_x_{) =} _g₍_r0_,_x₎_g₍_r0_,_y₎_{. Let}

g(r0,x) = s1(f1). . .sn(fn)andg(r0,y) = s0₁(f₁0). . .s0m(fm0 ). We now have two possibilities:

• there exists(r1,x1)6= (r,x)such thatg(r1,x1)containssnors0₁, but notsns0₁; or

• there does not exist(r1,x1)6= (r,x)such thatg(r1,x1)containssnor s0₁, but notsns01. In this case we can compress: if fn =t1. . .tiand f10 =t01. . .t0k, then

sn := sns0₁[π0 7→ πi, . . . ,πk 7→ πk+i] and

g0₍_r_,_x_{) =} _s₁₍_f₁₎_{. . .}_s_n₍_f_n_f0

1)s02(f20). . .s0m(fm0 ). In this new assignment to sn, the parameters ins0

1have been shifted byito reflect the concatenation with

sn.

In both cases, due to the preserved counting invariant, we can take an unused variablezf and use it to updateg(r,y): zf :=ε, andg0(r,y) =zf.

{x := x[y],y:=?)_}: without loss of generality let x and y be type-1 variables. Let

s(?) be the subtree of g(r0,x) containing ?. We perform the following update: g0₍_r_,_x_{) =} _g₍_r0_,_x₎_{_?/_g₍_r0_,_y₎_}_{, where} _a_{_b_/_c_}_{replaces the node} _b_of _a_with_c_{. Let}

g(r0,y) =s0₁(f₁0). . .s0_m(f_m0 ). For everysl, we now have two possibilities:

• there existsp ≤ m0_{, and}₍_r₁_,_x₁₎ ₆_{= (}_r_,_x₎_{such that}_g₍_r₁_,_x₁₎_contains_s_or _s0_p_, but nots(s0_p); or

• there does not existp ≤ m0_{, and}₍_r₁_,_x₁₎₆_{= (}_r_,_x₎_{such that} _g₍_r₁_,_x₁₎_contains

sors0_p_{; in this case we can compress: assume} _f_p0 ₌ _t₁_{. . .}_t_k_{, then}_s_:₌ _s_[_π_i _7→

p[π0 7→ πi, . . . ,πk 7→ πk+i],πi+1 7→ πi+1+k, . . . ,πn+k+1], and g0(r,x) =

g0₍_r_,_x₎_{_s₍_s0_p₎_/_s_}_.

In both cases, due to the preserved counting invariant, we can take an unused variablez_f and use it to updateg(r,y): z_f :=π0, andg0(r,y) = zf(?). Figure 5.3 shows an example of such an update.

{x :=y,y:= x}: we symbolically reflect the swap. g0₍_r_,_x_{) =} _g₍_r0_,_y₎_{, and} _g0₍_r_,_y_{) =}

g(r0,x). Similarly to before, we compress if necessary.

By inspection of the variable assignments, it is easy to see thatS0_{is copyless.}

Figure 5.3 shows an example of an update of the form x = x[y]. In the figure, each

z_1# z4# g(r1,x)# g(r2,x)# x:=x[y]# y:=?# ?# z2# z_8# z4# ?# z2# z5# z7# g(r1,y)# g(r2,y)# ?# z6# z10# z7# ?# z6# z1# z4# g’(r1,x)# g’(r2,x)# z2# z8# z4# z5# g’(r1,y)# g’(r2,y)# ?_# z10# ?_# z3# z9# z3# z7# ?# z6# z_2# z9# z7# ?# z_6# z2#:=#z2[π0;>π0π1]# #z2#:=#z2[π0;>π0π1]# z3#:=#z3[π0;>z5] # #z9##:=#z9[π0;>z10]# z5#:=#π0 # # #z10:=#π0#

FIGURE5.3: Example of symbolic variable assignment.

reading the symbola such thatδA(r1,a) = r1 andδA(r2,a) = r1. Before readinga(on

the left), the variablesz2,z4,z6, andz7are shared between the two representations of

the variables atr1andr2. In the new shapeg0the hole ? ing(r1,x)(respectivelyg(r2,x))

is replaced byg(r1,y)(respectivelyg(r2,y)). However, since the sequencez3z5(respec-

tively z9z10) is not shared, we can compress it into a single variable z3 (respectively

z9), and reflect such a compression in the variable update z3 := z3[π0 7→ z5](respec-

tivelyz9:=z9[π0 7→z10]). Now the variablez5(respectivelyz10) is unused and we can

therefore use it to updateg0₍_r₁_,_y₎_{(respectively}_g0₍_r₂_,_y₎_).

The output function F, ofS0 _{simply applies the unrolling function. For example, let’s} assumeS0 _{ends in state}₍_l_,_f_,_g₎_∈_Q0_{, with}_l₍_r₀_{) =}_r_, _f₍_r_{) =}_q_{, and}_F₍_q_{) =} _xy_{. We have} thatF(l,f,g) =u(g(r0,x)g(r0,y)). This concludes the proof.

In document Programming Using Automata and Transducers (Page 168-173)