5.6 Restricted inputs
5.6.1 Mapping strings
Suppose we restrict the inputs to contain only internal symbols, that is, strings overΣ. Then the STT cannot use its stack, and we can assume that the set Pof stack symbols is the empty set. This restricted transducer can still map strings to nested words (or trees) overΓwith interesting hierarchical structure, and hence, is called astring-to-tree
transducer. This leads to the following definition: astreaming string-to-tree transducer
(SSTT) § from input alphabetΣto output alphabetΓconsists of a finite set of statesQ; an initial stateq0 ∈ Q; a finite set of typed variablesXtogether with a conflict relation
η; a partial output function F : Q 7→ E0(X,Γ)such that for each state q, a variablex
appears at most once inF(q); a state-transition functionδ:Q×Σ7→ Q; and a variable- update function ρ : Q×Σ 7→ A(X,X,η,Γ). Configurations of such a transducer are
of the form(q,α), whereq ∈ Qis a state, andαis a type-consistent valuation for the
variables X. The semantics[[S]] of such a transducer is a partial function from Σ∗ to W0(Γ). We notice that in this setting the copyless restriction is enough to capture MSO
completeness since the model is closed under regular look-ahead (i.e. a reflexiveηis
enough).
Theorem 5.16 (Copyless SSTTs are closed under RLA). A string-to-tree transduction is
definable by an SSTT iff it is definable by a copyless SSTT.
Proof. The ⇐direction is immediate. For the⇒direction, using Theorem 5.9 we con-
sider the input to be a copyless SSTT with regular look-ahead. Given a DFA A = (R,r0,δA)over the alphabetΣ, and a copyless string-to-tree STTS = (Q,q0,X,F,δ,ρ)
over R, we construct an equivalent copyless multi-parameter STT
S0 = (Q0,q0
0,X0,Π,ϕ,F0,δ0,ρ0)over Σ. Thanks to Theorem 5.11 this implies the exis-
tence of a copyless SSTT.
Auxiliary notions. Given a finite setU, we inductively define the following sets:
F(U): the set of forests over U defined as: ε ∈ F(U), and if s0, . . . ,sn ∈ U, and
f0, . . . ,fn∈ F(U), thens0(f0). . .sn(fn)∈ F(U);
SF(f0): given a forest f0 ∈ F(U), the set of sub-forests of f0, for short SF(f0), is defined as follows:
• if f0 ≡ s00(t00). . .s0m(t0m), and there exist i ≤ msuch that f ∈ SF(t0i), then
f ∈ SF(f0);
• the empty forestεbelongs toSM(f0);
• let f ≡s0(t0). . .sn(tn), and f0 ≡s00(t00). . .s0m(t0m). If there exist 0≤i≤m−n such thats0. . .sn =si0. . .s0i+n, and for every 0≤ j≤n,tj ∈ SF M(t0i+j), then
f ∈ SF(f0).
Intuitively, the first rule in the definition of SF allows to move down in the forest until an exact match inSF Mis found by the other rules. Given two forests f1, f2 ∈ F(U), we writeS(f1,f2) ≡ SF(f1)∩ SF(f2)for the set shared sub-forests of f1 and
f2. Finally the set of maximal shared sub-forests is defined as
M(t1,t2) ={f | f ∈ S(t1,t2)∧ ¬∃f0 ∈ S(t1,t2).f0 =6 f∧ f ∈ SF(f0)}.
State components and invariants. The transition of the STTSat a given step depends on the state of Aafter reading the reverse of the suffix. Since the STTS0 cannot deter- mine this value based on the prefix, it needs to simulateSfor every possible choice. We denote byXi the type-i variables ofX. Every stateq∈Q0 is a tuple(l,f,g)where
• l :R 7→ R, keeps track, for every possible stater ∈ R, of what would be the state ofAafter processing the stringREV(w), wherewis the string read so far;
• f :R7→ Q, keeps track, for every possible stater ∈ R, of what would be the state ofSafter processing the stringwr, wherewis the string read so far;
• g : (R×X) 7→ F(X0∪ {?})keeps track of how the variablesX0 ofS0 need to be combined in order to obtain the value of a variable ofS.
State summarization invariants. We first discuss the invariants of the first two com- ponentsland f of a state, and how they are preserved by the transition functionδ0of S0. After reading a wordw S0is in state(l,f,g)where
• for every look-ahead stater ∈R,l(r) =r0,δ∗A(r,REV(w)) =r0;
• for every look-ahead stater ∈R, f(r) =q,δ∗(q0,wr) =q.
At the beginningS0is in state(l0,f0,g0), where for everyr ∈R,l0(r) =rand f0(r) =q0. The componentg0is discussed later.
Next we describe the transition functionδ0. We assumeS0 to be in state (l, f,g), and to be reading the input symbola ∈ Σ; we denote withl0,f0 the new values of the state components, and we only write the parts that change. For every look-ahead stater∈ R, ifδA(r,a) =r0, f(r) =q, andδ(q,r0) =q0, thenl0(r) =l(r0)and f0(r) =q0.
Variable summarization. Next, we describe howS0keeps track of the variable values. The natural approach for this problem would be that of keeping, for each stater ∈ R
and variablex∈ X, a variableg(r,x)containing the value ofxinS, assuming the prefix read so far, was read by Astarting in stater. This natural approach, however, would cause the machine not to be copyless. Consider, for example, the following scenario. Letr,r1andr2be look-ahead states inRsuch that, for somea∈ Σ,δ(r1,a) =δ(r2,a) =
r. AssumeSonly has one stateq∈ Q, and one variablex ∈ X. IfSupdatesρ(q,r,x)to x, in order to perform the corresponding update inS0 we would have to assigng(r,x) to bothg(r1,x)andg(r2,x), and this assignment is not copyless.
Our solution to this problem relies on a symbolic representation of the update and a careful analysis of sharing. In the previous example, a possible way to represent such update is by storing the content of g(r,x) into a variable z, and then remembering
in the state the fact that both g(r1,x) and g(r2,x) now contain z as a value. In the
construction, the above update is reflected by updating the state, without touching the variable values.
The set of variablesX0contains|R|(4|X0||R|)type-0 variables, and|R|(4|X1||R|)type-1 variables. The set of parametersΠofS0 is{πi | 0 ≤ i ≤ 4|X||R|}. We will show later how these numbers are obtained.
Variables semantics and invariants. We next describe how we can recover the value of a variable inXfrom the corresponding shape functiong(x). Intuitively, the value of
a variablexis recovered by merging together the variables appearing in the shape ofx. We call this operation unrolling.
We define the unrolling u : F(X0) 7→ E(X0,Σ)of a symbolic variable representation
as follows. Given a forest f = s0(f0). . .sn(fn) ∈ F(X0), the unrolling of f is de- fined as u(f) ≡ ut(s0,f0). . .ut(sn, fn), where for every s ∈ X, and g0. . .gm ∈ X∗,
ut(s,g0. . .gm)≡s[π0 7→u(g0), . . . ,πm 7→u(gm)].
After reading i-th symbol ai of an input word w, S0 is in a configuration ((l,f,g),α) iff for every look-ahead stater ∈ R, and variable x ∈ X, ifδ∗(q0,REV((a1. . .ai)r)) =
(q,α1), andu(g(r,x)) =s, thenα1(x) =α(s).
Counting argument invariants. Next, we describe how we keep the shape function
gcompact, allowing us to use a finite number of variables while updating them in a copyless manner. The shape functiongmaintains the following invariants.
Single-Use: each shape g(r,x)is repetition-free: no variable x0 ∈ X0 appears twice in
g(r,x).
Sharing Bound: for all statesr,r0 ∈R,∑x,y
∈X|M(g(r,x),g(r0,y))| ≤ |X|.
Hole Placement: for every type-1 variablex ∈ X1, and stater ∈ R, there exists exactly
one occurrence of ? ing(r,x), and it does not have any children.
Horizontal compression: for every f f0 ∈ SF(g(r,x)), such that ?6∈ f f0, then there must
be a shapeg(r0,x0), with (r0,x0) 6= (r,x), such that either f ∈ SF(g(r0,x0))and
f f0 6∈ SF(g(r0,x0)), or f0 ∈ SF(g(r0,x0))and f f0 6∈ SF(g(r0,x0)).
Vertical compression: for everys(f)∈ SF(g(r,x)), such that ?6∈ s(f), then there must
be a shapeg(r0,x0), with(r0,x0)6= (r,x), such that eithers()∈ SF(g(r0,x0))and
s(f)6∈ SF(g(r0,x0)), or f ∈ SF(g(r0,x0))ands(f)6∈ SF(g(r0,x0)).
The first invariant ensures the bounded size of shapes. The second invariant limits the amount of sharing between shapes and it implies that, for eachr, and for eachx 6= y,
g(r,x)andg(r,y)are disjoint. The second invariant also implies that, for every stater,
the treeg(r,x), for allx cumulatively, can have a total of|X||R|maximal shared sub- forests, with respect to all other strings. The third invariant says that the tree of each type-1 variable contains exactly one hole and this hole appears as a leaf. This helps us dealing with variable substitution. The fourth and fifth compression invariants guar- antee that the shapes use only the minimum necessary amount of variables. Together they imply that the sum ∑x∈X|g(r,x)|is bounded by 4|X||R|. This is due to the fact that a shape can be updated only in three ways, 1) on the left, 2) on the right, and 3) below the ?. As a result it suffices to have|R|(4|X||R|)variables inZ.
Variable and state updates. Next we show howgand the variables inX0are updated and initialized.
The initial valueg0 in the initial state, and each variable in X0 are defined as follows:
letX0
0 = {z1, . . . ,zk},X10 = {z01, . . . ,z0k},X0 = {x1, . . . ,xi}, X1 = {x01, . . . ,x0j}. For each
type-0 variablexi ∈ X0, for each type-1 variablexi0 ∈ X1, and look-ahead stater ∈ R,
we have thatg0(r,xi) =zi,zi =ε,g0(r,x0j) =z0j(?), andz0j =π0.
We assume S0 to be in state (l, f,g), and to be reading the input symbol a ∈ Σ. We denote with g0 the new value of g. We assume we are given a look-ahead stater ∈ R, such thatδA(r,a)to be equal tor0, andxandyare the variables to be the updated.
{x :=w}: where without loss of generality w = ha?bi. We first assume there exists
unused variable. We updateg0(r,x) =zf, and setzf :=ha?bi. Since the counting invariants are preserved, there must have existed an unused variablezf.
{x := xy,y:=ε}: we perform the following update: g0(r,x) = g(r0,x)g(r0,y). Let
g(r0,x) = s1(f1). . .sn(fn)andg(r0,y) = s01(f10). . .s0m(fm0 ). We now have two pos- sibilities:
• there exists(r1,x1)6= (r,x)such thatg(r1,x1)containssnors01, but notsns01; or
• there does not exist(r1,x1)6= (r,x)such thatg(r1,x1)containssnor s01, but notsns01. In this case we can compress: if fn =t1. . .tiand f10 =t01. . .t0k, then
sn := sns01[π0 7→ πi, . . . ,πk 7→ πk+i] and
g0(r,x) = s1(f1). . .sn(fnf0
1)s02(f20). . .s0m(fm0 ). In this new assignment to sn, the parameters ins0
1have been shifted byito reflect the concatenation with
sn.
In both cases, due to the preserved counting invariant, we can take an unused variablezf and use it to updateg(r,y): zf :=ε, andg0(r,y) =zf.
{x := x[y],y:=?)}: without loss of generality let x and y be type-1 variables. Let
s(?) be the subtree of g(r0,x) containing ?. We perform the following update: g0(r,x) = g(r0,x){?/g(r0,y)}, where a{b/c}replaces the node bof awithc. Let
g(r0,y) =s01(f10). . .s0m(fm0 ). For everysl, we now have two possibilities:
• there existsp ≤ m0, and(r1,x1) 6= (r,x)such thatg(r1,x1)containssor s0p, but nots(s0p); or
• there does not existp ≤ m0, and(r1,x1)6= (r,x)such that g(r1,x1)contains
sors0p; in this case we can compress: assume fp0 = t1. . .tk, thens:= s[πi 7→
s0
p[π0 7→ πi, . . . ,πk 7→ πk+i],πi+1 7→ πi+1+k, . . . ,πn+k+1], and g0(r,x) =
g0(r,x){s(s0p)/s}.
In both cases, due to the preserved counting invariant, we can take an unused variablezf and use it to updateg(r,y): zf :=π0, andg0(r,y) = zf(?). Figure 5.3 shows an example of such an update.
{x :=y,y:= x}: we symbolically reflect the swap. g0(r,x) = g(r0,y), and g0(r,y) =
g(r0,x). Similarly to before, we compress if necessary.
By inspection of the variable assignments, it is easy to see thatS0is copyless.
Figure 5.3 shows an example of an update of the form x = x[y]. In the figure, each
z1# z4# g(r1,x)# g(r2,x)# x:=x[y]# y:=?# ?# z2# z8# z4# ?# z2# z5# z7# g(r1,y)# g(r2,y)# ?# z6# z10# z7# ?# z6# z1# z4# g’(r1,x)# g’(r2,x)# z2# z8# z4# z5# g’(r1,y)# g’(r2,y)# ?# z10# ?# z3# z9# z3# z7# ?# z6# z2# z9# z7# ?# z6# z2#:=#z2[π0;>π0π1]# #z2#:=#z2[π0;>π0π1]# z3#:=#z3[π0;>z5] # #z9##:=#z9[π0;>z10]# z5#:=#π0 # # #z10:=#π0#
FIGURE5.3: Example of symbolic variable assignment.
reading the symbola such thatδA(r1,a) = r1 andδA(r2,a) = r1. Before readinga(on
the left), the variablesz2,z4,z6, andz7are shared between the two representations of
the variables atr1andr2. In the new shapeg0the hole ? ing(r1,x)(respectivelyg(r2,x))
is replaced byg(r1,y)(respectivelyg(r2,y)). However, since the sequencez3z5(respec-
tively z9z10) is not shared, we can compress it into a single variable z3 (respectively
z9), and reflect such a compression in the variable update z3 := z3[π0 7→ z5](respec-
tivelyz9:=z9[π0 7→z10]). Now the variablez5(respectivelyz10) is unused and we can
therefore use it to updateg0(r1,y)(respectivelyg0(r2,y)).
The output function F, ofS0 simply applies the unrolling function. For example, let’s assumeS0 ends in state(l,f,g)∈Q0, withl(r0) =r, f(r) =q, andF(q) = xy. We have thatF(l,f,g) =u(g(r0,x)g(r0,y)). This concludes the proof.