4.4 Recognizable tree languages
4.4.2 Regular tree grammars
Another way to dene the recognizable tree languages is by regular tree grammars, rst studied by Brainerd (1969) (in a more general form). These generating devices are the natural generalization from strings to trees of regular grammars of the Chomsky hierarchy.
Formally, they are dened as follows.
Denition 4.4.8. A regular ΣX-tree grammar (RTG) is a device RT = (N, Σ, X, P, S) specied as follows.
(1) N is a nite non-empty set of nonterminal symbols such that N ∩ (Σ ∪ X) = ∅.
(2) Σ is a ranked alphabet and X is a leaf alphabet that together form the terminal alphabet of RT .
(3) P is a nite set of productions of the form A → r, where A ∈ N and r ∈ TΣ(X ∪ N ).
(4) S ∈ N is the distinguished start symbol.
For any s, t ∈ TΣ(X ∪ N ), s ⇒RT tmeans that there exist a (Σ ∪ N)X-context c and a production A → r in P such that s = c(A) and t = c(r) (i.e., t can be obtained from s by replacing one occurrence of A by r). The derivation relation ⇒RT∗ and the n-step derivation relation ⇒RTn are dened as usual (see Section 2.3.2). The ΣX-tree language generated by RT is the set
T (RT ) := {t ∈ TΣ(X) | S ⇒RT∗ t} .
Note that any regular ΣX-tree grammar may be viewed as a CFG with the terminal alphabet Σ ∪ X ∪ Z, where Z consists of the parentheses ( and ) and the comma. Thus, the tree languages generated by RTGs are special CFLs when trees are treated as strings. An example is given next.
Example 4.4.9. Let Σ = {f/3, g/2} and X = {x}. The system
RT = ({S, B}, Σ, X, {S → f (x, S, B), S → g(x, B), B → y}, S) is an RTG generating the ΣX-tree language
T (RT ) = {t ∈ TΣ(X) | t = f (x, ξ, y)n(g(x, y)), n ∈ N} . A sample derivation in RT is
S ⇒RT f (x, S, B) ⇒RT f (x, f (x, S, B), B) ⇒RT f (x, f (x, g(x, B), B), B)
⇒RT3 f (x, f (x, g(x, y), y), y) .
Note that the yield language of T (RT ) is the CFL {xnyn| n ≥ 1}.
A more complicated example of an RTG that models linguistic phenomena can be found in Graehl et al. (2008, Figure 4), for example. Also, May (2010, Section 2.2) shall be consulted.
We recall the following (cf. Gécseg and Steinby, 1984, Lemma II.3.4).
Theorem 4.4.10. Every regular ΣX-tree grammar is eectively equivalent to a regular ΣX-tree grammar RT = (N, Σ, X, P, S) in which each production is of the form
(1) A → d, where A ∈ N and d ∈ Σ0∪ X, or of the form
(2) A → f(A1, . . . , Am), where m > 0, f ∈ Σm and A, A1, . . . , Am∈ N. Such a tree grammar is said to be in normal form.
Example 4.4.11. Let Σ = {f/2, g/1} and X = {x, y}. Then RT = ({S, A}, Σ, X, P, S) with P consisting of the productions
S → f (A, A) | g(A), and A → f (S, S) | g(S) | x | y
is a ΣX-tree grammar in normal form generating all ΣX-trees of odd height.
On the other hand, the RTG RT of Example 4.4.9 is equivalent to the RTG in normal form ({S, A, B}, Σ, X, {S → f(A, S, B), S → g(A, B), A → x, B → y}, S).
Any regular ΣX-tree grammar in normal form can easily be converted to an equivalent ndT ΣX-tree recognizer by turning nonterminals into states and taking as rules the pro-ductions, and conversely. Hence it is clear that the regular tree grammars generate exactly the recognizable tree languages (see Gécseg and Steinby (1984, Theorem II.3.6) or Gécseg and Steinby (1997, Proposition 6.2), for example).
4.4.2.1 Tree substitution grammars
We now consider a restricted type of regular tree grammar with applications in linguis-tics, namely tree-substitution grammars (Schabes, 1990, Frank, 2000, Eisner, 2003, Shieber, 2004), by giving a new formal denition that will be further used and explained in Sec-tion 5.3.1. In the formulaSec-tion of Shieber (2004), a tree-substituSec-tion grammar is essen-tially a set P of trees in TΣ(N ), where Σ is the ranked alphabet of terminal symbols and N := {f↓ | f ∈ Σ \ Σ0}. Each derivation begins with a tree in P whose root is labeled with the symbol in Σ that has been chosen as the start symbol. The idea is that if a leaf is labeled with a symbol f↓ in N, then we may substitute for that leaf any tree in P in which the root is labeled by f. Derivations are dened via derivation trees, but we may dene them equivalently in the usual way by regarding the symbols f↓ ∈ N as nonterminals and then stipulating that if f↓ is the left-hand side of a production, then the right-hand side of the production is a tree r ∈ P such that root(r) = f. The possibility of having multiple copies of trees in P and the ordering of the N-labeled leaves in the tree in P postulated by Shieber (2004) do not have any eect on the generative power, and their intended uses will be achieved otherwise when we dene the synchronous version of tree-substitution gram-mars. On the other hand, the set of productions should obviously be nite, and hence we arrive at the following denition.
Denition 4.4.12. A tree-substitution grammar (TSG) is a system TS = (N, Σ, X, P, S) specied as follows.
(1) Σ and X are the terminal alphabets.
(2) N := {f↓ | f ∈ Σ\Σ0}is the set of nonterminal symbols such that N ∩(Σ∪X) = ∅.
(3) P is a nite set of productions of the form f↓ → r, where f↓∈ N, r ∈ TΣ(X ∪ N ) and root(r) = f.
(4) S ∈ N is the start symbol (and hence S = f↓ for some f ∈ Σ \ Σ0).
The one-step derivation relation ⇒TS is dened as usual: if s, t ∈ TΣ(X ∪ N ), then s ⇒TS t if and only if there is a context c ∈ CΣ(X ∪ N ) and a rule f↓ → r in P such
that s = c(f↓) and t = c(r). Then, the ΣX-tree language generated by TS is the set T (TS) := {t ∈ TΣ(X) | S ⇒∗TSt}. Furthermore, let T [T SG] denote the class of all the tree languages generated by TSGs.
As Frank (2000) observed, TSGs are adequate systems to produce appropriate structural descriptions for clausal complementation and to generate sentences containing adjunction structures such as adverbial modiers and relative clauses. Next, we give an example of such a tree grammar.
Example 4.4.13. The system
TS = ({S↓}, {S/2}, {x, y}, {S↓ → S(x, y), S↓ → S(x, S(S↓, y))}, S↓) is a TSG. A sample derivation in TS is
S↓ ⇒TS S(x, S(S↓, y)) ⇒TS S(x, S(S(x, S(S↓, y)), y))
⇒TS S(x, S(S(x, S(S(x, y), y)), y)) ,
and TS generates the tree language T (TS) = {S(x, S(ξ, y))n(S(x, y)) | n ∈ N}.
It is immediately clear from the denition that every TSG is a special regular tree gram-mar and hence that T [T SG] ⊆ Rec. However, it is also obvious that not every recognizable tree language can be generated by a TSG. First of all, the root of any tree in the tree language generated by a TSG must be labeled by the symbol that corresponds to the start symbol. Secondly, the number of nonterminals is limited by the size of the ranked alphabet.
Moreover, the dT-recognizable tree language {x} cannot be generated by any TSG. On the other hand, the TSG ({f↓}, {f /2}, {x, y}, {f↓ → f (x, y), f↓ → f (y, x)}, f↓) generates the tree language {f(x, y), f(y, x)}, which is not dT-recognizable by Theorem 4.4.5. Hence, we have.
Theorem 4.4.14. T [T SG] ⊂ Rec and T [T SG] k DRec.
On the other hand, we may note the following fact (cf. Schabes, 1990).
Proposition 4.4.15. Every context-free language is the yield of a tree language generated by a tree-substitution grammar.
Proof. If L ⊆ X∗ is a CFL, it is generated by a CFG CF = (N, X, P, S) in CNF (cf.
Theorem 2.4.8). Assuming rst that L ⊆ X+, we dene the TSG TS = (N0, Σ, X, P0, S0), where Σ := {e/0} ∪ {A/2 | A ∈ N}, N0 := {A↓| A ∈ N }, S0 := S↓, and
P0 := {A↓ → A(B↓, C↓) | A → BC ∈ P } ∪ {A↓ → A(x, e) | A → x ∈ P } .
Clearly, TS generates exactly the usual derivation trees of CF (cf. Sudkamp, 1997, for example) in which inner nodes are labeled with nonterminal symbols (now binary symbols in Σ), but here any leaf labeled with a terminal symbol x ∈ X is replaced with a subtree A(x, e). Hence, yd(T (TS)) = L(CF ) = L. If ε ∈ L, the construction is modied by adding the production S↓ → S(e, e) to P0. Note that S does not appear on the right-hand side of any production in P because CF is in CNF.
The closure properties of T [T SG] are discussed by Maletti (2014). Thus, we nd out that this tree language class is not closed under union, intersection, complement and alphabetic tree homomorphism.