Count based State Merging for Probabilistic Regular Tree Grammars

(1)

Count-based State Merging for Probabilistic Regular Tree Grammars

Toni Dietze

Faculty of Computer Science Technische Universität Dresden

01062 Dresden, Germany [email protected]

Mark-Jan Nederhof School of Computer Science

University of St Andrews KY16 9SX, UK

Abstract

We present an approach to obtain language models from a tree corpus using proba-bilistic regular tree grammars (prtg). Start-ing with a prtg only generatStart-ing trees from the corpus, the prtg is generalized step by step by merging nonterminals. We focus on bottom-up deterministic prtg to sim-plify the calculations.

1 Introduction

Constituent parsing plays an important role in nat-ural language processing (nlp). One can easily read off a pcfg from a tree corpus and use it for parsing. This might work quite well (Charniak, 1996), but it can be even more fruitful to intro-duce a state behaviour that is not visible in the cor-pus (Klein and Manning, 2003). The Expectation-Maximization Algorithm (Dempster et al., 1977) can be used to train probabilities if the state be-haviour is fixed (Matsuzaki et al., 2005). This can be improved by adapting the state behaviour auto-matically by cleverly splitting and merging states (Petrov et al., 2006).

More generally, finding a grammar by exam-ining terminal objects is one of the problems in-vestigated in the field of grammatical inference. There are many results for the string case, e.g., on how to learn deterministic stochastic finite (string) automata from text (Carrasco and Oncina, 1994; Carrasco and Oncina, 1999). For the tree case, there are, e.g., results for identifying function dis-tinguishable regular tree languages from text (Fer-nau, 2002). There is also a generalization of n-grams to trees including smoothing techniques (Rico-Juan et al., 2000; Rico-Juan et al., 2002).

The mentioned results for deterministic stochas-tic finite (string) automata were generalized to an algorithm that learns stochastic deterministic tree automata from trees (Carrasco et al., 2001). Given

a tree corpus, this approach yields a single gram-mar. The authors experimentally showed that if the corpus is too small, the grammar tends to be too general. In contrast, the split-merge approach of Petrov et al. (2006) produces a sequence of dif-ferent grammars. One can use cross validation to select a grammar from that sequence that suitably abstracts away from the training corpus. Because of the intricate combination of splitting and merg-ing however, the behavior is very difficult to anal-yse theoretically. Our approach is similar to the split-merge procedure in that a sequence of gram-mars is obtained. However, it differs by operat-ing without splittoperat-ing and relyoperat-ing on mergoperat-ing alone, thereby obtaining a simpler framework.

Our goal is to create a sequence of probabilistic regular tree grammars (prtg) from a corpus such that every prtg abstracts away from the corpus more than its predecessors in the sequence (cf. Al-gorithm 1). We start with a prtg that admits no more and no less than the trees in the corpus. The rules of the prtg are then changed step by step to make the grammar more general.

We generalize a prtg bymergingnonterminals, which means we replace several nonterminals by a single new one. The weights of the resulting prtg are assigned by maximum likelihood estimation on the corpus.

To make the approach easier, we only consider bottom-up deterministic prtg. On the one hand, this simplifies our calculations, e.g., the maximum likelihood estimate; on the other hand, this simpli-fies the application of the prtg as language model, since the search for the most probable tree for a given yield does not have to consider several derivations for a single tree.

2 Preliminaries

(2)

relation(≡) ⊆X ×X. We writex ≡y instead of(x, y) ∈(≡)forx, y ∈X. Note that the iden-tity relation is an equivalence relation. Letx∈X. Theequivalence class ofx (induced by(≡)), de-noted by[x]≡ (or just[x]if(≡)is clear from the context), is defined as{y∈X|x≡y}. The quo-tient set ofXby(≡), denoted byX/≡, is defined as{[x] | x ∈ X}. Note that X/≡is a partition ofX and that for every partitionP ofX there is an equivalence relation(≡0₎_{such that}_P ₌_X/_≡0_. The set of equivalence relations over X forms a complete lattice ordered by set inclusion.

We denote the set of the natural numbers includ-ing0 byN. Analphabet(of symbols) is a finite, non-empty set. Aranked alphabet is an alphabet Σwhere we associate arankrk(σ) ∈ _Nwith ev-eryσ ∈ Σ. LetΣ be a ranked alphabet. Theset of trees overΣ, denoted byTΣ, is the smallest set T such thatσ(t1, . . . , trk(σ))∈Tfor everyσ∈Σ

and t1, . . . , trk(σ) ∈ T. Additionally, we define

T∅ = ∅. Let t = σ(t1, . . . , tk) ∈ TΣ. The set

of positions oft, denoted by pos(t), is defined as

{ε} ∪ {iw | i ∈ {1, . . . , k}, w ∈ pos(ti)}. Let w ∈ pos(t). The symbol of t atw, denoted by t(w), is defined as σ ifw = ε, and by ti(w0) if w=iw0. Thesubtree oftatw, denoted byt|_w, is defined astifw=ε, and byti|w0 ifw=iw0.

Let X be a set and f: X → R≥0 a mapping.

Thesupport off, denoted bysupp(f), is defined as{x ∈ X | f(x) 6= 0}. Thesize off, denoted by |f|, is defined asP

x∈supp(f)f(x). We callf

a corpus (over X) if supp(f) is finite and non-empty. We callf a probability distribution (over

X)if|f|= 1. We denote the set of all probabil-ity distributions over X by Pd(X). Note that f may be both a corpus and a probability distribu-tion. Sometimes we refer to the values of a corpus by the wordcounts.

Definition 1. A regular tree grammar (rtg) is a tuple(N, Σ, I, R)where

• N is an alphabet (ofnonterminals),

• Σis a ranked alphabet (ofterminals) such that N∩Σ=∅,

• I ⊆N is a non-empty set (ofinitial nontermi-nals), and

• R is a finite set of rules of the form A0 −→ σ(A1, . . . , Ark(σ)) where σ ∈ Σ and

A0, . . . , Ark(σ)∈N.

Our definition above corresponds to a normal form of rtg with regard to a more general definition, which was given by, for example, Gécseg and

Steinby (1984, Chapter II, Section 3).

LetG = (N, Σ, I, R)be an rtg. For a ruler ∈

Rof the formA0−→σ(A1, . . . , Ark(σ))we define

lhs(r) = A0 and rhs(r) = σ(A1, . . . , Ark(σ)).

We associate with r the rank rk(r) = rk(σ), hence we may view R as a ranked alphabet ifR is non-empty. We callGbottom-up deterministic

if rhs(r1) = rhs(r2) implies r1 = r2 for every

r1, r2 ∈R.

Lett∈TΣ. Aderivation tree ofGfortis a tree d∈TRsuch thatpos(d) = pos(t),lhs(d(ε))∈I, and for everyw ∈ pos(t)we have rhs(d(w)) =

t(w)(lhs(d(w1)), . . . ,lhs(d(wrk(t(w))))). The

set of all derivation trees ofGfortis denoted by DG(t). Thelanguage of trees generated byG, de-noted byJGK, is defined as{t∈TΣ |DG(t)6=∅}.

Note that ifG is bottom-up deterministic, DG(t) has at most one element. We denote this element, if it exists, bydt_G.

Definition 2. A probabilistic regular tree gram-mar (prtg)is a tuple(G, ι, ρ)where

• G= (N, Σ, I, R)is an rtg,

• ι: I → [0,1] is a mapping (initial weights), and

• ρ:R→[0,1]is a mapping (rule weights).

Let P = (G, ι, ρ) be a prtg. Terminology for rtg is carried over to prtg, e.g.,P is bottom-up de-terministic iff Gis bottom-up deterministic. We callP properifιis a probability distribution and

P

r∈R: lhs(r)=Aρ(r) = 1for everyA∈N. Let t ∈ TΣ andd ∈ DP(t). Theweight of d

(induced byP), denoted by_JP_K(d), is defined as ι(lhs(d(ε)))·Q

w∈pos(t)ρ(d(w)). Theweight oft (induced byP), denoted by JPK(t), is defined as

P

d∈DP(t)JPK(d). IfJPKis a probability

distribu-tion overTΣ, thenPis calledconsistent.

Letc be a corpus over TΣ. Thecanonical rtg of c and the canonical prtg of c are defined as G= (N, Σ, I, R)andP = (G, ι, ρ), respectively, where

• N ={t|_w |t∈supp(c), w∈pos(t)},

• I = supp(c),

• R ={t|ε −→t(ε)(t|1, . . . , t|rk(t(ε)))|t∈N}, • ι(t) =c_|c|(t)for everyt∈I, and

• ρ(r) = 1for everyr∈R.

Note that every canonical prtg is bottom-up deter-ministic, proper, and consistent, and thatJPK(t) =

ι(t) for everyt ∈ supp(_JP_K); hence, _JP_K repre-sents the relative frequencies of the trees inc.

(3)

Thelikelihood ofcgivenpis defined as L(c|p) = Y

t∈supp(c)

p(t)c(t).

LetM ⊆Pd(X). Themaximum likelihood es-timate fromMforc, denoted bymleM(c), is de-fined as

mleM(c) = argmax p∈M

L(c|p).

IfM= Pd(TΣ), thenmleM(c)maps a tree to its relative frequency inc(Prescher, 2004, Theo-rem 1). Hence, the canonical prtg represents the corpus perfectly.

LetG = (N, Σ, I, R)be an rtg, M = {_JP_K |

P = (G, ι, ρ)is a consistent prtg}, andca corpus overTΣ. Then we also writemleG(c)instead of mleM(c).

Let G = (N, Σ, I, R) be a bottom-up deter-ministic rtg and c a corpus over TΣ such that supp(c) ⊆ _JG_K. Note that there is exactly one derivation treedt_GofGfor every treet∈supp(c). Based onG, we derive three corpora fromc:

cR_G:R→R≥0:r 7→

X

t∈supp(c)

c(t)· |{w∈pos(t)|r = dt_G(w)}|,

cN_G:N →R≥0:A7→

X

t∈supp(c)

c(t)· |{w∈pos(t)|A= lhs(dt_G(w))}|,

cI_G:N →R≥0:A7→

X

t∈supp(c) :

A=lhs(dt G(ε))

c(t).

NowmleG(c) =J(G, ι, ρ)Kwhere for everyA∈

I andr ∈R(Prescher, 2004, Theorem 10):

ι(A) = c

I

G(A)

|c| and ρ(r) =

cR_G(r)

cN_G(lhs(r)). Note thatcN_G(A) =P

r∈R:A=lhs(r)cRG(r). 3 Count-Based State Merging

Algorithm 1 summarizes the idea of our approach. We detail the used notions in what follows.

Let G = (N, Σ, I, R) be an rtg and (≡) an equivalence relation on N. The G-merger w.r.t.

(≡)is the overloaded expressionπ≡for

• nonterminals:π≡(A) = [A]for everyA∈N,

• rules:∀σ ∈Σ:∀A0, . . . , Ark(σ)∈N:

π≡(A0−→σ(A1, . . . , Ark(σ)))

Algorithm 1Count-Based State Merging Input: corpuscoverTΣ

Output: sequence of bottom-up deterministic prtg P0, . . . , Pn, some n ∈ N, such that supp(JP0K)⊆. . .⊆supp(JPnK)

1: P0 = (G0, ι0, ρ0)←canonical prtg ofc

2: i←0

3: whilethere exists a non-trivialGi-mergerdo

4: π ←BESTMERGER(Gi, c)

5: i←i+ 1 6: Gi ←π(Gi−1)

7: letPibe prtg such thatmleGi(c) =JPiK

= [A0]−→σ([A1], . . . ,[Ark(σ)]), • sets of nonterminals or rules by applying π≡

elementwise, and

• rtg:π≡(G) = (N/≡, Σ, π≡(I), π≡(R)). Note thatJGK ⊆ Jπ≡(G)K, because by replacing

each rulerin a derivation tree ofGbyπ≡(r)we get a derivation tree of π≡(G). We call π≡

non-trivial, if(≡) is not the identity relation. We say π≡ merges A1 and A2, iff A1 ≡ A2. We carry

over the lattice structure of the set of equivalence relations overNto the set ofG-mergers in order to identify minimal and leastG-mergers with certain properties.

To deal with prtg, we fix a corpus c over TΣ such thatsupp(c) ⊆ _JG_K. We repeatedly use the maximum likelihood estimate to assign weights to an rtg. That is, the weights are not manipulated during merging itself, but they are used to choose a G-merger: Let Π be a set of G-mergers. The

bestG-merger fromΠw.r.t.cis defined as argmax

π∈Π

L(c|_Jmle_π₍_G₎(c)_K). (1)

So far, the presented notions are defined for gen-eral rtg. Now letG= (N, Σ, I, R)be a bottom-up deterministic rtg. Assuming00 = 1, we then have the following, which is proven in Appendix A:

L(c|_JmleG(c)K) =

Q

A∈NcIG(A) cI

G(A)

|c||c| ·

Q

r∈RcRG(r) cR

G(r)

Q

A∈NcNG(A) cN

G(A)

. (2)

(4)

Expres-sion 1 without explicitly calculating mle_π₍_G₎(c). This is the reason for calling our approach “count-based”, and this is the idea of BESTMERGER in Algorithm 1.

Improving efficiency Let π ∈ Π and G0 = (N0, Σ, I0, R0) = π(G). We needcR_G0, cN_G0, and cI_G0 to calculate L(c | _JmleG0(c)

K). Bottom-up

determinism allows us to derive dt_G0 for every t ∈ supp(c) by replacing every ruler indt_G by π(r). Hence,

∀x0∈X0:cX_G0(x0) =

X

x∈X:x0₌_π₍_x₎

cX_G(x), (3)

where X is any of R, N, or I (with or without prime, italic or roman). So we can reuse the cor-pora related toGto calculateL(c|_JmleG0(c)

K).

We may rewrite Expression 1 by dividing the likelihood byL(c | _JmleG(c)K). Then, for many

instantiations of π many factors in the fraction cancel out. In detail: Let(≡)be the equivalence relation underlyingπ, and

N ={A∈N | |[A]|>1}, N0 ₌_π₍_N₎_, R={r∈R| |[r]|>1}, and R0 =π(R), where(≡)is extended to rules such that r1 ≡ r2

iffπ(r1) =π(r2)for everyr1, r2 ∈R. Then, with

Equation 2 andG0 =π(G):

L(c|_JmleG0(c)

K)

L(c|_JmleG(c)K)

=

Q

A∈N0cI_G0(A) cI

G0(A)

Q

A∈NcIG(A) cI

G(A)

·

Q

r∈R0cR_G0(r) cR

G0(r)

Q

r∈RcRG(r) cR

G(r) ·

Q

A∈NcNG(A) cN

G(A)

Q

A∈N0cN_G0(A) cN

G0(A)

.

(4) Yet, finding the maximum is still expensive.

Heuristics AssumeN ={A1, A2},R =∅and

N∩I =∅. Then, using Equation 3, the right-hand side of Equation 4 is equal tof(cN_G(A1), cN_G(A2))

where

f(x, y) = x x_·_yy

(x+y)x+y. (5) For positivexandy,f(x, y)is monotonically de-creasing (cf. Appendix B). Hence, with our as-sumption, it is best to merge nonterminals with the least counts w.r.t.cN_G.

This may be used to guide the search for the best merger. Recall that we want to generalize the canonical (p)rtg step by step. We want to take the smallest steps possible w.r.t. loss of likelihood

Algorithm 2

Input: rtgG= (N, Σ, I, R)and equivalence re-lation(≡0)onN

Output: the leastG-mergerπ≡ such thatπ≡(G) is bottom-up deterministic and(≡0)⊆(≡)

1: (≡)←(≡0)

2: whileπ≡(G)not bottom-up deterministicdo

3: find rulesr1andr2 inGsuch that

rhs(π≡(r1)) = rhs(π≡(r2)), but

lhs(π≡(r1))6= lhs(π≡(r2))

4: letA1 = lhs(r1)andA2 = lhs(r2)

5: let(≡0)equivalence relation s.t.N/≡0

=N/≡ \ {[A1],[A2]} ∪ {[A1]∪[A2]}

6: replace(≡)by(≡0₎

(cf. Expression 1), so we consider only minimal non-trivial mergers, i.e., mergers that merge ex-actly two nonterminals. Note that the application of larger mergers may be decomposed into a se-quential application of several minimal non-trivial mergers. We can easily sort the minimal non-trivial mergers by the value of f for the counts of the merged nonterminals. Note that the merger which merges the nonterminals with the lowest counts comes first. We choose a beam widthn >0 and select thenfirst mergers for further investiga-tion assuming that the best merger is among them. Unfortunately, a minimal non-trivial mergerπ≡ does not necessarily result in a bottom-up deter-ministic rtg, but there is a least (i.e. unique mini-mal) mergerπ≡0such that(≡)⊆(≡0)which does (cf. Lemma 2 in Appendix C). We use Algorithm 2 to calculate this merger for each of the nchosen mergers and determine the best merger from the results w.r.t. Equation 4.

To restrict the number of considered mergers even more, it may be useful to only merge non-terminals which produce the same terminal. Note that Algorithm 2 preserves this property.

4 Practical results and outlook

(5)

There is generally a very large number of non-terminals with count 1 in the canonical rtg. Our current heuristics yields the same value for every merger which considers two of such nonterminals. Especially in the first iterations of Algorithm 1 the number of mergers with the same (lowest) heuris-tic value far exceeds the beam width. This means it is arbitrary which mergers lie within the beam. We hope to improve the heuristics by comparing the trees which are produced by the merged non-terminals.

Using the generated grammars for parsing, we are currently only able to process sentences con-sisting of words seen in the training data. Even for this limited subset of sentences we are not able to improve precision or recall of brackets (Sekine and Collins, 1997) in comparison to the probabilistic context-free grammar straightforwardly obtained from the corpus. We hope this will improve with a better heuristics.

We have restricted our attention to bottom-up deterministic regular tree grammars. Thanks to this, the conceptual framework could remain rel-atively simple. What is unclear at this time is whether the bottom-up determinism per se stricts the potential accuracy of the models, in re-lation to the split-merge framework, which allows nondeterministic regular tree grammars.

References

Rafael C. Carrasco and Jose Oncina. 1994. Learn-ing stochastic regular grammars by means of a state merging method. In Rafael C. Carrasco and Jose Oncina, editors, Grammatical Inference and

Appli-cations, volume 862 ofLecture Notes in Computer

Science, page 139–152. Springer Berlin Heidelberg.

Rafael C. Carrasco and Jose Oncina. 1999. Learn-ing deterministic regular grammars from stochastic samples in polynomial time. RAIRO – Theoretical

Informatics and Applications, 33(1):1–19.

Rafael C. Carrasco, Jose Oncina, and Jorge Calera-Rubio. 2001. Stochastic inference of regular tree languages. Machine Learning, 44(1-2):185–197.

Eugene Charniak. 1996. Tree-bank grammars. In

Proc. of AAAI/IAAI 1996, volume 2, pages 1031–

1036.

A. P. Dempster, N. M. Laird, and D. B. Rubin. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical

So-ciety. Series B (Methodological), 39(1):1–38.

Henning Fernau. 2002. Learning tree languages from text. In Jyrki Kivinen and Robert H. Sloan, editors,

Computational Learning Theory, volume 2375 of

Lecture Notes in Computer Science, page 153–168.

Springer Berlin Heidelberg.

Ferenc Gécseg and Magnus Steinby. 1984. Tree

Au-tomata. Akadémiai Kiadó, Budapest, Hungary.

Dan Klein and Christopher D. Manning. 2003. Ac-curate unlexicalized parsing. InProc. of ACL 2003, volume 1, pages 423–430.

Takuya Matsuzaki, Yusuke Miyao, and Jun’ichi Tsujii. 2005. Probabilistic CFG with latent annotations. In

Proc. of ACL 2005, pages 75–82.

Slav Petrov, Leon Barrett, Romain Thibaux, and Dan Klein. 2006. Learning accurate, compact, and inter-pretable tree annotation. InProc. of COLING/ACL 2006, pages 433–440.

Detlef Prescher. 2004. A tutorial on the expectation-maximization algorithm including maximum-likelihood estimation and EM training of probabilis-tic context-free grammars.CoRR, abs/cs/0412015.

Juan Ramón Rico-Juan, Jorge Calera-Rubio, and Rafael C. Carrasco. 2000. Probabilistick-testable tree languages. In Arlindo L. Oliveira, editor,

Gram-matical Inference: Algorithms and Applications,

volume 1891 ofLecture Notes in Computer Science, page 221–228. Springer Berlin Heidelberg.

Juan Ramón Rico-Juan, Jorge Calera-Rubio, and Rafael C. Carrasco. 2002. Stochastic k-testable tree languages and applications. In Pieter Adriaans, Henning Fernau, and Menno van Zaanen, editors,

Grammatical Inference: Algorithms and

Applica-tions, volume 2484 of Lecture Notes in Computer

Science, page 199–212. Springer Berlin Heidelberg.

(6)

A The likelihood of the maximum likelihood estimate in the bottom-up deterministic case (Equation 2)

LetG= (N, Σ, I, R)be a bottom-up deterministic rtg andca corpus overTΣsuch thatsupp(c)⊆JGK.

LetP = (G, ι, ρ)such that_JP_K= mleG(c). We can transformL(c|JPK)as follows, assuming0

0 _{= 1}_:

L(c|_JPK)

= Y

t∈supp(c)

JPK(t)

c(t) _{(def. of}_L₎

= Y

t∈supp(c)

X

d∈DP(t)

JPK(d)

c(t)

(def. ofJPK)

= Y

t∈supp(c)

JPK(d

t

G)c(t) (assumptions forGandc)

= Y

t∈supp(c)

ι(lhs(dt_G(ε)))· Y

w∈pos(t)

ρ(dt_G(w))c(t) (def. of_JP_K)

= Y

t∈supp(c)

ι(lhs(dt_G(ε)))c(t)· Y

w∈pos(t)

ρ(dt_G(w))c(t) (distributivity)

=

_Y

t∈supp(c)

ι(lhs(dt_G(ε)))c(t)

· Y

t∈supp(c)

Y

w∈pos(t)

ρ(dt_G(w))c(t) (commutativity)

= Y A∈N

Y

t∈supp(c) :

A=lhs(dt G(ε))

ι(A)c(t)

·Y

r∈R

Y

t∈supp(c)

Y

w∈pos(t) :

r=dt G(w)

ρ(r)c(t) (commutativity)

= Y A∈N

ι(A)cIG(A)

·Y

r∈R ρ(r)

P t∈supp(c)

P w∈pos(t) :

r=dt_G(w) c(t)

(bc·bd=bc+d,00 = 1, def. ofcI_G)

= Y A∈N

ι(A)cIG(A)

·Y

r∈R ρ(r)

P

t∈supp(c)c(t)·|{w∈pos(t)|r=dtG(w)}| _{(distributivity)}

= Y A∈N

ι(A)cIG(A)

·Y

r∈R

ρ(r)cRG(r) _{(def. of}cR

G)

=

Q

A∈NcIG(A) cI

G(A)

Q

A∈N|c|c I G(A)

·

Q

r∈RcRG(r) cR

G(r)

Q

r∈RcNG(lhs(r)) cR

G(r)

(def. ofιandρ, comm., distr.)

=

Q

A∈NcIG(A) cI

G(A) |c||c| ·

Q

r∈RcRG(r) cR

G(r)

Q

A∈NcNG(A) cN

G(A)

(bc·bd=bc+d, def. ofcI_GandcN_G, comm.)

This proves Equation 2.

B The function in Equation 5 is monotonically decreasing

We may transform Equation 5 as follows:

f(x, y) = x x_·_yy

(x+y)x+y =

x

x+y

x

· y

x+y

y

.

For positive arguments, the partial derivatives off are ∂f(x, y)

∂x = ln

x

x+y

· x

x+y

x

· y

x+y

y

, and

∂f(x, y)

∂y = ln

y

x+y

· x

x+y

x

· y

x+y

y

(7)

Forx, y >0the fractions are smaller than one. This means the logarithms are negative, hence the whole terms. Sof is monotonically decreasing.

C Properties of mergers regarding bottom-up determinism

Lemma 1. LetG = (N, Σ, I, R)be an rtg, and let (≡1) and(≡2) be equivalence relations overN such thatπ≡1(G)andπ≡2(G)are bottom-up deterministic. Let(≡) = (≡1)∩(≡2). Then alsoπ≡(G)

is bottom-up deterministic.

Proof. Assumeπ≡(G)isnotbottom-up deterministic. Then there are two rulesA0 −→σ(A1, . . . , Ark(σ))

andB0 −→ σ(B1, . . . , Brk(σ))inR such thatAi ≡ Bi for every1 ≤i≤ rk(σ), butA0 6≡B0. Hence,

Ai ≡1BiandAi ≡2Bifor every1≤i≤rk(σ), and thereforeA0 ≡1 B0andA0 ≡2 B0. This implies

A0 ≡B0, which is a contradiction, soπ≡(G)is bottom-up deterministic. q.e.d.

Lemma 2. LetG= (N, Σ, I, R)be an rtg, and let(≡)be an equivalence relation overN. Then there is a least (i.e. unique minimal)( ˆ≡)such that(≡)⊆( ˆ≡)andπ≡ˆ(G)is bottom-up deterministic.

Proof. Existence:Consider(≡0)such that∀A1, A2 ∈N:A1 ≡0 A2. Then(≡0)satisfies the conditions. Uniqueness: Assume there is a minimal (≡0₎ ₆_{= ( ˆ}_≡₎_{satisfying the conditions. Then, by Lemma 1,} (≡0)∩( ˆ≡)would also satisfy the conditions, which contradicts that(≡0)and( ˆ≡)are minimal. Hence,