A Note on Sequential Rule Based POS Tagging

(1)

A Note on Sequential Rule-Based POS Tagging

Sylvain Schmitz

LSV, ENS Cachan & CNRS, Cachan, France

Abstract

Brill’s part-of-speech tagger is defined through a cascade of leftmost rewrite rules. We revisit the compilation of such rules into a single sequential transducer given by Roche and Schabes (Comput. Ling. 1995) and provide a direct construction of the minimal sequential transducer for each individual rule.

Keywords. Brill Tagger; Sequential Trans-ducer; POS Tagging

1 Introduction

Part-of-speech (POS) tagging consists in assigning the appropriate POS tag to a word in the context of its sentence. The program that performs this task, the POS tagger, can be learned from an annotated corpus in case of supervised learning, typically us-ing hidden Markov model-based or rule-based tech-niques. The most famous rule-based POS tagging technique is due to Brill (1992). He introduced a three-parts technique comprising:

1. alexical tagger, which associates a unique POS tag to each word from an annotated training corpus. This lexical tagger simply associates to each known word its most probable tag ac-cording to the training corpus annotation, i.e. a unigram maximum likelihood estimation; 2. anunknown word tagger, which attempts to tag

unknown words based on suffix or capitaliza-tion features. It works like the contextual tag-ger, using the presence of a capital letter and bounded sized suffixes in its rules: for instance in English, a-ablesuffix usually denotes an ad-jective;

3. acontextual tagger, on which we focus in this paper. It consists of a cascade of string rewrite rules, calledcontextual rules, which correct tag assignments based on some surrounding con-texts.

In this note, we revisit the proof that contextual rules can be translated into sequential transducers1 proposed by Roche and Schabes (1995): whereas Roche and Schabes give a separate proof of sequen-tiality and exercise it to show that their constructed non-sequential transducer can be determinized (at the expense of a worst-case exponential blow-up), we give a direct translation of a contextual rule into the minimal normalized sequential transducer, by adapting Simon (1994)’s string matching automa-ton to the transducer case. Our resulting sequential transducers are of linear size (before their composi-tion). A similar construction can be found in (Mi-hov and Schultz, 2007), but no claim of minimality is made there.

2 Contextual Rules

We start with an example by Roche and Schabes (1995): Let us suppose the following sentences were tagged by the lexical tagger (using the Penn Tree-bank tagset):

∗_{Chapman/NNP killed/VBN John/NNP Lennon/NNP} ∗_{John/NNP Lennon/NNP was/VBD shot/VBD by/IN}

Chapman/NNP

He/PRP witnessed/VBD Lennon/NNP killed/VBN by/IN Chapman/NNP

1_{Historically, what we call here “sequential” used to be} called “subsequential” (Sch¨utzenberger, 1977), but we follow the more recent practice initiated by Sakarovitch (2009).

(2)

There are mistakes in the first two sentences: killed

should be tagged as a past tense form “VBD”, and

shotas a past participle form “VBN”.

The contextual tagger learns contextual rules over some tagsetΣof formuav→ubv(ora→b / u v using phonological rule notations (Kaplan and Kay, 1994)), meaning that the tag arewrites to b in the context of u v, where the context is of length

|uv| bounded by some fixed k + 1; in practice, k = 2ork = 3(Brill (1992) and Roche and Sch-abes (1995) use slightly differenttemplatesthan the one parametrized by k we present here). For in-stance, a first contextual rule could be “nnp vbn→

nnp vbd” resulting in a new tagging

Chapman/NNP killed/VBD John/NNP Lennon/NNP ∗_{John/NNP Lennon/NNP was/VBD shot/VBD by/IN}

Chapman/NNP

∗_{He/PRP witnessed/VBD Lennon/NNP killed/VBD} by/IN Chapman/NNP

A second contextual rule could be “vbd in →

vbn in” resulting in the correct tagging

Chapman/NNP killed/VBD John/NNP Lennon/NNP John/NNP Lennon/NNP was/VBD shot/VBN by/IN

Chapman/NNP

He/PRP witnessed/VBD Lennon/NNP killed/VBN by/IN Chapman/NNP

As stated before, our goal is to compile the entire sequence of contextual rules learned from a corpus into a single sequential function.

Let us first formalize the semantics we will em-ploy in this note for Brill’s contextual rules.2 _Let

C=r1r2· · ·rnbe a finite sequence of string rewrite rules inΣ∗_×_Σ∗ _with_Σ_{a POS tagset of}_{fixed size}_.

In practice the rules constructed in Brill’s contextual tagger arelength-preservingand1-change-bounded, i.e. they modify a single letter, but this is not a useful consideration for our transducer construction. Each rule_ri ri =ui →videfines aleftmost rewrite relation

=⇒

lm defined by

w=ri⇒

lm w

0 _iff_∃_{x, y}_∈_Σ∗_{, w}₌_xu

iy∧w0 =xviy

∧ ∀z, z0_∈_Σ∗_{, w}₆₌_zu

iz0∨x≤pref z 2_{This is not exactly the semantics assumed by either Brill} nor Roche and Schabes, who used iterated-application seman-tics, resp. contextual and non contextual, instead of the single-application semantics we use here. This has little practical con-sequence.

wherex≤pref zdenotes thatxis a prefix ofz. Note that the domain of=ri⇒

lm isΣ

∗_·_u_i_·_Σ∗_{. The}_behavior

of a single rule is then the relationJriKincluded in

Σ∗_×_Σ∗ _{defined by}_J_r

iK = =_lmri⇒ ∪ IdΣ∗_\_(Σ∗_·_ui_·_Σ∗₎,

i.e. it applies=ri⇒

lm onΣ

∗_·_u_i_·_Σ∗ _{and the identity on}

its complementΣ∗_\_(Σ∗_·_u_i_·_Σ∗₎_{. The behavior of}

C is then the composition JCK = Jr1K # Jr2K #· · ·#

JrnK. Note that this behavior doesnot employ the transitive closure of the rewriting rules.

A naive implementation ofCwould try to match each ui at every position of the input string w in

Σ∗_{, resulting in an overall complexity of} _O₍_|_w_{| ·}

P

i|ui|). One often faces the problem of tag-ging asetof sentences{w1, . . . , wm}, which yields O((P_i|ui|)·(Pj|wj|)). As shown in Roche and Schabes’ experiments, compilingCinto a single se-quential transducerT results in practice in huge sav-ings, with overall complexities inO(|w|+|T |)and O(|T |+P_j|wj|)respectively.

EachJriKis a rational function, being the union of two rational functions over disjoint domains. Let|ri| be the length|uivi| ≤k. Roche and Schabes (1995, Sec. 8.2) provide a construction of an exponential-sized transducer Tri for each JriK, and compute their compositionTC of size|TC| = O(Qn_i₌₁2|ri|).

As they show that each JriK is actually a sequen-tial function, their composition JCKis also sequen-tial, and TC can be determinized to yield a

se-quential transducer T of size doubly exponential inPn_i₌₁|ri| ≤ nk (see Roche and Schabes, 1995, Sec. 9.3). By contrast, our construction directly yields linear-sized minimal sequential transducers for each JriK, resulting in a final sequential trans-ducer of sizeO(Qn_i₌₁|ri|) =O(2nlogk).

3 Sequential Transducer of a Rule

Intuitively, the sequential transducer for JriKis re-lated to the string matching automaton (Simon, 1994; Crochemore and Hancart, 1997) for ui, i.e. the automaton for the languageΣ∗_u

i. This insight yields adirectconstruction of the minimal sequen-tial transducer of a contextual rule, with at most

(3)

3.1 Preliminaries

Overlaps, Borders (see e.g. Crochemore and Hancart, 1997, Sec. 6.2). The overlap ov(u, v) of two wordsuandvis the longest suffix ofuwhich is simultaneously a prefix ofv. A worduis aborder

of a wordvif it is both a prefix and a suffix ofv, i.e. if there existv1,v2inΣ∗ such thatv =uv1 =v2u. Forv 6= ε, the longest border ofvdifferent fromv itself is denotedbord(v).

Fact 1. For allu, vinΣ∗ _and_a_in_Σ_,_ov(_{ua, v}_{) =} ov(u, v)·aifov(u, v)·a≤pref vandov(ua, v) =

bord(ov(u, v)·a)otherwise.

Sequential Transducers (see e.g. Sakarovitch, 2009, Sec.V.1.2). Formally, a sequential transducer fromΣto∆is a tupleT = hQ,Σ,∆, q0, δ, η, ι, ρi

whereδ :Q×Σ→Qis a partial transition function, η:Q×Σ→∆∗_{a partial transition output function}

with the same domain asδ, i.e.dom(δ) = dom(η), ι ∈ ∆∗ _{is an initial output, and} _ρ _: _Q _→ _∆∗

is a partial final output function. T defines a par-tial sequential function JTK : Σ∗ _→ _∆∗ _with

JTK(w) =ι·η(q0, w)·ρ(δ(q0, w))for allwinΣ∗for which δ(q0, w) andρ(δ(q0, w))are defined, where η(q, ε) = εandη(q, wa) = η(q, w)·η(δ(q, w), a)

for allwinΣ∗_and_a_in_Σ_.

Let us noteT₍_q₎for the sequential transducer with qfor initial state. We writeu∧vfor the longest com-mon prefix of stringsu andv; the longest common prefix of all the outputs from state q can be writ-ten formally asV_v_∈_Σ∗JT₍_q₎K(v). A sequential

trans-ducer is normalizedif this value isεfor allq ∈ Q such that dom(JT₍_q₎K) 6= ∅, i.e. if the transducer outputs symbols as soon as possible; any sequen-tial transducer can be normalized. The translation

of a sequential function f by a word w in Σ∗ _is

the sequential function w−1_f _with_dom(_w−1_f_{) =} w−1_dom(_f₎ _and _w−1_f₍_u_{) = (}V

v∈Σ∗f(wv))−1 ·

f(wu) for all u in dom(w−1_f₎_{. As in the finite} automata case where minimal automata are isomor-phic with residual automata, the minimal sequen-tial transducer for a sequensequen-tial functionf is defined as thetranslation transducerhQ,Σ,∆, q0, δ, η, ι, ρi, where Q = {w−1_f _| _w _∈ _Σ∗_} _{(which is finite),}

q0 =ε−1f,ι=Vv∈Σ∗f(v)ifdom(f)6=∅andι=

εotherwise,δ(w−1_{f, a}_{) = (}_wa₎−1_f_,_η₍_w−1_{f, a}_{) =} V

v∈Σ∗(w−1f)(av) if dom((wa)−1f) 6= ∅ and η(w−1_{f, a}_{) =} _ε _{otherwise, and} _ρ₍_w−1_f_{) =}

(w−1_f₎₍_ε₎_if_ε_∈_dom(_w−1_f₎_{, and is otherwise} un-defined.

3.2 Main Construction

Here is the definition of our transducer for a contex-tual rule (see Fig.1):

Definition 2(Transducer of a Contextual Rule). The sequential transducer Tr associated with a contex-tual rule r = u → v with u 6= ε is defined as

Tr=hpref(u),Σ,Σ, ε, δ, η, ε, ρiwith the set of pre-fixes of u as state set, ε as initial state and initial output, and for allainΣandwinpref(u),

δ(w, a)=

    

wa ifwa≤prefu

w ifw=u

bord(wa) otherwise

ρ(w)=

    

ε ifw≤pref (u∧v)

(u∧v)−1_w _if₍_u_∧_v₎_<_pref _{w <}_pref _u ε otherwise, i.e. ifw=u

η(w, a)=

                    

a ifwa≤pref (u∧v) ε if(u∧v)<prefwa<pref u

(u∧v)−1_v _if_wa₌_u

a ifw=u

ρ(w)a·ρ(bord(wa))−1 otherwise.

It remains to show that this sequential transducer is indeed the minimal normalized sequential trans-ducer forJrK.

Proposition 3(Correctness). Letr = u → v with

u6=ε. ThenJTrK=JrK.

Proof. Let us first consider the case of input words inΣ∗_\_(Σ∗_·_u_·_Σ∗₎_:

Claim3.1. For allwinΣ∗_\_(Σ∗_·_u_·_Σ∗₎_,_δ₍_{ε, w}_{) =} ov(w, u)andη(ε, w) =w·ρ(ov(w, u))−1_.

By induction onw: sinceu 6= ε, the base case is w = εwithδ(ε, ε) = ε = ov(ε, u) andη(ε, ε) =

(4)

ε ε

ε

a

ε

ab

ε

aba

a

abab

ab

ababb

ε b:b

a:a a:a

b:b b:b

a:ε a:aa

b:ε b:bbb a:ab

[image:4.612.126.489.59.150.2]

a:a, b:b

Figure 1: The sequential transducer constructed forababb→abbbb.

andainΣ:

δ(ε, wa) =δ(δ(ε, w), a)i.h.= δ(ov(w, u), a)

Fact 1_{= ov(}_{wa, u}₎

η(ε, wa) =η(ε, w)·η(δ(ε, w), a)

i.h.₌ _w_·_ρ₍_δ₍_{ε, w}₎₎−1_·_η₍_δ₍_{ε, w}₎_{, a}₎

=w·ρ(w0₎−1_·_η₍_w0_{, a}_{) ;}

(by settingw0 ₌_δ₍_{ε, w}₎₎

we need to do a case analysis for this last equation:

Casew0_a_6≤

pref u Then η(w0, a) = ρ(w0) · a ·

ρ(border(w0_a₎₎−1_{, which yields} _η₍_{ε, wa}_{) =} w·ρ(w0₎−1_·_ρ₍_w0₎_·_a_·_ρ₍_δ₍_{ε, wa}₎₎−1 ₌_wa_· ρ(δ(ε, wa))−1_.

Casew0_{a <}

pref u Then δ(ε, wa) = w0a, and

we need to further distinguish between several cases:

w0_a_≤

pref (u∧v) then ρ(w0) = ε,

η(w0_{, a}_{) =} _{a, and} _ρ₍_w0_a_{) =} _ε,

thus η(ε, wa) = wa = wa · ε−1 ₌ wa·ρ(w0_a₎−1_,

w0 _{= (}_u_∧_v₎ _then_ρ₍_w₀_{) =} _ε,_η₍_w₀_{, a}_{) =}_ε,

and ρ(w0_a_{) = (}_u _∧ _v₎−1 _· _w0_a ₌ _a,

η(ε, wa) = w = wa · a−1 ₌ _wa _· ρ(w0_a₎−1_,

(u∧v) <pref w0 thenρ(w0) = (u∧v)−1 ·

w0_, _η₍_w0_{, a}_{) =} _{ε, and} _ρ₍_w0_a_{) = (}_u _∧

v)−1_·_w0_{a, thus}_η₍_{ε, wa}_{) =}_w_·₍₍_u_∧_v₎−1_· w0₎−1 ₌_wa_·_a−1_·₍₍_u_∧_v₎−1_·_w0₎−1 ₌ wa·ρ(w0_a₎−1_.

The claim yields thatJTrKcoincides with JrKon words in Σ∗_\_(Σ∗ _·_u_·_Σ∗₎_{, i.e. is the identity over} Σ∗_\_(Σ∗ _·_u _·_Σ∗₎_{. Then, since} _u ₆₌ _{ε, a word in} Σ∗_·_u_·_Σ∗_{can be written as}_waw0_with_w_in_Σ∗_\_(Σ∗_·

u·Σ∗₎_,_a_in_Σ_with_wa_in_Σ∗_·_{u, and}_w0_in_Σ∗_{. Let}

u = u0_{a; the claim implies that} _δ₍_{ε, w}_{) =} _u0 _and

η(ε, w) = w·ρ(u0₎−1_{. Thus, by definition of}_T r, δ(ε, wa) =u0_a₌ _u_{and thus}_η₍_{ε, wa}_{) =} _η₍_{ε, w}₎_·

η(u0_{, a}_{) =}_w_·_ρ₍_u0₎−1_·₍_u_∧_v₎−1_·_v;

if(u∧v) <pref u0 η(ε, wa) = w·((u∧v)−1·

u0₎−1_·₍_u_∧_v₎−1_·_v₌_w_·_u0−1_·_v₌_wa_·_u−1_·_v;

otherwise i.e. ifu0 _{= (}_u_∧_v₎_:_η₍_{ε, wa}_{) =}_w_·_u0−1_· v=wa·u−1_·_v.

Thus in all casesJTrK(wa) =JrK(wa), and sinceTr starting in stateu(i.e.Tr(u)) implements the identity overΣ∗_{, we have more generally}_J_T

rK=JrK.

Lemma 4(Normality). Letr =u→ v. ThenTris

normalized.

Proof. Let w ∈ Prefix(u) be a state of Tr; let us show thatVJT_r₍_w₎K(Σ∗_{) =}_ε.

If(u∧v) <pref w <pref u let u0 =

w−1_u _∈ _Σ+_{, and consider the two} out-putsJT_r₍_w₎K(u0_{) =}_η₍_{w, u}0₎_ρ₍_u_{) = (}_u_∧_v₎−1_v and JTr(w)K(ε) = ρ(w) = (u ∧ v)−1w. Since (u ∧ v) <pref u we can write u as

(u ∧v)au00_u0_{, and either} _v _{= (}_u _∧_v₎_bv0 _or

v = u ∧v, for some a 6= b inΣ and u00_{, v}0

inΣ∗_{; this yields} _w _{= (}_u _∧_v₎_au00 _{and thus}

JT_r₍_w₎K(u0₎_∧_J_T

r(w)K(ε) =ε.

otherwise ρ(w) =ε, which yields the lemma.

Proposition 5 (Minimality). Letr = u → v with

u6=εandu6=v. ThenTris the minimal sequential

transducer forJrK.

Proof. Let w <pref w0 be two different states in

Prefix(u); we proceed to prove that Jw−1_T rK 6=

Jw0−1_T_r_K_{, hence that no two states of} _T_r _can be merged. By Thm. 4 it suffices to prove that

(5)

such thatJTr(w)K(x) =6 JTr(w0₎K(x). We perform a

case analysis:

ifw0 _≤

pref (u∧v) then w <pref (u ∧ v) thus

JT_r₍_w₎K(x) = xfor all x 6∈ w−1_·_Σ∗_·_u_·_Σ∗_;

considerJT_r₍_w₎K(w0−1_u_{) =}_w0−1_u₆₌_w0−1_v₌

JTr(w0₎K(w0−1u);

ifw≤pref (u∧v)andw0 =u then

JTr(w0₎K(x) = x for all x and we

con-sider JT_r₍_w₎K(w−1_u_{) =} _w−1_v ₆₌ _w−1_v ₌

JT_r₍_w0₎K(w−1u);

otherwise that is ifw≤pref (u∧v)and(u∧v)<pref w0 _<_pref _{u, or}₍_u_∧_v₎_<_pref _{w <}_pref _w0 _≤_pref

u, we have ρ(w) 6= ρ(w0₎ _thus _J_T

r(w)K(ε) 6=

JT_r₍_w0₎K(ε).

4 Conclusion

The results of the previous section yield (the cases u=εandu=vare trivial):

Theorem 6. Given a contextual rule r = u → v, one can construct directly the minimal normalized sequential transducerTrof sizeO(|r|)forJrK.

The remaining question is whether we can ob-tain better upper bounds on the size of the sequen-tial transducer TC for a cascade C = r1· · ·rn than O(2nlogk₎_{. It turns out that there are cascades of} length n for which no sequential transducer with a subexponential (inn) number of states can exist, thus our construction is close to optimal.

References

Eric Brill. 1992. A simple rule-based part of speech tag-ger. InANLP ’92, pages 152–155. ACL Press. Maxime Crochemore and Christophe Hancart. 1997.

Au-tomata for matching patterns. In Grzegorz Rozenberg and Arto Salomaa, editors,Handbook of Formal Lan-guages, volume 2. Linear Modeling: Background and Application, chapter 9, pages 399–462. Springer. Ronald M. Kaplan and Martin Kay. 1994. Regular

mod-els of phonological rule systems. Comput. Linguist., 20(3):331–378.

Stoyan Mihov and Klaus U. Schultz. 2007. Effi-cient dictionary-based text rewriting using subsequen-tial transducers.Nat. Lang. Eng., 13(4):353–381. Emmanuel Roche and Yves Schabes. 1995.

Determinis-tic part-of-speech tagging with finite-state transducers.

Comput. Linguist., 21(2):227–253.

Jacques Sakarovitch. 2009. Elements of Automata The-ory. Cambridge University Press.

Marcel-Paul Sch¨utzenberger. 1977. Sur une variante des fonctions s´equentielles. Theor. Comput. Sci., 4(1):47– 57.