Processing Structured Input with Skipping Nested Automata

(1)

Proceedings of the 11th International Conference on Finite State Methods and Natural Language Processing, pages 30–34,

Processing Structured Input with Skipping Nested Automata

Dominika Pawlik University of Warsaw Institute of Mathematics

Aleksander Zabłocki University of Warsaw Institute of Informatics

{dominika, olekz}@mimuw.edu.pl, [email protected] Bartosz Zaborowski Polish Academy of Sciences Institute of Computer Science

Abstract

We propose a new kind of finite-state au-tomata, suitable for structured input char-acters corresponding to unranked trees of small depth. As a motivating applica-tion, we regard executing morphosyntactic queries on a richly annotated text corpus.

1 Introduction

Efficient lookup in natural language corpora be-comes increasingly important, correspondingly to the growth of their size. We focus on complex queries, involving regular expressions over seg-ments and specifying their syntactic attributes (e.g. “find all sequences of five nouns or gerunds in a row”). For such queries, indexing the corpus is in general not sufficient; finite-state devices come then as the natural tool to use.

The linguistic annotation of the text seems to be getting still more complex. To illustrate that, Ger-man articles in the IMS Workbench are annotated with sets of readings, i.e. possible tuples of the grammatical case, number and gender (an exam-ple is shown in Fig. 1a). Several other big corpora are stored in XML files, making it easy to extend their annotation in the future if so desired.

Hence, we consider a general setting where the segments are tree-shaped feature structures of fixed type, with list-valued attributes allowed (see Fig. 1a). A corpus query in this model should be a regular expression over segment specifi-cations, being in turn Boolean combinations of attribute specifications, with quantifiers used in the case of list-valued attributes. We may also wish to allow specifying string-valued attributes by regular expressions. For instance, der Tisch /the tablemasculine/ is a match for the expression1

1_{This example query is rather useless by itself; however,} it compactly demonstrates several features which (variously combined) have been found needed in NLP applications.

(POS! =N _∨ WORD =".*sch")

∧ ∃i∈READ(i.GEND=m ∧ i.NUMB=sg)

_∗

describing sequences of segments having a masculine-singular reading, in which all nouns end withsch.

In this paper, we propose an adjustment of exist-ing finite-state solutions suited for queries in such model. Our crucial assumption is that the input, al-though unranked, has a reasonablybounded depth, e.g. by 10. This is the case in morphosyntactic analysis (but often not in syntactic parsing). 2 Related Work

There are two existing approaches which seem promising for regex matching over structured al-phabets. As we will see, each has an advantage over the other.

FSAP model. The first model relies on finite-state automata with predicates assigned to their edges (FSAP, (van Noord and Gerdemann, 2001)). (A FSAP runs as follows: in a single step, it tests which of the edges leaving the current states are la-beled with a predicate satisfied by the current input symbol, and non-deterministically follows these edges). In our case, pattern matching can be re-alized by a FSAP over the infinite alphabet of seg-ments: one segment becomes one input symbol, and segment specifications become predicates.

As showed by van Noord and Gerdemann, FSAPs are potentially efficient as they admit determinization. However, this involves some Boolean operations on the predicates used; as a result, testing predicates for segments might become involved. For example, a non-optimized (purely syntax-driven) evaluation of

∃x∈READx.CASE=N∧ ∃y∈READy.CASE=G (1)

(2)

(a) se gment 2 6 6 6 6 6 6 6 6 6 4

WORD ‘der’ =

list

n d,e,ro

POS D READ list 8 > > < > > : _reading 2 6 4 GEND m CASE N NUMB sg 3 7 5, reading 2 6 4 GEND f CASE G NUMB sg 3 7 5 9 > > = > > ; 3 7 7 7 7 7 7 7 7 7 5 (b) _˛ ˛J ˛ ˛ gggggggggg ggg

ppppp 88

˛ ˛J ˛ ˛ {{{ ++ ˛ ˛D ˛ ˛ ˛ ˛J ˛ ˛ mmmmmm

m C_C_C

˛ ˛K ˛ ˛ ˛ ˛d˛ ˛ ˛ ˛e˛ ˛ ˛ ˛r˛ ˛ ˛ ˛K ˛ ˛ ˛ ˛J ˛ ˛ {{{ ++ ˛ ˛J ˛ ˛ {{{ ++ ˛ ˛K ˛ ˛ ˛ ˛m ˛ ˛ ˛ ˛N ˛ ˛ ˛ ˛sg ˛ ˛ ˛ ˛K ˛ ˛ ˛ ˛f ˛ ˛ ˛ ˛G ˛ ˛ ˛ ˛sg ˛ ˛ ˛ ˛K ˛ ˛ (c) 0 1 2 3 J J

d e r KD J J

m NsgK J f GsgK K

K # (d) 01 2 3

J#

JD J K

d e r K J J K

m NsgK f GsgK

Figure 1: A sample segment for the German ambiguous articleder, which can be masculine nominative as well as feminine genitive (for brevity, other its readings are ignored), presented as a typed feature structure (a) and as an unranked tree (b). Note that strings are treated as lists, which ensures finiteness of the alphabet. We adjust the tree representation by labeling every non-leaf withJand appending a new leaf labeled withKto its children. Parts (c) and (d) show respectively the prefix-order and breadth-first linearizations of the tree. For legibility, we display them stratified wrt. the original depth.

we will see, the other approach is free of this prob-lem.

VPA model. Instead of treating segments as single symbols, we might treat our input as a bounded-depth ordered unranked tree over a fi-nite alphabet (see Fig. 1b), which is a common perspective in the theory of XML processing and tree automata. Recently, several flavours of deter-minizable tree automata for unranked trees have been proposed, see (Comon et al., 2007), (Mu-rata, 2001) and (Gauwin et al., 2008). Under our assumptions, all these models are practically cov-ered by the elegant notion ofvisibly pushdown au-tomata(VPA; see (Alur and Madhusudan, 2004)). As explained in (Alur and Madhusudan, 2009), ex-ecuting a VPA on a given tree may be roughly un-derstood as processing its prefix-order lineariza-tion (see Fig. 1b–c) with a suitably enhanced clas-sical FSA. We omit the details as they are not im-portant for the scope of this paper.

Discussion.FSAPs and VPAs both have poten-tial advantages over each other. For the example rule (1), a naive FSAP will scan the input two times while a deterministic VPA will do it only once. On the other hand, the VPA will read all input symbols, including the irrelevant values of

POS,GENDandNUMB, while a FSAP will simply

skipover them.

We would like to combine these two advan-tages, that is, design a flavour of tree automata al-lowing both determinization and skipping in the above sense (see Fig. 2b for an example). Al-though this might be seen as a minor adjustment

of the FSAP model (or one of the tree-theoretic models cited above), it looks to us that it has not been mentioned so far in the literature.2

Finally, we mention concepts which are only seemingly related. Some authors consider jump-ing or skipping for automata in different mean-ings, e.g. simply moving from a state to another, or compressing the input to the list of gap lengths between consecutive occurrences of a given sym-bol (Wang, 2012), which is somehow related but far less general. In some important finite-state frameworks, like XFST (Beesley and Karttunen, 2003) and NooJ, certain substrings (like +Verb) can be logically treated as single input symbols. However, what we need is nesting such clusters, combined with a choice for an automaton whether to inspect the contents of a cluster or to skip it over.

3 Skipping nested automata

Let Σ be a finite alphabet, augmented with ad-ditional symbols J, K, # (see Fig. 1). We fol-low the definitions and notation for unranked Σ -trees from (Comon et al., 2007, Sec. 8), in partic-ular, we identify aΣ-treetwith its labeling

func-tion t :_N∗ _⊇P os(t) _→ Σ. Forp _∈ P os(t), we denote its depth (i.e. its length as a word) byd(p). Letp ∈ P os(t),d ≤d(p) + 1ands > 0. We define the (d, s)-successor of p, denoted p[d, s],

(3)

to be thes-th vertex at depthdfollowingpin the prefix order3_{, and leave it undefined if this vertex}

does not exist. We also setp[d(p),0] =p.

Askipping nested automaton(SNA) over Σis a tuple A = (Q, QI, QF, δ, d), where Q (resp.

QI, QF) is the set of all (resp. initial, final)

states, d : Q → N is the depth function and

δ_⊆Q_×Σ_×Q_×N+is a finite set of transitions,

such that

d(q) = 0 forq ∈QI,

d(q0)≤d(q) + 1 for(q, σ, q0, s)∈δ.

(The intuitive meaning ofd(q)is the depth of the next symbol to be read whenqis the current state,

which leads to some kind of skipping. Introduc-ingswill allow performing more general skips).

A run of A on a Σ-tree t is a sequence ρ = (pi, qi)n_i₌₀ ⊆P os(t)×Qsuch that p0 = t(ε),

q0 ∈ QI and for every i < n there is s ∈ N

such that (qi, t(pi), qi+1, s) ∈ δ and pi+1 =

pi[d(qi+1), s]. We say that ρ reads pi and skips

overall positions betweenpi andpi+1in the

pre-fix order. We say thatρisacceptingifqn ∈QF.

A tree isacceptedbyAif there is an accepting run

on it. An example is shown in Fig. 2.

A run ρ = (pi, qi)n_i₌₀ is crashing if there

is(qn, t(pn), qn+1, s)∈δsuch thatpn[d(qn+1), s]

is undefined. (Intuitively, this means jumping to a non-existent node). We say that A processest

safely if all its runs on t are not crashing. This property turns out to be important for determiniza-tion (see Secdeterminiza-tion 4). A treetisaccepted safelyif

it is processed safely and accepted.

In practice, safe processing of trees coming fromtypedfeature structures (as in Fig. 1) can be ensured with the aid of analyzing the types. For example, the SNA shown in Fig. 2 can assume stateq2only at (the start of) areading, which must

have four children; hence the skipping transition fromq2toq3 is safe. On the other hand, we

can-not use e.g. a transition fromq1 toq2 withs= 3

because alist (here, of readings) may turn out to have only one child. We omit a general formal treatment of this issue since it trivialises in our in-tended applications.

3_{This is the}_s_{-th child of}_p_if_d ₌ _d₍_p_{) + 1}_{, and the}_s -th right sibling of -the`

d(p)−d´

-fold parent ofpotherwise.

(Note that in the second cased(p)₋dmust be non-negative; ford(p)₋d= 0, the “0-fold parent” ofpmeans simplyp).

4 (Quasi-)determinization

A SNA A = (Q, QI, QF, δ, d) is

determin-istic if, for every q and σ, there is at most

one (q, σ, q0, s) _∈ δ. By quasi-determinization

we mean building a deterministic SNA A = (Q, Q_I, Q_F, δ, d) which accepts safely the same trees whichAdoes.

LetS denote the highest value of sappearing

inA. We say that(q, r) _∈Q_×[0, S]is anoption for A at p wrt. p0 if there is a run ρ of A on t

which ends in(p0[d(q), r], q)such thatpeither is

the final position of ρ or is skipped over byρ in

its last step. A state ofAwill be a set of possible options forAat a given position wrt. itself.

We define:

Q= 2Q×[0,S], Q_I =_{(q,0)_}q_∈QI ,

Q_F =X ∈QX∩(QF × {0})6=∅ .

d(X) = max

(q,r)∈Xd(q) forX ∈Q.

It remains to defineδ. For eachX _∈Q,σ _∈Σ, it shall contain a tuple X, σ, X00, s, where X00, s

are computed by settingd=d(X), computing X0=(q, r)_∈X : d(q)< d_∨r >0 _∪

{(q0, r0) : (q, σ, q0, r0)∈δ,(q,0)∈X, d(q) =d},

(intuitively, this is the set of options for A at p0 = p[d(p),1] wrt. p, provided that p0 exists),

then settingd0 =d(X0)and finally

s= minr : (q, r)_∈X0, d(q) =d0 ,

X00 =(q, r)∈X0 : d(q)< d0 ∪

(q, r₋s) : (q, r)_∈X0, d(q) =d0 .

(Explanation: p00=p[d0, s]is the nearest (wrt. the prefix order) ending position of any of the runs corresponding to the options from X0; hence A

may jump directly to p00; the options atp00 wrt.p

are the same as atp0, i.e.X0; hence, the targetX00

is obtained fromX0by “re-basing” fromptop00.)

While the trees accepted safely by A and A

coincide, this is not true for simplyacceptedtrees. For example, the tree t corresponding (in the

prefix order) toJaK#is accepted byAdefined as

q0 0

q1 1

J q2

1

a

(2) q3

0

a

(there is a run ending inq3) but not byAbecause

forX ={(q1,0)}andσ =awe obtaind(X00) =

(4)

(a)

q0 0

q1 1 J (3)

q2 2 J

q3 3 J (2)

Σ\ {K,G}

q4 0

G q5

0 #

(b)

0 1 2 3

q0 J

J

d e r KD q1

J_q2 J

mq3NsgK q2

J

fq3GsgK K

K q4

#q5

Figure 2: A SNA accepting a segment followed by # (end of input) and having a genitive reading (a), and its run on the segment from the previous figure (b). In part (a), numbers above states are their depths; numbers above dotted arcs are the values ofs(not shown whens = 1). In part (b), the black symbols are read while the gray symbols are skipped over; this corresponds to the continuous and dotted arcs.

5 Practical use

In order to make SNA applicable, we should ex-plain how to build SNAs corresponding to typical corpus queries, and alsohow skipping should be efficiently performed in practice.

It is straightforward to build a non-skipping SNA for a regular expressione overΣ provided thatedoes not containJorKinside?,∗ _or+_{, and}

all their occurrences are well-matched inside ev-ery branch of|. Under our assumption of bounded input depth, every regular expression can be trans-formed to an equivalent one of that form.

Skipping SNAs in our applications are defined by an additional regex construct_, matching any single sub-tree of the input. This is compiled into

s Σ u

(withd _≡0), which makes a skip if the input starts withJ. To enhance even longer skips, patterns of the form_{n_} and_Je_∗_Kare suitably

optimized. For example, the SNA of Fig. 2 recog-nizesJ_ _J_∗_J_{_G_}∗_K_{_}∗_K_{_}∗_K_#_.

Proceeding to skipping in practice, we assume that, as a result of pre-processing, we are given the breadth-first linearization ˜_t _∈ _Σ∗ _{of the}

in-put (see Fig. 1d), stored physically as an array (for a giveni, accessing˜t[i]takes a constant time), and that any occurrence ofJat positioniis equipped

with the pointerL(i)to its left-most child.4 More-over, we equip a deterministic SNAAwith an

ar-ray S such that, when A stays at p of depth d,

S[i]should point to the(i,1)-successor ofpfor all i < d.5 In this setting, the(d, s)-successor of the current position has indexS[d] + (s₋1), which is computable in constant time. Hence, we are able to runAefficiently (Fig. 3 shows an example). In

particular, the running time is independent of the number of input symbols skipped over.

4_{Note that this is the way in which the original structure} would be stored in memory by a standard C implementation. 5_{Upkeeping this requires only one memory access for} ev-eryJprocessed; cf. (Alur and Madhusudan, 2004).

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

0J 3

4#5_J7 _D 1

2

J 11

K

d e r K 2

2 6

J 14

2

2 6

J 18

K m

3

2 6 12

Nsg K f 3

2 6 13

Gsg K

Figure 3: A run of the SNA from Fig. 2 on the in-put from Fig. 1d. The numbers aboveJare point-ers to their left-most children. The bottom num-ber in each frame indicates the current state; the remaining ones show the stackS (S[i]appears at depthi).

6 Evaluation and conclusion

A preliminary version of our method has been used for finding the matches of relatively com-plex hand-written patterns aimed at shallow pars-ing and correctpars-ing errors in the IPI Corpus of Pol-ish (Przepiórkowski, 2004). As a result, although the input size grew by about30%due to introduc-ing J and Knodes, over 75% of the obtained in-put was skipped, leading to an overall speed-up by about50%. Clearly, empirical results may depend heavily on the particular input and queries. Hence, our solution may turn out to be narrowly scoped as well as to be useful in various aspects of XML pro-cessing. Note that, although the expressive power of SNAs as presented is rather weak, it seems to be easily extendable by integrating our main idea with the general VPA model.6

References

(5)

Control, 19(5):439–475.

Rajeev Alur and P. Madhusudan. 2004. Visibly push-down languages. InProceedings of the thirty-sixth annual ACM symposium on Theory of computing, pages 202–211, New York, NY, USA. ACM.

Rajeev Alur and P. Madhusudan. 2009. Adding nest-ing structure to words. J. ACM, 56(3):16:1–16:43.

Kenneth R. Beesley and Lauri Karttunen. 2003.Finite State Morphology. CSLI Studies in Computational Linguistics. CSLI Publications.

Mikołaj Boja´nczyk and Thomas Colcombet. 2006. Tree-walking automata cannot be determinized.

Theor. Comput. Sci., 350(2-3):164–173.

H. Comon, M. Dauchet, R. Gilleron, C. Löding, F. Jacquemard, D. Lugiez, S. Tison, and M. Tom-masi. 2007. Tree automata techniques and applica-tions. http://www.grappa.univ-lille3. fr/tata. release October, 12th 2007.

Olivier Gauwin, Joachim Niehren, and Yves Roos. 2008. Streaming tree automata. Inf. Process. Lett., 109(1):13–17.

Makoto Murata. 2001. Extended path expressions for XML. In Peter Buneman, editor,PODS. ACM.

Gertjan van Noord and Dale Gerdemann. 2001. Fi-nite state transducers with predicates and identities.

Grammars, 4(3):263–286.

Adam Przepiórkowski. 2004. Korpus IPI PAN. Wer-sja wst˛epna. Institute of Computer Science, Polish Academy of Sciences, Warsaw.