EM Training for Weighted Aligned Hypergraph Bimorphisms

(1)

EM-Training for Weighted Aligned Hypergraph Bimorphisms

Frank Drewes

Department of Computing Science Ume˚a University

S-901 87 Ume˚a, Sweden [email protected]

Kilian Gebhardt and Heiko Vogler Department of Computer Science

Technische Universit¨at Dresden D-01062 Dresden, Germany

[email protected] [email protected]

Abstract

We develop the concept of weighted aligned hypergraph bimorphism where the weights may, in particular, represent proba-bilities. Such a bimorphism consists of an

R≥0-weighted regular tree grammar, two

hypergraph algebras that interpret the gen-erated trees, and a family of alignments between the two interpretations. Seman-tically, this yields a set of bihypergraphs each consisting of two hypergraphs and an explicit alignment between them; e.g., discontinuous phrase structures and non-projective dependency structures are bihy-pergraphs. We present an EM-training al-gorithm which takes a corpus of bihyper-graphs and an aligned hypergraph bimor-phism as input and generates a sequence of weight assignments which converges to a local maximum or saddle point of the likelihood function of the corpus.

1 Introduction

In natural language processing alignments play an important role. For instance, in machine translation they show up as hidden information when training probabilities of dictionaries (Brown et al., 1993) or when considering pairs of input/output sentences derived by a synchronous grammar (Lewis and Stearns, 1968; Chiang, 2007; Shieber and Schabes, 1990; Nederhof and Vogler, 2012). As another example, in language models for discontinuous phrase structures and non-projective dependency structures they can be used to capture the connec-tion between the words in a natural language sen-tence and the corresponding nodes of the parse tree or dependency structure of that sentence.

In (Nederhof and Vogler, 2014) the generation of discontinuous phrase structures has been

for-malized by the new concept of hybrid grammar. Much as in the mentioned synchronous grammars, a hybrid grammar synchronizes the derivations of nonterminals of a string grammar, e.g., a lin-ear context-free rewriting system (LCFRS) (Vijay-Shanker et al., 1987), and of nonterminals of a tree grammar, e.g., regular tree grammar (Brainerd, 1969) or simple definite-clause programs (sDCP) (Deransart and Małuszynski, 1985). Additionally it synchronizes terminal symbols, thereby establish-ing an explicit alignment between the positions of the string and the nodes of the tree. We note that LCFRS/sDCP hybrid grammars can also generate non-projective dependency structures.

In this paper we focus on the task of training an LCFRS/sDCP hybrid grammar, that is, assigning probabilities to its rules given a corpus of discon-tinuous phrase structures or non-projective depen-dency structures. Since the alignments are first class citizens, we develop our approach in the general framework of hypergraphs and hyperedge replacement (HR) (Habel, 1992). We define the concepts ofbihypergraph(for short: bigraph) and

aligned HR bimorphism. A bigraph consists of hypergraphs H1, λ, and H2, where λrepresents

the alignment betweenH1andH2. A bimorphism

B= (g,A1,Λ,A2)consists of a regular tree

gram-marggenerating trees over some ranked alphabet

Σ, twoΣ-algebrasA1andA2which interpret each

symbol inΣas an HR operation (thus evaluating every tree to two hypergraphs), and aΣ-indexed familyΛof alignments between the two interpreta-tions of eachσ ∈Σ. The semantics ofBis a set of bigraphs.

For instance, each discontinuous phrase struc-ture or non-projective dependency strucstruc-ture can be represented as a bigraph(H1, λ, H2)whereH1and

H2correspond to the string component and the tree

component, respectively. Fig. 1 shows an example of a bigraph representing a non-projective

(2)

(a)

A hearing is scheduled on the issue today . is

hearing

A on

issue the

scheduled

today .

(b)

(y(0) 1 )inp

is . (y

(0) 1 )out

hearing scheduled

A on today

issue

the

(y(0) 1 )inp

A hearing is scheduled on the issue today .

(y(0) 1 )out

H2

H1

λ

Figure 1: (a) A sentence with non-projective dependencies is represented in (b) by a bigraph(H1, λ, H2).

Both hypergraphsH1 and H2 contain a distinct hyperedge (box) for each word of the sentence. H1

specifies the linear order on the words.H2describes parent-child relationships between the words, where

children form a list to whose start and end the parent has a tentacle. The alignmentλestablishes a one-to-one correspondence between the (input vertices of the) hyperedges inH1andH2.

dency structure. We present each LCFRS/sDCP hybrid grammar as a particular aligned HR bimor-phism; this establishes an initial algebra semantics (Goguen et al., 1977) for hybrid grammars.

The flexibility of aligned HR bimorphisms goes well beyond hybrid grammars as they generalize the synchronous HR grammars of (Jones et al., 2012), making it possible to synchronously gen-erate two graphs connected by explicit alignment structures. Thus, they can for instance model align-ments involving directed acyclic graphs like Ab-stract Meaning Representations (Banarescu et al., 2013) or Millstream systems (Bensch et al., 2014). Our training algorithm takes as input an aligned HR bimorphismB= (g,A1,Λ,A2)and a corpus

cof bigraphs. It is based on the dynamic program-ming variant (Baker, 1979; Lari and Young, 1990; Prescher, 2001) of the EM-algorithm (Dempster et al., 1977) and thus approximates a local maximum or saddle point of the likelihood function ofc.

In order to calculate the significance of each rule of g for the generation of a single bigraph

(H1, λ, H2) occurring inc, we proceed as usual,

constructing the reduct B (H1, λ, H2) which

generates the singleton(H1, λ, H2)via the same

derivation trees as Band preserves the

probabil-ities. We show that the complexity of construct-ing the reduct is polynomial in the size ofgand

(H1, λ, H2)ifBis an LCFRS/sDCP hybrid

gram-mar. However, as the algorithm itself is not limited to this situation, we expect it to be useful in other cases as well.

2 Preliminaries

Basic mathematical notation We denote the set of natural numbers (including0) byNand the set

N\ {0}byN. Forn∈N, we denote{1, . . . , n} by[n]. An alphabetA is a finite set of symbols. We denote the set of all strings overAbyA∗_{, the}

empty string byε, andA∗_{\ {}_ε_}_by_A+_{. We denote}

the length ofs∈A∗_by_|_s_|_{and, for each}_i_∈_[_|_s_|_]_,

theith item insbys(i), i.e.,sis identified with the functions: [|s|]→Asuch thats=s(1)· · ·s(|s|). We denote the range{s(1), . . . , s(|s|)}ofsby[s]. The powerset of a setAis denoted byP(A). The canonical extension of a functionf:A → B to f:P(A) → P(B) and to f: A∗ _→ _B∗ _are

de-fined as usual and denoted byf as well. We denote the restriction off:A→B toA0 _⊆_A_by_f_|

A0.

For an equivalence relation∼onB we denote the equivalence class of b ∈ B by[b]∼ and the

[image:2.595.91.501.65.311.2]

(3)

B we define the function f/∼: A → B/∼ by

f/∼(a) = [f(a)]∼. In particular, for a strings∈

B∗_{we let}_s/_∼_{= [}_s_(1)]_∼_{· · ·}_[_s₍_|_s_|_)]_∼_.

Terms, regular tree grammars, and algebras A ranked alphabet is a pair (Σ,rk) where Σ is an alphabet andrk: Σ→Nis a mapping associat-ing a rank with each symbol ofΣ. Often we just writeΣinstead of(Σ,rk). We abbreviaterk−1₍_k₎

byΣk. In the following letΣbe a ranked alphabet.

LetAbe an arbitrary set. We letΣ(A)denote the set of strings {σ(a1, . . . , ak) | k ∈ N, σ ∈ Σk, a1, . . . , ak ∈ A}(where the parentheses and

commas are special symbols not inΣ). Theset of well-formed terms overΣindexed byA, denoted byTΣ(A), is defined to be the smallest setT such

thatA⊆T andΣ(T)⊆T. We abbreviateTΣ(∅)

byTΣ and writeσinstead ofσ()forσ ∈Σ0.

A regular tree grammar (RTG)1 _{(G´ecseg and}

Steinby, 1984) is a tupleg= (Ξ,Σ, ξ0, R)where Ξis an alphabet (nonterminals), Ξ∩Σ = ∅, el-ements inΣ are called terminals,ξ0 ∈ Ξ(initial

nonterminal),Ris a ranked alphabet (rules); each rule inRkhas the formξ →σ(ξ1, . . . , ξk)where

ξ, ξ1, . . . , ξk∈Ξ,σ∈Σk. We denote the set of all

rules with left-hand sideξbyRξfor eachξ∈Ξ.

Since RTGs are particular context-free gram-mars, the concepts of derivation relation and gen-erated language are inherited. Thelanguage of the RTGg is the set of all well-formed terms inTΣ

generated byg; this language is denoted byL(g). We define the Ξ-indexed family (Dgξ | ξ ∈ Ξ) of mappings Dξg: TΣ → P(TR); for each

term t ∈ TΣ, Dξg(t) is the set of t’s

deriva-tion trees in TR which start with ξ and yield t.

Formally, for each ξ ∈ Ξ and σ(t1, . . . , tk) ∈

TΣ, the setDgξ(σ(t1, . . . , tk))contains each term

%(d1, . . . , dk) where % = (ξ → σ(ξ1, . . . , ξk))

is in R and di ∈ Dξgi(ti) for each i ∈ [k].

We define_S Dg(t) = Sξ∈ΞDgξ(t) andDξg(TΣ) = t∈TΣD

ξ

g(t). Finally,Dξg0(TΣ, ξ)is the set of all

ζ ∈ TR({ξ})such that there is a ζ0 ∈ Dgξ0(TΣ)

which has a subtree whose root is inRξ, andζ is

obtained fromζ0 _{by replacing exactly one of these}

subtrees byξ.

Example 2.1. Let Σ = Σ0 ∪Σ2 where Σ0 =

{σ2, σ4, σ5}andΣ2 ={σ1, σ3}. Letgbe an RTG

1_{in this context we use “tree” and “term” as synonyms}

with initial nonterminalSand the following rules: S→σ1(A, B) A→σ2

B→σ3(C, D) C→σ4 D→σ5

We observe thatS ⇒∗

g σ1(σ2, σ3(σ4, σ5)). Letη,

ζ, andζ0_{be the following trees (in order):}

B→σ3(C, D)

C→σ4 D→σ5

S→σ1(A, B)

A→σ2 B

S→σ1(A, B)

A→σ2 η

Thenζ ∈DS

g(TΣ, B)becauseζ0 ∈ DgS(TΣ)and

the left-hand side of the root ofηisB.

A Σ-algebrais a pairA = (A,(σA | σ∈Σ))

whereAis a set andσAis ak-ary operation onA

for everyk ∈ Nandσ ∈ Σk. As usual, we will

sometimes use A to refer to its carrier set A or, conversely, denoteAbyA (and thusσA byσA)

if there is no risk of confusion. TheΣ-term alge-brais the Σ-algebraTΣ withσTΣ(t1, . . . , tk) =

σ(t1, . . . , tk) for every k ∈ N, σ ∈ Σk, and

t1, . . . , tk ∈ TΣ. For each Σ-algebraAthere is

exactly oneΣ-homomorphism, denoted by [[.]]A,

from theΣ-term algebra toA(Wechler, 1992). Hypergraphs and hyperedge replacement In the following letΓ be a finite set oflabels. A Γ -hypergraphis a tupleH= (V, E,att,lab,ports), whereV is a finite set ofvertices,Eis a finite set of

hyperedges,att: E →V∗_{\ {}_ε_}_{is the}_attachment

of hyperedges to vertices,lab:E →Γis the label-ingof hyperedges, andports∈V∗_{is a sequence}

of (not necessarily distinct)ports. The set of all

Γ-hypergraphs is denoted byHΓ. The vertices in

V \[ports] are also calledinternal vertices and

denoted byint(H).

For the sake of brevity, we shall in the following simply callΓ-hypergraphs and hyperedgesgraphs

andedges, respectively. We illustrate a graph in fig-ures as follows (cf., e.g.,graph((σ2)A)in Fig. 2a).

A vertexvis illustrated by a circle, which is filled and labeled byiin case thatports(i) =v. An edge ewith labelγandatt(e) =v1. . . vnis depicted as

aγ-labeled rectangle withntentacles, lines point-ing tov1, . . . , vnwhich are annotated by1, . . . , n.

(We sometimes drop these annotations.)

If we are not interested in the particular set of labelsΓ, then we also call aΓ-graph simplygraph

and writeHinstead ofHΓ. In the following, we

will refer to the components of a graphHby index-ing them withHunless they are explicitly named. LetHandH0_{be graphs.}_H_and_H0_are_disjoint

(4)

(a)

1

is . 2

e1 e2

3 4 3 4

3 4 1 2

1 2

2 4

1 3

1 2

graph((σ1)A)

3

hearing 4

A

1 2

3 4

1 2

graph((σ2)A)

1

e1 2

3

e2 4

1 2

graph((σ3)A)

1

scheduled 2

today

3 4

1 2

graph((σ4)A)

1

on 2

issue the

3 4

1 2

graph((σ5)A)

(b)

[[σ1(σ2, σ3(σ4, σ5))]]A=

1

is . 2

[[σ2]]A [[σ3(σ4, σ5)]]A

3 4 3 4

3 4 1 2

1 2

2 4

1 3

1 2 ₌

1

is . 2

hearing [[σ3(σ4, σ5)]]A

A

3 4 3 4

3 4 1 2

3 4

1 2

3 4

1 2

=

1

is . 2

hearing [[σ4]]A

A [[σ5]]A

3 4 3 4

3 4 1 2

1 2

=

1

is . 2

hearing scheduled

A on today

1 2

1 2 _{· · ·} 1 2

3 4 3 4

3 4 3 4 3 4

1 2

Figure 2: (a) The(Σ,Γ)-HR algebraAand (b) the evaluation of the termσ1(σ2, σ3(σ4, σ5))inA.

LetE =EH ∩EH0. Ifatt_H|_E = att_H0|_E and

labH|E = labH0|_E, then the unionof the graphs

HandH0_{is the graph}_H_∪_H0 _{= (}_V

H∪VH0, E_H∪

EH0,att_H∪att_H0,lab_H∪lab_H0,ports_Hports_H0).

For every F ⊆ EH let H \ F = (VH, E,

attH|E,labH|E,portsH)whereE=EH\F.

Let k ∈ N and H, H1, . . . , Hk ∈ H be

pair-wise disjoint. Let e1, . . . , ek ∈ EH be pairwise

distinct edges, calledvariables. LetIbe the graph H\ {e1, . . . , ek} ∪H1∪ · · · ∪Hk. Thehyperedge

replacement(HR)ofe1 byH1, . . . , andekbyHk

inH(Bauderon and Courcelle, 1987; Habel and Kreowski, 1987) yields the graph

H[e1/H1, . . . , ek/Hk]

= (VI/∼, EI,attI/∼,labI,portsH/∼)

where∼is the least equivalence relation onVIsuch

that attH(ei, j) ∼ portsHi(j) for everyi ∈ [k]

andj ∈ [min(|attH(ei)|,|portsHi|)]. In the

fol-lowing, we call∼theequivalence relation involved in the HR that yieldsH[e1/H1, . . . , ek/Hk].

We assume that each variableeiis labeled by a

distinguished symbol⊥ ∈Σand depicteiby ei

instead of ⊥. Throughout this paper, we will not distinguish between isomorphic graphs, i.e., graphs that are identical up to a bijective renaming of ver-tices and edges. However, since hyperedge replace-ment is defined on concrete graphs and requires thatH, H1, . . . , Hkare pairwise disjoint, we may

choose isomorphic copies of the involved graphs, i.e., rename edges or vertices. To avoid the cum-bersome conversion between abstract and concrete

graphs, we assume that this renaming isopaque. In this sense, we may define an HR operation as a total function fromHk_to_H_{as follows.}

Let H be a graph. For pairwise distinct edges e1, . . . , ek ∈ EH, the HR operation

H_h_e₁_...e_k_i:Hk _{→ H}_{is given by}

H_h_e₁_...e_k_i(H1, . . . , Hk) =H[e1/H1, . . . , ek/Hk]

for all graphsH1, . . . , Hk∈ H.

A(Σ,Γ)-HR algebra(Courcelle, 1991) is aΣ -algebraA= (HΓ,(σA)σ∈Σ)where, for everyk∈ Nandσ ∈Σk, we haveσA =Hhe1...ekifor some

H ∈ HΓ and pairwise distinct e1, . . . , ek ∈ EH.

Then we denoteHbygraph(σA). AnHR algebra

is a(Σ,Γ)-HR algebra for someΣandΓ.

An example of a(Σ,Γ)-HR algebraAand the application of[[.]]Ato a term are given in Fig. 2.

3 Bigraphs and Aligned HR Bimorphisms

Now we formally introduce our central notions of bigraph and aligned HR bimorphism. A case study with examples follows in the next section. Throughout this section let∆andΩbe alphabets. Definition 3.1. Abigraph of type(∆,Ω)is a triple B = (H1, λ, H2)whereH1 ∈ H∆andH2 ∈ HΩ

are disjoint, andλis analignment ofH1andH2,

i.e., a graph withVλ =VH1 ∪VH2,Eλ∩EH1 =

Eλ∩EH2 =∅,att(e)∈(VH1)+·(VH2)+for each

[image:4.595.78.517.61.288.2]

(5)

(a) x(0)1 . . . x(0)m0 σ y1(0) . . . y(0)n0

y(1)1 . . . ym(1)1 x(1)1 . . . x(1)n1

e1

y(1k) . . . y(mkk) x (k)

1 . . . x(nkk)

ek · · ·

e(0)1 e(0)n0

e(1)1 e(1)m1 e

(n)

1 e(mnk)

(b)

σ1 (y (0) 1 )inp

is . (y

(0) 1 )out

y1(1) x(1)1

e1

x(2)1 x(2)2

e2

(c) inp

is . out

x(1)1 x(2)1 x(2)2

(d) inp

x(2)2

out

x(1)1

[image:5.595.72.519.66.198.2]

x(2) 1

Figure 3: The graphs (a)Hσ

IO, (b)Hσ1, (c)Ls1(0)M, and (d)Ls(1)1 M. The large circles and the dotted lines in

(a) and (b) visualize the underlying term structure; e.g., in (b)σ1has two children becauseσ1∈Σ2.

Definition 3.2. Analigned HR bimorphism of type

(Σ,∆,Ω) is a tuple B = (g,A1,Λ,A2), where

gis an RTG overΣ andA1,A2 are a(Σ,∆)-HR

algebra and a(Σ,Ω)-HR algebra, resp., such that

graph(σA1)andgraph(σA2)are disjoint for each

σ∈Σ. Further,Λis aΣ-indexed family(Λσ |σ∈ Σ), each Λσ being an alignment of graph(σA1)

andgraph(σA2).

In the following, for each termt ∈TΣ, we

as-sume (w.l.o.g.) that[[t]]A1 and[[t]]A2 are disjoint.

Definition 3.3. Let B = (g,A1,Λ,A2) be an

aligned HR bimorphism. Thesemantics ofB, de-noted by L(B), is the set of bigraphs defined as follows.

First, let the B-alignment be the TΣ-indexed

family ΛB = (ΛB(t) | t ∈ TΣ) where ΛB(t)

is the alignment of[[t]]A1 and[[t]]A2 defined

induc-tively ontas follows. Lett=σ(t1, . . . , tk)∈TΣ.

For j ∈ [2], suppose that∼j is the equivalence

relation involved in the hyperedge replacement that yieldsgraph(σAj)[e1/[[t1]]Aj, . . . , ek/[[tk]]Aj].

We assume thatΛσ,ΛB(t1), . . . ,ΛB(tk)have

pair-wise disjoint sets of edges. Then we define the B-alignment of[[t]]A1 and[[t]]A2 to be the graph

ΛB(t) = (Λσ∪ΛB(t1)∪· · ·∪ΛB(tk))/(∼1∪∼2)

and we let[[t]]B = ([[t]]A1,ΛB(t),[[t]]A2).Finally,

we defineL(B) ={[[t]]B|t∈ L(g)}.

4 Case Study: Hybrid Grammars

We show how an LCFRS/sDCP hybrid grammar (Nederhof and Vogler, 2014) can be represented as an aligned HR bimorphism. These grammars deal with sequence terms; hence, we first recall their definition and show how to view sequence terms as particular graphs.

4.1 A Graph View on Sequence Terms LetΓbe a ranked alphabet andY be a set disjoint fromΓ. The sets oftermsandsequence-terms( s-terms) overΓindexed byY (Seki and Kato, 2008) are denoted by TΓ(Y) and T_Γ∗(Y), respectively,

and defined inductively as follows: 1. Y ⊆ TΓ(Y),

2. ifk ∈ N, γ ∈ Γk andsi ∈ T_Γ∗(Y)for each

i∈[k], thenγ(s1, . . . , sk)∈ TΓ(Y), and

3. ifn ∈ Nand ti ∈ TΓ(Y) for each i ∈ [n],

thenht1, . . . , tni ∈ T_Γ∗(Y).

Lets ∈ T∗

Γ(Y). We say thatsislinearif every

y ∈Y occurs at most once ins. In the following we only consider linear s-terms. We note that, if

Γ= Γ0, thensis essentially a string over Γ0and

Y. IfΓ=Γ1, thenscorresponds to a sequence of

ordinary (unranked) terms overΓ1indexed byY.

Every linear s-term scan be represented as a graphLsM: it has two distinct portsinp and out, representing the start and end ofs, resp. For each variabley ∈Y,LsMhas two distinct portsyinp_and

yout_{. For each occurrence of a symbol}_γ _∈ _Γ_k_in

s, there is aγ-labeled edge with2k+ 2tentacles inLsM. The(2i−1)-th and2i-th tentacle (i∈[k]) point to the start and end vertex, respectively, of thei-th child sequence ofγ. The last two tentacles point towards the predecessor and the successor of γ, respectively: this may be a vertex separating two symbols, the start or end vertex of a (sub-)sequence, or the port realizingyinp_or_yout_{for some}_y_∈_Y_.

For instance, the s-term

s(0)₁ =his(hx(1)₁ , x(2)₁ i),.(hi)i

inT∗

Γ({x(1)1 , x(2)1 , x(2)2 })is represented byLs(0)1 Min

Fig. 3c. The ports(x(2)₂ )inp_and₍_x(2)

(6)

are depicted as filled circles to the left and the right ofx(2)₂ , and similarly for the other variables. Note that(x(1)₁ )out _and₍_x(2)

1 )inpcoincide becausex(1)1

is succeeded byx(2)₁ ins(0)₁ .

4.2 LCFRS, sDCP, and LCFRS/sDCP Hybrid Grammars

Here we formalize LCFRS/sDCP hybrid grammars as particular aligned HR bimorphisms, where the algebrasA1andA2 are an LCFRS algebra and an

sDCP algebra, resp. Since both LCFRS and sDCP can be viewed as particular types of attribute gram-mars (AG), we first define the concept of AG alge-bra and, in a second step, instantiate it to LCFRS algebra and sDCP algebra.

For eachσ ∈ Σk, letsynσA = (n0, . . . , nk) ∈ Nk+1 _and_inhσ

A = (m0, . . . , mk) ∈ Nk+1 be

tu-ples defining the setsI ={y(0)_j |j∈[n0]}∪{y(_ji)|

i ∈ [k], j ∈ [mi]} and O = {x(0)r | r ∈

[m0]} ∪ {x(r`) | ` ∈ [k], r ∈ [n`]} ofinsideand

outside attributes, resp. (The abbreviations stem from the AG notionssynthesized attributesand in-herited attributes.) The definition of an AG algebra Afollows the two-phase approach in (Engelfriet and Heyker, 1992). In the first phase, for each symbol σ inΣk, we define a graphHIOσ of type

synσ

A,inhσA as shown in Fig. 3a: there is a pair

of vertices for each inside attributey(_ji) and each outside attributex(r`). For each inside attributey(_ji)

there is a edgee(_ji)which connectsy(_ji)with all out-side attributes. For the edgee(0)₁ the tentacles are shown completely in Fig. 3a; for the other edges the tentacles to outside attributes are abridged for clarity. The edgese1, . . . , ekcorrespond to thek

successors ofσ.

In the second phase, we choose an I-indexed family of s-terms(s(_ji) ∈ T∗

Γ(O) |yj(i) ∈I)such

that eachx(r`)inOoccurs exactly once in alls(_ji)

together (single syntactic use restriction). Then we replace each edgee(_ji)by the graphLs(_ji)M; this specifies a particular information flow. Formally, we setσA =H_hσ_e₁_...e_k_iwhere

Hσ =H_IOσ [e(_ji)/Ls_j(i)M|y_j(i)∈I] .

For instance, for σ1 ∈ Σ2 we have synσA1 =

(1,1,2) and inhσ1

A = (0,1,0), and so we

con-struct Hσ1

IO accordingly. Next, we choose Ls(0)1 M

and Ls(1)₁ Mas depicted in Fig. 3c and 3d,

respec-tively. ThenHσ1 = Hσ1

IO[e(0)1 /Ls(0)1 M, e(1)1 /Ls(1)1 M]

is the graph in Fig. 3b where dashed lines indicate the identification of vertices. Note thatHσ1 equals

graph((σ1)A)in Fig. 2a.

A (Σ,Γ)-HR algebra A is a (Σ,Γ)-attribute grammar algebra ((Σ,Γ)-AG algebra), if each symbolσ in Σ is interpreted as described above. For instance,Aof Fig. 2a is a(Σ,Γ)-AG algebra.

We observe that (Σ,Γ)-AG algebras have the following property: for every edge e ∈ EHσ, if

labHσ(e) ∈ Γ_kthenehas 2k+ 2tentacles. We

call the vertex attHσ(e)(k+ 1) the input vertex

of e and denote it by inp(e). Note that no two terminal edges inHσ_{have the same input vertex.}

Thissingle-input propertywill be crucial for an effi-cient representation of subgraphs during the reduct construction (cf. Sec. 5.2).

Next we instantiate the concept of AG-algebra to LCFRS algebras and to sDCP algebras. An LCFRS does not have inherited attributes:

Definition 4.1. Let∆=∆0be a ranked alphabet.

A(Σ,∆)-AG algebraAis a(Σ,∆)-LCFRS algebra, ifinhσ_A= (0, . . . ,0)for allσ∈Σ.

Definition 4.2. LetΩ=Ω1 be a ranked alphabet and letAbe a(Σ,Ω)-AG algebra. We say thatA

is a(Σ,Ω)-sDCP algebra.

Then the graph view on an LCFRS/sDCP hybrid grammar is an HR bimorphism B = (g,A1,Λ,A2), where

• g= (Ξ,Σ, ξ0, R)is an RTG,

• A1is a(Σ,∆)-LCFRS algebra,

• A2is a(Σ,Ω)-sDCP algebra,∆=Ω

(regard-ing∆andΩas sets of symbols), and

• there are functionsfan: Ξ → N , inh: Ξ →

N, and syn: Ξ → N such thatfan(ξ0) = 1,

inh(ξ0) = 0,syn(ξ0) = 1, and for every(ξ →

σ(ξ1, . . . , ξk))∈R, it holds that

– (fan(ξ),fan(ξ1), . . . ,fan(ξk)) = synσA1 and

– (inh(ξ),inh(ξ1), . . . ,inh(ξk)) = inhσA2 and

(syn(ξ),syn(ξ1), . . . ,syn(ξk)) = synσA2.

Moreover, we require the following: Letσ∈Σand

Hj = graph(σAj)forj ∈[2]. For eache∈EΛσ

we haveattΛσ(e) = inp(e1) inp(e2)wheree1 ∈

EH1,e2 ∈EH2, andlabH1(e1) = labH2(e2).

Example 4.3. Let g be as in Ex. 2.1 and con-sider the LCFRS/sDCP hybrid grammar B = (g,A1,Λ,A2), whereA1,Λ, andA2are as

speci-fied in Fig. 4. Then the bigraph in Fig. 1b equals

(7)

(y(0) 1 )inp

is . (y

(0) 1 )out

e1 e2 (x(1)

1 )inp (x(2)1 )out (x(1)

1 )out,(x(2)1 )inp

(y(1) 1 )out,(x(2)2 )out

(y(1) 1 )inp,(x(2)2 )inp

(y(0) 1 )inp

e1 is e2 .

(y(0) 1 )out

(σ1)A2

(σ1)A1

Λσ1

(y(0) 1 )inp

hearing (y

(0) 1 )out

A

(x(0)1 )inp (x(0)1 )out

(y(0) 1 )inp

A hearing

(y(0) 1 )out

(σ2)A2

(σ2)A1

Λσ2

(y(0) 1 )inp

on (y

(0) 1 )out

issue

the

(y(0) 1 )inp

on the issue

(y(0) 1 )out

(σ5)A2

(σ5)A1

Λσ5

(y(0) 1 )inp

e1 (y (0)

1 )out (y2(0))inp

e2 (y (0) 2 )out

(y(0)

1 )inp,(x(1)1 )inp (x(1)1 )out,(x(2)1 )inp (x(2)1 )out,(x(1)2 )inp (x(1)2 )out,(y(0)1 )out

e1 e2

(σ3)A2

(σ3)A1

Λσ3

(y(0) 1 )inp

scheduled (y

(0) 1 )out

today

(y(0) 1 )inp

scheduled

(y(0)

1 )out (y(0)2 )inp

today

(y(0) 2 )out

(σ4)A2

(σ4)A1

[image:7.595.75.526.59.271.2]

Λσ4

Figure 4: The interpretation ofσ1, . . . , σ5inA1,Λ, andA2.

5 EM Training

We present a training algorithm which takes as in-put a weighted aligned HR bimorphism and a finite, non-empty corpuscof bigraphs. It is essentially the same as the training algorithm for probabilistic context-free grammars (PCFG) (Baker, 1979; Lari and Young, 1990; Nederhof and Satta, 2008). As shown in (Prescher, 2001), this algorithm is a dy-namic programming variant of the EM-algorithm (Dempster et al., 1977). Thus, our algorithm gener-ates a sequence of probability assignments which converges to a probability assignmentp; the like-ˆ

lihood ofcunderpˆis a local maximum or saddle point of the likelihood function ofc.

5.1 Weighted Aligned HR Bimorphisms We define weighted RTG in a similar way as PCFG was defined in (Nederhof and Satta, 2006).

A weighted regular tree grammar (WRTG) is a pair(g, p) whereg = (Ξ,Σ, ξ0, R) is an RTG

and p: R → R≥0 is the weight assignment. A weight assignment_P pis aprobability assignmentif

ρ∈Rξp(ρ) = 1for eachξ ∈ Ξ. We extendpto

the mappingp0_:_D

g(TΣ)→R≥0on derivations as

follows: for eachd=%(d1, . . . , dk)inDg(TΣ)we

definep0₍_d_{) =}_p₍_%₎_·Qk

i=1p0(di). For eacht∈TΣ

we definep00₍_t_{) =}P

d∈Dξg(t)p

0₍_d₎_{. We define the}

mappingsin: Ξ → R≥0 ∪ {∞} (inside weight) andout: Ξ → R≥0∪ {∞} (outside weight) for

eachξ ∈Ξby

in(ξ) = X

d∈Dξg(TΣ)

p0₍_d₎ _out(_ξ_{) =} X d∈Dξ_g0_(T_Σ_,ξ₎

p000₍_d₎

where p000_:_Dξ0

g (TΣ, ξ) → R≥0 is defined in the

same way asp0_{, with the addition that}_p000₍_ξ_{) = 1}_.

As usual, we will drop the primes fromp0_,_p00_{, and}

p000_.

Definition 5.1. A weighted aligned HR bimor-phismis a pair(B, p) = ((g,A1,Λ,A2), p)where

(g,A1,Λ,A2) is an aligned HR bimorphism and

(g, p)is a WRTG.

5.2 Reduct Construction

Given a weighted aligned HR bimorphism(B, p) = ((g,A1,Λ,A2), p)and a bigraph(H1, λ, H2), we

restrictgto an RTGg0_{such that only trees}_t_{∈ L}₍_g₎

satisfying[[t]]B = (H1, λ, H2)are inL(g0). Also,

we show that ifBis an LCFRS/sDCP hybrid gram-mar, theng0_{can be constructed in time polynomial}

in the size ofBand(H1, λ, H2).

Definition 5.2. Let m ∈ N and H ∈ H. A graph H0 _{is an} _m_{-subgraph of} _H _if _int(_H0₎ _⊆

int(H), EH0 ⊆ E_H, lab_H0 = lab_H|_E_H₀, and

|ports_H0| ≤ m. Moreover, we require that there

is a mappingϕ:VH0 → V_H, called vertex

map-ping, such thatϕ(attH0(e)) = att_H(e) for each

e ∈ EH, ϕ(v) = v for each v ∈ int(H0), and

ϕ(int(H0₎₎_∩_ϕ_([ports

H0]) = ∅. Moreover, for

each v ∈ int(H0₎_{, if} _v _∈ _[att

H(e)] for some

e∈EH, thene∈EH0. The set of allm-subgraphs

ofHis denoted byHm

S(H).

For instance, graph((σ4)A) in Fig. 2a is a 2

-subgraph of the last graph in Fig. 2b. If a graphH is the result of applying an HR operation to graphs H1, . . . , Hk, then eachHiis a|portsHi|-subgraph

(8)

needed, because some of the ports ofHi may be

identified with each other inH.) Hence, for the reduct we consider onlym-subgraphs ofH, where m is the maximal port length of HR operations in A1 or A2. We observe that HmS(H) is finite

because we identify isomorphic graphs.

Definition 5.3. Let (B, p) = ((g,A1,Λ,A2), p)

be a weighted aligned HR bimorphism withg = (Ξ,Σ, ξ0, R)and let(H1, λ, H2)be a bigraph.

We define (B, p) (H1, λ, H2), the reduct of (B, p) with respect to (H1, λ, H2), to

be the weighted aligned HR bimorphism

((g0_,_A

1,Λ,A2), p0)whereg0 andp0 are defined as

follows.

If[[.]]B−1(H1, λ, H2) = ∅, theng0 = ({ξ00},Σ,

ξ0

0,∅) and p0 = ∅. Otherwise, let m ∈ N

be the maximum of all |ports_graph(_σ_A₁₎| and

|ports_graph(_σ_A

2)| where σ ∈ Σ. Now, we

con-struct g0 _{= (}_Ξ0_,_Σ_{, ξ}0

0, R0) where we abbreviate

Hm

S(H1)×HmS(λ)×HSm(H2)byHmS(H1, λ, H2):

• Ξ0 ₌_Ξ_×₍_Hm

S(H1, λ, H2)∩[[TΣ]]B),

• ξ0

0= (ξ0, H1, λ, H2), and

• for every rule %= (ξ→σ(ξ1, . . . , ξk))∈R

and (s, η, r),(s1, η1, r1), . . . ,(sk, ηk, rk) in

Hm

S (H1, λ, H2)∩[[TΣ]]Bwe have

%0 _{= (}_{ξ, s, η, r}₎_→_σ₍₍_ξ

1, s1, η1, r1), . . . ,

(ξk, sk, ηk, rk))∈R0

ifs=σA1(s1, . . . , sk),η=Λσ∪η1∪. . .∪ηk,

andr =σA2(r1, . . . , rk).

We definep0₍_%0_{) =}_p₍_%₎_.

Theorem 5.4. In Def. 5.3 the following hold: 1. L(g0_{) = [[}_._]]

B−1(H1, λ, H2)∩ L(g).

2. There is a deterministic tree relabelingφfrom Dg0 to D_g such that for all t ∈ L(g0) and

d0 _∈ _D

g0(t), φ|_D

g0(t) is a bijection between

Dg0(t)andD_g(t), andp0(d0) =p(φ(d0)).

Proof. If [[.]]B−1(H1, λ, H2) = ∅, then L(g0) =

∅ by construction, and thus, both statements of the theorem hold. Otherwise, the first statement follows from the following claim.

Claim (*)For everyn∈N,ξ∈Ξ,t∈TΣ, and

(s, η, r)∈(Hm

S(H1, λ, H2))∩[[TΣ]]B it holds that

(ξ, s, η, r) ⇒n

g0 tiff ξ ⇒n_g t and[[t]]A1 = sand ΛB(t) =ηand[[t]]A2 =r.

For the proof of the second statement we define φ((ξ, s, η, r)) = ξ for each(ξ, s, η, r) ∈ Ξ0_{, and}

extendφin the canonical way to derivation trees.

Then the statement is an immediate consequence of the constructions ofR0_and_φ_{and Claim (*).}

Complexity We determine the complexity of the reduct construction for the special case of LCFRS/sDCP hybrid grammars. We assume that the maximal lengthmof ports in the HR operations is fixed and not part of the input. In preparation, we determine an upper bound on|Ξ0_|_.

LetH ∈ Hbe such that from each vertex there is an (undirected) path to a port. GivenH each H0 _{∈ H}m

S (H)∩[[TΣ]]A is uniquely determined

by itsboundary representation(Lautemann, 1990; Chiang et al., 2013; Groschwitz et al., 2015), which consists of (a) the pair (ports_H0, ϕ(ports_H0)),

(b) the set ofboundary edgesEB

H0 ofH0consisting

of alle∈EH0 such thatatt_H0(e)∩[ports_H0]6=∅,

and (c) a functionatt0_:_EB

H0 →([ports_H0]∪{⊥})∗

such thatatt0₍_e₎₍_i_{) = att}

H0(e)(i) ifatt_H0(e)(i)

∈[ports_H0], and⊥otherwise. Note that (c) is

nec-essary becauseϕ|_[ports_H0]might not be injective.

Now, for an arbitrary(Σ,Γ)-AG algebraAand H ∈ LT∗

Γ(∅)M the following holds. Due to the

single-input property ofHand of the involved HR operations, for eachm-subgraphH0 _the

informa-tion (b) and (c) can be inferred from (a). There are Sm,k port sequences of lengthm withkdistinct

vertices, whereSm,k is theStirling number of the

second kind. For each port we choose a vertex in VH. Thus, we obtain that|HSm(H)∩[[TΣ]]A| ≤

P_m

k=0Sm,k· |VH|k≤mm· |VH|m.

Next, we analyze the B-alignments of an LCFRS/sDCP hybrid grammarB. Let(s, η, r)∈ Hm

S(H1, λ, H2)andt∈TΣ with[[t]]B = (s, η, r).

Then Vη = Vs ∪ Vr. Let e ∈ Eλ and

e1 ∈ EH1 and e2 ∈ EH2 with attλ(e) =

inp(e1) inp(e2). (i) Assume that there ist0 ∈TΣ

with[[t0_]]_B _{= (}_H₁_{, λ, H}₂₎_and_t_{is a subtree of}_t0_.

Thene1ande2are unique by the single input

prop-erty. Hence, e1 ∈ Es iff e2 ∈ Er iff e ∈ Eη

becausetis a subtree oft0 _{and by the structure of} Λσ. (ii) If there is no sucht0, then the equivalences

under (i) may be violated, in which case(s, η, r)

can safely be pruned.

Thus, for each LCFRS/sDCP hybrid grammar Band each(H1, λ, H2) withH1 ∈ LT∆∗(∅)Mand

H2 ∈ LT_Ω∗(∅)M we obtain the upper bound |Ξ| ·

m2m_{· |}_V_H

1|m· |VH2|mon|Ξ0|. This bound can be

refined to|Ξ|·m2m_·|_V H1|2·fan

∗

·|VH2|2·(syn

∗_+inh∗₎

, wheref∗_{= max}_ξ_∈Ξ_f₍_ξ₎_for_f _{∈ {}_fan_,_syn_,_inh_}_.

ConstructingΞ0_and_R0_{simultaneously with a}

(9)

a worst-case time complexity inO(|Ξ0_|k₎_where_k

is the maximum rank ofΣ. 5.3 EM Training Algorithm

In the first step of our training algorithm, a cor-pusc0_:_R _→ _R≥0 _{is computed as follows. After}

initialization (line 3), each bigraph B occurring incis considered (line 4), the reduct(B, pi)B

is built (line 5), the inside/outside weights of the new WRTG(g0_{, p}0₎_{are calculated (line 6), and}

ac-cording to these weights and the current weight assignmentpithe countc0(%)of each rule is

incre-mented (lines 8–9). In the second step, the corpus c0_{is normalized (lines 10–14) and the result is the}

next probability assignmentpi+1(line 15).

Algorithm 5.1 EM-training algorithms for weighted aligned HR bimorphisms.

Input: weighted aligned HR bimorphism

(B, p0) = ((g,A1,Λ,A2), p0)

withg= (Ξ,Σ, ξ0, R), and a finite,

non-empty corpuscof bigraphs. Output: sequencep1, p2, p3, . . . of improved

probability assignments forR. 1: i←0

2: whiletruedo

3: initialize countsc0 _:_R_→_R≥0_:_c0₍_%₎_←₀

4: forB = (H₁, λ, H₂)s.t.c(B)>0do 5: ((g0,A₁,Λ,A₂), p0)←(B, p_i)B

with RTGg0 _{= (}_Ξ0_,_Σ_{, ξ}0

0, R0)and

det. tree rel.φ// cf. Thm 5.4 6: computeoutandinfor the WRTG

(g0_{, p}0₎

7: ifin(ξ₀0)6= 0then

8: for%= (ξ →σ(ξ₁, . . . , ξ_k))∈Rdo 9: c0₍_%₎_←_c0₍_%_{) +} _c₍_B₎ _· _in(_ξ0

0)−1·

P %0_∈_R0_:

φ(%0₎₌_%_∧_%0₌₍_ξ0_→_σ₍_ξ0

1,...,ξ0k))

out(ξ0₎_·_p

i(%)· k Q j=1in(ξ

0 j)

10: forξ∈Ξdo 11: s←P_%_∈_R_ξc0(%) 12: for%∈R_ξdo

13: ifs = 0then p_i₊₁(%)←p_i(%)·|R_ξ|−1 14: else p_i₊₁(%)←s−1·c0(%) 15: outputp_i₊₁andi←i+ 1

Acknowledgment

We thank the referees for their careful reading of the manuscript.

References

James K. Baker. 1979. Trainable grammars for speech recognition. In Speech Communication Papers for the 97th Meeting of the Acoustical Society of Amer-ica, pages 547–550.

Laura Banarescu, Claire Bonial, Shu Cai, Madalina Georgescu, Kira Griffitt, Ulf Hermjakob, Kevin Knight, Philipp Koehn, Martha Palmer, and Nathan Schneider. 2013. Abstract meaning representation for sembanking. InProc. 7th Linguistic Annotation Workshop, ACL 2013 Workshop.

Michel Bauderon and Bruno Courcelle. 1987. Graph expressions and graph rewriting. Mathematical Sys-tems Theory, 20:83–127.

Suna Bensch, Frank Drewes, Helmut J¨urgensen, and Brink van der Merwe. 2014. Graph transformation for incremental natural language analysis. Theoreti-cal Computer Science, 531:1–25.

Walter S. Brainerd. 1969. Tree generating regular sys-tems. Inform. and Control, 14:217–231.

Peter F. Brown, Vincent J. Della Pietra, Stephen A. Della Pietra, and Robert L. Mercer. 1993. The mathematics of statistical machine translation: parameter estimation. Computational Linguistics, 19(2):263–311.

David Chiang, Jacob Andreas, Daniel Bauer, Karl Moritz Hermann, Bevan Jones, and Kevin Knight. 2013. Parsing graphs with hyperedge replacement grammars. InProc. of the 51st Annual Meeting of the Association for Computational Linguistics, pages 924–932.

David Chiang. 2007. Hierarchical phrase-based trans-lation.Computational Linguistics, 33(2):201–228.

Bruno Courcelle. 1991. The monadic second-order logic of graphs V: on closing the gap between de-finability and recognizability. Theoretical Computer Science, 80:153–202.

Arthur P. Dempster, Nan M. Laird, and Donald B. Ru-bin. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Sta-tistical Society, Series B, 39:1–38.

Pierre Deransart and Jan Małuszynski. 1985. Relat-ing logic programs and attribute grammars. J. Logic Programming, 2:119–155.

Joost Engelfriet and Linda Heyker. 1992. Context-free hypergraph grammars have the same term-generating power as attribute grammars. Acta Infor-matica, 29(2):161–210.

(10)

Joseph A. Goguen, James W. Thatcher, Eric G. Wagner, and Jesse B. Wright. 1977. Initial algebra semantics and continuous algebras. J. ACM, 24:68–95.

Jonas Groschwitz, Alexander Koller, and Christoph Teichmann. 2015. Graph parsing with s-graph grammars. In Proc. of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, pages 1481–1490.

Annegret Habel and Hans-J¨org Kreowski. 1987. May we introduce to you: Hyperedge replacement. In

Proc. of the Third Intl. Workshop on Graph Gram-mars and Their Application to Computer Science, pages 15–26.

Annegret Habel. 1992. Hyperedge Replacement: Grammars and Languages, volume 643 ofLecture Notes in Computer Science. Springer.

Bevan Jones, Jacob Andreas, Daniel Bauer,

Karl Moritz Hermann, and Kevin Knight. 2012. Semantics-based machine translation with hy-peredge replacement grammars. In M. Kay and C. Boitet, editors, Proc. 24th Intl. Conf. on Com-putational Linguistics (COLING 2012): Technical Papers, pages 1359–1376.

Karim Lari and Steve J. Young. 1990. The estimation of stochastic context-free grammars using the Inside-Outside algorithm. Computer Speech and Language, 4:35–56.

Clemens Lautemann. 1990. The complexity of

graph languages generated by hyperedge replace-ment.Acta Inf., 27(5):399–421.

Philip M. Lewis and Richard E. Stearns. 1968. Syntax-directed transduction. J. ACM, 15(3):465–488. Mark-Jan Nederhof and Giorgio Satta. 2006.

Proba-bilistic parsing strategies. J. ACM, 53(3):406–436. Mark-Jan Nederhof and Giorgio Satta. 2008.

Prob-abilistic parsing. In Gemma Bel-Enguix, M. Do-lores Jim´enez-L´opez, and Carlos Mart´ın-Vide, edi-tors, New Developments in Formal Languages and Applications, volume 113 of Studies in Computa-tional Intelligence, pages 229–258. Springer.

Mark-Jan Nederhof and Heiko Vogler. 2012. Syn-chronous context-free tree grammars. In Proc. of the 11th International Workshop on Tree Adjoin-ing Grammars and Related Formalisms (TAG+11), pages 55–63.

Mark-Jan Nederhof and Heiko Vogler. 2014. Hy-brid grammars for discontinuous parsing. In Proc. of 25th International Conference on Computational Linguistics (COLING 2014), pages 1370–1381. Detlef Prescher. 2001. Inside-outside estimation meets

dynamic EM. InProc. of the 7th International Work-shop on Parsing Technologies, pages 241–244.

Hiroyuki Seki and Yuki Kato. 2008. On the gener-ative power of multiple context-free grammars and macro grammars.IEICE - Transactions on Informa-tion and Systems, E91-D(2):209–221.

Stuart M. Shieber and Yves Schabes. 1990. Syn-chronous tree-adjoining grammars. In Proc. of the 13th International Conference on Computational Linguistics, volume 3, pages 253–258.

Stuart M. Shieber, Yves Schabes, and Fernando C. N. Pereira. 1995. Principles and implementation of de-ductive parsing. J. Logic Programming, 24(1–2):3 – 36.

Krishnamurti Vijay-Shanker, David J. Weir, and Ar-avind K. Joshi. 1987. Characterizing structural descriptions produced by various grammatical for-malisms. InProc. of the 25th Annual Meeting of the Association for Computational Linguistics, pages 104–111.