EM-Training for Weighted Aligned Hypergraph Bimorphisms
Frank Drewes
Department of Computing Science Ume˚a University
S-901 87 Ume˚a, Sweden [email protected]
Kilian Gebhardt and Heiko Vogler Department of Computer Science
Technische Universit¨at Dresden D-01062 Dresden, Germany
[email protected] [email protected]
Abstract
We develop the concept of weighted aligned hypergraph bimorphism where the weights may, in particular, represent proba-bilities. Such a bimorphism consists of an
R≥0-weighted regular tree grammar, two
hypergraph algebras that interpret the gen-erated trees, and a family of alignments between the two interpretations. Seman-tically, this yields a set of bihypergraphs each consisting of two hypergraphs and an explicit alignment between them; e.g., discontinuous phrase structures and non-projective dependency structures are bihy-pergraphs. We present an EM-training al-gorithm which takes a corpus of bihyper-graphs and an aligned hypergraph bimor-phism as input and generates a sequence of weight assignments which converges to a local maximum or saddle point of the likelihood function of the corpus.
1 Introduction
In natural language processing alignments play an important role. For instance, in machine translation they show up as hidden information when training probabilities of dictionaries (Brown et al., 1993) or when considering pairs of input/output sentences derived by a synchronous grammar (Lewis and Stearns, 1968; Chiang, 2007; Shieber and Schabes, 1990; Nederhof and Vogler, 2012). As another example, in language models for discontinuous phrase structures and non-projective dependency structures they can be used to capture the connec-tion between the words in a natural language sen-tence and the corresponding nodes of the parse tree or dependency structure of that sentence.
In (Nederhof and Vogler, 2014) the generation of discontinuous phrase structures has been
for-malized by the new concept of hybrid grammar. Much as in the mentioned synchronous grammars, a hybrid grammar synchronizes the derivations of nonterminals of a string grammar, e.g., a lin-ear context-free rewriting system (LCFRS) (Vijay-Shanker et al., 1987), and of nonterminals of a tree grammar, e.g., regular tree grammar (Brainerd, 1969) or simple definite-clause programs (sDCP) (Deransart and Małuszynski, 1985). Additionally it synchronizes terminal symbols, thereby establish-ing an explicit alignment between the positions of the string and the nodes of the tree. We note that LCFRS/sDCP hybrid grammars can also generate non-projective dependency structures.
In this paper we focus on the task of training an LCFRS/sDCP hybrid grammar, that is, assigning probabilities to its rules given a corpus of discon-tinuous phrase structures or non-projective depen-dency structures. Since the alignments are first class citizens, we develop our approach in the general framework of hypergraphs and hyperedge replacement (HR) (Habel, 1992). We define the concepts ofbihypergraph(for short: bigraph) and
aligned HR bimorphism. A bigraph consists of hypergraphs H1, λ, and H2, where λrepresents
the alignment betweenH1andH2. A bimorphism
B= (g,A1,Λ,A2)consists of a regular tree
gram-marggenerating trees over some ranked alphabet
Σ, twoΣ-algebrasA1andA2which interpret each
symbol inΣas an HR operation (thus evaluating every tree to two hypergraphs), and aΣ-indexed familyΛof alignments between the two interpreta-tions of eachσ ∈Σ. The semantics ofBis a set of bigraphs.
For instance, each discontinuous phrase struc-ture or non-projective dependency strucstruc-ture can be represented as a bigraph(H1, λ, H2)whereH1and
H2correspond to the string component and the tree
component, respectively. Fig. 1 shows an example of a bigraph representing a non-projective
(a)
A hearing is scheduled on the issue today . is
hearing
A on
issue the
scheduled
today .
(b)
(y(0) 1 )inp
is . (y
(0) 1 )out
hearing scheduled
A on today
issue
the
(y(0) 1 )inp
A hearing is scheduled on the issue today .
(y(0) 1 )out
H2
H1
λ
Figure 1: (a) A sentence with non-projective dependencies is represented in (b) by a bigraph(H1, λ, H2).
Both hypergraphsH1 and H2 contain a distinct hyperedge (box) for each word of the sentence. H1
specifies the linear order on the words.H2describes parent-child relationships between the words, where
children form a list to whose start and end the parent has a tentacle. The alignmentλestablishes a one-to-one correspondence between the (input vertices of the) hyperedges inH1andH2.
dency structure. We present each LCFRS/sDCP hybrid grammar as a particular aligned HR bimor-phism; this establishes an initial algebra semantics (Goguen et al., 1977) for hybrid grammars.
The flexibility of aligned HR bimorphisms goes well beyond hybrid grammars as they generalize the synchronous HR grammars of (Jones et al., 2012), making it possible to synchronously gen-erate two graphs connected by explicit alignment structures. Thus, they can for instance model align-ments involving directed acyclic graphs like Ab-stract Meaning Representations (Banarescu et al., 2013) or Millstream systems (Bensch et al., 2014). Our training algorithm takes as input an aligned HR bimorphismB= (g,A1,Λ,A2)and a corpus
cof bigraphs. It is based on the dynamic program-ming variant (Baker, 1979; Lari and Young, 1990; Prescher, 2001) of the EM-algorithm (Dempster et al., 1977) and thus approximates a local maximum or saddle point of the likelihood function ofc.
In order to calculate the significance of each rule of g for the generation of a single bigraph
(H1, λ, H2) occurring inc, we proceed as usual,
constructing the reduct B (H1, λ, H2) which
generates the singleton(H1, λ, H2)via the same
derivation trees as Band preserves the
probabil-ities. We show that the complexity of construct-ing the reduct is polynomial in the size ofgand
(H1, λ, H2)ifBis an LCFRS/sDCP hybrid
gram-mar. However, as the algorithm itself is not limited to this situation, we expect it to be useful in other cases as well.
2 Preliminaries
Basic mathematical notation We denote the set of natural numbers (including0) byNand the set
N\ {0}byN. Forn∈N, we denote{1, . . . , n} by[n]. An alphabetA is a finite set of symbols. We denote the set of all strings overAbyA∗, the
empty string byε, andA∗\ {ε}byA+. We denote
the length ofs∈A∗by|s|and, for eachi∈[|s|],
theith item insbys(i), i.e.,sis identified with the functions: [|s|]→Asuch thats=s(1)· · ·s(|s|). We denote the range{s(1), . . . , s(|s|)}ofsby[s]. The powerset of a setAis denoted byP(A). The canonical extension of a functionf:A → B to f:P(A) → P(B) and to f: A∗ → B∗ are
de-fined as usual and denoted byf as well. We denote the restriction off:A→B toA0 ⊆Abyf|
A0.
For an equivalence relation∼onB we denote the equivalence class of b ∈ B by[b]∼ and the
[image:2.595.91.501.65.311.2]B we define the function f/∼: A → B/∼ by
f/∼(a) = [f(a)]∼. In particular, for a strings∈
B∗we lets/∼= [s(1)]∼· · ·[s(|s|)]∼.
Terms, regular tree grammars, and algebras A ranked alphabet is a pair (Σ,rk) where Σ is an alphabet andrk: Σ→Nis a mapping associat-ing a rank with each symbol ofΣ. Often we just writeΣinstead of(Σ,rk). We abbreviaterk−1(k)
byΣk. In the following letΣbe a ranked alphabet.
LetAbe an arbitrary set. We letΣ(A)denote the set of strings {σ(a1, . . . , ak) | k ∈ N, σ ∈ Σk, a1, . . . , ak ∈ A}(where the parentheses and
commas are special symbols not inΣ). Theset of well-formed terms overΣindexed byA, denoted byTΣ(A), is defined to be the smallest setT such
thatA⊆T andΣ(T)⊆T. We abbreviateTΣ(∅)
byTΣ and writeσinstead ofσ()forσ ∈Σ0.
A regular tree grammar (RTG)1 (G´ecseg and
Steinby, 1984) is a tupleg= (Ξ,Σ, ξ0, R)where Ξis an alphabet (nonterminals), Ξ∩Σ = ∅, el-ements inΣ are called terminals,ξ0 ∈ Ξ(initial
nonterminal),Ris a ranked alphabet (rules); each rule inRkhas the formξ →σ(ξ1, . . . , ξk)where
ξ, ξ1, . . . , ξk∈Ξ,σ∈Σk. We denote the set of all
rules with left-hand sideξbyRξfor eachξ∈Ξ.
Since RTGs are particular context-free gram-mars, the concepts of derivation relation and gen-erated language are inherited. Thelanguage of the RTGg is the set of all well-formed terms inTΣ
generated byg; this language is denoted byL(g). We define the Ξ-indexed family (Dgξ | ξ ∈ Ξ) of mappings Dξg: TΣ → P(TR); for each
term t ∈ TΣ, Dξg(t) is the set of t’s
deriva-tion trees in TR which start with ξ and yield t.
Formally, for each ξ ∈ Ξ and σ(t1, . . . , tk) ∈
TΣ, the setDgξ(σ(t1, . . . , tk))contains each term
%(d1, . . . , dk) where % = (ξ → σ(ξ1, . . . , ξk))
is in R and di ∈ Dξgi(ti) for each i ∈ [k].
We defineS Dg(t) = Sξ∈ΞDgξ(t) andDξg(TΣ) = t∈TΣD
ξ
g(t). Finally,Dξg0(TΣ, ξ)is the set of all
ζ ∈ TR({ξ})such that there is a ζ0 ∈ Dgξ0(TΣ)
which has a subtree whose root is inRξ, andζ is
obtained fromζ0 by replacing exactly one of these
subtrees byξ.
Example 2.1. Let Σ = Σ0 ∪Σ2 where Σ0 =
{σ2, σ4, σ5}andΣ2 ={σ1, σ3}. Letgbe an RTG
1in this context we use “tree” and “term” as synonyms
with initial nonterminalSand the following rules: S→σ1(A, B) A→σ2
B→σ3(C, D) C→σ4 D→σ5
We observe thatS ⇒∗
g σ1(σ2, σ3(σ4, σ5)). Letη,
ζ, andζ0be the following trees (in order):
B→σ3(C, D)
C→σ4 D→σ5
S→σ1(A, B)
A→σ2 B
S→σ1(A, B)
A→σ2 η
Thenζ ∈DS
g(TΣ, B)becauseζ0 ∈ DgS(TΣ)and
the left-hand side of the root ofηisB.
A Σ-algebrais a pairA = (A,(σA | σ∈Σ))
whereAis a set andσAis ak-ary operation onA
for everyk ∈ Nandσ ∈ Σk. As usual, we will
sometimes use A to refer to its carrier set A or, conversely, denoteAbyA (and thusσA byσA)
if there is no risk of confusion. TheΣ-term alge-brais the Σ-algebraTΣ withσTΣ(t1, . . . , tk) =
σ(t1, . . . , tk) for every k ∈ N, σ ∈ Σk, and
t1, . . . , tk ∈ TΣ. For each Σ-algebraAthere is
exactly oneΣ-homomorphism, denoted by [[.]]A,
from theΣ-term algebra toA(Wechler, 1992). Hypergraphs and hyperedge replacement In the following letΓ be a finite set oflabels. A Γ -hypergraphis a tupleH= (V, E,att,lab,ports), whereV is a finite set ofvertices,Eis a finite set of
hyperedges,att: E →V∗\ {ε}is theattachment
of hyperedges to vertices,lab:E →Γis the label-ingof hyperedges, andports∈V∗is a sequence
of (not necessarily distinct)ports. The set of all
Γ-hypergraphs is denoted byHΓ. The vertices in
V \[ports] are also calledinternal vertices and
denoted byint(H).
For the sake of brevity, we shall in the following simply callΓ-hypergraphs and hyperedgesgraphs
andedges, respectively. We illustrate a graph in fig-ures as follows (cf., e.g.,graph((σ2)A)in Fig. 2a).
A vertexvis illustrated by a circle, which is filled and labeled byiin case thatports(i) =v. An edge ewith labelγandatt(e) =v1. . . vnis depicted as
aγ-labeled rectangle withntentacles, lines point-ing tov1, . . . , vnwhich are annotated by1, . . . , n.
(We sometimes drop these annotations.)
If we are not interested in the particular set of labelsΓ, then we also call aΓ-graph simplygraph
and writeHinstead ofHΓ. In the following, we
will refer to the components of a graphHby index-ing them withHunless they are explicitly named. LetHandH0be graphs.HandH0aredisjoint
(a)
1
is . 2
e1 e2
3 4 3 4
3 4 1 2
1 2
2 4
1 3
1 2
graph((σ1)A)
3
hearing 4
A
1 2
3 4
3 4
1 2
1 2
graph((σ2)A)
1
e1 2
3
e2 4
1 2
1 2
graph((σ3)A)
1
scheduled 2
today
3 4
3 4
1 2
1 2
graph((σ4)A)
1
on 2
issue the
3 4
3 4
3 4
1 2
1 2
1 2
graph((σ5)A)
(b)
[[σ1(σ2, σ3(σ4, σ5))]]A=
1
is . 2
[[σ2]]A [[σ3(σ4, σ5)]]A
3 4 3 4
3 4 1 2
1 2
2 4
1 3
1 2 =
1
is . 2
hearing [[σ3(σ4, σ5)]]A
A
3 4 3 4
3 4 1 2
3 4
1 2
1 2
3 4
1 2
1 2
=
1
is . 2
hearing [[σ4]]A
A [[σ5]]A
3 4 3 4
3 4 1 2
3 4 1 2
1 2
1 2
1 2
1 2
=
1
is . 2
hearing scheduled
A on today
1 2
1 2 · · · 1 2
3 4 3 4
3 4 3 4
3 4 3 4 3 4
1 2
1 2
1 2
1 2
Figure 2: (a) The(Σ,Γ)-HR algebraAand (b) the evaluation of the termσ1(σ2, σ3(σ4, σ5))inA.
LetE =EH ∩EH0. IfattH|E = attH0|E and
labH|E = labH0|E, then the unionof the graphs
HandH0is the graphH∪H0 = (V
H∪VH0, EH∪
EH0,attH∪attH0,labH∪labH0,portsHportsH0).
For every F ⊆ EH let H \ F = (VH, E,
attH|E,labH|E,portsH)whereE=EH\F.
Let k ∈ N and H, H1, . . . , Hk ∈ H be
pair-wise disjoint. Let e1, . . . , ek ∈ EH be pairwise
distinct edges, calledvariables. LetIbe the graph H\ {e1, . . . , ek} ∪H1∪ · · · ∪Hk. Thehyperedge
replacement(HR)ofe1 byH1, . . . , andekbyHk
inH(Bauderon and Courcelle, 1987; Habel and Kreowski, 1987) yields the graph
H[e1/H1, . . . , ek/Hk]
= (VI/∼, EI,attI/∼,labI,portsH/∼)
where∼is the least equivalence relation onVIsuch
that attH(ei, j) ∼ portsHi(j) for everyi ∈ [k]
andj ∈ [min(|attH(ei)|,|portsHi|)]. In the
fol-lowing, we call∼theequivalence relation involved in the HR that yieldsH[e1/H1, . . . , ek/Hk].
We assume that each variableeiis labeled by a
distinguished symbol⊥ ∈Σand depicteiby ei
instead of ⊥. Throughout this paper, we will not distinguish between isomorphic graphs, i.e., graphs that are identical up to a bijective renaming of ver-tices and edges. However, since hyperedge replace-ment is defined on concrete graphs and requires thatH, H1, . . . , Hkare pairwise disjoint, we may
choose isomorphic copies of the involved graphs, i.e., rename edges or vertices. To avoid the cum-bersome conversion between abstract and concrete
graphs, we assume that this renaming isopaque. In this sense, we may define an HR operation as a total function fromHktoHas follows.
Let H be a graph. For pairwise distinct edges e1, . . . , ek ∈ EH, the HR operation
Hhe1...eki:Hk → His given by
Hhe1...eki(H1, . . . , Hk) =H[e1/H1, . . . , ek/Hk]
for all graphsH1, . . . , Hk∈ H.
A(Σ,Γ)-HR algebra(Courcelle, 1991) is aΣ -algebraA= (HΓ,(σA)σ∈Σ)where, for everyk∈ Nandσ ∈Σk, we haveσA =Hhe1...ekifor some
H ∈ HΓ and pairwise distinct e1, . . . , ek ∈ EH.
Then we denoteHbygraph(σA). AnHR algebra
is a(Σ,Γ)-HR algebra for someΣandΓ.
An example of a(Σ,Γ)-HR algebraAand the application of[[.]]Ato a term are given in Fig. 2.
3 Bigraphs and Aligned HR Bimorphisms
Now we formally introduce our central notions of bigraph and aligned HR bimorphism. A case study with examples follows in the next section. Throughout this section let∆andΩbe alphabets. Definition 3.1. Abigraph of type(∆,Ω)is a triple B = (H1, λ, H2)whereH1 ∈ H∆andH2 ∈ HΩ
are disjoint, andλis analignment ofH1andH2,
i.e., a graph withVλ =VH1 ∪VH2,Eλ∩EH1 =
Eλ∩EH2 =∅,att(e)∈(VH1)+·(VH2)+for each
[image:4.595.78.517.61.288.2](a) x(0)1 . . . x(0)m0 σ y1(0) . . . y(0)n0
y(1)1 . . . ym(1)1 x(1)1 . . . x(1)n1
e1
y(1k) . . . y(mkk) x (k)
1 . . . x(nkk)
ek · · ·
e(0)1 e(0)n0
e(1)1 e(1)m1 e
(n)
1 e(mnk)
(b)
σ1 (y (0) 1 )inp
is . (y
(0) 1 )out
y1(1) x(1)1
e1
x(2)1 x(2)2
e2
(c) inp
is . out
x(1)1 x(2)1 x(2)2
(d) inp
x(2)2
out
x(1)1
[image:5.595.72.519.66.198.2]x(2) 1
Figure 3: The graphs (a)Hσ
IO, (b)Hσ1, (c)Ls1(0)M, and (d)Ls(1)1 M. The large circles and the dotted lines in
(a) and (b) visualize the underlying term structure; e.g., in (b)σ1has two children becauseσ1∈Σ2.
Definition 3.2. Analigned HR bimorphism of type
(Σ,∆,Ω) is a tuple B = (g,A1,Λ,A2), where
gis an RTG overΣ andA1,A2 are a(Σ,∆)-HR
algebra and a(Σ,Ω)-HR algebra, resp., such that
graph(σA1)andgraph(σA2)are disjoint for each
σ∈Σ. Further,Λis aΣ-indexed family(Λσ |σ∈ Σ), each Λσ being an alignment of graph(σA1)
andgraph(σA2).
In the following, for each termt ∈TΣ, we
as-sume (w.l.o.g.) that[[t]]A1 and[[t]]A2 are disjoint.
Definition 3.3. Let B = (g,A1,Λ,A2) be an
aligned HR bimorphism. Thesemantics ofB, de-noted by L(B), is the set of bigraphs defined as follows.
First, let the B-alignment be the TΣ-indexed
family ΛB = (ΛB(t) | t ∈ TΣ) where ΛB(t)
is the alignment of[[t]]A1 and[[t]]A2 defined
induc-tively ontas follows. Lett=σ(t1, . . . , tk)∈TΣ.
For j ∈ [2], suppose that∼j is the equivalence
relation involved in the hyperedge replacement that yieldsgraph(σAj)[e1/[[t1]]Aj, . . . , ek/[[tk]]Aj].
We assume thatΛσ,ΛB(t1), . . . ,ΛB(tk)have
pair-wise disjoint sets of edges. Then we define the B-alignment of[[t]]A1 and[[t]]A2 to be the graph
ΛB(t) = (Λσ∪ΛB(t1)∪· · ·∪ΛB(tk))/(∼1∪∼2)
and we let[[t]]B = ([[t]]A1,ΛB(t),[[t]]A2).Finally,
we defineL(B) ={[[t]]B|t∈ L(g)}.
4 Case Study: Hybrid Grammars
We show how an LCFRS/sDCP hybrid grammar (Nederhof and Vogler, 2014) can be represented as an aligned HR bimorphism. These grammars deal with sequence terms; hence, we first recall their definition and show how to view sequence terms as particular graphs.
4.1 A Graph View on Sequence Terms LetΓbe a ranked alphabet andY be a set disjoint fromΓ. The sets oftermsandsequence-terms( s-terms) overΓindexed byY (Seki and Kato, 2008) are denoted by TΓ(Y) and TΓ∗(Y), respectively,
and defined inductively as follows: 1. Y ⊆ TΓ(Y),
2. ifk ∈ N, γ ∈ Γk andsi ∈ TΓ∗(Y)for each
i∈[k], thenγ(s1, . . . , sk)∈ TΓ(Y), and
3. ifn ∈ Nand ti ∈ TΓ(Y) for each i ∈ [n],
thenht1, . . . , tni ∈ TΓ∗(Y).
Lets ∈ T∗
Γ(Y). We say thatsislinearif every
y ∈Y occurs at most once ins. In the following we only consider linear s-terms. We note that, if
Γ= Γ0, thensis essentially a string over Γ0and
Y. IfΓ=Γ1, thenscorresponds to a sequence of
ordinary (unranked) terms overΓ1indexed byY.
Every linear s-term scan be represented as a graphLsM: it has two distinct portsinp and out, representing the start and end ofs, resp. For each variabley ∈Y,LsMhas two distinct portsyinpand
yout. For each occurrence of a symbolγ ∈ Γkin
s, there is aγ-labeled edge with2k+ 2tentacles inLsM. The(2i−1)-th and2i-th tentacle (i∈[k]) point to the start and end vertex, respectively, of thei-th child sequence ofγ. The last two tentacles point towards the predecessor and the successor of γ, respectively: this may be a vertex separating two symbols, the start or end vertex of a (sub-)sequence, or the port realizingyinporyoutfor somey∈Y.
For instance, the s-term
s(0)1 =his(hx(1)1 , x(2)1 i),.(hi)i
inT∗
Γ({x(1)1 , x(2)1 , x(2)2 })is represented byLs(0)1 Min
Fig. 3c. The ports(x(2)2 )inpand(x(2)
are depicted as filled circles to the left and the right ofx(2)2 , and similarly for the other variables. Note that(x(1)1 )out and(x(2)
1 )inpcoincide becausex(1)1
is succeeded byx(2)1 ins(0)1 .
4.2 LCFRS, sDCP, and LCFRS/sDCP Hybrid Grammars
Here we formalize LCFRS/sDCP hybrid grammars as particular aligned HR bimorphisms, where the algebrasA1andA2 are an LCFRS algebra and an
sDCP algebra, resp. Since both LCFRS and sDCP can be viewed as particular types of attribute gram-mars (AG), we first define the concept of AG alge-bra and, in a second step, instantiate it to LCFRS algebra and sDCP algebra.
For eachσ ∈ Σk, letsynσA = (n0, . . . , nk) ∈ Nk+1 andinhσ
A = (m0, . . . , mk) ∈ Nk+1 be
tu-ples defining the setsI ={y(0)j |j∈[n0]}∪{y(ji)|
i ∈ [k], j ∈ [mi]} and O = {x(0)r | r ∈
[m0]} ∪ {x(r`) | ` ∈ [k], r ∈ [n`]} ofinsideand
outside attributes, resp. (The abbreviations stem from the AG notionssynthesized attributesand in-herited attributes.) The definition of an AG algebra Afollows the two-phase approach in (Engelfriet and Heyker, 1992). In the first phase, for each symbol σ inΣk, we define a graphHIOσ of type
synσ
A,inhσA as shown in Fig. 3a: there is a pair
of vertices for each inside attributey(ji) and each outside attributex(r`). For each inside attributey(ji)
there is a edgee(ji)which connectsy(ji)with all out-side attributes. For the edgee(0)1 the tentacles are shown completely in Fig. 3a; for the other edges the tentacles to outside attributes are abridged for clarity. The edgese1, . . . , ekcorrespond to thek
successors ofσ.
In the second phase, we choose an I-indexed family of s-terms(s(ji) ∈ T∗
Γ(O) |yj(i) ∈I)such
that eachx(r`)inOoccurs exactly once in alls(ji)
together (single syntactic use restriction). Then we replace each edgee(ji)by the graphLs(ji)M; this specifies a particular information flow. Formally, we setσA =Hhσe1...ekiwhere
Hσ =HIOσ [e(ji)/Lsj(i)M|yj(i)∈I] .
For instance, for σ1 ∈ Σ2 we have synσA1 =
(1,1,2) and inhσ1
A = (0,1,0), and so we
con-struct Hσ1
IO accordingly. Next, we choose Ls(0)1 M
and Ls(1)1 Mas depicted in Fig. 3c and 3d,
respec-tively. ThenHσ1 = Hσ1
IO[e(0)1 /Ls(0)1 M, e(1)1 /Ls(1)1 M]
is the graph in Fig. 3b where dashed lines indicate the identification of vertices. Note thatHσ1 equals
graph((σ1)A)in Fig. 2a.
A (Σ,Γ)-HR algebra A is a (Σ,Γ)-attribute grammar algebra ((Σ,Γ)-AG algebra), if each symbolσ in Σ is interpreted as described above. For instance,Aof Fig. 2a is a(Σ,Γ)-AG algebra.
We observe that (Σ,Γ)-AG algebras have the following property: for every edge e ∈ EHσ, if
labHσ(e) ∈ Γkthenehas 2k+ 2tentacles. We
call the vertex attHσ(e)(k+ 1) the input vertex
of e and denote it by inp(e). Note that no two terminal edges inHσhave the same input vertex.
Thissingle-input propertywill be crucial for an effi-cient representation of subgraphs during the reduct construction (cf. Sec. 5.2).
Next we instantiate the concept of AG-algebra to LCFRS algebras and to sDCP algebras. An LCFRS does not have inherited attributes:
Definition 4.1. Let∆=∆0be a ranked alphabet.
A(Σ,∆)-AG algebraAis a(Σ,∆)-LCFRS algebra, ifinhσA= (0, . . . ,0)for allσ∈Σ.
Definition 4.2. LetΩ=Ω1 be a ranked alphabet and letAbe a(Σ,Ω)-AG algebra. We say thatA
is a(Σ,Ω)-sDCP algebra.
Then the graph view on an LCFRS/sDCP hybrid grammar is an HR bimorphism B = (g,A1,Λ,A2), where
• g= (Ξ,Σ, ξ0, R)is an RTG,
• A1is a(Σ,∆)-LCFRS algebra,
• A2is a(Σ,Ω)-sDCP algebra,∆=Ω
(regard-ing∆andΩas sets of symbols), and
• there are functionsfan: Ξ → N , inh: Ξ →
N, and syn: Ξ → N such thatfan(ξ0) = 1,
inh(ξ0) = 0,syn(ξ0) = 1, and for every(ξ →
σ(ξ1, . . . , ξk))∈R, it holds that
– (fan(ξ),fan(ξ1), . . . ,fan(ξk)) = synσA1 and
– (inh(ξ),inh(ξ1), . . . ,inh(ξk)) = inhσA2 and
(syn(ξ),syn(ξ1), . . . ,syn(ξk)) = synσA2.
Moreover, we require the following: Letσ∈Σand
Hj = graph(σAj)forj ∈[2]. For eache∈EΛσ
we haveattΛσ(e) = inp(e1) inp(e2)wheree1 ∈
EH1,e2 ∈EH2, andlabH1(e1) = labH2(e2).
Example 4.3. Let g be as in Ex. 2.1 and con-sider the LCFRS/sDCP hybrid grammar B = (g,A1,Λ,A2), whereA1,Λ, andA2are as
speci-fied in Fig. 4. Then the bigraph in Fig. 1b equals
(y(0) 1 )inp
is . (y
(0) 1 )out
e1 e2 (x(1)
1 )inp (x(2)1 )out (x(1)
1 )out,(x(2)1 )inp
(y(1) 1 )out,(x(2)2 )out
(y(1) 1 )inp,(x(2)2 )inp
(y(0) 1 )inp
e1 is e2 .
(y(0) 1 )out
(σ1)A2
(σ1)A1
Λσ1
(y(0) 1 )inp
hearing (y
(0) 1 )out
A
(x(0)1 )inp (x(0)1 )out
(y(0) 1 )inp
A hearing
(y(0) 1 )out
(σ2)A2
(σ2)A1
Λσ2
(y(0) 1 )inp
on (y
(0) 1 )out
issue
the
(y(0) 1 )inp
on the issue
(y(0) 1 )out
(σ5)A2
(σ5)A1
Λσ5
(y(0) 1 )inp
e1 (y (0)
1 )out (y2(0))inp
e2 (y (0) 2 )out
(y(0)
1 )inp,(x(1)1 )inp (x(1)1 )out,(x(2)1 )inp (x(2)1 )out,(x(1)2 )inp (x(1)2 )out,(y(0)1 )out
e1 e2
(σ3)A2
(σ3)A1
Λσ3
(y(0) 1 )inp
scheduled (y
(0) 1 )out
today
(y(0) 1 )inp
scheduled
(y(0)
1 )out (y(0)2 )inp
today
(y(0) 2 )out
(σ4)A2
(σ4)A1
[image:7.595.75.526.59.271.2]Λσ4
Figure 4: The interpretation ofσ1, . . . , σ5inA1,Λ, andA2.
5 EM Training
We present a training algorithm which takes as in-put a weighted aligned HR bimorphism and a finite, non-empty corpuscof bigraphs. It is essentially the same as the training algorithm for probabilistic context-free grammars (PCFG) (Baker, 1979; Lari and Young, 1990; Nederhof and Satta, 2008). As shown in (Prescher, 2001), this algorithm is a dy-namic programming variant of the EM-algorithm (Dempster et al., 1977). Thus, our algorithm gener-ates a sequence of probability assignments which converges to a probability assignmentp; the like-ˆ
lihood ofcunderpˆis a local maximum or saddle point of the likelihood function ofc.
5.1 Weighted Aligned HR Bimorphisms We define weighted RTG in a similar way as PCFG was defined in (Nederhof and Satta, 2006).
A weighted regular tree grammar (WRTG) is a pair(g, p) whereg = (Ξ,Σ, ξ0, R) is an RTG
and p: R → R≥0 is the weight assignment. A weight assignmentP pis aprobability assignmentif
ρ∈Rξp(ρ) = 1for eachξ ∈ Ξ. We extendpto
the mappingp0:D
g(TΣ)→R≥0on derivations as
follows: for eachd=%(d1, . . . , dk)inDg(TΣ)we
definep0(d) =p(%)·Qk
i=1p0(di). For eacht∈TΣ
we definep00(t) =P
d∈Dξg(t)p
0(d). We define the
mappingsin: Ξ → R≥0 ∪ {∞} (inside weight) andout: Ξ → R≥0∪ {∞} (outside weight) for
eachξ ∈Ξby
in(ξ) = X
d∈Dξg(TΣ)
p0(d) out(ξ) = X d∈Dξg0(TΣ,ξ)
p000(d)
where p000:Dξ0
g (TΣ, ξ) → R≥0 is defined in the
same way asp0, with the addition thatp000(ξ) = 1.
As usual, we will drop the primes fromp0,p00, and
p000.
Definition 5.1. A weighted aligned HR bimor-phismis a pair(B, p) = ((g,A1,Λ,A2), p)where
(g,A1,Λ,A2) is an aligned HR bimorphism and
(g, p)is a WRTG.
5.2 Reduct Construction
Given a weighted aligned HR bimorphism(B, p) = ((g,A1,Λ,A2), p)and a bigraph(H1, λ, H2), we
restrictgto an RTGg0such that only treest∈ L(g)
satisfying[[t]]B = (H1, λ, H2)are inL(g0). Also,
we show that ifBis an LCFRS/sDCP hybrid gram-mar, theng0can be constructed in time polynomial
in the size ofBand(H1, λ, H2).
Definition 5.2. Let m ∈ N and H ∈ H. A graph H0 is an m-subgraph of H if int(H0) ⊆
int(H), EH0 ⊆ EH, labH0 = labH|EH0, and
|portsH0| ≤ m. Moreover, we require that there
is a mappingϕ:VH0 → VH, called vertex
map-ping, such thatϕ(attH0(e)) = attH(e) for each
e ∈ EH, ϕ(v) = v for each v ∈ int(H0), and
ϕ(int(H0))∩ϕ([ports
H0]) = ∅. Moreover, for
each v ∈ int(H0), if v ∈ [att
H(e)] for some
e∈EH, thene∈EH0. The set of allm-subgraphs
ofHis denoted byHm
S(H).
For instance, graph((σ4)A) in Fig. 2a is a 2
-subgraph of the last graph in Fig. 2b. If a graphH is the result of applying an HR operation to graphs H1, . . . , Hk, then eachHiis a|portsHi|-subgraph
needed, because some of the ports ofHi may be
identified with each other inH.) Hence, for the reduct we consider onlym-subgraphs ofH, where m is the maximal port length of HR operations in A1 or A2. We observe that HmS(H) is finite
because we identify isomorphic graphs.
Definition 5.3. Let (B, p) = ((g,A1,Λ,A2), p)
be a weighted aligned HR bimorphism withg = (Ξ,Σ, ξ0, R)and let(H1, λ, H2)be a bigraph.
We define (B, p) (H1, λ, H2), the reduct of (B, p) with respect to (H1, λ, H2), to
be the weighted aligned HR bimorphism
((g0,A
1,Λ,A2), p0)whereg0 andp0 are defined as
follows.
If[[.]]B−1(H1, λ, H2) = ∅, theng0 = ({ξ00},Σ,
ξ0
0,∅) and p0 = ∅. Otherwise, let m ∈ N
be the maximum of all |portsgraph(σA1)| and
|portsgraph(σA
2)| where σ ∈ Σ. Now, we
con-struct g0 = (Ξ0,Σ, ξ0
0, R0) where we abbreviate
Hm
S(H1)×HmS(λ)×HSm(H2)byHmS(H1, λ, H2):
• Ξ0 =Ξ×(Hm
S(H1, λ, H2)∩[[TΣ]]B),
• ξ0
0= (ξ0, H1, λ, H2), and
• for every rule %= (ξ→σ(ξ1, . . . , ξk))∈R
and (s, η, r),(s1, η1, r1), . . . ,(sk, ηk, rk) in
Hm
S (H1, λ, H2)∩[[TΣ]]Bwe have
%0 = (ξ, s, η, r)→σ((ξ
1, s1, η1, r1), . . . ,
(ξk, sk, ηk, rk))∈R0
ifs=σA1(s1, . . . , sk),η=Λσ∪η1∪. . .∪ηk,
andr =σA2(r1, . . . , rk).
We definep0(%0) =p(%).
Theorem 5.4. In Def. 5.3 the following hold: 1. L(g0) = [[.]]
B−1(H1, λ, H2)∩ L(g).
2. There is a deterministic tree relabelingφfrom Dg0 to Dg such that for all t ∈ L(g0) and
d0 ∈ D
g0(t), φ|D
g0(t) is a bijection between
Dg0(t)andDg(t), andp0(d0) =p(φ(d0)).
Proof. If [[.]]B−1(H1, λ, H2) = ∅, then L(g0) =
∅ by construction, and thus, both statements of the theorem hold. Otherwise, the first statement follows from the following claim.
Claim (*)For everyn∈N,ξ∈Ξ,t∈TΣ, and
(s, η, r)∈(Hm
S(H1, λ, H2))∩[[TΣ]]B it holds that
(ξ, s, η, r) ⇒n
g0 tiff ξ ⇒ng t and[[t]]A1 = sand ΛB(t) =ηand[[t]]A2 =r.
For the proof of the second statement we define φ((ξ, s, η, r)) = ξ for each(ξ, s, η, r) ∈ Ξ0, and
extendφin the canonical way to derivation trees.
Then the statement is an immediate consequence of the constructions ofR0andφand Claim (*).
Complexity We determine the complexity of the reduct construction for the special case of LCFRS/sDCP hybrid grammars. We assume that the maximal lengthmof ports in the HR operations is fixed and not part of the input. In preparation, we determine an upper bound on|Ξ0|.
LetH ∈ Hbe such that from each vertex there is an (undirected) path to a port. GivenH each H0 ∈ Hm
S (H)∩[[TΣ]]A is uniquely determined
by itsboundary representation(Lautemann, 1990; Chiang et al., 2013; Groschwitz et al., 2015), which consists of (a) the pair (portsH0, ϕ(portsH0)),
(b) the set ofboundary edgesEB
H0 ofH0consisting
of alle∈EH0 such thatattH0(e)∩[portsH0]6=∅,
and (c) a functionatt0:EB
H0 →([portsH0]∪{⊥})∗
such thatatt0(e)(i) = att
H0(e)(i) ifattH0(e)(i)
∈[portsH0], and⊥otherwise. Note that (c) is
nec-essary becauseϕ|[portsH0]might not be injective.
Now, for an arbitrary(Σ,Γ)-AG algebraAand H ∈ LT∗
Γ(∅)M the following holds. Due to the
single-input property ofHand of the involved HR operations, for eachm-subgraphH0 the
informa-tion (b) and (c) can be inferred from (a). There are Sm,k port sequences of lengthm withkdistinct
vertices, whereSm,k is theStirling number of the
second kind. For each port we choose a vertex in VH. Thus, we obtain that|HSm(H)∩[[TΣ]]A| ≤
Pm
k=0Sm,k· |VH|k≤mm· |VH|m.
Next, we analyze the B-alignments of an LCFRS/sDCP hybrid grammarB. Let(s, η, r)∈ Hm
S(H1, λ, H2)andt∈TΣ with[[t]]B = (s, η, r).
Then Vη = Vs ∪ Vr. Let e ∈ Eλ and
e1 ∈ EH1 and e2 ∈ EH2 with attλ(e) =
inp(e1) inp(e2). (i) Assume that there ist0 ∈TΣ
with[[t0]]B = (H1, λ, H2)andtis a subtree oft0.
Thene1ande2are unique by the single input
prop-erty. Hence, e1 ∈ Es iff e2 ∈ Er iff e ∈ Eη
becausetis a subtree oft0 and by the structure of Λσ. (ii) If there is no sucht0, then the equivalences
under (i) may be violated, in which case(s, η, r)
can safely be pruned.
Thus, for each LCFRS/sDCP hybrid grammar Band each(H1, λ, H2) withH1 ∈ LT∆∗(∅)Mand
H2 ∈ LTΩ∗(∅)M we obtain the upper bound |Ξ| ·
m2m· |VH
1|m· |VH2|mon|Ξ0|. This bound can be
refined to|Ξ|·m2m·|V H1|2·fan
∗
·|VH2|2·(syn
∗+inh∗)
, wheref∗= maxξ∈Ξf(ξ)forf ∈ {fan,syn,inh}.
ConstructingΞ0andR0simultaneously with a
a worst-case time complexity inO(|Ξ0|k)wherek
is the maximum rank ofΣ. 5.3 EM Training Algorithm
In the first step of our training algorithm, a cor-pusc0:R → R≥0 is computed as follows. After
initialization (line 3), each bigraph B occurring incis considered (line 4), the reduct(B, pi)B
is built (line 5), the inside/outside weights of the new WRTG(g0, p0)are calculated (line 6), and
ac-cording to these weights and the current weight assignmentpithe countc0(%)of each rule is
incre-mented (lines 8–9). In the second step, the corpus c0is normalized (lines 10–14) and the result is the
next probability assignmentpi+1(line 15).
Algorithm 5.1 EM-training algorithms for weighted aligned HR bimorphisms.
Input: weighted aligned HR bimorphism
(B, p0) = ((g,A1,Λ,A2), p0)
withg= (Ξ,Σ, ξ0, R), and a finite,
non-empty corpuscof bigraphs. Output: sequencep1, p2, p3, . . . of improved
probability assignments forR. 1: i←0
2: whiletruedo
3: initialize countsc0 :R→R≥0:c0(%)←0
4: forB = (H1, λ, H2)s.t.c(B)>0do 5: ((g0,A1,Λ,A2), p0)←(B, pi)B
with RTGg0 = (Ξ0,Σ, ξ0
0, R0)and
det. tree rel.φ// cf. Thm 5.4 6: computeoutandinfor the WRTG
(g0, p0)
7: ifin(ξ00)6= 0then
8: for%= (ξ →σ(ξ1, . . . , ξk))∈Rdo 9: c0(%)←c0(%) + c(B) · in(ξ0
0)−1·
P %0∈R0:
φ(%0)=%∧%0=(ξ0→σ(ξ0
1,...,ξ0k))
out(ξ0)·p
i(%)· k Q j=1in(ξ
0 j)
10: forξ∈Ξdo 11: s←P%∈Rξc0(%) 12: for%∈Rξdo
13: ifs = 0then pi+1(%)←pi(%)·|Rξ|−1 14: else pi+1(%)←s−1·c0(%) 15: outputpi+1andi←i+ 1
Acknowledgment
We thank the referees for their careful reading of the manuscript.
References
James K. Baker. 1979. Trainable grammars for speech recognition. In Speech Communication Papers for the 97th Meeting of the Acoustical Society of Amer-ica, pages 547–550.
Laura Banarescu, Claire Bonial, Shu Cai, Madalina Georgescu, Kira Griffitt, Ulf Hermjakob, Kevin Knight, Philipp Koehn, Martha Palmer, and Nathan Schneider. 2013. Abstract meaning representation for sembanking. InProc. 7th Linguistic Annotation Workshop, ACL 2013 Workshop.
Michel Bauderon and Bruno Courcelle. 1987. Graph expressions and graph rewriting. Mathematical Sys-tems Theory, 20:83–127.
Suna Bensch, Frank Drewes, Helmut J¨urgensen, and Brink van der Merwe. 2014. Graph transformation for incremental natural language analysis. Theoreti-cal Computer Science, 531:1–25.
Walter S. Brainerd. 1969. Tree generating regular sys-tems. Inform. and Control, 14:217–231.
Peter F. Brown, Vincent J. Della Pietra, Stephen A. Della Pietra, and Robert L. Mercer. 1993. The mathematics of statistical machine translation: parameter estimation. Computational Linguistics, 19(2):263–311.
David Chiang, Jacob Andreas, Daniel Bauer, Karl Moritz Hermann, Bevan Jones, and Kevin Knight. 2013. Parsing graphs with hyperedge replacement grammars. InProc. of the 51st Annual Meeting of the Association for Computational Linguistics, pages 924–932.
David Chiang. 2007. Hierarchical phrase-based trans-lation.Computational Linguistics, 33(2):201–228.
Bruno Courcelle. 1991. The monadic second-order logic of graphs V: on closing the gap between de-finability and recognizability. Theoretical Computer Science, 80:153–202.
Arthur P. Dempster, Nan M. Laird, and Donald B. Ru-bin. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Sta-tistical Society, Series B, 39:1–38.
Pierre Deransart and Jan Małuszynski. 1985. Relat-ing logic programs and attribute grammars. J. Logic Programming, 2:119–155.
Joost Engelfriet and Linda Heyker. 1992. Context-free hypergraph grammars have the same term-generating power as attribute grammars. Acta Infor-matica, 29(2):161–210.
Joseph A. Goguen, James W. Thatcher, Eric G. Wagner, and Jesse B. Wright. 1977. Initial algebra semantics and continuous algebras. J. ACM, 24:68–95.
Jonas Groschwitz, Alexander Koller, and Christoph Teichmann. 2015. Graph parsing with s-graph grammars. In Proc. of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, pages 1481–1490.
Annegret Habel and Hans-J¨org Kreowski. 1987. May we introduce to you: Hyperedge replacement. In
Proc. of the Third Intl. Workshop on Graph Gram-mars and Their Application to Computer Science, pages 15–26.
Annegret Habel. 1992. Hyperedge Replacement: Grammars and Languages, volume 643 ofLecture Notes in Computer Science. Springer.
Bevan Jones, Jacob Andreas, Daniel Bauer,
Karl Moritz Hermann, and Kevin Knight. 2012. Semantics-based machine translation with hy-peredge replacement grammars. In M. Kay and C. Boitet, editors, Proc. 24th Intl. Conf. on Com-putational Linguistics (COLING 2012): Technical Papers, pages 1359–1376.
Karim Lari and Steve J. Young. 1990. The estimation of stochastic context-free grammars using the Inside-Outside algorithm. Computer Speech and Language, 4:35–56.
Clemens Lautemann. 1990. The complexity of
graph languages generated by hyperedge replace-ment.Acta Inf., 27(5):399–421.
Philip M. Lewis and Richard E. Stearns. 1968. Syntax-directed transduction. J. ACM, 15(3):465–488. Mark-Jan Nederhof and Giorgio Satta. 2006.
Proba-bilistic parsing strategies. J. ACM, 53(3):406–436. Mark-Jan Nederhof and Giorgio Satta. 2008.
Prob-abilistic parsing. In Gemma Bel-Enguix, M. Do-lores Jim´enez-L´opez, and Carlos Mart´ın-Vide, edi-tors, New Developments in Formal Languages and Applications, volume 113 of Studies in Computa-tional Intelligence, pages 229–258. Springer.
Mark-Jan Nederhof and Heiko Vogler. 2012. Syn-chronous context-free tree grammars. In Proc. of the 11th International Workshop on Tree Adjoin-ing Grammars and Related Formalisms (TAG+11), pages 55–63.
Mark-Jan Nederhof and Heiko Vogler. 2014. Hy-brid grammars for discontinuous parsing. In Proc. of 25th International Conference on Computational Linguistics (COLING 2014), pages 1370–1381. Detlef Prescher. 2001. Inside-outside estimation meets
dynamic EM. InProc. of the 7th International Work-shop on Parsing Technologies, pages 241–244.
Hiroyuki Seki and Yuki Kato. 2008. On the gener-ative power of multiple context-free grammars and macro grammars.IEICE - Transactions on Informa-tion and Systems, E91-D(2):209–221.
Stuart M. Shieber and Yves Schabes. 1990. Syn-chronous tree-adjoining grammars. In Proc. of the 13th International Conference on Computational Linguistics, volume 3, pages 253–258.
Stuart M. Shieber, Yves Schabes, and Fernando C. N. Pereira. 1995. Principles and implementation of de-ductive parsing. J. Logic Programming, 24(1–2):3 – 36.
Krishnamurti Vijay-Shanker, David J. Weir, and Ar-avind K. Joshi. 1987. Characterizing structural descriptions produced by various grammatical for-malisms. InProc. of the 25th Annual Meeting of the Association for Computational Linguistics, pages 104–111.