Using Universal Linguistic Knowledge to Guide Grammar Induction

(1)

Using Universal Linguistic Knowledge to Guide Grammar Induction

Tahira Naseem, Harr Chen, Regina Barzilay

Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

{tahira, harr, regina} @csail.mit.edu

Mark Johnson

Department of Computing Macquarie University [email protected]

Abstract

We present an approach to grammar induc-tion that utilizes syntactic universals to im-prove dependency parsing across a range of languages. Our method uses a single set of manually-specified language-independent rules that identify syntactic dependencies be-tween pairs of syntactic categories that com-monly occur across languages. During infer-ence of the probabilistic model, we use pos-terior expectation constraints to require that a minimum proportion of the dependencies we infer be instances of these rules. We also auto-matically refine the syntactic categories given in our coarsely tagged input. Across six lan-guages our approach outperforms state-of-the-art unsupervised methods by a significant mar-gin.1

1 Introduction

Despite surface differences, human languages ex-hibit striking similarities in many fundamental as-pects of syntactic structure. These structural corre-spondences, referred to assyntactic universals, have been extensively studied in linguistics (Baker, 2001; Carnie, 2002; White, 2003; Newmeyer, 2005) and underlie many approaches in multilingual parsing. In fact, much recent work has demonstrated that learning cross-lingual correspondences from cor-pus data greatly reduces the ambiguity inherent in syntactic analysis (Kuhn, 2004; Burkett and Klein, 2008; Cohen and Smith, 2009a; Snyder et al., 2009; Berg-Kirkpatrick and Klein, 2010).

1

The source code for the work presented in this paper is available at http://groups.csail.mit.edu/rbg/code/dependency/

Root→Auxiliary Noun→Adjective

Root→Verb Noun→Article

Verb→Noun Noun→Noun

Verb→Pronoun Noun→Numeral Verb→Adverb Preposition→Noun

Verb→Verb Adjective→Adverb

[image:1.612.327.525.237.336.2]

Auxiliary→Verb

Table 1: The manually-specified universal dependency rules used in our experiments. These rules specify head-dependent relationships between coarse (i.e., unsplit) syntactic categories. An explanation of the ruleset is pro-vided in Section 5.

In this paper, we present an alternative gram-mar induction approach that exploits these struc-tural correspondences by declaratively encoding a small set of universal dependency rules. As input to the model, we assume a corpus annotated with coarse syntactic categories (i.e., high-level part-of-speech tags) and a set of universal rules defined over these categories, such as those in Table 1. These rules incorporate the definitional properties of syn-tactic categories in terms of their interdependencies and thus are universal across languages. They can potentially help disambiguate structural ambiguities that are difficult to learn from data alone — for example, our rules prefer analyses in which verbs are dependents of auxiliaries, even though analyz-ing auxiliaries as dependents of verbs is also consis-tent with the data. Leveraging these universal rules has the potential to improve parsing performance for a large number of human languages; this is par-ticularly relevant to the processing of low-resource

(2)

languages. Furthermore, these universal rules are compact and well-understood, making them easy to manually construct.

In addition to these universal dependencies, each specific language typically possesses its own id-iosyncratic set of dependencies. We address this challenge by requiring the universal constraints to only hold in expectation rather than absolutely, i.e., we permit a certain number of violations of the con-straints.

We formulate a generative Bayesian model that explains the observed data while accounting for declarative linguistic rules during inference. These rules are used as expectation constraints on the posterior distribution over dependency structures. This approach is based on the posterior regular-ization technique (Grac¸a et al., 2009), which we apply to a variational inference algorithm for our parsing model. Our model can also optionally re-fine common high-level syntactic categories into per-language categories by inducing a clustering of words using Dirichlet Processes (Ferguson, 1973). Since the universals guide induction toward linguis-tically plausible structures, automatic refinement be-comes feasible even in the absence of manually an-notated syntactic trees.

We test the effectiveness of our grammar induc-tion model on six Indo-European languages from three language groups: English, Danish, Portuguese, Slovene, Spanish, and Swedish. Though these lan-guages share a high-level Indo-European ancestry, they cover a diverse range of syntactic phenomenon. Our results demonstrate that universal rules greatly improve the accuracy of dependency parsing across all of these languages, outperforming current state-of-the-art unsupervised grammar induction meth-ods (Headden III et al., 2009; Berg-Kirkpatrick and Klein, 2010).

2 Related Work

Learning with Linguistic Constraints Our work is situated within a broader class of unsupervised ap-proaches that employ declarative knowledge to im-prove learning of linguistic structure (Haghighi and Klein, 2006; Chang et al., 2007; Grac¸a et al., 2007; Cohen and Smith, 2009b; Druck et al., 2009; Liang et al., 2009a). The way we apply constraints is

clos-est to the latter two approaches of posterior regular-ization and generalized expectation criteria.

In the posterior regularization framework, con-straints are expressed in the form of expectations on posteriors (Grac¸a et al., 2007; Ganchev et al., 2009; Grac¸a et al., 2009; Ganchev et al., 2010). This de-sign enables the model to reflect constraints that are difficult to encode via the model structure or as pri-ors on its parameters. In their approach, parame-ters are estimated using a modified EM algorithm, where the E-step minimizes the KL-divergence be-tween the model posterior and the set of distributions that satisfies the constraints. Our approach also ex-presses constraints as expectations on the posterior; we utilize the machinery of their framework within a variational inference algorithm with a mean field approximation.

Generalized expectation criteria, another tech-nique for declaratively specifying expectation con-straints, has previously been successfully applied to the task of dependency parsing (Druck et al., 2009). This objective expresses constraints in the form of preferences over model expectations. The objective is penalized by the square distance between model expectations and the prespecified values of the ex-pectation. This approach yields significant gains compared to a fully unsupervised counterpart. The constraints they studied are corpus- and language-specific. Our work demonstrates that a small set of language-independent universals can also serve as effective constraints. Furthermore, we find that our method outperforms the generalized expectation ap-proach using corpus-specific constraints.

(3)

em-ploy variational methods for inference. However, our goal is to apply category refinement to depen-dency parsing, rather than to PCFGs, requiring a substantially different model formulation. While Liang et al. (2007) demonstrated empirical gains on a synthetic corpus, our experiments focus on unsu-pervised category refinement on real language data.

Universal Rules in NLP Despite the recent surge of interest in multilingual learning (Kuhn, 2004; Co-hen and Smith, 2009a; Snyder et al., 2009; Berg-Kirkpatrick and Klein, 2010), there is surprisingly little computational work on linguistic universals. On the acquisition side, Daum´e III and Campbell (2007) proposed a computational technique for dis-covering universal implications in typological fea-tures. More closely related to our work is the posi-tion paper by Bender (2009), which advocates the use of manually-encoded cross-lingual generaliza-tions for the development of NLP systems. She ar-gues that a system employing such knowledge could be easily adapted to a particular language by spe-cializing this high level knowledge based on the ty-pological features of the language. We also argue that cross-language universals are beneficial for au-tomatic language processing; however, our focus is on learning language-specific adaptations of these rules from data.

3 Model

The central hypothesis of this work is that unsu-pervised dependency grammar induction can be im-proved using universal linguistic knowledge. To-ward this end our approach is comprised of two components: a probabilistic model that explains how sentences are generated from latent dependency structures and a technique for incorporating declar-ative rules into the inference process.

We first describe the generative story in this sec-tion before turning to how constraints are applied during inference in Section 4. Our model takes as input (i.e., as observed) a set of sentences where each word is annotated with a coarse part-of-speech tag. Table 2 provides a detailed technical descrip-tion of our model’s generative process, and Figure 1 presents a model diagram.

For each observed coarse symbols:

1. Draw top-level infinite multinomial over subsymbolsβs∼GEM(γ).

2. For each subsymbolzof symbols:

(a) Draw word emission multinomial

φsz∼Dir(φ0).

(b) For each context valuec:

i. Draw child symbol generation multinomialθszc∼Dir(θ0).

ii. For each child symbols0:

A. Draw second-level infinite multinomial over subsymbols

πs0_szc∼DP(α, β_s0).

For each tree node i generated in context c by parent symbols0 and parent subsymbolz0:

1. Draw coarse symbolsi ∼Mult(θs0_z0).

2. Draw subsymbolzi ∼Mult(πsis0z0c).

3. Draw wordxi ∼Mult(φsizi).

Table 2: The generative process for model parameters and parses. In the above GEM, DP, Dir, and Mult refer respectively to the stick breaking distribution, Dirichlet process, Dirichlet distribution, and multinomial distribu-tion.

Generating Symbols and Words We describe how a single node of the tree is generated before discussing how the entire tree structure is formed. Each node of the dependency tree is comprised of three random variables: an observed coarse symbol

s, a hidden refined subsymbol z, and an observed word x. In the following let the parent of the cur-rent node have symbols0and subsymbolz0; the root node is generated from separate root-specific distri-butions. Subsymbol refinement is an optional com-ponent of the full model and can be omitted by de-terministically equating sandz. As we explain at the end of this section, without this aspect the gener-ative story closely resembles the classic dependency model with valence (DMV) of Klein and Manning (2004).

(4)

s - coarse symbol (observed)

z - refined subsymbol

x - word (observed)

θszc - distr over child coarse symbols for

each parentsandzand contextc βs - top-level distr over subsymbols fors πss0_z0_c - distr over subsymbols for eachs,

[image:4.612.79.539.54.201.2]

parents0andz0, and contextc φsz - distr over words forsandz

Figure 1: Graphical representation of the model and a summary of the notation. There is a copy of the outer plate for each distinct symbol in the observed coarse tags. Here, node 3 is shown to be the parent of nodes 1 and 2. Shaded variables are observed, square variables are hyperparameters. The elongated oval aroundsandzrepresents the two variables jointly. For clarity the diagram omits some arrows fromθto eachs,πto eachz, andφto eachx.

distribution with parameters θs0_z0_c. As the indices

indicate, we have one such set of multinomial pa-rameters for every combination of parent symbol

s0 and subsymbolz0 along with acontext c. Here the context of the current node can take one of six values corresponding to every combination of di-rection (left or right) and valence (first, second, or third or higher child) with respect to its parent. The prior (base distribution) for eachθs0_z0_cis a

symmet-ric Disymmet-richlet with hyperparameterθ0.

Next we draw the refined syntactic category sub-symbolzfrom an infinite multinomial with parame-tersπss0_z0_c. Here the selection ofπis indexed by the

current node’s coarse symbol s, the symbol s0 and subsymbolz0 of the parent node, and the contextc

of the current node.

For each unique coarse symbolswe tie together the distributions πss0_z0_c for all possible parent and

context combinations (i.e.,s0,z0, andc) using a Hi-erarchical Dirichlet Process (HDP). Specifically, for a singles, each distributionπss0_z0_cover subsymbols

is drawn from a DP with concentration parameter

α and base distributionβs over subsymbols. This base distributionβsis itself drawn from a GEM prior with concentration parameterγ. By formulating the generation of z as an HDP, we can share parame-ters for a single coarse symbol’s subsymbol distribu-tion while allowing for individual variability based on node parent and context. Note that parameters are not shared across different coarse symbols, pre-serving the distinctions expressed via the coarse tag

annotations.

Finally, we generate the word x from a finite multinomial with parametersφsz, wheresandzare the symbol and subsymbol of the current node. The

φdistributions are drawn from a symmetric Dirich-let prior.

Generating the Tree Structure We now consider how the structure of the tree arises. We follow an approach similar to the widely-referenced DMV model (Klein and Manning, 2004), which forms the basis of the current state-of-the-art unsuper-vised grammar induction model (Headden III et al., 2009). After a node is drawn we generate children on each side until we produce a designated STOP symbol. We encode more detailed valence informa-tion than Klein and Manning (2004) and condiinforma-tion child generation on parent valence. Specifically, af-ter drawing a node we first decide whether to pro-ceed to generate a child or to stop conditioned on the parent symbol and subsymbol and the current con-text (direction and valence). If we decide to gener-ate a child we follow the previously described pro-cess for constructing a node. We can combine the stopping decision with the generation of the child symbol by including a distinguished STOP symbol as a possible outcome in distributionθ.

(5)

gener-ation of zis obviated and word x is drawn from a word distributionφs indexed solely by coarse sym-bols. The resulting simplified model closely resem-bles DMV (Klein and Manning, 2004), except that it 1) explicitly generate wordsxrather than only part-of-speech tagss, 2) encodes richer context and va-lence information, and 3) imposes a Dirichlet prior on the symbol distributionθ.

4 Inference with Constraints

We now describe how to augment our generative model of dependency structure with constraints de-rived from linguistic knowledge. Incorporating arbi-trary linguistic rules directly in the generative story is challenging as it requires careful tuning of either the model structure or priors for each constraint. In-stead, following the approach of Grac¸a et al. (2007), we constrain the posterior to satisfy the rules in ex-pectation during inference. This effectively biases the inference toward linguistically plausible settings. In standard variational inference, an intractable true posterior is approximated by a distribution from a tractable set (Bishop, 2006). This tractable set typ-ically makes stronger independence assumptions be-tween model parameters than the model itself. To in-corporate the constraints, we further restrict the set to only include distributions that satisfy the specified expectation constraints over hidden variables.

In general, for some given model, let θ denote the entire set of model parameters andz andx de-note the hidden structure and observations respec-tively. We are interested in estimating the posterior

p(θ, z | x). Variational inference transforms this problem into an optimization problem where we try to find a distributionq(θ, z)from a restricted setQ

that minimizes the KL-divergence between q(θ, z) andp(θ, z |x):

KL(q(θ, z)kp(θ, z|x))

=

Z

q(θ, z) log q(θ, z)

p(θ, z, x)dθdz+ logp(x).

Rearranging the above yields:

logp(x) = KL(q(θ, z)kp(θ, z|x)) +F,

whereF is defined as

F ≡ Z

q(θ, z) logp(θ, z, x)

q(θ, z) dθdz. (1)

ThusFis a lower bound on likelihood. Maximizing this lower bound is equivalent to minimizing the KL-divergence betweenp(θ, z|x)andq(θ, z). To make this maximization tractable we make a mean field assumption thatqbelongs to a setQof distributions that factorize as follows:

q(θ, z) =q(θ)q(z).

We further constrainqto be from the subset ofQ

that satisfies the expectation constraintEq[f(z)]≤b wheref is a deterministically computable function of the hidden structures. In our model, for exam-ple, f counts the dependency edges that are an in-stance of one of the declaratively specified depen-dency rules, while b is the proportion of the total dependencies that we expect should fulfill this con-straint.2

With the mean field factorization and the expec-tation constraints in place, solving the maximization ofF in (1) separately for each factor yields the fol-lowing updates:

q(θ) = argmin q(θ)

KL q(θ)kq0(θ), (2)

q(z) = argmin q(z)

KL q(z)kq0(z)

s.t. E_q₍_z₎[f(z)]≤b, (3)

where

q0(θ)∝expEq(z)[logp(θ, z, x)], (4)

q0(z)∝expEq(θ)[logp(θ, z, x)]. (5)

We can solve (2) by settingq(θ) toq0(θ) — since

q(z)is held fixed while updatingq(θ), the expecta-tion funcexpecta-tion of the constraint remains constant dur-ing this update. As shown by Grac¸a et al. (2007), the update in (3) is a constrained optimization problem and can be solved by performing gradient search on its dual:

argmin λ

λ>b+ logX z

q0(z) exp(−λ>f(z)) (6)

For a fixed value of λ the optimal q(z) ∝

q0(z) exp(−λ>f(z)). By updating q(θ) and q(z) as in (2) and (3) we are effectively maximizing the lower boundF.

2

(6)

4.1 Variational Updates

We now derive the specific variational updates for our dependency induction model. First we assume the following mean-field factorization of our varia-tional distribution:

q(β, θ, π, φ, z)

=q(z)·Y

s0

q(βs0)·

T

Y

z0₌₁

q(φs0_z0)·

Y

c

q(θs0_z0_c)· Y

s

q(πss0_z0_c), (7)

wheres0varies over the set of unique symbols in the observed tags,z0denotes subsymbols for each sym-bol, c varies over context values comprising a pair of direction (left or right) and valence (first, second, or third or higher) values, andscorresponds to child symbols.

We restrict q(θs0_z0_c) andq(φ_s0_z0) to be Dirichlet

distributions and q(z) to be multinomial. As with prior work (Liang et al., 2009b), we assume a de-generateq(β)≡δβ∗(β)for tractability reasons, i.e.,

all mass is concentrated on some singleβ∗. We also assume that the top level stick-breaking distribution is truncated atT, i.e.,q(β)assigns zero probability to integers greater thanT. Because of the truncation ofβ, we can approximateq(πss0_z0_c)with an

asym-metric finite dimensional Dirichlet.

The factors are updated one at a time holding all other factors fixed. The variational update forq(π) is given by:

q(πss0_z0_c) = Dir π_ss0_z0_c;αβ+E_q₍_z₎[C_ss0_z0_c(z)],

where term E_q₍_z₎[Css0_z0_c(z)] is the expected count

w.r.t. q(z) of child symbol s and subsymbol z in context cwhen generated by parent symbol s0 and subsymbolz0.

Similarly, the updates forq(θ)andq(φ)are given by:

q(θs0_z0_c) = Dir θ_s0_z0_c;θ₀+E_q₍_z₎[C_s0_z0_c(s)],

q(φs0_z0) = Dir φ_s0_z0;φ₀+E_q₍_z₎[C_s0_z0(x)],

whereCs0_z0_c(s)is the count of child symbolsbeing

generated by the parent symbols0and subsymbolz0

in contextcandCs0_z0_xis the count of wordxbeing

generated by symbols0and subsymbolz0.

The only factor affected by the expectation con-straints isq(z). Recall from the previous section that the update forq(z)is performed via gradient search on the dual of a constrained minimization problem of the form:

q(z) = argmin q(z)

KL(q(z)kq0(z)).

Thus we first compute the update forq0(z):

q0(z)∝

N

Y

n=1

len(n)

Y

j=1

(expEq(φ)[logφsnjznj(xnj)]

×expE_q₍_θ₎[logθsh(nj)zh(nj)cnj(snj)]

×expEq(π)[logπsnjsh(nj)zh(nj)cnj(znj)]),

where N is the total number of sentences, len(n) is the length of sentencen, and indexh(nj) refers to the head of the jth node of sentence n. Given thisq0(z)a gradient search is performed using (6) to find the optimalλand thus the primal solution for updatingq(z).

Finally, we update the degenerate factor q(βs) with the projected gradient search algorithm used by Liang et al. (2009b).

5 Linguistic Constraints

Universal Dependency Rules We compile a set of 13 universal dependency rules consistent with vari-ous linguistic accounts (Carnie, 2002; Newmeyer, 2005), shown in Table 1. These rules are defined over coarse part-of-speech tags: Noun, Verb, Adjec-tive, Adverb, Pronoun, Article, Auxiliary, Preposi-tion, Numeral and Conjunction. Each rule specifies a part-of-speech for the head and argument but does not provide ordering information.

(7)

1. Identify non-recursive NPs:

• All nouns, pronouns and possessive marker are part of an NP.

• All adjectives, conjunctions and deter-miners immediately preceding an NP are part of the NP.

2. The first verb or modal in the sentence is the headword.

3. All words in an NP are headed by the last word in the NP.

4. The last word in an NP is headed by the word immediately before the NP if it is a preposition, otherwise it is headed by the headword of the sentence if the NP is be-fore the headword, else it is headed by the word preceding the NP.

5. For the first word set its head to be the head-word of the sentence. For each other head-word set its headword to be the previous word.

Table 3: English-specific dependency rules.

English-specific Dependency Rules For English, we also consider a small set of hand-crafted depen-dency rules designed by Michael Collins3for deter-ministic parsing, shown in Table 3. Unlike the uni-versals from Table 1, these rules alone are enough to construct a full dependency tree. Thus they allow us to judge whether the model is able to improve upon a human-engineered deterministic parser. Moreover, with this dataset we can assess the additional benefit of using rules tailored to an individual language as opposed to universal rules.

6 Experimental Setup

Datasets and Evaluation We test the effective-ness of our grammar induction approach on English, Danish, Portuguese, Slovene, Spanish, and Swedish. For English we use the Penn Treebank (Marcus et al., 1993), transformed from CFG parses into

depen-3_{Personal communication.}

dencies with the Collins head finding rules (Collins, 1999); for the other languages we use data from the 2006 CoNLL-X Shared Task (Buchholz and Marsi, 2006). Each dataset provides manually annotated part-of-speech tags that are used for both training and testing. For comparison purposes with previ-ous work, we limit the cross-lingual experiments to sentences of length 10 or less (not counting punc-tuation). For English, we also explore sentences of length up to 20.

The final output metric is directed dependency ac-curacy. This is computed based on the Viterbi parses produced using the final unnormalized variational distributionq(z)over dependency structures.

Hyperparameters and Training Regimes Un-less otherwise stated, in experiments with rule-based constraints the expected proportion of dependencies that must satisfy those constraints is set to 0.8. This threshold value was chosen based on minimal tun-ing on a stun-ingle language and ruleset (English with universal rules) and carried over to each other ex-perimental condition. A more detailed discussion of the threshold’s empirical impact is presented in Sec-tion 7.1.

Variational approximations to the HDP are trun-cated at 10. All hyperparameter values are fixed to 1 exceptαwhich is fixed to 10.

We also conduct a set ofNo-Splitexperiments to evaluate the importance of syntactic refinement; in these experiments each coarse symbol corresponds to only one refined symbol. This is easily effected during inference by setting the HDP variational ap-proximation truncation level to one.

For each experiment we run 50 iterations of vari-ational updates; for each iteration we perform five steps of gradient search to compute the update for the variational distribution q(z) over dependency structures.

7 Results

(8)

DMV PGI No-Split HDP-DEP

English 47.1 62.3 71.5 71.9(0.3)

Danish 33.5 41.6 48.8 51.9(1.6)

Portuguese 38.5 63.0 54.0 71.5(0.5)

Slovene 38.5 48.4 50.6 50.9(5.5)

Spanish 28.0 58.4 64.8 67.2(0.4)

[image:8.612.314.540.52.223.2]

Swedish 45.3 58.3 63.3 62.1 (0.5)

Table 4: Directed dependency accuracy using our model with universal dependency rules (No-Split and HDP-DEP), compared to DMV (Klein and Manning, 2004) and PGI (Berg-Kirkpatrick and Klein, 2010). The DMV re-sults are taken from Berg-Kirkpatrick and Klein (2010). Bold numbers indicate the best result for each language. For the full model, the standard deviation in performance over five runs is indicated in parentheses.

7.1 Main Cross-Lingual Results

Table 4 shows the performance of both our full model (HDP-DEP) and its No-Split version using universal dependency rules across six languages. We also provide the performance of two baselines — the dependency model with valence (DMV) (Klein and Manning, 2004) and the phylogenetic grammar induction (PGI) model (Berg-Kirkpatrick and Klein, 2010).

HDP-DEP outperforms both DMV and PGI across all six languages. Against DMV we achieve an average absolute improvement of 24.1%. This improvement is expected given that DMV does not have access to the additional information provided through the universal rules. PGI is more relevant as a point of comparison, since it is able to lever-age multilingual data to learn information similar to what we have declaratively specified using universal rules. Specifically, PGI reduces induction ambigu-ity by connecting language-specific parameters via phylogenetic priors. We find, however, that we out-perform PGI by an average margin of 7.2%, demon-strating the benefits of explicit rule specification.

An additional point of comparison is the lexi-calized unsupervised parser of Headden III et al. (2009), which yields the current state-of-the-art un-supervised accuracy on English at 68.8%. Our method also outperforms this approach, without em-ploying lexicalization and sophisticated smoothing as they do. This result suggests that combining the complementary strengths of their approach and ours

English

Rule Excluded Acc Loss Gold Freq

Preposition→Noun 61.0 10.9 5.1

Verb→Noun 61.4 10.5 14.8

Noun→Noun 64.4 7.5 10.7

Noun→Article 64.7 7.2 8.5

Spanish

Rule Excluded Acc Loss Gold Freq

Preposition→Noun 53.4 13.8 8.2

Verb→Noun 61.9 5.4 12.9

Noun→Noun 62.6 4.7 2.0

[image:8.612.72.298.54.139.2]

Root→Verb 65.4 1.8 12.3

Table 5: Ablation experiment results for universal depen-dency rules on English and Spanish. For each rule, we evaluate the model using the ruleset excluding that rule, and list the most significant rules for each language. The second last column is the absolute loss in performance compared to the setting where all rules are available. The last column shows the percentage of the gold dependen-cies that satisfy the rule.

can yield further performance improvements. Table 4 also shows theNo-Splitresults where syn-tactic categories are not refined. We find that such refinement usually proves to be beneficial, yielding an average performance gain of 3.7%. However, we note that the impact of incorporating splitting varies significantly across languages. Further understand-ing of this connection is an area of future research.

Finally, we note that our model exhibits low vari-ance for most languages. This result attests to how the expectation constraints consistently guide infer-ence toward high-accuracy areas of the search space.

Ablation Analysis Our next experiment seeks to understand the relative importance of the various universal rules from Table 1. We study how accu-racy is affected when each of the rules is removed one at a time for English and Spanish. Table 5 lists the rules with the greatest impact on performance when removed. We note the high overlap between the most significant rules for English and Spanish.

(9)

Span-50  55  60  65  70  75 

Gold  70  75  80  85  90 

Accur

acy 

Constraints Threshold 

[image:9.612.320.533.51.236.2]

Average  English 

Figure 2: Accuracy of our model with different threshold settings, on English only and averaged over all languages. “Gold” refers to the setting where each language’s thresh-old is set independently to the proportion of gthresh-old depen-dencies satisfying the rules — for English this proportion is 70%, while the average proportion across languages is 63%.

ish, is not the most frequent rule in either language. This result suggests that some rules are harder to learn than others regardless of their frequency, so their presence in the specified ruleset yields stronger performance gains.

Varying the Constraint Threshold In our main experiments we require that at least 80% of the ex-pected dependencies satisfy the rule constraints. We arrived at this threshold by tuning on the basis of En-glish only. As shown in Figure 2, for EnEn-glish a broad band of threshold values from 75% to 90% yields re-sults within 2.5% of each other, with a slight peak at 80%.

To further study the sensitivity of our method to how the threshold is set, we perform post hoc ex-periments with other threshold values on each of the other languages. As Figure 2 also shows, on average a value of 80% is optimal across languages, though again accuracy is stable within 2.5% between thresh-olds of 75% to 90%. These results demonstrate that a single threshold is broadly applicable across lan-guages.

Interestingly, setting the threshold value indepen-dently for each language to its “true” proportion based on the gold dependencies (denoted as the “Gold” case in Figure 2) does not achieve optimal

Length

≤10 ≤20

Universal Dependency Rules

1 HDP-DEP 71.9 50.4

No Rules (Random Init)

2 HDP-DEP 24.9 24.4

3 Headden III et al. (2009) 68.8 -English-Specific Parsing Rules

4 Deterministic (rules only) 70.0 62.6

5 HDP-DEP 73.8 66.1

Druck et al. (2009) Rules

6 Druck et al. (2009) 61.3

-7 HDP-DEP 64.9 42.2

Table 6: Directed accuracy of our model (HDP-DEP) on sentences of length 10 or less and 20 or less from WSJ with different rulesets and with no rules, along with vari-ous baselines from the literature. Entries in this table are numbered for ease of reference in the text.

performance. Thus, knowledge of the true language-specific rule proportions is not necessary for high accuracy.

7.2 Analysis of Model Properties

We perform a set of additional experiments on En-glish to gain further insight into HDP-DEP’s behav-ior. Our choice of language is motivated by the fact that a wide range of prior parsing algorithms were developed for and tested exclusively on En-glish. The experiments below demonstrate that 1) universal rules alone are powerful, but language-and dataset-tailored rules can further improve per-formance; 2) our model learns jointly from the rules and data, outperforming a rules-only deter-ministic parser; 3) the way we incorporate posterior constraints outperforms the generalized expectation constraint framework; and 4) our model exhibits low variance when seeded with different initializations. These results are summarized in Table 6 and dis-cussed in detail below; line numbers refer to entries in Table 6. Each run of HDP-DEP below is with syntactic refinement enabled.

[image:9.612.85.290.63.222.2]

(10)

As lines 1 and 5 of Table 6 show, language-specific rules yield better performance. For sentences of length 10 or less, the difference between the two rulesets is a relatively small 1.9%; for longer sen-tences, however, the difference is a substantially larger 15.7%. This is likely because longer sen-tences tend to be more complex and thus exhibit more language-idiosyncratic dependencies. Such dependencies can be better captured by the refined language-specific rules.

We also test model performance when no linguis-tic rules are available, i.e., performing unconstrained variational inference. The model performs substan-tially worse (line 2), confirming that syntactic cat-egory refinement in a fully unsupervised setup is challenging.

Learning Beyond Provided Rules Since HDP-DEP is provided with linguistic rules, a legitimate question is whether it improves upon what the rules encode, especially when the rules are complete and language-specific. We can answer this question by comparing the performance of our model seeded with the English-specific rules against a determin-istic parser that implements the same rules. Lines 4 and 5 of Table 6 demonstrate that the model out-performs a rules-only deterministic parser by 3.8% for sentences of length 10 or less and by 3.5% for sentences of length 20 or less.

Comparison with Alternative Semi-supervised Parser The dependency parser based on the gen-eralized expectation criteria (Druck et al., 2009) is the closest to our reported work in terms of tech-nique. To compare the two, we run HDP-DEP using the 20 rules given by Druck et al. (2009). Our model achieves an accuracy of 64.9% (line 7) compared to 61.3% (line 6) reported in their work. Note that we do not rely on rule-specific expectation information as they do, instead requiring only a single expecta-tion constraint parameter.4

Model Stability It is commonly acknowledged in the literature that unsupervised grammar induc-tion methods exhibit sensitivity to initializainduc-tion. As in the previous section, we find that the

pres-4_{As explained in Section 5, having a single expectation}

pa-rameter is motivated by our focus on parsing with universal rules.

ence of linguistic rules greatly reduces this sensitiv-ity: for HDP-DEP, the standard deviation over five randomly initialized runs with the English-specific rules is 1.5%, compared to 4.5% for the parser de-veloped by Headden III et al. (2009) and 8.0% for DMV (Klein and Manning, 2004).

8 Conclusions

In this paper we demonstrated that syntactic uni-versals encoded as declarative constraints improve grammar induction. We formulated a generative model for dependency structure that models syntac-tic category refinement and biases inference to co-here with the provided constraints. Our experiments showed that encoding a compact, well-accepted set of language-independent constraints significantly improves accuracy on multiple languages compared to the current state-of-the-art in unsupervised pars-ing.

While our present work has yielded substantial gains over previous unsupervised methods, a large gap still remains between our method and fully su-pervised techniques. In future work we intend to study ways to bridge this gap by 1) incorporat-ing more sophisticated lincorporat-inguistically-driven gram-mar rulesets to guide induction, 2) lexicalizing the model, and 3) combining our constraint-based ap-proach with richer unsupervised models (e.g., Head-den III et al. (2009)) to benefit from their comple-mentary strengths.

Acknowledgments

(11)

References

Mark C. Baker. 2001. The Atoms of Language: The Mind’s Hidden Rules of Grammar. Basic Books.

Emily M. Bender. 2009. Linguistically na¨ıve != lan-guage independent: Why NLP needs linguistic typol-ogy. In Proceedings of the EACL 2009 Workshop on the Interaction between Linguistics and Compu-tational Linguistics: Virtuous, Vicious or Vacuous?, pages 26–32.

Taylor Berg-Kirkpatrick and Dan Klein. 2010. Phylo-genetic grammar induction. In Proceedings of ACL, pages 1288–1297.

Christopher M. Bishop. 2006. Pattern Recognition and Machine Learning. Information Science and Statis-tics. Springer.

Sabine Buchholz and Erwin Marsi. 2006. CoNLL-X shared task on multilingual dependency parsing. In Proceedings of CoNLL, pages 149–164.

David Burkett and Dan Klein. 2008. Two languages are better than one (for syntactic parsing). InProceedings of EMNLP, pages 877–886.

Andrew Carnie. 2002. Syntax: A Generative Introduc-tion (Introducing Linguistics). Blackwell Publishing. Ming-Wei Chang, Lev Ratinov, and Dan Roth.

2007. Guiding semi-supervision with constraint-driven learning. InProceedings of ACL, pages 280– 287.

Shay B. Cohen and Noah A. Smith. 2009a. Shared lo-gistic normal distributions for soft parameter tying in unsupervised grammar induction. In Proceedings of NAACL/HLT, pages 74–82.

Shay B. Cohen and Noah A. Smith. 2009b. Variational inference for grammar induction with prior knowl-edge. In Proceedings of ACL/IJCNLP 2009 Confer-ence Short Papers, pages 1–4.

Michael Collins. 1999. Head-driven statistical models for natural language parsing. Ph.D. thesis, University of Pennsylvania.

Hal Daum´e III and Lyle Campbell. 2007. A bayesian model for discovering typological implications. In Proceedings of ACL, pages 65–72.

Gregory Druck, Gideon Mann, and Andrew McCal-lum. 2009. Semi-supervised learning of dependency parsers using generalized expectation criteria. In Pro-ceedings of ACL/IJCNLP, pages 360–368.

Thomas S. Ferguson. 1973. A bayesian analysis of some nonparametric problems. Annals of Statistics, 1(2):209–230.

Jenny Rose Finkel, Trond Grenager, and Christopher D. Manning. 2007. The infinite tree. InProceedings of ACL, pages 272–279.

Kuzman Ganchev, Jennifer Gillenwater, and Ben Taskar. 2009. Dependency grammar induction via bitext pro-jection constraints. In Proceedings of ACL/IJCNLP, pages 369–377.

Kuzman Ganchev, Jo˜ao Grac¸a, Jennifer Gillenwater, and Ben Taskar. 2010. Posterior regularization for struc-tured latent variable models. Journal of Machine Learning Research, 11:2001–2049.

Jo˜ao Grac¸a, Kuzman Ganchev, Ben Taskar, and Fernando Pereira. 2009. Posterior vs. parameter sparsity in la-tent variable models. InAdvances in NIPS, pages 664– 672.

Jo˜ao Grac¸a, Kuzman Ganchev, and Ben Taskar. 2007. Expectation maximization and posterior constraints. InAdvances in NIPS, pages 569–576.

Aria Haghighi and Dan Klein. 2006. Prototype-driven grammar induction. In Proceedings of ACL, pages 881–888.

William P. Headden III, Mark Johnson, and David Mc-Closky. 2009. Improving unsupervised dependency parsing with richer contexts and smoothing. In Pro-ceedings of NAACL/HLT, pages 101–109.

Dan Klein and Christopher Manning. 2004. Corpus-based induction of syntactic structure: Models of de-pendency and constituency. In Proceedings of ACL, pages 478–485.

Jonas Kuhn. 2004. Experiments in parallel-text based grammar induction. In Proceedings of ACL, pages 470–477.

Percy Liang, Slav Petrov, Michael Jordan, and Dan Klein. 2007. The infinite PCFG using hierarchical Dirichlet processes. InProceedings of EMNLP/CoNLL, pages 688–697.

Percy Liang, Michael I. Jordan, and Dan Klein. 2009a. Learning from measurements in exponential families. InProceedings of ICML, pages 641–648.

Percy Liang, Michael I. Jordan, and Dan Klein. 2009b. Probabilistic grammars and hierarchical Dirichlet pro-cesses. The Handbook of Applied Bayesian Analysis. Mitchell P. Marcus, Beatrice Santorini, and Mary Ann

Marcinkiewicz. 1993. Building a large annotated cor-pus of english: The penn treebank. Computational Linguistics, 19(2):313–330.

Frederick J. Newmeyer. 2005. Possible and Probable Languages: A Generative Perspective on Linguistic Typology. Oxford University Press.

Slav Petrov and Dan Klein. 2007. Learning and infer-ence for hierarchically split PCFGs. InProceeding of AAAI, pages 1663–1666.

Benjamin Snyder, Tahira Naseem, and Regina Barzilay. 2009. Unsupervised multilingual grammar induction. InProceedings of ACL/IJCNLP, pages 73–81. Lydia White. 2003. Second Language Acquisition and