M-Step (subroutine for Algorithm 1) - Covariance in Unsupervised Learning of Probabilistic Gram

Estimate µ(t)andΣ(t)using the following maximum likelihood closed-form solution: µ(t)_k_,_i ← 1 M M

∑

m=1 ˜µ(t)_m_,_k_,_i h Σ(t) k i i,j← 1 M M

∑

m=1 ˜µ(t)_m_,_k_,_i˜µ(t)_m_,_k_,_j+ (σ˜(t))2 m,k,iδi,j+Mµ(t)_k_,_iµ(t)_k_,_j −µ(t)_k_,_j M

∑

m=1 ˜µ(t)_m_,_k_,_i−µ(t)_k_,_i M

∑

m=1 ˜µ(t)_m_,_k_,_j ! , whereδi,j=1 if i= j and 0 otherwise.

Then, note that:

L=argmin qm(ym) DKL qm(ym) rm(ym|xm,eψm˜ ) , (15)

where DKL denotes the KL divergence. To see that, combine the definition of KL divergence with the fact that∑K_k=₁∑Nk

i=1fm,k,i(x,y)ψ˜m,k,i−log Zm(ψ˜m) =log rm(ym|xm,eψm˜ )where log Zm(ψ˜)does not depend on qm(ym). Equation 15 is minimized when qm=rm.

The above lemma demonstrates that the minimizing qm(ym)has the same form as the probabilis- tic grammar G, only without having sum-to-one constraints on the weights (leading to the required normalization constant Zm(ψ˜m)). As in classic EM with probabilistic grammars, we never need to represent qm(ym)explicitly; we need only ˜fm, which can be calculated as expected feature values under rm(ym|xm,eψm˜ )using dynamic programming.

Variational inference for model II is done similarly to model I. The main difference is that instead of having variational parameters for each qm(ηm), we have a single distribution q(η), and the sufficient statistics from the inside-outside algorithm are used altogether to update it during variational inference.

Appendix C. Variational EM for Logistic-Normal Probabilistic Grammars

The algorithm for variational inference with probabilistic grammars using logistic normal prior is defined in Algorithms 1–3.10 Since the updates for ˜ζ(t)_k are fast, we perform them after each optimization routine in the E-step (suppressed for clarity). There are variational parameters for each training example, indexed by m. We denote by B the variational bound in Equation 14. Our stopping criterion relies on the likelihood of a held-out set (Section 5) using a point estimate of the model.

10. An implementation of the algorithm is available athttp://www.ark.cs.cmu.edu/DAGEEM. For simplicity, we give the vanilla logistic normal version of the algorithm in this appendix. The full version requires a more careful indexing and can be derived using the equations from Appendix B.

References

S. Afonso, E. Bick, R. Haber, and D. Santos. Floresta sinta(c)tica: a treebank for Portuguese. In Proceedings of LREC, 2002.

A. Ahmed and E. Xing. On tight approximate inference of the logistic normal topic admixture model. In Proceedings of AISTATS, 2007.

J. Aitchison. The Statistical Analysis of Compositional Data. Chapman and Hall, London, 1986. H. Alshawi and A. L. Buchsbaum. Head automata and bilingual tiling: Translation with minimal

representations. In Proceedings of ACL, 1996.

D. Angluin. Learning regular sets from queries and counterexamples. Information and Computa- tion, pages 87–106, 1988.

N. B. Atalay, K. Oflazer, and B. Say. The annotation process in the Turkish treebank. In Proceedings of LINC, 2003.

J. Baker. Trainable grammars for speech recognition. In The 97th meeting of the Acoustical Society of America, 1979.

A. Banerjee. On Bayesian bounds. In Proceedings of ICML, 2006.

T. Berg-Kirkpatrick and D. Klein. Phylogenetic grammar induction. In Proceedings of ACL, 2010. T. Berg-Kirkpatrick, A. Bouchard-Cote, J. DeNero, and D. Klein. Unsupervised learning with

features. In Proceedings of NAACL, 2010.

D. M. Blei and J. D. Lafferty. Correlated topic models. In Proceedings of NIPS, 2006.

D. M. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993–1022, 2003.

J. L. Boyd-Graber and D. M. Blei. Syntactic topic models. CoRR, abs/1002.4665, 2010.

P. F. Brown, V. J. Della Pietra, P. V. deSouza, J. C. Lai, and R. L. Mercer. Class-based n-gram models of natural language. Computational Linguistics, 1990.

S. Buchholz and E. Marsi. CoNLL-X shared task on multilingual dependency parsing. In Proceed- ings of CoNLL, 2006.

D. Burkett and D. Klein. Two languages are better than one (for syntactic parsing). In Proceedings of EMNLP, 2008.

G. Carroll and E. Charniak. Two experiments on learning probabilistic dependency grammars from corpora. Technical report, Brown University, 1992.

E. Charniak and M. Johnson. Coarse-to-fine n-best parsing and maxent discriminative reranking. In Proceedings of ACL, 2005.

D. Chiang. A hierarchical phrase-based model for statistical machine translation. In Proceedings of ACL, 2005.

A. Clark and F. Thollard. PAC-learnability of probabilistic deterministic finite state automata. Jour- nal of Machine Learning Research, 5:473–497, 2004.

A. Clark, R. Eyraud, and A. Habrard. A polynomial algorithm for the inference of context free languages. In Proceedings of ICGI, 2008.

S. B. Cohen and N. A. Smith. Shared logistic normal distributions for soft parameter tying in unsupervised grammar induction. In Proceedings of NAACL-HLT, 2009.

S. B. Cohen and N. A. Smith. Empirical risk minimization with approximations of probabilistic grammars. In NIPS, 2010.

S. B. Cohen, K. Gimpel, and N. A. Smith. Logistic normal priors for unsupervised probabilistic grammar induction. In NIPS, 2008.

S. B. Cohen, D. M. Blei, and N. A. Smith. Variational inference for adaptor grammars. In Proceed- ings of NAACL, 2010.

T. Cohn, S. Goldwater, and P. Blunsom. Inducing compact but accurate tree-substitution grammars. In Proceedings of HLT-NAACL, 2009.

M. Collins. Head-Driven Statistical Models for Natural Language Parsing. PhD thesis, U. Penn., 1999.

M. Collins. Head-driven statistical models for natural language processing. Computational Lin- guistics, 29:589–637, 2003.

I. Dagan. Two languages are more informative than one. In Proceedings of ACL, 1991.

D. Das, N. Schneider, D. Chen, and N. A. Smith. Probabilistic frame-semantic parsing. In Proceed- ings of ACL, 2010.

C. de la Higuera. A bibliographical study of grammatical inference. Pattern Recognition, 38:1332– 1348, 2005.

J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In Proceedings of OSDI, 2004.

A. Dempster, N. Laird, and D. Rubin. Maximum likelihood estimation from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B, 39:1–38, 1977.

Y. Ding and M. Palmer. Machine translation using probabilistic synchronous dependency insertion grammars. In Proceedings of ACL, 2005.

J. Eisner. Bilexical grammars and a cubic-time probabilistic parser. In Proceedings of IWPT, 1997. J. R. Finkel, T. Grenager, and C. D. Manning. The infinite tree. In Proceedings of ACL, 2007. H. Gaifman. Dependency systems and phrase-structure systems. Information and Control, 8, 1965.

K. Ganchev, J. Gillenwater, and B. Taskar. Dependency grammar induction via bitext projection constraints. In Proceedings of ACL, 2009.

J. Gillenwater, K. Ganchev, J. Grac¸a, F. Pereira, and B. Taskar. Sparsity in dependency grammar induction. In Proceedings of ACL, 2010.

K. Gimpel and N. A. Smith. Feature-rich translation by quasi-synchronous lattice parsing. In Proceedings of EMNLP, 2009.

S. Goldwater. Nonparametric Bayesian models of lexical acquisition. PhD thesis, Brown University, 2006.

S. Goldwater and T. L. Griffiths. A fully Bayesian approach to unsupervised part-of-speech tagging. In Proceedings of ACL, 2007.

A. Haghighi, P. Liang, T. Berg-Kirkpatrick, and D. Klein. Learning bilingual lexicons from mono- lingual corpora. In Proceedings of ACL, 2008.

J. Hajiˇc, A. Böhmová, E. Hajiˇcová, and B. Vidová Hldaká. The Prague dependency treebank: A three-level annotation scenario. Treebanks: Building and Using Parsed Corpora, pages 103–127, 2000.

W. P. Headden, M. Johnson, and D. McClosky. Improving unsupervised dependency parsing with richer contexts and smoothing. In Proceedings of NAACL-HLT, 2009.

G. E. Hinton. Products of experts. In Proceedings of ICANN, 1999.

R. Hwa, P. Resnik, A. Weinberg, C. Cabezas, and O. Kolak. Bootstrapping parsers via syntactic projection across parallel texts. Journal of Natural Language Engineering, 11(3):311–25, 2005. R. Johansson and P. Nugues. LTH: Semantic structure extraction using nonprojective dependency

trees. In Proceedings of SemEval, 2007.

M. Johnson. Why doesn’t EM find good HMM POS-taggers? In Proceedings EMNLP-CoNLL, 2007.

M. Johnson, T. L. Griffiths, and S. Goldwater. Adaptor grammars: A framework for specifying compositional nonparameteric Bayesian models. In NIPS, 2006.

M. Johnson, T. L. Griffiths, and S. Goldwater. Bayesian inference for PCFGs via Markov chain Monte Carlo. In Proceedings of NAACL, 2007.

Y. Kawata and J. Bartels. Stylebook for the Japanese treebank in VERBMOBIL. Technical Report Verbmobil-Report 240, Seminar für Sprachwissenschaft, Univerisität Tübingen, 2000.

D. Klein and C. D. Manning. A generative constituent-context model for improved grammar induc- tion. In Proceedings of ACL, 2002.

D. Klein and C. D. Manning. Accurate unlexicalized parsing. In Proceedings of ACL, pages 423– 430, 2003.

D. Klein and C. D. Manning. Corpus-based induction of syntactic structure: Models of dependency and constituency. In Proceedings of ACL, 2004.

K. Kurihara and T. Sato. Variational Bayesian grammar induction for natural language. In Proceed- ings of ICGI, 2006.

K. Lari and S. J. Young. The estimation of stochastic context-free grammars using the inside-outside algorithm. Computer Speech and Language, 4:35–56, 1990.

P. Liang, S. Petrov, M. Jordan, and D. Klein. The infinite PCFG using hierarchical Dirichlet pro- cesses. In Proceedings of EMNLP, 2007.

D. Lin. A path-based transfer model for machine translation. In Proceedings of COLING, 2004. M. P. Marcus, B. Santorini, and M. A. Marcinkiewicz. Building a large annotated corpus of English:

The Penn treebank. Computational Linguistics, 19:313–330, 1993.

D. McAllester. PAC-Bayesian model averaging. Machine Learning Journal, 5:5–21, 2003.

D. Mimno, H. Wallach, and A. McCallum. Gibbs sampling for logistic normal topic models with graph-based priors. In In Proceedings of NIPS Workshop on Analyzing Graphs, 2008.

J. Nivre, J. Hall, S. K¨ubler, R. McDonald, J. Nilsson, S. Riedel, and D. Yuret. The CoNLL 2007 shared task on dependency parsing. In Proceedings of the CoNLL Shared Task, EMNLP-CoNLL, 2007.

K. Oflazer, B. Say, D. Z. Hakkani-T¨ur, and G. T¨ur. Building a Turkish treebank. In A. Abeille, editor, Building and Exploiting Syntactically-Annotated Corpora. Kluwer, 2003.

F. C. N. Pereira and Y. Schabes. Inside-outside reestimation from partially bracketed corpora. In Proceedings of ACL, 1992.

H. Raiffa and R. Schaifer. Applied Statistical Decision Theory. Wiley-Interscience, 1961.

M. Seeger. PAC-Bayesian generalization bounds for Gaussian processes. Journal of Machine Learn- ing Research, 3:233–269, 2002.

D. A. Smith and N. A. Smith. Bilingual parsing with factored estimation: Using English to parse Korean. In Proceedings of EMNLP, 2004.

N. A. Smith. Novel Estimation Methods for Unsupervised Discovery of Latent Structure in Natural Language Text. PhD thesis, Johns Hopkins University, 2006.

N. A. Smith and J. Eisner. Guiding unsupervised grammar induction using contrastive estimation. In Proceedings of IJCAI Workshop on Grammatical Inference Applications, 2005.

N. A. Smith and J. Eisner. Annealing structural bias in multilingual weighted grammar induction. In Proceedings of COLING-ACL, 2006.

B. Snyder and R. Barzilay. Unsupervised multilingual learning for morphological segmentation. In Proceedings of ACL, 2008.

V. Spitkovsky, H. Alshawi, and D. Jurafsky. From baby steps to leapfrog: How “less is more” in unsupervised dependency parsing. In Proceedings of NAACL, 2010a.

V. I. Spitkovsky, H. Alshawi, D. Jurafsky, and C. D. Manning. Viterbi training improves unsuper- vised dependency parsing. In Proceedings of CoNLL, 2010b.

V. I. Spitkovsky, D. Jurafsky, and H. Alshawi. Profiting from mark-up: Hyper-text annotations for guided parsing. In Proceedings of ACL, 2010c.

L. Tesnière. Élément de Syntaxe Structurale. Klincksieck, 1959.

K. Toutanova and M. Johnson. A Bayesian LDA-based model for semi-supervised part-of-speech tagging. In Proceedings of NIPS, 2007.

M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational infer- ence. Foundations and Trends in Machine Learning, 1:1–305, 2008.

M. Wang, N. A. Smith, and T. Mitamura. What is the Jeopardy model? a quasi-synchronous grammar for question answering. In Proceedings of EMNLP, 2007.

D. Wu. Stochastic inversion transduction grammars and bilingual parsing of parallel corpora. Com- putational Linguistics, 23(3):377–404, 1997.

N. Xue, F. Xia, F.-D. Chiou, and M. Palmer. The Penn Chinese Treebank: Phrase structure annota- tion of a large corpus. Natural Language Engineering, 10(4):1–30, 2004.

D. Yarowsky, G. Ngai, and R. Wicentoswki. Inducing multilingual text analysis tools via robust projection across aligned corpora. In Proceedings of HLT, 2001.

In document Covariance in Unsupervised Learning of Probabilistic Grammars (Page 30-35)