Experiments in Dependency Parsing - Learning Chinese language structures with multiple views

Previous experiments on the shallow parsing evaluate the impact of the word clustering on parsing in the constituency formalism. Both the bracketing and the labeling tasks can benefit from word clusters. Another important type of syntactic structure is the bilexical dependency structures. In this section we evaluate the impact of the MKLCS clusters on dependency parsing.

7.5.1 Cluster-based Features

Principled feature engineering is important for the application of word clusters to dependency parsing. In our experiments, we basically incorporate word clusters as fine-grained POS tags. We copy every real POS tag involved feature and substitute the POS tag as word clusters.

7.5.2 Experiments and Analysis

7.5.2.1 Main Results

In order to evaluate the helpfulness of cluster-based features, we conduct dependency parsing experiments using CoNLL 2009 shared task’s data, i.e. the same data setting as the parsing experiments in Chapter5. Similar to the chunking experiments, we do two sets of experiments on basis of the supervised POS tagger and the semi-supervised tagger respectively. In this chapter, we use a second order graph-based dependency parsing model [Che et al.,2009;Li et al.,2011] for experiments.1 _{This parser obtains}

the best parsing result of the CoNLL shared task. Table7.13 summarizes the experimental results. These results show that word clustering is very helpful to enhance dependency parsing. The size of the total number of clusters influence the quality of dependency parsing. With the increase of the total number of clusters, both the UAS and the LAS increase.

7.5.2.2 Two-fold Effect

Word clustering derives paradigmatic relational information from unlabeled data, and contribute to dependency parsing by (1) abstracting context information and (2) fighting data sparseness problem. To analyze the two-fold effect, we limit entries of the clustering lexicon to only contain IV words. Using this constrained lexicon, we

1_{We would like to thank Zhenghua Li to provide his implementation and Meishan Zhang to help}

Tagger Features Cluster UAS LAS Supervised Supervised - - 82.98% 78.65% Supervised +c100 MKCLS+1991-2004 83.60% 79.41% Supervised +c500 MKCLS+1991-2004 84.01% 79.85% Supervised +c1000 MKCLS+1991-2004 84.16% 79.99% +c500(MKCLS)+1991-2004 +c100 MKCLS+1991-2004 79.87% 80.01% +c500(MKCLS)+1991-2004 +c500 MKCLS+1991-2004 84.22% 80.11% +c500(MKCLS)+1991-2004 +c1000 MKCLS+1991-2004 84.57% 80.46% +Clustering+Bagging +c1000 MKCLS+1991-2004 84.80% 80.82%

Table 7.13: Dependency parsing UAS/LAS with different feature configurations on the development data.

train a new “+c1000(MKCLS)+1991-2004” model and report its prediction power in Table7.14. Note that, the POS information is provided by the supervised tagger. The gap between the baseline and +IV clustering models measures the first contribution, while the gap between the +IV clustering and +All clustering models measures the second one. This result indicates that the improved accuracy partially comes from the new interpretation of a word through a clustering, and partially comes from its memory of OOV words that appears in the unlabeled data.

Tagger Features UAS LAS

Supervised Supervised 82.98% 78.65%

Supervised +IV clustering 83.45% 79.24% Supervised +All clustering 84.16% 79.99%

Table 7.14: Dependency performance with IV clustering on the development data.

7.5.2.3 Impact on the Prediction of OOV Words

Word clustering fights the sparse data problem by relating low-frequency words with high-frequency words through their classes. Table7.15shows the prediction accuracy of the different types of dependencies. We report four types of dependencies: (1) both the dependent and the head are IV words; (2) the dependent is an IV word while the head is an OOV word; (3) the dependent is an OOV word while the head is an IV word; (4) both the dependent and the head are OOV words. The semi-supervised model for evaluation is the best system available. From this table, we can see a clear gap of predictive power between IV and OOV words. There is a very interesting phenomenon that, when dependencies with OOV dependents are harder to recognize than the ones with OOV heads. We compare the improvements of the OOV and IV

words and find that the error reduction of the OOV words are higher. This confirms our motivation to leverage on knowledge exploiting paradigmatic relations among words to better handle the recognition and disambiguation of the OOV words.

Supervised Semi-supervised Dependent ← Head P R F P R F IV ← IV 84.09% 83.81% 83.95 85.42% 85.12% 85.27 IV ← OOV 78.16% 79.65% 78.90 80.18% 81.77% 80.97 OOV ← IV 72.74% 73.46% 73.10 74.94% 75.57% 75.26 OOV ← OOV 69.84% 64.26% 66.94 74.92% 69.81% 72.28

Table 7.15: Dependency prediction accuracy relative to word type (OOV or IV).

7.5.2.4 Final Results

Table 7.16 is the performance of different dependency models evaluated on the test data. The first line shows the best result reported in the CoNLL 2009 shared task. The cluster-based features results in relative error reductions of 7.2% and 6.9% in terms of the UAS and LAS scores over our baseline.

Tagger Parser UAS LAS

CoNLL 09 [Che et al., 2009] - - 75.49%

Supervised Supervised 83.27% 78.64%

+c500(MKCLS)+1991-2004 +c1000(MKCLS)+1991-2004 84.48% 80.11%

Table 7.16: Dependency parsing performance on the test data.

7.6 Conclusion and Discussion

In this chapter, we evaluate the helpfulness of unsupervised word clustering for supervised parsing. Our work is motivated by (1) the importance of rich lexical information for parsing and (2) the performance gap between supervised and unsupervised NLP methods. Our feature induction based semi-supervised approach achieves substantial improvements over competitive baseline systems for Chinese parsing. Experimental results confirm that capturing paradigmatic relations is essential to analyzing syntag- matic relations.

Despite this success, there are several ways in which our work might be improved. We demonstrate the helpfulness of word clustering for shallow chunking and dependency parsing. A natural area for future work is applying word clustering to full

constituency parsing. The main difficulty to do so is that most of successful constituency parsers are based on generative models, which are hard to incorporate rich features.

Recall that the popular Brown and MKCLS clustering algorithms are based on a bigram language model. Intuitively, there is a mismatch between the kind of lexical information that is captured by the Brown/MKCLS clustering and the kind of lexical information that is modeled in supervised POS tagging, chunking and dependency parsing. A natural avenue for further research would be exploiting other type of lexical knowledge that reflect the syntactic behavior of words.

Part III

Chapter 8 Full and Partial Parsing Based

Semantic Chunking

State-of-the-art Chinese semantic role labeling (SRL) systems leverage full parsing to find arguments and classify their semantic types. To better utilize syntactic information, which is crucial to the success of SRL, we propose a semantic chunking method together with linguistically rich syntactic features. Our system achieves an F-score of 93.41, which is significantly better than the best reported performance, 92.0. We also empirically analyze the effect of full parsing in Chinese SRL. Motivated by devel- oping a complementary method, we study an alternative lightweight solution which only makes use of partial syntactic parses. Furthermore, we present a comparative analysis of the two categories of methods. This analysis could be exploited to improve SRL accuracy by system ensemble.

The rich syntactic features used in full parsing based SRL system is introduced in [Sun, 2010a], and the partial parsing based method is introduced in [Sun et al.,

2009a]. To lead to a fair comparison, we repeat experiments with slight modifications of the original papers.

In document Learning Chinese language structures with multiple views (Page 149-154)