Background - Learning Chinese language structures with multiple views

8.1.1 The Problem

In the last decade, there has been an increasing interest in semantic role labeling (SRL) on several languages, which consists of recognizing arguments involved by predicates in a given sentence and labeling their semantic types. Typical semantic classes include Agent, Patient, Source, Goal, and so forth, which are core arguments

to a predicate, as well as Location, Time, Manner, Cause, and so on, which are adjuncts. In order to indicate exactly what semantic relations hold among a given predicate and its associated participants and properties, the role-bearing constituents must be identified and their correct semantic role labels assigned, as in:

• [警察]Agent[正在]T ime[详细]M anner[调查]P redicate[事故原因]P atient

• [The police]Agent are [thoroughly]M anner [investigating]P redicate [the cause of the accident]P atient.

In the given example, the predicate in question is “调_{查/investigate”. All argu-} ments and adjuncts involved by “调查” have been represented as groups of words between square brackets. A tag next to the close bracket denotes the role of the argument (or adjunct). For example, the tag Agent indicates the doer, “警察/police,” of the investigation event, since they initiates and sustains the action; the tag Manner indicates an adjunct of the target verb, since it notes how the process of an event is carried out.

Such sentence-level semantic analysis of text is concerned with the characterization of events and is therefore important to understand the essential meaning of the original input language sentences – who did what to whom, for whom or what, how, where, when and why? Different from many other information representation formalisms, this shallow semantic interpretation is independent of domains and more robust for many application purposes. SRL abstracts important semantic information away from syntactic structure and may potentially benefit many deep NLP tasks such as question answering, textual entailment, and complex information extraction.

8.1.2 The Annotation Data

Since the seminal work of [Gildea and Jurafsky,2002], statistical and machine learning approaches have been the predominant research paradigm in SRL, like many other subfields in NLP. A pre-requisite for statistical and machine learning approaches to SRL is the availability of a significant amount of semantically interpreted corpora from which automatic systems can learn. The recent activities in SRL have in large part been driven by the availability of semantically annotated corpora such as the FrameNet [Baker et al.,1998], PropBank [Palmer et al.,2005], and Nombank [Meyers et al., 2004] projects for English; the tectogrammatical layer for Czech; and the Salsa Project [Burchardt et al.,2006] for German; the Chinese PropBank [Xue and Palmer,

A0 VP

NP AM-TMP AM-MNR VP

NN ADVP ADVP Rel A1

警方 police AD AD VV NP 正在 now 详细 thoroughly 调查 investigate NN NN 事故 accident 原因 cause

Figure 8.1: An example sentence for CTB and CPB: The police are thoroughly investigating the cause of the accident.

The Chinese PropBank (CPB) is a popular semantically annotated corpus for research on Chinese SRL. It adds a layer of predicate-argument structures to the Chinese TreeBank. It assigns semantic role labels to syntactic constituents (rather than to the head words in a dependency structure) in a sentence. Each verb is annotated with a fixed number of arguments and each argument plays a role with regard to the verb. The arguments of a predicate are labeled with a contiguous sequence of integers, in the form of AN (N is a natural number); the adjuncts are annotated as such with the label AM followed by a secondary tag. For the core arguments, CPB uses a set of predicate-specific semantic role labels Predicates vary on the number of core arguments they take, but generally the total number of core arguments does not exceed six. Secondary tags for the adjuncts provide semantic information such as location, manner, and time that are not specific to a particular verb or even a particular class of verbs and they are defined based on a general set of guidelines. The assignment of semantic roles is illustrated in Figure 8.1, where the predicate is the verb “调查/investigate”. E.g., the NP “事故原因/the cause of the accident” is labeled as A1, meaning that it is the Patient, the PP “正在/now” is labeled as AM-TMP, indicating a temporal component.

With the advent of this supporting resource, Chinese SRL has become a well- defined task with a substantial body of work and comparative evaluation.

8.1.3 Successful Methods for English SRL

The work on SRL has included a broad spectrum of statistical and machine learning approaches to the task. Most SRL research takes an approach requiring training on role-annotated data. In this chapter, we only focus on supervised approaches. Given a sentence and a designated verb, the SRL task consists of identifying the boundaries of the arguments of the verb predicate and classifying them with semantic roles. The most common architecture for state-of-the-art SRL systems consists of the following steps.

• The first step in SRL is full syntactic parsing, which provide rich syntactic information for semantic processing.

• The second step typically consists of linguistically motivated pruning the set of argument candidates for a given predicate.

• The third step consists of a local classification of argument candidates. By “local,” we mean that candidates are usually treated independently of each other.

• The last step is to apply joint inference in order to combine the predictions of local scorers to produce a good structure of all labeled arguments for the predicate.

A variety of research has been proposed to capture different characteristics of SRL. More recent approaches for English SRL have involved calibrating features [Gildea and Jurafsky, 2002; Xue and Palmer, 2004], analyzing the complex input – syntax trees [Liu and Sarkar,2007;Moschitti,2004], exploiting the complicated output – the predicate-argument structure [Punyakanok et al., 2004; Toutanova et al., 2005], sys- tem combination [Punyakanok et al.,2004;Surdeanu et al.,2007], as well as capturing paradigmatic relations between predicates [Gordon and Swanson, 2007].

8.1.4 Previous Work on Chinese SRL

Previous work on Chinese SRL mainly focus on how to implement SRL methods which are successful on English, such as [Ding and Chang, 2008; Sun and Jurafsky,

2004;Xue,2008;Xue and Palmer,2005;Zhuang and Zong,2010]. Full parsing based SRL methods that are successful on English are adopted to resolve Chinese SRL.Sun and Jurafsky [2004] did the preliminary work on Chinese SRL without any large semantically annotated corpus of Chinese. They adopt English SRL methods presented

in [Gildea and Jurafsky, 2002; Pradhan et al., 2003] and evaluate ten specified verbs with a small collection of Chinese sentences. This work make the first attempt on Chinese SRL and produce promising results.

After the CPB was built, [Xue and Palmer, 2005] and [Xue, 2008] produce more complete and systematic research on Chinese SRL. Their work shows that when gold parses are available, the performance of Chinese SRL is fairly encouraging, achieving an f-score of 92.0. When an automatic parser is used, the performance, however, is highly degraded, only achieving an f-score of 71.9. This indicates the importance of syntactic parsing for well-performed SRL systems. In [Ding and Chang, 2008], SRC is divided into two sub-tasks in sequence: Each argument should first be determined whether it is a core argument or an adjunct, and then be classified into fine-grained categories. However, delicately designed features are more important and our experi- ments suggest that by using rich features, a better SRC solver can be directly trained without using hierarchical architecture.

Dependency is another popular formalism to represent syntactic and semantic information in NLP. In the CoNLL 2008 shared task, Surdeanu et al. [2008] propose a unified dependency-based formalism to model both syntactic dependencies and semantic roles for English. The CoNLL 2009 shared task is an extension of the CoNLL 2008. It dedicates to the joint parsing of syntactic and semantic dependencies in multiple languages [Hajiˇc et al., 2009]. As a sub-task of CoNLL 2009, Chinese semantic dependency parsing method are well evaluated.

In document Learning Chinese language structures with multiple views (Page 154-158)