Constituency parsing - NLP Processes - TOOL FOR THE AUTOMATIC ANALYSIS OF SYNTACTIC

3 TOOL FOR THE AUTOMATIC ANALYSIS OF SYNTACTIC

3.1 NLP Processes

3.1.2 Constituency parsing

Constituency parsing is essential for automatic syntactic analyses because it allows for the identification of phrasal and clausal boundaries, and also allows for the differentiation between independent and subordinated clauses. The notion of grammatical constituency, or the idea that groups of words can together function as a constituent or grammatical unit has been around for about a hundred years (Wundt, 1900). Constituency explains how strings of words such as the linguist and the fashionable linguist can occur in similar contexts in a sentence and serve similar purposes (i.e., they are both noun phrases). Chomsky (1965) used the idea of constituency as a basis for developing a model for how syntactic systems work. Chomsky theorized that language systems could be described via a number of phrase-structure rules that account for constituencies. Although many formalisms have been derived from Chomsky’s theories, computer scientists tend to use phrase structure rules written in Chomsky Normal Form (CNF), in which each rule includes a single structure on the left (e.g., NP) and either one or two structures on the right (Jurafsky & Manning, 2008). Our first example, the linguist, which is a noun phrase (NP), can be accounted for in CNF via the rule NP -> determiner (DT) noun (N). To account for our second example, the fashionable linguist, we will need two rules, NP -> DT NP and NP -> adjective (ADJ) N. It is important to note that these rules do not include lexical items,

and therefore theoretically, any ADJ and any N could be used to create a grammatical NP. This has led to describing such a grammar as a “context-free grammar” (CFG). A CFG presents a promising starting point for computational syntactic analysis because it can theoretically use a finite number of rules to describe an infinite number of lexical combinations (sentences). A syntactic parse, then, is a hierarchical representation of the phrase structure rules that account for a particular sentence, which is often called a parse tree. The parse tree for the sentence The linguist climbs rocks, for example is: (S((NP (DT The) (NN linguist)) (VP (VBZ climbs) (NP (NNS rocks))))), which can be alternatively represented as in Figure 3.1.

Figure 3.1 A visual representation of the parse tree for the sentence The linguist climbs rocks. Phrase structure rules can of course be handwritten, but they can also be automatically derived from large repositories of hand-annotated sentences such as the Penn Treebank (Marcus et al., 1993). Despite the theoretical advantages of CFGs, there is at least one major drawback: linguistic ambiguity. Syntactic linguistic ambiguity can be demonstrated in many ways and at many levels. One classic type of example, which is outlined in Fromkin et al. (2013), is prepositional phrase (PP) attachment. In the sentence The boy saw the man with the telescope, for example, the attachment of the PP with the telescope is ambiguous. It is not clear whether the

PP is directly attached to the VP as a sister of saw (e.g., VP -> V PP) or whether it is directly attached to the NP as a sister of man (e.g., NP -> N PP): the rules for the grammar allow for both interpretations. While this example is as ambiguous to humans as it is to a computer program, a program based solely on phrase structure rules will have much more difficulty than a human processing the same sentences, especially when we consider that true CFG parsers only use POS tags (not lexical items) as input. One way structural ambiguity can be solved is probabilistically. Given hand-tagged corpora, we can calculate the relative probabilities of each sentence parse based on the POS tags, and assign the most probable parse to a sequence of POS tags (i.e., a POS representation of a sentence). When probabilities are used to disambiguate possible parsers, they are referred to as probabilistic context-free grammar parsers (PCFG parsers).

The accuracy of PCFG parsers has improved through the addition of various degrees of context. Two ways that PCFG parsers have been enhanced is through recognizing grammatical relations and through the use of lexical information (Jurafsky & Manning, 2008). In the

Switchboard corpus, for example, the probability of an NP consisting of a pronoun (PRP) (NP-> PRP) and the probability of an NP consisting of a determiner and a noun NP -> DT NN is very similar. If we consider grammatical relations such as subjects and objects of verbs, however, we see that the probability of NP-> PRP is much higher in the subject position than NP-> DT NN, while in the object position, the opposite is true (Francis, Gregory, & Michaelas, 1999). PCFGs can achieve much higher accuracies through the use of such information. Modeling lexical preferences (e.g., n-gram frequencies) can also increase parser accuracy, but can lead to extremely large and therefore slow models. Most current parsers such as the Stanford Parser (Klein & Manning, 2003) tend to use grammatical relations instead of lexical information (Jurafsky & Manning, 2008).

The Charniak parser (Charniak & Johnson, 2005; Charniak, 2000) is another popular parser (in this chapter the configuration outlined in Charniak & Johnson, 2005 is described), but unlike the Stanford Parser, it uses lexical information to obtain an accurate parse. The Charniak parser runs in two stages. First, a text is parsed via a PCFG parser, and the n-best parses (i.e., the n-most probable parses) are kept (Charniak & Johnson, 2005 reports on a 50-best parse

configuration). Lexical and head information is then added to information available to the probabilistic model, and MaxEnt is used to choose the best parse. Using this method, the Charniak parser can achieve up to 91.0% accuracy, which is state of the art.

In document Constructing Empirical Likelihood Confidence Intervals for Medical Cost Data with Censored Observations (Page 55-58)