Syntactic Analysis - Knowledge mining over scientific literature and technical documentation

In order to deal effectively with the domain descriptions, the original QA system had to be modified on many levels. We therefore describe again the main workings of the system, focusing here on those aspects that had to be modified.

The syntactic analysis begins with the tokenizer. Sentences are split into the units of analysis which optimize processing - words and sentence bound-aries are all identified. Domain descriptions, recognized in a separate phase (Chapter 4, “Extraction of Domain Descriptions”) are now treated using a dedicated approach explained below.

An efficient lookup procedure identifies in the running text the domain

de-Figure 6.1Offline processing (extensions in orange, modifications in yellow)

scriptions (and their variants) which have been previously stored in the sys-tem’s computational thesaurus (seeSection 5.3andFigure 6.1). As the head of a multi-word description controls sentence level syntactic behaviour, each description is considered as a single unit and assigned the syntactic re-quirements of the head. As such, they are identified as either singular (DESCRIPTION.s) or plural (DESCRIPTION.p) noun phrases. In paral-lel to the assignment of syntactic features, a semantic value is assigned to the description, which corresponds to the identifier of the synset to which it belongs.

In this way the same description (or descriptions belonging to the same synset) are treated syntactically as noun phrases (either singular or plu-ral).¹ At the semantic level, descriptions belonging to the same synset are equated, all being replaced by their synset identifier. It could be argued that the semantic representation of singular and plural nouns should be differ-ent, however in our application we deliberately choose to ignore a number of semantic differences that have no impact on the problem we aim at solv-ing.

A possible alternative approach would be to use the internal structure of descriptions (detected as described inSection 5.2), in the process of building

1Generally speaking, a description does not necessarily have to be a noun phrase, though they always are in our domain.

the semantic representation of the sentences. This would however require mantaining a dual representation for each domain description at various levels of processing, once as a frozen syntactic unit (useful for parsing) and once as a compound, where the head carries the syntactic information. At present, we find such an approach to be cumbersome while the solution that we have adopted provides for a neater flow of information. We do not rule out however the possibility of exploiting the internal structure of descriptions at a later stage in our research.

Figure 6.2Examples of LG output

///// a.d electrical coax cable.n4 connects.v062 the.d external antenna.n1 to.o the.d ANT connection.n1 /////

• Wd ^✲•

Parsing is based upon the robust, dependency-based Link Grammar (LG) parser [Sleator and Temperley, 1993], which is able to handle a wide range of syntactic structures [Sutcliffe and McElligott, 1996b]. LG uses linkages to describe the syntactic structure of a sentence. Each word carries linking requirements (singular determiners ‘look for’ singular nouns etc.), a link-age representation of a sentence (Figure 6.2a) satisfies all of these individual requirements in a connected graph without any cross-over links. Links

con-nect pairs of words in such a way that the requirements of each word de-scribed in the sentences are satisfied, the links do not cross, and the words form a connected graph. An ability to predict the syntactic requirements of ‘unknown’ words and to process ungrammatical sentences by optionally ignoring some tokens, ensures that an analysis of each sentence is returned.

This is vital in the construction of the semantic representation.

In more detail, in the example inFigure 6.2a, the link Wd connects the subject coax cable to the wall.² The wall functions as a dummy word at the begin-ning of every sentence and has linking requirements like any other word.

Sslinks the transitive verb connects with the subject on the left, the verbal head on the right. The transitive verb and its direct object external antenna, that acts as the head of a noun phrase, are connected by the Os link. MVp connects the verb to the modifying prepositional phrase. Finally, the link Js connects the preposition to with its object ANT connection.

Processing the tokens inside multi-word descriptions individually would introduce additional linking requirements. In the best case, modifiers are all connected to the head (Figure 6.2b), identifying the descriptions as a phrasal unit but offering only a superficial representation of the internal structure. In more complex sentences, such modifiers might also wrongly link to words outside the description, resulting in multiple parses for the given sentence. The single token approach that we have adopted requires only that the linking properties for tokens of the types DESCRIPTION.s and DESCRIPTION.p be added to the LG lexicon.

Exploiting the atomicity of the domain descriptions previously identified during pre-processing blocks the possibility of erroneous parses, and also saves the computational expense needed to disambiguate between the al-ternatives. Furthermore, the risk of a parse which involves only fragments

2The wall is an artificial constituent introduced by LG as the ’root’ of the analysis.

of a domain description (which should be treated as an indivisible unit) is avoided. Experimental results [Rinaldi et al., 2002a,Rinaldi et al., 2003b]

show that, using this approach, the number of possible parses is reduced in average by almost 50%.

Thus, reducing the complexity of the material to be parsed by treating multi-word descriptions as atomic elements reduces both the space and time re-quirements for the parsing process, and can have a dramatic impact on the automatic processing of technical documentation, as these results apply to all domains and texts with a high frequency of domain-specific descriptions.

The additional effort required for the analysis of the internal structure of the descriptions might be worthwhile if an accurate internal representation of their structure was possible. However, any parser with a sufficiently rich grammar would deliver a number of potential structures, among which dis-ambiguation is extremely difficult. For example, a typical structure assigned by Link Grammar to a domain description is shown in (Figure 6.2b): addi-tional modifiers add the link A (adjectival modifier) or the link AN (nomi-nal modifier) to the head of the phrase. Whilst this structure may correctly describe some descriptions (underfuselage off-centered door), arbitrary appli-cation to air conditioning system, electrical coax cable or the extension to no smoking/fasten seat belt (ns/fsb) signs fails to capture the more subtle patterns of modification.

In a more traditional parsing approach, a clear distinction is drawn between the grammar, the lexicon and the parsing algorithm. In this case, either the grammar does not have sufficient coverage, and therefore some of the possible structures are missing (for example, a few grammars would not cater for the analysis “[[underfuselage] [off-centered] door]”), or it overgen-erates, leaving the disambiguation problem open. Link Grammar presents the additional difficulty that it conflates grammar and lexicon. As all the

grammatical information is coded within lexical entries, it is problematic to provide a general fix for the problem of missing analysis. This is one of the reasons why, in our more recent research (seeChapter 8, “A QA application for biomedical literature”andChapter 9, “Relation Mining over Biomedical Literature”), we are moving away from Link Grammar.

In document Knowledge mining over scientific literature and technical documentation (Page 116-121)