Parsing - Knowledge mining over scientific literature and technical documentation

6.5 Comments

8.1.3 Parsing

The main innovation at the parsing level is the replacement of the Link Grammar parser with a new statistical broad-coverage parser (Pro3Gres), which is as fast as a probabilistic parsers but more deep-linguistic because it delivers grammatical relation structures which are closer to predicate-argument structures than the linkages delivered by Link Grammar, and therefore can be more easily converted into MLFs. The evaluation reported in [Schneider et al., 2004c] shows that it has state-of-the-art performance.

Figure 8.2displays the three levels of analysis that are performed on a sim-ple sentence. A process of expansion yields NP3 as a comsim-plete candidate description. However, NP1 and NP2 form two distinct, fully expanded noun

Figure 8.3Example usage of the Bio-QA system

phrase chunks. Their formation into a noun phrase with an embedded prepositional phrase is recovered from the parser’s syntactic relations giv-ing the maximally projected noun phrase involvgiv-ing a description: “Arginine methylation of STAT1” (or juxtaposed “STAT1 Arginine methylation”). Finally, the highest level syntactic relations (subj and obj) identifies a transitive predicate relation between these two candidate descriptions.

The usage of a deep-linguistic dependency parser partly simplifies the con-struction of MLF. First, the mapping between labeled dependencies and a surface semantic representation is often more direct than across a complex constituency subtree [Schneider, 2003b], and often more accurate [Johnson, 2002a]. Dedicated labels can directly express complex relations, the lexical participants needed for the construction are more locally available.

An example of interaction with the QA system can be seen inFigure 8.3.

8.2 Conclusion

In this chapter we have discussed our initial attempts at adapting our ques-tion answering systems to the biomedical domain. Although the resulting system is no more than a limited prototype, our activities and background

research into this domain have motivated us to seek new paths of exploita-tion for the techniques of domain modeling that we had previously devel-oped for the AMM manual.

In particular, we have come to realize that a reliable knowledge base is crucial for a satisfactory question answering system, and that the domain knowledge can be, at least in part, automatically derived from documents.

At the same time we have become aware of the limitations of the existing QA solution.

In our current research, briefly outlined in the next chapter, we are seeking ways to overcome such limitations, with the long-term goal of creating a novel type of QA system, that can rely upon a rich knowledge base, which however is not manually created, but is instead automatically derived from a sufficiently large collection of documents.

Relation Mining over Biomedical Literature

In the process of porting our QA system to the biomedical domain we re-alized that the amount of domain knowledge provided by domain descrip-tions and their taxonomic reladescrip-tionships is insufficient to offer a satisfactory QA experience. In particular, a large number of relations which exist among domain descriptions cannot be captured by the techniques that we have so far presented, but would instead require a different type of analysis. In this chapter we introduce techniques that we have recently adopted in order to extract other types of domain relations. While pursuing that goal, we have also come to realize that relation mining over biomedical literature is in it-self a major research arena, which has since then become the central focus of our activities.

We discuss first in Section 9.1the role of deep parsing in relation mining, and in particular the benefits of dependency parsing. A precondition for further developing our approach is to be able to rely on a good quality de-pendency parser, or an annotated dede-pendency corpus. In Section 9.2 we present our work in creating such a corpus, using an internally developed

parser, discussed inSection 9.3. Finally, inSection 9.4we present an evalu-ation of domain relevalu-ations extracted from the dependency corpus.

9.1 Deep Parsing for Relation Mining

Full parsing, even of complex, highly technical language, is beginning to be possible due to recent developments in parsing technology. Still, as far as we know, few systems exist that show the feasibility of automated re-lation extraction directly from genomics scientific literature (for details see Section 7.1.2.2). In our research activities, we show that advanced parsing techniques combining statistics and human knowledge of linguistics have matured enough to be successfully applied in real settings.

As an intermediate step towards question answering for the biomedical do-main, we aim at developing and refining methods for discovery of interac-tions between biological entities (genes, proteins, pathways, etc.) from the scientific literature, based on a complete syntactic analysis of the articles, using a novel high-precision parsing approach [Schneider, 2003b]. It has numerous advantages in comparison to Link Grammar: a greater process-ing speed, fewer erroneous parses, and doesn’t go into random mode when it encounters an unkown word.

The GENIA corpus (described in detail in the next section) provides a very interesting test bed for practical experimentation of relation extraction. Its main advantage is that it comes with manually annotated domain descrip-tions, therefore not requiring a separate phase of extraction. We have used it in further experiments aimed at showing how relationships among do-main entities can be extracted in an efficient fashion from a richly annotated corpus [Rinaldi et al., 2006a,Rinaldi et al., 2006b,Rinaldi et al., 2006c].¹ A

1A working prototype can be accessed online at <http://www.ontogene.org/>

Figure 9.1Dependency Structure visualization (via SVG)

similar application has been developed in collaboration with a biomedical company for a different corpus focusing on circadian rythms of Arabidopsis Thaliana [Rinaldi et al., 2007b].²

In document Knowledge mining over scientific literature and technical documentation (Page 150-155)