• No results found

CHAPTER 3: STEM SENTENCES SELECTION VIA IE

3.1 Unsupervised Surface-based Patterns

3.1.5 Evaluation

3.1.5.1 Experimental data

We used the GENIA Corpus as the domain corpus while British National Corpus (BNC) was used as a general corpus. The GENIA corpus consists of 2000 abstracts extracted from the MEDLINE containing 18,421 sentences. In the evaluation phase, GENIA EVENT Annotation corpus29 is used (Kim et.al, 2008). It consists of 1000 MEDLINE abstracts similar to the GENIA corpus and has 9,372 sentences. The main difference between the GENIA and GENIA EVENT corpora is that in the GENIA EVENT corpus events are identified and annotated.

In order to handle the problem of data sparseness due to the small size of the GENIA corpus we developed a WEB corpus (consisting of 132,582 sentences) by collecting MEDLINE articles similar to the GENIA corpus from the National Library of

Medicine30. The Web corpus was collected using a commercial web crawler, which implements a methodology for collecting a topical corpus, similar to the one implemented in tools such as BootCat.31 The commercial web crawler was preferred over BootCat because it has a term extractor integrated with it, so high quality terms were automatically extracted from pages being analysed and used for automatically building more queries while BootCat extracts single words. It is fully automated, i.e. one does not have to do manual revision of the extracted terms after every iteration. Moreover, it queries multiple search engines (Google, Yahoo and Bing) and so the crawling results are not biased towards any particular search engine. As the commercial web crawler uses a term extractor, it is better at crawling highly technical domains which are best captured by multi-word terms. BootCat, instead, was primarily intended to collect language-specific, topic-independent corpora, where single words are more suitable for collecting content. In response to an original set of manually constructed queries built from the GENIA corpus, original queries were constructed by manually defining several topical terms (named entities) e.g. protein, DNA and combining them randomly to create an initial set of queries. The crawler collects web pages by making calls to several popular search engines, extracts topical terminology from the pages, selects the most promising topical terms to create new queries, and uses them to collect more web pages on the topic. The crawler collected web pages in this iterative manner until the desired size of the corpus is reached. The crawler strips off boilerplate content (navigation menus, standard notices etc.) from each page, removes HTML tags, detects and discards duplicate pages. The GENIA named entity tagger was then used for NER and PoS tagging. The quality of the collected corpus was evaluated using corpus homogeneity and similarity scores.

In order to ensure that the Web corpus is sufficiently on-topic, it is important to know how similar the two corpora are. Corpus similarity also plays a pivotal role when porting an NLP application from one domain with one corpus to another domain with a different corpus. Corpus similarity is a complex issue and there is no generally accepted method to measure corpus similarity; (Kilgarriff, 1997; Kilgarriff and Rose, 1998 and Kilgarriff, 2001) argued that it is most important to first determine the homogeneity of a corpus before computing its similarity to another corpus, as the

30http://www.nlm.nih.gov/

judgement of similarity can become unreliable if a homogenous corpus is compared with a heterogeneous one. Kilgarriff (1997) presented an overview of various approaches for corpus similarity and proposed a word frequency list approach to measure corpus similarity and homogeneity. We used the Kilgarriff (1997) approach as it is considerably easier to count words accurately rather than syntactic categories.

In order to measure corpus homogeneity, we divided the corpus into two equal parts and produced a word frequency list of each sub-corpus by processing the text using GENIA tagger and filtering out punctuations and stop words. In the next step we took the 500 most frequent words from each sub-corpus and calculated the chi-square statistics for the difference between two sub-corpora, as Kilgarriff and Rose (1998) and Kilgarriff (2001) showed that chi-square statistics perform considerably better than other information-theoretic and statistical measures. To determine the similarity between the two corpora, we also produced the top 500 words from each corpus and calculated the chi-square statistics for each corpus. Low chi-square scores indicate homogeneous and highly similar corpora, while high scores correspond to heterogeneous corpora and dissimilar corpora.

Corpus Chi-Score

GENIA 1379.693 GENIA EVENT 2364.577

WEB 14750.369 BNC 20872371.995

Table 9: Homogeneity scores of corpora

Table 9 shows the homogeneity scores between two sub-corpora in each corpus we used in the experiment. We observe that GENIA and GENIA EVENT corpora achieve quite a low score which in turn shows that both these two corpora are homogenous. This is rather unsurprising as both corpora were compiled by hand to ensure topic relevance and are generally accepted as benchmark biomedical corpora. WEB and BNC scores show that these two corpora are more heterogeneous. BNC exhibits the greatest heterogeneity, which is obviously explained by the fact that the corpus is meant to cover the broadest possible range of domains in general British

English. The WEB corpus is much more homogeneous than BNC, but still has a chi- square score of magnitude greater than the GENIA corpora, reflecting the fact that automatic web collection methods are still incapable of ensuring the same level of topic relevance as achieved in manually compiled corpora.

In the next step, we will calculate the similarity scores between these corpora using Chi-Score. Table 10 shows similarity scores in which GENIA and GENIA EVENT corpora are quite similar to each other while in the case of all other corpora the high score means that they are quite dissimilar to each other.

GENIA EVENT WEB BNC

GENIA 2137.63 173207.002 23686564.063

GENIA EVENT 136568.630 23008298.781

WEB 28068572.14

Table 10: Similarity scores of corpora

As mentioned earlier that BNC is a heterogeneous corpus, which is also reflected here too in the form of a higher similarity score while the WEB corpus similarity score is also quite high due to a higher homogenous score when compared to the manually compiled corpora of GENIA and GENIA EVENT respectively.

We collected the Web corpus to attain higher recall in our experiments but as is quite obvious from the homogeneity and similarity scores (Table 9 and 10), the Web corpus is not homogenous and also not similar to GENIA or GENIA EVENT corpus. One of the possible reasons for this is that GENIA is a very narrow-domain corpus and it is hard to collect relevant topical documents automatically.

3.1.5.2 Evaluation method

In order to evaluate the quality of the extracted patterns, we examined their ability to capture pairs of related named entities in the manually annotated evaluation corpus, without recognising the types of the semantic relations. Selecting a certain number of best-ranking patterns, we measured precision, recall and F-score.

To test the statistical significance of differences in the results of different methods and configurations, we used a paired t-test, having randomly divided the evaluation corpus (GENIA EVENT Annotation corpus) into 20 subsets of equal size; each subset containing 461 sentences on average. We collected precision, recall and F-score for each of these subsets and then using paired t-test we found statistical significance between different surface pattern types and also between different ranking methods using score-thresholding measure.