6.1 ANSWER ENTITY EXTRACTION WITHOUT CONTEXTS
6.1.1 Answer Entity Extraction with Named Entity Recognition Tools
Most of the researchers in the TREC 2009 entity retrieval task applied named entity recog- nition (NER) tools for answer entity extraction. The TREPM model can be interpreted as an answer entity extraction without contexts, i.e., p(e|d, q, t). In practices, the Stanford NER tool is the most popular tool for this task. Unfortunately, it can only identify the named entities of persons, organizations, and locations, but not products. Therefore, teams like [Zheng et al., 2009] treated proper nouns as candidate product entities. Teams such as [Yang et al., 2009] and [Serdyukov and de Vries, 2009] used external knowledge base (e.g., Wikipedia) to train a named entity tool for products. Similar approaches were done by [Vydiswaran et al., 2009] and [McCreadie et al., 2009]) relying on a dictionary of company names and a pre-defined set of patterns for the product recognition. Most of these researchers did further entity re-ranking since the results directly from the named entity recognition are not promising. Wu specifically evaluated the re-ranking process by calculating the similari- ties between input query, support snippets, and related entities [Wu and Kashioka, 2009].
We follow the same idea for answer entity extraction without contexts, i.e., the named entity recognition (NER) tool for extractions. The research question is whether the NER tools can extract the answer entities from germane documents. A special parser is designed for the Wikipedia page extraction, because we expect a better parser to pre-process the webpages can improve the extraction. Therefore, the experiment also tests whether the html parser will affect the results of named entity extractions. The answer entities should consider all extractions from germane documents for each topics. Therefore, we compare the results before and after this sum in order to evaluate whether entities extracted from
one document and entities from multiple documents can complete each other. Moreover, the evaluation results are reported according to the entity types (such as products, persons, and organizations) and page types (such as Web pages and Wikipedia pages) in order to test whether these factors affect answer entity extraction.
The experiment is based on the TREC 2009 entity retrieval tasks. All germane documents are preprocessed as plain texts, removing all tags from HTML pages. Stanford NER tool identifies the entity of organizations and persons, and the noun phrase extractor extracts noun phrases as products. Three groups of experiments are evaluated.
• Experiment 1: Stanford NER extracts the entities from the germane documents. The top 10 results are evaluated.
• Experiment 2: With the special parser for the Wikipedia pages, further cleaned up by removing the header and footer tags, answer entities are extracted from Wikipedia germane documents. Because there are a lot of non-relevant contents in the Wikipedia page, e.g., category information in the bottom and language information in the left, a simple parser is introduced to remove the header and footer parts of the Wikipedia to reduce this noise. The experiment evaluates whether removing the noise in this context can improve the results significantly. The top 10 results are evaluated.
• Experiment 3: With the answer entities extracted from every germane document, the algorithm summaries the results by topics. Different from the previous two experiments by extracting the answer entities by documents, this experiment summaries the entities across the documents within the same topic. This experiment evaluates whether the real answers for the same topic from the same document or multiple documents can complete the answer sets for each other. The top 10 results are evaluated.
The results of precision, recall and F-measure are as shown in Table13. With the answer entities within the same topic, the precision and the F-measure significantly improve from 0.103 to 0.17 and from 0.144 to 0.16 respectively (two-tail t-test, p¡0.001), but the recall drops from 0.419 to 0.37, according to Experiment 1 and Experiment 3. This result indicates that answer entities from different documents for the same topic can complement each other and improve the precision which in turn improves the overall performance (F-measure). The
Table 13: Results of named entity recognition tools for answer entity extraction
Precision Recall F-measure Experiment 1: based line, evaluated by documents
Overall 0.1030 0.4190 0.1440 Product 0.0120 0.2959 0.0230 Person 0.2480 0.5460 0.3370 Organization 0.0770 0.4110 0.1110 Web page 0.1148 0.3693 0.1551 Wiki page 0.0830 0.5204 0.1269
Experiment 2: Special parser for Wikipedia, evaluated by documents
Overall 0.1083 0.4400 0.1500 Product 0.0127 0.3639 0.0240 Person 0.2588 0.5463 0.3454 Organization 0.0829 0.4241 0.1179 Web page 0.1148 0.3693 0.1551 Wiki page 0.0982 0.5501 0.1426 Experiment 3: evaluated by topics
Overall 0.1700 0.3700 0.1600 Product 0.0293 0.3163 0.0495 Person 0.4449 0.4236 0.3555 Organization 0.1117 0.3646 0.1237 Web page 0.2060 0.2783 0.1704 Wiki page 0.1055 0.5275 0.1439
answer entities extracted from different documents do co-reference each other and improve the accuracy of the extraction. However, the recall drops because merging the results from different documents by topics reduces some rare but relevant answer entities with low scores.
Therefore, further work is to investigate how to improve extracting answer entities with rare existing.
With the special Wikipedia parser to remove some noise, the results of Wikipedia page are improved significantly (two-tail t-test, p¡0.001). Precision rises from 0.08 in Experiment 1 to 0.10 in Experiment 2, recall rises from 0.52 in Experiment 1 to 0.55 in Experiment 2, and F-measure rises from 0.13 in Experiment 1 to 0.14 in Experiment 2. That means narrowing down the context and removing the noise does help to improve the results. How- ever, the results of the web page extraction (F-measure of Web page is 0.1551 in Experiment 1 and 0.17 in Experiment 3) are better than the ones from Wikipedia pages (F-measure of Wikipedia page is 0.1269 in Experiment 1 and 0.14 in Experiment 3). Especially, the precision in Experiment 1 (0.12 vs. 0.08) and in Experiment 3 (0.2 vs. 0.1) is higher but the recall in Experiment 1 (0.37 vs. 0.52) and in Experiment 3 (0.28 vs. 0.53) is lower. It is because the Wikipedia germane documents cover more information than the webpage germane documents, which can bring in more answer entities so recall is improved, but also bring the noises which cause precision drops.
Comparing the performance of three experiments according to different entity types, the results indicate that the extractions of organizations and persons (directly extracted from NER) are significantly better than products extracted from noun phrases. This means named entity tools are critical in this step. The approach of treating noun phrases as the products brings too much noise (the precision of products is only 0.01). For the entity type of organizations, even with some trained data and rules for extraction, the precision is still very low (0.08). Therefore, further work is needed to investigate answer entity extraction for the named entity recognizer non identifiable entities, such as products.
The precision of NER method is 0.17 (overall precision in Experiment 3), which needs to be further improved. In the next section, we use knowledge base method to improve the extraction precision by filtering the candidate entities with the entity categories from knowledge bases. The recall of NER is less then 0.5, which means this method misses half of important answer entities. In the next section, we will extract more answer entities from knowledge base, which is independent from the corpus, to improve the recall of the extraction.