• No results found

Topic Analysis on TREC entity retrieval

6.5 EXPERIMENTS ON TREPM MODEL

6.5.3 Topic Analysis on TREC entity retrieval

This dissertation discusses about the general entity retrieval task, although we based on entity retrieval task in TREC for evaluation. The current TREC entity retrieval is on the Web pages and asks the general domain questions. Therefore, it still can demonstrate the general entity retrieval task. Furthermore, these methods can be applied in the special domain, such as the medical domain, because the current TREPM model is domain independent. All methods and the model can be transfered or applied into these domains.

With reviewing the topics in TREC 2009, we identify two types of topics with regard to whether the answers are uniquely existing in a document. The first type is asking the general knowledge or information, e.g., “products of MedImmune Inc.” The answers for this type of topics can exist in the multiple germane documents repeatedly. They are either scattering in several germane documents and require people to summarize the answers for the topics complementally, e.g., “carriers that Blackberry makes phones for”, or accumulatively appear in some documents which require people to extract them as a whole, e.g., “products of MedImmune Inc.” The second type of topics is asking the questions whose answers are uniquely existing in one document, e.g., “students of Claire Cardie” or “Donors to the Home Depot Foundation.” This type of topics is sensitive at the germane document identification and answer entity extraction. If the system fails at detecting the germane document for this type of topics or detecting the answer entities, the system will fail at collecting the answer entities. This type of topic is tougher one than previous one.

There are seven out of twenty topics in the TREC 2009 entity retrieval tasks, whose answers are uniquely existing in the Web.

• The topics, such as “Students of Claire Cardie”, only can find the answer hint from the topic entity’s homepage, e.g., “Clair Cardie”. If the web pages are removed or the web pages are composed with PDF files or photos, the component of answer entity extraction will be hard to detect the answer entities. Therefore, this case is critical for the germane document identification.

• The topic of “Chefs with a show on the Food Network” has the unique answers in their website about the TV show schedule. However, the representation of the table structure

uses embed HTML lists, so that the current extraction method can not fully extract all answer entities.

• The topic of “Winners of the ACM Athena award” has the answers in the ACM webpage. The difficulty to detect the answer entities for this topics is that the winners’ names mixture with other awards. Therefore, how to detect the related tables within a germane document will be a challenge task.

• The topic of “Authors awarded an Anthony Award at Bouchercon in 2007” is tough be- cause of the year limitation of the query. The topic needs to find the answers in the exact year of 2007. Therefore, the answer entity extraction component should differentiate the answers from others with regards to years.

• The topic of “Sponsors of the Mancuso quilt festivals” has the unique answer sets in their website. Especially, in order to protect their sponsors to be maliciously crawled, these sponsors are embedded in the Web page using the images of logos. Therefore, although we can easily detect the germane document, it is still hard to extract the answer entities. Similar case is the topic of “Donors to the Home Depot Foundation.”

There are thirteen out of twenty topics in the TREC 2009 entity retrieval tasks asking the questions whose answers exist in the multiple documents. For example, for the topic of “carriers that Blackberry makes phones for,” the answers are scattering in the multiple documents, and the system is required to crawl them and summarize them as the answer set.

• The topic of “professional sports teams in Philadelphia” has the answers in multiple documents. Some of them cover all answer sets, while some others are not. Some answers are in the sentences, while others are in the tables/lists. Similar cases include “products of MedImmune Inc”, “Scotch whisky distilleries on the island of Islay”, “Campuses of Indiana University”, “Members of the band Jefferson Airplane”, “CDs released by the King’s Singers”, “Airlines that currently use Boeing 747 planes”, “Members of The Beaux Arts Trio”, and “Airlines that Air Canada has code share fights with”.

• The topic of “organizations that award Nobel prizes” can be easily be confused with the topic of “organizations awarded Nobel prizes”. Therefore, we use the topic entity as

queries “Nobel prizes” to fine the germane documents for answer entity extraction. • The topic of “Journals published by the AVMA” includes the abbreviation of “AVMA”

for “American Veterinary Medical Association (AVMA)”. Similar case is the topic of “Universities that are members of the SEC conference for football”.

• The topic of “Companies that John Hennessy serves on the board of” has the answers scattering multiple documents. However, it is also really not obvious webpages indicating the information. It is also a tough topic.

6.6 SUMMARY

This chapter examined answer entity extraction, whose target is to identify answer entities from germane documents for the entity retrieval task in an effective way. We considered sev- eral ways of entity extraction: named entity recognition tools, knowledge base (Wikipedia) extraction and entity filtering, table/list extraction, bootstrapping methods, and classifica- tion methods.

Named entity recognition tools (NER) for answer entity extraction can only work on grammatical sentences. It treats the documents as plain texts, so the corpus containing noise web pages should be preprocessed by removing the HTML tags. With the pre-processing, many non-grammatical sentences are generated in the corpus, which causes some errors in extraction. For example, many entities are listed as items in the Web page. The simple parsing is hard to extract answer entities according to the queries from the germane docu- ment. This is the reason why the recall for the NER entity extraction is high (about 0.4 on the extraction from germane documents) but the precision (about 0.1 on the extraction from germane documents) and the F-measure (about 0.1 on the extraction from germane documents) is low. Moreover, this method also depends on whether NER can identify the type. If the NER tool could not identify the types, it will fail to extract them. For example, the extraction results on the entity type of product are worse than the ones on the entity types of person and organization.

improve the precision and recall of answer entity extraction. One is to mine the entity answers from a knowledge base (e.g., Wikipedia Infobox). The other one is using the knowledge base to filter the non-relevant entities out. The results of knowledge base answer entity extraction show that the approach can extract high accuracy entities but only for a small set of those topics. The method of knowledge base filtering can significantly improve the accuracy of answer entity extraction. But both methods are limited by the knowledge base and the representation in the knowledge base.

Tables/lists are considered as the symbolic contexts for the entity extraction. As the analyses on the entity contexts, we find that most entities are in tables or lists. Therefore, an algorithm to extract the entities from the tables/lists in the Web is investigated and im- plemented for answer entity extraction. The results show that this approach is more accurate than the NER system, but also it can find 30% entities. This is because part of answers are in the different media, such as images or PDF files. The complicate representation of the tables/lists in the web page is another reason for extraction failure.

A semi-supervised learning method, bootstrapping, is considered as the syntactic context for answer entity extraction. The experiment shows this approach can achieve high recall results for some topics. But it is also highly dependent on the entity seeds and patterns. In this experiment, the method could only extract the answers for one topic (out of 20). In the future work, I will investigate the impact of more seeds and better patterns for the extraction.

In order to complement the extraction disadvantages from the above methods, we treated the entity extraction as a binary classification problem and the extraction results from the above methods as features. The experiment compares this method with the other answer entity extraction methods is conducted. The results indicate that this method is significantly better than all the individual extraction methods by themselves. However, because the low recall of the above extraction method, the learning-based method could only find half of the answer entities. The reason for the low recall is that the current system only treats the noun phrases as the candidate answer entities. Therefore, it will miss some answer entities with special characters, such as FluMist . In the future, more methods should be introduced toR