GERMANE DOCUMENT IDENTIFICATION AS CONVENTIONAL DOC-

DOCUMENT RETRIEVAL

If we assume the target entity type t is independent with the query q and the document d, then germane document identification is equal to conventional document search, i.e., p(d|q, t) = p(d|q). Therefore, the first approach treats germane document identification as a conventional document retrieval problem. That is, relevant documents are the same as germane documents with the assumption that these documents contain entities for further

extractions.

In previous studies, various document retrieval methods are applied in germane document identification. Some teams use BM25 for retrieval [Zhai et al., 2009]. More others use language model, such as [Wu and Kashioka, 2009]. Fang applied a language model on the structured retrieval—on document, passage and entity level—to find germane documents [Fang et al., 2009]; McCreadie applied the same idea of structure retrieval but on webpage title and body level[McCreadie et al., 2009]; Zheng used the language model on document and snippet (50-word window size) level [Zheng et al., 2009]. Some other teams consider the query constructions to refine the queries representing users’ information needs. For example, Vydiswaran tried to identify the information need (the narrative part of topic) as a structured query which was represented as a relation including a relation description, an entity of focus, and an entity of interest [Vydiswaran et al., 2009]. Yang also did some query re-constructions by adding the synonym of topic entities into the query for searches [Yang et al., 2009].

In our annotation processes, the assessors find the germane documents for all 20 TREC 2009 topics with some proper queries on search engines. Therefore, we consider to simulate this process: generating the proper queries from the narratives or topics and applying these queries on the search engines, i.e., treating germane document identification as a conventional document retrieval task.

This experiment is based on 20 topics from the TREC 2009 entity retrieval task. Yahoo search engine is used as an indexing and searching system for the corpus. The queries are from the narrative parts of topics. The experiment tests whether the queries from topic narratives can find the germane documents and how to choose the retrieved documents as the germane documents.

The experiment methodology is as follows: the queries generated from the topic narrative part are issued to the Yahoo!Boss search engine API; the top 100 results from Yahoo are evaluated; the performance is evaluated by precision, recall and F-measure. The results are as in Table 6.

According to the F-measure (F@2=0.22), the top two documents are the most valuable germane documents. Although the precision at 1 or 2 is 0.2, this approach can only find

Table 6: Results of germane document identification as conventional document retrieval Rank P R F Top P R F 100 0.008 0.416667 0.015668 10 0.06 0.3 0.098834 90 0.008333 0.391667 0.016281 9 0.066667 0.3 0.107727 80 0.009375 0.391667 0.018264 8 0.075 0.3 0.118384 70 0.010714 0.391667 0.020796 7 0.085714 0.3 0.131389 60 0.0125 0.391667 0.024144 6 0.1 0.3 0.147619 50 0.015 0.391667 0.028776 5 0.1 0.258333 0.141667 40 0.01875 0.391667 0.035609 4 0.125 0.258333 0.165238 30 0.025 0.391667 0.046698 3 0.166667 0.258333 0.198333 20 0.0375 0.391667 0.067824 2 0.225 0.233333 0.223333 1 0.25 0.116667 0.158333

a small part of germane documents with regards to the recall at 100 (recall@100=0.41). That means this approach misses more than half of germane documents. Another finding is that almost half of topics have at least a Wikipedia page as the germane document, which means Wikipedia is a good external source for extracting answer entities. Although germane documents can be found for most topics according to the annotators, the number of answer entities for each topic is various. Therefore, it is not proper to use a simple threshold to cut the number of germane documents for each topic.

Further analyses on the query generation are conducted on the TREC 2009 topics. Two kinds of queries can be generated for germane document identification. The first are queries generated by topic entities (e.g., MedImmune, Inc), and the second are queries generated by descriptions (e.g., products of MedImmune, Inc).

Generating queries from descriptions is the most intuitive way. Because the description part provides more information than the topic entity, it is prone to finding documents with answer entities. There are 9 out of 20 cases in the TREC 2009 data sets.

queries, because the descriptions hurt search results. For example, for the topic of organizations that award Nobel prizes, the description (e.g., organizations that award) for topic entity (e.g., Nobel prizes) can cause the error results from the similar concepts (e.g., Nobel prize awarded organizations), with the assumption of bag-of-words retrieval system, especially when there is no special pages that discuss about the target entities. There are 4 out of 20 cases in the TREC 2009 data sets.

Generating queries from descriptions or topic entities means there are no differ- ences in two kinds of queries because the descriptions fail to provide the additional information than topic entities. A typical case is that the descriptions of topics are the entity type requirements for answer entities, which usually do not appear in the documents, e.g.“students of Claire Cardie”. The “students” is the entity type, which is hard to bring value in the search, especially when there is no special webpage indicating this entity. There are 7 out of 20 cases in the TREC 2009 sets.

Except for the above cases, there are some topics failing at correctly representing the relations between topic entities and answer entities. For example, for the topic of what are some of the spin-off companies from the University of Michigan, the representation of the relation between the topic entity (i.e., the University of Michigan) and the answer entities is “spin-off companies”, which will be more effective if we use their synonyms of “spun of/from/of from”.

These analyses also confirm that germane document identification can only deal with term co-occurrence problems since the retrieval model is built on the assumption of the bag- of-words model(i.e., the independence of terms in a document). Answer entity extraction is an important complement for germane document identification in the entity retrieval task.

In document Searching for Entities: When Retrieval Meets Extraction (Page 64-67)