• No results found

Table/List Extractions from Knowledge Base

6.2 SYMBOLIC CONTEXTS: TABLE/LIST EXTRACTION

6.2.3 Table/List Extractions from Knowledge Base

A knowledge base can be independent from the original corpus. Therefore, the extraction from a knowledge base only relies on the knowledge base itself. The knowledge base also organizes the information in the tables or lists, but with a more standard way. Although the tables in knowledge base can use the same extraction method mentioned in the previous section, it can be more precise in the knowledge base, which is described as the relation context. The relation context refers to the relation r and the associated topic entity e1 and

the answer entity e2. According to the relation contexts, the query can be interpreted as

the topic entity eq1 and the relation rq. Therefore, with the TREPM model, answer entity

extraction with relation contexts can be represented as following:

p(e|d, q, t) = X c p(c|d, q, t)p(e|c, d, q, t) ≈ X e1,r,e2 p(e1, r, e2|d, rq, eq1)p(eq2|e1, r, e2, d, rq, eq1) ≈ X e1=eq1,r=rq,e2 p(e1, t1, r, e2|d, e1, r, e2)p(eq2|e1, r, e2, d)

Figure 9: A sample of Infobox Wikipedia Infobox is one of the knowledge bases

and is used to demonstrate the extraction process. It extracts high accuracy answer entities but is not lim- ited by corpus and increases the recall of answer en- tity extraction. Figure 9 is a sample of a Wikipedia Infobox. Noisy knowledge is one of the problems in using knowledge bases for answer entity extraction. The study of Wu illustrates that the knowledge base like Infobox need to be further cleaned for extractions [Wu and Weld, 2008]. In the pilot study of extracting company-product pairs from Infobox, there are two of twenty company cases (10%) where the “product” fields in the Infobox pages contain links to other pages instead of the product information itself. Another problem is the incompleteness of its knowledge. For example, three of the twenty company pages do not contain informa-

tion of products (15%), where one case has no product field in its Infobox and the other two have not Infobox fields at all.

Because of the incompleteness and the complexity of knowledge base, the algorithm of answer entity extraction from a knowledge base first has to detect the related topic entities. For example, for the query of products of Medimmue Inc, the algorithm needs to find the

correct entry of the topic entity, i.e., MedImmune Inc. It is to implement the detection of e1 = eq1 in the formula. Secondly, in the topic entities related attributes, the algorithm

identifies whether this targeted entity has attributes associated with the queried relations, e.g. products. This is the detection of r = rq. The last step is to identify the entity

instances acting as the attributes of topic entities, e.g., the value of FluMist for the attribute of products. If the attribute fields in Infobox are not directly extractable, further mining steps need to associate to related pages in order to extract the answer entity information. The overall algorithm of mining the Wikipedia Infobox for extracting answer entities is in Figure10.

For each target type (e.g., company) in the knowledge base (e.g., Wikipedia Infobox){ Get related part (e.g., InfoBox) {

Get target field (e.g., location) { Extract target information {

If (field is terms), then extract terms as they are (e.g., products))

If (field needs to be further extracted, e.g., ”List of Google Products” or ”Yahoo Products”) Further extraction method{

e.g., craw this page{

If the page containts LIST information, then extract them as products; If the page containts the links to anther page, craw this page

}//end of further extraction method }//end of extract target information }//end of get target field

}//end of get related part }//end of all

Figure 10: The algorithm extracting entities from knowledge bases

The first experiment evaluating answer entity extraction of the tables/lists from knowl- edge base is on RAP sets. Topics are the companies, and the targeted entities are the products and locations of those companies. There are 265 entities of products extracted for 30 companies from Wikipedia 2008 version. The results are in Appendix B. The knowledge base entity extraction can effectively extract most answer entities with high precision for the experimental entities.

The second experiment is the answer extraction for the TREC 2009 20 topics. Because there are only 3 topics related to product retrieval, this experiment only uses these three topics. There is only one topic out of three to be extracted. That is, Synagis and FluMist are extracted as products for the topic of products of MedImmune, Inc.

Although this method can extract the high accuracy answer entities, which are inde- pendent on the noisy corpus, the extraction still relies on the knowledge base itself and the representation of knowledge. For example, the topic of airlines that currently use Boeing 747 planes uses the term of “primary users” in the Wikipedia Infobox of Boeing 747 pages to represent the relation between the topic entity and answer entity. The various representation causes difficulty in the matching of topic entities and answer entities. How to expand the algorithm to extract more entities will be the future work.

6.2.4 Discussion

The list/table extraction method can successfully detect answer entities for such topics as chefs with a show on the Food Network. However, there are some topics that are still hard for extractions because of the complicated structures of lists and tables.

1. Tables/lists can be embedded into the pictures or photos. For example, for the topics of sponsors of the Mancuso quilt festivals, the lists on the Web page are the logos for these companies with the links pointing to these companies. This kind of representation is popular on the Web to avoid robots mining the web contents, but it also causes our difficulties in the entity extraction. Similar cases include the topic of donors to the Home Depot Foundation, whose answer entities are in one picture, which cannot be extracted by the text extraction.

2. The hierarchical or mixture structures of lists or tables also cause difficulty in extraction. The answer entities for the topic of authors awarded an Anthony at Bouchercon in 2007, for example, are listed in lines combining the authors and the title of a work of fiction together, which causes the difficulty of extractions.

3. Various formats to present the list structure also cause the failure of extraction. For exam- ple, for the topic of sport teams in Philadelphia, the Wikipedia page, http://en.wikipedia. org/wiki/Sports in Philadelphia.html, uses HTML heading to represent the answer en- tities. For the topic of donors to the Home Depot, the Webpage of the HomeDe- pot, http://www.homedepotfoundation.org/donors/2010-complete-donor-list.html, uses the way of each line per donor to present the answer entities. Therefore, the algorithm

should further consider the various presentation of the table/list structure for the extrac- tion.

4. Sometimes, there are multiple tables in the document, but not all the tables or lists in the germane documents discuss the answer entities and only tables or lists in some sections present the answer entities. For example, for the topic of the journals published by AVMA, the germane document of http://www.avma.org/journals/default.asp contains the answer entities in the “journals” section which is mixture with others. Therefore, how to identify the answer sections in the germane documents will be our future work.

5. Various names are needed to represent the relations between answer entities and the topic entities. The researches on matching the relation names in the celling header of the tables and the queries will be our future work.