• No results found

A MODEL COMBINING DOCUMENT RETRIEVAL AND ENTITY EX-

An entity retrieval question can be stated as follows: does an entity e answer a query q with the targeted type t? If we view this problem from the probability aspect, the question answers what is the probability of a candidate entity e being the answer entity given a query q with the target type t? That is p(e|q, t). The answer lists result entities according to their probabilities. The top k candidates are deemed to be the most probable entities.

Entities exist in the Web pages. Therefore, if we consider all germane documents d, the TREPM will be the following Equation 3.1. If we find the documents containing answer entities, called germane documents dgermane, to estimate this probability, then the original

formula can be estimated as Equation 3.2.

p(e|q, t) = X d p(e, d|q, t) = X d p(d|q, t)p(e|d, q, t) (3.1) ≈ X dgermane

p(dgermane|q, t)p(e|dgermane, q, t)) (3.2)

The TREPM model includes the following two parts. The first part of Equation 3.1 is P

dp(d|q, t), where p(d|q, t) is the probability of the document d generated by the query q

with the target entity type t. This is germane document identification, conducted by estimating the similarity between a document and a query. The second part of Equation

3.1 is p(e|d, q, t), i.e., the probability of entity e generation with the target type of t in the document d given query q, called answer entity extraction.

If we further consider contexts c for answer entities, answer entity extraction in the TREMP model will be expanded to the following Equation 3.3. The first quantity is p(c|d, q, t), which is the generative probability of the context c in a given the particular doc- ument d with the query q and the target entity type t. The second quantity is p(e|c, d, q, t), which is the probability of a candidate entity e to be an answer given a context c in the document d for the query q with the target entity type t. Similar to germane document

identification, if we use the most high probability contexts, called support contexts csupport,

to extract the answer entities, then we have the estimation formula as Equation 3.4.

p(e|q, t) = X d p(d|q, t)p(e|d, q, t) = X d p(d|q, t)X c p(c|d, q, t)p(e|c, d, q, t) (3.3) ≈ X dgermane p(dgermane|q, t)( X csupport

(p(csupport|dgermane, q, t)p(e|csupport, dgermane, q, t))(3.4)

There are two reasons for decomposing the entity retrieval problem into germane docu- ment identification and answer entity extraction. The first reason is that the decomposition can divide the word-independent factor and the word-dependent factor into two subtasks. The information need of entity retrieval (e.g., what are the products of MedImmune, Inc.) represents the answer entity (e.g., the company’s product) as a description (e.g., products of MedImmune, Inc.) expressing the relation between the topic entity (e.g., MedImmune, Inc.) and the answer entities (e.g, FluMist). The word-independent factor assumes that the words occur in the documents independently, while the word-dependent factor assumes that the meaning of words influences the interpretation of other words in documents. For example, in the word independence assumption, we assume the above query is to find the documents containing the terms of “Products”, “of”, “MedImmune”, and“Inc.” With this assumption, the document can be the one either containing “products of MedImmune Inc” or “MedIm- mune Inc. buys the computer products from ...” The document retrieval model can provide a good and effective way to retrieve the information, according to the word-independence assumption, in document-level relevancy. It is, to certain degree, to estimate whether the document contains the answers for the query or not. In the word-dependent assumption, the semantic meaning within a document is analyzed to extract the answer entities for the query. For example, we need to treat MedImmune, Inc. as an entity of company, and find the product of MedImmune, Inc from the document. Entity extraction can be a powerful approach for this task.

The second reason for this decomposition is to simplify a globe retrieval problem into two locally optimized problems, which will lower the complexity of the problem and reduce the

execution time. If we assume the number of documents in the corpus is m and the number of contexts is n, then the time requirement for the entity retrieval task is Θ(m ∗ n), because the system needs to iterate every document and scan all contexts to detect answer entities. If we use document retrieval to find germane documents with the number of m0(m0  m) and only consider the most effective contexts with the number of n0(n0  n), then the time requirement for the TREPM to complete the entity retrieval task is Θ(m0∗ n0) , which

will significantly reduce the system execute time. The space complexity is similar. Mark Bron and his entity retrieval group formula the entity retrieval task as p(e|E, T, R), i.e., the probability of candidate entities, e, given the source entity, E, the target type, T , and the relation, R, described in the narrative. They calculate the co-occurrence of candidate entities e and source entities E for all documents in the corpus. According to their answer for how long does it take to process the whole corpus, it is around two weeks [Bron et al., 2010]. If they consider germane documents for the candidate entities, this process time will be significantly shorter.

In summary, TREPM considers the relevance between entities and topics on two layers: germane document identification and answer entity extraction. In order to search answer entities, a retrieval system needs to rank all candidate entities by considering all combinations of documents and contexts. In a large-scale information environment or open-ended corpus, such as the Web, however, evaluating all documents and all patterns is an impossible task. Therefore, we find germane documents and support contexts, instead of all documents and all contexts, effectively and efficiently to estimate answer entities.