Document Retrieval Strategy - Other QA Systems

CHAPTER 1 A Survey on Question Answering System

7. Other QA Systems

2.1 Document Retrieval Strategy

Since our QA system first utilized the traditional term-based document retrieval strategy to match some relevant documents in the collection to the received users’ question, only

a document which shares at least one common term with a user’s question can be retrieved by our system. For example, if a user type in a question as “what is AI”, the document “Artificial intelligence is the intelligence of machines and the branch of computer science that aims to create it” will never be retrieved by our QA system. It is because this document shares no common term with the user’s question. However, this document is highly related to the question and can be the best answer to help the user understand this concept. It should be retrieved and sent back to the user to be viewed. This is how we miss the valuable semantic connection between “AI” and “Artificial Intelligence” by simply using term-based document retrieval strategy to search for the related documents as the candidate answers. It clearly decreases the answer quality our QA system produces.

For avoiding this situation that some important and related documents are not retrieved by our QA system, we have made great efforts on researching and experimenting different algorithms and strategies. We have first utilized a traditional strategy called term stemming which helps our system detects more common terms without being distracted by the different term suffixes and prefixes. But we have also noticed that the term stemming still did not change the fact that the semantic information was missing during the retrieval. Because of the variety of information from the Internet and the unlimited questions we are dealing with. This was a big challenge to our system to dynamically detect the semantic connection between a question and one of the documents in the collection. Not only us, as it was talked in the chapter 1, other researchers in this area have also realized the affection of this fact and utilized different strategies such as the syntactic and semantic analysis on the document collection to fill the semantic gap

between queries and documents. Among them, Mehran and Timothy [63] designed a novel and powerful strategy to try to solve this problem without integrating the systems with the complicated and expensive semantic analysis.

In their research, they claimed that in order to match two semantically connected documents together such as “AI” and “Artificial Intelligence” which do not share any common terms, a practical option is expanding each of them with some corresponding relevant terms. Apparently it has been proved that those two documents have a zero possibility of sharing common terms. With the two new expansions, they will definitely have a higher possibility to share one or more overlapping terms. Therefore, a QA system can combine the straightforward and affordable term-based document search strategy with the new document expansions to accomplish the retrieval without losing the semantic information.

Then how to expand two “irrelevant” documents became the most important topic in Mehran and Timothy’s research. They needed to carefully choose some qualified and related terms to expand each document in the collection with the consideration of the processing time. Clearly, using a collection of synonyms to attach the related terms to the documents is an option. However, the synonym collection can be out of date; some unimportant terms in the document have also been attached with their synonyms. Those disadvantages affect the quality of the expansions and the retrieved answer quality of the QA systems eventually.

Therefore, instead of using a synonym collection, Mehran and Timothy utilized the search results of the Google search engine. They first send the two “irrelevant”

documents as two independent search queries to the Google search engine separately. Then, they will get back a few searched related web pages for each of the original document. The text contents on those web pages will be saved as the expansion resource of the corresponding original document for the next process. Apparently, simply using those text contents to expand the two original documents is not efficient. Each of the original documents will be expanded into a large text document. The system processing time will correspondingly be increased since more terms in each of the expanded documents will be checked during the document retrieval. The document collection of the QA system will be sharply expanded as well. Therefore, Mehran and Timothy first calculate the TF-IDF value for each term in the searched web pages from Google. Only the top 50 TF-IDF value highest terms in each searched web page will be saved to expand the original document. Thus, if a QA system uses the top 10 searched web pages from the Google search engine to expand a document in the collection, the original document will be expanded with 500 new relevant terms.

Moreover, Mehran and Timothy found out instead of using the complete searched web pages as the expanding materials, they could just use the web page snippets which are shown under the searched links on the Google search result page. When users use the Google search engine to search for some information, Google shows a web page snippet under each searched link to offer a quick view of each web page to the users. Those snippets are comprised by a few sentences which contain some common words with the search query. For example, the web page snippet of the first searched link corresponding to the query “what is AI” is “WHAT IS ARTIFICIAL INTELLIGENCE… This article for the layman answers basic questions about artificial intelligence. The opinions

expressed here are not”. Thus, if we follow the Mehran and Timothy’s strategy to expand the document “what is AI”, it will then be expanded with the terms such as “ARTIFICIAL” and “INTELLIGENCE”. Therefore, after the expansion, the similarity between the two documents “what is AI” and “Artificial intelligence is the intelligence of machines and the branch of computer science that aims to create It.” will become higher than zero as it was before. This is how the semantic connection between “AI” and “artificial intelligence” are detected based on the document expansions. Figure 1 further expansions the general Mehran and Timothy’s expansion process.

Figure 1

Related but share no common term

similarity = 0

send to the Google search engine as a search query

searched a number of web

page snippets S { , …. }

searched a number of web

page snippets S { , …. }

save the top 50 TF-IDF highest terms in (1≤ i ≥ n)

expand with the 50*n new

terms

expand with the 50*n new

terms

Expanded Expanded

possibly share some common terms

Since the Google search engine offers a better search result with a smaller text query, each document we send to Google to acquire some relevant web page snippets for the future expansion should be small as well. This was the reason Mehran and Timothy claimed in their paper that their retrieval strategy only works for the short queries and the small documents. However, this fact does not affect the testing result of their work. Based on their evaluations, this snippet search strategy remarkably raises the similarity between the traditionally “irrelevant” documents. Comparing with using a synonym collection as the expansion material, the web page snippets help the system detect more semantic connections between two sentences or two small passages. For example: the new search strategy is able to show out the semantic connection between “java programming” and “applet development” by the helps of their corresponding web page snippets from the Google search engine. But the experiment showed that using the synonym collection failed in this case.

Mehran and Timothy also explained in their paper that their search strategy is suitable for the query suggestion systems. As a query suggestion system, it offers users some similar and more valuable queries based on users’ own queries to assistant them acquiring the information they are looking for. For example, if a user type in a question “which laptop should I buy”, the query suggestion system will then show the user some similar but more specific queries such as “2012 best selling laptops” to suggest the user to use. Hopefully, the new suggested query will offer the user a better search result than the user’s original question. In the query suggestion system which is using Mehran and Timothy’s expansion strategy, there is a local pre-processed query collection in which each query has been expanded by their web page snippets and stored. Thus, more relevant prepared

queries can be matched to the users’ question as the suggestions. Every time a user type in a new query, the query suggestion system sends only this piece of original query to the Google search engine to get expanded. The new expanded query will then be matched with some of the prepared queries in the local collection. Even though the pre-processed query collection may be large, each query of this collection needs only one time expansion. The future query expansion will only make on the received users’ questions. Therefore, Mehran and Timothy defined their new snippet search strategy as a lazy strategy which means the document expansions only be executed when they are necessary.

However, even though Mehran and Timothy claimed that their snippet search strategy works with different term-based similarity measurements, it is not practical and efficient to directly embed this strategy into some QA systems which perform the term-based document retrieval on some dynamically updated document collections like our QA system does. Those QA systems initialize a complete new document collection based on the new question they receive each time. In our QA system, each time a new question is submitted from the users, the system sends the question to the Google search engine to search for the top 10 web pages back. Then a new document collection for answering this question is set up by distributing each text paragraph in each searched web page and storing them as independent documents in the collection. This means if we directly utilize Mehran and Timothy’s search strategy to detect the semantic connections between the received query and the corresponding document collection which usually owns more than 200 documents, we have to send more than 200 text queries to the Google search engine to expand the documents and the users’ question with more than 2000 web page snippets.

Our QA system clearly cannot afford this expensive search strategy. In the section 7, we will explain how we have modified and improved Mehran and Timothy’s strategy to embed it into the dynamic term-based QA systems like ours. We used their strategy as the basic concept and inspiration to develop a new document search strategy.

In document Improving Retrieval of Information from the Internet (Page 74-82)