• No results found

Boolean and ranked retrieval

The different IR process can be instantiated in different ways. Within dif- ferent approaches, we distinguish two broad paradigms that have been the goal of research for many years namely boolean and ranked retrieval. The main difference between these approaches is that boolean retrieval returns a set of (unranked) documents and ranked retrieval returns an ordered list of search results. Boolean represents the dominant model in the early times of IR, while ranked retrieval has taken the lead in current times. We describe below each of them.

2.2.1 Boolean retrieval

Boolean information retrieval dates back to the origins of IR [112]. In this model, queries are expressed through boolean logic. For instance, the query “New AND York” tells that we want both the terms “New” and “York” to appear in the retrieved documents. The query “(London OR Paris) AND NOT Berlin” would match documents that contain the terms “London” or “Paris”, and none of retrieved documents will have the term “Berlin”.

In this model of IR, it does not matter how often a term appears in a document, we are only interested in its presence within the document. A document that contain the term “London” 20 times scores the same with a document that contains the term “London” once. This approach has the

advantage that it shows only documents that will match the query precisely, but it has many other disadvantages. We will list below some of them.

First, sometimes it is difficult to know the exact words that will have to appear in the documents as well as it difficult to guess the best logic operators to connect these terms. In fact, most users of the boolean model were expert users. The large audience does not necessarily master boolean logic.

Second, the returned documents will have no order. If the list of returned documents is long, one might prefer to access the documents which are more likely to be relevant first. The lack of order affects both true negatives and false negatives. In other terms, let’s take the query “Paris and London and Berlin” where we want to compare these cities. The results which mention these terms several times will score the same with the results which contain each of these terms once. On the other hand, a document which contains both “Paris” and “London”, but does not contain “Berlin” will score 0, the same as other documents which do not contain any of the query terms.

Nowadays, pure boolean IR has turned to be obsolete for most IR appli- cations. Existing techniques had to integrate some sort of ranking.

2.2.2 Ranked retrieval

Ranked retrieval [112] allows users to issue free text queries i.e. they type one or more words without any logic operators or other complex operators. The main difference with boolean IR is that the output is a ranked list rather than a set of documents. Here, we can always list results even when some of the query terms are not present in the document. The ordering of the results aims to keep relevant results on top. This corresponds to the ranking principle. Robertson states that the optimal IR system should order results by their probability to be relevant [146].

Typically, results are ranked by scoring functions which combine different features generated from the query and the documents. Instead of the binary presence of a term within a document, ranked retrieval models combine other weights. Within them, two are the more popular namely term frequency tf and the inverse document frequency idf. Features are also specific to the IR model being used where we can mention vector space models [151], probabilistic models[147, 30], language models [140, 31], fuzzy models [26].

The next section presents some critics on traditional ranked retrieval.

2.2.3 Limits of ranked retrieval

The list of documents of the same format is not necessarily the best approach for Information Retrieval. A survey on traditional Web search [22] shows that users find relevant search results in the first page of results in 39.9% of the cases and that only 21.2% of the interviewed find the results well

organized. Below, we list some limits on the ranked retrieval paradigm from different perspectives.

Data sparseness: The relevant information can be scattered in different documents [123]. The ranked list for these cases is inadequate, because the user has to scan within different documents to satisfy his informa- tion need. This can be a time-consuming and burdensome process.

Lack of focus: Ranked retrieval approaches provide a ranked list of uni- formly presented results. Typically, each result is a snippet composed of the result title, a link to the document and a summary of the linked document. But it is known that the beginning of a document is not necessarily the best entry point [145]. For queries when the answer is just a part of document, it might be better to return this part of docu- ment right away. The uniform snippets do not have enough flexibility for focused retrieval.

Lack of diversity: For some queries, search results should be diverse [47] in both terms of content and presentation. The traditional ranked re- trieval approach would provide a uniform presentation on all results. The queries “images of Niagara Falls”, “videos of Niagara Falls” and “Niagara Falls” would all be returned Web page snippets from tradi- tional Web search. Ideally, the first two queries should be returned respectively images and videos right away, while the third query can be answered with diverse results (images, videos, Web pages, . . . ). Ranked retrieval approach should account for diversity in both terms of content and presentation. In fact, diversification of search results has an increasing interest in IR research [45, 8].

Ambiguity: Many queries can be ambiguous in terms of information need. The reference example is Jaguar which can refer to a car, an animal, an operating system and so on. Ideally, we should return one answer per query interpretation [161]. This can be multiple ranked lists or linked sets of results.

2.3

Conclusion: Towards aggregated retrieval