Document Representations - Efficient and effective retrieval using Higher-Order proximity model

In order to calculate similarity scores using the models discussed in Section2.2, a proper representation of documents in the test collection is necessary. Before representing documents in the collection, there are several preprocessing steps required. Typically, each document in the collection is assigned a unique identifier (doc id), and each term also has a unique identifier (term id). A dictionary (or vocabulary) is built during the process of parsing all documents in the corpus. The most straightforward representation is to preserve each document in the form of term ID vectors, without any further processing, which is referred to as a direct file. However, an obvious drawback in using this representation is that it may not help improve the query evaluation cost in the ranked

D1: Scalable Vector Graphics is an XML-based vector image format.

D2: The SVG Specification primarily focuses on vector graphics markup language.

D3: Scalable Vector Graphics images can be produced by the use of a vector graphics editor.

D4: All aspects of an SVG document can be accessed and manipulated using scripts in a similar

way to HTML.

Figure 2.2: An example document collection consisting four documents.

Figure 2.3: The basic inverted file structure for the example collection in Figure2.2. retrieval process. Consider the basic bag-of-words models discussed in Section2.2.1. All of them fall in theTF×IDFranking regime, and computing these statistics on-the-fly results in a high cost

in retrieval process. Hence a representation that improves query evaluation is required. In this section, the main focus is the inverted file structure, including the one that is built for proximity ranking models. Some other representations are discussed in Section2.3.2. After constructing a vocabulary, the four documents can be represented using a basic inverted index, as shown in Figure2.3.

2.3.1 Inverted Lists

Basic Inverted Index. Consider an example document collection containing four documents, shown in Figure2.2. A basic inverted file index structure contains two components: the vocabulary and an inverted list for each term [135]. The vocabulary contains all unique terms in the document collection, and the document frequency ftassociated with each term. Each term in the vocabulary also has a pointer reference to an inverted list that keeps the frequency statistics in the form of hD, ft,Di, which is the document ID where the term appears, and the number of occurrences of the term in document D. In addition to these two components, there is often a separate structure that maintains meta information for all documents. Besides a mapping from original document name to document ID, it may also contain information such as document length for the convenience of query evaluation. It is clear that, with the help of such an index structure, most of the bag-of-words

Figure 2.4: An example of a combined inverted list and partial nextword index. models can be computed efficiently.

There are various ways of organizing such inverted lists. The inverted list of a term can be sorted based on document ID, but it can also be sorted based on impact or term frequency to improve the retrieval efficiency [2]. However, when the ranking model requires phrase or other proximity features, the basic inverted index is no longer sufficient. A natural extension is to add positional information to the inverted list of each term. Instead of keeping ft,Donly, positions, where each term occurs, are also kept in the inverted list entry, which is the “[pi. . . ]” part shown in Figure2.3. Including positional information of terms adds to the space required and may not even be necessary if the retrieval process requires only document-level information. Zobel and Moffat [135] provide a detailed survey in regard to using inverted files for retrieval tasks.

Representing Proximity Statistics. As proximity features are query dependent, pre-computing and indexing can be difficult, especially for higher-order proximity features. However, if only phrases or term-pairs are considered in the proximity-based ranking models, it is possible to build a separate index structure. In addition to using the positional index only, there are three alternatives in representing phrase or term-pair proximity features – a hybrid index structure consisting both phrase and term-level inverted files [13,125]; or an approximate index [35].

Phrase Index. A phrase index may also be stored using an inverted file representation where each entry in the vocabulary is a phrase instead of being a single term [125]. However, only representing a document collection using a phrase index is not sufficient since the term-level statistics are also required. Therefore, a commonly adopted solution is to use a hybrid index structure that consists of both term and phrase-level statistical information. The main concern with building a phrase-level index is the index size. Building an inverted index for all phrases is infeasible and unnecessary. Therefore, a partial phrase index is a viable alternative. A partial phrase index only

stores a set of common phrases, and is smaller compared to the alternative of storing everything. Needless to say, the subset may make an exact phrase match impossible, resulting in a failed query evaluation. To tackle this, Williams et al. [125] proposed the nextword index, which orga- nizes terms in a tree-level structure and aims at optimizing retrieval of pairs of terms. It may still be infeasible to build a nextword index for all terms in the vocabulary.

Therefore,Williams et al.proposed to build the nextword index only for frequently occurring terms, as shown in Figure 2.4. For rarely used terms, only an inverted file is used in the query evaluation process. The final option of a hybrid index structure includes the partial phrase index in addition to an inverted file and a nextword index. Using a complete nextword index and an inverted file can help evaluate phrase queries efficiently, and a combination of a partial nextword index, a partial phrase index and an inverted file gives the most significant improvement in the evaluation of phrase queries with minimal space overhead. Also using a partial phrase index combined with an inverted index, Broschart and Schenkel [13] focused on improving the retrieval efficiency for the term-pair proximity model proposed by B¨uttcher et al. [17]. The partial term-pair index is no longer restricted to phrases according to the proximity features used in the retrieval model, so the number of possible entries in a term-pair index is much larger than only storing bigrams. The problem of pruning the size of the term-pair index has been formulated as an optimization problem byBroschart and Schenkel, where the index size is fixed and the optimization goal is to maximize the result quality. Broschart and Schenkelshowed that, by using an effectiveness oriented hybrid index structure, retrieval effectiveness can be improved; using an efficiency oriented hybrid index, the efficiency of retrieval can be greatly improved with little effectiveness loss.

Approximated Representation of Proximity Statistics. While most of the research efforts have been put into building an auxiliary index structure that stores exact statistics of proximity features, Elsayed et al. [35] proposed an approximation approach. Based on SDM (Equation2.6),

Elsayed et al.explored the hypothesis – “By using approximated term positions, it is possible to obtain space and efficiency gains with little sacrifice in effectiveness”. The core idea of the proposed approximation method is to divide ordinal positions of terms into buckets. Instead of storing the exact position lists for a term, only the IDs of buckets where a term appeared in are kept. There are two possible bucketing strategies: either fixing the bucket width or fixing the amount of buckets. Different bucket creation methods may impact both efficiency and effectiveness. Moreover, the compression mechanism used may also be an important factor in exploring the trade-offs between effectiveness and efficiency. In order to consider both bucketing and compression factors,

Elsayed et al.experimented with all four combinations and concluded that creating buckets with a fixed width of 20 obtains the best trade-off between effectiveness and efficiency. Alternatively Huston et al. [46] proposed a sketch based index in order to support term-dependency models, which is more robust. The proposed index structure makes use of the COUNTMINsketch tech-

nique [31] in order to estimate n-gram frequency statistics. The main purpose of proposing a sketched index is to explore the space and time trade-offs when n-gram statistics have been applied in the retrieval models. In previous work, such as building a term-pair [13] or a hybrid phrase index [125], the vocabulary of the n-grams is often required to be stored as a part of index. While

Figure 2.5: An example of direct file representation of documents in Figure2.2

the vocabulary size can be reduced by pruning infrequent phrases based on query logs, using the sketch method can avoid this step, because only hash values instead of actual terms are stored in the index structure. Huston et al. [46] empirically show that, by using the sketch technique, an optimal index structure with low space overhead and small errors can be constructed to support proximity-based models using n-gram features.

2.3.2 Representations Other than Inverted Lists

As briefly mentioned before, the naive and straightforward representation of a document is a direct file, which represents documents in a collection using a vector of term IDs, as shown in Figure2.5. By traversing the term IDs, it is the same as reading the original document. Although this representation has been used less in the retrieval phase, it can be economical during the feature extraction stage. For example, Clarke et al. [24] extract and compute proximity statistics by concatenating all documents in the collection. Asadi and Lin [3] compare both representations for the feature extraction task and argued that a direct file is easier to extend when incorporating rich features such as named entity markups. Also, when a snippet generation task is required, keeping a direct file representation is crucial. A second choice for representing documents is using a self-index structure [33,80].Culpepper et al.proposed a self-index structure for retrieving documents based on bag-of-words retrieval models with high efficiency. Navarro and M¨akinen [77] gave a detailed introduction to self-indexing and its application in information retrieval.

Most work using self-indexing to represent documents can only support bag-of-words models, and few can be applied to term-dependency models. In order to support using term-dependency models to evaluate queries, Petri et al. [80] proposed a hybrid structure consisting an FM-INDEX

Structure Order Ordered Unordered Approx.

Basic Index Positional Inverted Index Any X X

Broschart and Schenkel [13] Term-pair Inverted Index Two X X Williams et al. [125] Phrase Index Any X

Williams et al. [125] Nextword Index Any X

Elsayed et al. [35] Bucket-based Inverted Index Two X X X Huston et al. [47] Sketch-based Inverted Index n X X

Petri et al. [80] Hybrid Self-index n X

Table 2.2: A summarization of representing proximity statistics.

for the index structure, the proposed hybrid approach gives exactly the same top-k documents as using exhaustive retrieval regime.

2.3.3 Summary of Representing Proximity Feature Statistics

Although pre-computing proximity statistics is difficult, especially when higher-order proximity features are considered, indexing phrases or term-pairs is a viable approach. Table2.2summarizes different approaches of pre-computing proximity-based statistics. A plain positional index can also support any types of proximity-based retrieval models, but with a more expensive query evaluation time cost compared to storing pre-computed proximity statistics.

However, supporting arbitrary proximity features is difficult, so most of the pre-computation focuses on a specific type of proximity feature. For example, when considering the order of proximity features, term-pair index structures have been proposed, regardless of whether the pairs are required to appear in an ordered window or not. When considering the higher-order proximity features, often ordered windows are used. While one of the main ideas is to avoid the space overhead, other trade-offs have been explored for these index structures. For example, the tunable index proposed by Broschart and Schenkelis based on finding an optimal point of effectiveness and space trade-off. Petri and Moffat [79] compare five different index structures for phrases, including using a direct file. Their results showed that the efficiency largely depends on two factors – the number of results from a given query and the smallest document frequency value among all query terms. Therefore, there is no indexing technique which is superior over all of the others, and more detailed studies on categorizing the proposed indexing structures are needed.

In document Efficient and effective retrieval using Higher-Order proximity models (Page 34-39)