CBR
As discussed in the previous section, typical-CBR involves selection of best- clusters and documents, which is followed by a result integration stage. Let us assume that the best-clusters set is already obtained either automatically (i.e., by query-cluster matching using the centroid IIS) or manually (i.e., by browsing, as in a category-restricted query [30, 31]). Then, a typical ranking query evaluation algorithm as shown in Algorithm 1 can be employed during the best-document selection stage. Finally, those documents that are not from the best-clusters can be discarded from the query output. In this section, we propose different strategies for the best-document selection stage so that result integration can be achieved earlier during the processing. We show that these strategies improve the performance under certain conditions.
CHAPTER 3. SEARCH USING DOCUMENT GROUPS: TYPICAL-CBR 47
The query processing strategies discussed here differ in how they answer the following questions: (i) at what point during the best-document selection should the cluster-id(s) of a particular document be intersected with the best-cluster ids, and (ii) what kind of data structure should be used to keep best-cluster ids? Considering the query evaluation shown in Algorithm 1, the cluster ids can be intersected at three different points, yielding three implementation alternatives: (i) before updating the accumulator entry for a document, (ii) before inserting a document to the min-heap, or (iii) after extracting the top scoring documents from the min-heap (i.e., the traditional baseline approach as described in [30, 31]). Two potential data structures to store best-cluster ids are (i) a sorted array of best-clusters, or (ii) a 0/1 mark array in which entries for best-clusters are 1 and all others are 0. We discuss these alternatives and their efficiency trade-offs in the following.
Intersect Before Update (IBU). In this approach (Algorithm 2), only those accumulator entries that belong to documents from best-clusters are updated. To achieve this, after a posting list is retrieved for a query term, the cluster to which each document in the posting list belongs is determined and intersected with the best-cluster set. If the document’s cluster is found in the best-cluster set, its accumulator entry is updated. Otherwise, there is no need to compute the partial query-document similarity and accumulator update for this particular document.
Note that, this alternative would also increase the efficiency of the last two steps of the algorithm (i.e., building and extracting from the heap as shown in lines 7-8 in Algorithm 2), since all of the nonzero entries in the accumulator structure are for the documents that are from best-clusters. On the other hand, the performance of this approach crucially depends on the cost of determining the clusters to which a document belongs (line 4) and cluster-id intersection operation (line 5). For the former operation, the algorithm should access document-cluster (DC) IIS for each element of the posting lists. However, if document-cluster associations are kept in the main memory or cached efficiently, this cost can be avoidable. This seems reasonable, since DC-IIS can be expected to be relatively small in size and can be shared among several query processing threads. For
Algorithm 2 The query processing algorithm for intersect before update (IBU) approach
Input: Query Q, Index I, Best-clusters BestClus, Document-category index IDC
Output: Top-k best matching documents 1: foreach term t in Q do
2: Retrieve It from I
3: for each posting (d, fd,t) in It do
4: Retrieve Idfrom IDC
5: if Id∩ BestClus 6= ∅ then
6: DAcc[d] ← DAcc[d] + PartialSimilarity(d, Q) 7: Build a min-heap H of size k for nonzero DAcc entries 8: Extract top-k best-matching documents from H
instance, assuming that documents are not repeated in more than one clusters, the main memory requirement to cache the entire DC-IIS would be O(N ), i.e., in the order of the number of documents. In this study, without loss of generality, we assume that each document belongs to at most one cluster and the DC-IIS is stored in the main memory.
Assuming each document belongs to only one cluster, the cost of a cluster-id intersection is O(log S), if a sorted array of size S is used to store best-cluster ids; and O(1) if a 0/1 mark array is used for this purpose. Note that, the data structure for best-clusters can be a sorted array if the memory reserved per query is scarce and/or total number of clusters is quite large. In this case, the docu- ment’s cluster id can be searched within best-clusters using binary search. A 0/1 mark array is obviously more efficient but can only be preferred if the memory is not a concern and/or number of clusters is relatively small. Finally, if the number of best-clusters is relatively small, which is possible in a practical setup, a hash-table can also be used instead of a mark array to provide similar look-up efficiency but less space consumption.
Intersect Before Insert (IBI). In this approach, instead of applying the cluster id intersection for each doc-id in each posting list, we do it once for each non-zero accumulator entry while building the heap (Algorithm 3). This alternative is preferable if the number of non-zero accumulator entries is expected to be low and/or the cost of cluster id intersection is high, e.g., DC-IIS is on disk.
CHAPTER 3. SEARCH USING DOCUMENT GROUPS: TYPICAL-CBR 49
Algorithm 3 The query processing algorithm for intersect before insert (IBI) approach
Input: Query Q, Index I, Best-clusters BestClus, Document-category index IDC
Output: Top-k best matching documents 1: foreach term t in Q do
2: Retrieve It from I
3: for each posting (d, fd,t) in It do
4: DAcc[d] ← DAcc[d] + PartialSimilarity(d, Q) 5: foreach Dacc[d] 6= 0 do
6: Retrieve Id from IDC
7: if Id∩ BestClus 6= ∅ then
8: Insert d into the min-heap H of size k 9: Extract top-k best-matching documents from H
Intersect After Extract (IAE). As illustrated in the example in Sec- tion 3.2.3.4, this is the simplest result integration approach that is probably employed in current systems (e.g., [30, 31]). Roughly, in this approach the best- document selection stage proceeds as FS, and the elimination of documents that are not from best-clusters are achieved at the very end. This approach allows an existing IR system using FS to easily adapt a clustering or classification struc- ture on top of its document collection without any modification; but, in turn, cannot utilize the best-clusters information while selecting best-documents. We still outline this strategy for the sake of completeness and to use it as a baseline in the evaluation of strategies that we propose above and in the next section.
In this strategy (Algorithm 4) the entire query processing works as in Algo- rithm 1 and only at the end of the evaluation, the cluster-ids of top-k documents are intersected with the best-clusters. Of course, if some of those k documents are not from the best-clusters, then the build-heap step and extraction should be repeated. To avoid such a repetition, the initial evaluation can be executed for top-L documents, where L > k. In this case, the cost of cluster-id intersection would be negligible as it is postponed at the end of processing and L N . On the other hand, it is important to choose L appropriately, if L is much larger than k (e.g., L = N as an extreme case), the gains in the intersection stage would be lost during the build-heap and extraction. If L is too small (i.e., very close to k), we may need more than one iteration to find at least k documents that are in the best-clusters. Thus, IAE alternative will be useful if it can somehow be
Algorithm 4 The query processing algorithm for intersect after extract (IAE) approach
Input: Query Q, Index I, Best-clusters BestClus, Document-category index IDC
Output: Top-k best matching documents 1: foreach term t in Q do
2: Retrieve It from I
3: for each posting (d, fd,t) in It do
4: DAcc[d] ← DAcc[d] + PartialSimilarity(d, Q)
5: Build a min-heap H of size L(L ≥ k) for nonzero DAcc entries 6: ResultN um ← 0
7: whileResultN um < k and H is not empty do 8: Extract d with the highest score from H 9: Retrieve Id from IDC
10: if Id∩ BestClus 6= ∅ then
11: Insert d into output, ResultN um ← ResultN um + 1 12: if ResultN um < k then
13: Set L to some M s.t. M > L, go to Line 5
guaranteed that in a small number of highest scoring documents, there will be at least k documents from the best-clusters. More specifically, this approach would be better than the previous alternative only if cluster intersection is costly; and better than the IBU algorithm if both intersection test is expensive and too many nonzero accumulator entries arise.