Minimal Document Set Retrieval
Wei Dai
Department of Computer Science and Engineering
State University of New York at Buffalo Buffalo, New York 14260
[email protected]
Rohini Srihari
Department of Computer Science and Engineering
State University of New York at Buffalo Buffalo, New York 14260
[email protected]
ABSTRACT
This paper presents a novel formulation and approach to the minimal document set retrieval problem. Minimal Doc- ument Set Retrieval (MDSR) is a promising information re- trieval task in which each query topic is assumed to have different subtopics; the task is to retrieve and rank rele- vant document sets with maximum coverage but minimum redundancy of subtopics in each set. For this task, we pro- pose three document set retrieval and ranking algorithms:
Novelty Based method, Cluster Based method and Subtopic Extraction Based method. In order to evaluate the system performance, we design a new evaluation framework for doc- ument set ranking which evaluates both relevance between set and query topic, and redundancy within each set. Fi- nally, we compare the performance of the three algorithms using the TREC interactive track dataset. Experimental results show the effectiveness of our algorithms.
Categories and Subject Descriptors
H.3.3 [Information Search and Retrieval]: Retrieval Models – search process, clustering
General Terms
Algorithms, Experimentation.
Keywords
Information retrieval, Document set retrieval.
1. INTRODUCTION
The conventional ad-hoc information retrieval task is con- cerned with assimilating and ranking documents based on maximizing relevance to the user query. In reality, how- ever, each query topic usually consists of many different subtopics. As a result, a relevant document may only cover, or be relevant to one, or at best, a few subtopics. In order to get full coverage about the query topic, the user has to go
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.
CIKM’05, October 31–November 5, 2005, Bremen, Germany.
Copyright 2005 ACM 1-59593-140-6/05/0010 ...$5.00.
through a long list of ranked documents. This is a time con- suming task, especially if the documents are not ordered by subtopic. With the explosion of online information, tradi- tional IR systems which simply offer ranked document lists become increasingly insufficient for satisfying user informa- tion needs. Further steps become necessary to allow user to quickly fulfill their search criteria. Organizing retrieval re- sults into semantically related clusters to facilitate browsing is one technique that has been used [14, 13]. In this paper, we explore a different approach by generating ranked mini- mal document sets which attempt to maximize the number of distinct subtopics related to the query while maintaining minimal redundancy within a set.
Minimal document set ranking strategy, in a sense, is trying to compose big documents and rank each big doc- ument according to both its coverage and redundancy to all subtopics of this query. For example, a student doing a literature survey on “machine learning” may be most in- terested in finding documents that cover representative ap- proaches to “machine learning”. Using traditional ranking stratey, user may only see few most popular “machine learn- ing” approaches, such as SVM and ANN, on the top ranked documents and user has to scan through most of ranked doc- uments to discover all representive approaches about “ma- chine learning”. However, Minimal document set ranking strategy, by implicit or explicit modeling different “machine learning” approaches, offers information to user as a rank list of document sets in which each document set try to cover all “machine learning” approaches and keep information re- dundancy minimal within a set. Finally, each document set is ranked by considering combined value of its subtopic cov- erage, subtopic redundancy and relevancy to topic.
Carbonell [2] proposed the Maximal Marginal Relevance (MMR) criterion for combining query-relevance with infor- mation novelty in the context of text retrieval and summa- rization. Zhai [16] explicitly modeled this problem as the subtopic retrieval problem and evaluated several methods for performing subtopic retrieval by using statistical lan- guage models and a maximal marginal relevance (MMR) ranking strategy. In the interactive track of TREC-6, TREC- 7, TREC-8, the same problem was explored. They refer to it as Aspect Retrieval, which focuses on studying how an interactive retrieval system can best support users in gath- ering information about the different aspects of a topic [8, 9, 4]. The recent TREC question-answering (QA) tracks have introduced definition type questions which call for various aspects of a definition [11]. More recently, Zhang et al [17]
proposed a graph and fuzzy set based data mining approach
to model the semantic relationships among the document set. The model is applicable to both the text domain and image domain.
MDSR and all of the above methods attempt to capture information novelty/redundancy among documents. MDSR further assumes that documents must be ranked with re- spect to an existing set of relevant documents. We design a new document set evaluation framework which is based on traditional relevance-based precision-recall evaluation met- rics; we show that this framework subsumes the traditional single document ranked list evaluation method.
In this paper, we propose three document set generation algorithms. The first is Novelty Based method, which gen- erates document sets from relevant documents according to the novelty score between two documents. The second algo- rithm is the Cluster Based generation method which gener- ates clusters from retrieved documents and combines them into document sets. The third method, we call Subtopic Ex- traction Based method, which explicitly extracts subtopics from retrieved documents and uses those subtopics as di- mensions to generate ranked document sets. Experimental results show Subtopic Extraction based retrieval gives us the best performance.
2. RELATED WORK
There are two areas of information retrieval research that provide the theoretical foundation and empirical techniques for our models: novelty detection and document clustering.
Our novelty based document set model is closely related to Topic Detection and Tracking (TDT) [1] and Novelty and Redundancy Detection in Adaptive Filtering research [18]. The former monitors a stream of chronologically or- dered documents, and the latter addresses the problem of extending an adaptive information filtering system to make decisions about the novelty and redundancy of relevant doc- uments.
Document clustering methods are used to organize collec- tions around topics. Each cluster is assumed to be the rep- resentative of a topic. Document clustering techniques are also employed in the Topic Detection and Tracking research [10, 12]. We credit our cluster based model on this research direction. Online clustering is another document clustering research direction, in which search results are clustered into different groups to facilitate user’s browsing. Vivisimo.com is a real world application of this technique. We adapt some technique from this research direction into our subtopic ex- traction based model.
To the best of our knowledge, there has been no research conducted specifically on Minimal Document Set Retrieval, (as defined here), so this work represents a pilot study.
3. DATA SET
For the purpose of evaluation, we need a data set which shows subtopics for each query topic and truth judgment about which document covers which subtopic. We use two datasets to evaluate our system in this study. One is from TREC interactive datasets. Another is from TREC ad-hoc topics which are different from the topics in TREC inter- active dataset. We refer to the TREC interactive dataset as TREC-1 and the second dataset as TREC-2 dataset in this study. For the TREC-1 dataset, we collected all truth judgments for three years of interactive TREC (TREC-6,
TREC-7, TREC-8), the years when this track was running.
The datasets spanning three years contain 210,158 docu- ments from the 1991-1994 Financial Times of London with an average length of roughly 400 words. There are totally 20 TREC topics. Figure 1 shows an example interactive TREC topic: For each topic, TREC assessors have identified sev-
Number: 431i
Title: robotic technology latest developments use?
Instances:
In the time allotted, please find as many DIFFERENT developments of the sort described above as you can.
Please save at least one document for EACH such DIF- FERENT development. If one document discusses sev- eral such developments, then you need not save other doc- uments that repeat those, since your goal is to identify as many DIFFERENT developments of the sort described above as possible.
Figure 1: TREC interactive track topic example
eral instances. Different instances reflect different aspects of the topic. For the above topic, they identified 45 subtopics in the relevant documents. Here are some instances:
1. medical robot helping with human surgery.
2. water-jet cutting robots.
3. robots used in engine assembly.
4. aplly metallic paints to parts for a computer.
5. controlling inventory - storage devices.
...
Different topics may have different number of instances (subtopics).
In this dataset, the number of subtopics ranges between 7 and 56, with an average of 20. For each document, the cov- erage of each subtopic presented as a bits vector as following example:
FT911-129 111111000000000000000000000...
FT911-133 000000110000000000000000000...
FT911-135 000100011100000000000000000...
FT941-1242 001101000000000000000000000...
...
The above example indicates that document FT911-129 cov- ers six different instances, and FT911-133 covers two differ- ent instances. Different retrieval topics also have different number of relevant documents. In this dataset, relevant doc- uments ranges from 5 to 100 with average of 40 documents per topics. More detailed information about this dataset can be found in the TREC interactive track reports [8, 9, 4].
We created a second dataset (referred to as TREC-2) by asking two students, who were otherwise unaffiliated with our research, to provide truth judgments on a total of ten ad-hoc topics selected from TREC-8 (topics 401-450). The number of judged relevant documents range between 65 and 135 with an average 120 relevant documents per topic; the number of subtopics per topic range between 17 and 51 with an average of 26.
4. PROBLEM FORMALIZATION
In order to generate the minimum document set, we need a ranking system which can generate document sets us- ing combined criteria of relevance and redundancy. Here, relevance and redundancy are not two conflicting concepts
Collection Evaluation Rank-1 Rank-2 Rank-3 Rank-4 Rank-5 d1: 111 {d1} {d1, d2, d3, d4} {d1} {d1, d5} {d1, d2} d2: 100 {d2, d3, d4} {d2} {d2, d3, d4} {d3, d4, d5}
d3: 010 {d3}
d4: 001 {d4}
d5: 000 P
Coptimal 2 2 2 2 2
Sub-Recall 1 0.5 1 1 0.83
Sub-Precision 1 0.5 0.5 0.875 0.62
Table 1: An example of computing Document Set Evaluation Metrices
which belong to two different dimensions. Relevance is the relationship of the query topic with retrieved document sets;
redundancy is based on the relationship among documents inside each generated document set.
We use the following notation throughout this paper:
• subtopic(d|q): the subtopic coverage of this document d corresponding to a query. Given a certain query, we may abuse the notation as subtopic(d) which is clear under certain context.
• subS(S|q): the subtopic coverage of this document set S corresponding to a query.
• relD(r(d|q)): ranked list of relevant documents ac- cording to query q and rank function r.
• relS(s(S|q)): ranked list of relevant document sets ac- cording to query q and set generation function s.
• redun(d|S, q): redundancy information between the document d and document set S given query topic q.
• redunS(S|q): redundancy information inside document set S given query q.
More precisely, we define our task as: for any given rele- vant document list relD(r(d|q)), generate a ranked list of relevant minimum document sets relS(s(S|q)) according to subtopic coverage subS(S|q) of each document set and re- dundant information inside each set redunS(S|q).
The key for our task here is to find an appropriate subtopic coverage function subS(S|q) and subtopic redundancy func- tion redunS(S|q).
5. DOCUMENT SET EVALUATION MET- RICS
We designed our evaluation metrics based on precision and recall of traditional IR evaluation methods. Our docu- ment set evaluation task, however, is more complicated than traditional IR evaluation. Relevance of a document set to a query topic is not a simple binary relation, but a partial relationship. The relevance value is decided by the subtopic coverage of this document set. In order to evaluate the re- dundancy factor, we also need to add a penalty to inhibit redundancy in a set. Generally, our evaluation metrics need to achieve two goals: to evaluate partial relevance of a doc- ument set to a query topic, and penalize redundancy inside document sets based on minimality criterion.
Precisely, document set precision and recall measures are defined as follows:
Set Coverage: The value of subtopic coverage of a set is decided by the fraction of the total subtopics (N) this document set covers:
Cs=|S|S|
i=1subtopic(di)|
N (1)
Subtopic Recall: Sum of set coverage value of all retrieved document sets compared to sum of optimal set coverage value Ciin the collection:
Sub Recall =
PCs
argmax(P
Ci) (2)
Set Precision: Considering the subtopic redundancy factor inside a set, we define the Set Precision as:
Ps= |S|S|
i=1subtopic(di)|
N +P|S|
i=1cost(di) (3) Cost: The cost of adding a document into a set S:
cost(d) =
1 if subtopic(d) = 0
τ · redun(d|S) otherwise (4) Subtopic Precision: The average set precision of retrieved K document sets:
Sub P recision = Pk
i=1Ps
K (5)
Adding any document which contains no relevant subtopics or contains relevant but redundant subtopics given the set S will be penalized using this cost function, where τ is a parameter to allow adjusting of redundancy influence. We set τ to 1 in our experiment.
Using the above definitions, our document set evaluation metrics could be used to subsume traditional IR single doc- ument rank evaluation. The latter is a special case when each document set contains only one document, and there is only one subtopic for each query topic.
6. COMPUTING THE METRICS
In order to compute the subtopic recall of document sets, we need to first calculate the optimal value argmax(P
Ci), that is, the sum of maximum set coverage values in all rele- vant documents. Furthermore, we can see that argmax(P
Ci) is the sum of set coverage values of single relevant docu- ments. Because any document set containing subtopic over- lap will cause the relevance value to be lost in Ci, the sum of single relevant document values will have the maximum subtopic recall value. In calculations, we use the number of subtopic overlaps between the subtopic coverage of one document and the subtopic coverage of a document set for redun(d|S) in the cost function.
We show an example of computing Document Set Evalu- ation metrics in Table 1. We build a small document collec- tion which has 5 documents. For a given query topic, each document has the subtopic truth judgment represented as a bit vector. For example, d1 covers all three subtopics;
(d2, d3, d4) each cover a different subtopic and d5 is not rel- evant to this topic. Rank-1 to Rank-5 show the different representative document set rankings. Rank-1 generates a perfect document set ranking with full subtopic coverage and no subtopic redundancy for each set, therefore, it has value 1 for both Subtopic Recall and Subtopic Precision.
Rank-2 puts all relevant documents into one set, so the subtopic redundancy causes decrease in both Subtopic Re- call and Precision. Rank-3 consists of a single ranked list of documents; the poor subtopic coverage results in poor performance on Subtopic Precision. Rank-4 shows that if any non-relevant documents are added to the document set, it will cause a decrease in Subtopic Precision. Rank-5 is a general imperfect ranking which shows the subtopic redun- dancy and poor subtopic coverage for each set; therefore it gets poor Subtopic Recall and Precision. By using this ex- ample, we show that our document set evaluation metrics do fairly evaluate all possible document set rankings. For the document set generation and ranking task it should be noted that in reality, most relevant documents cannot be composed into the perfect, full coverage subtopic set. For example, if we change the above example in Table 1 so d2, d3, d4 cover only one and the same subtopic, then the best Subtopic Pre- cision we can get is 0.5. It is an intrinsic difficulty of the dataset. Another nice property of this document set eval- uation metric is it generates the same precision and recall value as tradition evaluation metrics if we assume a single subtopic and single document ranking.
7. MINIMAL DOCUMENT SET RETRIEVAL APPROACHES
A minimum document set retrieval system should include the following three stages:
1. Generate a ranked list of relevant documents.
2. Using this ranked document list, generate every possi- ble document set which satisfies minimum redundancy and the best coverage of subtopics.
3. Rank document set according to the combined scores of subtopic coverage and subtopic redundancy of set.
Observe from Above, the first stage is a well defined prob- lem. The focus of our study is the second and the third stage, that is, how to evaluate the coverage of subtopics and the redundancy information for each set are our major tasks. In some cases, the third step is not necessary since the second step performs the ranking.
Different ways of tackling the problem of subtopic and redundancy functions lead us to different approaches for the MDSR task. We propose three methods and detail them as follows. All our proposed methods assume we have a high Precision/Recall of relevant documents from the first stage. We use bag of words as the document representation.
Traditional tf-idf word weighting scheme is used to calculate the distance function for redundancy and cluster similarity measure, which is:
wi,j= fi,j· logN ni
(6)
The effectiveness and robustness of the above measure has been proven in previous studies.
7.1 Novelty Based Method
After we retrieve a list of relevant documents from the first stage, Novelty Based method generates document sets directly using the redundancy function; ranking document sets in the order of the seed (first document in a set) is used to generate this set. There are many novelty detec- tion methods, such as set difference, geometric distance and distributional similarity [18]. Geometric distance using Co- sine metric has been proven to be a very effective similarity metric for many tasks. We use it as our redundancy func- tion. To adapt it more to subtopic novelty detection, we use the local context of query words to represent the document and measure the subtopic novelty between two documents by using Cosine distance between local contexts.
We present each document d as a vector ~d = (w1, w2, · · · , wt), where wtare the words around the query terms in a certain window size. We calculate the similarity between two doc- uments using redundancy function:
redun(di|dt) = cos(di, dt)
=
−
→di·−→ dt
|−→ di| · |−→
dt| (7)
Precisely, Novelty Based Method generates document sets as follows:
1. Generate ranked list of documents relD(r(d|q)).
2. Pick the top ranked documents as the seed, generate initial set S{d1}
3. Top down pick the next document dt, calculate the redundancy of dtto this set S as:
redunS(dt|S) = argmaxdi∈S(d)redun(di|dt) (8) If redunS(dt|S) is less than certain novelty threshold, add it into set S. Otherwise, ignore it and repeat this step until the end of the relevant document list.
4. Remove all documents in set S from ranked list, go to step 2 to generate next document set.
5. Rank document sets in the order they are generated.
In step 2, the novelty threshold can be learned from the training data. Because we generate document sets by pick- ing the most relevant document top down, intuitively rank- ing document sets by the order they are generated gives us the most relevant document sets as the top ranked sets (step 4).
7.2 Cluster Based Method
The cluster based method groups similar documents into clusters then picks a document from each cluster to gener- ate ranked document sets. To cluster documents, one must establish a pairwise measure of document similarity, then choose a clustering algorithm to group documents based on their similarity measure. There are many different distance measures, such as the Cosine measure, the Dice and Jaccard coefficients, and the overlap coefficient. We opted for the Cosine measure for document similarity in our experiment due to the robust performance reported in many research domains.
Clustering algorithms are usually divided into two classes:
partitioning and hierarchical agglomerative clustering. Both of these have been studied in the context of IR [5, 6]. Parti- tioning algorithms (e.g. K-means) offer a more efficient way to cluster documents, but sacrifice some accuracy in arrang- ing the documents. In contrast to partitioning algorithms, hierarchical agglomerative algorithms explore every possi- ble inter-object distance and create a hierarchical cluster tree. Single linkage, complete linkage, group linkage, group average, centroid and Ward’s algorithm represent most of agglomerative algorithms. The difference among them lies in how the similarity between clusters is defined. They have been extensively explored in previous studies [7]. We explore all seven of the agglomerative algorithms in our experiments.
Our Cluster Based algorithm consists of the following steps:
1. Generate ranked list of documents relD(r(d|q)).
2. Cluster documents into different subtopics by using equation (2) as the similarity measure. Keep the doc- ument relevant ranking order in each cluster during the clustering process.
3. Top down, pick one document from each cluster to generate a document set, and first generated document set ranking higher.
We use a cluster to represent a subtopic. Granularity of the subtopic will be decided by setting an appropriate cluster threshold.
7.3 Subtopic Extraction Based Method
The subtopic based method explicitly searches for the subtopics from the ranked document list, then generates document sets and ranks them using these subtopics. We present the subtopic based method in two stages: the subtopic searching stage, and the document set generating and rank- ing stage.
Many studies have investigated subtopic searching. Zamir and Etzioni [14, 13] designed a Suffix Tree Clustering algo- rithm which creates a cluster by first identifying a common phrase from a set of documents, then generates a cluster us- ing this common phrase. Zeng [15] took a similar approach, but further calculated several important properties to iden- tify subtopics, and used them to rank subtopics by using machines learning techniques. We adopt Zeng’s method in our subtopic searching algorithm as follows:
1. Generate ranked list of documents relD(r(d|q)).
2. Generate n-gram (where n <= 3 in our experiments) subtopic candidates by using the local context of query words in the ranked list of documents relD(r(d|q)).
3. Rank subtopic candidates according to y score of fol- lowing combined properties:
(a) tf*idf
(b) Phrase Independence: A phrase is independent when the entropy of its context is high according to [3] (i.e. the left and right contexts are random enough).
Ind = − X
t∈d(w)
f (t) tf logf (t)
tf (9)
Evaluation Method TREC-1 TREC-2 Traditional 0.392 0.348 Document Set 0.061 0.052
Table 2: A comparison of the average non- interpolated precision by different evaluation met- rics on baseline dataset.
where f (t) refers to right/left side term frequency.
We calculate both left and right context indepen- dence values and average them as:
Ind = (Indl + Indr)/2 (c) Phrase length: Len = N
Combine above properties linearly to generate y score:
y = a · tf idf + b · Ind + c · Len (10) Finding the best phrases to represent the subtopics is the intuition for this formula. Parameters a, b and c can be learned from training data by using a machine learning approach as Zeng [15] proposed
4. Pick the top ranked phrases (in our experiment, top 200 phrases), represent each phrase as a vector of doc- uments weights Pj= (w1, w2, · · · , wi), where wiis the tf.idf weight of phrase Pj in document di; Cluster those phrases by using this vector values. This method is similar to similarity thesaurus building by global analysis. Since each subtopic might be composed by different phrases, by clustering top ranked phrases, we can get a better representation of a subtopic using a cluster of phrases.
Given the representation of each subtopic as a cluster of phrases, we continue with our document set generating and ranking algorithm:
1. Re-rank the relevant document list relD(r(d|q)) ac- cording to each subtopic; generate a vector of ranked document lists, each document ranked list correspond- ing to a subtopic. (Standard VSM used as re-ranking function)
2. Since the same ranking function is used to re-rank doc- ument lists corresponding to each subtopic, the rele- vant scores over all subtopics are comparable. We pick the document which has the largest relevant value as a document set seed S{d1}. Check the coverage of this set S to every subtopic:
Cov(subtopic|S) = argmaxdi∈S(d)rel(di|sub) (11) If the value of Cov(subtopic|S) is less than a certain threshold, we judge this subtopic as not covered by this set S; pick the top ranked document from this subtopic ranking list and add it into set S. Until all subtopics are covered by the set S, remove the documents in S from all subtopic ranked lists.
3. Rank document sets in the order they are generated.
This subtopic ranking strategy allows us to explicitly model the coverage and redundancy to each subtopic. Set ranking also is a natural process of relevance.
Cluster Method Ward Single Link Average Weighted Average Centroid Median Complete Link
Avg. Precision on TREC-1 0.122 0.073 0.098 0.089 0.085 0.082 0.131
Avg. Precision on TREC-2 0.115 0.065 0.086 0.093 0.081 0.079 0.126
Table 4: A comparison of the average non-interpolated precision of the seven cluster methods for document set retrieval.
Algorithm TREC-1 Improve TREC-2 Improve
Baseline 0.061 —– 0.052 —–
Novelty 0.128 +110% 0.10 +92%
Cluster 0.131 +114% 0.126 +142%
Sub-Extra 0.144 +136% 0.149 +186%
Table 3: A comparison of the average non- interpolated precision of the three algorithms for document set retrieval.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Precision
Recall
11−pt Avg. Recall−Precision by different evaluation metrics on baseline
Traditional Evaluation Metrics Document Set Evaluation Metrics
Figure 2: Comparison of different evaluation metrics on baseline retrieval result.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2
Precision
Recall
11−pt Avg. Recall−Precision on TREC−1 dataset
Novelty Clustering Subtopic Extraction BaseLine
Figure 3: Performance comparison of three docu- ment set generation algorithms on TREC-1 dataset
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18
Precision
Recall
11−pt Avg. Recall−Precision on TREC−2 dataset
Novelty Clustering Subtopic Extraction BaseLine
Figure 4: Performance comparison of three docu- ment set generation algorithms on TREC-2 dataset
8. EXPERIMENTS 8.1 Experiment Setup
Document set retrieval experiments use a two stage strat- egy. The Okapi formula is used for relevant document re- trieval in the first stage. As is the practice, we pick the top two hundred documents from the first stage as our second stage data. The document set ranked list is generated us- ing Novelty Based, Cluster Based and Subtopic Extraction Based methods respectively in the second stage.
In order to evaluate the effectiveness of our proposed meth- ods, we compare the document set generation algorithm re- sults with result of baseline which is the relevant document ranking from first stage and not processed by any document set generation methods.
8.2 Threshold Learning
Granularity of the subtopics is task and user dependent;
threshold tuning is necessary. For experiments using Novelty Based method, the novelty threshold must be specified. For experiments using Cluster Based method, we need to decide the cluster threshold. For experiments using Subtopic Based method, all parameters used to rank candidate subtopics have to be learned, subtopic clustering thresholds have to be decided, and the redundancy threshold for set generation must be decided. In our experiments, we randomly pick half the topics as training data, thus tuning results on this data set. Then we test the performance on the other half.
8.3 Results and Discussion
We use 11-point average Recall-Precision figure and aver- age precision at seen relevant documents value as our evalu- ation strategy. The 11-point average Recall-Precision figure
allows us to evaluate quantitatively both the quality of the overall answer set and the breadth of the retrieval algorithm.
Average precision at seen relevant documents measure fa- vors systems which retrieve relevant documents quickly. To generate high quality document sets in the top rank is an important property we want the document set generation system to have. So comparison of different systems will be based on Average precision values.
Table 2 and Figure 2 compare two different evaluation methods on the first stage ad-hoc retrieval results. It shows the significant difference between evaluation methods. Due to the intrinsic difficulty of the subtopic problem, the doc- ument set evaluation value is much lower than the tradi- tional evaluation metric. Table 3, Figure 3 and Figure 4 show the effectiveness of our document set generation algo- rithms on both datasets. From Table 3, we can especially see significant improvement in Average precision value over the baseline which means all algorithms effectively retrieved high quality relevant document sets at the top of the list.
We discuss the results and the implication for each different document set generation algorithms as follows:
8.3.1 Novelty Based Method
Using a simple algorithm, the Novelty Based method generated document sets by considering the novelty score redunS(di|S) between documents. Experimental results showed it is less effective than the other two algorithms.
By looking at document sets generated by this algorithm, we found the size of the top-ranked document sets are all very large. It may include highly redundant information in those document sets, therefore causing the decrease in performance. We conjecture that simply using local context to decide the subtopic novelty between two documents is not robust enough. Since it is possible that one subtopic is represented by slightly different words, generating document sets based on only this novelty score redunS(di|S, q) is too sensitive. It is an intrinsic defect of this algorithm, and adjusting the threshold cannot avoid this problem. We also ran experiments using different document representations:
One used the local context of the query words and another used the entire document. The local context showed better results.
8.3.2 Cluster Based Method
In this set of experiments, seven hierarchical agglomera- tive clustering algorithms are used for a cross-method com- parison. Table 4 summarizes the results of these experiments and shows the average precision score for different cluster- ing algorithms. From Table 4 we can see that, in general, the single link cluster algorithm gave the worst performance.
The complete link and the Ward clustering algorithm gen- erated better performance than others in our experiments.
The single linkage method defines the distance between two clusters as the smallest distance between two objects in clusters. The complete linkage uses the largest distance in- stead. These two methods represent two extremes, and the other four methods represent some compromise between the two extremes. Because of the well-known chaining effect, a single link cluster produces a small number of large, poorly linked clusters, whereas the complete link process produces a much larger number of small, tightly linked groupings. In our document set generation task, since each cluster implic- itly represents a subtopic, high level of similarity among doc-
uments in a cluster is desirable. Because each item in a com- plete link cluster is guaranteed to resemble all other items in that cluster at the stated similarity level, the complete link clustering system may be better adapted to subtopic gen- eration task than the single link clusters, where similarities between items may be very low.
In general, the Cluster Based Method better represents the subtopics and outperforms the Novelty based method.
A problem with the Cluster Based method is it implicitly as- sumes each document only belongs to one subtopic which is not always true. One document could cover many subtopics as in the both TREC datasets. Cluster Based method per- formance may vary a lot corresponding to different datasets.
8.3.3 Subtopic Extraction Based Method
Experiment results show the Subtopic Extraction Based method is the most effective algorithm among the three.
By explicitly extracting the subtopics, this approach al- lows a user to monitor the quality of subtopics generated during the process. Since the interactive TREC dataset also offers a description of each instance (we refer to it as subtopic here), we can use the extracted subtopics to compare with instances generated by TREC for each topic.
Clustering the top phrases to represent subtopics is another nice property which offers effective and robust representa- tion for subtopics. We show a result of phrase clusters rep- resenting subtopics as follows:
Number: 431i
Title: robotic technology latest developments use?
Phrase clusters to represent different subtopics:
{surgeon ; surgic robot; patient; medical};
{painting ; robot paint};
{waterjet cut};
{computer control ; warehouse ; maintenance};
{final assemble ; robot install ; honda};
{chip ; device ; mbit ram ; semiconductor};
...
Extracting representative phrases for each subtopic is the key step for the success of this algorithm. Our algorithm performed well on topics like 431i, however, the question answering style topics (Number 326i: Any report of a ferry sinking where 100 or more people lost their lives) are very difficult to handle. More study is necessary to successfully handle this kind of topics.
Compared to the other two algorithms, Subtopic Extrac- tion Based method not only explicitly generates subtopics, but also allows users to specify the novelty threshold when generating document sets. It offers more control to the user, and possibly renders different results based on different users’ needs.
9. CONCLUSIONS AND FUTURE WORK
We have presented a novel subtopic based document set generating and ranking task, Minimal Document Set Re- trieval. We defined Minimal Document Set Retrieval as re- trieving a ranked list of documents sets and evaluating the list by considering coverage and redundancy of these sets corresponding to subtopics.
Three retrieval and ranking algorithms were proposed and discussed in this paper. The Novelty Based algorithm is a straightforward but less effective and less tunable method.
Cluster Based method offers users implicit representation for each subtopic. The Subtopic Extraction Based method is by
far the most effective and flexible method. By explicitly ex- tracting subtopics and then generating document sets based on subtopics, it allows users to (i) specify subtopic extract- ing thresholds, and (ii) to adjust the redundancy threshold on document set generation. Experimental results demon- strate all algorithms can effectively generate the minimal document sets compared to the baseline.
Another contribution is the new evaluation framework for document set ranking metrics which comprise both relevance between a set and query topic, and redundancy within each set. We believe the document set evaluation metrics are a generalized framework, and can be use to subsume tradi- tional relevance-based recall-precision single document rank- ing metrics.
There are still lots of open problems for future research.
Recent success in the language modeling approach to IR mo- tivated us to consider applying this model for the document set generation task.
10. ACKNOWLEDGEMENT
This work is sponsored by NSF grant IIS-0325404 and a research grant from the FAA 032-G-009.
11. REFERENCES
[1] J. Allan, J. Carbonell, G. Doddington, J. Yamron, and Y. Yang. Topic detection and tracking pilot study.
Topic Detection and Tracking Workshop Report, 2001.
[2] J. Carbonell and J. Goldstein. The use of mmr, diversity-based reranking for reordering documents and producing summaries. In Proceedings of SIGIR 1998, pages 335–336, 1998.
[3] L. F. Chien. Pat-tree-based adaptive keyphrase extraction for intelligent chinese information retrieval.
In Proceedings of 20th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval, 1997.
[4] W. Hersh and P. Over. Trec-8 interactive track report.
The Seventh Text Retrieval Conference (TREC-8), pages 57–64, 2000.
[5] A. Leuski and J. Allan. Improving interactive retrieval by combining ranked list and clustering. In
Proceedings of RIAO, pages 665–681, 2000.
[6] A. Leuski and W. Croft. An evaluation of techniques for clustering search results. In Technical Report IR-76, 1996.
[7] N.Jardine and C. van Rijsbergen. The use of hierarchic clustering in information retrieval, Information Storage and Retrieval. 1995.
[8] P. Over. Trec-6 interactive track report. The Sixth Text Retrieval Conference (TREC-6), pages 73–82, 1998.
[9] P. Over. Trec-7 interactive track report. The Seventh Text Retrieval Conference (TREC-7), pages 65–72, 1999.
[10] M. Spitters, R. Villa, and C. V. Rijsbergen. Tno at tdt2001: language model-based topic detection. In Topic Detection and Tracking Workshop Report, 2001.
[11] E. M. Voorhees. Overview of the TREC 2003 question answering track. In Proceedings of Text REtrieval Conference, 2003.
[12] J. Yamron, I. Carp, L. Gillick, S. Lowe, and P. V.
Mulbregt. Topic tracking in a news stream. In
Proceedings of the DARPA Broadcast News Workshop, 1999.
[13] O. Zamir and O. Etzioni. Web document clustering: A feasibility demonstration. In Proceedings of the 19th International ACM SIGIR Conference on Research and Development of Information Retrieval
(SIGIR’98), pages 217–240, 1998.
[14] O. Zamir and O. Etzioni. Grouper: A dynamic clustering interface to web search results. In Proceedings of the Eighth International World Wide Web Conference (WWW8), 1999.
[15] H. Zeng, Q. He, Z. Chen, W. Ma, and J. Ma. Learning to cluster web search results. In Proceedings of SIGIR 2004, 2004.
[16] C. Zhai, W. W. Cohen, and J. Lafferty. Beyond independent relevance: Methods and evaluation metrics for subtopic retrieval. In Proceedings of SIGIR 2003, 2003.
[17] R. Zhang, Z. M. Zhang, and S. Khanzode. A data mining approach to modeling relationships among categories in image collection. In Proceedings of ACM KDD 2004, pages 749–754, 2004.
[18] Y. Zhang, J. Callan, and T. Minka. Novelty and redundancy dectection in adaptive filtering. In Proceedings of SIGIR 2002, 2002.