CiteSeerX — Minimal document set retrieval

(1)

Minimal Document Set Retrieval

Wei Dai

Department of Computer Science and Engineering

State University of New York at Buffalo Buffalo, New York 14260

Rohini Srihari

Department of Computer Science and Engineering

State University of New York at Buffalo Buffalo, New York 14260

ABSTRACT

This paper presents a novel formulation and approach to the minimal document set retrieval problem. Minimal Doc- ument Set Retrieval (MDSR) is a promising information retrieval task in which each query topic is assumed to have different subtopics; the task is to retrieve and rank relevant document sets with maximum coverage but minimum redundancy of subtopics in each set. For this task, we propose three document set retrieval and ranking algorithms:

Novelty Based method, Cluster Based method and Subtopic Extraction Based method. In order to evaluate the system performance, we design a new evaluation framework for doc- ument set ranking which evaluates both relevance between set and query topic, and redundancy within each set. Fi- nally, we compare the performance of the three algorithms using the TREC interactive track dataset. Experimental results show the effectiveness of our algorithms.

Categories and Subject Descriptors

H.3.3 [Information Search and Retrieval]: Retrieval Models – search process, clustering

General Terms

Algorithms, Experimentation.

Keywords

Information retrieval, Document set retrieval.

1. INTRODUCTION

The conventional ad-hoc information retrieval task is con- cerned with assimilating and ranking documents based on maximizing relevance to the user query. In reality, however, each query topic usually consists of many different subtopics. As a result, a relevant document may only cover, or be relevant to one, or at best, a few subtopics. In order to get full coverage about the query topic, the user has to go

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

CIKM’05, October 31–November 5, 2005, Bremen, Germany.

through a long list of ranked documents. This is a time con- suming task, especially if the documents are not ordered by subtopic. With the explosion of online information, traditional IR systems which simply offer ranked document lists become increasingly insufficient for satisfying user information needs. Further steps become necessary to allow user to quickly fulfill their search criteria. Organizing retrieval results into semantically related clusters to facilitate browsing is one technique that has been used [14, 13]. In this paper, we explore a different approach by generating ranked mini- mal document sets which attempt to maximize the number of distinct subtopics related to the query while maintaining minimal redundancy within a set.

Minimal document set ranking strategy, in a sense, is trying to compose big documents and rank each big document according to both its coverage and redundancy to all subtopics of this query. For example, a student doing a literature survey on “machine learning” may be most in- terested in finding documents that cover representative approaches to “machine learning”. Using traditional ranking stratey, user may only see few most popular “machine learning” approaches, such as SVM and ANN, on the top ranked documents and user has to scan through most of ranked documents to discover all representive approaches about “machine learning”. However, Minimal document set ranking strategy, by implicit or explicit modeling different “machine learning” approaches, offers information to user as a rank list of document sets in which each document set try to cover all “machine learning” approaches and keep information redundancy minimal within a set. Finally, each document set is ranked by considering combined value of its subtopic coverage, subtopic redundancy and relevancy to topic.

Carbonell [2] proposed the Maximal Marginal Relevance (MMR) criterion for combining query-relevance with information novelty in the context of text retrieval and summa- rization. Zhai [16] explicitly modeled this problem as the subtopic retrieval problem and evaluated several methods for performing subtopic retrieval by using statistical language models and a maximal marginal relevance (MMR) ranking strategy. In the interactive track of TREC-6, TREC- 7, TREC-8, the same problem was explored. They refer to it as Aspect Retrieval, which focuses on studying how an interactive retrieval system can best support users in gath- ering information about the different aspects of a topic [8, 9, 4]. The recent TREC question-answering (QA) tracks have introduced definition type questions which call for various aspects of a definition [11]. More recently, Zhang et al [17]

proposed a graph and fuzzy set based data mining approach

(2)

to model the semantic relationships among the document set. The model is applicable to both the text domain and image domain.

MDSR and all of the above methods attempt to capture information novelty/redundancy among documents. MDSR further assumes that documents must be ranked with re- spect to an existing set of relevant documents. We design a new document set evaluation framework which is based on traditional relevance-based precision-recall evaluation metrics; we show that this framework subsumes the traditional single document ranked list evaluation method.

In this paper, we propose three document set generation algorithms. The first is Novelty Based method, which generates document sets from relevant documents according to the novelty score between two documents. The second algorithm is the Cluster Based generation method which generates clusters from retrieved documents and combines them into document sets. The third method, we call Subtopic Ex- traction Based method, which explicitly extracts subtopics from retrieved documents and uses those subtopics as dimensions to generate ranked document sets. Experimental results show Subtopic Extraction based retrieval gives us the best performance.

2. RELATED WORK

There are two areas of information retrieval research that provide the theoretical foundation and empirical techniques for our models: novelty detection and document clustering.

Our novelty based document set model is closely related to Topic Detection and Tracking (TDT) [1] and Novelty and Redundancy Detection in Adaptive Filtering research [18]. The former monitors a stream of chronologically ordered documents, and the latter addresses the problem of extending an adaptive information filtering system to make decisions about the novelty and redundancy of relevant documents.

Document clustering methods are used to organize collec- tions around topics. Each cluster is assumed to be the representative of a topic. Document clustering techniques are also employed in the Topic Detection and Tracking research [10, 12]. We credit our cluster based model on this research direction. Online clustering is another document clustering research direction, in which search results are clustered into different groups to facilitate user’s browsing. Vivisimo.com is a real world application of this technique. We adapt some technique from this research direction into our subtopic extraction based model.

To the best of our knowledge, there has been no research conducted specifically on Minimal Document Set Retrieval, (as defined here), so this work represents a pilot study.

3. DATA SET

For the purpose of evaluation, we need a data set which shows subtopics for each query topic and truth judgment about which document covers which subtopic. We use two datasets to evaluate our system in this study. One is from TREC interactive datasets. Another is from TREC ad-hoc topics which are different from the topics in TREC interactive dataset. We refer to the TREC interactive dataset as TREC-1 and the second dataset as TREC-2 dataset in this study. For the TREC-1 dataset, we collected all truth judgments for three years of interactive TREC (TREC-6,

TREC-7, TREC-8), the years when this track was running.

The datasets spanning three years contain 210,158 documents from the 1991-1994 Financial Times of London with an average length of roughly 400 words. There are totally 20 TREC topics. Figure 1 shows an example interactive TREC topic: For each topic, TREC assessors have identified sev-

Number: 431i

Title: robotic technology latest developments use?

Instances:

In the time allotted, please find as many DIFFERENT developments of the sort described above as you can.

Please save at least one document for EACH such DIF- FERENT development. If one document discusses several such developments, then you need not save other documents that repeat those, since your goal is to identify as many DIFFERENT developments of the sort described above as possible.

Figure 1: TREC interactive track topic example

eral instances. Different instances reflect different aspects of the topic. For the above topic, they identified 45 subtopics in the relevant documents. Here are some instances:

1. medical robot helping with human surgery.

2. water-jet cutting robots.

3. robots used in engine assembly.

4. aplly metallic paints to parts for a computer.

5. controlling inventory - storage devices.

...

Different topics may have different number of instances (subtopics).

In this dataset, the number of subtopics ranges between 7 and 56, with an average of 20. For each document, the coverage of each subtopic presented as a bits vector as following example:

FT911-129 111111000000000000000000000...

FT911-133 000000110000000000000000000...

FT911-135 000100011100000000000000000...

FT941-1242 001101000000000000000000000...

...

The above example indicates that document FT911-129 covers six different instances, and FT911-133 covers two different instances. Different retrieval topics also have different number of relevant documents. In this dataset, relevant documents ranges from 5 to 100 with average of 40 documents per topics. More detailed information about this dataset can be found in the TREC interactive track reports [8, 9, 4].

We created a second dataset (referred to as TREC-2) by asking two students, who were otherwise unaffiliated with our research, to provide truth judgments on a total of ten ad-hoc topics selected from TREC-8 (topics 401-450). The number of judged relevant documents range between 65 and 135 with an average 120 relevant documents per topic; the number of subtopics per topic range between 17 and 51 with an average of 26.

4. PROBLEM FORMALIZATION

In order to generate the minimum document set, we need a ranking system which can generate document sets using combined criteria of relevance and redundancy. Here, relevance and redundancy are not two conflicting concepts

(3)

Collection Evaluation Rank-1 Rank-2 Rank-3 Rank-4 Rank-5 d1: 111 {d1} {d1, d2, d3, d4} {d1} {d1, d5} {d1, d2} d2: 100 {d2, d3, d4} {d2} {d2, d3, d4} {d3, d4, d5}

d3: 010 {d3}

d4: 001 {d4}

d5: 000 P

Coptimal 2 2 2 2 2

Sub-Recall 1 0.5 1 1 0.83

Sub-Precision 1 0.5 0.5 0.875 0.62

Table 1: An example of computing Document Set Evaluation Metrices

which belong to two different dimensions. Relevance is the relationship of the query topic with retrieved document sets;

redundancy is based on the relationship among documents inside each generated document set.

We use the following notation throughout this paper:

• subtopic(d|q): the subtopic coverage of this document d corresponding to a query. Given a certain query, we may abuse the notation as subtopic(d) which is clear under certain context.

• subS(S|q): the subtopic coverage of this document set S corresponding to a query.

• relD(r(d|q)): ranked list of relevant documents ac- cording to query q and rank function r.

• relS(s(S|q)): ranked list of relevant document sets ac- cording to query q and set generation function s.

• redun(d|S, q): redundancy information between the document d and document set S given query topic q.

• redunS(S|q): redundancy information inside document set S given query q.

More precisely, we define our task as: for any given rele- vant document list relD(r(d|q)), generate a ranked list of relevant minimum document sets relS(s(S|q)) according to subtopic coverage subS(S|q) of each document set and re- dundant information inside each set redunS(S|q).

The key for our task here is to find an appropriate subtopic coverage function subS(S|q) and subtopic redundancy func- tion redunS(S|q).

5. DOCUMENT SET EVALUATION MET- RICS

We designed our evaluation metrics based on precision and recall of traditional IR evaluation methods. Our document set evaluation task, however, is more complicated than traditional IR evaluation. Relevance of a document set to a query topic is not a simple binary relation, but a partial relationship. The relevance value is decided by the subtopic coverage of this document set. In order to evaluate the redundancy factor, we also need to add a penalty to inhibit redundancy in a set. Generally, our evaluation metrics need to achieve two goals: to evaluate partial relevance of a document set to a query topic, and penalize redundancy inside document sets based on minimality criterion.

Precisely, document set precision and recall measures are defined as follows:

Set Coverage: The value of subtopic coverage of a set is decided by the fraction of the total subtopics (N) this document set covers:

Cs=|S|S|

i=1subtopic(di)|

N (1)

Subtopic Recall: Sum of set coverage value of all retrieved document sets compared to sum of optimal set coverage value Ciin the collection:

Sub Recall =

PCs

argmax(P

Ci) (2)

Set Precision: Considering the subtopic redundancy factor inside a set, we define the Set Precision as:

Ps= |S|S|

i=1subtopic(di)|

N +P|S|

i=1cost(di) (3) Cost: The cost of adding a document into a set S:

cost(d) =

1 if subtopic(d) = 0

τ · redun(d|S) otherwise (4) Subtopic Precision: The average set precision of retrieved K document sets:

Sub P recision = Pk

i=1Ps

K (5)

Adding any document which contains no relevant subtopics or contains relevant but redundant subtopics given the set S will be penalized using this cost function, where τ is a parameter to allow adjusting of redundancy influence. We set τ to 1 in our experiment.

Using the above definitions, our document set evaluation metrics could be used to subsume traditional IR single document rank evaluation. The latter is a special case when each document set contains only one document, and there is only one subtopic for each query topic.

6. COMPUTING THE METRICS

In order to compute the subtopic recall of document sets, we need to first calculate the optimal value argmax(P

Ci), that is, the sum of maximum set coverage values in all rele- vant documents. Furthermore, we can see that argmax(P

Ci) is the sum of set coverage values of single relevant documents. Because any document set containing subtopic over- lap will cause the relevance value to be lost in Ci, the sum of single relevant document values will have the maximum subtopic recall value. In calculations, we use the number of subtopic overlaps between the subtopic coverage of one document and the subtopic coverage of a document set for redun(d|S) in the cost function.

(4)

We show an example of computing Document Set Evalu- ation metrics in Table 1. We build a small document collection which has 5 documents. For a given query topic, each document has the subtopic truth judgment represented as a bit vector. For example, d1 covers all three subtopics;

(d2, d3, d4) each cover a different subtopic and d5 is not relevant to this topic. Rank-1 to Rank-5 show the different representative document set rankings. Rank-1 generates a perfect document set ranking with full subtopic coverage and no subtopic redundancy for each set, therefore, it has value 1 for both Subtopic Recall and Subtopic Precision.

Rank-2 puts all relevant documents into one set, so the subtopic redundancy causes decrease in both Subtopic Re- call and Precision. Rank-3 consists of a single ranked list of documents; the poor subtopic coverage results in poor performance on Subtopic Precision. Rank-4 shows that if any non-relevant documents are added to the document set, it will cause a decrease in Subtopic Precision. Rank-5 is a general imperfect ranking which shows the subtopic redundancy and poor subtopic coverage for each set; therefore it gets poor Subtopic Recall and Precision. By using this example, we show that our document set evaluation metrics do fairly evaluate all possible document set rankings. For the document set generation and ranking task it should be noted that in reality, most relevant documents cannot be composed into the perfect, full coverage subtopic set. For example, if we change the above example in Table 1 so d2, d3, d4 cover only one and the same subtopic, then the best Subtopic Pre- cision we can get is 0.5. It is an intrinsic difficulty of the dataset. Another nice property of this document set evaluation metric is it generates the same precision and recall value as tradition evaluation metrics if we assume a single subtopic and single document ranking.

7. MINIMAL DOCUMENT SET RETRIEVAL APPROACHES

A minimum document set retrieval system should include the following three stages:

1. Generate a ranked list of relevant documents.

2. Using this ranked document list, generate every possible document set which satisfies minimum redundancy and the best coverage of subtopics.

3. Rank document set according to the combined scores of subtopic coverage and subtopic redundancy of set.

Observe from Above, the first stage is a well defined problem. The focus of our study is the second and the third stage, that is, how to evaluate the coverage of subtopics and the redundancy information for each set are our major tasks. In some cases, the third step is not necessary since the second step performs the ranking.

Different ways of tackling the problem of subtopic and redundancy functions lead us to different approaches for the MDSR task. We propose three methods and detail them as follows. All our proposed methods assume we have a high Precision/Recall of relevant documents from the first stage. We use bag of words as the document representation.

Traditional tf-idf word weighting scheme is used to calculate the distance function for redundancy and cluster similarity measure, which is:

wi,j= fi,j· logN ni

(6)

The effectiveness and robustness of the above measure has been proven in previous studies.

7.1 Novelty Based Method

After we retrieve a list of relevant documents from the first stage, Novelty Based method generates document sets directly using the redundancy function; ranking document sets in the order of the seed (first document in a set) is used to generate this set. There are many novelty detection methods, such as set difference, geometric distance and distributional similarity [18]. Geometric distance using Co- sine metric has been proven to be a very effective similarity metric for many tasks. We use it as our redundancy function. To adapt it more to subtopic novelty detection, we use the local context of query words to represent the document and measure the subtopic novelty between two documents by using Cosine distance between local contexts.

We present each document d as a vector ~d = (w1, w2, · · · , wt), where wtare the words around the query terms in a certain window size. We calculate the similarity between two documents using redundancy function:

redun(di|dt) = cos(di, dt)

=

−

→di·−→ dt

|−→ di| · |−→

dt| (7)

Precisely, Novelty Based Method generates document sets as follows:

1. Generate ranked list of documents relD(r(d|q)).

2. Pick the top ranked documents as the seed, generate initial set S{d1}

3. Top down pick the next document dt, calculate the redundancy of dtto this set S as:

redunS(dt|S) = argmaxd_i∈S(d)redun(di|dt) (8) If redunS(dt|S) is less than certain novelty threshold, add it into set S. Otherwise, ignore it and repeat this step until the end of the relevant document list.

4. Remove all documents in set S from ranked list, go to step 2 to generate next document set.

5. Rank document sets in the order they are generated.

In step 2, the novelty threshold can be learned from the training data. Because we generate document sets by pick- ing the most relevant document top down, intuitively ranking document sets by the order they are generated gives us the most relevant document sets as the top ranked sets (step 4).

7.2 Cluster Based Method

The cluster based method groups similar documents into clusters then picks a document from each cluster to generate ranked document sets. To cluster documents, one must establish a pairwise measure of document similarity, then choose a clustering algorithm to group documents based on their similarity measure. There are many different distance measures, such as the Cosine measure, the Dice and Jaccard coefficients, and the overlap coefficient. We opted for the Cosine measure for document similarity in our experiment due to the robust performance reported in many research domains.

(5)

Clustering algorithms are usually divided into two classes:

partitioning and hierarchical agglomerative clustering. Both of these have been studied in the context of IR [5, 6]. Parti- tioning algorithms (e.g. K-means) offer a more efficient way to cluster documents, but sacrifice some accuracy in arrang- ing the documents. In contrast to partitioning algorithms, hierarchical agglomerative algorithms explore every possible inter-object distance and create a hierarchical cluster tree. Single linkage, complete linkage, group linkage, group average, centroid and Ward’s algorithm represent most of agglomerative algorithms. The difference among them lies in how the similarity between clusters is defined. They have been extensively explored in previous studies [7]. We explore all seven of the agglomerative algorithms in our experiments.

Our Cluster Based algorithm consists of the following steps:

2. Cluster documents into different subtopics by using equation (2) as the similarity measure. Keep the document relevant ranking order in each cluster during the clustering process.

3. Top down, pick one document from each cluster to generate a document set, and first generated document set ranking higher.

We use a cluster to represent a subtopic. Granularity of the subtopic will be decided by setting an appropriate cluster threshold.

7.3 Subtopic Extraction Based Method

The subtopic based method explicitly searches for the subtopics from the ranked document list, then generates document sets and ranks them using these subtopics. We present the subtopic based method in two stages: the subtopic searching stage, and the document set generating and ranking stage.

Many studies have investigated subtopic searching. Zamir and Etzioni [14, 13] designed a Suffix Tree Clustering algorithm which creates a cluster by first identifying a common phrase from a set of documents, then generates a cluster using this common phrase. Zeng [15] took a similar approach, but further calculated several important properties to identify subtopics, and used them to rank subtopics by using machines learning techniques. We adopt Zeng’s method in our subtopic searching algorithm as follows:

2. Generate n-gram (where n <= 3 in our experiments) subtopic candidates by using the local context of query words in the ranked list of documents relD(r(d|q)).

3. Rank subtopic candidates according to y score of fol- lowing combined properties:

(a) tf*idf

(b) Phrase Independence: A phrase is independent when the entropy of its context is high according to [3] (i.e. the left and right contexts are random enough).

Ind = − X

t∈d(w)

f (t) tf logf (t)

tf (9)

Evaluation Method TREC-1 TREC-2 Traditional 0.392 0.348 Document Set 0.061 0.052

Table 2: A comparison of the average non- interpolated precision by different evaluation metrics on baseline dataset.

where f (t) refers to right/left side term frequency.

We calculate both left and right context independence values and average them as:

Ind = (Indl + Indr)/2 (c) Phrase length: Len = N

Combine above properties linearly to generate y score:

y = a · tf idf + b · Ind + c · Len (10) Finding the best phrases to represent the subtopics is the intuition for this formula. Parameters a, b and c can be learned from training data by using a machine learning approach as Zeng [15] proposed

4. Pick the top ranked phrases (in our experiment, top 200 phrases), represent each phrase as a vector of doc- uments weights Pj= (w1, w2, · · · , wi), where wiis the tf.idf weight of phrase Pj in document di; Cluster those phrases by using this vector values. This method is similar to similarity thesaurus building by global analysis. Since each subtopic might be composed by different phrases, by clustering top ranked phrases, we can get a better representation of a subtopic using a cluster of phrases.

Given the representation of each subtopic as a cluster of phrases, we continue with our document set generating and ranking algorithm:

1. Re-rank the relevant document list relD(r(d|q)) ac- cording to each subtopic; generate a vector of ranked document lists, each document ranked list corresponding to a subtopic. (Standard VSM used as re-ranking function)

2. Since the same ranking function is used to re-rank document lists corresponding to each subtopic, the relevant scores over all subtopics are comparable. We pick the document which has the largest relevant value as a document set seed S{d1}. Check the coverage of this set S to every subtopic:

Cov(subtopic|S) = argmaxd_i∈S(d)rel(di|sub) (11) If the value of Cov(subtopic|S) is less than a certain threshold, we judge this subtopic as not covered by this set S; pick the top ranked document from this subtopic ranking list and add it into set S. Until all subtopics are covered by the set S, remove the documents in S from all subtopic ranked lists.

3. Rank document sets in the order they are generated.

This subtopic ranking strategy allows us to explicitly model the coverage and redundancy to each subtopic. Set ranking also is a natural process of relevance.

(6)

Cluster Method Ward Single Link Average Weighted Average Centroid Median Complete Link

Avg. Precision on TREC-1 0.122 0.073 0.098 0.089 0.085 0.082 0.131

Avg. Precision on TREC-2 0.115 0.065 0.086 0.093 0.081 0.079 0.126

Table 4: A comparison of the average non-interpolated precision of the seven cluster methods for document set retrieval.

Algorithm TREC-1 Improve TREC-2 Improve

Baseline 0.061 —– 0.052 —–

Novelty 0.128 +110% 0.10 +92%

Cluster 0.131 +114% 0.126 +142%

Sub-Extra 0.144 +136% 0.149 +186%

Table 3: A comparison of the average non- interpolated precision of the three algorithms for document set retrieval.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Precision

Recall

11−pt Avg. Recall−Precision by different evaluation metrics on baseline

Traditional Evaluation Metrics Document Set Evaluation Metrics

Figure 2: Comparison of different evaluation metrics on baseline retrieval result.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2

Precision

Recall

11−pt Avg. Recall−Precision on TREC−1 dataset

Novelty Clustering Subtopic Extraction BaseLine

Figure 3: Performance comparison of three document set generation algorithms on TREC-1 dataset

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18

Precision

Recall

11−pt Avg. Recall−Precision on TREC−2 dataset

Novelty Clustering Subtopic Extraction BaseLine

Figure 4: Performance comparison of three document set generation algorithms on TREC-2 dataset

8. EXPERIMENTS 8.1 Experiment Setup

Document set retrieval experiments use a two stage strategy. The Okapi formula is used for relevant document retrieval in the first stage. As is the practice, we pick the top two hundred documents from the first stage as our second stage data. The document set ranked list is generated using Novelty Based, Cluster Based and Subtopic Extraction Based methods respectively in the second stage.

In order to evaluate the effectiveness of our proposed methods, we compare the document set generation algorithm results with result of baseline which is the relevant document ranking from first stage and not processed by any document set generation methods.

8.2 Threshold Learning

Granularity of the subtopics is task and user dependent;

threshold tuning is necessary. For experiments using Novelty Based method, the novelty threshold must be specified. For experiments using Cluster Based method, we need to decide the cluster threshold. For experiments using Subtopic Based method, all parameters used to rank candidate subtopics have to be learned, subtopic clustering thresholds have to be decided, and the redundancy threshold for set generation must be decided. In our experiments, we randomly pick half the topics as training data, thus tuning results on this data set. Then we test the performance on the other half.

8.3 Results and Discussion

We use 11-point average Recall-Precision figure and average precision at seen relevant documents value as our evaluation strategy. The 11-point average Recall-Precision figure

(7)

allows us to evaluate quantitatively both the quality of the overall answer set and the breadth of the retrieval algorithm.

Average precision at seen relevant documents measure fa- vors systems which retrieve relevant documents quickly. To generate high quality document sets in the top rank is an important property we want the document set generation system to have. So comparison of different systems will be based on Average precision values.

Table 2 and Figure 2 compare two different evaluation methods on the first stage ad-hoc retrieval results. It shows the significant difference between evaluation methods. Due to the intrinsic difficulty of the subtopic problem, the document set evaluation value is much lower than the traditional evaluation metric. Table 3, Figure 3 and Figure 4 show the effectiveness of our document set generation algorithms on both datasets. From Table 3, we can especially see significant improvement in Average precision value over the baseline which means all algorithms effectively retrieved high quality relevant document sets at the top of the list.

We discuss the results and the implication for each different document set generation algorithms as follows:

8.3.1 Novelty Based Method

Using a simple algorithm, the Novelty Based method generated document sets by considering the novelty score redunS(di|S) between documents. Experimental results showed it is less effective than the other two algorithms.

By looking at document sets generated by this algorithm, we found the size of the top-ranked document sets are all very large. It may include highly redundant information in those document sets, therefore causing the decrease in performance. We conjecture that simply using local context to decide the subtopic novelty between two documents is not robust enough. Since it is possible that one subtopic is represented by slightly different words, generating document sets based on only this novelty score redunS(di|S, q) is too sensitive. It is an intrinsic defect of this algorithm, and adjusting the threshold cannot avoid this problem. We also ran experiments using different document representations:

One used the local context of the query words and another used the entire document. The local context showed better results.

8.3.2 Cluster Based Method

In this set of experiments, seven hierarchical agglomerative clustering algorithms are used for a cross-method comparison. Table 4 summarizes the results of these experiments and shows the average precision score for different clustering algorithms. From Table 4 we can see that, in general, the single link cluster algorithm gave the worst performance.

The complete link and the Ward clustering algorithm generated better performance than others in our experiments.

The single linkage method defines the distance between two clusters as the smallest distance between two objects in clusters. The complete linkage uses the largest distance in- stead. These two methods represent two extremes, and the other four methods represent some compromise between the two extremes. Because of the well-known chaining effect, a single link cluster produces a small number of large, poorly linked clusters, whereas the complete link process produces a much larger number of small, tightly linked groupings. In our document set generation task, since each cluster implicitly represents a subtopic, high level of similarity among doc-

uments in a cluster is desirable. Because each item in a complete link cluster is guaranteed to resemble all other items in that cluster at the stated similarity level, the complete link clustering system may be better adapted to subtopic generation task than the single link clusters, where similarities between items may be very low.

In general, the Cluster Based Method better represents the subtopics and outperforms the Novelty based method.

A problem with the Cluster Based method is it implicitly assumes each document only belongs to one subtopic which is not always true. One document could cover many subtopics as in the both TREC datasets. Cluster Based method performance may vary a lot corresponding to different datasets.

8.3.3 Subtopic Extraction Based Method

Experiment results show the Subtopic Extraction Based method is the most effective algorithm among the three.

By explicitly extracting the subtopics, this approach allows a user to monitor the quality of subtopics generated during the process. Since the interactive TREC dataset also offers a description of each instance (we refer to it as subtopic here), we can use the extracted subtopics to compare with instances generated by TREC for each topic.

Clustering the top phrases to represent subtopics is another nice property which offers effective and robust representation for subtopics. We show a result of phrase clusters rep- resenting subtopics as follows:

Number: 431i

Title: robotic technology latest developments use?

Phrase clusters to represent different subtopics:

{surgeon ; surgic robot; patient; medical};

{painting ; robot paint};

{waterjet cut};

{computer control ; warehouse ; maintenance};

{final assemble ; robot install ; honda};

{chip ; device ; mbit ram ; semiconductor};

...

Extracting representative phrases for each subtopic is the key step for the success of this algorithm. Our algorithm performed well on topics like 431i, however, the question answering style topics (Number 326i: Any report of a ferry sinking where 100 or more people lost their lives) are very difficult to handle. More study is necessary to successfully handle this kind of topics.

Compared to the other two algorithms, Subtopic Extrac- tion Based method not only explicitly generates subtopics, but also allows users to specify the novelty threshold when generating document sets. It offers more control to the user, and possibly renders different results based on different users’ needs.

9. CONCLUSIONS AND FUTURE WORK

We have presented a novel subtopic based document set generating and ranking task, Minimal Document Set Re- trieval. We defined Minimal Document Set Retrieval as re- trieving a ranked list of documents sets and evaluating the list by considering coverage and redundancy of these sets corresponding to subtopics.

Three retrieval and ranking algorithms were proposed and discussed in this paper. The Novelty Based algorithm is a straightforward but less effective and less tunable method.

Cluster Based method offers users implicit representation for each subtopic. The Subtopic Extraction Based method is by

(8)

far the most effective and flexible method. By explicitly extracting subtopics and then generating document sets based on subtopics, it allows users to (i) specify subtopic extracting thresholds, and (ii) to adjust the redundancy threshold on document set generation. Experimental results demon- strate all algorithms can effectively generate the minimal document sets compared to the baseline.

Another contribution is the new evaluation framework for document set ranking metrics which comprise both relevance between a set and query topic, and redundancy within each set. We believe the document set evaluation metrics are a generalized framework, and can be use to subsume traditional relevance-based recall-precision single document ranking metrics.

There are still lots of open problems for future research.

Recent success in the language modeling approach to IR mo- tivated us to consider applying this model for the document set generation task.

10. ACKNOWLEDGEMENT

This work is sponsored by NSF grant IIS-0325404 and a research grant from the FAA 032-G-009.

11. REFERENCES

[1] J. Allan, J. Carbonell, G. Doddington, J. Yamron, and Y. Yang. Topic detection and tracking pilot study.

Topic Detection and Tracking Workshop Report, 2001.

[2] J. Carbonell and J. Goldstein. The use of mmr, diversity-based reranking for reordering documents and producing summaries. In Proceedings of SIGIR 1998, pages 335–336, 1998.

[3] L. F. Chien. Pat-tree-based adaptive keyphrase extraction for intelligent chinese information retrieval.

In Proceedings of 20th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval, 1997.

[4] W. Hersh and P. Over. Trec-8 interactive track report.

The Seventh Text Retrieval Conference (TREC-8), pages 57–64, 2000.

[5] A. Leuski and J. Allan. Improving interactive retrieval by combining ranked list and clustering. In

Proceedings of RIAO, pages 665–681, 2000.

[6] A. Leuski and W. Croft. An evaluation of techniques for clustering search results. In Technical Report IR-76, 1996.

[7] N.Jardine and C. van Rijsbergen. The use of hierarchic clustering in information retrieval, Information Storage and Retrieval. 1995.

[8] P. Over. Trec-6 interactive track report. The Sixth Text Retrieval Conference (TREC-6), pages 73–82, 1998.

[9] P. Over. Trec-7 interactive track report. The Seventh Text Retrieval Conference (TREC-7), pages 65–72, 1999.

[10] M. Spitters, R. Villa, and C. V. Rijsbergen. Tno at tdt2001: language model-based topic detection. In Topic Detection and Tracking Workshop Report, 2001.

[11] E. M. Voorhees. Overview of the TREC 2003 question answering track. In Proceedings of Text REtrieval Conference, 2003.

[12] J. Yamron, I. Carp, L. Gillick, S. Lowe, and P. V.

Mulbregt. Topic tracking in a news stream. In

Proceedings of the DARPA Broadcast News Workshop, 1999.

[13] O. Zamir and O. Etzioni. Web document clustering: A feasibility demonstration. In Proceedings of the 19th International ACM SIGIR Conference on Research and Development of Information Retrieval

(SIGIR’98), pages 217–240, 1998.

[14] O. Zamir and O. Etzioni. Grouper: A dynamic clustering interface to web search results. In Proceedings of the Eighth International World Wide Web Conference (WWW8), 1999.

[15] H. Zeng, Q. He, Z. Chen, W. Ma, and J. Ma. Learning to cluster web search results. In Proceedings of SIGIR 2004, 2004.

[16] C. Zhai, W. W. Cohen, and J. Lafferty. Beyond independent relevance: Methods and evaluation metrics for subtopic retrieval. In Proceedings of SIGIR 2003, 2003.

[17] R. Zhang, Z. M. Zhang, and S. Khanzode. A data mining approach to modeling relationships among categories in image collection. In Proceedings of ACM KDD 2004, pages 749–754, 2004.

[18] Y. Zhang, J. Callan, and T. Minka. Novelty and redundancy dectection in adaptive filtering. In Proceedings of SIGIR 2002, 2002.

CiteSeerX — Minimal document set retrieval

Minimal Document Set Retrieval

Wei Dai

[email protected]

Rohini Srihari

[email protected]

ABSTRACT

Categories and Subject Descriptors

General Terms

Keywords

1. INTRODUCTION

2. RELATED WORK

3. DATA SET

4. PROBLEM FORMALIZATION

5. DOCUMENT SET EVALUATION MET- RICS

6. COMPUTING THE METRICS

7. MINIMAL DOCUMENT SET RETRIEVAL APPROACHES

7.1 Novelty Based Method

7.2 Cluster Based Method

7.3 Subtopic Extraction Based Method

8. EXPERIMENTS 8.1 Experiment Setup

8.2 Threshold Learning

8.3 Results and Discussion

8.3.1 Novelty Based Method

8.3.2 Cluster Based Method

8.3.3 Subtopic Extraction Based Method

9. CONCLUSIONS AND FUTURE WORK

10. ACKNOWLEDGEMENT

11. REFERENCES