A Document Expansion Model Based on Markov Network

(1)

Available at http://www.joics.com

A Document Expansion Model Based on Markov Network ?

Jiali ZUO

^a,

∗, Mingwen WANG

^b

, Jianyi WAN

^b

, Genxiu WU

^c

, Shuixiu WU

^b

aSchool of Elementary Education, Jiangxi Normal University, Nanchang 330027, China

bSchool of Computer and Information Engineering, Jiangxi Normal University, Nanchang 330022, China

cSchool of Mathematics and Information Sciences, Jiangxi Normal University, Nanchang 330022, China

Abstract

Information retrieval model is still can not achieve satisfactory performance after decades of development.

Many researches have show add useful information to information retrieval can improve retrieval performance. In this paper we propose a document expansion model based on Markov network, which model the related information using Markov network and do document expansion by means of corpus information.

Different from the query expansion that add new terms into query directly, document expansion add expansion terms to document and can avoid topic drift. Experiment result on Retur-21578 shows that our model can improve the text classification performance.

Keywords: Markov Network; Information Retrieval; Document Expansion

1 Introduction

Information retrieval model is still a central problem in information retrieval, although it has been studied by many researches. Over the decade, no single retrieval model has proven to be most effective. There are many reasons why most retrieval models can not return satisfactory retrieval result to users, one of them is some models take term independent assumption, such as vector space model. Vector space model use vector to represent document and query, which assume that the appearance and importance of a term is independent to other terms[1]. Most used statistical language model is unigram language model, which also take term independent assumption[6]. However, it is obvious that term independence assumption is not true in most

?Project supported by the Nation Natural Science Foundation of China(No.60963014, No.61272212 and No.

61163006) and Jiangxi Normal University Growing Foundation.

∗Corresponding author.

Email address: [email protected] (Jiali ZUO).

(2)

still require more study. Actually, there are some successful applications of Markov network in information retrieval to model useful information, such as Markov random model for term dependencies proposed by Meltaz [5,15]. In our former research, we use edge in Markov network to model term relationship and propose information retrieval model based on Markov method.

The method use Markov network to model a document, in which term information and information retrieval model can be combined in a unified frame [18]. Other research show query reformulation based on this model can also obtain satisfactory performance[19].

As query reformulation may cause topic drift, we construct a document expansion model by Markov network in this paper. Our model use corpus information to do document expansion and then do retrieval based on Markov network information retrieval model.

The paper is organized as follows. In section 2 we give a brief introduction about Markov network and its application in information retrieval. Section 3 will introduce the document expansion model we propose. Section 4 is about the experiments results, and the conclusions are given in the final section.

2 Markov Network Information Retrieval Model

Markov network is a kind of undirected graphical model that can model complex problems espe- cially for those that contains a great deal of variant. A Markov network consists of:

(1)A Markov network can be expressed as G(V, E) where V is a set of nodes of all the collection, and E is a set of edges between nodes representing dependencies between random variables.

(2)A set of functions ψ(c), each is a non-negative function assigned to a clique on the graph of G.

The property of Markov network is any variable in the graph is independent of its non-neighbors given observed values for its neighbors.

The key problem of using Markov network in information retrieval is to construct a Markov network G representing D and Q, to estimate the joint distribution P over this graph, and then to use its value to measure the relevant of D and Q. According the Markov property, the joint distribution over the graph G can be factorized over the cliques of graph:

(3)

Fig.1: A simple example of the Markov network construction

P_G(Q, D) = 1 Z

Y

c∈C(G)

ψ(c). (1)

Where, D = {t₁, t₂, ..., t_n} is a set of terms, Q is query node, C(G) is the set of cliques of graph G, and the functions ψ(c) are referred to as factor potential or clique potential. Z normalizes the distribution and is hard to compute. For ranking purpose, we can infer that:

PG(Q, D) ^rank⁼ log PG(Q|D) = logP_G(Q, D) P_G(D)

rank= log P_G(Q, D)^rank=

X

c∈C(G)

log ψ(c). (2)

In formula 2, the general formation of potential function ψ(c) = exp(λ₁, f (∗)), and its definition is decided by the task and cliques what we consider. Feature function f (∗) is real-valued feature function over cliques. In practice, feature function can take various forms and there also can define multiple potential functions for one clique. As shown in figure1, we just consider two kinds of cliques: CQT₁ contains a query node Q and a term node that appear in the document, in which there is an edge between query node Q and term node ti(1 6 i 6 n); CQTⁿ contains a query node Q and two or more term node t₁...t_n (n is number of terms), in which there are edges between each two term nodes. In this paper, we set n=2.

Then we can obtain the rank function:

P_G(Q, D)^rank=

X

c∈QT1

λ₁f_T₁(c) + X

c∈QT2

λ₂f_T₂(c). (3)

Formula 3 shows that the ranking function has two parts, which are information of QT₁ representing term relevant degree and information of QT₂ representing contribution of term relationship. λ₁ and λ₂ represent the weight of two parts separately, where λ₁ + λ₂ = 1. If λ₁ is bigger, model performance mainly influenced by term, and if λ₂ is bigger, model gives higher weight to

(4)

3 Document Expansion Model Based on Markov Network

In information retrieval, if terms in document and query are different, such as query contains way and document only contains road, then the document may be judged in-relevant, although way and road is almost the same meaning. The reason is that the information model set relevant degree of this document to be 0. To solve this problem, many research use query expansion to add term road and other terms to query, and then the relevant degree of this document is not 0 any more and can be judged relevant. But query expansion adds new terms into queries directly, it may add too much unnecessary information to cause topic drift. So we do document expansion other than query expansion by add information to document, more details are follow.

3.1 Document expansion model

As we mentioned that we can do document expansion to document by add terms to document, the problem is which terms to add to document. As we see the former example that query contains

’way’ and document only contains ’road’, we know it should be ’road’ to add to document. The reason is that ’way’ and ’road’ has the same meaning. To make document expansion automatically, we use corpus to obtain expansion terms. Given one document D and corpus C , we compute the whole document model according to formula 5:

D_c= 1 N_D

X

j

d_j. (5)

Actually, in formula 5, we give all the documents of the same weight. After get , we add it into document D, then we can obtain new document model D⁰ using formula 6

D⁰ = λ_DD + λ_D₁D₁. (6)

Then the new document model contains two parts: original document model and background corpus information, which are weighted by λ_D and λ_D₁.

(5)

3.2 Retrieval

Then the relevant degree of a document D to query Q turn to be the relevant degree of document model D⁰ to query Q. As D⁰ contains the original document model and background corpus information, then the ranking function is:

4 Experiment and Analysis

4.1 Feature selection

As feature selection is of great impact to retrieval performance, we use two kinds of most used feature functions in this paper, which is Language models with Dirichlet smoothing[1,19] and BM 25[20].

Feature function using Language models is:

f₁(c) = qtf_ilogtf_i,D + µ^cf_|C|ⁱ

|D| + µ . (7)

Feature function using BM25 is:

f₁(c) = qtf_i (k₁+ 1)tf_i,D k1((1 − b) + b_|D|^|D|

avg) + tfi,D

log N − df_i+ 0.5

df_i+ 0.5 . (8)

In these formulas, tf_i, D and cf_i is the number of term t_i in document D and corpus C separately, df_i is the number of documents that t_i appears. |C| and |D| is the length of corpus and document. |D|_avg means average document length in dataset. qtf_i is the count of t_i in query Q. µ, k₁ and b is the parameters of these two feature functions.

For the reason of space, we omit the details of f₂ and f₃ which are similar to f₁. Using language model as feature function, the final ranking function is:

P_G(Q|D⁰) ^rank=

X

c∈QT1

λ_T₁f_T₁(c) + X

c∈QT2

λ_T₂f_T₂(c)

= X

(ti,Q)∈QT1

λT1qtfilog[

tf_i,D⁰ + µ^{t cf}_|C|ⁱ

|D⁰| + µ^t ]

+ X

(ti,tj,Q∈QT2)

log[Rel_D(t_i, t_j)(tfi,D + µ^{t cf}_|C|ⁱ

|D⁰| + µ^t )^qtfⁱ(tfj,D+ µ^{t cf}_|C|^j

|D⁰| + µ^t )^qtf^j]. (9)

Using BM 25 model as feature function, the final ranking function is:

(6)

f (dfi) = logN − df_i+ 0.5

df_i+ 0.5 . (11)

Some other feature functions can also be used and will be discussed in our future work.

4.2 Experiment results

Table 1 gives the experiment result of document expansion model(noted by MNRED). We use MAP, P@10 and P@20 to measure performance. And the result shows that performance of document expansion model is better than baseline model. But we also observe that the document is too long and then the expansion information maybe unnecessary, which makes the weigh of relevant information in document decrease and affect the performance. If the query is long, it also affects the performance of document expansion. Furthermore, we estimate the whole document model as the background model to do document expansion, which may add too much unnecessary information to document. In the future work, we will use term relationship to do document expansion, which can improve retrieval performance.

Table 1: Performance in Med(Feature function is language model)

M odel MAP P@10 P@20

Baseline(BM ) 0.4485 0.5167 0.4383 M N RED(BM ) 0.4856 0.5930 0.4431

Baseline(LM ) 0.3825 0.4600 0.3850 M N RED(LM ) 0.4446 0.5481 0.4171

5 Conclusions

The paper proposes a document expansion model based on Markov Network, using relationship between terms to construct Markov network, and do document expansion by means of corpus information. Experiment shows our model can improve the performance of information retrieval.

This method does not directly add information to queries avoid and then avoid topic drift. So,

(7)

this method can be widely used in advertisement recommendation and micro blogging retrieval.

In the future we will do more research about this model and do more experiment to evaluate model performance.

References

[1] C. Zhai, Statistical Language Models for Information Retrieval: A Critical Review, Foundations and Trends in Information Retrieval 2(2008), 137-215.

[2] X. Wang, C. Zhai, Mining Term Association Patterns from Search Logs for Effective Query Refor- mulation, 17th ACM International Conference on Information and knowledge Management, 2008, pp. 479-488.

[3] M. Lease, Natural Language Processing for Information Retrieval: the Time is Ripe (again), 2nd Ph. D. Workshop on Information and knowledge Management (PIKM 07), 2007, pp. 1-8.

[4] S. Liu, F. Lin, C. Yu, W. Meng, An Effective Approach to Document Retrieval via Utilizing Wordnet and Recognizing Phrases, 27th ACM Special Interest Group on Information Retrieval, 2004, pp. 266-272.

[5] T. Brants, Natural Language Processing in Information Retrieval, 16th Meeting of Computational Linguistics in the Netherlands, 2003, pp. 1-13.

[6] J. Gao, J. Nie, G. Wu, G. Cao, Dependence Language Model for Information Retrieval, 27th ACM Special Interest Group on Information Retrieval, ACM, 2004, pp. 170-177.

[7] G. Cao, J. Nie, J. Bai, Integrating Word Relationships into Language Models, 28th ACM Special Interest Group on Information Retrieval, 2005, pp. 298-305.

[8] M. Karimzadehgan, C. Zhai, Estimation of Statistical Translation Models Based on Mutual In- formation for Ad Hoc Information R.etrieval, 33th ACM Special Interest Group on Information Retrieval, 2010, pp. 323-33.

[9] J. Bai, D. Song, P. Bruza, J. Nie, G. Cao, Query Expansion Using Term Relationships in Lan- guage Models for Information Retrieval, 14th ACM International Conference on Information and knowledge Management, 2005, pp. 688-695.

[10] J. Xu, W. Croft, Query Expansion Using Local and Global Document Analysis, 19th ACM Special Interest Group on Information Retrieval, 1996, pp. 4-11.

[11] Y. Lv, C. Zhai, Positional Relevance Model for Pseudo-Relevance Feedback, 33th ACM Special Interest Group on Information Retrieval, 2010, pp. 579-586.

[12] V. Dang, W. B. Croft, Query Reformulation Using Anchor Text, 3th ACM International Confer- ence on Web Search and Data Mining, 2010, pp. 41-50.

[13] D. Metzler, W. B. Croft, A Markov Random Field Model for Term Dependencie, 28th ACM Special Interest Group on Information Retrieval, 2005, pp. 472-479.

[14] D. Metzler, Automatic Feature Selection in the Markov Random Field Model for Information Retrieval, 30th ACM Internation Conference on Information and knowledge Management, 2007, pp. 253-262.

[15] D. Metzler, W. B. Croft, Latent Concept Expansion Using Markov Random Fields, 16th ACM Special Interest Group on Information Retrieval, 2007, pp. 311-318.

[16] J. Seo, W. B. Croft, Geometric Representation for Multiple Documents, 33th ACM Special Interest Group on Information Retrieval, 2010, pp. 251-258.

[17] Jiali Zuo, Mingwen Wang, A Query Reformulation Model Using Markov Graphic Method, 2011 International Conference on Asian Language Processing, 2011, pp. 119-122.

(8)