Vol 12, No 1 (2014)

(1)

IJCSBI.ORG

ISSN: 1694-2108 | Vol. 12, No. 1. APRIL 2014 1

Improving Ranking Web

Documents using User’s Feedbacks

Fatemeh Ehsanifar

Department of CSE,

Lorestan Science and Research College, Lorestan, Iran

Hasan Naderi

Assistant Professor Department of CSE,

Iran University of Science and Technology, Tehran, Iran

ABSTRACT

Nowadays, World Wide Web has been utilized as the best environment for development, distribution and achieving knowledge. The most significant tool for achieving to this infinite ocean of information involves variety of Search Engines, in which ranking is one of the main parts. Regarding problems based on text and link, some methods have been considered according to user’s behavior in web. User’s behavior includes valuable information which can be used for improving quality of web ranking results. In this research a model has been offered in which for each definite query, user’s positive and negative feedbacks about displayed list in web pages have been received, including how many times user has accessed to a certain site, time spent in a site, number of successful downloads in a site, number of positive and negative clicks in a site, then it calculates the ranking of each page using Multiple Attribute Decision Making method, and eventually presents a new ranking about the site which could be updated regularly according to the next feedbacks from users.

Keywords

User’s feedback, Multiple Attribute Decision Making, User’s Behavior, Search Engine.

1. INTRODUCTION

(2)

IJCSBI.ORG

ISSN: 1694-2108 | Vol. 12, No. 1. APRIL 2014 2 pages. Most of web users do not pay attention to the pages which come after first search results. Therefore, this is important for a web search engine to present possibly the most favorite results to users in the top of the list, otherwise, a search engine could not be considered effective enough. So the role of a ranking algorithm is identification and dedication of more ranking to more valid pages among other numerous web pages.

Structure of the present paper is as follow: Next part is assigned to review of literature. In part3, TOPSIS algorithm and its characteristics have been described. Proposed method is presented in part 4, and description of Simulation is presented in part 5. Finally, part 6 includes conclusion and some future works.

2. RELATED WORKS

Ranking is one of the main parts of search engine. Ranking is a process in which the quality of a page is estimated. Owing to the fact that for every query there could be thousands of relevant pages, it is imperative to prioritize them, and present the first 10 or 20 results to the user. Ranking methods generally can be divided in to five classifications: First ranking classification is text-based, and the most important text-based ranking models are probability and vector space. In vector space model, both of document and query are vectors with dimensions as much as the number of words. In this model each vector turns in to a weight vector, then cos of angle between two vectors with weight is calculated as their degree of similarity. Usually the most significant method of weighting is TF-IDF by Mr. Salton[1]. Another text-based ranking method includes probability model. Purpose of a retrieval system, based on probability model of document ranking, is related to possible relevance of each document with query of user. Thus, contrary to vector space, this model definitely cannot find degree of similarity between query and document [2].

Second classification is connection-based ranking. Contrary to environment of Traditional Information Retrieval, web has a great heterogeneous structure in which documents are linked together, and also shape a huge graph. Web links involve valuable information, so new ranking algorithms have been created based on link. In a general view, connection-based algorithms are divided in to two classifications of query-dependent models, and query-independent models [3]. In query-independent models, such as page Rank, ranking is done as offline (outline), and using overall web graph, and subsequently for each query there is a fixed page. But in query-dependent models (sensitive to topic), such as HITS, ranking in graph involves collection of pages relating to user query.

(3)

IJCSBI.ORG

ISSN: 1694-2108 | Vol. 12, No. 1. APRIL 2014 3 which has drawn lots of attention in recent years. Proposed methods in the area of ranking that work according to learning, are divided in to three main classifications: point method, pair method, and list method. In point method, a digit is dedicated to each pair of document-query which represents level of connection between them [5]. In pair methods, with obtaining pair of objects (characteristics of objects and their relative ranking), it has been attempted to dedicate a ranking to each object close to its real ranking, and eventually objects will be divided in to two general classifications of “correctly ranked” and “incorrectly ranked”. Most of existing learning-based ranking methods are of this type. List-based methods utilize list of ordered objects as learning data collection for prediction of order of objects. Fifth classification is based on user’s behavior. Regarding problems of text-based and link-based methods, methods that are link-based on behavior and judgment of user have been considered extensively for prevailing justice and democracy in web. In other words, for development and improvement of web in terms of quality and quantity, determination of the most befitted pages is carried out by users [6]. There are two methods for data collection by users: Direct Feedback Method, and Implicit Feedback Method [7].

In direct feedback methods the user is requested to judge about proposed results, which is a difficult method. In indirect method, user’s behavior during search process (that is registered in logs of search engines) is utilized. As a consequence, it can be collected with the least possible cost. User’s behavior during search process involves text of query, how the user clicks on ordered list of results [8], content of clicked pages, stop duration in each page[9], and other existing information concerning events registered during search. These registered events include invaluable information which can be used for analysis, assessment, and modeling user’s behavior in order of improving quality of results.

3. TOPSIS ALGORITM

(4)

IJCSBI.ORG

ISSN: 1694-2108 | Vol. 12, No. 1. APRIL 2014 4 4. METHODOLOGY

In this part we provide a model that collects 5 cases of user’s feedback (positive and negative) on list of search results for each certain query, which might be inserted by a large number of users, and also calculates ranking of each document using TOPSIS method, and finally gives a ranking to documents. At specified intervals and using next re-collected feedbacks from users, these rankings are to be updated. Five cases of user’s feedback which have been regarded as five criteria (the first four cases are positive characteristics and the last case is a negative characteristic), have been used for assessing web pages as follow:

 Open Click: the number of times that each site will be available or will be clicked for a certain query.

 Dwell time: a period of time that users spend in each site for a certain query, and this period of time is based on hour.

 Download: the number of downloads which occurs for a certain query in each page.

 Plusclick: a collection of positive clicks identifying user’s satisfaction from selecting the page, such as doing left click or right click on existing links in the page and etc.

 Negative click: a collection of negative clicks identifying lack of satisfaction among users about selected documents, such as clicking close and etc.

Implementation steps of proposed method:

First step: first of all Decision making Matrix is formulated as follow: where A1, A2, …, Am in Decision Making Matrix D stand for m sites that are supposed to be ranked according to a series of the criteria;

andXoc, XDT, XD, XPC, XNC represent the criteria for assessment of suitability of each site, and finally ri,jcomponents representing specific values of jth criteria for ith site. Now this matrix becomes normal using Scale-up method or Scale-up norm, and it leads to formation of Matrix D. Second step: In this part, relative significance of existing criteria is calculated using Entropy method, and becomes balanced using λ values. λ values, respectively, have been considered for criteria of number of clicks (0,2), criteria of spent time which can be more significant compared with other criteria (0,3), criteria of number of downloads (0,1), criteria of number of positive clicks (0,2), and criteria of number of negative clicks (0,2), that vector W shapes as follow:

W={WOC, WDT, WD, WPC, WNC }

(5)

IJCSBI.ORG

ISSN: 1694-2108 | Vol. 12, No. 1. APRIL 2014 5 𝑉 = 𝑁_𝐷 . 𝑊_5∗5

𝑉₁₁ ⋯ 𝑉₁₅

⋮ ⋱ ⋮

𝑉_𝑚1 ⋯ 𝑉_𝑚5 (1)

so that ND turns in to a matrix that criteria scores have been scale-up in it. W5×5 is a Diagonal Matrix in which only elements of its main diameter are non-zero.

Third step: Determination of ideal solution according to relation (2), and negative ideal solution according to relation (3): For positive ideal option (A+) we define the best site as user’s viewpoint, and for negative ideal option (A-) we define the worst site as user’s viewpoint as follow:

𝐴+_{= 𝑚𝑎𝑥 𝑉}

𝑖,𝑗 𝑗 ∈ 𝐽 , 𝑚𝑖𝑛 𝑉𝑖,𝑗 𝑗 ∈ 𝐽 𝑖 = 1,2,3,4,5 = 𝑉_{𝑚𝑎𝑥 𝑂𝐶}+ _{, 𝑉}

𝑚𝑎𝑥 𝐷𝑇+ . , 𝑉𝑚𝑎𝑥 𝐷+ , 𝑉𝑚𝑎𝑥 𝑃𝐶+ , 𝑉𝑚𝑖𝑛 𝑁𝐶+ (2)

𝐴−_{= 𝑚𝑖𝑛 𝑉}

𝑖,𝑗 𝑗 ∈ 𝐽 , 𝑚𝑎𝑥 𝑉𝑖,𝑗 𝑗 ∈ 𝐽 𝑖 = 1,2, ,3,4,5 = 𝑉_{𝑚𝑖𝑛 𝑂𝐶}− _{, 𝑉}

𝑚𝑖𝑛 𝐷𝑇− , , 𝑉𝑚𝑖𝑛 𝐷− , 𝑉𝑚𝑖𝑛 𝑝𝑐− , 𝑉𝑚𝑎𝑥 𝑁𝐶− (3)

𝐽 = 𝑗 = 1,2,3,4,5 𝑗𝑓𝑜𝑟 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝐴𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒

𝐽 = 𝑗 = 1,2,3,4,5 𝑗𝑓𝑜𝑟 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝐴𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒

According to relation (4) the distance of ith site from positive ideal site is as follow:

𝑑_𝑖+= (𝑉_𝑖,𝑗 − 𝑉_𝑗+)2 𝑛

𝑗 =1

.5

; 𝑖 = 1, 2, … . , 𝑚 (4)

And according to relationship (5) the distance of ith site from positive ideal site is as follow:

𝑑_𝑖−_{= 𝑉}

𝑖,𝑗 − 𝑉𝑗− 2 𝑛

𝑗 =1

.5

; 𝑖 = 1, 2, … … , 𝑚 (5)

Fifth step: Calculation of relative closeness of each site (Ai) to ideal site. This relative closeness is defined according to relation (6) as follow:

𝑐𝑙_𝑖+₌ 𝑑𝑖− 𝑑_𝑖+_{+ 𝑑}

𝑖

(6)

IJCSBI.ORG

ISSN: 1694-2108 | Vol. 12, No. 1. APRIL 2014 6 It seems that if Ai = A+, then we have di+ = 0 and then cli+ = 1, and if Ai = A+, then we have di- = 0 and then cli- = 0. Therefore, the closeness of Ai option to ideal solution (A+) is corresponding to higher value of cli+ .

Sixth step: Ranking of sites according to prioritization of preferences is based on descending order of cli+.

5. SIMULATION OF PROPOSED MODEL

The proposed model became simulated using MATLAB software (version 2012). Simulation process was in such a way that firstly it receives three specific entries from users. Entries, respectively, are as follow:

 Number of users which insert a unit query.

 The number of times that feedbacks are received again from users, and a new ranking takes place, or in other words ranking of pages becomes updated.

 The number of pages or sites which is supposed to be ranked for each specific query regarding user’s feedback, or using TOPSIS method.

Some points have been considered for each feedback as follow: Maximum positive click for each user is 20 times, and it must be at least 0, maximum number of downloads for each visit is 15 times, maximum time for each user to visit a site is 3 hours, and it must be at least 5 minutes.

5.1 TYPICAL SIMULATION OF PROPOSED MODEL

(7)

IJCSBI.ORG

ISSN: 1694-2108 | Vol. 12, No. 1. APRIL 2014 7

Figure 1. Typical Simulation of Proposed Model in three steps

5.2 ADVANTAGES AND DISADVANTAGES OF PROPOSED METHOD

(8)

IJCSBI.ORG

ISSN: 1694-2108 | Vol. 12, No. 1. APRIL 2014 8 the closest option to ideal option or solution. Briefly, ideal solution results from a collection of maximum values of each criterion, while non-ideal solution results from a collection of the most minimum values of each criterion.

 In this method we can take into account a considerable number of criteria.

 This method can be applied very simply and in a convenient speed, and due to reduction of volume of calculations in assessment, it takes advantage of a great number of options.

6. CONCLUSION

Search engines provide search results regardless of user’s desires and work background. In this concern, users while using search engines, mostly come across some results which might not be interesting for them, and another important case is that most of search engines use such algorithms that looks in to the number of input and output links of a website, such as Page Rank; so user’s behavior pattern is of uttermost significance in ranking of the websites. In this work we presented a method for ranking of web documents with simultaneous usage of five cases of negative and positive feedbacks of users, and in this proposed model we used one of Multiple Attribute Decision Making Models called TOPSIS.

Moreover, this method seems to be a suitable method for prioritization of pages due to simultaneous characteristic of two distances from positive and negative ideal option, and eventually implements a ranking on documents. One of other innovations of this model is that they use a great deal of user’s feedbacks for ranking, simultaneously; and among these feedbacks, time has to be considered since it is one of well-known and new methods of variety of implicit feedbacks among users, and researchers believe that as much as a user spends more time for reading a document, the document becomes of more importance for him/her.

(9)

IJCSBI.ORG

ISSN: 1694-2108 | Vol. 12, No. 1. APRIL 2014 9 REFERENCES

[1] Salton, G. Buckley, C. 1988. Term-weighting approaches in automatic text retrieval. Inf. Process. Manage. 24, 5 (August 1988), pp. 513-523.

[2] Robertson, S. E. Walker, S. 1994. Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval. In Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR '94), W. Bruce Croft and C. J. van Rijsbergen (Eds.). Springer-Verlag New York, Inc., New York, NY, USA, pp. 232-241.

[3] Jain, R. Dr. G. N. Purohit, 2011.page ranking Algorithms for Web Mining, International Journal of Computer Applications (0975 – 8887) Vol. 13. No.5, January. [4] Shakery, A. Zhai, C. 2003. Relevance propagation for topic distillation uiuc trec 2003web track experiments. In Proceedings of the TREC Conference.

[5]Yeh,J.Y Lin, J.Y. Ke H.R, Yang W.P.2007. Learning to Rank for Information Retrieval Using Genetic Programming, Presented in SIGIR 2007 Workshop, Amsterdam.

[6] ZHAO, D. ZHANG, M.ZHANG, D. 2012. A Search Ranking Algorithm Based on UserPreferences, Journal of Computational Information Systems, pp. 8969-8976.

[7] Attenberg, J. Pandey, S. Suel, T. 2009. Modeling and predicting user behavior in sponsored search. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD '09). ACM, New York, NY, USA, pp.1067-1076.

[8] Dupret, G. Liao, C. 2010. A model to estimate intrinsic document relevance from the clickthrough logs of a web search engine. In Proceedings of the third ACM international conference on Web search and data mining (WSDM '10). ACM, New York, NY, USA, pp.181-190.

[9] Liu, C. White, R.W. Dumais, S. 2010. Understanding web browsing behaviors through Weibull analysis of dwell time. In Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval (SIGIR '10). ACM, New York, NY, USA, pp.379-386.

[10] Yurdakul M. 2008. Development of a performance measurement model for manufacturing companies using the AHP and TOPSIS approaches, International Journal of production research, pp. 4609-4641.

APPENDIX

Implementing Scale-up using Norms: In order of comparing different measurement criteria for variety of criteria, we must use scale-up method, which results in measurement of elements of transformed criteria (ni,j) without considering their dimension. There are various methods of scale-up (such as scale-up using norm, linear scale-up, phase scale-up), and here we use scale-up using norm. We divide ri,j from assumed Decision Making Matrix by existing norm of column jth (for criteria of xj). That is,

n_i,j = ri,j m r_i,j2

(10)

IJCSBI.ORG

ISSN: 1694-2108 | Vol. 12, No. 1. APRIL 2014 10 In this way, all of assumed columns of matrix have the same unit of length and their overall comparison becomes simple.

Weighting criteria using Entropy Method:

In most of MADM problems we need to know the relative importance of existing criteria, in a way that their total equals to unit (normalized), and this relative importance estimates the degree of preference of each criteria than other cases for Decision Making, and we use Entropy method for this purpose. Entropy in theory of criterion information is for expressed lack of certainty by a Discrete Probability Distribution (Pi). Decision Making Matrix has been considered by m option and n criteria, and existing information content from this Decision Making matrix is calculated as (Pij).

P_i,j= ri,j r_i,j m i=1

(2)

And for Ej from Pij collection for each criterion we have:

E_j = −k m P_i,j∗ P_i,j

i=1 ; ∀i, j 3

So that it holds k= 1/lnm, and deviation degree (dj) from produced information for each jth criteria is as follow:

dj = 1 − Ej ; ∀j (4)

And finally for (Wj) weights from existing criteria we have:

w_j = dj

d_j n j=1

; ∀j 5

As Wn×1 matrix is not multipliable with Normalized Decision Making Matrix (n×n), before multiplying it is necessary to transform Weight Matrix into a Diagonal Matrix (Wn×n) (weights on the main diameter), and if DM has a λi subjective judgment as relative importance for jth criteria in advance, then the calculated Wj through Entropy can be balanced as follow:

wَ_j = λj∗ wj λj, wj n j=1

; ∀𝑗 (6) This paper may be cited as: