• No results found

Improved PageRank Algorithm Combined User Behavior with Topic Similarity

N/A
N/A
Protected

Academic year: 2020

Share "Improved PageRank Algorithm Combined User Behavior with Topic Similarity"

Copied!
8
0
0

Loading.... (view fulltext now)

Full text

(1)

2016 International Conference on Artificial Intelligence: Techniques and Applications (AITA 2016) ISBN: 978-1-60595-389-2

Improved PageRank Algorithm Combined

User Behavior with Topic Similarity

Hao-dong

ZHU

1,2,*

and

Bao-feng

HE

1

1DepartmentofComputerScience,SiasInternationalUniversity,Xinzheng,Henan, 451150,China

2SchoolofComputerandCommunicationEngineering,ZhengzhouUniversityofLightIndustry,

Zhengzhou,Henan,450002,China

*Correspondingauthor

Keywords: PageRankalgorithm, Similarityweight,PRvalue,Userbehavior.

Abstract. Aiming atshortcomingsofthedrifting themeandthesplitting pageweightoftraditional PageRank algorithm, an improved PageRank algorithm combined user behavior with topic similarity was proposed. Combined with the time factor,the proposed algorithm comprehensively analyzed retweet, comment and mention, measured contribution of three kinds of different behaviorsformicro-bloguserinfluencebymeansofstatisticalanalysis, andusedimprovedTF-IDF calculatetheweightof subjectrelevancetochoose pageswithhighercorrelation degreeandobtain differentPR value. The simulation results show that comparedwith Micro-blog’s common sorting algorithms,theimprovedPageRankalgorithmhasbettersortingeffect.

Introduction

With the rapid development of Internet technology, especially the gradual maturity of search engines, socialnetworks havebecome animportant platformforthe exchangeof usersonline. The size of Chinese Internet users have been expanding, and by the end of 2015, it had reached 6.88 billion. In the face of massive data, the user can quickly and accurately obtain high quality information accordingto theirown needs,which becomesa serious challengeto thesearch engine technology.

The ideal search engine requirementsthat it can provide users with the behavior of the user [1] relatedqueryinformation,whichrequirestheexistingsearchenginetechnologycontinuestomature. PageRankalgorithmisoneofcommonlyusedalgorithminthesearchengine,however,theclassical PageRank algorithm [2] ignores the user's individual needs. The content and structure of the networkareconstantlychanging,andusersoftenignorethetimefactor,whichisnotconsistentwith the law of user behavior.How to get the users’ actual needs and the similarity between the users’ search topics and other factors together? It has become the focus of domestic and foreign that the researchersstudytheproblems.

(2)

Relevant Basic Knowledge

PageRank Algorithm

It will divide a certain weight according to the importance of the page and transfer to the search page, so as to determine the search page using PageRank value (referred to as the PR value)[6]. Finally, it judges the importance according to the web page of the PR value level, improving the qualityoftheusertoquicklysearch[7].

Theresearchersgenerallybelievethatthelinkbetweenthewebpageisestablishedbyhyperlink, then thenumber of pages constitutea super largein the end. Usersrandomly select a webpage to browse from all web pages, There are two options for each web page by using hyperlinks on the webpage[8]:Tothisend. Continue toselectalinkto browse.Therefore,we calculateofaweb page,forexampletherankofwebpageAthatthestandardformulaasthestandard(1).

    n i i i P C P PR d d A PR

1 ( ) ) ( ) 1 ( ) ( (1)

Informula (1), n representsthe totalnumber ofpages, PR (A)representsa webpage (e.g., page A)PR of allthe pages;PR (Pi) representsthe valueof the pageA pageof thelink to thePi ,C(Pi)

represents the number of outbound links page Pi; In order to avoid the emergence of the link betweenthewebandthesteadystateprobabilitycannotconverge,here,theintroductionofdamping coefficient d (0<d<1), the value of 0.85, it means that users continue to visit the current page probability.1-drepresentstheprobabilitythatauserwillrandomlyaccessotherwebpagesfromthe currentwebpage.According tothe traditionalalgorithm, PRvaluea pageofi isonlyrelated toits penetration.

Random Walk Model

It convertsthe behaviorof usersbrowse theweb page intoa web linkstructure, in thestructure of the link according to the direction of random forward, and with a certain probability of random jumptootherpages[9].Theprobabilitythateachpagewillbelinkedtotheuserwilltendtoastable valueafter anumber of visits.Becausethe userrandomly selectsone ofthe links atthe sametime as the equal probability in the current page to browse and repeats the process continuously. The iterativerelationofthealgorithmisshownintheformula(2).

    ) ( | ( )| ) ( 1 ) ( i in

j out j

j PR d n d i

PR (2)

Informula(2),in(i)represents acollectionofpages thatpointtoawebpageI,out(j)represents acollectionofpagesthatpointtoawebpagej.

Haveliwala of the Stanford University in 2002 in Topic-sensitive pagerank proposed PersonalRank algorithm[10]. Wereplace (2) inthe pagenodes with the user and item that we can calculate in the global for each user and item of importance, and thus givethe overall ranking. In fact,weneedtocalculatetheuserbehaviornodewhichisrelativetoausernodemcorrelation.This algorithmcanbeusedtosortforallrelevantuserpersonalizationbehaviors.Theiterativerelationof thealgorithmisshownintheformula(3).

    ) ( | ()| ) ( ) 1 ( ) ( i in j outi

j PR d ri d i

PR ,ri

0 1 m i m i   (3)

In formula (3), ri is replaced by 1/n, whichmeans that the probabilityof starting from different

points is different. m represents the target user node we recommend. Using (3) calculation, the correlationdegreeofallvertexeswithrespecttothetargetpointmisobtained.

(3)

Analysis of CurrentCorrelation Improved Algorithms

G Kleinberg, J Teutte et al.[6] focused on the analysis of the propagation properties of Twitter, through thecomparative analysis of a largenumber of Twitter data, the studyof the impact of the userin thespreadof certaintopics. The resultsshow that,in themicro-blog retweetandquotation, there is no direct relationship on penetration of distribution and user influence order. Conversely, influentialusers playanimportant roleina varietyoftopics. ChenHongchanget al.[7]focusedon the analysisof retweet, commentand threekinds of actsmentioned inmicro-blog, using statistical methods to enumerate users who arein these three aspects related to the number of carries on the contrastanalysis, andmeasured thelevelof theinfluenceof communicationusers, finally,they put forward a kind of based on behavior of weights allocation algorithm; Lu Gang, Li Gengsheng et al.[8]proposedanimprovedalgorithmwhichwasinthetraditionalalgorithmbasedonintothelink similaritybetween webpages andreferenced pagerelease time,improvingthe precision andrecall ofsearchresults,however,nowitseemsthattheyignorethesimilaritybetweenusersofthefactors. Because of the lack of the influence of the traditional page ranking algorithm on the sorting results, T. Havdiwala[9]et al.improvedin thisarea,whichnot onlyhasthenetworktopology asa metric,but alsohastheinfluencefactorssuchascontextcorrelationandsubjectsensitivity. Finally, through a large number of simulation experiments proved their assumptions. Weng et al. [10] proposedtheTRank(TwitterRank)algorithm,whichnotonlyconsidersthefactorssuchasnetwork structure and user interaction, but also analyzed the similarity of the subject. Eventually, a combinationof allthese factorswascalculated andthe influenceofeach twitteruser. Xu Zhiming, LiDong etal. [11] definedthe strengthofthe userrelationship asthesimilarity betweentheusers, and proposes a method for computing the similarity of the users’ attribute information. However, these researchers don’t combine the user browsing behavior characteristic with the similarity of themeofthepagetoresearch.

Improved PageRank Algorithm of This Paper (BSPR)

Basedonthe timefactoranalysis, thispaper usestheimprovedspace vectormodelto calculatethe relative weight of the subject. The improved algorithm extract the relevant high degree of web pagesallowpeopleto freelychoose,finallymaking usobtainthecorresponding PR.Therefore, the problems that Theme drift, emphasis on old web pages, web weight sharing are suppressed to a certainextent. Thus,thispaper proposesBSPR (basedon Behaviorand Similar ThemePageRank) algorithmoftheformula, suchastheformula(4).

) )

( )

, ( (

) 1 ( ) (

) (

t i i j

F v

j i w

i d d B u v BSPR v Wu v W

u BSPR

i u o j

 

   

  

(4)

Amongthem: ui and vi for micro-blogusers,Fo (ui)representsthe final setof fansui fans, BSPR

(ui) onbehalfoftheuseru PR,BSPR(vj)onbehalfofthe uservjPR,BW(ui,vj)asthescalefactor is PR value of user vj assigned to the user ui , if vj users pay attention to page ui for retweeting, interactivetopicandmicro-bloggingmentionedbehavior,andscalingfactorworking,Wtrepresents the time feedback factor of the web page vi.  ,  ,  represent user behavior, topic similarity and time factor accounted for in the calculation of the proportion of the PR value., And to make

1

   

 ,soastomoreaccuratelyestimatethevalueofBSPR,makingmicro-blog'sinfluenceon theorderoftheuserismoretrue. TheinitialPRvalueofeachuserissetto1.

User Behavior Analysis

In micro-blog, there are many factors affecting the ranking of the user influence, but the users’ behaviorexistsbasedonmicro-blogretweet,commentsandmention[11].

(4)

directed weight network, which is forwarded by the user retweet, the user comments, micro-blog mentioned,andthedistributionfactorisgiveninturn,thentheweightformulaisconstructedas(5).

ji ji

ji j

i

w u v R C M

B ( , )   (5)

In(5)theformula,Rjirepresentsmicro-blogforwardweight,Cjirepresentsmicro-blogcomments,

Mji represents micro-blog referred to the weight. In the micro-blog network, when multiple

networks need to be combined,the direct sum of the simple edge weights is not reasonable. Thus theintroductionof ,, distributionfactor,whichrespectively, representstheuserretweet, user comments, micro-blog mentioned three kinds of behavior contribution in the page ranking algorithm.

In general, we adopt the measure standard that the users publish the micro-blog to statistic the impactofthecrowd,soastogettheformula(6).

ij ij ij ij Mn Cn Rn Rn   

(6)

In the formula (6), Rnij represents the forwarding impact, Cnij represents the commented on the impactof people,Mnij representsthemention theimpactof people.Thus,the valuesof  and  canbeobtainedinturn.Theformula(5)isobtainedbynormalizingtheformula(7).

  n k j k j i j i w v u W v u W v u B 1 ) , ( ) , ( ) ,

( (7)

In the formula (7), Bw (ui,vj) on behalf of micro-blog users paying attention to the contribution

sizeoftheuserui.nrepresentsthetotalnumberof usersvjconcernedabouttheobject. krepresents micro-blog retweet the case several times. After the treatment  , ,

will fall between [0,1]. Thenwemakefulluseofscalefactorandtheformula(4)tocalculateallusersofthePRvalue.

Topic Similarity Analysis

In the improved algorithm based on PageRank, we consider only the users’ behavioral factors, neglecting correlation between users, which in some extent solves the topic drift phenomenon. However, problems of PageRank algorithm itself did not really solved. Li Zhiying, Yang Wu and others[12] use theimproved spacevector model,that is,TF-IDFformula,respectively calculating weightsmand nonthetermsof thetiwhereis acorrespondinglinkbetweenthe useruanduserv. Defining Micro-blog content as the vector space model in the vector, the paper uses the cosine vector measurement method, so as to calculate the similarity of the contents of the micro-blog. Therefore,theformulaforcalculatingthesimilarityofthesimilarityisshownin(8).

     n i n i i i v i W u i W v i W u i W v u W 1 2 2 1 ) , ( ' ) , ( ' ) , ( ' ) , ( ' ) ,

( (8)

In formula (8), W(ui,vi) expresses micro-blog user ui and micro-blog user vi content similarity

value, W'(i,u) expresses the word ti in the micro-blog user u in terms of the right value, )

, ( ' i v

W representsthetermtiinmicro-blogusers.

We use the formula (9), to calculate proportion of the similarity of the page content in the formula(4).

  i u i i O k i i i v

u W u v W u k

(5)

Inthe formula (9),

i iv

u

W representsthe weight ofthe contentsimilarity of allthe degreesof the

userviintheuserui,Oui representsacollectionofuseruipointingtotheuservi, kOui.

Time Feedback Factor

In general, the numberof web pages beingsearched by thesearch engine to indicate thelength of timethat thewebpage itself exists,Inviewof theexisting problemswhich layparticularstress on old web pages and ignore new valuable web pages in the PageRank algorithm, the paper puts forwardthetimefeedbackfactortorevisetheformula(4),theformulaas(10).

T e

Wt / (10) Intheformula(10),Trepresentsthenumberoftimesthatthewebpageissearchedbythesearch engine. Wt asaweb pagetimefeedback factor.E asa constant, itisthe valueofe=(1-d)/n,which is relatedto the totalnumber of nthat search engineaccess to theproject, but alsoaffected by the standard formula (1) the impact of d. Generally speaking, the search engine's search cycle is 15 daysto30days.Ifawebpageexistsinthenetworkforalongtime,accordingly,itisineachsearch cycle to be accessed to the greater the probability. In other words, single page exist time is proportionaltothenumberofsearchengineaccess.

FromtheimprovedPageRankalgorithmtomodifytheformulaweknowthatifthelargernumber oftwitterusers,correspondingto,theuserretweet,comment,mentiontothehigherthenumber.The userbehavior related factorsaccounted forthe greaterproportion when we calculatethe PRvalue, therefore, the introduction of correlative coefficient ,, is particularly important. The greater the total number of micro-blog, the more interaction between the user, the higher the similarity betweenthepages, toacertainextent,itavoidstheproblemoftopicdrift.Thealgorithmintroduces thetimefeedbackfactor,whichovercomesthedifficultyofextractingthetimeofthewebpageand can judge the date of the web page more accurately. The size of the PR value is affected by the lengthof time theweb publishing,which is conducivethe old webpage precipitating andthe new pagerapidlyrising.

Improved PageRank Algorithm Description

Accordingto the calculationmethod ofthe BSPR algorithmproposedin this paper,themain steps areasfollows:

①Calculateeachmicro-bloguserspagein-degreeandout-degreenumber.

②Getthelastmodificationtimeofeachpage.

③Constructthenetworkvectorbyusingthepoints,edgesandweightsofthenetwork.

④Statistics the crowd affected by micro-blog and standardizes processing user behavior weight.

⑤Calculate the proportion of the weight of the page similarity by using the cosine vector measurementmethod.

⑥Calculatethetimefeedbackfactorofthewebpage,accordingtothecycletimesT.

⑦By repeated experiments, the balance coefficient  ,, in the proportion of the whole formula, according to the formula (4) to calculate the BSPR(ui), so as to get the ranking of micro-blog'suserinfluence.

This paperfocuses onthe calculationof theinfluence ofmicro-blog users (Step④,Step⑤,Step ⑥ arethecoreofthepapersteps).

Experiment Simulation

Experimental Preparation

(6)

sets Z,micro-blog web pages j, micro-blogtext sets X and other. The main contents of the data

storage:

1)Userinformation: useruid,username,gender,location,createmicro-blogtime,theuserhome URL,fannumber,thenumberofconcerns,micro-blog,thecollectionnumber.

2) Micro-blog information: micro-blog mid, publishing time, micro-blog content, micro-blog sources,forwardingnumber,comments,waspraised,publishedbytheuseruid,micro-blogbelongs tothesubject.

3)Listeningrelationship:useruid,fanid.

4)Concern:theuseruid,concernedabouttheid.

Experiment environment settings: 1) hardware platform: 1 HP servers, Intel i7-4790, 1TB 3.4GHz hard disk and 2 HP 3.6GHz Intel PCprocessor, 4G memory, 500G hard disk.2) software platform:Win7operatingsystem,MATLABsimulationtool,MySQL5.5database.

Due to the restrictions ofmicro-blog Sina, each user canonly get to 200 people at most,so the collectionofgoodfriendsisnotverycomprehensive.Wemergeallofthecollectedmicro-bloguser data, and simple clean it with the crawler tool to remove duplicate entries. The final user data statistics,suchastable1.

Table1.Userdatastatistics.

Projectdata Number

Thetotalnumberofusers 10413

Thetotalnumberofmicro-blog 84,168

Micro-blogretweet 27,759

Theuser'sfriends 1,391,718

Labeling, extraction of the collected data, we get the number of edges and average degree and averagein degree relateddata, which isbased on retweets, userreviews, users refer tothe number ofnodes,asshownintable2.Accordingtotheformula(9),Calculatethepagesimilarityproportion of micro-blog users in the formula (4), and the results will be calculated as a factor of the users’ influenceranking.

Table2.Micro-blogbasicdata.

micro-blog Thenumberofnodes Thenumberofedges Theaverageoutdegree TheaverageIndegree

retweet 10413 90132 18.31 19.13

comment 10413 117846 31.23 32.46

mention 10413 31843 7.33 5.22

Experimental Results and Analysis

Firstly, the probability value of each node is initialized, the initial access probability of the node corresponding to the m is 1, and the initial access probability of the other nodes is 0.After data calculation,thispaperintheformula(5)[8],takes 4/7, 2/7,1/7.

Secondly, after repeated experimental simulation, we calculate  , , three scaling factor valuesintheBSPRalgorithm,theresultsshowninfigure1.

From figure 1, we know when ,, take different values, BSPR algorithm of the PR value with the change, in order to illustrate the influence of  , , on the improved PageRank. When

0.3

(7)

is set to 0.2, 0.4, 0.4, which reduces theuser theme drift phenomenon, makes up for the lackofweightsharingpage,moreconducivetotheuser'sinfluenceranking.

Finally, in order to verify the effect of the article BSPR algorithm on the micro-blog user influenceranking,thispapercomparesitwiththethreemethodscommonlyusedinmicro-blog.The simulationresultsareshowninfigure2.

[image:7.595.84.514.157.313.2]

Figure 1.Valuesof ,, analysischart. Figure2.Comparisonresultsoffivealgorithms.

Infigure2,thehorizontalcoordinates,inthesamequeryconditions,thequerytopK%micro-blog users. The vertical coordinates, user influence coverage ratio. From figure 3, we know that the BSPR algorithm is superior than BWPR algorithm(User behavior through algorithm, retweets, comments, and other factors mentioned in the analysis of user, to calculate the influence order of users), WPR algorithm (The theme correlation algorithm, by constructing a link analysis of micro-blogusers/webcontentbasedonuserinfluenceranking)andTURankalgorithm(Combined with time factor algorithm, the time factor is introduced to observe the influence of users in a certain period of time). It shows that the micro-blog data information filtering and user related behavior,thethemeofsimilarity, thecombination ofthetimefactorand asthebasis fortheweight distributionismoreclosetothereality,andachieveagoodeffect.

Conclusions

AimingattheshortcomingsofthetraditionalPageRankalgorithm,thedriftingtheme,splittingpage weight, new web links less, the user behavior, the theme of the relevant weight, time factor and other factors are introduced into the PageRank algorithm, then an improved PageRank algorithm based on the users’ behavior and topic similarity is proposed, and the algorithm is applied to the ranking of micro-blog users' influence. Using the collected data to simulate the experiment, the improvedalgorithmcangetabettersortingeffect,whichshowsthefeasibilityandsuperiorityofthe BSPRalgorithm.

Acknowledgement

ThisworkwassupportedinpartbytheScienceandTechnologyPlanProjectsofHenanProvinceof ChinaundergrantNo.152102210357andNo.152102210149.

References

[1] X.P.Han, B.H. Wang, T. Zhou, Researchon human behaviordynamics, ComplexSystems and ComplexityScience. 7(2010)132-144.

[2]D.C. Huang,H.C. Qi,PageRankAlgorithmResearch,ComputerEngineering.32(2006)145-146.

(8)

[4] C. Wang, X.H. Ji,Improved PageRank Algorithm Basedon User Interest and topic, Computer Science. v43 (2016) 275-278.

[5]J.M. Kleinberg,R.Kumar,P.Raghavan,et al.The Webas agraph:Measurements,models,and methods, Proc. of the 50th Annual International Conf. on Computing and Combinatorics, Tokyo, Japan,1999,pp.1-17.

[6]Z.M.Xu,D.Li,T.Liu,etal,MeasuringSimilaritybetweenMicro-blogUserandItsApplication, ChineseJournalofComputers.37(2014)208-217.

[7] C. Qi, H.C. Chen, H.T. Yu, et al, Method of evaluating micro-blog users’ influence based on comprehensiveanalysisofuserbehavior,ApplicationResearchofComputers.7(2014)2005-2007.

[8] G.S. Li, G. Lu, An improved PageRank algorithm based on time feedback and classification technology,JournalofBeijingUniversityofChemicalTechnology. 40(2013)87-89.

[9] J. Weng, E. P.Lim, J.Jiang, Twitterrank:finding topic-sensitive influential twitterers, Proc. of thethirdACMinternationalconf.onWebsearchanddatamining,2010, pp.261-270.

[10] W.Gang, Y.M.Wei, Arnoldi versus GMRES for computing pageRank: A theoretical contributiontogoogle'spageRankproblem,Proc.ofthe4thProQuest,2012,pp.192-199.

[11] J.P. Zhang, X.Y. Huang, Research and Application on Text Similarity Algorithm Based on Semantics,ChongqingUniversityofTechnology,Chongqing,2014.

Figure

Figure 1. Values of  ,  ,  analysis chart.

References

Related documents