Identifying Task-Oriented Navigational Distribution for Web Recommendation

6. Discovering Task-Oriented Navigational Distribution for Web

6.3. Identifying Task-Oriented Navigational Distribution for Web Recommendation

Distribution for Web Recommendation with PLSA

Model

As we discussed before, each latent factor z do really represent a specific aspect k

associated with the co-occurrence observations in nature. In this sense, we can utilize the

factor-conditional probability estimates generated by PLSA model to partition Web pages

and induce latent factors by extracting the contents of “dominant” Web pages whose

probabilities are exceeding a predefined threshold.

6.3.1. Characterizing Latent Factor Space

First, we discuss how to capture the latent factors associated with user navigational

behaviours. This aim is accomplished by characterizing the “dominant” pages. Note that

( _j| _k)

P p z represents the conditional occurrence probability over the page space

corresponding to a specific factor, whereas P z( _k | p_j) represents the conditional

probability distribution over the factor space corresponding to a specific page, which is

expressed in the form of

( | ) ( ) ( | ) ( | ) ( ) k j k k k j j k k z Z P p z P z P z p P p z P z ∈ ⋅ = ⋅

∑

(6.1)

In such an expression, we may consider that the pages whose conditional probabilities

( _j| _k)

P p z and (P zk | pj) are both greater than a predefined threshold µ can be viewed to contribute significantly to one particular functionality related to the latent factor.

Furthermore, we choose all pages satisfying the aforementioned condition to form a

number of “dominant” page sets. By exploring the contents of these pages, we may

characterize the semantic meaning of each factor. In section 6.4, we will present some

examples of latent factors derived from two real data sets. The algorithm to characterize

the task-oriented semantic latent factor is described as follows:

[Algorithm 6.2]: Characterize Latent Factors

[Input]: A set of probability estimates (P pj|zk) and (P zk | pj), a predefined threshold

µ.

[Output]: A set of characteristic page base sets LF = (LF LF1, 2,,LFk).

Step 1: LF₁ =LF₂== LF_k =φ,

Step 2: For each

z

k, choose all pages pj∈P,

If (P p_j|z_k) ≥µand (P z_k | p_j) ≥µ then LFk =LFk ∪pj

Else go back to step 2,

Step 3: If there are still pages to be classified, go back to step 2,

6.3.2. Identifying Web Page Category

Note that the set of P z( _k | p_j) is conceptually representing the probability distribution

over the latent factor space for a specific Web page p , we, thus, construct the page-j

factor matrix based on the calculated probability estimates, to reflect the relationships

between Web pages and latent factors, which is expressed as follows:

,1 ,2 ,

( , ,..., )

j j j j k

vp ₌ c c c (6.2)

Where

c

j s, is the occurrence probability estimate of the page p on a factorj z . In this s

way, the distance between two page vectors may reflect the functionality similarity

exhibited by them. We, therefore, define the similarity by applying the well-known

cosine similarity as:

(

)

2 2 ( _i, _j) _i, _j ( _i _j ) sim p p = vp vp vp ⋅ vp (6.3) where

(

)

, , 1 , k i j i m j m m vp vp c c = =

∑

, ₂ 2, 1 k i i l l vp C = =

∑

With the defined page similarity measure (6.3), we propose a clustering algorithm to

partition Web pages into various page categories. The Web page clustering algorithm is

described as follows:

[Algorithm 6.3]: Clustering Web pages

[Input]: A set of P z( k |pj), a predefined threshold µ.

[Output]: A set of Web page categories PCL = {PCL₁,, PCL_P} and the

Step 1: Select the first page p as the initial cluster ₁ PCL₁ ₌{ }p₁ and the centroid of this clusterCid1=p1,

Step 2: For each pagep_j, measure the similarity between p_j and the centroid of each

existing cluster sim p Cid( j, i), if

(

j, t

)

max( ( j, i)) i

sim p Cid = sim p Cid >µ, then insert p j

into the cluster PCL_t and update the centroid of PCL_t as

t t j PCL =PCL ∪p (6.4) 1/ t t t j j PCL Cid PCL vp ∈ = ⋅

∑

(6.5)

where PCL is the number of sessions in the cluster _t PCL . _t

Otherwise, p will create a new cluster itself and is the centroid of the new cluster, _j

Step 3: If there are still sessions to be classified into one of existing clusters or a session

that itself is a cluster, go back to step 2 iteratively until it converges (i.e. all clusters’

centroids are no longer changed),

Step 4: OutputPCL={PCL_i} and Cid ={Cid_i}.

6.3.3. Web Recommendation Based on Identifying Task Distribution

Generally, Web recommendation process is to predict the customized Web contents to

users according to the navigational interests exhibited by individual or groups of users.

Suppose that the conditional probabilities are estimated by PLSA model as described

above, we can, in turn, utilize them to identify the user’s underlying access interest or

Since each user session is represented as a sequence of visited pages along with different

weights, which are determined by the degrees of the interests on these pages of the user,

we can capture the interest-oriented task sequence derived from the clicked pages within

the session accordingly. This aim is accomplished by computing the posterior probability

of each task based on a Bayesian updating approach, given that pages are independent on

tasks each other. These posterior probabilities associated with the various tasks indicate

the likelihood of the user’s underlying intention. The navigational preference, therefore,

is characterized as a sequence of tasks with corresponding probabilities. By presetting an

appropriate threshold, we can choose all tasks whose posterior probabilities are greater

than the defined value as a collection of the dominant tasks to reflect the user’s initial

intention. Moreover, incorporating the identified sequence of dominant tasks with the

task-based page categories derived from the previous section will lead to discovering the

page candidates that are more likely to be visited or interested by the user later. The

algorithm is described as follows.

[Algorithm 6.4]: Discovering task-oriented navigational distribution for Web

recommendation based on PLSA model

[Input]: An active user session ( ₁i, ₂i, , i), i

i t j

s = p p p p ∈P, a set of estimated conditional probabilities (P p_j|z_k) and a threshold µ.

[Output]: The dominant task sequence { ,1 , }

i i t

TL= z z corresponding to the user session

Step 1: For each task z_k∈Z , which is supposed to be independent on the pages, calculate the posterior probability of z_k given all pages in s by employing a Bayesian _i

updating method [95]: ( | ) ( ) ( | ) i i j i k i k j k p S P z s αP z P p z ∈ =

∏

(6.6) where α is a constant,

Step 2: Choose all tasks whose conditional probabilities are greater than the preset

threshold to form the dominant task sequence corresponding to the user session.

{ _k | _k , ( _k | )_i }

TL= z z ∈Z P z s >µ (6.7)

Step 3: For each z in TL, incorporate the corresponding task-based page category, and _k

then compute the recommendation score for each page p as _j

, ( ) ( | ) ( , ), , k j j k i j k j k z p rs p =

∑

P z s ⋅wt p z p ∈P z ∈TL (6.8)

where wt p z( j, k)denotes the weight of p within j z page category. Note that the k

recommendation score will be 0 if the page is already visited in the current session,

Step 4: Sort the computed recommendation scores from step 3 in a descending order, i.e.

( ( r), , ( r))

rs₌ rs p rs p , and choose the N pages with the highest scores to construct the top-N recommendation set.

{ rj | ( rj) ( rJ ), 1, 2, , 1}

RS= p rs p >rs p ₊ j= N− (6.9)

In document Web Mining Techniques for Recommendation and Personalization (Page 127-132)