Using LDA for Discovering Access Pattern - Web Usage Mining Using Latent Dirichlet Allocation M

5. Web Usage Mining Using Latent Dirichlet Allocation Model

5.3. Using LDA for Discovering Access Pattern

Alike capturing the underlying topics over the word vocabulary and each document’s

probability distributions over the mixing topic space, LDA could also be used to discover

hidden access topics (i.e. tasks) and user preference mixtures over the uncovered topic

space from the user surfing history. That is, from the usage data, LDA can identify the

latent topics in the form of a simplex of Web pages, and characterizes each Web user

session as a simplex of these discovered topics. In other words, LDA reveals two aspects

of underlying usage information to us, that is, the hidden topic space and the topic

mixture distribution of each Web user session, which reflects the underlying correlation

between Web pages as well as Web user sessions. With the discovered topic-simplex

expression, it is possible to model user access patterns in terms of topic mixture

distributions, in turn, to predict user’s potentially interested pages by employing a

collaborative recommendation algorithm. In the following parts, we discuss how to

discover user access patterns in terms of topic-simplex expressions as well as the latent

topic space based on LDA model.

Similar to the implementation of the document-topic expression in text mining discussed

above, viewing Web user sessions as mixtures of topics makes it possible to formulate

the problem of identifying the underlying topics/tasks hidden in the usage data. Given m

Web user sessions expressing z topics over n distinctive pages, we can represent

( )

P p z with a set of z multinomial distributions φ over the n pages, such that

( )

( ) _pj

P p z₌ j ₌φ , and P z with a set of m multinomial distributions ( ) θ over the z topics, such that for a page in Web session s, ( )

( ) s j

hidden in a collection of Web pagesp=

{

p p1, 2,…pn

}

, where each p appears in some i

Web sessions, our aim is to obtain an estimate of φ that gives a high probability over the pages in the page collection. Here we use LDA model described above to estimate φ and

θ_{parameters that result in a maximum log likelihood of the usage data. The complete}

probability model is as follows:

( ) ( ) ( ) , ( ) i i i i s s i z z j i Dirichlet z Discrete Dirichlet p z Discrete θ α θ θ φ β φ φ ∼ ∼ ∼ ∼ (4.17)

Here, z stands for a set of hidden topics, θsi _{denotes a Web session}

s ’s preference

distribution over the topics and φzi _{represents the specific topic}

z ’s association

distribution over the page collection. αandβare hyperparameters of the prior of θ and φ. In this manner, the equation (4.15) is re-parameterized as

1 ( , ) ( )( ( ) ( , )) k n i k j k z Z j P s α β pθ α p z θ p p z β θd ∈ = =

∫

∏∑

(4.18)

where s denotes a user session, n is the number of pages. More details regarding the _i

formulation is referred to [67].

We use a variational inference algorithm to estimate each Web session’s correlation with

multiple topics (α), and the associations between the topics and Web pages (β), with which we can capture user visit preference distribution exhibited by each Web session

and identify the semantics of topic space.

Given a collection of user sessions, we aim to estimate the parameters of αandβ that maximize the log likelihood of the usage data

1 ( , ) log ( , ) m i i liα β P s α β = =

∑

(4.19)

where m is the number of user sessions.

The variational EM algorithm [67] executes as follows. E-step updates the optimizing

values of the parameters and re-calculates the posterior value of the equation (4.18); M-

step maximizes the log likelihood with respect to the updated parameters. This iterative

execution results in finding parameters of α and β that correspond to a maximum likelihood of the usage data.

Interpreting the contents of prominent pages related to each topic based on β will eventually result in defining the meaning of each topic. Meanwhile, the topic-oriented

user access patterns are constructed by examining the calculated user session’s

association with multiple topics and aggregating all sessions whose associations with a

specific topic are greater than a threshold. We describe our approach to discovering the

topic-oriented access pattern below.

Given this representation, for each latent topic, we can consider user sessions with

θ _{exceeding a threshold as “prototypical” user sessions associated with that topic. In}

other words, these top user sessions that contribute significantly to this topic via their

navigational behaviour, are used to construct this topic-specific user access pattern.

Thus, for each latent topic, we choose all user sessions with θsi exceeding a certain

threshold as candidates of this specific access pattern. As a user session is represented by

a weighted page vector in the original space of page collections, we can create an

form of weighted page vector. The algorithm to generate the topic-specific access pattern

is described as follows:

[Algorithm 5.1]: Building user access pattern based on LDA model

[Input]: The calculated session-topic preference distribution θ, the usage data SP and a

predefined threshold µ.

[Output]: A set of user access patterns AP={apk}.

Step 1: For each latent topic zj, choose all user sessions with _j s z

θ ≥µ to construct a user session aggregation R corresponding to _j z , _j

Step 2: For each latent topicz_j, compute the topic-specific aggregated user access pattern

of the selected user sessions in R by taking the discovered sessions’ associations θ with

j z into account j k s z s R j j s ap R θ ∈ =

∑

i (4.20)

where R_j is the number of the selected user sessions in R , j

Step 3: Output a set of topic-oriented user access patterns AP over k multiple topics,

{

1, 2, k

}

AP₌ ap ap …ap .

In document Web Mining Techniques for Recommendation and Personalization (Page 112-115)