5. Web Usage Mining Using Latent Dirichlet Allocation Model
5.3. Using LDA for Discovering Access Pattern
Alike capturing the underlying topics over the word vocabulary and each document’s
probability distributions over the mixing topic space, LDA could also be used to discover
hidden access topics (i.e. tasks) and user preference mixtures over the uncovered topic
space from the user surfing history. That is, from the usage data, LDA can identify the
latent topics in the form of a simplex of Web pages, and characterizes each Web user
session as a simplex of these discovered topics. In other words, LDA reveals two aspects
of underlying usage information to us, that is, the hidden topic space and the topic
mixture distribution of each Web user session, which reflects the underlying correlation
between Web pages as well as Web user sessions. With the discovered topic-simplex
expression, it is possible to model user access patterns in terms of topic mixture
distributions, in turn, to predict user’s potentially interested pages by employing a
collaborative recommendation algorithm. In the following parts, we discuss how to
discover user access patterns in terms of topic-simplex expressions as well as the latent
topic space based on LDA model.
Similar to the implementation of the document-topic expression in text mining discussed
above, viewing Web user sessions as mixtures of topics makes it possible to formulate
the problem of identifying the underlying topics/tasks hidden in the usage data. Given m
Web user sessions expressing z topics over n distinctive pages, we can represent
( )
P p z with a set of z multinomial distributions φ over the n pages, such that
( )
( ) pj
P p z= j =φ , and P z with a set of m multinomial distributions ( ) θ over the z topics, such that for a page in Web session s, ( )
( ) s j
hidden in a collection of Web pagesp=
{
p p1, 2,…pn}
, where each p appears in some iWeb sessions, our aim is to obtain an estimate of φ that gives a high probability over the pages in the page collection. Here we use LDA model described above to estimate φ and
θ parameters that result in a maximum log likelihood of the usage data. The complete
probability model is as follows:
( ) ( ) ( ) , ( ) i i i i s s i z z j i Dirichlet z Discrete Dirichlet p z Discrete θ α θ θ φ β φ φ ∼ ∼ ∼ ∼ (4.17)
Here, z stands for a set of hidden topics, θsi denotes a Web session
i
s ’s preference
distribution over the topics and φzi represents the specific topic
i
z ’s association
distribution over the page collection. αandβare hyperparameters of the prior of θ and φ. In this manner, the equation (4.15) is re-parameterized as
1 ( , ) ( )( ( ) ( , )) k n i k j k z Z j P s α β pθ α p z θ p p z β θd ∈ = =
∫
∏∑
(4.18)where s denotes a user session, n is the number of pages. More details regarding the i
formulation is referred to [67].
We use a variational inference algorithm to estimate each Web session’s correlation with
multiple topics (α), and the associations between the topics and Web pages (β), with which we can capture user visit preference distribution exhibited by each Web session
and identify the semantics of topic space.
Given a collection of user sessions, we aim to estimate the parameters of αandβ that maximize the log likelihood of the usage data
1 ( , ) log ( , ) m i i liα β P s α β = =
∑
(4.19)where m is the number of user sessions.
The variational EM algorithm [67] executes as follows. E-step updates the optimizing
values of the parameters and re-calculates the posterior value of the equation (4.18); M-
step maximizes the log likelihood with respect to the updated parameters. This iterative
execution results in finding parameters of α and β that correspond to a maximum likelihood of the usage data.
Interpreting the contents of prominent pages related to each topic based on β will eventually result in defining the meaning of each topic. Meanwhile, the topic-oriented
user access patterns are constructed by examining the calculated user session’s
association with multiple topics and aggregating all sessions whose associations with a
specific topic are greater than a threshold. We describe our approach to discovering the
topic-oriented access pattern below.
Given this representation, for each latent topic, we can consider user sessions with
i
s
θ exceeding a threshold as “prototypical” user sessions associated with that topic. In
other words, these top user sessions that contribute significantly to this topic via their
navigational behaviour, are used to construct this topic-specific user access pattern.
Thus, for each latent topic, we choose all user sessions with θsi exceeding a certain
threshold as candidates of this specific access pattern. As a user session is represented by
a weighted page vector in the original space of page collections, we can create an
form of weighted page vector. The algorithm to generate the topic-specific access pattern
is described as follows:
[Algorithm 5.1]: Building user access pattern based on LDA model
[Input]: The calculated session-topic preference distribution θ, the usage data SP and a
predefined threshold µ.
[Output]: A set of user access patterns AP={apk}.
Step 1: For each latent topic zj, choose all user sessions with j s z
θ ≥µ to construct a user session aggregation R corresponding to j z , j
Step 2: For each latent topiczj, compute the topic-specific aggregated user access pattern
of the selected user sessions in R by taking the discovered sessions’ associations θ with
j z into account j k s z s R j j s ap R θ ∈ =
∑
i (4.20)where Rj is the number of the selected user sessions in R , j
Step 3: Output a set of topic-oriented user access patterns AP over k multiple topics,
{
1, 2, k}
AP= ap ap …ap .