4. Discovering Usage Pattern and Latent Factor with Probabilistic Latent
4.3. Constructing User Access Pattern and Identifying Latent Factor with PLSA
As discussed in section 4.2, each latent factor z do really represent a specific aspect k
associated with the usage co-occurrence activities in nature. In other words, for each
factor, there might exist a task-oriented user access pattern corresponding to it. We, thus,
can utilize the class-conditional probability estimates generated by the PLSA model to
produce the aggregated user profiles for characterizing user navigational behaviours.
Conceptually, each aggregated user profile will be expressed as a collection of pages,
which are accompanied by their corresponding weights indicating the contributions to
such user group made by those pages. Furthermore, analysing the generated user profile
can lead to revealing common user access interests, such as dominant or secondary
4.3.1. Partitioning User Sessions
Firstly, we begin with the probabilistic variable P s z( i k) , which represents the
occurrence probability in the condition of a latent class factor z exhibited by a given k
user session s . On the other hand, the probabilistic distribution over the factor space of a i
specific user session s can reflect the specific user access preference over the whole i
latent factor space, therefore, it may be utilized to uncover the dominant factors by
distinguishing the top probability values. Therefore, for each user session s , we can i
further compute a set of probabilities (P z s over the latent factor space via Bayesian k i)
formula as follows: ( | ) ( ) ( | ) ( | ) ( ) k i k k k i i k k z Z P s z P z P z s P s z P z ∈ =
∑
i i (4.10)Actually, the set of probabilities P z s is tending to be “sparse”, that is, for a given
( )
k i si,typically only a few entries are significantly different from the predefined threshold.
Hence we can classify the user into corresponding clusters based on these probabilities
exceeding the given threshold. Since each user session can be expressed as a pages vector
in the original n-dimensional space, we can create a mixture representation of the
collection of user sessions within same cluster that associated with the factor z in terms k
of a collection of weighted pages. The algorithm for partitioning user sessions is
described as following.
[Input]: A set of calculated probability values of (P z sk i), a user session-page matrix
ij
SP , and a predefined threshold µ.
[Output]: A set of session clusters SCL=
(
SCL SCL1, 2,SCLk)
Step 1: Set SCL1=SCL2 ==SCLk =φ,
Step 2: For eachsi∈S, select P z s
( )
k i , ifP z s( )
k i ≥µ, then SCLk =SCLk∪si, Step 3: If there are still users sessions to be clustered, go back to step 2,Step 4: Output session clusters SCL=
{
SCLk}
.[Algorithm 4.2]: Generating user profile
[Input]: A session cluster set SCL=
{
SCLk}
.[Output]: A set of user profiles UP=
{ }
UPk .Step 1: For each factorzk, choose all candidate sessions in SCLk,
Step 2: Represent each session si as a page vector and compute their centroids in the
form of page vector as:
( | ) i k i i k s P z s UP R =
∑
i (4.11)where R denotes the total number of sessions in the cluster,
Step 3: if there are still user session clusters not to be processed, go back to step 1,
Step 4: Output the centroid of each session cluster in the form of page vector as the
By now, we assign user sessions into the corresponding clusters which can be considered
to represent user navigational patterns based on the calculated conditional probability
distributions from the PLSA model and characterize the representations of the user
profiles in terms of weighted page vector as well. As discussed above, it can be seen that
a particular user session does not only belong to just one cluster, but also to other
different clusters associated with different latent factors. For example, a user session may
exhibit different interests (with different probabilities) on two aspects z1 and z2. This
can be “explained” as that a user may, indeed, perform different tasks during the same
session and really reflect the nature of user access patterns in real world. It can be
implied, in turn, the PLSA model partitions user session-page pairs, which is different
from clustering either user sessions or pages or both. In other words, the user session-
page probabilities in the PLSA model reflect “overlay” of latent factors, while the
conventional clustering model assumes there is just one cluster-specific distribution
contributed by all user sessions in the cluster [66].
4.3.2. Characterizing Latent Semantic Factor
As mentioned in the previous section, the core of PLSA model is the latent factor space.
From this point of view, how to characterize the factor space or explain the semantic
meaning of factors is a crucial issue in PLSA model. Similarly, we can also utilize
another obtained conditional probability distribution (P p zj k)by PLSA model to identify
the semantic meaning of the latent factor by partitioning Web pages into corresponding
For each hidden factor z , we may consider that the pages, whose conditional k
probabilities P p z( j k) are greater than a predefined threshold, can be viewed as
providing similar functional components corresponding to the latent factor. In this way,
we can select all pages with probabilities exceeding a certain threshold to form an aspect-
specific page group. By analysing the URLs of the pages and their weights derived from
the conditional probabilities, which are associated with the specific factor, we may
characterize and explain the semantic meaning of each factor. In section 4.4, we will
present two examples with respect to the discovered latent factors. The algorithm to
generate the factor-oriented Web page group is briefly described as follows:
[Algorithm 4.3]: Characterizing latent semantic factor
[Input]: A set of conditional probabilities, P p z
(
j k)
, a predefined threshold µ.[Output]: A set of latent semantic factors represented by a set of dominant pages.
Step 1: Set PCL1=PCL2 ==PCLk =φ,
Step 2: For eachzk, choose all Web pages such thatP p z
(
j k)
≥µ and (P z pk j)≥µ, then constructPCLk = pj∪PCLk,Step 3: If there are still pages to be classified, go back to step 2,
Step 4: Output PCL=