3. Discovering Web Usage Pattern with Latent Semantic Indexing Approach
3.2. Latent Semantic Indexing Algorithm
In this section, we first focus on introducing LSI algorithm and its related mathematical
background, especially the knowledge of linear algebra in terms of Singular Value
Decomposition operation, which forms the foundation of LSI algorithm. Upon the
transformed semantic space, we propose a novel similarity function to measure the
distance between two user sessions, which would be used in Web clustering.
3.1.1. Web Usage Data Model
The Web usage data is originally collected and stored in Web sever logs of websites, and
is pre-processed for data analysis after performing data cleaning, page identification, and
mainly interested in the refined usage data instead of the raw data, more details regarding
data preparation steps could be found in [73].
At this stage, we first review the usage data model described in the previous chapter, and
particularly introduce the concept of the session-page matrix for Web usage mining.
As discussed above, in the context of Web usage mining, we construct two sets of Web
objects: Web session set S={ ,s s1 2,…sm}and Web page set P={ ,p p1 2,…pn}.
Figure 3-1. The schematic structure of a session-page matrix
And each user session is considered as a sequence of page-weight pairs, say
1 1 2 2
{( , ), ( , ), ( , )}
i i i n in
s = p a p a … p a . For the reason of simplicity expression, each user
session can be re-written as a sequence of weights over the page space, i.e.
1 2
{ , , }
i i i in
s = a a …a , where aij denotes the weight for the page p in the j s user session. i
As a result, the whole user session data can be formed as a Web usage data matrix
represented by a session-page matrix SPm n× ={ }aij (Figure 3-1 illustrates the schematic
structure of the session-page matrix).
The entry value in the session-page matrix, aij is usually determined by the number of
hits or the amount time spent by specific user on the corresponding page. Generally, in
order to eliminate the influence caused by the relative amount difference of visiting time
duration or hit number, a normalization manipulation across page space in the same user
(page corpus) p1 p 2 … pj … … p n s1 s2 … (session set) si aij … s
session is performed. Figure 3-2 illustrates two snapshots of Web log records extracted
from a Web access log, in which each field is separated by a space. Particularly, note that
the first and fourth fields are identified as the visitor IP address and the requested URL
respectively, and are utilized to help collecting usage data. Thus, the first field can be
identified as user session ID and the fourth attribute is treated as the page ID.
Figure 3-2. Snapshots from a Web access log
Once the usage matrix is constructed, we may applying conventional clustering
algorithms on the user session data to classify user sessions into various groups, within
which the classified sessions share the similar access interest. It is intuitive to perform
clustering algorithms directly on each row vector of the usage matrix to determine the
relative “close” session cluster by using a similarity-based measure, such as the
commonly adopted cosine similarity from Information Retrieval. In [29], for example, an
algorithm named PACT is proposed to address usage pattern mining based on the above
mentioned technique. However, this kind of clustering technique only captures the
mutual relationships between session data explicitly, it is incapable of revealing the
“deeper” underlying characteristics of usage patterns. In this work, we propose an
algorithm, named Latent Usage Information (LUI) to group user sessions semantically by
taking the latent semantic information into account. For better understanding LUI
algorithm, we first discuss some theoretical backgrounds of the SVD algorithm.
202.161.108.167 - - [01/Feb/2003:00:00:03 +1100] "GET/timetables/city/2003s1/cc
4logo.gif HTTP/1.1" 206 14102 "http://www.cs.rmit.edu.au/timetables/city/2003s1/
cover.html "Mozilla/4.0 (compatible; MSIE 5.5; Windows 98)"
213.183.13.65 - - [01/Feb/2003:00:00:16 +1100] "GET/˜winikoff/palm/dev.html HTT P/1.1" 302 244 "http://www.google.de/search?q=sources+onboardc+examples&ie=UTF- 8&oe=UTF-8&hl=de&meta=" Scooter/3.3"
3.1.2. Singular Value Decomposition Algorithm
The SVD definition of a matrix is illustrated as follows [77]: For a real matrix
ij m n
A a
×
= , without loss of generality, suppose m≥nand there exists a SVD of A:
m m m n n n
A U= ×
∑
× V× (3.1) where U and V are orthogonal matrices. Matrices U and V can be denoted as1 2 [ , , ] m m m m m U × = u u …u × and Vn n× =[ ,V V1 2,…,Vn n n]× , where u (i = 1,…,m) is a m-i dimensional vector { 1, 2, } T i i i mi
u = u u …u and v (j = 1,…,n) is a n-dimensional matrix j
1 2
{ , , }T
j j j nj
v = v v …v . Suppose rank A( )=r and singular values of A are the diagonal
elements of ∑ as follows:
1 2 r r 1 n 0
σ ≥σ ≥σ ≥σ + = = σ = (3.2)
For a given threshold ε ( 0< <1ε ), we choose a parameter k such that
(
σk−σk+1)
σk ≥ε. Then, we denote U u uk[
1, 2,,uk m k]
× ,Vk =[
v v1, 2,,vk n k]
× ,∑
k=diag(
σ σ1, 2,σk)
, andk k k
A =U
∑
kVKnown from the theorem in algebra [77], A is the best approximation matrix to A and k
conveys the latent semantic information among the usage data. This property makes it
possible to find out relative “close” user sessions at the semantic latent level based on
their mutual similarity.
3.1.3. Representation of User Session in Latent Semantic Space
Once the SVD implementation is completed, we may rewrite user sessions with the
dimensional latent semantic space. For a given sessions , it is represented as a coordinate i
vector with respect to pages, written as si ={ai1,ai2,…,ain} . The projection of coordinate vector si in the k-dimensional latent semantic subspace is re-parameterized as
) ,..., , ( 1 2 ' ik i i k k i i sV t t t s =