• No results found

Latent Semantic Indexing Algorithm

3. Discovering Web Usage Pattern with Latent Semantic Indexing Approach

3.2. Latent Semantic Indexing Algorithm

In this section, we first focus on introducing LSI algorithm and its related mathematical

background, especially the knowledge of linear algebra in terms of Singular Value

Decomposition operation, which forms the foundation of LSI algorithm. Upon the

transformed semantic space, we propose a novel similarity function to measure the

distance between two user sessions, which would be used in Web clustering.

3.1.1. Web Usage Data Model

The Web usage data is originally collected and stored in Web sever logs of websites, and

is pre-processed for data analysis after performing data cleaning, page identification, and

mainly interested in the refined usage data instead of the raw data, more details regarding

data preparation steps could be found in [73].

At this stage, we first review the usage data model described in the previous chapter, and

particularly introduce the concept of the session-page matrix for Web usage mining.

As discussed above, in the context of Web usage mining, we construct two sets of Web

objects: Web session set S={ ,s s1 2,…sm}and Web page set P={ ,p p1 2,…pn}.

Figure 3-1. The schematic structure of a session-page matrix

And each user session is considered as a sequence of page-weight pairs, say

1 1 2 2

{( , ), ( , ), ( , )}

i i i n in

s = p a p ap a . For the reason of simplicity expression, each user

session can be re-written as a sequence of weights over the page space, i.e.

1 2

{ , , }

i i i in

s = a aa , where aij denotes the weight for the page p in the j s user session. i

As a result, the whole user session data can be formed as a Web usage data matrix

represented by a session-page matrix SPm n× ={ }aij (Figure 3-1 illustrates the schematic

structure of the session-page matrix).

The entry value in the session-page matrix, aij is usually determined by the number of

hits or the amount time spent by specific user on the corresponding page. Generally, in

order to eliminate the influence caused by the relative amount difference of visiting time

duration or hit number, a normalization manipulation across page space in the same user

(page corpus) p1 p 2 … pj … … p n s1 s2 … (session set) si aij … s

session is performed. Figure 3-2 illustrates two snapshots of Web log records extracted

from a Web access log, in which each field is separated by a space. Particularly, note that

the first and fourth fields are identified as the visitor IP address and the requested URL

respectively, and are utilized to help collecting usage data. Thus, the first field can be

identified as user session ID and the fourth attribute is treated as the page ID.

Figure 3-2. Snapshots from a Web access log

Once the usage matrix is constructed, we may applying conventional clustering

algorithms on the user session data to classify user sessions into various groups, within

which the classified sessions share the similar access interest. It is intuitive to perform

clustering algorithms directly on each row vector of the usage matrix to determine the

relative “close” session cluster by using a similarity-based measure, such as the

commonly adopted cosine similarity from Information Retrieval. In [29], for example, an

algorithm named PACT is proposed to address usage pattern mining based on the above

mentioned technique. However, this kind of clustering technique only captures the

mutual relationships between session data explicitly, it is incapable of revealing the

“deeper” underlying characteristics of usage patterns. In this work, we propose an

algorithm, named Latent Usage Information (LUI) to group user sessions semantically by

taking the latent semantic information into account. For better understanding LUI

algorithm, we first discuss some theoretical backgrounds of the SVD algorithm.

202.161.108.167 - - [01/Feb/2003:00:00:03 +1100] "GET/timetables/city/2003s1/cc

4logo.gif HTTP/1.1" 206 14102 "http://www.cs.rmit.edu.au/timetables/city/2003s1/

cover.html "Mozilla/4.0 (compatible; MSIE 5.5; Windows 98)"

213.183.13.65 - - [01/Feb/2003:00:00:16 +1100] "GET/˜winikoff/palm/dev.html HTT P/1.1" 302 244 "http://www.google.de/search?q=sources+onboardc+examples&ie=UTF- 8&oe=UTF-8&hl=de&meta=" Scooter/3.3"

3.1.2. Singular Value Decomposition Algorithm

The SVD definition of a matrix is illustrated as follows [77]: For a real matrix

ij m n

A a

×

 

=   , without loss of generality, suppose mnand there exists a SVD of A:

m m m n n n

A U= ×

× V× (3.1) where U and V are orthogonal matrices. Matrices U and V can be denoted as

1 2 [ , , ] m m m m m U × = u uu × and Vn n× =[ ,V V1 2,…,Vn n n]× , where u (i = 1,…,m) is a m-i dimensional vector { 1, 2, } T i i i mi

u = u uu and v (j = 1,…,n) is a n-dimensional matrix j

1 2

{ , , }T

j j j nj

v = v vv . Suppose rank A( )=r and singular values of A are the diagonal

elements of ∑ as follows:

1 2 r r 1 n 0

σ σ σ σ + = = σ = (3.2)

For a given threshold ε ( 0< <1ε ), we choose a parameter k such that

(

σk−σk+1

)

σk ≥ε. Then, we denote U u uk

[

1, 2,,uk m k

]

× ,Vk =

[

v v1, 2,,vk n k

]

× ,

k=diag

(

σ σ1, 2,σk

)

, and

k k k

A =U

kV

Known from the theorem in algebra [77], A is the best approximation matrix to A and k

conveys the latent semantic information among the usage data. This property makes it

possible to find out relative “close” user sessions at the semantic latent level based on

their mutual similarity.

3.1.3. Representation of User Session in Latent Semantic Space

Once the SVD implementation is completed, we may rewrite user sessions with the

dimensional latent semantic space. For a given sessions , it is represented as a coordinate i

vector with respect to pages, written as si ={ai1,ai2,…,ain} . The projection of coordinate vector si in the k-dimensional latent semantic subspace is re-parameterized as

) ,..., , ( 1 2 ' ik i i k k i i sV t t t s =

=