Outline of the Dissertation - Web Mining Techniques for Recommendation and Personalization

1. Introduction

1.4. Outline of the Dissertation

This dissertation contains nine chapters. Figure 1-2 illustrates the structure of the thesis

chapters, where TO denotes Task-Oriented and UP stands for User Profiling. Form the

figure, it is seen, after introducing the research problems and some mathematical

concepts, formulas and algorithms used in the later chapters, we aim to employ three

kinds of latent semantic analysis models to address Web usage mining in the following

three chapters; in chapter 6 and 7, we investigate using the discovered usage knowledge

for Web recommendations. Meanwhile, two application investigations of gait pattern

mining are carried out in chapter to evaluate the applicability of the developed

methodologies and technologies in chapter 8. We eventually summarize the claims of the

thesis and indicate some future research directions in chapter. In a summary, we put

forward a mathematical foundation first, then conduct various specific knowledge

discovery tasks followed a task of knowledge application, which are logically bunched

Figure 1-2. The structure of the thesis

Chapter 2 describes the basic concepts and techniques necessary for Web usage mining

and Web recommendations. A mathematical usage data model (matrix) is presented in

this chapter, and the related mathematical knowledge and background are provided for

better understanding the algorithms and techniques developed in this dissertation. The

developed algorithms and techniques for Web usage mining, latent semantic analysis and

Web recommendation are also reviewed and discussed based on which this dissertation

develops. This chapter provides a foundation for further study of Web usage mining and

Web recommendation described in the following chapters.

In chapter 3, we address the Web usage mining by employing the traditional LSI

approach. A LUI algorithm is proposed to extract the latent semantic knowledge from the

usage data via computing the singular values of the original usage data, which

approximates the original semantics hidden in the usage matrix. Apart from other

1. Introduction

2. Theoretical Foundation

3. LSI Model 4. PLSA Model 5. LDA Model

6. TO Recommendation 7. UP Recommendation 8. Case Studies of Gait Pattern Mining

algorithms, such as [29], that employed a standard clustering algorithm on the usage data

directly to find the aggregates of user sessions, we develop another algorithm that

performs clustering on the transformed usage space to improve the Web usage mining.

In this algorithm, each user session is represented by a dimensionality-reduced page

vector, which conveys the latent semantic relationships among the Web objects in the

usage data model. From the revealed relationships, user session aggregates that contain

highly semantic similarity are eventually generated. Experiments are conducted to

demonstrate the effectiveness of the algorithm for usage pattern mining.

Chapter 4 presents an alternative latent semantic analysis model, the PLSA model. In

contrast to the tradition LSI approaches, the PLSA model is based on a more solid

foundation of statistical analysis. It is capable of discovering the latent semantic factor

space associated with the usage patterns in addition to the traditional latent semantic

analysis. In this chapter, a series of equations are formulated based on Bayesian and

uncertainty theory, which characterize the associations between Web objects (i.e. Web

pages and user sessions) and latent semantic factors. Meanwhile, an EM algorithm is

developed to estimate the parameters of the PLSA model that leads to a maximum

likelihood of the usage data. The parameters of the PLSA model are termed as a set of

conditional probability distribution of Web pages or user sessions against the latent

semantic factors, which convey the intrinsic aggregation property of the Web objects. We

then utilize these factor-based feature vectors to group Web pages and user sessions as

well as identify the latent semantic factor space via a probability inference approach. In

particular, two sets of similarity measures of Web pages and user sessions are proposed.

In chapter 5, we address the Web usage mining by applying a novel generative model, i.e.

LDA model, which is also an alternative latent semantic analysis model. We first

systematically summarize the evolution of the generative models, and intensively discuss

the strength of the analytical model employed. We then describe the algorithm of a

variational EM algorithm to calculate the parameters of LDA model. The parameters in

terms of posterior probability and Dirichlet value are used to derive user access patterns.

We carry out an experimental analysis to evaluate the effectiveness and efficiency of the

proposed analytical models.

We turn to address Web recommendations in chapter 6 by using the usage knowledge

discovered based on the above analytical models. We first introduce a top-N weighted

scoring scheme, which forms the common base of various Web recommendation

algorithms. This algorithm is to compute the recommendation score of each page based

on the probability weight, which represents the likelihood being visited. In this chapter,

we also present a Web recommendation algorithm by identifying user task preference

distributions and integrating the latent task space into the collaborative filtering approach.

Analysis of the very first clicks on Web pages results in capturing the task-driven

probability distribution of one user session over task space, in turn, determines the

predominant tasks having significant probability values. We eventually calculate the

recommendation scores by integrating the predominant tasks with the discovered task

representatives. The evaluation is done by a series of experiments on real world log files.

Chapter 7 concentrates on the study of employing a user profiling approach for Web

recommendations. In this chapter, we utilize the proposed user profile approach to deal

applied into the collaborative recommendation algorithm to select the most matched

usage pattern and predict the most potentially visited pages by referring to the visiting

histories of other users who exhibit similar navigation preferences. Experimental results

on two Web log files show the effectiveness of the proposed algorithms.

We extend the developed technologies and methodologies to two case studies of gait

pattern mining in chapter 8. First we address discovering gait patterns of CP patients,

which are represented by the attribute vectors of the temporal-distance kinematic gait

variables. The CP gait pattern mining algorithm is implemented by employing the

traditional clustering algorithms, i.e. k-means and hierarchical clustering algorithms. Gait

patterns are derived by the centroids of the gait clusters, in turn, treated as the diagnostic

indicatives for assessing the walking impairment of the CP patients. In the second part of

this chapter, we develop a SOM-based clustering algorithm to address gait analysis for

monitoring fall risk of elderly population. By using a specific gait variable, Minimum

Foot Clearance (MFC) to model elder people walking characteristics, we construct a gait

data model in terms of various statistical parameters of the MFC variable. Then we

employ a SOM-based clustering algorithm on a gait dataset which consists of three

groups of gait data, i.e. younger subjects, elderly but healthy subjects and elderly subjects

with impaired walking ability, to separate these subjects into different gait groups. In the

transformed gait SOM grid, it is shown there are various groups of subjects assigned to

different portions of the figure. The locality of the aggregation indicates the gait pattern

knowledge. Meanwhile, the centroids of the clusters stand for the characteristics of the

gait information. Experimental results are visualized and tabulated to show the

2. Fundamentals of Web Data Mining and

In document Web Mining Techniques for Recommendation and Personalization (Page 40-46)