1. Introduction
1.4. Outline of the Dissertation
This dissertation contains nine chapters. Figure 1-2 illustrates the structure of the thesis
chapters, where TO denotes Task-Oriented and UP stands for User Profiling. Form the
figure, it is seen, after introducing the research problems and some mathematical
concepts, formulas and algorithms used in the later chapters, we aim to employ three
kinds of latent semantic analysis models to address Web usage mining in the following
three chapters; in chapter 6 and 7, we investigate using the discovered usage knowledge
for Web recommendations. Meanwhile, two application investigations of gait pattern
mining are carried out in chapter to evaluate the applicability of the developed
methodologies and technologies in chapter 8. We eventually summarize the claims of the
thesis and indicate some future research directions in chapter. In a summary, we put
forward a mathematical foundation first, then conduct various specific knowledge
discovery tasks followed a task of knowledge application, which are logically bunched
Figure 1-2. The structure of the thesis
Chapter 2 describes the basic concepts and techniques necessary for Web usage mining
and Web recommendations. A mathematical usage data model (matrix) is presented in
this chapter, and the related mathematical knowledge and background are provided for
better understanding the algorithms and techniques developed in this dissertation. The
developed algorithms and techniques for Web usage mining, latent semantic analysis and
Web recommendation are also reviewed and discussed based on which this dissertation
develops. This chapter provides a foundation for further study of Web usage mining and
Web recommendation described in the following chapters.
In chapter 3, we address the Web usage mining by employing the traditional LSI
approach. A LUI algorithm is proposed to extract the latent semantic knowledge from the
usage data via computing the singular values of the original usage data, which
approximates the original semantics hidden in the usage matrix. Apart from other
1. Introduction
2. Theoretical Foundation
3. LSI Model 4. PLSA Model 5. LDA Model
6. TO Recommendation 7. UP Recommendation 8. Case Studies of Gait Pattern Mining
algorithms, such as [29], that employed a standard clustering algorithm on the usage data
directly to find the aggregates of user sessions, we develop another algorithm that
performs clustering on the transformed usage space to improve the Web usage mining.
In this algorithm, each user session is represented by a dimensionality-reduced page
vector, which conveys the latent semantic relationships among the Web objects in the
usage data model. From the revealed relationships, user session aggregates that contain
highly semantic similarity are eventually generated. Experiments are conducted to
demonstrate the effectiveness of the algorithm for usage pattern mining.
Chapter 4 presents an alternative latent semantic analysis model, the PLSA model. In
contrast to the tradition LSI approaches, the PLSA model is based on a more solid
foundation of statistical analysis. It is capable of discovering the latent semantic factor
space associated with the usage patterns in addition to the traditional latent semantic
analysis. In this chapter, a series of equations are formulated based on Bayesian and
uncertainty theory, which characterize the associations between Web objects (i.e. Web
pages and user sessions) and latent semantic factors. Meanwhile, an EM algorithm is
developed to estimate the parameters of the PLSA model that leads to a maximum
likelihood of the usage data. The parameters of the PLSA model are termed as a set of
conditional probability distribution of Web pages or user sessions against the latent
semantic factors, which convey the intrinsic aggregation property of the Web objects. We
then utilize these factor-based feature vectors to group Web pages and user sessions as
well as identify the latent semantic factor space via a probability inference approach. In
particular, two sets of similarity measures of Web pages and user sessions are proposed.
In chapter 5, we address the Web usage mining by applying a novel generative model, i.e.
LDA model, which is also an alternative latent semantic analysis model. We first
systematically summarize the evolution of the generative models, and intensively discuss
the strength of the analytical model employed. We then describe the algorithm of a
variational EM algorithm to calculate the parameters of LDA model. The parameters in
terms of posterior probability and Dirichlet value are used to derive user access patterns.
We carry out an experimental analysis to evaluate the effectiveness and efficiency of the
proposed analytical models.
We turn to address Web recommendations in chapter 6 by using the usage knowledge
discovered based on the above analytical models. We first introduce a top-N weighted
scoring scheme, which forms the common base of various Web recommendation
algorithms. This algorithm is to compute the recommendation score of each page based
on the probability weight, which represents the likelihood being visited. In this chapter,
we also present a Web recommendation algorithm by identifying user task preference
distributions and integrating the latent task space into the collaborative filtering approach.
Analysis of the very first clicks on Web pages results in capturing the task-driven
probability distribution of one user session over task space, in turn, determines the
predominant tasks having significant probability values. We eventually calculate the
recommendation scores by integrating the predominant tasks with the discovered task
representatives. The evaluation is done by a series of experiments on real world log files.
Chapter 7 concentrates on the study of employing a user profiling approach for Web
recommendations. In this chapter, we utilize the proposed user profile approach to deal
applied into the collaborative recommendation algorithm to select the most matched
usage pattern and predict the most potentially visited pages by referring to the visiting
histories of other users who exhibit similar navigation preferences. Experimental results
on two Web log files show the effectiveness of the proposed algorithms.
We extend the developed technologies and methodologies to two case studies of gait
pattern mining in chapter 8. First we address discovering gait patterns of CP patients,
which are represented by the attribute vectors of the temporal-distance kinematic gait
variables. The CP gait pattern mining algorithm is implemented by employing the
traditional clustering algorithms, i.e. k-means and hierarchical clustering algorithms. Gait
patterns are derived by the centroids of the gait clusters, in turn, treated as the diagnostic
indicatives for assessing the walking impairment of the CP patients. In the second part of
this chapter, we develop a SOM-based clustering algorithm to address gait analysis for
monitoring fall risk of elderly population. By using a specific gait variable, Minimum
Foot Clearance (MFC) to model elder people walking characteristics, we construct a gait
data model in terms of various statistical parameters of the MFC variable. Then we
employ a SOM-based clustering algorithm on a gait dataset which consists of three
groups of gait data, i.e. younger subjects, elderly but healthy subjects and elderly subjects
with impaired walking ability, to separate these subjects into different gait groups. In the
transformed gait SOM grid, it is shown there are various groups of subjects assigned to
different portions of the figure. The locality of the aggregation indicates the gait pattern
knowledge. Meanwhile, the centroids of the clusters stand for the characteristics of the
gait information. Experimental results are visualized and tabulated to show the