1. Introduction
1.3. Claims of the Dissertation
This dissertation is mainly focused on discovering Web usage patterns in terms of user
profiles and latent semantic factors from Web log files to support Web recommendations
based various Latent Semantic Analysis (LSA) models, and extending the proposed
methodologies and technologies to other data mining applications.
The basic research philosophy of this study consists of three procedures: the first is to
create a unified mathematical analysis model, on which various data mining algorithms
could be employed no matter which research background is addressed. To conduct the
Web usage mining, three latent semantic analysis algorithms are investigated and
semantic factors associated, which are used to construct a usage-base Web navigational
base via a user profiling method. At the third stage, the discovered usage knowledge is
used for a further Web application, i.e. Web recommendation. In addition to Web data
mining, the proposed analytical methodology could also be extended to other data mining
applications, which use similar research techniques and analytical algorithms but in
different application background. In this study, we extend our developed methodologies
and technologies to a healthcare data mining domain. This thesis consists of all these
studies and results mentioned above. The research contained in this thesis, indeed, spans
a number of different research communities – Web mining, Web recommendation,
clustering, text retrieval and health data mining, however, it is encircling one focus of
using intelligent computational algorithms to deal with knowledge discovery tasks. It is
believed it is the main contribution of this thesis.
In particular, to conduct these studies, a mathematical framework is established for Web
usage mining and a series of algorithms are proposed to predict Web user navigational
preferences and recommend the customized Web contents to Web users. Three kinds of
latent semantic analysis models, namely standard LSI, PLSA and LDA, are proposed to
address the Web usage mining and Web recommendations respectively. Two case studies
of the extension of the proposed pattern mining methodologies and algorithms are carried
out in an application of gait pattern mining, one important topic in healthcare and
biomechanical data mining domains. The main contributions of the dissertation can be
summarized as follows:
A Mathematical Framework of Web Usage Mining for Web Recommendation
models, such as Web content or Web linkage information, this framework represents the
mathematical expression with respect to Web usage information e.g. click number or
visiting duration on various Web pages. On the other hand, unlike other usage data
models such as directed graphs or visit sequences [19, 68-70], this usage data model is in
the form of a usage matrix. From the usage matrix, the intrinsic association among Web
objects such as Web pages or user sessions as well as the latent semantic relationships
between the latent task space and Web objects conveyed by the usage information, is
discovered by a series of matrix operations or generative procedures, such as singular
value decomposition, matrix approximation, Bayesian equation conversion, Bayesian
updating, Expectation-Maximization iterative operation and so on. Other usage-based
information or expressions, such as task-based Web page similarity, task-based user
session similarity, user task preference distribution, user profile, and top-N recommended
page list are also represented and derived by various matrix operations and other engaged
algorithms. This framework makes it feasible to systematically perform analysis on the
collected Web usage data using the mathematical theories in a unified way that can
extend the developed methodologies and technologies to other data mining applications,
which have similar data expressions. As a result, this framework provides a solid
mathematical base for discovering Web usage pattern and making Web
recommendations.
Latent Semantic Analysis Models For Web Usage Mining In this work, we aim to
intensively investigate using latent semantic analysis paradigms for discovering Web
usage pattern and making Web recommendations, which includes the following
− Traditional Latent Semantic Indexing (LSI)
− Probabilistic Latent Semantic Analysis (PLSA)
− Latent Dirichlet Allocation (LDA)
Tradition LSI is based on a Singular Value Decomposition (SVD) operation, which is to
reduce the dimensionality of the original input space but holding the maximum
approximation of the original matrix. The main advantage of LSI is its capability of
uncovering the underlying relationships among the observed objects that aren’t exhibited
explicitly and directly. In this study, we aim to employ the traditional LSI analysis on the
usage data matrix to analyse the associations among user sessions in the transformed
vector space resulted from the SVD implementation.
PLSA model is a variant of the tradition LSI models, which introduces an aspect space as
an inter-medium between two usage attributes, i.e. user session and Web page. With the
PLSA model, the original usage data is mapped into two new usage vectors, in which the
associations between user sessions and the latent aspect space, and between Web pages
and the latent aspect space, are modelled by the estimates of the conditional probabilities.
The new mapped usage vectors along with the newly defined user session similarity and
Web page similarity provide a novel Web usage mining method, with which we can
derive usage based page groups and session aggregates.
LDA model is a recently emerging generative model, which reveals the intrinsic
correlation among co-occurrence via a generative procedure. Different from mining Web
usage pattern by the PLSA model, LDA is to learn the hidden usage knowledge based on
computing the Dirichlet value and posterior probability. The discovered usage knowledge
the latter two models is the capability of capturing the aspect space that associates with
the discovered usage knowledge in addition to usage pattern mining itself.
Algorithms for Web Usage Mining and Web Recommendation In this research, we
propose several algorithms and concepts based on three latent semantic analysis models
described above respectively.
For the tradition LSI model, the following algorithms and definitions are proposed:
− Latent Usage Information (LUI) algorithm. This algorithm is about
transforming the original usage data matrix into a latent usage information
space, which not only maintains the main usage information within a new
dimensionality-reduced usage space, but also reveals semantic relationships
hidden in the usage data. In this algorithm, a SVD operation is performed to
conduct the latent semantic analysis explicitly.
− User session distance in the semantic space. This measure is to calculate the
distance between two user sessions in the semantic space. Due to the
advantage of the SVD operation, this kind of distance function is based on a
low-dimensional but semantic-based vector space, thus, it is possible to
partition user sessions at a level of semantic analysis, that is, the user sessions
in same aggregation are more like-minded.
For the PLSA model, the following algorithms and definitions are proposed:
− Expectation-Maximum (EM) algorithm. EM algorithm is an iterative
operation, which is to estimate the maximum likelihood value of the co-
occurrence observations. In the proposed EM algorithm for the PLSA model,
distribution of Web objects against the latent factor space based on Bayesian
equation. The EM algorithm starts from an initial input, iteratively executing
the Expectation step, which updates the conditional probability distribution,
and Maximum step, which aims to re-calculate the likelihood value with the
updated conditional probability distribution until reaching a local optimal
point. The conditional probability distributions corresponding to the optimal
value could be viewed as the final estimates of the relationships between the
Web object and the latent factor.
− Usage-based Web page similarity and user session similarity measures. Based on the derived probability estimates via the EM algorithm, we propose two
new similarity measures for Web pages and user sessions respectively; one is
used for modelling the common functionality of Web pages and another is
about measuring the like-minded navigational preferences of user sessions.
With the two proposed similarity measures, we are able to find the
aggregations of the Web objects by utilizing the discovered usage knowledge.
− Usage-based Web page grouping algorithm. In this study, we develop a new
k-means algorithm for grouping Web pages by using the usage-based page
similarity. This clustering algorithm generates an automated Web pages
groups based on the mutual distance of two pages. The discovered page
groups can be, in turn, viewed as the task-driven page aggregations, which can
be used to improve or re-structure the Web site design and organization.
− Determining user task preference distribution algorithm. In this algorithm, we
clicks of the user via a Bayesian updating approach. The dominant task
preferences are determined by selecting those tasks whose corresponding
probability weights are exceeding a certain threshold. Incorporating the
dominant task preferences with the corresponding tasks characterized by a set
of predominant pages, results in the determination of the pages with
significant weights as the recommended page list.
− User profiling algorithm for Web recommendation. In this research, we
propose a novel user profiling algorithm to represent usage pattern derived
from Web usage mining, by using a collaborative filtering approach. The user
sessions are first clustered into a number of session clusters based on the
usage-based session similarity measure, and the centroids of the discovered
user session clusters, in the forms of weighted page sequences, are created as
user profiles. When a new active user session is coming, a most matched user
profile is selected by measuring the distance between the active user session
and the constructed user profiles, and a weighted scoring scheme is then
applied to determine the N pages having the top-N highest weights as the page
recommendation list. In other words, the recommended pages are chosen by
referring to the historic visits by other users, who have the like-minded
visiting preferences. In this sense, this algorithm is also called a collaborative
recommendation approach.
For LDA model, we develop the following algorithms:
− A variational EM algorithm. In this research, we adopt a variational EM algorithm to find the variational parameters that maximizes the log likelihood
of the usage data. The estimates of the variational parameters are the posterior
probability and Dirichlet value of the data, the former reflecting the
underlying relationships between user sessions and latent factors while the
latter representing the linking between Web pages and latent factors.
− Collaborative recommendation algorithm. Similar to the collaborative recommendation algorithm proposed for the PLSA model, we also incorporate
the usage knowledge derived by LDA model into a collaborative filtering
algorithm. We first partition user sessions into various session clusters based
on the calculated posterior probability values, in turn, view the centroids of
the session clusters as the representatives of the usage patterns. Integrating the
usage knowledge into the proposed weighted scoring scheme eventually
generates the recommended page list.
Case Studies of Gait Pattern Mining In this research, we aim to extend the
developed methodologies and technologies to a biomechanical data mining application,
i.e. gait pattern mining. Gait analysis is an important topic in the movement clinical
research and application for different specific populations, such as CP patients or elderly
people. In this study, we conduct the following case studies:
− Case study of CP gait pattern mining using the traditional clustering based algorithms. We develop standard k-means and hierarchical clustering based
approaches to find CP-specific gait patterns. The gait characteristics of
healthy children and CP patients at different pathological level are modelled
by different gait vectors of kinematic variables, i.e. temporal-distance
means for researchers or clinicians to monitor the development of CP or
assess the effectiveness of the intervention.
− Case study for monitoring the fall risk of elderly population using a SOM-
based clustering approach. We employ a SOM-based clustering algorithm to
investigate the locality of the gait in a transformed SOM grid map. The
derived SOM grid could offer us a visualized representation of gait patterns
for screening the fall risk in an elderly population.