Claims of the Dissertation - Web Mining Techniques for Recommendation and Personalization

1. Introduction

1.3. Claims of the Dissertation

This dissertation is mainly focused on discovering Web usage patterns in terms of user

profiles and latent semantic factors from Web log files to support Web recommendations

based various Latent Semantic Analysis (LSA) models, and extending the proposed

methodologies and technologies to other data mining applications.

The basic research philosophy of this study consists of three procedures: the first is to

create a unified mathematical analysis model, on which various data mining algorithms

could be employed no matter which research background is addressed. To conduct the

Web usage mining, three latent semantic analysis algorithms are investigated and

semantic factors associated, which are used to construct a usage-base Web navigational

base via a user profiling method. At the third stage, the discovered usage knowledge is

used for a further Web application, i.e. Web recommendation. In addition to Web data

mining, the proposed analytical methodology could also be extended to other data mining

applications, which use similar research techniques and analytical algorithms but in

different application background. In this study, we extend our developed methodologies

and technologies to a healthcare data mining domain. This thesis consists of all these

studies and results mentioned above. The research contained in this thesis, indeed, spans

a number of different research communities – Web mining, Web recommendation,

clustering, text retrieval and health data mining, however, it is encircling one focus of

using intelligent computational algorithms to deal with knowledge discovery tasks. It is

believed it is the main contribution of this thesis.

In particular, to conduct these studies, a mathematical framework is established for Web

usage mining and a series of algorithms are proposed to predict Web user navigational

preferences and recommend the customized Web contents to Web users. Three kinds of

latent semantic analysis models, namely standard LSI, PLSA and LDA, are proposed to

address the Web usage mining and Web recommendations respectively. Two case studies

of the extension of the proposed pattern mining methodologies and algorithms are carried

out in an application of gait pattern mining, one important topic in healthcare and

biomechanical data mining domains. The main contributions of the dissertation can be

summarized as follows:

A Mathematical Framework of Web Usage Mining for Web Recommendation

models, such as Web content or Web linkage information, this framework represents the

mathematical expression with respect to Web usage information e.g. click number or

visiting duration on various Web pages. On the other hand, unlike other usage data

models such as directed graphs or visit sequences [19, 68-70], this usage data model is in

the form of a usage matrix. From the usage matrix, the intrinsic association among Web

objects such as Web pages or user sessions as well as the latent semantic relationships

between the latent task space and Web objects conveyed by the usage information, is

discovered by a series of matrix operations or generative procedures, such as singular

value decomposition, matrix approximation, Bayesian equation conversion, Bayesian

updating, Expectation-Maximization iterative operation and so on. Other usage-based

information or expressions, such as task-based Web page similarity, task-based user

session similarity, user task preference distribution, user profile, and top-N recommended

page list are also represented and derived by various matrix operations and other engaged

algorithms. This framework makes it feasible to systematically perform analysis on the

collected Web usage data using the mathematical theories in a unified way that can

extend the developed methodologies and technologies to other data mining applications,

which have similar data expressions. As a result, this framework provides a solid

mathematical base for discovering Web usage pattern and making Web

recommendations.

Latent Semantic Analysis Models For Web Usage Mining In this work, we aim to

intensively investigate using latent semantic analysis paradigms for discovering Web

usage pattern and making Web recommendations, which includes the following

− Traditional Latent Semantic Indexing (LSI)

− Probabilistic Latent Semantic Analysis (PLSA)

− Latent Dirichlet Allocation (LDA)

Tradition LSI is based on a Singular Value Decomposition (SVD) operation, which is to

reduce the dimensionality of the original input space but holding the maximum

approximation of the original matrix. The main advantage of LSI is its capability of

uncovering the underlying relationships among the observed objects that aren’t exhibited

explicitly and directly. In this study, we aim to employ the traditional LSI analysis on the

usage data matrix to analyse the associations among user sessions in the transformed

vector space resulted from the SVD implementation.

PLSA model is a variant of the tradition LSI models, which introduces an aspect space as

an inter-medium between two usage attributes, i.e. user session and Web page. With the

PLSA model, the original usage data is mapped into two new usage vectors, in which the

associations between user sessions and the latent aspect space, and between Web pages

and the latent aspect space, are modelled by the estimates of the conditional probabilities.

The new mapped usage vectors along with the newly defined user session similarity and

Web page similarity provide a novel Web usage mining method, with which we can

derive usage based page groups and session aggregates.

LDA model is a recently emerging generative model, which reveals the intrinsic

correlation among co-occurrence via a generative procedure. Different from mining Web

usage pattern by the PLSA model, LDA is to learn the hidden usage knowledge based on

computing the Dirichlet value and posterior probability. The discovered usage knowledge

the latter two models is the capability of capturing the aspect space that associates with

the discovered usage knowledge in addition to usage pattern mining itself.

Algorithms for Web Usage Mining and Web Recommendation In this research, we

propose several algorithms and concepts based on three latent semantic analysis models

described above respectively.

For the tradition LSI model, the following algorithms and definitions are proposed:

− Latent Usage Information (LUI) algorithm. This algorithm is about

transforming the original usage data matrix into a latent usage information

space, which not only maintains the main usage information within a new

dimensionality-reduced usage space, but also reveals semantic relationships

hidden in the usage data. In this algorithm, a SVD operation is performed to

conduct the latent semantic analysis explicitly.

− User session distance in the semantic space. This measure is to calculate the

distance between two user sessions in the semantic space. Due to the

advantage of the SVD operation, this kind of distance function is based on a

low-dimensional but semantic-based vector space, thus, it is possible to

partition user sessions at a level of semantic analysis, that is, the user sessions

in same aggregation are more like-minded.

For the PLSA model, the following algorithms and definitions are proposed:

− Expectation-Maximum (EM) algorithm. EM algorithm is an iterative

operation, which is to estimate the maximum likelihood value of the co-

occurrence observations. In the proposed EM algorithm for the PLSA model,

distribution of Web objects against the latent factor space based on Bayesian

equation. The EM algorithm starts from an initial input, iteratively executing

the Expectation step, which updates the conditional probability distribution,

and Maximum step, which aims to re-calculate the likelihood value with the

updated conditional probability distribution until reaching a local optimal

point. The conditional probability distributions corresponding to the optimal

value could be viewed as the final estimates of the relationships between the

Web object and the latent factor.

− Usage-based Web page similarity and user session similarity measures. Based on the derived probability estimates via the EM algorithm, we propose two

new similarity measures for Web pages and user sessions respectively; one is

used for modelling the common functionality of Web pages and another is

about measuring the like-minded navigational preferences of user sessions.

With the two proposed similarity measures, we are able to find the

aggregations of the Web objects by utilizing the discovered usage knowledge.

− Usage-based Web page grouping algorithm. In this study, we develop a new

k-means algorithm for grouping Web pages by using the usage-based page

similarity. This clustering algorithm generates an automated Web pages

groups based on the mutual distance of two pages. The discovered page

groups can be, in turn, viewed as the task-driven page aggregations, which can

be used to improve or re-structure the Web site design and organization.

− Determining user task preference distribution algorithm. In this algorithm, we

clicks of the user via a Bayesian updating approach. The dominant task

preferences are determined by selecting those tasks whose corresponding

probability weights are exceeding a certain threshold. Incorporating the

dominant task preferences with the corresponding tasks characterized by a set

of predominant pages, results in the determination of the pages with

significant weights as the recommended page list.

− User profiling algorithm for Web recommendation. In this research, we

propose a novel user profiling algorithm to represent usage pattern derived

from Web usage mining, by using a collaborative filtering approach. The user

sessions are first clustered into a number of session clusters based on the

usage-based session similarity measure, and the centroids of the discovered

user session clusters, in the forms of weighted page sequences, are created as

user profiles. When a new active user session is coming, a most matched user

profile is selected by measuring the distance between the active user session

and the constructed user profiles, and a weighted scoring scheme is then

applied to determine the N pages having the top-N highest weights as the page

recommendation list. In other words, the recommended pages are chosen by

referring to the historic visits by other users, who have the like-minded

visiting preferences. In this sense, this algorithm is also called a collaborative

recommendation approach.

For LDA model, we develop the following algorithms:

− A variational EM algorithm. In this research, we adopt a variational EM algorithm to find the variational parameters that maximizes the log likelihood

of the usage data. The estimates of the variational parameters are the posterior

probability and Dirichlet value of the data, the former reflecting the

underlying relationships between user sessions and latent factors while the

latter representing the linking between Web pages and latent factors.

− Collaborative recommendation algorithm. Similar to the collaborative recommendation algorithm proposed for the PLSA model, we also incorporate

the usage knowledge derived by LDA model into a collaborative filtering

algorithm. We first partition user sessions into various session clusters based

on the calculated posterior probability values, in turn, view the centroids of

the session clusters as the representatives of the usage patterns. Integrating the

usage knowledge into the proposed weighted scoring scheme eventually

generates the recommended page list.

Case Studies of Gait Pattern Mining In this research, we aim to extend the

developed methodologies and technologies to a biomechanical data mining application,

i.e. gait pattern mining. Gait analysis is an important topic in the movement clinical

research and application for different specific populations, such as CP patients or elderly

people. In this study, we conduct the following case studies:

− Case study of CP gait pattern mining using the traditional clustering based algorithms. We develop standard k-means and hierarchical clustering based

approaches to find CP-specific gait patterns. The gait characteristics of

healthy children and CP patients at different pathological level are modelled

by different gait vectors of kinematic variables, i.e. temporal-distance

means for researchers or clinicians to monitor the development of CP or

assess the effectiveness of the intervention.

− Case study for monitoring the fall risk of elderly population using a SOM-

based clustering approach. We employ a SOM-based clustering algorithm to

investigate the locality of the gait in a transformed SOM grid map. The

derived SOM grid could offer us a visualized representation of gait patterns

for screening the fall risk in an elderly population.

In document Web Mining Techniques for Recommendation and Personalization (Page 32-40)