Data Mining - Master Thesis. Personalizing web search and crawling from clickstream data

Data mining is the process of retrieving patterns and useful information from data. For information about data mining, please see [23, 14, 13].

In the context of web search, some data mining techniques are more appli-cable than others. The most used data mining techniques in the field of web search are: document clustering and document classification techniques. Many of these applications are based in the Vector Space Model. In the next sections we explain the Vector Space Model and the main machine learning and data mining techniques used in the field of information retrieval. For more informa-tion about these techniques, please refer to [19, 22].

2.5.1 Vector Space Model

Search engines return the result in a ranking, that is, the documents that best fit the query appear in the top positions. To perform such a ranking, we need a function to measure the query-document similarity. For this purpose, usually each document is transformed to a vector, then a vector distance function will be used.

A common technique to transform a document into a vector is tf-idf, but many more techniques for this purpose exist. These techniques are based in two assumptions:

• The more frequent a term t is in a document d, the more relevant t is in d.

• The more frequent a term t is among a set of document D, the less relevant t is in d.

Given a set of documents D, the tf-idf technique transforms a document d into a vector v, where v has one component for each term that appears in d.

vt= weightd,t= tfd,t∗ idft

Where tfd,t is the term frequency of the term t in the document d, that is, how many times t appears in d; and idft is the inverse document frequency of t in D. idftis computed as:

idft= log | d⁰∈ D : t⁰∈ d |

| D |

Then, the cosine distance C, or the Euclidean distance E can used to com-pute the similarity between two vectors. The following formules are used to compute the cosine distance and the Euclidean distance.

C = cosineDistance(~c, ~d) = arccos ~c × ~d

This is, after all, a heuristic, but has been intensely studied from theoretical and experimental viewpoints and has become a de facto standard.

When using the Vector Space Model, we will represent each document as a vector in the vector space model. In Vector Space Model each document is represented as a vector with one real-valued component, usually a tf-idf weight, for each term. Thus, the document space X, the domain of the classification function, is R^{|V |}.

Contiguity hypothesis. Documents in the same class form a contiguous region and regions of different classes do not overlap.

Many clustering and classification algorithms use the Vector Space Model to represent documents, therefore, they lie on the contiguity hypothesis. Some of those algorithms are: K-Means, K Nearest Neighbours and Support Vector Machines.

2.5.2 Clustering

Clustering is the partitioning of datasets into clusters, so that clusters are co-herent internally, but clearly different from each other.

There are many kinds of clustering algorithms. The most important ones are: hierarchical clustering and flat clustering.

2.5.2.1 Hierarchical Clustering

Hierarchical clustering builds up a hierarchy of groups by continuously merging the two most similar groups. Each of these groups starts as a single item. In each iteration, the similarity between each pair of groups is computed, so that the most similar pair of groups is merged together as a new group.

After several iterations (as many as initial number of items 1), all the items belong to the same group.

Finally, after hierarchical clustering is completed, the result can be viewed as a binary tree.

2.5.2.2 Flat Clustering

Flat clustering algorithms determine all the clusters at once.

K-Means is the most important flat clustering algorithm. This algorithms uses the Vector Space Model to represent the documents. Its objective is to minimize the average squared Euclidean distance of documents from their cluster centers where a cluster center is defined as the mean or centroid ^→µ of the documents in a cluster w:

The definition assumes that the documents are represented as length-normalized vectors.

A measure of how well the centroids represent the members of their clusters is the Residual Sum of Squares (RSS), the squared distance of each vector from its centroid summed over all vectors:

RSSk= X

The first step of K-means is to select as initial cluster centers K randomly selected documents (where K has been provided by the user), the seeds. The algorithm then moves the cluster centers around in space in order to minimize RSS. This is done by repeating iteratively two steps:

1. Reassign each document to the cluster with a nearest centroid.

2. Re-compute the centroid for each cluster (based on the documents belong-ing to the cluster).

This iteration is repeated until a stopping criterion is met. The most usual stopping conditions are:

• A fixed number of iterations has been completed.

• Assignment of clusters does not change between iterations.

• Terminate when RSS falls below a threshold. In practice, we need to com-bine it with a bound on the number of iterations to guarantee termination.

• Terminate when the decrease in RSS falls below a threshold, which indi-cates we are close to convergence. Again, we need to combine it with a bound on the number of iterations to prevent very long runtimes.

In our application, we have used a K-Means variation as text clustering al-gorithm.

2.5.3 Classification

In text classification, we are given a description d ∈ X of a document, where X is the documentspace; and a fixed set of classes C = {c1, c2, ..., cJ}. We are given a training set D of labeled documents hd, ci, where hd, ci ∈ X × C

Using a learning method or a learning algorithm, we then wish to learn a classifier or a classification function that maps documents to classes.

This type of learning is called supervised learning because the supervisor (the human who defines the classes and labels training documents) serves as a teacher directing the learning process.

2.5.3.1 Na¨ıve Bayes

Na¨ıve Bayes is a probability-based classification algorithm. The probability of a document d being in class c is computed as:

P (c | d) ∝ P (c) Y

1≤k≤nd

P (t_k| c)

where P (tk| c) is the conditional probability of term tk occurring in a document of class c. We interpret P (t_k| c) as a measure of how much evidence tk con-tributes that c is the correct class. P (c) is the prior probability of a document occurring in class c. If a document’s terms do not provide clear evidence for one class versus another, we choose the one that has a higher prior probability.

Our goal is to find the best class for the document, so we will choose the class for which the highest probability is estimated.

2.5.3.2 Support Vector Machines

Support Vector Machines, also known as SVMs, are a set of supervised learn-ing methods used for classification. SVMs are not necessarily better than other machine-learning methods (except perhaps in situations with few training data), but they perform at the state-of-the-art level and have much current theoretical and empirical appeal.

SVMs use the Vector Space Classification to represent documents. Once a set of training documents is provided by the user, the algorithm will try to find hyperplanes in an extended feature space that separate the different classes.

SVMs are very complex, but we just explain its basis, as deeper details are out of the scope of this thesis. For further details on SVMs see [19].

2.5.3.3 K-Nearest Neighbours

k-Nearest Neighbours is an algorithm that classifies objects based on their clos-est training samples in the feature space.

Given a collection of items C, an item I, and a natural number K, the k-Nearest Neighbours algorithm will return a collection of items S containing the K items in C that are closest to I. Thus, it is necessary to have a function dist(a, b) which computes the distance between the items a and b based on their similarity (the more similar, the less distance).

If the selected K value is too large, it will return items that are too far from I, thus adding noise. On the other hand, if K is too low, the set of items re-turned may not be representative enough, and may introduce random variations.

The itemset returned by K Nearest Neighbours S can be used to classify I, or to do regression on I according to the items in S.

For more information about the k-Nearest Neighbours algorithm see [22].

Chapter 3

State of the art

3.1 User Navigation Patterns

There are many approaches for discovering patterns in the users navigation. For example, [6] introduces new algorithms to retrieve a taxonomy of a single web site from the click-streams of its users.

In [8] they have developed a system to find out how the time affects the user behavior while surfing a web page. That is, they segment the logs of navigation of the users in different time intervals; and then they find what time intervals really interfere with the users behavior.

The two papers presented previously try to find navigation patterns among long groups of users for a single site. In this thesis, we consider instead finding the navigation patterns from a single-user approach; moreover, we consider all the pages that the user visits, not a single site. Also note that the previously presented works have been designed to work on the server side of the Web; by contrast, our system has been designed to work on the client side.

In document Master Thesis. Personalizing web search and crawling from clickstream data (Page 14-19)