The Development of Web Log Mining Based on Improve-K-Means Clustering Analysis

(1)

D. Jin and S. Lin (Eds.): Advances in CSIE, Vol. 2, AISC 169, pp. 613–618. springerlink.com © Springer-Verlag Berlin Heidelberg 2012

on Improve-K-Means Clustering Analysis

TingZhong Wang*

College of Information Technology, Luoyang Normal University, Luoyang, 471022, China [email protected]

Abstract. The main objective of Web log mining is to extract interesting patterns from the Web access to records. Web log mining has been successfully applied to a personalized recommendation system improvement and business intelligence. This paper presents the development of Web log mining based on improve-K-Means clustering analysis. K-Means clustering algorithm is analyzed and the paper proposes effective index of the K-Means clustering algorithm and verified by experiment, and proposes automatically selected based on the initial cluster centers that this selection method can reduce the outlier and improve the clustering results.

Keywords: web log mining,K-Means clustering, isolated point clustering.

1 Introduction

Web log mining, also known as Web usage mining, namely the use of the data set to analyze the mining, data mining technology on the site use a lot of data (user access) and other relevant data to obtain valuable website access mode of knowledge the main objective of Web log mining is to extract interesting patterns from the Web access to records. Web log mining is mainly used in e-commerce, through the analysis and explores Web log records law, to identify potential customers, and enhance the quality of Internet information services to end-users, to improve the performance and structure of the Web server system. Currently studied in the Web Usage Mining techniques and tools can be divided into two categories: pattern discovery and pattern analysis.

Web log mining has two main research directions: user access pattern tracking and personalized use of the recorded track. Track user access patterns are to understand the user's access patterns and tendencies in order to improve the organizational structure of the site by analyzing the use of records [1]. Therefore, these data were analyzed to help understand user behavior, to improve the site structure and to provide users with personalized service.

Web access to the most common applications in the mining Web log mining, mining server's log files, draw the user access patterns, the article is based on Web

*

Author Introduce: TingZhong Wang(1973.7-), Male, Han, Master of Henan University of Science and Technology, Research area: web mining, data mining, clustering.

(2)

Log Mining for Personalized Recommendation. This paper presents the development of Web log mining based on improve-K-Means clustering analysis. In this paper, the K-Means clustering algorithm to cluster the user, therefore, described in detail below the K-Means clustering algorithm.

2 K-Means Clustering Algorithm and Improve

The cluster analysis used to discover the data distribution and patterns, is an important research direction in data mining. The clustering problem can be described as follows: collection of data points are divided into classes (called clusters, cluster), makes the greatest extent possible between each cluster of data points is similar to the data points in different clusters to maximize the cluster.

Web log clustering in two ways: user clustering and page clustering. User clustering user sessions, according to the user access to the action, looking for patterns of behavior similar to the user. User clustering results can be used as a library of smart Web site mode recommended mode, such as: Web Server analysis, judgment, user A and user B belong to the same group, assuming that the user profile in the group {a.html, b. html, c.html}, user A has access page contains a.html page, the smart Web site real-time recommendation module will be recommended to the user A - b.html and c.html two pages [2]. If user B has access to the page b.html page recommendation module is real-time user B should be recommended to a.html page and c.html page, that is equation 1:

θ

φ

κ

β

_n

=

−

R

cos(

−

_n

)

sin

(1)

K-means clustering method is a common division-based clustering method, also known as K-means method, is a widely used algorithm. A form of clustering will make an objective criteria for the classification (often referred to as the similarity function, such as: distance, similarity coefficient) optimization. In this article we use the distance between the relatively simple and commonly used data to describe the similarity, the greater the distance, the smaller the similarity, on the contrary is the greater.

Its core idea is the data objects through an iterative clustering, in order to target function is minimized, so that the generated cluster as compact as possible and independent. This iterative relocation process is repeated until the objective function (generally used to mean square error as the standard measure function) to minimize so far, that is, until each cluster is no longer changes until. The objective function (error function) is generally used to mean square error as the standard measure function such as Equation 2.

∑ ∑

= ∈ − = k j i c j l j l w i 1 2 E ₍₂₎

In general, pre-determined the value of the clustering parameter k is very difficult, therefore, should be based on the data sets and clustering criteria to obtain the clustering parameter k. Ray and Turi, the measure of an effective index of the cluster

(3)

distance and the distance between the clusters, and applied to image processing, the effective index such as (3),(4),(5) as shown.

( ) ( ) ( ) Intra k Validity k Inter k = (3) 2 1 ( )

1

k i i x C i In tra k

x

Z

N

₌ _∈ =

∑ ∑

− (4) 2 , ( ) ( || i j||) i j I n t e r k =

m i n

Z − Z ₍₅₎

This article will effectively index and K-Means clustering algorithm is proposed, which combines the K-Means clustering algorithm based on the effective index. The algorithm does not require the user to determine in advance the clustering parameter k, can be automatically determined, but required Kmax limit the number of clusters. Under normal circumstances, the cluster parameters is much smaller than the number of objects (k << n).The algorithm is described as follows.

Algorithmic thinking: the algorithm will be effective index and the K-Means clustering algorithm, the combination of effective index based on the average of the objects in the cluster and clustering;

Input: a data set of n objects where each object m attributes;

Output: the number of clusters k and the set of k clusters, which minimize the effective index of the clustering.

i, the While (k = 2 to of Kmax Step by a variable value);

ii, random selection of k objects as initial cluster centers: c1 (1), c2 (1), ..., ck (1); iii, to re-allocate each object to the clustering of the object and the center of the cluster closest to;

iv, update cluster mean, using the following formula to calculate the object in each cluster mean as equation 6:

( ) 1 ( 1) j x C j k k X N j

c

∈ + =

∑

(6)

Of which: j = 1,2, ..., k, the number of Properties Nj as to Cj (k) of the object; v, repeat steps iii, iv until the cluster centers no longer change, for all j = 1,2, ..., k

( 1)

( )

j

k

c

+ =

_c

Such as cluster centers no longer change, switch to the next step; validity of the effective index of vi, in accordance with the formula (1) - (3) to calculate the number of clusters is k (k);

vii, compare the effective index of validity (k) and the previous index of validity (k-1) to retain the make validity value smaller k;

viii, the end of the algorithm, the output of the most effective number of clusters k and k the center of a cluster and cluster C1, C2, C3, ..., Ck.

(4)

3 Web Log Mining Technology

Web log mining has been successfully applied to a personalized recommendation system improvement and business intelligence. Accumulation, especially in the business site with a large number of users access to log data, businesses can use these data to provide users with personalized services to improve customer trust and service quality [3]. Web access to the most common applications in the mining Web log mining, mining server's log files, draw the user access patterns, the article is based on Web Log Mining for Personalized Recommendation, as is shown by equation7.

i i M N i i i N i e e u α α 1 1 ) 0 ( + = = ∑ + =

∑

(7)

Web log mining can be divided into three phases: data preprocessing, pattern mining and dig out the pattern analysis. Web server access log (Access Log) generally include: IP address, request time, the method (eg GET, POST), the URL of the requested file, the HTTP version number, the return code, transmit the number of bytes. Table 1 lists several Web server http://lpqf.haust.edu.cn access log. In Table 1 of the first log that user from the IP address 192.168.2.174 to a GET request transmission / comment/list4js.asp, this request is successfully transferred 93 bytes of data, 200 for return code, indicating that the response successfully.

Table 1. Content of Web-server’s Access Log

IP Address Time Method/url Status Size

192.168.2.174 2006-10-16 00:23:40 GET /comment/list4js.asp 200 93 200 188 200 242 188 231 192.168.2.222 2006-10-16 00:25:02 GET /include/PageCount.asp 192.168.2.174 2006-10-16 00:25:48 POST /comment/comment.asp 192.168.2.233 2006-10-16 00:27:21 GET /include/functionhit.asp

Mainly based on the idea of the automatic evaluation methods: if the user is a long time or high frequency access to a site or a page, indicating their interest in the site or page high, therefore, you can access time and frequency as a hobby measure the weight, the algorithm is as follows: i calculate the user to access a url of the frequency obtained by the statistics of the url is the number of users to access. Taking into account the data cleaning stage to remove the occasional visits of the page, you can set the number of users to access a url in the fixed time period should be greater than or equal to a set value.

4 Experimental Results and Analysis

In order to verify the validity of the algorithm test, the log data after data cleaning, user identification, page recognition and other steps, the two sets of data: The first set

(5)

of data consists of 201 users and 81 links; the two sets of data, including 792 different users and 1644 links. Effective index values, as shown in Figure 1.

It can be seen from the above two sets of test results to algorithm clustering k = 62 the minimum effective index, the clustering meet close and maximum reparability between the clusters, the largest cluster experiments to achieve the desired results. Using the clustering algorithm based on the size and number of data objects to be clustering to select the appropriate step. When the amount of data is small, choose smaller step length can improve the precision of clustering; when the large amount of data, increasing the step size reduces the computation for a large amount of data, a step increase in the accuracy of the algorithm the impact is negligible.

Fig. 1. Validity versus Cluster Number

Test using the mean of the conventional k-means method to cluster, the initial point selected were random and before the automatic cluster center selection algorithm, test users in 2658 (nine clusters) clustering, test results such as Figure 2 to Figure 3, the results show that the initial cluster centers automatically selects the algorithm is better than randomly selected. It can be seen from Figures 2, 3; automatic initial point selection method is superior to the random initial point selection method.

Fig. 2. Stochastic selection clusting initialization point and result of clusting

(6)

Experiment cluster analysis to 2658 users, the results show that the initial cluster centers automatically selected a better solution to the problem of isolated points, the comparison shown in Figure 4. Is obvious from the figure can be seen: before the cluster center automatically selects the algorithm, reducing the initial cluster centers randomly selected to result in isolated points more.

Fig. 4. Comparisons of isolated point

This paper analyzes the clustering algorithm on the k-means clustering algorithm, the initial value problem for a traditional clustering algorithm, improved K-Means clustering algorithm proposed effective index of the K-Means clustering algorithm and validated through experiments. Isolated points are more randomly selected from the initial point of clustering to reduce the outlier, automatically selected based on the initial cluster centers, the experiment found that this selection method can reduce the outlier and improve the clustering effect.

5 Summary

Web log mining has been successfully applied to a personalized recommendation system improvement and business intelligence. K-means clustering method is a common division-based clustering method, also known as K-means method, is a widely used algorithm. This paper presents the development of Web log mining based on improve-K-Means clustering analysis. In this paper, the K-Means clustering algorithm to cluster the user, therefore, described in detail the K-Means clustering algorithm.

References

1. Mobasher, B., Cooley, R., Srivastava, J.: Automatic personalization based on web usage mining. Communications of the ACM 43(8), 142–151 (2000)

2. Huang, Z.: Extensions to the K-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery 2, 283–304 (1998)

3. Srivastava, J., Cooley, R., Deshpande, M., et al.: Web usage mining: discovery and application of usage patterns from web data. SIGKDD Explorations 1(2), 12–23 (2000)