DISCOVERING SEQUENTIAL RULES FOR WEB USAGE ANALYSIS

(1)

57

DISCOVERING SEQUENTIAL RULES FOR WEB USAGE

ANALYSIS

Ms. P.C. Saritha

1

, Dr. T. Senthil Prakash

2

, Mr. M. Rajesh

3

, Ms. K. Remya

4 II Year M.E(CSE)1 , Professor & HOD2 , Assistant Professor3 , II Year M.E(CSE)4

Shree Venkateshwara Hi-Tech Engg. College, Gobi, Tamilnadu, India 1,2,3,4 [email protected] 1, [email protected], [email protected] 4

ABSTRACT

Web mining techniques are used to analyze web resource details. Content mining, structure mining and usage mining are the main types of web mining. Content mining and structure mining methods are used to analyze web page contents. User access details are analyzed using usage mining methods. The association rule mining techniques are used to mine hidden knowledge from large data sets. Candidate sets combines the attribute name and value. Item sets are build with candidate sets. Support and confidence values are used in the association rule mining process. Sequential pattern mining discovers subsequences common to multiple sequences. A sequential rule (SR) is also called as episode rule, temporal rule or prediction rule. Partially Ordered Sequential Rules (POSR) is identified with antecedent and consequent items in unordered sequences. RuleGrowth algorithm is adapted to discover the Partially Ordered Sequential Rules (POSR). Rule expansions approach is applied to generate sequential rules and search operations. RuleGrowth is easily extendable with constraints for the specific applications. TRuleGrowth algorithm finds rules occurring with a sliding-window constraint. Left expansion and right expansion operations are carried out to update the rules in the RuleGrowth algorithm.

1. Introduction

Predicting the next element(s) of a sequence is an important research problem with wide applications such as stock market analysis, consumer product recommendation, weather forecasting, text

generation and web link recommendation.

Techniques for sequence prediction can be categorized according to the types of sequences on which they are applied. There are two main types. Time series are sequences of numeric data typically recorded at an equal time interval. Symbolic sequences are sequences of events or nominal data generally recorded at unequal time intervals [4]. Previous research on sequence prediction has mainly focused on predicting time series. This is usually done by applying statistical methods to find mathematical functions that fit the data. These functions are then used to make predictions [3]. In this paper, we are interested by the case of symbolic sequences, which has many applications. Because the data in symbolic sequences is not numeric,

techniques for time-series forecasting cannot be applied to this problem.

(2)

58

sequential rules [5], [6], [7] that we have proposed in previous works. We compare the prediction accuracy of these rules with standard sequential rules for the task of webpage recommendation for large click stream datasets. Experimental results show that using the new type of sequential rules can greatly improve prediction accuracy, while requiring a smaller training set.

2. Related Work

Current research on DUST detection can be classified in two main families of methods: content-based and URL content-based. In content-content-based DUST detection, the similarity of two URLs is determined by comparing their contents using syntactic or semantic evidence as shingles, text signatures, pair-wise similarities, sentence pair-wise similarities, and semantic graphs [12]. In content-based DUST detection, to infer if two distinct URLs correspond to duplicates, or near duplicates, it is necessary to fetch and inspect the whole content of their corresponding pages. In order to avoid such a waste of resources, several URL-based methods have been proposed to determine duplicate URLs without examining the associated contents. For a comprehensive review of the literature, we refer the reader to [2] that describes both content based and URL-based methods. In the following paragraphs, we focus on URL-based methods including the ones that reported the best results in the literature.

The first URL-based method proposed was DustBuster. In their work, the authors addressed the DUST detection problem as a problem of finding normalization rules able to transform a given URL to another likely to have similar content. The rules consist of substring substitutions learned from crawl logs or web logs. Rules are selected if (a) they have large support, (b) they do not come from large groups and (c) URLs matched by them have similar sketches or compatible sizes in the training log. Redundant rules are eliminated based on their support information. By evaluating their method in four websites, the authors found that up to 90 percent of the top 10 rules were valid, 47 percent of

the duplicate URLs were recognized and the crawl was reduced by up to 26 percent.

The authors use some heuristics to generalize the generated rules. In particular, they attempt to infer the false positive rate of the rules in order to select the most precise ones. To accomplish this, they verify if the set of values that a certain URL component assumes is greater than a threshold value N, a heuristic which they call fanout-N. Their best results were obtained with N = 10. In this work, we refer to this method as Rfanout_10. By applying the set of rules found by Rfanout-10 to a number of

large scale experiments on real data, the authors were able to reduce the number of duplicate URLs by 60 percent, whereas only substitution rules achieved 22 percent reduction.

The authors evaluated their method in a set of about 8 million URLs, achieving a deduplication reduction rate of about 42 percent using the top 9 percent of precise rules. In a subsequent paper [9], they implemented their algorithm using a distributed framework and extended the URL and rule representations to include two additional patterns: tokens inside path components and more complex irrelevant components. The authors used a very simple alignment heuristic to deal with irrelevant components. They evaluated the method with 3 billion URLs showing its scalability. By comparing their method with Rfanout-10, they achieved two times

more reduction using 56 percent of the rules. The method proposed is not publicly available and was not described with enough detail to be implemented.

(3)

59

conflicts and redundancy. They evaluated their approach in a collection with 70 million URLs and showed that their method was able to outperform Rfanout-10 achieving about twice the reduction using

46 percent of the rules and consuming half of the learning time. In this work, we refer to it as Rtree. Since, as far as we know, Rtree is the method which reported the best results in literature, we adopt it as our second baseline.

In this article we continue the preliminary work presented in [10], where we first proposed the use of multiple alignment as a way to avoid the problems of simple pairwise rule extraction. The main differences between this work and the previous one are (1) the handling of large dup-clusters; (2) the adoption of new methods for intracluster generalization and alignment penalization; (3) the elimination of a hierarchical clustering step with the reduction of the number of generated rules; and (4) the simplification of the algorithm, by supporting fewer kinds of tokens. We also included a theoretical and empirical performance analysis of rules and algorithms, a new web dataset where duplicates were identified by means of canonical tags and a more detailed set of experiments allowed us to conclusions regarding the use of our method.

3. Mining Partially-Ordered Sequential Rules Sequential pattern mining is an important data mining task with wide applications. It consists of discovering subsequences that are common to multiple sequences. Several algorithms have been proposed for this task such as GSP, PrefixSpan, SPADE and CM-SPADE. Sequential patterns found by these algorithms are often misleading for the user. The reason is that patterns are found solely on the basis of their support. For instance, consider the sequential pattern {Vivaldi}, {Handel}, {Berlioz} meaning that customer(s) bought the music of Vivaldi, Handel and Berlioz in that order. This sequential pattern is said to have a support of 50 percent because it appears in sequences 1, 2 and 4 of the following sequence database containing six sequences.

1: {Vivaldi}, {Mozart}, {Handel}, {Berlioz}

2: {Mozart}, {Bach},{Paganini}, {Vivaldi}, {Handel}, {Berlioz}

3: {Handel}, {Vivaldi}, {Mozart}, {Ravel}, {Berlioz}

4: {Vivaldi}, {Mozart}, {Handel}, {Bach},

{Berlioz}

5: {Mozart}, {Bach}, {Vivaldi}, {Handel} 6: {Vivaldi}, {Handel}, {Mozart}, {Bach}

This pattern is misleading because despite that it appears in 50 percent of the sequences, there are also two sequences where {Vivaldi}, {Handel} are not followed by {Berlioz}. If someone had to take decisions on the basis of this pattern, it could lead to taking wrong decisions. A solution to this problem would be to add a measure of the confidence or probability that a pattern will be followed. But adding this information to sequential patterns is not straightforward because they can contain multiple items and sequential pattern mining algorithms have just not been designed for that. An alternative that considers the confidence of a sequential pattern is sequential rule (SR) mining. A sequential rule indicates that if some event(s) occur, some other event(s) are likely to follow with a given confidence or probability. Sequential rule mining has been applied in several domains such as drought management, stock market analysis, weather observation, reverse engineering, learning and e-commerce. Algorithms for sequential rule mining are designed to either discover rules appearing in a single sequence across sequences or common to multiple sequences.

In this article interested by the task of mining sequential rules common to multiple sequences, which is analogous to sequential pattern mining and is also applied on sequence databases. It consists of finding rules of the form X Y in a

(4)

60

{Berlioz}. It means that customer(s) who bought the music of Vivaldi, Mozart and Handel in that order, have then bought the music of Berlioz. This rule has a support of 33 percent because it is found in two sequences out of six sequences. Moreover, the rule is said to have a confidence of 100 percent because in

each sequence where {Vivaldi}, {Mozart},

{Handel} appears in that order it is followed by {Berlioz}. Mining such rules can be useful to make recommendations, predictions or to analyze customers’ behavior. Besides, there are many other applications such as:

 Web traversal patterns. Sequential rules can be mined in sequences of web pages visited by users, to make recommendations.

 Educational data. For example, in e-learning, sequential rules have been mined to understand and predict the behavior of learners and to discover patterns common to several learners’ solutions.

 Bioinformatics. In this field, many different kind of sequential data need to be analyzed. Sequential rule mining can be applied for example to discover sequential relationships between gene expressions of various patients using data from microarray experiments.

We note, three important problems with the definition of a sequential rule as a relationship between two sequential patterns:

1) Rules may have many variations with different item ordering. Because sequential patterns specify a strict ordering between items, there might be several rules with the same items but a different ordering. For example, there are 23 variations of {Vivaldi}, {Mozart}, {Handel}) {Berlioz} with the same items ordered differently such as the following rules denoted as

R1: {Vivaldi}, {Mozart}, {Handel} {Berlioz},

R2: {Mozart}, {Vivaldi}, {Handel} {Berlioz},

R3: {Handel}, {Vivaldi}, {Mozart} {Berlioz},

R4: {Handel, Vivaldi}, {Mozart} {Berlioz},

R5: {Handel}, {Vivaldi, Mozart} {Berlioz},

R6: {Handel, Vivaldi, Mozart} {Berlioz}.

But all these variations describe the same situation. 2) Rules and their variations may have important differences in how they are rated by the algorithms. For example, rules R1, R2 and R3 respectively have support/confidence of 33 percent /100 percent, 16 percent/50 percent and 16 percent /100 percent, and R4, R5 and R6 do not appear in the database. These differences in how variations of the same rules are rated can give a wrong impression of the sequential relationships contained in the database to the user. In fact, if all the variations of the same rule were taken as a whole, their support and confidence could be much higher. For example, none of the previous rules has a support higher than 33 percent. But taken as a whole they appear in four sequences out of six 66 percent.

3) Rules are less likely to be useful. Because rules are very specific, each rule is less likely to match with a new sequence to make predictions. For example, consider that a new customer buys {Vivaldi}, {Handel}, {Mozart} in that order. None of the previous rules would match that sequence to predict that the customer may buy {Berlioz} next. If a partial matching is used, a problem would be to choose between rules R1, R2 and R3 because they are rated quite differently.

4. Problem Statement

Sequential pattern mining discovers

(5)

61

RuleGrowth algorithm. The following drawbacks are identified from the existing systems.

• Rule reduction and compression operations

are not integrated

• Sequence priority is not considered

• Prediction process is not performed

• Rule summarization is not supported

5. Sequential Rule Discovery Mechanism

Facing these issues, in this article, we explore the idea of mining “partially ordered sequential rules” (POSR), a more general form of sequential rules common to multiple sequences such that items in the antecedent and in the consequent of each rule are unordered. This definition has the benefits of summarizing several rules by single rules. For example, the rule {Mozart, Vivaldi, Handel}) {Berlioz} replaces all the previous rules and has a support of 75 percent and a confidence of 66 percent.

To discover POSR, we propose an efficient algorithm named RuleGrowth. It uses a novel approach named “rule expansions” to generate sequential rules and includes several strategies to perform the search efficiently. RuleGrowth is easily extendable. Constraints can be added to the algorithm for the needs of specific applications. For

example, we present an extension named

TRuleGrowth that finds rules occurring with a sliding-window constraint. This constraint is important for many real applications because users often only wish to discover patterns occurring within a maximum amount of time.

To evaluate the proposed algorithms, we conduct an extensive performance study comparing their performance with two baseline algorithms on four real-life datasets representing different types of data. Results show that RuleGrowth outperforms baseline algorithms in all situations in terms of execution time and memory consumption. Results show that the execution time and the number of rules discovered can be reduced by several orders of magnitude when the sliding window constraint is used. Experiments also show that the proposed

algorithms have excellent scalability. Furthermore, we also report a real application where results have shown that the prediction accuracy obtained using POSR can be much higher than using sequential rules.

The sequential pattern mining process is enhanced with rule reduction and compression techniques. Sequence prediction operations are integrated with the sequential pattern mining process. Representative subset patterns are identified with minimal antecedent and maximal consequent. Rule summarization process is performed to support decision making process. The user web browsing behavior identification is done with user access logs. The system is designed to perform sequential pattern

extraction and webpage prediction process.

Sequential pattern identification is performed with sliding window mechanism. The system is divided into three major modules. They are Log optimizer, Session Analysis and Rule Mining Process. Log optimizer module is designed to perform access log preprocessing operations. The session analysis module is designed to construct the sequential databases. The rule mining module is designed for pattern discovery and prediction process.

5.1. Log Optimizer

The web page requests are maintained under the access log files. Session ID, IP address, page URL and requested time details are updated under the log files. The data populate process is carried out to transfer the log details into the database. Redundant page requests are removed from the log files

5.2. Session Analysis

The session analysis is carried out to construct the sequential database. Sequential database is build with access log details. Page requests are arranged with requested sequences. The page requests are grouped into sessions.

5.3. Rule Mining Process

(6)

62

from the item set collections. Support and confidence ratio are used in the prediction process.

6. Conclusion

Sequential rule mining methods are employed to discover frequent patterns in sequential data values. RuleGrowth algorithm is applied to discover the Partially Ordered Sequential Rules (POSR). RuleGrowth algorithm is enhanced with rule reduction and compression techniques. Rule summarization and prediction operations are integrated with the system. The sequential rule mining model is constructed to discover the user request sequences in web sites. Rule reduction and compression techniques are applied to limit the derived rules. Prediction methods are adapted to estimate the next action in the sequence. The system increases the accuracy in the sequential rule discovery process.

REFERENCES

[1] Kaio Rodrigues and Altigran da Silva, “Removing DUST Using Multiple Alignment of Sequences”, IEEE Transactions On Knowledge And Data Engineering, Vol. 27, No. 8, August 2015

[2] B. S. Alsulami, M. F. Abulkhair and F. E. Eassa, “Near Duplicate Document Detection Survey,” Int. J. Comput. Sci. Commun. Netw., vol. 2, no. 2, 2012.

[3] Han, J., Kamber, M.: Data Mining: Concepts and Techniques, 3rd edn. Morgan Kaufmann, San Francisco, 2011.

[4] P. Fournier-Viger, T. Gueniche and V. S. Tseng, “Using Partially Ordered Sequential Rules To Generate More Accurate Sequence Prediction,” in Proc. 8th Int. Conf. Adv. Data Mining Appl., 2012.

[5] Fournier-Viger, V.S., Nkambou, R.: Mining Sequential Rules Common to Several Sequences with the Window Size Constraint. In: Kosseim, L., Inkpen, D. (eds.) Canadian AI 2012. LNCS, vol. 7310, Springer, Heidelberg, 2012.

[6] Fournier-Viger, P., Tseng, V.S.: Mining Top-K Sequential Rules. In: Tang, J., King, I., Chen, L., Wang, J. (eds.) ADMA 2011, Part II. LNCS, vol. 7121, pp. 180–194. Springer, Heidelberg, 2011.

[7] Fournier-Viger, P., Nkambou, R., Tseng, V.S.: RuleGrowth: Mining Sequential Rules Common to Several Sequences by Pattern-Growth. ACM Press, 2011.

[8] Pitman, Zanker, M.: An Empirical Study of Extracting Multidimensional Sequential Rules for Personalization and Recommendation in Online

Commerce. In: 10th Intern. Conf. on

Wirtschaftsinformatik, 2011.

[9] H. S. Koppula, K. P. Leela, A. Agarwal, K. P. Chitrapura, S. Garg and A. Sasturkar, “Learning url patterns for webpage deduplication,” in Proc. 3rd ACM Int. Conf. Web Search Data Mining, 2010. [10] K. W. L. Rodrigues and A. S. de Silva, “Learning url normalization rules using multiple alignment of sequences,” in Proc. 20th Int. Symp. String Process. Inf. Retrieval, 2013.

[11] T. Lei, R. Cai, X. Fan and L. Zhang, “A pattern tree-based approach to learning url normalization rules,” in Proc. 19th Int. Conf. World Wide Web, 2010.