• No results found

Synchronous Pattern Mining From Web Log Data Using Association and Correlation Rules

N/A
N/A
Protected

Academic year: 2020

Share "Synchronous Pattern Mining From Web Log Data Using Association and Correlation Rules"

Copied!
6
0
0

Loading.... (view fulltext now)

Full text

(1)

International Journal of Emerging Technology and Advanced Engineering

Website: www.ijetae.com (ISSN 2250-2459,ISO 9001:2008 Certified Journal, Volume 3, Issue 9, September 2013)

523

Synchronous Pattern Mining From Web Log Data Using

Association and Correlation Rules

P.Kavitha

1

, Dr. G. N. K .SureshBabu

2

1

Research Scholar, Bharathiar University, Coimbatore, India. Asst.Professor, Sree Arumugham Teacher Training College-B.Ed, Tholudur, India.

2Assoc.Professor, GKM College of Engineering and Technology, Chennai, India.

Abstract-- In this paper we introduce the concept of synchronous patterns, association and correlation rule, and search how they can be mined efficiently. The title of synchronous pattern mining is indeed rich. In this paper is dedicated to methods of synchronous item set mining. We can find synchronous item sets from large amount of web log data using association and correlation rules, and the web log data are either transactional or relational. We can mine association and correlation rules in multilevel and multidimensional space. The association and correlation rules are the most interesting. The techniques learned in this paper may also be extended for more advanced forms of synchronous pattern mining. Finding such synchronous patterns plays an essential role in mining association, correlation, and many other interesting relationships among web log data. Moreover in this paper helps in web log data classification, clustering and other web log data mining task. Thus synchronous pattern mining has become an important web log data mining task and focused in web log data mining research.

Keywords-- Correlation rules, clustering, data mining, web logs, multilevel, multidimensional data, and association rule, item set mining.

I. INTRODUCTION

The basic concept of synchronous pattern mining for the discovery of interesting associations and correlations between items sets in transactional and relational web log databases. Synchronous pattern mining searches for recurring relationships in a given web log data set. The basic concept of mining synchronous patterns and associations are presents a road map to the different kinds of synchronous patterns, association rule, and correlation rules that can be viewed.

II. MARKET BASKET ANALYSIS

Market basket analysis uses customer purchases to develop associations between the different items purchased. This allows “big-brother" to produce a more effective. Product placement and catalog design (Computer games are near computers).

Note that product placement in a store, catalog, and online may all be different as different types of people shop in different ways.

Cross-marketing strategies - products and services of other companies that complement yours (you only sell computers, so partner with a company that sells cool computer games!). Customer shopping behavior analyses:

itemset: A set of one or more items  K-itemset X = {x1, …, xk}

(complete) support or support count of X: Frequency or occurrence of an itemset X.

(Relative) support s is the fraction of support, s, transactions that contains X (i.e., the probability that a transaction contains X).

 An itemset X is synchronous if X’s support X s is no less than a minsup threshold.

III. ASSOCIATION RULE MINING

Given a web log data set, find the items in the web log data that are associated with each other. Association is measured as frequency of occurrence in the same context. The definition of context depends on the application. Association should not be confused with more meaningful relationships such as a causal relation. Find all the rules X -> Y with minimum support and confidence threshold.

Support, possibility that a transaction contains X∪Y. Confidence, conditional probability that a transaction having X also contains Y

Let minsup = 50%, minconf = 50% Freq. Pat.: Apple:3, Nuts:3, Beans:4, Eggs:3 Apple Beans}:3

Association rules: (many more!)

-> Apple -> Beans (60%, 100%) -> Beans -> Apple (60%, 75%)

(2)

International Journal of Emerging Technology and Advanced Engineering

Website: www.ijetae.com (ISSN 2250-2459,ISO 9001:2008 Certified Journal, Volume 3, Issue 9, September 2013)

524

Tid Items bought

10 Apple, Nuts, Beans

20 Apple, Coffee, Beans

30 Apple, Beans, Eggs

40 Nuts, Eggs, Milk

50 Nuts, Coffee, Beans, Eggs, Milk

Synchronous item sets, blocked item sets, and association rules

 Let I = {I1, I2, …., Im} be a set of items

 Let D, the task related data, be a set of database transactions where each operation T is a set of items such that T ⊆ I

 Each operation is associated with an identifier, called TID.

 Let A be a set of items

 A operation T is said to contain A if and only if A ⊆ T.

The rule A=>B holds in the transactions set D with support s, where s is the percentage of communication in D that contains AUB. The rule A=>B has confidence c in the transactions set D, where c is the percentage of transaction in D containing A that also contain B.

This is taken to be the uncertain probability, P (B\A). That is,

Support (A =>B) = P(AUB) Confidence (A=>B) = P(B\A)

We form the equation by using the above equation.

Confidence (A=>B) = P(B\A) = support (AUB) / support (A) = support_count (AUB) /

support_count (A)

In general, association rule mining can be viewed as a two step process:

1. Find all synchronous itemsets: By explanation, each of these itemsets will occur at least as synchronously as a predetermined minimum support calculate, minsup.

2. Generate strong association rules from the synchronous itemsets: By definition, these rules must assure minimum support and minimum confidence. A long pattern contains a combinatorial number of

sub-patterns, e.g., {a1, …, a100} contains (1001) + (1002) + … + (110000) = 2100 – 1 = 1.27*1030 sub-patterns! Solution: Mine blocked patterns and max patterns

instead

An itemset X is closed if X is synchronous and there exists no super model Y כ X, with the same support as X

An itemset X is a max pattern if X is synchronous and there exists no synchronous super model Y כ X Closed pattern is a lossless density of frequent

patterns

oReducing the # of patterns and rules Scalable Synchronous Itemset Mining

In this method for mining the simplest form of synchronous patterns (i.e.) single dimensional, single level, Boolean synchronous itemsets, such as those discussed for market basket analysis. We begin by presenting apriori algorithm, basic algorithm for finding synchronous itemsets.

Apriori Algorithm (finding synchronous itemsets using candidate generation)

Apriori is a seminal algorithm proposed by R.Agarwal and R.Srikant in 1994 for mining synchronous itemsets for Boolean association rules.

Assume the items in Lk‐1 are listed in an order

(3)

International Journal of Emerging Technology and Advanced Engineering

Website: www.ijetae.com (ISSN 2250-2459,ISO 9001:2008 Certified Journal, Volume 3, Issue 9, September 2013)

525

The prune step: Scan web log data set D and compare candidate support count of Ck with minimum support count. Remove candidate itemsets that whose support count is less than minimum support count, resulting in Lk.

• Initially, scan DB once to get synchronous 1‐itemset

• Generate length (k+1) candidate itemsets from length k synchronous itemsets

• Prune length (k+1) candidate itemsets with Apriori property

A Priori Property: All nonempty subsets of a synchronous

itemset must also be synchronous. • Test the candidates against DB

 Terminate when no synchronous or candidate set can be generated

IV. ALGORITHM

input : D, a web log database of transactions minsup, the minimum support calculate threshold. output: L, synchronous itemsets in D.

begin

L1 = findFequent1Itemsets(D) for k = 2; Lk-1≠1ø; k++ do Ck = apriori.gen(Lk-1) for t ϵ D do

Ct = subset(Ck , t) /* get candidates */ for c ϵ Ct do

c.count++ end end

Lk = { c ϵ Ck | c.count ≥ minSup} end

return L = Uk lk end

Procedure apriori_gen(input : Lk-1, synchronous (k - 1)-itemsets)

output: Ck , candidate k-itemsets begin

for l1 ϵ Lk-1 do

for l2 ϵ Lk-1 do if ˄k-2

n=1 (l1[n] = l2[n]) ˄ (l1[k - 1] < l2[k - 1]) then

c = l1 l2 /* join step candidates */ if has.insynchronous.subset(c, Lk-1) then delete(c) /* prune step */

end else add c to Ck end end end end return Ck end

Procedure has_insynchronous_subset(input : c, candidate k-itemset ,Lk-1, synchronous (k - 1)-itemsets)

output: TRUE or FALSE begin

for (k - 1)subset s of c do

if s Lk-1 then return TRUE end

end

return FALSE end

Generating association rules from synchronous itemsets  Once the synchronous itemsets from transactions in a web log database D have been found, it is simple to generate strong association rules that satisfy:

 minimum support  minimum confidence  Relation among support and confidence: Confidence(AB)= P(B|A) = support_count(AB)

support_count(A) where support_count(A∪B) is the number of transactions containing the itemsets A ∪ B, and support_count(A) is the number of transactions containing the itemset A.

Association rules can be generate as follows:

(4)

International Journal of Emerging Technology and Advanced Engineering

Website: www.ijetae.com (ISSN 2250-2459,ISO 9001:2008 Certified Journal, Volume 3, Issue 9, September 2013)

526

 For every non empty split S of L, output the rule:

“S(L-S)”if (support_count(L)/support_count(S)) >= min_conf

V. MINING A VARIETY FASSOCIATION RULES

The mining synchronous itemsets and association rule for the additional application requirements by extending our scope to include multilevel association rule, multidimensional association rule, and quantitative association rules in transactional and / or relational web log databases and web log data warehouses.

Mining Multilevel Association Rules

Multilevel association rule involve concepts and different levels of abstraction. Physically powerful associations discovered at high levels of abstraction may represent commonsense knowledge. Moreover what may represent commonsense to the one user may seem novel to another. Therefore, web log data mining systems should provide capabilities for mining association rules at multiple levels of abstraction.

Mining Multilevel Association Rules (1)

• Web log data mining systems should provide capabilities for mining association rules at multiple levels of abstraction

• Exploration of shared multi‐level mining (Agrawal & Srikant@VLB’95, Han &Fu@VLDB’95)

Mining Multilevel Association Rules (2)

• For each level, any algorithm for discovering synchronous item sets may be used, such as Apriori or its difference

– Using standardized minimum support for all levels (referred to as standarized support)

Using compact minimum support at lower levels (referred to as compact support)

(5)

International Journal of Emerging Technology and Advanced Engineering

Website: www.ijetae.com (ISSN 2250-2459,ISO 9001:2008 Certified Journal, Volume 3, Issue 9, September 2013)

527

Mining Multilevel Association Rules (3)

Side effect: The generation of several redundant rules across multiple levels of abstractions due to the ancestor relationships among items

buys(X, “laptop computer”)=> buys(X, “HP printer”) [support = 8%, confidence = 70%]

buys(X, “IBM laptop computer”) => buys(X, “HP printer”) [support = 2%, confidence = 72%]

Multidimensional association rules

Multidimensional association rules involve more than one dimension or predicate. Example, rules relating what a customer buys as well as customer’s age.

 Single‐dimensional rules: buys(X, “milk”) ⇒ buys(X, “bread”)

 Multidimensional rules: ≥ 2 size or predicates Inter‐dimension assoc. rules (no repeated predicates) age(X,”19-25”) ∧occupation(X,“student”) ⇒ buys(X, “coke”)

hybrid‐dimension assoc. rules (repeated predicates) age(X,”19-5”)∧buys(X,“popcorn”)⇒buys(X,“coke”)

Quantitative association rules

Quantitative association rules involve numeric attributes that have an implicit ordering among values. Example, age

 ARCS (Association Rule Clustering System): Cluster adjacent rules to form general association rules using a 2‐D grid

age(X,”34‐35”)∧income(X,”31‐50K”)⇒ uys(X,”high resolution TV”)

Proposed by Lent, Swami and Widom ICDE’97

Constraint based association mining

A good heuristic is to have the users specify such intuition or expectations as constraints to confine the search space. This approach is known as Constraint based mining. The constraint can include the following categories:

Knowledge type constraints:

These specify the type of knowledge to be mined, such as association or correlation.

Web log data constraints:

These specify the set of task relevant web log data.

Dimension/level constraints:

These specify the desired dimensions of the web log data, or levels of the concept hierarchies, to be used in mining.

Interestingness constraints:

These specify thresholds on statistical measures of rule absorbing, such as support, confidence, and correlation.

Rule constraints:

These specify the form of rules to be mined. Such constraints may be expressed as metarules, as the maximum or minimum number of predicates that can occur in the rule antecedent or consequent, or as relationships among attributes, element values, and/or aggregates.

VI. CONCLUSION

In this paper discover the synchronous patterns, associations and correlation relationships, huge amount of web log data is useful in selective marketing, result analysis and business organization. A accepted area of application is market basket analysis, which study the buying habits of consumers by searching for sets of items that are synchronous purchased simultaneously. Association rule mining consists of primary finding synchronous itemsets from which strong association rules in the form of A=>B are generated.

REFERENCES

[1 ] Jiawei Han and Micheline Kamber : Web log data Mining Concepts and Techniques, Second Edition, Margan Kaufmann publishers. [2 ] R. Agrawal and R. Srikant. Fast algorithms for mining association

rules. VLDB'94.

[3 ] J. Han, J. Pei, and Y. Yin. Mining synchronous patterns without candidate generation. SIGMOD’ 00.

[4 ] J. Pei, J. Han, and R. Mao. CLOSET: An Efficient Algorithm for Mining Synchronous Closed Itemsets. DMKD'00.

[5 ] J. Liu, Y. Pan, K. Wang, and J. Han. Mining Synchronous Item Sets by Opportunistic Projection. KDD'02.

[6 ] J. Han, J. Wang, Y. Lu, and P. Tzvetkov. Mining Top-K Synchronous Closed Patterns without Minimum Support. ICDM'02. [7 ] J. Wang, J. Han, and J. Pei. CLOSET+: Searching for the Best

Strategies for Mining Synchronous Closed Itemsets. KDD'03. [8 ] G. Liu, H. Lu, W. Lou, J. X. Yu. On Computing, Storing and

Querying Synchronous Patterns. KDD'03.

[9 ] M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. Parallel algorithm for discovery of association rules. DAMI:97.

[10 ] Zaki and Hsiao. CHARM: An Efficient Algorithm for Closed Itemset Mining, SDM'02.

(6)

International Journal of Emerging Technology and Advanced Engineering

Website: www.ijetae.com (ISSN 2250-2459,ISO 9001:2008 Certified Journal, Volume 3, Issue 9, September 2013)

528

About The Author

P. Kavitha1, Research Scholar, Bharathiar University, Coimbatore, India. Graduated the Faculty of Education, Sree Arumugham Teacher Training College-B.Ed, Tholudur, Tamil Nadu, in India. She is currently working as an Assistant Professor within the Department of Education from the same College. Her, interest fields are Computer graphics, Web log data Mining and Computer Network.

References

Related documents

Many researchers prefer to utilize ANN to conduct their studies in different fields. However, only limited number of researchers prefer to use ANN in their empirical researches

involve: a more detailed analysis of the composition of UK public spending, in particular how strict or fluid the boundaries are between current and capital expenditure; the tax

Table 2: United Nations Security Council Presidential Statements Concerning Syria Civil War. Year Draft Text Number UNSC Presidential Statements Concerning Syria

The test results of double sheet CNTs/PMMA strain sensors demonstrated that the magnitude of resistance change increased with increasing number of active grids, but on the other

The data mining techniques and algorithms are used to perform in the pattern discovery by using clustering, association rules and sequential analysis. The association technique

Baking Pan Temperature Dial Function Dial ON Indicator Light Slide-Out Crumb Tray Oven Rack Timer Broil Rack (optional accessory on select models).. Broil Rack* in

amended police report were lulled into a belief that the Applicant’s claim, and in particular the circumstances surrounding the accident, were indisputable.. They