High Utility Itemset Mining

(1)

International Journal of Emerging Technology and Advanced Engineering

Website: www.ijetae.com (ISSN 2250-2459, Volume 2, Issue 8, August 2012)

476

High Utility Itemset Mining

Sudip Bhattacharya1, Deepty Dubey2

1

Research Scholar, Dept. of Computer Science & Engg., Chhatrapati Shivaji Institute of Technology, Durg, India 2

Asst. Professor, Dept .of Computer Science & Engg., Chhatrapati Shivaji Institute of Technology, Durg, India

Abstract— Data Mining can be defined as an activity that extracts some new nontrivial information contained in large databases. Traditional data mining techniques have focused largely on detecting the statistical correlations between the items that are more frequent in the transaction databases. Also termed as frequent itemset mining , these techniques were based on the rationale that itemsets which appear more frequently must be of more importance to the user from the business perspective .In this paper we throw light upon an emerging area called Utility Mining which not only considers the frequency of the itemsets but also considers the utility associated with the itemsets. The term utility refers to the importance or the usefulness of the appearance of the itemset in transactions quantified in terms like profit , sales or any other user preferences. In High Utility Itemset Mining the objective is to identify itemsets that have utility values above a given utility threshold. In this paper we present a literature review of the present state of research and the various algorithms for high utility itemset mining.

Keywords—Utility mining, High utility itemsets, Constraint based itemset mining, Frequent itemset mining.

I. INTRODUCTION

A. Data Mining

Data mining is concerned with analysis of large volumes of data to automatically discover interesting regularities or relationships which in turn leads to better understanding of the underlying processes [16]. The primary goal is to discover hidden patterns, unexpected trends in the data. Data mining activities uses combination of techniques from database technologies, statistics, artificial intelligence and machine learning.

The term is frequently misused to mean any form of large-scale data or information processing. The actual data mining task is the automatic or semi-automatic analysis of large quantities of data to extract previously unknown interesting patterns.

Over the last two decades data mining has emerged as a significant research area .This is primary due to the inter-disciplinary nature of the subject and the diverse range of application domains in which data mining based products

and techniques are being employed. This includes bioinformatics, genetics, medicine, clinical research, education, retail and marketing research.

Data mining has been considerably used in the analysis of customer transactions in retail research where it is termed as market basket analysis. Market basket analysis has also been used to identify the purchase patterns of the alpha consumer. Alpha consumers are people that play a key role in connecting with the concept behind the inception and design of a product.

B. Frequent Itemset Mining

An itemset can be defined as a non-empty set of items.

An itemset with k different items is termed as a k-itemset. For e.g. {bread, butter, milk } may denote a 3-itemset in a supermarket transaction .The notion of frequent itemsets was introduced by Agrawal et al [1].Frequent itemsets are the itemsets that appear frequently in the transactions. The goal of frequent itemset mining is to identify all the itemsets in a transaction dataset [21]. Frequent itemset mining plays an essential role in the theory and practice of many important data mining tasks , such as mining association rules [1,2,17], long patterns [5],emerging patterns[10], and dependency rules [26].It has been applied in the field of telecommunications [3],census analysis[5] and text analysis[26].

The criterion of being frequent is expressed in terms of support value of the itemsets. The Support value of an itemset is the percentage of transactions that contain the itemset.

EXAMPLE 1 Consider the small example of a transaction database representing the sales data and the profit associated with the sale of each unit of the items.

(2)

International Journal of Emerging Technology and Advanced Engineering

477

TABLEI TRANSACTION DATABASE

Transacion ID

Quantity of Item sold in Transaction

Item A Item B Item C

T1 2 0 1

T2 4 0 2

T3 4 1 0

T4 0 1 1

T5 5 1 2

T6 10 1 5

T7 4 0 2

T8 1 0 0

T9 3 0 0

T10 5 0 0

Table II represents the unit profit associated with the sale of individual items.

TABLEII

UNIT PROFIT ASSOCIATED WITH ITEMS

Item Name Unit Profit (in INR)

Item A 5

Item B 100

Item C 40

Now consider the itemset AB. Since there are only 3 transactions (T3, T5 and T6) that contain this itemset out of the overall 10 transactions, so the support for this itemset will be

Support (AB) = 3 / 10 * 100 = 30 %

Since T3 contains 4 units of item A and 1 unit of item B so the profit earned by the sale of the itemset AB in transaction T3 is given by

profit (AB, T3) = 4 * profit(A) + 1 * profit(B)

= 4*5 + 1*100 = 120

Since AB appears in transactions T3, T5 and T6, so total profit associated with itemset AB by the complete transaction set of 10 transactions is

profit(AB) = profit(AB,T3) + profit(AB,T5) +

profit(AB,T6)

= (4*5+1*100) + (5*5+1*100) +

(10*5+1*100 ) = 395

Similarly we can calculate the support values for the different itemsets and also the profit obtained by the sale of those itemsets by all the ten transactions as indicated in table III

TABLEIII

SUPPORT AND PROFIT FOR ALL ITEMSETS

Itemset Support(%) Profit(INR)

A 90 190

B 40 400

C 60 520

AB 30 395

AC 50 605

BC 30 620

ABC 20 555

If we consider minimum support = 40 % then we observe that there are 4 itemsets A, B,C and AC which qualify as frequent itemsets because they have support more than minimum support threshold value. But if we consider the profit associated we find that out of the 4 most profitable itemsets i.e. C, AC, BC, and ABC only two are frequent itemsets also. Itemsets BC and ABC are itemsets which are not frequent but still they fetch more profit than some of the frequent itemsets like A or B. This is inherently because the deviation of the unit profits of the items. As we can see one unit of item B when sold will fetch much more profit than one unit of item A or item C.

(3)

International Journal of Emerging Technology and Advanced Engineering

478 In reality a retail business may be interested in identifying its most valuable customers (customers who contribute a major fraction of the profits to the business). These are the customers who may buy full priced items or high margin items which may be absent from a large number of transactions because most customers do not buy these items frequently [34].

The practical usefulness of the frequent itemset mining is limited by the significance of the discovered itemsets [32].While mining literature has been exclusively focused on frequent itemsets, in many practical situations rare ones are of higher interest [24].For example in medical databases rare combinations of symptoms might provide useful insights for the physicians about the cause of the disease. So during the mining process we should not be prejudiced to identify either frequent or rare itemsets but our aim should be identify itemsets which are more utilizable to us. In other words our aim should be in indentifying itemsets which have comparatively higher utilities in the database, no matter whether these identified itemsets are frequent itemsets, rare itemsets or neither of them. This leads to the inception of a new approach in data mining which is based on the concept of itemset utility called as utility mining.

C. Utility Mining

The limitations of frequent or rare itemset mining motivated researchers to conceive a utility based mining approach, which allows a user to conveniently express his or her perspectives concerning the usefulness of itemsets as utility values and then find itemsets with high utility values higher than a threshold [32] .In utility based mining the term utility refers to the quantitative representation of user preference i.e. the utility value of an itemset is the measurement of the importance of that itemset in the users perspective. For e.g. if a sales analyst involved in some retail research needs to find out which itemsets in the stores earn the maximum sales revenue for the stores he or she will define the utility of any itemset as the monetary profit that the store earns by selling each unit of that itemset.

Here note that the sales analyst is not interested in the number of transactions that contain the itemset but he or she is only concerned about the revenue generated collectively by all the transactions containing the itemset. In practice the utility value of an itemset can be profit, popularity, page-rank, measure of some aesthetic aspect such as beauty or design or some other measures of user’s preference.

Formally an itemset S is useful to a user if it satisfies a utility constraint i.e. any constraint in the form u(S) >= minutil, where u(S) is the utility value of the itemset an minutil is a utility threshold defined by the user [32]. In our example if we take utility of an itemset as the unit profit associated with the sale of that itemset then with utility threshold minutil = 500 then the itemset ABC has a utility value of 555 which means that this itemset is of interest to the user even though its support value is just 20%.Since while considering the total utility of an itemset S we multiply the utility values of the individual items consisting the itemset S with the corresponding frequencies of the individual items of S in the transactions that contain S, so the utility based mining approach can be said to be measuring the significance of an itemset from two dimensions. The first dimension being the support value of the itemset i.e the frequency of the itemset and the second dimension is the semantic significance of the itemset as measured by the user.

Recent work [8, 9, 15, 18, and 19] has highlighted the importance of constraint based itemset mining in which the user has the privilege to specify his or her preferences by defining constraints that capture the semantic significance of the itemset in the intended application domain.

(4)

International Journal of Emerging Technology and Advanced Engineering

479 Temporal data mining is concerned with data mining of large sequential data sets [16].Sequential data refers to data that is ordered with respect to some index such as time series, gene sequence, list of moves in a chess game etc. Since lot of business domain generates temporal data, high utility itemset mining finds its prominence among temporal data mining also. One of the most essential features desired by temporal data mining algorithms is their incremental approach. An incremental approach provides the ability to use the previous data structures and mining results in order to reduce unnecessary and redundant calculation with sliding windows of time index. Researchers have devised algorithms for high utility itemsets mining for incremental / temporal databases also.

II. LITERATURE REVIEW

In this section we present a brief overview of the various algorithms, concepts and approaches that have been defined in various research publications.

Agarwal et al in [1,2] studied the mining of association rules for finding the relationships between data items in large databases . Association rule mining techniques uses a two step process. The first step uses algorithms like the Apriori to identify all the frequent itemsets based on the support value of the itemsets. Apriori uses the downward closure property of itemsets to prune off itemsets which cannot qualify as frequent itemsets by detecting them early. The second step in association rule mining is the generation of association rules from frequent itemsets using the support – confidence model.

Chan et al in [9] observes that the candidate set pruning strategy exploring the antimonotone property used in apriori algorithm do not hold for utility mining. The work gives the novel idea of top-k objective directed data mining which focuses on mining the top-k high utility closed patterns that directly support a given business objective. Yao et al in [31] defines the problem of utility mining formally. The work defines the terms transaction utility and external utility of an itemset. The mathematical model of utility mining was then defined based on the two properties of utility bound and support bound.

The utility bound property of any itemset provides an upper bound on the utility value of any itemset. This utility

bound property can be used as a heuristic measure for pruning itemsets at early stages that are not expected to qualify as high utility itemsets.

Yao et al in [32] defines the utility mining problem as one of the cases of constraint mining. This work shows that the downward closure property used in the standard Apriori algorithm and the convertible constraint property are not directly applicable to the utility mining problem. The authors also present two pruning strategies to reduce the cost of finding high utility itemsets. Yao et al in [33] classifies the utility-measures into three categories namely, item level, transaction level and cell level. The unified utility function was defined to represent all existing utility-based measures.

High utility frequent itemsets contribute the most to a predefined utility, objective function or performance metric [13]. Hu et al in [13] presents an algorithm for frequent

itemset mining that identifies high utility item

combinations. The algorithm is designed to find segments of data defined through the combinations of few items (rules) which satisfy certain conditions as a group and maximize a predefined objective function. The authors have formulated the task as an optimization problem and presents an efficient approximation to solve it through specialized partition trees called high-yield partition trees an investigated the performance of various splitting techniques.

Li et al in [14] propose two efficient one pass algorithms MHUI-BIT and MHUI-TID for mining high utility itemsets from data streams within a transaction sensitive sliding window. Liu et al in [34] proposes a Two-phase algorithm for finding high utility itemsets. Tseng et al in [29] proposes a novel method THUI (Temporal High Utility Itemsets)-Mine for mining temporal high utility itemset mining. The novel contribution of THUI-Mine is that it can effectively identify the temporal high utility itemsets by generating fewer candidate sets and thus has lower costs in terms of execution time.

Ahmed et al in [4] proposes three novel tree structures to efficiently perform incremental and interactive high utility pattern mining.

(5)

International Journal of Emerging Technology and Advanced Engineering

480 The authors also suggest a technique to generate different types of itemsets such as High Utility and High Frequency (HUHF), High Utility and Low Frequency (HULF), Low Utility and High Frequency (LUHF) and Low Utility and Low Frequency (LULF).

Pillai et al in [20] presents a new foundational approach to temporal weighted itemset mining where item utility value are allowed to be dynamic within a specified period of time, unlike traditional approaches where value are static within those times. The authors incorporate a fuzzy model where item utilities can be assumed to be fuzzy values. Erwin et al in [11] observed that the conventional candidate-generate-and-test approach for identifying high utility itemsets is not suitable for dense date sets. Their work proposes a novel algorithm CTU-Mine that mines high utility itemsets using the pattern growth approach. A similar argument is presented by Yu et al in [36]. Existing algorithms for high utility mining are column enumeration based adopting an apriori like candidate set generation-and-test approach and thus are inadequate in datasets with high dimensions [36].

Some of the research fraternity considers utility-frequent itemset mining as a special case of utility mining that in addition to utility thresholds also considers a support threshold. Podpecan et al in [22] proposes a novel algorithm FUFM (Fast Utility-Frequent Mining) which finds all utility-frequent itemsets within the given utility and support constraints. Yeh et al in [35] presents a bottom-up two phase algorithm BU-UFM for efficiently mining utility-frequent itemsets.

Recent works on the subject include Sandhu et al [28].This work presents an efficient approach based on weight factor and utility factor for mining of significant association rules between data itemsets. Ramaraju et al in [23] presents a novel algorithm CHUT (Conditional High Utility Tree) to mine the high utility Itemset in two steps. The first step is to compress the transaction database to reduce the search space.

The second step uses a new proposed algorithm HU-Mine to mine the complete set of high utility itemsets.

III. CONCLUSION

Frequent itemset mining is based on the rationale that the itemsets which appear more frequently in the transaction databases are of more importance to the user .However the practical usefulness of mining the frequent itemset by considering only the frequency of appearance of the itemsets is challenged in many application domains such as retail research. It has been that in many real applications that the itemsets that contribute the most in terms of some user defined utility function (for e.g. profit) are not necessarily frequent itemsets.

Utility mining attempts to bridge this gap by using item utilities as an indicative measurement of the importance of that item in the user’s perspective. Utility mining is a comparatively new area of research and most of the literature work is focused towards reducing the search space while searching for the high utility itemsets.

In this paper we have presented a brief review of the various approaches and algorithms for mining of high utility itemsets. In the next paper we will present a deeper insight into the different pruning techniques used to detect and prune unnecessary candidate itemsets early in the search for high utility itemsets.

REFERENCES

[1] R. Agrawal , T. Imielinski, A. Swami, 1993, mining association rules between sets of items in large databases, in: proceedings of the ACM SIGMOD International Conference on Management of data, pp. 207-216

[2] R. Agrawal, R Srikant, Fast algorithms for mining association rules,in : Proceedings of 20th international Conference on Very Large Databases ,Santiago, Chile, 1994, pp.487-499

[3] K. Ali , S.Manganaris, R.Srikant , Partial classification using association rules, in:Proceeedings of the 3rd International Conference on Knowledge Discovery and Data Mining , Newport Beach, California, 1997, pp. 115-118

[4] C.F.Ahmed , S.K.Tanbeer, Jeong Byeong-Soo, Lee Young-Koo, Efficient tree structures for high utility pattern mining in incremental databases, in: IEEE Transactions on Knowledge and Data Engineering 21(12) (2009)

[5] R.J.Bayardo, Efficiently mining long patterns from databases, in:Prodeedings of the 1998 ACM SIGMOD International Conference on Management of Data, Seattle, 1998, pp.85-93.

(6)

International Journal of Emerging Technology and Advanced Engineering

481 [7] B.Barber, H.J Hamilton , Extracting share frequency itemsets with

infrequent subsets, Data Mining and Knowledge Discovery 7(2) (2003)153-185.

[8] C.H. Cai , A.W.C Fu, C.H.Cheng , w.W. Kwong, Mining association rules with weighted items,in:Proceedings of IEEE International Database Engineering and Applications Symposium, Cardiff, United kingdom, 1998, pp.68-77

[9] Chan , Q.Yang,Y.D Shen, Mining high utility itemsets, in:Proceedings of the 3rd IEEE International Conference on Data Mining , Melbourne , Florida, 2003, pp.19-26

[10]G.Dong , J.Li, Efficient mining of emerging patterns :discovering trends and differences, in:Proceedings of the 5th international Conference on Knowledge Discovery and Data Mining ,San Diego, 199, pp.43-52

[11]A.Erwin, R.P.Gopalan,N.R.Achuthan, Efficient mining of high utility itemsets from large datasets, in: Advances in Knowledge Discovery , Springer Lecture Notes in Computer Science , volume 5012/2008, pp. 554-561

[12]J Han, J.Pei, Y.Yin ,R. Mao Mining frequent Patterns without candidate generation:a frequent -pattern tree approach , Data Mining and Knowledge Discovery 8(1)(2004) 53-87

[13]J.Hu, A. Mojsilovic , High-utility pattern mining :A method for discovery of high-utility ietmsets,in :Pattern Recognition 40(2007) 3317-3324

[14]H.F.Li, H.Y. Huang , Y.Cheng Chen, y. Liu, S.Lee, Fast and memory efficient mining of high utility itemsets in data streams, in :Eigth International Conference of Data Mining 2008

[15]S. Lu, H. Hu,f. Li, Mining weighted association rules, Intelligent Data Analysis 5(3)(2001) 211-225

[16]S Laxman , P S Sastry, A survey of temporal data mining , Sadhana , Vol. 31 , part 2 , April 2006, pp. 173-198.

[17]H.Mannila , H.Toivonen, Levelwise search and borders of theories in knowledge discovery, Data Mining and Knowedge Discovery 1(3)(1997) 241-258

[18]J Pei, J. Han, Can we push more constraints into frequent pattern mining? In: Proceedings of the 6th International Conference on Knowledge Discovery and Data Mining, Boston , 2000, pp.350-354.

[19]J Pei,J. Han, L.V.S. Lakshmanan, Pushing convertible constraints in frequent itemset mining, Data Mining and Knowledge Discovery 8(3)(2004) 227-252

[20]J.Pillai, O.P.Vyas, S. Soni, M.Muyeba, A conceptual approach to temporal weighted itemset utility mining, in : International Journal of Computer Applications (0975-8887) Volume 1-No.28, 2010

[21]J.Pillai , O.P.Vyas ,Overview of itemset utility mining and its applications , in: Internationa Journal of Computer Applications (0975-8887), Volume 5-No.11(August 2010)

[22]V. Podpecan , N. Lavrac, I. Kononenkom, A fast algorithm for mining utility-frequent itemsets,in workshop on Constraint-Based Mining and Learning at ECML/PKDD ,2007, pp. 9-20.

[23]C.Ramaraju , N.Savarimuthu,A conditional tree based novel algorithm for high utility itemset mining,International Conference on Recent Trends in Information Technology ( ICRTIT) 2011, pp. 701-706

[24]L.Szathmary,A.Napoli, P.Valtchev, Towards rare itemset mining ,in: Proceedings of the 19th IEEE Interational Conference on Tools with Artificial Intelligence , 2007, Volume-1, pp.305-312,

[25]A.Silberschatz, A.Tuzhilin,On subjective measures of interestingness in knowledge discovery ,in:Proceedings of the 1st International Conference on Kowledge Discovery and Data Mining ,Menlo Park, California , 1995,pp.275-281.

[26]C.Silverstein, S. Brin , R. Motwani, Beyond market basket: generalizing association rules to dependence rules, Data Mining and Knowledge Discovery 2(1) (1998) pp.39-68

[27]S.Shankar, T.P.Purusothoman, S. Jayanthi,N.Babu, A fast agorithm for mining high utility itemsets , in :Proceedings of IEEE International Advance Computing Conference (IACC 2009), Patiala, India, pp.1459-1464

[28]P.S.Sandhu , D.S.Dhaliwal,S.N.Panda,A.Bisht,An improvement in apriori algorithm using profit and quantity, in proceedings of the second international conference on computer and network technology (ICCNT) 2010, pp. 3-7..

[29]V.S. Tseng , C.J. Chu , T.Liang, Efficient mining of temporal high utility itemsets from data streams, in: Proceedings of Second International Workshop on Utility-Based Data Mining , August 20, 2006

[30]K.Wang, S.Q, Zhou, J.Han, Profit mining:from patterns to actions ,in:Advances in Database Technology, 8th international Conference on Extending Database Technology,Prague,pp.70-87

[31]H.Yao, H.J.Hamilton, C.J.Butz, A foundation approach to mining itemset utilities from databases,in:Proceedings of the Third SIAM International Conference on Data Mining, Orlando, Florida , 2004,pp.482-486

[32]H.Yao,H.J.Hamilton ,Mining itemset utilities from transacation databases, in Data and Knowledge Enineering 59(2006) pp.603-626

[33]H.Yao ,H.J Hamilton , L.Geng, A unified framework for utility-based measures for mining itemsets, in :proceedings of the ACM international conference on utility-based Data Mining Workshop(UBDM) 2007, pp. 28-37.

[34]Liu. Y, Liao. W,A. Choudhary, A fast high utility itemsets mining algorithm, in: Proceedings of the Utility-Based Data Mining Workshp, August 2005

[35]Jieh-shan Yeh, Yu-chiang Li , Chin-chen Chang, Two-phase algorithms for a novel utility-frequent model,in : Emerging Trends in Knowledge Discovery and Data Mining, Lecture notes in Computer Science, Vol. 4819/2007, pp. 433-444.