• No results found

Data mining for qualitative dataset Using association rules: A review

N/A
N/A
Protected

Academic year: 2020

Share "Data mining for qualitative dataset Using association rules: A review"

Copied!
8
0
0

Loading.... (view fulltext now)

Full text

(1)

Abstract— In today’s electronic world vast amounts of knowledge is stored within many datasets and databases. the default format of this data means that the knowledge within is not immediately accessible, but rather has to be mined and extracted. This requires some automated tools and they need to be effective and efficient way to extract. Association rule mining is one of the approaches to obtaining knowledge stored with datasets which includes frequent patterns and association rules between the items of a dataset with varying levels of strength. This is also association rule mining’s downside; the number of rules of association that can be found is usually are giant . In order to effectively use the association rules the number of rules needs to be kept manageable, thus it is necessary to have some method to reduce the number of association rules for the classification. However, we do not want to lose knowledge through finding rules for the same. That’s why the idea of non-redundant association rule mining was come. A second issue with association rule mining is determining which ones are interesting. The standard approach has been to use support and confidence.

Index Terms—association rule, cross dataset, Qualitative data.

I. INTRODUCTION

Currently the world has a wealth of data, stored all over the planet, but we need to understand that data. It has been stated that the amount of data doubles approximately every twenty months [1]. This is especially true since the use of computers and electronic database packages. The amount or quantity of data easily exceeds what a human can comprehend on their own and thus if we wish to use and understand as much data as possible we need tools to help us. From this overwhelming state, the field of data mining has taken off and become hotly utilized.

Association rule mining is one of the dominating data mining technologies. Association rule mining is a process for finding associations or relations between data items or attributes in large datasets. It allows popular patterns and associations, correlations, or relationships among patterns to be found with minimal human effort, bringing important information to the surface for use. Association rule mining has been proven to be a successful technique for extracting useful information from large datasets. Various algorithms or models were developed many of which have been applied in various application domains that include telecommunication networks, market analysis, risk management, inventory control and many others. The success of applying the extracted rules to solving real world problems is very often restricted by the quality of the rules. However, the quality of the extracted rules has not drawn

adequate attention [2] [3]. Measuring the quality of association rules is also difficult and current methods appear to be unsuitable, especially when multi-level and cross level rules are involved. Moreover, most of the successful applications are restricted to cases where the datasets involve only a single concept level and the success of the application is heavily dependent on the quality of the discovered rule set. Mining quality non-redundant multi-level association rules from multi-level datasets is a challenge still needing to be worked on and it is a desired goal for helping to solve real world problems.

II. LITERATURE REVIEW

The purpose of this chapter is to present an in depth review of the topics, areas and works related to the research presented here. Firstly we conduct a brief but comprehensive in depth review of association rule mining techniques and approaches, followed by a look at interestingness & quality, and redundancy issues related to association rule mining. This review sets the groundwork for our research and the proposals made here.

A. Association Rule Mining

In this section association rule mining will be introduced and then covered in depth and detail, including a review of multi-level and cross-level association rule mining. Many different approaches and algorithms will then be looked at and reviewed. Association rule mining was first presented by Agrawal in [4]. Association rules are generally in the form X => Y, where both X and Y are items or itemsets that are wholly contained within a dataset/database and X ∩ Y = ø. X is the antecedent and Y is the consequent. This rule implies that whenever X is present, Y will also be present or that X implies Y.

Because most datasets are large and the user does not always want all the rules but only those that are of interest/importance measures are needed so that the uninteresting rules can be removed. Traditionally, there are two basic measures used in association rule mining support and confidence. Support is a measure that defines the percentage/fraction of records/entries in the dataset that contain X U Y to the total number of records/entries. The basic method for determining the support does not take into account the quantity of the item with a record and the support value serves as a statistical significance of the rule [5]. Confidence is a measure that defines the percentage/fraction of records/entries in the dataset that contain X U Y to the total number of records/entries that contain just X. The confidence value serves as a measure of the strength or

Data mining for qualitative dataset Using

association rules: A review

(2)

precision of the rule [5]. Association rule mining is the process of finding those rules that satisfy the preset minimum support and confidence from a dataset and it is a two step process. Usually, the first step is to find frequent itemsets whose frequency exceeds a specified threshold and the second step is to generate rules from those frequent itemsets using the constraint of minimum support and confidence. The simplest technique to generate the rules is to take each frequent itemset and generate a rule with all but one item in the antecedent (with the last item in the consequent) and check that this rule meets the threshold. Then the last item in the antecedent is moved into the consequent and the new rule checked. This is repeated until the antecedent becomes empty [5].

It is the first step that influences and determines the performance of the rule mining process. Because of this, there has been a large focus on improving this step and much work has been proposed [5, 6]. This has led to a lesser focus on the second stage of the process. It is here that the actual rules are generated and it is this second stage that needs improvement in order to deal with the issues of the rule set generated. While the focus on the first stage has led to improvements in efficiency, it has not led to improvements in the size, quality or clarity/understanding of the rules. This is why more work is needed which focuses on the second stage. Indeed, the focus has started to shift and some work is being done which focuses on improving the outcome of the rules [7, 8].

The problems of finding frequent item sets are basic in multi level association rule mining, fast algorithms for solving problems are needed. This paper presents an efficient version of Apriori algorithm for mining multi-level association rules in large databases to finding maximum frequent itemset at lower level of abstraction [9]. These methods reduce the time and increase the throughput, but the redundancy in the dataset left for the work.

In this review several different approaches to association rule mining will be presented, starting with the traditional approaches, followed by multi-level and cross-level approaches. After a review of the different approaches, other issues related to association rule mining will be presented.

B. Single Level Association Rule Mining

In this section the review will introduce several of the more traditional association rule mining algorithms. These approaches are intra-transactional algorithms usually for a single abstract level and single dimensional datasets. Most of the approaches proposed focus only on the first step, finding frequent itemsets, as this has always been considered to be the step needing the most work. Pro-processing of data prior to use of these approaches is not covered here, but for some approaches it may be necessary to perform pre-processing.

C. AIS Algorithm

The AIS algorithm is the first approach that was proposed for association rule mining. It was developed by Agrawal, Imielinski & Swami [4] and it focused on improving database quality and introducing functionality needed to process

decision support queries and the finding of relations/associations within the data.

The efficiency of the AIS algorithm was improved by the addition of an estimation method to prune candidate itemsets that had no chance of being above the support threshold (which allows the algorithm to avoid having to unnecessarily determine the support of the itemset) [4, 5]. Also two approaches to memory management were added in case it was not possible to hold the candidate and frequent itemset list in memory.

The AIS algorithm has several drawbacks. The first is that it can only generate rules with a single item in the consequent (thus cannot find rules where the presence of X implies the presence of Y, where Y contains multiple items). Another drawback is that the AIS generates too many candidate itemsets that are never frequent thus requiring space and time to store and generate useless information [5, 10]. This candidate generation process slows down the algorithm and can make the time it takes to process a large database/dataset unacceptable. Also, the algorithm makes many passes over the database/dataset, which again is inefficient and time consuming.

D. Apriori Algorithm

The AIS algorithm was a straightforward approach but had several drawbacks that needed improvement upon. The Apriori algorithm was that improvement and is considered a major step in association rule mining [5, 11]. This approach avoids the problems AIS had with the generation of candidate itemsets that are known to have a too low support value. The candidate itemsets are generated by joining what has been determined to be frequent itemsets level-wisely and the candidates are pruned according to the Apriori properties, which is an anti-monotone heuristic to reduce the search space [5]. This has the effect of reducing the number of candidate itemsets that need to have their support determined, thus reducing computation, I/O and memory costs when compared against AIS [5]. The argument by Agrawal et. al. is that there is a basic intuition that for a k-itemset to be frequent, its (k-1) subsets must also be frequent.

The Apriori approach still has two major drawbacks or bottlenecks, of which one has been inherited from the AIS approach. The inherited drawback is that it still has to scan the entire database/dataset multiple times as it builds the list of frequent itemsets. The second major drawback is that the candidate generation process is complex and resource (time and memory) consuming [5, 11].

Attempts have been made to deal with the multiple scan drawbacks and there are two major approaches. The first is to reduce the number of passes on the whole database/dataset or by replacing the whole database/dataset with a portion of it based on the current frequent itemsets. The second is to utilize different pruning approaches and techniques to reduce the number of candidate itemsets.

E. Apriori-Based Algorithms

(3)

drawback and some of these modified approaches are presented here.

Apriori-TID extends the original Apriori by removing the need for multiple database/dataset scans by constructing a counting-base set during the first pass through of the database/dataset. This counting-base set is then used later on during the determination of frequent itemsets and thus the database/dataset is not needed [6, 10].

DHP (known as Direct Hash Pruning) introduced the concept of possible frequent itemsets in an effort to optimise the generation of the candidate itemsets which was expensive under Apriori. It in effect tries to accumulate information about the set of (k+1) items in advance and stores them in a hash table, before it is needed. The use of the hash table reduces the size of the candidate set, but due to the reality of hash collisions being possible, it is necessary to check each entries support. This approach also uses progressive dataset pruning to remove and discard items/objects that are of no further use later on. The use of hashing and pruning is said to result in a significant speedup over the original Apriori [6, 12].

F. FP-Tree Algorithm

Because the Apriori approach has two major drawbacks, work to fix these drawbacks has been conducted. One of the results from this work is the designing of tree structures for use in association rule mining. Frequent Pattern Tree (known as FP-Tree) was first introduce by Han & Pei and is an approach that requires only two passes/scans through a database/dataset to generate the frequent itemsets and does so without the need to generate candidate itemsets [13].

Two advantages to FP-Tree is that the avoidance of having a candidate generation process and limiting the passes/scans of the database/dataset needed (only scans database/dataset twice) the FP-Tree approach is considered to be faster than Apriori by an order of magnitude [5].

FP-Tree has limitations. Firstly, it is difficult to use this approach in an interactive mining process, where it is possible to change the support threshold [5]. The second limitation is that the FP-Tree algorithm is not suitable for incremental rule mining. There are other limitations to the original FP-Tree algorithm the majority of which are based on finding more sophisticated patterns [12].Also, as the number of different/unique items increases, the size of the tree increases/expands typically at an exponential rate due to the reduction in common prefix sharing ,Due to these limitations and problems, several variations on the original FP-Tree approach have been developed over the years.

G. FP-Growth/FP-Tree Based Algorithms

While the FP-Tree approach was shown to be effective and efficient, as mentioned previously the size of the tree usually increases exponentially as the number of unique items increases. A number of FP-Tree/FP-Growth based algorithms have been developed to address this shortcoming. FP-Growth was developed by Grahne & Zhu [14] and it uses an extra array-based structure to decrease the number of traversals of the tree that are required during the analysis. This saves time during general traversal of the tree and also enables direct initialization of the next level of the

FP-Tree(s).

H. RARM (Rapid Association Rule Mining) Algorithm

Rapid Association Rule Mining (RARM) is an approach that also uses a tree structure to represent the database/dataset and does not utilize a candidate generation process. It was first introduced by Das, Ng & Woon [15] with the focus of being faster than the existing algorithms. The approach taken in developing RARM was to build 1-itemsets and 2-itemsets lists quickly and without needing a candidate generation process to get the frequent 2-itemsets. By doing so it would be faster than Apriori, since Apriori‟s main bottleneck is said to be with the candidate generation process for 2-itemsets and it is claimed that by doing so, RARM is up two orders of magnitude faster than Apriori . By limiting the tree to just two levels and not using a candidate generation process for 2-itemsets, the RARM algorithm is fast, efficient and scalable. The Apriori approach is used to discover itemsets of greater order than two because the use of initial process for larger itemsets is actually inefficient due to the memory requirements[15].

The RARM algorithm, while fast has several drawbacks. Because it uses a tree structure just like FP-Tree, it suffers the same limitations as the FP-Tree algorithm [5]. Additionally, because the TrieIT structure stores the support counts individually, it requires more memory, thus the RARM approach may use more memory than FP-Tree.

I. Non-Derivable Itemset Algorithm

Another approach to association rule mining has been proposed in the form of non-derivable itemsets and rules. In this approach, itemsets are removed if their support can be derived [16], since the derivability is monotone. The Non-Derivable Itemset (NDI) approach is based on the Apriori approach, however it does not try to find all of the frequent itemsets. It instead focuses on obtaining a complete set of deduction type rules in order to derive what is called „tight‟ intervals on the support value of a candidate itemset [16]. Three variations of the proposed NDI approach were compared against the original NDI approach. These alternatives included NDI-All, where the Derive All algorithm was applied after the NDI algorithm; NDI-hamc, where the NDI approach used the halving optimization and NDIhamc-All, where the NDI approach used the halving optimization followed by the DeriveAll approach.

J. Closed Itemset Algorithm

An itemset is said to be closed if and only if no proper superset of this itemset has the same support that this itemset has. For a given support threshold, knowing all frequent closed itemsets is sufficient to generate all the frequent itemsets and their supports without accessing to the dataset. The use of frequent closed itemsets presents a clear promise to reduce the number of extracted rules and also provide a concise representation of association rules [17].

(4)

reduced when compared to the number of frequent itemsets [17].

K. Multi-Level & Cross-Level Association Rule Mining

Traditionally, association rule mining has been performed at a single concept or abstract level, which was usually either a low abstract/primitive level or at a high abstract/concept level [J. Han, 1995; J. Han & Y. Fu, 1995]. It is widely accepted that single level rule mining has two major problems; firstly it is difficult to find strong associations at a low or primitive level due to the sparseness of data and secondly, mining at high levels may result in common knowledge rules being presented which are already known and are of little use or interest [J. Han & M. Kamber, 2006; J. Han & M. Kamber, 2001; J. Han, 1995; J. Han & Y. Fu, 1995]. It is quite possible that a given database, which can be mined by a single level algorithm, is not in fact flat, but contains data in a hierarchical format. While this structure may be present, it has been argued that few algorithms use or take advantage of this type of structure [R. Páircéir, S. McClean & B. Scotney, 2000]. Therefore alternatives were investigated and both multi-level rule mining and cross-level rule mining came about. A general process to assist in mining rules from hierarchical databases/datasets has been proposed by Psaila & Lanzi through the exploitation of both implicit and explicit concept hierarchies that often feature in data warehouses [G. Psaila & R. L. Lanzi, 2000]. Despite this however, most association rule mining has focussed on single level techniques and discovery. The following figure (2.1) shows an example of part of the Amazon hierarchy for their book collection, where each level has a different degree of abstraction.

One of the major arguments for the use of multi-level or cross-level rule mining is that it has the potential for undiscovered knowledge to be discovered. Such knowledge could not be found by the single level approach and this new knowledge may be highly relevant or interesting to a given user [18]. In fact, it has been stated that multi-level rule mining is useful in discovering new knowledge missed by conventional algorithms and thus if a database/dataset has a hierarchical structure (with multiple levels of conception/abstraction) a multi-level or cross-level approach would be a benefit. Aside from the advantage of new knowledge, multi-level or cross-level rule mining has other advantages.

Multi-level rules span multiple levels of abstraction, but the items within one rule come from the same concept/abstract level. That means they can be at different levels and contain more general or more specific information than single level rules and the intermediate results from high levels of abstraction can be used to help mine lower abstract levels and refine the process.

Despite the advantages that multi-level rule mining has, it also has drawbacks. The biggest is that a concept hierarchy/ontology/taxonomy is needed and may have to be dynamically built or altered during rule mining and that rules discovered will be dependant on the taxonomy built/used [J. Han, 1995; J. Han & Y. Fu, 1995]. Thus a suitable ontology/taxonomy for the database must first be discovered so that the rules obtained actually make sense and represent usable knowledge as well as allowing the generalisation of concepts from low levels to higher levels [19].

The second drawback is directly related to the support threshold(s) that will be used to determine frequent itemsets for each level. It is argued that the simplest approach, to use a uniform minimum support for all levels is unsuitable as it will either miss interesting itemsets (as they have a low support) or will suffer from the itemset generation bottleneck [20].

To overcome this problem alternative techniques using different approaches for the support thresholds were developed. One approach is the „reduced support‟ where each level of abstraction has its own value / threshold. Under this approach the deeper the level (or lower in abstraction) the lower the minimum support threshold is [20]. This approach relies usually on the user selecting the best threshold for each level, which may not be an easy task. It also can still suffer from the same problem as an approach that uses uniform support in that if the wrong threshold is used at any level either interesting itemsets may be missed or excessive itemsets (the generation bottleneck) may be discovered. However, the advantage is that such a mistake is usually limited to just that level of the dataset.

L. Apriori-Based Algorithms

(5)

structures and intermediate results between the different concept levels [21].

Han also presented three different ways to go about multi-level mining. The first is the Progressive Deepening approach [21] which starts at a high abstraction level and finds the strong associations there. It then selectively moves down to lower levels of abstraction (towards the primitive level) and finds associations there. The second is a Progressive Generalization approach [], which starts at a low/primitive level and finds associations there. It then uses these results and generalizes them to higher levels. The third is an Interactive-up-and-down approach [22], which involves the user interacting with the algorithm to direct it in stepping up or stepping down through the concept levels.

Apriori uses a uniform support threshold, that is the same support requirement for all itemsets regardless of length (1-itemset, 2-itemset etc) or whatever concept level they are located at. Wang, He & Han argue that this means Apriori either misses interesting patterns (because they have low support) or finds too many patterns and suffers from a bottleneck during its itemset generation. Their algorithm is known as Adaptive Apriori and defines the best support threshold for each group/bin of items and schema/concept level individually through the use of a supportbased specification.

The work has been undertaken in discovering both multi-level and cross level frequent itemsets from multi-level datasets. The approach taken is a top-down progressive deepening method built upon existing algorithms used in mining single level and multi-level rules. The main difference between this approach and other similar approaches is the pruning that takes place. Thakur, Jain & Pardasani‟s approach uses a reduced uniform minimum support, where each level has its own support threshold (used for all items at that level, hence uniform) and the support threshold is reduced as the algorithm works down the levels (so deeper levels have a lower support threshold). For each level, after the frequent itemsets are discovered, the dataset is filtered/pruned so that any items that are not frequent at the current level and are not descendents of a frequent item at the current level are removed [19]. In presenting their approach the author‟s do not consider or look at how the user should determine and select the support thresholds for each level, so this aspect is still quite subjective and based largely on the user‟s opinion or belief.

M. Other Algorithms

Apriori and FP-Growth based approaches are the most widely used and developed when it comes to multi-level rule mining. However, some work has been done which is not completely based on one of these two approaches. One approach proposed is based on statistics and others are based on the idea of using fuzzy set theory.

Another technique proposed for multi-level association rule mining is the use of fuzzy set theory. Hong, Lin & Chien proposed the FDM (fuzzy data mining) algorithm, which combines fuzzy set theory with linguistic terms and works on finding patterns and rules from quantitative databases/datasets. In this work, the assumption is made that the membership functions for the items is already known

and does not need to be discovered or made during mining. This assumption does limit the use of FDM. This approach is limited to finding association rules for items that are not on the same path in the database/dataset hierarchy, but it can be modified to include this and can also be modified so as to degrade into a more traditional non-fuzzy algorithm by assigning a constant membership value (eg 1) instead of a variable value based on quantity [23].

Hong, Lin & Wang also developed a fuzzy Apriori-based approach for mining multi-level association rules using similar techniques to FDM. It too works on quantitative databases/datasets and also relies on knowing membership function/s in advanced. These approaches may not generate the complete set of rules and thus information may be lost, but should generate the most important ones because they include the most important fuzzy terms for the items .

Kaya & Alhajj [24] proposed an approach that is based on the work of Han & Fu and Hong et al., in that they used fuzzy set theory, weighted mining, linguistic terms and use not just support and confidence to measure the strength of rule interestingness, but also use item importance. By using these approaches it is claimed that the rules are more meaningful and more understandable to users. Although the proposal by Kaya & Alhajj performs well and has been said that the use of their weighting method has produced consistent and meaningful results, it has only been tested on a synthetic database/dataset and not a „real-world‟ set and thus how well it performs on non synthetic sets is unknown [24].

III. INTERESTINGNESS,PRESENTATION &REDUNDANCY

The aim of association rule mining is to uncover associations, correlations and/or relationships between data items. Often there are many. At the same time some rules may not be important, relevant, correct or even interesting to the user. It is even possible for rules to be misleading [20]. And they also may be of low or poor quality. Also, association rule mining has been said to produce too many rules and even produce redundant rules. This has been caused by the focus being on improving the efficiency of generating the rules and not on the content, quality, presentation or makeup of those rules being produced. Although work has been undertaken in the areas of interestingness/importance, quality and the evaluation of quality there still appears to be a lack of agreement over what interestingness/importance is, what „good‟ association rules are and how to best evaluate the „goodness‟ of a rule or rule set. Because of this there is some confusion over the differences between interestingness and „goodness‟ and how the two are related. The following subsections will look at interest/importance measures for rules and the associated issues, along with a brief look at presentation of the produced rule set and redundancy in the rules.

A. Interestingness and Measures

(6)

rule is interesting then it is a good rule and in other cases they are separate but related issues. In this section the review will look at several of the different interestingness measures that can be applied to association rules

B. Interestingness

Determining and measuring the interestingness (and also the quality) of association rules is an important area which is active. Much work has been done, but yet there is no formal definition or even a widely accepted agreement in the context of association rule mining [5]. It seems that interestingness is a broad topic / concept that is based on and emphasizes the following aspects [5]:

Conciseness. A rule is considered to be concise if it

contains few attribute-value pairs and a rule set is concise if it contains few rules. A concise rule or set of rules are those that are easy to understand and easily added to a user‟s existing knowledge.

Generality/Coverage. A rule is considered to be

general if it covers a large subset of the dataset. It measures the comprehensiveness of or the fraction of instances covered by the rule or set of rules and rules that „cover‟ or characterize more of the dataset are usually seen as more interesting. Generality can be used to form the basis for pruning techniques and strategies. Often generality coincides with conciseness as concise rules tend to have larger coverage.

Reliability. A rule is considered to be reliable if the

condition or relationship described by it occurs often (in a high number of applicable cases). Usually for association rules, a rule is considered reliable if it has a high confidence.

Peculiarity. A rule is considered to be peculiar if it is distant from other discovered rules based on some distance measure. Peculiar rules are usually generated from data outliers which are relatively rare, few in number and significantly different from the rest of the dataset. Often they could be unknown to the user and thus may be interesting.

Diversity. A rule can be considered diverse if the

attributes that make it up are significantly different from each other. A set of rules is diverse if the rules are significantly different from each other. A diverse rule or rule set can be interesting because in presence of a lack of knowledge, a user may assume that a uniform distribution will hold in a rule or rule set.

Novelty. A rule is considered to be novel if a user did

not know it beforehand nor could infer it from other rules. Novelty can not be measured explicitly with reference to a user‟s knowledge or ignorance because no system represents everything a user knows or does not know. Novelty is subjective and usually measured by having the user identify a rule as novel or noticing that

it can not be inferred from others and does not contradict others.

Surprisingness. A rule is surprising (or unexpected) if

it contradicts a user‟s existing or current knowledge. Surprising rules can be considered interesting because they may be identifying failings in existing / previous knowledge.

Utility. A rule is of utility if it can be used by someone

to contribute to reaching a goal. The utility of rule or rule set is subjective and depends on the domain, user and goal. Different users can have divergent goals concerning what Knowledge / information can be obtained from the dataset.

Actionability (or Applicability). A rule is actionable

in a domain if a user is able to make a decision about future actions in the domain from the information in the rule. It can be associated with pattern/rule selection strategies. The actionability of a rule or rule set is subjective and depends on the domain, dataset and application.

From these criteria it can be seen that it is difficult to define what interestingness is and therefore how to measure (or evaluate) rules for it. Also it is difficult to define what „goodness‟ is as the two are interlinked. Many of these criteria could be applied directly to the concept of ‟goodness‟, and as such, measuring how good a rule or rule set is also difficult. In some cases it appears that evaluation of the „goodness‟ of a rule or set of rules is the same as measuring their interestingness.

C. Measures

There are many different measures available for measuring or assessing the interestingness or „goodness‟ of association rules. All of these measures can broadly be defined as belonging to three different categories; objective, subjective and semantic. Objective based measures determine the interestingness of association rules based purely on the raw data, while subjective measures used both the raw data and user data to determine the interestingness. Semantic based measures determine the interestingness of a rule based on the semantic meaning(s) and explanation of the patterns. Subjective based measures are more difficult to define and implement as information about the user must be taken into consideration (either by direct user input of their knowledge or by learning about the user‟s knowledge). In the case of subjective measures, often what one user finds interesting, another may not, due to different knowledge, understanding, expectations etc. Currently objective measures are the most popular and some are reviewed here.

Support-confidence

(7)

approach, the support measures or defines the range of the rule and the confidence measures the precision/accuracy of the rule .Support was chosen as it represents statistical significance [4] and users are usually only interested in the rules that have a support/significance above a certain threshold. The calculation of support assumes statistical independence and the support - confidence approach is targeted at finding qualitative rules .

Correlation

Another reliability based approach is known as correlation. Instead of basing the interestingness on just support and confidence values, this approach measures the correlation between items in the rule. One of the advantages of this approach is that both positive and negative relations can be discovered. It is positive correlations that mean that the presence of one item implies the presence of another item .

D. Redundancy

As already mentioned, often the number of rules discovered is quite large. However, not all of these rules are necessarily unique. Often there are redundant rules, especially when a support- confidence based approach has been used. Zaki has stated that it was known that there was redundancy, but “the extent of redundancy is a lot larger than previously suspected” [25]. Redundant rules give very little if any new information or knowledge to the user and often make it more difficult to find new knowledge (and perhaps even help to overwhelm user when it comes to finding high quality interesting / important rules). This problem of redundant rules is claimed to become more crucial when the data is dense or correlated.Suppressing the redundant rules or removing them from the final result set, will make it easier for the user to handle, process and understand the remaining rules, which actually contain the new and unique information. Pasquier et. al. has argued that there is a good reason to suppress or remove the redundant rules; they can be misleading [17]. In the work done by Pasquier et. al., they also argue that the support-confidence information is important when it comes to characterising redundant rules and propose to generate a condensed rule set.

IV. PROBLEM DEFINITION

Data mining and the use of the discovered data and knowledge is a major field of research. Research in this field is also often taken and applied to real world scenarios. In order to improve the usage of data and the knowledge it contains, it is necessary to develop better techniques. Improvements to rule mining, being one of the major data mining technologies, would result in benefits to many applications. Hence the development of new novel techniques to discover and mine high quality association rules from multi-level datasets and effectively use them is important. Furthermore, to ensure that high quality rules can be identified, it is equally as important to have an assessment for measuring a rule‟s usefulness or interestingness. By achieving this, then the applications that utilize association rules can be improved. For most of the work done in developing association rule mining, the primary focus has been on the efficiency of the approach (how quickly can it derive the rules) and to a lesser extent the quality (and the evaluation process / measures to

determine the quality) of the derived rules has been emphasized. Often for a dataset, a huge number of rules can be derived, but many of them (in some cases a significant number) can be redundant to other rules and thus are useless in practice. The extremely large number of rules makes it difficult for the end users to comprehend and therefore effectively use the discovered rules and thus significantly reduces the effectiveness of rule mining algorithms. If the extracted knowledge can‟t be effectively used in solving real world problems, the effort of extracting the knowledge is worth little. This is a serious problem but not yet solved satisfactorily. Some approaches aiming at reducing the number of extracted rules and also eliminating redundancy have been proposed [Y. Xu & Y. Li, 2006; Y. Xu & Y. Li, 2007.1; N. Pasquier et. al., 2005; M. J. Zaki, 2004]. However, none of them deal with multi-level datasets and instead have focused solely on datasets with only a single level. The traditional approaches do not take into account that there may be a data hierarchy so they find associations at just one level of the hierarchy and fail to discover associations and rules for data in other hierarchical levels or even rules that span across two or more levels. For such multi-level datasets, the issue of deriving rules is even more serious, since the presence of multiple concept levels in a dataset means that items and therefore patterns will be at multiple levels, which inevitably increases the number of rules that can be discovered.

V. CONCLUSIONS

With today‟s reliance on electronic datasets and databases being able to effectively extract information and knowledge that can be used is extremely important. Association rule mining is one technique that was developed for this situation. However, datasets have gone from being flat to being highly structured with a well defined hierarchy / taxonomy. As well as this, the number of association rules that can be derived has also increased. This makes it harder to extract information or knowledge and also makes it harder to use that information or knowledge. One other issue facing association rule mining is determining which rules are useful or interesting to the user or application. Most of the time support and confidence are used, but these do not take into account the hierarchical structure within a multi-level dataset. Finally, one application that uses datasets is the recommender system, where an automated systems recommends „items‟ to a user based on their past actions and the actions / interest of other users in the dataset. However recommender systems can only work when they know the user. New users are an unknown value and as such recommender systems struggle to make quality recommendations to these users because their interests, likes and dislikes are not known. From the research undertaken here the following are a list summarizing the main findings discovered.

(8)

 Association rules from a multi-level dataset that were deemed to be hierarchically redundant can be recovered without loss through the use of the non-redundant basis set and the set of frequent closed itemset and generators. Allowing all exact and all approximate rules to be recovered from the basic rule sets.

 Few interestingness measures are designed for association rules derived from multi-level datasets and none seem to use the hierarchy / taxonomy structure within such a dataset. Also there appears to be no existing measure that can determine the diversity of such an association rule.

REFERENCES

[1] Goethals, O. Maimon & L. Rokach (Eds.), “The Data Mining and Knowledge Discovery Handbook”, New York: Springer Science Business Media, 2005,pp. 377-397.

[2] Xu, Y., Li, Y. (2006, 28-30 Nov) “Mining for Useful Association Rules Using the ATMS”, Paper presented at the International Conference on Computational Intelligence for Modelling, Control and Automation, Vienna, Austria.

[3] Xu, Y., & Li, Y. (2007, 6-8, Nov). Generating Concise Association Rules. Paper presented at the 16th ACM Conference on Conference on Information and Knowledge Management (CIKM'07), Lisbon, Portugal.

[4] t the ACM SIGMOD International Conference on Management of Data

(SIGMOD'93), Washington D.C., USA.

[5] Zhao, Q., & Bhowmick, S. S. (2003). Association Rule Mining: A Survey. Nanyang Technological University, Singapore.

[6] Ceglar, A., & Roddick, J. F. (2006). Association Mining. ACM Computing Surveys (C SUR), 38(2).

[7] Xu, Y., & Li, Y. (2007, 6-8, Nov). Generating Concise Association Rules. Paper presented at the 16th ACM Conference on Conference on Information and Knowledge Management (CIKM'07), Lisbon, Portugal. [8] Xu, Y., & Li, Y. (2007). Mining Non-Redundant Association Rules

Based on Concise Bases. International Journal of Pattern Recognition and Artificial Intelligence, 21(5), 659-675.

[9] Gautam P ,Pardasi K R,” Algorithm for efficient multilevel association mining ”, International Journal on Computer Science and Engineering, Vol. 02,No 05,2010,1700-1704.

[10] Agrawal, R., & Srikant, R. (1994, Sep). Fast Algorithms for Mining Association Rules in Large Databases. Paper presented at the 20th International Conference on Very Large Data Bases, Santiago, Chile.

[11] Mao, R. (2001). Adaptive-FP: An Efficient And Effective Method For

Multi-Level Multi-Dimensional Frequent Pattern Mining. Simon Fraser University.

[12] Park, J. S., Chen, M.-S., & Yu, P. S. (1997). Using a Hash-Based Method with Transaction Trimming for Mining Association Rules. IEEE Transactions on Knowledge and Data Engineering, 9(5), 813-825. [13] Han, J., & Pei, J. (2000). Mining Frequent Patterns by Pattern-Growth:

Methodology and Implications. ACM SIGKDD

[14] Explorations Newsletter, 2(2), 14-20. [14] Grahne, G., & Zhu, J. (2003). Efficiently Using Prefix-trees in Mining Frequent Itemsets. Paper presented at the IEEE ICDM Workshop on Frequent Itemset Mining Implementation (FIMI'03), Melbourne, FL

[15] Das, A., Ng, W.-K., & Woon, Y.-K. (2001). Rapid Association Rule Mining. Paper presented at the CIKM'01

[16] Calders, T., & Goethals, B. (2007). Non-Derivable Itemset Mining. Data Mining and Knowledge Discovery, 14(1), 171-206

[17] Pasquier, N., Taouil, R., Bastide, Y., Stumme, G., & Lakhal, L. (2005). Generating a Condensed Representation for Association Rules. Journal of Intelligent Information Systems, 24(1), 29-60.

[18] Han, J., & Fu, Y. (1995, 11-15 Sep). Discovery of Multiple-Level Association Rules from Large Databases. Paper presented at the 21st International Conference on Very Large Databases, Zurich, Switzerland [19] hakur, R. S., Jain, R. C., & Pardasani, K. R. (2006). Mining

Level-Crossing Association Rules from Large Databases. Journal of Computer Science, 2(1), 76-81.

[20] Han, J., & Kamber, M. (2006). Mining Frequent Patterns, Associations, and Correlations. In D. D. Cerra (Ed.), Data Mining: Concepts and

Techniques (2nd ed., pp. 227-283). San Francisco, USA: Morgan Kaufmann Publishers.

[21] Han, J., & Fu, Y. (1999). Mining Multiple-Level Association Rules in Large Databases. IEEE Transactions on Knowledge and Data Engineering, 11(5), 798 - 805.

[22] Han, J. (1995). Mining Knowledge at Multiple Concept Levels. Paper presented at the Conference on Information and Knowledge Management [23] Hong, T.-P., Lin, K.-Y., & Chien, B.-C. (2003). Mining Fuzzy

Multiple-Level Association Rules from Quantitative Data. Applied Intelligence, 18(1), 79-90

[24] Kaya, M., & Alhajj, R. (2004, 22-24 Jun). Mining multi-cross-level fuzzy weighted association rules. Paper presented at the 2nd International IEEE Conference on Intelligent Systems

[25] Zaki, M. J. (2004). Mining Non-Redundant Association Rules. Data Mining and Knowledge Discovery, 9, 223-248.

[26] Ganter, B., & Wille, R. (1999). Formal Concept Analysis: Mathematical Foundations: Springer-Verlag

Amit Kumar Chandanan, B.E (Computer Science and Engineering, IT GGU, Bilaspur, India ),M Tech(IT,SATI

,Vidisha, India), Pursuing PhD in Computer Science &

Information Technology from Jayoti Vidyapeeth

Women‟s University . Publish three papers in International Conferences in IEEE, Computer Society, and Springer. Presently Working as Assistant Professor and Head of the department of Computer Science & Engineering in Hitkarini College of Engineering & Technology, Jabalpur (MP) -India

Manoj Kr. Shukla is Associate Professor in Department of Computer Science & Engineering, SDGI. Dr. Shukla has done his B.Tech, M.Tech , & Ph.D in the area of Computer Science. He is Member of ISTE, CSI, IETE, IANG, UACEE, WSEAS, IACSIT, ACM, etc. He has been editorial board Member for a number of premier conferences and journals including American Journal of Database Theory and Application, International Journal of Scientific and Engineering Research, IJAIS, IJCIIS, IJCST, IJCST, IJETTCS, IJCSE, WASET, CSJEERS, IJACT, etc.

References

Related documents

The report must be written by the road traffic authority and contains the type of road safety inspection and the route, start and end time of the inspection, names of the

Many researchers prefer to utilize ANN to conduct their studies in different fields. However, only limited number of researchers prefer to use ANN in their empirical researches

(2016) Home-Based and Facility-Based Directly Observed Therapy of Tuberculosis Treatment under Programmatic Conditions in Urban Tanzania.. This is an open access article

Nonsynonymous mutations were classified into three categories: (i) category 1 ‘high confidence driver mutations’ (amino acid substitutions identical to those found previously

One of the key parameters of asset and liability management is the rate of return guaranteed contractually to policyholders; the level of this rate, which is contingent on

and globally, remain on the negative side of the digital divide. This age-based digital divide is of concern because the internet enables users to expand their

Abstract: Research at three antebellum plantations on Sapelo Island, Georgia indicates a wide variety in Geechee settlement forms, construction techniques and materials,