17
Post Mining based Pattern Filtering Scheme with Attribute
Association
Ms. N.M. Indumathi1, Ms. C. Nithya2
ME,
Assistance Professor, Department of Computer Science
1,2,
Nandha Engineering College (Autonomous), Erode, Tamilnadu, India
ABSTRACT
The association rule mining methods are used to discover frequent patterns from large databases. Support and confidence measures are compared in the pattern discovery process. Frequent patterns are identified with minimum support and minimum confidence values. Huge number of patterns are derived from high dimensional data values. Post mining methods are applied to reduce the patterns from the discovered pattern collection from the pattern discovery process. Weighted rule mining methods are adapted to mine the frequent patterns or rules with attribute weight values. The post mining process is tuned to filter the patterns from the weighted rule mining process. Ontology is employed to discover patterns with concept relationship values. The system is tuned to analyze the binary, categorical and continuous data values. Data validation and verification methods are integrated with Association Rule Interactive post-Processing using Schemas and Ontologies (ARIPSO) mechanism. The system is upgraded to support weighted pattern discovery and filtering mechanism. The pattern ranking process is also handled by the post mining mechanism.
1. Introduction
In statistical and data analysis, it is essential that the business analyst knows, what the variables are before analysis can start. The business analyst needs knowledge discovery technology and tools. Knowledge discovery has its roots in artificial intelligence and machine learning. Some of the definitions of knowledge discovery are described in the following list:
Knowledge discovery may be a nontrivial extraction of implicit, previously unknown and potentially useful information from the data. Knowledge discovery may be the data search process, without stating in advance a hypothesis or question and still finding either unexpected or interesting information in relationships and patterns among its data elements or important business rules in the full data searched and analyzed. Knowledge discovery may mean to uncover previously unknown business facts in the gigabytes of data in the data warehouse or data mart.
Business managers and analysts are always seeking new and additional business insights so that crucial business decisions, which have significant impact on the health of a business, can be improved. Using the traditional techniques of business queries and data analysis requires asking the right questions. Knowledge
discovery technology determines by itself the questions to ask and then keeps on asking questions, digging deeper to unearth the nuggets of knowledge the business seeks. Business analysts do not have the time, attention span and stamina to ferret out all the implicit relationships and patterns that exist in the data warehouse.
Knowledge discovery is aimed at shifting through the vast amount of data in the data warehouse, searching for frequently recurring patterns, detecting trends and unearthing facts. Knowledge discovery systems try to discover facts or knowledge with minimal, if any, guidance or direction from the analyst, all in the shortest amount of time. So, the knowledge discovery is large amounts of data warehouse or data mart data are inspected and facts/knowledge discovered are presented to the business analyst. Now the business analyst exercises business known and domain expertise to discern useful facts from those that are not useful. This is the ideal combination of people and computers.
18
visualization and browse tools that aid in exploring and analyzing previously mined data further enhance the value of the Knowledge discovery effort. A knowledge discovery activity is successful when an experienced business executive or manager staring at a newly discovered business fact says, ―I didn’t know that.‖2. Related Works
The work on entity resolution can be broadly divided into three categories. Pairwise ER. Most works on ER focus on record matching which involves comparing record pairs and identifying whether they match. A major part of work on record matching focuses on similarity functions [2], To capture string variations, proposed a transformation based framework for record matching. Some machine-learning based approaches can identify matching strings which are syntactically far apart. Similarity based on record relationships is also proposed to solve the people identification problem.
Since in our work, records are not compared with each other, [9]our work is orthogonal to record matching. String similarity functions can be applied to fuzzy match operator in ER-rules. For example, given a string s, we say s _ ―wei wang‖ if the edit distance between s and ―wei wang‖ is smaller than a given threshold. Decision trees are employed to teach record matching rules. Decision trees cannot be used to discover ER-rules. This is because the domain of the righthand side of record matching rules is {yes, no}, while the domain of the righthand side of ER-rules is an entity set.
Non-pairwise ER. The research on non-pairwise ER includes clustering strategies [7] and classifiers. Most strategies solve ER based on the relationship graph among records, by modeling the records as nodes and the relationships as edges. Machine learning approaches are also proposed by using global information to solve ER effectively. However, these methods are not suitable for massive data because of efficiency issues. We choose a representative work for comparison.
Scaling. Some other works treat the ER algorithm as black box and focus on developing scalable framework for ER. Indexing techniques used for ER have been surveyed by Christen [4] focuses on how to update ER results efficiently when ER logic evolves. These
techniques are orthogonal to our work and can be applied to accelerate our rule-based ER algorithm.
Note that, among the existing works on pair-wise ER, rule-based approaches [8] are closer to our work. These rules differ from our work as they focus on determining whether two records refer to the same entity while our work focuses on determining whether a record refers to an existing entity. Our preliminary work [5] proposed rule-based ER and rule discovery strategies. The preliminary work only proposed a heuristic method for rule discovery, without efficiency and accuracy guarantee. In this paper, we propose a new definition of rules and effective algorithms for rule discovery.
3. Rule Selection using Post Mining Approach
Association rule mining one of the most important tasks in Knowledge Discovery in Databases. Among sets of items in transaction databases, it aims at discovering implicative tendencies that can be valuable information for the decision-maker.
An association rule is defined as the implication
Y
X , described by two interestingness measures— support and confidence—where X and Y are the sets of items and XY
Apriori is the first algorithm proposed in the association rule mining field and many other algorithms were derived from it. Starting from a database, it proposes to extract all association rules satisfying minimum thresholds of support and confidence. It is very well known that mining algorithms can discover [11] a prohibitive amount of association rules; for instance, thousands of rules are extracted from a database of several dozens of attributes and several hundreds of transactions. Furthermore, as suggested by Silbershatz and Tuzilin , valuable information is often represented by those rare—low support and unexpected association rules which are surprising to the user.19
is crucial to help the decision-maker with an efficient technique for reducing the number of rules.To overcome this drawback, several methods were proposed in the literature. On the one hand, different algorithms were introduced to reduce the number of itemsets by generating closed, maximal or optimal itemsets, and several algorithms to reduce the number of rules, using no redundant rules or pruning techniques. On the other hand, post processing methods can improve the selection of discovered rules. Different complementary post processing methods may be used, like pruning, summarizing, grouping, or visualization. Pruning consists in removing uninteresting or redundant rules. In summarizing, concise sets of rules are generated. Groups of rules are produced in the grouping process; and the visualization improves the readability of a large number of rules by using adapted graphical representations.
Most of the existing post processing methods are generally based on statistical information in the database. Since rule interestingness strongly depends on user knowledge and goals, [1]these methods do not guarantee that interesting rules will be extracted. For instance, if the user looks for unexpected rules, all the already known rules should be pruned. Or, if the user wants to focus on specific schemas of rules, only this subset of rules should be selected. Moreover, the rule post processing methods should be imperatively based on a strong interactivity with the user.
The representation of user knowledge is an important issue. The more the knowledge is represented in a flexible, expressive, and accurate formalism, the more the rule selection is efficient. In the Semantic Web1 field, ontology is considered as the most appropriate representation to express the complexity of the user knowledge, and several specification languages were proposed.
Interactive post processing approach, ARIPSO Association Rule Interactive post-Processing using Schemas and Ontologies is proposed to prune and filter discovered rules. First, Domain Ontologies are used to strengthen the integration of user knowledge in the post processing task. Second, Rule Schema formalism is introduced by extending the specification language
proposed by Liu et al. for user beliefs and expectations toward the use of ontology concepts. Furthermore, an interactive and iterative framework is designed to assist the user throughout the analyzing task. The interactivity of the approach relies on a set of rule mining operators defined over the Rule Schemas in order to describe the actions that the user can perform.
4. Problem Statement
Post mining schemes are used to filter derived rules. Pruning, summarizing, grouping, and visualization techniques are used for the post mining process. Uninterest or redundant rules are removed in pruning process. Concise sets of rules are generated in summarizing method. Grouping process produces groups of rules. Visualization produces graphical format of rules. ARIPSO and Ontologies mechanism is used for post mining process. ARIPSO is used to prune and filter discovered rules. The existing system has the following drawbacks. Quantitative attributes are not considered, weighted rule mining is not supported, Rule ranking scheme is not available and Rule validation is not provided.
5. Weighted Rule Mining and Rule Selection Process The ontology based rule selection scheme is designed to perform rule filtering and weighted rule mining process. The system supports frequency based rule mining and weight based rule mining process. The association rule-mining model is used to detect rules based on the attribute frequencies. The weighted rule-mining model uses the preassigned weights for the attribute values. Support and confidence values are estimated for the association rule mining process. Weighted support and weighted confidence are used in the weighted rule mining process. Attribute weight and frequency values are used in the weighted support and weighted confidence estimation process.
20
with the semantic weight values. The semantic weights are used to filter the rules. The rule ranking process is also performed using the semantic weights. The system filters the rules that are selected with the minimum support and minimum confidence values.The concept of association rule proposed the support-confidence measurement framework and reduced association rule mining to the discovery of frequent item sets. The following year a fast mining algorithm, Apriori, was proposed. Much effort has been dedicated to the classical association rule mining problem since then. Numerous algorithms have been proposed to extract the rules more efficiently. These algorithms strictly follow the classical measurement framework and produce the same results once the minimum support and minimum confidence are given. WARM generalizes the traditional model to the case where items have weights. Ram Kumar et al. introduced weighted support of association rules based on the costs assigned to both items as well as transactions. An algorithm called WIS was proposed to derive the rules that have a weighted support larger than a given threshold. Ca et al. defined weighted support in a similar way except that they only took item weights into account. The definition broke the downward closure property. As a result, the proposed mining algorithm became more complicated and time consuming. Tao et al. Provided another definition to retain the ―weighted downward closure property.‖
Wang and Su proposed a novel approach on item ranking. A directed graph is created where nodes denote items and links represent association rules. A generalized version of HITS is applied to the graph to rank the items, where all nodes and links are allowed to have weights. However, the model has a limitation that it only ranks items but does not provide a measure like weighted support to evaluate an arbitrary item set. Anyway, it may be the first successful attempt to apply link-based models to association rule mining.
6. Post Mining based Pattern Filtering Scheme The concept relationship based weighted rule mining and filtering system is designed to perform post mining on derived rules. ARIPSO scheme is enhanced with validation methods. Weighted rule mining and
filtering process can be integrated with the ARIPSO scheme. Rank based concept relationship analysis can be provided to improve the post mining process. The system is designed to perform rule mining and [4] rule selection process. Ontology is used to reduce the rules based on concept relationships. Weighted rule mining scheme is also integrated with the system. Rule validation process is included to verify the mined rules. The system is divided into five modules. They are Data Preprocess, Rule mining process, Weighted rule mining process, Ontology analysis and Rule selection process.
Data preprocess module is designed to perform noise elimination and candidate set preparation tasks. Ontology construction and attribute analysis are performed under ontology analysis module. Interested rule selection is carried out under rule mining process. Weighted rule mining is applied with attribute weight values. Post mining operations are carried out under the rule selection process.
6.1. Data Preprocess
The data preprocess module is used to normalize the data values. Noise elimination process is performed to reduce redundant data values. Attribute names and values are extracted to build candidate sets. Frequency estimation is done for each candidate set values.
6.2. Rule Mining Process
The association rule mining tasks are carried out under the rule mining process. Candidate generation is performed with attribute names and attributes values for each transaction. The item sets are prepared from the candidate set information. Frequency values are estimated for each items. The support and confidence values are estimated for all items. The interested rule selection process is carried out on the estimated support and confidence values. Minimum support and minimum confidence values are used to filter the relevant rules.
6.3. Weighted Rule Mining Process
21
estimate weighted support and weighted confidence values. The minimum support and minimum confidence values are used to filter the weighted rules.6.4. Ontology Analysis
The ontology is a repository used to maintain the relationship between the concepts and terms. The ontology are maintained as XML documents. The resource description framework is used to manage ontology values. The ontology is used to analyze the attribute relationship. The transaction table attribute names are analyzed with ontology elements. The relationship and their levels are extracted from the ontology analysis.
6.5. Rule Selection Process
The rule selection process is done with ontology analysis and pruning model. The user assisted rule selection is also carried out to filter the rules. The system selects the rules under the post mining process. The ontology is used extract relationship between the attributes. The rules are ranked with reference to the concept weight values.
7. Conclusion
The association rule mining techniques are used to mine the relationship between the itemsets. The post mining models are used to filter the mined rules. The rules are reduced in the rule selection process. The association rule mining technique uses the frequency values. The frequency based model assigns equal priority for all attributes. The weight values are assigned to attributes with reference to the importance of the attributes. The frequency and weight values are used in the weighted rule mining process. The rule filtering scheme is constructed with the Ontology for concept relationship analysis.
The optimal rule selection scheme is introduced to filter the rules with concept relationship. The post mining scheme is adapted with the weighted rule mining model. The support and confidence values are used to find out the rules. In the weighted rule mining rules are fetched using the weighted support and weighted confidence
values. The concept relationships are analyzed using the Ontology for rule selection process.
The association rule mining and weighted rule mining results are compared in the experimental analysis. The weighted rule mining produces accurate results than the association rule mining process. The rule selection process is also analyzed for the association rule mining and weighted rule mining model. The weighted rule mining with rule selection method filters the redundant rule in an efficient way.
References
1. Claudia Marinica and Fabrice Guillet ―Knowledge-Based Interactive Postminingof Association Rules Using Ontologies‖ IEEE Transactions on Knowledge and Data engineering, vol. 22, no. 6, June 2010.
2. X. Fan, J. Wang, X. Pu, L. Zhou, and B. Lv, ―On graph-based name disambiguation,‖ J. Data Inf. Quality, vol. 2, no. 2, p. 10, 2011.
3. R. Vibhor, N. D. Nilesh, and N. G. Minos, ―Large-scale collective entity matching,‖ Proc. VLDB Endowment, vol. 4, no. 4, pp. 208– 218, 2011.
4. S. E. Whang and H. Garcia-Molina, ―Entity resolution with evolving rules,‖ Proc. VLDB Endowment, vol. 3, no. 1, pp. 1326–1337, 2010.
5. L. Li, J. Li, H. Wang, and H. Gao, ―Context-based entity description rule for entity resolution,‖ in Proc. 20th ACM Int. Conf. Inf. knowl. Manag., 2011, pp. 1725–1730.
6. H. Kopcke, A. Thor, and E. Rahm, ―Evaluation of entity resolution approaches on real-world match problems,‖ Proc. VLDB Endowment, nvol. 3, no. 1, pp. 484–493, 2010.
7. I. Bhattacharya and L. Getoor, ―Collective entity resolution in relational data,‖ Proc. VLDB Endowment, vol. 3, no. 1, p. 5, 2010.
8. M. Herschel, F. Naumann, S. Szott, and M. Taubert, ―Scalable iterative graph duplicate detection,‖ IEEE Trans. Knowl. Data Eng., vol. 24, no. 11, pp. 2094–2108, Nov. 2011.
22
Multiple Sequences‖, IEEE Transactions On Knowledge And Data Engineering, Vol. 27, No. 8, August 2015 10. Alberto Calzada, Jun Liu, Hui Wang and AnilKashyap, ―A New Dynamic Rule Activation Methodfor Extended Belief Rule-Based Systems‖, IEEE Transactions On Knowledge And Data Engineering, Vol. 27, No. 4, April 2015