• No results found

Predictive Data Mining in Very Large Data Sets: A Demonstration and Comparison Under Model Ensemble

N/A
N/A
Protected

Academic year: 2022

Share "Predictive Data Mining in Very Large Data Sets: A Demonstration and Comparison Under Model Ensemble"

Copied!
20
0
0

Loading.... (view fulltext now)

Full text

(1)

Predictive Data Mining in Very Large Data Sets: A Demonstration and Comparison

Under Model Ensemble

Dr. Hongwei “Patrick” Yang

Educational Policy Studies & Evaluation College of Education

University of Kentucky Lexington, KY

Presented at the 2014 Modern Modeling Methods conference

(2)

Overview

 The study demonstrates predictive data mining models under model ensemble in the context of analyzing large data

 Data mining is usually defined as the data-driven process of discovering

meaningful hidden patterns in large

amounts of data through automatic as

well as manual means

(3)

Overview

 Many industries use data mining to address business problems, such as

bankrupt prediction, risk management, fraud detection, etc.

 Such applications in data mining typically take advantage of predictive data mining models as learning machines with a

primary focus on making good predictions

Modern Modeling Methods Conference 3

2014

(4)

Overview

 Among many types of predictive data mining models are decision trees, neural networks, and (traditional) regression models:

 Decision tree: Identify the most significant split of the outcome at each layer

 Neural network: Model nonlinear associations

 For each of the models/learning machines

presented above, the outcome can be either

a categorical one or a numerical one

(5)

Overview

 On the other hand, model ensemble

techniques have recently become popular thanks to the growing power of

computation

 Bagging and boosting are two of the most popular ensemble techniques

Modern Modeling Methods Conference 5

2014

(6)

Overview

 Model ensemble techniques are designed to create a model

ensemble/committee containing multiple component/base models

 The committee of models are averaged or pooled in a certain manner to

improve the stability and accuracy of

predictions

(7)

Overview

 Model ensemble techniques can be

incorporated into many types of predictive models/learning machines (tree, neural

network, regression, etc.)

 Ensemble-based modeling can also be combined with common feature/subset selection procedures (genetic algorithm,

stepwise method, all-possible-subsets, etc.)

Modern Modeling Methods

Conference 2014 7

(8)

Numerical examples

 To demonstrate the effectiveness of predictive data mining in discovering

meaningful information from large data, the study chooses the three types of

predictive models which are commonly used, and analyzes them under two

large scale applications

(9)

Numerical examples

 To further improve the predictions from each type of model, model ensemble is implemented during the modeling

process to pool predictions from individual component model

 For comparison purposes, all models are also fitted without creating any model ensemble

Modern Modeling Methods

Conference 2014 9

(10)

Numerical examples

 Besides, the models are each evaluated for goodness-of-fit and performance at the final stage using various fit statistics including average squared error, ROC

index, misclassification rate, Gini

coefficient, K-S statistic, as applicable

 The entire analysis is performed under

SAS Enterprise Miner 7.1

(11)

Numerical examples

 Example one: Physicochemical properties of protein tertiary structure data

 A numerical outcome: 45,730 cases

 Example two: Bank marketing data

 A categorical outcome: 41,188 cases

 Both data sets are retrieved from the UC Irvine (UCI) Machine Learning Repository

Modern Modeling Methods

Conference 2014 11

(12)

Example one: Numerical outcome

(13)

Example one: Numerical outcome

Modern Modeling Methods

Conference 2014 13

Table 1. Comparison of Models based on Training Data under a Numerical Outcome.

Model Description

Average Squared Error

Root Average Squared Error

Maximum Absolute Error

EnRegTreeNN 21.338 4.619 15.000

EnReg 22.874 4.783 14.818

EnNN 23.122 4.809 16.556

EnTree 25.193 5.019 16.131

NN 23.591 4.857 19.663

Reg 23.574 4.855 19.668

Tree 24.103 4.910 17.412

(14)

Example one: Numerical outcome

 Ensemble models tend to be more

effective in reducing errors, although it is not guaranteed

 Average squared error: Lower is better

 Root average squared error: Lower is better

 Maximum absolute error: Lower is better

(15)

Example two: Categorical outcome

Modern Modeling Methods

Conference 2014 15

(16)

Example two: Categorical outcome

Table 2. Comparison of Models based on Training Data under a Categorical Outcome.

Model Description

Root Average Squared Error

Misclassification Rate

Roc Index

Gini Coefficient

Kolmogorov -Smirnov Statistic

Bin-Based Two-Way Kolmogorov -Smirnov Statistic

Gain Cumulative Lift

Cumulative Percent Captured Response

EnRegTreeNN 0.237 0.078 0.947 0.894 0.780 0.772 504.305 6.043 60.541

EnReg 0.241 0.081 0.935 0.871 0.719 0.717 455.744 5.557 55.676

EnNN 0.252 0.086 0.919 0.838 0.682 0.681 428.767 5.288 52.973

EnTree 0.270 0.101 0.801 0.602 0.579 0.576 395.325 4.953 49.623

Tree 0.254 0.090 0.900 0.800 0.697 0.692 441.595 5.416 54.179

NN 0.261 0.098 0.912 0.823 0.675 0.670 400.087 5.001 50.027

Reg 0.261 0.097 0.912 0.823 0.668 0.666 408.710 5.087 50.889

(17)

Example two: Categorical outcome

 Ensemble models typically have better discriminatory power among all models, as is indicated by each criterion

Misclassification rate: Lower is better

ROC index: Higher is better

Gini coefficient: Higher is better

K-S statistic: Higher is better

Cumulative lift: Higher is better

Cumulative percent captured response: Higher is better

Modern Modeling Methods

Conference 2014 17

(18)

Conclusions

 The study presents some initial evidence for the effectiveness of model ensemble in

improving the performance of an individual learning machine (model) under a given type

 The study needs to be supplemented with additional information on the use of (real) bagging and boosting in improving the

performance of individual learning machine

(19)

Conclusions

 The study provides applied researchers with more options beyond traditional regression modeling when reliable predictions are

needed in their research

 The study serves as the foundation for a future research topic which adds feature

selection to predictive data mining modeling under model ensemble for analyzing very

large data sets

Modern Modeling Methods Conference

2014 19

(20)

References

Ao, S. (2008). Data mining and applications in Genomics. Berlin, Heidelberg, Germany: Springer Science+Business Media.

Bache, K. & Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

Barutcuoglu, Z., & Alpaydin, E. (2003). A comparison of model aggregation methods for regression. In O. Kaynak, E.

Alpaydin, E. Oja, & L. Xu. (Eds.), Artificial Neural Networks and Neural Information Processing - ICANN/ICONIP 2003 (pp. 76–83). NYC, NY: Springer.

Breiman, L. (1996). Bagging predictors. Machine Learning, 24, 123-140.

Cerrito, P. B. (2006). Introduction to data mining: Using SAS Enterprise Miner. Cary, NC: SAS Institute Inc.

Drucker, H. (1997). Improving regressor using boosting techniques. Proceedings of the 14th International Conferences on Machine Learning, 107-115.

Freund, Y. (1995). Boosting a weak learning algorithm by majority. Information and Computation, 121, 256-285.

Freund, Y., & Schapire, R. E. (1996). Experiments with a new boosting algorithm. Machine Learning: Proceedings of the Thirteenth International Conference, 148-156.

Freund, Y., & Schapire, R. E. (1997). A decision-theoretic generalization of online learning and an application to boosting.

Journal of Computer and System Sciences, 55, 119-139.

Hill, C. M., & Malone, L. C., & Trocine, L. (2004). Data mining and traditional regression. In H. Bozdogan (Ed.), Statistical data mining and knowledge discovery, (pp. 233-249). London, UK: Chapman and Hall/CRC.

Larose, D. T. (2005). Discovering knowledge in data: An introduction to data mining. Hoboken, NJ: John Wiley & Sons, Inc.

Liu, B., Cui, Q., Jiang, T., & Ma, S. (2004). A combinational feature selection and ensemble neural network method for classification of gene expression data. BMC Bioinformatics, 5, 136.

Oza, N. C. (2005). Ensemble Data Mining Methods. In J. Wang (Ed.), Encyclopedia of Data Warehousing and Mining (pp.

448-453). Hershey, PA: Information Science Reference.

Schapire, R. E. (1990). The strength of weak learnability. Machine Learning, 5, 197-227.

Schapire, R. E.. (2002). The boosting approach to machine learning: An overview. In D. D. Denison, M. H. Hansen, C. C.

References

Related documents

pictures presented in the right half field of vision. Neither the number of stimuli nor prior knowledge as to their iden- tity made any difference in the

Indian immigrants who lived in Western countries for a long time, used to teach some Western imaginary demands on newcomer Indian immigrants like in Kapur’s The

A hybrid statistical model representing both the pose and shape variation of the carpal bones is built, based on a number of 3D CT data sets obtained from different subjects

Alvarado Mora et al Virology Journal 2012, 9 244 http //www virologyj com/content/9/1/244 RESEARCH Open Access Phylogenetic analysis of complete genome sequences of hepatitis B virus

Immunoprecipi- tation and Western blot for FGFR3 proteins confirmed the presence of both FGFR3 proteins in the cell lysate, suggesting that this decrease in phosphorylation did

In examining the ways in which nurses access information as a response to these uncertainties (Thompson et al. 2001a) and their perceptions of the information’s usefulness in

As a formal method it allows the user to test their applications reliably based on the SXM method of testing, whilst using a notation which is closer to a programming language.

For the cells sharing a given channel, the antenna pointing angles are first calculated and the azimuth and elevation angles subtended by each cell may be used to derive