Predictive Data Mining in Very Large Data Sets: A Demonstration and Comparison
Under Model Ensemble
Dr. Hongwei “Patrick” Yang
Educational Policy Studies & Evaluation College of Education
University of Kentucky Lexington, KY
Presented at the 2014 Modern Modeling Methods conference
Overview
The study demonstrates predictive data mining models under model ensemble in the context of analyzing large data
Data mining is usually defined as the data-driven process of discovering
meaningful hidden patterns in large
amounts of data through automatic as
well as manual means
Overview
Many industries use data mining to address business problems, such as
bankrupt prediction, risk management, fraud detection, etc.
Such applications in data mining typically take advantage of predictive data mining models as learning machines with a
primary focus on making good predictions
Modern Modeling Methods Conference 3
2014
Overview
Among many types of predictive data mining models are decision trees, neural networks, and (traditional) regression models:
Decision tree: Identify the most significant split of the outcome at each layer
Neural network: Model nonlinear associations
For each of the models/learning machines
presented above, the outcome can be either
a categorical one or a numerical one
Overview
On the other hand, model ensemble
techniques have recently become popular thanks to the growing power of
computation
Bagging and boosting are two of the most popular ensemble techniques
Modern Modeling Methods Conference 5
2014
Overview
Model ensemble techniques are designed to create a model
ensemble/committee containing multiple component/base models
The committee of models are averaged or pooled in a certain manner to
improve the stability and accuracy of
predictions
Overview
Model ensemble techniques can be
incorporated into many types of predictive models/learning machines (tree, neural
network, regression, etc.)
Ensemble-based modeling can also be combined with common feature/subset selection procedures (genetic algorithm,
stepwise method, all-possible-subsets, etc.)
Modern Modeling Methods
Conference 2014 7
Numerical examples
To demonstrate the effectiveness of predictive data mining in discovering
meaningful information from large data, the study chooses the three types of
predictive models which are commonly used, and analyzes them under two
large scale applications
Numerical examples
To further improve the predictions from each type of model, model ensemble is implemented during the modeling
process to pool predictions from individual component model
For comparison purposes, all models are also fitted without creating any model ensemble
Modern Modeling Methods
Conference 2014 9
Numerical examples
Besides, the models are each evaluated for goodness-of-fit and performance at the final stage using various fit statistics including average squared error, ROC
index, misclassification rate, Gini
coefficient, K-S statistic, as applicable
The entire analysis is performed under
SAS Enterprise Miner 7.1
Numerical examples
Example one: Physicochemical properties of protein tertiary structure data
A numerical outcome: 45,730 cases
Example two: Bank marketing data
A categorical outcome: 41,188 cases
Both data sets are retrieved from the UC Irvine (UCI) Machine Learning Repository
Modern Modeling Methods
Conference 2014 11
Example one: Numerical outcome
Example one: Numerical outcome
Modern Modeling Methods
Conference 2014 13
Table 1. Comparison of Models based on Training Data under a Numerical Outcome.
Model Description
Average Squared Error
Root Average Squared Error
Maximum Absolute Error
EnRegTreeNN 21.338 4.619 15.000
EnReg 22.874 4.783 14.818
EnNN 23.122 4.809 16.556
EnTree 25.193 5.019 16.131
NN 23.591 4.857 19.663
Reg 23.574 4.855 19.668
Tree 24.103 4.910 17.412
Example one: Numerical outcome
Ensemble models tend to be more
effective in reducing errors, although it is not guaranteed
Average squared error: Lower is better
Root average squared error: Lower is better
Maximum absolute error: Lower is better
Example two: Categorical outcome
Modern Modeling Methods
Conference 2014 15
Example two: Categorical outcome
Table 2. Comparison of Models based on Training Data under a Categorical Outcome.
Model Description
Root Average Squared Error
Misclassification Rate
Roc Index
Gini Coefficient
Kolmogorov -Smirnov Statistic
Bin-Based Two-Way Kolmogorov -Smirnov Statistic
Gain Cumulative Lift
Cumulative Percent Captured Response
EnRegTreeNN 0.237 0.078 0.947 0.894 0.780 0.772 504.305 6.043 60.541
EnReg 0.241 0.081 0.935 0.871 0.719 0.717 455.744 5.557 55.676
EnNN 0.252 0.086 0.919 0.838 0.682 0.681 428.767 5.288 52.973
EnTree 0.270 0.101 0.801 0.602 0.579 0.576 395.325 4.953 49.623
Tree 0.254 0.090 0.900 0.800 0.697 0.692 441.595 5.416 54.179
NN 0.261 0.098 0.912 0.823 0.675 0.670 400.087 5.001 50.027
Reg 0.261 0.097 0.912 0.823 0.668 0.666 408.710 5.087 50.889
Example two: Categorical outcome
Ensemble models typically have better discriminatory power among all models, as is indicated by each criterion
Misclassification rate: Lower is better
ROC index: Higher is better
Gini coefficient: Higher is better
K-S statistic: Higher is better
Cumulative lift: Higher is better
Cumulative percent captured response: Higher is better
Modern Modeling Methods
Conference 2014 17
Conclusions
The study presents some initial evidence for the effectiveness of model ensemble in
improving the performance of an individual learning machine (model) under a given type
The study needs to be supplemented with additional information on the use of (real) bagging and boosting in improving the
performance of individual learning machine
Conclusions
The study provides applied researchers with more options beyond traditional regression modeling when reliable predictions are
needed in their research
The study serves as the foundation for a future research topic which adds feature
selection to predictive data mining modeling under model ensemble for analyzing very
large data sets
Modern Modeling Methods Conference
2014 19
References
Ao, S. (2008). Data mining and applications in Genomics. Berlin, Heidelberg, Germany: Springer Science+Business Media.
Bache, K. & Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
Barutcuoglu, Z., & Alpaydin, E. (2003). A comparison of model aggregation methods for regression. In O. Kaynak, E.
Alpaydin, E. Oja, & L. Xu. (Eds.), Artificial Neural Networks and Neural Information Processing - ICANN/ICONIP 2003 (pp. 76–83). NYC, NY: Springer.
Breiman, L. (1996). Bagging predictors. Machine Learning, 24, 123-140.
Cerrito, P. B. (2006). Introduction to data mining: Using SAS Enterprise Miner. Cary, NC: SAS Institute Inc.
Drucker, H. (1997). Improving regressor using boosting techniques. Proceedings of the 14th International Conferences on Machine Learning, 107-115.
Freund, Y. (1995). Boosting a weak learning algorithm by majority. Information and Computation, 121, 256-285.
Freund, Y., & Schapire, R. E. (1996). Experiments with a new boosting algorithm. Machine Learning: Proceedings of the Thirteenth International Conference, 148-156.
Freund, Y., & Schapire, R. E. (1997). A decision-theoretic generalization of online learning and an application to boosting.
Journal of Computer and System Sciences, 55, 119-139.
Hill, C. M., & Malone, L. C., & Trocine, L. (2004). Data mining and traditional regression. In H. Bozdogan (Ed.), Statistical data mining and knowledge discovery, (pp. 233-249). London, UK: Chapman and Hall/CRC.
Larose, D. T. (2005). Discovering knowledge in data: An introduction to data mining. Hoboken, NJ: John Wiley & Sons, Inc.
Liu, B., Cui, Q., Jiang, T., & Ma, S. (2004). A combinational feature selection and ensemble neural network method for classification of gene expression data. BMC Bioinformatics, 5, 136.
Oza, N. C. (2005). Ensemble Data Mining Methods. In J. Wang (Ed.), Encyclopedia of Data Warehousing and Mining (pp.
448-453). Hershey, PA: Information Science Reference.
Schapire, R. E. (1990). The strength of weak learnability. Machine Learning, 5, 197-227.
Schapire, R. E.. (2002). The boosting approach to machine learning: An overview. In D. D. Denison, M. H. Hansen, C. C.