ABSTRACT
DE LA FUENTE GALLEGOS, RODRIGO ANDRÉS. Simulation Metamodeling with Gaussian Process: A Numerical Study. (Under the direction of Stephen Roberts.)
After studying the metamodeling literature with focus on applications from inside and outside Industrial Engineering, it was determined that no systematic in-depth compari-son of modern metamodeling techniques has been applied to real simulation problems. Additionally, even though efforts have been made to incorporate modern methods such as Gaussian processes, several misconceptions have caused confusion about how to use these tools. The aim of this dissertation is to provide an in-depth comparison of metamodels for two simulation paradigms, namely, systems dynamics and discrete event simulation, and to introduce some new machine learning techniques for improving Gaussian processes.
In simulation metamodeling, several studies have been reported regarding support vector machines and Gaussian processes, along with some neural network applications (often dismissed in engineering due to the difficulty of tunning them). Little research has been done using state-of-the-art tree-based methods. Additionally, applications of Gaussian processes to simulation metamodeling are dominated by the use of a constant intercept and smooth covariance functions, while outside engineering more attention has been given to elaborated non-stationary mean responses and the Matèrn family of covariance functions in both separable and non-separable form.There is also an apparent lack of methodology and reproducibility when comparing metamodels.
simulation (DES) model (DMV service system). These techniques include Support Vector Regression (SVR), Random Forest (RF), Gradient Boosted Regression Trees (GBRT), Multilayer Perceptron (MLP) and Gaussian Processes (GP). GP used the classic squared exponential covariance and the Matèrn once and twice differentiable in separable and non-separable form. A total of 33 responses (25 SD and 8 DES) were tested with the criteria of goodness of fit, interpretability and fitting time. The main findings for both types of simulations were that GPs using non-separable Matèrn covariances were far superior to the squared exponential covariance and their separable counterparts in mean squared prediction error. Furthermore, while MLP and GBRT were also competitive, GP was both more accurate and robust across datasets and types of simulation. As to interpretability, GBRT was the only method that could provide insight to investigators and all the others techniques had no or limited interpretation. With regard to time, most of the models that fit fast yielded less accurate predictions.
© Copyright 2016 by Rodrigo Andrés De la Fuente Gallegos
Simulation Metamodeling with Gaussian Process: A Numerical Study
by
Rodrigo Andrés De la Fuente Gallegos
A dissertation submitted to the Graduate Faculty of North Carolina State University
in partial fulfillment of the requirements for the Degree of
Doctor of Philosophy
Industrial Engineering
Raleigh, North Carolina 2016
APPROVED BY:
Joseph Guinness Jeffrey Joines
Reha Uzsoy Stephen Roberts
DEDICATION
BIOGRAPHY
ACKNOWLEDGEMENTS
I would like to thank to my adviser Dr. Stephen D. Roberts for his support and en-couragement during these five years. To the committee members, I wish to thank: Dr. Joseph Guinness for his contributions that greatly enhanced this work, Dr. Jeff Joines for devoting his precious time in helping me to get started with Chapter 4, and last, but not least, Dr. Reha Uzsoy for his dedicated reading and correcting the dissertation and providing valuable feedback.
Many graduate student colleagues sheared their experiences with me and helped me, in some way or another, in getting started with my dissertation. A special mention to my friend Raymond Smith who played a crucial role at the beginning of this work. To my friends Andrés and Pedro for the invaluable conversations about research we had during lunch breaks.
My family has always been there cheering me up from the other side of the world. Sofía and Rockie, provided their unconditional company during the last two years of this adventure.
TABLE OF CONTENTS
LIST OF TABLES . . . viii
LIST OF FIGURES . . . xiii
ABBREVIATIONS . . . xvi
Chapter 1 Metamodeling in Engineering . . . 1
1.1 Introduction . . . 1
1.2 Literature Review in Simulation Metamodeling . . . 2
1.2.1 Non-Gaussian Processes Metamodels . . . 3
1.2.2 Gaussian Process Metamodeling . . . 9
1.3 Discussion . . . 15
1.4 Dissertation Outline . . . 18
Chapter 2 Metamodeling: A State-of-the-Art Comparison . . . 20
2.1 Introduction . . . 20
2.2 Modern Machine Learning Techniques . . . 22
2.2.1 Support Vector Regression (SVR) . . . 22
2.2.2 Multilayer Perceptron (MLP) . . . 25
2.2.3 Random Forest Regression (RF) . . . 28
2.2.4 Gradient Boosted Regression Tree (GBRT) . . . 31
2.2.5 Gaussian Process . . . 36
2.3 Experimental Setup . . . 39
2.4 Applications . . . 43
2.4.1 Experiment 1 . . . 44
2.4.2 Experiment 2 . . . 60
2.5 Discussion . . . 73
2.6 Conclusion . . . 76
Chapter 3 A Two-stage Model Selection Gaussian Process . . . 77
3.1 Introduction . . . 77
3.2 A Review of Variable Selection Methods . . . 79
3.2.1 Model Selection . . . 82
3.3 Loose Coupling of GP and adaptive-LASSO . . . 91
3.3.1 The AdaLassoGP Class . . . 92
3.4 Experimental Results . . . 97
3.4.1 Hospital 1 Metadata . . . 98
3.4.2 Hospital 2 Metadata . . . 107
3.6 Conclusion . . . 120
Chapter 4 Using Machine Learning Algorithms to Metamodel Discrete Event Simulations: DMV Charlotte Case Study . . . 122
4.1 Introduction . . . 122
4.2 Overview of the DES Model Under Study . . . 124
4.3 Experimental Setup . . . 128
4.4 Training Results . . . 132
4.5 Gaussian Process Results . . . 138
4.5.1 Gaussian Process with Constant Mean - Separable and Non-separable Covariances . . . 138
4.5.2 Gaussian Process with Adaptive Lasso Mean - Separable and Non-separable Covariances . . . 141
4.5.3 Select Best Gaussian Process Models . . . 143
4.5.4 Effect of Number of Replications in Model Accuracy . . . 144
4.6 All Techniques Comparison Results . . . 146
4.6.1 Coverage Analysis . . . 153
4.6.2 Discussion . . . 159
4.7 Conclusion . . . 160
Chapter 5 Conclusions and Future Work . . . 162
5.1 Conclusions of Dissertation Work . . . 162
5.2 Future Work . . . 164
BIBLIOGRAPHY . . . 166
APPENDICES . . . 176
Appendix A Numerical Results Chapter 2 . . . 177
A.1 Hospital 1 Dataset - Results . . . 177
A.2 All Techniques Comparison - Hospital 1 Dataset . . . 182
A.3 Hospital 2 Dataset - Results . . . 186
A.4 All Techniques Comparison - Hospital 2 Dataset . . . 190
Appendix B Numerical Results Chapter 3 . . . 193
B.1 Motivation Chapter 3 . . . 193
B.2 Hospital 1 Dataset - Results . . . 196
B.2.1 Adaptive Lasso Mean - Minimum and One Standard Deviation Cross-validation Error . . . 196
B.2.2 Gaussian Process Best of the Best . . . 200
B.3 Hospital 1 Dataset - All Techniques Comparison . . . 202
B.4.1 Adaptive Lasso Mean - Minimum and One Standard Deviation
Cross-validation Error . . . 206
B.4.2 Gaussian Process Best of the Best . . . 210
B.5 Hospital 2 Dataset - All Techniques Comparison . . . 213
Appendix C Numerical Results Chapter 4 . . . 215
C.1 Experimental Design . . . 215
C.2 Sliced Latin Hypercube Design Experiment . . . 218
C.2.1 Constant Mean - Sliced Latin Hypercube Design . . . 218
C.2.2 Adaptive Lasso Mean - Sliced Hypercube Design . . . 223
C.2.3 All Techniques Comparison . . . 229
C.3 K-Nearly Orthogonal Latin Hypercube Experiment . . . 231
C.3.1 Constant Mean - K-Nearly Orthogonal Design . . . 231
C.3.2 Adaptive Lasso Mean - K-Nearly Orthogonal Design . . . 235
C.3.3 All Techniques Comparison . . . 241
LIST OF TABLES
Table 1.1 Most cited metamodeling techniques . . . 16 Table 1.2 Current Gaussian process practices in Spatial Statistics and
Com-puter Experiments . . . 17 Table 2.1 Transfer functions . . . 26 Table 2.2 Most used covariance kernels . . . 37
Table 2.3 Platform information: interpreter, operative system and hardware . 43
Table 2.4 Covariates for the Hospital 1 dataset . . . 46 Table 2.5 Responses evaluated in Hospital 1 dataset . . . 47
Table 2.6 Best parameters randomized grid search cross-validation Hospital 1 48
Table 2.7 Covariates for the Hospital 2 dataset . . . 62 Table 2.8 Responses evaluated in Hospital 2 dataset . . . 63
Table 2.9 Best parameters randomized grid search cross-validation Hospital 2 64
Table 3.1 Most important covariates for each response of Hospital 1 dataset . 106 Table 3.2 Most important covariates for each response of Hospital 2 dataset . 116 Table 4.1 Covariates for the DMV dataset . . . 128 Table 4.2 Responses evaluated in DMV dataset . . . 133 Table 4.3 Best parameters randomized grid search cross-validation K-NOLHD 135 Table 4.4 Best parameters randomized grid search cross-validation SLHD . . 136 Table 4.5 Adaptive Lasso - GP, Interpretability - K-NOLHD . . . 151 Table 4.6 Adaptive Lasso - GP, Interpretability - SLHD . . . 151 Table 4.7 Confidence interval for the mean prediction expressed as percentage
of the times the simulated mean was within the predicted interval . 155 Table 4.8 Welch’s t test for difference of means number of non-statistically
different means out of 900 samples . . . 158 Table A.1 Gaussian Process with separable and non-separable covariances Mean
squared prediction error Hospital 1 - five folds cross-validation . . 178 Table A.2 Gaussian Process with separable and non-separable covariances R2
Hospital 1 - five folds cross-validation . . . 179 Table A.3 Gaussian Process with separable and non-separable covariances
Max-imum absolute deviation Hospital 1 - five folds cross-validation . . 180 Table A.4 Gaussian Process with separable and non-separable covariances Time
Hospital 1 - five folds cross-validation . . . 181 Table A.5 Gaussian Process with separable and non-separable covariances
Table A.6 All techniques comparison Mean squared prediction error Hospital 1 - five folds cross-validation . . . 183 Table A.7 All techniques comparison Mean squared prediction error Hospital 1
- five folds cross-validation . . . 184 Table A.8 All techniques comparison Maximum absolute deviation Hospital 1
-five folds cross-validation . . . 185 Table A.9 Gaussian Process with separable and non-separable covariances Mean
squared prediction error Hospital 2 - five folds cross-validation . . 187 Table A.10 Gaussian Process with separable and non-separable covariances R2
Hospital 2 - five folds cross-validation . . . 188 Table A.11 Gaussian Process with separable and non-separable covariances
Max-imum absolute deviation Hospital 2 - five folds cross-validation . . 189 Table A.12 Gaussian Process with separable and non-separable covariances
Fit-ting time Hospital 2 - five folds cross-validation . . . 190 Table A.13 Gaussian Process with separable and non-separable covariances
Like-lihood visits Hospital 2 - five folds cross-validation . . . 191 Table A.14 All techniques comparison Mean squared prediction error Hospital 2
- five folds cross-validation . . . 191 Table A.15 All techniques comparison Maximum absolute deviation Hospital 2
-five folds cross-validation . . . 192 Table A.16 All techniques comparison R2 Hospital 2 - five folds cross-validation 192
Table B.1 Constant vs quadratic using non-separable covariances Mean squared prediction error Hospital 1 - five folds cross-validation . . . 194 Table B.2 Constant vs quadratic using non-separable covariances Mean squared
prediction error Hospital 2 - five folds cross-validation . . . 195 Table B.3 Adaptive Lasso and Gaussian Process - separable vs non-separable
covariances Mean squared prediction error Hospital 1 - five folds minimum cross-validation . . . 197 Table B.4 Adaptive Lasso and Gaussian Process - separable vs non-separable
covariances Mean squared prediction error Hospital 1 - five folds one standard deviation cross-validation . . . 198 Table B.5 Adaptive Lasso and Gaussian Process - separable vs non-separable
covariances Fitting time Hospital 2 - five folds one standard deviation cross-validation . . . 199 Table B.6 Adaptive Lasso and Gaussian Process - separable vs non-separable
covariances Likelihood visits Hospital 1 - five folds one standard deviation cross-validation . . . 199 Table B.7 Adaptive Lasso coupled with Gaussian Process - best of the best
Table B.8 Adaptive Lasso coupled with Gaussian Process - best of the best Maximum absolute deviation Hospital 1 - five folds cross-validation 201 Table B.9 Adaptive Lasso coupled with Gaussian Process - best of the best R2
Hospital 1 - five folds cross-validation . . . 201 Table B.10 All techniques comparison - best of the best Mean squared prediction
error Hospital 1 - five folds cross-validation . . . 203 Table B.11 All techniques comparison - best of the best Maximum absolute
deviation Hospital 1 - five folds cross-validation . . . 204 Table B.12 All techniques comparison - best of the best R2 Hospital 1 - five
folds cross-validation . . . 205 Table B.13 Adaptive Lasso coupled with Gaussian Process - separable vs
non-separable covariances Mean squared prediction error Hospital 2 - five folds minimum cross-validation . . . 207 Table B.14 Adaptive Lasso coupled with Gaussian Process - separable vs
non-separable covariances Mean squared prediction error Hospital 2 - five folds one standard deviation cross-validation . . . 208 Table B.15 Adaptive Lasso coupled with Gaussian Process - separable vs
non-separable covariances Fitting time Hospital 2 - five folds one standard deviation cross-validation . . . 209 Table B.16 Adaptive Lasso and Gaussian Process - separable vs non-separable
covariances Likelihood visits Hospital 2 - five folds one standard deviation cross-validation . . . 209 Table B.17 Adaptive Lasso coupled with Gaussian Process - best of the best
Mean squared prediction error Hospital 2 - five folds cross-validation 210 Table B.18 Adaptive Lasso coupled with Gaussian Process - best of the best
Maximum absolute deviation Hospital 2 - five folds cross-validation 211 Table B.19 Adaptive Lasso and Gaussian Process - best of the best R2 Hospital
2 - five folds cross-validation . . . 212 Table B.20 All techniques comparison - best of the best Mean squared prediction
error Hospital 2 - five folds cross-validation . . . 213 Table B.21 All techniques comparison - best of the best Maximum absolute
deviation Hospital 2 - five folds cross-validation . . . 213 Table B.22 All techniques comparison - best of the best R2 Hospital 2 - five
folds cross-validation . . . 214 Table C.1 Variable description for the DMV model - DES input . . . 215 Table C.2 Schedule combinations . . . 216 Table C.3 Randomized covariates Nearly orthogonal Latin hypercube design
-(NO)3311 Example . . . 217
Table C.4 Constant mean and separable and non-separable covariances R2
Table C.5 Constant mean and separable and non-separable covariances Mean squared prediction error Three datasets . . . 220 Table C.6 Constant mean and separable and non-separable covariances
Maxi-mum absolute deviation Three datasets . . . 221 Table C.7 Constant mean and separable and non-separable covariances
Likeli-hood visits Three datasets . . . 222 Table C.8 Adaptive Lasso mean and separable and non-separable covariances
R2 Three datasets . . . 224
Table C.9 Adaptive Lasso mean and separable and non-separable covariances Mean squared prediction error Three datasets . . . 225 Table C.10 Adaptive Lasso mean and separable and non-separable covariances
Maximum absolute deviation Three datasets . . . 226 Table C.11 Adaptive Lasso mean and separable and non-separable covariances
Likelihood visits Three datasets . . . 227 Table C.12 Best Gaussian Process for SLHD Mean squared prediction error
Three datasets . . . 228 Table C.13 Best Gaussian Process for SLHD Maximum absolute deviation Three
datasets . . . 228
Table C.14 All techniques comparison for SLHD R2 Three datasets . . . 229
Table C.15 All techniques comparison for SLHD Mean squared prediction error Three datasets . . . 229 Table C.16 All techniques comparison for SLHD Maximum absolute deviation
Three datasets . . . 230 Table C.17 Tuning and fitting time DMV - SLHD . . . 230
Table C.18 Constant mean and separable and non-separable covariances R2
Three datasets . . . 232 Table C.19 Constant mean and separable and non-separable covariances Mean
squared prediction error Three datasets . . . 233 Table C.20 Constant mean and separable and non-separable covariances
Maxi-mum absolute deviation Three datasets . . . 234 Table C.21 Constant mean and separable and non-separable covariances
Likeli-hood visits Three datasets . . . 235 Table C.22 Adaptive Lasso mean and separable and non-separable covariances
R2 Three datasets . . . 236
Table C.23 Adaptive Lasso mean and separable and non-separable covariances Mean squared prediction error Three datasets . . . 237 Table C.24 Adaptive Lasso mean and separable and non-separable covariances
Maximum absolute deviation Three datasets . . . 238 Table C.25 Adaptive Lasso mean and separable and non-separable covariances
Table C.26 Best Gaussian Process for K-NOLHD Mean squared prediction error
Three datasets . . . 239
Table C.27 Best Gaussian Process for K-NOLHD Maximum absolute deviation Three datasets . . . 240
Table C.28 All techniques comparison for K-NOLHD R2 Three datasets . . . . 241
Table C.29 All techniques comparison for K-NOLHD Mean squared prediction error Three datasets . . . 241
Table C.30 All techniques comparison for K-NOLHD Maximum absolute devia-tion Three datasets . . . 242
Table C.31 Tuning and fitting time DMV - K-NOLHD . . . 242
Table C.32 Effect of the number of replications per point on model accuracy R2 Three datasets . . . 243
Table C.33 Effect of the number of replications per point in model accuracy Mean squared prediction error Three datasets . . . 244
Table C.34 Effect of the number of replications in model accuracy Maximum absolute deviation Three datasets . . . 245
Table C.35 All technique variance prediction R2 . . . 246
Table C.36 All technique variance prediction MSPE . . . 246
Table C.37 Average MSPE five-folds . . . 247
LIST OF FIGURES
Figure 2.1 MLP diagram for response 7 experiment 2. . . 25
Figure 2.2 Weight and activation function . . . 27
Figure 2.3 Regression Tree for response 4 Hospital 1 data . . . 29
Figure 2.4 Random Forest for response 4 Hospital 1 data . . . 31
Figure 2.5 Random Forest for response 4 Hospital 1 data . . . 32
Figure 2.6 Gradient Boosted Regression Trees (GBRT) variable importance for response 7 Hospital 1 data . . . 35
Figure 2.7 GBRT partial dependency plots for response 7 Hospital 1 data . . 35
Figure 2.8 Chapter 2 experimental sequence . . . 40
Figure 2.9 Five-folds cross-validation . . . 42
Figure 2.10 Responses histograms Hospital 1 dataset . . . 45
Figure 2.11 Mean squared prediction error - Hospital 1 dataset five-folds cross-validation . . . 51
Figure 2.12 Maximum absolute deviation - Hospital 1 dataset five-folds cross-validation . . . 52
Figure 2.13 All techniques comparison Hospital 1 dataset R2 - five-folds cross-validation . . . 54
Figure 2.14 All techniques comparison Hospital 1 dataset MSPE - five-folds cross-validation . . . 55
Figure 2.15 All techniques comparison Hospital 1 dataset MAX - five-folds cross-validation . . . 56
Figure 2.16 Partial dependency plots for selected responses fitted with GBRT for Hospital 1 data . . . 58
Figure 2.17 Tuning and fitting time Hospital 1 dataset . . . 59
Figure 2.18 Responses histograms Hospital 2 dataset . . . 61
Figure 2.19 Mean squared prediction error - Hospital 2 dataset five-folds cross-validation . . . 67
Figure 2.20 Maximum absolute deviation - Hospital 2 dataset five-folds cross-validation . . . 68
Figure 2.21 All Techniques comparison Hospital 2 dataset R2 - five-folds cross-validation . . . 69
Figure 2.22 All techniques comparison Hospital 2 dataset MSPE - five-folds cross-validation . . . 70
Figure 2.23 All techniques comparison Hospital 2 dataset MAX - five-folds cross-validation . . . 71
Figure 2.24 Partial dependency plots for selected responses fitted with GBRT for Hospital 2 data . . . 72
Figure 2.26 Conclusions for Hospital 1 dataset . . . 74
Figure 2.27 Conclusions for Hospital 2 dataset . . . 75
Figure 3.1 Constant vs quadratic mean - MSPE . . . 78
Figure 3.2 UML diagram for the AdaLassoGP class . . . 93
Figure 3.3 LASSO path response 4 - Hospital 1 data . . . 95
Figure 3.4 Fraction of deviance explained response 4 - Hospital 1 data . . . . 95
Figure 3.5 Five-folds cross-validation response 4 - Hospital 1 data . . . 96
Figure 3.6 Experimental sequence . . . 97
Figure 3.7 Adaptive-Lasso mean and non-separable covariances minimum vs. one standard deviation cross-validation - MSPE . . . 98
Figure 3.8 Best cross-validation cut . . . 99
Figure 3.9 All techniques comparison - R2 five-folds cross-validation . . . 100
Figure 3.10 All techniques comparison - MSPE five-folds cross-validation . . . 101
Figure 3.11 All techniques comparison - MAX five-folds cross-validation . . . 102
Figure 3.12 Tuning and fitting time Hospital 1 dataset . . . 107
Figure 3.13 Adaptive-Lasso mean and non-separable covariances minimum vs. one standard deviation cross-validation - MSPE . . . 109
Figure 3.14 Best cross-validation cut . . . 110
Figure 3.15 All techniques comparison - R2 five-folds cross-validation . . . 111
Figure 3.16 All techniques comparison - MSPE five-folds cross-validation . . . 111
Figure 3.17 All techniques comparison - MAX five-folds cross-validation . . . 112
Figure 3.18 Tuning and fitting time Hospital 2 dataset . . . 117
Figure 3.19 General results for Hospital 1 dataset . . . 118
Figure 3.20 General results for Hospital 2 dataset . . . 119
Figure 4.1 Sensitivity of different inputs parameters on the output Lange (2015)127 Figure 4.2 Experiment overview . . . 129
Figure 4.3 Experimental sequence . . . 130
Figure 4.4 Column correlation K-NOLHD and SLHD DMV dataset . . . 131
Figure 4.5 Responses histograms for K-NOLHD and SLHD experiments DMV dataset . . . 133
Figure 4.6 MSPE - constant mean with separable and non-separable covariances139 Figure 4.7 MAX - constant mean with separable and non-separable covariances140 Figure 4.8 MSPE - Ada-Lasso mean with separable and non-separable covariances142 Figure 4.9 MAX - Ada-Lasso mean with separable and non-separable covariances143 Figure 4.10 Constant vs Ada-Lasso mean best results . . . 144
Figure 4.11 Effect of the number of replications on model accuracy - R2 . . . 145
Figure 4.12 Effect of the number of replications on model accuracy . . . 146
Figure 4.13 All techniques comparison - R2 . . . 147
ABBREVIATIONS
ABM Agent Based Modeling. 21
ALGP Adaptive-Lasso Gaussian Process. 97, 98, 100–102, 105–107, 110, 112, 117–120,
126, 143, 144, 149, 150, 153, 154, 159–161, 164
ALGPE Adaptive-LASSO Gaussian process with squared exponential covariance
func-tion. 107, 108, 116, 141, 142, 153, 159, 160, 164
ALGPM32 Adaptive-LASSO Gaussian process with Matèrn once differentiable
covari-ance function. 107, 108, 141
ALGPM32NS Adaptive-LASSO Gaussian process with Non-Separable Matèrn once
differentiable covariance function. 107, 108, 141
ALGPM52 Adaptive-LASSO Gaussian process with Matèrn twice differentiable
covari-ance function. 107
ALGPM52NS Adaptive-LASSO Gaussian process with Non-Separable Matèrn twice
differentiable covariance function. 108, 141, 142
ANN Artificial Neural Networks. 7
BLUP Best Linear Unbiased Predictor. 76
CART Classification and Regression Trees. 8
COSSO Component Selection and Smoothing Operator. 3
CRN Common Random Numbers. 15
CV Cross-validation. 39, 40, 42, 50, 66
DACE Design and Analysis of Computer Experiments. 12
DES Discrete Event Simulation. 2, 7, 14, 19, 21, 42, 120–125, 144, 154, 160, 163, 164
DOE Design of Experiments. 13, 84
GAMLSS Generalized Additive Models for Location, Scale and Shape. 7
GBRT Gradient Boosted Regression Trees. xiii, 17, 18, 32–35, 39, 50, 53–55, 57, 58, 66,
68–70, 72, 74–78, 80, 81, 102, 107, 110, 112, 116, 118–120, 124, 126, 137, 146–150, 153–155, 159–164
GP Gaussian Process. 2, 3, 5, 6, 8, 10, 11, 13, 15, 16, 18, 19, 39, 40, 43, 50, 52–55, 57, 59,
66–70, 72–78, 80, 81, 91, 92, 100, 101, 105, 106, 110, 112, 117–120, 123, 124, 129, 138, 142–149, 154, 155, 157, 158, 160, 162–164
GPCE Gaussian Process with constant intercept and squared exponential covariance.
51, 52, 59, 66, 138–140
GPCM32 Gaussian Process with constant intercept and Matèrn once differentiable
covariance. 51, 52, 59, 66, 67, 139
GPCM32NS Gaussian Process with constant intercept and Non-Separable Matèrn once
differentiable covariance. 50–52, 59, 66, 67, 77, 138–140
GPCM52 Gaussian Process with constant intercept and Matèrn twice differentiable
covariance. 59
GPCM52NS Gaussian Process with constant intercept and Non-Separable Matèrn twice
differentiable covariance. 51, 52, 59, 66, 138–140
GPQM32NS Gaussian Process with quadratic mean response and Non-Separable
Matèrn once differentiable covariance. 77
IDW Inverse Distance Weighting. 8
IE Industrial Engineering. 1, 2, 9, 13, 18, 42, 122, 162, 165
KED Kriging with External Drift. 11
LARS Least Angle Regression Selector. 87–89, 94
LASSO Least Absolute Shrinkage and Selection Operator. 85–91, 94
LHD Latin Hypercube Design. 6, 13, 14, 44, 60, 84, 132
MARS Multivariate Adaptive Regression Splines. 5, 6
MAX Maximum Absolute Error. 5, 18, 19, 21, 39, 43, 55, 67, 70, 73–76, 99, 100, 109,
110, 112, 118, 142, 144, 148, 159, 160, 162, 164
MLP Multilayer Perceptron. 3, 15, 16, 18, 25, 39, 53–55, 57, 59, 66, 68–70, 72, 74, 76,
77, 84, 100, 101, 105, 118, 124, 137, 146–148, 153–155, 157, 159–161, 163, 164
MLS Moving Least Squares. 8
MSPE Mean Squared Prediction Error. 9, 10, 14, 18, 21, 39, 43, 51–54, 67, 68, 73–80,
94, 98–100, 109, 110, 118, 124, 138, 141, 143–145, 147, 154, 159, 160, 162, 164
NOLHD Nearly Orthogonal Latin Hypercube Design. 13, 128–132, 137, 138, 141, 147,
149, 153, 154, 164
OK Ordinary Kriging. 8, 10–15
RBF Radial Basis Functions. 5, 6, 24, 56
REML Restricted Maximum Likelihood. 9, 10, 17
RF Random Forest. 3, 4, 6, 7, 15, 17, 18, 30, 32–34, 39, 53, 57–59, 76, 124, 137, 146, 147,
153, 161–163
RIDW Regression Inverse Distance Weighting. 8
RK Regression Kriging. 10–12
RMSPE Root Mean Squared Prediction Error. 5, 7, 19
RSM Response Surface Methodology. 3, 13, 91
SD Systems Dynamic. 2, 21, 42–44, 60, 73, 74, 77, 122, 164
SK Stochastic Kriging. 8, 12, 15
SLHD Sliced Latin Hypercube Design. 129, 131, 132, 137, 140, 141, 147, 150, 165
SVM Support Vector Machine. 3, 4, 6, 7, 15, 16, 18
SVR Support Vector Regression. 22, 24, 25, 39, 53, 56–58, 76, 121
UK Universal Kriging. 11, 13, 14
UML Unified Modeling Language. 92
Chapter 1
Metamodeling in Engineering
1.1
Introduction
A simulation is an abstraction of reality used to provide a simplified representation of our perception of the salient features of a system, which in most cases, is complex and nonlinear. Large simulations are still very time consuming and expensive to run; thus, when the dimensionality of the problem gets large, it is not recommended that the simulation model be used to perform sensitivity analysis or optimization. To overcome this problem the analyst has to employ another level of abstraction and build a metamodel, which provides faster execution and better comprehension of the simulated system. The metamodel provides (ideally) a sort of “gray-box” for the I/O relationships; however, the ability of the metamodel to predict reality depends completely upon the accuracy of the simulation models. Once fitted, a metamodel can be used for several purposes 1) model approximation, 2) model exploration, 3) problem formulation, 4) global or multi-objective optimization, etc. (Wang and Shan, 2007).
The main objectives of this dissertation are:
separable covariance functions on the prediction performance of Gaussian Process (GP) will be tested (GP is the currently most-used technique in engineering), which
is generally overlooked in simulation metamodeling.
• Propose an improvement in how to incorporate more information into the non-stationary trend component of a Gaussian process to metamodel simulations with a large number of parameters. The effect of loose coupling of modern covariate selection techniques with GP will be studied with the aim of improving the predictive performance of GP applied to deterministic and stochastic simulation.
• Contribute to metamodeling applications in the field of IE by applying metamodeling to two different simulation paradigms, namely, Systems Dynamic (SD) and Discrete Event Simulation (DES). First, the efforts are focused on comparing different metamodeling techniques applied to two SD hospital simulations done to determine the most accurate metamodels when responses are highly non-linear and their distributions non-normal. Second, study the impact of stochastic simulation on different metamodeling approaches using a DES of the Division of Motor Vehicles of NC.
1.2
Literature Review in Simulation Metamodeling
In this section different metamodeling methods applied to engineering problems are explored. An effort has been made to separate non-GP1 techniques from GP models.
1.2.1
Non-Gaussian Processes Metamodels
The main idea behind metamodeling is to approximate a complex system with a simple, parametric or nonparametric, and less computationally expensive method to explore how a set of covariates are related to one or more response variables. The work of Box and Wilson (1951) introduced the concept of Response Surface Methodology (RSM) and the use of low order polynomials to approximate the underlying relationship among variables (Box and Draper, 1987) . This section reviews different modern metamodeling techniques
applied to simulation in engineering.
better because of their interpolation properties, but when the number of samples get large machine learning methods such as SVM, RF and Radial Basis functions produce better results. With regard to execution time the authors pointed out that Random Forest provides the best compromise of speed and accuracy.
Can and Heavey (2012) compared Genetic Programming, which has the capability to evolve programs using symbolic regression, and Artificial Neural Networks as metamodel-ing techniques for three Industrial Engineermetamodel-ing problems 1) Automated Material Handlmetamodel-ing System (Kuo et al., 2007), 2) the (s, S) Inventory Control System (Biles et al., 2007), and 3) Production Line (Papadopolous et al., 1993). The main evaluation criteria were (i) training and test performance and (ii) computational effort. Their conclusions were that Genetic Programming showed a better performance in out-of-sample prediction; however, its main drawback was that it is more expensive to fit even considering the tuning time Artificial Neural Networks require to avoid over-fitting.
validation, pointing out that the popular leave-one-out cross-validation does not convey the information needed to assert metamodel accuracy; instead, leave-k-out cross-validation must be used and a good rule of thumb for k is k = 0.1N or k = √N, where N is the sample size (Wang and Shan, 2007, p. 12). In regard to the evaluation criteria, the authors used the three most common validation methods, presented below:
RM SP E =
s PN
i=1(yi−yˆi) 2
N (1.1)
M AX = max|yi−yˆi|, i= 1, . . . , N (1.2)
R2 = 1−
PN
i=1(yi−yˆi) 2
PN
i=1(yi−y¯)
2 (1.3)
The Root Mean Squared Prediction Error (RMSPE) is a measurement used to evaluate general accuracy; while, the Maximum Absolute Error (MAX) gives information about local accuracy and R2 provides spatial proximity between the answers produced by the
simulation model and the metamodel.
Jin, Chen, and Simpson (2001) compared Second Order Polynomial Regression, Mul-tivariate Adaptive Regression Splines (MARS), Radial Basis Functions (RBF) and GP with Gaussian Correlation. Their evaluation criteria were accuracy, efficiency, robustness, transparency and simplicity. They evaluated 14 test problems classified according to three features: 1) problem scale, which is given by the number of covariates, 2) nonlinearity, defined as an R2 ≤ 0.99 when a low order polynomial model is fitted, and 3) noisy vs
smooth surface. They concluded that MARS, RBF, and GP with Gaussian correlation
is the most suitable model. Finally, they pointed out that for small sample sizes RBF outperforms the other methods and that GP was the least efficient when fitting time was considered.
Li et al. (2010) compared RBF (Gaussian function with Ridge regularization), SVM with Gaussian kernel function, Neural networks, GP with Gaussian correlation and un-known constant, and MARS. The test bed was comprised of 16 stochastic problems (functions with added noise). They divided the data into training, validation and testing sets, but no cross-validation was performed. They used the average value of the repeated runs of the model to develop the metamodel and Latin Hypercube Design (LHD) to gener-ate the designs. The evaluation criteria were accuracy, robustness (the algorithm does not deteriorate when tested on different data), and efficiency. Li et al. (2010) concluded that SVM provided the best compromise of accuracy and robustness, but for more complicated models (higher dimensionality and heterogeneous errors) RBF was more efficient. They also pointed out the importance of interpretability, stating that only MARS has a simple interpretation and all the remaining methods, although they are accurate and robust, are difficult to interpret based on the underlying relationships among the input covariates and the response.
even though Boosting and SVM have better performance than RF, SVM was a much more expensive procedure. Finally, they pointed out the ability of RF and Boosting to find simpler, faster, and more interpretable (the understandability of why the model is true or how it is induced from the data) models (Ogutu, Piepho, and Schulz-Streeck, 2011, p. 8).
Boutselis and Ringrose (2013) compared Artificial Neural Networks (ANN) and Generalized Additive Models for Location, Scale and Shape (GAMLSS) as potential metamodeling techniques applied to computer combat simulation. They stated that when the relationship between variables is too complicated to be predefined parametrically, the flexible methods of ANN and GAMLSS could be used. Regarding GAMLSS they said, “[I]t is a very flexible approach to modeling not just the mean but also the vari-ance, skewness and kurtosis of the response variable using flexible (e.g. spline) models” (Boutselis and Ringrose, 2013, p. 6088). The authors used RMSPE as the key indicator of out-of-sample prediction. They concluded that both methods produce quite similar out-of-sample prediction performance, with GAMLSS more accurate for skewed data; in addition, GAMLSS requires more involvement in model development and selection which could be useful in understanding of the data and the results.
Hsieh, Chang, and Chien (2014) proposed a response surface method built on a second order polynomial to capture the relationship between the cycle time of normal lots and the percentage of hot lots in semiconductor manufacturing. Their model consisted of a DES experimented by a simplex lattice design. Then, a second order polynomial was fitted and if the resulting model was not satisfactory (based on R2) more points were
reliable and easy-to-use analytical model.
Salemi, Nelson, and Staum (2014) studied the applicability of Moving Least Squares (MLS) regression with anisotropic weight function for high-dimensional stochastic
sim-ulation. They compared MLS, Classification and Regression Trees (CART), Stochastic Kriging (SK) with Gaussian correlation, Weighted Least Squares (WLS) regression. They reported that MLS produced results even when the sample size was larger than 5,000; whereas SK and WLS were unable to obtain a result for that sample size since the computer ran out of memory. They concluded that their MLS outperformed the other methods forecasting the M/G/1 queue with 5, 25, and 75 dimensions.
Joseph and Kang (2011) demonstrated that the Inverse Distance Weighting (IDW) interpolation method can be significantly improved when coupled with a linear regression method. Their work was focused on reducing the computational burden of GP methods, while maintaining prediction accuracy. The proposed Regression Inverse Distance Weight-ing (RIDW) consists of fittWeight-ing a penalized regression model to obtain a sparse global trend approximation of the response; then, the residuals are used for interpolation based on a modified IDW that allows for anisotropy (different importance for each covariate) and truncated neighborhood (include just a specific number of points around a new prediction site). They compared RIDW with Ordinary Kriging (OK), concluding that the former has comparable prediction accuracy to the latter, but requires less computational effort (there is no need to invert the correlation matrix). Finally, they justify their use of global
1.2.2
Gaussian Process Metamodeling
In this subsection some relevant work in Kriging literature is presented. Firstly, a review of works outside IE is given. Secondly, the most relevant works in the IE field are summarized.
1.2.2.1 Some applications outside Industrial Engineering
Kriging is a general technique used for interpolation that originated in the field of Geo-statistics (Kbiob, 1951; Matheron, 1963). Its applications span a wide range of fields from geology and agriculture to economics and engineering. The following is a broad overview of the current practice outside engineering.
Mardia (2007) pointed out that the method of Maximum Likelihood (ML) used in computer experiments compares favorably to the use of variograms -the current practice in geostatistics- allowing the integrated estimation of the mean response vector and the covariance hyper-parameters. However, he also stated that the ML method is computa-tionally demanding for two reasons: the matrix factorization and the multidimensional optimization of the autocorrelation parameters. He also said that ML should be preferred over Restricted Maximum Likelihood (REML) because the latter is more sensitive to the effect of the mean response having higher Mean Squared Prediction Error (MSPE) in some cases. Pollice and Bilancia (2002) studied Kriging as a linear mixed effect model comprised of a mean part, a spatially correlated component, and random noise. They mention that the use of fixed mean effect and REML produces biased covariance estimates in an application to the prediction of soil properties.
small samples but high-dimensional inputs. They explained that the autocorrelation func-tion weights are governed not only by the trend specificafunc-tion -ranging from linear models to Fourier series- but also the choice of autocorrelation function; the Gaussian covariance is generally preferred because of its simplicity and symmetric properties. Moreover, Gins-bourger et al. (2009) noted that Kriging with a constant, also known as OK, is the most commonly used version of Kriging, but it has some flaws when the response is highly non-linear. They studied the effect of linear and quadratic mean responses, noticing that they produced poor results for extrapolation. Finally, they tested a non-linear additive model for the trend and ML for Kriging the residuals, but the resulting MSPE did not improve as expected.
Bachoc et al. (2014) applied Kriging to a thermal-hydraulic system. Their main goal was to improve model prediction by conditioning the mean response and the autocor-relation function through the use of priors. They reported the use of cross-validation to validate their results and used different autocorrelation functions, namely, Gaussian, Exponential, and Matèrn, once and twice differentiable in their separable form. They concluded that the use of a two stage Gaussian process model, whose first stage computes the autocorrelation function of the residuals using REML and then the second recalibrates the mean response, produced better mean squared prediction error than the regular GP.
dimension to be included in the model, while other authors have set the minimum at
2p+ 20 where p represents the number of predictors. Finally, they concluded that using RK the analyst can get “the best” of the data and that the “logit transformation” applied to the response variable can be useful to model non-linear relationships (Hengl, Heuvelink, and Stein, 2003).
Hung (2011) proposed the Iterative Reweighted Least Angle Regression (IRLARS) algorithm for Gaussian Processes and presents an application for circuit simulations. The method first computes adaptive Lasso (Least Absolute Shrinkage and Selection Operator) which perform simultaneous estimation and variable selection and then GP, repeating that sequence until convergence. The main drawback of his method is that it is iterative, which means the optimization process has to be run several times to achieve convergence. This becomes intractable for large GP models, and get worse if cross-validation and several random starting points are needed.
Finally, including a trend component allows for a broader interpretation of the Gaus-sian process because the analyst can have separate information about the underlying interpolation components. Despite the broad utilization of RK in geostatistics, the method has not extended through computer experiments, where the general trend is OK or its heteroskedastic version SK. A detailed explanation of this method and other advanced topics in the field of spatial statistics can be found in Cressie (2015), Illian et al. (2008), Isaaks and Srivastava (1989) and Gelfand et al. (2010).
1.2.2.2 Kriging in Engineering
The application of Kriging to engineering problems started with the work of Sacks et al. (1989). Later, Lophaven, Nielsen, and Sondergaard (2002) implemented the Design and
Analysis of Computer Experiments (DACE) using Matlab, making it a popular tool for
Kriging metamodeling. The implementation consists of efficient computations based on Cholesky decomposition for the autocorrelation matrix and LU transformations for the mean response, making it the cornerstone for newer Gaussian process software (Pedregosa et al., 2011; Couckuyt, Dhaene, and Demeester, 2014). An implementation of DACE using
object oriented paradigm in Matlab was provided by Couckuyt, Dhaene, and Demeester
(2014), extending the number of autocorrelation functions available to include the Matèrn family in its separable form. In this dissertation we use thesklearnGaussian process class
available in Python (Rossum, 1995) that according to its authors is a “shameless copy”
of the original DACE (Pedregosa et al., 2011). The interested reader can see Roustant, Ginsbourger, and Deville (2012, p. 3) for a detailed analysis of the current implementations
inR andMatlab. In summary, keeping track of the available software to fit Kriging models
and also define the availability for general use because R and Python are free but Matlab
is commercial.
Kleijnen et al. (2015) reviewed how RSM and Kriging relate to different Design of Experiments (DOE). They emphasized that each metamodel defines a specific DOE which is more appropriate for sensitivity analysis and optimization. They related fractional
fractorial designs with different resolutions and central composite design with RSM and
LHD or Nearly Orthogonal Latin Hypercube Design (NOLHD) with Kriging methods; additionally, they propose that the analyst should usesequential bifurcation to identify the most important parameters. They gave a detailed explanation of OK and UK applied to both deterministic (homoskedastic error) or stochastic (heteroskedastic error) simulation. Finally, the author made a statement that reflects most of the applications of Kriging in IE: “The disadvantage of UK is that UK requires the estimation of additional parameters:
beside µ=β0. We conjecture that the estimation of these q−1 extra parameters explains why UK has a higher MSE. In practice, most Kriging models do not use UK but OK”
(Kleijnen et al., 2015, p. 20).
optimization. They experimented with a small (s,S) inventory example using LHD and OK with Gaussian correlation function.
Mehdad and Kleijnen (2014) proposed a variant of Kriging known as Intrinsic Kriging (IK) which is based on the idea of using integrated random functions as means of filtering trend, equivalent to an integrated process in time series. They compared IK versus OK and UK with a fixed order polynomial of degreep(homoskedastic and heteroskedastic versions) concluding that IK gives smaller MSPE than OK and UK but requires more work selecting covariance parameters. Kleijnen (2009) provided an introduction to Kriging for the field of simulation and discussed the use of LHD and sequential design for simulation experiments taking advantage of the Gaussian process variance at non-simulated points. He explained both deterministic and stochastic Kriging framed as OK formulas. Some of his conclusions were: 1) Kriging studies need to be carried on more realistic simulations that the classical
M/M/1 queuing and the (s, S) inventory model which are simple academic examples,
and 2) more studies are required to fill the gap between Kriging and simulations with multivariate output. Dellino (2007) proposed a robust optimization framework which combines Taguchi methods and Kriging, but the main drawback is that the entire work is supported by only two examples based on the Economic Order Quantity (EOQ) and the (s,S) inventory problem making the results difficult to generalize to larger simulations.
Ankenman, Nelson, and Staum (2010) introduced the concept of Stochastic Kriging
‘heteroskedastic nugget effect’. They demonstrated that using this additional information provided by the replications, the analyst can get a more accurate model than deterministic
OK. Staum (2009) presented examples for the application of SK to the M/M/1problem
and explained why it is important to consider the effect of local variability produced in stochastic modeling. Moreover, he emphasized that misspecification of the regression basis produces poor predictions justifying the use of SK with just a intercept term for the mean and relaying mostly in the interpolation power of the random part. Chen, Ankenman, and Nelson (2013) showed how incorporating gradient information improves surface prediction. Additionally, Chen, Ankenman, and Nelson (2012) study the effect of Common Random Numbers (CRN), concluding that when the aim of the metamodel is to provide accurate prediction values as in financial risk, CRN is not recommended. However, CRN are recommended when better parameter estimates are required for sensitivity analysis.
Finally, the interested reader is encouraged to see Rasmussen and Bro (2012) and Rasmussen (2006) for a full treatment of Gaussian processes, and Santner, Williams, and Notz (2013), Fang, Li, and Sudjianto (2005), and Sacks et al. (1989) for applications of Kriging in computer experiments.
1.3
Discussion
not seem to be popular in other engineering fields. Table 10.1 in Hastie et al. (2009, p. 351) presents the most popular “off-the-shelf” data mining techniques. They pointed out that SVM and MLP on the one hand have good predictive power and can handle linear combination of covariates but, on the other hand, have bad scalability for large samples and poor interpretability, being extremely sensitive to input transformations. Even though they did not consider GP in the table, it can be said that GP closely follows SVM and MLP characteristics.
Table 1.1: Most cited metamodeling techniques
Technique Study
SVR
Villa-Vialaneix et al. (2012)
Ogutu, Piepho, and Schulz-Streeck (2011) Li et al. (2010)
Wang and Shan (2007)
Jin, Chen, and Simpson (2001)
MLP
Boutselis and Ringrose (2013) Villa-Vialaneix et al. (2012) Can and Heavey (2012) Li et al. (2010)
Wang and Shan (2007)
RF Villa-Vialaneix et al. (2012)
Ogutu, Piepho, and Schulz-Streeck (2011)
GP
Salemi, Nelson, and Staum (2014) Villa-Vialaneix et al. (2012) Joseph and Kang (2011) Li et al. (2010)
Wang and Shan (2007)
In regard to methods based on Trees, Hastie et al. (2009) say that their advantages are scalability, robustness to input transformation, and mixing different types of data. The main disadvantages are poor predictive power and difficulty of extracting linear com-binations; additionally, they give a “fair” interpretability score to trees. It is important to point that RF and GBRT are the strongest tree-based techniques and according to Hastie (2014) they have better prediction performance than most of the tree-based methods. He
also said that in general GBRT should perform better than RF.
Finally, Table 1.2 presents the current Gaussian process standard practices in both
spatial statistics and computer experiments. Outside of engineering the practice is to
detrend the surface using low-order polynomials and then compute the covariance structure either using variograms (the preferred choice) or REML to the detrended errors. On the other hand, computer experiments use a flat mean response and relies on an anisotropic parameter search to estimate the covariance structure in an integrated way with the mean response using ML.
Table 1.2: Current Gaussian process practices in Spatial Statistics and Computer Experiments
Area Mean response Covariance Separability θparameters
Spatial Statistics Polynomial function Matèrn No Variogram, RMLE
1.4
Dissertation Outline
Based on the previous review, we observed a lack of an integrated comparison of the state-of-the-art techniques with applications to simulation problems. Moreover, an im-portant number of the comparisons were done with either artificial data or data that is not representative of the behavior of complex simulated systems found in the IE models. Another important issue addressed by this work is the comparison between separable (product of one dimensional correlations) and non-separable (geometrically anisotropic) correlation functions Additionally, more importance should be given to the mean response when using Gaussian processes in computer experiments, especially for high-dimensional problems, so studying its importance for both deterministic and stochastic simulation will be an important part of this work.
The remainder of this dissertation is organized as follows:
• Chapter 2 presents a state-of-the-art comparison of current machine learning al-gorithms, namely, GP, SVM, MLP, RF and GBRT. Two different simulations are analyzed using three criteria: MSPE, R2, and MAX. Special emphasis is given to
reproducible research providing all the information needed to replicate the results presented. In order to assess the quality of MSPE,R2 and MAX, cross-validation is
used. Another important issue addressed in this chapter is the interpretability of each technique, which is generally overlooked in engineering applications.
• Chapter 3 studies the effect of different specifications of trend functions in GP. A two-stage method is proposed coupling GLMNET, the state-of-the-art model selection
adequate specification of the global trend for both prediction accuracy and model interpretation. The results from this chapter are contrasted with those obtained
in Chapter 2 following the same criteria RMSPE, R2 and MAX and degree of
interpretability.
• Chapter 4 explores the effect of the loose coupling proposed in Chapter 3 on a stochastic simulation. A DES of a DMV office in Charlotte, North Carolina is metamodeled in order to determine how the modern machine learning techniques respond to simulations with noise due to replications. Two experimental designs are analyzed conjointly with the proposed techniques in order to determine how sensitive the techniques are to the locations of points in the experimental region. The effect that the number of replications per point has on the accuracy of the responses when using GP is also explored. Finally, a complete analysis of coverage is presented.
Chapter 2
Metamodeling: A State-of-the-Art
Com-parison
2.1
Introduction
In this dissertation the variable definitions used in Hastie et al., 2009 will be employed. An input variable is given byX, whose components are {Xj}p1, wherepis the total number of
features. A quantitative output is given by Y. Sample values xi are written in lowercase, where xi is theith observed value ofX. Bold uppercase letters represent matrices. The set of samples, e.g. {xi}N1 would be theN ×p matrixX. Vectors withN elements are bold;
thus y and xj represent all observation on the response and the Xj variable respectively. Finally, xT
i is the transpose of xi since all vectors are assumed to be column vectors and
xi is the ith row of X.
To date, there are three major paradigms in simulation: SD, DES, and Agent Based Modeling (ABM). SD is a high level abstraction technique that models the dynamics of the systems defining accumulators (stocks) and flow rates (flows) that are connected by feedback loops. DES has a lower level of abstraction since it mirrors the process the modeler observes in real life; thus, it is a process-oriented technique that focus on modeling the sequence of operations done with entities. Finally, ABM is a bottom-up approach used when there is information about how the individual elements of a system behave, but no general understanding of how the general process behaves (Borshchev, 2013).
In one way or another, all these techniques, regardless of their abstractions, can still be very complicated and time consuming to analyze. To handle this problem another level of abstraction called “metamodel” (a model of the simulation) is used in place of the simulation to mimic the responses Y observed in the real world. Because the metamodel is a simplification of the simulation it adds additional noise m, so the real world response can be approximated by Y =fˆˆ(X) +sim+m.
2.2
Modern Machine Learning Techniques
Modern machine learning methods have overtaken simple linear models because of their better prediction accuracy when the relationship between the covariates and the response are non-linear, cannot be defined or both (Breiman et al., 2001). In this section a general overview of the five methods used in this chapter is given. For a complete and detailed derivation of each method the interested reader is encouraged to see Hastie et al. (2009) and Bishop (2006) who provide a deeper overview of each model.
2.2.1
Support Vector Regression (SVR)
Introduced by Cortes and Vapnik (1995) theε-SV regression finds a functionf(x)that has at mostε deviation from all training pointsyi; no deviation larger thanε will be accepted. Let Equation 2.1 be the function that approximates the relationship of the covariates and the response, where φ(·) represents a feature mapping to a higher dimension e.g.
xT = (x
1, x2)−→φ(x)T = (x1, x2, x21, x1x2, x22)
f(x) = β0+φ(x)Tβ (2.1)
Under Support Vector Regression (SVR) the optimal parameters for β0 and β are
found by minimizing Equation 2.2, whereC is a tuning parameter to control the trade-off between flatness off(·)and the amount of error above to be tolerated. This condition is needed because most of the time it is not possible to approximate all responses within
precision (Smola and Schölkopf, 2004, p. 200). λ is a penalty parameter applied to the
Equation 2.3 is the -insensitive loss function that ignores errors smaller than. Basak, Pal, and Patranabis (2007), Smola and Schölkopf (2004), and Hastie et al. (2009) show the step-by-step procedure for solving the dual optimization problem of Equation 2.2.
H(β, β0) = C
N
X
i=1
V(yi, f(xi)) +
λ
2||β||
2 (2.2)
where
V(yi, f(xi)) =
|f(xi)−yi| − if |f(xi)−yi| ≥
0 otherwise
(2.3)
They also show the optimal solution is given by Equations 2.4 and 2.5.
ˆ
β=
N
X
i=1
(ˆαi∗−αˆi)φ(xi) (2.4)
ˆ
f(x) = N
X
i=1
(ˆαi∗−αˆi)hφ(x), φ(xi)i+β0
ˆ
f(x) = N
X
i=1
(ˆαi∗−αˆi)k(xi, x) +β0 (2.5)
The important tuning parameters are C, ε and the type of kernel k(·,·). A kernel is essentially a covariance function that is symmetric and positive definite, and maps from input to feature space (Genton, 2002). In this work we follow Cherkassky and Ma (2004) who proposed an automatic hyper-parameter estimation from the data, reducing
that, according to the authors, includes the effects of outliers in the data.
C = max (|y¯+ 3σy|,|y¯−3σy|) (2.6)
For computingthe author proposed Equation 2.7 which requires the estimation of two additional parameters τ and σ. According to his experiments τ = 3 was a robust choice for different datasets. They proposed to estimateσ from the data using the residuals of a
high-order polynomial as shown in Equation 2.8, where N represents the total number of
training samples and pthe number of covariates being estimated, and yˆare the predicted values from a high-order polynomial (second-order in our case).
=τ σ
r
ln(N)
N (2.7)
ˆ
σ2 = N
N −p
PN
i=1(yi−yˆi)2
N
!
(2.8)
In addition,linear,RBF, andsecond and third-order polynomial kernels are used based on the fact that the linear kernel is less prone to over-fitting in high-dimensions, but less flexible when the relationship between the regressors and the response is non-linearly separable; whereas, RBF and polynomial kernels exhibit the opposite behavior. Equation 2.9 shows the RBF kernel for any two inputs x and z, where γ = 1/2σ2. Finally, it is
important to mention that Hastie et al. (2009) showed that SVR can be heavily affected bythe curse of dimensionality and noisy factors.
In summary, the hyperparemeters used in SVR were optimized testing different values for: kernel, gamma, C, epsilon and degree.
2.2.2
Multilayer Perceptron (MLP)
A sketch of a MLP is given in Figure 2.1 which illustrates the classic topology of a
feedforward artificial neural network (Svozil, Kvasnicka, and Pospichal, 1997). The input
layer has one node for each of the features in the design matrix and communicates the
external variables to the network. Then, the network shows a set of hidden layers, each comprised by a set of nodes. The number ofhidden layers and their respective nodes must be tuned by the user because incorrect specification of these parameters can induce either under or overfitting. Finally, the output layer consists of just one node for regression problems that uses a linear transfer function (Karsoliya, 2012).
...
... ...
X1
X2
X3
X16
H1,1
H1,8
H2,1
H2,8
Y7
Input layer
1st Hidden layer
2nd Hidden layer
Ouput layer
In general, one or two hidden layers with the same number of nodes are recommended by Karsoliya (2012, p. 716). In regard to the number of nodes per hidden layer, several rules of thumb are available in Karsoliya (2012, p. 716) such as the total number of nodes should be less than twice the number of nodes in the input layer, the number of nodes should be between the number of inputs and outputs nodes, etc. We used Equation 2.10 to determine the number of hidden nodes (Nh). Ni andNo represents the number of input and output nodes respectively,Ns the number of samples in the training set, and k is a scaling factor between two and ten.
Nh =
Ns
k(Ni +No) (2.10)
Another important parameter that is predefined by the user is the transfer function used to induce non-linearity to fit more complex functions. The logistic sigmoid and the
rectifier linear unit are the most commonly used, the later being more popular nowadays
(LeCun, Bengio, and Hinton, 2015; Nair and Hinton, 2010). The functional forms are given in Table 2.1.
Table 2.1: Transfer functions
Name Formula
Rectifier linear unit (Relu) f(x) = log(1 +ex)
Sigmoid (logistic) f(x) = 1+1ex
scaling” method where t denotes time. Next, the initial learning rate must be set to a small number in the interval [0,1]. Modern software allows forL2 regularization in order to
control over-fitting (see the MLP object in Pedregosa et al., 2011, for a detailed example).
X2 w2
Σ
f
Activate function
Z Output
X1 w1
X3 w3
Weights
Bias
w0
Inputs
Figure 2.2: Weight and activation function
by Equation 2.13.
ηj = p
X
i=1
wIiXi+wI0 (2.11)
f(ηj) =log(1 +eηj) (2.12)
Y =
m
X
j=1
wHjf(ηj) +wH0 +ε (2.13)
The main parameters to tune are the number of hidden layers and the number of nodes associated with each one, the transfer function, the initial learning rate values and its behavior during the training, and the regularization parameter. Once the feed forward pass has reached Equation 2.13 the prediction error is computed and if the convergence criterion is not met, the error is back-propagated in order to update thew. This process continues until convergence.
For hyperparameters optimization the following parameters were evaluated: hidden_-layer_sizes, activation_function, alpha, learning_rate, learning_rate_init.
2.2.3
Random Forest Regression (RF)
The Random forest is an ensemble method introduced by Breiman (2001). In simple words, the approach consists of an ensemble of several regression trees (as described in Breiman et al., 1984) which obtains a prediction based on a unitary vote from each tree. In order to understand the idea of RF first it is important to explain what a tree is. Figure 2.3 depicts different views of a regression tree. On the left can be seen a basic binary tree structure that consists of a set of if-then-else statements, that starting from a root
x13 >= 18
x13 >= 22
x13 >= 32
x5 < 124 x5 < 99
x5 < 124
x5 < 101
x13 >= 14
x5 < 70
x13 >= 14
x5 < 125 6.1
7.8 16 9.3
19 29 9.4 21 30 28 47 52 yes no (a) Tree x5 x13
(b) 3D surface
x5
x13
60 80 100 120 140
20
30
40
50
(c) 3D surface
Figure 2.3: Regression Tree for response 4 Hospital 1 data
The algorithm generates {Rm}M1 piece-wise constant regions of the feature space and
the predicted value in each region is the average response of the observations at that region fˆ(x) = PM
i=1y¯mI(x ∈ Rm) where Rm is an hypercube region, y¯m the average
of the responses in the region and I is the indicator function that is 1 if x is in Rm, zero otherwise. At each iteration the algorithm minimizes the node impurity, which in the case of regression is given by e(T) = PN
i=1
h
yi−PMm=1y¯mI(x∈Rm)
i2
finding at each iteration the variable/split combination that minimizes the prediction error e(T) with respect to y¯m and Rm. Once the algorithm has converged it gives a piece-wise con-stant model that is easy to interpret (a good read is Loh, 2011) as depicted in Figure 2.3(a).
either the data or the response, 3) they are sensitive to small perturbations having an impact on out-of-sample predictions and 4) its piece-wise constant structure affects prediction accuracy when the response decays smoothly.
In order to overcome these flaws Breiman (2001) proposed an ensemble of trees -a forest- that using some randomization strategies produces a more accurate and smooth surface than a single tree. The general description of how to generate a RF is presented in Algorithm 1, below:
Algorithm 1 Random forest pseudo-code
1: for m= 1 toM do .M is the total number of trees
.Sample with replacement N samples from X
2: Xm ←bootstrapN(X)
3: Fit a random forest tree Tm(Xm) following these instructions for each terminal
node
4: while Minimum node size is not reached do
5: Randomly selectd out of p features
6: Find the best feature\split combination out of the dreduced set.
7: Generate a binary partition of the feature at the split point
8: end while
9: end for
10: return the ensemble {Tm(Θm)}Mi
The recursive procedure used to fit a random tree using N bootstrapped observations and just m features at each iteration produces more uncorrelated trees which, according to Breiman (2001), improves the variance. The prediction of a random forest is the simple average of the unitary vote of the prediction of each tree in the random_forest as seen
if-then-elserules for the bth tree, B is the total number of trees andxis a new sample.
ˆ
frfM(x) = 1
M
M
X
m=1
T(x: Θm) (2.14)
Algorithm 1 is efficiently implemented in Pedregosa et al. (2011). Hyperparameter opti-mization was experimented on: max_features, max_depth, min_samples_split, min_-samples_leaf, and n_estimators options.
x13 >= 20
x13 >= 28
x5 < 122
x13 >= 23
x5 < 72
x5 < 127
x13 >= 14
x5 < 101 6.4
8.7
17 30
12
26
29 53
49
yes no
x13 >= 20
x13 >= 30
x5 < 122
x13 >= 24
x5 < 99
x13 >= 14
x5 < 127 6.8
8.9
18 32
15
28 43
61
yes no
(a) Boostrapped trees
x5
x13
(b) 3D surface
Figure 2.4: Random Forest for response 4 Hospital 1 data
2.2.4
Gradient Boosted Regression Tree (GBRT)
base learner) to the residual at each iteration.
F(x) = M
X
m=0
βmb(x; Θm) (2.15)
When the base learner is a regression tree, each iteration of GB grows a new tree. The process improves the fitting in regions where the previous trees did not do well. This is achieved because the new tree is fit to the “pseudo” residuals (the error left from the previous fits), generating completely different trees from iteration to iteration as shown in the left plot in Figure 2.5. In general, the trees will not be as deep as RF because the latter needs bushy trees, namely trees with high level interactions, to reduce variance through averaging. The right plot in Figure 2.5 shows that after sequentially fitting shallow trees GBRT produces a smoother surface-response than a simple regression tree (Figure 2.3)
x13 >= 18 x13 >= 22
x13 >= 32 x5 < 124
x5 < 99 x5 < 124
x5 < 101 x13 >= 14 x5 < 70
x13 >= 14 x5 < 125
6.1 7.8 16 9.3 19 29 9.4 21 30 28 47 52
yes no x13 < 12
x5 < 84 x13 >= 14 x13 < 16
x13 >= 20 x5 >= 134
x5 < 135 x5 < 98 x5 >= 101
x5 < 130 x5 >= 102
-5.8
-0.82
-4
-8.5 -0.58 -1.4
0.66 6
-0.81 6.8 13 9.4
yes no
(a) Sequential trees
x5
x13
(b) 3D surface
Figure 2.5: Random Forest for response 4 Hospital 1 data
Algorithm 2 Generalized Boosting
1: Initialize the functionF0(x) = 0
2: for m= 1 toM do
3: βm,Θm ←arg min
β,Θ
PN
i=1L(yi−Fm−1(xi) +β b(xi : Θ))
4: Fm(x)←Fm−1+εβmb(x: Θm)
5: end for
6: return fˆ(x) =FM(x)
linear regression or trees. The process needs an initial value for the function Fo(x), and then sequentially fits the base learner to the residuals. The Generalized Boosting algorithm uses astage-wise procedure to compute the parameters βm andΘm in line 3 of Algorithm 2 (Friedman, 2002; Hastie et al., 2009). According to Friedman (2002) this stage-wise procedure makes tractable a potentially difficult problem, an idea that will be explored in Chapter 3. Line 4 shows the update of the cumulative sequential fitting, whereε is a scalar in the range(0,1] used to slow down the rate of overfitting, by updating the model by a small amount.
A customized implementation of GB for trees is given in Friedman (2002) Algorithm 1, which is presented here in Algorithm 3. The modification for trees starts in Line 6
where GBRT fits a L-terminal nodes tree that produces {Rlm}L
1 disjoint regions. As
explained before, a regression tree predicts a constant value for each region; thus, the update after each iteration just reduces to an estimate a “location” parameter for each region of the mth tree as in Line 8 and then adds it to the cumulative model using the indicator function as in Line 10 (Friedman, 2002). The interested reader is encouraged to read Friedman (2002, p. 2).