Model Evaluation 3:

Full text

(1)Lecture 10. Model Evaluation 3: Cross Validation STAT 479: Machine Learning, Fall 2018. Sebastian Raschka. http://stat.wisc.edu/~sraschka/teaching/stat479-fs2018/. Sebastian Raschka. STAT 479: Machine Learning. FS 2019. 1.

(2) Some News. Sebastian Raschka. STAT 479: Machine Learning. FS 2019. 2.

(3) https://gradient.paperspace.com/free-gpu Sebastian Raschka. STAT 479: Machine Learning. FS 2019. 3.

(4) Happy Halloween!. Sebastian Raschka. STAT 479: Machine Learning. FS 2019. 4.

(5) https://www.deeplearning.ai/thebatch/ Sebastian Raschka. STAT 479: Machine Learning. FS 2019. 5.

(6) Happy Halloween! Visions of Machine Learning AI gone wrong & Malevolent superintelligences. Sebastian Raschka. STAT 479: Machine Learning. FS 2019. 6.

(7) 1) AI Goes Rogue •. Code awakens into sentience, will be able to think better than humans (btw. have you heard of Movarek's paradox?). •. Computers already “think” much faster than humans; they remember more information than humans, too. •. Artificial intelligence already manages crucial systems in fields like finance, security, and communications. •. It will enslave or exterminate our species (https://www.nytimes.com/2009/07/26/science/26robot.html). Sebastian Raschka. STAT 479: Machine Learning. FS 2019. 7.

(8) 1) AI Goes Rogue •. Code awakens into sentience, will be able to think better than humans (btw. have you heard of Movarek's paradox?). •. Computers already “think” much faster than humans; they remember more information than humans, too. •. Artificial intelligence already manages crucial systems in fields like finance, security, and communications. •. It will enslave or exterminate our species (https://www.nytimes.com/2009/07/26/science/26robot.html). How scared should we be?. Sebastian Raschka. STAT 479: Machine Learning. FS 2019. 8.

(9) 2) Deepfakes Wreak Havoc •. Deepfakes of celebrities (https://www.engadget.com/ 2019/10/11/deepfake-celebrity-impresonations);. •. GPT-2 language model’s ability to churn out faux articles that convince readers they're from the New York Times (https:// www.foreignaffairs.com/articles/2019-08-02/not-your-fathersbots). Sebastian Raschka. STAT 479: Machine Learning. FS 2019. 9.

(10) 2) Deepfakes Wreak Havoc •. Deepfakes of celebrities (https://www.engadget.com/ 2019/10/11/deepfake-celebrity-impresonations);. •. GPT-2 language model’s ability to churn out faux articles that convince readers they're from the New York Times (https:// www.foreignaffairs.com/articles/2019-08-02/not-your-fathersbots). How scared should we be?. Sebastian Raschka. STAT 479: Machine Learning. FS 2019. 10.

(11) 2) Deepfakes Wreak Havoc •. Deepfakes of celebrities (https://www.engadget.com/ 2019/10/11/deepfake-celebrity-impresonations);. •. GPT-2 language model’s ability to churn out faux articles that convince readers they're from the New York Times (https:// www.foreignaffairs.com/articles/2019-08-02/not-your-fathersbots). How scared should we be?. https://www.biometricupdate.com/201905/researchers-develop-digital-watermarks-to-detectdeepfakes. Sebastian Raschka. STAT 479: Machine Learning. FS 2019. 11.

(12) 3) No Escape from Surveillance •. Artificial intelligence will boost the power of surveillance, effectively making privacy obsolete. •. Smartphone applications track your location, browsing history, and even mine your contact data, thanks to the permissions you give them in exchange for free apps.. •. More than half of all US companies monitor their employees — including email monitoring and biometric tracking (https:// www.gartner.com/smarterwithgartner/the-future-of-employeemonitoring). •. AI surveillance is used by local, state, or national governments in over 40 percent of the world’s countries (https:// carnegieendowment.org/2019/09/17/global-expansion-of-aisurveillance-pub-79847) Sebastian Raschka. STAT 479: Machine Learning. FS 2019. 12.

(13) 3) No Escape from Surveillance •. Artificial intelligence will boost the power of surveillance, effectively making privacy obsolete. •. Smartphone applications track your location, browsing history, and even mine your contact data, thanks to the permissions you give them in exchange for free apps.. How scared should we be? • More than half of all US companies monitor their employees — including email monitoring and biometric tracking (https:// www.gartner.com/smarterwithgartner/the-future-of-employeemonitoring). •. AI surveillance is used by local, state, or national governments in over 40 percent of the world’s countries (https:// carnegieendowment.org/2019/09/17/global-expansion-of-aisurveillance-pub-79847) Sebastian Raschka. STAT 479: Machine Learning. FS 2019. 13.

(14) https://www.nytimes.com/interactive/2019/04/16/opinion/facialrecognition-new-york-city.html Sebastian Raschka. STAT 479: Machine Learning. FS 2019. 14.

(15) FlowSAN: Privacy-enhancing Semi-Adversarial Networks to Confound Arbitrary Face-based Gender Classifiers Improvements to better control the perturbations and enhance the removal of soft-biometric information. Vahid Mirjalili, Sebastian Raschka, Arun Ross (2019) FlowSAN: Privacy-enhancing Semi-Adversarial Networks to Confound Arbitrary Face-based Gender Classifiers IEEE Access 2019, 10.1109/ACCESS.2019.2924619. Sebastian Raschka. ML4MI Seminar, Oct 2019. Sebastian Raschka. STAT 479: Machine Learning. 48. FS 2019. 15.

(16) 4) Biased Data Trains Oppressive AI. •. AI learns from data to reach its own conclusions. •. But training datasets are often gathered from and curated by humans who have social biases.. •. The risk that AI will reinforce existing social biases is rising as the technology increasingly governs education, employment, loan applications, legal representation, and press coverage.. Sebastian Raschka. STAT 479: Machine Learning. FS 2019. 16.

(17) http://proceedings.mlr.press/v81/buolamwini18a.html Sebastian Raschka. STAT 479: Machine Learning. FS 2019. 17.

(18) 4) Biased Data Trains Oppressive AI. •. AI learns from data to reach its own conclusions. But trainingscared datasets are often gathered from we and curated by •How should be? humans who have social biases.. •. The risk that AI will reinforce existing social biases is rising as the technology increasingly governs education, employment, loan applications, legal representation, and press coverage.. Sebastian Raschka. STAT 479: Machine Learning. FS 2019. 18.

(19) Gender Privacy: An Ensemble of Semi Adversarial Networks for Confounding Arbitrary Gender Classifiers Improvements to construct a more diverse set of SAN models for better generalizability via ensembling G5 G3. G2. Hypothesis space of gender classifiers G2. G4 G1. G3. G4. G5 G1 Non-diverse: Ensemble SAN cannot generalize. Diverse: Ensemble SAN can generalize. Figure 1: Diversity in an ensemble SAN can be enhanced through its auxiliary gender classifiers (see Figure 2). When the auxiliary gender classifiers lack diversity, ensemble SAN cannot generalize well to arbitrary gender classifiers.. Figure Architecture of the original SAN model [31]. group of atFigure 4:2:Face prototypes computed for each tribute labels. The abbreviations at the bottom of each image refer to the prototype attribute-classes, where Y=young, O=old, M=male, F=female, W=white, B=black.. 3.2. Diversity in Autoencoder Ensembles. Vahid Mirjalili, Sebastian Raschka, and Arun Ross (2018) Gender Privacy: An Ensemble of Semi Adversarial groups. For each group, we generate a prototype image, One of for theConfounding key aspects of neural Gender networkClassifiers. ensembles 9th is IEEE Networks Arbitrary International Conference on Biometrics: Theory, diversity among individual network Applications, andthe Systems (BTAS 2018)models [17]. Sevwhich is the average of all face images from the training. eral techniques have been proposed in the literature for en- dataset that belong to that group. Hence, given eight distinct Sebastian Raschka Seminar, Oct 2019 47 hancing diversity among individual networks inML4MI an ensem-. categories or groups, eight different prototypes are computed. Next, an opposite-attribute prototype is defined by flipping one of the binary attribute labels of an input image. For example, input STAT 479: Machine Learning if theFS 2019image had the attribute labels. ble, such as seeding the networks with different random weights, choosing different network architectures, or using bootstrap samples of the training data [42, 15]. In the context of SAN models, autoencoder diversity can Sebastian Raschka. 19.

(20) 5) Machines Take Everyone's Jobs •. AI will exceed human performance at a wide range of activities. Huge populations will become jobless. •. Historically, technology created more jobs than it destroyed. What makes AI different is it threatens to outsource the one thing humans have always relied on for employment: their brains.. •. Massive unemployment in the past have brought severe social disruption. The U.S. Great Depression in the 1930s saw jobless rates above 34 percent.. Sebastian Raschka. STAT 479: Machine Learning. FS 2019. 20.

(21) 5) Machines Take Everyone's Jobs •. AI will exceed human performance at a wide range of activities. Huge populations will become jobless. How scared should we be?. •. Historically, technology created more jobs than it destroyed. What makes AI different is it threatens to outsource the one thing humans have always relied on for employment: their brains.. •. Massive unemployment in the past have brought severe social disruption. The U.S. Great Depression in the 1930s saw jobless rates above 34 percent.. Sebastian Raschka. STAT 479: Machine Learning. FS 2019. 21.

(22) 5) Machines Take Everyone's Jobs •. AI will exceed human performance at a wide range of activities. Huge populations will become jobless. How scared should we be?. •. Historically, technology created more jobs than it destroyed. What makes AI different is it threatens to outsource the one thing humans have always relied on for employment: their brains.. Lifelong learning is a front-line defense (and a rewarding Education can help ahead of partial Massive unemployment in the pastyou havestay brought severe social • pursuit!). disruption. The U.S. Great Depression in theor 1930s saw lanes if automation in your current profession change joblessyour ratesprofession above 34 percent. is being automated away.. Sebastian Raschka. STAT 479: Machine Learning. FS 2019. 22.

(23) 6) AI Winter Sets In. •. Could the flood of hype for artificial intelligence lead to a catastrophic collapse in funding?. •. AI will fail to deliver on promises inflated by businesses and researchers. •. Funding will dry up, research will sputter, and progress will stall. Sebastian Raschka. STAT 479: Machine Learning. FS 2019. 23.

(24) 6) AI Winter Sets In. •. Could the flood of hype for artificial intelligence lead to a catastrophic collapse in funding?. •. AI will fail to deliver on promises inflated by businesses and researchers. •. Funding will dry up, research will sputter, and progress will stall. How scared should we be?. Sebastian Raschka. STAT 479: Machine Learning. FS 2019. 24.

(25) 6) AI Winter Sets In. How scared should we be? •. Could the flood of hype for artificial intelligence lead to a catastrophic collapse in funding?. •. Aswill AIfail practitioners, shouldinflated strive by to present ourand work AI to deliver onwe promises businesses honestly, criticize one another fairly and openly, and researchers. •. promote projects that demonstrate clear value. Genuine Funding will in dryimproving up, research will sputter, progress will stall progress peoples’ livesand is the best way to ensure that AI enjoys perpetual springtime.. Sebastian Raschka. STAT 479: Machine Learning. FS 2019. 25.

(26) https://www.deeplearning.ai/thebatch/ Sebastian Raschka. STAT 479: Machine Learning. FS 2019. 26.

(27) Lecture 10. Model Evaluation 3: Cross Validation STAT 479: Machine Learning, Fall 2018. Sebastian Raschka. http://stat.wisc.edu/~sraschka/teaching/stat479-fs2018/. Sebastian Raschka. STAT 479: Machine Learning. FS 2019. 27.

(28) Large dataset Performance estimation Small dataset. ▪. 2-way holdout method (train/test split). ▪. Confidence interval via normal approximation. ▪. (Repeated) k-fold cross-validation without independent test set. ▪. Leave-one-out cross-validation without independent test set. ▪. Large dataset Model selection (hyperparameter optimization) and performance estimation Small dataset. Large dataset Model & algorithm comparison Small dataset. ▪. Confidence interval via 0.632(+) bootstrap. 3-way holdout method (train/validation/test split). ▪. (Repeated) k-fold cross-validation with independent test set. ▪. Leave-one-out cross-validation with independent test set. ▪. Multiple independent training sets + test sets (algorithm comparison, AC). ▪. McNemar test (model comparison, MC). ▪. Cochran’s Q + McNemar test (MC). ▪. Combined 5x2cv F test (AC). ▪. Nested cross-validation (AC). Sebastian Raschka. STAT 479: Machine Learning. These are my. personal recommendations;. MC = model comparison,. AC = algorithm comparison;. Terms that are still unfamiliar. (McNemar's test, 5x2cv F test, etc.) will be covered next lecture.. FS 2019. 28.

(29) Overview. Bias and Variance Basics. Overfitting and Underfitting Holdout method Confidence Intervals. Model Eval Lectures. Repeated holdout. Resampling methods. Empirical confidence intervals Hyperparameter tuning. Cross-Validation. Model selection. This Lecture Algorithm Selection. Statistical Tests Evaluation Metrics. Sebastian Raschka. STAT 479: Machine Learning. FS 2019. 29.

(30) What are Hyperparameters?. Sebastian Raschka. STAT 479: Machine Learning. FS 2019. 30.

(31) Hyperparameters. feature 2. nonparametric model: k-nearest neighbors. ? k=1. k=1. kk=3 =3 k=5. Hyperparameter. k=5. feature 1. Sebastian Raschka. STAT 479: Machine Learning. FS 2019. 31.

(32) Hyperparameters parametric model: logistic regression Hyperparameter. Sebastian Raschka. STAT 479: Machine Learning. FS 2019. 32.

(33) 3-Way Holdout. 4. instead of "regular" holdout to avoid "data leakage" during hyperparameter optimization. Training Data. Validation Data. Training Labels. Validation Labels. Test Data. 5. Best Hyperparameter Values. Learning Algorithm. Model. Prediction. Performance. Model. Test Labels. Training Data. Training Labels. 1. Data. Validation Data. Labels. Validation Labels. 6. Best Hyperparameter Values. Data. Learning Algorithm. Labels. Test Data. Final Model. Test Labels. Hyperparameter values. 2. Training Data. Hyperparameter values. Training Labels. Model. Learning Algorithm. Model. Hyperparameter values. Validation Data. Prediction. Model. Validation Labels. Model. Performance. 3. Validation Data. Prediction. Model. Validation Labels. Validation Data. Prediction. Model. Validation Labels. Best Hyperparameter values. Performance. Best Model. Performance. Sebastian Raschka. STAT 479: Machine Learning. FS 2019. 33.

(34) Main points why we evaluate the predictive performance of a model: 1. Want to estimate the generalization performance, the predictive performance of our model on future (unseen) data.. 2. Want to increase the predictive performance by tweaking the learning algorithm and selecting the best performing model from a given hypothesis space.. 3. Want to identify the ML algorithm that is best-suited for the problem at hand; thus, we want to compare different algorithms, selecting the best-performing one as well as the best performing model from the algorithm’s hypothesis space.. Sebastian Raschka. STAT 479: Machine Learning. FS 2019. 34.

(35) k-Fold Cross-Validation Part 1 Model Evaluation Sebastian Raschka. STAT 479: Machine Learning. FS 2019. 35.

(36) k-Fold Cross-Validation Validation Fold. K Iterations (K-Folds). A. B. Training Fold. 1st. Performance 1. 2nd. Performance 2. 3rd. Performance 3. 4th. Performance 4. 5th. Performance 5. Training Fold Data Training Fold Labels. Performance. =. 1 5. 5. ∑ Performance. i. i =1. C Validation Fold Data. Prediction. Performance. Model Hyperparameter Values. Validation Fold Labels. Learning Algorithm. Model. Sebastian Raschka. STAT 479: Machine Learning. FS 2019. 36.

(37) k-Fold Cross-Validation non-overlapping test folds; utilizes all data for testing. overlapping training folds. some variance estimate from different training sets, (but no unbiased estimate). more pessimistic for small k because we withhold data from fitting A. Validation Fold. K Iterations (K-Folds). -. B. Training Fold. 1st. Performance 1. 2nd. Performance 2. 3rd. Performance 3. 4th. Performance 4. 5th. Performance 5. Training Fold Data Training Fold Labels. Performance. =. 1 5. 5. ∑ Performance. C Validation Fold Data. Prediction. Performance. Model Hyperparameter Values. i. i =1. Validation Fold Labels. Learning Algorithm. Model. Sebastian Raschka. STAT 479: Machine Learning. FS 2019. 37.

(38) k-Fold CV special cases: k=2 & k=n Training. Evaluation. 1 2 3 4 5 6 7 8 9 10. 1 2 3 4 5 6 7 8 9 10. Holdout Method. 2-Fold Cross-Validation. 1 2 3 4 5 6 7 8 9 10. 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10. Repeated Holdout. 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10. .... Sebastian Raschka. STAT 479: Machine Learning. FS 2019. 38.

(39) k-Fold CV special cases: k=2 & k=n Training Training. Evaluation. 1 2 3 4 5 6 7 8 9 10. evaluation. 1 2 3 4 5 6 7 8 9 10 Holdout Method. 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10. 1 2 3 4 5 6 7 8 9 10. 2-Fold Cross-Validation. 1 2 3 4 5 6 7 8 9 10. 1 2 3 4 5 6 7 8 9 10. 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10. 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10. Repeated Holdout. 1 2 3 4 5 6 7 8 9 10. 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10. 1 2 3 4 5 6 7 8 9 10. 1 2 3 4 5 6 7 8 9 10. .... 1 2 3 4 5 6 7 8 9 10 Sebastian Raschka. STAT 479: Machine Learning. FS 2019. LOOCV 39.

(40) k-Fold Cross-Validation "[...] where available sample sizes are modest, holding back compounds for model testing is ill-advised. This fragmentation of the sample harms the calibration and does not give a trustworthy assessment of fit anyway. It is better to use all data for the calibration step and check the fit by cross-validation, making sure that the cross-validation is carried out correctly. [...] The only motivation to rely on the holdout sample rather than cross-validation would be if there was reason to think the cross-validation not trustworthy -- biased or highly variable. But neither theoretical results nor the empiric results sketched here give any reason to disbelieve the cross-validation results." [1] 1. Hawkins, D. M., Basak, S. C., & Mills, D. (2003). Assessing model fit by cross-validation. Journal of chemical information and computer sciences, 43(2), 579-586.. Sebastian Raschka. STAT 479: Machine Learning. FS 2019. 40.

(41) Table 1: Summary of the findings from the LOOCV vs. holdout comparison study conducted Hawkins and others Douglas M Hawkins, Subhash C Basak, and Denise Mills. “Assessing mo t by cross-validation”. In: Journal of Chemical Information and Computer Sciences 43.2 (20 p. 579–586. See text for details.. LOOCV vs Holdout. Experiment True R2 — q 2 True R2 — hold 50 True R2 — hold 20 True R2 — hold 10. Mean 0.010 0.028 0.055 0.123. Standard deviation 0.149 0.184 0.305 0.504. 1. Hawkins, D. M., Basak, S. C., & Mills, D. (2003). Assessing model fit by cross-validation. Journal of chemical information and computer sciences, 43(2), 579-586.. ets and averaging the results. In rows 2-4, the researchers used the holdout method tting models to the 100-example training sets, and they evaluated the performances oldout sets of"mean" sizes 10, 20,toand samples. Each experiment was coefficients repeated 75 times, The reported refers the 50 averaged difference between the true 2 2 2 heofmean column shows the average di↵erence between the estimated R and the true determination (R ) and the coefficients obtained via LOOCV (here called q ) after alues. As we see, theon estimates via LOOCVtraining (q 2 ) are the closest to the true repeating thiscan procedure multiple, obtained different 100-example sets n average. The estimates obtained from the 50-example test set via the holdout met changingthough. the random seedon in LOOCV?) re(why alsonot passable, Based these particular experiments, we may agree with esearchers’ conclusion: Taking the third of these points, if you have 150 or more compounds available, then you can certainly make a random split into 100 for calibration and 50 or more for testing. However it is hard to Machine see why want to do this. Sebastian Raschka STAT 479: Learningyou would FS 2019 41.

(42) Sebastian Raschka. STAT479 FS18. L10: Cross Validation. LOOCV vs Holdout. Page 10. Table 1: Summary of the findings from the LOOCV vs. holdout comparison study conducted by Hawkins and others Douglas M Hawkins, Subhash C Basak, and Denise Mills. “Assessing model fit by cross-validation”. In: Journal of Chemical Information and Computer Sciences 43.2 (2003), pp. 579–586. See text for details.. Experiment True R2 — q 2 True R2 — hold 50 True R2 — hold 20 True R2 — hold 10. Mean 0.010 0.028 0.055 0.123. Standard deviation 0.149 0.184 0.305 0.504. 1. Hawkins, D. M., Basak, S. C., & Mills, D. (2003). Assessing model fit by cross-validation. Journal of chemical information and computer sciences, 43(2), 579-586.. sets and averaging the results. In rows 2-4, the researchers used the holdout method for fitting models to the 100-example training sets, and they evaluated the performances on holdout sets of sizes 10, 20, and 50 samples. Each experiment was repeated 75 times, and the mean column shows the average di↵erence between the estimated R2 and the true R2 2 2 values. As we can see, the estimates obtained via LOOCV (q ) are the closest to the true R The reported "mean" refers to the averaged difference between the true on average. The estimates obtained from the 50-example test set via the holdout method coefficients ofaredetermination (R2 )Based and on thethese coefficients obtainedwe viamay LOOCV (here also passable, though. particular experiments, agree with the called q2) after repeating this procedure on different 100-example training researchers’ conclusion:. In rows 2-4, the researchers used thepoints, holdout formore fitting models to the 100Taking the third of these if youmethod have 150 or compounds available, example training sets, and evaluated the performances holdout then you canthey certainly make a random split into 100 for on calibration andsets 50 or of sizes more for testing. However it is hard to repeated see why you 75 would want toand do this. 10, 20, and 50 samples. Each experiment was times, the mean M Hawkins, Subhash C Basak,the andestimated Denise Mills.R2“Assessing column shows theDouglas average difference between and themodel true R2 fit by cross-validation”. In: Journal of Chemical Information and Computer values. Sciences 43.2 (2003), pp. 579–586 One reason why we may prefer the holdout method may be concerns about computational efficiency, if the dataset is sufficiently large. As Learning a rule of thumb, Sebastian Raschka STAT 479: Machine FS 2019we can say that the. 42.

(43) Problems with LOOCV for Classification. •. While LOOCV is almost unbiased, one downside of using LOOCV over k-fold cross-validation with k < n is the large variance of the LOOCV estimate.. •. LOOCV is "defect" when using a discontinuous loss-function such as the 0-1 loss in classification or even in continuous loss functions such as the mean-squared-error.. •. LOOCV has high variance because the test set only contains one example.. Sebastian Raschka. STAT 479: Machine Learning. FS 2019. 43.

(44) Problems with LOOCV for Classification "With k=n, the cross-validation estimator is approximately unbiased for the true (expected) prediction error, but can have high variance because the n "training sets" are so similar to one another." [1]. [1] Friedman, J., Hastie, T., & Tibshirani, R. (2001). The elements of statistical learning (Vol. 1, No. 10). New York, NY, USA: Springer series in statistics.. Sebastian Raschka. STAT 479: Machine Learning. FS 2019. 44.

(45) Problems with LOOCV for Classification Var. n. n. n. n. Xi = Cov(Xi, Xj) = Var(Xi) + 2 Cov(Xi, Xj) ∑∑ ∑ ∑ (∑ ) i=1 i=1 j=1 i=1 1≤i<j≤n For correlated variables, the variance of their sum is the sum of their covariances. Or in other words, we can attribute the high variance to the fact that the mean of highly correlated variables has a higher variance than the mean of variables that are not highly correlated?. Sebastian Raschka. STAT 479: Machine Learning. FS 2019. 45.

(46) Empirical Study and Recommendation % acc 75 % acc 70 Soybean 75. std dev Vehicle 7 Soybean. Estimated Soybean Vehicle. 65 70 Soybean. 65 60 Vehicle. 6 5. Rand. 3. Rand. 50 45 Rand. 2 5. 2. 45. % acc1 5 2 100 % 99.5 acc Mushroom 100 99 99.5 Mushroom Hypo 98.5 99 Chess 98Hypo 98.5 Chess 97.5 98 97 97.5 96.5 97 96.5 96 2. 96. Rand Breast. 4. 55 60 Vehicle. 55 50. C4.5. 1. 2. 10. 10. 20. 20. -5. -2. 50 100. folds. -1 samples. 1 0. Hypo Chess Mushroom 2. std dev. 5. 10. 20. -5. -2. -1. -5. -2. -1. folds. Naive-Bayes. 7 6 5 4. Vehicle Rand Soybean Breast. 3 2 5. 5. 10. 20. -5. -2. 10. 20. 50. 100. folds -1 samples. 1. Chess Hypo Mushroom. 0. 2. 5. 10. 20. folds. Kohavi, R. (1995, August). A study of cross-validation and bootstrap for accuracy estimation and model selection. In Ijcai (Vol. 14, No. 2, pp. 1137-1145).. Sebastian Raschka. STAT 479: Machine Learning. FS 2019. 46.

(47) Summarizing k-Fold CV for Model Evaluation What happens if we increase k? • The bias of the performance estimator ____creases (more accurate / more variable?). • The variance of the performance estimators ____creases (more accurate / more variable?). • The computational cost ____creases . (more iterations, larger training sets during fitting). • Exception: decreasing the value of k in k-fold cross-validation to small values (for example, 2 or 3) also ____creases the variance on small datasets due to random sampling effects Sebastian Raschka. STAT 479: Machine Learning. FS 2019. 47.

(48) k-Fold Cross-Validation Part 2 Model Selection. Sebastian Raschka. STAT 479: Machine Learning. FS 2019. 48.

(49) Training Data Training Labels. Data. 1. Labels. Test Data Test Labels. Hyperparameter values. 2. Training Data. Hyperparameter values. Training Labels. Performance. Learning Algorithm. Performance. Hyperparameter values. 3. Best Hyperparameter Values. Training Data. Learning Algorithm. Training Labels. Test Data. 4. Sebastian Raschka. Model. Prediction. Performance. Model. 5. Performance. Test Labels. Best Hyperparameter Values. Data Labels. Learning Algorithm. STAT 479: Machine Learning. Final Model. FS 2019. 49.

(50) Training Data. 1. Training Labels. Data Labels. Test Data Test Labels. Hyperparameter values. 2. Training Data Training Labels. Hyperparameter values. Performance. Learning Algorithm. Performance. Hyperparameter values. 3. Training Data Training Labels. Sebastian Raschka. Performance. Best Hyperparameter Values. Model. Learning Algorithm. STAT 479: Machine Learning. FS 2019. 50.

(51) Test Data. 4. 5. Prediction. Performance. Model. Test Labels. Data. Best Hyperparameter Values. Learning Algorithm. Labels. Sebastian Raschka. STAT 479: Machine Learning. FS 2019. Final Model. 51.

(52) The Law of Parsimony. Occam's Razor: "Among competing hypotheses, the one with the fewest assumptions should be selected.". https://en.wikipedia.org/wiki/Occam%27s_razor Sebastian Raschka. STAT 479: Machine Learning. FS 2019. 52.

(53) The Law of Parsimony "Simpler models are more accurate. This belief is sometimes equated with Occam's razor, but the razor only says that simpler explanations are preferable, not why. They're preferable because they're easier to understand, remember, and reason with. Sometimes the simplest hypothesis consistent with the data is less accurate for prediction than a more complicated one. Some of the most powerful learning algorithms output models that seem gratuitously elaborate -sometimes even continuing to add to them after they've perfectly fit the data -- but that's how they beat the less powerful ones.". Pedro Domingos: "Then Myths about Machine Learning" https://medium.com/@pedromdd/ten-myths-about-machine-learning-d888b48334a3. Sebastian Raschka. STAT 479: Machine Learning. FS 2019. 53.

(54) The 1-standard error method "... However, if two models perform equally well, the simpler. one seems more likely (among other advantages)" Pedro Domingos: "Then Myths about Machine Learning" https://medium.com/@pedromdd/ten-myths-about-machine-learning-d888b48334a3. Sebastian Raschka. STAT 479: Machine Learning. FS 2019. 54.

(55) The 1-standard error method "... However, if two models perform equally well, the simpler. one seems more likely (among other advantages)" Pedro Domingos: "Then Myths about Machine Learning". 1. Consider the numerically optimal estimate and its standard error.. 2. Select the model whose performance is within one standard error of the value obtained in step 1.. Sebastian Raschka. STAT 479: Machine Learning. FS 2019. 55.

(56) The 1-standard error method. (Some toy data I generated via scikit-learn) Sebastian Raschka. STAT 479: Machine Learning. FS 2019. 56.

(57) The 1-standard error method Consider a RBF-kernel SVM, where gamma controls the influence of the training points. (don't need to know the details, yet) 2 | | | | K(x , x ) = exp(−γ x − x ), γ > 0. Gaussian/RBF-kernel: i j i j. Sebastian Raschka. STAT 479: Machine Learning. FS 2019. 57.

(58) The 1-standard error method Which parameter would you select?. =0. (note: here I used 10-fold CV) Sebastian Raschka. STAT 479: Machine Learning. FS 2019. 58.

(59) The 1-standard error method Which parameter would you select?. = 0.001. = 10.0. = 0.1. (note: here I used 10-fold CV) Sebastian Raschka. STAT 479: Machine Learning. FS 2019. 59.

(60) Code Examples https://github.com/rasbt/stat479-machine-learning-fs19/blob/ master/10_eval3-cv/code/10_eval3-cv__code.ipynb. Sebastian Raschka. STAT 479: Machine Learning. FS 2019. 60.

(61) Reading Assignment. L10 Lecture Notes: https://github.com/rasbt/stat479-machine-learning-fs19/blob/ master/10_eval3-cv/10-eval3-cv__notes.pdf. Kohavi, Ron. "A study of cross-validation and bootstrap for accuracy estimation and model selection." IJCAI. Vol. 14. No. 2. 1995. https://ai.stanford.edu/~ronnyk/accEst.pdf. Sebastian Raschka. STAT 479: Machine Learning. FS 2019. 61.

(62) Overview Bias and Variance Basics. Overfitting and Underfitting Holdout method Confidence Intervals. Model Eval Lectures. Repeated holdout. Resampling methods. Empirical confidence intervals Hyperparameter tuning. Cross-Validation. Model selection Algorithm Selection. Statistical Tests. Next Lecture. Evaluation Metrics. Sebastian Raschka. STAT 479: Machine Learning. FS 2019. 62.

(63)

No results found