Sebastian Raschka STAT 479: Machine Learning FS 2018
Data Preprocessing and
Machine Learning with Scikit-Learn
(Computational Foundations Part 3/3)
Lecture 05
1
STAT 479: Machine Learning, Fall 2018 Sebastian Raschka
http://stat.wisc.edu/~sraschka/teaching/stat479-fs2018/
Sebastian Raschka STAT 479: Machine Learning FS 2018 2
Sebastian Raschka STAT 479: Machine Learning FS 2018 !3
Labels Raw Data
Training Dataset
Test Dataset
Labels
New Data
Labels Learning
Algorithm
Preprocessing Learning Evaluation Prediction
Final Model Feature Extraction and Scaling
Feature Selection
Dimensionality Reduction Sampling
Model Selection Cross-Validation Performance Metrics
Hyperparameter Optimization
Sebastian Raschka STAT 479: Machine Learning FS 2018 4
Reading a Dataset from a Tabular Text File
Sebastian Raschka STAT 479: Machine Learning FS 2018 5
Iris-Versicolor Iris-Virginica Iris-Setosa
Fisher, R.A. "The use of multiple measurements in taxonomic problems" Annual Eugenics, 7, Part II, 179-188 (1936); also in
"Contributions to Mathematical Statistics" (John Wiley, NY, 1950).
Sebastian Raschka STAT 479: Machine Learning FS 2018 6
Sebastian Raschka STAT 479: Machine Learning FS 2018 7
https://pandas.pydata.org
McKinney, Wes. "Data structures for statistical computing in python."
Proceedings of the 9th Python in Science Conference. Vol. 445. 2010.
Sebastian Raschka STAT 479: Machine Learning FS 2018 8
https://pandas.pydata.org
Sebastian Raschka STAT 479: Machine Learning FS 2018 9
Basic Data Handling
Sebastian Raschka STAT 479: Machine Learning FS 2018 10
Sebastian Raschka STAT 479: Machine Learning FS 2018 11
Sebastian Raschka STAT 479: Machine Learning FS 2018 12
Sebastian Raschka STAT 479: Machine Learning FS 2018 13
Sebastian Raschka STAT 479: Machine Learning FS 2018 14
Sebastian Raschka STAT 479: Machine Learning FS 2018 15
Sebastian Raschka STAT 479: Machine Learning FS 2018 16
Sebastian Raschka STAT 479: Machine Learning FS 2018 17
Raschka, Sebastian. "MLxtend: Providing machine learning and data science utilities and extensions to Python’s scientific computing stack."
The Journal of Open Source Software 3.24 (2018).
http://rasbt.github.io/mlxtend/
ML XTEND
Sebastian Raschka STAT 479: Machine Learning FS 2018 18
Sebastian Raschka STAT 479: Machine Learning FS 2018 19
Sebastian Raschka STAT 479: Machine Learning FS 2018 20
Sebastian Raschka STAT 479: Machine Learning FS 2018 21
Sebastian Raschka STAT 479: Machine Learning FS 2018 22
Sebastian Raschka STAT 479: Machine Learning FS 2018 23
Python Classes
Sebastian Raschka STAT 479: Machine Learning FS 2018 24
Python Classes
Sebastian Raschka STAT 479: Machine Learning FS 2018 25
Python Classes
Sebastian Raschka STAT 479: Machine Learning FS 2018 26
Python Classes
Sebastian Raschka STAT 479: Machine Learning FS 2018 27
Sebastian Raschka STAT 479: Machine Learning FS 2018 28
Python Classes
Sebastian Raschka STAT 479: Machine Learning FS 2018 29
Python Classes
Sebastian Raschka STAT 479: Machine Learning FS 2018 30
Sebastian Raschka STAT 479: Machine Learning FS 2018 31
Sebastian Raschka STAT 479: Machine Learning FS 2018 32
http://scikit-learn.org
Pedregosa, Fabian, et al. "Scikit-learn: Machine learning in Python."
Journal of machine learning research 12.Oct (2011): 2825-2830.
Sebastian Raschka STAT 479: Machine Learning FS 2018 33
Sebastian Raschka STAT 479: Machine Learning FS 2018 34
Training Data
Model
Training Labels
Predicted labels
Test Data
est.fit(X_train, y_train)
est.predict(X_test)
①
②
Scikit-learn Estimator API
Sebastian Raschka STAT 479: Machine Learning FS 2018 35
Sebastian Raschka STAT 479: Machine Learning FS 2018 !36 1.3 Resubstitution Validation and the Holdout Method
The holdout method is inarguably the simplest model evaluation technique; it can be summarized as follows. First, we take a labeled dataset and split it into two parts: A training and a test set. Then, we fit a model to the training data and predict the labels of the test set. The fraction of correct predictions, which can be computed by comparing the predicted labels to the ground truth labels of the test set, constitutes our estimate of the model’s prediction accuracy. Here, it is important to note that we do not want to train and evaluate a model on the same training dataset (this is called resubstitution validation or resubstitution evaluation), since it would typically introduce a very optimistic bias due to overfitting. In other words, we cannot tell whether the model simply memorized the training data, or whether it generalizes well to new, unseen data. (On a side note, we can estimate this so-called optimism bias as the difference between the training and test accuracy.)
Typically, the splitting of a dataset into training and test sets is a simple process of random subsam- pling. We assume that all data points have been drawn from the same probability distribution (with respect to each class). And we randomly choose 2/3 of these samples for the training set and 1/3 of the samples for the test set. Note that there are two problems with this approach, which we will discuss in the next sections.
1.4 Stratification
We have to keep in mind that a dataset represents a random sample drawn from a probability distribution, and we typically assume that this sample is representative of the true population – more or less. Now, further subsampling without replacement alters the statistic (mean, proportion, and variance) of the sample. The degree to which subsampling without replacement affects the statistic of a sample is inversely proportional to the size of the sample. Let us have a look at an example using the Iris dataset 1, which we randomly divide into 2/3 training data and 1/3 test data as illustrated in Figure 1. (The source code for generating this graphic is available on GitHub2.)
All samples (n = 150)
Training samples (n = 100) Test samples (n = 50)
Figure 1: Distribution of Iris flower classes upon random subsampling into training and test sets.
1https://archive.ics.uci.edu/ml/datasets/iris
2https://github.com/rasbt/model-eval-article-supplementary/blob/master/code/iris-random-dist.ipynb
6
Issues with Subsampling
Sebastian Raschka STAT 479: Machine Learning FS 2018 !37
Stratified Split
Sebastian Raschka STAT 479: Machine Learning FS 2018
Normalization: Min-Max Scaling
!38
xnorm[i] = x[i] − xmin xmax − xmin
Sebastian Raschka STAT 479: Machine Learning FS 2018
Normalization: Min-Max Scaling
!39
xnorm[i] = x[i] − xmin xmax − xmin
Sebastian Raschka STAT 479: Machine Learning FS 2018
Normalization: Standardization
!40
xstd[i] = x[i] − μx σx
Sebastian Raschka STAT 479: Machine Learning FS 2018
Normalization: Standardization
!41
xstd[i] = x[i] − μx σx
Sebastian Raschka STAT 479: Machine Learning FS 2018
Normalization: Standardization
!42
Sebastian Raschka STAT 479: Machine Learning FS 2018
Sample vs Population Standard Deviation
!43
sx = 1 n − 1
i=1
∑n
(x[i] − ¯x)2
σx = 1 n
i=1
∑n
(x[i] − μx)2
Sebastian Raschka STAT 479: Machine Learning FS 2018
Sample vs Population Standard Deviation
!44
sx = 1 n − 1
i=1
∑n
(x[i] − ¯x)2
σx = 1 n
i=1
∑n
(x[i] − μx)2
Sebastian Raschka STAT 479: Machine Learning FS 2018 !45
Scaling Validation and Test Sets
Sebastian Raschka STAT 479: Machine Learning FS 2018 !46
Scaling Validation and Test Sets
Given 3 training examples:
- example1: 10 cm -> class 2 - example2: 20 cm -> class 2 - example3: 30 cm -> class 1
Estimate:
mean: 20 cm
standard deviation: 8.2 cm
Sebastian Raschka STAT 479: Machine Learning FS 2018 !47
Scaling Validation and Test Sets
Given 3 training examples:
- example1: 10 cm -> class 2 - example2: 20 cm -> class 2 - example3: 30 cm -> class 1
Estimate:
mean: 20 cm
standard deviation: 8.2 cm
Standardize:
- example1: -1.21 -> class 2 - example2: 0.00 -> class 2 - example3: 1.21 -> class 1
Sebastian Raschka STAT 479: Machine Learning FS 2018 !48
Scaling Validation and Test Sets
Given 3 training examples:
- example1: 10 cm -> class 2 - example2: 20 cm -> class 2 - example3: 30 cm -> class 1
Estimate:
mean: 20 cm
standard deviation: 8.2 cm
Standardize (z scores):
- example1: -1.21 -> class 2 - example2: 0.00 -> class 2 - example3: 1.21 -> class 1
h(z) = {2 z ≤ 0.6
1 otherwise
Sebastian Raschka STAT 479: Machine Learning FS 2018 !49
Scaling Validation and Test Sets
Given 3 training examples:
- example1: 10 cm -> class 2 - example2: 20 cm -> class 2 - example3: 30 cm -> class 1
Estimate:
mean: 20 cm
standard deviation: 8.2 cm Standardize (z scores):
- example1: -1.21 -> class 2 - example2: 0.00 -> class 2 - example3: 1.21 -> class 1
h(z) = {2 z ≤ 0.6
1 otherwise
Given 3 NEW examples:
- example4: 5 cm -> class ? - example5: 6 cm -> class ? - example6: 7 cm -> class ?
Estimate "new" mean and std.:
- example5: -1.21 -> class 2 - example6: 0.00 -> class 2 - example7: 1.21 -> class 1
Sebastian Raschka STAT 479: Machine Learning FS 2018 !50
Scaling Validation and Test Sets
Given 3 training examples:
- example1: 10 cm -> class 2 - example2: 20 cm -> class 2 - example3: 30 cm -> class 1
Estimate:
mean: 20 cm
standard deviation: 8.2 cm Standardize (z scores):
- example1: -1.21 -> class 2 - example2: 0.00 -> class 2 - example3: 1.21 -> class 1
h(z) = {2 z ≤ 0.6
1 otherwise
- example4: 5 cm -> class ? - example5: 6 cm -> class ? - example6: 7 cm -> class ?
Estimate "new" mean and std.:
- example5: -1.21 -> class 2 - example6: 0.00 -> class 2 - example7: 1.21 -> class 1
- example5: -18.37 - example6: -17.15 - example7: -15.92
Sebastian Raschka STAT 479: Machine Learning FS 2018 !51
Training Data
Model
Transformed Test Data Test
Data
Transformed Training Data
est.fit(X_train)
est.transform(X_train) est.transform(X_test)
①
② ③
Scikit-Learn Transformer API
Sebastian Raschka STAT 479: Machine Learning FS 2018 !52
Scikit-Learn Transformer API
Sebastian Raschka STAT 479: Machine Learning FS 2018 !53
Sebastian Raschka STAT 479: Machine Learning FS 2018
Categorical: Ordinal
!54
Sebastian Raschka STAT 479: Machine Learning FS 2018
Categorical: Ordinal
!55
Sebastian Raschka STAT 479: Machine Learning FS 2018
Categorical: Nominal
!56
Sebastian Raschka STAT 479: Machine Learning FS 2018
One-hot Encoding
!57
Sebastian Raschka STAT 479: Machine Learning FS 2018
One-hot Encoding
!58
Sebastian Raschka STAT 479: Machine Learning FS 2018 !59
Sebastian Raschka STAT 479: Machine Learning FS 2018 !60
Sebastian Raschka STAT 479: Machine Learning FS 2018 !61
Sebastian Raschka STAT 479: Machine Learning FS 2018
Scikit-Learn Pipelines
!62
Training set Test set
Scaling
Dimensionality Reduction
Learning Algorithm
.fit(…) &
.transform(…)
.fit(…) &
.transform(…)
.fit(…)
Predictive Model
.transform(…)
.transform(…)
.predict(…)
pipeline.fit(…)
Class labels
pipeline.predict(…)
Class labels
(Step 1) (Step 2)
Pipeline
Sebastian Raschka STAT 479: Machine Learning FS 2018
Scikit-Learn Pipelines
!63
Sebastian Raschka STAT 479: Machine Learning FS 2018
Scikit-Learn Pipelines
!64
Sebastian Raschka STAT 479: Machine Learning FS 2018
Scikit-Learn Pipelines
!65
Training set Test set
Scaling
Dimensionality Reduction
Learning Algorithm
.fit(…) &
.transform(…)
.fit(…) &
.transform(…)
.fit(…)
Predictive Model
.transform(…)
.transform(…)
.predict(…)
pipeline.fit(…)
Class labels
pipeline.predict(…)
Class labels
(Step 1) (Step 2)
Pipeline
Sebastian Raschka STAT 479: Machine Learning FS 2018
Model Selection:
Simple Holdout Method
!66 Original dataset
Training set Validation set Test set
Training set Test set
Machine learning algorithm
Predictive model
Change hyperparameters and repeat
Final performance estimate
Fit Evaluate
Sebastian Raschka STAT 479: Machine Learning FS 2018
Model Selection:
Simple Holdout Method
!67
Sebastian Raschka STAT 479: Machine Learning FS 2018
Model Selection:
Simple Holdout Method
!68
Sebastian Raschka STAT 479: Machine Learning FS 2018
Model Selection:
Simple Holdout Method
!69
Sebastian Raschka STAT 479: Machine Learning FS 2018 !70
Reading Assignments
• Python Machine Learning, 2nd ed.:
Ch04 up to "Selecting Meaningful Features"
(pg 107-123)
• Python Machine Learning, 2nd ed.:
Ch06 up to "Debugging Algorithms with Learning and Validation Curves"
(pg 185-194)