Data Preprocessing and Machine Learning with Scikit-Learn

Full text

(1)Lecture 05. Data Preprocessing and Machine Learning with Scikit-Learn (Computational Foundations Part 3/3). STAT 479: Machine Learning, Fall 2019. Sebastian Raschka. http://stat.wisc.edu/~sraschka/teaching/stat479-fs2019/. Sebastian Raschka. STAT 479: Machine Learning. FS 2019. 1.

(2) Announcements! 1) Homework 1: Posted soon! 2) Project Group Assignments: TA will send the survey tomorrow. 3) Project Proposal submission due date. Sebastian Raschka. STAT 479: Machine Learning. FS 2019. !2.

(3) Where We Currently Are .... Sebastian Raschka. STAT 479: Machine Learning. FS 2019. !3.

(4) Machine Learning Workflow Feature Extraction and Scaling Feature Selection Dimensionality Reduction Sampling. Labels Training Dataset Labels Raw Data. Learning Algorithm. Final Model. New Data. Test Dataset Labels Preprocessing. Learning. Evaluation. Prediction. Model Selection Cross-Validation Performance Metrics Hyperparameter Optimization. Sebastian Raschka. STAT 479: Machine Learning. FS 2019. !4.

(5) Reading a Dataset from a Tabular Text File. Sebastian Raschka. STAT 479: Machine Learning. FS 2019. 5.

(6) The Iris Dataset. Iris-Setosa. Iris-Versicolor. Iris-Virginica. Fisher, R.A. "The use of multiple measurements in taxonomic problems" Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to Mathematical Statistics" (John Wiley, NY, 1950).. Sebastian Raschka. STAT 479: Machine Learning. FS 2019. 6.

(7) Sometimes Useful: Executing "Bash" Terminal Commands Via "!". Sebastian Raschka. STAT 479: Machine Learning. FS 2019. 7.

(8) A DataFrame Library for Data Wrangling. https://pandas.pydata.org. (PANel DAta S) McKinney, Wes. "Data structures for statistical computing in python." Proceedings of the 9th Python in Science Conference. Vol. 445. 2010.. Sebastian Raschka. STAT 479: Machine Learning. FS 2019. 8.

(9) https://pandas.pydata.org. Sebastian Raschka. STAT 479: Machine Learning. FS 2019. 9.

(10) As a Side Note .... Sebastian Raschka. Biopandas: Working with molecular structures in pandas dataframes. The Journal of Open Source Software, 2(14), jun 2017. doi: 10.21105/joss.00279. URL http://dx.doi.org/10.21105/joss.00279.. Sebastian Raschka. STAT 479: Machine Learning. FS 2019. !10.

(11) Basic Data Handling. Sebastian Raschka. STAT 479: Machine Learning. FS 2019. 11.

(12) Python Function. Sebastian Raschka. STAT 479: Machine Learning. FS 2019. 12.

(13) Regular Function vs Lambda Function. Sebastian Raschka. STAT 479: Machine Learning. FS 2019. 13.

(14) Column-based Data Processing via Lambda Functions and ".apply". Sebastian Raschka. STAT 479: Machine Learning. FS 2019. 14.

(15) Column-based Data Processing via Dictionaries and ".map". Sebastian Raschka. STAT 479: Machine Learning. FS 2019. 15.

(16) Quick Inspections via "head" and "tail". Sebastian Raschka. STAT 479: Machine Learning. FS 2019. 16.

(17) Accessing the Uderlying NumPy Array(s) via the ".values" Attribute. Sebastian Raschka. STAT 479: Machine Learning. FS 2019. 17.

(18) "Creating*" the Label Vector "y" and Design Matrix "X". * why did I put "Creating" in quotation marks? Sebastian Raschka. STAT 479: Machine Learning. FS 2019. 18.

(19) A Library with Additional Data Science- & Machine Learning-related Functions. MLXTEND. http://rasbt.github.io/mlxtend/. Raschka, Sebastian. "MLxtend: Providing machine learning and data science utilities and extensions to Python’s scientific computing stack." The Journal of Open Source Software 3.24 (2018).. Sebastian Raschka. STAT 479: Machine Learning. FS 2019. 19.

(20) Exploratory Data Analysis (EDA). Sebastian Raschka. STAT 479: Machine Learning. FS 2019. 20.

(21) Sebastian Raschka. STAT 479: Machine Learning. FS 2019. 21.



(24) (Later, we will see how to do this more conveniently) Sebastian Raschka. STAT 479: Machine Learning. FS 2019. 24.

(25) Python Classes To get a better understanding of the scikit-learn API, we need to understand the main concepts behind Object Oriented Programming (OOP) & classes in Python. Sebastian Raschka. STAT 479: Machine Learning. FS 2019. 25.

(26) Python Classes. Sebastian Raschka. STAT 479: Machine Learning. FS 2019. 26.








(34) The "Main" Machine Learning Library for Python. http://scikit-learn.org. Pedregosa, Fabian, et al. "Scikit-learn: Machine learning in Python." Journal of machine learning research 12.Oct (2011): 2825-2830.. Sebastian Raschka. STAT 479: Machine Learning. FS 2019. 34.

(35) The Scikit-learn Estimator API (an OOP Paradigm). Sebastian Raschka. STAT 479: Machine Learning. FS 2019. 35.

(36) The Scikit-learn Estimator API Training Data ①. Training Labels. est.fit(X_train, y_train). Model ②. Test Data. est.predict(X_test). Predicted labels Sebastian Raschka. STAT 479: Machine Learning. FS 2019. 36.

(37) A 3-Nearest Neighbor Classifier & 2 Iris Features. Sebastian Raschka. STAT 479: Machine Learning. FS 2019. 37.

(38) a sample is inversely proportional to the size of the sample. Let us have a look at an example using the Iris dataset 1 , which we randomly divide into 2/3 training data and 1/3 test data as illustrated in Figure 1. (The source code for generating this graphic is available on GitHub2 .). Issues with Random Subsampling ... All samples (n = 150). Training samples (n = 100). Test samples (n = 50). Figure 1: Distribution of Iris flower classes upon random subsampling into training and test sets. 1. https://archive.ics.uci.edu/ml/datasets/iris Sebastian Raschka STAT 479: Machine Learning. FS 2019. !38.

(39) Stratified Splits. Sebastian Raschka. STAT 479: Machine Learning. FS 2019. !39.

(40) Normalization: Min-Max Scaling. [i] xnorm. Sebastian Raschka. [i]. x − xmin = xmax − xmin. STAT 479: Machine Learning. FS 2019. !40.

(41) Normalization: Min-Max Scaling. [i] xnorm. Sebastian Raschka. [i]. x − xmin = xmax − xmin. STAT 479: Machine Learning. FS 2019. !41.

(42) Normalization: Standardization [i] x − μ x [i] xstd = σx. Sebastian Raschka. STAT 479: Machine Learning. FS 2019. !42.

(43) Normalization: Standardization [i]. x − μ x [i] xstd = σx. Sebastian Raschka. STAT 479: Machine Learning. FS 2019. !43.

(44) Normalization: Standardization. Sebastian Raschka. STAT 479: Machine Learning. FS 2019. !44.

(45) Sample vs Population Standard Deviation. sx =. σx =. Sebastian Raschka. 1 i=1 [i] 2 (x − x¯) n−1∑ n. i=1. 1 (x [i] − μx)2 n∑ n. STAT 479: Machine Learning. FS 2019. !45.

(46) Sample vs Population Standard Deviation. sx =. σx =. Sebastian Raschka. i=1. 1 (x [i] − x¯)2 n−1∑ n. 1 i=1 [i] (x − μx)2 n∑ n. STAT 479: Machine Learning. FS 2019. !46.

(47) Scaling Validation and Test Sets. Sebastian Raschka. STAT 479: Machine Learning. FS 2019. !47.

(48) Scaling Validation and Test Sets. Given 3 training examples: - example1: 10 cm -> class 2. - example2: 20 cm -> class 2. - example3: 30 cm -> class 1. Estimate: mean: 20 cm. standard deviation: 8.2 cm. Sebastian Raschka. STAT 479: Machine Learning. FS 2019. !48.

(49) Scaling Validation and Test Sets Given 3 training examples: - example1: 10 cm -> class 2. - example2: 20 cm -> class 2. - example3: 30 cm -> class 1. Estimate: mean: 20 cm. standard deviation: 8.2 cm. Standardize: - example1: -1.21 -> class 2. - example2: 0.00 -> class 2. - example3: 1.21 -> class 1 Sebastian Raschka. STAT 479: Machine Learning. FS 2019. !49.

(50) Scaling Validation and Test Sets Given 3 training examples: - example1: 10 cm -> class 2. - example2: 20 cm -> class 2. - example3: 30 cm -> class 1. Estimate: mean: 20 cm. standard deviation: 8.2 cm. Assume you have the classification rule:. Standardize (z scores): - example1: -1.21 -> class 2. - example2: 0.00 -> class 2. - example3: 1.21 -> class 1 Sebastian Raschka. h(z) =. (. class 2 if z  0.6 class 1 otherwise. <latexit sha1_base64="HAx2k5DyXbF25+3p1q+TLdp/cAo=">AAACaHicbVHBTttAEF27tIVAW1NAqOIyImpFL5GdosKlEoILRyo1gBRH0XozTlas19buGBqsiH/sjQ/g0q/oxliIAk9a6enNvJndt0mhpKUwvPX8Vwuv37xdXGotr7x7/yFY/Xhq89II7Ilc5eY84RaV1NgjSQrPC4M8SxSeJRdH8/rZJRorc/2LpgUOMj7WMpWCk5OGwc1k5/or/IA4wbHUlXCj7AxaUCMm/E2VUNxamEEX4EsjgUydcA2xQgg73yGOX3RED4acJmiupMXZvDFGPWpWDYN22AlrwHMSNaTNGpwMgz/xKBdlhprqNf0oLGhQcUNSKDc+Li0WXFzwMfYd1TxDO6jqoGbw2SkjSHPjjiao1ceOimfWTrPEdWacJvZpbS6+VOuXlO4PKqmLklCL+0VpqYBymKcOI2lQkJo6woWR7q4gJtxwQe5vWi6E6OmTn5PTbif61un+3G0fHDZxLLItts12WMT22AE7ZiesxwS785a9dW/D++sH/qb/6b7V9xrPGvsP/vY/S5OzNw==</latexit>. STAT 479: Machine Learning. FS 2019. !50.

(51) Scaling Validation and Test Sets Given 3 training examples:. h(z) =. - example1: 10 cm -> class 2. - example2: 20 cm -> class 2. - example3: 30 cm -> class 1. Estimate: mean: 20 cm. standard deviation: 8.2 cm. Standardize (z scores): - example1: -1.21 -> class 2. - example2: 0.00 -> class 2. - example3: 1.21 -> class 1 Sebastian Raschka. (. class 2 if z  0.6 class 1 otherwise. <latexit sha1_base64="HAx2k5DyXbF25+3p1q+TLdp/cAo=">AAACaHicbVHBTttAEF27tIVAW1NAqOIyImpFL5GdosKlEoILRyo1gBRH0XozTlas19buGBqsiH/sjQ/g0q/oxliIAk9a6enNvJndt0mhpKUwvPX8Vwuv37xdXGotr7x7/yFY/Xhq89II7Ilc5eY84RaV1NgjSQrPC4M8SxSeJRdH8/rZJRorc/2LpgUOMj7WMpWCk5OGwc1k5/or/IA4wbHUlXCj7AxaUCMm/E2VUNxamEEX4EsjgUydcA2xQgg73yGOX3RED4acJmiupMXZvDFGPWpWDYN22AlrwHMSNaTNGpwMgz/xKBdlhprqNf0oLGhQcUNSKDc+Li0WXFzwMfYd1TxDO6jqoGbw2SkjSHPjjiao1ceOimfWTrPEdWacJvZpbS6+VOuXlO4PKqmLklCL+0VpqYBymKcOI2lQkJo6woWR7q4gJtxwQe5vWi6E6OmTn5PTbif61un+3G0fHDZxLLItts12WMT22AE7ZiesxwS785a9dW/D++sH/qb/6b7V9xrPGvsP/vY/S5OzNw==</latexit>. Given 3 NEW examples: - example4: 5 cm -> class ?. - example5: 6 cm -> class ?. - example6: 7 cm -> class ?. Estimate "new" mean and std.: - example5: -1.21 -> class 2 - example6: 0.00 -> class 2 - example7: 1.21 -> class 1. STAT 479: Machine Learning. FS 2019. !51.

(52) Scaling Validation and Test Sets Given 3 training examples:. h(z) =. - example1: 10 cm -> class 2. - example2: 20 cm -> class 2. - example3: 30 cm -> class 1. Estimate: mean: 20 cm. standard deviation: 8.2 cm. Standardize (z scores): - example1: -1.21 -> class 2. - example2: 0.00 -> class 2. - example3: 1.21 -> class 1 Sebastian Raschka. (. class 2 if z  0.6 class 1 otherwise. <latexit sha1_base64="HAx2k5DyXbF25+3p1q+TLdp/cAo=">AAACaHicbVHBTttAEF27tIVAW1NAqOIyImpFL5GdosKlEoILRyo1gBRH0XozTlas19buGBqsiH/sjQ/g0q/oxliIAk9a6enNvJndt0mhpKUwvPX8Vwuv37xdXGotr7x7/yFY/Xhq89II7Ilc5eY84RaV1NgjSQrPC4M8SxSeJRdH8/rZJRorc/2LpgUOMj7WMpWCk5OGwc1k5/or/IA4wbHUlXCj7AxaUCMm/E2VUNxamEEX4EsjgUydcA2xQgg73yGOX3RED4acJmiupMXZvDFGPWpWDYN22AlrwHMSNaTNGpwMgz/xKBdlhprqNf0oLGhQcUNSKDc+Li0WXFzwMfYd1TxDO6jqoGbw2SkjSHPjjiao1ceOimfWTrPEdWacJvZpbS6+VOuXlO4PKqmLklCL+0VpqYBymKcOI2lQkJo6woWR7q4gJtxwQe5vWi6E6OmTn5PTbif61un+3G0fHDZxLLItts12WMT22AE7ZiesxwS785a9dW/D++sH/qb/6b7V9xrPGvsP/vY/S5OzNw==</latexit>. - example4: 5 cm -> class ?. - example5: 6 cm -> class ?. - example6: 7 cm -> class ?. Estimate "new" mean and std.: - example5: -1.21 -> class 2 - example6: 0.00 -> class 2 - example7: 1.21 -> class 1 - example5: -18.37. - example6: -17.15. - example7: -15.92. STAT 479: Machine Learning. FS 2019. !52.

(53) The Scikit-Learn Transformer API Training Data. Test Data. ① est.fit(X_train). ② est.transform(X_train). Model. Transformed Training Data. Sebastian Raschka. est.transform(X_test)③. Transformed Test Data. STAT 479: Machine Learning. FS 2019. !53.

(54) The Scikit-Learn Transformer API. Sebastian Raschka. STAT 479: Machine Learning. FS 2019. !54.

(55) Working with Categorical Data. Sebastian Raschka. STAT 479: Machine Learning. FS 2019. !55.

(56) Categorical Data -> Ordinal Data. Sebastian Raschka. STAT 479: Machine Learning. FS 2019. !56.

(57) Categorical Data -> Nominal Data (Class Labels). Sebastian Raschka. STAT 479: Machine Learning. FS 2019. !57.

(58) One-hot Encoding for Categorical (Nominal) Features. Sebastian Raschka. STAT 479: Machine Learning. FS 2019. !58.

(59) One-hot Encoding for Categorical (Nominal) Features. Sebastian Raschka. STAT 479: Machine Learning. FS 2019. !59.

(60) Dealing with Missing Data. Sebastian Raschka. STAT 479: Machine Learning. FS 2019. !60.



(63) Scikit-Learn Pipelines Class labels. (Step 1). (Step 2) Test set. Training set pipeline.fit(…). pipeline.predict(…). Pipeline .fit(…) & .transform(…). .fit(…) & .transform(…). Scaling Dimensionality Reduction Learning Algorithm. .fit(…). .transform(…). .transform(…). Predictive Model. Class labels .predict(…). Sebastian Raschka. STAT 479: Machine Learning. FS 2019. !63.

(64) Scikit-Learn Pipelines. Sebastian Raschka. STAT 479: Machine Learning. FS 2019. !64.

(65) Scikit-Learn Pipelines. Sebastian Raschka. STAT 479: Machine Learning. FS 2019. !65.

(66) Scikit-Learn Pipelines Class labels. (Step 1). (Step 2) Test set. Training set pipeline.fit(…). pipeline.predict(…). Pipeline .fit(…) & .transform(…). .fit(…) & .transform(…). Scaling Dimensionality Reduction Learning Algorithm. .fit(…). .transform(…). .transform(…). Predictive Model. Class labels .predict(…). Sebastian Raschka. STAT 479: Machine Learning. FS 2019. !66.

(67) Model Selection: Simple Holdout Method Original dataset. Training set. Training set. Test set. Validation set. Test set. Change hyperparameters and repeat. Machine learning algorithm Evaluate. Fit Predictive model. Sebastian Raschka. Final performance estimate. STAT 479: Machine Learning. FS 2019. !67.

(68) Model Selection: Simple Holdout Method. Sebastian Raschka. STAT 479: Machine Learning. FS 2019. !68.




(72) Lecture Notes This time in interactive Jupyter Notebook form:. https://github.com/rasbt/stat479-machine-learning-fs19/blob/master/ 05_preprocessing-and-sklearn/code/05-preprocessing-andsklearn__notes.ipynb. Bonus: Column Transformers for Heterogenous Data https://github.com/rasbt/stat479-machine-learning-fs19/blob/master/ 05_preprocessing-and-sklearn/code/05-bonus-columntransformer.ipynb. Sebastian Raschka. STAT 479: Machine Learning. FS 2019. !72.

(73) Tip If you see this, the Notebook rendering on GitHub is having some hiccups again.. Simply copy and paste the notebook link into the NbViewer available at https://nbviewer.jupyter.org (it always works!). Sebastian Raschka. STAT 479: Machine Learning. FS 2019. !73.

(74) Reading Assignments. •. Python Machine Learning, 2nd ed.: Ch04 up to "Selecting Meaningful Features" (pg 107-123). •. Python Machine Learning, 2nd ed.: Ch06 up to "Debugging Algorithms with Learning and Validation Curves" (pg 185-194). Sebastian Raschka. STAT 479: Machine Learning. FS 2019. !74.

(75)

No results found