A Survey on Data Preparation and Feature Engineering in Machine Learning

(1)



Abstract—Data is available in every corners life. The data keeps on increasing by the usage of smart phones, our Interaction with social networks and much more. The largely generated data can be effectively utilized by means of data mining and machine learning concepts. Machine learning is an integral part of artificial intelligence, which is used to design algorithms based on the data trends and historical relationships between data. Machine learning is used in various fields such as bioinformatics, intrusion detection, Information retrieval, game playing, marketing and so on. This paper describes the step by step method adopted for data preparation and the concepts applied for feature engineering in Machine learning using an familiarized Rossman sales data set.

Index Terms— Data, Machine Learning, Feature Engineering, Artificial Intelligence

I. INTRODUCTION

The terms Machine Learning[4], Artificial Intelligence,[5] Data Science, Deep Learning are concepts interrelated to each other, as though they seems to be different areas, they are highly interconnected with each other. Artificial Intelligence (AI) is the ability of a machine to think and learn by itself without human intervention.[1]

[image:1.595.317.545.275.396.2]

Machine learning is a field of Artificial Intelligence that uses statistical methods to give the computer systems the ability to learn from data without being explicitly programmed. [6] Deep Learning is a subset of machine learning where Artificial Neural Networks, algorithms inspired by human brain, learn from large amount of data. It requires substantial computing power (ie) High performance GPU. Data Science is an interdisciplinary field that uses scientific methods, processes, algorithms and system to extract knowledge and insights from data in various forms, both structured and unstructured. It is a practical application of all those fields (AI, DL, ML) in a business context.

Figure 1 : Visually Linking DS, ML, AI and DL

T.SEENI SELVI, Department of Computer Science, Hindusthan College of Arts & Science, Bharathiar University, Coimbatore, India 9994373424

M.NIRMALA, Department of Computer Applications, Hindusthan College of engineering & Technology, Anna University, Coimbatore, India, Mobile No 9952881646.

II. MACHINE LEARNING METHODS / TECHNIQUES GROUPED BY LEARNINGSTYLE

[image:1.595.319.573.506.610.2]

Definition of Machine Learning with respect to Mitchell[2] A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.

Figure 2 : Machine Learning Techniques

There are four general machine learning techniques: [3][7] (1) Supervised

(2) Unsupervised (3) Semi-supervised

(4) Reinforcement learning methods.

[image:1.595.83.260.600.674.2]

The objectives of machine learning are to enable machines to make predictions, perform clustering, extract association rules, or make decisions from a given dataset.

Figure 3 : Basic Machine Learning Process

1. Object has features and the collective features are named under the term Label. The features + label are called as Training Data.

2. The model is trained by the above given Training Data 3. Once the model has become well trained, it should be

evaluated by Test Data. The test data are just the Features (given as Input) and not the labels.

4. Based on the Training received from the Training data, the train model will be evaluated by the Test data. 5. If the trained model has evaluated the data correctly

then it is ready for prediction.

T.SEENISELVI

1

_{, M.NIRMALA}

2 Associate Professor1_{, Assistant Professor}2

(2)

Supervised Learning

[image:2.595.49.295.121.229.2]

In supervised learning, the target variables needs to be predicted from Labels (Training Data). The training data consist of input vector X called as features and output vector Y of labels or tags.

Figure 4 : Supervised Learning

A label or tag from vector Y is the result of all combined features of input vector X. The X Vector and Y Vector together are called as Training Example.

Supervised learning takes labelled data (features and labels) and creates a model that can make predictions given new data.

Training data: (x1, y1), . . . , (xn, yn) / xi∈Rd and yi is the label.

Two groups or categories of algorithms come under supervised learning.

They are 1. Regression 2. Classification

[image:2.595.53.549.303.761.2]

if income > θ1 and savings > θ2 then low-risk else high-risk

Figure 5 : Regression

Figure 6 : Classification Methods of Supervised Learning

Support Vector Machines, neural networks, decision trees, K-nearest neighbors, naive Bayes, etc.

Applications of Supervised Learning

Classification

 Face recognition.

 Optical character recognition  Medical diagnosis

 Speech recognition  machine translation  biometrics

 Credit scoring: Regression

 Predict the price of a car from its mileage.  Navigating a car: angle of the steering.

 Kinematics of a robot arm: predict workspace location from angles.

Unsupervised Learning

Unsupervised Learning is a type of Machine learning used to draw inferences from data that has not been labelled.

It is a type of machine learning used to draw inferences from the data that has not been labelled. There is no correct answer to predict. It identifies the commonalities in the data and reacts on the presence or absence of such commonalities in each piece of data.

Training data: “examples” x. x1, . . . , xn, xi∈X⊂Rn

Figure 7 : Unsupervised Learning

Applications of Unsupervised Learning  Learning associations:

 Clustering

 Density estimation  Dimensionality reduction  Feature selection

 Outlier/novelty detection. Semi Supervised Learning

It is a combination of Supervised learning and Unsupervised learning.

This type of learning is used when  No enough Labelled Data

 Don’t have a way / Time / resources to get more labelled data

 Example can be : Building a model for a bank to detect Frauds

Semi Supervised Learning

It is a combination of supervised learning and unsupervised learning methods. This type of method is used when enough labelled data is unavailable. No time or way or resource to get more labelled data.

The combination of two techniques can increase the size of the labelled training data.

Advantages are

[image:2.595.47.285.311.753.2]

(3)

[image:3.595.58.277.50.167.2]

 Labeled data is hard to get and expensive

Figure 8 : Semi supervised Learning

Reinforcement Learning

The reinforcement learning method aims at using observations gathered from the interaction with the environment to take actions that would maximize the reward or minimize the risk.

In order to produce intelligent programs (also called agents), reinforcement learning goes through the following steps:

1. Input state is observed by the agent.

2. Decision making function is used to make the agent perform an action.

3. After the action is performed, the agent receives reward or reinforcement from the environment.

[image:3.595.46.292.374.475.2]

4. The state-action pair information about the reward is stored.

Figure 9 : Reinforcement Learning

[image:3.595.49.300.530.564.2]

III. DATAPREPARATION ANDFEATUREENGINEERING

Figure 10 : Data Preparation & Feature Engineering

When ML Required

A problem is addressed using ML if and only if

 It cannot be solved using Traditional Programming Method

 Where the system itself needs to solve the problem  When the size of the data is Large. ML cannot be

applied for small set of data Define Problem

 The first step is defining the problem.  Be clear with what the model is expected to.  Identify data sources.

 Ensure that all the inputs are available during prediction

Collect Data

 Why the data to be collected?

 What data to be collected?

 Where the data should be collected?  How the format data should be? Why the data to be collected?

 Data is the most import factor in ML  To train a model, data is required.  The model learns from data. What data to be collected

 Features with strong predictive power and labels are to be collected.

 Features collected are mainly dependent on the ML problem.

 Getting features are easy but labelling is a difficult task

 Labeling done through humans or through automated means. It is time consuming and expensive.

Where the data to be collected?

 The most crucial step in ML is to have a good quality of data.

 Data can be collected from various authenticated sources and various data set repository.

 A huge amount of money, time and resources are consume in collecting data.

How the format data should be?

 The data should be reliable and trust worthy.  It should be available at the time of prediction.  Enough data for all classes should be available.  Avoid the following from data

 Label errors  Noisy data  Outliers

 Duplicate examples  Missing Values  Irrelevant data  Bad Feature Values Exploratory Data Analysis

Exploratory Data Analysis (EDA) is an approach for data analysis that employs a variety of techniques (both graphical and quantitative) to better understand data.

Understand Data types of features

Figure 11 : Data Types of Features

Categorical Data

The data which can be divided into groups are called as Categorical Data. It is also called as Qualitative data. Example : Sex, Educational Level

Nominal and Ordinal Data Define

[image:3.595.320.543.566.662.2]

(4)

Nominal data is named data which can be represented into discrete categories which do not overlap.

Example : Gender ; Male and Female

Ordinal Data is data which is placed into some kind of order or scale.

Example : rating user satisfaction on a scale of 1 – 10

Numerical Data

The type of data which is measurable or countable is referred as Numerical Data. It is also called as Quantitative Data. Example : Students Height and Weight (Measurable)

Discrete Data

Represents items that can be counted. Example:Number of students in a class

Continuous Data

Represents items that can be measured. It takes any value within a range

Example : A person’s Height, weight

Data Visualization

It is the graphical representation of data in the form of chart, diagrams etc. The various forms are charts, maps and the data in the graphical format helps to understand the pattern, trends, outliers etc. It is more efficient than quantitative data representation.

Various Plots for Data Visualization  Histogram

 Bar Chart  Line Plot  Scatter Plot  Box Plot  Heatmap Data Cleaning

Data cleaning is the process which ensures that the data is correct and relevant. The inaccurate data are removed.

Data cleaning comprises of the following steps  Removing Duplicate and Irrelevant Data  Removing or correcting Missing values  Removing Incorrect features

 Removing incorrectly labeled data  Removing Outliers

Feature Engineering

It is a process of transforming data into features to act as inputs to machine learning models.

Two important parts of Feature Engineering  Variable Transformation

 Feature Creation

Different types of Activities involved in Featured Engineering

 Create Features

 Remove Unwanted or unused Features  Data Transformations

Create Features

 Features can be created by

 Splitting / extracting data in a variable

 Combining two or more variables  Creating a Bucketized Column  Combining sparse classes  Creating Indicator Variables  Rounding off to Integers Remove Unwanted or unused Features

If a features does not make the machine tolearn, then it could be dropped.

Example :ID Columns of Sales Data Transformations

It involves converting everything into numbers. If the input feature values are non numeric, then it should be converted to numeric values.

2 Types of Transformations are  Normalization

 Bucketing / Binning

IV. METHODOLOGY APPLIED DATAPREPARATION ANDFEATURE ENGINEERING

In order to perform Feature engineering the size of the data set used be unusually large. In order to demonstrate the methodology the Rossman data sales analysis [8] set has been used.

Step 1: Identify the various data sets such as store.csv, train,csv and test.csv

[image:4.595.309.592.379.609.2]

Details of data provided in the format of CSV file

Table 1 : Rossman Data Sales Analysis Data set

(5)

[image:5.595.60.464.42.466.2]

Table 2 : Data Types and Columns of Rossman Data set

Step 3: Import the necessary packages for data processing. # Pandas for Data frames, seaborne for easier visualization, numpy for numerical computing, matplot lib for visualization Step 4 : Load the respective .csv files. In this train.csv and store.csv are loaded.

Step 5: To know the dimensions of the data set use the shape command to display the dimension.

train.csv file dimension is (1017209, 9) store.csv file dimension is (1115, 10)

Step 6: To display the data types of the dataset.

Store int64

StoreType object

Assortment object

CompetitionDistance float64 CompetitionOpenSinceMonth float64 CompetitionOpenSinceYear float64

Promo2 int64

Promo2SinceWeek float64

Promo2SinceYear float64

PromoInterval object

Step 7 : To display the count of the columns available in the dataset.

Store 1017209

DayOfWeek 1017209

Date 1017209

Sales 1017209

Customers 1017209

Open 1017209

Promo 1017209

StateHoliday 1017209

SchoolHoliday 1017209

Step 8: To Check whether any columns are empty or contains null values

df1.columns[df1.isnull().any()]

Step 9 : To describe the columns of the data set by displaying the count, mean, standard deviation, min, max etc such that the minimum values maximum values are other relevant

information for numeric columns can be clearly displayed.

By using the df1["Sales"].describe() command the sales column displays the details as

count 1.017209e+06 mean 5.773819e+03 std 3.849926e+03

min 0.000000e+00

25% 3.727000e+03 50% 5.744000e+03 75% 7.856000e+03 max 4.155100e+04

It infers that the sales has zero value in its column such that no sales has happened on that particular day.

saleszerocount= len(df1[df1.Sales == 0])

print (" The Number of Days were sales were Zero "+ str(saleszerocount))

Use the above option to identify the number of days the sales value was zero (ie) No sales.

Step 10 : To calculate the actual zero values in the Customers column and calculate the nonzero values in the Customer Column

customerszerocount= len(df1[df1.Customers == 0]) print (" The Total Number of Customers with Zero Value " +str(saleszerocount))

nonzerocountcustomers=customerscount-customerszerocount print("The Total Number of Non Zero Customers are " +str(nonzerocountsales))

Step 11: Apply the above option for all the possible columns in the given dataset. The above option identifies the count of the empty values in the columns.

Step 12 : To Identify the specific location where the data columns are empty us the below given code.

(6)

Step 13: Clear out the null values for specific columns after doing a close observation.

df1 = df1[df1.Sales != 0]

Step 14 : Perform Univariate analysis for numerical values or quantitative data to do a quality check using Data visualization tools.[9]

[image:6.595.308.557.48.290.2]

df1.hist(xrot=None,figsize=(15,15)) plt.show()

Figure 12 : Univariate Analysis on Quantitative Data of train.csv

Step 15 : Perform Univariate analysis for Categorical Columns to do a quality check using Data visualization tools.

Figure 13 : Univariate Analysis on Categorical Column of train.csv

[image:6.595.49.294.184.424.2]

The Univariate analysis helps in performing a quality check by visualizing it using various plots. The date column does not give any relevant information.

[image:6.595.314.570.442.528.2]

Figure 14 : Univariate Analysis on Categorical Column of store.csv

Figure 15 : Univariate Analysis on Quantitative Data of store.csv

Step 16 : Perform Target Variable with all other Bivariate Columns for Numerical Columns to do a quality check using Data visualization tools.

sb.pairplot(data=df1, x_vars=['Sales'],

[image:6.595.51.272.505.616.2]

y_vars=['Store', 'DayOfWeek', 'Customers']) plt.show()

Figure 16 : Bivariate Analysis 1 on Quantitative Data of train.csv

In the Figure 16 Target Varible sales with Store and DayofWeek cannot be taken as Quantitative data as they are continuous values and only Sales with Customer gives a clear idea that as the Customer increases, the sales also increases. There exist a outlier in the diagram and the visualization tools helps us to identify these types of available outliers in the data set.

sb.pairplot(data=df1, x_vars=['Sales'],

y_vars=['Open', 'Promo', 'SchoolHoliday']) plt.show()

[image:6.595.47.283.661.764.2]

(7)

[image:7.595.315.550.54.206.2]

Figure 17 : Bivariate Analysis 2 on Quantitative Data of train.csv

Step 17 : Perform Target Variable with all other Bivariate Columns for Categorical Columns to do a quality check using Data visualization tools.

def boxplot(x,y,**kwargs): sb.boxplot(x=x,y=y) x=(plt.xticks(rotation=90))

f=pd.melt(df1,id_vars = ['Sales'],value_vars=traincatfeatures) g=sb.FacetGrid(f,col="variable",col_wrap=3,sharex=False,sh arey=False, height=5)

[image:7.595.52.293.55.160.2]

g=g.map(boxplot,"value","Sales")

Figure 18 : Bivariate Analysis on Categorical Data of train.csv

In theFigure 18the Categorical data does not infer any information as one variable is Date and other is StateHoliday which is either 0 or 1.

Step 17 : Perform Multivariate analysis with three or more features. The output will be a 3D format in scatter plot. The Hue value could be changed to getter different visualization perspective

sb.scatterplot(x="Customers", y="Sales", hue="SchoolHoliday", data=df1)

[image:7.595.49.280.311.420.2]

.

sb.scatterplot(x="Customers", y="Sales", hue="DayOfWeek", data=df1)

V. CONCLUSION

The paper describes the various methodologies involved in data cleaning and feature engineering. The paper emphasizes the step by step format required to perform a feature engineering task on the machine learning problem. It also described the basic technical aspects of the machine learning concepts to get a break through about the basics of ML. Get statistical insight of the Data using various EDA tools. Perform Univariate, Bivariate and Multivariate analysis to understand individual features and their relationship with Target variables. The analysis of data may vary depending upon the dataset.

ACKNOWLEDGMENT

My thanks to my Research Guide Prof K.Seeniselvi for encouraging me in my research endeavors which covers both technological and social aspects.

My Special Thanks to Mr. Rajesh Kumar, Founder | BuzzTech Training Institute for the constant encouragement, support and guidance which helped me to get acquainted with technical information in detail.

REFERENCES

[1] John McCarthy, Reminiscences on the history of time sharing, IEEE Annals of the History of Computing, Vol.14, No.1, pp.19–24, 1992. [2] https://www.cs.ubbcluj.ro/~gabis/ml/ml-books/McGrawHill%20-%20

Machine%20Learning%20-Tom%20Mitchell.pdf

[3] https://towardsdatascience.com/machine-learning-types-and-algorithms -d8b79545a6ec

[4] Smola, Alex, and S.V.N. Vishwanathan. Introduction to Machine Learning. Cambridge University Press, 2008. N.p., 2008. Web. [5] https://www.sciencedaily.com/terms/artificial_intelligence.htm [6] https://www.cs.virginia.edu/~evans/greatworks/samuel.pdf

[7] Chethan Kumar GN. “Machine-learning-types-and-algorithms "

https://towardsdatascience.com. N.p., n.d. Web.

[8] "Rossmann Store Sales." N.p., n.d. Web. <https://www.kaggle.com/c/rossmann-store-sales/data>.

[image:7.595.48.279.562.717.2]

(8)

T.Seeni Selvi Completed her M.Sc., M.Phil in Computer Science. She is currently doing her Ph.D under the area of Data Mining. She has guided 11 M.Phil.Students. 2 Papers has been published in International Conference and 12 Journals has been published in reputed International Journals. She has Completed 6 E-Learning Course under various Computer Science Domains.