Abstract—Data is available in every corners life. The data keeps on increasing by the usage of smart phones, our Interaction with social networks and much more. The largely generated data can be effectively utilized by means of data mining and machine learning concepts. Machine learning is an integral part of artificial intelligence, which is used to design algorithms based on the data trends and historical relationships between data. Machine learning is used in various fields such as bioinformatics, intrusion detection, Information retrieval, game playing, marketing and so on. This paper describes the step by step method adopted for data preparation and the concepts applied for feature engineering in Machine learning using an familiarized Rossman sales data set.
Index Terms— Data, Machine Learning, Feature Engineering, Artificial Intelligence
I. INTRODUCTION
The terms Machine Learning[4], Artificial Intelligence,[5] Data Science, Deep Learning are concepts interrelated to each other, as though they seems to be different areas, they are highly interconnected with each other. Artificial Intelligence (AI) is the ability of a machine to think and learn by itself without human intervention.[1]
[image:1.595.317.545.275.396.2]Machine learning is a field of Artificial Intelligence that uses statistical methods to give the computer systems the ability to learn from data without being explicitly programmed. [6] Deep Learning is a subset of machine learning where Artificial Neural Networks, algorithms inspired by human brain, learn from large amount of data. It requires substantial computing power (ie) High performance GPU. Data Science is an interdisciplinary field that uses scientific methods, processes, algorithms and system to extract knowledge and insights from data in various forms, both structured and unstructured. It is a practical application of all those fields (AI, DL, ML) in a business context.
Figure 1 : Visually Linking DS, ML, AI and DL
T.SEENI SELVI, Department of Computer Science, Hindusthan College of Arts & Science, Bharathiar University, Coimbatore, India 9994373424
M.NIRMALA, Department of Computer Applications, Hindusthan College of engineering & Technology, Anna University, Coimbatore, India, Mobile No 9952881646.
II. MACHINE LEARNING METHODS / TECHNIQUES GROUPED BY LEARNINGSTYLE
[image:1.595.319.573.506.610.2]Definition of Machine Learning with respect to Mitchell[2] A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.
Figure 2 : Machine Learning Techniques
There are four general machine learning techniques: [3][7] (1) Supervised
(2) Unsupervised (3) Semi-supervised
(4) Reinforcement learning methods.
[image:1.595.83.260.600.674.2]The objectives of machine learning are to enable machines to make predictions, perform clustering, extract association rules, or make decisions from a given dataset.
Figure 3 : Basic Machine Learning Process
1. Object has features and the collective features are named under the term Label. The features + label are called as Training Data.
2. The model is trained by the above given Training Data 3. Once the model has become well trained, it should be
evaluated by Test Data. The test data are just the Features (given as Input) and not the labels.
4. Based on the Training received from the Training data, the train model will be evaluated by the Test data. 5. If the trained model has evaluated the data correctly
then it is ready for prediction.
T.SEENISELVI
1, M.NIRMALA
2 Associate Professor1, Assistant Professor2Supervised Learning
[image:2.595.49.295.121.229.2]In supervised learning, the target variables needs to be predicted from Labels (Training Data). The training data consist of input vector X called as features and output vector Y of labels or tags.
Figure 4 : Supervised Learning
A label or tag from vector Y is the result of all combined features of input vector X. The X Vector and Y Vector together are called as Training Example.
Supervised learning takes labelled data (features and labels) and creates a model that can make predictions given new data.
Training data: (x1, y1), . . . , (xn, yn) / xi∈Rd and yi is the label.
Two groups or categories of algorithms come under supervised learning.
They are 1. Regression 2. Classification
[image:2.595.53.549.303.761.2]if income > θ1 and savings > θ2 then low-risk else high-risk
Figure 5 : Regression
Figure 6 : Classification Methods of Supervised Learning
Support Vector Machines, neural networks, decision trees, K-nearest neighbors, naive Bayes, etc.
Applications of Supervised Learning
Classification
Face recognition.
Optical character recognition Medical diagnosis
Speech recognition machine translation biometrics
Credit scoring: Regression
Predict the price of a car from its mileage. Navigating a car: angle of the steering.
Kinematics of a robot arm: predict workspace location from angles.
Unsupervised Learning
Unsupervised Learning is a type of Machine learning used to draw inferences from data that has not been labelled.
It is a type of machine learning used to draw inferences from the data that has not been labelled. There is no correct answer to predict. It identifies the commonalities in the data and reacts on the presence or absence of such commonalities in each piece of data.
Training data: “examples” x. x1, . . . , xn, xi∈X⊂Rn
Figure 7 : Unsupervised Learning
Applications of Unsupervised Learning Learning associations:
Clustering
Density estimation Dimensionality reduction Feature selection
Outlier/novelty detection. Semi Supervised Learning
It is a combination of Supervised learning and Unsupervised learning.
This type of learning is used when No enough Labelled Data
Don’t have a way / Time / resources to get more labelled data
Example can be : Building a model for a bank to detect Frauds
Semi Supervised Learning
It is a combination of supervised learning and unsupervised learning methods. This type of method is used when enough labelled data is unavailable. No time or way or resource to get more labelled data.
The combination of two techniques can increase the size of the labelled training data.
Advantages are
[image:2.595.47.285.311.753.2] Labeled data is hard to get and expensive
Figure 8 : Semi supervised Learning
Reinforcement Learning
The reinforcement learning method aims at using observations gathered from the interaction with the environment to take actions that would maximize the reward or minimize the risk.
In order to produce intelligent programs (also called agents), reinforcement learning goes through the following steps:
1. Input state is observed by the agent.
2. Decision making function is used to make the agent perform an action.
3. After the action is performed, the agent receives reward or reinforcement from the environment.
[image:3.595.46.292.374.475.2]4. The state-action pair information about the reward is stored.
Figure 9 : Reinforcement Learning
[image:3.595.49.300.530.564.2]III. DATAPREPARATION ANDFEATUREENGINEERING
Figure 10 : Data Preparation & Feature Engineering
When ML Required
A problem is addressed using ML if and only if
It cannot be solved using Traditional Programming Method
Where the system itself needs to solve the problem When the size of the data is Large. ML cannot be
applied for small set of data Define Problem
The first step is defining the problem. Be clear with what the model is expected to. Identify data sources.
Ensure that all the inputs are available during prediction
Collect Data
Why the data to be collected?
What data to be collected?
Where the data should be collected? How the format data should be? Why the data to be collected?
Data is the most import factor in ML To train a model, data is required. The model learns from data. What data to be collected
Features with strong predictive power and labels are to be collected.
Features collected are mainly dependent on the ML problem.
Getting features are easy but labelling is a difficult task
Labeling done through humans or through automated means. It is time consuming and expensive.
Where the data to be collected?
The most crucial step in ML is to have a good quality of data.
Data can be collected from various authenticated sources and various data set repository.
A huge amount of money, time and resources are consume in collecting data.
How the format data should be?
The data should be reliable and trust worthy. It should be available at the time of prediction. Enough data for all classes should be available. Avoid the following from data
Label errors Noisy data Outliers
Duplicate examples Missing Values Irrelevant data Bad Feature Values Exploratory Data Analysis
Exploratory Data Analysis (EDA) is an approach for data analysis that employs a variety of techniques (both graphical and quantitative) to better understand data.
Understand Data types of features
Figure 11 : Data Types of Features
Categorical Data
The data which can be divided into groups are called as Categorical Data. It is also called as Qualitative data. Example : Sex, Educational Level
Nominal and Ordinal Data Define
[image:3.595.320.543.566.662.2]Nominal data is named data which can be represented into discrete categories which do not overlap.
Example : Gender ; Male and Female
Ordinal Data is data which is placed into some kind of order or scale.
Example : rating user satisfaction on a scale of 1 – 10
Numerical Data
The type of data which is measurable or countable is referred as Numerical Data. It is also called as Quantitative Data. Example : Students Height and Weight (Measurable)
Discrete Data
Represents items that can be counted. Example:Number of students in a class
Continuous Data
Represents items that can be measured. It takes any value within a range
Example : A person’s Height, weight
Data Visualization
It is the graphical representation of data in the form of chart, diagrams etc. The various forms are charts, maps and the data in the graphical format helps to understand the pattern, trends, outliers etc. It is more efficient than quantitative data representation.
Various Plots for Data Visualization Histogram
Bar Chart Line Plot Scatter Plot Box Plot Heatmap Data Cleaning
Data cleaning is the process which ensures that the data is correct and relevant. The inaccurate data are removed.
Data cleaning comprises of the following steps Removing Duplicate and Irrelevant Data Removing or correcting Missing values Removing Incorrect features
Removing incorrectly labeled data Removing Outliers
Feature Engineering
It is a process of transforming data into features to act as inputs to machine learning models.
Two important parts of Feature Engineering Variable Transformation
Feature Creation
Different types of Activities involved in Featured Engineering
Create Features
Remove Unwanted or unused Features Data Transformations
Create Features
Features can be created by
Splitting / extracting data in a variable
Combining two or more variables Creating a Bucketized Column Combining sparse classes Creating Indicator Variables Rounding off to Integers Remove Unwanted or unused Features
If a features does not make the machine tolearn, then it could be dropped.
Example :ID Columns of Sales Data Transformations
It involves converting everything into numbers. If the input feature values are non numeric, then it should be converted to numeric values.
2 Types of Transformations are Normalization
Bucketing / Binning
IV. METHODOLOGY APPLIED DATAPREPARATION ANDFEATURE ENGINEERING
In order to perform Feature engineering the size of the data set used be unusually large. In order to demonstrate the methodology the Rossman data sales analysis [8] set has been used.
Step 1: Identify the various data sets such as store.csv, train,csv and test.csv
[image:4.595.309.592.379.609.2]Details of data provided in the format of CSV file
Table 1 : Rossman Data Sales Analysis Data set
Table 2 : Data Types and Columns of Rossman Data set
Step 3: Import the necessary packages for data processing. # Pandas for Data frames, seaborne for easier visualization, numpy for numerical computing, matplot lib for visualization Step 4 : Load the respective .csv files. In this train.csv and store.csv are loaded.
Step 5: To know the dimensions of the data set use the shape command to display the dimension.
train.csv file dimension is (1017209, 9) store.csv file dimension is (1115, 10)
Step 6: To display the data types of the dataset.
Store int64
StoreType object
Assortment object
CompetitionDistance float64 CompetitionOpenSinceMonth float64 CompetitionOpenSinceYear float64
Promo2 int64
Promo2SinceWeek float64
Promo2SinceYear float64
PromoInterval object
Step 7 : To display the count of the columns available in the dataset.
Store 1017209
DayOfWeek 1017209
Date 1017209
Sales 1017209
Customers 1017209
Open 1017209
Promo 1017209
StateHoliday 1017209
SchoolHoliday 1017209
Step 8: To Check whether any columns are empty or contains null values
df1.columns[df1.isnull().any()]
Step 9 : To describe the columns of the data set by displaying the count, mean, standard deviation, min, max etc such that the minimum values maximum values are other relevant
information for numeric columns can be clearly displayed.
By using the df1["Sales"].describe() command the sales column displays the details as
count 1.017209e+06 mean 5.773819e+03 std 3.849926e+03
min 0.000000e+00
25% 3.727000e+03 50% 5.744000e+03 75% 7.856000e+03 max 4.155100e+04
It infers that the sales has zero value in its column such that no sales has happened on that particular day.
saleszerocount= len(df1[df1.Sales == 0])
print (" The Number of Days were sales were Zero "+ str(saleszerocount))
Use the above option to identify the number of days the sales value was zero (ie) No sales.
Step 10 : To calculate the actual zero values in the Customers column and calculate the nonzero values in the Customer Column
customerszerocount= len(df1[df1.Customers == 0]) print (" The Total Number of Customers with Zero Value " +str(saleszerocount))
nonzerocountcustomers=customerscount-customerszerocount print("The Total Number of Non Zero Customers are " +str(nonzerocountsales))
Step 11: Apply the above option for all the possible columns in the given dataset. The above option identifies the count of the empty values in the columns.
Step 12 : To Identify the specific location where the data columns are empty us the below given code.
Step 13: Clear out the null values for specific columns after doing a close observation.
df1 = df1[df1.Sales != 0]
Step 14 : Perform Univariate analysis for numerical values or quantitative data to do a quality check using Data visualization tools.[9]
[image:6.595.308.557.48.290.2]df1.hist(xrot=None,figsize=(15,15)) plt.show()
Figure 12 : Univariate Analysis on Quantitative Data of train.csv
Step 15 : Perform Univariate analysis for Categorical Columns to do a quality check using Data visualization tools.
Figure 13 : Univariate Analysis on Categorical Column of train.csv
[image:6.595.49.294.184.424.2]The Univariate analysis helps in performing a quality check by visualizing it using various plots. The date column does not give any relevant information.
[image:6.595.314.570.442.528.2]Figure 14 : Univariate Analysis on Categorical Column of store.csv
Figure 15 : Univariate Analysis on Quantitative Data of store.csv
Step 16 : Perform Target Variable with all other Bivariate Columns for Numerical Columns to do a quality check using Data visualization tools.
sb.pairplot(data=df1, x_vars=['Sales'],
[image:6.595.51.272.505.616.2]y_vars=['Store', 'DayOfWeek', 'Customers']) plt.show()
Figure 16 : Bivariate Analysis 1 on Quantitative Data of train.csv
In the Figure 16 Target Varible sales with Store and DayofWeek cannot be taken as Quantitative data as they are continuous values and only Sales with Customer gives a clear idea that as the Customer increases, the sales also increases. There exist a outlier in the diagram and the visualization tools helps us to identify these types of available outliers in the data set.
sb.pairplot(data=df1, x_vars=['Sales'],
y_vars=['Open', 'Promo', 'SchoolHoliday']) plt.show()
[image:6.595.47.283.661.764.2]Figure 17 : Bivariate Analysis 2 on Quantitative Data of train.csv
Step 17 : Perform Target Variable with all other Bivariate Columns for Categorical Columns to do a quality check using Data visualization tools.
def boxplot(x,y,**kwargs): sb.boxplot(x=x,y=y) x=(plt.xticks(rotation=90))
f=pd.melt(df1,id_vars = ['Sales'],value_vars=traincatfeatures) g=sb.FacetGrid(f,col="variable",col_wrap=3,sharex=False,sh arey=False, height=5)
[image:7.595.52.293.55.160.2]g=g.map(boxplot,"value","Sales")
Figure 18 : Bivariate Analysis on Categorical Data of train.csv
In theFigure 18the Categorical data does not infer any information as one variable is Date and other is StateHoliday which is either 0 or 1.
Step 17 : Perform Multivariate analysis with three or more features. The output will be a 3D format in scatter plot. The Hue value could be changed to getter different visualization perspective
sb.scatterplot(x="Customers", y="Sales", hue="SchoolHoliday", data=df1)
[image:7.595.49.280.311.420.2].
Figure 19 : Bivariate Analysis on Categorical Data of train.csv
sb.scatterplot(x="Customers", y="Sales", hue="DayOfWeek", data=df1)
Figure 20 : Bivariate Analysis on Categorical Data of train.csv
V. CONCLUSION
The paper describes the various methodologies involved in data cleaning and feature engineering. The paper emphasizes the step by step format required to perform a feature engineering task on the machine learning problem. It also described the basic technical aspects of the machine learning concepts to get a break through about the basics of ML. Get statistical insight of the Data using various EDA tools. Perform Univariate, Bivariate and Multivariate analysis to understand individual features and their relationship with Target variables. The analysis of data may vary depending upon the dataset.
ACKNOWLEDGMENT
My thanks to my Research Guide Prof K.Seeniselvi for encouraging me in my research endeavors which covers both technological and social aspects.
My Special Thanks to Mr. Rajesh Kumar, Founder | BuzzTech Training Institute for the constant encouragement, support and guidance which helped me to get acquainted with technical information in detail.
REFERENCES
[1] John McCarthy, Reminiscences on the history of time sharing, IEEE Annals of the History of Computing, Vol.14, No.1, pp.19–24, 1992. [2] https://www.cs.ubbcluj.ro/~gabis/ml/ml-books/McGrawHill%20-%20
Machine%20Learning%20-Tom%20Mitchell.pdf
[3] https://towardsdatascience.com/machine-learning-types-and-algorithms -d8b79545a6ec
[4] Smola, Alex, and S.V.N. Vishwanathan. Introduction to Machine Learning. Cambridge University Press, 2008. N.p., 2008. Web. [5] https://www.sciencedaily.com/terms/artificial_intelligence.htm [6] https://www.cs.virginia.edu/~evans/greatworks/samuel.pdf
[7] Chethan Kumar GN. “Machine-learning-types-and-algorithms "
https://towardsdatascience.com. N.p., n.d. Web.
[8] "Rossmann Store Sales." N.p., n.d. Web. <https://www.kaggle.com/c/rossmann-store-sales/data>.
[image:7.595.48.279.562.717.2]T.Seeni Selvi Completed her M.Sc., M.Phil in Computer Science. She is currently doing her Ph.D under the area of Data Mining. She has guided 11 M.Phil.Students. 2 Papers has been published in International Conference and 12 Journals has been published in reputed International Journals. She has Completed 6 E-Learning Course under various Computer Science Domains.