Missing values analysis techniques in Data mining: Review
1*Mohammed Sharik U Zama,2*Somula Ramasubbareddy
Department of Information Technology,VNRVJIET, Hyderabad, Telangana 500090, India.
Abstract
Missing data is a prevalent problem in data ana- lytics. Researches and surveys often have missing data in their observations. Having missing data in the data set affects the quality of the data set dramatically. In real-world databases and data warehouses, the data is inaccurate, incomplete and inconsistent. There can be numerous reasons behind this such as human or computer errors in the data entry procedure, purposefully submitting incorrect answers, faulty measurements, and many more. Missing data can have several negative effects on the knowledge discovery process such as biased results, invalid conclusions, and so on. Analyzing the data becomes an arduous task when there are missing data in the dataset. The main reason being, data mining algorithms primarily perform well on dataset that is consistent and complete. Luckily, this problem can be solved with the help of several techniques that can be employed in the data preprocessing stage to handle missing data. The purpose of this research paper is to compare and classify methods to handle missing data. Results from this study are the comparison, classification and contrasting of methods to handle missing data along with the advantages and disadvantages of each method.
I. INTRODUCTION
Data mining can be described as the process of discov- ering intriguing patterns and knowledge from large amounts of data. There are many different types of data sources like the web, information repositories, data warehouses and data surged dynamically into the system [1]. Missing data analysis is a relatively new discipline. Data preparation is the most important step in the investigation of data [2].
The possibility of missing data in a data set is because of several reasons such as measurement errors, manual data entry procedures, and errors in the equipment used. Loss of efficiency, complexity in handling, and bias arising due to the difference between the missing and incomplete data are the three major problems associated with missing data in the data set. When there’s believability, accuracy, consistency, timeliness, and interpretability then the data can be called as quality data. But, in real-world datasets, this is not the usual case. Therefore, we preprocess the data before using it in data mining algorithms. This process is called as Data preprocessing. In Data preprocessing, data cleaning routines are employed to fill the missing data, remove noisy data by identifying outliers and amend the inconsistencies in the data. Missing data is divided primarily into 3 categories [3]:
• Missing completely at random (MCAR)
• Missing at random (MAR)
• Not missing at random (NMAR)
There are methods which are used in dealing with missing data such as deleting data that has incomplete information or replacing the missing values with the average values. These methods are easy to implement but come with a huge drawback of producing a biased model in the end [4]. Missing data treatment should satisfy the following three conditions:
• There should be no bias in the estimation. Data distribu- tion should not be changed.
• Relationships among the attributes must be retained.
• Complexity and time cost must be minimal.
Maximum likelihood (ML) method having the expectation maximum and multiple imputations (MI) renders good statisti- cal properties for missing data analysis. During the evaluation of expectation maximum [5] [6] [7] [8] conditional expectation of the log-likelihood function to acquire convergent values is required for the maximum likelihood completion. Whereas, the completion of the multiple imputation (MI) method needs pre- diction model explicitly defined by minimizing and followed by
forecasting the missing values.
Missing values are handled by data mining algorithms in a very simple manner covering traditional imputationtechniques such as deleting the data, mean values imputation, maximum likelihood, and many other statistical methods. Currently, extensive research is going on for the missing data analysis and the use of machine learning techniques in the imputation methods of missing data [9] [2] [10]. K-means clustering is used for predicting missing attribute values in the data set [11] [12] whereas K-nearest neighbor is widely used for missing data imputation [13] [14] [15] [16]. K-means classifier performs poorly in comparison to the efficient KNN classifier in missing value imputation [17]. Due to which a study on k-means algorithms as an imputation method to handle missing values was carried out [18]
which was then followed by a scrutiny on the effect of various k-means clustering algorithms [19]. On the other hand, algorithmic imputation using Fuzzy k-means (FKMI) is done using the euclidean distance function [20] [21].
II. TECHNIQUES EMPLOYED FOR REMOVING MISSING DATA
The data having missing or incomplete values/fields is miss- ing data. This incompleteness is caused due to several reasons like human errors, computer errors in the data entry procedure, measurement errors, intentionally submitting incorrect answers and so on. Loss of efficiency, complexity in handling, and bias arising due to the difference between the missing and incomplete data are the three major problems associated with missing data in the data set. Accuracy is compromised due to the presence of missing data because the amount of knowledge in the data is significantly reduced. Missing value analysis will aid in resolving the above problems caused by the lack of data.
Missing data is divided primarily into 3 categories [3] :
• Missing Completely at Random (MCAR):
Level of randomness is high in the MCAR. In MCAR there is no relationship between the data value that’s missing and the rest of the data values in the data set. For example, if the data point A is missing in the data set then it’s not dependent on the other data point B in the data set. That is, it cannot predict the missing data point A from any other data point. This justifies that the probability of missing data is same for all the missing data. The main advantage of this type of data is that it’s easier for the researchers to approximate and compute the given model.
• Missing at Random (MAR):
In MAR, the observed data and the values that are missing have a direct relationship. In case when a specific data point is found missing then the missingness can be explained by observed individual variables on which we have complete information.
• Not Missing at Random (NMAR):
The missing data points are not completely random, and they cannot be approximated from other data points in the data set.
Few techniques for handling missing values in machine learning and statistical methods are as follows:
• In the data set which has lost data, eliminate all the patterns in it. This is very much applicable if the data is small.
• Replace the missing data which is known as imputation.
• Search for a relevant model according to the data to calculate the missing value.
Due to several methods for handling missing data, it is crucial to understand the pros and cons of each method and also the purpose of the methods. According to the studies performed, missing data handling methods can be grouped broadly into two approaches which are statistics and machine learning. While using statistical methods for handling missing values, we ignore technique and model-based imputation technique. Whereas, in machine learning methods, imputation techniques are applied using K-nearest Neighbors, K-Mean Clustering and Fuzzy C-Means.
III. MISSING DATA HANDLING METHODS
There are different strategies present for handling the miss- ing data. Figure 1 shows a few of them. As we have observed, there are two categories of methods for missing data handling that is the statistical
1) Missing Data Ignoring Technique:
There are two types of deletions that are involved in Ignoring technique namely: listwise and pairwise deletion. If there exists a case that has missing data for any of the variables then that particular case should be avoided from the analysis. This is a default in a statistical package. Pairwise deletion is also known as available case method and this method considers each feature independently. For every single feature, all ob- served values in each observation are taken into consideration and missing data values are left unnoticed.
Missing Data Imputation Method:
In the imputation method, estimated values are put into the missing values replacing them based on the information present in the dataset. There exist many different choices from this method such as mean imputation and many other efficient methods based on the relationship between attributes as given below:
• Mean and Median Substitution:
This method is generally used in the sample service. A non- sampled instance is replaced by other sampled instance. The most widely used technique is the single imputation technique. In mean substitution, missing values of a variable are replaced by the mean values of the observed values. The missing values that are imputed rely on a single variable. Mean substitution restores the mean of the variable substitution. But, mean substitution generally disturbs the other characteristics of variable substitution. Either mean or median substitution of covariates and outcome variables is still a common practice. To improve this method we begin with stratifying the data in subgroups and use the subgroup mean. In median imputation, the outcomes in the median of the complete dataset are similar to the case of deletion, but the randomness between individual responses is reduced and bias on variances and covariances get close to zero.
• Hot deck:
The missing value from the current data is replaced with a value from an approximated distribution.
The hot deck is generally carried out in two stages. In the first stage, the data present is separated into clusters. In the second stage, each missing data instance is linked with one cluster. The cases that are complete in the cluster are then used to fill the missing values in the data set. This filling is done by evaluating the mean or mode of the attribute within the cluster [4]
• Regression Imputation:
This method is a predictive model which helps in dealing with the imputation of missing data. In this, the values of the observed features and the values estimated are then used to fill up the missing values [3] [22] [23]
• Multiple Imputation:
In the Multiple Imputation method, we can generate entire data with imputed missing values that are filled by various models like linear regression model, etc. The variables that are used in estimating the missing values should include every variable that is to be used in parameter estimation which is based on the analysis models. The mean of the individual estimates is used for calculating overall parameter estimation but there is a pos- sibility of uncertainty by the process of imputation. Therefore, multiple imputations will help in removing the shortcomings of single imputation by giving a form of additional error which is based on parameter variation to estimate all the imputation errors [8] [24] [25] [26].
• K-Nearest Neighbor Imputation (KNNI):
This method chooses the K-nearest observation from a se- quence of observations with values present in the attributes to perform imputation which will reduce the size of the
Fig. 1. Strategies for handling missing data.
Distance. Once the value of the K-nearest neighbor is dis- covered the predicted value is replaced because the missing data value has to be estimated. Replacement of values is performed based on a variant of data. KNNI method can be used for the qualitative and quantitative data attributes [13] [14] [15] [16]
[11] [12] [17]
• K-Means:
In k-means the objects are classified mainly based on at- tributes/features in several k-groups where k is a positive integer. The grouping is primarily carried out by reducing the sum of squares of the distance between the data and the clustered centroid [17] [18] [19]. This method is a fast and accurate way to predict missing values.
• Fuzzy K-Mean Clustering Imputation (FKMI):
In FKMI, membership functions hold a significant role. Fuzzy clustering helps in determining if each cluster is well divided or not. This is the case where an object belongs to various clusters. Having missing values causes the object to be present in multiple clusters. Fuzzy clustering imputation will help in describing the degree of membership of each object in the cluster [20] [21].
2.Missing Data Model Base Technique:
This technique is used to approximate the model parameter for all the datasets.
• Maximum Likelihood:
Maximum Likelihood is a method that uses maximum Like- lihood function in estimating the parameters of a distribution. This distribution in the dataset is believed to maximize the likelihood. Equation likelihood of the observed data is com- puted in estimating the variables. The roots of the equation that are obtained from the equation likelihood will globally maximize the likelihood of the observed data therefore there exists consistency in the data. Newton-Raphson procedure, Fisher score or Quasi-Newton method can be used to calculate the iterative MLE maximization of variables.
• Expected maximization (EM) algorithm:
This is an iterative algorithm which is used to find the maxi- mum likelihood estimates for the parameters in the statistical model when the data is incomplete. EM algorithm consists mainly of 2 phases: step-E and step-M. In step-E, we compute the conditional expectation of log-likelihood function and maximize the conditional expectation. Mead covariance is used to substitute the missing value of observed data variable. In step-M, covariance and maximum likelihood approximations of the average matrix-vector are obtained assuming there is no missing value. The outcome of the covariance matrix and the regression coefficients from the step -M are used to estimate the missing values. These iterations are repeated until missing values are obtained. EM algorithm needs a large sample size and the data should be missing at random (MAR) [7] [8].
When the data has the distribution function in the form of a model equation like the multivariate normal distribution, mixture Gaus and other types of distribution then technique base model using a maximum expectation algorithm is used. For other datasets that don’t necessarily require a model, the imputation technique is suggested.
In datasets that have small sample quantities, statistical methods can be performed. For datasets with large sample size imputation technique with machine learning is recommended. Generally, in real-world datasets imputation techniques are easier to apply. Root Mean Squared Error (RMSE) is used as a standard error criterion to determine which method is more efficient. The handling of missing data is more efficient when the value obtained from the RMSE technique is small.
REFERENCES
[1] H. J. K. M, DataMining Concepts and Techniques. Morgan Kaufmann Publishers, 2000.
[2] UMATHE and V. H. G.C., “A Review on incomplete data and cluster- ing,” Int. J. Compt.Sci.Inf.
Technol, 6(2);1225-7, (2015), 2015.
[3] Batista, G. . Monard, and Maria-Carolina, “An Analysis of Four Missing Data Treatment Methods for Supervised,” Applied Artificial Intelligence (APPL ARTIF INTELL), vol. 17, pp. 519–533, 2013.
[Online]. Available: 10.1080/713827181
[4] M. Soley-Bori, “Dealing with missing data: Key assumptions and methods for applied analysis,”
Technical report , 2013.
[5] P. D. Allison, “Handling missing data by maximum likelihood,” SAS Global Forum, Statistics and Data Analysis, 2012.
[6] S. Galbraith, “Applied Missing Data Analysis by Craig K Enders,” Australian & New Zealand Journal of Statistics, vol. 54, no. 2, pp. 251–251, 2012. [Online]. Available: 10.1111/j.1467- 842X.2012.00656.x
[7] N. Balakrishnan and D. Kundu, “Hybrid censoring: Models, inferential results and applications,”
Computational Statistics & Data Analysis, vol. 57, pp. 166–209, 01 2013. [Online]. Available:
10.1016/j.csda. 2012.03.025
[8] A. Hansson and R. Wallin, “Maximum likelihood estimation of Gaussian models with missing data - Eight equivalent formulations,” Automatica, vol. 48, pp. 1955–1962, 2012.
[9] B. suthar, H. patel, and A. gosawmi, “A survey classification of imputation methods in data mining,”
IJAETAE, pp. 309–312, 2012.
[10]S. Kanchana and D. A. S. Thanamani, “Classification of Efficient Imputation Method for Analyzing Missing Values,” International Journal of Computer Trends and Technology, vol. 12, pp. 193–195, 06 2014. [Online]. Available: 10.14445/22312803/IJCTT-V12P138
[11]N. Suguna and K. G. Thanushkodi, “Predicting Missing Attribute Values Using k-Means Clustering,”
Journal of Computer Science, vol. 7, pp. 216–224, 02 2011. [Online]. Available:
10.3844/jcssp.2011.216.224
[12]S. Baiwal, “Imputation of Missing Values using Association Rule Mining & K-Mean Clustering,”
2016.
[13]. Ibrahim Aydilek and A. Arslan, “A novel hybrid approach toestimating missing values in databases using K-nearest neighbors and neural networks,” International Journal of Innovative Computing, Information and Control, vol. 8, 07 2012.
[14]M. R. Malarvizhi and D. A. selvadoss Thanamani, “K-Nearest Neighbor in Missing Data Imputation,”
2012.
[15]W. M. Khedr and A. M. Elshewey, “Pattern Classification for Incom- plete Data Using PPCA and KNN ,” Journal of Emerging Trends in Computing and Information Sciences, vol. 4, no. 8, pp. 628–
632, 2013.
[16]S. Gajawada and D. Toshniwal, “Missing value imputation method based on clustering and nearest neighbours,” International Journal of Future Computer and Communication, vol. 1, no. 2, pp. 206–
208, 2012.
[17]M. R. Malarvizhi, “K-NN Classifier Performs Better Than K-Means Clustering in Missing Value Imputation,” IOSR Journal of Computer Engineering, vol. 6, pp. 12–15, 01 2012. [Online]. Available:
10.9790/0661-0651215
[18]B. Mehala, K. Vivekanandan, and P. R. J. Thangaiah, “An Analysis on K-Means Algorithm as an Imputation Method to Deal with Missing Values,” Asian Journal of Information Technology, vol. 7, no. 9, pp. 434–441, 2008.
[19]F. A. Bakhsh and K. Maghooli, “Missing Data Analysis: A Survey on the Effect of Different K-Means Clustering Algorithms,” American Journal of Signal Processing, pp. 65–70, 2014.
[20]C.-T. Chang, J. Z. C. Lai, and M.-D. Jeng, “A Fuzzy K-means Clustering Algorithm Using Cluster Center Displacement,” Journal of Information Science and Engineering,, vol. 27, pp. 995–1009, 05 2011.
[21]S. P. M. J, “A Comparison of Six Methods for Missing Data Imputation,” Journal of Biometrics &
Biostatistics, vol. 06, 01 2015. [Online]. Available: 10.4172/2155-6180.1000224
[22]A. N. S, “A Comparative Study of Missing Value Imputation Methods on Time Series Data,”
International Journal of Technology Innovations and Research (IJTIR), vol. 14, pp. 1–8, 2015.
[23]N. J and K.-P. N, “Efficiency Comparison of Data Mining Techniques for Missing Value Imputation ,” J. Ind. Intellegent Inf, pp. 305–309, 2015.
[24]Y. Dong and C.-Y. J. Peng, “Principled missing data methods for researchers,” SpringerPlus, vol. 2, no. 1, p. 222, May 2013. [Online].
Available: 10.1186/2193-1801-2-222