2.5 Data
3.1.2 Data Outliers and Missing Values
Performing a visual overview of our data, we can detect that there are cases of out- liers as well as missing values, in almost all the different data categories. Therefore, we have to justify the action to take in each case. In this subsection, we shall explain shortly how we will handle outliers and missing values. An outlier is an observation that lies an abnormal distance from the rest of the values in the dataset. [25] In our case we have, detected outliers in almost all the collections. These abnormal observations could fall under one of the following categories:
1. Outliers are the result of measurement or recording errors
2. Outliers are the unpremeditated and exact outcome resulting from the record- ings
3.1. MATERIAL AND DATAPREPARATION 23
In our case outliers fall under the same category, as when there are recording errors, there are no results collected, as we will explain in the section below, of handling of missing values. Outliers could contain valuable information, especially when they regard the target, as in the number of arrivals for a state. So its important to treat our outliers as they are recorded, and we assume that these values are correctly reported by the officials. The case of Water Drum Prices being extremely high, is correlated to the conditions within a state and these are not errors, that we should ignore, but paradoxical indicators for movement, either pushing or pulling people from state to state.
In the case of missing data, we can assume that the missing value, falls under one of the following categories:
1. The sensors, which provide data for two of our categories, rain and rivers, have failed, to give feedback and they have not been replaced within the time period of the month, sufficiently for us to be able to average the values.
2. The information officers were unable to register arrivals and departures, there- fore, we have missing value for the Region in terms of numbers for Current, Before and Future.
Missing values, can be treated with one of the following techniques, and we have selected to experiment with two of these, because they serve better the purpose of this project:
1. Replace missing values with 0. This would not fit the needs for our case. If we choose to replace missing values of arrivals, for example in a region, with zeros, then we are alternating the training set to model arrivals for that regions to zero, whereas other influential variables might be pointing out that there were a lot of arrivals for that region, but unfortunately they have not been recorded. We would not pursue to bias the machine in such a non-rational manner.
2. Replace the missing value with an alternative value, either that being the mean value, the median value, or the previous instance value. Again since there is not a pattern of arrivals or rainfall in the datasets with the missing values we cannot make the assumption, that we can guess the missing value. Using the rainfall in Gedo as can be seen in the table below 3.2, we cannot assume that for September there was no rainfall, or that there was the mean of these series rainfall, as the series has extreme deviations from the mean and the median, as well as the next rows don’t follow the last rows value.
3. The last option is to exclude the entire row, for all categories from the training set thus leading to gaps between dates in the training set, and since in both of
Date 4/1/2017 5/1/2017 6/1/2017 7/1/2017 8/1/2017 9/1/2017
Rain 83.2 44.7 0.0 0.0 0.0
Date 10/1/2017 11/1/2017 12/1/2017 1/1/2018 2/1/2018 3/1/2018
Rain 65.0 110.0 0.0 0.0 0.0 0.6
Table 3.2: Rainfall of 1 year of data in the Gedo state
our machine learning models, we would want to base prediction on patterns, this will make the training set smaller but a lot more reliable.
The reason we examine the training set for the next step of our methodology is that there has been research indicating a relationship between accuracy of models and outliers, as well as missing data. The paper by [26] tests on multiple datasets ANN with different percentages of missing data and concludes that potentially signif- icant information loss is produced even with small percentages of missing samples. For outliers in the training data, it has been demonstrated that modeling accuracy decreases as the outlying points increase. [27]. In the same paper it is concluded that when the outliers, are less than 15%of the total data then the models accuracy is statistically significant compared to having no outliers data. This study also shows that variations in the percentage of outliers and magnitude of outliers in the test data may affect modeling accuracy.
Given these conclusions, of previous researchers, we will also experiment, and compare the accuracy of our models, using the technique of disregarding outliers and including them. More on the training set will be explained in the experimental set up and the Results section.