2.6 Data Cleaning
2.6.1 Initial Data Investigation
The initial data investigation aims to assess the current data quality. The quality depends on dimensions that include accuracy, completeness, consistency, relevance and reliability, as covered in section 2.1.1.2. Higher data quality enables higher quality analysis.
Before considering the required data cleaning steps, an overview of the current data tness is required. Data cleaning is typically an iterative process and unique to each data set. The requirements for the preprocessed data can be found in Table 2.3.
2.6.1.1 Completeness
The completeness of data is directly measured by the amount of data present. This is typically the rst aspect that suers when a system experiences technical diculties. It is also one of the largest contributors to the quality of data before preprocessing and therefore has a large impact on the accuracy and quality of the cleaned data.
Using the Missingno tool, as described in section 2.4.1.1, the data sets were investigated and evaluated for tness for further analysis. From the sets, a representative from a high level of completeness and a low level of completeness were selected for comparison. Starting with the data set which has a high level of completeness, the resultant Missingno matrix plot and bar chart can be seen in Fig. 2.5(a) and 2.5(b). From these gures, it is evident that the data set, specically the temperature vectors, reach a completeness of 93%. As discussed in section 2.2.1.3, it is expected that the temperature data sets will be the most complete as the temperature sensor data is always sent; no data saving is performed on these values. This makes the temperature data completeness representative of the SEC's data completeness. In contrast, the power and hot water usage data values are compressed by not reporting the values if they are the default value. As a result of this, the power and hot water usage data values have completeness levels of 8.3 % and 8.0 % respectively.
On the other end of the spectrum, besides units that at out have no reported data values, there are units which have large amounts of data missing. Representative of such a data set is Fig. 2.6(a) and 2.6(b). Evidently there is intermittent data in the temperature data vectors, with a large outage of data experienced during the early stages of operation. The overall data completeness clearly indicated that the data experienced signicant losses during operation. The temperature, power and hot water data set completeness for this unit was found to be 51.5 %, 4.9 % and 5.9 % respectively.
(a) High data completeness chart.
(b) High data completeness bar chart. Figure 2.5: High level of data completeness.
Table 2.4 provides a comparison of the two data sets, the rst with a high data completeness and the second with a low data completeness.
This satises the completeness investigation requirement DC1.1 in Table 2.3. 2.6.1.2 Accuracy
The accuracy of a data set directly inuences the quality of the analysis of the data and the insight gained from that.
Values: In the context of preprocessing data that originates from a system that was deemed accurate, the main triggers for inaccurate data are seen as unexpected values, typically outliers. Considering the temperature data of an EWH, temperature values up to 60 ◦Care expected [50] and temperatures exceeding 80 ◦C are not expected and may
indicate erroneous values.
Considering the frequency plot of the outlet, inlet and ambient temperatures, as seen in Fig. 2.7, a clear distribution can be seen for all below 80 ◦C. There are, however,
(a) Low data completeness chart.
(b) Low data completeness bar chart. Figure 2.6: Low level of data completeness.
of values. Considering the raw data from the outlet sensor, a signicant spike is evident around 2016-10-07, which is unlike any other recorded values. Data cleaning will need to eectively handle outliers such as this.
Time Stamps: Another aspect to consider is the time accuracy of the data. The SECs are designed to report in at a constant period of 1 minute; any deviation from this registers as an inaccurate period of sampling. Considering successive observations as shown in Fig. 2.9(a), a time drift was observed. The drift seems constant, with every observation being recorded about 1 min and 1-2 s after the previous observation. This drift could lead to regular missed observations, reducing the completeness of the data set. Fig. 2.9(b) shows the frequency distribution of the seconds value of the recorded time stamp, indicating no clear pattern to more frequent seconds values.
2.6.1.3 Consistency
The provided data had a high level of consistency regarding the particular elds of interest. All the sensor measurements were stored as numerical data types with no erroneous data types present. The consistency was a concern due to the rapid development of the Geasy project and the expected evolution of the database. This contributed to the selection of an oine data set for development and analysis.
This satises the consistency investigation requirement DC1.3 in Table 2.3. 2.6.1.4 Reliability
The provided data set is assumed to have a high level of reliability and no further alter- ations, besides improving the data quality, were made.
This satises the reliability investigation requirement DC1.4 in Table 2.3. 2.6.1.5 Relevance
The recorded values in the database primarily include all the required values to perform eective analysis of the data. Of particular importance are the sensor values. These sensors are discussed in 2.2.1.3. Of the six sensors, ve are relevant to the analysis to be performed in this dissertation. The one sensor that is irrelevant for this study is the outlet far temperature sensor, which was previously used to estimate ow rates and event times. Furthermore, where available, the EWH volume and element ratings are also important metrics. The relevant parameters, along with their units are listed in Table 2.5.
This satises the relevance investigation requirement DC1.5 in Table 2.3.
(a) Outlet temperature.
(b) Inlet temperature. (c) Ambient Temperature. Figure 2.7: Investigation of the temperature outliers through frequency plots, suered by the Geasy SECs.
Table 2.5: Parameter descriptions as used in the data set.
Label Description Unit
W Power kW
T1 Outlet temperature C
T3 Inlet temperature C
T4 Ambient temperature C
Temper
ature (
°C)
Figure 2.8: Raw temperature plot showing erroneous spike of 100 ◦C.
(a) Time stamp drift experienced for 60 consecutive observations, shown by the recorded seconds value of the time stamp.
(b) Time stamp drift seconds frequency over the period of a day.
2.6.1.6 Summary
This section provided valuable insight into the current state of the data quality, satisfying requirement DC1 from Table 2.3.
2.6.2 Sampling Period Regularisation
The data was observed to have a time stamp drift, as seen in section 2.6.1.2. Data cleaning routines and methods used during further analysis of the data are typically sensitive to non-regular sample periods. For this data set, sampling period regularisation is achieved by the method as described in section 2.3.1. As a result of the time grouping, the seconds values are essentially truncated and recorded as 0 for all observations. This provides a regular sampling period of 1 minute. Due to the SEC sampling period drift, the resultant data structure with regular sampling period generates some observations with no recorded values.
This satises the sampling period regularisation requirement DC2.1 in Table 2.3.