K-Means Clustering Algorithm - Theoretical Background

3. Theoretical Background

3.2. K-Means Clustering Algorithm

The K-Means algorithm clusters data into homogeneous subsets, a method that is simple, efficient, robust and often able to successfully produce clustering results. Rather than breaking down when encountering unsupported data structures, it ignores intrinsic data structures it is unable to handle, such as

autocorrelation, but creates clusters none the less. Its robustness can be interpreted as a “brute force”

clustering approach as, regardless of data quality, it delivers a clustering solution in most cases. This robustness can also lead to unintended results if applied haphazardly. The robustness of K-Means should not be mistaken for stability of the solution.

The algorithm is described in Table 5 and consists of four steps: 1) initialization of the algorithm is done by randomly assigning k clusters to the data; 2) recursively each smart meter is assigned to the cluster closest

“in distance”; 3) Each assignment updates the cluster means; and 4) steps 2 and 3 are repeated until there is no change in the assignment of clusters.

K-Means Clustering Algorithm [46]

1. Randomly assign k = 1, 2, 3, …, K clusters, K is defined by the analyst.

2. For a given cluster assignment C, the total cluster variance

𝐶𝐶,{𝑚𝑚min_𝑘𝑘}₁^𝐾𝐾� 𝑁𝑁𝑘𝑘 � �|𝑥𝑥𝑖𝑖− 𝑚𝑚𝑘𝑘|�²

𝐶𝐶(𝑖𝑖)=𝑘𝑘 𝐾𝐾

𝑘𝑘=1

(3)

is minimized with respect to the means {𝒎𝒎𝟏𝟏, … , 𝒎𝒎𝒌𝒌 } yielding the means of the currently assigned clusters. N is the number of observations, and 𝒙𝒙𝒊𝒊 is the i’th observation vector.

3. Given a current set of means {𝒎𝒎𝟏𝟏, … , 𝒎𝒎_𝒌𝒌 }, (3) is minimized by assigning each observation to the closest (current) cluster mean.

𝐶𝐶(𝑙𝑙) = argmin

1≤𝑘𝑘≤𝐾𝐾 �|𝑥𝑥𝑖𝑖− 𝑚𝑚𝑘𝑘|�² (4)

4. Steps 2 and 3 are iterated until the assignments do not change or the maximum number of iterations is reached. This algorithm can lead to suboptimal local solutions.

Table 5 - K-Means algorithm.

K-Means is prone to delivering suboptimal solutions that can be unstable, as the method can get caught up in local optima due to the random initialization. Therefore, it is advisable to rerun the algorithm with different random initializations and subsequent selection of the preferred solution. The SKlearn package applied throughout the papers implements ten random initializations with subsequent selection of the best performance.

Apart from its random initialization, the K-Means algorithm is a deterministic algorithm whose objective is to minimize the distance from each observation to the cluster centroids. The centroids, which are average values of the members in the cluster, are updated each time a new member is added or removed. The constant updating of the centroids results in members leaving clusters and vice versa. This is continued until convergence is achieved, measured such that no member or centroid changes.

The K-Means algorithm evaluates each variable by itself, disregarding correlation information. For smart-meter data this equates to evaluating each smart-metering time step independently of other time steps. In many settings this poses no problem for the clustering of the data, but for smart-meter data this has an effect.

Smart-meter data as shown in 3.5 and papers 2 and 3 contain a time-dependent component shown by the existence of autocorrelation in the data. This component governs information about how previous

consumption affects current consumption. As shown in Figure 3 K-Means evaluates each value on the x-axis (time step) independently, though the figure indicates periodicity. K-Means is unable to include this

autocorrelation, and hence this information is not conveyed in the clustering solution. The inclusion of autocorrelation information could potentially decrease variability. Papers 1, 2 and 3 discuss the implications of excluding temporal information from the clustering.

Figure 3 - Three day Scatter plot of 10 smart meters with hourly resolution. The blue direction arrow shows the direction of the K-Means computations when clustering; all observations in same hour are used for the clustering. Red direction arrow shows the temporal structure in the data, this structure is not included in the K-Means clustering.

As described in paper 1, K-Means is the most prevalent clustering algorithm in smart-meter consumption clustering, its simplicity and widespread availability makes it an obvious option for analysts. Paper 1 indicates that very few papers acknowledge the existence of autocorrelation in smart-meter data, and only one of the identified papers deploy time-series methods by applying Fourier transformation.

The simplicity of the K-Means algorithm makes it an excellent baseline for clustering. Papers 2 and 3 successfully investigate how careful preprocessing of the input data can enable K-Means to account for autocorrelation without changing the algorithm. It is possible to do this without introducing complexity in the clustering phase by transforming the input data such that the transformed data account for the dependencies. This enables K-Means to account for time dependencies indirectly, thereby including latent information and reducing the variance in the resulting clusters. The preprocessing of the data does not increase the computational cost, as the chosen transformations – autocorrelation feature and wavelet features – are calculated by applying efficient linear time algorithms [47].

The K-Means method is implemented in every major data science software package from proprietary to open source. Its simplicity makes it straightforward to implement, such that analysts can still deploy it even if it is missing from their preferred programming language.

The simplicity of the algorithm makes it easy to evaluate its computational cost. The evaluation of algorithms is done using O-notation, which evaluates upper bound computational cost by an order of magnitude [48]. The worst-case running time for the K-Means algorithm is O(kⁿ) [49] for k clusters and n observations in the case of smart meters, n being number of meter readings and equating to dimensions in the dataset. The worst-case scenario is the maximum computational effort needed to cluster a given

0 1 2 3 4 5 6 7 8 9 10

1. januar 2012 2. januar 2012 3. januar 2012 4. januar 2012

Consumption kWh

Temporal Structure

K-Means Clustering Direction

dataset. The best possible running time for the K-Means algorithm is O(𝑘𝑘^√𝑛𝑛) [49], a significant reduction of computational effort even for small datasets. In both upper and lower bound running times there is a significant speed gain to be harvested by reducing the number of observations per meter, e.g. the

dimensions of the data. Some of the methods described in sections 3.5 and 3.6 have a significant impact on the running time. From papers 2 and 3 we have the following results from the K-Means clustering of different datasets. Table 6 shows the effect on electricity data and Table 7 on district-heating data. The two tables show that dimensionality reduction and data transformation by autocorrelation features and

wavelets described in sections 3.5 and 3.6 significantly reduce the worst-case running time. It also has a positive effect on the best-case running time.

Processing (Electricity Data) Normalization Autocorrelation

Features Wavelet

Features

Scaling / Transform

Ô(n) Ô(n) Ô(n)

Size of input data (n)

168 x 32k+ 24 x 32k+ 42 x 32k+

Best-case running time

₁₂^√168 ₁₂^√24 ₁₂^√42

Worst-case running time

¹²¹⁶⁸ ¹²²⁴ ¹²⁴²

Table 6 - Runtime comparison table from paper 2. The Normalized and Wavelet methods were unable to provide meaningful clusters and are for comparison set to twelve clusters, and 25% compression for wavelets. The autocorrelation and Wavelet method reduce dataset size, with significant impact on the runtime. An adaptation from paper 2.

Processing (District Heating Data) Normalization Autocorrelation

Features Wavelet

Features

Scaling / Transform

Ô(n) Ô(n) Ô(n)

Size of input data (n)

^{744 x 49} ^{24 x 49} ^{161 x 49}

Best case running time

4^√744 7^√24 4^√161

Worst case running time

⁴⁷⁴⁴ ⁷²⁴ ⁴¹⁶¹

Table 7 - Runtime comparison table from paper 3. The different scaling and transformations identify different number of clusters in the data. In this case we can see that the worst-case running time for the autocorrelation feature clustering is better than the scaled or wavelet transformed data. An adaptation from paper 3.

The K-Means algorithm is sensitive regarding differences of scale between variables. Normalization of variables is often a requirement of meaningful clustering. Papers 2, 3 and 4 all employ some type of scaling or transformation. Paper 3 evaluates the four different scaling methods presented in Table 8 and their impact on the resulting clusters.

Scale Mathematical Description Intuition Normalization

_𝑥𝑥^{𝑥𝑥 − 𝑥𝑥}^{𝑚𝑚𝑖𝑖𝑛𝑛}

𝑚𝑚𝑚𝑚𝑚𝑚− 𝑥𝑥_{𝑚𝑚𝑖𝑖𝑛𝑛}

Normalization puts all observations on a 0-1 scale compared to the largest reading.

Dimensionless.

Standardization

^{𝑥𝑥 − 𝑥𝑥}_𝜎𝜎^{𝑚𝑚𝑚𝑚𝑚𝑚𝑛𝑛}

Standardization scales all observations compared to the standard deviation of the data.

Dimensionless.

Mean-Center

𝑥𝑥 − 𝑥𝑥_{𝑚𝑚𝑚𝑚𝑚𝑚𝑛𝑛}

Mean-centering removes the mean from the meter reading. It is equal to shifting on the y-axis.

Mean-Divide

_𝑥𝑥 ^𝑥𝑥

𝑚𝑚𝑚𝑚𝑚𝑚𝑛𝑛

Scales observations relative to the series mean. Does not

constrain the y-axis to the interval [0, 1]. Dimensionless.

Table 8 - Scaling methods applied to K-Means input data. As presented in paper 3.

In document Analysis of High Frequency Smart Meter Energy Consumption Data (Page 33-37)