Electricity-usage data for unsupervised data mining

4.4 Contributed datasets

4.4.10 Electricity-usage data for unsupervised data mining

The electrical device data originates from a trial of smart meters in 187 homes across the United Kingdom (see [118, 7]). Smart meters were used to monitor the electricity consumption of each household in Watt Hours (Wh) at 15-minute intervals. Each series corresponds to the entire consumption of a household over the duration of the trial. The UK government has mandated that all households must be equipped with smart metering equipment by 2020. As a consequence, there will be very large quantities of data that must be processed in an efficient manner. Figure 4.14 shows the typical consumption of a number of device types.

One confounding factor is that devices of a similar nature have very similar usage profiles. Devices such as fridges and freezers, or computers and televisions, are very difficult to distinguish. In addition, the device-specific data is user-orientated. There is no central control over the devices that are monitored; the consumers have direct access to the monitoring equipment and all device labels are user-specified. Hence, labelling is potentially unreliable. Because of these confounding factors, it would be

Computer Oven/Cooker Washing Machine Immersion Heater Dishwasher Fridge/Freezer Kettle

Figure 4.14: Example electricity-usage data for a single house over one day. The graph on the left shows household consumption; the graph on the right is decomposed by device.

beneficial to have a reliable, automated method of detecting and identifying specific device use. The algorithms we propose in Chapter 7 are a first step in this direction.

4.5 Conclusions

We have discussed 75 classification datasets that will be used for experimentation in Chapters 5 and 6, and a synthetic data space and electricity-usage profiles that will be used for unsupervised mining of approximately repeated patterns in Chapter 7.

Time-series Classification using

Shapelet-transformed Data

The work in this chapter is published in a number of papers.

Results from the preliminary implementation of the shapelet transform were published in:

J. Lines, L. Davis, J. Hills, and A. Bagnall

A Shapelet Transform for Time Series Classification

Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Dis- covery and Data Mining, pages 289–297, ACM: 2012.

The results from this paper are not in included in this thesis, as they were created using a legacy version of the shapelet transform implemented by the lead author of [119].

The accuracy results in this thesis for the shapelet transform with 10N shapelets are to be published in:

A. Bagnall, J. Lines, J. Hills, and A. Bostrom

Time-Series Classification with COTE: The Collective of Transformation- Based Ensembles

https://ueaeprints.uea.ac.uk/id/eprint/49614

This paper is currently under review; all shapelet-related work in the paper is my own.

Some of the work on extensions to the shapelet transform, and some of the results and analysis, is published in:

J. Hills, J. Lines, E. Baranauskas, J. Mapp, and A. Bagnall Classification of Time Series by Shapelet Transformation

Data Mining and Knowledge Discovery 28 (4), pages 851–881, Springer: 2014.

As lead author, I made the largest contribution to this paper, modifying the original implementation of the shapelet transform, running experiments, extending the transform, analysing results, and writing the paper.

5.1 Introduction

Shapelets are time-series subsequences used for classification. They are intended to provide accurate classification and insight into the problem domain. Our novel contribution to the field is the shapelet transform. The shapelet transform (Section 5.2) improves upon both the accuracy and insight aspects of the shapelet approach.

We improve classification accuracy by dissociating the shapelet-discovery algorithm from the classification algorithm; we use the discovered shapelets to transform the original data into a space of shapelet features. Rather than being anchored to a decision tree (the standard format, see Chapter 3), we employ a diverse ensemble of classifiers on the shapelet-transformed data. Our accuracy results (Section 5.3) are significantly better than any other shapelet-based approach, and significantly better than using 1NN with DTW distance, which is a benchmark for time-series classification [51].

Shapelets can offer considerable insight into the problem domain (see Section 5.4). We aim to optimise the shapelet approach for providing insight along with classification accuracy.

We improve upon the ability of shapelets to offer insight into the problem domain in a number of ways. First, we eliminate the tree classifier, and its difficult to interpret hierarchy of binary splits on shapelets. Second, we focus on increasing interpretability by reducing dimensionality through filtering (Section 5.5) and clustering (Section 5.6) the shapelets. We compare a number of different hierarchical clustering methods to a novel form of clustering based on using the Minimum Description Length measure as a parameterless stopping criterion for the clustering. Our third contribution is to discretise the clustered shapelet data into a set of binary features representing the presence or absence of a particular shapelet in a given series. Combined with a dictionary of shapelets, this approach is entirely interpretable, and could be deployed

in environments such as medicine or finance, where professionals must be able to justify the decisions they make to their customers and stakeholders (we explore this issue in more detail in Chapter 6).

The shapelet transform improves accuracy and interpretability over rival TSC algorithms, offering a solution to TSC problems that is both effective and compre- hensible to non-experts.

In document Mining time-series data using discriminative subsequences (Page 94-99)