In this section, I discussed dissimilarity measures between time series that are based on an
α-divergence between the permutation distributions (PD) of time series. The PD is invariant to
all monotonic transformations of the underlying time series. In particular, addition and multi- plication of positive constants to the time series do not change its PD, hence the PD is invariant to normalizations, e.g., standardization by subtracting the mean and dividing by standard de- viation. This property relieves the researcher of choosing normalization as preprocessing which often has a serious impact on clustering with commonly used metric dissimilarity measures. For example, Kalpakis, Gada, and Puttagunta (2001) report a comparison of the Euclidean distance on differently transformed representations of the data, including Fourier coefficients, wavelet co- efficients, principal components, and raw data, and they find that normalization improves the clustering performance on some data sets, whereas it decreases the performance on others. An- other advantage gained from this property is the robustness against drift in a signal, which often occurs due to the physical properties of the measurement device. In electroencephalography (EEG), amplifier drifts are observed (Fisch & Spehlmann, 1999) and a common problem in accelerometry is thermal drift of the sensors. Drifts of these kinds can be thought of as a local offset in the relatively small embedding window which does not alter the PD. These properties make PDC an interesting complexity-based alternative to other dissimilarity measures for clus- tering. In conjunction with a divergence between probability distributions, I proposed applying this measure in a hierarchical clustering scheme to obtain tree-like partitions of a set of mul- tivariate time series. A likelihood ratio test of multinomial distributions provides a statistical criterion for the choice of how many distinct clusters are an appropriate representation of the data set. This test requires setting a parameter in order to determine the critical value of the test. A variant of the same criterion was derived from an information-theoretic perspective and provides a parameter-free criterion for the determination of the number of clusters.
Given the large number of existing clustering algorithms, one might ask why there is a need for a further clustering algorithm. Indeed, there are several reasons. The permutation distribution has important properties for the clustering of time series, most importantly it does not depend on the moments of a time series and is therefore robust to shifts and scalings. Second, the codebook representation allows the comparison of time series that differ in length. Third, many heuristics for the determination of the number of clusters are developed independently of the time series’ features or the measure of dissimilarity. Fourth, it is efficient to compute. The distributional model of the codebooks allows a consistent and straightforward formulation of such heuristics based on statistical and information-theoretic tests. All these points make a strong case for the PDC methodology. Applications to simulated and real data sets are presented in Chapter 5 and will demonstrate the versatility of the approach, and provide empirical evidence for the MinE criterion that automatically detects an appropriate embedding dimension.
We have discovered a criterion for choosing the embedding dimension and the number of clusters based on information-theoretic criteria. Importantly, this makes PDC a parameter-free clustering approach. For future applications, it is potentially useful to apply the embedding heuristic separately to each dimension. Choosing different embedding dimensions for different dimensions of the time series, might increase the discriminatory power of the clustering and might help in selecting informative dimensions of the data set. An advantage of keeping equal embedding dimensions across different time series dimensions is the possibility of forming cross-
product codewords that allow modeling interdependencies between multiple dimensions. Time series segmentation is a challenging problem in many domains. Beyond clustering a set of different time series, clustering algorithms can generally be employed to detect anomalies in time series (Keogh, Lonardi, & Ratanamahatana, 2004) or segment time series into different parts by sliding a window over the time series and treating the resulting shorter segments as independent time series (Keogh, Chu, Hart, & Pazzani, 2004). Alternatively, top-down and bottom-up approaches or combinations thereof can be used (Keogh, Chu, et al., 2004). The window approach adds another parameter, the window size w, to the algorithm. Preselecting an appropriate window size is often considered as feasible (Keogh, Lonardi, & Ratanamahatana, 2004). In segmentation approaches, the MinE criterion can be applied to not only choose an ap- propriate embedding dimension but also to determine the combination of embedding dimension and window size that maximizes the distinctiveness of the order pattern representation.
As a further potential modification of PDC, we can adopt other divergences between proba- bility distributions. Csiszár f-divergences (Csizár, 1974), sometimes also Ali–Silvey divergences (Ali & Silvey, 1966), describe another family of divergences that also subsume the Kullback- Leibler-divergence as special case. Recently, Cichocki, Cruces, and Amari (2011) presented a further generalization of α-divergences to the broad class of α-β-γ-divergences. These might provide further interesting measures of dissimilarity for future research.
One could argue that, in some cases, PDC is not an appropriate representation of the time series. For example, if the moments of the time series are thought to represent discriminative information about time series, this information is discarded by PDC. Although this can be advantageous for some problems, it might be problematic for others. A famous example is the synthetic control chart data set, which was described by Alcock and Manolopoulos (1999). They describe a task that needs the discrimination between six noisy patterns: The six basic patterns are a constant, a cyclic pattern, an increasing trend, a decreasing trend, a sudden upward shift, and a sudden downward shift. The final patterns are obtained by adding white noise to the basic patterns. PDC can in principle succeed in determining the constant, the cycle, and the trends. However, both patterns with a sudden shift in either direction are not captured by PDC. In domains where such a shift is considered as irrelevant information, for example arising from a sensor misreading, this property is clearly beneficial. Whenever information of the moments should be preserved, there is no reason that speaks against taking into account multiple features for clustering that capture the specific definition of similarity in the idiosyncratic context and extending the dissimilarity function to also consider further features, such as the signal’s mean and the variance. Wang, Smith, Hyndman, and Alahakoon (2004) present such a feature- based approach, which also included complexity measures, even though they did not use the permutation distribution. It would be highly interesting to examine the extent to which PDC can improve their cluster results.
Many time series approaches are based on a wavelet decomposition and perform clustering in the frequency space. Again, complexity measures like the PD could be applied to the different resolution levels of such basis transformations.
Hierarchical clustering as used in this work is relatively slow due to its quadratic time com- plexity. On the other hand, codebook generation from the time series and the calculation of the pair-wise distances is highly parallelizable. When the number of clusters is known beforehand, it might be advantageous to switch to more efficient clustering methods like k-means or spectral clustering while being aware of their particular limitations.
In the next chapter, I turn the focus of my thesis to SEM Trees, which combine SEMs and decision trees to a novel method of exploratory data analysis.
3
Structural Equation Model Trees
SEM Trees are placed in a quite different setting than the clustering approach. In that context, we assumed that neither a model about the observed data nor any additional information is available. However, this is often not the case. Indeed, researchers often hold a large set of prior assumptions about their data and have models at their disposal that formalize their preconceptions. It is interesting to observe how these initial models are refined when they do not perform as expected, e.g., when they explain to little variance in the data or simply indicate by measures of goodness-of-fit that the model is an inappropriate representation of the sample. In these cases, exploratory approaches can be taken, i.e., approaches that are data- driven and extend models so as to account for the observed phenomena. However, these model selection processes bear the risk of implicit and explicit selection biases. In these cases, model refinements represent mere idiosyncrasies of the sample rather than relations that generalize to the population. Whenever a SEM is available along with a set of covariates whose influence on the model is not yet clear, SEM offers a formal approach to find covariates that find partitions of the data set that maximally differ with respect to the estimated parameters. SEM Trees are a method to systematically explore refinements of initial models in order to guide informed theory building.SEM Trees are a consolidation of Structural Equation Models (SEMs) and decision trees. In the following, a review of the most important concepts of SEM is given. Afterwards, the SEM concept is united with the decision tree paradigm, resulting in SEM Trees. Important theoretical and practical aspects of SEM Trees are discussed. Applications of SEM Trees are presented in Chapter 5.