2.4.1 Importance of data transformations for clustering
Data transformations are essential prior to clustering in order to prepare the data for the clustering process by handling the inconsistencies the data may contain. For that reason a lot of methods and techniques have been developed for the purposes of pre- processing. For the socio-economic context the necessity of unit standardisation and log transformations is emphasised in (Hennig and Liao, 2013) as it deals successfully with noise, outliers and the asymmetry of expression of numerical attributes. The importance of standardisation prior to clustering is further supported by (Caruana et al., 2006; Duarte et al.,2010). In addition to this the reduction of dimensionality is considered an essential transformation (Vyas and Kumaranayake,2006) where PCA (Hotelling,1933) is utilised for this reason and in (Cortinovis et al., 1993) where attribute elimination strategies are adopted.
In addition to this, data transformations can assist in handling mixed data, a very problematic issue for clustering. Two strategies that can be developed to deal with mixed datasets according to (Ahmad and Dey, 2007) are to either transform the categorical data into numerical and apply conventional clustering or discretise the numeric attributes and apply categorical data clustering. The second strategy results in loss of information and therefore degrades the quality of the clustering results, whereas the first strategy demands a meaningful transformation of the categorical attributes into numerical where the distance between objects can reflect the similarity between data points.
2.4.2 Homogeneity Analysis
This transformation can be carried away by Homogeneity Analysis (Homals) (De Leeuw and Mair,2009,2007), a non-linear multivariate analysis with low computational costs that has the ability to handle vast amounts of data. In its strict sense, Homogeneity Analysis can be viewed as a descriptive tool to analyse categorical data, very similar correspondence analysis (Lebart and Salem,1988). But instead of using SVD (Singular Value Decomposition), it relies on optimising a loss function, it is faster and it capitalises sparseness in data, successfully dealing with missing data.
Given a dataset with categorical variables expressing the information across objects, Homals tries to find a low-dimensional space in which objects and categories are posi- tioned in such a way that as much information as possible is retained from the original data. As explained in (Michailidis and de Leeuw, 1998)the goal becomes to construct
a low-dimensional joint map of objects and categories in Euclidean space. The choice of low dimensionality is because the map can be plotted and thus can be interpreted and understood whereas the choice of Euclidean space stems from its nice properties (projections, triangle inequality) and our familiarity with Euclidean geometry.
The desired properties in this low dimensional joint space dictate that the category points serve as centers of gravity of the object points that share the same category. The larger the spread between category points the better a variable discriminates and thus, it indicates how much a variable contributes to relative loss. The distance between two object scores is related to the“similarity” between their response patterns. A“perfectly homogenous” solution would imply that all object points coincide with their category points (De Leeuw and Mair,2009).
Homogeneity Analysis tries to achieve the maximum homogeneity by quantifying or rescaling the objects and the variables in a joint p-dimensional space where similar objects and categories with similar content will be placed close together. It does so by truing to minimise the departure from homogeneity. This departure is measured by the sum of squares of the distances between the object scores and their corresponding categories. Thus the problem is reduced to minimize these distances in the p-dimensional space. Certain constraints guarantee that the representation will be centered and that the object scores will be orthogonal.
The geometrical properties of this joint representation created by Homogeneity Analysis are ideal for clustering purposes. As the distance between object points reflects the similarity of these two object points, a clustering algorithm can uncover similar groups of objects when it performed upon the transformed data.
In its broad sense, Homogeneity Analysis can incorporate different scale levels keeping the order and preserving the distances. This way it can fit more parsimonious models as it can handle ordinal and numerical variables increasing its functionality. With the ability to handle all types of data Homogeneity Analysis can serve as optimal scaling and dimensionality reduction technique. Optimal scaling refers to the procedure which transforms the observed response categories according to some specified criterion (loss function) as part of an optimisation process that can find the optimal summarisation of the data. The fact that this functionality can extend to categorical data gives the op- portunity to Homogeneity Analysis to serve as a non-linear PCA or non-linear canonical correlation analysis.
2.4.3 Exploratory Factor Analysis
Another interesting pre-processing technique that tries to represent the underlying struc- ture of the data into linear constructs that are called factors is Exploratory Factor Anal- ysis (EFA). As defined more formally in (Fabrigar and Wegener, 2011) EFA refers to a set of statistical procedures designed to determine the number of distinct constructs, usually refered as factors, that are needed to account for the pattern of correlations among a set of measures. Factors are considered as unobservable constructs that exert linear influences on one or more measured variables on the dataset (Ferguson and Cox, 1993).
In order to achieve that, Exploratory Factor Analysis aims to understand and represent the structure of correlations among observed scores on a set of variables. Therefore the goal of Factor Analysis is to arrive at a relatively parsimonious representation of the structure that summarises the original data and reveals the associations among the variables. It can be seen as a process of clustering the correlations matrix of the variables into factors. Therefore Exploratory Factor Analysis can serve as dimenionality reduction technique that reveals potential associations in the data.
(Fabrigar et al., 1999) summarises the five basic steps that are need in order to apply EFA on a dataset:
1. Choose the variables of the dataset.
2. Decide whether EFA is fitting procedure.
3. Determing the number of factors.
4. Decide which rotation is suitable for representing the factors.
Usually the first and the second steps involve techniques to decide whether the data are suitable for EFA. For that reason Stability Coefficient, which indicates how stable a factor structure is relative to the population drawn, can determine if there are enough observations and the KMO test, which indicates whether the associations between the variables in the correlation matrix can be accounted for by a smaller set of factors, checks the appropriateness of the correlation matrix. The third step involves a decision upon the algorithm used for EFA and the fourth step involves a number of techniques to determine the best number of factors that can represent the data. The criteria for this decision is the statistical utility, interpretability and stability or robustness. Statistical utility formalises a logic of identifying the suitable number of factors based on the ideas that one more factor would not explain more while one factor less cannot explain enough.
Scree plot and parallel analysis are the most common techniques used for this purpose. Interpretability makes it crucial to avoid situations where none of the resulting factors cannot express substantially more than one variables of the data, which is referred as underfactoring or situations where factors explain too many variables of the data, which is referred as overfactoring. Finally the representation of the factors in the final decision of the EFA is important to find out if the resulting factors would be interdependent to each other or not. That will affect the interpretation of the factors in the end.
2.4.4 Towards Behavioural Data
The most interesting fact of both pre-processing techniques, discussed in this section, is their ability to represent the original dataset in a new dimensional space where associ- ations and patterns in the data are more clear. Observing the underlying structure in this represented space can lead into squeezing out behavioural elements hidden in the original data and forming a new behavioural dataset that will express well defined pat- terns of behaviour and will enable Data Mining methods to successfully mine behaviours from complex data.