2.6 Supervised Learning for modelling Consumer Indebtedness
3.2.1 Data Preparation
Data transformations are an essential step of every Data Mining procedure as they deal with the inconsistencies of the data. This prepares the data for the later stages of the Knowledge Discovery process as seen in Fig 1.1 in Chapter 1. The two pre-processing techniques presented here, namely Homogeneity Analysis (Homals) and Factor Analysis, are very powerful transformations that not only can they serve as summarisation and dimensionality reduction techniques but they also have the ability to extract behavioural elements hidden in the data and represent them clearly in a new behavioural space.
3.2.2 Homogeneity Analysis
As shown in Chapter 2 Homogeneity Analysis creates a low dimensional space to rep- resent the similarities among objects and the similarities between categories. Given a dataset of n objects and m categorical variables, it tries to create a representation of the object scores and the categories quantification (levels of the categorical variables) in a joint p-dimensional space (p¡¡m) based on the criterion of minimising the departure from homogeneity which is expressed by the loss function:
σ(X; Y1, Y2..., Ym) = 1 m m X j=1 tr(X − GjYj)Mj(X − GjYj) (3.1)
where X is a (n × p) matrix containing the object scores, Yj is a (Kj × p) matrix
containing the categories quantification of the jth categorical variable, Gj is a (n × Kj)
indicator matrix with ones in the cells where the objects have the corresponding category level and zeros in all the rest of cells, Kj is the number of the categorical levels of the
jth categorical variable, Mj is a (Kj × Kj) diagonal matrix with the row sums of Gj
representing the missing values and tr represents the trace function of linear algebra that sums the elements of the diagonal.
As it can be understood the loss function measures the sum of squared distances be- tween the object points and their corresponding categories and therefore it measures the departure from homogeneity. Homegeinity describes the optimal state of the repre- sentation, where all object points coincide with their category points it is stated by the following definitions of perfection:
1. Yj are perfectly homogeneous if G1Y1 = G2Y2= ... = GmYm
2. X is perfectly discriminated if X = P1X = P2X = ... = PmX, where Pj is the
orthogonal projector defined in Pj = GjDj−1G −1
j where Dj is a (Kj× Kj) diagonal
matrix with the relative frequencies of each level in the diagonal.
3. X and Yj are perfectly consistent if X = G1Y1 = G2Y2= ... = GmYm
The loss function sits in the heart of Homogeneity Analysis and is subject to the following constraints in order to avoid the trivial solution corresponding to X=0 and Yj = 0.
u0M•X = 0 (3.2)
X0M•X = I (3.3)
that makes the columns of the object score matrix orthogonal.
In order to achieve the maximum homogeneity Homals utilises the Alternating Least Squares (ALS) algorithm to minimise simultaneously over the X’s and Yj’s and therefore
minimise the loss function. It is an iterative algorithm that each iteration consists of the three steps:
1. Randomise objects scores X
2. Update categories quantification Yj = Dj−1G
0
jX, where Dj is a Kj× Kj diagonal
matrix containing the relative frequencies of the categories of variable j on its diagonal.
3. Update object scores ˜X = M?−1
m
P
j=1
GjYj, where M? is the average of matrix M
4. Normalise object scores X
5. repeat steps 2 to 4 until it converges
The second step of the algorithm expresses the first centroid principle, according to which a category quantification is in the centroid of the object scores that belong to it. The third step of the algorithm shows that an object score is the average of quan- tifications of the categories it belongs to. Hence, this solutions guarantees that the resulting representation places objects close to the categories they fall in and categories close to the objects belonging in them. The fourth step ensures that the normalisation constraints3.2 and 3.3.
The resulting representation possesses some very interesting properties that are very useful for interpretation:
• The distance between two objects indicates their similarity.
• A variable discriminates better if its categories are further apart.
• Objects with identical profiles will receive identical object scores.
• Category points with high marginal frequencies will tend to locate closer to the origin (center of the representation).
• Objects with a profile similar to the average profile will tend to locate closer to the origin.
Figure 3.1: An object plot of senators based on their votes on twenty issues
• Categories of binary variables are placed on a straight line through the origin and their distance is determined by the marginal frequency.
• The category quantifications of each variable have a weighted sum over categories equal to zero.
In Fig. 3.1 you can see an example of the resulting representation when Homogene- ity Analysis is applied on dataset containing the votes of senators on twenty different topics. The example is analysed properly in (De Leeuw and Mair, 2009) but here we can demonstrate the power of Homals to place similar objects together. The objects in this case are U.S senators and they are split in two groups as we can see from the plot. The left part represents the Republican senators and the right part the Democratic senators. Assuming that senators of the same party have similar voting patterns, the Homogeneity Analysis manages to identify these patterns and represent them in clear and interpretable way.
This summarises the functionality of Homals under its strict definitions. Possible ex- tensions of the described functionality to numerical and ordinal variables can be found here (De Leeuw and Mair,2007).
In the R environment Homals has been implemented in the “homals” package (De Leeuw and Mair,2007) and is available for download in the CRAN repository: https://cran. r-project.org/web/packages/homals/index.html
Figure 3.2: A diagram of Factor Analysis model
3.2.3 Exploratory Factor Analysis
Factor Analysis is a generic term for a family of statistical techniques concerned with the reduction of a set of observable variables in terms of a small number of latent factors. The latent factors exert linear influences on the variables and therefore the latent factors determine the values of the observed variables. As a result each observed variable can be expressed as a weighted composite of a set of latent variables. This can be seen clearly in Fig 3.2 that depicts a factor model. There, F1 and F2 are two common factors, Y1, Y2, Y3, Y4, and Y5 are observed variables and e1, e2, e3, e4, and e5 represent residuals or unique factors, which are assumed to be uncorrelated with each other. Thus according to this model every variable is a result of a linear combination between the common factors and a unique factor.
In order to find the latent factors that form the underlying structure of the data, Factor Analysis tries to appoint the appropriate values to the loadings of the factors in such a way so that their linear combination can approximate the correlation matrix of the measured variables. More formally this is expressed in the following equation:
P = ΛΦΛT + Dψ (3.4)
where P is the the correlation matrix of the observed variables, Λ is the factor loading matrix, Φ is the correlation matrix among common factors, which is usually equal to the identity matrix I as orthogonality is generally assumed, and Dψ is the covariance
matrix of the unique factors, which is usually a diagonal matrix with the variance of the unique factors on the diagonal when orthogonality is assumed again.
The most common algorithm for Factor Analysis tries to estimate the reduced correlation matrix P − Dψ in an iterative manner that tries to minimise the residual sum of squared
differences. The resulting factors are ordered by the proportion of variance they explain which quarantees the uniqueness of the solution.
In Explanatory Factor Analysis, which makes no “a priori” assumptions about rela- tionships among factors, it is important to determine the number of factors that best describe the relationships in the data. Two of the most common techniques for finding the number of factors in applied EFA are the scree test and parallel analysis (Fabrigar et al.,1999). In the Scree test the eigenvalues of the correlation matrix of the variables are plotted in descending order. The number of optimal factors is then chosen by the number of eigenvalues that precede the last substantial drop in the graph. In Parallel Analysis the eigenvalues of the same correlation matrix are compared to eigenvalues of the correlation matrices of randomly generated datasets with same size as the original. The number of factors is chosen by the number of eigenvalues that is bigger than the number of eigenvalues of random data.