Statistical Methods in Data Mining
Concei¸c˜ao Amado
Instituto Superior T´ecnico
Lisboa
Concei¸c˜ao Amado Statistical Methods in Data Mining
Dimensionality Reduction
Summary
1 Dimensionality Reduction
Dimensionality Reduction
Is the process of reducing the number of random variables under consideration
Mapping input data instances x to real vectors z with smaller number of dimensions
Some techniques:
Principal Components Analysis (linear)
Independent Components Analysis (linear or nonlinear) Self-Organizing Maps (nonlinear)
Multi-Dimensional Scaling (nonlinear; allows non-numeric data objects)
Isomap (nonlinear)
Concei¸c˜ao Amado Statistical Methods in Data Mining
Dimensionality Reduction
Why reduce dimensionality?
Reduces time-complexity: less computation Reduces space-complexity: less parameters
Simpler models are more robust on small datasets: less over-fitting More interpretable; simpler explanations
Data visualization (structure, groups, outliers) easier if plotted in 2 or 3 dimensions
PCA - Applications
Interpretation (study structure)
Create a new set of variables (a smaller number that are
uncorrelated). These can be used in other procedures (e.g., multiple regression).
Select a sub-set of the original variables to be used in other multivariate procedures.
Detect outliers or clusters of observations.
Check multivariate normality assumption (before assuming multivariate normality and analysing data using procedures that assume multivariate normality
Concei¸c˜ao Amado Statistical Methods in Data Mining
Dimensionality Reduction
PCA - Introduction
Principal Component Analysis (PCA): Initially proposed by Pearson (1901), with a different name. In 1933, Hotelling independently, got (and named) the same results
Only in the 60’s (XX Century) started to be studied and explored in detail. In parallel with the computers development
Concern: Explain the associations among a set of variables through linear combinations of these variables
Aims:
(i) Data reduction (ii) Interpretation
Frequently used as an intermediate step
Population Principal Components
In Population principal components, we considere Σ and the principal components (PCs) are derived from Σ
Two approaches
Concei¸c˜ao Amado Statistical Methods in Data Mining
Dimensionality Reduction
Geometry of PCA: p - space
Axis Rotation & Best Fit Line
Concei¸c˜ao Amado Statistical Methods in Data Mining
Dimensionality Reduction
Further Notes regarding PCA
The Algebra of Population PCA
Concei¸c˜ao Amado Statistical Methods in Data Mining
Dimensionality Reduction
Maximizing the Criteria
Population PC: Result 1
Concei¸c˜ao Amado Statistical Methods in Data Mining
Dimensionality Reduction
Population PC: Result 2
Proportion of Variance Accounted For
Concei¸c˜ao Amado Statistical Methods in Data Mining
Dimensionality Reduction
Correlation between Y
iand X
kPopulation Principal Components
Properties:
1 E(Yi) = γtiE(X) = γtiµ
2 Var(Yi) = γtiΣγi = λi
3 Cov(Yi, Yj) = 0, ∀i 6= j
4 Var(Y1) ≥ Var(Y2) ≥ . . . ≥ Var(Yp) ≥ 0
5 Var(Y) = Λ = Diag(λ1, . . . , λp), where Y = (Y1, . . . Yp)t
6 tr(Σ) = Pp
i=1Var(X) = tr(Λ) =Pp i=1λi
7 det(Σ) = Qp
i=1λi = det(Λ)
8 Cov(Xi, Yj) = λjγij
9 Cov(X, Y) = Γ Λ, where Γ = (γ1, . . . , γp) is a (p × p) matrix
10 Cor(Xi, Yj) = γijpλj/√ σii
Comment: Properties 5. and 6. are the theoretical support for the use of PCA as a data reduction technique
Concei¸c˜ao Amado Statistical Methods in Data Mining
Dimensionality Reduction
Population Principal Components
In fact, if tr(Σ) =Pp
i=1Var(X) = tr(Λ) =Pp
i=1λi then λj
Pp i=1λi
is the proportion of population total variance due to the j-th principal component
Population Principal Components: an example
Exercise: Given the covariance matrix:
Σ =
8 0 1
8 3
5
,
1 Compute the eigenvalues λ1, λ2, and λ3 of Σ, and the eigenvectors γ1, γ2, and γ3 of Σ.
Hint: You may use R to compute the eigenvalues and eigenvectors of Σ.
2 Write the population PC
3 Verify that Var(Yi) = λi and Cov(Yi, Yj) = 0, ∀i 6= j
4 Show that λ1+ λ2+ λ3 = tr(Σ), where the trace of a matrix equals the sum of its diagonal elements
5 Show that λ1λ2λ3 = det(Σ)
Concei¸c˜ao Amado Statistical Methods in Data Mining
Dimensionality Reduction
Population Principal Components: an example Solution:
1 The covariance matrix:
Σ =
8 0 1
8 3 5
,
has the following eigenvalues and eigenvectors:
Λ = Diag(10, 8, 3) and Γ =
−0.267 0.949 0.169
−0.802 −0.316 0.507
−0.535 0.000 −0.845
2 The population PCs are:
Y1 = −0.267X1− 0.802X2− 0.535X3 Y2 = 0.949X1− 0.316X2
Population Principal Components: an example
Solution:
3. Verify that Var(Yi) = λi and Cov(Yi, Yj) = 0, ∀i 6= j (Homework) 4. The covariance matrix trace is tr(Σ) = 8 + 8 + 5 = 21
Σ =
8 0 1
8 3 5
,
and the eigenvalues sum is 10 + 8 + 3 = 21 5. The |Σ| = 240 = 10 × 8 × 3
Concei¸c˜ao Amado Statistical Methods in Data Mining
Dimensionality Reduction
Population Principal Components: an example
For the example:
6. Calculate the proportion of the total variance explained by each principal component
7. Obtain the correlation between each principal component and each observed variable
Solution:
6. Given the eigenvalues of Σ, (10, 8, 3)t
λi 10 8 3
λj Pp
i=1λi
10
21 ' 0.476 8
21 ' 0.381 3
21 ' 0.143
Pk j=1λj Pp
i=1λi 0.476 0.857 1.000
7. Obtain the correlation between each principal component and each
observed variable (Homework)
When Variances are Very Different
Concei¸c˜ao Amado Statistical Methods in Data Mining
Dimensionality Reduction
PCs of Standardized Variables
PCs of Standardized versus non-Std Variables
Concei¸c˜ao Amado Statistical Methods in Data Mining
Dimensionality Reduction
PC obtained from Standardized Variables
Properties (Standardized Variables):
1 E(Yi) = γtiE(Z) = 0
2 Var(Yi) = γtiRγi = λi
3 Cov(Yi, Yj) = 0, ∀i 6= j
4 Var(Y1) ≥ Var(Y2) ≥ . . . ≥ Var(Yp) ≥ 0
5 Var(Y) = Λ = Diag(λ1, . . . , λp), where Y = (Y1, . . . Yp)t
6 tr(R) = p = tr(Λ) = Pp
i=1λi. Thus, λj/p is the proportion of population total variance due to the j-th principal component
7 det(Σ) = Qp
i=1λi = det(Λ)
8 Cov(Zi, Yj) = λjγij
9 Cov(Z, Y) = Γ Λ
10 Cor(Zi, Yj) = γij
pλj
2.3 PC obtained from Standardized Variables
Exercise (cont.): Given the covariance matrix: (Homework)
Σ =
8 0 1
8 3
5
1 Convert the covariance matrix into a correlation matrix, R
2 Compute the eigenvalues λ1, λ2, and λ3 of R, and the eigenvectors γ1, γ2, and γ3 of R Hint: You may use R to compute the eigenvalues and eigenvectors of R.
3 Write the population PC obtained from the standardized variables
4 Calculate the proportion of the total variance explained by each principal component obtained from the standardized variables.
5 Obtain the correlation between each principal component and each standardized observed variable
6 Compare the PC obtained from the original and standardized variables. Are they the same?
Which approach would you recommend?
Concei¸c˜ao Amado Statistical Methods in Data Mining
Dimensionality Reduction
Additional PC Properties
PC are not scale invariant: This is a limitation of the PCA, that have to be account when modelling data
Variables with high variability tend to be more important in the definition of the first PC, which can be misleading
Example:
Σ =
8 0 1
8 3 5
γ1
X3 2X3 3X3 10X3 0.267 0.125 0.074 0.020 0.802 0.375 0.223 0.061 0.535 0.919 0.972 0.998
The traditional solution is to use standardized variables
Sample Principal Components
Concei¸c˜ao Amado Statistical Methods in Data Mining
Dimensionality Reduction
Algebra of Sample PC
Geometry of Sample PC
Concei¸c˜ao Amado Statistical Methods in Data Mining
Dimensionality Reduction
Geometry of Sample continued
Geometry of Sample continued
Concei¸c˜ao Amado Statistical Methods in Data Mining
Dimensionality Reduction
Geometry of Sample continued
Sample Principal Components
Let xl represents the observed values on the l-th object. Then ˆ
yli = ˆγTi xl are called the score of the l-th object on the i-th PC Sample PC obtained from standardized variables follow a similar reasoning
Concei¸c˜ao Amado Statistical Methods in Data Mining
Dimensionality Reduction
Example: Swiss Bank Notes
Swiss Bank Notes: variables
Concei¸c˜ao Amado Statistical Methods in Data Mining
Dimensionality Reduction
Swiss Bank Notes: sample statistics
Eigenvalues of S
Concei¸c˜ao Amado Statistical Methods in Data Mining
Dimensionality Reduction
Eigenvectors of Genuine Bank notes
Correlation between measures and PCs
Concei¸c˜ao Amado Statistical Methods in Data Mining
Dimensionality Reduction
The Latter PCs
PCA as a Preliminary to Other Analysis
Concei¸c˜ao Amado Statistical Methods in Data Mining
Dimensionality Reduction
Number of PC to retain
Question: How many PC should be retained?
Answer: There is no universally accepted method. Here are some commonly used ones:
Choose k such that the ratio:
Pk i=1λi
Pp j=1λj
is high (e.g. higher than 80%) Problem: Decide the threshold
Number of PC to retain
Choose k such that
λi ≥ ¯λ, i = 1, . . . , k where ¯λ = Pp
i=1λi/p
Comment: When working with standardized variables ¯λ = 1, since Pp
i=1λi = p
Concei¸c˜ao Amado Statistical Methods in Data Mining
Dimensionality Reduction
Number of PC to retain
Draw a scree plot, i.e. plot λi versus i. Lock for an elbow in the scree plot
Statistical test. Not distribution free anymore
PC Interpretation
The first decisions:
Standardize or not to standardize?
How many PC to retain?
Only after the problem to interpret PC raises. There are two main possible strategies:
Loadings: For a given PC, select the loadings with the highest (on absolute value) magnitude
If all the selected coefficients have the same sign, then the PC can be interpret as weighted sum (or ”index“) of the selected original variables
Otherwise, it can be understood as a contrast between the selected variables associated with positive and negative weights
Concei¸c˜ao Amado Statistical Methods in Data Mining
Dimensionality Reduction
PC Interpretation
Correlations between PCs and Xi: Do something similar as before, but using the sample correlations between the PC and the original variables
Problem: The two criteria may lead to different sets of selected variables and so to different interpretations
Validation: Cadima and Jollife (1995) suggested a way to validate the selected subset of original variables to interpret the PC.
Sample PC - Example
2013 National Records: R Example
Concei¸c˜ao Amado Statistical Methods in Data Mining
Dimensionality Reduction
Multidimensional Scaling (MDS)
Multidimensional Scaling (MDS)
Underlying data set of n points may be unknown:
Might not know coordinate values of points Might not even know dimensionality of data!
Given instead a dissimilarity measure dij between pairs of points Might not know how distances were calculated (Euclidean, city block, ...)
Concei¸c˜ao Amado Statistical Methods in Data Mining
Dimensionality Reduction
Multidimensional Scaling (MDS)
Example application domains:
Data visualization
Marketing: Organize consumer products as points in a space according to perceived differences between them
Quantitative social sciences: Eg. mapping the political bent of parliamentary ministers based on voting dissimilarities
Psychology
Multidimensional Scaling (MDS)
MDS is a set of data analysis methods, which allow one to infer the dimensions of the perceptual space of subjects.
The raw data entering into an MDS analysis are typically a measure of the global similarity or dissimilarity of the objects under
investigation
The primary outcome of an MDS analysis is a spatial configuration, in which the objects are represented as points
The points in this spatial representation are arranged in such a way, that their distances correspond to the similarities of the objects:
similar object are represented by points that are close to each other, dissimilar objects by points that are far apart
Concei¸c˜ao Amado Statistical Methods in Data Mining
Dimensionality Reduction
How does MDS work?
The goal of an MDS analysis is to find a spatial configuration of objects when all that is known is some measure of their general (dis)similarity
The spatial configuration should provide some insight into how the object(s) are in that space of a number of potentially unknown dimensions
How does MDS work?
MDS methods include Classical MDS Metric MDS Non-metric MDS
Concei¸c˜ao Amado Statistical Methods in Data Mining
Dimensionality Reduction
Classical MDS
Consider the problem of the cities and looking at a map showing a number of cities, one is interested in the distances between them.
These distances are easily obtained by measuring them using a ruler.
Apart from that, a mathematical solution is available: knowing the coordinates x and y , the Euclidean distance between two cities a and b is defined by:
Now consider the inverse problem: having only the distances, is it possible to obtain the map?
Classical MDS
Classical MDS, which was first introduced by Torgerson (1952), addresses this problem. It assumes the distances to be Euclidean.
Euclidean distances are usually the first choice for an MDS space There exist, however, a number of non Euclidean distance measures, which are limited to very specific research questions (cf. Borg &
Groenen, 1997)
In many applications of MDS the data are not distances as measured from a map, but rather proximity data
Concei¸c˜ao Amado Statistical Methods in Data Mining
Dimensionality Reduction
Classical MDS
When applying classical MDS to proximities it is assumed that the proximities behave like real measured distances
This might hold e. g. for data that are derived from correlation matrices, but rarely for direct dissimilarity ratings.
The advantage of classical MDS is that it provides an analytical solution, requiring no iterative procedures.
classical Multidimensional Scaling - theory
Concei¸c˜ao Amado Statistical Methods in Data Mining
Dimensionality Reduction
classical Multidimensional Scaling - theory
Suppose for now we have Euclidean distance matrix D = (drs)
The objective of classical Multidimensional Scaling (cMDS) is to find X = [x1, . . . , xn], xr ∈ IRp, so that ||xr − xs|| = drs . || · || is a vector norm. In classical MDS, this norm is the Euclidean distance.
Such a solution is not unique, because if X is the solution, then X∗ = X + c, c ∈ IRp also satisfies
||x∗r − x∗s|| = ||(xr + c)− (xs + c)|| = ||xr − xs|| = drs. Any location c can be used, but the assumption of centered configuration, i.e.
Xn r =1
xrk = 0, for all k,
serves well for the purpose of dimension reduction.
classical Multidimensional Scaling - theory
In short, the cMDS
finds the centered configuration x1, . . . , xn in IRp for some p ≥ n − 1 so that their pairwise distances are the same as those corresponding
distances in D.
We may find the n × n Gram matrix B = XTX, rather than X.
The Gram matrix is the inner product matrix since X is assumed to be centered on in general standardize1.
1It is worth mentioning that centering the variables has the geometric effect of moving the origin of the space (0, 0) to the centroid of the points defined by the means of the variables, but it does not affect the distances in any way. In addition, in MDS analysis, regardless of types of MDS model used, the configuration is typically centered and standardized, which means that the sum of coordinates of each
dimension is zero and the variance is one.
Concei¸c˜ao Amado Statistical Methods in Data Mining
Dimensionality Reduction
classical Multidimensional Scaling - theory
Suppose for now we have Euclidean distance matrix D = (drs) Step 1: Express D in terms of B
classical Multidimensional Scaling - theory
Proof: Sum of a row r in B is zero
Concei¸c˜ao Amado Statistical Methods in Data Mining
Dimensionality Reduction
classical Multidimensional Scaling - theory
Proof: Sum of a column s in B is zero
classical Multidimensional Scaling - theory
Lets sum-up the squared distances over r :
Concei¸c˜ao Amado Statistical Methods in Data Mining
Dimensionality Reduction
classical Multidimensional Scaling - theory
Lets sum-up the squared distances over s:
classical Multidimensional Scaling - theory
Lets sum-up the squared distances over r and s:
Concei¸c˜ao Amado Statistical Methods in Data Mining
Dimensionality Reduction
classical Multidimensional Scaling - theory
Express drs in terms of brs:
classical Multidimensional Scaling - theory
Computing B
Concei¸c˜ao Amado Statistical Methods in Data Mining
Dimensionality Reduction
classical Multidimensional Scaling - theory
Finally we must compute X from B = XXT
To do that we perform eigenvalue decomposition on B:
cMDS Algorithm
Concei¸c˜ao Amado Statistical Methods in Data Mining
Dimensionality Reduction
MDS Algorithm: observations
The distance matrix D might have negative eigenvalues:
which means it is not Euclidean. We cannot exactly reproduce it using X. But this is fine as long as we have some large positive eigenvalues.
Matrix B might have negative eigenvalues:
You get complex values when you compute their square root
If you have some high positive values (at least 2), then we can still plot in the 2D space
The problem in MDS is to construct vectors x that produce the distances in D as close as possible:
We can measure the reproduction error using root mean square error
cMDS examples: tetrahedron
Concei¸c˜ao Amado Statistical Methods in Data Mining
Dimensionality Reduction
cMDS examples: circular distances
cMDS examples: circular distances
Concei¸c˜ao Amado Statistical Methods in Data Mining
Dimensionality Reduction
cMDS examples: Airline distances
cMDS examples: Airline distances
Concei¸c˜ao Amado Statistical Methods in Data Mining
Dimensionality Reduction
cMDS examples: Airline distances
2D map of 18 world cities using cMDS. The colors reflect the different continents.
cMDS examples: Airline distances
3D map of 18 world cities using cMDS. The colors reflect the different continents.
Concei¸c˜ao Amado Statistical Methods in Data Mining
Dimensionality Reduction
Distance Scaling
classical MDS seeks to
find an optimal configuration xi that gives drs u ˆdrs = ||xr − xs|| as close as possible.
Distance Scaling
Relaxing drs u ˆdrs from cMDS by allowing ˆdrs u f (drs), for some monotone function f .
Called metric MDS if dissimilarities drs are quantitative
Called non-metric MDS if dissimilarities drs are qualitative (e.g.
ordinal).
Unlike cMDS, distance scaling is an optimization process minimizing stress function, and is solved by iterative algorithms.
metric MDS
Concei¸c˜ao Amado Statistical Methods in Data Mining
Dimensionality Reduction
metric MDS
cMDS vs. Sammon Mapping
Concei¸c˜ao Amado Statistical Methods in Data Mining
Dimensionality Reduction
Non metric MDS
The assumption that proximities behave like distances might be too restrictive, when it comes to employing MDS for exploring the perceptual space of human subjects
In order to overcome this problem, Shepard (1962) and Kruskal (1964) developed a method known as nonmetric multidimensional scaling
In non metric MDS, only the ordinal information in the proximities is used for constructing the spatial configuration
A monotonic transformation of the proximities is calculated, which yields scaled proximities. Optimally scaled proximities are sometimes referred to as disparities.
In many applications of MDS, dissimilarities are known only by their rank order, and the spacing between successively ranked dissimilarities is of no interest or is unavailable
Non-metric MDS
Given a (low) dimension p, non-metric MDS seeks to find an optimal configuration X ⊂ R
pthat gives f (d
ij) ≈ ˆ d
ij= kx
i− x
jk
2as close as possible.
•
Unlike metric MDS, here f is much general and is only implicitly defined.
•
f (d
ij) = d
ij∗are called disparities, which only preserve the order of d
ij, i.e.,
d
ij< d
k`⇔ f (d
ij) ≤ f (d
k`) (5)
⇔ d
ij∗≤ d
k`∗28 / 41
Kruskal’s non-metric MDS
•
Kruskal’s non-metric MDS minimizes the stress-1
stress-1( ˆ d
ij, d
∗ij) =
X
i<j
( ˆ d
ij− d
ij∗)
2P ˆ d
ij2
1 2
.
•
Note that the original dissimilarities are only used in checking (5). In fact only the order d
ij< d
k`< ... < d
mfamong
dissimilarities is needed.
•
the function f works as if it were a regression curve
(approximated dissimilarities ˆ d
ijas y , disparities d
ij∗as ˆ y , and
the order of dissimilarities as explanatory)
Example: Letter recognition
Wolford and Hollingsworth (1974) were interested in the
confusions made when a person attempts to identify letters of the alphabet viewed for some milliseconds only. A confusion matrix was constructed that shows the frequency with which each
stimulus letter was mistakenly called something else. A section of this matrix is shown in the table below.
Is this a dissimilarity matrix?
30 / 41
Example: Letter recognition
• How to deduce dissimilarities from a similarity matrix?
From similarities δ
ij, choose a maximum similarity c ≥ max δ
ij, so that d
ij= c − δ
ij, if i 6= j, 0 if i = j.
• Which method is more appropriate?
Because we have deduced dissimilarities from similarities, the absolute dissimilarities d
ijdepend on the value of personally chosen c. This is the case where the non-metric MDS makes most sense.
However, we will also see that metric scalings (cMDS and Sammon mapping) do the job as well.
• How many dimension?
By inspection of eigenvalues from the cMDS solution.
•
First choose c = 21 = max δ
ij+ 1.
•
Compare MDS with p = 2, from cMDS, Sammon mapping, and non-metric scaling (stress1):
32 / 41
Letter recognition:
•
First choose c = 21 = max δ
ij+ 1.
•
Compare MDS with p = 3, from cMDS, Sammon mapping,
and non-metric scaling (stress1):
Letter recognition:
• Do you see any clusters?
•
With c = 21 = max δ
ij+ 1, the eigenvalues of the Gram-matrix B in the calculation of cMDS are:
508.5707 236.0530 124.8229 56.0627 39.7347 -0.0000 -35.5449 -97.1992
•
The choice of p = 2 or p = 3 seems reasonable.
34 / 41
Letter recognition
•
Second choice of c = 210 = max δ
ij+ 190.
•
Compare MDS with p = 2, from cMDS, Sammon mapping,
and non-metric scaling (stress1):
•
Second choice of c = 210 = max δ
ij+ 190.
•
Compare MDS with p = 3, from cMDS, Sammon mapping, and non-metric scaling (stress1):
36 / 41
Letter recognition:
•
With c = 210, the eigenvalues of the Gram-matrix B in the calculation of cMDS are:
1.0e+04 * 2.7210 2.2978 2.1084 1.9623 1.9133 1.7696 1.6842 0.0000
•
May need more than p > 3 dimensions.
Letter recognition: Summary
•
The structure of the data appropriate for non-metric MDS.
•
Kruskal’s non-metric scaling:
1 Appropriate for non-metric dissimilarities (only when their orders are preserved)
2 Optimization: susceptible to local minima (leading to different configurations);
3 Time-consuming
•
cMDS fast, overall good.
•
Sammon mapping fails when c = 210.
38 / 41
Letter recognition: Summary
•
Clusters (C , G ), (D, Q), (H, M, N, W ) are confirmed by a cluster analysis for either choice of c.
Use agglomerative hierarchical clustering with average linkage:
Basics of a non metric MDS algorithm
The core of a non metric MDS algorithm is a twofold optimization process
First the optimal monotonic transformation of the proximities has to be found
Secondly, the points of a configuration have to be optimally arranged, so that their distances match the scaled proximities as closely as possible
Concei¸c˜ao Amado Statistical Methods in Data Mining
Dimensionality Reduction
MDS - some comments
Some literature classify the cMDS in the metric multidimensional scaling category.
MDS is also known as principal coordinates analysis.
Sammon mapping is a nonlinear metric multidimensional scaling method.
Non-metric multidimensional scaling is also known as ordinal MDS.
MDS in R
library(MASS)
# compute dissimilarity matrix from a dataset d <- dist(swiss)
# d is (n x n-1) lower triangle matrix cmdscale(d, k =2) # classical MDS
sammon(d,k=1) # Sammon Mapping
isoMDS(d,k=2) # Kruskal’s Non-metric MDS 40
Concei¸c˜ao Amado Statistical Methods in Data Mining