Statistical Methods in Data Mining

(1)

Statistical Methods in Data Mining

Concei¸c˜ao Amado

Instituto Superior T´ecnico

Lisboa

Concei¸c˜ao Amado Statistical Methods in Data Mining

Dimensionality Reduction

Summary

1 Dimensionality Reduction

(2)

Dimensionality Reduction

Is the process of reducing the number of random variables under consideration

Mapping input data instances x to real vectors z with smaller number of dimensions

Some techniques:

Principal Components Analysis (linear)

Independent Components Analysis (linear or nonlinear) Self-Organizing Maps (nonlinear)

Multi-Dimensional Scaling (nonlinear; allows non-numeric data objects)

Isomap (nonlinear)

Why reduce dimensionality?

Reduces time-complexity: less computation Reduces space-complexity: less parameters

Simpler models are more robust on small datasets: less over-fitting More interpretable; simpler explanations

Data visualization (structure, groups, outliers) easier if plotted in 2 or 3 dimensions

(3)

PCA - Applications

Interpretation (study structure)

Create a new set of variables (a smaller number that are

uncorrelated). These can be used in other procedures (e.g., multiple regression).

Select a sub-set of the original variables to be used in other multivariate procedures.

Detect outliers or clusters of observations.

Check multivariate normality assumption (before assuming multivariate normality and analysing data using procedures that assume multivariate normality

PCA - Introduction

Principal Component Analysis (PCA): Initially proposed by Pearson (1901), with a different name. In 1933, Hotelling independently, got (and named) the same results

Only in the 60’s (XX Century) started to be studied and explored in detail. In parallel with the computers development

Concern: Explain the associations among a set of variables through linear combinations of these variables

Aims:

(i) Data reduction (ii) Interpretation

Frequently used as an intermediate step

(4)

Population Principal Components

In Population principal components, we considere Σ and the principal components (PCs) are derived from Σ

Two approaches

Geometry of PCA: p - space

(5)

Axis Rotation & Best Fit Line

Further Notes regarding PCA

(6)

The Algebra of Population PCA

Maximizing the Criteria

(7)

Population PC: Result 1

Population PC: Result 2

(8)

Proportion of Variance Accounted For

Correlation between Y

_i

and X

_k

(9)

Population Principal Components

Properties:

1 E(Y_i) = γ^t_iE(X) = γ^t_iµ

2 Var(Y_i) = γ^t_iΣγ_i = λ_i

3 Cov(Y_i, Y_j) = 0, ∀i 6= j

4 Var(Y₁) ≥ Var(Y₂) ≥ . . . ≥ Var(Y_p) ≥ 0

5 Var(Y) = Λ = Diag(λ1, . . . , λ_p), where Y = (Y1, . . . Y_p)^t

6 tr(Σ) = Pp

i=1Var(X) = tr(Λ) =Pp i=1λ_i

7 det(Σ) = Qp

i=1λi = det(Λ)

8 Cov(X_i, Y_j) = λ_jγ_ij

9 Cov(X, Y) = Γ Λ, where Γ = (γ₁, . . . , γ_p) is a (p × p) matrix

10 Cor(X_i, Y_j) = γ_ijpλj/√ σ_ii

Comment: Properties 5. and 6. are the theoretical support for the use of PCA as a data reduction technique

Population Principal Components

In fact, if tr(Σ) =Pp

i=1Var(X) = tr(Λ) =Pp

i=1λ_i then λ_j

Pp i=1λ_i

is the proportion of population total variance due to the j-th principal component

(10)

Population Principal Components: an example

Exercise: Given the covariance matrix:

Σ =





8 0 1

8 3

5



 ,

1 Compute the eigenvalues λ1, λ2, and λ3 of Σ, and the eigenvectors γ₁, γ₂, and γ₃ of Σ.

Hint: You may use R to compute the eigenvalues and eigenvectors of Σ.

2 Write the population PC

3 Verify that Var(Y_i) = λ_i and Cov(Y_i, Y_j) = 0, ∀i 6= j

4 Show that λ₁+ λ₂+ λ₃ = tr(Σ), where the trace of a matrix equals the sum of its diagonal elements

5 Show that λ₁λ₂λ₃ = det(Σ)

Population Principal Components: an example Solution:

1 The covariance matrix:

Σ =





8 0 1

8 3 5



,

has the following eigenvalues and eigenvectors:

Λ = Diag(10, 8, 3) and Γ =





−0.267 0.949 0.169

−0.802 −0.316 0.507

−0.535 0.000 −0.845





2 The population PCs are:

Y₁ = −0.267X₁− 0.802X₂− 0.535X₃ Y₂ = 0.949X₁− 0.316X₂

(11)

Solution:

3. Verify that Var(Y_i) = λ_i and Cov(Y_i, Y_j) = 0, ∀i 6= j (Homework) 4. The covariance matrix trace is tr(Σ) = 8 + 8 + 5 = 21

Σ =





8 0 1

8 3 5



,

and the eigenvalues sum is 10 + 8 + 3 = 21 5. The |Σ| = 240 = 10 × 8 × 3

For the example:

6. Calculate the proportion of the total variance explained by each principal component

7. Obtain the correlation between each principal component and each observed variable

Solution:

6. Given the eigenvalues of Σ, (10, 8, 3)^t

λ_i 10 8 3

λ_j Pp

i=1λ_i

10

21 ' 0.476 8

21 ' 0.381 3

21 ' 0.143

P_k j=1λ_j Pp

i=1λ_i 0.476 0.857 1.000

7. Obtain the correlation between each principal component and each

observed variable (Homework)

(12)

When Variances are Very Different

PCs of Standardized Variables

(13)

PCs of Standardized versus non-Std Variables

PC obtained from Standardized Variables

Properties (Standardized Variables):

1 E(Yi) = γ^t_iE(Z) = 0

2 Var(Yi) = γ^t_iRγ_i = λi

3 Cov(Yi, Yj) = 0, ∀i 6= j

4 Var(Y1) ≥ Var(Y2) ≥ . . . ≥ Var(Yp) ≥ 0

5 Var(Y) = Λ = Diag(λ1, . . . , λp), where Y = (Y1, . . . Yp)^t

6 tr(R) = p = tr(Λ) = Pp

i=1λi. Thus, λj/p is the proportion of population total variance due to the j-th principal component

7 det(Σ) = Qp

i=1λi = det(Λ)

8 Cov(Zi, Yj) = λjγij

9 Cov(Z, Y) = Γ Λ

10 Cor(Zi, Yj) = γij

pλj

(14)

2.3 PC obtained from Standardized Variables

Exercise (cont.): Given the covariance matrix: (Homework)

Σ =





8 0 1

8 3

5





1 Convert the covariance matrix into a correlation matrix, R

2 Compute the eigenvalues λ₁, λ₂, and λ₃ of R, and the eigenvectors γ₁, γ₂, and γ₃ of R Hint: You may use R to compute the eigenvalues and eigenvectors of R.

3 Write the population PC obtained from the standardized variables

4 Calculate the proportion of the total variance explained by each principal component obtained from the standardized variables.

5 Obtain the correlation between each principal component and each standardized observed variable

6 Compare the PC obtained from the original and standardized variables. Are they the same?

Which approach would you recommend?

Additional PC Properties

PC are not scale invariant: This is a limitation of the PCA, that have to be account when modelling data

Variables with high variability tend to be more important in the definition of the first PC, which can be misleading

Example:

Σ =





8 0 1

8 3 5





γ1

X₃ 2X₃ 3X₃ 10X₃ 0.267 0.125 0.074 0.020 0.802 0.375 0.223 0.061 0.535 0.919 0.972 0.998

The traditional solution is to use standardized variables

(15)

Sample Principal Components

Algebra of Sample PC

(16)

Geometry of Sample PC

Geometry of Sample continued

(17)

Geometry of Sample continued

(18)

Sample Principal Components

Let xl represents the observed values on the l-th object. Then ˆ

yli = ˆγ^T_i x_l are called the score of the l-th object on the i-th PC Sample PC obtained from standardized variables follow a similar reasoning

Example: Swiss Bank Notes

(19)

Swiss Bank Notes: variables

Swiss Bank Notes: sample statistics

(20)

Eigenvalues of S

Eigenvectors of Genuine Bank notes

(21)

Correlation between measures and PCs

The Latter PCs

(22)

PCA as a Preliminary to Other Analysis

Number of PC to retain

Question: How many PC should be retained?

Answer: There is no universally accepted method. Here are some commonly used ones:

Choose k such that the ratio:

Pk i=1λi

Pp j=1λj

is high (e.g. higher than 80%) Problem: Decide the threshold

(23)

Number of PC to retain

Choose k such that

λi ≥ ¯λ, i = 1, . . . , k where ¯λ = Pp

i=1λi/p

Comment: When working with standardized variables ¯λ = 1, since Pp

i=1λi = p

Number of PC to retain

Draw a scree plot, i.e. plot λ_i versus i. Lock for an elbow in the scree plot

Statistical test. Not distribution free anymore

(24)

PC Interpretation

The first decisions:

Standardize or not to standardize?

How many PC to retain?

Only after the problem to interpret PC raises. There are two main possible strategies:

Loadings: For a given PC, select the loadings with the highest (on absolute value) magnitude

If all the selected coefficients have the same sign, then the PC can be interpret as weighted sum (or ”index“) of the selected original variables

Otherwise, it can be understood as a contrast between the selected variables associated with positive and negative weights

PC Interpretation

Correlations between PCs and Xi: Do something similar as before, but using the sample correlations between the PC and the original variables

Problem: The two criteria may lead to different sets of selected variables and so to different interpretations

Validation: Cadima and Jollife (1995) suggested a way to validate the selected subset of original variables to interpret the PC.

(25)

Sample PC - Example

2013 National Records: R Example

Multidimensional Scaling (MDS)

(26)

Multidimensional Scaling (MDS)

Underlying data set of n points may be unknown:

Might not know coordinate values of points Might not even know dimensionality of data!

Given instead a dissimilarity measure dij between pairs of points Might not know how distances were calculated (Euclidean, city block, ...)

Multidimensional Scaling (MDS)

Example application domains:

Data visualization

Marketing: Organize consumer products as points in a space according to perceived differences between them

Quantitative social sciences: Eg. mapping the political bent of parliamentary ministers based on voting dissimilarities

Psychology

(27)

Multidimensional Scaling (MDS)

MDS is a set of data analysis methods, which allow one to infer the dimensions of the perceptual space of subjects.

The raw data entering into an MDS analysis are typically a measure of the global similarity or dissimilarity of the objects under

investigation

The primary outcome of an MDS analysis is a spatial configuration, in which the objects are represented as points

The points in this spatial representation are arranged in such a way, that their distances correspond to the similarities of the objects:

similar object are represented by points that are close to each other, dissimilar objects by points that are far apart

How does MDS work?

The goal of an MDS analysis is to find a spatial configuration of objects when all that is known is some measure of their general (dis)similarity

The spatial configuration should provide some insight into how the object(s) are in that space of a number of potentially unknown dimensions

(28)

How does MDS work?

MDS methods include Classical MDS Metric MDS Non-metric MDS

Classical MDS

Consider the problem of the cities and looking at a map showing a number of cities, one is interested in the distances between them.

These distances are easily obtained by measuring them using a ruler.

Apart from that, a mathematical solution is available: knowing the coordinates x and y , the Euclidean distance between two cities a and b is defined by:

Now consider the inverse problem: having only the distances, is it possible to obtain the map?

(29)

Classical MDS

Classical MDS, which was first introduced by Torgerson (1952), addresses this problem. It assumes the distances to be Euclidean.

Euclidean distances are usually the first choice for an MDS space There exist, however, a number of non Euclidean distance measures, which are limited to very specific research questions (cf. Borg &

Groenen, 1997)

In many applications of MDS the data are not distances as measured from a map, but rather proximity data

Classical MDS

When applying classical MDS to proximities it is assumed that the proximities behave like real measured distances

This might hold e. g. for data that are derived from correlation matrices, but rarely for direct dissimilarity ratings.

The advantage of classical MDS is that it provides an analytical solution, requiring no iterative procedures.

(30)

classical Multidimensional Scaling - theory

Suppose for now we have Euclidean distance matrix D = (drs)

The objective of classical Multidimensional Scaling (cMDS) is to find X = [x1, . . . , xn], xr ∈ IR^p, so that ||xr − xs|| = drs . || · || is a vector norm. In classical MDS, this norm is the Euclidean distance.

Such a solution is not unique, because if X is the solution, then X^∗ = X + c, c ∈ IR^p also satisfies

||x^∗r − x^∗s|| = ||(xr + c)− (xs + c)|| = ||xr − xs|| = drs. Any location c can be used, but the assumption of centered configuration, i.e.

Xn r =1

xrk = 0, for all k,

serves well for the purpose of dimension reduction.

(31)

classical Multidimensional Scaling - theory

In short, the cMDS

finds the centered configuration x1, . . . , xn in IR^p for some p ≥ n − 1 so that their pairwise distances are the same as those corresponding

distances in D.

We may find the n × n Gram matrix B = X^TX, rather than X.

The Gram matrix is the inner product matrix since X is assumed to be centered on in general standardize¹.

1It is worth mentioning that centering the variables has the geometric effect of moving the origin of the space (0, 0) to the centroid of the points defined by the means of the variables, but it does not affect the distances in any way. In addition, in MDS analysis, regardless of types of MDS model used, the configuration is typically centered and standardized, which means that the sum of coordinates of each

dimension is zero and the variance is one.

classical Multidimensional Scaling - theory

Suppose for now we have Euclidean distance matrix D = (drs) Step 1: Express D in terms of B

(32)

classical Multidimensional Scaling - theory

Proof: Sum of a row r in B is zero

classical Multidimensional Scaling - theory

Proof: Sum of a column s in B is zero

(33)

classical Multidimensional Scaling - theory

Lets sum-up the squared distances over r :

classical Multidimensional Scaling - theory

Lets sum-up the squared distances over s:

(34)

classical Multidimensional Scaling - theory

Lets sum-up the squared distances over r and s:

classical Multidimensional Scaling - theory

Express drs in terms of brs:

(35)

classical Multidimensional Scaling - theory

Computing B

classical Multidimensional Scaling - theory

Finally we must compute X from B = XX^T

To do that we perform eigenvalue decomposition on B:

(36)

cMDS Algorithm

MDS Algorithm: observations

The distance matrix D might have negative eigenvalues:

which means it is not Euclidean. We cannot exactly reproduce it using X. But this is fine as long as we have some large positive eigenvalues.

Matrix B might have negative eigenvalues:

You get complex values when you compute their square root

If you have some high positive values (at least 2), then we can still plot in the 2D space

The problem in MDS is to construct vectors x that produce the distances in D as close as possible:

We can measure the reproduction error using root mean square error

(37)

cMDS examples: tetrahedron

cMDS examples: circular distances

(38)

cMDS examples: circular distances

cMDS examples: Airline distances

(39)

cMDS examples: Airline distances

2D map of 18 world cities using cMDS. The colors reflect the different continents.

(40)

cMDS examples: Airline distances

3D map of 18 world cities using cMDS. The colors reflect the different continents.

Distance Scaling

classical MDS seeks to

find an optimal configuration xi that gives drs u ˆdrs = ||xr − xs|| as close as possible.

Distance Scaling

Relaxing d_rs u ˆdrs from cMDS by allowing ˆd_rs u f (drs), for some monotone function f .

Called metric MDS if dissimilarities drs are quantitative

Called non-metric MDS if dissimilarities drs are qualitative (e.g.

ordinal).

Unlike cMDS, distance scaling is an optimization process minimizing stress function, and is solved by iterative algorithms.

(41)

metric MDS

(42)

cMDS vs. Sammon Mapping

Non metric MDS

The assumption that proximities behave like distances might be too restrictive, when it comes to employing MDS for exploring the perceptual space of human subjects

In order to overcome this problem, Shepard (1962) and Kruskal (1964) developed a method known as nonmetric multidimensional scaling

In non metric MDS, only the ordinal information in the proximities is used for constructing the spatial configuration

A monotonic transformation of the proximities is calculated, which yields scaled proximities. Optimally scaled proximities are sometimes referred to as disparities.

(43)

In many applications of MDS, dissimilarities are known only by their rank order, and the spacing between successively ranked dissimilarities is of no interest or is unavailable

Non-metric MDS

Given a (low) dimension p, non-metric MDS seeks to find an optimal configuration X ⊂ R

^p

that gives f (d

ij

) ≈ ˆ d

ij

= kx

ⁱ

− x

^j

k

₂

as close as possible.

•

Unlike metric MDS, here f is much general and is only implicitly defined.

•

f (d

_ij

) = d

_ij^∗

are called disparities, which only preserve the order of d

ij

, i.e.,

d

ij

< d

k`

⇔ f (d

^ij

) ≤ f (d

^k`

) (5)

⇔ d

ij^∗

≤ d

k`^∗

28 / 41

Kruskal’s non-metric MDS

•

Kruskal’s non-metric MDS minimizes the stress-1

stress-1( ˆ d

_ij

, d

^∗

ij) =



 X

i<j

( ˆ d

ij

− d

ij^∗

)

²

P ˆ d

_ij²





1 2

.

•

Note that the original dissimilarities are only used in checking (5). In fact only the order d

ij

< d

k`

< ... < d

mf

among

dissimilarities is needed.

•

the function f works as if it were a regression curve

(approximated dissimilarities ˆ d

ij

as y , disparities d

_ij^∗

as ˆ y , and

the order of dissimilarities as explanatory)

(44)

Example: Letter recognition

Wolford and Hollingsworth (1974) were interested in the

confusions made when a person attempts to identify letters of the alphabet viewed for some milliseconds only. A confusion matrix was constructed that shows the frequency with which each

stimulus letter was mistakenly called something else. A section of this matrix is shown in the table below.

Is this a dissimilarity matrix?

30 / 41

Example: Letter recognition

• How to deduce dissimilarities from a similarity matrix?

From similarities δ

ij

, choose a maximum similarity c ≥ max δ

^ij

, so that d

_ij

= c − δ

^ij

, if i 6= j, 0 if i = j.

• Which method is more appropriate?

Because we have deduced dissimilarities from similarities, the absolute dissimilarities d

ij

depend on the value of personally chosen c. This is the case where the non-metric MDS makes most sense.

However, we will also see that metric scalings (cMDS and Sammon mapping) do the job as well.

• How many dimension?

By inspection of eigenvalues from the cMDS solution.

(45)

•

First choose c = 21 = max δ

ij

+ 1.

•

Compare MDS with p = 2, from cMDS, Sammon mapping, and non-metric scaling (stress1):

32 / 41

Letter recognition:

•

First choose c = 21 = max δ

_ij

+ 1.

•

Compare MDS with p = 3, from cMDS, Sammon mapping,

and non-metric scaling (stress1):

(46)

Letter recognition:

• Do you see any clusters?

•

With c = 21 = max δ

_ij

+ 1, the eigenvalues of the Gram-matrix B in the calculation of cMDS are:

508.5707 236.0530 124.8229 56.0627 39.7347 -0.0000 -35.5449 -97.1992

•

The choice of p = 2 or p = 3 seems reasonable.

34 / 41

Letter recognition

•

Second choice of c = 210 = max δ

_ij

+ 190.

•

Compare MDS with p = 2, from cMDS, Sammon mapping,

and non-metric scaling (stress1):

(47)

•

Second choice of c = 210 = max δ

ij

+ 190.

•

Compare MDS with p = 3, from cMDS, Sammon mapping, and non-metric scaling (stress1):

36 / 41

Letter recognition:

•

With c = 210, the eigenvalues of the Gram-matrix B in the calculation of cMDS are:

1.0e+04 * 2.7210 2.2978 2.1084 1.9623 1.9133 1.7696 1.6842 0.0000

•

May need more than p > 3 dimensions.

(48)

Letter recognition: Summary

•

The structure of the data appropriate for non-metric MDS.

•

Kruskal’s non-metric scaling:

1 Appropriate for non-metric dissimilarities (only when their orders are preserved)

2 Optimization: susceptible to local minima (leading to different configurations);

3 Time-consuming

•

cMDS fast, overall good.

•

Sammon mapping fails when c = 210.

38 / 41

Letter recognition: Summary

•

Clusters (C , G ), (D, Q), (H, M, N, W ) are confirmed by a cluster analysis for either choice of c.

Use agglomerative hierarchical clustering with average linkage:

(49)

Basics of a non metric MDS algorithm

The core of a non metric MDS algorithm is a twofold optimization process

First the optimal monotonic transformation of the proximities has to be found

Secondly, the points of a configuration have to be optimally arranged, so that their distances match the scaled proximities as closely as possible

MDS - some comments

Some literature classify the cMDS in the metric multidimensional scaling category.

MDS is also known as principal coordinates analysis.

Sammon mapping is a nonlinear metric multidimensional scaling method.

Non-metric multidimensional scaling is also known as ordinal MDS.

(50)

MDS in R

library(MASS)

# compute dissimilarity matrix from a dataset d <- dist(swiss)

# d is (n x n-1) lower triangle matrix cmdscale(d, k =2) # classical MDS

sammon(d,k=1) # Sammon Mapping

isoMDS(d,k=2) # Kruskal’s Non-metric MDS 40