Unsupervised Learning Juhan Nam

(1)

Juhan Nam

GCT634/AI613: Musical Applications of Machine Learning (Fall 2021)

Unsupervised Learning

(2)

Introduction

● Traditional machine learning pipeline in classification tasks

○ Select a set of audio features and concatenate for a given task

○ The concatenated features are complementary to each other

MFCC Spectral Statistics

Chroma

. . .

Classifier Output Class

(3)

Issues: Redundancy and Dimensionality

● Redundancy and dimensionality in the concatenated audio features

○ Adding more features increases the dimensionality of the feature vectors. As a result, the classifier needs more parameters to train

○ The information in the concatenated feature vectors can be redundant

Chroma

. . .

? Classifier Output

Class

(4)

Issues: Temporal Summary

● Taking the entire frames of audio features over time is too large

○ 10 ~ 100 frame rates (frames per second) is typical in frame-level audio

● Summary methods

○ Temporal pooling (average, standard deviation): orderless summary

○ Concatenating with frame-wise differences (delta and double-delta): capture local temporal changes (e.g., MFCC)

○ DCT over time: take a small number of low-frequency cosine kernels for each feature dimension (1D DCT) or the entire feature dimension (2D DCT)

Chroma

. . .

? ? Classifier

(5)

Unsupervised Learning

● Principal Component Analysis (PCA)

○ Learn a linear transform so that the transformed features are de-correlated

○ Dimensionality reduction: 2D/3D for visualization

● K-means

○ Learn K cluster centers and determine the membership

○ Move each data point to a fixed set of learned vectors (cluster centers):

vector quantization and one-hot sparse feature representation

● Gaussian Mixture Models (GMM)

○ Learn K Gaussian distribution parameters and the soft membership

○ Density estimation (likelihood estimation): can be used for classification when estimated for each class

(6)

Principal Component Analysis

● Correlation and Redundancy

○ We can measure the redundancy between two elements in a feature vector by computing their correlation

○ If some of the elements have high correlations, we can remove the redundant elements

Pearson correlation coefficient = ^∑^! ^"^!^{# $}^"^! ^%^!^#%^!

"_!# $"_! ^" %# &% ^"

𝑥_! 𝑥_"

⋮ 𝑥_#

(7)

Principal Component Analysis

● Transform the input space (𝑋) into a latent space (𝑍) such that the latent space is de-correlated (i.e., each dimension is orthogonal to each other)

○ Linear transform designed to maximize the variance of the first principal component and minimize the variance of the last principle component

𝑋

𝑍^$ 𝑍

Orthogonal vectors (principal components)

(8)

Principal Component Analysis

● Transform the input space (𝑋) into a latent space (𝑍) such that the latent space is de-correlated (i.e., each dimension is orthogonal to each other)

○ Linear transform designed to maximize the variance of the first principal component and minimize the variance of the last principle component

𝑍

𝑍𝑍

^!

= 𝑁 =

𝑛

_"

0 0 𝑛

_#

0 0 0 0 0 0

0 0

⋱ ⋮

⋯ 𝑛

_$

=

𝑊 𝑋

The diagonal elements correspond to the variances of transformed data points on each dimension

𝑊𝑋 = 𝑍

(9)

Principal Component Analysis: Eigenvalue Decomposition

● Eigenvalue decomposition (𝑄: eigenvectors, Λ: eigenvalue matrix)

● To derive 𝑊

𝑍𝑍

^!

= 𝑁 (𝑊𝑋)(𝑊𝑋)

^!

= 𝑁 𝑊𝑋𝑋

^!

𝑊

^!

= 𝑁 𝑊Cov(𝑋)𝑊

^!

= 𝑁 𝐴𝑄 = 𝑄Λ

(If 𝐴 is symmetric)

𝑊 = 𝑄

^!

𝐴𝑥

_%

= 𝜆

_%

𝑥

_%

𝑄 = [𝑥_!𝑥_"… 𝑥_#] Λ = diag(𝜆_%)

Cov 𝑋 = 𝑋𝑋

^!

= (𝑋𝑋

^!

)

^!

= Cov(𝑋)

^! Covariance matrix of 𝑋 is symmetric

𝐴 = 𝑄Λ𝑄

^1"

𝐴 = 𝑄Λ𝑄

^!

Cov(𝑋) = 𝑊

^!

𝑁𝑊

𝑊 is an orthogonal matrix (𝑊^#' = 𝑊⁽)

The orthogonal matrix 𝑾 is obtained from the eigenvectors of covariance matrix of 𝑿 !

𝑁 = Λ

(10)

Principal Component Analysis: Eigenvalue Decomposition

● From the eigenvalue decomposition

● Set the scaled orthogonal matrix

● The latent spaced is normalized to have unit variances

Λ&!/" = 1+

𝜆_! 0 0 1+

𝜆_"

⋮

⋯ ⋱ 0

0 1+ 𝜆_#

𝐴 = 𝑄Λ𝑄

^!

𝑄

^!

𝐴𝑄 = Λ 𝑄

^!

𝐴𝑄 = Λ

⁽^⁄⁾

Λ

⁽^⁄⁾

Λ

^{1 ⁄}⁽ ⁾

𝑄

^!

𝐴(Λ

^{1 ⁄}⁽ ⁾

𝑄) = 𝐼

𝑊

³

= Λ

^{1 4}^" ^#

𝑄

^!

= Λ

^{1 4}^" ^#

𝑊

𝑍^!𝑍^!" = (𝑊^!𝑋) 𝑊^!𝑋 ^" = 𝑊^!𝑋𝑋^"𝑊^!" = Λ^{# $}^% ^&(𝑊𝑋)(𝑋^"𝑊^")Λ^{# $}^% ^& = Λ^{# $}^% ^&𝑍𝑍^"Λ^{# $}^% ^& = 𝐼

𝑊⁾ becomes an orthonormal matrix

𝑍

³

= Λ

^{1 4}^" ^#

𝑍

(11)

Principal Component Analysis in Practice

● 𝑋 is a huge matrix where each column is a data point in practice

○ Computing the covariance matrix is a bottleneck

○ We often randomly sample the input data Cov 𝑋 =

𝑋

. . .

𝑋^"

. . .

(12)

Principal Component Analysis in Practice

● Shift the distribution to have zero mean

● The normalization is optional: called PCA whitening

Shifting

𝑋

𝑋^$ = 𝑋 − mean(𝑋)

Rotation

Normalization

(Scaling) 𝑊𝑋^$

Λ^$ 𝑊𝑋^$

(13)

Dimensionality Reduction Using PCA

● We can remove principal components with small variances

○ Sort the variances in the latent space (the eigenvalues) in descending order and removing the tails

○ A strategy is accumulating the variances from the first principal component.

When it reaches 90% or 95% of the sum of all variances, remove the remaining dimensions. This significantly reduces the dimensionality.

● Note that you can reconstruct the original data with some loss

○ You can use PCA as a data compression method

Variances

95% ⋯

(14)

Visualization Using PCA

● Taking the first two or three principal components only for 2D or 3D visualization

○ A popularized used feature visualization method along with t-SNE in analyzing the latent feature space in the trained deep neural network

source:https://jakevdp.github.io/PythonDataScienceHandbook/05.09-principal-component-analysis.html

(15)

K-Means Clustering

● Grouping the data points into K clusters

○ Each point has a membership to one of the clusters

○ Each cluster has a cluster center (not necessarily one of the data points)

○ The membership is determined by choosing the nearest cluster center

○ The cluster center is the mean of the data points that belong to the cluster This is dilemma!

(16)

K-Means: Definition

● The loss function to minimize is defined as:

○ Regarded as a problem that learns cluster centers (𝜇_') that minimize the loss

○ 𝑟_'⁽⁾⁾ is the binary indicator of the membership of each data point

● Taking the derivative of the loss 𝐿 w.r.t the cluster center 𝜇

₅

𝐿 = 9

*+!

,

9

-+!

.

𝑟_*- 𝑥^(*) − 𝜇_- ^" 𝑟_-^(*) = <1 if 𝑘 = argmin

1 𝑥^(*) − 𝜇_- ^"

0 otherwise

𝜇_- = ∑_*+!^, 𝑟_-^(*)𝑥^(*)

∑_*+!^, 𝑟_-^(*)

Again, we should know the cluster centers (to determine membership) before computing the cluster centers

𝑑𝐿

𝑑𝜇_- = 9

*+!

,

9

-+!

.

2𝑟_-^(*)(𝑥^(*) − 𝜇_-) = 0

(17)

Learning Algorithm

● Iterative learning

○ Initialize the cluster centers with random values (a)

○ Compute the memberships of each data point given the cluster centers (b)

○ Update the cluster centers by averaging the data points that belong to them (c)

○ Repeat the two steps above until convergence (d, e, f)

426 9. MIXTURE MODELS AND EM

(a)

−2 0 2

−2 0

2 (b)

−2 0 2

−2 0

2 (c)

−2 0 2

(d)

−2 0 2

−2 0

2 (e)

−2 0 2

−2 0

2 (f)

−2 0 2

(g)

−2 0 2

−2 0

2 (h)

−2 0 2

−2 0

2 (i)

−2 0 2

Figure 9.1 Illustration of the K-means algorithm using the re-scaled Old Faithful data set. (a) Green points denote the data set in a two-dimensional Euclidean space. The initial choices for centres µ₁and µ₂are shown by the red and blue crosses, respectively. (b) In the initial E step, each data point is assigned either to the red cluster or to the blue cluster, according to which cluster centre is nearer. This is equivalent to classifying the points according to which side of the perpendicular bisector of the two cluster centres, shown by the magenta line, they lie on. (c) In the subsequent M step, each cluster centre is re-computed to be the mean of the points assigned to the corresponding cluster. (d)–(i) show successive E and M steps through to final convergence of the algorithm.

(The PRML book)

9.1. K-means Clustering 427 Figure 9.2 Plot of the cost function J given by

(9.1) after each E step (blue points) and M step (red points) of the K- means algorithm for the example shown in Figure 9.1. The algorithm has converged after the third M step, and the final EM cycle pro- duces no changes in either the as- signments or the prototype vectors.

J

1 2 3 4

0 500 1000

case, the assignment of each data point to the nearest cluster centre is equivalent to a classification of the data points according to which side they lie of the perpendicular bisector of the two cluster centres. A plot of the cost function J given by (9.1) for the Old Faithful example is shown in Figure 9.2.

Note that we have deliberately chosen poor initial values for the cluster centres so that the algorithm takes several steps before convergence. In practice, a better initialization procedure would be to choose the cluster centres µ_k to be equal to a random subset of K data points. It is also worth noting that the K-means algorithm itself is often used to initialize the parameters in a Gaussian mixture model before applying the EM algorithm.

Section 9.2.2

A direct implementation of the K-means algorithm as discussed here can be relatively slow, because in each E step it is necessary to compute the Euclidean distance between every prototype vector and every data point. Various schemes have been proposed for speeding up the K-means algorithm, some of which are based on precomputing a data structure such as a tree such that nearby points are in the same subtree (Ramasubramanian and Paliwal, 1990; Moore, 2000). Other approaches make use of the triangle inequality for distances, thereby avoiding unnecessary distance calculations (Hodgson, 1998; Elkan, 2003).

So far, we have considered a batch version of K-means in which the whole data set is used together to update the prototype vectors. We can also derive an on-line stochastic algorithm (MacQueen, 1967) by applying the Robbins-Monro procedure Section 2.3.5

to the problem of finding the roots of the regression function given by the derivatives of J in (9.1) with respect to µ_k. This leads to a sequential update in which, for each Exercise 9.2

data pointxn in turn, we update the nearest prototype µ_kusing

µ^new_k = µ^old_k + η_n(x_n− µ^oldk ) (9.5) where η_nis the learning rate parameter, which is typically made to decrease monotonically as more data points are considered.

The K-means algorithm is based on the use of squared Euclidean distance as the measure of dissimilarity between a data point and a prototype vector. Not only does this limit the type of data variables that can be considered (it would be inappropriate for cases where some or all of the variables represent categorical labels for instance),

The loss monotonically decreases every iteration

(18)

Data Compression Using K-means

● Vector Quantization

○ The set of cluster centers is called “codebook”:

○ Encoding a sample vector to a single scalar value of “codebook index” (membership index)

○ The compressed data can be reconstructed using the codebook

○ Example: speech codec (CELP)

■ A component of speech sound is vector-

quantized and the codebook index is transmitted in the speech communication

Encoding

3 5 ⋯

Decoding

𝑥^(!) 𝑥^(") ⋯ 𝜇₂ 𝜇₃ ⋯

Example of a codebook for a 2D Gaussian with 16 code vectors

source:https://wiki.aalto.fi/pages/viewpage.action?pageId=149883153

(19)

Codebook-based Feature Summarization

● Compute the histogram of codebook index

○ Represent the codebook index with one-hot vector

■ if K is a large number, it is regarded as a sparse representation of the features

○ Useful for summarizing a long sequence of feature-level features

■ Often called “a bag of features” (computer vision) or “a bag of words” (NLP)

0 0 0 01 0 … 0 0 0 1 0 0 Summarization (histogram)

K-dimensional vector

Encoding

𝑥^(!) 𝑥^(") ⋯ one-hot vector

representation

source:https://towardsdatascience.com/bag-of-visual-words-in-a-nutshell-9ceea97ce0fb

a bag of features

(20)

Gaussian Mixture Model (GMM)

● Fit a set of multivariate Gaussian distribution to data

○ Similar to K-means clustering but it learns not only the cluster centers (means) but also the covariance of distribution in the clusters

○ The membership is a soft assignment as a multinomial distribution

■ The multinomial distribution is regarded as mapping on a latent space

K-means GMM

(21)

Gaussian Mixture Model (GMM)

● Replace the hard assignment with a multinomial distribution

● Replace a single cluster with a multivariate Gaussian distribution

○ Mean and covariance

𝑟_- ∈ 0,1

(hard assignment)

𝜋_- = 𝑃(𝑧_-|𝑥) 9

-+!

.

𝜋_- = 1

(soft assignment)

𝑑(𝑥, 𝜇_-) = 𝑥 − 𝜇_- ^" 𝑃(𝑥|𝑧_') = 1

(2𝜋)^+/& Σ_' ^%/& 𝑒^#^%^&(-#.^$⁾^% ^/^$ ^&/(^(-#.^$⁾

. . .

1 2 3 4 K

(22)

Gaussian Mixture Model (GMM)

● The likelihood of a data point can be computed as a mixture of Gaussians

● Fit this model to data by maximum likelihood estimation

○ Equivalent to minimizing the negative log likelihood (this is the loss function)

○ This model fitting is called density estimation

● GMM is also called a latent model

○ z is a latent variable: regarded as hidden causes of the data distribution

𝑝 𝑥 = 9

4

𝑝 𝑥, ℎ = 9

4

𝑝 ℎ 𝑝 𝑥 ℎ = 9

-+!

.

𝜋_- 𝑁(𝑥|𝜇_-, Σ_-)

(23)

Learning Algorithm: K-Means

● Iterative learning

(24)

Learning Algorithm: GMM

● Iterative learning

Expectation (E step)

Maximization (M step)

Update the clusters by maximizing the likelihood given the membership Gaussian distribution parameters

(25)

Learning Algorithm

● Initialize the parameters

● E-step

○ Evaluate the “soft” membership of samples given the Gaussian distributions

● M-step

○ Update the parameters that maximize the log-likelihood

𝛾_-^(*) = 𝜋_-𝑁(𝑥^(*)|𝜇_-, Σ_-)

∑₁𝜋₁𝑁(𝑥^(*)|𝜇_-, Σ_-) 𝜃 ∈ 𝜋_-, 𝜇_-, Σ_-

𝑁_- = 9

*

𝛾_-^(*) 𝜇_- = 1

𝑁_-9

*

𝛾_-^(*)𝑥^(*) Σ_- = 1 𝑁_-9

*

𝛾_-^(*)(𝑥^(*) − 𝜇_-)(𝑥^(*)− 𝜇_-)⁵ 𝜋_- = 𝑁_-

𝑁

# of membership multinomial dist. Gaussian dist. (mean and covariance of each cluster)

(26)

Classification Using GMM

● Training: fit one GMM model to each class of data

● Test: use Bayes’ rule for classification

𝑃 𝑥 𝑦 = 𝑐1, 𝜃_6! , 𝑃 𝑥 𝑦 = 𝑐2, 𝜃_6" , 𝑃 𝑥 𝑦 = 𝑐3, 𝜃₆₂ , …

Z𝑦 = argmax

7 𝑃 𝑦 𝑥 = 𝑥_* = argmax

7 𝑃 𝑥 = 𝑥_* 𝑦 𝑝(𝑦) 𝑝(𝑥 = 𝑥_*) Z𝑦 = argmax

7 𝑃 𝑥 = 𝑥_* 𝑦 𝑝(𝑦)

Prior distribution of each class

If you don’t any information on the prior, you can ignore 𝑝(𝑦) by assuming that all classes are equally possible.

𝑐1

𝑐2 𝑐3