Machine Learning: Clustering Algorithms - a Review

(1)

Satyanarayana Medicherla, Vol 7 Issue 8, pp 33-45 August 2019

Machine Learning: Clustering Algorithms - a Review

Satyanarayana Medicherla, Consultant, Canon U.S.A. Inc, 1, Canon Park, New York - 11747, email- [email protected]

Abstract

Clustering is a machine learning technique that is used to group a set of data points. Clustering is one of the most popular and simpler un-supervised machine learning techniques. In case of un-supervised learning algorithm, the data set does not have labels or outcome variable. We try to group the data into different clusters by finding the similarities of the items [1]. The similarity is computed using some distance function. The goal of the clustering algorithms is to group items that are most similar within each group or and are different from items of other groups [2]. We will discuss some of the commonly used clustering techniques. Market segmentation, customer segmentation, document

classification and news classification are some of the important applications of clustering.

Introduction

Clustering is an un-supervised learning algorithm and hence there are no outcome labels attached to the data set. It is a brute-force method because this has to calculate the distance for all the training examples from other training examples for deciding the clusters.

Once the clustering process is completed, we label each of our training examples with their computed cluster numbers. The process of identifying or grouping a set of items into different clusters is different between different techniques of clustering.

We are going to discuss K-means clustering, K- medoids clustering, K-modes clustering and Hierarchical clustering. We will apply different clustering techniques depending upon the characteristics of the data set.

K-Means Clustering Algorithm

We start the process by initializing K-arbitrarily selected/assigned centroids. We compute the distance of each of the data points from the cluster centroids and assign the nearest cluster centroid to each data point. Then we will re-compute new centroids for each of the K-clusters by computing the mean of the feature vectors within each cluster. After computing the new centroids, we will again compute the distance of each

data point from the newly computed centroids and repeat the cluster assignments. This process is repeated until we find the optimal assignment of clusters. Once we find the optimal set of K-clusters, the items within each cluster are most similar and are different from the items from other clusters. We will discuss the elbow method to find the optimal set of clusters.

We are given a set of m training examples with a set of feature vectors X of n features defined as follows.

The feature vector is of the form

𝑥₁= (𝑥₁₁, 𝑥₁₂, 𝑥₁₃, . . . , 𝑥_1𝑛) 𝑥2= (𝑥21, 𝑥22, 𝑥23, . . . , 𝑥2𝑛)

. . . ..

𝑥_𝑚= (𝑥_𝑚1, 𝑥_𝑚2, 𝑥_𝑚3, . . . , 𝑥_𝑚𝑛)

Let us say we want to find the distance between the examples 𝑥𝑖 and 𝑥𝑗

We can find the distance between 𝑥_𝑖 and 𝑥_𝑗 using any of the following:

(2)

Page 34

The Euclidean distance between the two examples 𝑥_𝑖 and 𝑥𝑗 is

𝐷(𝑥𝑖, 𝑥𝑗) = ( (

𝑛

𝑘=1

𝑥𝑖𝑘 − 𝑥𝑗𝑘)²)

or Cosine Similarity between 𝑥_𝑖 and 𝑥_𝑗 is

Cosine Similarity (𝑥_𝑖,𝑥_𝑗) = ^𝑥^𝑖𝑘

𝑛𝑖=1 ∗𝑥_𝑗𝑘 ( ^𝑛_𝑖=1(𝑥_𝑖𝑘)²)∗ ( ^𝑛_𝑖=1(𝑥_𝑗𝑘)²)

or any other distance metric like Gower’s distance, which is discussed later in this article.

Algorithm Steps [3]

Let us say we have m training examples as mentioned above and they are 𝑥₁, 𝑥₂, 𝑥₃, . . . , 𝑥_𝑚𝜖𝑅^𝑛 These training examples are assigned to K different clusters The cluster assignments to these training examples are 𝑐₁, 𝑐₂, 𝑐₃, . . . . , 𝑐_𝐾𝜖𝐼 The cluster centroids are 𝜇₁, 𝜇₂, 𝜇₃, . . . . , 𝜇_𝐾𝜖𝑅^𝑛

• repeat

– for i = 1 to m

• 𝑐_𝑖 = the cluster number of the cluster that is closest to the 𝑖𝑡ℎ

training example – for k = 1 to K

• 𝜇𝑘 = mean of the training examples assigned to 𝑘𝑡ℎ

cluster

• until no change in the cluster assignments

Optimization Objective [4]

𝜇𝑐_𝑖 is the cluster centroid of the cluster for which our training example 𝑥_𝑖 is assigned.

Now our optimization objective is to minimize the following cost or distance function

𝐽 = 1

𝑚 (

𝑛

𝑖=1

𝑥𝑖− 𝜇𝑐_𝑖)²

Clustering for Dataset with all Numeric Features Cluster Assignments using R - a simple example Let us take a very simple example and see how R’s K- means clustering works. In this example, we have generated the data points, which are forming two different clusters. We will try to use R’s K-means function to identify two clusters and visualize them graphically.

setwd("C:/Coursera/R/MachineLearning") library(ggplot2)

library(gridExtra)

x =c(1,1,2,2,2,3,3,4,4,5,5,6,6,7)

y =c(1,2,1,2,3,1,2,7,8,7,8,8,9,8)

data <-data.frame(x,y)

plot1 <-ggplot(data) +geom_point(aes(x=x,y=y),

size=2) +

labs(title ="Plot of data set",x="x-values" ,y="y- values")

cluster <-kmeans(data,2)

str(cluster)

## List of 9

## $ cluster : int [1:14] 1 1 1 1 1 1 1 2 2 2 ...

## $ centers : num [1:2, 1:2] 2 5.29 1.71 7.86

## ..- attr(*, "dimnames")=List of 2

## .. ..$ : chr [1:2] "1" "2"

## .. ..$ : chr [1:2] "x" "y"

## $ totss : num 188

## $ withinss : num [1:2] 7.43 10.29

## $ tot.withinss: num 17.7

## $ betweenss : num 170

## $ size : int [1:2] 7 7

## $ iter : int 1

## $ ifault : int 0

## - attr(*, "class")= chr "kmeans"

plot2 <-ggplot(data) +geom_point(aes(x=x,y=y),col = cluster$cluster, shape = cluster$cluster,size=3) + geom_point(aes(x=cluster$centers[1,1],y=cluster$cent ers[1,2]),col=I("black"),shape=15,size=3)+

geom_point(aes(x=cluster$centers[2,1],y=cluster$cent ers[2,2]),col=I("red"),shape=17,size=3)+

labs(title ="Plot of with Centroids",x="x-values"

,y="y-values")

(3)

Page 35

grid.arrange(plot1, plot2, ncol=2)

Now let us visualize the cluster graphically.

library(factoextra)

## Welcome! Related Books: `Practical Guide To Cluster Analysis in R` at https://goo.gl/13EFCZ fviz_cluster(cluster, data = data,

ellipse.type ="convex",

palette ="jco",

ggtheme =theme_minimal())

Manual Cluster Assignments

We will calculate the distance matrix manually and cluster the items manually. We will compare the manual and R’s clustering assignments.

Data set ‘iris’ - description provided by R

This famous (Fisher’s or Anderson’s) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.

Usage iris

iris is a data frame with 150 cases (rows) and 5 variables (columns) named Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, and Species.

• sepal length in cm

• sepal width in cm

• petal length in cm

• petal width in cm

• class:

– Iris Setosa – Iris Versicolour – Iris Virginica

Let us plot the data to show the possible clusters w.r.t.

to some of the features of iris data set.

library(ggplot2) require(gridExtra) par(mfrow=c(1,2))

scatter1 <-ggplot(data=iris, aes(x = Petal.Length, y =

Petal.Width)) +

geom_point(aes(color=Species, shape=Species)) + xlab("Petal Length") +ylab("Petal Width") + ggtitle("Species/Petal Len-Wdth") scatter2 <-ggplot(data=iris, aes(x = Sepal.Length, y =

Sepal.Width)) +

geom_point(aes(color=Species, shape=Species)) + xlab("Sepal Length") +ylab("Sepal Width") + ggtitle("Species/Sepal Len-Wdth")

(4)

Page 36

grid.arrange(scatter1, scatter2, ncol=2)

The plots show clearly that there are three different possible clusters. Now let us manually assign clusters to our data set. We loop through each data point and find the distance from each of the randomly initialized clusters centers. We assign the data point to the closest cluster centroid and re-calculate the cluster centers again. Repeat the same process until no re-assignments or until the within cluster distance changes very less.

data("iris")

# scale all the variables

scaling_function <-function(x)

{

return ((x-min(x))/(max(x)-min(x))) }

iris_norm<-

as.data.frame(lapply(iris[,c(1,2,3,4)],scaling_function ))

# distance calculation function

euclidean_distance <-function(x1,x2) {

return(

sqrt((x1$Sepal.Length -x2$Sepal.Length)^2+

(x1$Sepal.Width -x2$Sepal.Width)^2+

(x1$Petal.Length -x2$Petal.Length)^2+

(x1$Petal.Width -x2$Petal.Width)^2

) ) }

# add and initialize the ditance column

iris_norm$cluster <-0

iris_norm$dist <-999

# initialize the cluster centers

centers <-

rbind(iris_norm[1,],iris_norm[75,],iris_norm[150,])

# save the previous cluster assignment prev_clusters_assignments <-iris_norm

for(l in1:100)

{

clusters_changed <-FALSE

for (i in1:150)

{

for (j in1:3)

{

dist <-

euclidean_distance(centers[j,],iris_norm[i,])

if(dist <iris_norm[i,]$dist)

{

iris_norm[i,]$cluster <-j iris_norm[i,]$dist <-dist }

(5)

Page 37

} }

# check if there are any changes in cluster assignments

# if no changes, then stop

for (i in1:150)

{

if(iris_norm[i,]$cluster

!=prev_clusters_assignments[i,]$cluster) {

clusters_changed <-TRUE prev_clusters_assignments <-iris_norm break;

} }

if(!clusters_changed) {

print("No changes in the cluster assignments") flush.console()

break }

# recompute the cluster centers

for(j in1:3)

{

centers[j,]$Sepal.Length <- mean(iris_norm[iris_norm$cluster==j,]$Sepal.Length)

centers[j,]$Sepal.Width <-

mean(iris_norm[iris_norm$cluster==j,]$Sepal.Width) centers[j,]$Petal.Length <- mean(iris_norm[iris_norm$cluster==j,]$Petal.Length)

centers[j,]$Petal.Width <-

mean(iris_norm[iris_norm$cluster==j,]$Petal.Width) }

}

## [1] "No changes in the cluster assignments"

# list the cluster centers

centers

## Sepal.Length Sepal.Width Petal.Length

Petal.Width cluster dist

## 1 0.1961111 0.5950000 0.07830508

0.06083333 0 999

## 75 0.4450113 0.2976190 0.55378762

0.50340136 0 999

## 150 0.6410675 0.4264706 0.76603523 0.80392157 0 999

# add the Species label back to the normalized dataframe

iris_norm$Species <-iris$Species

# generate the comparison table by comparing the existing cluster labels and computed cluster assignments

table(iris_norm$cluster,iris_norm$Species)

##

## setosa versicolor virginica

## 1 50 0 0

## 2 0 45 4

## 3 0 5 46

K-Means on iris data set described above data("iris")

scaling_function <-function(x)

{

return ((x-min(x))/(max(x)-min(x))) }

iris_norm<-

as.data.frame(lapply(iris[,c(1,2,3,4)],scaling_function ))

(6)

Page 38

iris_scaled <-iris_norm;

iris_scaled$Species <-iris$Species

within_cluster_dist <-c()

for (i in2:10)

{

results <-kmeans(iris_norm,i)

within_cluster_dist[[i]] <-results$tot.withinss }

data <-data.frame(2:10,within_cluster_dist[2:10])

ggplot(data)

+geom_point(aes(x=2:10,y=within_cluster_dist[2:10]

)) +

geom_line(aes(x=2:10,y=within_cluster_dist[2:10])) + labs(title ="Clusters vs Within SS",x="Number of Clusters" ,y="Within Cluster SS")

We can clearly see an elbow at 3 is indicating the ideal

number of clusters for this data set is 3. If we increase the number of clusters beyond 3, the Within Cluster Sum of Squares decreases further but the rate of decrease slows down.

set.seed(111)

iris_kmeans_3<-kmeans(iris_norm, 3) library(factoextra)

fviz_cluster(iris_kmeans_3, data = iris_norm,

ellipse.type ="convex",

palette ="jco",

ggtheme =theme_minimal())

K-Medoids Clustering

K-means algorithm is very sensitive to outliers. As a result, any outlier would change the cluster assignments substantially by affecting the computation of the mean. As we have noticed, K-means clustering involves the computation of distances and the means.

When we have data of mixed data types like numerical and categorical variables, the standard K-means clustering does not work because we cannot compute the distance and mean of those data elements directly.

We will use the K-medoids clustering method to cluster this type of data. We can supply our own function to compute the distances or use the R’s daisy function, which uses the Gower’s distance.

(7)

Page 39

For example, if we have a set of numbers (1,3,5,7,9) - the medoid is 5 and the mean is also 5. If we have (1,3,5,7,50) - the mean is 13.2 but still the medoid is 5, irrespective of the magnitude of the fifth number which is an outlier. The medoid remains within most of the clustered data points but the mean shifted far away from them towards the outlier. This way the K- means clustering is very sensitive to outliers but the K- medoids clustering is not sensitive to outliers.

K-Medoids Clustering Algorithm Steps [5]

In case of K-means, we randomly initialize K cluster centroids those need not be actual training examples.

In case of the K-medoids clustering, we randomly pick K training examples as centroids from the training set itself.

• repeat

– for i = 1 to m

• 𝑐𝑖 = the cluster number of the medoid that is closest to the 𝑖𝑡ℎ

training example

– for each of the non-medoid elements

• compute the total cost after swapping the medoid with this non-medoid element

• if the new cost is less than the actual cost, fix this element as the new medoid

• until no change in the cluster assignments

K-medoids algorithm is less scalable for huge amounts of data because the complexity of the K-medoids algorithm is of the order of 𝑂(𝐾(𝑁 − 𝐾)²) because we will swap N-K non-medoid items for K medoids and compute the distances. We will use the Gower’s distance calculation function for computing the distances used by the K-medoids clustering. Let us briefly discuss the Gower’s distance function.

Gower’s Distance [6][7]

Distance Function, for example Euclidean distance is applicable only for continuous numerical variables and it does not work for categorical or mixed variables. We will user Gower Distance for mixed variables.

For Quantitative variables, Manhattan Distance is used. For Ordinal Variables, numeric ranks are assigned and then the Manhattan Distance is used For

Nominal Variables, the variable is binary encoded by creating k different columns for k different possible values and then the Dice Coefficient is used.

Simple Matching Similarity, Jackard Coefficient and Dice Coefficient

Distance based on Simple Matching Similarity

𝐷 = 1 − 𝑎 + 𝑑

𝑎 + 𝑏 + 𝑐 + 𝑑

Distance based on Jackard Coefficient

𝐷 = 1 − 𝑎

𝑎 + 𝑏 + 𝑐

Distance based on Dice Coefficient 𝐷 = 1 − 2 ∗ 𝑎

2 ∗ 𝑎 + 𝑏 + 𝑐

where

a - the number of 1s where item-1 and item-2 match b - the number of places where item-1 has 1 and item-2

has 0

c - the number of places where item-1 has 0 and item-2

has 1

d - the number of 0s where item-1 and item-2 match

Let us generate some data of binary digits and calculate the Gower’s distance.

library(cluster) set.seed(111)

df <-

data.frame(replicate(3,sample(c(0,1),3,replace=TRU E)))

for(i in1:ncol(df)) df[,i] =as.factor(df[,i]) df

## X1 X2 X3

## 1 1 0 0

## 2 0 0 0

## 3 1 0 0

daisy(df,metric="gower")

## Dissimilarities :

(8)

Page 40

## 1 2

## 2 0.3333333

## 3 0.0000000 0.3333333

##

## Metric : mixed ; Types = N, N, N

## Number of objects : 3

The Gower’s distance is calculated as follows for any two data items:

𝐷 = 1 − 𝑎 + 𝑑

𝑎 + 𝑏 + 𝑐 + 𝑑 where

𝑎 + 𝑑 𝑎 + 𝑏 + 𝑐 + 𝑑 is the simple matching similarity.

and a,b,c and d are as defined in the previous section.

The distance between data points 1 and 2 is

a = 0, b=2,c=0,d=1

Now the Dissimilarity is d = 1 - (a+d)/(a+b+c+d) = 1 - 2/3 = 0.6666667

Let us use the HR data set that can be downloaded from Kaggle [8].

Here is the structure of the HR data.

hr_data <-read.csv("HR_comma_sep.csv") str(hr_data)

## 'data.frame': 14999 obs. of 10 variables:

## $ satisfaction_level : num 0.38 0.8 0.11 0.72 0.37

0.41 0.1 0.92 0.89 0.42 ...

## $ last_evaluation : num 0.53 0.86 0.88 0.87

0.52 0.5 0.77 0.85 1 0.53 ...

## $ number_project : int 2 5 7 5 2 2 6 5 5 2 ...

## $ average_montly_hours : int 157 262 272 223

159 153 247 259 224 142 ...

## $ time_spend_company : int 3 6 4 5 3 3 4 5 5 3 ...

## $ Work_accident : int 0 0 0 0 0 0 0 0 0 0 ...

## $ left : int 1 1 1 1 1 1 1 1 1 1 ...

## $ promotion_last_5years: int 0 0 0 0 0 0 0 0 0 0 ...

## $ sales : Factor w/ 10 levels

"accounting","hr",..: 8 8 8 8 8 8 8 8 8 8 ...

## $ salary : Factor w/ 3 levels

"high","low","medium": 2 3 3 2 2 2 2 2 2 2 ...

As we can see the following columns are binary data

columns and the other columns mentioned below are categorical variables. Let us convert all the following to factor variables.

table(hr_data$number_project)

##

## 2 3 4 5 6 7

## 2388 4055 4365 2761 1174 256 table(hr_data$Work_accident)

##

## 0 1

## 12830 2169 table(hr_data$left)

##

## 0 1

## 11428 3571

table(hr_data$promotion_last_5years)

##

## 0 1

## 14680 319

table(hr_data$time_spend_company)

##

## 2 3 4 5 6 7 8 10

## 3244 6443 2557 1473 718 188 162 214

hr_data$number_project <-

as.factor(hr_data$number_project)

hr_data$Work_accident <-

as.factor(hr_data$Work_accident)

hr_data$left <-as.factor(hr_data$left)

hr_data$promotion_last_5years <-

as.factor(hr_data$promotion_last_5years)

hr_data$time_spend_company <-

as.factor(hr_data$time_spend_company)

Now those variables are converted to factor variables.

We will chose 1000 rows for our clustering to study the Silhouette distance for different number of clusters.

set.seed(111) library(cluster)

hr_sample <-hr_data[sample(1:nrow(hr_data), 1000,replace=FALSE),]

dist <-daisy(hr_sample)

(9)

Page 41

Now use the R’s Partition Around Medoids ({PAM}) function to cluster the training examples.

sil_width <-c()

for(i in2:10)

{

pam_fit <-pam(dist,diss=T,k=i)

sil_width[i] <-pam_fit$silinfo$avg.width }

sil_width

## [1] NA 0.09555267 0.12611918 0.16457036

0.15855896 0.17011730

## [7] 0.18774302 0.19461600 0.18510932 0.18794090

data <-data.frame(2:10,sil_width[2:10]) ggplot(data)

+geom_point(aes(x=2:10,y=sil_width[2:10])) + geom_line(aes(x=2:10,y=sil_width[2:10])) + labs(title ="Average Silhouette Width",x="Number of Clusters" ,y="Within Cluster SS")

The above plot shows the elbow for four clusters. Now let us cluster the HR data into four clusters and visualize the clusters graphically.

km<-pam(hr_data,diss=F,4)

fviz_cluster(km, data = hr_data,palette

="jco",ggtheme =theme_minimal()) ### Ref. [9]

We can observe from the above illustration that 4 seems to be ideal number of clusters where the average Silhouette With seems to be the maximum.

K-Modes Clustering [10]

K-modes algorithm is very similar to K-means algorithm. Unlike K-means, K-modes makes use of dissimilarities instead of distances and K-modes uses modes instead of means. K-modes clustering cannot handle continuous variables. For example, if we have three vectors A,B and C as follows,

A = (a,b,c,d,e,f)

B = (a,x,c,y,e,z) C = (b,b,x,z,f,g)

The distance between A and B can be computed as follows [11]

• Compare the corresponding elements of the A

and B

(10)

Page 42

• Add 0 when the elements are same

• Add 1 when they are different

So for A and B,

A = (a,b,c,d,e,f)

B = (a,x,c,y,e,z)

(a==a,b==x,c==c,d==y,e==e,f==z)\

0+1+0+1+0+1 = 3\

A generalization:

𝑥1= (𝑥11, 𝑥12, 𝑥13, . . . , 𝑥1𝑛) 𝑥2= (𝑥21, 𝑥22, 𝑥23, . . . , 𝑥2𝑛)

. . . ..

𝑥𝑚 = (𝑥𝑚1, 𝑥𝑚2, 𝑥𝑚3, . . . , 𝑥𝑚𝑛) The distance between 𝑥_𝑖 and 𝑥_𝑗 is

𝑑 = (

𝑛

𝑘=1

𝑥𝑖𝑘 == 𝑥𝑗𝑘)? 0: 1)

The mode M of A, B and C is computed as follows [11]

• For each element of M is obtained by the corresponding elements in vectors A, B and C

• we pick up the values which occurred most number of times

• When there is a tie, we can pick up one of them.

A = (a,b,c,d,e,f)

B = (a,x,c,y,e,z)

C = (b,b,x,z,f,g)

The mode M = (a,b,c,d,e,f)

Let us say that there are k elements in the 𝑘^𝑡ℎ cluster as follows:

𝑥1= (𝑥₁₁^𝑖 , 𝑥₁₂^𝑖 , 𝑥₁₃^𝑖 , . . . , 𝑥_1𝑛^𝑖 ) 𝑥2= (𝑥21𝑖 , 𝑥22𝑖 , 𝑥23𝑖 , . . . , 𝑥2𝑛𝑖 )

. . . ..

𝑥_𝑘 = (𝑥_𝑘1^𝑖 , 𝑥_𝑘2^𝑖 , 𝑥_𝑘3^𝑖 , . . . , 𝑥_𝑘𝑛^𝑖 )

Now we will compute each element of the mode from the above as follows:

• repeat

– for p = 1 to n

• for j = 1 to k

– note down the most common element of 𝑥𝑗𝑝

• the 𝑝^𝑡ℎ element of mode M = the above noted most common element

Let us use the Mushrooms data set, which is available on kaggle. [12]. We will cluster this data set using kmodes clustering algorithm.

library(klaR)

## Loading required package: MASS

mushrooms_data <-read.csv('mushrooms.csv') dim(mushrooms_data)

## [1] 8124 23

table(mushrooms_data$class)

##

## e p

## 4208 3916

mushrooms_sample <-

mushrooms_data[sample(1:nrow(mushrooms_data), 1000,replace=FALSE),]

# remove the class variable which is the class label mushrooms_data_clean <-subset(mushrooms_sample, select =-c(class, veil.type))

kmodes_cluster <-kmodes(mushrooms_data_clean, 2,

iter.max =1000, weighted =FALSE)

str(kmodes_cluster)

## List of 6

## $ cluster : int [1:1000] 2 1 1 2 2 1 1 2 2 2 ...

## $ size : 'table' int [1:2(1d)] 347 653

## ..- attr(*, "dimnames")=List of 1

## .. ..$ cluster: chr [1:2] "1" "2"

(11)

Page 43

## $ modes :'data.frame': 2 obs. of 21 variables:

## ..$ cap.shape : Factor w/ 6 levels

"b","c","f","k",..: 3 6

## ..$ cap.surface : Factor w/ 4 levels

"f","g","s","y": 4 3

## ..$ cap.color : Factor w/ 10 levels

"b","c","e","g",..: 4 5

## ..$ bruises : Factor w/ 2 levels "f","t": 1 2

## ..$ odor : Factor w/ 9 levels

"a","c","f","l",..: 3 6

## ..$ gill.attachment : Factor w/ 2 levels "a","f":

2 2

## ..$ gill.spacing : Factor w/ 2 levels "c","w":

1 1

## ..$ gill.size : Factor w/ 2 levels "b","n": 1 1

## ..$ gill.color : Factor w/ 12 levels

"b","e","g","h",..: 1 6

## ..$ stalk.shape : Factor w/ 2 levels "e","t":

1 2

## ..$ stalk.root : Factor w/ 5 levels

"?","b","c","e",..: 2 2

## ..$ stalk.surface.above.ring: Factor w/ 4 levels

"f","k","s","y": 2 3

## ..$ stalk.surface.below.ring: Factor w/ 4 levels

"f","k","s","y": 2 3

## ..$ stalk.color.above.ring : Factor w/ 9 levels

"b","c","e","g",..: 7 8

## ..$ stalk.color.below.ring : Factor w/ 9 levels

"b","c","e","g",..: 7 8

## ..$ veil.color : Factor w/ 4 levels

"n","o","w","y": 3 3

## ..$ ring.number : Factor w/ 3 levels

"n","o","t": 2 2

## ..$ ring.type : Factor w/ 5 levels

"e","f","l","n",..: 3 5

## ..$ spore.print.color : Factor w/ 9 levels

"b","h","k","n",..: 2 4

## ..$ population : Factor w/ 6 levels

"a","c","n","s",..: 5 5

## ..$ habitat : Factor w/ 7 levels

"d","g","l","m",..: 1 1

## $ withindiff: num [1:2] 2609 5251

## $ iterations: int 2

## $ weighted : logi FALSE

## - attr(*, "class")= chr "kmodes"

mushrooms_sample$cluster <-kmodes_cluster$cluster table(mushrooms_sample$class,

mushrooms_sample$cluster)

##

## 1 2

## e 9 480

## p 338 173

table(kmodes_cluster$cluster)

##

## 1 2

## 347 653

# let us add the class variable back on the clean data

km<-kmodes(mushrooms_data, 2, iter.max =1000,

weighted =FALSE)

table(mushrooms_data$class,km$cluster)

##

## 1 2

## e 28 4180

## p 3100 816

We can see the most of the mushrooms were classified correctly.

Hierarchical Clustering [13]

So far, we discussed variants of clustering, K-means clustering, K-medoids clustering and K-modes clustering, where we specify the number of clusters.

Hierarchical clustering groups the hierarchy from bottom-up and does not require us to specify the number of clusters. There are two types of Hierarchical clustering methods. One is the Agglomerative clustering, where the clustering process works in a bottom-up fashion and the other is the Divisive clustering, where the clustering process work in a top- down fashion.

We are going to discuss the Agglomerative Hierarchical Clustering method. The Agglomerative Hierarchical clustering method works in a bottom-up fashion, where each data items is put in its own cluster and then combine the nearest two clusters into one cluster and so on to get a single cluster with all the data elements.

If there are n number of data points to be clustered, the algorithm works as follows:

• Repeat

(12)

Page 44

– compute the distance between all pairs of data points/clusters - a distance matrix is obtained

– each data point is put in its own individual cluster - n-single element clusters are formed

– pairs of nearest clusters are identified and combined into one cluster - n/2- two element clusters are formed

– repeat the above steps till all the data points are in a single cluster - single cluster with n elements

• until a single cluster is formed with all the items Once this whole process is completed, it is represented in the form of a dendrogram.

The merging process of merging adjacent clusters in Hierarchical clustering is controlled by the Linkage Method used in clustering. There are four different popular linkage methods.

Let us briefly discuss the linkage methods.

Linkage Methods [14]

• Complete-linkage - Calculates the Maximum distance between clusters and use this distance for merging

• Single-linkage - Calculates the Minimum distance between clusters and use this distance for merging

• Average-linkage - Calculates the Average distance between clusters and use this distance for merging

• Centroid-linkage - Calculates the distance between the centroids and use this distance for merging

The Y-axis in the Dendrogram shows the distance where merge/split takes place.

Different linkage methods would give different sets of clusters. By default, the complete linkage method is used. Let us try with the average linkage method, where the average distance between clusters is used to merge the clusters.

clusters <-hclust(dist(iris[, 3:4]), method ='average') plot(clusters)

The y-axis in the dendrogram shows the height/distance where the clusters get merged or split.

If we cut this dendrogram, at height 3 it will form three clusters similar to our K-means clustering that we have already discussed in this article.

clusters_3<-cutree(clusters, 3)

table(clusters_3, iris$Species)

##

## clusters_3 setosa versicolor virginica

## 1 50 0 0

## 2 0 45 1

## 3 0 5 49

The classification with Average method of linkage seems to have worked very well,

References:

1. https://www.youtube.com/watch?v=hDmNF9JG 3lo - by Andrew Ng.

2. https://inseaddataanalytics.github.io/INSEADAn alytics/CourseSessions/Sessions45/ClusterAnaly sisReading.html

3. https://www.coursera.org/learn/machine-

learning/lecture/93VPG/k-means-algorithm - by Andrew Ng.

4. https://www.coursera.org/learn/machine-

(13)

Page 45

learning/lecture/G6QWt/optimization-objective - by Andrew Ng.

5. https://www.youtube.com/watch?v=OWpRBCrx 5-M

6. https://www.youtube.com/watch?v=AY8kLwDQ -0I

7. https://dpmartin42.github.io/posts/r/cluster- mixed-types

8. https://www.kaggle.com/liujiaqi/hr-comma- sepcsv.

9. https://www.datanovia.com/en/lessons/k- medoids-in-r-algorithm-and-practical-examples/

10. https://www.coursera.org/lecture/cluster- analysis/3-5-the-k-medians-and-k-modes- clustering-methods-pShI2

11. https://www.youtube.com/watch?v=b39_vipRkU o

12. https://www.kaggle.com/uciml/mushroom- classification.

13. https://www.datacamp.com/community/tutorials/

hierarchical-clustering-R

14. https://www.datacamp.com/community/tutorials/

hierarchical-clustering-R