CONCLUSION - On K-Means Clustering Using Mahalanobis Distance

The K-Means algorithm remains a fixture in modeling clusters. It has the ad- vantage of being computationally fast and accurate in cases where there are relatively few clusters. However, it lacks the capability to handle non-spherical clusters and as the number of clusters increases, K-Means is far less useful for cluster classification. Therefore, an alternative approach has been developed to handle data sets with numerous clusters. Modifying the K-Means algorithm by using Mahalanobis distances rather than Euclidean distances allows the algorithm to take clusters’ variance structures into account.

Two different initialization procedures were developed in order to maximize the classification accuracy of Mahalanobis K-Means. The first initialization method proposed uses Euclidean distances to initialize spherically shaped clusters in regions with a high concentration of data points. The second initialization method is an extension of the first method and uses the covariance matrix estimate Σb for each

cluster to generate 95% confidence ellipsoids which then capture more points. The mean vector and covariance matrix are then updated, and this process is repeated until no additional points are captured by the confidence ellipsoids. The Mahalanobis K-Means algorithms associated with these two initialization methods were named MK-Means1 and MK-Means2, respectively.

Performance of both Mahalanobis K-Means algorithms and the regular K-Means algorithm was evaluated in a simulation study. In general, the performance of MK- Means1 was very poor because the cluster covariance matrices were poorly estimated in the initialization step. As for MK-Means2 and K-Means, both performed very well in the five cluster case. However, MK-Means2 vastly outperformed K-Means in the ten cluster case with the difference being the most pronounced in data sets with lower cluster overlap. MK-Means1 and MK-Means2 both outperformed regular K-Means

in correctly classifying observations in the Iris data set, but MK-Means2 was again the best algorithm of all three.

REFERENCES

[1] A. C. Rencher,Methods of Multivariate Analysis. Wiley series in probability and mathematical statistics. Probability and mathematical statistics, J. Wiley, 2002. [2] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the em algorithm,” JOURNAL OF THE ROYAL

STATISTICAL SOCIETY, SERIES B, vol. 39, no. 1, pp. 1–38, 1977.

[3] H.-S. Park and C.-H. Jun, “A simple and fast algorithm for k-medoids clustering,” Expert Syst. Appl., vol. 36, pp. 3336–3341, Mar. 2009.

[4] H.-S. Park and C.-H. Jun, “A simple and fast algorithm for k-medoids clustering,” Expert Systems with Applications, vol. 36, no. 2, Part 2, pp. 3336 – 3341, 2009.

[5] I. Lee and J. Yang, “2.27 - common clustering algorithms,” in Comprehensive

Chemometrics(E. in Chief:Stephen D. Brown, R. Tauler, , and B. Walczak, eds.),

pp. 577 – 618, Oxford: Elsevier, 2009.

[6] S. P. Lloyd, “Least squares quantization in pcm.,” IEEE Transactions on

Information Theory, vol. 28, no. 2, pp. 129–136, 1982.

[7] H. Steinhaus, “Sur la division des corp materiels en parties,” Bull. Acad. Polon. Sci, vol. 1, pp. 801–804, 1956.

[8] R. C. de Amorim and B. Mirkin, “Minkowski metric, feature weighting and anomalous cluster initializing in k-means clustering,” Pattern Recognition, vol. 45, no. 3, pp. 1061 – 1075, 2012.

[9] J. Pea, J. Lozano, and P. Larraaga, “An empirical comparison of four initialization methods for the k-means algorithm,” Pattern Recognition Letters, vol. 20, no. 10, pp. 1027 – 1040, 1999.

[10] J. B. MacQueen, “Some methods for classification and analysis of multivariate observations,” in Proc. of the fifth Berkeley Symposium on Mathematical

Statistics and Probability (L. M. L. Cam and J. Neyman, eds.), vol. 1, pp. 281–

297, University of California Press, 1967.

[11] L. Kaufman and P. Rousseeuw, Finding Groups in Data An Introduction to

Cluster Analysis. New York: Wiley Interscience, 1990.

[12] A. K. Jain, M. N. Murty, and P. J. Flynn, “Data clustering: a review,” ACM

Comput. Surv., vol. 31, pp. 264–323, Sept. 1999.

[13] V. Melnykov, W.-C. Chen, and R. Maitra, MixSim: Simulating Data to Study

Performance of Clustering Algorithms., 2010. R package version 1.0-2.

[14] E. Anderson, “The species problem in iris,” Annals of the Missouri Botanical

Garden, vol. 23, pp. 457–509, 1936.

[15] R. A. Fisher, “The use of multiple measurements in taxonomic problems,”Annals

APPENDIX

testfunc <- function(){ require(MixSim) c1 <- matrix(ncol = 4) ## Initialization of K Clusters ## ############################################################### A <- MixSim(MaxOmega = 0.05, K = 10, p = 2, PiLow = 0.05) n <- 500 PI <- A$Pi mu <- A$Mu s <- A$S

sim <- simdataset(n, PI, mu, s) p <- 2

## Single Iteration

################################################################ nstart <- 10

itermat <- c(NA, NA)

indmat <- matrix(rep(NA, n), ncol = n) ind1mat <- matrix(rep(NA, n), ncol = n) ind2mat <- matrix(rep(NA, n), ncol = n)

## Functions #############

## Growing Clusters ###################

cluster <- function(data, sigmahat, muhat, alpha){ n <- dim(data)[1]

p <- dim(data)[2]

data0 <- sweep(data, 2, muhat, FUN = "-")

m0 <- diag(data0 %*% solve(sigmahat, tol = 1e-50) %*% t(data0)) ind <- which(m0 < qchisq(alpha, df = p, lower.tail = F))

muhat <- colMeans(data[ind,]) sigmahat <- var(data[ind,])

outputs <- list(ind, muhat, sigmahat) return (outputs)

}

## Cluster Initialization ##########################

clusterinit <- function(data, alpha, w){

d <- as.matrix(dist(data, method = "euclidean"))

n <- dim(data)[1] - length(which(is.na(d)[1,] == TRUE)) p <- dim(data)[2]

# Need to remove NA values from dist.list so it does not select an NA point # use which(is.na(data[,1])).

rand1 <- rmultinom(1, 1, (n:1)^2 / sum((1:n)/2)) randn <- sum((1:n)*rand1)

dist.list <- apply(d, 1, dist.n, w) dist.id <- order(dist.list)[randn] r <- sort(d[dist.id,])[w] sphcluster <- as.matrix(data[which(d[dist.id,] < r),]) muhat <- as.vector(colMeans(sphcluster)) sigmahat <- var(sphcluster) all.est <- list()

all.est[[1]] <- list(which(d[dist.id,] < r), muhat, sigmahat)

all.est[[2]] <- cluster(data, sigmahat, muhat, alpha) # 1st iteration

while(identical(all.est[[iter]][[1]], all.est[[(iter-1)]][[1]]) != TRUE){

sigmahat <- unlist(all.est[[iter]][[3]])

muhat <- unlist(all.est[[iter]][[2]])

iter <- iter + 1

est <- cluster(data, sigmahat, muhat, alpha)

all.est[[iter]] <- est } return(all.est) } ## Mahalanobis K-Means ######################

mkmeans <- function(data, clusters){

clusters <- clusters[as.vector(unlist((lapply(1:length(clusters), function(i) if(length(clusters[[i]]) > p)

b[[i]] <- i else b[[i]] <- NULL))))]

clusters <- clusters[as.vector(unlist((lapply(1:length(clusters), function(i) if(det(var(data[clusters[[i]],])) >= 1e-50)

b[[i]] <- i else b[[i]] <- NULL))))]

k <- length(clusters)

distmatrix <- matrix(unlist(lapply(1:k,

function(j) mahalanobis(data, apply(data[clusters[[j]],], 2, mean), solve(var(data[clusters[[j]],]), tol = 1e-50), inverted = TRUE))), ncol = k)

ind <- apply(distmatrix, 1, which.min)

clusters <- lapply(1:k, function(i) which(ind == i))

clusters <- lapply(1:k, function(i) unique(clusters[[i]]))

clusters <- clusters[as.vector(unlist((lapply(1:length(clusters), function(i) if(length(clusters[[i]]) > p) b[[i]] <- i

else b[[i]] <- NULL))))]

}

## Regular K-Means Initialization #################################

kinit <- function(data, k){

centers <- sample(1:dim(data)[1], k)

d <- as.matrix(dist(data, method = "euclidean")) dcenters <- d[,centers]

ind <- apply(dcenters, 1, which.min)

clusters <- lapply(1:k, function(i) which(ind == i))

clusters <- lapply(1:k, function(i) unique(clusters[[i]]))

return(clusters) }

## Within Sums of Squares Functions: Mahalanobis and Euclidean ##############################################################

SSm <- function(data, cluster){

mu0 <- apply(data[cluster,], 2, mean)

data0 <- sweep(data[cluster,], 2, mu0, FUN = "-") s <- var(data[cluster,])

d <- sqrt(diag(data0 %*% solve(s, tol = 1e-50) %*% t(data0))) ss <- sum(d)

return(ss) }

SS <- function(data, cluster){

mu0 <- apply(data[cluster,], 2, mean)

data0 <- sweep(data[cluster,], 2, mu0, FUN = "-") d <- sqrt(diag(data0 %*% t(data0)))

ss <- sum(d)

return(ss) }

rowreplace <- function(x, list, values){ x[list,] <- values x } for(m in 1:nstart){ clusters <- list() i <- 2 n <- dim(sim$X)[1] k <- 10 alpha <- 0.05 w <- 25

data <- sim$X

d <- as.matrix(dist(data, method = "euclidean"))

cl <- clusterinit(data, alpha, w)

clusters[[1]] <- unlist(cl[[length(cl)]][[1]]) data <- rowreplace(data, clusters[[1]], NA)

while(i <= k && length(which(is.na(data[,1]) == F)) > w){

cl <- clusterinit(data, alpha, w) clusters[[i]] <- cl[[length(cl)]][[1]] data <- rowreplace(data, clusters[[i]], NA) i <- i + 1

}

mcl <- list()

mcl[[1]] <- clusters iter <- 2

repeat{

mcl[[iter]] <- mkmeans(sim$X, mcl[[(iter-1)]])

if(identical(mcl[[iter]], mcl[[(iter-1)]]) == FALSE){

iter <- iter + 1 } else { break } }

ind <- rep(NA, dim(sim$X)[1]) a <- mcl[[length(mcl)]] for(i in 1:length(a)){ for(j in 1:length(a[[i]])){ ind[a[[i]][[j]]] <- i } }

## Alternate Initialization ######################################################### clusters <- kinit(sim$X, k) mcl1 <- list() mcl1[[1]] <- clusters iter <- 2 repeat{

mcl1[[iter]] <- mkmeans(sim$X, mcl1[[(iter-1)]])

if(identical(mcl1[[iter]], mcl1[[(iter-1)]]) == FALSE){

iter <- iter + 1 } else { break } }

ind1 <- rep(NA, dim(sim$X)[1]) a1 <- mcl1[[length(mcl1)]] for(i in 1:length(a1)){ for(j in 1:length(a1[[i]])){ ind1[a1[[i]][[j]]] <- i } } ## Regular K-Means ################## kmeans <- kmeans(sim$X, k) ## Results ############## SS1 <- 0 for(i in 1:length(a)){ SSt <- SS(sim$X, a[[i]]) SS1 <- SS1 + SSt }

WSS1 <- 0 for(i in 1:length(a)){ SSc <- SSm(sim$X, a[[i]]) WSS1 <- WSS1 + SSc } WSS2 <- 0 for(i in 1:length(a1)){ SSc1 <- SSm(sim$X, a1[[i]]) WSS2 <- WSS2 + SSc1 } WSS3 <- kmeans$tot.withinss if(m == 1) indmat[m,] <- ind ind1mat[m,] <- ind1 ind2mat[m,] <- kmeans$cluster if(m > 1)

indmat <- rbind(indmat, ind) ind1mat <- rbind(ind1mat, ind1)

ind2mat <- rbind(ind2mat, kmeans$cluster)

}

itermat <- itermat[-which(is.na(itermat[,1]) == TRUE),]

a10 <- ClassProp(indmat[which.min(itermat[,1]),], sim$id) ## MK-Means1

#a11 <- ClassProp(indmat[which.min(itermat[,2]),], sim$id)

## My Algorithm using Mahalanobis WSS to determine best clusters a12 <- ClassProp(ind1mat[which.min(itermat[,3]),], sim$id)

## MK-Means2

a13 <- ClassProp(ind2mat[which.min(itermat[,4]),], sim$id) ## K-Means

c1 <- rbind(c1, c(a10, NA, a12, a13))

c1 <- c1[-which(is.na(c1[,1]) == TRUE),]

return(c1)

}

Algorithm comparison code for Ωmax = 0.05, K = 10, p = 2. This code will produce 10 runs of each algorithm for each data set, with the best cluster configuration chosen to represent the optimal assignments for each algorithm.

In document On K-Means Clustering Using Mahalanobis Distance (Page 31-47)