On K-Means Clustering Using Mahalanobis Distance
Abstract
A problem that arises quite frequently in statistics is that of identifying groups, or clusters, of data within a population or sample. The most widely used procedure to identify clusters in a set of observations is known as K-Means. The main limitation of this algorithm is that it uses the Euclidean distance metric to assign points to clusters. Hence, this algorithm operates well only if the covariance structures of the clusters are nearly spherical and homogeneous in nature. To remedy this shortfall in the K-Means algorithm the Mahalanobis distance metric was used to capture the variance structure of the clusters. The issue with using Mahalanobis distances is that the accuracy of the distance is sensitive to initialization. If this method serves as a signicant improvement over its competitors, then it will provide a useful tool for analyzing clusters.