Topography of Convergence
K-Means is one of the most popular clustering algorithms. The K-means algorithm performs sequential iterations looking to minimise the distance among the datapoints belonging to a pre-defined number of groups or clusters. The algorithm converges consistenly for different initializations
Data visualization could help in reducing the parameter space when the whole spectrum of potential solutions of an algorithm is analyzed. In this dataviz some examples are shown on the phase space of cluster size and distance, on distance types and the outcome in terms of time and distribution. Project developed with the Ecole Centrale de Lyon.
Three different datasets are considered in this exploratory analysis, with a variety of point distribution.
Different initializations converge in the phase space of cluster size and total distance. For a given number of centroids, three in this case, the different initializations form specific paths over which they converge, leaving some regions in the Phase Space without trajectories. Depending on the dataset, these trajectories change.
The Euclidean distance is usually taken for granted in clustering algorithms, although there are other definitions of distances such as Manhattan (or City Block) , Cosine and Max-Min. The contours displayed here show the density of the location of centroids depending on the distance used, in this case Euclidean and Manhattan. Even on this simple datasets there is a slight difference on the topography occupied by the centroids.
The initial location of the centroids, a process known as seeding, has a large influence on the algorithm performance in terms of actual cluster distribution and computing time. Here we can see the computing time and cluster distribution for different number of centroids and different grid on which the initial centroids are placed.