K-means clustering

K-means clustering (e.g. Bow 1984) is a non-hierarchical clustering method. The number of clusters to use is specified by the user, usually according to some hypothesis such as there being two sexes, four geographical regions or three species in the data set

The cluster assignments are initially random. In an iterative procedure, items are then moved to the cluster which has the closest cluster mean, and the cluster means are updated accordingly. This continues until items are no longer "jumping" to other clusters. The result of the clustering is to some extent dependent upon the initial, random ordering, and cluster assignments may therefore differ from run to run. This is not a bug, but normal behaviour in k-means clustering.

The cluster assignments may be copied and pasted back into the main spreadsheet. If you then set the column type to Group, corresponding colors (symbols) may be assigned to the items using the 'Row colors/symbols' option in the Edit menu.

Clustering statistics

WGSS: The Within-Group (or Within-Cluster) sum of squares; will decrease with cluster separation and with the number of clusters.

F: Ratio of Between-Group to Within-Group sum of squares; will increase with cluster separation and number of clusters.

Variance explained (percent): Ratio of Between-Group to total sum of squares, multiplied with 100.

Average silhouette: The average silhouette value over all objects (see below).

Silhouette plot and table

The silhouette plot (Rousseeuw 1987) gives an indication of how well each object has been classified, on a scale from -1 to 1, where 1 means a perfectly appropriate assignment to a cluster; -1 means the object would have been better placed in another cluster; 0 means the object is on the boundary between two clusters.

Missing data supported by column average substitution.

Reference

Bow, S.-T. 1984. Pattern Recognition. Marcel Dekker, New York.

Rousseeuw, P.J. 1987. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis". Computational and Applied Mathematics 20:53–65.

Published Aug. 31, 2020 10:08 PM - Last modified Mar. 3, 2024 1:34 AM