Clustering

The hierarchical clustering routine produces a 'dendrogram' showing how data points (rows) can be clustered.

Three different algorithms are available:

  • Unweighted pair-group average (UPGMA). Clusters are joined based on the average distance between all members in the two groups.
  • Single linkage (nearest neighbour). Clusters are joined based on the smallest distance between the two groups.
  • Ward's method. Clusters are joined such that increase in within-group variance is minimized,

One method is not necessarily better than the other, though single linkage is not recommended by some. It can be useful to compare the dendrograms given by the different algorithms, to informally assess the robustness of the clusters.

For Ward's method, a Euclidean distance measure is inherent to the algorithm. For UPGMA and single linkage, the distance matrix can be computed using 24 different indices, as described under the ‘Similarity and distance indices’ section.

Missing data: The cluster analysis algorithm can handle missing data, coded with question marks (?). This is done using pairwise deletion, meaning that when distance is calculated between two points, any variables that are missing are ignored in the calculation. For Raup-Crick, missing values are treated as absence. Missing data are not supported for Ward's method, nor for the Rho similarity measure.

Two-way clustering: The two-way option allows simultaneous clustering in R-mode and Q-mode.

Stratigraphically constrained clustering: This option will allow only adjacent rows or groups of rows to be joined during the agglomerative clustering procedure. May produce strange-looking (but correct) dendrograms.

Group-constrained clustering: This option will only allow joining of clusters within the given groups. May produce strange-looking (but correct) dendrograms.

Bootstrapping: If a number of bootstrap replicates is given (e.g. 100), the columns are subjected to resampling. Press Enter after typing to update the value in the “Boot N” number box. The percentage of replicates where each node is still supported is given on the dendrogram.

Note on Ward’s method: PAST produces Ward’s dendrograms identical to those made by Stata, but somewhat different from those produced by Statistica. The reason for the discrepancy is unknown.

Published Aug. 31, 2020 3:25 PM - Last modified Aug. 31, 2020 3:25 PM