Similarity and distance indices

Computes a number of similarity or distance measures between all pairs of rows. The data can be univariate or (more commonly) multivariate, with variables in columns. The results are given as a symmetric similarity/distance matrix. This module is rarely used, because similarity/distance matrices are usually computed automatically from primary data in modules such as PCO, NMDS, cluster analysis and PerMANOVA in Past.

Mathematical equations are given in the Past manual.

Euclidean

Basic Euclidean distance (the value is adjusted for missing data).

Gower

A distance measure that averages the difference over all variables, each term normalized for the range of that variable. The Gower measure is similar to Manhattan distance (see below) but with range normalization. When using mixed data types (see below), this is the default measure for continuous and ordinal data.

Chord

Euclidean distance between normalized vectors. Commonly used for abundance data.

Manhattan

The sum of differences in each variable.

Bray-Curtis

Bray-Curtis is a popular similarity index for abundance data (Bray and Curtis 1957). Many authors operate with a Bray-Curtis distance, which is simply 1-d.

Cosine

The inner product of abundances each normalised to unit norm, i.e. the cosine of the angle between the vectors.

Morisita

For abundance data.

Horn

Horn’s overlap index for abundance data (Horn 1966).

Mahalanobis

A distance measure taking into account the covariance structure of the data.

Correlation

The complement 1-r of Pearson’s r correlation across the variables. Taking the complement makes this a distance measure. See also the Correlation module, where Pearson’s r is given directly and with significance tests.

Rho

The complement 1-rs of Spearman’s rho, which is the correlation coefficient of ranks. See also the Correlation module, where rho is given directly and with significance tests.

Dice

Also known as the Sorensen coefficient. For binary (absence-presence) data, coded as 0 or 1 (any positive number is treated as 1). The Dice similarity puts more weight on joint occurences than on mismatches. When comparing two rows, a match is counted for all columns with presences in both rows. Using M for the number of matches and N for the the total number of columns with presence in just one row, we have d_jk = 2M / (2M+N).

Jaccard

A similarity index for binary data. With the same notation as given for Dice similarity above, we have d_jk = M / (M+N).

Kulczynski

Another similarity index for binary data.

Ochiai

A similarity index for binary data, comparable to the cosine similarity for other data types.

Simpson

The Simpson index (Simpson 1943) is defined simply as M / Nmin, where Nmin is the smaller of the numbers of presences in the two rows. This index treats two rows as identical if one is a subset of the other, making it useful for fragmentary data.

Raup-Crick

Raup-Crick index for absence-presence data. This index (Raup & Crick 1979) uses a randomization (Monte Carlo) procedure, comparing the observed number of species ocurring in both associations with the distribution of co-occurrences from 1000 random replicates from the pool of samples (not from the pair of samples, as incorrectly assumed by some authors).

Hamming

Hamming distance for categorical data as coded with integers (or sequence data coded as CAGT). The Hamming distance is the number of differences (mismatches), so that the distance between (3,5,1,2) and (3,7,0,2) equals 2. In PAST, this is normalised to the range [0,1], which is known to geneticists as "p-distance".

Jukes-Cantor

Distance measure for genetic sequence data (CAGT). Similar to p (or Hamming) distance, but takes into account probability of reversals.

Kimura

The Kimura 2-parameter distance measure for genetic sequence data (CAGT). Similar to Jukes-Cantor distance, but takes into account different probabilities of nucleotide transitions vs. transversions (Kimura 1980).

Tajima-Nei

Distance measure for genetic sequence data (CAGT). Similar to Jukes-Cantor distance, but does not assume equal nucleotide frequencies.

Tamura

Distance measure for genetic sequence data (CAGT). An extension of the Kimura 2-parameter distance, handling unequal transition/transversion probability, but also takes into account a possible bias in the G+C frequency.

Geographical

Distance in meters along a great circle between two points on the Earth’s surface. Exactly two variables (columns) are required, with latitudes and longitudes in decimal degrees (e.g. 58 degrees 30 minutes North is 58.5). Coordinates are expected in the WGS84 datum, and distance is calculated with respect to the WGS84 ellipsoid. Use of other datums will result in very slight errors. The accuracy of the algorithm used (Vincenty 1975) is on the order of 1 mm with respect to WGS84.

User-defined similarity

Expects a symmetric similarity matrix rather than original data. No error checking!

User-defined distance

Expects a symmetric distance matrix rather than original data. No error checking!

Mixed

This option requires that data types have been assigned to columns (see Entering and manipulating data). A pop-up window will ask for the similarity/distance measure to use for each datatype. These will be combined using an average weighted by the number of variates of each type. The default choices correspond to those suggested by Gower, but other combinations may well work better. The "Gower" option is a range-normalised Manhattan distance.

All-zeros rows: Some similarity measures (Dice, Jaccard, Simpson etc.) are undefined when comparing two all-zero rows. To avoid errors, especially when bootstrapping sparse data sets, the similarity is set to zero in such cases.

Missing data: Most of these measures treat missing data (coded as ‘?’) by pairwise deletion, meaning that if a value is missing in one of the variables in a pair of rows, that variable is omitted from the computation of the distance between those two rows. The exceptions are rho distance, using column average substitution, and Raup-Crick, which treats missing data as zero.

References

Bray, J.R. & J.T. Curtis. 1957. An ordination of the upland forest communities of Southern Wisconsin. Ecological Monographs 27:325-349.

Horn, H.S. 1966. Measurement of overlap in comparative ecological studies. American Naturalist 100:419-424.

Kimura, M. 1980. A simple model for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. Journal of Molecular Evolution 16:111-120.

Raup, D. & R.E. Crick. 1979. Measurement of faunal similarity in paleontology. Journal of Paleontology 53:1213-1227.

Simpson, G.G. 1943. Mammals and the nature of continents. American Journal of Science 241:1-31.

Vincenty, T. 1975. Direct and inverse solutions of geodesics on the ellipsoid with application of nested equations. Survey Review 176:88-93.