Compositional data transforms

Multivariate data which sum to a constant by design, such as percentages summing to 100, are called compositional data (Aitchison 1986). Such data contain “spurious” correlations because as one value increases, the others will have to decrease. Some multivariate analyses and tests such as PCA can be negatively affected by this. Past includes three commonly used transforms that can be applied to compositional data before further analysis. The data should have the usual multivariate format, with variables in columns and items in rows. The values in each row should sum to a constant, e.g. 1 or 100. Negative values are not allowed, and missing data are not supported.

Additive logratio (ALR)

The input data (one row) is a vector x with N dimensions.

alr(x) = [ln(x1/xN), ..., ln(xN/xN)]

That last element is equal to zero, showing that the transformed data have N-1 dimensions. As the ALR is computed with respect to the last element xN, it may be a good idea to place a low-noise variable with high values in the last column.

Center logratio (CLR)

clr(x) = [ln(x1/g(x)), ..., ln(xN/g(x))]

where g(x) is the geometric mean of the data vector. This is equivalent to a simple log transform followed by subtraction of the mean.

Isometric log ratio (ILR)

The isometric logratio transform was introduced by Egozcue et al. (2003). It has some good theoretical properties, but the results are difficult to interpret because the transformed variables are complicated combinations of the original variables. See the Past manual for equations. The transformed data vector has dimensions N-1. The last column N is filled with zeros for reference, and should not be included in further analysis.

Treatment of zero values

All three transforms involve log-transforming the original data, so zero values cannot be transformed directly. Past treats zero values following equation (6) in Martin-Fernandez et al. (2003):

rj = d if xj=0; (1-z/c)*xj if xj > 0

Here, d is a small value, approximating to the lower detection limit of the measurements. This value can be set by the user in the “Zero threshold” box (default 0.01). The z is the number of zeroes in the original data vector, while c is the total sum, computed from the data (e.g. 100 for percentages).

References

Aitchison J. 1986. The Statistical Analysis of Compositional Data. Chapman & Hall.

Egozcue, J.J., Pawlowsky-Glahn, V., Mateu-Figueras, G. & Barcelo-Vidal, C. 2003. Isometric logratio transformations for compositional data analysis. Mathematical Geology 35:279-300.

Martin-Fernandez, J.A., Barcelo-Vidal, C., Pawlowsky-Glahn, V. 2003. Dealing with zeros and missing values in compositional data sets using nonparametric imputation. Mathematical Geology 35:253-278.

Published Aug. 31, 2020 7:49 PM - Last modified Aug. 31, 2020 7:50 PM