Two-sample tests

A number of classical statistics and tests for comparing two univariate samples, as given in two columns. It is also possible to specify the two groups using a single column of values and an additional Group column. Missing data are disregarded. For mathematical details, see the Past manual.

t test and related tests for equal means

Sample statistics

Means and variances are estimated as described under Univariate statistics. The 95% confidence interval for the mean is based on the standard error for the estimate of the mean, and the t distribution. Normal distribution is assumed.

The 95% confidence interval for the difference between the means accepts unequal sample sizes. The confidence interval is computed for the larger mean minus the smaller, i.e. the center of the CI should always be positive. The confidence interval for the difference in means is also estimated by bootstrapping (simple bootstrap), with the given number of replicates (default 9999).

t test

The t test has null hypothesis

H₀: The two samples are taken from populations with equal means.

The t test assumes normal distributions and equal variances.

Unequal variance t test

The unequal variance t test is also known as the Welch test. It can be used as an alternative to the basic t test equal variances cannot be assumed. Note that this version of the t test is the default in R.

Monte Carlo permutation test

The permutation test for equality of means uses the absolute difference in means as test statistic. This is equivalent to using the t statistic. The permutation test is non-parametric with few assumptions, but the two samples are assumed to be equal in distribution if the null hypothesis is true. The number of permutations can be set by the user. The power of the test is limited by the sample size – significance at the p<0.05 level can only be achieved for n>3 in each sample.

Exact permutation test

As the Monte Carlo permutation test, but all possible permutations are computed. Only available if the sum of the two sample sizes is less than 27.

Bayes factor

The Jeffrey-Zellner-Siow (JZS) Bayes Factor is reported in favour of the alternative, i.e. it quantifies the evidence for the hypothesis of unequal means. It is calculated according to Rouder et al. (2009); see also the Past manual for details.

A BF value larger than 3 may be taken as “substantial” evidence for unequal means, but this is relative to the assumed “uninformative” prior (Rouder et al. 2009). Likewise, a BF value smaller than 1/3 may be taken as substantial evidence for equal means, but this rarely happens unless N is very large. BF>10 is reported as “strong evidence”; BF>30 as “very strong”; and BF>100 as “decisive evidence”.

Cohen's d

Cohen’s d (Cohen 1988) is a measure of effect size. Past calculates the classical version without bias correction (this usually does not matter). The textual interpretation of the value for Cohen’s d follows Sawilowsky (2009).

F test for equal variances

The F test has null hypothesis

H₀: The two samples are taken from populations with equal variance.

Normal distribution is assumed. The F statistic is the ratio of the larger variance to the smaller. The significance is two-tailed, with n₁ and n₂ degrees of freedom. Monte Carlo and exact permutation tests on the F statistic are computed as for the t test above.

Mann-Whitney test for stochastic equality

The two-tailed (Wilcoxon) Mann-Whitney U test is a non-parametric test and does not assume normal distribution. The null hypothesis is

H₀: H₀: For randomly selected values x and y from the two populations, the probability of x>y is equal to the probability of y>x.

If the distributions are equal except for location, the null hypothesis can be interpreted as equal medians.

For each value in sample 1, count the number of values in sample 2 that are smaller than it (ties count 0.5). The total of these counts is the test statistic U (sometimes called T). If the value of U is smaller when reversing the order of samples, this value is chosen instead (it can be shown that U₁+U₂=n₁n₂).

The program computes an asymptotic approximation to p based on the normal distribution (two-tailed), which is only valid for large n. It includes a continuity correction and a correction for ties.

A Monte Carlo value based on the given number of random permutations (default 9999) is also given – the purpose of this is mainly as a control on the asymptotic value.

For n₁+n₂<=26 (e.g. 13 values in each group), an exact p value is given, based on all possible group assignments. If available, always use this exact value. For larger samples, the asymptotic approximation is quite accurate.

Vargha-Delaney A (effect size)

The Vargha-Delaney effect size (Vargha & Delaney 2000) is an estimate of the proportion P(X > Y) + 0.5*P(X = Y). Values close to 0.5 indicate small or no difference; close to 0 indicates X smaller than Y; close to 1 indicates X larger than Y.

Brunner-Munzel test for stochastic equality

The Brunner-Munzel test (Brunner & Munzel 2000; Karch 2021) has the same null hypothesis as Mann-Whitney:

H₀: For randomly selected values x and y from the two populations, the probability of x>y is equal to the probability of y>x.

However, the Brunner-Munzel test is less sensitive to difference in distribution and variance and is therefore possibly superior to Mann-Whitney. The Brunner-Munzel test usually gives slightly higher p values than Mann-Whitney. The equations used in Past were taken from Fagerland et al. (2011).

The “phat” value is an estimate of the proportion P(X < Y) + 0.5*P(X = Y).

If there is no overlap between the two samples, the Brunner-Munzel statistic is not defined. Past then sets the p value to 0.

A Monte Carlo value based on the given number of random permutations (default 9999) is also given – the purpose of this is mainly as a control on the asymptotic value.

For n₁+n₂<=26, an exact p value is given, based on all possible group assignments (Neubert & Brunner 2007). If available, always use this exact value. For larger samples, the asymptotic approximation is quite accurate.

Mood’s median test for equal medians

The median test is an alternative to the Mann-Whitney test for equal medians. The median test has low power, and the Mann-Whitney test is therefore usually preferable. However, there may be cases with strong outliers where the Mood’s test may perform better.

The test simply counts the number of values in each sample that are above or below the pooled median, producing a 2x2 contingency table that is tested with a standard chi-squared test with two degrees of freedom, without Yate’s correction. For smaller data sets, a Fisher’s exact test is also reported.

Kolmogorov-Smirnov test for equal distributions

The Kolmogorov-Smirnov test is a nonparametric test for overall equal distribution of two univariate samples. In other words, it does not test specifically for equality of mean, variance or any other parameter. The null hypothesis is H₀: The two samples are taken from populations with equal distribution.

In the version of the test provided by Past, both columns must represent samples. You can not test a sample against a theoretical distribution (one-sample test).

The algorithm is based on Press et al. (1992), with significance estimated after Stephens (1970).

The permutation test uses 10,000 permutations. Use the permutation p value for N<30 (or generally).

Anderson-Darling test for equal distributions

The Anderson-Darling test is a nonparametric test for overall equal distribution of two univariate samples. It is an alternative to the Kolmogorov-Smirnov test.

The test statistic A²_N is computed according to Pettitt (1976). This statistic is transformed to a statistic called Z according to Scholz & Stephens (1987). The p value is computed by interpolation and extrapolation in Table 1 (m=1) of Scholz & Stephens (1987), using a curve fit to the von Bertalanffy model. The approximation is fairly accurate for p<0.25. For p>0.25, the p values are estimated using a polynomial fit to values obtained by permutation. A p value based on Monte Carlo permutation with N=999 is also provided.

Epps-Singleton test for equal distributions

The Epps-Singleton test (Epps & Singleton 1986; Goerg & Kaiser 2009) is a nonparametric test for overall equal distribution of two univariate samples. It is typically more powerful than the Kolmogorov-Smirnov test, and unlike the Kolmogorov-Smirnov it can be used also for non-continuous (i.e. ordinal) data. The null hypothesis is H₀: The two samples are taken from populations with equal distribution.

The mathematics behind the Epps-Singleton test are complicated. The test is based on the Fourier transform of the empirical distribution function, called the empirical characteristic function (ECF). The ECF is generated for each sample and sampled at two points (t₁=0.4 and t₂=0.8, standardized for the pooled semi-interquartile range). The test statistic W₂ is based on the difference between the two sampled ECFs, standardized by their covariance matrices. A small-sample correction to W₂ is applied if both sample sizes are less than 25. The p value is based on the chi-squared distribution. For details, see Epps & Singleton (1986) and Goerg & Kaiser (2009).

Coefficient of variation (Fligner-Kileen test)

This module tests for equal coefficient of variation in two samples. The coefficient of variation (or relative variation) is defined as the ratio of standard deviation to the mean in percent.

The 95% confidence intervals are estimated by bootstrapping (simple bootstrap), with the given number of replicates (default 9999).

The null hypothesis if the statistical test is:

H₀: The samples were taken from populations with the same coefficient of variation.

If the given p(normal) is less than 0.05, equal coefficient of variation can be rejected. Donnelly & Kramer (1999) describe the coefficient of variation and review a number of statistical tests for the comparison of two samples. They recommend the Fligner-Killeen test (Fligner & Killeen 1976), as implemented in Past. This test is both powerful and is relatively insensitive to distribution.

The following statistics are reported:

T	The Fligner-Killeen test statistic, which is a sum of transformed ranked positions of the smaller sample within the pooled sample (see Donnelly & Kramer 1999 for details).
E(T)	The expected value for T.
z	The z statistic, based on T, Var(T) and E(T). Note this is a large-sample approximation.
p	The p(H₀) value. Both the one-tailed and two-tailed values are given. For the alternative hypothesis of difference in either direction, the two-tailed value should be used. However, the Fligner-Killeen test has been used to compare variation within a sample of fossils with variation within a closely related modern species, to test for multiple fossil species (Donnelly & Kramer 1999). In this case the alternative hypothesis might be that CV is larger in the fossil population, if so then a one-tailed test can be used for increased power.

References

Brunner, E., Munzel, U. 2000. The nonparametric Behrens-Fisher problem: Asymptotic theory and a small-sample approximation". Biometrical Journal 42:17–25.

Cohen, J. 1988. Statistical power analysis for the behavioral sciences (2nd ed.). Academic Press, NY.

Donnelly, S.M. & Kramer, A. 1999. Testing for multiple species in fossil samples: An evaluation and comparison of tests for equal relative variation. American Journal of Physical Anthropology 108:507-529.

Epps, T.W. & Singleton, K.J. 1986. An omnibus test for the two-sample problem using the empirical characteristic function. Journal of Statistical Computation and Simulation 26:177–203.

Fagerland M.W., Sandvik L., Mowinckel P. 2011, Parametric methods outperformed non-parametric methods in com Karch JD. Psychologists Should Use Brunner-Munzel’s Instead of Mann-Whitney’s U Test as the Default Nonparametric Procedure. Advances in Methods and Practices in Psychological Science. 2021;4(2).parisons of discrete numerical variables – Additional File 1: Test statistics. BMC Medical Research Methodology 11:44.

Fligner, M.A. & Killeen, T.J. 1976. Distribution-free two sample tests for scale. Journal of the American Statistical Association 71:210-213.

Goerg, S.J. & Kaiser, J. 2009. Nonparametric testing of distributions – the Epps-Singleton two-sample test using the empirical characteristic function. The Stata Journal 9:454-465.

Karch J.D. 2021. Psychologists should use Brunner-Munzel’s instead of Mann-Whitney’s U test as the default nonparametric procedure. Advances in Methods and Practices in Psychological Science 2021;4(2).

Neubert, K., Brunner, E. 2007. A studentized permutation test for the non-parametric Behrens-Fisher problem". Computational Statistics & Data Analysis 51:5192–5204.

Pettitt, A.N. 1976. A two-sample Anderson-Darling rank statistic. Biometrika 63:161-168.

Press, W.H., Teukolsky, S.A., Vetterling, W.T. & Flannery, B.P. 1992. Numerical Recipes in C. 2nd edition. Cambridge University Press.

Rouder, J.N., Speckman, P.L., Sun, D., Morey, R.D. & Iverson, G. 2009. Bayesian t tests for accepting and rejecting the null hypothesis. Psychonomic Bulletin and Revue 16:225–237.

Sawilowsky, S. 2009. New effect size rules of thumb. Journal of Modern Applied Statistical Methods 8:467–474.

Scholz, F.W. & Stephens, M.A. 1987. K-sample Anderson–Darling tests. Journal of the American Statistical Association 82:918–924.

Stephens, M.A. 1970. Use of the Kolmogorov-Smirnov, Cramer-von Mises and related statistics without extensive tables. Journal of the Royal Statistical Society, Series B 32:115-122.

Vargha, A., Delaney. H.D. 2000. A critique and improvement of the CL common language effect size statistics of McGraw and Wong. Journal of Educational and Behavioral Statistics 25:101–252