• Editorial Board +
• For Contributors +
• Journal Search +
Journal Search Engine
ISSN : 1598-7248 (Print)
ISSN : 2234-6473 (Online)
Industrial Engineering & Management Systems Vol.19 No.2 pp.386-397
DOI : https://doi.org/10.7232/iems.2020.19.2.386

# A Study on Robustness of the Paired Sample Tests

Chanseok Park, Min Wang, Wook-Yeon Hwang*
Applied Statistics Laboratory, Department of Industrial Engineering, Pusan National University, Busan, Republic of Korea
Department of Management Science and Statistics, The University of Texas at San Antonio, San Antonio, TX, USA
College of Global Business, Dong-A University, Busan, Republic of Korea
*Corresponding Author, E-mail: wyhwang@dau.ac.kr
October 20, 2019 March 3, 2020 April 8, 2020

## ABSTRACT

The paired sample t-test is one of the widely-used statistical procedures for comparing the equality of the means of the two paired populations. The basic underlying assumption of the test is that observations are normally distributed and uncontaminated, whereas this assumption is easily violated in practice, which could result in an improper effect on the result. There exist other paired sample tests without the assumption so that extensive comparisons among them in terms of robustness can be a guideline for practitioners. In this regard, we investigate the robustness properties of several widely-used paired sample tests and evaluate these tests by comparing their performances under the possibility of (i) data contamination and (ii) normal model departure. These tests include the paired sample t-test, Wilcoxon test, Yuen’s t-test, and two robustified t-tests. It is shown that the robustified t-tests perform well even when the basic underlying assumption is valid, and clearly outperform the other tests in the case of data contamination and normal model departure as well.

## 1. INTRODUCTION

The problem of comparing the equality of the means of the two paired populations is a staple topic in introductory statistics courses at colleges and universities. Besides, it is a very crucial statistical approach for expremental studies in biology, education and so on. In many experimental situations, each subject may be measured at two different times or for two related conditions or units, which result in a natural pairing of observations. Let $( X 1 , Y 1 ) , ( X 2 , Y 2 ) , ⋯ , ( X n , Y n )$ be n pairs which are independently chosen with $E [ X i ] = μ 1$. We shall be interested in testing for their mean difference of the form

$H 0 : μ d = μ 1 − μ 2 = 0 and H 1 : μ d ≠ 0 ,$

which is indeed a single-sample ttest. Then the test statistic is given by

$t = d ¯ − 0 s d / n$
(1)

where d and sd are the sample mean and sample standard deviation of the differences (XiYi), respectively.

At the significance level of α, the null hypothesis is rejected if $| t | > t 1 − α / 2 , n − 1 ,$, where the critical value $t 1 − α / 2 , n − 1$ is the (1−α/2) quantile of the Student t-distribution with n−1 degrees of freedom. The test statistic in (1) relies on the assumption that the observations are normally distributed and uncontaminated, whereas this assumption is easily violated in practice. As will be shown in Section 3, the presence of a single outlying observation could lead to an opposite decision. Thus, a robustified version of this test statistic which is less sensitive to data contamination and model departure is clearly warranted.

Yuen and Dixon (1973) extended approximate behavior of the one-sample trimmed t-statistic (Tukey, 1960) and evaluated the performance of the trimmed t-statistic compared to the test statistic in (1) under the scenario when the data set exhibits higher tails than the normal. More recently, by replacing the sample mean and the sample standard deviation in (1) with robust estimators, Park (2018) and Jeong et al. (2018) proposed two robustified analogues of the test statistic and showed both test statistics converge to the standard normal distribution. Given that it may not be appropriate to use the asymptotic standard normal distribution to make a decision, especially when a sample size is small. Recently, Park and Wang (2018a) provided the empirical distributions of the two robustified analogues. They also developed the rt.test package (Park and Wang, 2018b) for the R language (R Core Team, 2018) to implement these robustified tests.

It deserves mentioning that Kim et al. (2018) obtained the confidence intervals using these two robustified t-test statistics and illustrated the effects of the data contamination by outliers to these test statistics through three real-data examples. However, they do not further consider their performances when the normality assumption is violated. Here, the normality and non-contamination assumptions are related to the robustness issues on the normal model departure and outlier-resistance properties in the statistics literature. For more details, readers are referred to Hampel et al. (1986) and Basu et al. (2011).

The previous research did not compare the performances of the test statistics of several paired sample tests under the possibility of (i) data contamination and (ii) normal model departure. However, the comparison is important in practice. Therefore, in this paper, we carry out extensive Monte Carlo simulations as well as real data analysis to investigate the impacts of model departure and data contamination to several paired sample test statistics. These tests include the conventional t-test in (1), Wilcoxon test (Wilcoxon, 1945), Yuen’s t-test (Yuen and Dixon, 1973) and two robustified t-tests proposed by Park (2018) and Jeong et al. (2018). The numerical study shows that the robustified t-tests are quite efficient even when the basic underlying assumptions are valid as well as clearly outperform others in the presence of normal model departure or data contamination.

The remainder of this paper is organized as follows. In Section 2, we review the robustified analogues of the t-test statistic by Park (2018) and Jeong et al. (2018). In Section 3, we provide an illustrative example to show the limitation of the conventional t-test in practice. We conduct extensive simulation studies for evaluating robustness of the considered tests to data contamination in Section 4 and to model departure from normality in Section 5. Some concluding remarks are provided in Section 6 with the deviation of theoretical power function deferred to Appendix.

## 2. REVIEW ON THE ROBUSTIFIED t-TEST STATISTICS

The basic idea of the robustified t-test statistics is to replace the anti-robust estimators (d and sd) in (1) with their robust alternatives. By using the sample median and the sample median absolute deviation (MAD), Park (2018) proposed the following statistic

$T A = 2 n π Φ − 1 ( 3 4 ) median 1 ≤ i ≤ n X i − μ median 1 ≤ i ≤ n | X i − median 1 ≤ i ≤ n X i | ,$
(2)

where Φ−1 is the inverse of the standard normal cumulative distribution function. A nice property of the above statistic is that it is a pivotal quantity; see Proposition 1 of Park (2018). Also, it converges to the standard normal distribution; see Proposition 2 of Park (2018). It should be noted that the above statistic is recently incorporated into robustification of sequential bifurcation for a simulation factor screening method (Liu et al., 2019).

Later on, by using the Hodges and Lehmann (1963) and Shamos (1976) estimators, Jeong et al. (2018) suggested the following statistic

$T B = 3 n 2 π Φ − 1 ( 3 4 ) median i ≤ j ( X i + X j ) − 2 μ median i ≤ j ( | X i − X j | ) ,$
(3)

which is also a pivotal quantity and converges to the standard normal distribution. They showed that the performance of the hypothesis testing using (3) outperforms that using (2).

Although the above statistics in (2) and (3) have a nice asymptotic property, the convergence rate is somewhat slow. Thus, especially for a small sample size, it is highly recommended to use empirical distributions instead of the above asymptotic distributions. Recently, Park and Wang (2018a, 2018b) developed the rt-test R package based on the empirical distributions instead of the asymptotic distributions which enable one to perform a statistical inference more accurately and facilitate the implementation of these statistics in various practical applications. It should be noted that the rt-test R package can handle up to the sample of size n = 100 , which is enough for most of practical uses.

## 3. A REAL EXAMPLE

We considered the following fifteen observations used by Fisher (1966) and Welch (1987):

These observations are about the differences in heights of cross and self-fertilized plants. We performed the paired test of H0 : μd = 0 versus H1 :μd ≠ 0 using the conventional t-test, Wilcoxon test, Yuen’s t-test, and the two robustified t-tests. Note that the p-values of the conventional t-test, Wilcoxon test, Yuen with trim 0.1 (Yuen (0.1)), Yuen with trim 0.29 (Yuen (0.29)), the robustified t-test with TA (rt-test ( TA )) and the robustified t-test with TB (rt-test (TB )) are given by 4.97%, 4.126%, 4.926%, 1.399%, 2.677%, and 1.168%, respectively. We observed that all of which result in the rejection of H0 at the significance level of 5%.

We evaluated their robustness properties by comparing the changes of their p-values when one observation is contaminated. To this end, we made contaminated data sets by replacing the fifteenth observation (its original value is 75) with a contaminated value, denoted by d15, which ranges from 30 to 130. With these contaminated data sets, we performed the paired t-tests using the aforementioned methods and plotted their changes of the p-values of the hypothesis tests in Figure 1.

It can be easily seen from Figure 1 that the paired tests by the conventional t-test, Wilcoxon, and Yuen (0.1) are easily affected by a contaminated value so that the decision can be altered to accepting H0 which is reversed from rejecting H0. In addition, we summarized the ranges of d15 for making a decision of rejecting H0 at the level 5% in Table 1. We observed that Yuen with trim 0.29, rt-test (TA) , and rt-test (TB) are robust to data contamination and that these tests should clearly be preferred, especially when some of the observations are contaminated.

Next, we considered a different contamination scheme with the same data above. To be more specific, we contaminated the last two observations as follows.

where $δ = − 50 , − 49 , … , 0 , … , 49 , 50.$. Notice that the value of d does not change for any value of δ in this contamination scheme. We performed the same paired test of H0 :μd = 0 versus H1 :μd ≠ 0 as above and evaluated their robustness properties by comparing the changes of their p-values when the two observations are contaminated as the value of δ changes. With these contaminated data sets, we performed the paired t-tests again and plotted their changes of the p-values of the hypothesis tests in Figure 2. In addition, we summarized the ranges of δ for making a decision of rejecting H0 at the level 5% in Table 2. We observe that the result is similar to the case of a single contamination discussed above.

In the following section, we conducted extensive simulation studies to further compare the performances of these tests, including the conventional t-test, Wilcoxon test, Yuen’s t-test with a reasonable trim of 0.29, rt-test (TA) , and rt-test (TB) . We used a trim value of 0.29 for Yuen t-test, which makes its breakdown point be the same as that of rt-test (TB) . Here, the breakdown point is a widely-used criterion for measuring the robustness of an estimator and is defined as the proportion of incorrect observations (i.e., arbitrarily small or large observations), the estimators of the parameters can handle before giving estimated values arbitrary close to zero or infinity. For more details, see Hodges (1967), Donoho and Huber (1983), Subsection 2.2a of Hampel et al. (1986), and Subsection 1.6.1 of Hettmansperger and McKean (2010). The larger the breakdown point of an estimator, the more robust it is. For instance, the sample mean and standard deviations are not robust statistics and have a breakdown point of 0.

## 4. SIMULATION FOR EVALUATING ROBUSTNESS TO DATA CONTAMINATION

We compared the powers of the considered tests for H0 :μ = 0 versus H1 :μ ≠ 0 under two scenarios: (i) with no contamination and (ii) with data contamination. For each scenario, we generated a paired observation (Xi, Yi) for i = 1, 2,…, n from a bivariate normal distribution

$( X i Y i ) ∼ N ( ( μ 0 ) , ( σ x 2 ρ σ x σ y ρ σ x σ y σ y 2 ) ) ,$

where the value of μ ranges from −2 to 2 with an increment of 0.1. We took n =10, σx = σy = 1, and ρ = 0, 0.8, −0.8. We obtained the empirical powers of these tests through the extensive Monte Carlo simulation with 10,000 iterations.

In the case of no contamination, we plotted the empirical powers of these considered tests in Figure 3. We also superimposed the theoretical power curve for the paired t-test (blue dots). We observed that the empirical power curve is essentially the same as the theoretical power as expected. We showed that the theoretical power function for testing H0 : μ = μd versus H1 : μμd is given by

$K t ( μ ) = 1 − Φ ν , δ ( t α / 2 ) + Φ ν , δ ( − t α / 2 ) ,$
(4)

where $δ = ( μ − μ d ) / ( σ d / n ) .$. See Appendix for a detailed derivation. Since $σ d 2 = σ x 2 + σ y 2 − 2 ρ σ x σ y ,$, we have $σ d 2 =$ = 2,0.4,3.6 for ρ = 0, 0.8, − 0.8, respectively.

In the case of data contamination, we consider three different contamination schemes as follows.

• (a) The value of y1 is replaced with y1 = 5. The simulation results (empirical powers) are summarized in Figure 4.

• (b) The values of y1 and y2 are replaced with y1 = 5 and y2 = 5. The simulation results are in Figure 5.

• (c) The values of y1 and y2 are replaced with y1 = 5 and y2 = −5. The simulation results are in Figure 6.

From Figure 3 (no contamination) and Figures 4, 5, 6 (contamination), we can draw the following conclusions.

1. As one expects, the conventional t-test performs the best and its power is close to the theoretical one in the absence of data contamination. It is noteworthy that the power of rt-test (TB) is very close to the second best test, the Wilcoxon test, and is noticeably higher than rt.test (TA) and Yuen (0.29).

2. In the absence of data contamination, when μ > 0 the powers of rt-test (TA) and rt-test (TB) are close to each other and are higher than that of Yuen (0.29); when μ < 0 , rt-test (TB) performs the best among three robustified tests.

3. When the data set is contaminated, the conventional t-test and the Wilcoxon test lost their powers seriously even with a single contaminated value. As one expects, the two robustified tests clearly outperform the conventional t-test and the Wilcoxon test under this scenario.

4. At the same level of breakdown point of 0.29, the performance of rt-test (TB) is superior to the one of Yuen (0.29) in both presence and absence of data contamination.

5. The power of the five considered tests is greatly influenced by the bivariate correlation coefficient between the paired samples: the closer the correlation comes to one, the larger is the power of the tests. The main reason is given as follows. Considering the term $σ d / n$ in (4), the sample size needed for the paired t-test is $n ∼ ( 1 − ρ ) σ 2 .$. Thus, when the correlation is positive, we obtained the same power with the small sample size by using the paired t-test. However, if the correlation is negative, the paired t-test can be a disaster. We can also notice a very similar phenomenon for the Wilcoxon test. For more details, see Section 2.12.1 of Hettmansperger and McKean (2010).

## 5. SIMULATION FOR EVALUATING ROBUSTNESS TO MODEL DEPARTURE FROM THE NORMALITY

We compared the powers of the considered tests for H0 : μ = 0 versus H1 : μ ≠ 0 under the following cases of model departure from the normality: (i) logistic distribution, (ii) Laplace distribution, (iii) Student t-distribution with three degrees freedom, and (iv) uniform distribution.

We generated a paired observations (Xi, Yi) for i = 1, ⋯, n as follows. Each of the second samples from the respective non-normal distribution is generated in such a manner that they have the same mean μ and the same variance σ2. For the logistic distribution, a standard logistic random variable, Zi, has a mean equal zero and variance equal to π2/3. Therefore, we generated our logistic samples using

$Y i = μ + 3 σ π ⋅ Z i .$
(5)

Then we have E(Yi) = μ and Var(Yi) = σ2. A standard Laplace random variable, Zi, has a mean equal to zero and variance equal to two. Thus, in this case, we generated our samples using

$Y i = μ + σ 2 ⋅ Z i$
(6)

For Student t-distribution, we used

$Y i = μ + σ ν / ( ν − 2 ) ⋅ Z i ,$
(7)

where Zi has the Student t-distribution with ν = 3 degrees of freedom. Finally, for the uniform distribution, let Zi be a standard uniform random variable in (−1,1). We used

$Y i = μ + 3 σ Z i .$
(8)

Using n =10 paired observations (Xi, Yi) from each of the non-normal distributions, we performed the test for H0 : μ = 0 versus H1 : μ ≠ 0. We provide empirical powers of the paired tests under various non-normal distributions in Figure 7. We observe from Figure 7 that among all the considered underlying distributions, the powers of rt-test (TB) are similar to those of the conventional t-test and the Wilcoxon test, but rt-test (TB) offers higher power than rttest (TA) and Yuen (0.29). When the sample size is small (n = 10) and the underlying distribution is non-normal, Yuen (0.29) does not turn to be superior to the conventional t-test, even though the latter depends on the normality assumption of observations in the data set. Finally, as the sample size increases (not shown here for simplicity), the powers of all the tests increase significantly.

## 6. CONCLUDING REMARKS

In this paper, we conducted extensive simulation studies to investigate the performances of several widely-used test statistics for comparing the equality of the means of the two paired populations in the cases of data contamination and model departure from the normality. These tests include the conventional t-test, Wilcoxon test (Wilcoxon, 1945), Yuen’s ttest Yuen and Dixon (1973) and the two robustified t-tests proposed by by Park (2018) and Jeong et al. (2018). It is shown that the conventional t-test and the Wilcoxon test are extremely sensitive in the presence of data contamination or model departure and that the decision can be easily reversed by a single contaminated value. Consequently, a robustified test statistic which is less sensitive to data contamination and model departure is clearly warranted in practice.

Among the three robust test statistics, Yuen (0.29), rt-test (TA) , and rt-test (TB) , we recommend rt-test (TB) in practical situations. Given that the power of a statistical test usually depends on the level of data contamination, the degree of departure from the normality and the sample size, we also have a preference for rt-test (TB). This is because the numerical studies showed that rt-test (TB) outperforms others in most cases, no matter when data contamination or normal model departure exists in the underlying samples.

## <APPENDIX> THEORETICAL POWER FUNCTION

Lemma 1. Let $D i = X i − Y i$ be normally distributed with μ and variance $σ d 2 .$

Let $D ¯ = ∑ i = 1 n D i / n$ (sample mean) and $S d 2 = ∑ i = 1 n ( D i − D ¯ ) 2 / ( n − 1 )$ (sample variance). Then the test statistic below

$D ¯ − μ d S d / n$

has a non-central t-distribution with n−1 degrees of freedom and non-centrality $δ = ( μ − μ d ) / ( σ d / n ) .$

Proof. By the definition of a non-central t-distribution, the below has a non-central t-distribution with r degrees of freedom and non-centrality δ

$Z + δ V / r ,$

where ZN(0,1) , V has a chi-square distribution with r degrees of freedom, and Z and V are independent.

Let $V = ( n − 1 ) S d 2 / σ d 2$ for convenience. Then V has a chi-square distribution with n−1 degrees of freedom. We have

$D ¯ − μ d S d / n = ( D ¯ − μ d ) / ( σ d / n ) V / ( n − 1 ) = ( D ¯ − μ ) / ( σ d / n ) + ( μ − μ d ) / ( σ d / n ) V / ( n − 1 ) = Z + ( μ − μ d ) / ( σ d / n ) V / ( n − 1 ) = Z + δ V / ( n − 1 ) ,$

where $δ = ( μ − μ d ) / ( σ d / n )$ and $Z = ( D ¯ − μ ) / ( σ d / n ) ∼ N ( 0 , 1 ) .$ Since $S d 2$ and D are independent, V and Z are also independent. This completes the proof.

For convenience, we denote $T = ( D ¯ − μ d ) / ( S d / n ) .$ Then the critical region for testing H0 :μ =μd versus H1 : μμd is given by $| T | > t α / 2 .$. Then the power function is given by

$K t ( μ ) = P ( | T | > t α / 2 ) = P ( T > t α / 2 ) + P ( T < − t α / 2 ) .$

It is immediate from Lemma 1 that T has a non-central t-distribution with n−1 degrees of freedom and non-centrality

$δ = μ − μ d σ d / n .$

Thus, we have

$K t ( μ ) = 1 − Φ ν , δ ( t α / 2 ) + Φ ν , δ ( − t α / 2 ) ,$
(A6)

where $Φ ν , δ ( ⋅ )$ is the cumulative distribution function of the non-central t-distribution with ν = n −1 degrees of freedom and non-centrality δ.

## ACKNOWLEDGMENT

This work was supported by a 2-Year Research Grant of Pusan National University.

## Figure

The p-values of the considered paired tests when the fifteenth observation in the data set is contaminated.

The p-values of the considered paired tests when the last two observations are contaminated with δ.

Empirical powers of the paired tests (with no contamination).

Empirical powers of the paired tests (with contamination: y1 = 5) .

Empirical powers of the paired tests (with contamination: y1 = 5 and y2 = 5)

Empirical powers of the paired tests (with contamination: y1=5 and y2 =−5 ).

Empirical powers of the paired tests under various non-normal distributions.

## Table

The range of d15 for making a decision of rejecting H0 at the level 5%

The range of δ for making a decision of rejecting H0 at the level 5%

## REFERENCES

1. Basu, A. , Shioya, H. , and Park, C. (2011), Statistical Inference: The Minimum Distance Approach, Mo­no­­graphs on Statistics and Applied Probability, Chapman & Hall.
2. Donoho, D. and Huber, P. J. (1983), The notion of breakdown point. In: P. Bickel, K. Doksum, J.L. Hodges (eds.), A Festschrift for Erich Lehmann, Wadsworth, Belmont, CA, 157-184.
3. Fisher, R. A. (1966), The design of Experiments (8th ed.), Oliver and Boyd, Edinburgh.
4. Hampel, F. R. , Ronchetti, E. , Rousseeuw, P. J. , and Stahel, W. A. (1986), Robust Statistics: The Approach Based on Influence Functions, John Wiley & Sons, New York.
5. Hettmansperger, T. P. and McKean, J. W. (2010), Robust Nonparametric Statistical Methods (2nd ed.), Chapman & Hall/CRC, Boca Raton, FL.
6. Hodges, J. L. and Lehmann, E. L. (1963), Estimates of location based on rank tests, Annals of Mathematical Statistics, 34(2), 598-611.
7. Hodges Jr., J. L. (1967), Efficiency in normal samples and tolerance of extreme values for some estimates of location, Proceedings of the Fifth Berkeley Sym­posium on Mathematical Statistics and Probability, University of California Press, Berkeley, 163-186.
8. Jeong, R. , Son, S. B. , Lee, H. J. , and Kim, H. (2018), On the robustification of the z-test statistic, Presented at KIIE Conference, Gyeongju, Korea.
9. Kim, H. , Park, C. , and Wang, M. (2018), Paired ttest based on robustified statistics, Presented at KIIE Conference, Seoul, Korea.
10. Liu, L. , Ma, Y. , Park, C. , and Byun, J. H. (2019), Robust sequential bifurcation for simulation factor screening under data contamination, Computers & Industrial Engineering, 129, 102-112.
11. Park, C. (2018), Note on the robustification of the Student t-test statistic using the median and the median absolute deviation, Available from: https://arxiv.org/abs/1805.12256, ArXiv e-prints.
12. Park, C. and Wang, M. (2018a), Empirical distributions of the robustified t-test statistics, Available from: https://arxiv.org/abs/1807.02215, ArXiv e-prints.
13. Park, C. and Wang, M. (2018b), rt.test: Robustified t-test, Available from: https://cran.r-project.org/web/packages/rt.test/.
14. R Core Team (2018), R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria.
15. Shamos, M. I. (1976), Geometry and statistics: Problems at the interface. In: J. F. Traub (ed), Algorithms and Complexity: New Directions and Recent Results, Academic Press, New York, 251-280.
16. Tukey, J. W. (1960), A survey of sampling from contaminated distributions. In: Contributions to Probability and Statistics, Stanford University Press, Stanford, Calif, 448-485.
17. Welch, W. J. (1987), Rerandomizing the median in matched-pairs designs, Biometrika, 74(3), 609-614.
18. Wilcoxon, F. (1945), Individual comparisons by ranking methods, Biometrics Bulletin, 1(6), 80-83.
19. Yuen, K. K. and Dixon, W. J. (1973), The approximate behaviour and performance of the two-sample trimmed t, Biometrika, 60(2), 369-374.
 Do not open for a day Close