1. INTRODUCTION
The problem of comparing the equality of the means of the two paired populations is a staple topic in introductory statistics courses at colleges and universities. Besides, it is a very crucial statistical approach for expremental studies in biology, education and so on. In many experimental situations, each subject may be measured at two different times or for two related conditions or units, which result in a natural pairing of observations. Let $({X}_{1},\hspace{0.17em}{Y}_{1}),\hspace{0.17em}\hspace{0.17em}({X}_{2},\hspace{0.17em}{Y}_{2}),\hspace{0.17em}\cdots ,\hspace{0.17em}({X}_{n},\hspace{0.17em}{Y}_{n})$ be n pairs which are independently chosen with $E[{X}_{i}]={\mu}_{1}$. We shall be interested in testing for their mean difference of the form
which is indeed a singlesample ttest. Then the test statistic is given by
where d and s_{d} are the sample mean and sample standard deviation of the differences (X_{i} − Y_{i}), respectively.
At the significance level of α, the null hypothesis is rejected if $\leftt\right>{t}_{1\alpha /2,n1},$, where the critical value ${t}_{1\alpha /2,n1}$ is the (1−α/2) quantile of the Student tdistribution with n−1 degrees of freedom. The test statistic in (1) relies on the assumption that the observations are normally distributed and uncontaminated, whereas this assumption is easily violated in practice. As will be shown in Section 3, the presence of a single outlying observation could lead to an opposite decision. Thus, a robustified version of this test statistic which is less sensitive to data contamination and model departure is clearly warranted.
Yuen and Dixon (1973) extended approximate behavior of the onesample trimmed tstatistic (Tukey, 1960) and evaluated the performance of the trimmed tstatistic compared to the test statistic in (1) under the scenario when the data set exhibits higher tails than the normal. More recently, by replacing the sample mean and the sample standard deviation in (1) with robust estimators, Park (2018) and Jeong et al. (2018) proposed two robustified analogues of the test statistic and showed both test statistics converge to the standard normal distribution. Given that it may not be appropriate to use the asymptotic standard normal distribution to make a decision, especially when a sample size is small. Recently, Park and Wang (2018a) provided the empirical distributions of the two robustified analogues. They also developed the rt.test package (Park and Wang, 2018b) for the R language (R Core Team, 2018) to implement these robustified tests.
It deserves mentioning that Kim et al. (2018) obtained the confidence intervals using these two robustified ttest statistics and illustrated the effects of the data contamination by outliers to these test statistics through three realdata examples. However, they do not further consider their performances when the normality assumption is violated. Here, the normality and noncontamination assumptions are related to the robustness issues on the normal model departure and outlierresistance properties in the statistics literature. For more details, readers are referred to Hampel et al. (1986) and Basu et al. (2011).
The previous research did not compare the performances of the test statistics of several paired sample tests under the possibility of (i) data contamination and (ii) normal model departure. However, the comparison is important in practice. Therefore, in this paper, we carry out extensive Monte Carlo simulations as well as real data analysis to investigate the impacts of model departure and data contamination to several paired sample test statistics. These tests include the conventional ttest in (1), Wilcoxon test (Wilcoxon, 1945), Yuen’s ttest (Yuen and Dixon, 1973) and two robustified ttests proposed by Park (2018) and Jeong et al. (2018). The numerical study shows that the robustified ttests are quite efficient even when the basic underlying assumptions are valid as well as clearly outperform others in the presence of normal model departure or data contamination.
The remainder of this paper is organized as follows. In Section 2, we review the robustified analogues of the ttest statistic by Park (2018) and Jeong et al. (2018). In Section 3, we provide an illustrative example to show the limitation of the conventional ttest in practice. We conduct extensive simulation studies for evaluating robustness of the considered tests to data contamination in Section 4 and to model departure from normality in Section 5. Some concluding remarks are provided in Section 6 with the deviation of theoretical power function deferred to Appendix.
2. REVIEW ON THE ROBUSTIFIED tTEST STATISTICS
The basic idea of the robustified ttest statistics is to replace the antirobust estimators (d and s_{d}) in (1) with their robust alternatives. By using the sample median and the sample median absolute deviation (MAD), Park (2018) proposed the following statistic
where Φ^{−1} is the inverse of the standard normal cumulative distribution function. A nice property of the above statistic is that it is a pivotal quantity; see Proposition 1 of Park (2018). Also, it converges to the standard normal distribution; see Proposition 2 of Park (2018). It should be noted that the above statistic is recently incorporated into robustification of sequential bifurcation for a simulation factor screening method (Liu et al., 2019).
Later on, by using the Hodges and Lehmann (1963) and Shamos (1976) estimators, Jeong et al. (2018) suggested the following statistic
which is also a pivotal quantity and converges to the standard normal distribution. They showed that the performance of the hypothesis testing using (3) outperforms that using (2).
Although the above statistics in (2) and (3) have a nice asymptotic property, the convergence rate is somewhat slow. Thus, especially for a small sample size, it is highly recommended to use empirical distributions instead of the above asymptotic distributions. Recently, Park and Wang (2018a, 2018b) developed the rttest R package based on the empirical distributions instead of the asymptotic distributions which enable one to perform a statistical inference more accurately and facilitate the implementation of these statistics in various practical applications. It should be noted that the rttest R package can handle up to the sample of size n = 100 , which is enough for most of practical uses.
3. A REAL EXAMPLE
We considered the following fifteen observations used by Fisher (1966) and Welch (1987):
These observations are about the differences in heights of cross and selffertilized plants. We performed the paired test of H_{0} : μ_{d} = 0 versus H_{1} :μ_{d} ≠ 0 using the conventional ttest, Wilcoxon test, Yuen’s ttest, and the two robustified ttests. Note that the pvalues of the conventional ttest, Wilcoxon test, Yuen with trim 0.1 (Yuen (0.1)), Yuen with trim 0.29 (Yuen (0.29)), the robustified ttest with T_{A} (rttest ( T_{A} )) and the robustified ttest with T_{B} (rttest (T_{B} )) are given by 4.97%, 4.126%, 4.926%, 1.399%, 2.677%, and 1.168%, respectively. We observed that all of which result in the rejection of H_{0} at the significance level of 5%.
We evaluated their robustness properties by comparing the changes of their pvalues when one observation is contaminated. To this end, we made contaminated data sets by replacing the fifteenth observation (its original value is 75) with a contaminated value, denoted by d_{15}, which ranges from 30 to 130. With these contaminated data sets, we performed the paired ttests using the aforementioned methods and plotted their changes of the pvalues of the hypothesis tests in Figure 1.
It can be easily seen from Figure 1 that the paired tests by the conventional ttest, Wilcoxon, and Yuen (0.1) are easily affected by a contaminated value so that the decision can be altered to accepting H_{0} which is reversed from rejecting H_{0}. In addition, we summarized the ranges of d_{15} for making a decision of rejecting H_{0} at the level 5% in Table 1. We observed that Yuen with trim 0.29, rttest (T_{A}) , and rttest (T_{B}) are robust to data contamination and that these tests should clearly be preferred, especially when some of the observations are contaminated.
Next, we considered a different contamination scheme with the same data above. To be more specific, we contaminated the last two observations as follows.
where $\delta =50,\hspace{0.17em}49,\hspace{0.17em}\dots ,\hspace{0.17em}0,\hspace{0.17em}\dots ,\hspace{0.17em}49,\hspace{0.17em}50.$. Notice that the value of d does not change for any value of δ in this contamination scheme. We performed the same paired test of H_{0} :μ_{d} = 0 versus H_{1} :μ_{d} ≠ 0 as above and evaluated their robustness properties by comparing the changes of their pvalues when the two observations are contaminated as the value of δ changes. With these contaminated data sets, we performed the paired ttests again and plotted their changes of the pvalues of the hypothesis tests in Figure 2. In addition, we summarized the ranges of δ for making a decision of rejecting H_{0} at the level 5% in Table 2. We observe that the result is similar to the case of a single contamination discussed above.
In the following section, we conducted extensive simulation studies to further compare the performances of these tests, including the conventional ttest, Wilcoxon test, Yuen’s ttest with a reasonable trim of 0.29, rttest (T_{A}) , and rttest (T_{B}) . We used a trim value of 0.29 for Yuen ttest, which makes its breakdown point be the same as that of rttest (T_{B}) . Here, the breakdown point is a widelyused criterion for measuring the robustness of an estimator and is defined as the proportion of incorrect observations (i.e., arbitrarily small or large observations), the estimators of the parameters can handle before giving estimated values arbitrary close to zero or infinity. For more details, see Hodges (1967), Donoho and Huber (1983), Subsection 2.2a of Hampel et al. (1986), and Subsection 1.6.1 of Hettmansperger and McKean (2010). The larger the breakdown point of an estimator, the more robust it is. For instance, the sample mean and standard deviations are not robust statistics and have a breakdown point of 0.
4. SIMULATION FOR EVALUATING ROBUSTNESS TO DATA CONTAMINATION
We compared the powers of the considered tests for H_{0} :μ = 0 versus H1 :μ ≠ 0 under two scenarios: (i) with no contamination and (ii) with data contamination. For each scenario, we generated a paired observation (X_{i}, Y_{i}) for i = 1, 2,…, n from a bivariate normal distribution
where the value of μ ranges from −2 to 2 with an increment of 0.1. We took n =10, σ_{x} = σ_{y} = 1, and ρ = 0, 0.8, −0.8. We obtained the empirical powers of these tests through the extensive Monte Carlo simulation with 10,000 iterations.
In the case of no contamination, we plotted the empirical powers of these considered tests in Figure 3. We also superimposed the theoretical power curve for the paired ttest (blue dots). We observed that the empirical power curve is essentially the same as the theoretical power as expected. We showed that the theoretical power function for testing H_{0} : μ = μ_{d} versus H_{1} : μ ≠ μ_{d} is given by
where $\delta =(\mu {\mu}_{d})/({\sigma}_{d}/\sqrt{n}).$. See Appendix for a detailed derivation. Since ${\sigma}_{d}^{2}={\sigma}_{x}^{2}+{\sigma}_{y}^{2}2\rho {\sigma}_{x}{\sigma}_{y},$, we have ${\sigma}_{d}^{2}=$ = 2,0.4,3.6 for ρ = 0, 0.8, − 0.8, respectively.
In the case of data contamination, we consider three different contamination schemes as follows.

(a) The value of y_{1} is replaced with y_{1} = 5. The simulation results (empirical powers) are summarized in Figure 4.

(b) The values of y_{1} and y_{2} are replaced with y_{1} = 5 and y_{2} = 5. The simulation results are in Figure 5.

(c) The values of y_{1} and y_{2} are replaced with y_{1} = 5 and y_{2} = −5. The simulation results are in Figure 6.
From Figure 3 (no contamination) and Figures 4, 5, 6 (contamination), we can draw the following conclusions.

As one expects, the conventional ttest performs the best and its power is close to the theoretical one in the absence of data contamination. It is noteworthy that the power of rttest (T_{B}) is very close to the second best test, the Wilcoxon test, and is noticeably higher than rt.test (T_{A}) and Yuen (0.29).

In the absence of data contamination, when μ > 0 the powers of rttest (T_{A}) and rttest (T_{B}) are close to each other and are higher than that of Yuen (0.29); when μ < 0 , rttest (T_{B}) performs the best among three robustified tests.

When the data set is contaminated, the conventional ttest and the Wilcoxon test lost their powers seriously even with a single contaminated value. As one expects, the two robustified tests clearly outperform the conventional ttest and the Wilcoxon test under this scenario.

At the same level of breakdown point of 0.29, the performance of rttest (T_{B}) is superior to the one of Yuen (0.29) in both presence and absence of data contamination.

The power of the five considered tests is greatly influenced by the bivariate correlation coefficient between the paired samples: the closer the correlation comes to one, the larger is the power of the tests. The main reason is given as follows. Considering the term ${\sigma}_{d}/\sqrt{n}$ in (4), the sample size needed for the paired ttest is $n\sim (1\rho ){\sigma}^{2}.$. Thus, when the correlation is positive, we obtained the same power with the small sample size by using the paired ttest. However, if the correlation is negative, the paired ttest can be a disaster. We can also notice a very similar phenomenon for the Wilcoxon test. For more details, see Section 2.12.1 of Hettmansperger and McKean (2010).
5. SIMULATION FOR EVALUATING ROBUSTNESS TO MODEL DEPARTURE FROM THE NORMALITY
We compared the powers of the considered tests for H_{0} : μ = 0 versus H_{1} : μ ≠ 0 under the following cases of model departure from the normality: (i) logistic distribution, (ii) Laplace distribution, (iii) Student tdistribution with three degrees freedom, and (iv) uniform distribution.
We generated a paired observations (X_{i}, Y_{i}) for i = 1, ⋯, n as follows. Each of the second samples from the respective nonnormal distribution is generated in such a manner that they have the same mean μ and the same variance σ^{2}. For the logistic distribution, a standard logistic random variable, Z_{i}, has a mean equal zero and variance equal to π^{2}/3. Therefore, we generated our logistic samples using
Then we have E(Y_{i}) = μ and Var(Y_{i}) = σ^{2}. A standard Laplace random variable, Z_{i}, has a mean equal to zero and variance equal to two. Thus, in this case, we generated our samples using
For Student tdistribution, we used
where Z_{i} has the Student tdistribution with ν = 3 degrees of freedom. Finally, for the uniform distribution, let Z_{i} be a standard uniform random variable in (−1,1). We used
Using n =10 paired observations (X_{i}, Y_{i}) from each of the nonnormal distributions, we performed the test for H_{0} : μ = 0 versus H_{1} : μ ≠ 0. We provide empirical powers of the paired tests under various nonnormal distributions in Figure 7. We observe from Figure 7 that among all the considered underlying distributions, the powers of rttest (T_{B}) are similar to those of the conventional ttest and the Wilcoxon test, but rttest (T_{B}) offers higher power than rttest (T_{A}) and Yuen (0.29). When the sample size is small (n = 10) and the underlying distribution is nonnormal, Yuen (0.29) does not turn to be superior to the conventional ttest, even though the latter depends on the normality assumption of observations in the data set. Finally, as the sample size increases (not shown here for simplicity), the powers of all the tests increase significantly.
6. CONCLUDING REMARKS
In this paper, we conducted extensive simulation studies to investigate the performances of several widelyused test statistics for comparing the equality of the means of the two paired populations in the cases of data contamination and model departure from the normality. These tests include the conventional ttest, Wilcoxon test (Wilcoxon, 1945), Yuen’s ttest Yuen and Dixon (1973) and the two robustified ttests proposed by by Park (2018) and Jeong et al. (2018). It is shown that the conventional ttest and the Wilcoxon test are extremely sensitive in the presence of data contamination or model departure and that the decision can be easily reversed by a single contaminated value. Consequently, a robustified test statistic which is less sensitive to data contamination and model departure is clearly warranted in practice.
Among the three robust test statistics, Yuen (0.29), rttest (T_{A}) , and rttest (T_{B}) , we recommend rttest (T_{B}) in practical situations. Given that the power of a statistical test usually depends on the level of data contamination, the degree of departure from the normality and the sample size, we also have a preference for rttest (T_{B}). This is because the numerical studies showed that rttest (T_{B}) outperforms others in most cases, no matter when data contamination or normal model departure exists in the underlying samples.
<APPENDIX> THEORETICAL POWER FUNCTION
Lemma 1. Let ${D}_{i}={X}_{i}{Y}_{i}$ be normally distributed with μ and variance ${\sigma}_{d}^{2}.$
Let $\overline{D}={\displaystyle {\sum}_{i=1}^{n}}\text{}{D}_{i}/n$ (sample mean) and ${S}_{d}^{2}={\displaystyle {\sum}_{i=1}^{n}}\text{}{({D}_{i}\overline{D})}^{2}/(n1)$ (sample variance). Then the test statistic below
has a noncentral tdistribution with n−1 degrees of freedom and noncentrality $\delta =(\mu {\mu}_{d})/({\sigma}_{d}/\sqrt{n}).$
Proof. By the definition of a noncentral tdistribution, the below has a noncentral tdistribution with r degrees of freedom and noncentrality δ
where Z ∼ N(0,1) , V has a chisquare distribution with r degrees of freedom, and Z and V are independent.
Let $V=(n1){S}_{d}^{2}/{\sigma}_{d}^{2}$ for convenience. Then V has a chisquare distribution with n−1 degrees of freedom. We have
where $\delta =(\mu {\mu}_{d})/({\sigma}_{d}/\sqrt{n})$ and $Z=(\overline{D}\mu )/({\sigma}_{d}/\sqrt{n})\sim N(0,\hspace{0.17em}1).$ Since ${S}_{d}^{2}$ and D are independent, V and Z are also independent. This completes the proof.
For convenience, we denote $T=(\overline{D}{\mu}_{d})/({S}_{d}/\sqrt{n}).$ Then the critical region for testing H_{0} :μ =μ_{d} versus H_{1} : μ ≠ μ_{d} is given by $\leftT\right>{t}_{\alpha /2}.$. Then the power function is given by
It is immediate from Lemma 1 that T has a noncentral tdistribution with n−1 degrees of freedom and noncentrality
Thus, we have
where ${\text{\Phi}}_{\nu ,\delta}(\cdot )$ is the cumulative distribution function of the noncentral tdistribution with ν = n −1 degrees of freedom and noncentrality δ.