Permutation Tests for Two-sample Location Problem Under Extreme Ranked Set Sampling

In this paper, permutation test of comparing two-independent samples is investigated in the context of extreme ranked set sampling (ERSS). Three test statistics are proposed. The statistical power of these new test statistics are evaluated numerically. The results are compared with the statistical power of the classical independent two-sample t-test, MannWhitney U test, and the usual two-sample permutation test under simple random sampling (SRS). In addition, the method of computing a confidence interval for the two-sample permutation problem under ERSS is explained. The performance of this method is compared with the intervals obtained by SRS and Mann-Whitney procedures in terms of empirical coverage probability and expected length. The comparison shows that the proposed statistics outperform their counterparts. Finally, the application of the proposed statistics is illustrated using a real life example.


Introduction
McIntyre (1952) introduced the concept of ranked set sampling (RSS) as a new sampling scheme for data collection. Due to its importance for a variety of applications in statistics, it is republished in McIntyre (2005) to estimate the mean of Australian pasture yields. As claimed by McIntyre, McIntyre (1952, 2005, the mean of the RSS is an unbiased estimator of the population mean. Also, the variance of the RSS mean is smaller than in simple random sampling (SRS) with equal measurement elements. This sampling scheme is useful when it is difficult to measure large number of elements but visually (without inspection) ranking some of them is easier. For example, in McIntyre's experiment the yields of pasture plots can be assessed without the actual laborious process of weighing and mowing the hay for a lot of plots. Moreover, the RSS scheme is also highly applicable in instances where measuring a variable of interest is difficult and risky to measure. For example, in studying some diseases such as the yellowing of the body of an infant, one of the main steps is to measure the bilirubin level of the infant by taking their blood samples. However, it is risky and excruciating to take the blood samples. It is rather easy to rank the babies and take the measurement of the bilirubin level on their urine samples (Paul and Thomas, 2017). The RSS scheme involves randomly selecting m sets, each of size m elements, from a study population (typically m is in the range 2 to 5). The elements of each set are ordered with regards to the variable of interest by any negligible cost method or visually without measurements. Finally, the i-th minimum from the i-th set, i = 1, 2, . . . , m, is identified for measurement. The obtained sample is referred to as a ranked set sample of set size m. Takahasi and Wakimoto (1968) explained the mathematical theory behind the claims of McIntyre, McIntyre (1952McIntyre ( , 2005 by showing that the efficiency of the RSS mean with respect to SRS, defined by the ratio of the variances of the two sample means, is bounded by 1 and (m + 1)/2. Some authors estimate the parameters of a specific distribution using RSS. Bhoj (1997) obtained the estimates of the location and the scale parameters of the extreme value distribution using RSS. Similarly, Abu-Dayyeh et al. (2004) proposed some estimators for estimating the location and the scale parameters of the Logistic distribution using SRS, RSS and some of its other modifications. For more examples, see Lam et al. (1994), Bhoj and Ahsanullah (1996), Chacko and Thomas (2007), Chacko and Thomas (2008), Al-Saleh and Diab (2009), and Sarikavanij et al. (2014), among others. To better improve the efficiency of the estimators, some variations of RSS were implemented. For example (but not limited to): Samawi et al. (1996) introduced a more practical and efficient variation of RSS which is referred to as the extreme RSS (ERSS); Muttlak (1997) proposed a median RSS as a modification of RSS to decrease ranking error and to improve the efficiency of the estimators being estimated; Al-Saleh and Al-Kadiri (2000) suggested double RSS as a method that improves efficiency of the RSS estimators while keeping m fixed; and Al-Saleh and Al-Omari (2002) suggested a multi-stage RSS as a generalization of double RSS. These variations of RSS were later used to estimate the parameters of some distributions. Shaibu and Muttlak (2004) used ERSS and median RSS to propose linear unbiased estimators and maximum likelihood estimators of the parameters of location-scale family of distributions like; exponential, normal and gamma distributions. They showed that their estimators dominate other existing estimators under ERSS and the estimators are most efficient under median RSS. For more examples, see Adatia (2000). In the context of testing hypothesis, Koti and Jogesh Babu (1996) derived the exact distribution of the sign test statistic based on RSS. It was reported that the test is more powerful than the counterpart sign test statistic of SRS. Liangyong and Xiaofang (2010) used the sign test statistic of RSS for testing hypotheses about the quantiles of a population distribution. Wolfe (1992, 1994) and Bohn and Wolfe (1994) suggested the RSS analogue of the classical two-sample Wilcoxon test and studied its relative properties under perfect and imperfect judgment. Öztürk (1999) studied the effect of the RSS on two-sample sign test statistic. Öztürk and Wolfe (2000) presented an optimal RSS allocation scheme for a two-sample RSS median test. They derived the exact distribution of the two-sample median test statistic in the context of RSS and tabled it for some sample sizes. Samuh (2012), Samuh (2017), and Amro and Samuh (2017) investigated the two-sample permutation test within the context of RSS and multistage RSS. In this paper, a new testing procedure for the two-sample design within ERSS is investigated. The rest of the paper is structured as follows. The procedure of the ERSS is described in Section 2. The independent two-sample problem is introduced in Section 3. Permutation test for two-sample ERSS with three proposed test statistics is discussed in Section 4. The method of computing a confidence interval for the two-sample permutation problem under ERSS is explained in Section 5. Simulation study that shows the benefits of permutation test of the extreme ranked set two-sample design is reported in Section 6. Illustrative example is used to show the application of this research in Section 7. Finally, summary and concluding remarks are provided in Section 8.

Extreme ranked set sampling scheme
Following Samawi et al. (1996), the procedure of the ERSS is described as follows: 1. Randomly select m sets of size m elements each from the study population. These may be denoted as set 1 = . . , Y * 2m }, and so on till the last set, set m = {Y * m1 , Y * m2 , . . . , Y * mm }. It is assumed that the largest and the lowest elements in each set can be determined virtually or by any negligible cost method. This is, of course, a simple and practical approach.
(b) Measure the median ranked element, say Y m . In this paper, we consider this way.
4. Independently repeat the steps h cycles, if needed, to acquire an ERSS of size n = h × m.
It is worth to note that although m 2 elements are sampled in the first step, only m of them are considered for measurement. In case of perfect ranking (no error was made in the ranking mechanism) the measured elements are called the order statistics and they are not ordered (See Navarro et al. (2007) for some examples where order statistics are not ordered); we denote the i-th order statistic acquired in the j-th cycle by Y ji , i = 1, 2, . . . , m, and j = 1, 2, . . . , h. To this end, the ERSS scheme produces a data set as follows If m is even, the first m/2 columns have the distribution of the 1-st order statistic and the last m/2 columns have the distribution of the m-th order statistic. If m is odd, the first (m − 1)/2 columns have the distribution of the 1-st order statistic, the second (m − 1)/2 columns have the distribution of the m-th order statistic, and the last column has the distribution of the (m/2)-th order statistic. Therefore, the data in the same column are identically distributed. Also, all the data are mutually independent.
In this paper, we assume perfect judgment ranking in the selection of the data points for the ERSS. A violation of this assumption is quite interesting and a subject of future work.

The independent two-sample design
In this section, the independent two-sample problem is introduced, and the classical independent t-test, Mann-Whitney U test, and the two-sample permutation test are reviewed. Let us consider the testing problems for one-sided alternative hypotheses as produced by treatments with non-negative effect size δ. Particularly, let X 1 = (X 11 , X 12 , . . . , X 1n1 ) and X 2 = (X 21 , X 22 , . . . , X 2n2 ) be independent random samples from F (x 1 ) and G(x 2 ) = F (x 2 − δ), and we wish to test Under the normality assumption of the underlying distribution, the likelihood ratio test statistic for testing the null hypothesis in Equation 1 is given by whereX 1 andX 2 are the sample means of the two samples X 1 and X 2 , respectively, and the pooled standard deviation is When H 0 is true, T is distributed as Student's distribution with ν = n 1 + n 2 − 2 degrees of freedom. H 0 is rejected when |T | > t α ν , where t α ν is the upper α critical value, and α is the level of significance. The statistical power level is given by W (δ; n 1 , n 2 , α) = 1 − F t (t α ν , ν, ncp), where F t is the cumulative distribution of the Student t, ν = n 1 + n 2 − 2 is the degrees of freedom, and ncp = δ S 2 p (1/n 1 + 1/n 2 ) −1/2 is the non-centrality parameter. Note that W is a function of δ for a given sample sizes n 1 and n 2 , and preassigned level of significance α. The power level measures how likely to get a significant result given that the alternative hypothesis is true; It measures the probability that the true value of δ will be detected by the test. In a permutation framework, we can relax the normality assumption. Permutation tests form a subclass of nonparametric tests which do not rely on any particular distribution. To obtain an exact permutation solution, the exchangeability assumption (often referred to as equality in distributions) in the null hypothesis is required to the data points. This assumption is generally assured by randomly assigning experimental units to treatments in experimental studies. In case of observational studies, exchangeability in the null hypothesis shall be assumed. For testing the null hypothesis in Equation 1, a suitable test statistic should be chosen such that, without loss of generality, large values of it are considered to be against H 0 . For more details about the choice of the test statistic in the permutation framework, see Page 84 of Pesarin and Salmaso (2010). One may choose T =X 1 −X 2 as a test statistic. For determining the exact p-value, an appropriate reference distribution is needed which is called the permutation distribution. Indeed, the following steps are used to carry out the permutation test for two-sample design.
1. For the given two-independent samples, X 1 and X 2 , calculate the observed test statistic, T 0 = T (X 1 , X 2 ).
2. Write down the set of all possible permutations of the n = n 1 + n 2 observations between the two samples; i.e. the permutation sample space X . The cardinality of this space is n!.
3. For each permutation in X , compute the test statistic, T * = T (X * 1 , X * 2 ). The cardinality of related space is n n1 . 4. The true p-value is calculated as 5. For a given preassigned significance level α, the test is declared to be significant if α is greater than the p-value.
Since it is tedious to write down and enumerate the whole members of permutation sample space X , conditional Monte Carlo simulation (Algorithm 1) can be used to approximate the p-value at any desired accuracy.

Independently repeat
Step 2 a large number of times, say B, giving B values for T * , say {T * b , b = 1, . . . , B}.

The estimated permutation p-value isλ
where I(·) is the indicator function.
Note thatλ T is an unbiased estimate of the true λ T and, due to the Glivenko-Cantelli theorem (Shorack and Wellner, 1986), as B diverges, it is strongly consistent. A 100(1 − α)% approximate confidence interval for λ T iŝ where λ T 1 −λ T /B is the estimated standard error ofλ T , and z α 2 is the upper α/2 critical level of the standard normal distribution. In order to properly define the power function, the underlying population distribution must be fully specified; defined in its analytical form and all its parameters. But this is not the case in permutation framework. In practice, the power of permutation test is based on repeated random sampling from some population. The p-value of the permutation test is conditional upon the observations for each sample, and the power is the proportion of p-values that are less than or equal α. Algorithm 2 is used for evaluating the power based on a standard Monte Carlo simulation.

Algorithm 2 Power Function of Permutation Test
1. For the given samples, X 1 and X 2 , find an estimate of δ, sayδ. Then, consider the consequent empirical deviateŝ Z = (X 1 −δ, X 2 ).
5. Finally, the estimated power level is given bŷ 6. To obtain the power as a function of δ, Steps 1-5 are repeated for different values of δ.
For more details see Pesarin and Salmaso (2010) (See also Samuh and Pesarin, 2018). Another nonparametric test, which does not assume a normal distribution, is Mann-Whitney U test. The test statistic is U = min(U 1 , U 2 ), where and R i is the sum of the ranks for sample i. A large value of U is considered to be against H 0 . To evaluate the power of the Mann-Whitney U test, a standard Monte Carlo method can be used.

Two-sample permutation test based on ERSS
Let Y t = {Y t ji } and Y c = {Y c ji } be two independent ERSS groups. The first denotes the treatment group and the second denotes the control group. In each group, the data are all mutually independent and, aside that, the data in the same column are identically distributed. Therefore, under the null hypothesis of no effect, the exchangeability assumption holds within columns and hence the permutation test can be applied. The data has to be permuted carefully taking into account whether m is odd or even to maintain the diversity of distributions. To this end, if m is even then the first m/2 columns of Y t = {Y t ji } are permuted with the first m/2 columns of Y c = {Y c ji } and the other m/2 columns To carry out the permutation test for this extreme ranked set two-sample design, Algorithm 3 is used. To this end, three test statistics are proposed: 1. The first proposal is based on the difference between overall means of the two groups; 2. The second proposal is based on the studentized statistic; 3. Randomly permute Y as explained above to get Y * .
4. Split Y * into Y t * and Y c * such that Y t * and Y c * contain the same number of rows as in Y t and Y c , respectively.

Compute the test statistic
6. Independently repeat Steps 3-5 a large number of times, say B, giving B test statistics, say and It is worth to note that the rationale behind this proposal is due to the structure of the data points obtained by ERSS. For example, when m is even, the first m/2 columns of Y t and Y c have the distribution of the 1-st order statistic and the last m/2 columns have the distribution of the m-th order statistic. Therefore, the studentized statistic is proposed for more accuracy and to maintain the diversity of distributions.
3. The third proposal is based on partial tests. If m is even, the null hypothesis in Equation 1 is partitioned into 2 independent sub-hypotheses as follows.
where δ 1 (δ 2 ) is the true difference between the means of the 1-st (m-th) order statistic in the treatment and control groups. Thus, the suggested test statistic will be If m is odd, the null hypothesis in Equation 1 is partitioned into 3 independent sub-hypotheses as follows.
H 0s : δ s = 0 against H 1s : δ s > 0, s = 1, 2, 3, where δ 1 (δ 2 , δ 3 ) is the true difference between the means of the 1-st (m-th, (m/2)-th) order statistic in the Permutation Tests for Two-sample Location Problem Under Extreme Ranked Set Sampling treatment and control groups. Thus, the suggested test statistic will be Now, Algorithm 1 is used and this leads to 2 or 3 independent p-values. Finally, these p-values have to be combined for testing the overall hypothesis in Equation 1. The following approaches are considered for combining p-values.
(a) The Fisher approach (Fisher, 1934). It is based on the statistic . So, the combined p-value is given by is the standard normal cumulative distribution function. Under the null hypothesis, L/ √ m ∼ N (0, 1). So, the combined p-value is given by where Z is the standard normal random variable.
(c) The logistic approach (Mudholkar and George, 1977). It is based on the logit statistic Under the null hypothesis, t follows approximately a Student's distribution with (5m + 4) degrees of freedom. Hence, the combined p-value is given by

Permutation confidence interval for δ
Suppose thatδ L andδ U are the lower and upper limits, respectively, for the parameter δ.
The confidence interval for δ contains all values of δ 0 for which the null hypothesis H 0 : δ = δ 0 versus H 1 : δ = δ 0 is not rejected at level α. The one-sided confidence interval for δ contains all values of δ 0 for which the null hypothesis Pesarin and Salmaso (2010) provided a method to compute a confidence interval for the usual two-sample permutation problem. The method can be adapted to construct a permutation confidence interval for δ under ERSS. Algorithm 4 summarizes this method. To carry out this algorithm, two types of different tolerance specifications must be defined; 1. An error > 0 to control the reliability of the permutation p-value. It must be related to the number of permutations used in CMC algorithm (B); the smaller is, the larger B is.

A real number η.
Two criteria are used to evaluate the performance of this algorithm; (1) by estimating the empirical coverage probability of the resulting confidence interval, and (2) by calculating the length of the interval. This is done by simulation.

Simulation study
A simulation study is carried out to assess the significance level and the power of the proposed test statistics for the two-sample ERSS design and to compare them with the usual two-sample permutation test, Mann-Whitney U test, and the classical two-sample t-test. 2. Choose a negative value of η and subtract (δ + η) from every value of the treatment group Y t .
3. Use the CMC algorithm to estimate the permutation p-value,λ(η), based on the data sets Y t − (δ + η) and Y c .
5. To obtain the upper confidence limit of δ, repeat Steps 2 and 3 with positive values for η until the condition |1 −λ(η) − α/2| < is satisfied and then assignδ U =δ + η. It is worth to point out that the power levels of the proposed test statistics are obtained for the same generated twosample ERSS. Moreover, balanced designs are considered in computing the power levels; that is, each sample of the two-sample ERSS is with set size m and number of cycles h. Also, the size of each sample in the two-sample SRS is h × m, so that power comparisons between the considered test statistics under ERSS and SRS are done by maintaining the same number of observations in both schemes (to insure that the two schemes have the same cost).

Simulation results
As reported in Tables 1 and 2 Tables 3-8. For fixed m, h, and α, the power levels of the permutation test statistics based on ERSS (T 1 , T 2 , T 3 ), are strictly higher than the power level of the permutation test statistic within SRS and MW for all given δ except at very small sample size (h < 4) and small nominal significance level. For all considered test statistics, the power levels increase as the effect size δ increases. Moreover, the power levels based on ERSS increase as m, h, and α increase. It can be also seen that the power levels of T 1 and T 2 behave the same for the uniform and normal distributions, and T 2 is more appropriate for the exponential and gamma distributions (which are asymmetric distributions) than for the uniform and normal distributions (which are symmetric distributions). In addition, T 2 behaves better for the exponential distribution than for the gamma distribution in the sense that the skewness is greater for the exponential distribution than for the gamma distribution. Among the considered combining functions, the Liptak combining function is the best for the uniform and normal distributions. All considered combining functions behave almost the same for large sample size. Finally, under normality assumption, Tables 9-11 report the exact power levels for the parametric one-sided two-sample t-test for different values of the sample size. Apparently, the power levels of the permutation test statistics based on ERSS are higher than in the parametric t-test. Contrarily, the power levels of the permutation test under SRS are equivalent to the parametric t-test for a large sample size, but  To evaluate the performance of the confidence intervals of δ obtained by Algorithm 4, the empirical coverage probability and the expected length (based on 2000 simulations) of 90% confidence intervals are calculated. Data are simulated from uniform distribution, normal distribution, exponential distribution, and gamma distribution. The results are reported in Tables 12 and 13 for different values of m, h, and δ. It can be seen that, comparing ERSS to SRS, the confidence intervals obtained by ERSS are, on average, shorter than the one obtained by SRS of equivalent sample size. Intervals obtained by Mann-Whitney procedure are the shortest. Everything else being fixed, the length of the intervals decrease as the sample size (the set size m and/or the number of cycle h) increases. The coverage probability (percent of intervals containing δ) is at least as the nominal 90% level for all considered test statistics. When m = 4, the coverage probability is a bit lower than the nominal 90% level for those intervals obtained by T 2 and T 3 statistics for the exponential distribution. Moreover, the coverage probability obtained by T 1 does not differ substantially from the one obtained by SRS, but the length of the interval obtained by T 1 is shorter than the one obtained by SRS.

Data application
Height is an important factor in basketball and football. The heights of basketball players are in the range of 184-210 cm, and the heights of football players are in the range of 170-188 cm. Talent players could have height outside these ranges. So, in general, the average height of basketballers is slightly greater than that for footballers. To examine this among Palestinian athletes, a SRS of 15 footballers and 15 basketballers are selected from an athletics club in Palestine. Moreover, an ERSS of 15 footballers and 15 basketballers (m = 3, and h = 5) are selected in the following way. A group of three players is randomly selected, ranked according to their heights, and the height of the shortest player is measured. A second group of three players is randomly selected, ranked according to their heights, and the height of the tallest player is measured. Finally, a third group is randomly selected, ranked according to their heights, and the height of the middle player is measured. This is a single cycle of ERSS of size m = 3. This process is repeated 5 times (h = 5), for both footballers and basketballers, to acquire an ERSS of size m × h = 15. The data are shown in Table 14. To construct a 95% confidence interval for δ = µ B − µ F , the true difference between the two population means, we set = 0.001 and B = 1000. The results are reported in Table 15. It can be seen that, for testing H 0 : δ = 0 versus H 1 : δ = 0, all p-values are significant at α = 0.05. In addition, all confidence intervals are positive abd do not include zero; that is, the expected height of basketballers is greater than that for footballers.

Summary and Conclusion
Two-sample permutation test was previously investigated within the context of RSS and multistage RSS. It was shown to be highly efficient and applicable in the context of RSS. In this paper, we extend the applicability of the permutation test to ERSS. The concept of ERSS is first introduced. Then, we review the independent two-sample design, such as, the classical independent t-test, Mann-Whitney U test, and the two-sample permutation test. Further, three different permutation test statistics for two-sample ERSS are suggested. The first statistic is based on the difference between overall means of two groups. The second is based on the studentized statistic, while the third is based on partial tests. The third statistic yields 2 or 3 independent p-values, which are combined to a single p-value for testing hypothesis.
The following methods are explored for combining the p-values; the Fisher approach, the Liptak approach, and the logistic approach. The suggested statistics are compared with the usual permutation test statistic, the Mann-Whitney U test and the classical t-test. These are done under different distributions, such as, uniform, normal, exponential, and gamma distributions. In summary, our simulation results assert that the power levels of the permutation test statistics using ERSS are higher than the power levels of the classical two sample independent t-test statistic and the permutation test statistic using SRS. In addition, we compute the confidence interval for the two-sample permutation problem under ERSS and observe that the length of the confidence interval are, on average, lesser than the one obtained under SRS of equivalent sample size. The effect of sample size is considered and it is observed that the performance of the proposed statistics improves with increase in sample size. Real life application of this research is shown using an illustrative example. To this end, it is recommended to apply permutation test within the framework of ERSS in lieu of SRS. It is worthy of note that we assume perfect judgment ranking in the selection of the data points for the ERSS. A violation of this assumption is quite interesting and a subject of future work.