Mean Estimation under Imputation based on Two-Phase Sampling Design using an Auxiliary Variable

The present article offers more efficient imputation based estimators of the population mean under the framework of two-phase sampling in presence of an auxiliary variable. The theoretical conditions stating superiority of the proposed estimators, over some prevalent existing competitive estimators, in terms of relative efficiency is established by numerical illustrations based on three different data sets from the classical statistical literature .


Introduction
Sampling units may refuse to participate in the sample study or may not answer some items inthe questionnaire, or the investigators are not able to contact all the selected sampling units, or unforeseen factors may accidently cause loss of some collected information leading to incomplete survey response.Operationally, a value is identified as 'missing' for a variable, if a meaningful value for the specific analysis to be performed is hidden (Raghunathan, 2016).For example, some survey response options labeled as 'don't know' or 'none of the above' in situations from demography, pathology, election or any other sample survey, may be treated as a missing value in context of an analysis.The nature and quantum of missing cases/ items in response variable determines the population size of the corresponding auxiliary variable.For example, missing demographic, pathological or other data corresponding to a drop-out respondent is recorded under 'missing'.However, the same information corresponding to a subject who has died after a certain stage would not be classified as missing for the purpose of analysis and therefore is not to be imputed.Imputation comprises of replacing the actual missing values in the sample with a plausible set of values, to make-up a complete single data set.Two possible consequences of analysis based on the incomplete data or amputed data are loss of efficiency and bias.
Missing data mechanism involves formalizing conditional relationship between missing and the observed values.A response variable which is partially observed consists of the observed value on the subjects who provided information and the unobserved values on the subjects who did not provide information.The latter are regarded as Missing Completely At Random (MCAR), if every sample subject has the same chance of responding.In other words, distribution of the response variable is identical for the respondents as well as for the non-respondents.The observed sample data are representative of the population though some data is lost through the non-responding units.The present article proposes new improved class of estimators for the population mean of the study variable under the two-phase sampling scheme framework, in the presence of a highly positively correlated auxiliary variable, implicitly assuming MCAR.Two-phase sampling design is more useful, powerful and economical as compared to the simple random sampling without replacement (SRSWOR) when the population mean of the auxiliary variable is unknown, at the start of the survey sampling.The present article considers population mean of the auxiliary variable to be unknown in advance.Samples for each phase are selected through SRSWOR and a modification of combined exponential ratio and product type estimators are proposed for imputing missing data in practice.Expressions for Bias, Mean Square Error and minimum Mean Square Error corresponding to the proposed estimators are derived.The expressions for Bias and M.S.E. are in fact equal to infinite Taylor series involving the terms which are functions of a variable.These functions are approximated to varying degrees by the partial sums of these series.For large sample size,   are negligible, therefore, in the present paper first order approximation are considered.Theoretical and Empirical efficiency comparisons are carried out which shows that the proposed classes of estimators are more efficient than the existing estimators.

Notations
Let   N ,..., 2 , 1   be a finite population of size N and the character under study be y.A large preliminary simple random sample (without replacement) '  S of size '  n is drawn from the population  and a secondary sample S of size n   using the concept of two-phase sampling and the mechanism of MCAR, for given r, n and n .We have: where,

Reviewing Existing Imputation Methods and Corresponding Estimators
Some commonly adopted estimators for imputing unknown data values under nonresponse sample survey are summarized in

Proposed Imputation Methods and their Properties
Let ' ji y denote the i th available observation for the j th imputation method.
The first proposed method: Using above, the imputation-based estimator of population mean Y is: and To illustrate the general results, a particular case of the proposed class of estimators 4) to (7), the bias and M.S.E. of '

T under design-I and design-
II to the first degree of approximation, respectively is obtained as The second proposed method: Using above, the imputation-based estimator of population mean Y is: where    , are suitable chosen scalars.
To illustrate the general results, a particular case of the proposed class of estimators   14) to (17), then the bias and M.S.E. of ' T under design-I and design-II to the first degree of approximation, respectively is obtained as The third proposed method is 3.
Using above, the imputation-based estimator of population mean Y is: To illustrate the general result, a particular case of the proposed class of estimators    The fourth proposed method is 4.
Using above, the imputation-based estimator of population mean Y is:    34) to (37), the bias and M.S.E. of T under design-I and design-II to the first degree of approximation, respectively is given as: Conditions under which the proposed estimators are more efficient than the contemporary estimators are summarized as under (for proof, see Appendix II):

Illustrative Examples
for the population A, population B, and population C with respect to m T under two-phase sampling scheme is computed by using the following formula where (*) represents the respective estimator.The results are summarized in Table 8.

Conclusions
Imputation is a strategy which provides nearest substitute corresponding to the non respondent unit in an item survey.Several such strategies for mean imputation for the missing group of data exist in literature.Estimator under each strategy observes an inherent deviation from the true value of the parameter (such as, mean).Intention of a research in imputation is always directed towards reducing this gap and improving the estimator for achieving higher precision by the practitioners who would be the end users of these theoretic contributions.The present work thus proposes some new estimators for mean imputation, under two distinct sampling designs, which are more efficient than some of the popularly used mean estimators for imputation.
To sum up, analytic discussion based on some contemporary and some proposed imputed mean estimates for unit non response in field substitution, is now carried out.Data from three different studies, comprising of positively related auxiliary variable and study variable have been selected for the purpose of illustration.Table 6 shows theoretic conditions for efficiency comparisons within proposed estimators.On the basis of corresponding minimum M.S.E.'s, the proposed estimators     Conditions of the four proposed estimators being more efficient, under each design as compared with ratio, product estimator, exponential ratio, exponential product, combined ratio and combined product estimators are shown in table 7. Tables 7 distinctly indicates conditions for all the four proposed estimators to be more efficient (and hence superior) than the six existing estimators considered in this paper, under both designs.

T
. The present paper is therefore an important contribution for the practitioners in the area of missing data analysis as it offers improved estimators than the existing ones for imputing lost or missing data.
For Table 7: Comparison of mean estimator with all the four proposed estimators are shown below.
Similarly, comparison of other existing estimators with all the four proposed estimators can be obtained.
as: a sub-sample from preliminary large sample ' S (denoted by design-I) or as independent to sample ' S (denoted by design-II) without replacing ' S .The number of responding units in the sample of size nbe denoted by   n r r  .For every unit R i  the value i y is observed, but for units ' R i  , the value i y values are missing and imputed values are derived.The th i value ix of auxiliary variate is used as a source of imputation for missing data when ' R i  .For S, the data The estimator, Bias, M.S.E and minimum M.S.E. of ' and design-IIare obtained as

2 :
The estimator, Bias, M.S.E. and minimum M.S.E. of ' design-I and design-II upto first order of approximation are given by

3 :
The estimator, Bias, M.S.E. and minimum M.S.E. of ' design-I and design-II upto first order of approximation are obtainable as

T
in (24) to (27), we get the bias and M.S.E. of ' under design-I and design-II to the first degree of approximation, respectively as

4 :
The estimator, Bias, M.S.E. and minimum M.S.E. of ' under design-I and design-II upto first order of approximation are given as general result, a particular case of the proposed class of estimators ' ), an estimator of population mean is

Mean 5 .
Estimation under Imputation based on Two-Phase Sampling Design using an Auxiliary Variable Pak.j.stat.oper.res.Vol.XII No.4 2016 pp639-658 651 Comparison of Estimators Conditions for superiority among the proposed class of estimators are described as under (for proof, see Appendix II): Population A [Source: Kadilar and Cingi (2006)]: Y represents apple production and X represents number of apple trees.Other related information to population are: : Koyuncu and Kadilar (2009)]:Y is number of teachers and X is number of students.Other related information to population are: N =923 n = 180 Y = 436.4345X = 11440.498 =0.9543 y S =749.9394y C =1.7183 x S =21331.131x C =1.8645 ' n =360 r = 175 The population C [Source: Murthy (1967)]: Y is area under winter paddy and X is corresponding geographical area.The following data is based on the given population:

mT.T
Three populations from classical statistical literature are considered for comparison, each of which consistently shows that the proposed estimators are more competent for imputation in practice.We conclude that our proposed estimators outperform all other contemporary estimators considered, under each design.It is evident that under each design the proposed estimators  Using Taylor's expansion in above expressions and ignoring terms of   1  n o leads to equation (3), (13), (23) and (33).Now, Taking expectation on both sides of equation (

Table 1 : Some well-known imputation methods when
. Assume th j imputation strategy.' R i  Mean Method r y

Table 3 : M.S.E's of Suggested estimator '
 , under design I and II

Table 8 : PREs of various estimators w.r.to
mT under Two-Phase Sampling design

Table 8
gives insight on the Percentage Relative Efficiency (P.R.E.) of various estimators under Two-Phase Sampling w.r.to.