Generalised Model Based Confidence Intervals in Two Stage Cluster Sampling

Chambers and Dorfman (2002) constructed bootstrap confidence intervals in model based estimation for finite population totals assuming that auxiliary values are available throughout a target population and that the auxiliary values are independent. They also assumed that the cluster sizes are known throughout the target population. We now extend to two stage sampling in which the cluster sizes are known only for the sampled clusters, and we therefore predict the unobserved part of the population total. Jan and Elinor (2008) have done similar work, but unlike them, we use a general model, in which the auxiliary values are not necessarily independent. We demonstrate that the asymptotic properties of our proposed estimator and its coverage rates are better than those constructed under the model assisted local polynomial regression model.


Introduction 1.1 Background
In specifying a sampling strategy in survey sampling, there exist different approaches: the design based approach, the model assisted approach, the model-based approach and randomization-assisted model based approach.For a detailed review of these approaches, see Smith (1976), Smith (1994).Our concern is the model based approach.Ouma and Wafula (2005) reviewed the work of Chambers and Dorfman (2002) and modified the conditions.However, they limited their work to simple random sampling.Suppose that P is a finite are known only for a Sample, say s , of n N ≤ of the population elements, one way of characterizing the sample selection of the survey variable is to assume that for every unit on the sampling frame, a new variable, say i S takes a value equivalent to the number of times which that particular population unit's value is observed.The distribution of these values defines the design of the sample survey.
Once the sample has been chosen, the values ( , ) i Y i s ∈ are known.Now, let the distribution of i S depend on the known population values of X and suppose that one wishes to use the sample values together with the known values of X to make an inference about the unknown but finite population total major concern in model based approach to statistical survey inference has been finding robust estimators for the population parameters under model misspecifications.

Outline of the paper
This paper is organised as follows.In Subsections 1.3, 1.4, 1.5, 1.6, we give a brief highlight on model based estimation, the local polynomial estimation, confidence intervals, and two stage cluster sampling respectively, in each case, pointing out some gaps that our proposed estimator attempts to fill.In Section 2, we propose an estimator for the finite population total and suggest a bootstrap confidence interval for it in Section 3. In Section 4, we derive the properties of our proposed estimator.We conclude this paper in Section 5 with a simulation experiment and some discussions.

Review of model based estimation
The model based approach to statistical survey sampling has been developed to detailed extents.In particular, we build up on the work of Dorfman (1992) who proposed a non-parametric regression estimator for the population total under a model based approach.He illustrated that the developed estimator of the population total performs better when compared to the corresponding design based estimators and linear regression estimators.The model due to Dorfman (1992) relies on the assumption that the regression line passes through the origin and that the auxiliary values i X are independent.Suppose one or all these assumptions are incorrect.Will the prediction intervals still occupy the same nominal properties?and will the estimator of the population total still be design unbiased?Dorfman (1992) considered a non-parametric regression model for estimating population totals in finite populations.He proposed a non-parametric regression based estimator for the population total.To develop the estimator, he assumed that the population values were generated by a model defined as

, ( )
. m is a smooth function and i e is an independent random variable with mean zero and constant variance.The non-parametric population total estimator due to Dorfman (1992) is defined as ( ) where ( ) is the weight associated with the i th unit of the sample.Further, ( ) k u is a symmetric density function, b a scaling factor and ( ) ( ) In his empirical study, Dorfman (1992) illustrated that the estimator D T ∧ performs better when compared to the corresponding design based and linear regression estimators.These results were also confirmed by Cheng (1994) who applied non-parametric regression in estimating population parameters under conditions of missing data.Breidt and Opsomer (2000), also assumed model 1.1 and developed a new class of modelassisted non parametric regression estimators for the population total, based on local polynomial smoothing, a kernel method.Their estimator is defined as where with h denoting the bandwidth.In their simulation study, OB T ∧ performs better than the Horvitz-Thompson estimator defined as However, the theory developed in Breidt and Opsomer (2000) for the local polynomial regression estimator applies only to direct element sampling designs with auxiliary information available for all elements of the population.Consequently, we offer more insight on the consistency of the coverage rates using a general super population model in two stage sampling.Ji-Yeon et al. (2009) recently extended the work of Breidt and Opsomer (2000) to two stage cluster sampling where the estimators are linear combinations of estimators of cluster totals with weights that are calibrated to known control totals.They indicated that the local polynomial regression estimators are constructed by modeling the M points ( ) x t as a realization from an infinite super population model ζ in which ( ) where is a smooth function of x and ( ) var x is also smooth and strictly positive.Their estimator is defined as where ( ) In the equation 1.7, and i e represents the first column of the identity matrix si X .In their simulation results they concluded that the estimator 1.6 is more efficient than the Horvitz-Thompson and the linear regression estimators when the mean function of the super population model is non linear while being nearly as efficient when the model is linear.
Recently, Jan and Elinor (2008) considered the problem of estimating the population total in two-stage cluster sampling when cluster sizes are known only for the sampled clusters, making use of a population model arising from a variance component model.They considered the application of predictive likelihood technique in estimation of the unknown part of the population total where N is the number of primary sampling units or clusters and each cluster consists of i m units which are only known for the sampled clusters, ij y is the value of the variable of interest for unit j of the i th cluster.They assumed the population model defined by the equations , var var , cov 0 , var , cov in cases where j k ≠ and 0 ρ ≥ .To predict the unobserved value of Z in the estimate of the population total T given by they developed a partial likelihood for Z , ( ) , L z y from the generalized joint likelihood for the unknown quantities z and θ given by ( ) ( ) They applied the design based Horvitz-Thompson estimator of population total, where 0 n , represents the number of the primary sampling units selected in first stages and the model.In their simulation, they considered three coverage measures of Z ; the model based over the joint distribution of Y and Z , the design based over the sampling design, and regarding the total sample as a stochastic variable.They concluded that for a small number and the unconditional coverage no of sampled clusters, the three intervals differ significantly, but for large n0, the three intervals are practically identical.
Further, a comprehensive simulation study of the model based and the design coverage properties of the prediction intervals indicate that for large sample sizes, the coverage measures achieve approximately the nominal level 1 α − and are slightly less than 1 α − for moderately large samples and for small sample sizes, the coverage measures are about1 2α − , being raised to 1 α − for a modified interval based on 0 2 n t − distribution.We note that the models 1.9 and 1.10 assume that the regression line passes through the origin and that the auxiliary values i X are considered independent.The questions raised in subsection 1.3 therefore remain unanswered.

Review of confidence intervals in survey sampling
Confidence intervals are usually constructed around point estimators in order to provide a properly scaled measure of uncertainty associated with the estimator.The conventional method is based on the assumption that the sample size is large enough for the Central Limit Theorem to hold.This is however not always true in practice.
As a consequence Do and Kokic (2001), Chambers and Dorfman (2002) applied the bootstrap method to develop model based confidence intervals to address situations where the sample sizes are not large.They also proposed modifications of the procedure to account for misspecifications in a working model.They further noted that there is greater efficiency in using of successive model refinements and estimators obtained using the bootstrap approach as opposed to their competing estimators.However, the evidence of the extended simulation study on the beef population showed that the achievement of the research did not precisely attain its goal.They therefore recommended the construction of sounder confidence intervals using the bootstrap approach.Ouma and Wafula (2005) suggested the use of a general super population model ( ) is a smooth function, i e is an independent random variable with mean zero and constant variance.They used a bandwidth of 1.5 and simple random sampling with replacement to generate the values of survey variable Y .In their empirical study, they established that their coverage rates were higher compared to that of Chambers and Dorfman (2002).We now extend this to two stage cluster sampling.

Review of two stage cluster sampling
Let U be a finite population of N primary sampling units be the number of secondary sampling units be the value of the response variable Y for the ssu j belonging to the psu i .In the previous works, an assumption has been made that the element specific auxiliary data are known for all clusters and population elements, respectively.
For our case, we assume that the cluster sizes are known only for the sampled clusters and therefore the survey values ij y , N i ,.., are generated using the model ,.., 2 , 1 = .

Proposed Estimator for population total.
Jan and Elinor [6] used the model 1.15 to define the population total as where N is the number of primary sampling units or clusters and each cluster consists of i m units which are only known for the sampled clusters, ij y is the value of the variable of interest for unit j of the th i cluster.Referring to the same model 1.15 we may write that

∑ ∑ ∑∑
(2.2)((( and it follows that the problem is now reduced to the that of predicting the unobserved values z of the random variable Z .To do this, we apply the general model 1.15 to predict the values of the unobserved survey variables . Therefore the estimate of the population total is given by

Proposed Bootstrap Confidence Interval
Under the model based approach, the sampling distribution of the estimator corresponds to the distribution of possible alternative point estimates that could arise given the selection of the same sample S from populations similar to the actual underlying population of the observed data.To construct a confidence interval for T that reflects the actual finite sample and finite population characteristics of the distribution of T we estimate such a distribution from the sample data.For our case, we make use the sample data and the working model 1.15 to generate a sequence of alternative realizations of Y using non parametric estimates of In equation 3.1, ij e is selected via two stage cluster sampling with replacement from : 1, 2... , 1, 2,.. .Then the bootstrap confidence interval is obtained using

Unbiasedness of the model
Considering the model 1.15 we may write that where m is the bootstrap sample size, i r is the number of times the th i primary sampling unit is selected, ij x is the th j observation made from the th i cluster, and is the initial sampling weight of secondary sampling unit equal to the inverse of its selection probability, that is; However there is considerable benefit and little loss in choosing Now, let the initial sampling weight of secondary sampling units , further, b being a scaling factor, ( ) with the symbols bearing their usual meanings, then in model 1.15, completing the proof that the proposed model is unbiased.

Asymptotic variance of the error term
From subsection 4.1, it follows that ( 1) ( ) So using equations 4.15 and 4.16 in equation 4.14 we have that Using this theorem, we may write equation 4.17 as and noting that as the number, n m i = of the second stage samples tends to be large,

Conditional relative bias of the estimator for the population total
From its definition, the conditional relative bias of using T ∧ as an estimator of T is In this case, Using equation 4.9 in equation 4.26, it can be seen that as ∞ → n this bias, / ^asymptotically tends to zero.

Description of the simulation experiment
Simulation experiments were performed in order to compare the performance of the model based regression estimator with that of the model assisted local polynomial regression estimator in two-stage element sampling due to Ji-Yeon et al. (2009).
To obtain the model based estimator for the population total, i X are generated as independent and identically distributed on uniform (0, 1) random variables.The population consists of 100 clusters.In stage one a sample of = i n 20 clusters is taken which forms the primary sampling units from the total cluster size = i N 100 using simple random sampling with replacement.
In stage two, from each selected clusters, say  where, jk y is the th k observation made from the th j cluster, and Using above bootstrap procedure, the bootstrap estimate of the population total In computation of the model assisted local polynomial regression estimators, we again adopt a method due to Ji-Yeon et al. ( 2009).This we do as a means to having a realistic comparative study.We therefore apply the Epanechnikov kernel ( ) ( ) { }

Simulation Results
Table 1 gives the results of Mean Squared Error of the model based MSEmb and the Local Linear Polynomial regression estimator of the population total in two stage cluster sampling.The confidence intervals generated by the model based method are much tighter than those generated by the Local polynomial method at lower bandwidths but at larger bandwidths, both the LPRE and the model based estimators of population total perform poorly.We note that the best performing confidence interval is one whose coverage rate is close to the true population total and its length small.Consequently, the model based estimators are far better than their local as the parent sample, is calculated.The process is then repeated a large number, B , of times to obtain weights.Then we have If further the sampled and non sampled values of x are in the interval [ ] d c, and are generated by densities s d and s p d − respectively, both bounded away from zero on [ ] d c, and with continuous second derivatives, and if for any expression of Z , it can be shown explicitly that [ ] equation 1.4.Similarly, we construct the confidence interval for the population total T and compare performance of the developed model on estimation of T with that due to Ji-Yeon et al. (2009) on Local Polynomial regression estimation in two-stage sampling.
for computation of the Local polynomial regression estimator of population totals.This helps to compare the bias, mean squared errors and confidence interval lengths of the estimators using both the model based and model assisted local polynomial regression approaches.

Table 1 : Mean Squared Error (MSE)
It can be seen that at lower bandwidths the MSE for the model based estimators is higher compared to that of the Local Polynomial regression estimator.As the bandwidth increases, the MSEmb drastically reduces and approximately remains low.It is important to note that an increase in the bandwidth does not significantly change the MSE for the Local Polynomial Regression Estimators LPRE.Generally, the Model based estimator is more efficient than the Local polynomial estimators of the totals.Table2is a summary of bias for the model based estimator of the population total and the Local Polynomial regression estimator.

Table 2 : Summary Results of Bias
The bias for the model based estimator is much lower than those of the Local Polynomial regression estimators.The large bias associated with the Local Polynomial estimators are reflected in the values of its estimators which are much lower than the true simulated population total of 99.5078.This can best be attributed to the choice of the variance.The precision of estimation can be improved by choosing a smaller value of the variance.Table3now presents a summary of the estimated population totals for the model based and the Local Polynomial regression estimators in two stage sampling.

Table 4
now gives the coverage rates of 95 % confidence interval lengths for model based and Local Polynomial Regression models.