Efficient Rank-Based Analysis of Multilevel Models for the Family of Skew-t Errors

Rank-based analysis of linear models is based on selecting an appropriate score function. The information about the shape of the underlying distribution is necessary for the optimal selection; leading towards asymptotically efficient analysis. In this study, we analyzed the multilevel model with cluster-correlated error terms following a family of skew-t distribution with the rank-based approach based on score function derived for the class of skewnormal distribution. The rank fit is compared with the Restricted Maximum Likelihood (REML) estimation in terms of validity and efficiency for different sample sizes. A Monte Carlo simulation study is carried out over skewed-t and contaminated-t distribution with a range of skewness parameter from moderately to highly skewed. The standard error of regression coefficients is significantly reduced in the rank-based approach and further reduces for a large sample size. Rank-based fit appeared asymptotically efficient than REML for each shape parameter of skewness in skew-t and contaminated-t distribution computed through a calculation of precision. The empirical validity of fixed effects is obtained up to the nominal level 0.95 in REML but not rank-based with skew-normal score function.


Introduction
Several experimental and observational studies often occur resulting in cluster-correlated data, for example in clinical designs with multiple centers known as clusters, repeated measurement designs where individuals are supposed as clusters. In the randomized complete block, design blocking is a random effect, and blocks serve the same as clusters, both terms can be used interchangeably (Auda et al., 2017). One example of these mixed effect models is a multilevel model for incorporating cluster-correlated errors. The observations between clusters are independent and within each cluster dependent. Hence the assumption of the independent error term in linear models is violated. The most common conventional technique to estimate the mixed effect model is Restricted Maximum Likelihood (REML) and methods based on the ordinary least square fitting (OLS). Inferential procedures like testing of hypothesis, confidence intervals, standard error of fixed effects are affected if the underlying likelihood is non-normal due to the presence of outliers (Auda et al., 2017). Neglecting the dependent structure in the model causes underestimating the standard error of the regression coefficient and inflating type-I error (Goldstein, 1995). The only small number of contaminated values can spoil the REML fit leading to biased estimates. The robust estimation methods are least sensitive to outliers and therefore provides an attractive alternative to REML and OLS for linear models. One of the commonly used robust analysis is rank-based fitting of linear models. Auda et al., performed a detailed comparison of the rank-based method with traditional estimation techniques REML of the mixed model. They discussed the power and efficiency of rankbased methods over normal and some non-normal situations (Auda et al., 2017). Kloke, McKean, & Rashid (2009), has extended this rank theory for linear models with cluster-correlated errors. This rank theory provides a comprehensive methodology including general linear testing of hypothesis, confidence intervals, and robust diagnostic measures for model fit (Hettmansperger & McKean, 2010). It replaces the Euclidean norm in the least square fitting with the pseudo norm that produces robust fixed effect estimates . The rank-based theory is generally efficient however, this efficiency can be optimized depending on the distribution of random errors . If the form of the underlying distribution is known, the corresponding selection of the rank-based procedure would provide optimal analysis. This study is focused on rank-based analysis with score function that is suitable for skew-normal (SN) class of distributions of the error term in linear models. This study is focused only on the case of skew-t errors and showed that enhanced efficiency in rank-based method is attained by using suitable score function and the method is far more powerful and efficient as compared to REML. While its application is presented in detail for the skew-t (ST) class of distribution by discussing its validity and efficiency over several different values of the skewness parameter. Skew-normal is a rich class of distribution presented by (Azzalini, 1985). This family of distribution has a shape parameter ,  −     to control the skewness of each distribution in this class.
Because the distribution is symmetric in this class, it can be left or right-skewed depending upon the -ve and +ve value of  respectively. The skew-normal distribution reduces to simple normal-distribution when the shape parameter 0  = . This family of distribution has been used in numerical studies by (Adelchi Azzalini & Capitanio, 2003), (Azzalini & Genton, 2008) and (Azzalini & Capitanio, 2014). This study is intended to facilitate valid and precise inferences for skewed data occurred in various practical situations. Some common examples of positive skewness might include income and expenditure of families, prices, number of accidents claimed by insurance companies, number of children etc. are the reasonable events to achieve low boundary usually at zero and extremely large values are also possible to occur. Negative skewness usually less often occurs in real problems, but some situations like students' scores on easy tests for which students are especially motivated and another example includes results of Olympic long jump by athletes. Also retirement ages may have negative skewness: mostly people retire after the age of 60 or 65 but some professions give retirement in early ages, similarly in developed countries age at death is negatively skewed. One of the popular distributions to handle these kind of data sets is skewed-t distribution.
The student's t-distribution is unable to capture the asymmetry however, it efficiently captures values away from mean and heavy tails (Basalamah, 2017). Several methodologies in literature are proposed to incorporate asymmetry. We considered a widely held approach proposed (Arellano-Valle & Azzalini, 2013). Skew-t distribution from this skewsymmetric rich class of distributions is considered for the generation of random errors. Likewise, the skew-normal distribution, skewness of the skew-t distribution is also controlled by a shape parameter ,  −     . The appropriate score function for skew-normal error terms is developed in package sn ( Azzalini, 2014). This package contains suitable functions to generate and manipulate skew-normal and skew-t random variants. This package can be downloaded at https://CRAN.R-project.org/package=sn. The detail about data examples and functions used in this package are available at http://azzalini.stat.unipd.it/SN. We applied the score function suitable for skew-normal error distribution from this package and skew-t error terms are generated from multilevel models. The rank-based fit is obtained by using R package jrfit and REML estimates are achieved through package lme4. We performed a Monte Carlo simulation study to discuss the robustness and efficiency of rank-based procedures developed for skew-normal and applied over a class of skew-t distributions at different sample sizes.
In this study, we conducted a rank-based analysis for a skew-t family of distribution based on skew-normal score function. A simulation study is performed with the rank-based procedure based on a shape parameter of the skewnormal score function. The analysis for correct  is fully efficient. The rank fit also shows quite similar results with neighborhood values of the correct  . The effect of sample size in the multilevel model is also considered because a larger sample size reduces standard error (SE) leading towards efficient analysis. Thus the analysis is repeated over three different sample sizes small to moderate and the rank-based fit is compared with the REML approach. The findings of this study confirm the outclass performance of skew-normal procedure over the traditional approach REML in the neighborhood of the correct  applied on skew-t level-1 and level-2 random errors in the multilevel model.
Where Y is the dependent variable, X is np  the design matrix,  is the 1 p  vector of slope parameters,  is the intercept, and e is a 1 n vector of residuals that are ..

Random Intercept Model
The general form of the random intercept multilevel model in combined form is presented (Goldstein, 1995): Where ij y is a dependent variable measured at level-1 (i subscript refers to individual-level variation and j refers to group-level variation), ij x is an independent variable measured at level-1, 1  is the fixed slope for each group. ij e is a random error obtained from the level-1 regression equation. 00  is the overall intercept, it is a fixed effect i.e., invariant over the clusters. It shows a unique effect of cluster j on the intercept. Also 0 j u is a random error that shows the deviation of group-level slopes from the overall slope. It is assumed to be normally and independently distributed across individuals with distribution function F and density function f.

Brief Overview of Rank Theory
In rank-based theory, a pseudo norm is used instead of the Euclidean norm in OLS.
This joint rank-based estimator of  is called Jaeckel's dispersion function (Jaeckel, 1972). This is a convex function of  as it is defined in terms of the norm. It is efficient and robust in Y-space (J. D. Kloke & McKean, 2012). Al- Where R denotes the rank of scores sum equal to zero. Some score functions make assumptions about the distribution of the raw scores and few of them do not make any assumptions such as the Wilcoxon score function. If the underlying distribution of the error term is known, Hájek & Sidák, (1967) showed that optimal scores can be found as: Then the pdf of standard skew-t becomes: Parallel to the fact in a normal distribution, positive values of  making the distribution right-skewed and negative values of  forms the distribution left-skewed. When  →Skew-t density approaches to half t density (Basalamah, 2017). In simulation settings values of  in the interval of -12 and +12 are considered, by using the fact of little difference between skew-normal and half normal distribution when  is near 8    All the values of SE of and  reducing significantly in the rank-based method as compared to REML, which justifies the robustness of rank theory. With the increase in sample size, the SE decreases a little bit that confirms the effect of a large sample size. The value of precision is calculated greater than 1 indicates the rank-based fit is more precise than REML. And the value of precision less than 1 means REML is better. For each value of the shape parameter, precision is always greater than 1. The most efficient analysis of fixed effects comes out in rank-based analysis within the skew-t distribution depicted from high values of Because REML performs worse in such a high skewness and rank-based method still performs best. Surprisingly, the rank-based procedure with score function 7 () u  does not produce nominal confidence close to 0.95. Table-2 comprises a parallel analysis for contaminated skewed-t error distribution of error term. Empirical efficiencies are greatly efficient than REML. The value of precision is computed similarly for 7  = and its neighborhood points. For the shape parameters 10 and 12

 =
, the precision is significantly high. For highly skewed and contaminated data rank-based method produces a far better fit than REML. Overall precision is high in skewed-t error cases as compared to contaminated error cases. It means empirical efficiencies outperforms REML in a significant magnitude. For 12  = , in skew-t distribution precision 28.54 is extremely high than for contaminated error case. It has empirical efficiency 2854% more than REML even with score function 7 () u  that is not appropriate for 12  = , the analysis can be further optimized by using 12 () u  . In the contaminated-t error case, for 10  = the rank method is 1098% more efficient than REML. All the rank-based procedures based on skew-normal score function appeared robust and efficient than REML for skew-t and contaminated-t distribution of error term in multilevel models. But the validity in terms of confidence intervals is not improved in the rank-based method.

Summary and Conclusion
The rank-based analysis is a robust and efficient alternative of traditional methods in case of non-normality of error terms in linear models. A suitable selection of score function is necessary to reach optimal analysis. For this purpose, the knowledge about the underlying distribution of the error term is required. Practically, most often Wilcoxon (default) score is used that is suitable for symmetric moderate tailed distribution. For example, rank-based analysis with normal score function applied for normally distributed error terms would produce a completely efficient analysis. In this study, the performance of rank-based analysis based on score function developed for the skew-normal distribution of error in the linear model is evaluated for errors randomly generated from a class of skew-t distributions in multilevel models. And the rank-based approach is compared with the traditional REML in three different sample sizes in terms of validity and efficiency. Two models skew-t and contaminated-t distribution are used with different values of shape parameter in the family of skew-t distribution. The score function is developed for a particular value of the shape parameter. Thus analysis would be optimized for the correct value of . We selected the score function appropriate for 7  = and performed rank analysis for 4,5, 6, 7,8,9,10 and 12 The optimized analysis is obtained for and  inside 3 elements of correct in skew-t and contaminated-t distribution. SE and 95% confidence interval for fixed effect parameters are computed. SE is reduced through each rank-based analysis as compared to REML for both fixed effects and  , its magnitude also decreases with increasing sample size. Validity in terms of confidence interval should reach a nominal level 0.95, but in a rank-based approach, it is a little bit smaller than REML. Precision is calculated to compare REML and rank-based fit for each case in simulation settings. It is always greater than 1 means rank-based is overall better. For contaminated-t error, precision shows quite similar patterns for 7  = and neighborhood points. But for 10,12  = with such a high skewness precision is 1098% more efficient than REML. In skewed-t case, the highest value of precision 2854% is attained when. Because REML produces worse fit in such a highly skewed error. Based on the empirical efficiencies, the score function performed well but not for empirical confidence levels.