Identification of High Leverage Points in Linear Functional Relationship Model

In a standard linear regression model the explanatory variables, X , are considered to be fixed and hence assumed to be free from errors. But in reality, they are variables and consequently can be subjected to errors. In the regression literature there is a clear distinction between outlier in the Y space or errors and the outlier in the X-space. The later one is popularly known as high leverage points. If the explanatory variables are subjected to gross error or any unusual pattern we call these observations as outliers in the X space or high leverage points. High leverage points often exert too much influence and consequently become responsible for misleading conclusion about the fitting of a regression model, causing multicollinearity problems, masking and/or swamping of outliers etc. Although a good number of works has been done on the identification of high leverage points in linear regression model, this is still a new and unsolved problem in linear functional relationship model. In this paper, we suggest a procedure for the identification of high leverage points based on deletion of a group of observations. The usefulness of the proposed method for the detection of multiple high leverage points is studied by a well-known data set and Monte Carlo simulations.


Introduction
The linear functional relationship model (LFRM) is an extension of a linear regression model (LRM) which allows for sampling variability in the measurements of both the response and explanatory variables. In regression the model is poorly fitted because of the presence of outliers. It is a common practice over the years to use residuals for the identification of outliers. Residuals are in fact estimates of the true errors that occur in the Y-space. We anticipate at this point that fitting of the LFRM could be even more complicated because here outliers could occur in the X-space more frequently than the linear regression model. Outliers in the X-space are called high leverage points in the regression literature since they exert too much weight on the fitting of the model. When we use the ordinary least squares (OLS) or the maximum likelihood (ML) method for fitting a regression line, the resulting residuals are functions of leverages and true errors. Thus high leverage points together with large errors (outliers) may pull the fitted line in a way that the fitted residuals corresponding to those outliers might be too small and this may cause masking (false negative) of outliers. For the same reason the residuals corresponding to inliers may be too large and this may cause swamping (false positive). Peña and Yohai (1995) pointed out that high leverage cases are mainly Identification of High Leverage Points in Linear Functional Relationship Model 492 responsible for masking and swamping of outliers in linear regression. The unfortunate consequences of the presence of high leverage points in linear regression have been studied by many authors. The presence of a high leverage point could increase (often unduly) the value of 2 R . Chatterjee and Hadi (1988) mentioned the existence of collinearity-influential observations whose presence could induce or break the collinearity structure among the explanatory variables. Kamruzzaman and Imon (2002) and Imon and Khan (2003a) pointed out that high leverage points may be the prime source of collinearity-influential observations. Imon (2009) pointed out that in the presence of high leverage points the errors not only become heteroscedastic, they might produce big outliers as well. Another way to deal with outliers is to use M-estimators Zamanzade et al.;Zamanzade et al., 2019 andZamanzade et al., 2018). This could make the procedures for the detection of heteroscedasticity very complicated. That is why the identification of high leverage points is essential before making any kind of inference. In this paper our main objective is to identify high leverage points in a linear functional relationship model. Although some efforts have been done on the identification of outliers and influential observations in LFRM e.g. (Abdullah,1995;Vidal, 2007;and Wellman, 1991), but so far as we know, there is no reported work in the identification of high leverage points in LFRM. Let us consider a simple linear regression model = + + where i y is the response, i X is (supposed) explanatory variable assumed to be constant and specific assumption made on i  . We feel that the assumption of i X being constant in model (1) (2) is called a functional relation. So the main difference between a LRM and a LFRM is that in LRM it is assumed that the explanatory variable is free from error but in LFRM it is subjected to error. In section 2, we discuss different measures of leverages. Since all conventional measures of leverages are based on fixed explanatory variable, in section 3 we introduce the estimated values of X in LFRM which can be considered as fixed. In section 4, we discuss a procedure of the identification of high leverage points in LFRM. The usefulness of this proposed measure is studied through real world data in section 5 and through Monte Carlo simulations in section 6.

Measures of Leverages
In regression analysis it is sometimes very important to know whether any set of X-values are exerting too much influence on the fitting of the model. According to Hocking and Pendleton (1983) For a perfect balanced design, ii w can be written as 2 .   Rousseeuw and Leroy (1987) showed that Mahalanobis distance for each of the points has a one-one relationship with ii w and do not yield any extra information in the leverage structure of a data point. Hadi (1992) pointed out that traditionally used measures of leverages are not sensitive enough to the high leverage points. He introduced a single case deleted leverage measure, named as potential, which is believed to be more sensitive to the high leverage point. Imon and Khan (2003b) showed that in the presence of multiple high leverage points, observations are masked in such a way that even potential values may not focus on all of them. As a remedy to this problem, Imon (2002)

Estimation of the Fixed-X in LFRM
All the leverage measures discussed in section 2 are designed for fixed-X model and hence cannot be readily applied to errors in variable models. In this section we obtain the estimated values of X so that these values can be used as fixed-X in the subsequent studies.
Let us assume, Model (2) is also known as the unreplicated linear functional relationship when there is only one relationship between the two variables X and Y . There are ) 4 ( + n parameters to be estimated, which are  (8) and the likelihood function is given by Now differentiating (10) with respect to parameters 2 , ,    and i X , we proceed to the ML solution 0 ) which is not consistent. Kendall and Stuart (1979) showed a consistent estimator of 2   can be derived by Using (11) and (13)

Identification of High Leverage Points in LFRM
In this section we suggest a procedure for the identification of high leverage points in linear functional relation model. From a set of observed x and y (both assumed to be measured with error), we have estimated the fixed-X using the equation (17) Since the above formula contains mean and sum of squares of means which could be very sensitive to high leverage points. For this reason we propose a new formula for the leverages analogous to formula (20), but here the nonrobust components are replaced by their corresponding robust alternatives. Hence the formula is It is easy to show from (20) and (21) that mean ( ii ŵ ) = median ( ii w ) = 2/n.
We consider several measures of the identification of high leverage points, the twice the mean (2M) rule, the thricethe-mean (3M) rule, and then introduce a new cut-off point. Since it may not be easy to find the theoretical distribution of ii w and often excessive high leverage values can affect measures like mean and standard deviation, we define a confidence bound type cut-off point which is analogous to forms used by Hadi (1992), Imon (2002Imon ( ,2005 We compare the performances of the above rules in terms of correct identification of high leverage points and swamping rate of good leverages.

Monte Carlo Simulations
In this section we report a Monte Carlo simulation which is designed to investigate the performances of different measures of leverages in linear functional relation model. For four different sample sizes, n = 20, 30, 50 and 100, we generated the X values from Uniform (20, 40). Here we consider three different percentages, i.e., 10%, 20%, and 30% high leverage points. The X value corresponding to the lowest high leverage value is then set at 100 and the next values have an increment of 5 each. To generate a model like (2), we then define where i  is also N (0, 1). For each different sample we apply all five leverage identification rules mentioned in section 4 and compute the correct identification rate (IR) and the swamping rate (SR) in terms of percentages. We run 10,000 simulations for each combination and these results are presented in Table 1. When no high leverage point exists, we observe from Table 1 that for n = 20, all methods considered in the simulation perform well. However, rule 1, i.e., the traditional leverage measure based on the 2M rule has about 5% swamping rate. The newly proposed rule 4 performs the best as its swamping rate is the lowest followed by rule 2, rule 5 and rule 3. The performance of all these rules tend to improve with the increase in sample sizes but still rule 1 has relatively very high swamping rate which clearly shows that the 2M rule is too prone to declare low leverage points as points of high leverages. In

Example
We consider a real world data to investigate the performance of our proposed method. In order to make the relationship as model (2), we assume that measurement error can occur in either variable of this example.

Iron in Slag Data
This example is taken from Hand et al. (1994) where the data for 50 results of iron content of crushed blast furnace slag measured by two different techniques, which are chemical test (Y) and magnetic test (X). The original data together with the estimated X values by the maximum likelihood formula (17) is presented in Table 2. Now we compute the leverage values for this data set and these values are presented in Table 2. Here the cut-off point for rule 1 and 3 is 0.08, for rule 2 and 4 is 0.12 and rule 5 is 0.0975 respectively. We observe from the Table 3 that the traditional leverage values ii ŵ do not identify any high leverage points, but the 2M rule swamps in six good cases.
The newly proposed leverage measures ii w do not identify any high leverage points but the 2M rule swamps in one good case. The 3M rule does not identify any high leverage point for both of these two leverage measures. We observe exactly the same performance from the rule based on the new cut-off point as well.

Modified Iron in Slag Data
Next we modified the original iron in slag data by inserting few high leverage points. We consider three different situations. In case 1, 5 low leverage cases (10%) are replaced by high leverage points. In case 2 and case 3 we replace 20% and 30% low leverage points by points of high leverages respectively. Table 4 gives the first 15 observations of the modified iron in slag data and the corresponding leverage values are given in Table 5. Now we compute the leverage values for this data set and these values are presented in Table 5. Here the cut-off point for rule 1 and 3 is 0.08 and for rule 2 and 4 is 0.12 as they were before. The cut-off points for rule 5 are 0.1052, 0.1024, and 0.0845 for 10%, 20% and 30% high leverage points respectively. We observe from the Table 5 that for the 10% contamination, the traditional leverage values ii ŵ can identify high leverage points successfully, but their performances tend to deteriorate with the increase in the level of contamination. For 20% contamination it fails to identify four high leverage cases out of 10 and for the 30% contamination it fails to identify 10 out of 15 high leverage points. The newly proposed leverage measures ii w perform very well in this regard. All high leverage points are successfully identified irrespective of the level of contamination.

Conclusions
In this paper, our main objective is to propose a new method for the identification of high leverage points in linear functional relationship model. After obtaining a method of finding the fixed-X values, we propose three different identification rules based on robust measures of leverages. Both numerical and simulation results show that the traditionally used measures may often fail to identify even a single high leverage point when 20% to 30% high leverage points are present in the data. The 2M rule based on traditional leverage measure possesses relatively very high swamping rate as well. However, the proposed methods perform very well in every occasion. Our study clearly shows that they can correctly identify all high leverage points without swamping low leverage cases.