A Graphical and Numerical Method for Selection of Variables in Linear Models

A model is usually only an approximation of underlying reality. To access this reality in an adequate way, research all over the world, in different dimensions, is in progress. Most of the diagnostic methods that are being used for the selection of variables to retain in the final model are either based on theoretical methods or they are graphical, that is why model assessing becomes difficult. As a result, the regressors in a model may get very large or very small in their number. The researcher, therefore, has to look at variety of options, and has to fit a lot of models and then is found muddled with the choice to which to select and which to reject. This work is based upon introducing a diagnostic procedure for subset selection due to which one may be successful in reducing the number of possible models to be fitted. This strategy consists of graphical as well as numerical measures; this combination helps much in reducing the number of regressors in the model as well as the number of models. We have also introduced some new approaches and thus a considerable reduction in the regressors by this method does not prohibit the researcher to include regressors of his own interest.


Introduction
In this article, we propose a strategy for the selection of independent variables in any model.Sometimes it was assumed that the variables which constitute the equation are chosen in advance i.e. independents in the model be fixed a priori.Examining the equation to see whether the function specification and the assumptions about the residuals, fulfill the requirements, cover the whole of the analytical process.In many applications of regression analysis however, the set of independent variables that constitute the model is not pre assumed.In these situations, previous experience in connection with underlying theoretical considerations can help the researcher/ analyst to specify the set of independent variables.Methods and criterion functions for subset selection are critically reviewed by Hocking (1976), Computational algorithms for subset selection are very well discussed by Miller (1984).Use of log linear polynomials very well explained by Ali A (1986).Stepwise Directed search which is a combination of forward selection and the stepwise backward elimination strategy described by Broerson (1984) but still the problem is there.Usually the problem consists of selecting an appropriate set of independent variables from a set that quite likely include all the important variables but we can say, with some extent, that these all are not necessary to adequately model the response y.As Montgomery (2003), "if the objective is to obtain a good description of a given process or to model a complex system, a search for regression equations with small residual sum of squares is indicated".We have used this fact while formulating our method.Stepwise regression is used to customize the computational efforts.This search method develops a sequence of regression models, at each step adding or deleting an x variable can be stated equivalently in terms of error sum of squares reduction, coefficient of partial correlation, or F statistic being the variable with the largest F-value considered to be the candidate for addition in the next stage.We have also used the same idea in combination with the ratio of coefficient of determination and mean square of residuals.We with the aforesaid ratio and thus formulated a new criteria.
Xavier de Luna and Kostas Skouras (2003) have used the graphical tools on recursive prediction errors in combination with Schwarz's (BIC) and Akaike's information criteria (AIC) and proposed "k" potential strategies.It seems to be useful but we are concentrating ourselves to the initial selection of variables.We are not discussing AIC, BIC and many other popular criteria because almost all of these have an extensive theoretical backgrounds.In comparison with all such methods, our strategy doesn't require any tuff theoretical backgrounds; however, we have made comparisons with very popular Cp criterion because many authors proved it as a better criterion than AIC and BIC.Miller (1990), Fahrmeir, L & Tulz.Gerhard (1994), Mc. Cullagh et al (1989) and almost all statistical scientists unanimously describe that, the number of regressors must be as small as possible and R 2 should be large, relatively.We have considered all of these in our analysis.
While building a model, consideration should also be given to the function specification in variable selection because they both are linked together thus selection of variables or their form, are two problems which should be solved simultaneously, however for simplicity they should be treated sequentially.At the moment we confine ourselves to the selection of the variables not to the specification which is left for further research.
An important situation arises when the investigator have some prior justification for using certain variables (justification may depend upon several factors including exploratory data analysis).Thus a model driven and exploratory driven analysis both be incorporated.So we are interested in screening the potential variables to obtain the model that contain the best subset among them via exploratory analysis.In short, in most of the problems there is no single regression model that is best in terms of various evaluation criteria that have been proposed.A great deal of judgment and experience with the system being modeled is usually necessary to select an appropriate set of independent variables for a regression equation.

Variable Selection Strategy
Our strategy is very simple and concentrates on the strength of correlation of independent variables(x's) with dependent variable(y) and upon the Multicollinearity of different independent variables.

1.
We just include those independents which have significant correlation (at 5% or 1% level) with the dependent variable (they are treated as primary variables) and exclude the independents which don't have significant correlation with dependent variable but have significant correlation with those independents which already have been declared as primary .theserejected variables are the main cause of reducing the total number of models to be fitted.

2.
If two primary variables are correlated, then we treat them independently as primary variable but both of them can not appear together in any model.

3.
If any pair of variables is significantly correlated and these don't include any of the primary variables then both are included one by one in combination with primary variables, but not both at a time, because of the collinearity between them.In this way, they form two different sets of models i.e. they can combine with other variables which are not mulicollinear with them.If they are "m" pairs they form "m" groups with the same conditions.

4.
We include all those variables in the potential models which don't have any correlation with dependent or other independent variables but these included variables are not considered to be the primary part of the model however they are necessary to combine with the primary variables.That is, they should not constitute the model independently without the primary variables but in combination with the primary variables.
In the above paragraphs when we say multicollinearity or the correlation, we mean significant correlation between the two variables.
As for example in the Hald's data out of four independents ( ) there should be sixteen possible models and many authors like (Montgomery (2003)) have fitted all the sixteen models and then searched by different criteria the most suitable set of independents in the final model.By our strategy we find that out of these The above two models are our target models.So we have reduced sixteen models to only two models.

y on 1
x 4. y on 2 x and 5. y on 4 x Hence the above five models, in total, can be fitted by our strategy because in other combination 3 x may be present are there may be ( ) all of such combinations have already be rejected by our strategy.We have also applied full model for relative comparisons only.
We have introduced some other criteria (these are explained in Explanation of the terms and methods)

C1, Criterion 2. D1, Criterion
These are because for model fitting R 2 should be large, MSE should be small, number of variables should be less and average gain by the independents should be large.

So we have calculated the average gain by the independents as
where k represent the total number of independents in any model, and multiplied by 1 C in this way more precise model in the shape of D1, can be attained however, 1 C only can also provide best model.
We have compared our scheme with the other standard procedures like forward selection, backward elimination and stepwise regression.Also we have compared the results give by Neter et al (1987), Montgomery et al (2003) and Anderson & Bancroft (1952) and found that our strategy is simpler and give at least the same results as by other well known schemes.We have used NewR 2 which was first introduced by M.J.R. Healy (1994) in our calculations but it does not help in any improvement.
In order to explain the selection criteria and strategy for inclusion of independent variables, in any model we define the following terms.According to our strategy, x 1 , x 2 and x 4 be the primary variables, initially.the possible set of models exclude x 3 because it is correlated with primary variable x 1 and hence potential variables of the model be x 1 , x 2 and x 4 however x 4 have strong correlation with x 2 this mean x 1 is compulsory in the model and there is choice between x 2 and x 4 .But x 2 and x 4 both should not be included in the model because they are correlation is significant the possible set of models might be only two.

Explanation of the
1. y on x 1 , x 2 .2. y on x 1 and x 4 .
However we include the final set of independent variables for further analysis as  While examining the scatter diagrams we see that linear trend is available only in x 1 , x 2 and x 4 .Scatter diagrams reject the inclusion of x 3 in potential models.So these can be used in initial selection of the variables in a potential model.By our method, most favorite is x 3 and be treated as primary variable.Now the candidates are x 1 , x 2 and x 4 which may combine with x 3 .Here, x 4 is correlated with x 3 so it is out from the model, now we include x  Yes, scatter diagram help like the earlier and we can say that linear trend is available only in x 3 .

NETER's DATA
If we combine the inference from histograms and scatter diagrams we can say that only x 3 can be the member of our final selection.By our strategy we can fit only 12 rather than 64 models and our most favorite model must include x 2 and x 3 , so these are treated as primary variables.Other possibilities are x 1 , x 4 , x 5 and x 6 to combine with x 2 and x 3 .Since x 5 & x 6 both are correlated with x 3 which is one of the primary variables, so x 5 and x 6 are excluded from the model.And x 1 don't have any correlation with x 4 so it is included in the model.Now we look at x 4 since it is not correlated with any other independent variable so it can be a candidate in possible models.Up to this moment there are only 4 variables in the model named x 1 , x 2 , x 3 and x 4 .Now the required possibilities are only 12 because the models with out the combination with x 3 are also excluded.While examining the scatter diagrams we can say clearly that x 2 , x 3 and x 4 have linear trend.
If we combine both the histograms and scatter diagrams, we may fairly say that model include x 1 , x 2 , x 3 and x 4 and thus in total 2 4 models required to be fitted rather than 2 While comparing all three tables above, we can say easily that our strategy is simpler in application as well as in understanding and give the best possible results while selecting the variables in any model.Although with larger number of regressors it is difficult to decide whether to retain any regressors in the model or to drop it out, but it is applicable and as a result possible number of models reduce dramatically.
We have also proposed the graphical method which is also applicable.Although it is not new strategy because most of the statisticians have suggested it as primary tool but it is presented here as an alternative to some well sophisticated techniques like forward selection, backward elimination and stepwise regression.
Our graphical strategy is not so powerful but the numerical one is quite comparable to the well sophisticated techniques as mentioned earlier.
We can also compare our strategy with well known Cp criterion on Hald's data ,as discussed by Montgomery (2003) and find that our strategy is better than Cp, as in Cp criterion we have to fit 16 models and then to select x 1 & x 2 as regressors but by our strategy, the same is achieved by fitting only 5 models.
We can also make the same comparison on Neter's data and find our strategy, even more suitable, because Neter selects a model consisting x 1 , x 2 and x 3 with MSE, equal to 0.066 with sixteen possible models but the model selected by our strategy consists x 1 & x 2 only with MSE equal to 0.064 with total four possible models.
It is thus recommended that Cp criterion may produce better results if applied by using our strategy.

Further research
Although a verity of variables selection methods is in practice today, there is still a plenty of work to be done viewing up the fact we are also on the track of improvement, our strategy may be improved by considering the followings i) Detection of outliers and their removal, prior to applying our technique will be made.ii) Use mean of present values in place of missing values if they happen to be in variables.iii) Adjusted R 2 may be used rather than R 2 .
Montgomery D. C. (2003) have fitted 16 models for the same set of data, he used various methods including BIC, AIC and Cp criteria, and found by fitting 16 models, that the final model consist x 1 and x 2 , We have also selected the same by fitting only five models.Montgomery D. C. (2003) have used well known Cp criteria while our's strategy is more simple and easy as compared to Cp criterion.

Terms and methods P=
Number of parameters.MSE= Mean Square of residuals.R 2 = coefficient of determination.

The example of Hald's data Definition of Variables:
* Correlation here and afterward mean Pearson's correlation ** Correlation is significant at the 0.01 level(2-tailed) the best modelNeter et el (1987)selects the model x 1 , x 2 and x 3 by Cp criterion but in our analysis it is rejected by all our criteria and also by MSE, because MSE from our selected model is less than the Neter's model. *

Anderson and Bancoft's data Definition of variables:
*Correlation is significant at the 0.05 level (2-tailed).

Scatter Diagrams Anderson and Bancoft's data
Ali A & Al Subaihi (2001)s minimum for the set of regressors (x 1 , x 2 , x 3 , x 4 ) and (x 2 , x 3 , x 4 ) but on the behalf of MSE we can not say that the model which possesses only the minimum MSE is considered the best because in the traditional methods also, these sets of independent variables are not considered the best.Method of forward selection which is very well known, also rejects these sets of independent variables, and hence this method supports our strategy which is very simple in the form of C1 and D1.The Cp criterion was used on Anderson and Bancroft's data byAli A & Al Subaihi (2001)along with some other methods, they selected x 1 , x 2 and x 6 as the best set of variables, with no other details.