Efficient Model Selection For Moisture Ratio Removal Of Seaweed Using Hybrid Of Sparse And Robust Regression Analysis

The Internet of things ((IoT) consisted of physical devices networks such as sensors, home appliances, electronics, and software’s. It enables us to collect and exchange data in several fields. After data collection from IoT, variable selection is considered a major problem because many variables are involved in real life datasets. The current study focused on large data analysis of the problem of model selection, including interaction terms. The dataset used in this study is taken from solar drier with moisture ratio removal (%) as dependent variable while ambient temperature, chamber temperature, collector temperature, chamber relative humidity, ambient relative humidity, and solar radiation as independent variables. LASSO with Huber M, LASSO with Hampel M and LASSO with Bisquare M are used in this study. Comparison of techniques are made with ridge regression and OLS (ordinary least square) after multicollinearity test and coefficient test. MAPE (mean absolute percentage error) for the efficient selected model is used to forecast with its minimum possible value. As a result, the hybrid model of LASSO with Bisquare-M is considered as the efficient model with its minimum MAPE value. Thus, the resulting model with the selected variables can be used to predict Moisture Ratio Removal (%) to determine seaweed drying behavior.


Introduction
The Internet of Things (IoT) involves different types of software, electronics, sensors, and home appliances from which data can be connected and exchanged. Using these systems, the computer-based system can be designed to improve economic benefits, efficiency and results in the reduction of human exertions by Malavade and Akulwar (2017). Using IoT devices, data can be collected from the sensor to determine factors such as humidity, air speed, temperature, and water irrigation. The population is on the rise and there are nearly 797 million people facing food insecurity problems, which are considered to be a major problem in the agricultural sector Ahmed et al.,(2017). Seaweed is the product used to manufacture food, the medical and manufacturing industries in the fields of agriculture. Seaweed appears to be a potential source for renewable energy, but it can also be converted to energy such as biofuel and gas. Dissa et al., (2011) considered the carrageenan to be the main cause of seaweed extraction.
was the estimation of the structural economic model for the increase in agricultural productivity by Pender et al. (2004). To address the multicollinearity problem, Giacalone and Panarello (2017) introduced Lp norm estimation methods. One contribution is the use of analysis of variance (ANOVA) in the assessment of the yield factor used by Neitsch et al.,(2011). Also, Xu and Ying (2010) made variable selection using median regression with LASSOtype penalty. Afterwards, Zhao et al., (2012) investigated wavelet-based LASSO for the purpose of regressing function scalars. The simulation data and the real dataset were used for the investigation of asymptotic convergence and for the performance of the finite sample. Zhang et al., (2016) used LASSO, Adaptive LASSO, Adaptive LASSO II, Multitask LASSO, reweighted LASSO for quantitative trait loci analysis (eQTL) in the study of human cancer-related factors. According to Shariff and Ferdaos (2017), the Ridge regression is one of the methods used in the case of multicollinearity, but its results are affected in the case of outliers. For data analysis, outliers may have a negative impact on the results. Gad and Qura (2016) reviewed different types of robust methods in the case of outliers. Some practical lower bound (LB) and some upper bound (UB) were proposed by Midi et al., (2011) in the case of high Leverage Collinearity Influential Measure (HLCIM) and are considered to be essential measure for the detection of multicollinear degrees. Model selection was also made by different researchers as Abdullah et al., (2008) used 8Sc explained in Table 2 to obtain the best model among all possible models, similar to Yahaya et al.,(2013) presented a logistic regression model using 8SC. Later on, Javaid et al., (2019) conducted the study regarding the efficient model selection for the moisture ratio removal of seaweed using solar drier. For that purpose, multiple linear regression analysis was used up to four phases. The efficient model selection was made with the significant factors selected in the analysis. Total of 32 models were entertained in that study up to third level interaction term. While the hybrid models of the LASSO and robust regression analysis is presented in the current study. All possible models containing variables with their interaction terms is analyzed. Efficient model selection was made with 8SC and MAPE was used to predict the best model for moisture ratio removal by Javaid et al., (2019). While in this study, total of 192 models are entertained. The hybrid model selection is made by using LASSO and robust estimators up to fifth order interaction terms. The large dataset is used in the study so that the results can be more efficient in term of MSE.

Multiple Regression
Stuart (2011) explained the regression analysis as a widely used technique with x as an independent variable, y as a dependent variable and ε as an error term. The design matrix X can then be defined as the vector Y and ε as follows. Thus, the classic linear model is considered as Y = Xβ + ε where the aim for the least square estimation is to minimize of (1) for getting the estimates for β.
Thus, the estimate of least square ̂ is solution to so in case of X T X as nonsingular, the estimates of the least square can be directly estimated from the data.

Ridge Regression
Ridge Regression is used in the case of a Multicollinearity problem as according to Miller (2002), a biased estimator b(d) can be obtained by using all variables as follows. b(d) =(X՛X + dI) −1 X՛y where d can be considered as scalar of the positive values range and the predictors should be standardised with zero means and sum of squares for any columns of X elements as one. This means that the correlation matrix should be replaced instead of the X՛X and afterwards, a plot of b can obtain (d) against d called the ridge trace. According to them, ridge regression justification may be the same as for subset selection in case of bias variance trade off and the bias amount may be small in case of small d and the variance reduction may be substantial. Typically, this type of application occurs in the case of ill-conditioned matrix. The singular-value decomposition (s.v.d) can be seen for X having n rows and k columns when viewing the ridge regression situation.
Xn ×k = U n ×k Λ k×k Vk×k With U and V column as normalized and orthogonal as U՛U = I and also V՛V = I where Λ as diagonal matrix having a diagonal as a singular value. Then, with λi as i = 1, 2,…,k as the singular values with a smaller value as variance but with an addition of d, the contribution can be significantly reduced.
Just like the ridge regression, the constant β0 can be parameterized by doing standardization to the predictors. ̂0 solution is ̅. According to them, the LASSO problem can also be written in an equivalent lagrangian form.

│}
It can be noted that like the ridge regression problem, penalty L2 is only replaced by LASSO penalty L1. LASSO penalty does not make a linear solution in and there will be no closed form solution as in ridge regression. LASSO computing solution is a quadratic programming issue. There are different efficient algorithm paths that can be used for the computation of the entire solution path as for the variation of λ containing the same computational cost as for the ridge regression. Because of this constraint's nature, since t will be sufficiently small, some of the coefficients will be exactly zero.so Therefore, LASSO can make a continuous selection of subsets. If t is selected as larger than t0 = ∑ │1 │ where ̂ = ̂l s , then estimates of LASSO are ̂' s. On the other hand, for t = 0 2 ⁄ say then, in that case, the coefficients of the least square are reduced by about 50% on average. Like the size of the subset in the selection of the variable subset or in the ridge regression of the penalty parameter, t should be chosen in an adaptive way to minimization the expected prediction error estimation.

Robust Regression
Robust regression involves different kinds of estimation. The most used are M estimators.
Draper and Smith (1998) state that minimizing ∑ ( =1 ) for finding an estimation of M, a partial differentiation is required with respect to each parameter p resulting in a system of p equations where x is a dependent variable, ψ is an error term and s is a standard deviation.
Where ψ(u) = called as a score function and the weight function can be defined as, ) yields the following matrix form of (4) It is very similar to the solution of the least square estimator, but with the introduction of a weight matrix to reduce the outlier's influence. Usually, in contrast to the least squares, equation (4) is not possible to calculate the M-estimation directly from the dataset, W depends on the residuals, that depend on the estimation. In fact, an initial estimate and iterations are required, to finally converge on W and M-estimate for β. Iterative procedure IRLS is used to identify the M-estimates of regression. Stuart (2011)    To data analysis, the following phases are performed in order to obtain efficient model selection using different regression analysis techniques.

Phase I -All Possible Models
According to Khuneswari et al., (2008), all possible models are a prerequisite for determining the best model and can be derived using the following formula, where N is considered to be the number of all possible models, k represents the total number of independent variables and j=1,2, .…k.
All possible models will be used and OLS, LASSO and Ridge will be applied to them after the procedure will be moved to next phase.
There was no missing value on the dataset. For all six independent variables, 5% of the data was randomly selected. Mean Absolute Percentage Error (MAPE) was calculated using the dataset to be used in Phase 3. for the forecasting of the best model for the selected model, two tests will be performed for OLS, i.e. the multicollinearity test and the coefficient test.

Phase 2-Selected Models
For the forecasting of the best model from the selected model, two tests will be performed for OLS, i.e. the multicollinearity test and the coefficient test. For the coefficient test, 5% of the significance level is used. After these tests have been carried out, a list of selected models will be obtained in the OLS regression analysis, while only significant variables will be retained for the selected models in ridge and LASSO regression.

Multicollinearity test
Multicollinearity occurs when the model contains collinear predictors. In this case, results may not be valid Joseph (2009). Thus, it is important to handle this type of problem in data analysis before proceeding. Ramanathan (2002) defined the coefficient test, a test used for each independent variable to check whether it differs significantly from zero or not. That is, the following hypothesis is made for it.

Coefficient test
H0: βj = 0 H1: βj ≠ 0 Where βj denotes the coefficient of variable in the model for j = 1,2,…,k and the t test will be used for testing at a 5% level of significance.

Phase 3-The Best Model
After the selected models have been obtained, the best selection of the models is made between the selected models. Eight selection criteria were identified for this purpose by Ali et al., (2017). In this study, the eight selection criteria is used for the best model selection. The list of formulas for 8SC can be shown as in Table 2. where n = total number of observations k +1 = estimated parameters numbers (including constant) SSE = sum of squares error.

Phase 4-Goodness of Fit
Gujarati (2004) has defined some assumptions for the least square estimate as no perfect multicollinearity or correct identification of the model. The goodness of fitness test was defined in Ramanathan (2002) as ensuring that the model fits well into the dataset. For this phase, the 5% dataset maintained will be used for model efficiency calculation using the MAPE value and then the residuals will be calculated using the difference between the actual and observed values using the best model obtained in Phase 2 so that the residual test is analyzed using the randomness and normality test. Also, a nonparametric test like a run test will be used to check the random pattern of observations. For normality check assumptions, some supporting documents such as Sharpio Wilk test and Kolmogrov Smirnov test used for this purpose. Afterwards, scatter plot, histogram, and box plot for residuals for supporting evidence are obtained.

Data Collection and Procedure
The data used in this study was taken from a seaweed dryer for four days using V-groove hybrid solar dryers. Seven variables are used in this study as Moisture Ratio content (%) as the dependent variable with six independent variables as Ambient Temperature ( 1 ), Chamber Temperature ( 2 ), Collector Temperature ( 3 ), Chamber Relative Humidity ( 4 ), Ambient Relative Humidity ( 5 ) and Solar radiation ( Table 3 for six variables taken in the analysis.  There are now 145 models left out of 192 models after grouping. These final models have no problem with multicollinearity, and all the variables included are also significant. The procedure shall be applied to all possible models for obtaining the selected models. The best model is selected using eight selection criteria.
For OLS, the minimum value was found for M183.36.3 and the MAPE value was obtained using the reserved data. It was determined to be MAPE = 9.781944.
Stuart (2011) explained that the least square estimates cannot perform well in the case of outliers or in the case of normality assumptions are not met. Therefore, the performance of ridge regression is better than OLS in the case of highly collinear predictors in the model. Estimates derived from the ridge regression are biased, but the mean square error of ridge estimator is smaller than the OLS estimators of Hoerl and Kennard (1970). Since the data used in this study also suffers from a multicollinearity problem so next technique applied is the ridge regression on all possible models for the purpose of comparison and the selected models were obtained.
Ridge regression is a type of regularized regression analysis so that at least two variables should be included in the model starting with M5.0.0 containing two variables Hocking (1976) . Ridge regression has the ability to make the correlation coefficient as a reduction to zero but cannot make it exactly zero because it contains the L2 penalized least square penalty term Ogutu et al., (2012) so that the best model selection in Phase 2, 8SC has been used and the selection for the best model is obtained with a minimum value among all models. The best model selected with its coefficient was obtained. OLS and ridge regression do not have the ability to select the model as they contain all variables in the model so modified LASSO has been performed for handling outliers in the dataset for the purpose of selecting crucial variables among all variables including their interaction terms. Modified LASSO was performed using three robust M estimator methods, i.e., Huber's M estimator, Hampel's M estimator, and Bisquare's M estimator.
For the Sparse regression analysis, LASSO is performed on all possible models and the results have been analyzed. The list of 183 models were obtained by grouping the same models as in OLS. Robust methods are applied to these resultant 183 models to obtain the selected models in the modified LASSO. 151 models were obtained as selected models using the Huber M estimator method for the above 183 models obtained from LASSO. Coefficient test is performed at 0.05 a level of significance for removal the insignificant variables from the models in Huber's M estimation. Thus, after performing the coefficients tests, only 151 models are left in the Huber's M estimation method. Eight selection criteria for the best model selection were performed on these 151 models and results were observed. The best model M192. 27     From the figures 1-5 above, outliers pattern can be observed. OLS and ridge have more outliers than the technique being proposed. Although the OLS model is selected after multicollinearity and coefficient test, it still contains more outliers. It is because OLS cannot deal with outliers, while ridge can also deal with multicollinearity only, thus, compared to the proposed technique, there are more outliers. in real life dataset, it is not always a good option to remove outliers. the outliers outside the sigma limits are observed in Table 5. For outliers, in standardized residuals plot, observations outside the sigma limits are observed in Table 5. It is noted that although the ridge regression and OLS has a smaller number of outliers in three sigma limits, but the SSE value was very much high for ridge regression analysis. So, the ridge regression cannot be considered as a good technique. While LASSO with Bisquare M has some observations outside three sigma limits, but due to the use of robust regression analysis, outliers are not affected on its coefficients and the model. That is the advantage of the proposed technique that LASSO can have only significant variables in the model. While robust estimators can deal with outliers. Thus, the proposed model can deal with the multicollinearity and outliers' issue in the real-life dataset.

4.
CONCLUSION Solar drier has its importance in the field of agriculture and aquaculture. There are many factors involved in solar drier efficiency, so it is important to identify those factors. Different variables affecting moisture ratio removal have been observed with their interaction effects by considering a large data. Hybrid of LASSO and robust regression is proposed for identification of significant factors including the interaction terms. The results show that the combination of LASSO and Bisquare's M provides better results compared to other methods. The value of SSE for the hybrid of LASSO and Bisquare's M is found be 8.9689, that is less than 10%. Thus the resulting selected model is now ready to predict the moisture ratio removal of seaweed with its selected significant factors.