Penalized Splines Fitting for a Poisson Response Including Outliers

There have been various studies in the literature on investigating the relationship between a count response and several covariates. Most researchers study count variables and use traditional methods (i.e. generalized linear modelsGLM). However, GLM is limited when dealing with outliers and nonlinear relationships. Generalized Additive Models (GAM) is an extension of GLM, where the assumptions on the link functions and components are additive and smooth, respectively. Our aim is to propose a flexible extension of GLM and demonstrate the usefulness and performance of GAMs for the analysis of Poisson data set including outliers in the response variable through extensive Monte Carlo Simulations and using three applications.


Introduction
The nature of the count data is that they appear positively, never zero, and always discrete number or integers within a period of time. There have been various studies in the literature on investigating the relationship between a count response and several covariates using traditional methods such as generalized linear models. Ordinary regression model can be extended to the family of generalized linear models when the conditional mean of the counts on a set of variables via a link function. This method, Poisson regression, is often lead to biased estimates of parameters due to its properties of equal mean and variance. Usually negative binomial regression is used to handle with overdispersion and underdispersion, but essentially it is not always the best choice for data. Camarda et al. (2016) studied the conditional mean in Poisson regression to represent the response as a sum of smooth components and motivated their model in two applications and demonstrated the versatility of the approach. Currie et al. (2004) showed how p-splines can be used to model two-dimensional Poisson data. They introduced a model that described the expected values as a sum of components. Guisan et al. (2002) published a review paper that includes the application of GLM and GAM for ecological modelling. Sapra (2012Sapra ( , 2013 studied GAMs for several economic and business datasets. He compared the models in terms of AIC, deviances and R 2 (adjusted).
In literature, different approaches of GAM estimating methods were proposed. Here, the smoothing spline basis approach which is a method incorporated in mgcv package is addressed (Wood 2018). One approach to estimate GAM models is by choosing a basis for the smoothing function and a wiggliness measure. In this approach, model estimation implies estimation of smoothing parameter as well as model coefficients for a penalized likelihood maximization objective. The core point of the study is to build a flexible extensio000n of generalized linear models and demonstrate the usefulness of GAMs for this analysis of co0unt data set. This paper is organized as follows: First we shortly introduce GLM and GAM. The penalized regression splines used for the estimation of GAMs are presented in the next. The first simulation study where the dependent variable suffers from outliers and regresses on one variate conducts an estimation scenario for a Poisson distributed response using GLM and GAMs fitted by three penalized splines, cubic regression splines, p-splines, thinplate splines. Second simulation study is organized in the same manner except regressing the dependent variable on two covariates. Application part illustrates the comparisons of GLM and GAMs using commonly known datasets. Last section provides some concluding remarks.

Generalized Linear Model and Generalized Additive Models
GAM is an extension of GLM, where the assumptions on the link functions and components are additive and smooth, respectively (Hastie et al. 1990(Hastie et al. , 2001. GLM is introduced to relax the assumptions of normality and homoscedasticity in ordinary linear regression. In GLM, distribution of response variable has to be one of the exponential family distributions such as normal, binomial, exponential, gamma and Poisson (Nelder et al. 1972, Montgomery et al.2012. Also, the similarity in considerably several properties of the exponential family distributions allows us to use the same technique to estimate model coefficients using the likelihood concept of estimation. The general form of GLM is given by, where is the i th row of X (data matrix), is the unknown parameter vector. Note that the expected response is given by, where ∼ Exponential family( ,φ), is a link function, some known monotonic, which connects the systematic component, (linear predictor), and the random component (response variable) of the model (Keele 2008). When the response variable follows normal distribution and an identity link function is used, the generalized linear model turns out to be an ordinary linear regression model. This approach extends to additive models to form generalized additive models (GAM). When the linear predictor, , in GLM model are replaced with additive predictors (explanatory variables), the model is called generalized additive model (GAM), thus, GAMs are regarded as extensions of GLMs. In the GAM, the linear predictor becomes: where is a covariate. The 's in generalized additive models are smoothing functions which can be any of kernels, local regression (loess) or smoothing splines which are derived from the model. In this study we use cubic regression splines, p-splines and thinplate regression splines. The model is obtained by precise characterisation of what is meant by a smooth function, in the form of measures of (wiggliness) spline penalty∫ ′′ ( ) 2 . The 's is linear basis expansion where the γ k are the coefficients to be estimated and ( ) are the basis functions (approximations). The shape of predictor functions is determined by the data. Hence there is no assumption on the specific link function for error distribution. The penalties become quadratic forms, where the are the matrices of known coefficients.
Specifically, for a dependent count variable with observation i, let represent the number of independent events that occur during a fixed time period. ∼ Poisson ( ) where is the mean and the variance of this distribution. This parameter is given by where are the explanatory variables and ( ) are the smooth terms for = 1, … , and is the vector of coefficients. Note that the Eq. (5) is the expected response.
To penalize overfit, estimation is made by penalized maximum likelihood method, Assuming the λ j , smoothing parameters, are known, estimates of , ̂, can be found by maximizing Eq. (6).
In GAM, while the assumption of standard linear models of the linear dependency of on is relaxed, the additivity assumption still holds true and it is this additivity property that makes GAM easier to interpret than other algorithms such as support vector machines (SVM), neural networks, etc. (Keele 2008). The main difference between GLM and GAM is that in GAM, linear predictor is the sum of smoothing functions, not limited to being linear. GAM can handle non-linear, linear and nonmonotonic relationships between response and explanatory variables. Also, GLM emphasizes estimation and inference for the parameters of the model, while GAMs focus on exploring data nonparametrically (Guisan 2002). Wood (2003Wood ( , 2006 employs the approach of penalized regression spline to fit a model. By default, the degree of smoothness of the fit is chosen internally by the algorithm. Automatic selection of smoothing parameter is an advantage for reason being that it avoids the subjectivity and work of choosing it by the user. However, it can fail to obtain the best degree of smoothness and human intervention could sometimes be needed (Faraway 2006).

Selection of GAM and model parameter estimation
Model evaluation plays a fundamental role in regression analysis; comparisons can be made between models to obtain the better of them. In linear regression, mean square error (MSE) is regarded as the building blocks of most model evaluation techniques and it measures how far the model estimations from the actual observations are. In GLM and GAM, it is necessary to have a quantity which is equivalent in importance and interpretation to residual sum of squares for ordinary linear modelling (McCullagh et al. 1989).
As minimizing MSE is to least square fits, in models fitted using maximum likelihood estimation (MLE), the quantity to be minimized is the deviance. Maximizing the likelihood in those models corresponds to minimizing the deviance of the model (Wood 2006). Model deviance is defined as twice the difference in log-likelihood between the saturated model and the full model (Montgomery et al.2012). Deviance of a model can be regarded as the lack of fit between the model and the data points. It is used for model adequacy checking; the smaller the deviance the better the model is. It is known that Akaike Information Criterion (AIC) and deviance can be used with GLM and GAM (Sakamoto et al. 1988; Guisan et al. 2002) When large number of knots are used to represent the smoothers, and the method of maximum likelihood is used to estimate , the model parameters, then, there is possibility of over-fitting. This is the reason why penalized likelihood maximization is preferred over the ordinary likelihood maximization to estimate GAMs (Wood 2006).

Monte Carlo Simulations
The simulations are performed to explore the performance of the proposed GLM and GAM fits in a variety of situations. Furthermore, the effects of the different proportions of outliers existing in response variable from Poisson family to the fitted models will be addressed.
Data generated with different mean functions and proportions of outliers or without outliers, using the procedures are discussed in the following sections, respectively. Generated data are used to evaluate the performance of the three smoothing splines; cubic regression, p-splines, thin plate splines and generalized linear model. The mean of AIC, residual deviance and explained deviance obtained for each model are provided for the goodness of fit. For the analysis and model building procedures, the mgcv package in R (R Development Core Team 2018) statistical software is performed.

Poisson response with one covariate
The response variable was simulated from, /~( ( )), where ( ) = −1 (ℎ( )) and ℎ( ) = 4cos (2 (1 − ) 2 ). Here, is a log-link function. The covariate follows a uniform distribution, ∼ U(0,1). The randomly selected values were multiplied by 1 2 where 1 ∼ U (2, 5) and 2 ∼ U (−1, 1) and the result was rounded to the nearest integer. For a specified outlier proportion value given by : 0, 0.1, 0.2, a total of 100% simulated data was randomly selected to be changed to outliers. The number of repetitions for each scenario was 500 for all possible combinations of and . Each of the three smoothing splines were used to fit a GAM model for all the samples.

Poisson response with two additive functions
To illustrate the performance of the proposed methods, two covariates are considered by ( 1 , 2 ) for = 1, … , and response variable is generated from / ( 1 , 2 )~(exp( 1 ( 1 ), 2 ( 2 ))), where ( ) = ( ), = 1,2, 1 ( ) = exp (sin ( It is important to note that, in each case, sample sizes of 20, 50, 100, 200, and 500 were simulated and a total of 500 repetitions were generated. Each of the three smoothing splines were used to fit a GAM model for all the samples.

Results
Mean of deviances and AIC of the models using GLM, and GAMs fitted by cubic regression spline (cr), p-spline (ps) and thin-plate spline (tp) bases for one covariate are presented in Table 1. Additionally, the percentage of explained deviance for each model is provided in the same table. The results in Table 1 show that an increase in the proportion of outliers included in the response variable has inflated the mean deviances and AIC of all the models. To illustrate this, one sample size (e.g, = 25) can be considered for comparison of the outcomes when different number of outliers are included in the data. In cases where an outlier is not included, the mean deviances of GLM, cubic, -spline and thin plate splines are respectively 367.13, 13.06, 13.28, and 12.74, however, when some outliers (% = 0.1) are introduced in the response variable their respective mean deviances are 461.37, 49.95, 47.11, and 47.93 respectively. Similarly, for % = 0.2 as well, the mean deviances are seen to be increasing in all models. Similarly, results are obtained using n (=50, 100, 250, 500) support this argument. This comparison indicates that GAM fitted by p-splines outperformed GLM.
Results from Table 1 demonstrate that for all sample sizes , p-splines based models produced a smaller mean deviance than others at % (=0.1, 0.2). In addition to this, the models with a smaller mean AIC are produced by using p-splines in all combinations of outlier proportion ( ) and sample sizes except for = 25 and % =0.0. Furthermore, the mean of the proportion of explained deviance scored by each method shows that GAMs fitted by p-splines outperformed the others.
Results from scenario 2 given in Table 2, demonstrate that GLM and GAMs fitted by cubic splines, p-splines and thin-plane splines for two covariates. It shows that for = 25 cubic splines based models produced a smaller mean deviance than others at % =0.2. For sample size of =50, 100, 250, 500 however, GAMs fitted by thinplate splines performed better in all combinations of outlier proportion ( ). In addition to this, the models with a smaller mean AIC are produced by using thin-plate splines in all combinations of outlier proportion ( ) and sample sizes except for = 25. Furthermore, the mean of the proportion of explained deviance scored by each method is another evidence that GAMs fitted by thin-plate splines outperformed the others. Compared to all models, the GAM using thin-plate regression splines produced the highest explained deviances than the rest of models. This comparison also indicates that GLM produced poor goodness of fit (Dev.Exp. % 0.01-0.08) while GAMs produced more desirable results when the response has some proportion of outliers when regressing on two covariates.
On the contrary, results from the univariate case do not show the same trend as that of the case of two covariates, rather, GAMs fitted by p-splines seems to be outperformed by both cubic and thin plate splines regardless of the sample sizes and the outlier proportion except for = 25 and p% =0.0. The percentage of the explained deviance is almost equal to each other for splines base model.

Comparison on Data Sets
In this section, we apply the provided fitting procedure to several datasets from literature. Each data consists of a response from Poisson distribution with/without outliers. The description of the data sets is given in Table 3. The Table 4 presents the results of the models from GLM and GAMs. In order to obtain the goodness of fits of the models, deviance and AIC are evaluated for each model sand summarized in the table.
The ships data concerning a type of damage caused by waves to the forward section of cardio-carrying vessels (McCullagh et al, 1989). The variables are incidents, service, and year. This data suffers from outlier problem. To investigate whether the expected number of incidents per aggregate months of service makes sense on year and service, using GLM and GAM fitted by cubic splines, p-splines and thin-plate splines are compared. Notice that AIC (447.75) and deviance (343.88) values obtained by GLM are quite larger which is not preferred than any GAMs. Compared to all models, the GAM using cubic regression splines produced the highest explained deviance as 92.9% than the rest of models.
The frisk data is from Gelman and Hill (2006)' study on police stops in New York City. The data give the counts of police stops for past arrests. The data also suffers from outliers. The model obtained from GLM produced higher values of AIC and deviance (64820.00, 63109.70) whereas these values from GAM models show better performance. Among the GAMs, the p-splines GAM models produced the smallest AIC (38088.04) and highest explained deviance (70%).
Another data set that includes the reported cases of AIDS diagnosed in 1983 and 1992 (De Angelis et al 1994). The data are the time delay in reporting of diagnosis and the number of AIDS cases. The data suffers from outlier in the response variable. Considering the GLM fit, it performs the data poorly. However, GAMs fitted by p-splines produced the smallest AIC (4834.453) and highest explained deviance as (57.6%). In this paper, we compare the performance of the methods GLM and GAM for a Poisson response variable with/without outliers when regressing on one variate and two covariates. Considering the GAM fits, three penalized regression spline smoothers are evaluated where the first is cubic splines, another is the p-splines, and the third is the thin-plate splines. Three common datasets have been also used to compare the performance of GLM and GAM on the Poisson response including outliers. Using these datasets, the comparison of the performance of GLM and GAM which employs a penalized smoothing spline approach are investigated. Our simulation studies and the illustrations using three data sets from literature reflect the parallel results to each other. The results show that GAMs are more resistant to outliers for a response variable from Poisson family.