Bootstrap Power of Time Series Goodness-of-Fit Tests

In this article, we looked at power of various versions of Box and Pierce statistic and Cramer von Mises test. An extensive simulation study has been conducted to compare the power of these tests. Algorithms have been provided for the power calculations and comparison has also been made between the semi parametric bootstrap methods used for time series. Results show that Box-Pierce statistic and its various versions have good power against linear time series models but poor power against non linear models while situation reverses for Cramer von Mises test. MSC: 62M10, 91B84


Introduction
Time series model building is a science and art as well.It is generally considered a three stage iterative procedure consisting of identification, estimation and diagnostic checking (Box and Jenkins, 2008).Diagnostic checking is an important stage and residuals obtained by fitting the identified model play an important role in model criticism.Box and Pierce (1970) test and its several other versions are perhaps the most commonly used types of portmanteau test (Mainassara et al., 2009).These tests are capable to perform an overall test for an entire set of, say, the first m autocorrelations assuming that the null model, i.e. the model defined under null hypothesis, is correct.Moreover, the choice of m is very important in the appropriateness of asymptotic distribution and power of these tests.
In this paper, we numerically study the power of some of the popular time series goodness of fit tests.Escanciano (2006) has studied power of various goodness of fit tests under the fixed design wild bootstrap.Horowitz et al. (2006) has compared performance of Box and Pierce (1970) test with some other tests under double block bootstrapping.
The novelty of our study is that we study the size of the tests under various semiparametric bootstrap designs described in Section 3.1.We compare the power of these tests with the Cramer von Mises (CvM) (Escanciano, 2007) statistic against various linear and non-linear alternative models.To the best of our knowledge, these tests are not studied under these setting in the literature.
Our results show that Box-Pierce type tests do well against the linear alternatives but fail to perform against the non-linear alternatives, while the situation reverses for the CvM statistic due to Escanciano (2007), i.e, the CvM statistic does well against various non linear alternatives but much less well against various linear alternatives.Moreover, dynamic bootstrap methods show better performance than the fixed design bootstrap in our example.We have not found any clear advantage of using wild residuals bootstrapping in this scenario.
The remainder of the paper is organized as follows: In the next section a review of the literature on available diagnostic tests is given.Section 3 describes the different bootstrap methods in time series context.Section 4 gives the estimation procedure and algorithms for Monte Carlo simulations for computing power of the tests.Finally, Section 5 presents the results of simulations and discussion of the results.
In practice, there are many possible linear and non-linear models for a problem under study e.g.autoregressive, moving average, mixed ARMA models, threshold autoregressive etc. Box and Jenkins (2008) have described time series model building as a three-stage iterative procedure that consists of identification, estimation and validation.
Identification of the model is partly science and partly arts.There are no exact ways of identifying the underlying model though there are some tools, for example, the autocorrelation and partial autocorrelation plots to identify the general class of underlying model, see Box and Jenkins (2008, p.196).Importantly, it should be noted that at the identification stage, especially dealing with complex situations, we identify a class of models that will later be efficiently fitted and then go through the diagnostic checking phase (Box and Jenkins, 2008).
There are rigorous ways to estimate the parameters of autoregressive models such as the method of least squares and Yule-Walker estimates.Moving average models can be estimated through the innovations method, see e.g.Brockwell and Davis (1991).The estimates of moving average models and the mixed models can also be obtained graphically or through iterative estimation procedures such as non-linear minimization (see e.g.Box and Jenkins, 2008).Time series models should be able to describe the dependence among the observations, see e.g.Li (2004).It is a well-discussed issue that in time series model criticism, the residuals obtained from fitting a potential model to the observed time series play a vital role and can be used to detect departures from the underlying assumptions, (Box and Jenkins, 2008;Li, 2004).
In particular, if the model is a good fit to the observed series then the residuals should behave somewhat like a white noise process.So, taking into account the effect of estimation, the residuals obtained from a good fit should be approximately uncorrelated.While looking at the significance of residual autocorrelations, one approach is to test the significance of each individual residual autocorrelation which seems to be quite cumbersome.Another approach is to have some portmanteau test capable of testing the significance of the first, say m, residual autocorrelations (Box and Jenkins, 2008;Li, 2004), an approach we now describe.

Diagnostic Tests
The third stage of diagnostic checking process (Box and Jenkins, 2008) provides a practitioner an opportunity to test the model before using it for forecasting.This stage not only checks the fitted model for inadequacies but can also suggest improvements in the fitted model in the next iteration of this model building procedure.In this section we will do a literature review of the available diagnostic tests for fitted time series models.
In a time series context, if the fitted model is good then it should be able to explain the dependence pattern among successive observations.In other words, all the dependence in terms of autocorrelations and partial autocorrelations of the data generating process (DGP) should be explained by the fitted model so there should be no significant autocorrelation and partial autocorrelation in successive terms of the residuals.
In practice the most popular way for diagnostic checking a time series model is the portmanteau test, which tests whether any of a group of the first m autocorrelations of a time series are significantly different from zero.This type of test was first suggested by Box and Pierce (1970), in which they studied the distribution of residual autocorrelations in ARIMA processes.Based on the autocorrelations of the residuals obtained by fitting an ARMA(p, q) model to y t , they suggested the following portmanteau test where k rˆ is the residual autocorrelation at lag k.They suggested that for moderate values of m and the fitted model is adequate, under the following conditions: 1. Since Box and Pierce (1970) paper, the portmanteau test has become the vital part of time series diagnostic checking.Several modifications and versions of Box and Pierce (1970) has been suggested in the literature, see e.g.Ljung and Box (1978), McLeod and Li (1983), Monti (1994), Katayama (2008), Katayama (2009).These tests are capable of testing the significance of autocorrelation (partial autocorrelation) up to a finite number of lags.Chand et al. (2012) has used the portmanteau tests to criticize the fitted models.
In the discussion of Prothero and Wallis (1976), Chatfield has mentioned the poor power properties of Q m and has recommended focusing on residual autocorrelations at the first few lags and seasonal lags.Similar suggestions are also made by Davies et al. (1977).In the same discussion on the Prothero and Wallis paper, Chatfield and Newbold also pointed out the poor approximation of the finite-sample distribution of Q m .Prothero and Wallis (1976), in their reply to this discussion, suggested the use of the correction factor (n + 2) / (n-k) to Q m .However, this correction factor may inflate the variance of the resulting statistic relative to that of the asymptotic 2 q p m    distribution (see e.g.Davies et al., 1977, Ansley andNewbold, 1979).
An important point to note is that the statistic Q m has been developed assuming the normality of the white noise process.As the results of Anderson and Walker (1964) suggest the asymptotic normality of the autocorrelation of a stochastic process is independent of the normality of the stochastic process and only depends on the assumption of finite variance, so the portmanteau test is expected to be insensitive to the normality assumption.Ljung and Box (1978) suggested the use of the modified statistic They have shown that the modified portmanteau statistic * m Q has a finite sample distribution which is much closer to 2 to the normality assumption of the error term, t  .As pointed out by many researchers e.g.Davies et al. (1977), Ansley and Newbold (1979), the true significance levels of Q m tends to be much lower than predicted by the asymptotic theory and though the mean of * m Q is much closer to the asymptotic distribution, this corrected version of the portmanteau test has an inflated variance.But Ljung and Box (1978) pointed out that approximate expression of variance given by Davies et al. (1977) overestimates the variance of * m Q .
Frequently in the literature larger values of m have been used in Q m and * m Q , and the most commonly suggested value is m = 20 (see e.g.Davies et al., 1977, Ljung andBox, 1978).Ljung (1986)  a distribution, where a and b are constants to be determined.Ljung and Box (1978) also studied the empirical significance levels and empirical powers of * m Q for various choices of m and showed that the empirical significance levels for an AR(1) process are close to the nominal level for small choices of m, for example when m = 10 or 20, in all the cases except when the AR parameter is close to the stationarity region.This is a very challenging scenario for the 2  approximation.We will look at this issue in our future work.Ljung and Box (1978) also showed that approximating asymptotic distribution of As the partial autocorrelation function is an important tool in determining the order of an autoregressive process.Monti (1994) suggested a portmanteau test, following the idea of Ljung and Box (1978), given as: where k ˆ is the residual partial autocorrelation at lag k.She showed that especially when the order of the moving average component is understated.
As we have discussed earlier, the asymptotic distribution of m Q and * m Q is questioned by several authors in the literature.Though small values of m solve this problem in some situations, it does not work in all cases, for example when the process is nearly stationary, see Ljung (1986).In a very recent paper, Katayama (2008) has suggested a bias corrected version and X is an (m x (p + q)) matrix partitioned into p and q columns, see McLeod (1978) for details.Katayama (2008) showed the importance of this correction term especially for small values of m and when the roots of the ARMA(p,q) process lie near the stationarity region.
In practice, the optimal choice of m is difficult as the use of the used the sample autocorrelation of the squared residuals to test for linearity against the nonlinearity and showed its good power against departures from linearity.Escanciano (2007) proposed diagnostic tests based on the CvM test using the weights suggested by Bierens (1982), given by where is the variance of residuals and  I I increases very fast with P which results in weights being near 0 when P is relatively large.We have considered the CvM statistic with this weight scheme in our study as it has shown good power properties reported in Escanciano (2006).

Methodology
We now consider various versions of the statistics defined in (1), ( 2), ( 3) and ( 4).We compare empirical size and power of these tests against various linear and non-linear classes of models.Mainly we compare the dynamic and fixed design bootstrap methods but we also look at the usefulness of transformed residuals in bootstrap methods.

Bootstrap Methods
For time series data, the dependence structure of the data generating process (DGP) makes it difficult to apply the bootstrap methods.In general, there are two main bootstrap methods that are used in time series i.e. model-based bootstrap methods and blockresampling bootstrap methods.Generally, the model-based bootstrap methods are called resampling-residuals bootstrap methods.
In block bootstrapping, we divide the sample into overlapping or non-overlapping blocks of a certain length.The performance of block bootstrap methods much depends on block length.Under the stationarity condition each block should have the same joint probability distribution.In our study we consider only the model-based bootstrapping, as model based bootstrap methods tend to be more accurate than block bootstrap methods (Lahiri, 2003) and also as our objective is to compare two model-based bootstrap methods, namely dynamic bootstrap and fixed design bootstrap.

Suppose we have a sample time series
where

I
is the information set defined earlier in (5) and  is the vector of model parameters.Suppose the fitted model is

Semi-parametric time series bootstrap methods
Under the assumption that the DGP given in ( 6) is the true model for the given sample time series, the residuals given in (7)  is the dynamic bootstrap of the information set defined in (5) and * t  is selected at random with replacement from the vector of the residuals

Dynamic wild bootstrap
The dynamic wild bootstrap (DWB) is a simple modification of the dynamic bootstrap.The only difference is to resample the rescaled residuals instead of fitted residuals.These rescaled residuals are usually named as wild bootstrap.Various rescaling schemes have been suggested in the literature, see e.g.Liu (1988)  is the DWB of the information set defined in (5) and such that the sequence v t is IID with zero mean, unit variance and finite fourth moment.Liu (1988) has suggested the following v t for transforming the IID residuals to wild residuals,

The fixed design wild bootstrap
In fixed design wild bootstrap (FWB), the bootstrap sample is generated from the fixed design P t , 1 

I
. Moreover, the fitted residuals are transformed to wild residuals using the suggested transformations (see Liu, 1988

Model Estimation
In this section, we describe the estimation methods used in our study.In our numerical results showed in Section 5. Given an AR(p) model are obtained as below:  8), ( 9) and (10).
In this paper, we mainly look at the power of the diagnostic tests.We use the bootstrap distributions under the semi-parametric bootstrap designs discussed in Section 3.1.We also look at empirical power of test against various alternative models.

Algorithms
In this section, we give the algorithms for the Monte method used to compute the empirical size and power of the diagnostic tests defined in Section2.For each Monte Carlo run, a sample time series is simulated under the model Μ .For the sake of convenience, we denote the statistic of interest as T. For the computation of power Μ is the alternative model.We estimate the null model for the simulated sample time series and T is calculated from the residuals, .ˆt 

Algorithm 1: Bootstrap sampling procedure
Step Step 4 Repeat Step 1-3 for each of the B bootstrap samples.
Algorithm 1 gives the bootstrap procedure used in our numerical study.From this algorithm, we obtain the bootstrap approximation of the distribution of the test.We will use this algorithm to compute power in the following algorithms of our simulation study consisting of N Monte Carlo runs.
The power of a test is the probability of rejecting a false null hypothesis.For empirical power, as mentioned earlier, the sample is generated under the alternative model.Algorithm 2 state the Monte Carlo procedure we use to determine the power of test.

Algorithm 2: Computation of empirical power
Step 1 Calculate 100(1 - )th percentile, say of the bootstrap distribution of T * obtained using Algorithm 1.
Step 2 Reject null model if Empirical power, 1-,  is determined as below, In the next section, we look at different examples and compute the power of the diagnostic tests.Now we give definitions of some nonlinear models which we will study as alternative models in empirical power study of portmanteau tests.

Threshold Autoregressive model
The threshold autoregressive, TAR(p), model is defined as where r is called the threshold, below r the AR parameters are ) 1 ( i  and above r these are ) 2 ( i  (see e.g.Chatfield, 2004, p.200).Threshold models were developed and introduced by Tong and Lim (1980) which are basically piecewise linear AR models.For more discussion on bilinear models see also Tang and Mohler (1988) and references therein.

Results and Discussion
In this section, we look at some numerical examples to compare the empirical power of the goodness of fit tests under various semi-parametric bootstrap designs discussed in Section 3.1.We present and compare the power against linear and non-linear alternative class of models under a linear null model.Empirical power results are obtained using Algorithm 2 consisting of 1000 Monte Carlo runs of 200 bootstrap samples.Each bootstrap sample is of size n = 100.

Linear Alternatives
Mixed ARMA models are the most commonly used models in applications.In this section we compare the power of the tests against several versions of ARMA (2,2).In this example, we simulate the series for the alternative model, ARMA(2, 2) process, given below: . We fit an AR(2) model to this sample and the power results in the following table of the percentage of time we rejected the null model.Importantly, note that we consider various values of k ranging from 0 to 2. It can be noticed that choice k = 0 corresponds to an AR(2) process, so we expect very low power in this case, actually as low as the level of significance.On the other hand, as the value of k increases, the MA component in an ARMA process increases in absolute value and this should result in a higher power, reaching a maximum of 100%, for some value of k.  1 gives the results for empirical power of the goodness-of-fit tests.It can be very clearly noticed that CvM has less power while portmanteau tests have better power against this linear class of alternatives.Our results confirm the results reported in the literature, see e.g Hong and Lee (2003); Escanciano (2006).Though we have provided the power results for both of dynamic and fixed design bootstrap methods, we discuss the results for dynamic bootstrap method only, as dynamic bootstrapping provides the best approximation to the asymptotic distribution especially for the portmanteau tests, see e.g.Chand (2013).
We can see from these results as we increase the value of k, in general, the power for each of the goodness-of-fit tests increases but the increase that for C ex p, is not exponential and it attains a maximum power around 20% even for k = 2.In contrast to this, the portmanteau tests show an exponential increase in power with an increase in k and reaches nearly to maximum power of 100%.
Moreover, it can also be seen that as the value of m increases for the portmanteau tests, these tests become generally less powerful.This result is well known and reported in the literature, see e.g.Hong and Lee (2003), Katayama (2009).The same kind of behaviour can be seen for P v M C ex p, test and it also shows a decrease in power for larger values of P, this is also reported in Escanciano (2006).

Non Linear Alternatives
In this section, we look at the empirical power of the goodness-of-fit tests against some popular non-linear alternatives.We consider several versions of non linear EXPAR (2) and TAR(2) models.It has been reported in the literature that the portmanteau tests, we are studying, have poor power against non-linear alternatives especially for TAR models (Escanciano, 2006).We will use the same choices of P and m as we have used in previous section of power against linear alternatives i.e.P = 3, 5, 7 for ex p, power increases exponentially with an increase in k and attains the maximum power 100% at k = 2 while power for the portmanteau tests can reach around 43%.These results confirm our earlier findings that power decreases for larger values of P and m.Now, we move to threshold autoregressive model, another class of non-linear models.Theory suggests that TAR models are more challenging than EXPAR models for the diagnostic tests.We consider the following TAR(2) model  Importantly, it should be noted that choice of P and m is very crucial and the power results may improve for some smaller values of P and m.Noting the result reported in Escanciano (2006), where P v M C ex p, has achieved power of81% against TAR(1) model, we tried smaller values of P, i.e.P = 1, 2. For P = 1, power for P CvM exp, even further decreases to around 20% while for P = 2, it shows an improvement and power rises to 60%.

Conclusion
Portmanteau tests are powerful against the linear alternatives while the CvM statistic has shown more power against non-linear alternatives.The choice of m for portmanteau tests and P for P CvM exp, test is important.Our results suggest that approximation of the asymptotic distribution and power of these goodness-of-fit tests highly depends on the choice of these parameters, P and m.


are the weights in the MA(  ) representation.
suggests the use of smaller values of m and has shown that for small values of m, * m Q has an approximate 2 b

t
Stute et al. (1998) has suggested the following as v t sequence, each of the N Monte Carlo runs.Step 4 Pena and Rodriguez (2002)kage between his suggested multiple test and the test due toPena and Rodriguez (2002).He suggested a method based on some iterative procedure to approximate joint distribution of the multiple test as the computation of the distribution is very hard due to correlated elements.
Katayama (2009)and diagnostic checking require large values of m which results in less power and unstable size of test, as noticed byLjung (1986),Katayama (2008).Katayama (2009)suggested a multiple portmanteau test to overcome this problem.His suggested test is based on several portmanteau tests for a range of small to medium values of m.He showed using some numerical examples that his suggestion leads to a superior test.
ˆ is the estimate of  .Thus the residuals are will serve the purpose of an IID sample.The following approaches are used in semi-parametric time series bootstrap methods.
* I or Stute et al. (1998).The DWB is defined as: o I

Table 2 :
Power (in %), based on 1000 Monte Carlo runs of 200 bootstrap samples of size 100 for AR(2) against EXPAR(2).Table2reports the empirical power of the diagnostic tests.The situation looks quite opposite to the linear case in the previous section.As we can see, k = 0 will correspond to an AR(2) process and with an increase in value of k, the non-linear component in the model will become dominant.

Table 3 :
We can see by controlling the value of k, we can control the amount of nonlinearity in the model.The lower value of k corresponds to low levels of nonlinearity while larger values of k will result in a highly nonlinear model.We use a range of values of k where the model does not blow up.We use the same Algorithm 2 to compute the empirical power.Power (in %), based on 1000 Monte Carlo runs of 200 bootstrap samples of size 100 for AR(2), against TAR(2). t.

Table 3
reports the empirical power of the diagnostic tests for AR(2) against TAR(2) models.These results generally confirm the known fact that threshold models are challenging for the goodness of fit tests.The residual autocorrelations based portmanteau tests show very low power against the TAR model.Though P CvM exp, is showing better power results especially for smaller choice of P, i.e.P = 3, it still cannot achieve the same high power as it did against EXPAR(2) models.