The Bias in Bayes and How to Measure it

Summary A Bayes prior with a likelihood can give approximate confidence and provide a remarkably flexible approach to statistical inference; but is also known to provide inaccurate perhaps incorrect results. We develop a measure of Bayes bias, first examining a simple Normal model and then progressing to quite general models with scalar and vector parameters. The Bias measure can be interpreted as the lateral displacement of the location standardized likelihood function and thus provides ready access to the effect of a prior on p -values, confidence bounds, and Bayes posterior bounds. The needed computation is comparable to that for the likelihood function and thus provides an initial option for checking merits of Bayesian computation for high dimensions.


Introduction
The observed likelihood function from a statistical investigation provides substantial information concerning the value of an unknown parameter in a context being investigated.To obtain the observed likelihood function we need a statistical model in the familiar density form

 
 .A powerful tool then for eliciting information from this observed likelihood function is provided by the Bayesian methodology; this methodology introduces a weight function ( )   called a prior that provides a scaling for parameter values and treats ( )   as a probability input to the statistical model; with this acquired status for the prior the probability lemma for conditional probability produces a posterior density

    
viewed as a probability description concerning the otherwise unknown parameter value.For some further discussion see Fraser (2011).
The probability lemma has two inputs and for validity each must describe a probability process.For the Bayes application the validity condition holds for the model input, assuming of course that the model is taken as a valid approximation for data production.But for the prior the validity condition does not hold, when as in most applications it is introduced as a facilitator of the inference process.For example: in the original Bayes (1763) there was additional randomization hypothesized as a conceptional description for the origins of the  value.And in the default Bayes case sometimes called the objective Bayes case the prior is artifact, a mathematical object introduced for inference expedience.And even in cases where a known physical source for the  value is available there is no statistical imperative that says the prior must be combined with the model, and indeed there may be good grounds for keeping the two informations separate thus letting any end-user have the option of combining them if deemed appropriate.Of course, if there is no easy way to extract parameter information from the model with data, then the default Bayes approach has the merit of producing approximate confidence, as demonstrated in Fraser (2011).
In § 2 we consider a simple scalar Normal model and examine the bias coming from the use of a Bayes default prior; we then extend this to the first-order asymptotic model and show that again the Bayes bias is easily calculated and easily interpreted.Then in § 3 we recall the significant Welch-Peers result that the Jeffreys prior with a scalar parameter exponential model converts that model to a location model to second order and thus gives second order equivalence of confidence and Bayes posteriors.Then in § 4 we consider the scalar exponential model and give a general expression for Bayes bias recorded as (2); and in § 5 we extend this to the vector parameter case through the use of a directed location reparameterization (3) obtaining the extended definition of bias as (4).§ 6 records some brief discussion.

The pure case and the first order case
There is merit with most procedures in statistics in knowing how they work in the very simplest of situations.Thus, for measuring the bias in Bayes, we first consider a model that is a scale-standardized Normal ( ;1)  and a prior that is log-linear ( ) = exp( ) c a    .
Very simple calculations show that the posterior is which is the Normal 0 ( ;1) y a  .Thus if a is the slope of the log-prior at the maximum likelihood value 0  then the posterior has bias = B a , a shift B of the observed likelihood in standardized parameter units.In other words, the posterior is not what is indicated by the likelihood function but is that likelihood, in scale standardized units, shifted laterally by the amount = B a , We call B here the bias in Bayes and the bias B is just the slope a of the log-prior at the maximum likelihood value 0  as appropriately scale standardized; it measures how the prior biases the data information by displacing likelihood.We will see that this is a general phenomenon arising in the use of priors.
Most statistical packages record first order p -values and first order confidence intervals, typically based on the maximum likelihood value or the score value or the signed likelihood ratio value.These can provide valuable information but are often in mutual conflict except in very simple cases like the preceding Normal example.Nonetheless it seems appropriate to consider Bayes bias initially from a first-order perspective.
For this we need Central Limit type regularity and also need the separation of model information into core information, then first order departure that drops off as 1/ 2  n  , then second order departure that drops off as 1  n  , and so on.While this might seem like an idle exercise it has remarkable consequences in allowing or enabling higher order and typically very accurate calculations.In this section we consider just the analysis that separates out the core leaving departures that are first order.Call this first order analysis and it corresponds to just the use of familiar Central Limit type results.Then in the next section we consider the improvements in going to the second order for calculations.
For the first-order analysis of a parameter  we can view the score = ( ) , and the signed likelihood ratio = ( ; ) r r y


as being Normal (0,1) .For this, the score ( ) = ( / ) ( ) is the slope or gradient of log-likelihood at the parameter value of interest, ( )  is a version of information at the maximum likelihood value, and the signed likelihood root in the scalar  case is ; variations arise depending on the version of information used and where it is evaluated.Now suppose we examine the Bayes analysis of this first order distribution associated with likelihood.The standardization of the variable and parameter will lead to a standardized prior that is log-linear

  
 with slope of order 1/ 2 n  ; this will have no first order effect on the posterior.
But then just to check how a non-trivial prior would affect the posterior suppose we consider a prior: ( ) = exp( ) c a     targeted on the standardized likelihood function.The log-posterior would then have the following quadratic form in the exponent, ignoring constants: which implies that the asymptotic posterior distribution of  is Normal . Thus this first order prior would give a first order posterior with the acquired Bayes bias

The Welch-Peers linearization
For an exponential model with scalar canonical variable and scalar canonical parameter, Welch & Peers (1963) investigated confidence intervals and corresponding Bayes posterior intervals and showed that they are equivalent to the second order, provided the Bayes approach uses the Jeffreys (1946) root-information prior.Thus for the exponential model and relative to the canonical parameter = ( )    we obtain the prior 1/2 ( ) A perhaps more fundamental result implicit in Welch & Peers (1963) is that an exponential model with scalar canonical variable and parameter can be rewritten to second order as a location model.To show this we can integrate the default prior and obtain a location reparameterization that functions as a location parameter thus allowing the model to be rewritten to second order as a location model ˆ( ) In fact this extended result is available widely for general asymptotic models with scalar variable and parameter (Cakmak et al., 1998), and makes this version of the root-information prior a widely available procedure for obtaining second order default priors; for further details see A location parameterization such as ( )   has the property that the observed information is constant; thus ( ) 1 A location parameterization however is not unique and any constant multiple of  is also a location parameter; the advantage of our particular choice ( )   is the built-in standardization to have observed information equal to 1, of course, to second order in moderate deviations.
The role of the location model as an exemplar in inference does deserve more recognition.Let ( ) g y   be a scalar variable, scalar parameter location model.Tests and confidence intervals can sometimes be one-sided to the left, or one-sided to the right, or two sided, or occasionally more specialized, depending often on the interests of some particular end user.To avoid this seeming arbitrariness there is merit in defining a basic or fundamental p -value as probability left-of-the-data or just the observed value of the distribution function.As such the p -value records the statistical position of the data 0 y with respect to a parameter value  : thus

 
 and with a flat prior ( ) = c   , trivial calculations show that the two are equal, that is ( ) = ( ) p s   , or that Bayes equals frequentist or equivalently that the Bayes procedure has frequentist or repetition validity.The location model thus serves as the primal basis for addressing Bayes repetition validity.

Measuring Bayes bias to second order: The scalar case
An exponential model provides full third-order generality for statistical inference in the scalar parameter case (Cakmak et al., 1998) (Andrews et al.,2005).Thus to measure Bayes bias to the second order for a scalar parameter, we restrict attention to an exponential model with canonical parameter ( )   and examine the bias from the Bayes use of a prior ( )   .For this we take the gradient a of the log-prior at the observed maximum likelihood value 0  , as in Section 2, but then adjust to obtain the gradient B with respect to the standardized location parameter.For a given initial prior ( )d    we first reexpress it in terms of the location parameter and we then take the log-gradient with respect to the location parameter.This gives the log-gradient which gives the special log-gradient B of the prior as the excess of the direct loggradient over the log-gradient of the Jeffreys prior, all then adjusted to allow intermediate differentiation to take with respect to the convenient canonical parameter  .The interpretation is as in Section 2.
Example 1.Let y be Normal m   with data 0 y and a proposed prior ( )   .The canonical parameter can always be extracted by sample space differentiation of loglikelihood at the observed data, in an appropriate direction for the variable.We have Now consider the log-gradient for the example and suppose that ( )   has been rewritten as ( )   in term of the canonical parameter  .Then the Bayes bias , which in this standardized notation directly describes the effective translation of the standardized Normal limiting distribution that comes from the use of the prior for the analysis.

Measuring Bayes Bias to second order: the vector case
Now consider an exponential model with vector canonical parameter  , and observed data 0 y with corresponding maximum likelihood value 0  .Second order inference addresses parameter information in moderate deviations about the observed 0  and as such only involves the second derivatives giving the information matrix 0 ( ) For some background on approximate secondorder exponential models, see Fraser et al. (2012).We first examine a directed parameter departure 0 = u

  
 , where u is a unit vector defining the direction of a departure and  is the magnitude of the departure; we then consider the directed information say . We can consider ( ) j   as being the one-dimensional directed information for  , viewed as a scalar parameter  for given u .We then define a directed location = ( ) and observe that  has constant information with the standardized value 1.
We now apply Section 4 conditionally given the direction u and calculate the log- gradient vector B of the prior ( )   with respect to the location linear parameterization  allowing the need to reexpress the prior relative to the location parameter as well as the determination of the resulting gradient with respect to the standardized location parameterization: which gives the special standardized log-gradient B of the prior as the excess of the loggradient with respect to the directed location parameterization  over the log-gradient of the directed Jeffreys prior, all then adjusted to allow intermediate differentiation to take place with respect to the directed canonical parameter  .The interpretation is as in Section 2.
with canonical parameters   case which uses = 1 X  .The negative second derivative of likelihood with respect to after reexpression in terms of       ; and then to have zero Bayes bias we need a zero log- gradient with respect to the local location variable  .In more detail the differential

The Welch-Peers linearization
We have examined the effect of a prior on a default Bayes analysis by determining the displacement of the standardized likelihood caused by the presence of the prior.This displacement is the Bayes bias consequent to the introduction of the prior.For an exponential model the technique works with likelihood and thus involves only the dimension of the parameter, and we have illustrated it for parameter dimensions 1 and 2 .
For more general regular models with continuous parameters an exponential model approximation, called the tangent exponential model, is needed; for some recent discussion see Reid & Fraser (2010).This exponential model leads to third order analysis and requires sample space differentiation of the log-pikelihood function ( ; ) y   at the observed data 0 y ; the differentiation is in sensitivity directions representing how parameter change at the observed maximum likelihood value affects the data variable at the observed data point.
In some cases the sample space derivatives may be inaccessible; the log-likelihood can then be replaced by its sample space mean value using 0 =   , and then differentiation with respect to 0  at 0  gives the canonical parameter ( )   that determines the tangent exponential model but now just to second order.For further details see Reid & Fraser (2010).
These calculations are of order n and are thus comparable to the computation of the likelihood function itself; they thus have accessibility whenever the likelihood function is available.This assures that the second order effect of using a prior is readily available without the calculations needed for the Bayes posterior itself.
just the natural 0 y -section or observed data-section of the model, and is often used in logarithmic form as Fraser et al.(2012),Fraser et al.(2010b).
definition of p -value.If we now switch and take a parameter viewpoint we would correspondingly think of looking right-of-the-parameter and calculate the posterior survivor value 0

( 1 
 .Then if we want a scaled version of the canonical parameter that has observed information equal to the identity we would replace the initial canonical parameter and use 0 illustrate (1) we can simplify expressions without loss of generality by using 0 = and ( ) = m   and thus have =   ; in particular we then have 0 1 j   showing that the model has no curvature as defined in Fraser et al. (2010a).

Example 2 .
Consider Then differentiating the log of this location prior gives the Bayes bias