Performance of Bayes procedures
4.2 Frequentist performance: Point estimates
Besides having good Bayesian properties, estimators derived using Bayesian methods can have excellent frequentist and EB properties, and produce i mprovements over estimates generated by frequentist or likelihood-based approaches. We start with two basic examples and then a generalization.
still in a basic framework. These show that the Bayesian approach can be very effective in producing attractive frequentist properties, and suggest that this performance advantage holds for more complex models.
4.2.1 Gaussian/Gaussian model
The familiar Gaussian/Gaussian model provides a simple illustration of how the Bayesian formalism can produce procedures with good frequen-tist properties, in many situations better than those based on maximum likelihood or unbiasedness theory. We consider the basic model wherein
and
= 0). We compare the two decision rules
= (1- B)y, where B= 1/(1+
whileδ2 is the posterior mode, median, and mean (the latter implying it is the Bayes rule under squared error loss). We then have
and also
However, the posterior mean of the first coordinate
Crossing will
(that is, the setting of Example 2.1 where
= 1 and
). Here,
(y) = y and is the MLE and UMVUE
Figure 4.2 MSE (risk under squared error loss) for three estimators in the Gaus-sian/Gaussian model.
which is exactly the variance of show that
is, the Bayes rule has smallerfrequentistrisk provided the true mean, close to the prior mean, 0.
This situation is illustrated for the case where B = 0.5 in Figure 4.2, where the dummy argument t is used in place of
by the solid horizontal line at 1, while
centered at 0. For comparison a dashed line corresponding to the MSE for a third rule,
to be 2, ignoring the data completely. Clearly its risk is given by
which is 0 if than
no other rule could possibly have lower MSE for all
plus the square of its bias. It is easy to
if and only if That
is shown is the dotted parabola (y) = 2, is also given. This rather silly rule always estimates
happens to actually be 2, but increases much more steeply moves away from 2. This rule is admissible, since (thanks to the 0 is
4.2.2 Beta/binomial model
As a second example, consider again the estimation of the event probability in a binomial distribution. For the Bayesian analysis we use the conjugate Beta(a, b) prior distribution, and follow Subsection 3.3.2 in reparametrizing value at
unattractive for general use. The Bayes rule
admissible, but notice that it would be inadmissible if B < 0, since then for all
cannot be; the point here is only to show that crossing (shrinking past the prior mean) is a poor idea in this example.
For the more general setting of Example 2.2, the Bayes rule will have smaller MSE than the MLE when
Figure 4.3 MSE for three estimators in the beta/binomial model with n = 1 and
= 0.5.
For the special case where tion), the limits simplify to
(i.e., the prior is "worth" one observa-a broobserva-ad region of superiority.
since
Fortunately, B cannot be negative can also be shown to be values make it
= 2), but the large penalty paid for other
Figure 4.4 MSE for three estimators in the beta/binomial model with n = 20 and
from (a, b) to (
a measure of prior precision (i.e., increasing M implies decreasing prior variance). Based on squared-error loss, the Bayes estimate is the posterior mean, namely,
Hence
likelihood estimate, X/n. Irrespective of the value of The MSE of the Bayes estimate,
This equation shows the usual variance plus squared bias decomposition of mean squared error.
Figure 4.3 shows the risk curve for n = 1,
If one uses the MLE with n = 1, the MLE must be either 0 or l; no experienced data analyst would use such an estimator. Not surprisingly
the MSE is 0 for
= 0.5, and M = 0, 1, 2.
= 0 or 1, but it rises to .25 at = .5. The Bayes rule is then given by
the MLE of is is a weighted average of the prior mean and the maximum M) where = a/(a + b), the prior mean, and M = a + b,
(4.4)
with M = 1 (dotted line) has lower MSE than the MLE for
= (.067-933). When M = 2 (dashed line), the region where the Bayes rule improves on the MLE shrinks toward 0.5, but the amount of improvement is greater. This suggests that adding a little bias to a rule in order to reduce variance can pay dividends.
Next, look at Figure 4.4, where
0, 1, 5. Due to the increased sample size, all three estimators have smaller MSE than for n = 1, and the MLE performs quite well. The Bayes rule with M = 1 produces modest benefits for
near 0 or 1, but it takes a larger M (i.e.,more weight on the prior mean) for the Bayes rule to be very different from the MLE. This is demonstrated by the curve for M = 5, which shows a benefit near 0.5, purchased by a substantial penalty for
need to be convinced by the analysis) would need to be quite confident that is near 0.5 for this estimator to be attractive.
Using the Bayes rule with "fair" prior mean
M = 1pays big dividends whenn = 1 and essentially reproduces the MLE forn = 20. Most would agree that the MLE needs an adjustment ifn = 1, a Figure 4.5 MSE for three estimators in the beta/binomial model with n = 20 and
= 0.5 and small precision near 0 or 1. The data analyst (and others who near 0.5, with little penalty for is again 0.5 but now n = 20 and M =
in the interval
smaller adjustment if n = 2, and so on. Bayes estimates with diffuse priors produce big benefits for small n (where variance reduction is important).
Finally, Figure 4.5 shows the costs and benefits of using a Bayes rule with an asymmetric prior having
prior case
M = 5, modest additional benefits accrue for little above 0), but performance is disastrous for
an estimator might be attractive in some application settings, but would require near certainty that
risk is the integral of these curves with respect to the prior distribution.
These integrals produce the pre-posterior performance of various estimates, a subject to which we return in Section 4.4.
In summary, Bayesian point estimators based on informative priors can be risky from a frequency standpoint. showing that there is no statisti-cal "free lunch." However, even for univariate analyses without compound sampling, the Bayesian formalism with weak prior information produces benefits for the frequentist.
A note on robustness
In both the binomial and Gaussian examples, either an empirical Bayes or hierarchical Bayes approach produces a degree of robustness to prior misspecification, at least within the parametric family, by tuning the hy-perparameters to the data. In addition. for exponential sampling distribu-tions with quadratic variance funcdistribu-tions. Morris (1983b) has shown broad robustness outside of the conjugate prior family.
4.2.3 Generalization
The foregoing examples address two exponential family models with con-jugate prior distributions. Samamego and Reneau (1994) evaluate these situations more generally for linear Bayes estimators (those where the es-timate is a convex combination of the MLE and the prior mean, even if G is not conjugate) based on (iid) data when the MLE is also unbiased (and therefore MVUE). With 0
the form
If the true prior is Go, this Bayes estimate has smaller pre-posterior MSE than the MLE so long as
Since Go can be a point mass, this relation provides a standard, frequentist evaluation.
the MLE, these estimates have
< 0.5. We remark that the Bayes preposterior greater than 0.6. Such below about 0.44 (and a
= 0.5), setting 141 = 1 essentially reproduces the MLE. With
= 0.1 when n = 20. As in the symmetric
(4.6) (4.5)
4.3.1 Beta/binomial model
Again using the notation of Subsection 4.2.2, we recall that the posterior in this case is
we examine how data are processed by the 100(1- 2α)% Bayesian credible interval. Suppose we adopt the uniform Beta(a = l, b = 1) prior, and consider the case where X = 0. We have the posterior distribution
In addition to producing point estimates with very good frequentist per-formance. the Bayesian approach can also be used to develop interval esti-mation procedures with excellent frequentist properties in both basic and complicated modeling scenarios. As in the previous subsection, our strategy will be to use noninformative or minimally informative priors to produce the posterior distribution, which will then be used to produce intervals via either HPD or simpler "equal-tail" methods (see Subsection 2.3.2).
In complicated examples, the Bayesian formalism may be the only way to develop confidence intervals that reflect all uncertainties. We leave intervals arising from the Gaussian/Gaussian model as an exercise, and begin instead with the beta/binomial model.