Journal of Multivariate Analysis 88 (2004) 19–46
Combining the data from two normal
populations to estimate the mean of one when
their means difference is bounded
$
Constance van Eeden
andJames V. Zidek
University of British Columbia, Vancouver, BC, Canada Received22 December 2000
Abstract
In this paper we address the problem of estimatingy1whenYiBindNðyi;s2iÞ; i¼1;2;are observedandjy1y2jpcfor a known constantc:ClearlyY2contains information abouty1: We show how the so-calledweightedlikelihoodfunction may be usedto generate a class of estimators that exploit that information. We discuss how the weights in the weighted likelihood may be selectedto successfully trade bias for precision andthus use the information effectively. In particular, we consider adaptively weighted likelihood estimators where the weights are selectedusing the data. One approach selects such weights in accordwith Akaike’s entropy maximization criterion. We describe several estimators obtained in this way. However, the maximum likelihoodestimator is investigatedas a competitor to these estimators along with a Bayes estimator, a class of robust Bayes estimators and(whencis sufficiently small), a minimax estimator. Moreover we will assess their properties both numerically andtheoretically. Finally, we will see how all of these estimators may be viewedas adaptively weightedlikelihood estimators. In fact, an over-riding theme of the paper is that the adaptively weighted likelihood methodprovides a powerful extension of its classical counterpart.
r2003 Elsevier Inc. All rights reserved.
AMS 2000 subject classifications:62F30; 62F10; 62C15; 62C20 STMA 2000 subject classifications: 04:170; 04:010; 04:020; 04:035
Keywords:Likelihood; Maximum likelihood; Weighted likelihood; Estimation; Admissibility; Mini-maxity; Normal means; Restrictedparameter spaces; Relevance weighting
$
Supportedby a grant from the Natural Sciences andEngineering Research Council of Canada.
Corresponding author. Moerland 19, 1151 BH Broek in Waterland, The Netherlands.
E-mail address:[email protected] (C. van Eeden).
0047-259X/03/$ - see front matterr2003 Elsevier Inc. All rights reserved. doi:10.1016/S0047-259X(03)00049-6
1. Introduction
Prior to the work of van Eeden[31,32]and independently that of Ayer et al.[7] andBrunk[11,12], statistical inference concerneditself with unrestrictedparameter spaces; parameters were allowedto range over their natural spaces. Statistical theory thereby achievedmathematical elegance andtractibility. Moreover, statistical procedures and their associated measures of reliability such as standard errors, couldoften be foundin explicit form.
No doubt the mathematical and computational difficulties were recognized by those who considered restricted parameter spaces. That may explain why it was not until the just cited work of van Eeden and Ayer et al. that a systematic development of a theory of inference for restrictedparameter spaces began.
Even then, the properties of estimators in restrictedparameter spaces received little attention until the papers by Sackrowitz andStrawderman[26]andMoors[24] andthe Ph.D. thesis of Charras [14]. They showedthat maximum likelihood estimators for such spaces are ‘‘almost always’’ inadmissible for squared error loss. Another important early result in this connection is the minimax estimator for a bounded normal mean of Casella and Strawderman[13]. Recent years have seen a large number of papers in this area.
Clearly in practice, parameter spaces are always restricted. Although the restrictions may be difficult to specify, the gains in performance that may be realizedfrom incorporating the restrictions can be immense as this paper will demonstrate. Reductions in mean-squared error (MSE of estimation) of 50% are attainable over the restrictedparameter space if the restrictions are judiciously exploited. Thus, the goal of finding statistical methods for such problems seems well worth achieving.
This paper makes a contribution to the theory of estimation for restrictedmulti-dimensional parameter spaces. In particular, it shows how the classical maximum likelihoodestimator (MLE) can be adaptedfor use in that setting. Moreover, in that setting it addresses broadly the problem of successfully trading off bias for precision in statistical estimation.
That problem arises when an investigator has data from a population other than that of his or her inferential interest. Do these auxiliary data contain information of value for estimating parameters in the population of interest? If so, how can the bias in the auxiliary sample be traded off for precision in the required parameter estimators?
The specific problem we consider is that of estimating the meany1of a univariate normal population when loss is squared error. Suppose that in addition to an observation y1 drawn from that population, another independent observation y2 is available, it is drawn from another normal population with mean y2 when jy1y2jpcfor a known value of c: Cany2 be usedwithy1 to create an estimator that improves on the estimatory1?
Heuristics suggest an affirmative answer; y2 nearly equal to y1 suggests y1Ey2: That in turn suggests a better estimator of y1 (the best linear unbiasedestimator obtainedassuming equal population means).
In Section 2 we describe a method for operationalizing these heuristics. That method, the ‘‘weighted maximum likelihood (WLE) method’’ is an extension of its famous classical cousin that like its predecessor, is a very generally applicable tool for the practitioner’s toolbox. In fact, if we allow the weights to be chosen adaptively with the help of the data, a very large class of estimators obtains. We will show that class includes not only the MLE but other estimators obtained below: a Bayes estimator (the ‘‘Pitman estimator’’); all members of a class of robust Bayes estimators; and(whencis small enough) a minimax estimator. An overriding theme of this article is that the adaptively weighted likelihood method is a very powerful extension of its classical predecessor.
By assessing the numerical andtheoretical properties of adaptively weighted likelihoodwe will show that very accurate estimators can be obtained, enabling us to conclude with some confidence that the method can be recommended for use in practice. These estimators can have substantially smaller mean-squarederrors (MSEs) than their classical unbiasedcounterpart y1; particularly when the two population means are nearly identical. Thus, an effective bias-variance trade-off is indeed possible; information in the sample from the second population can help in estimating the mean of the first.
The weightedlikelihood(WL) describedin Section 2 extends Fisher’s classical likelihoodandwhat Hu[19]introduces and calls the ‘‘relevance weighted likelihood (REWL).’’ Our extension not only permits the weights to depend on the data, but it allows negative weights as well. Negative weights are needed to enable the theory to embrace such estimators as the Pitman. Both the WL andREWL generalize the ‘‘local likelihood’’ of Tibshirani and Hastie [30] defined in the context of non-parametric regression. Their likelihoodwas extendedas a local likelihood, by Staniswalis[27]andas a quasi-local-likelihoodby Fan et al.[18].
In contrast to the local likelihood, the WL can be a global likelihood and in one of the applications developed by Hu and Zidek [21], it is shown how the celebrated James-Stein estimator can be foundas a maximum (relevance weighted) likelihood estimator when the relevance weights are estimatedfrom the data.
The relevance weights allow bias to be traded for precision in the likelihood setting, as bias is traded for variance in the non-parametric regression setting. The needfor such a theory has become increasingly important as the scale of modern experimental science has grown in its space–time scales thanks to demand (e.g. environmental science) combinedwith feasibility (e.g. through information technology). On these scales, the replicatedexperiment seems completely unrealistic as an experimental paradigm, leading to the need for a theory that embraces bias without sacrificing the goals of efficiency andprecision enshrinedin Fisher’s foundational works.
The theory described in Section 2 enables the restricted two-dimensional parameter space of this problem to be accommodated within the likelihood framework. As well it permits the bias-precision trade-off to be made without relying on a Bayesian approach (see[8]). The latter permits the bias-variance trade-off to be made in a conceptually straightforward manner. Reliance on empirical Bayes methods softens the demands for realistic prior modeling in complex problems.
Efron [17]illustrates the empirical Bayes approach in such problems anduses the term ‘‘relevance’’ in a manner similar to that of Hu[19].
Our theory is proposedas a simpler alternative to the empirical Bayesian approach for use in complex problems. The WL offers such an approach andwe will try to demonstrate that in this article. At the same time we gain a theory that formally links a diverse collection of statistical domains such as weighted least squares, non-parametric regression, meta-analysis andshrinkage estimation. Starting with the likelihood in these domains yields new methods and suggests new problems as we will attempt to show. At the same time, the WL comes with an (as yet incomplete) underlying general theory including extensions of Wald’s theory for the maximum likelihoodestimator[20].
In Section 2 we present the weightedlikelihood, our extension of that of Hu[19]. Also in that section is a class of WLEs along with a methodbasedon Akaike’s entropy criterion for fitting the optimal weights using the data. Several specific WLEs are derived in this way and presented there. They are compared with each other as well as the MLE and other estimators derived directly from them (their ‘‘dominators’’). The latter have uniformly smaller MSEs over the restricted parameter space than their parent. One of the many estimators, the one we denote by WLEN, is eventually selectedon the basis of its performance coupledwith its simplicity. It resembles the James-Stein estimator for this case of just p¼2 dimensions.
In Section 3 we turn to an alternative approach andfinda potential competitor for the WLEN that comes with optimality credentials lacking in WLEN: it is minimax at least when c is sufficiently small. However, numerical comparisons show that its MSE is actually bigger than that of WLEN over much of the restrictedparameter space. It only improves slightly on that of the WLEN over the rest of that space, leaving the WLEN as our selectedchoice at this stage in spite of its lack of ‘‘credentials’’.
We next use a natural Bayesian approach in Section 4 in quest of an admissible dominator for WLEN. The result we call the ‘‘Pitman estimator.’’ However, a numerical assessment of its performance shows it to be highly non-robust against mis-specification ofc:Thus we turn in Section 5 to robust Bayes alternatives andfind one whose numerical performance androbustness properties resemble those of WLEN. However, it is much more difficult to compute than WLEN and leaves us at the endof the day with the latter as the clear winner basedon pragmatic considerations.
Unless explicitly statedto the contrary, the proofs of all theorems andlemmas below appear in the appendix.
2. Weighted likelihood estimation
In this section we describe the weighted likelihood and apply it to the problem addressed in this paper. Assume fYig are independently distributed random variables or vectors, each having an associatedpopulation distribution with
probability density and cumulative distribution (PDF and CDF, respectively)fiand Fi:LetY¼ ðY1;y;YnÞbe the vector or matrix of these measurable attributes.
From each populationi; niX0 items are randomly and independently sampled,
yieldingYi¼ ðYi1;y;YiniÞ;Yijrepresenting theYimeasuredon thejth item sampled
from the ith population j¼1;y;ni; i¼1;y;n (the null vector when nj¼0).
Assume theYij;j¼1;y;n;are independent as well as identically distributed, each
having its associatedpopulation distribution. Denote the realization of Yi by yi; i¼1;y;n:
In this paper inferential interest concerns attributes of population 1. However, in general, Hu andZidek [21] consider other possibilities such as simultaneous inference about parameters of all the populations.
Starting from the Akaike entropy maximization principle[1–6], Hu andZidek[21] derive the REWL in the non-parametric and parametric cases. To be precise they suppose (when theY are discrete) that a predictive distribution saygofY1must be chosen to maximize RloggðyÞdF1ðyÞ; where F1 denotes the true ‘‘conceptual’’ population distribution for the first population. This maximization must be done subject to knowledge thatF1 resembles each of the otherFj; ja1;that is subject to
R
loggðyÞdFjðyÞ4cj; ja1 for specified fcj; ja1g: A Lagrangian argument then
implies thatgmaximizes a linear combination of theRloggðyÞdFjðyÞ; j¼1;y;n: However since the fFjg are unknown they are estimatedby fFˆjg their empirical distribution functions. When only one observation yj is available from population j¼1;y;n;the empirical distribution for that population becomes a point mass at
that observation.
In any event, with these heuristics the optimumg maximizes the non-parametric relevance likelihoodfunction that viewedas a function ofgis
g-Y n i¼1 Yni j¼1 glij=niðyijÞ: ð1Þ
Similar heuristics apply to the case of interest in this paper, i.e. the parametric case, where for the likelihoodwe have instead
y-Y n i¼1 Yni j¼1 flij=ni i ðyijjyiÞ; ð2Þ
wherey¼ ðy1;y;ynÞ:In both caseslijX0 andwe takelij=nj¼0 whennj¼0 for all iandj:
In the adaptively weighted likelihood of this paper, the weightsflijgmay depend on the data and even enable the investigator to trade off bias for precision in estimating the likelihoodfor population 1 using the data from the remaining populations. Ideally, the choice of these weights (equivalently the specification of the
fcjgabove) will be context dependent. However, Hu and Zidek[21]suggest one very general methodfor their selection basedon a suggestion of Stigler[29]. That method again basedon the use of the maximization of entropy approach with follow-up
estimation is usedbelow in this paper. Rather than describe it in general we demonstrate it below in specific problems.
The WLE for y is foundby maximizing (2). For the case where the weights are constants, Hu[20]shows that the theory of Waldfor the classical MLE extends to the relevance WLE under a suitable adaptation of Wald’s assumptions. In unpublishedwork, Steven Wang has shown that these results also holdwhen the weights are allowed to depend on the data, albeit under a somewhat different asymptotic paradigm than that adopted by Hu.
Now let us turn to the application of the parametric WL to the problem at hand. Here we have two normal populationsYiBNðyi;s2iÞfor which thefs2igare known i¼1;2:Assume, without loss of generality, thatn1¼n2 ¼1 for the two populations involvedandfor simplicity denote the relevance weights by li1 ¼li; i¼1;2 for
those populations. The WLE fory1 is easily shown to be
Y1þWb;
where W¼Y2Y1 and b obtains from the relevance weights andremains to be specified. The relevance weight ratio definesb through
l2 l1 ¼s 2 2 s2 1 b ð1bÞ: ð3Þ
However, in general, we may rely on the data to specify that ratio.
The maximization of entropy criterion we refer to above may be appliedto find relevance weights. That criterion leads to the minimization of the MSE in this case of normal population distributions. Hence the optimal choice ofbifD¼y2y1 were known wouldbe boptimal¼ s2 1 s2 WþD 2;
where s2W ¼s21þs22: However, since D is unknown the optimal weight must in general be estimated. Motivated by these considerations we will allowbto depend on
W in general anddefine the WLE to be dWLEðY1;Y2Þ ¼Y1þWb#ðWÞ:
As we will see, all the estimators considered in this paper are of the formY1þjðWÞ even though not all of them are obtainedby appeal to WLE theory. Nevertheless, by lettingb#ðWÞ ¼jðWÞ=W;we see that in fact all these estimators may be viewedas having been foundas adaptively weightedWLEs.
The MSE of estimation provides a natural criterion by which to assess the performance of any estimator ofy1 includingdWLEðY1;Y2Þ:We state that MSE in general in the next theorem.
Theorem 2.1. The MSE of an estimator ofy1 of the form Y1þjðWÞis given by EyðY1y1þjðWÞÞ2¼s21þ 2 1þtEyðDWÞjðWÞ þEyj 2ðWÞ ¼s212s 2 W 1þtEyj 0ðWÞ þE yj2ðWÞ;
where the second equality holds whenjis absolutely continuoust¼s2
2=s21:
A number of particular choices for b#ðWÞ present themselves. These are listed below as special cases.
Special case2.1 (WLEU): In this example we letD-Nto obtainb#ðWÞ 0:We
thereby obtain the unbiasedlinear estimator Y1 for y1 that comes directly from Fisher’s classical theory. This estimator’s MSE is of courses2
1:It is inadmissible and it can be shown that the maximum likelihoodestimator (MLE) ofy1 is a Brewster– Zidek improvement on it (see[10]).
Explicitly, the MLE referredto above, i.e. the first component of the MLE of
ðy1;y2Þunder the restrictionjy1y2jpcis given by dMLEðY1;Y2Þ ¼Y1þ
ðWcÞIðW4cÞ þ ðWþcÞIðWocÞ
1þt : ð4Þ
Special case2.2 (WLEM): We may obtain an estimator by evaluating the MSE of
the original estimatorY1þWbas a function of bandD:It is easily shown that as long asjDjpcwe can achieve a uniformly smaller MSE than that ofY1by replacing D2inboptimalbyc2:Denote the estimator obtainedin that way by WLEM, a kindof ‘‘minimax’’ estimator ofy1:
Special case2.3 (WLEN): Still another WLE, WLEN, is obtainedby naive
‘‘plug-in’’ estimation, that is by using minfW2;c2gto estimateD2:This naive estimator is an analogue of the James-Stein estimator. The analogy with the James-Stein estimator is seen by assuming equal standard deviations,s1¼s2¼sandsupposing that unlike the present case of just p¼2 populations, the James-Stein estimator obtains in the case ofp43 populations. In that case, the Akaike optimum weight ratio becomess2=ðps2þPp
i¼1 ðyiy%Þ 2Þ:
Now when yiy%; one wouldexpect the
James-Stein estimator to perform well. In that case we findan unbiasedestimator of the optimal weight to be
s2ðp3Þ Pp
i¼1 ðYiY%Þ2
andthereby obtain the celebratedestimator. However, this approach does not work in the casepo4;andwe adopt the naive plug-in estimator instead.
Special case2.4 (WLED): However, the estimator ofDobtainedby truncatingW
atcorcis dominated by that which truncates instead at7c;wherec¼cðWÞ ¼ c tanhðcjWj=s2
WÞ: We denote the estimator obtained by this potential improvement
Special case2.5 (WLE0): Another choice denoted by WLE0 is obtained by taking Dto be zero in the Akaike optimum formula, that is
# bðWÞ ¼b#0ðWÞ ¼defn s2 1 s2 W :
This choice gives us an unbiased‘‘estimator’’ of boptimal precisely when the information from population 2 wouldbe expectedto contribute the most.
We see in these examples that depending on how we ‘‘estimate’’D2 inboptimalwe can obtain WLEU, WLEM, WLEN, WLED andWLE0. The following theorem gives their (in)admissibility properties, as well as the MLEs, and compares WLEM, WLE0 andthe MLE with WLEU.
Theorem 2.2. The estimators WLEU, WLEM, WLEN, WLED and the MLE are
inadmissible. The estimator WLE0 is admissible. Each of the estimators WLEM and
MLE dominates WLEU. WLE0 dominates WLEU if and only if cosW:
The following theorem gives explicit dominators for each of the inadmissible ones. Theorem 2.3. The inadmissible ones among the above defined estimators Y1þjðWÞ
are dominated by the estimatorddefined by
dðY1;Y2Þ ¼ tY1þY2 1þt d ðWÞ; wheret¼s2 2=s21 and d ð
WÞis the projection of W=ð1þtÞ jðWÞonto the interval
½c=ð1þtÞtanhðcjWj=s2WÞ;c=ð1þtÞtanhðcjWjs2WÞ:
Theorem 2.4. For the dominators of WLEU and MLE, DWLEU and DMLE
respectively,given by the previous theorem,we have DWLEU ¼DMLE:
The approach usedin this section yields a smooth estimator sincebis ‘‘fitted’’ to the data only after the MSE has been computed. In particular it is a differentiable function of W in contrast to the truncatedMLE ofy1which is not. However, Hu and Zidek [21] emphasize that, in general, the specification of the relevance weights shouldbest be done in the context of the specific inferential context.
To conclude this section, we compare numericallyc¼1 all the specific estimators given above (except the MLE). Their MSE functions are depicted inFig. 1. In this andlater figures we show the MSEs over the regionð2;2Þeven though onlyð1;1Þ
is technically relevant. We do this because we recognize that, in practice, specifying a suitable cwill not be easy. We therefore needto examine the performance of our estimators in the wider interval to determine their robustness against the possible underestimation ofc:
Observe that WLE0 does very well when D¼0; as one might expect. That performance comes at a heavy price; the estimator is quite non-robust andits MSE grows rapidly outside the intervalð1;1Þ:The remaining estimators apart from the
unbiasedestimator, have very similar MSEs. On the basis of its simplicity, we have selectedthe WLEN for further consideration from this group.
Two additional performance comparisons relating to the effect of poor data quality seem in order. In the first comparison the data from population 2 is poorer than that from 1 (more specificallys2
2¼3 whiles21¼1). We compare the MSEs for all the WLE estimates under these circumstances inFig. 2. In contrast, the data from population 1 are poorer than those from 2 (s2
2¼1 whiles21¼3) for the comparison given inFig. 3.
Fig. 2 confirms what heuristics suggest, that the MSE performance of all the estimators considered thus far in this section declines relative to the natural benchmark, that of WLEU; the value of the auxiliary information in the sample from population 2 is much less valuable. Note however, that the relative performance seen inFig. 1still obtains.
In contrast, we see that the MSE performance relative to the WLEU of all the estimators improves in Fig. 3. Again this seems to accordwith our intuition. In summary, the benefit of using Y2 in estimating y1 will depend on the its quality relative to Y1: A very large improvement can be achievedunder favorable circumstances.
We may apply Theorem 2.3 to obtain additional estimators for consideration, that might improve on WLEU, WLEM, WLEN, andMLE. (WLE0, being admissible, is Fig. 1. MSE functions for selectedweightedlikelihoodestimators when the population variances are both 1. The minimax value in this case is 0.66 (see text).
Fig. 2. Graphs of normalizedMSE functions for selectedestimators when population 2ðvariance¼3Þis overdispersed relative to 1ðvariance¼1Þ:The minimax value in this case is 0.80 (see text).
Fig. 3. Graphs of normalizedMSE functions for selectedestimators when population 1ðvariance¼3Þis overdispersed relative to 2ðvariance¼1Þ:The minimax value in this case is 1.20 (see text).
unchangedon application of that theorem.) However, numerical analysis shows only a small uniform improvement (of less than 4% atD¼0) for WLEM andWLEN. In contrast, the MSE of the ‘‘dominator’’ of WLEU and MLE proves substantially smaller than that of its ‘‘parents’’, WLEU andMLE (seeFigs. 4 and5).
Thus, of the three estimators obtainedby application of Theorem 2.3, only DWLEU¼DMLE seems worthy of further consideration. To that end, we numerically compare it with WLEN inFig. 6.
We findthat although DWLEU¼DMLE has a slight advantage over WLEN whenDis near 1, over most of the range between ð1;1Þthe opposite is true.
Overall, simplicity andperformance point to WLEN as the estimator of choice at this point in the analysis. We are thus ledin the sequel to restrict attention to just WLEN which dominates the WLEU and at the same time is both simple and quite robust.
However, WLEN is not admissible (see Theorem 2.2) and our numerical results show that its dominator from application of Theorem 2.3 has uniformly smaller MSE. Furthermore, without a minimax value for this problem, we are unable to comment on its optimality with respect to the minimax criterion. Our quest for optimum performance points to the desirability of finding a minimax, admissible estimator as a possible competitor to the WLEN. We turn to that problem in the next section.
Fig. 4. MSE functions for WLEU andits dominator from Theorem 2.3 when the population variances are both 1.
Fig. 5. MSE functions for the MLE andits dominator from Theorem 2.3 when the population variances are both 1.
Fig. 6. Comparison of the MSE functions for WLEN andDWLE0 when the population variances are both 1.
3. A minimax estimator
The next theorem gives the minimax value for the case wherejy2y1jpcas well as a minimax estimator basedonðY1;Y2Þ for this case. This result holds only for ‘‘small’’ values ofc:
Theorem 3.1. The minimax value for the problem of estimatingy1 based onðY1;Y2Þ under the restrictionjy2y2jpc is,for cpmosW;given by
s2 1s22 s2 W þ s 4 1 s2 W sup jnjpm EnðmtanhðmZÞ nÞ2;
where m¼c=sW;Z is aNðn;1Þrandom variable and moE1:056742:
A minimax estimator (MIN) ofy1 for this problem is given by
tY1þY2 1þt c 1þttanh c s2 W ðY2Y1Þ :
This minimax estimator is unique(and thus admissible)among estimators of the form
d1ððtY1þY2Þ=ð1þtÞÞ d2ðY2Y1Þ:It dominates WLEU.
The exact definition of the constantmoin the above Theorem 3.1 can be foundin
Casella andStrawderman[13]. A table of values of supjnjpmEnðmtanhðmZÞ nÞ2for
selectedvalues ofmis also given there.
We have thus succeeded in finding a minimax estimator. However, we do not know if this estimator is admissible in the class of all estimators. We now compare numerically, this estimator with WLEN inFig. 7.
That figure suggests that WLEN is not minimax; its maximum MSE is slightly larger than that of MIN which for a value ofc¼1 is known to be minimax by the theorem in this section. However, its MSE is close to that of MIN over most of the relevant range in the parameter space andis actually smaller over much of it. Thus for practical purposes it wouldseem to be a more desirable alternative to MIN. Morever, the optimality of MIN is known to obtain only whencis small. Our search for a practical competitor to WLEN continues in Section 5 where we seek generalizedBayes competitors to the WLEN in the general case.
4. The Pitman estimator
The Pitman estimator (PIT) is the first component of the generalizedBayes estimator with respect to the uniform prior onYof ðy1;y2Þ:van Eeden and Zidek
[33]show that this estimator is given by
where TðWÞ ¼ 1 sW fðWc sW Þ fð Wþc sW Þ FðWþc sW Þ Fð Wc sW Þ :
As notedin the next theorem, they show that this estimator dominates WLEU and that there exists no estimator with a MSE that is uniformly smaller than that of the Pitman estimator over the parameter space.
Theorem 4.1. The Pitman estimatordP is admissible and it dominates WLEU. InFig. 8we see in the comparison of WLEN with the Pitman estimator how very non-robust the latter is against the mis-specification of c: Indeed, the MSE rises steeply at either extreme whenDfalls outside the intervalð1;1Þ:This wouldleadus to reject PIT as unrealistic in practical applications. Moreover, it points to the need to search for more robust Bayes alternatives.
5. Robust Bayes estimators
In deriving the Pitman estimator in the last section, we imposed our beliefs directly on the difference y2y1: We couldthen put a uniform prior on the resulting restrictedparameter space to derive the estimator.
However, in doing so we violate Lindley’s version of Cromwell’s rule: ‘‘Do not put probability zero on anything.’’ Clearly jy2y1j couldbe slightly larger than c; specifying the latter well will not generally be easy in practice. A smallcis obviously desirable for maximizing the benefits of the trade-off. On the other hand, if c is chosen to be too small and in fact the true mean difference exceedsc in absolute value significant losses would be anticipated. Yet under Cromwell’s rule, that possibility cannot be ignored.
A more robust approach wouldavoidimposing our beliefs at the first level of a hierarchical Bayesian analysis andinstead, impose it at the secondlevel where its imposition wouldhave smaller impact. To be more precise, we assume at the first stage of the analysis that the population means are independently distributed with yjBNðmj;gj2Þ:Then at the secondstage we suppose that m1m2 lies in the interval ½c;c: It is then easily seen that, conditional on ðY1;Y2Þ; the yj; j¼1;2; are
independent NðbjYjþ ð1bjÞmj;s2jg2j=l 2 jÞ; where, for j¼1;2; l 2 j ¼s2j þg2j and bj¼g2 j=l 2
j:For any proposedestimator, sayy#1 ¼y#1ðY1;Y2Þ;of y1 we then find
Eððy#1y1Þ2jY1;Y2Þ ¼ ðy#1 ðb1þ ð1b1Þm1ÞÞ2þ s2
1g21
l21 : ð6Þ
Then we can, in (6), compute the expectation over theYi’s to obtain the Bayes risk of
#
y1as a function of them’s. In fact, ignoring irrelevant terms andpositive factors that
do not depend ony#1 we obtainEðm#1m1Þ 2
where #
y1¼b1Y1þ ð1b1Þm#1: ð7Þ
Now note that, marginally, the Yj; j¼1;2; are independent Nðmj;l2jÞ with jm2m1jpc: The Pitman estimator dPðY1;Y2Þ of m
1 for this problem is given by (5) with s2
i replacedby l 2
i: From (7) we then obtain our robust estimator of y1; namely dRPðY1;Y2Þ ¼b1Y1þ ð1b1ÞdPðY1;Y2Þ ¼Y1þs2 1T 0ðWÞ; ð8Þ wherec0¼c=l;b1¼s2 1=l21;W0¼W=land T0ðWÞ ¼1 l fðW0c0Þ fðW0þc0Þ FðW0þc0Þ FðW0c0Þ:
The following theorem shows thatdRP is inadmissible whenl4sW;but dominates
WLEU for alll:
Theorem 5.1. The estimatordRPis inadmissible whenl4sW:It dominates WLEU for alll:
The proof of this theorem, which is given in the appendix, is not a generalization of the proof given in[33]for the special case wherel¼sW:That proof is basedon
an application of Kubokawa’s[22]technique and, as far as we can see, that proof does not generalize to the case wherel4sW:
InFig. 9we compare the WLEN with a robust Bayes (RBAY) alternative to PIT obtainedby taking g2
1¼g22¼0:35: As expectedthe latter is much more robust to mis-specification ofc:Moreover, its MSE compares favorably with that of WLEN. However, we are ledto choose WLEN on the basis of its greater computational simplicity. Moreover, its character is much more easily discerned.
6. Concluding remarks
In this paper we have shown how an extension to Fisher’s maximum likelihood function, the weightedlikelihood(WL) function, can provide estimates for parameters in a restrictedparameter space. We have concentratedon the first of two parameters, y1 but by symmetry we also obtain estimates for y2 when the difference of the two,D¼y2y1;is known to be bounded in absolute value by a specifiedconstant,c:In particular, we obtain a Stein-like shrinkage of the classical MLE,ðY1;Y2Þ;ofðy1;y2Þ:Of course, unlike its famous counterpart in the case when more than three means are being estimated, our estimator does not seem to be minimax basedon our numerical values. But unlike its classical counterpart it exploits the known restriction onDto produce large increases in precision over the classical MLE at least whencis small. Moreover, the simple alternative to that MLE we discover through the use of the WL compares very favorably with a number of
competitors introduced in this paper, judging by its MSE of estimation. In particular, its MSE seems almost identical to that of a robust Bayes estimator based on a numerical comparison.
Perhaps, the most important message in this paper is that very large gains in precision may be achievedby judiciously exploiting the natural restrictions that will obtain in any realistic problem encountered in practice. Reductions on the order of say 50% in the MSE seem quite realistic in many problems. Thus it seems extraordinary that so little attention has been paid to these restrictions in the development of statistical theory.
One reason for this may be the difficulty in formulating the restrictions. Like the models underlying the inferences to be made, the restrictions will be susceptible to uncertainty andpractitioners may be reluctant to impose these restrictions (even though they are prepared to introduce the models). Our next figure (Fig. 10) demonstrates numerically the impact of over- and under-estimatingc:
Notice in this figure that when cis small ðc¼0:5Þ very large reductions in the MSE can be achievedwhenD¼0:However, the analyst pays a very heavy price if she/he has mispecifiedcandin factjDjactually is say 1 (when the optimal choice ofc
wouldhave been c¼1:0). At the other extreme if the investigator conservatively specifies c¼1:5 when in fact jDj is actually say 0.5 (andthe optimal choice of c
wouldhave been 0.5) the reduction in the MSE tends to be rather small over the Fig. 9. Comparison of the MSE functions for WLEN andRBAY when the population variances are both 1.
restrictedspace where jDjp0:5: The practical problem of prescribing appropriate restrictions in parametric inference seems to have receivedlittle attention, leaving the practitioner with little to guide him or her in making that prescription
A natural question that may arise at this point concerns the case of a single parameter, say y1 restrictedto an interval, a problem that has been much investigatedby others. The question is can the WLE offer any assistance here? In fact that proves to be a special case of our analysis for the two means problem, namely the case when s2¼0 so that when Y2 is observed y2 becomes known. Without loss of generality we may assume y2¼0 so that the condition jDjpc becomesjy1jpc;exactly the restriction studied by Casella and Strawderman[13].
By specializing our results to this case we finda new estimator fory1; obtained from WLEN by lettings2 ¼0:
#
y1¼ ð1b#ÞY1;
where the shrinkage factor is given by
ð1b#Þ ¼ minfY 2 1;c2g s2 1þminfY12;c2g :
This estimator WLEN proves to be substantially better than Y1 over the res-trictedparameter space. However, Theorem 2.3 yields another, DWLEN, that dominates it.
Fig. 10. Comparison of the MSE functions for WLEN for various choices ofc;(1)c¼0:5;(2)c¼1:0 and (3)c¼1:5 when the mean difference is bounded by 1 and the population variances are both 1.
We wondered if we could have obtained similar gains had we applied that theorem directly to the unbiased estimator Y1; andobtainedDWLEU. Fig. 11 gives a numerical comparison of these three estimators when c¼1; a case for which the Casella–Strawderman estimator is known to be minimax and we have taken s2¼ 109to approximates
2¼0 while still allowing us to use the software constructedfor the original problem.
Comparing our estimators’ risks against that for the dominator of the estimator WLEU we findthe WLEN is quite competitive. Their minimax values over the relevant interval are very similar. However, WLEN does considerably better around
jDj ¼ jy1j ¼0:
Evidently, DWLEU makes its bias-variance trade-off in quite a different way than does WLEN. In fact, its MSE function seems quite similar in shape to that of the Casella–Strawderman estimator and much different than that of the WLEN. Moreover, Casella andStrawderman[13]give the minimax value for this case where s2
1¼1;s22¼0 andc¼1 as 0.45. The DWLEU achieves the same minimax value to two decimal places (albeit withs2
2not quite 0). However, it cannot be minimax when s22¼0 since Casella andStrawderman show their estimator is the unique minimax estimator for this case. More study of this case seem worthwhile and is left for future work. In any case since the WLEN is quite competitive with the Casella– Strawderman estimator whencis small, the WL seems a valuable tool even in this one-dimensional case.
Fig. 11. Comparison of the MSE functions for WLEN, DWLEN andDWLEU whens2is about 0, the
Appendix A
Proof of Theorem 2.1. First note that the conditional, given W; distribution of
Y1 is N s 2 2y1þs21ðy2WÞ s2 W ;s 2 1s22 s2 W : ðA:1Þ
The first expression for the MSE then follows from (A.1) andthe fact that EyðY1y1ÞjðWÞ ¼EyjðWÞEyððY1yÞjWÞ:
The secondexpression follows from Stein’s identity which says that, for absolutely continuousj;
EyðWDÞjðWÞ ¼s2WEyj0ðWÞ: &
The following two lemmas are needed in the proofs of inadmissibility. They also provide explicit dominators. The first of these lemmas uses a rotation technique of Blumenthal andCohen[9](see also[16]). Let
X1¼ tY1þY2 1þt ; X2¼ Y1þY2 1þt ; m1¼EyX1¼ ty1þy2 1þt ; m2¼EyX2¼ y1þy2 1þt ; ðA:2Þ thenjy2y1jpcif andonly if jm2jpc=ð1þtÞand m1 is unrestricted. Further note that
Y1¼X1X2; Y2¼X1þtX2;
y1¼m1m2; y2¼m1þtm2 ðA:3Þ andthatX1 andX2 are independent normal random variables.
Lemma A.1. The estimator Y1þjðWÞ is inadmissible for estimating y1 based on ðY1;Y2Þ under the restriction jy2y1jpc if d2ðX2Þ ¼X2jðð1þtÞX2Þ is inad-missible for estimatingm2based on X2under the restrictionjm2jpc=ð1þtÞ:Further, if
d2ðX2Þ dominates d2ðX2Þ for estimating m2 based on X2 under the restriction
jm2jpc=ð1þtÞ;then tY1þY2 1þt d 2 W 1þt
dominates Y1þjðWÞ for estimating y1 based on ðY1;Y2Þ under the restriction
jy2y1jpc:
Proof. Using (A.2), we find
where X1 and X2 are independent. Further, X1 is admissible for estimating the unrestricted m1: So, if d2 is inadmissible for estimating m2 basedon X2 under the restriction jm2jpc=ð1þtÞ; then X1d2ðX2Þ is inadmissible for estimating m1m2 basedonðX1;X2Þunder this same restriction. Further, if d2ðX2Þdominates d2ðX2Þ as an estimator of m2 basedon X2 under the restriction jm2jpc=ð1þtÞ; then X1d2ðX2Þ dominates X1d2ðX2Þ as an estimator of m1m2 basedon ðX1;X2Þ under this same restriction. &
Dominatorsd2 as mentionedin Lemma A.1 can be obtainedfrom the following Lemma A.2. It contains a very special case of a result of Moors[24 and25, Theorem 3.9, p. 46]andgives sufficient conditions for an estimator of a boundednormal mean to be inadmissible. It also gives dominators for such inadmissible estimators. Lemma A.2. Let Z be a Nðy;v2Þrandom variable. Then,for squared-error loss, an estimatordðZÞofyunder the restriction jyjpm is inadmissible if
PnðdðZÞomðjZjÞ or dðZÞ4mðjZjÞÞ40 for some yA½m;m; where mðzÞ ¼mtanhðmz=v2Þ; 0omðjzjÞom and mðjzjÞ ¼ mðjzjÞ:
Further, such inadmissible estimators are dominated by dðZÞ; obtained by
projecting,for each Z;dðZÞunto the interval½mðjZjÞ;mðjZjÞ:
Proof. Applying Moors[24]to this problem, we have hishðyÞ ¼yandhis decision space A¼ fyj mpypmg: The finite group G¼ fg1;g2g he needs has g1ðzÞ ¼z; g2ðzÞ ¼ z;which givesg˜1ðyÞ ¼y;g˜2ðyÞ ¼ yandhe considers estimatorsdðZÞofy satisfyingdðZÞ ¼ dðZÞ:We now needhis setsAzC½m;mdefined as the closed
convex range, for fixedz;of the function
˜ gðy;zÞ ¼ P2 i¼1 fðzjgiðyÞÞg˜iðyÞ Sðy;zÞ when Sðy;zÞ40; y elsewhere; 8 > < > :
where fðzjyÞ is the density of Z;Sðy;zÞ ¼P2i¼1 fðzjg˜iðyÞÞand yA½m;m: In our
case this gives
˜ gðy;zÞ ¼ye ðzyÞ2=ð2v2Þ eðzþyÞ2=ð2v2Þ eðzyÞ2=ð2v2Þ þeðzþyÞ2=ð2v2Þ¼ytanh zy v2 : This gives Az¼ mtanh jzjm v2 ;mtanh jzjm v2
andthe result then follows from Moors’ Theorem 1. &
Proof of Theorem 2.2. (1) WLEU: It can be shown that the MLE is a Brewster– Zidek (B–Z) improvement (see[10]) ofY1;provingY1’s inadmissibility. Another way to see thatY1 is inadmissible is to use the fact that WLEM dominates it.
(2)WLEM: It can be shown that the estimator WLEM is dominated by its B–Z improvement which is given by
dWLEMðY1;Y2Þ ¼ Y1þ Wþc 1þt when Wo s2 Wþc2 c ; Y1þ Wc 1þt when W4 s2 Wþc2 c ; Y1þjoðWÞ otherwise; 8 > > > > > < > > > > > : ðA:4Þ
where joðWÞ ¼s21W=ðs2W þc2Þ: Further, it follows from the first formula in
Theorem 2.1 that the MSE of WLEM is given by s21þs21s2W D 2s2 W2c2 ðs2 W þc2Þ 2 ð1þtÞ:
The fact thatD2s2
W2c2o0 for all jDjpcthen shows that WLEM dominates
WLEU.
(3)WLEN: The inadmissibility ofdWLEN can be shown as follows. The function j for this estimator is given by jðWÞ ¼s2
1Wðs2WþminðW2;c2ÞÞ 1; so (see Lemma A.1) d2ðX2Þ ¼X2 1 s2 1ð1þtÞ s2 W þminðX22ð1þtÞ 2; c2Þ ! :
So, forX24cð1þtÞ1;we haved2ðX2Þ ¼c2X2=ðs2W þc2Þwhich shows that Pm2 jd2ðX2Þjp
c ð1þtÞ
o1 for all m2AðN;NÞ:
The inadmissibility ofd2ðX2Þas an estimator ofm2basedonX2under the restriction jm2jpc=ð1þtÞthen follows from Lemma A.2 as well as from the results of Charras andvan Eeden[15].
(4)WLED: Using rotation (A.2), WLED can be written asX1d2ðX2Þ;where d2ðX2Þ ¼X2 s2 1 s2 Wþminðð1þtÞ 2 X2 2;c2tanh 2ð cjX2j=s21ÞÞ :
By Lemma A.1 it is then sufficient to show thatd2ðX2Þis inadmissible for estimating m2 basedon X2 under the restriction jm2jpc=ð1þtÞ: But this last inadmissibility follows from the fact that, as X2-N; d2ðX2Þ-N; showing that d2 takes, with positive probability for all m2A½c=ð1þtÞ;c=ð1þtÞ; values outside the interval ½c=ð1þtÞ;c=ð1þtÞ:
(5)MLE: That the MLE is inadmissible can be shown by using Lemmas A.1 and A.2 as follows. From (4) we get
jðWÞ ¼ðWcÞIðW4cÞ þ ðWþcÞIðWocÞ
so (see Lemma A.1) d2ðX2Þ ¼X2 X2 c 1þt I X24 c 1þt þ X2þ c 1þt I X2o c 1þt : This gives d2ðX2Þ ¼ c 1þt for X24 c 1þt; c 1þt for X2o c 1þt; 8 > < > :
makingd2ðX2Þinadmissible for estimatingm2under the restrictionjm2jpc=ð1þtÞby Lemma A.2 as well as by the results of Charras andvan Eeden[15].
From (4) andthe first expression in Theorem 2.1, it follows that the MSE of the MLE satisfies ðMSEs21Þð1þtÞ2 ¼ Eyð2DWW2þc2ÞIðjWj4cÞ 2DcðIðW4cÞ IðWocÞÞ ¼ ððDcÞ2s2ÞF Dc s þ ððDþcÞ2s2ÞF Dc s þ s ðDcÞf Dc s ðDþcÞf Dþc s :
The fact thatxFðxÞ þfðxÞX0 for allxthen shows that the MSE of the MLE is
strictly less thans2
1 for alljDjpc;showing that the MLE dominates WLEU. (6) WLE0: Suppose WLE0 were inadmissible as an estimator of y1 basedon ðY1;Y2Þunder the restrictionjy2y1jpc:Then there wouldexist another estimator of y1; say d0ðY1;Y2Þ; that differs from WLE0 on a measurable subset of R2 of positive Lebesgue measure that dominates WLE0. In particular, wheny1¼y2¼y
Eyðd0ðY1;Y2Þ yÞ2pEyðWLE0ðY1;Y2Þ yÞ2
for allyAðN;NÞ: But as we show below for the special case of estimating the
common meany;WLE0 is admissible. Thus Eyðd0ðY1;Y2Þ yÞ2¼EyðWLE0ðY1;Y2Þ yÞ2
for all yAðN;NÞ: But this is a contradiction since in that case the combined
estimatord00¼ ðd0þWLE0Þ=2 wouldhave a MSE that is strictly less than that of WLE0 for ally:That contradicts WLE0’s admissibility in this special case, a fact that we now establish.
The special case under consideration is location invariant under the transforma-tions: Yi-Yiþb and yi-yiþb for bAðN;NÞ and i¼1;2: The maximal invariant isW ¼Y2Y1:LetU¼WLE0:Then it is easily seen thatUandW are independent. MoreoverUBNðy;ðs12þs22Þ1Þ:
Let us make the one-to-one transformation of ðY1;Y2Þ to ðU;WÞ: The induced transformation group operates as follows:U-Uþb; W-W:As is easily shown,
any equivariant estimatordðU;WÞmust have the form dðU;WÞ ¼UþjðWÞ:
The MSE of any such estimator is easily shown to be EyðdðU;WÞ yÞ2¼EyðUyÞ2þEðjðWÞÞ2:
Thus the best equivariant (i.e. Pitman’s) estimator is given by takingj0;that is by letting dðU;WÞ ¼WLE0: The celebratedresult of Stein [28] implies that estimator is admissible.
Further, from the secondexpression in Theorem 2.1 it easily follows that the MSE of WLE0 is given bys2
1þs41ðD2s2WÞ=s4W:This is less thans21 for alljDjpcif and only ifcosW: &
Proof of Theorem 2.3. The result follows immediately from Lemmas A.1, A.2 and the fact that each one of the inadmissible estimators is inadmissible becaused2ðX2Þ takes, with positive probability for some m2A½c=ð1þtÞ;c=ð1þt; values outside
the interval½c=ð1þtÞ;c=ð1þtÞ: &
Proof of Theorem 2.4. Let (see Theorem 2.3)JðWÞbe the interval
ctanh cjWj s2 W ;ctanh cjWj s2 W :
Then it is, by Theorem 2.3, sufficient to prove that, for allW;the projectionp1ðWÞ ofW ontoJðWÞequals the projectionp2ðWÞof (see (4))
W ððWcÞIðW4cÞ þ ðWþcÞIðWocÞÞ
ontoJðWÞ:ForcpWpcthis follows from the fact that each one is the projection of W ontoJðWÞ: ForW4c; p1ðWÞis the projection of Wð4cÞ ontoJðWÞ and p2ðWÞis the projection ofW ðWcÞ ¼contoJðWÞ:These two projections are equal because, for each W; JðWÞC½c;c: The result for Woc follows by a
symmetry argument. &
Proof of Theorem 3.1. Using transformation (A.2), the problem becomes equivalent to showing that dðX1;X2Þ ¼X1 c 1þttanh c s2 W ð1þtÞX2 ðA:5Þ
is a minimax estimator ofm1m2 basedonðX1;X2Þunder the restrictionjm2jpc0¼ c=ð1þtÞ: To prove this, take a sequence of priors ln for ðm1;m2Þ with m1 and m2 independent,m1BNð0;nÞandm2with mass 1/2 on each of7c0:Then, conditionally on ðX1;X2Þ;m1 and m2 are independent and their marginal distributions are, respectively, the posterior ofm1 givenX1andthe posterior ofm2givenX2:Thus the unique Bayes estimatordnðX1;X2Þofm1m2 is given by
where, fori¼1;2; dn;iis the unique Bayes estimator ofmi basedonXi with respect
to the marginal prior ofmi:Now note that dn;1ðX1Þ ¼X1=ð1þ ðg2=nÞÞ; whereg2¼ varðX1Þ ¼s21s22=s2W:So Em1ðdn;1ðX1Þ m1Þ ¼ g2 nþg2m1 which gives Em1ðdn;1ðX1Þ m1Þ 2¼ n2 ðnþg2Þ2 g 2þm2 1 g4 n2 ;
which implies that the Bayes risk of dn;1 converges to g2 as n-N: Further, the estimatordn;2 andits Bayes risk,r2(say), are independent ofn;which shows that the Bayes risk ofdnconverges tog2þr2asn-N:To prove minimaxity ofdit is now, by Theorem 2.2 of Lehmann[23], sufficient to prove that
sup
jm2jpc0
Em2ðdn;2ðX2Þ m2Þ
2 ¼r2;
i.e., that, whencpmosW; d2ðX2Þ ¼dn;2ðX2Þis minimax for estimatingm2 basedon X2 under the restrictionjm2jpc0:But this result can be obtainedfrom Casella and Strawderman [13] (see also [34]). They show that, when UBNðn;1Þ; jnjpm; the unique minimax estimator for squared-error loss ofnbasedonU is given by
dMðUÞ ¼mtanhðmUÞ
when mpmoE1:056742: This minimax estimator is unique. To obtain the result
we need, let U¼sWX2=s12: Then UBNðsWm2=s21;1Þ with jnj ¼sWjm2j=s21p sWc=ðs21ð1þtÞÞ ¼c=sW andour minimax estimator ofm2 becomes
s21 sW c sW tanh c sW sW s2 1 X2 ¼ c 1þttanh c s2 W ð1þtÞX2 : The minimax value for this problem is given by
s21s22 s2 W þsup jDjpc ED c 1þttanh c s2 W ðY2Y1Þ D 1þt 2 ; ðA:6Þ wheres2
1s22=s2W is the variance ofX1:The secondterm in (A.6) can be written in the form s2 W ð1þtÞ2jsupDjpc ED c sW tanh c sW ðY2Y1Þ sW D sW 2 ¼ s 4 1 s2 W sup jnjpc=sW En c sW tanh c sW Z n 2 ; whereZBNðn;1Þ;which proves the result concerning the minimax value.
Concerning the uniqueness (andthus admissibility) of the minimax estimator among estimators of the formd01ðX1Þ d02ðX2Þ:This follows immediately from the fact that, fori¼1;2;diðXiÞis unique minimax for estimating mi whenjm2jpc0:
That the minimax estimator dominates WLEU follows, e.g., from the fact that WLEM dominates WLEU. &
Proof of Theorem 5.1. We first obtain an expression for the MSE ofdRP:The second expression in Theorem 2.1 gives
MSEðdRPÞ s21¼ 2s41Ey d dWT 0ðWÞ þs4 1EyðT0ðWÞÞ2; ðA:7Þ where l2½FðW0þc0Þ FðW0c0Þ2 d dWT2ðWÞ ¼ ½FðW0þc0Þ FðW0c0Þ½ðW0þc0ÞfðW0þc0Þ ðW0c0ÞfðW0c0Þ ½fðW0þc0Þ fðW0c0Þ2; so that d dWT2ðWÞ ¼ W0 l T 0ðWÞ þc0 l2 fðW0þc0Þ þfðW0c0Þ FðW0þc0Þ þFðW0c0Þþ ðT 0ðWÞÞ2 : ðA:8Þ
From (A.7) and(A.8) we then get
MSEðdRPÞ s21 ¼ 2s41Ey W0 l T 0ðWÞ þc0 l2 fðW0þc0Þ þfðW0c0Þ FðW0þc0Þ FðW0c0Þþ ðT 0ðWÞÞ2 þ s41EyðT0ðWÞÞ2 ¼ s41Ey 2 W0 l T 0ðWÞ þ2c0 l2 fðW0þc0Þ þfðW0c0Þ FðW0þc0Þ FðW0c0Þ þ ðT0ðWÞÞ2 : ðA:9Þ
To obtain an expression for EyW0T0ðWÞ;note that W0T0ðWÞ ¼ ððWDÞ=lÞT0þ
D0T0ðWÞ:Then use Stein and(A.8) to get EyW0T0ðWÞ ¼ s2 l Ey d dWT 0ð WÞ þD0T0ðWÞ ¼s 2 l Ey W0 l T 0ðWÞ þc0 l2 fðWþcÞ þfðWcÞ FðWþcÞ FðWcÞþ ðT 0ðWÞÞ2 ; so that 1þs 2 l2 EyW0T0ðWÞ ¼ s2c0 l3 fðW0þc0Þ þfðW0c0Þ FðW0þc0Þ FðW0c0Þþ s2 l EyðT 0ðWÞÞ2 þD0EyT0ðWÞ: Substituting this into (A.9) then gives
MSEðdRPÞ s21¼ s41Ey l2s2 l2þs2ðT 0ð WÞÞ2 þ 2c 0 l2þs2 fðW0þc0Þ þfðW0c0Þ FðW0þc0Þ FðW0c0Þ 2D0 l2þs2T 0ð WÞ :
To prove thatMSEðdRPÞ s21o0 for alljDjpc;first note that l2s22X0:So it is sufficient to show that, for alljDjpc;
GðD;cÞ ¼c0fðW 0þc0Þ þfðW0c0Þ FðW0þc0Þ FðW0c0ÞD 0fðW0c0Þ fðW0þc0Þ FðW0þc0Þ FðW0c0Þ40: But GðD;cÞ ¼ ðc0þD0Þ fðW 0þc0Þ FðW0þc0Þ FðW0c0Þ þ ðc0D0Þ fðW 0c0Þ FðW0þc0Þ FðW0c0Þ40;
for alljDjpcandallW;which proves thatdRP dominates WLEU. For the inadmissibility, by Lemma A.1, it is sufficient to show that
d2ðX2Þ ¼X2s21T
0ðð1þtÞX
2Þ
is inadmissible for estimatingm2 basedonX2 under the restrictionjm2jpc=ð1þtÞ: But this d2ðX2Þ takes, with positive probability for all m2AI ¼ ½c=ð1þtÞ; c=ð1þtÞ; values outside I because for lasW;d2ðxÞ converges to infinity as x converges to infinity. Thusd2ðX2Þis dominated by its projection ontoI: &
References
[1] H. Akaiki, Information theory andan extension of entropy maximization principle, in: B.N. Petrov, F. Csak (Eds.), Second International Symposium on Information Theory, Akademia, Kiado, 1973, pp. 276–281.
[2] H. Akaiki, On entropy maximization principle, in: P.R. Krishnaiah (Ed.), Applications of Statistics, North-Holland, Amsterdam, 1977, pp. 27–41.
[3] H. Akaiki, A Bayesian analysis of the minimum AIC procedure, Ann. Inst. Statist. Math. 30A (1978) 9–14.
[4] H. Akaiki, On the fallacy of the likelihoodprinciple, Statist. Probab. Lett. 1 (1982) 75–78. [5] H. Akaike, Information measures and model selection, Proceedings of 44th ISI Session, Vol. 1, 1983,
pp. 277–291.
[6] H. Akaiki, Prediction and entropy, A Celebration of Statistics, The ISI Centenary Volume, Springer, Berlin, 1985.
[7] M. Ayer, H.D. Brunk, G.M. Ewing, W.T. Reid, E. Silverman, An empirical distribution function for sampling with incomplete information, Ann. Math. Statist. 26 (1955) 641–647.
[8] J.O. Berger, Statistical Decision Theory andBayesian Analysis, 2ndEdition, Springer, New York, 1985.
[9] S. Blumenthal, A. Cohen, Estimation of the larger translation parameter, Ann. Math. Statist. 39 (1968) 502–516.
[10] J.F. Brewster, J.V. Zidek, Improving on equivariant estimators, Ann. Statist. 2 (1974) 21–38. [11] H.D. Brunk, Maximum likelihoodestimates of monotone parameters, Ann. Math. Statist. 26 (1955)
607–616.
[12] H.D. Brunk, On the estimation of parameters restrictedby inequalities, Ann. Math. Statist. 29 (1958) 437–454.
[14] A. Charras, Proprie´te´ Bayesienne et admissibilite´ d’estimateurs dans un sousensemble convex deRp;
Ph.D. Thesis, Universite´ de Montre´al, Montre´al, Canada, 1979.
[15] A. Charras, C. van Eeden, Bayes and admissibility properties of estimators in truncated parameter spaces, Canad. J. Statist. 19 (1991) 121–134.
[16] A. Cohen, H. Sackrowitz, Estimation of the last mean of a monotone sequence, Ann. Math. Statist. 41 (1970) 2021–2034.
[17] B. Efron, Empirical Bayes methods for combining likelihoods (with discussion), J. Amer. Statist. Assoc. 91 (1996) 538–565.
[18] J. Fan, N.E. Heckman, W.P. Wand, Local polynomial kernel regression for generalized linear models andquasi-likelihoodfunctions, J. Amer. Statist. Assoc. 90 (1995) 141–150.
[19] F. Hu, Relevance weightedsmoothing anda new bootstrap method, Ph.D. Thesis, Department of Statistics, University of British Columbia, 1994.
[20] F. Hu, Asymptotic properties of relevance weightedlikelihoodestimations, Canad. J. Statist. 25 (1997) 45–60.
[21] F. Hu, J.V. Zidek, The relevance weighted likelihood, 1997, Unpublished.
[22] T. Kubokawa, A unifiedapproach to improving equivariant estimators, Ann. Statist. 22 (1994) 290–299.
[23] E.L. Lehmann, Theory of Point Estimation, John Wiley & Sons, New York, 1983.
[24] J.J.A. Moors, Inadmissibility of linearly invariant estimators in truncated parameter spaces, J. Amer. Statist. Assoc. 76 (1981) 910–915.
[25] J.J.A. Moors, Estimation in truncatedparameter spaces, Ph.D. Thesis, Tilburg University, Tilburg, The Netherlands, 1985.
[26] H. Sackrowitz, W. Strawderman, On the admissibility of the M.L.E. for ordered binomial parameters, Ann. Statist. 2 (1974) 822–828.
[27] J.G. Staniswalis, The kernel estimate of a regression function in likelihood-based models, J. Amer. Statist. Assoc. 84 (1989) 276–283.
[28] C. Stein, The admissibility of Pitman’s estimator of a single location parameter, Ann. Math. Statist. 30 (1959) 970–979.
[29] S.M. Stigler, The 1988 Neyman Memorial Lecture: A Galtonian perspective on shrinkage estimators, Statist. Sci. 5 (1990) 147–155.
[30] R. Tibshirani, T. Hastie, Local likelihoodestimation, J. Amer. Statist. Assoc. 82 (1987) 559–567. [31] C. van Eeden, Maximum likelihood estimation of ordered probabilities, Proc. Kon. Nederl. Akad.
Wetensch. Ser. A 59 (1956) 444–455.
[32] C. van Eeden, Maximum likelihood estimation of partially or completely ordered parameters, Proc. Kon. Nederl. Akad. Wetensch. Ser. A 60 (1957) 128–136, 201–211.
[33] C. van Eeden, J.V. Zidek, Estimating one of two normal means when their difference in bounded, Statist. Probab. Lett. 51 (2001) 277–284. Correction note: Statist. Probab. Lett. 57 (2001) 111. [34] E. Zinzius, Minimaxscha¨tzer fu¨r den MittelwertWeiner normalverteilten Zufallsgro¨Xe mit bekanter