• No results found

Bayesian shrinkage prediction for the regression problem

N/A
N/A
Protected

Academic year: 2021

Share "Bayesian shrinkage prediction for the regression problem"

Copied!
18
0
0

Loading.... (view fulltext now)

Full text

(1)

www.elsevier.com/locate/jmva

Bayesian shrinkage prediction for the

regression problem

Kei Kobayashi

a,∗

, Fumiyasu Komaki

b

aInstitute of Statistical Mathematics, Japan bUniversity of Tokyo, Japan

Received 20 January 2007 Available online 9 February 2008

Abstract

We consider Bayesian shrinkage predictions for the Normal regression problem under the frequentist Kullback–Leibler risk function.

Firstly, we consider the multivariate Normal model with an unknown mean and a known covariance. While the unknown mean is fixed, the covariance of future samples can be different from that of training samples. We show that the Bayesian predictive distribution based on the uniform prior is dominated by that based on a class of priors if the prior distributions for the covariance and future covariance matrices are rotation invariant.

Then, we consider a class of priors for the mean parameters depending on the future covariance matrix. With such a prior, we can construct a Bayesian predictive distribution dominating that based on the uniform prior.

Lastly, applying this result to the prediction of response variables in the Normal linear regression model, we show that there exists a Bayesian predictive distribution dominating that based on the uniform prior. Minimaxity of these Bayesian predictions follows from these results.

c

2008 Elsevier Inc. All rights reserved.

AMS subject classifications:primary 62F07; 62F15; secondary 62C10; 62J07

Keywords: Bayesian prediction; Shrinkage estimation; Normal regression; Superharmonic function; Minimaxity; Kullback–Leibler divergence

1. Introduction

Suppose that we have observationsy ∼Nd(y;µ,Σ). HereNdis the density function of the

d-dimensional multivariate Normal distribution with mean vectorµand covariance matrixΣ.

Corresponding author.

E-mail addresses:[email protected](K. Kobayashi),[email protected](F. Komaki). 0047-259X/$ - see front matter c2008 Elsevier Inc. All rights reserved.

(2)

We consider the prediction ofy˜ ∼ Nd(y˜;µ,Σ˜)using a predictive density pˆ(y˜|y).We assume that the mean of the distribution of unobserved (future) samples is the same as that of the observed samples. However, the covariance matrices,ΣandΣ˜, are not necessarily the same or proportional to each other. We call a problem with such settings the “problem with changeable covariances”. As we will show below, the changeable covariance is a natural assumption when we consider the linear regression problems.

In the present work, we assume that the mean vectorµis unknown and the covariance matrix

Σ is known. We consider cases where the future covariance Σ˜ is known and ones where it is unknown.

We evaluate predictive densities pˆ(y˜|y)using the KL loss function

D(p˜(y˜|θ)k ˆp(y˜|y)):= Z ˜ p(y˜|θ)log p˜(y˜|θ) ˆ p(y˜|y)dy˜ (1)

and the (frequentist) risk function

RKL(pˆ, θ):=

Z

p(y|θ)D(p˜(y˜|θ)k ˆp(y˜|y))dy. (2) We consider the Bayesian predictive density

pπ(y˜|y):= R ˜ p(y˜|θ)p(y|θ)π(θ)dθ R p(y |θ)π(θ)dθ

with priorπ(θ). For the Normal model, the Bayesian predictive density with the uniform prior

πI(µ)=1 becomes pπ(y˜|y;Σ,Σ˜)= 1 (2π)d/2|Σ+ ˜Σ|1/2exp − (y˜−y)>(Σ+ ˜Σ)−1(y˜−y) 2 ! ,

as we will see in Section2. Letpπ(y˜|y)denotepπ(y˜|y;Σ,Σ˜)for short.

When Σ˜ is proportional to Σ, i.e. Σ˜ = aΣ for a > 0, the problem is reduced to that withΣ = vId andΣ˜ = ˜vId for positive scalar valuesv andv˜. This case with ‘unchangeable covariances’ has been well studied. The Bayesian predictive density

pI(y˜|y;Σ,Σ˜)= 1 {2π(v+ ˜v)}d/2exp −k ˜y−yk 2 2(v+ ˜v)

based on the uniform priorπI(µ)=1 dominates the plug-in density

p(y˜| ˆµ)= 1 {2πv˜}d/2exp −k ˜y−yk 2 2v˜

with MLE, whereµˆ = y. Moreover, by [12,13], the Bayesian predictive density pI(y˜|y)is the best predictive density that is invariant under the translation group. In [11,3], the minimaxity of

pIwas proved.

In [8], it was proved that the Bayesian predictive densitypS(y˜|y)with Stein prior

πS(µ):= kµk−(d−2) (3)

(3)

George et al. [3] generalized the result of Komaki [8]. Define the marginal distributionmπ by

mπ(z;Σ):=

Z

N(z;µ,Σ)π(µ)dµ. (4)

As we will see in Theorem 2.4, George et al. [3] proved a sufficient condition on the prior

π(µ)or the marginal distributionmπ for pπ(y˜|y)to dominatepI(y˜|y)whenΣis proportional to Σ˜. In the present work, we generalize the results of Komaki and George et al. [8,3] to the corresponding problem with changeable covariances, considering only finite sample cases. Asymptotic properties of Bayesian prediction are studied in [7,1,9].

2. Prior distributions independent of the future covariance

In this section, we develop and prove our main results concerning properties of pπ(y˜|y)in the problem with changeable covariances.

First we give three lemmas generalizing results proved in [3] for the problem with “unchangeable” variances.

Define the marginal distributionmπ by(4). Here, we assume thatΣ˜ is nonsingular, though this assumption can be removed as we will see at the end of Section2.

Lemma 2.1. If mπ(z;Σ) <∞for all z, then pπ(y˜|y)is a proper probability density. Moreover, the mean of pπ(y˜|y)is equal to the posterior mean Eπ[µ|y]if it exists.

Let

w:=(Σ−1+ ˜Σ−1)−1(Σ−1y+ ˜Σ−1y˜)

and

Σw :=(Σ−1+ ˜Σ−1)−1. (5)

As a function of the predictive density based on the uniform prior, the Bayesian predictive density based on a priorπ(µ)becomes as follows:

Lemma 2.2.

pπ(y˜|y)=pI(y˜|y)

mπ(w;Σw)

mπ(y;Σ) .

The following lemma is used for proving minimaxity ofpπ(y˜|y).

Lemma 2.3. The Bayesian predictive density pI(y˜|y)is minimax under the KL risk function

RKL(pˆ, µ).

Since the proofs ofLemmas 2.1and2.3are almost same as those of Lemmas 1 and 3 in [3], we omit them. We prove onlyLemma 2.2.

Proof of Lemma 2.2. p(y|µ,Σ)p(y˜|µ,Σ˜)= 1 (2π)d/2|Σ|1/2exp −(y−µ) >Σ−1(yµ) 2 1 (2π)d/2| ˜Σ|1/2

(4)

×exp −(y˜−µ) >Σ˜−1(y˜µ) 2 ! = 1 (2π)d/2|Σ|1/2 1 (2π)d/2| ˜Σ|1/2exp −(w−µ) >Σ−1 w (w−µ) 2 exp −y >Σ−1y 2 −y˜ >Σ˜−1y˜ 2 ! exp (Σ −1y+ ˜Σ−1y˜)>(Σ−1+ ˜Σ−1)−1(Σ−1y+ ˜Σ−1y˜) 2 ! = 1 (2π)d/2|Σ w|1/2exp −(w−µ) >Σ−1 w (w−µ) 2 × 1 (2π)d/2|Σ+ ˜Σ|1/2exp − (y− ˜y)>(Σ+ ˜Σ)−1(y− ˜y) 2 ! . (6)

In the last equation, we use

Σ−1(Σ−1+ ˜Σ−1)−1Σ−1−Σ−1 =Σ−1(Σ−1+ ˜Σ−1)−1Σ−1−Σ−1(Σ−1+ ˜Σ−1)−1(Σ−1+ ˜Σ−1) = −Σ−1(Σ−1+ ˜Σ−1)−1Σ˜−1 = −(Σ+ ˜Σ)−1 and |Σw||Σ+ ˜Σ| = |Σ−1+ ˜Σ−1|−1|Σ+ ˜Σ| = |Σ|| ˜Σ|.

From(6), the predictive density with the uniform priorI(µ)=1 is given by

pI(y˜|y)= R p(y|µ,Σ)p(y˜|µ,Σ˜)dµ R p(y|µ,Σ)dµ =(2π)−d/2|Σ+ ˜Σ|−1/2exp −(y− ˜y) >(Σ+ ˜Σ)−1(y− ˜y) 2 ! . Therefore pπ(y˜|y)= R p(y|µ,Σ)p(y˜|µ,Σ˜)π(µ)dµ R p(y |µ,Σ)π(µ)dµ = pI(y˜|y) R N(w;µ,Σw)π(µ)dµ R N(y;µ,Σ)π(µ)dµ = pI(y˜|y) mπ(w;Σw) mπ(y;Σ) .

Next, the difference of the risk functions of the two priors is evaluated. Let

RKL(π, µ):= Z p(y|µ,Σ)D(p(y˜|µ,Σ˜)k pπ(y˜|y))dy φπ(µ,Σ):= Z N(z;µ,Σ)logmπ(z;Σ)dz. Then fromLemma 2.2,

(5)

RKL(π, µ)−RKL(πI, µ)= Z p(y|µ,Σ)p(y˜|µ,Σ˜)log pI(y˜|y) pπ(y˜|y)dydy˜ = Z p(y|µ,Σ)p(y˜|µ,Σ˜)log mπ(y;Σ) mπ(w;Σw)dydy˜ =φπ(µ,Σ)−φπ(µ,Σw). (7) NowΣw =(Σ−1+ ˜Σ−1)−1 ≺ Σ. In order to prove RKL(π, µ) < RKL(πI, µ), it suffices to proveφπ(µ,Σ) < φπ(µ,Σw).

Before stating the main results for the problem with changeable covariances, we review some results with a special setting, i.e., unchangeable covariances.

An extended real-valued functionπ(µ)on an open set R ⊂Rpis said to besuperharmonic

when it satisfies the following properties:

1. −∞< π(µ)≤ ∞andπ(µ)6≡ ∞on any component of R. 2. π(µ)is lower semi-continuous onR.

3. IfGis an open subset ofRwith compact closureG¯ ⊂ R,w(µ)is a continuous function on

¯

G,w(µ)is harmonic onG, andπ(µ)≥w(µ)on∂G, thenπ(µ)≥w(µ)onG.

If π(µ) is a C2 function, then π(µ) is superharmonic on R if and only if 1π(µ) =

Pp i=1 ∂ 2 ∂µ2 iπ(µ) ≤0 onR. Theorem 2.4 ([8,3]).Assume d≥3. (i)If π(µ)is the Stein priorπS(µ),

v1> v2>0⇒φπ(µ, v1Id) < φπ(µ, v2Id) for allµ.

(ii)If π(µ)is a superharmonic function and mπ(z;vId) <∞for any z andv,

v1> v2>0⇒φπ(µ, v1Id)≤φπ(µ, v2Id) for allµ.

Furthermore, if mπ(z;vId)is also not constant for all v2 ≤ v ≤ v1, the inequality holds

strictly.

(iii)If √mπ(z;vId)is a superharmonic function for anyvand mπ(z;vId) <∞for any z

andv,

v1> v2>0⇒φπ(µ, v1Id)≤φπ(µ, v2Id) for allµ.

Furthermore, if mπ(z;vId)is also not constant for anyv2 ≤ v ≤ v1, the inequality holds

strictly.

We note that (iii) implies (ii) and (ii) implies (i). (i) was proved in [8]. (ii) and (iii) were proved in [3].

Theorem 2.5 is a generalization of (ii) of Theorem 2.4 to the problem with changeable covariances. For each priorπ(µ), definea rescaled priorwith respect to a positive definited×d

matrixΣ∗by

πΣ∗(µ):=π(Σ∗−1/2µ).

In particular, callπS;Σ∗(µ):=πS(Σ∗−1/2µ)a rescaled Stein priorwith respect toΣ∗.

We consider Bayes risk with priorsp(Σ)andp˜(Σ˜):

RKL(π, µ)=

Z

(6)

where dΣmeans a Lebesgue measure for a vector space of all components of a matrixΣ. Define ϕπ(µ) := Z p(Σ)p˜(Σ˜)φπ(µ,Σ)dΣdΣ˜ = Z p(Σ)p˜(Σ˜)N(z;µ,Σ)logmπ(z;Σ)dzdΣdΣ˜ (8) ϕπw(µ) := Z p(Σ)p˜(Σ˜)φπ(µ,Σw)dΣdΣ˜ = Z p(Σ)p˜(Σ˜)N(z;µ,Σw)logmπ(z;Σw)dzdΣdΣ˜. (9) Then from(7), RKL(π, µ)−RKL(πI, µ)=ϕπ(µ)−ϕπw(µ). (10) We consider the case where p(Σ), p˜(Σ˜), andπ(µ) are rotation invariant. Here, a function

f(Σ)of a matrixΣ ∈ Rd×dand a functiong(µ)of a vectorµ ∈Rd×dare said to berotation invariantif f(Σ)= f(PΣP>)andg(µ) =g(Pµ), respectively, for every orthogonal matrix

P ∈Rd×d.

Theorem 2.5. Let d ≥3. If p(Σ)and p˜(Σ˜)are rotation invariant functions andπis a rotation invariant superharmonic prior, then

RKL(πΣ, µ)≤RKL(πI, µ)

for anyµ. In particular, the Bayesian predictive distribution pΣ(y| ˜y)withπΣ dominates that based onπIif πis also not constant.

Proof. We note thatmπ(z;Σ) < ∞for everyz ∈ Rd and positive definite matrixΣ ∈ Rd×d

fromLemma A.1in theAppendix.

First, we prove invariance ofϕπΣ(µ)andϕwπ

Σ(µ)under rotations ofµ.

LetPbe ad×d orthogonal matrix; then

ϕπΣ(Pµ)= Z p(Σ)p(˜ Σ˜)N(z;Pµ,Σ)log Z N(z;µ0,Σ)πΣ(µ0)dµ0dzdΣdΣ˜ = Z p(Σ)p˜(Σ˜)N(˜z;µ,P>ΣP)log Z N(z˜; ˜µ0,P>ΣP) ×π(Σ−1/2Pµ˜0)dµ˜0d˜zdΣdΣ˜ = Z p(PΣP>)p˜(Σ˜)N(˜z;µ,Σ)log Z N(z˜; ˜µ0,Σ)π(Σ−1/2µ˜0)dµ˜0dz˜dΣdΣ˜ =ϕπ Σ(µ).

The proof of the rotation invariance ofϕπwΣ(µ)is nearly the same. We define µ∗:=arg max kµ0k=kµk kΣ−1/2µ0k kΣw−1/2µ0k and τ := kΣ −1/2µkw−1/2µ∗k. (11)

(7)

Note that 0< τ <1, becauseΣ˜ is positive definite. Moreover, kτΣw−1/2µ˜0k =τkΣw−1/2µ˜0k ≥ kΣ −1/2µ˜0kw−1/2µ˜0kw−1/2µ˜0k = kΣ−1/2µ˜0k for everyµ˜0.

From the rotation invariance ofφπΣ,

ϕπΣ(µ)=ϕπΣ(µ ∗) = EΣ,Σ˜ Z N(z;µ∗,Σ)log Z N(z; ˜µ,Σ)π(Σ−1/2µ)˜ dµ˜dz = EΣ,Σ˜ Z N(z˜;Σ−1/2µ∗,Id)log Z N(˜z; ˜µ0,Id)π(µ˜0)dµ˜0dz˜ = EΣ,Σ˜ Z N(z˜;τΣw−1/2µ∗,Id)log Z N(z˜; ˜µ0,Id)π(µ˜0)dµ˜0dz˜ = EΣ,Σ˜ Z N(z˜;Σw−1/2µ∗, τ−2Id)log Z N(z˜; ˜µ0, τ−2Id)π(τµ˜0)dµ˜0dz˜ ≤ EΣ,Σ˜ Z N(z˜;Σw−1/2µ∗,Id)log Z N(˜z; ˜µ0,Id)π(τµ˜0)dµ˜0dz˜ = EΣ,Σ˜ Z N(z˜;µ∗,Σw)log Z N(z˜; ˜µ0,Σw)π(τΣw−1/2µ˜0)dµ˜0d˜z . (12) Here, inequality(12)is given byTheorem 2.4(ii).

Since every rotation invariant superharmonic function is radially nonincreasing,

π(τΣw−1/2µ˜0)≤π(Σ−1/2µ˜0).

From this inequality,

EΣ,Σ˜ Z N(z˜;µ∗,Σw)log Z N(˜z; ˜µ0,Σw)π(τΣw−1/2µ˜0)dµ˜0dz˜ ≤EΣ,Σ˜ Z N(z˜;µ∗,Σw)log Z N(z˜; ˜µ0,Σw)π(Σ−1/2µ˜0)dµ˜0dz˜ =ϕwπ Σ(µ ∗) =ϕwπ Σ(µ). (13)

In particular, ifπis not constant, inequality(12)holds strictly. Therefore, pΣ dominates pI.

FromLemma 2.3,pΣ is proved to be minimax.

Corollary 2.6. Assume d ≥3. Let p(Σ)and p˜(Σ˜)be rotation invariant continuous functions. If π is a rotation invariant superharmonic prior, the Bayesian predictive density pΣ(y˜|y) is minimax underRKL.

Theorem 2.5andCorollary 2.6can be generalized to the case with asemi-positive definite future covariance matrixΣ˜. LetΣ˜ be ad-dimensional semi-positive matrix whose rank isk>0. Then there is ad×kmatrixLsatisfyingΣ˜ =L L>. Let{ai}di=−1kbe a set of orthogonal normalized

(8)

vectors that are orthogonal to each column vector ofL, i.e.L>ai =0 andai>aj =δi j fori,j = 1, . . . ,d−k. Define the Normal distribution with semi-positive definite covariance matrix by

Nd(y;µ,Σ˜)= 1 (2π)k/2|L>L|1/2exp − (y−µ)>Σ˜Ď(y−µ) 2 !dk Y i=1 δ(a>i (y−µ))

whereΣ˜Ďis the Moore–Penrose pseudo-inverse ofΣ˜.

From the results of functional analysis, Nd(y;µ,Σ˜) for any semi-positive definite Σ˜ is equivalent to lim→0Nd(y;µ,Σ˜ +Id)as a functional on Schwartz functions ofy.

Using this equivalence and the bounded convergence theorem, Eq.(7) is valid for a semi-definite future covariance matrix if we defineΣw:=(Σ−1+ ˜ΣĎ)−1. BecauseΣ˜Ď6=0,τ defined

by(11)takes values in(0,1). Therefore,Theorem 2.5andCorollary 2.6hold for each semi-definite future covariance matrixΣ˜.

3. Prior distributions depending on the future covariance

In this section, we consider prior distributions depending on the future covariance matrix.

Theorem 3.2 below says that every Bayesian prediction with an adequately metrized prior dominates that based on the uniform prior. Although the assumption that priors can depend on the future covariance may seem strange, this assumption is natural when we consider the linear regression problem, as we will see in Section4.

First, we generalizeTheorem 2.4to the case with non-identity covariances. Letµandz be vectors inRdand letΣ ∈Rd×dbe a positive definite matrix.

LetΣ1 andΣ2be positive definite matrices such that Σ1 Σ2. An orthogonal matrixU and a diagonal matrixΛare given by a diagonalization ofΣ11/2Σ2−1Σ11/2, i.e.Σ11/2Σ2−1Σ11/2=

U>ΛU. Let A∗:=Σ11/2U>(Λ−1I d)1/2.

Proposition 3.1. If πis a prior s.t.π(A∗µ)is a superharmonic function of µ, then

φπ(µ,Σ1)≥φπ(µ,Σ2) (14)

for anyµ∈Rd. Inequality(14)becomes strict if πis not a constant function.

The following theorem is a direct result ofProposition 3.1.

Theorem 3.2. If π(A∗µ)is a superharmonic function of µ, then RKL(π, µ) ≤ RKL(πI, µ).

Furthermore, if πis not a constant function, the Bayesian predictive distribution pπdominates that with the uniform priorπI.

Note thatπ(A∗µ)can be superharmonic only if rank(Σ

2−Σ1)≥3.

Proof of Proposition 3.1 and Theorem 3.2. Assume 0 ≺ Σ1 Σ2and letΣ11/2Σ2−1Σ11/2 =

U>ΛUbe a diagonalization. Then, φπ(µ,Σ)= Z log Z π(ν) 1 (2π)d/2|Σ|1/2exp −(x−ν) >Σ−1(xν) 2 dν × 1 (2π)d/2|Σ|1/2exp −(x−µ) >Σ−1(xµ) 2 dx.

(9)

Letx˜ :=UΣ−1/2x,µ˜ =UΣ−1/2µ, andν˜=UΣ−1/2ν. By|Σ2|−1/2|Σ1|1/2= |Λ|1/2, φπ(µ,Σ2)= Z log Z π(Σ1/2U>ν)˜ 1 (2π)d/2|Λ|1/2exp −(x˜− ˜ν) >Λ−1(x˜− ˜ν) 2 dν˜ × 1 (2π)d/2|Λ|1/2exp −(x˜− ˜µ) >Λ−1(x˜− ˜µ) 2 dx˜ =φ π(Σ11/2U>·)(µ,˜ Λ −1), (15)

whereπ(Σ11/2U>·)is a prior distribution whose density function is represented byπ(Σ11/2U>µ)

with a prior densityπ(µ). PuttingΣ2=Σ1, we get

φπ(µ,Σ1)=φπ(Σ1/2 1 U

>·)(µ,˜ Id), (16)

whereIdis thed-dimensional identity matrix.

We denote each diagonal component ofΛbyλi. Now 0< λi ≤1 for eachisinceΣ1Σ2. Letai(t):=1+t(λi−1−1)andA:=diag(ai). Then

φπ(µ,Σ2)−φπ(µ,Σ1)=φπ(Σ1/2 1 U >·)(µ,˜ Λ −1)φ π(Σ11/2U>·)(µ,˜ Id) = Z 1 t=0 d X i=1 ∂ai(t) ∂t ∂ ∂ai φπ(Σ1/2 1 U>·)( ˜ µ,A) ai(t) dt = Z 1 t=0 d X i=1 ∂a˜i(t) ∂t ∂ ∂a˜i φπ(A∗·)(µ,ˆ A˜) ˜ ai(t) dt = Z 1 t=0 d X i=1 ∂ ∂a˜i φπ(A∗·)(µ,ˆ A˜) ˜ ai(t) dt wherea˜i :=(λ−i 1−1)−1aiandµˆ :=(Λ−1−Id)−1/2µ˜.

By assumption,π(A∗·)forA∗=Σ11/2U>(Λ−1−Id)1/2is superharmonic. Now it is sufficient to proveLemma 3.3(iii).

Lemma 3.3. (i)Σid=1aiN(x;µ,A)= 1

21N(x;µ,A)where1is the Laplacian with respect to

µ. (ii)R

f(x−t)dµ(t)is a superharmonic function of x if f is a superharmonic function and µis a positive measure onRd.

(iii) Pd

i=1∂aiφπ(µ,A) ≤ 0 for any µ ∈ R d, a

i > 0, and A = diag(ai) for each

superharmonic priorπ.

Proof of Lemma 3.3. Lemma (i) follows from direct calculation. For a proof of (ii), see Problem 1.7.16 of [10]. d X i=1 ∂ ∂ai φπ(µ,A)= d X i=1 ∂ ∂ai Z log Z π(ν)N(x;ν,A)dν N(x;µ,A)dx = Z d P i=1 ∂ ∂ai Rπ(ν)N(x;ν,A)dν R π(ν)N(x;ν,A)dν N(x;µ,A)dx

(10)

+ Z log Z π(ν)N(x;ν,A)dν d X i=1 ∂ ∂ai N(x;µ,A)dx. (17) Now, d X i=1 ∂ ∂ai Z π(ν)N(x;ν,A)dν= 1 21 Z π(ν)N(x;ν,A)dν≤0

fromLemma 3.3(i) and (ii). Thus, the first term of the right-hand side of(17)is non-positive. The second term of the right-hand side of(17)becomes

1 2 Z log Z π(ν)N(x;ν,A)dν 1N(x;µ,A)dx =1 2 Z 1log Z π(ν)N(x;ν,A)dν N(x;µ,A)dx (18)

by (i) and the self-adjoint property of the Laplacian. Since the logarithm of a superharmonic function is superharmonic (see Problem 1.7.16 of [10]),(18) is non-positive from (ii). Thus

Lemma 3.3(iii) is proved. Example 3.4. A rescaled Stein prior

πS;Σ2−Σ1(µ)= k(Σ2−Σ1)

−1/2µk−(d−2)

satisfies the condition ofProposition 3.1andTheorem 3.2. This is because

k(Σ2−Σ1)−1/2µk−(d−2) =(µ>Σ −1/2 1 (Σ −1/2 1 Σ2Σ −1/2 1 −Id)−1Σ −1/2 1 µ) −(d−2)/2 =(µ>Σ1−1/2U>(Λ−1−Id)−1UΣ −1/2 1 µ) −(d−2)/2. Thus,πS;Σ2−Σ1(A ∗µ)=π S(µ).

4. Application to the Normal linear regression problem

In this section, we apply the results in the previous section to the Normal linear regression problem.

Consider a Normal linear model

y=X>β+, (19)

∼Np(0, σ2Ip),

where the target variable y is a p-dimensional vector, X is a d × p matrix composed of the explanatory variables, σ2 > 0 is an unknown variance, and β is an unknownd-dimensional vector. When the rightmost column ofX>is the constant vector(1, . . . ,1)>, the model(19)is a model with a intercept,y=X>β+β0+.

We suppose that a future sampley˜is generated by

˜

y= ˜X>β+ ˜, (20)

˜

∼Np˜(0,σ˜2Ip˜),

(11)

In the present work, we assume thatp ≥d andX X>is regular; however neither p˜ ≥d nor regularity ofX˜X˜>is necessary.

We consider the prediction problem for the linear regression model(19)and(20)with the KL risk function

˜

RKL(β,pˆπ,X,X˜):=

Z

p(y|X;β, σ2)D(p(y˜| ˜X;β,σ˜2)k pπ(y˜| ˜X,y,X;σ2,σ˜2))dy

and partial Bayesian risk function with priorsp(X)andp(˜ X˜):

˜

RKL(β,pˆπ):=

Z

p(X)p˜(X˜)R˜KL(β,pˆπ,X,X˜)dXdX˜. Note that we do not assume any prior forβ.

Next, the regression model is reduced to a Normal model discussed in Section 2. Let

y1:=(X X>)−1X yandy2:=y−X>(X X>)−1X y. Then 1 (2π)p/2exp −(X >βy)>(X>βy) 2σ2 dy = 1 (2π)d/2|Σ|1/2exp −(y1−β) >Σ−1(y 1−β) 2 g(y2;σ2)dy1dy2, where Σ:=σ2(X X>)−1 (21)

andg(y2;σ2)is a density function ofy2that is independent ofy1andβ.

Whenyis given,y1is a sufficient statistic ofβ, the maximum likelihood estimator, and the least-squares estimator ofβ. Thus, the regression model(19)is reduced to a Normal model

p(y1;β,Σ)=Nd(y1;β,Σ). (22)

Similarly, the regression model(20)for the future samples is reduced to a Normal model

˜

p(y˜1;β,Σ˜)=Nd(y˜1;β,Σ˜) (23)

with semi-positive definite covariance matrix. Herey˜1:=(X˜X˜>)ĎX˜y˜and

˜

Σ:= ˜σ2(X˜X˜>)Ď. (24)

The KL risk of the Bayesian predictive density with a priorπ(β)for the regression problem becomes ˜ RKL(pπ, β) = Z p(y|X;β, σ2)D(p(y˜| ˜X;β,σ˜2)k pπ(y˜| ˜X,y,X))dy = Z p(y|X;β, σ2) Z Nd(y˜1;β,Σ˜)g(y˜2; ˜σ2) ×log Nd(y˜1;β, ˜ Σ)g(y˜2; ˜σ2) R Nd(y˜1;β,Σ˜)g(y˜2; ˜σ2)Nd(y1;β,Σ)g(y2;σ2)π(β)dβ R Nd(y1;β,Σ)g(y2;σ2)π(β)dβ dy˜1dy˜2dy = Z Nd(y1;β,Σ)D(Nd(y˜1;β,Σ˜)kqπ(y˜1|y1))dy1 = RKL(qπ, β), (25)

(12)

where qπ(y˜1|y1):= R Nd(y˜1;β,Σ˜)Nd(y1;β,Σ)π(β)dβ R Nd(y1;β,Σ)π(β)dβ .

As a result, the prediction problem for the regression model (19) and (20) is reduced to a prediction problem (22) and (23). Using the result in Section 2, we construct a Bayesian prediction for the Normal regression problem.

DefineΣ,Σ˜, andΣw by (21)and(24), andΣw = (Σ−1+ ˜ΣĎ)−1, respectively, then the following theorem holds.

Theorem 4.1. LetπΣ(β)=π(Σ−1/2β). Let p(X)and p˜(X˜)be rotation invariant continuous functions.

(i) If π is a non-constant rotation invariant superharmonic function, then the Bayesian predictive density pΣ with a prior πΣ dominates pIwith the uniform prior πIunder the risk

˜

RKL.

(ii)If π is a rotation invariant superharmonic function, then pΣ is minimax under the KL riskR˜KL.

Proof. If p(X)andp˜(X˜)are rotation invariant, then the distributions ofΣ =σ2(X X>)−1and

Σw=(σ−2(X X>)+ ˜σ−2(X˜X˜>))−1are also rotation invariant.

FromTheorem 2.5andCorollary 2.6, the theorem is derived directly.

The assumption of rotation invariance of p(x)and p(x˜) is sometimes not realistic. If we consider priors depending on the future explanatory variables, we can construct a Bayesian prediction dominating that with the uniform prior and, therefore, being a minimax prediction.

Define an orthogonal matrix U and a diagonal matrix Λ by a diagonalization of

Σw1/2Σ−1Σw1/2, i.e. Σw1/2Σ−1Σw1/2 = U>ΛU. Let A∗ := Σw1/2U>(Λ−1 − Id)1/2. Then the following theorem is a direct consequence ofTheorem 3.2.

Theorem 4.2. (i)If π(A∗β)is superharmonic w.r.t.βandπis non-constant, then the Bayesian prediction based on the priorπdominates that based on the uniform prior.

(ii) If π(A∗β) is superharmonic, then the Bayesian prediction based on the prior π is minimax.

Note thatπ(A∗β)can be superharmonic only if the number of future samples is more than 2. 5. Experimental results

We show several experimental results on the Bayesian prediction with shrinkage priors for regression problems.

Figs. 1and2 show examples of the regression problem. We consider the five-dimensional Normal regression models, without an intercept term (Fig. 1) and with an intercept term (Fig. 2). We set the true parameterβ =(1,0, . . . ,0)∈ R5. An explanatory variable Xis sampled from the uniform distributionU([−1,1]5×10)and a corresponding target variableyis sampled from

N10(X>β,I10). The target variable y˜ for each explanatory variable x˜ = (x˜1,0, . . . ,0)where

˜

x1∈ [0,2]is predicted by the Bayesian predictive density based on the uniform priorπIand that based on a rescaled Stein priorπS;Σ whereΣ =X X>.

Two lines inFigs. 1and2arey= ˆβπ>x˜forπIandπS;Σ, respectively, whereβˆπis the posterior mean with priorπ. In both figures, the slope of the line with rescaled Stein prior is smaller than

(13)

Fig. 1. An example of the Bayesian prediction based on the uniform prior and a rescaled Stein prior for the Normal regression model without an intercept term.

Fig. 2. An example of the Bayesian prediction based on the uniform prior and a rescaled Stein prior for the Normal regression model with an intercept termβ0=1.

that of the one with the uniform prior because the slope parameter β is shrunk toward zero. Moreover inFig. 2, the intercept parameter is also shrunk.

Fig. 3shows the distribution functions of the predictive density pI(y˜| ˜x,y,X)withπI and

pS;Σ(y˜| ˜x,y,X)withπS;Σ, respectively, forβ = ˜x=e1:=(1,0, . . . ,0)∈R5.

Fig. 4shows the risk functions ofpI andpΣ ford =3,5,7,9 andkβk ∈ [0,2]. The model has no intercept term. Here we assume that the columns ofX andX˜ are independently sampled fromN10(0,I10).

(14)

Fig. 3. Distribution functions ofpI(y˜| ˜x,y,X)andpS;Σ(y˜| ˜x,y,X)whereβ= ˜x =e1:=(1,0, . . . ,0)∈Rd,Xis a

sample fromU([−1,1]d×p),yis a sample fromNp(y;X>β,10Ip), andp˜(y˜| ˜x)=N(y˜; ˜x>β,10). We generate 104 samples ofy˜from each predictive distribution. The sample means ofPIandPS;Σare 1.3134 and 0.6898, respectively.

Fig. 4. The risk difference of pI and pS;Σ ford = 3,5,7,9 andkβk ∈ [0,2].We generate 104independent

samples of X and X˜ from N10(0,I10). Each line in the figure represents the sample mean of the risk difference RKL(β,pI)−RKL(β,pS;Σ). Each error bar represents the standard deviation.

Next, we show an example of Bayesian prediction whose prior depends on the explanatory variables of future samples. We set x1 = (

√ 3/2,1/2,0)>, x2 = ( √ 3/2,−1/2,0)>, x3 = (0,0,1)>,y1= √ 3/2+1/2,y2= √

3/2−1/2 andy3=0.Fig. 5is a graph ofEπS;A∗[ ˜y| ˜x,y,x] for each value ofx˜ =(x˜(1),x˜(2),0)∈ R×R× {0}with the rescaled Stein priorπΣ;A∗. Here,

the Bayesian estimation based on the uniform prior corresponds to the MLEβˆ =(1,1,0), i.e.

(15)

Fig. 5. An example of Bayesian prediction whose prior depends on the explanatory variables of future samples.

Fig. 6. Comparison of the risk values with five predictive densities: The Bayesian predictive densities based onpIand pΣ, the ridge regression priors with regularization parametersλ=10 andλ=

10=3.16, and the plug-in density of the MLE. The model is five dimensional and has no intercept term. We generate 104independent samples ofXandX˜

fromN10(0,I10). Each line in the figure represents the sample mean of the riskRKL(β,pˆ)for the predictive densityp.ˆ

We can see that the amount of shrinkage with the Bayesian prediction increases as the direction of x˜ becomes closer to x(1) than x(2), i.e. x˜>e1 becomes larger than x˜>e2. This is intuitively explained as follows: When explanatory variables of training samples are closer to

x(1), x˜ whose directions are close tox(1) has more information than ones whose direction is close tox(2). Thusx˜close tox(1)need not be shrunk.

Fig. 6compares five predictive densities: The Bayesian predictive densities based onpI and

pπΣ, the ridge regression priors with regularization parametersλ ∈ { √

10,10}, and the plug-in density of MLE.

(16)

The ridge regression prior is πR R(β;λ)= λ d/2 (2π)d/2exp −λkβk 2 2

with a regularization parameterλ >0. We note that the posterior mean with the ridge regression prior is equivalent to the ridge regression estimator

ˆ

βR R =(X X>+λI)−1X y.

Whenkβkis close to 0, the center of shrinkage, the risk based on the ridge regression prior

πR Rbecomes smaller than that based onπΣ. However, whenkβkincreases, the prediction with πR Rbecomes worse than that withπIand even worse than the plug-in distribution of the MLE. 6. Conclusions and discussion

In this paper, we considered the multivariate Normal model with an unknown mean and a known covariance. The covariance matrix can be changed after the first sampling. We assumed rotation invariant priors of the covariance matrix and the future covariance matrix. We showed that the shrinkage predictive density with the rescaled rotation invariant superharmonic priors is minimax under the Kullback–Leibler risk. Moreover, if the prior is not constant, the Bayesian predictive density based on the prior dominates that with the uniform prior.

In this case, the rescaled priors are independent of the covariance matrix of future samples. Therefore, we can calculate the posterior distribution and the mean of the predictive distribution (i.e. the posterior mean and the Bayesian estimate for quadratic loss) based on some of the rescaled Stein priors without knowledge of future covariance. Since the predictive density with the uniform prior is minimax, that with each rescaled Stein prior is also minimax.

Next we considered Bayesian predictions whose prior can depend on the future covariance. In this case, we proved that the Bayesian prediction based on a rescaled superharmonic prior dominates that with the uniform prior without assuming the rotation invariance.

Applying these results to the prediction of response variables in the Normal regression model, we show that there exists a prior distribution such that the corresponding Bayesian predictive density dominates that based on the uniform prior. Since the prior distribution depends on the future explanatory variables, both the posterior distribution and the mean of the predictive distribution may depend on the future explanatory variables.

We note that [4] developed minimaxity and dominance conditions on prior distributions for the predictive regression problem with changeable covariance independently of our study. They also proposed a shrinkage prediction where only a subset of the regression coefficients are close to zero.

The robustness of some shrinkage methods such as the Stein estimators one has been studied (see, for example, the bibliography in [14]). The Stein effect has robustness in the sense that it depends on the loss function rather than the true distribution of the observations. Our result shows that the Stein effect has robustness with respect to the covariance of the true distribution of the future observations.

As the dimension of the model becomes large, the risk improvement with the shrinkage with the rescaled Stein priorπΣ increases as inFig. 4. An important example of a high dimensional model is that of the kernel methods (see [5]). As noted in [2], the feature space for the kernel methods is a kernel reproducing Hilbert space whose dimension is as large as the sample size.

(17)

Therefore Bayesian prediction based on shrinkage priors could be efficient for kernel methods. This is a future problem.

Acknowledgments

The authors thank Vincent Q. Vu for valuable comments on this paper. The authors would also like to thank the associate editor and two referees for their comments on and suggestions regarding an earlier version of this paper. This work was supported in part by the Ministry of Education, Science, Sports and Culture, Grant-in-Aid for JSPS Fellows, 17.10018, 2005.

Appendix. Finiteness of the marginal distribution

Here, we prove finiteness of the marginal distributionmπ(µ,Σ).

Lemma A.1. Ifπis a superharmonic prior density function, the marginal distribution mπ(x,Σ)

is finite for every vector x∈Rdand positive definite matrixΣ ∈Rd×d.

Proof. Fix a vector x ∈ Rd From the definition of superharmonic functions, π 6≡ ∞. Thus,

∃x0∈Rds.t.π(x0) <∞. If we setπ(µ)˜ :=π(µ+x0), thenπ˜ is superharmonic andπ(˜ 0) <∞. Letλmaxbe the maximal eigenvalue ofΣandr0:= kx+x0k; then

mπ(x,Σ)≤ Z exp −kx+x0−µk 2 2λmax ˜ π(µ)dµ ≤ Z kµk≤2r0 exp −kx+x0−µk 2 2λmax ˜ π(µ)dµ + Z kµk>2r0 exp −kµk 2 8λmax ˜ π(µ)dµ. (26)

The first term of the right-hand side of(26)is finite because the integral of a superharmonic function over a compact subspace ofRdis finite (see Theorem 4.10 of [6]).

The second term is also finite because

∞ X n=2 Z nr0<kµk≤(n+1)r0 exp −kµk 2 8λmax ˜ π(µ)dµ≤C ∞ X n=2 exp −(nr0) 2 8λmax ˜ π(0){(n+1)r0}d for a positive constantC. Here we used the fact thatRkµk<rπ(µ)˜ dµ < Cπ(˜ 0)rd, by Theorem 4.9 of [6]. Therefore,mπ(x,Σ) <∞.

From this lemma, we see that the assumption mπ(z, vId) < ∞ in Theorem 2.4(ii) is redundant.

References

[1] J.M. Corcuera, F. Giummol´e, First-order optimal prediction densities, in: P. Marriott, M. Salmon (Eds.), Applications of Differential Geometry to Econometrics, Cambridge University Press, Cambridge, 2000, pp. 214–229.

[2] N. Cristianini, J. Shawe-Taylor, An Introduction to Support Vector Machines, Cambridge University Press, 2000. [3] E.I. George, F. Liang, X. Xu, Improved minimax prediction under Kullback–Leibler loss, Annals of Statistics 34

(2006) 78–91.

[4] E.I. George, X. Xu, Predictive density estimation for multiple regression, Econometric Theory 24 (2) (2008) 528–544.

(18)

[5] T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning — Data mining, Inference, and Prediction, in: Springer Series in Statistics, Springer, New York, 2001.

[6] L.L. Helms, Introduction to Potential Theory, Wiley-Interscience, New York, 1969.

[7] F. Komaki, On asymptotic properties of predictive distributions, Biometrika 83 (1996) 299–313.

[8] F. Komaki, A shrinkage predictive distribution for multivariate Normal observables, Biometrika 88 (2001) 859–864. [9] F. Komaki, Shrinkage priors for bayesian prediction, Annals of Statistics 34 (2006) 808–819.

[10] E.L. Lehmann, G. Casella, Theory of Point Estimation, 2nd ed., Springer, New York, 1998.

[11] F. Liang, A. Barron, Exact minimax strategies for predictive density estimation, data compression, and model selection, IEEE Transactions on Information Theory 50 (2004) 2708–2726.

[12] G.D. Murray, A note on the estimation of probability density functions, Biometrika 64 (1977) 150–152. [13] V.M. Ng, On the estimation of parametric density functions, Biometrika 67 (1980) 505–506.

Figure

Fig. 1. An example of the Bayesian prediction based on the uniform prior and a rescaled Stein prior for the Normal regression model without an intercept term.
Fig. 4. The risk difference of p I and p S;Σ for d = 3 , 5, 7, 9 and kβk ∈ [0, 2]. We generate 10 4 independent samples of X and ˜ X from N 10 (0, I 10 )
Fig. 5. An example of Bayesian prediction whose prior depends on the explanatory variables of future samples.

References

Related documents

Keywords: Fuglede-Putnam theorem; w-hyponormal operator; (p , k) − quasihyponormal operator; derivation... It was shown in [2] and [3] that the class of w − hyponormal

In addition, Fujitsu has put in place a fully-managed print service which enhances staff productivity, improves service availability, and lowers costs through both the use of

Wk 1: Introduction to empire Wk 2: The Trans-Atlantic slave trade - conditions of the Middle Passage Wks 3-4: Life as a slave Wk 5: Slave runaways Wk 6:Abolition Wk

Types of Aviation Economic Impact Measured Airports Visitors Construction Military Aviation Air Cargo Aviation Education Aviation Businesses Federal Aviation

To obtain insights into the metabolic pathway of N α - acetyl -α- lysine and explore the genome of the Salinicoccus spp, we performed complete genome sequence analysis and annotation

Examples and case studies labelled as tactical urbanism from LMICs, are predominantly related to road safety (Brazil, India, Tanzania, and Chile) and are described in

After selecting a software architecture alternative that meets performance objectives, we use the system execution model to evaluate technical architecture alternatives and

This Appendix A describes the procedures for obtaining criminal background checks for University employees (staff and faculty), Volunteers, and Non- Employee Adults in Lehigh