Minimum Description Length Methods in Bayesian Model Selection: Some Applications

(1)

Minimum Description Length Methods in Bayesian Model

Selection: Some Applications

Mohan Delampady

Statistics and Mathematics Unit, Indian Statistical Institute, Bangalore, India Email: [email protected]

Received January 8,2013; revised February 10, 2013; accepted February 26,2013

Copyright © 2013 Mohan Delampady. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

ABSTRACT

Computations involved in Bayesian approach to practical model selection problems are usually very difficult. Computa- tional simplifications are sometimes possible, but are not generally applicable. There is a large literature available on a methodology based on information theory called Minimum Description Length (MDL). It is described here how many of these techniques are either directly Bayesian in nature, or are very good objective approximations to Bayesian solutions. First, connections between the Bayesian approach and MDL are theoretically explored; thereafter a few illustra- tions are provided to describe how MDL can give useful computational simplifications.

Keywords: Bayesian Analysis; Model Selection; Minimum Description Length; Hierarchical Bayes; Bayesian

Computations

1. Introduction

Bayesian computations can be difficult, in particular those in model selection problems. For instance, it may be noted that learning the structure of Bayesian networks is in general of the computational complexity type NP- complete ([1,2]). It is therefore meaningful to consider alternative computationally simpler solutions which are approximations to Bayesian solutions. Sometimes direct computational simplifications are possible, as shown, for example, in [3], but often approaches arising out of different methodologies are needed. We discuss some aspects of Minimum Description Length (MDL) methods with this point of view. Another important reason for ex- ploring these methods is that there is a substantial literature on this topic available in engineering and com- puter science with potential applications in statistics. We will not, however, explore certain other aspects of MDL such as the “Normalized Maximum Likelihood (NML)” introduced by [4] which do not seem to be in the spirit of the Bayesian approach that we have taken here.

The discussion below is organized as follows. In Sec- tion 2 we briefly describe the MDL principle and then indicate in Sections 3 and 4 how it applies to model fitting and model checking. It is shown that a particular version of MDL is equivalent to the Bayes factor criterion of model selection. Since this is computationally difficult most often, some approximations are desirable,

and it is next shown how a different version of MDL can provide such an approximation. Following this discussion, new applications are presented in Section 5. Spe- cifically, MDL approach to step-wise regression in Sec- tion 5.1, wavelet thresholding in 5.2 and a change-point problem in 5.3 are described.

2. Minimum Description Length Principle

The MDL approach to model fitting can be described as follows (see [5,6]). Suppose we have some data. Con- sider a collection of probability models for this set of data. A model provides a better fit if it can provide a more compact description for the data. In terms of coding, this means that according to MDL, the best model is the one which provides the shortest description length for the given data. The MDL approach as discussed here is also related to the Minimum Message Length (MML) approach of [7]. See [8,9] for connections to information theory and other related details.

If data x is known to arise from a probability density , then (see [10] or [11]) the optimal code length (in an

average sense) is given by . (Here is

logaritm to the base 2.) This is the link between description length and model fitting.

p

 

logp x

 log

 

logp x





The optimal code length of is valid only

(2)

denotes the precision. In effect we will then be con- sidering

 





 

 

 

 

d

u up x

 

x i

of the id

2

2 2

x x

P x X x p



 

 

    

_

instead of p tself as far as coding of x is considered when x is one-dimensional. In the r-dimensional case, we will replace the density p

 

x by the probability

r-dimensional cube of s e  containing x, name





ly

 

r

 

r

p x   p x  , s code length

changes to logp

 

rlog

o that the optimal



 x  .

3. MDL for Estimation or Model Fitting

Consider data n



_{x x}₁, , ,₂ _x_n



Λ

 

x x , and suppose

 



n _{ }:



 f  

F x

is the collection of models of interest. Further, let π

 



be a prior density for . Given a value of  (or a model), the optimal code length for describing _xn is

 

logf n_

 x , but since  is unknown, its description

requires a further logπ

 

 bits on average. Therefore the optimal code length is obtained upon minimizing

 

= logπ

 

DL _ _ _ _log_f _xn_ ,₍₁₎

so that MDL amounts to seeking that model which minimizes the sum of

 the length, in bits, of the description of the model, and

 the length, in bits, of data when encoded with the help of the model.

Now note that the posterior density of  given the data _xn is





 

_{ }

π

 

, n

n

m

π n f x  

x

 

m y

 x  (2)

where is the marginal or predictive density.

Therefore, minimizing

 





log

π

n

f

log

log n

DL

f

  

 

x

   

  x

over  is equivalent to maximizing _π



_ _xn



_{. Thus} MDL for estimation or model fitting is equivalent to finding the highest posterior density (HPD) estimate of

. Note, however, that a prior is needed for these calculations. The approach that a Bayesian adopts in specifying the prior is not, in general, what is accepted by practitioners of the MDL approach. Therefore, the equivalence of MDL and HPD approaches is either subject to accepting the same prior, or as an asymptotic or similar approximation. MDL mostly prefers an

approximately uniform prior when for some

fixed (same across all models), leading to the

maximum likelihood estimate (MLE). The case of having model parameters of different dimensions is different and is interesting. This can be easily seen in the continuous case upon discretization. Now denote by

π

k

 R

k k

F

DL



1, , ,2



k

_and__by _ _{ } _

Λ

θ . Then

 













 





π

π π

π

log π log

logπ log log

log

logπ log log log .

f

f k

k k n k n

f

k n k

f

k n k

f

DL

f

k f

n

k f n

  

 



 



   

     

  _{ }  _ __{ } _{ } _ _

 

 

 

     

  _{ }   __{ } _{ } _

 



    

θ

θ x θ

Here _f and _π are the precisions required to discretize x and  , respectively. Note that the term

nlog_f

 is common across all models, so it can be

ignored. However, the term klog_π which involves the dimension of  in the model varies and is influen- tial. According to [6,12], _π1 n is optimal (see [13] for details), in which case





 

log logπ

 

log constant.

2

k n k k

DL f

k n

 _{ } _

 

θ x θ θ

(3)

Minimizing this will not lead to MLE even when π(θk₎ is assumed to be approximately constant. In fact, [12] proceeds further and argues that the correct precision π

should depend on the Fisher information matrix. This amounts to using a prior which is similar in nature to the Jeffreys’ prior on . effreys prior is an objective choice and thus this approach to MDL can then be considered a default Bayesian approach.

J

In spite of these desirable properties, however, MDL leads to the HPD estimate of , which is not the usual Bayes estimate. Posterior mean is what is generally preferred, so that the error in estimation has an immediate simple answer in the posterior standard deviation. In summary, therefore, the Bayesian approach doesn’t seem to find attractive solutions in the MDL approach as far as estimation or model fitting is concerned unless the models under consideration are hierarchical having parameters of varying dimension. On the other hand, when such hierarchical models are of interest the inference problem usually involves model selection in addition to model fitting. Thus the possible gains from studying the MDL approach are in the context of model selection as described below.

4. Model Selection Using MDL

(3)

and express it in the following form. Let



1, , n



n _X _X

Λ

 

X X . Suppose

 

,

n_ _ n_{ }_{ }

X x

1 1

rsusM : ,

f . Consider testing

0: 0 ve

M    

i

   R  1, Υ  

  (4)

where , for some ₀ ₁

and ₀ ₁

d _{d i}_, _0,



 Ι   π





  



0

1

g I

  

  

 

. Let be a prior on . Then

can be expressed as π

 





0 0

0 1

π π

1 π

g I

 

  (5)

where π₀P



₀ π



and g0 and g1 are the conditional densities (with respect to some dominating - finite measure) of  under M0 and M1 respectively. Then





_{ }

0

_

0

 

_

_{ }

0

0 0

π

π 1

n m

P M 

 

x

0 1

π

n

n n

m m

x

x x (6)

 

0 0

π

, n

n

m m

 x

x (7)

where

 

g_i

 

d , i 1, 2,

i

n n

i

m f  





_

x x (8)

and

 



π0



1

 

. n

m

 x

n

0 0

π 1

n n

m x  m x  (9)

Note that mi is simply the marginal or predictive density of X under Mi and is the unconditional predictive density obtained upon averaging 0 and 1. Consequently, the posterior odds ratio of

m

m m

0

M relative to 1

M is,









 

0 01

 

0 0

π

,

π

n

n n

P M m

BF

P M   m  

x x

x

x x

BF

0 ₀ 0

1 1

π

1 π 1

n n

(10)

with ₀₁ denoting the Bayes factor of M₀ relative to 1

M . When we compare two competing models M0 and

1

M , we usually take 0 , and hence settle upon the

Bayes factor ₀₁ as the model selection tool. This

agrees well with the intuitive notion that the model yield- ing a better predictive ability must be a better model for the given data.

π 1 2

BF

4.1. Mixture MDL and Stochastic Complexity

Let us consider the MDL principle now for model selection between M₀ and M₁. Once the conditional prior densities g₀ and g₁ are agreed upon, MDL will select that model M_i which obtains a smaller value for

t log

 

n

i

m x , between the two. This is

t to using B

he code length, 

clearly equivalen ayes factor as the model

selection tool, and hence this version of MDL is equivalent to the Bayes factor criterion. In the MDL literature, this version of MDL is known as “mixture MDL”, and is distinguished from the “two-stage MDL” which separately codes the model and the prior. The two-stage MDL can be derived as an approximation to the mixture MDL as discussed later. See [13] for further details and other interesting comparisons and discussion. Let us consider a few examples before examining the need for other versions of MDL.

Example 1. Suppose Xn is a random sample from



, 2



N   with known _2_{. We want to test}

0: 0 rsus 1: 0.

M  ve M 

 

_0, 2

N

Consider the  prior on  with known _2

under M1. Then the marginal distribution of

n

X is



_, 2



n n

N 0 I under M0 and under M1 it is



_, 2 _2



n n

N 0 I   mo a con-

tinuous prior is con

11 . A continuous del and

sidered here. Since the precision of the prior parameter is the same across all models upon discretization we will ignore the distinction and proceed with densities. Then, both the Bayes factor criterion and the MDL principle will select M1 over M0 if and only if

 

log_m n _m n

 1 x  log 0 x ,

where m₁ and m₀ are the correspond g densities. pari

in

Since we are com ng two logarithms, let us switch to

natural logarithms. Then

 

2 2

2 1

π log ,

2 2 2 i i

x



 

 



and

 

0

1

log_m n nlog 2 n n

 x 

 

  



2 2

1

2 2

1

log log 2π log

2 2

. n

n

n n

n

m I

I

 

  



   

 _

 

11

x

x x

Noting that









2n 2 2 2n ₁ 2 2

n

I n

2 2

n

I

  11     11    

and

















1 1

2 2 2 2 2

2 2 2 2

n n

n

I I

I n

    

   



 _



 

  



  

11 11

11

we obtain

 













2 2 2

2

2 2

2 2 2 2

1 1

log 2π log 1

2 2

1

.

2 2

n n

n

i i

n

x nx

n

  



    

  

 





x

1 logm

(4)

Therefore M1 is preferred over M0 either by the actor or by the m re MDL if and only if

Bayes f ixtu

2 2

2

2 2 lo

n n

n

 

 

 2

2

g 1 .

x n



 

 _  _

 

. Le iou

Example 2 t us consider the prev s example with

unknown _2_{now. Suppose the prior on}_2_{is the}

default ₁_2 _{under both m}_odels_{. The}_p_{rior on}_

under M1 is now assumed to de nd on pe 2, i.e.,





2 _N _0,_c 2

    , where c med to be a

n

0

 is assu

k own constant for now. Then, provided _xn _₀_,









₂ 2

2 2

0 2 2

1 0

2

2 2

1

1 d

2π exp

2

π ,

2

n n

i i n n

i i

m x

n _x

n

 

 _

  



 

 _ _

 

 



  _{ }  

 









x

and letting 2





2

1

n i i

S x x





 , and



2



1 ,

m x c  denote

the marginal density of X given c and _2_,

 















 









1

2 1 2

2 2

2 2 2

2 2

1 2

2 2

1

2 2

0

2 2 1 2

2 2

π exp π

2

d

exp d

2 2

d

2π exp ,

2

π 1 .

2 1

n

n n

m

c

n x

c

S _{m x c}

n _nc _S x

c n

 





  

2 2

2 S 2

0

2 ₂ ₂

  



 

 

 _{ }

  



 



 

 

 

 

 _  _

 __ __

 

 

 _ _

 

 

 _{ }  _  _



   



x

Thus

 



_{ }



  _ 

 

















_

_













1

0

2 2

1 2

2 2

log

1 1

1_{log 1} _log

2 2

1_{log 1} _log 1

2 2

n

n n

i i

n

m m

x

S x c n

n S

nc

S x

n n

nc

nx 

 



 

 

 

 

 



   



   



x

1 2

2 2

1

.

1 1

nc

nx

c n

x S nc S 



 

Therefore, the Bayes factor criterion or the mixture MDL reduces to a criterion which is very similar to that given in the previous example, except that _2_{is now} replaced by an estimator.

Example 3. (Jeffreys’ Test) This is similar to the

problem discussed above, except that   C



0,



, with density

 

_

_

1

1 1

g    ₂ ₂

π 1

  

the Cauchy prior. The prior on

,

 is the e

under both m

same as befor odels: π2

 

 1 . This approach sug-

gested by Jeffreys ([14]) is important. It xplains how one should proceed when the hypotheses which describe the model selection problem involve only some of the

ar ters. Then Je

e

parameters and the remaining p ameters are considered

to be nuisance parame ffreys suggestion is to

employ the same noninformative prior on the nuisance parameters under both models, and a proper prior with low level of information on the parameters of interest. Details on this problem along with this choice of prior can be found in Section 2.7 of [15].

Note that 0

 

n

m x is the same as in the previous

example, namely

  



₂ 2

2 2

0 2 2

0

1 d

2π exp

2 n n

n

i

m x

1

π ,

2

i

i i

n 2 _x2 2

n n n

 

 _ _ _

 _ _







x



  

 







whereas



  

  _{ }

 





 

_

_

2 2

1

2 ₂ ₂

2

2 2

1 d

π e d .

π 1

n

n x

S n

m



  

0

2  



  

 _ 

  

  _ _ _

 



x

No closed form is available for m1(xn) in this case. To calculate this one can proceed as follows as indicated Section 2.7 of [15]. The Cauchy density





_{ }

in

 

1

g   can

be expressed as a Gamma scale mixture of normals,

 

_{1 2} 2 ₂ 2₂

1

0

e e d

2π 2π

g   _     _   _ 

 

 



where  is the mixing Gamma variable. Now one can

integrate over  and  in closed form to simplify (m1xn). Finally, one has a one-dimensional integral ver o

 left, which can be numerically computed whe ever

needed. ,

on

other priors, such as in Example 3, some nu

n

Now let us note from the examples discussed above that an efficient computation of mi(xn) relies on having an explicit functional form for it. This is generally possible

ly when a conjugate prior is used as in Examples 1 and 2. For

(5)

From Sections 4.3.1 and 7.1 of [15], assuming that f and gi are smooth functions, we obtain, for large n, the following asymptotic approximation for

 

n

i

m x of equation (8). Let ki be the dimension of i,

 

ki lo





nh θ   g _f _xn _θki _and

 

_θki _{denote the}

h

Hessian of h, i.e.,

 

2

 

.

i i

k k

h _{ } h

 _ 

  _ _  

 

θ θ

i k k ˆk

θ te either the MLE or the posterior

i

i j

Also, let i _deno mode. Then

 





 

_{ }

_

_{ }

_

 



 



ˆ ₂

2

1

d

e 2π det

1 ,

i i

1 2

i

ki

i i i

i

n i

k k k

n

k k k

h k

i

m

f g

n

g O n

 





 



θ x

x θ θ θ

θ

so that i

i nh

 

 

(11)

 





 





 





log

ˆ _{log 2}_π _log

2 2

1_{log det} _ˆ _log _ˆ

2

ˆ

log log 2π

1_{log det} _ˆ _log _ˆ

2 i

i

n i

k i i

h i

k n

h i

m

k k

nh



  

 

1

log

2 2

.

i i

k k

i i

k k

n

g O n

k k

f n

g O n



  

   

   

x

θ

θ θ

x θ

θ θ

(12)

Ignoring terms that stay ed as n , [12]

suggests using the (approximate) criterion which Rissa- nen calls stochastic information complexity (or “stochastic complexity” for short),



bound

 





1

 

_{ˆ ,} n  ˆ

log log det

2 i k

n n

SIC x   f x θ  (13)

where

 

ˆ

ˆ ki

n n

  _h θ (14)

for implementing MDL. See [12,16-18] for furthe tails.

, , ,

r de-

If X X1 2Λ Xn are i.i.d. observations, then we have

 

ki

I θ O

 

ˆki ˆ h

 θ 

 

_n1 _,

where I

 

θ is the Fisher Information matrix and hence

 





 





 

1

log 2π .

i

k

O n

 

ˆ ˆ

log

1

log log det

2 2

2

i i

i

k k

n n

i

k i

m f

k

n I



 

x x

(15)

Now ignoring terms that stay bounded as n , we obtain the Schwarz criterion ([19]), BIC, for Mi given by

 





log θ loggi θ

ˆ

θ

 

ˆ

BIC log log ,

2 i k

n _ _f n _ki _n

x x θ (16)

w uivale

n p

failure time data. The two models considered are exponential and Weibull:

hich can be seen to be asymptotically eq nt to

SIC.

Example 4. [20] discusses a model selectio roblem

for









: , exp , 0

M₀ f x   x x

versus





1





1: , exp , 0,

M f x  _x _x x_

where  0 and  0. Model selection criterion of Rissanen is SIC as described in Equations (13 nd (14). However, in this problem a better approx ation is

ed for the mixture MDL, the mixture being Jeffreys mixture, i.e., the conditional prior d nsities

under i

) a im employ

e M for i1, 2 are given by

 

ki det



 

ki



, ki ,

i i i

g θ B I θ θ  

where i is a compact subset of the relevant parameter space _i and 1 _det



 

ki



_d ki

i

B _ I



θ θ . Consequently,

i

 it follows that,

 







 







_{O n}

 

1 _,

 

2

det d

ˆ 2π

i i i

i i

k k k

n n

i i

k k

n i

m B f I

n B f





 

 

 



x x θ θ θ

x θ

e MLE of _θki_under i i



where _θˆki_{is th} _M _{. This yields,}

 





 









1 2

ˆ

log log log

2 2π

1

log det d .

i

i i

i

k

n n i

i

k k

k n

m f

I O

n



   

 

 _{  }

 



x x θ

θ θ

par an ha

volving

Com e this with (15) d note t t the term in-



 

ˆ



det _I _θki _vanishes.

We would like to note here that many authors [21, 2] define the MDL estimate to be the same as the HPD estimate with respect to the Jeffreys’ prior restricted to so

2

me compact set K where its integral is finite:

 





 





de

or ,

t

,f I

det d

K

K I



u u

which is the stochastic complexity approach advocated

θ



(6)

above.

It must be emphasized that proper priors are being employed to derive the SIC criterion, and ence indeter- minacy and inconsistency problems faced

employing improper priors are not a difficulty in this

arameter

h

by techniques

approach. Moreover, this approach can be viewed as an implementable approximation to an objective Bayesian solution.

4.2. Two-Stage MDL

Now consider the two-stage MDL which codes the prior and the likelihood separately and adds the two description lengths. This approach is therefore similar to esti-

mating the p  with the HPD estimate when

there is an informative prior, or with the MLE, but the iption length does have inter- resulting minimum descr

esting features. To see when and how this approach approximates the above mentioned model selection criterion, let us look at some of the specific details in the two stages of coding. See [12,13] for further details. Again, recall the setup in (4) and (5).

Stage 1. Let _θˆki_{be an estimate of}_θki_{such as the} posterior mean, HPD or MLE under Mi. This needs to be coded. Consider the prior density

 

ki

i

g θ condi-

tional on Mi being true. Usually MDL would choose a uniform density. Restrict θ to a large compact subset of the parameter space and discretize it as discussed in Section 3 with a precision of π1 n hen the code- length required for coding _θˆki_is

 

. T

ˆ _log ˆ _{log .}

2

i i

k k i

i

k

L θ   g θ  n (17)

Stage 2. Now the data xn is coded using the model

density f



_xn_θˆki



. Discretization may again be needed, say with precision f. Thus the description length for codi_ng _xn_{will be}



ˆ



log n ki _log _.

f

f n 

 

scription l

x θ (18)

Summing these two codelengths, therefore, we obtain

a total de ength of





 

ˆ

log n ki log

ˆ

log log .

2 i

f

k i

i

k

f n

g n



 x θ 

 θ  (19)

Since the second term above, nlogf, is constant over both M0 and M1, and th

wo term ding criteri e two-stage

e third term stays bounded

as increases, these t s are dropped from the

MD o-stage co on. Thus, fo

odels, th MDL simplif

(for i

n L tw metric m n

r regular para- ies to the same

criterio M ) as BIC, namely,



ˆ



log log .

2 i k

n ki

f n

 x θ  (20)

I more complicated model selection problems, the two-stage MDL will involve further steps and may differ from BIC.

It may also b seen upon comparin n

e g (19) with (15) that

the performance of SIC based MDL should be superior to the simplified two-stage MDL for mode

SIC

sion and Function Estimation

le

such areas. We will briefly consider these problems to see how MDL

proxi-

nd computationally attractive approximations to some of the Baye- rate n since uses a better precision for coding the parameter, namely, one based on the Fisher information.

5. Regres

Model selection is an important part of parametric and nonparametric regression and smoothing. Variab selection problems in multiple linear regression, order of the spline to fit and wavelet thresholding are some

methods can provide computationally attractive ap mations to the respective Bayesian solutions.

5.1. Variable Selection in Linear Regression

Variable selection is an important and well studied problem in the context of normal linear models. Literature includes [23-32]. We will only touch upon this area with the specific intention of examining useful a

sian methods.

n

Suppose we have an observation vector y on a re-

sponse variable Y and also measurements 1, 2, ,

n n n

M

x x Λx

on a set of potential explanatory variables (or regressors). Following [13], we associate with each regressor xj, a binary variable j. Then the set of available linear models is

: 1

, j j

n n n

j j

 









y x ε (21)

where n



_, 2



n n

N  I



ε 0 . Note that γ



 ₁, , ,₂Λ_M



is, then, a Bernoulli sequence associated with the set of

regression coefficients, M



, , ,



1 2 M

  



β Λ also. Let



β denote the vector of non-zero regression coefficients corresponding to γ, and X_ the corresp

matrix, which results in the model

n_ _ n

y

del, then, is actual ma

onding design

.

 

X β ε

Selecting the best mo ly an esti-

tion problem, i.e., find the HPD estimate of γ start- ing with a prior π1 on γ and a prior π2 on



_, 2



 

β given γ. The two-stage MDL, which is the

simplest, uses the criterion of minimizing

 





 

1

logπ .

  



2 ˆ ˆ

log , , ,

n n

DL y γ   f y β  X γ

γ (22)

(7)





1

2

ˆ _,

   



 



β X X X y

2 ˆ

ˆ_ RSS n_ _{ } n.

   y X β (23)

Consider the unifo priorm r on γ , all 2M _values

receiving the same weight 1 2M _{. Using these, we can}

re

(as in ple 2

-write the L criterion as the one which minimizes

Exam )

MD

log n

RSS log ,

2 2

k n    wh

(24)

ere k_ is the number of j1

We can also derive the mixture MDL or stochastic complexity of a given model





.

γ. If _g



_, 2



 

β γ is the

prior density under π₂,

 



n _, 2_, _,

 

_, 2



_d

m

f _ _ g _ __d 2_.

n

  



_

y γ

y β X γ β γ β

,

(25)

Applying (13), (14) after evaluating the information

matrix of the parameters



2



 

β an terms

that are irrelevant for model selectio (see

[13]),

d ignoring n, one obtains

 





1

log log det .

2 n 2



 

  X X

If

2

SIC log

2

n k

RSS k



 

 γ

(26)

g is chosen to be the conjugate prior density, then the marginal density _m

 

_yn _γ _{can be explicitly derived.}

Details o d further simplifications obtained upon

using Zellner’s g-prior can be found in [13]. (See also [33,34].)

This method is only useful if one is interested in comparing a few of these models, arising out of some

pr e

n this an

e-sp cified subsets. Comparing all 2M_{models is not}

a computationally viable option for even moderate values of M , since for each model, γ, one has to compute the corresponding βˆ_ and RSS_.

We are more interested in a different problem, namely, whether an extra regressor should be added to an already determined model. This is the idea behind the step-wise regression, forward selection method. In this set-up, the model comparison problem can be stated as comparing

0: k

n n n

j j

M y 



 x ε 1

j versus

1 1

1

: n k n n.

j j

j

M  







y x ε

This is actually a model building method, so we

n n



that x1 1 hence 1

assume , and  is the intercept

ves the starti odel. Then we decide whether

this model needs to be expanded by adding additional regressors. Thus, at step k, we have a existing model

ressors 1, , k

which gi ng m

n x Λx and we fix xk 1

with reg  to be one of

aining

the rem M k regressors as the candidate for

possible selection. Now the two-stage MDL approach is straight forward. From (22) and (24), we note that xk1 is to be selected if and only if



₁



 

_log 1 1_log _0,

2 2

k

RSS n

DL k DL k n

RSS 

 

     

  (27)

where DL j

 

is the de ription length of the model with regressors

sc 1, , j

x Λx and RSSj i s residual sum

of squares as given i (23). A closer look at (27) reveals certain interesting facts. We need the following additional notations involving design matrices and the corresponding projection matrices.

s it n

We assume that the required matrix inverses exist.

 





   



   



 

1

1 2 , .

j n n n

j

j P j j j j



 

 

X x x Λx X X X X

Then we note the following result which may be found, for example, in [35].

   





   

 



1

,

v v

 _



 



 _ _ 

 

   

 



1

1 1

1 1 1

k

k k k

k k

k k k k

k k



 

 

  

 

 

 _ _

 

 

 _

 

X X X x

X X

x X x x

X X

(28)

uu u

u w

   

here





   









1

1 1

and .

k

k k k

k

k n k

v I P



 

 





 

u X X X x

x x

It then follows that

   

     



 



1

1 1 1 1

k k

k k k

k k k k

k

k k k k

P P

v P P P

P



   



 

 

 

 

x x x x

and hence

   





 









1

2 1 ,

k k

k

n k

RSS RSS P P

v I P





  



 

y y

y x

and

 

1

k k



 



 













 





 

1

2 1 1

1 1

2

,k 1,2, , ,

k

n k

k k

k n k n

y x k

I P

RSS RSS

I P I P

r _

 



  



   



y x

x x

k

RSS _y _y

(8)

where _ _ 1 ,k 1,2, ,

y x k

r __ _Λ is simply the partial correlation co- y and xk 1

efficient between  conditional on x1, ,Λxk. Substituting these in (27), we see that





 



1



2 , 1,2, ,

1r_{y x}_k__ _Λ_k  log .n

1

1 1

1

log log

2 2

1

log 1 log

2 2

1 log

2 2

k

k k

k

DL k DL k

RSS n

n RSS

RSS RSS

n _n

RSS n



 

 

 _ _

 

  

 _  _

 



(29)

Therefore,



1



 

DL k DL k if and only if

 



1



2 , 1,2,

log 1 _{y x}_k _k

n

 Λ,

1 log

2 r  2 n if and only if

 

1

,k 1,

y x Λ

2 1

2, , 1 .

n k

r  n (30)

This method does have some appeal, in tha

step, it tries to select that variable which has the largest partial correlation with the response (conditional on the variables which are already in the model), just like the step-wise regression method. However, unlike the step-

w n method it does not require any stopping

rule to decide whet te should be added. It

relies on the magnitude of the partial correlation instead.

One can also apply the stocha criterion

given in (26) above. Then we obtain,

t at each

ise regressio

her the candida

stic complexity





 

   





   





 





   









 





   





1 2

1 , 1,2, ,

log 1 log log

2 ry xk k 2 n 2 RSSk

1

1 1

, 1,2,

SIC 1 SIC

det 1_log

2 _det

2 log 1

2 k

k

k n k

k k

y x

k k

I P

n k

r _

1

1 1

2 1

log log

2 2

det

1 1

log log

2 2 _det

2 1 1

k

k k

RSS n k

RSS RSS

n

n k



 

 

  _Λ  

 



 

 _ _ _ 

 

 _ _



 

 

 

 

X X x x

X X

Λ

 





 

 

 _ _

 



 



 

X X

X X

 





 





2 ,

1

1 1

log log ,

2 2

k

k n

k

k n k

I P

n

I P



 

   

 

  _ _

 

 

 

y y

x x

(31) which is related to the step-wise regression approach, but uses more information than just the partial correlation.

A full-fledged Bayesian approach using the g-prior can also be implemented as shown below. Note that

 



  2





2



2

0 0

0

, , , d d ,

k

n n k k k

k

m f  g  





_{ }

y y β X β β

R

and

 









1 1

1 2 1 2 1 2

1 1 0

, , , d d

k n

n k k k

k

m

f  g  





  





_{ }

y

y β X β β

R

(32)

where g0 and g1, respectively, are the prior densities under M0 and M1. Taking these priors to be g-priors, namely,





1



2 2

, , , , 1,

j

j j j

c N c   j k k 

β 0 X X

along with the density ₁_2_for_2_{, a (proper prior)} density π

 

c for the hyperparameter c, we obtain,

 

0 0 01

π d

,

n

m c c c

BF







y

w

1 0

π d

n

m c c c



y

(33)

here

 

0

n

m y c





       

 





 

       

2 2

1 2 1

2

1 1 1 1 1

1 ˆ ˆ

1 ,

1 1

1 ˆ ˆ _,

1

n k

n k k k k k

k n

n

k k k k k

a c RSS

c

c a c

RSS

c

 

 



    

 _ _ 

  _  _



 

 

 _ _ 

_  _



 

β X X β

y

β X X β

m (34)

with a π n2

 

n 2 and _{ } _{   }

n



  βˆ _j 



X X_j _j



1 X_{ }_j y.

The on mensio l integrals in (33), however, cannot

be obtain

e-di na

ed in closed form. One could also approximate

01

BF with m0



yn cˆk

 

m1 yn cˆk1



, where ˆc are the ML-II (cf. [36]) estimates of c. See [13] for details.

ate t L

ressi ld-

data (see [37,38]). We have not included “year” as a regressor (which is a proxy for technological advance) and instead have considered only the w

regressors.

In this data set the variables are: X1 = Year, 1 de- 930, X2 = Pre-season precipitation, X3 = May ture, X4 = June rain, X5 = June temperature, X6 = July rain, X7 = July temperature, X8 = August rain, X9 = August temperature, and Y = X10 = Corn Yield.

As mentioned earlier, we always keep the intercept and check whether this regression should be enlarged by

Example 5. We illustr he MD approach to step-

wise reg on by applying it to the Iowa-corn-yie

eather-related

(9)

 

adding more regressors. We first apply the Two-stage MDL criterion. From (30), at step k, we consider only those regressors (which are not already in the model and)

wh _ _

s x 



 j j x , (36) where _j are the wavelet functions and



 1, 2,





β Λ is the corresponding vector of wavelet

coefficients. We assume that the normally infinite sum in (36) can be taken to be a finite sum (or at least a very

sform

d wavelet coefficients,

n _W

x v. Consider now the e ival model:

n

ich satisfy 2 1

, , > 1

n k

r _n

Λ (=0.1005 in

example). From this set we pick the one with the largest

 

2 , 1, ,

y j k

r _ _Λ of 2 _ _

, 1, ,

y j k

r _ _Λ for the relevant steps are listed below.

6 X8 X9

, 1

y j this

. The values

step X2 X3 X4 X5 X X7

1 0.037 0.010 0.021 0.022 0.338 0.336 0.044 0.118 2 0.066 0.034 0.001 0.015 -- 0.162 0.060 0.018 3 0.016 0.011 0.022 0.000 -- -- 0.050 0.004

According to our procedure we select X6 first,

followed by X7 and the selection ends there.

We consider the SIC criterion next. From (31), at step k, we pick the regressor Xj with the largest value for

 





 





 

2

,

v j _, _{1, ,}

1 2 1

1,

y j k

n k k

n k

j n j

I P n I P



   

   

 

_ _ 

 

 

 

y y

x x

Λ

pr

k r

ovided it is positive. The values of v j k

 

, for the relevant steps are given below.

step X2 X3 X4 X5 X6 X7 X8 X9

1 −0.002 −0.020 0.043 −0.001 0.371 0.319 0.071 0.106 2 0.013 −0.011 0.012 −0.022 -- 0.141 0.073 −0.002 3 −0.040 −0.040 0.032 −0.042 -- -- 0.058 −0.022 4 −0.050 −0.042 0.027 −0.039 -- -- -- −0.023 5 −0.043 −0.027 -- −0.028 -- -- -- −0.025

c o

, , ,

Ac ording to SIC our order of selecti n is

6 7 8 4

X X X X .

5. W d

onsider the nonparametric regression problem where w

v

where

2. avelet Threshol ing

C

e have the following model for the noisy observations



v v, , ,





v 1 2Λ n :

 

, 1 , ,and ,

i i i i

v s x  i ,Λn x T (35)

i

 are i.i.d

good approximation) as indicated in [39].

Upon applying the discrete wavelet tran (DWT)

v , we get the estimate

to

 qu ent

,  β ε x

wh e er εNn



0,



2 n



In



.

The ere inv

mi ng t n et coeffici s

0

o zero j

model selection problem h olves

deter-ni he number of no -zero wavel ent :

0: number f

non-M  is k

ersus v

1: number of non-zero j 1

M  is k

where ki n i, 0,1 are the number of wavelet coef- non-zero

 

ficients of interest.

The prior distribution on the j is assumed

to be i.i.d.

 

_0, 2 _,1 i

N   j k under M i_i, 0,1.

Since we have not identified the locations (indices) of the non-zero wavelet coefficients, _j, we proceed as follows to describe the prior structure. With each j we associate a binary variable j as in [41] f

ion 13] for v

or wavelet

regress or as in [ ariable selection in re-

gression. Then γ



 1, , ,2 n



g



 Λ is a Bernoulli se-

quence associated with the set of re ression coefficients,



 1, 2, ,n





β Λ . Let

n

1

: .

i j i

j

k





 

   

γ





lly, w

Fina e let ,1zj  j n be i.i.d.

 

2 0, N  (g

be the corresponding joint density), and define the following structure under Mi.

 

1

π 1 ,for all : n _j _i;

j i

k

k 

 _{ } 

 



γ  n γ (37)

. _N



_0,_2



_{errors with unknown} r varia

erro nce _2_{, and}_s_{is a function (or signal) de-}

o al _ 1

T R . Assuming s is a

th satisfying certain regularities (see

9, e wavelet decomposition of

fined smoo [15,3

n some in function 40]), we ha

terv

ve th s:

, :k_i _j _{j j}z ,1 j n.

β γ (38)











₂



2





2

2π exp .

i

n n

j j j

n

n x z

    

 



 

2

2 1 ,

, , ,

2 n

n

j

f

f k



 

 

x β

x z γ (39)

uisance parameter 2



The n  which is common under

both models is given the prior density ₁_2 _{. Then it}