A Kullback Leibler Divergence for Bayesian Model Diagnostics

(1)

A Kullback-Leibler Divergence for Bayesian Model

Diagnostics

Chen-Pin Wang1, Malay Ghosh2 1

Department of Epidemiology and Biostatistics, University of Texas Health Science Center, San Antonio, USA

2

Department of Statistics, University of Florida, Gainesville, USA E-mail: wangc3@uthscsa.edu, [email protected]

Received August 18, 2011; revised September 19, 2011; accepted September 29,2011

Abstract

This paper considers a Kullback-Leibler distance (KLD) which is asymptotically equivalent to the KLD by Goutis and Robert [1] when the reference model (in comparison to a competing fitted model) is correctly specified and that certain regularity conditions hold true (ref. Akaike [2]). We derive the asymptotic property of this Goutis-Robert-Akaike KLD under certain regularity conditions. We also examine the impact of this asymptotic property when the regularity conditions are partially satisfied. Furthermore, the connection be-tween the Goutis-Robert-Akaike KLD and a weighted posterior predictive p-value (WPPP) is established. Finally, both the Goutis-Robert-Akaike KLD and WPPP are applied to compare models using various simu-lated examples as well as two cohort studies of diabetes.

Keywords:Kullback-Leibler Distance, Model Diagnostic, Weighted Posterior Predictive p-Value

1. Introduction

Information theory provides a general framework of de-veloping statistical techniques for model comparison (Akaike [2]; Shannon [3]; Kullback and Leibler [4]; Lindley [5]; Bernardo [6]; Schwarz [7]). The Kullback- Leibler distance (KLD) is perhaps the most commonly used information criterion for assessing model discrep-ancy (Akaike [2]; Kullback and Leibler [4]; Lindley [5]; Schwarz [7]). In essence, a KLD is the expected loga-rithm of the ratio of the probability density functions (p.d.f.s) of two models, one being a fitted model and the other being the reference model, where the expectation is taken with respect to the reference model. Thus KLD can be viewed as a measure of the information loss in the fitted model relative to that in the reference model. KLDs that are suitable for model comparison in the Bayesian framework typically involve the integrated likelihoods of the competing models, where the grated likelihood under each model is obtained by inte-grating the likelihood with respect to the prior distribu-tion of model parameters (e.g., Lindley [5] and Schwarz [7]). KLDs based on the ratio of integrated likelihoods however have been challenged by identifying priors that are compatible under the competing models and that the resulting integrated likelihoods are proper. As a way to

(2)

refer-173 C.-P. WANG ET AL.

ence model is not correctly specified, then the G-R KLD is not necessarily reduced to to the KLD between the reference model and the competing fitted model evalu-ated at its maximum likelihood estimate or MLE (ref. Akaike [2]), referred to as the Goutis-Robert-Akaike KLD or G-R-A KLD a more tractable model discrepancy measure.

This paper proposes to use G-R-A KLD for assessing model discrepancy in terms of the fit of certain statistics

n that is central to our inference or model diagnostic

purpose. That is, we evaluate the G-R-A KLD between the probability density function (p.d.f.) of n under the

reference model and that under the assumed model T

T r

f evaluated at its MLE (see Section 2). We investigate the (asymptotic) property of G-R-A KLD under certain regularity conditions as well as under the violation of some regularities, including non-nested models. Note that unlike G-R KLD (Goutis and Robert [1]), the G-R-A KLD considered herein does not require the reference model to be the true model, nor is the true model to be specified. Also, while G-R KLD has been limited to comparing nested generalized linear models, the G-R-A KLD seems to be more flexible for comparing nested or non-nested models that are broader than generalized lin-ear models (see Sections 3 and 4). Theorem 1 shows that under certain regularity conditions, the asymptotic ex-pression of the posterior estimator of G-R-A KLD is comprised of a leading term for model discrepancy in the mean of n, a term for model discrepancy in the

vari-ance of n, and a constant to penalize model complexity,

plus a smaller order term. Since the first two leading terms in the G-R-A KLD estimator resemble a measure that differentiates the predictability between models and

r

T T

r f , it is natural to study its connection with Bayes-ian model discrepancy measures based on predictive sta-tistics (Guttman [8], Rubin [9], Gelman et al. [10]). In

particular, we consider the posterior predictive check technique using a one-sided weighted posterior predic-tive p-value (WPPP). The WPPP of evaluates the predictive distribution of Tn under

n

T

f at T_npred,r , where pred r,

n

T denotes the prediction of T_n under , the posteriors are derived under , and the weight is used to account for the variation of under . Theo-rem 2 explicitly shows that for any n satisfying certain

asymptotic normality and regularity conditions, how the model discrepancy is reflected by the G-R-A KLD in connection to that by WPPP. To verify the results in Theorem 1 and Theorem 2 as well as to evaluate G-R-A KLD under partial violation of the regularity conditions, we examine the (asymptotic) property of G-R-A KLD and WPPP via both simulations and real data applica-tions motivated by two cohort studies of diabetes. These examples include the comparison between nested models

as well as non-nested models.

r r

T

n

T r

The paper is organized as follows. Section 2 studies the G-R-A KLD for (predictive) p.d.f.s of n between

two competing models. It also derives the relationship between G-R-A KLD and WPPP. Sections 3 and 4 evaluate model fit using both G-R-A and WPPP for ex-amples that meet all the regularity conditions required in Theorems 1 and 2 as well as for examples that meet only part of these regularity conditions.

T

2. A Proposed Kullback-Leibler Divergence

2.1. Notations

Throughout this paper, we assume that Xi’s originate

from model g, and are i.i.d. with some common p.d.f. with parameter(s)  for   g , where g is a

closed set. Denote for the reference model and r f for the fitted model, both governed by , where r

and f are the corresponding the parameter spaces.

Also, when a capital letter is used to denote for a random variable (or an estimator), the corresponding lower case is for its realization (or an estimate). Let Un =U



X1,



n

,X

 be the base for deriving the posterior density of

 under model r . We shall denote r



* |



and



* |



f  for the p.d.f.’s of * given  under model and

r f , respectively. Also, let ( ) =y (2π)l/2exp(y' 2)y



r

where y is l-dimensional.

2.2. Define Kullback-Leibler Divergence

Consider that model adequacy is evaluated based on its fit for certain statistics n that is per-tinent to the inference or model diagnostics. Stemmed from Goutis and Robert [1], we assess the relative fit between models using the KLD of the distribution of _n under and that under



1

= , ,

n

T T X  X

T f evaluated at ˆ_f, the MLE of :







|







log | d

ˆ |

n

n n f

r t

r t t

f t



 

 

 

 



n, (1)

We shall refer (1) to as the Goutis-Robert-Akaike KLD or G-R-A KLD since (1) is asymptotically equivalent to the KLD proposed by Goutis and Robert [1] when the reference model is the true model (ref. Akaike [2]). r

In general, for each  r, G-R-A KLD given in (1)

can be regarded as a measure of the minimal information gain in model from model r f since the minimum information loss under f is achieved at ˆf. Note that

unlike G-R KLD (Goutis and Robert [1]), (1) by defini-tion does not require the reference model to be the true model. To understand the utility of (1) in the Bayes-

(3)

ian framework, below we derive its Bayesian estimate and the associated (asymptotic) property under certain regularity conditions (see Sections 2.3 and 2.4). We also study the property of (1) when partial regularity condi-tions hold true using simulated data (Section 3) as well as two applications of diabetes studies (Section 4).

2.3. Bayesian Estimation of the Proposed KLD Estimating (1) involves approximating the integral with respect to ( | )r tn  and estimating unknown model

pa-rameters . To account for the uncertainty of model parameters, we consider the following estimator:







|





 



log | π | d d

ˆ |

n

n r n n

n f

r t

r t U t

f t



 



 

 

 



, (2)

where Un, as a function of (X1, , Xn), is used for

deriving the posterior of , and π_r



|U_n

 

=r U_n|







πr( ) /



r Un| πr( )d  denoting the posterior den-sity of  under model . That is, (2) is the average discrepancy between r and

r

f , each being weighted

by  r



|Un



. We shall denote (2) by KLDt



, | n



U

r f U .

Since (2) is nonnegative for any given n, the closer it

is to zero, the less is the information loss by fitting f ,

instead of , to statistic r n.

To gain further insight about the utility of (2), we de-rive its asymptotic properties below. The approximation of (2) is tied to the assumptions of _n under and

T

T r f ,

which will be described in Theorems 1 and 2. In what follows, let the statements be interpreted as “almost

sure” statements. Also, define for

which has a unique minimum attained at . Denote

O

( ) lo ) 1

Q y  y 

y g(y

> 0

y = 1

 for model parameters, and ˆ ( )r Un , ˆf

, r n and (Un)

2

ˆ

 (U ) ˆ ( )2f Un

( )

r

for the posterior means (or the MLE’s) of   ,  f( )

2_{( )}

,  r

2

f

, and  ( ), respectively.

Assume the following regularity conditions under Theorems 1 and 2.

(A1) For each x, both log



r x



|



and log



f x( |



)

 are 3 times continuously differentiable in . Fur-ther, there exist neighborhoods N_r( ) =



    _r,  _r



and Nf( ) =



   f, f



of  and integrable functions H ,_r( )x and H , _f( )x such that









,

( )

=

log | ( )

sup

k

k r

N _r

r x H_{ } x

    _{ }_



 

and









,

( ) ₌

log | ( )

sup

k

k f

N _f

f x H_{ }

    _{ }_

 _



for k = 1, 2, 3.

(A2) For all sufficiently large > 0,









| |>

|

log < 0

sup

|

r

r x E

r x

  

 



   

 _ _

 _ _

 

and









| |>

|

log < 0

sup

|

f

f x E

f x

  

 



   

 _ _

 _ _

 

.

(A3)













( , )

log | log |

sup

r r

E r x E r

         x 

 

 

 

 

as  0, and













( , )

log | log |

sup

f f

E f x E f

         x 

 

 

 

 

as  0.

(A4) The prior density π( ) is continuously differ-entiable in a neighborhood of  and π( ) > 0 .

(A5) Let Tn be asymptotically normally distributed

under both models such that



_|



₌ 1_{( )}



_{( )}



_{( )}



₍ 1 ₎

n r n r r

r T     n T     O n/ 2

(3) and



_|



₌ 1_{( )}



_{( )}



_{( )}



₍ 1 ₎

n f n f f

f T     n T     O n/ 2 .

(4) Theorem 1.









2

ˆ ( ) ˆ ( ) 2 , |

= (1) ˆ ( )

f n r n

t n

p f n

U U

KLD r f U

o

n U

 

 

 (5)

when  f( ) r( ), and





22

ˆ ( )

2 , | =

ˆ ( )

r n

t n p

f n U

KLD r f U Q o

U

 

 

 _ _

  (1) (6)

when  r( ) = f( ) but

2_{( )} 2_{( )}

r f

    .

The proof of Theorem 1 is given in Appendix 1. Since model comparison in real applications can rely on the relative fit to a multi-dimensional statistic, it is useful to study whether the results in Theorem 1 are applicable to the multivariate case with a fixed dimension. Suppose that n is a p-dimensional statistic ( and

inde-pendent of n) with

T p> 1





1/ 2



1/ 2





1/ 2

| =| ( ) | ( ) ( ) ( )

n r r n r

r T n T

O n

       



  

(4)

175 C.-P. WANG ET AL.

and





1/ 2



1/ 2





1/ 2

| =| ( ) | ( ) ( ) ( ).

n f f n f

f T n T

O n



            

Then following the derivation given in Appendix 1, it can be shown that





_

_





1

2 , |

ˆ ( ) ˆ ( ) ( )

ˆ ( ) ˆ ( ) = (1)

t n

f n r n f n

f n r n p

KLD r f U

U U U

n

U U o

         

when ( ) f  r( ), and











1



2 , | log ( ) ( ) ( ) ( ) = (1)

t n r n f

f n r n p

KLD r f U U U

tr  U U p o

  

   

n

when  _r( ) = _f( ) and _r( )  _f( ) .

The result above implies the importance of choosing to yield an effective model diagnostic based on

n n

T

( , | )

t

KLD r f U . For example, _n is typically chosen to be the sufficient statistic f r

T

o  since it contains all the information f o . Yet, T_n clearly is not the best diagnostic statistic if the estimates of the mean f T_n are unbiased under both models r and

o f as



, |



= (1)

t n p

KLD r f U O . Also, note that both





2 ₂

ˆf(Un) ˆr(Un) ˆf(Un)

   in (5) and



_ˆ2₍ ₎ _ˆ2₍ ₎

r n f n

Q  U  U



in (6) can be viewed as a discrepancy between and in terms of their posterior predictivity of _n. We show next how _n

r T

f



, |

t

KLD r f U is related to a weighted posterior predictive p-value, a typical approach for assessing model discrepancy re-garding the predictivity of _n in the Bayesian frame-work (see Rubin [9]; Gelman et al. [10]).

T

2.4. KLD vs Posterior Predictive p-Value Consider



















ˆ

( ) | d | d

π | d

tn

r n n f n n n

r n

WPPP U f y y r t t

U      

_{  }



, (7)

where and are the density functions of Tn

un-der and , respectively. We shall call (7) a one- sided weighted posterior predictive p-value (or WPPP) with respect to model since as shown above, it is equivalent to the mean predictive p-value of n under

over all possible posterior predicted values of Tn

arising from . Thus WPPP can be viewed as a model discrepancy measure since it evaluates the predictivity of under . Note that WPPP proposed here differs from the posterior predictive p-value originally proposed in Rubin [9]: WPPP calculates the predictive p-value of

n under an assumed model should n originate

from the reference model r, while Rubin's original proposal [9] assesses the predictive p-value of n under

an assumed model. WPPP is considered here since it is more coherent with the nature of G-R-A KLD as a dis-crepancy measure between models and .

r f r r f r r T f f T f PP T f T r r U

The connection between WP ( n) and KLD rt( ,

is derived in Theorem 2 below. Here we assume regularity conditions (A1)-(A5) as assumed in Theorem 1.

| n)

f U Theorem 2.

















2 1 2 2

2 2 2

2 , |

ˆ ( ) ˆ ( ) _{ˆ ( )} 1) ˆ ( ) ˆ ( ) ( )

t n

r n f n

f n r n n

P KLD r f U

n n

U U

o

U U U

  _       _    _        ( ) = ( r n p U ˆ r n f WPP U  (8)

when  _f( ) _r( ); and





22

ˆ ( )r n f n

U

5 =o

2 , | 1)

ˆ ( )

t n

KLD r f U Q 



 

 _ _

 =o

)

(

p (9)

and

( ) 0. (1

r n p

WPPP U  (10)

when  r( ) = f( ) but

2_{( )} 2_{( )}

r f

  





t n

 . The proof of the theorem is given in Appendix 2. Remark. Theorem 2 explicitly shows the asymptotic relationship between KLD r f, |

( )

U and .

Suppose that f r

( )

r n

P U WPP ( )

    (i.e., the estimate of the

mean of Tn differs between and r f ). Then both



, |



t n

KLD r ( )

O n

f U and are of order

. Suppose



1 r n WPPP   ( ) = ( ) r f



(U )

p     but

2 2_{( )}

f( ) r

  

f



f (i.e., the mean of n is the same both and , but

the variance of Tn differs between r and ). Then

T r



r(Un)



1



 WPPP converges to 0, while KLD r f_t



, converges to a positive quantity of .



n

U O_p(1)

3. Illustrative Examples

This section demonstrates the utility of the KLD r ft



, discussed previously through two sets of examples. Examples 3.1-3.6 demonstrate simulated examples that confirm the results proved in Theorems 1 and 2. All these six examples meet the regularity conditions re-quired, while Examples 3.3, 3.5, and 3.6 involve com-paring non-nested models. Example 3.7 studies the de-parture from Theorems 1 and 2 due to violation of the regularity condition (A5). We let in all calculations below.



n U



,



n n

U = X₁, X Example 3.1. (Nested) Assume









. . . ₁

2 1

| =

i i d

i i i

(5)

where 2> 0. Let . Suppose that

and =1

= ( n ) /

n i i

T  X n

=

r g





1





1

| = /

i i

f x    x  

for some known  > 0. Then ( ) =

r f

   ( ) = ₁, 2 2

( ) =

r

   , 2_{( ) =}

f

  ,

ˆr(Un) = ˆf(Un)

  =Tn,

and







ˆ2



2KLD r f Ut , | n Q  (Un) / =op(1)



.

That is, KLD r f Ut



, | n converges in probability to a positive quantity, which suggests the discrepancy be-tween and r f . However the magnitude of the model discrepancy assessed by KLD r f Ut



, | n



depends on

2



ˆ ( n) /



Q  U  . In contrast, converges to 0.5, indicating no difference between and

( )

r n

WPPP U

r f . Thus

the results are consistent with Theorems 1 and 2. Example 3.2. (Nested) Assume





1





. . . _1/2

| = ( ) ( ) ( )

i i d

i i g g i g

X g x     x  

 





 ,

where Xi= (Xi1,Xi2),  g( ) = ( ,  1 1 2), and (

g )

 is a 2 × 2 matrix with g,11( ) = g,22=3 and ,12 = ,21= 3 4

g ( ) g  

  . Suppose that





1



1/ 2





| = ( ) ( ) ( )

i r r

r x     x 

 _





i r 



and





1



1/ 2





| = ( ) ( ) ( )

i f f i

f x     x  f

 





,

where  r( ) =



  1, 1 2



,  f( ) =



 1, 1



, and both

( )

r 

 and f( ) are 2 × 2 matrices with

,11( ) = ,22 = 3

r  r 

  , r,12( ) = r,21= 0, ,11( ) = ,22 = 3

f  f 

  , and f,12( ) = f,21= 0. Let =



=1



1 2





n

n i i i

T  X X n. Then

2 ( ) =

r

   , ( ) = 0 _f and 2 2

3 ( ) = ( ) = 2

r f

2

     .

It can be shown that ˆ (r Un) =Tn, ˆ (f Un) = 0,













2 2 1 1 =1 2 2 2 =1 ˆ

ˆ ( ) = ( )

ˆ

( ) (2 )

n

r n i i r n

n

i r n i

U X U

X U n

     



, and













2 2 1 1 =1 2 2 1 =1 ˆ

ˆ ( ) = ( )

ˆ

( ) (2 )

n

f n i i f n

n

i f n

i

U X U

X U n

     



, where





1 =1 1

ˆ ( ) = n

r Un i Xi n







_

, ˆ ( ) =2 =1 2

n r Un i Xi



_

n,





1 =1 1 =1

ˆ ( ) = n n (2 )

f Un i Xi i Xi









2 n .

Thus





_ˆ2 _ˆ2

2KLD r f Ut , | n nr(Un) f(Un) =op(1).

Since ( ) f  r( ) and ˆr(Un)ˆf(Un) =Tn, fol-

lowing the arguments given in Appendix 2, we have



_ˆ _ˆ2 _ˆ2



( ) ( ) ( ) ( ) = (1).

r n r n r n f n p

WPPP U   n U  U  U o

Thus both KLD r f Ut



, | n



and the square of the probit transformed suggest the discrepancy between and r( n)

WPPP U

r f by an O n_p( ) term.

Example 3.3 (Non-nested). Assume

















. . .

1

| = ( !) exp 1 1

i i d

xi

i i i

X  g x  x      ,

where 0 < < 1 . Let Tn=Xn



1Xn



, r= g, and



|



= xi



₁ i

f x   



. Then

( ) = ( ) =

r f

    , 2_{( ) =}



₁



r

3

   





2

= 1

, and 2_{( )}

f

    .

In this case,  has the same statistical meaning under both f and since r = ( _i) 1



( _i)



. Yet the substantive meaning of

E X E X

 

 under differs from that under

r

f . It can be shown that  ˆ_r( )= ˆ_f( ) =T_n, and

















2 , | 1 ( ) = 2 , | 1 = (1)

t n n

t n n p

KLD r f U Q U

KLD r f U Q T o



 

ˆ

for 0 < < 1 . Consistent with Theorem 2,



_{, |}



₌ ₍ 1/ 2₎

t n p

KLD r f U O n

converges in probability to a positive quantity (i.e.,



₁ ˆ₍ ₎



n

Q  U ), and converges in probabil-ity to 0.5 (since

( ) r n WPPP U ( ) = ( ) r f     ).

Examples 3.4-3.6 below consider situations where _n is an unbiased estimate for the parameter of interest un-der both

T

f and . However _n is the MLE of cer-tain parameter under

r T

f but not under . In these ex-amples, since MLEs of the mean of n under both

r

T f

and converge to the same in probability, r



,



t n

KLD r f U is reduced to



_ˆ2₍ ₎ _ˆ2₍ ₎



₍₁₎

r n f n p

Q  U  U o in these examples.

Example 3.4 (Nested). As an extension of Example 3.1, consider









. . . 1

2 3 1 2 3

| =

i i d

ji ji j ji

(6)

C.-P. WANG ET AL. ₁₇₇

J

= 2, ,

j  with J> 1. Let



1 1

 



j

J n

j i

 





1

J

n ji j

T  X



_ n_j , r=g, and









1

2 ₁ 2

| = /

ji ji

f x    x  

for some 2> 0. Thus  r( ) = f( ) =1,



 



2

2 1 3 1

( ) =

r 

J J

j j j

j w j n

 

_

_ 

_

_ ,

and 2_{( )}





2

J

1

=

f  j nj

  for j= 1, , J,

where wj =nj



Jj1nj



. Here Tn is the MLE of 1 under f , while the MLE of ₁ under is r









1 2 3 1 1 j J n ji j i j n n U     1 2 3 1

ˆ₍ ₎ˆ ₍ ₎

ˆ₍ ₎ˆ ₍ ₎ _,

n j n

J

j n j

X U U

U                 

 



3 ˆ (

where  j Un) is the MLE of 3j under . Also, r







2 2



2 t r f, |Un Q ˆr(Un) ˆf(Un) =op(1)

= 2.25

KLD .

We conduct numerical calculation for the following combination 2 , 32= 4, 25 with



n n1, 2



=

and



5000,5000



n n1, 2

 

= 2000,8000



. Since the

95% percentiles of 2₍ ₎ _ˆ2₍ ₎

U U

ˆ_r _n _f _n



do not exceed 1 under all situations (hence





> 0

t n



, |f U > 0.95



Pr KLD r , it implies that in the

ory KLD should be able to distinguish f from . However, the numerical values of KLD only deviate from 0 by for 1 < . Thus in practice, KLD can hardly be differentiated from WP

which converges to 0.5 in probability

r

( )

r Un

6



10

K K< 10

PP ( )

r( ) = f

    .





















 2

. . .

i i d

X  g

> 2

1

2 2 2

1 2 2

1 2

/ 2 π 1 2

,

i xi

x              | = 1  

where 2 . Let Tn=X . Suppose that



X1, , Xn



1

i

are fitted by r=g and f x



_i|



=



X  . Then 1

( ) =

f r( ) =

     , 2





2 2

( ) = 2

r

     ,

and 2_{( )} . Since

f

  = 1



, |



Q



ˆ₂( )



ˆ₂( ) 2 =



(1) t f Un   Un  Un  op

KLD r

for all ₂ with KLD r f Ut



, | n



=

converging to 0 with probability 1 if and only if  , KLD r f Ut



, | n



can detect model discrepancy in the dispersion parameter. In contrast, converges to 0.5 in probability, which is not sensitive to the discrepancy in the dispersion parameter.

( )

r Un

WPPP





















 2

. . . 1

2 2 2

1 2 2

1 2

| = / 2 π 1 2 1 ,

i i d

i i

X g x

x                

where 2> 2, 3> 0, and 0 <4< 1. Suppose that the empirical variance of



X_i: = 1,i ,n



is greater than 1. Consider fitting the data by





















 2

1

2 2 2

1 2 2

1 2

| = / 2 π 1 2 1 , i r x x                or









1









2 2 1 2 2

| = ( 2 2

i i

f x     x   



   ,

where 2> 2 . Let Tn =X . Then ˆ (f Un) =Tn and

, where

1

= m i

U _

ˆ (_r _n) w U_i( _n) _i

 X













1 2 1 2 1 2 1 2 =1 ˆ ˆ

( ) = 1 ( ) ( )

ˆ ˆ

1 ( ) ( )

i n i n n

n

i n n

i

w U X U U

X U U

              _ _           





Since both ˆ ( )r Un and ˆ ( )f Un are unbiased

estima-tors for E T( )n and var



ˆ ( )r Un ˆ ( )f Un



converges to 0 in probability,



_{, |}



₌



_ˆ2₍ ₎ _ˆ2₍ ₎



₍₁₎

t n r n f n p

KLD r f U Q  U  U o ,

where , while can

not be obtained in a closed form. Thus

2 2

=1

ˆ ( ) = n ( ) /

f Un i Xi Tn

   n ˆ (_r2 U_n)



, |



t n

KLD r f U is evaluated numerically using eight combinations given in the table below. The results suggest that the degree to which KLD r f Ut



, | n



can distinguish f from varies by situation: closer

r

3

| 1| and min{ ,14 4} are to 0, closer is KLD r f Ut



, | n



( )

to 0. In contrast,

f Un

WPPP converges to 0.5 in probability, (see Table

1).

Example 3.7 below shows that the asymptotic rela-tionship between KLD r f U_t



, | _n



and

does not hold in the sense of Theorem 2 due to violation of the asymptotic normality assumption specified in (3) a

( )

r n

WPPP U

[image:6.595.311.536.655.720.2]

and (4).

Table 1. Simulation result of example 3.6.

2 3

( , )  4= 0.5 4= 0.2

(6, 2) _{1.58 10}_ 4 _{3.08 10}_ 6

(6,10 / 3) _{1.10 10}_ 1 _{6.27 10}_ 5

(2.5, 2) _{1.58 10}_ 4 _{1.47 10}_ 6

(7)



Example 3.7 (Nested). Consider







. . .

1

| = exp /

i i d

i i i

X  g x   x  .

Suppose that



Xi: = 1, ,i 



exp( )

i i

n are fitted by r=g and



|



=

f x x

1

 . Let T_n = min



X₁, , X_n



. Then





n|



= exp



n/

g t  n nt  and



_n|



= exp( _n)

f t  n nt . Then









( ) = ( ) ˆ

= ( ( < | = 1) | )

= exp( )d exp( / )d π ( | )d

= 1 exp( ) exp( / )d ( | )d

= π ( | )d 1

r n r n

r f pred

n n f n

tn

n n r n

n n n r n

r n

WPPP U WPPP X

E Pr T T X

n

n nx x nt t X

n

nt nt t X

X           _ _    _ _   _ _ _      

  

 



   and





( , | ) = ( , | ) / exp( / )

= log exp( / )d exp( )

π ( | )d

= log( ) (1 1/ ) exp( / )d

π ( | )d = ( )π ( | )d

t n t n

n n n n r n n n r n r n

KLD r f U KLD r f X

n nt n

nt t

n nt

X

n

n t nt

X Q X   _                 _   _         _ _ _ _     



n t    That is,

( ) = (1)

1

n

r n p

n

X

WPPP U o

X 



and

( , | ) ( ) = (1)

t n n p

KLD r f U Q X o .

In this example, both KLD r f Ut



, | n



r

and

can differentiate from r

WPPP

(Un) f by an

term despite that the asymptotic relationship between 1/ 2

( )

p

O n



, |



t n

KLD r f U and WPPP U_r( _n) does not hold in the sense of Theorem 2 due to the violation of the asymp-totic normality assumption.

4. Application of

KLD r f U

_t

( ,

|

_n

)

to

Diabetes Studies

In this section, we apply both KLD r f Ut( , | n) and

to compare non-nested models in two stud-

( )

r n

WPPP U

ies of diabetes. In these applications, we assess the model

fit to the entire dataset due to its clinical implication. Here the selected diagnostic statistic is a multivariate statistic of variables, and it does not meet all the regularity conditions specified in Section 3.

(1)

p

O

4.1. Study I: Analysis of Change in Glucose in Veterans with Type 2 Diabetes

Study I originated from a clinical cohort of 507 veterans with type 2 diabetes who had poor glucose control (indi-cated by glycosylated hemoglobin A1c or HbA1c greater than 6.5) at the baseline (fiscal year 1999), and were all treated by metformin as the mono oral glucose-lowering agent. As the literature suggested that the glucose-lowering response due to metformin may vary by an indi-viduals obesity status, the goal of our study was to com-pare models that assessed whether obesity was associated with the net change in glucose level between baseline and the end of 5-year follow-up. In this study, the empirical mean of the net change in HbA1c over the 5 year period was similar between the obese vs. non-obese groups (–0.498 vs. –0.379), yet the empirical variance was greater in the obese group (1.207 vs. 0.865). Also, the distribution of HbA1c was reasonably symmetric. Thus we considered two candidate models for fitting the HbA1c change: a mixture of normals vs. a t-distribution (note that the overall empirical variance of HbA1c change is 1.03); that is,















 















4 1 2 2

2 2 4 1 3 2 2

3 2 2

| = 2

2 1 2

2

i i

i

r x x

x                         

where ₂> 2, ₃> 0, and ₄= 0.487 (% in the obe-sity group) vs.

























 2

2 2

1 2

2

1 2

| = 1 2 / 2 π 1 , i f x x  2             

where 2> 1. The calculated KLD r f Ut



, | n



was 5.96, suggesting that model provided a modest better fit to the data compared to model

r

f . This result was also consistent with Figures 1 and 2 which contrasted the empirical quantiles with predicted quantiles under r and f . Note that both ˆ ( )_r U_n and ˆ ( )_f U_n are un-biased estimators for E T( )n , where ˆ ( ) =f Un Tn and

with

=1

ˆ = n (

i i n

U w

r( n)  U )Xi













1  2 1 2 1 2 1 2 =1 ˆ ˆ

( ) = 1 ( ) ( )

ˆ ˆ

1 ( ) ( ) .

i n i n n

n

i n n

i

w U X U U

X U U

(8)

C.-P. WANG ET AL. ₁₇₉

Thus the model discrepancy assessed by KLD r ft



, is primarily attributed to the difference in the vari-ance assumption betweenn



and

U

r f (as evident in Figures 1 and 2). In contrast, WPPP = 0.522 suggested that the overall fit were similar between the two models since the estimated net change in HbA1c was similar between these two models.

been described previously (Hazuda et al. [11]). The goal

of our analyses was to compare models that assessed whether glucose control trajectory class (poorer vs. better) was associated with the lower-extremity physical func-tional limitation score during the first follow-up period. The lower-extremity physical functional limitation score was measured by the Short Physical Performance Battery or SPPB, which is a well-established, validated measure of physical functioning. SPPB score is a sum of three items: 8-foot walking times, repeated chair stands, and balance scores, each being a 5-point liker scale 0 - 4. Hence the SPPB score is discrete in nature with a range of 0 - 12. Higher SPPB scores indicate better perform-ance and less functional limitation. Exploratory data analyses suggested that the empirical variance of SPPB (15.60 vs. 14.33) was greater than the mean (7.23 vs. 8.02) in both glucose control classes. Also, due to the 4.2. Study II: Analysis of the Functioning Score

in Older Adults with Diabetes

[image:8.595.89.513.302.685.2]

Study II arose from the subset of 119 participants with diabetes in the San Antonio Longitudinal Study of Aging (SALSA), a community-based study of the disablement process in Mexican American and European American older adults. Details of the SALSA study design, sam-pling approach, recruitment and field procedures have

(9)

[image:9.595.67.531.80.502.2]

Figure 2. Quantile plot for HbA1c for T2DM participants without obesity in VA (open triangle: quantile estimates based on the model assuming a mixture of normal distributions solid circle: quantile estimates based on the model assuming a t-dis- tribution solid line: empirical quantiles).

left-skewedness of the distribution of SPPB, we considered models fit to the reversed SPPB, i.e., Xi= (12

. Given the nature of SPPB distribution, we compare two candidate models:

)

i

SPPB

 

1

( | ) = ( )) ( ) xi(1 )

i i i

r x   _  x _x   

with _ > 0 and



i|



= exp





1









1





xi ( !i)

f x      x



. The calculated KLD r f Ut



, | n is 32.63, suggesting that has a much better fit than r f , which is coherent to the quantile plot shown in Figures 3 and 4. Since both

and

r f yielded similar estimates of , the model discrepancy assessed by

( i)

E X



, | n



t

KLD r f U could

primarily be attributed to the difference in variance esti-mation between and r f (as evident in Figure 2). In contrast, WPPP = 0.539 suggested similar fit between and

r f as expected due to similar estimates of under both and

( _i)

E X

r f .

5. Discussion

This paper considers the G-R-A KLD as given in (2). This KLD is appropriate for quantifying information discrepancy regarding n contained in the competing

models and

T

(10)

[image:10.595.58.538.81.517.2]

Figure 3. Quantile plot for SPPB for T2DM participants with good glycemic control in SALSA (open triangle: quantile esti-mates based on the negative binomial model; solid circle: quantile estiesti-mates based on the Poisson model; solid line: empirical quantiles).

rem 2. However, our results would need further refine-ment when the normality assumptions given in (3) and (4) are not suitable (see Example 3.7). As shown in Section 4, model comparison in medical research may rely on the fit to a multidimensional statistic. Although the results in Theorem 1 holds for a multivariate n with a fixed

di-mension, further investigation is needed to assess the property of our proposed KLD for situations when the dimension of n increases with n. The KLD proposed

herein usually provides the relative fit between compet-ing models. Thus, for the purpose of assesscompet-ing model adequacy (rather than relative model fit), a KLD should be used in conjunction with absolute model departure

indices such as posterior predictive p-values or residuals. Nevertheless, a KLD can be a measure of the absolute fit of model

T

f when the superior model is the true model, and therefore can be used for checking model adequacy.

r

6. Acknowledgements

Wang’s research is partially supported by NIDDK K25- DK075092; Ghosh’s research is partially supported by NSF. We thank Dr. Hazuda for providing data from the SALSA study supported by NIA R01-AG10444 and NIA

(11)

[image:11.595.58.539.80.518.2]

Figure 4. Quantile plot for SPPB for T2DM participants with poor glycemic control in SALSA (open triangle: quantile esti-mates based on the negative binomial model; solid circle: quantile estiesti-mates based on the Poisson model; solid line: empirical quantiles).

7. References

[1] C. Goutis and C. P. Robert, “Model Choice in General-ised Linear Models: A Bayesian Approach via Kullback- Leibler Projections,” Biometrika, Vol. 85, No. 1, 1998, pp. 29-37.doi:10.1093/biomet/85.1.29

[2] H. Akiaike, “A New Look at the Statistical Identification Model,” IEEE Transactions on Automatic Control, Vol. 19, No. 6, 1974, pp. 716-723.

doi:10.1109/TAC.1974.1100705

[3] C. E. Shannon, “A Mathematical Theory of Communica-tion,” Bell System Technical Journal, Vol. 27, 1948, pp. 379-423 and pp. 623-656.

[4] S. Kullback and R. A. Leibler, “On Information and

Suf-ficiency,” The Annals of Mathematical Statistics, Vol. 22, No. 1, 1951, pp. 79-86.doi:10.1214/aoms/1177729694

[5] D. V. Lindley, “On a Measure of the Information Pro-vided by an Experiment,” The Annals of Mathematical Statistics, Vol. 27, No. 4, 1956, pp. 986-1005.

doi:10.1214/aoms/1177728069

[6] J. M. Bernardo, “Expected Information as Expected Util-ity,” The Annals of Statistics, Vol. 7, No. 3, 1979, pp. 686-690.doi:10.1214/aos/1176344689

[7] G. Schwarz, “Estimating the Dimension of a Model,” The Annals of Statistics, Vol. 6, No. 2, 1978, pp. 461-464.

doi:10.1214/aos/1176344136

(12)

yal Statistical Society B, Vol. 29, No. 1, 1967, pp. 83-100. [9] D. B. Rubin, “Bayesianly Justifiable and Relevant

Fre-quency Calculations for the Applies Statistician,” Annals of Statistics, Vol. 12, No. 4, 1984, pp. 1151-1172.

doi:10.1214/aos/1176346785

[10] A. Gelman, J. Carlin, H. S. Stern and D. Rubin, “Bayes-ian Data Analysis,” Chapman and Hall, London, 1996. [11] H. P. Hazuda, S. M. Haffner, M. P. Stern and C. W.

Eifler, “Effects of Acculturation and Socioeconomic

Status on Obesity and Diabetes in Mexican Americans: The San Antonio Heart Study,” American Journal of Epidemiology, Vol. 128, No. 6, 1988, pp. 1289-1301. [12] S. Ghosal and T. Samanta, “Expansion of Bayes Risk for

Entropy Loss and Reference Prior in Nonregular Cases,” Statistics and Decisions, Vol. 15, 1997, pp. 129-140. [13] I. Ibragimov and R. Hasminskii, “Statistical Estimation:

Asymptotic Theory,” Springler-Verlag, New York, 1980.

Appendix 1. Proof of Theorem 1

Under (3) and (4), it can be shown that















 



















2 2 2

2 1 2

2 , |

|

= log | π | d d

ˆ |

( ) ( )

( )

= log

ˆ ( ) ( ) ˆ ( )

ˆ

( ) ( )

( )

ˆ ( )

( )

d

( )

t n

n

n r n n

n f

n r n r

r

f r f

r f r f n r n r

KLD r f U

r t

r t U t

f t

n t n t

n n t t O                                                 _ _        _             







1/2

(n ) πr |Un d

 _



Applying an argument similar to that in Ghosal and Samanta [12] (see also Ibragimov and Hasminskii [13]), (A1)-(A5) gives









_

_

2 2 2 1/ 2 2 ( ) 2 , | =

ˆ ( )

ˆ ( ) ( )

( ) π | d ˆ ( )

r t n f f r r n f

KLD r f U Q

n

O n U

        .          _ _       _    



(11)

When  r( ) f( ) ,

2 2

ˆ ˆ

( f( ) r( )) / f( )π

n     



r

n

( | U )d is the dominating term in (11). Thus by (A1)-(A5) and the arguments similar to those in Ghosal and Samanta [12], one gets





2 1/ 2 2 ˆ ( ( ) ( ))

π | d (

ˆ ( )

f r

r n p

f

n

U O n

          







2

1/ 2 2 1/ 2 2 1/ 2 2

ˆ ( ) ˆ ( )

= (

ˆ ( )

π ( | )d ( )

ˆ ˆ

( ( ) ( ))

= (

ˆ ( )

f n r n p f n

r n p

f n r n p f n

n U U

O n U

U O n

n U U

O n U          ) ).  _   _          



(12)

If  f( ) = r( ) ,

2_{( )} 2_{( )}

r f

    , then (11) be-comes





2 1/ 2 2 2 1/ 2 2 ( )

( ) π | d

ˆ ( )

= ( ).

ˆ ( )

r r n f r n p f n

Q O n U

U

Q O n

U   _ _            _ __   _ _                



(13) Therefore,









2 2

ˆ ( ) ˆ ( )

2 , | =

ˆ ( )

f n r n

t n p

f n

U U

KLD r f U n o

U

 

 

 (1)

when  r( ) f( ), and





22

ˆ ( )

2 , | =

ˆ ( ) r n

t n p

f n

U

KLD r f U Q o

U  

 

 _ _

  (1).

when  r( ) = f( ) and

2_{( )} 2_{( )}

r f

    .

Appendix 2. Proof of Theorem 2

Write z for the realization of Z , a random variable. Then

(0,1) N













* _| * _| _d

( ) ( ) = ˆ ( ) ( ) tn n n

n r r

r f

d

f y r t y t

(13)





_

_





1/ 2 *

1/ 2

ˆ ( ) ( )

( ) | d

ˆ ( )

ˆ ( ) ( ) ( )

=

ˆ ( ) ( )

r f

n n

f

r f

r

f f

n

O n r t t

n

z z

   

  

   

  _

   



 

 _



  

 _

 

 _ 

 

  

 

 



( )dz O n( ).

The second equality follows (A1)-(A5) and the argu-ment similar to that used in Ghosal and Samanta [12]. Let be a random variable distributed inde-pendently of

Y N(0,1)

Z . Then one can rewrite the above integral as













1/ 2

2 2

ˆ ( ) ( ) ( )

( )d ( ) ˆ ( ) ˆ ( )

ˆ ( ) ( ) ( )

= ( )

ˆ ( ) ( )

ˆ ˆ ( ) ( ) ( )

= ( )

ˆ 1 ( ) ( )

r f

r

f f

r f

r

f f

r f f

r f

n

Pr Y z z z O n

n

Pr Y Z O n

n

O n

   

  _

   

   

 

   

     

   



  

    

 

 

 _ 

 _ _ _

 

 

 _ 

 

 

 _ 

 







1/ 2

2 2

ˆ ( ) ( )

= (

ˆ ( ) ( )

r f

f r

n

O n

   

   



 _ 

 

_ _



 

).

Therefore, applying the arguments similar to those in Ghosal and Samanta [12] under (A1)-(A5) yields







 







_

_





* *

1/ 2

2 2

1/ 2

2 2

( )

= | | π | d d d ˆ

( ) ( )

= ( ) π | d

ˆ ( ) ( )

ˆ ( ) ˆ ( )

= (1)

ˆ ( ) ˆ ( )

r n tn

n r n n

r f

r n

f r

r n f n

p p

f n r n

WPPP U

f y r t U y t

n

O n U

n U U

O O n

U U

   

   

( )

 

   

 

 





  _  

__ __ 

 _ _ 



 _ _ 

 

 _ 

 

  

 _ 

 





(14) when  r( ) f( ); and

1/ 2 1/ 2

( ) = ( ) ( )d ( ) = 0.5 ( )

r n p p