Empirical Bayes method - Predictive Analytics Using R

Empirical Bayes methods are procedures for statistical inference in which the prior distribution is estimated from the data. This approach stands in contrast to standard Bayesian methods, for which the prior distribution is fixed before any data are observed. Despite this difference in perspective, empirical Bayes may be viewed as an approximation to a fully Bayesian treatment of a hierarchical model wherein the parameters at the highest level of the hierarchy are set to their most likely values, instead of being integrated out. Empirical Bayes, also known as maximum marginal likelihood (Bishop, Neural networks for pattern recognition, 2005), represents one approach for setting hyperparameters.

Introduction

Empirical Bayes methods can be seen as an approximation to a fully Bayesian treatment of a hierarchical Bayes model.

In, for example, a two-stage hierarchical Bayes model, observed data 𝑦 = {𝑦1, 𝑦2, … , 𝑦𝑁} are assumed to be generated from an unobserved

set of parameters 𝜃 = {𝜃1, 𝜃2, … , 𝜃𝜂} according to a probability

distribution. In turn, the parameters 𝜃 can be considered samples drawn from a population characterized by hyperparameters 𝜂 according to a probability distribution 𝑝(𝜃|𝜂). In the hierarchical Bayes model, though not in the empirical Bayes approximation, the hyperparameters 𝜂 are considered to be drawn from an unparameterized distribution 𝑝(𝜃|𝜂). Information about a particular quantity of interest 𝜃𝑖 therefore comes

not only from the properties of those data which directly depend on it, but also from the properties of the population of parameters 𝜃 as a whole, inferred from the data as a whole, summarized by the

hyperparameters 𝜂. Using Bayes’ theorem,

𝑝(𝜃|𝑦) =𝑝(𝑦|𝜃)𝑝(𝜃) 𝑝(𝑦) =

𝑝(𝑦|𝜃)𝑝

𝑝(𝑦) ∫ 𝑝(𝜃|𝜂)𝑝(𝜂)𝑑𝜂.

In general, this integral will not be tractable analytically and must be evaluated by numerical methods. Stochastic approximations using, e.g., Markov Chain Monte Carlo sampling or deterministic approximations such as quadrature are common.

Alternatively, the expression can be written as

𝑝(𝜃|𝑦) = ∫ 𝑝(𝜃|𝜂, 𝑦)𝑝(𝜂|𝑦)𝑑𝜂 = ∫𝑝(𝑦|𝜃)𝑝(𝜃|𝜂)

𝑝(𝑦|𝜂) 𝑝(𝜂|𝑦)𝑑𝜂, and the term in the integral can in turn be expressed as

𝑝(𝜂|𝑦) = ∫ 𝑝(𝜂|𝜃)𝑝(𝜃|𝑦)𝑑𝜃.

These suggest an iterative scheme, qualitatively similar in structure to a Gibbs sampler, to evolve successively improved approximations to 𝑝(𝜃|𝑦) and 𝑝(𝜂|𝑦). First, we calculate an initial approximation to 𝑝(𝜃|𝑦) ignoring the 𝜂 dependence completely; then we calculate an approximation to 𝑝(𝜂|𝑦) based upon the initial approximate distribution of 𝑝(𝜃|𝑦); then we use this 𝑝(𝜂|𝑦) to update the approximation for 𝑝(𝜃|𝑦); then we update 𝑝(𝜂|𝑦); and so on.

When the true distribution 𝑝(𝜂|𝑦) is sharply peaked, the integral determining 𝑝(𝜃|𝑦) may not be changed much by replacing the probability distribution over 𝜂 with a point estimate 𝜂∗ representing the distribution’s peak (or, alternatively, its mean),

𝑝(𝜃|𝑦) ≈𝑝(𝑦|𝜃)𝑝(𝜃|𝜂

∗₎

With this approximation, the above iterative scheme becomes the EM algorithm.

The term “Empirical Bayes“ can cover a wide variety of methods, but most can be regarded as an early truncation of either the above scheme or something quite like it. Point estimates, rather than the whole distribution, are typically used for the parameter(s) 𝜂. The estimates for 𝜂∗ are typically made from the first approximation to 𝑝(𝜃|𝑦) without subsequent refinement. These estimates for 𝜂∗ are usually made without considering an appropriate prior distribution for 𝜂.

Point estimation

Robbins method: non-parametric empirical Bayes (NPEB)

Robbins (Robbins, 1956) considered a case of sampling from a compound distribution, where probability for each 𝑦𝑖 (conditional on 𝜃𝑖)

is specified by a Poisson distribution, 𝑝(𝑦𝑖|𝜃𝑖) =

𝜃_𝑖𝑦𝑖_𝑒−𝜃_𝑖

𝑦𝑖!

while the prior is unspecified except that it is also i.i.d. from an unknown distribution, with cumulative distribution function 𝐺(𝜃). Compound sampling arises in a variety of statistical estimation problems, such as accident rates and clinical trials. We simply seek a point prediction of 𝜃𝑖

given all the observed data. Because the prior is unspecified, we seek to do this without knowledge of 𝐺 (Carlin & Louis, 2000).

Under mean squared error loss (SEL), the conditional expectation 𝐸(𝜃𝑖|𝑌𝑖) is a reasonable quantity to use for prediction (Nikulin, 2001).

For the Poisson compound sampling model, this quantity is 𝐸(𝜃𝑖|𝑦𝑖) =

∫(𝜃𝑦+1𝑒−𝜃⁄𝑦𝑖!)𝑑𝐺(𝜃)

∫(𝜃𝑦𝑒−𝜃 _𝑦

𝑖!

⁄ ) 𝑑𝐺(𝜃) .

This can be simplified by multiplying the expression by (𝑦𝑖+ 1 𝑦⁄ 𝑖+ 1),

𝐸(𝜃𝑖|𝑦𝑖) =

(𝑦_𝑖+ 1)𝑝𝐺(𝑦𝑖+ 1)

𝑝𝐺(𝑦𝑖)

where 𝑝𝐺 is the marginal distribution obtained by integrating out 𝜃 over

𝐺 (Wald, 1971).

To take advantage of this, Robbins (Robbins, 1956) suggested estimating the marginals with their empirical frequencies, yielding the fully non- parametric estimate as:

𝐸(𝜃𝑖|𝑦𝑖) = (𝑦𝑖+ 1)

#{𝑌𝑗= 𝑦𝑖+ 1}

#{𝑌𝑗= 𝑦𝑖}

, where # denotes “number of” (Good, 1953).

Example - Accident rates

Suppose each customer of an insurance company has an “accident rate” 𝛩 and is insured against accidents; the probability distribution of 𝛩 is the underlying distribution, and is unknown. The number of accidents suffered by each customer in a specified time period has a Poisson distribution with expected value equal to the particular customer’s accident rate. The actual number of accidents experienced by a customer is the observable quantity. A crude way to estimate the underlying probability distribution of the accident rate 𝛩 is to estimate the proportion of members of the whole population suffering 0, 1, 2, 3, … accidents during the specified time period as the corresponding proportion in the observed random sample. Having done so, we then desire to predict the accident rate of each customer in the sample. As above, one may use the conditional expected value of the accident rate 𝛩 given the observed number of accidents during the baseline period. Thus, if a customer suffers six accidents during the baseline period, that customer’s estimated accident rate is

7 × [𝑡ℎ𝑒 𝑝𝑟𝑜𝑝𝑜𝑟𝑡𝑖𝑜𝑛 𝑜𝑓 𝑡ℎ𝑒 𝑠𝑎𝑚𝑝𝑙𝑒 𝑤ℎ𝑜 𝑠𝑢𝑓𝑓𝑒𝑟𝑒𝑑 7 𝑎𝑐𝑐𝑖𝑑𝑒𝑛𝑡𝑠] / [𝑡ℎ𝑒 𝑝𝑟𝑜𝑝𝑜𝑟𝑡𝑖𝑜𝑛 𝑜𝑓 𝑡ℎ𝑒 𝑠𝑎𝑚𝑝𝑙𝑒 𝑤ℎ𝑜 𝑠𝑢𝑓𝑓𝑒𝑟𝑒𝑑 6 𝑎𝑐𝑐𝑖𝑑𝑒𝑛𝑡𝑠]. Note that if the proportion of people suffering 𝑘 accidents is a

decreasing function of 𝑘, the customer’s predicted accident rate will often be lower than their observed number of accidents. This shrinkage effect is typical of empirical Bayes analyses.

Parametric empirical Bayes

If the likelihood and its prior take on simple parametric forms (such as 1- or 2-dimensional likelihood functions with simple conjugate priors), then the empirical Bayes problem is only to estimate the marginal 𝑚(𝑦|𝜂) and the hyperparameters 𝜂 using the complete set of empirical measurements. For example, one common approach, called parametric empirical Bayes point estimation, is to approximate the marginal using the maximum likelihood estimate (MLE), or a Moments expansion, which allows one to express the hyperparameters 𝜂 in terms of the empirical mean and variance. This simplified marginal allows one to plug in the empirical averages into a point estimate for the prior 𝜃. The resulting equation for the prior 𝜃 is greatly simplified, as shown below. There are several common parametric empirical Bayes models, including the Poisson–gamma model (below), the Beta-binomial model, the Gaussian–Gaussian model, the Dirichlet-multinomial model (Johnson, Kotz, & Kemp, 1992), as well specific models for Bayesian linear regression (see below) and Bayesian multivariate linear regression. More advanced approaches include hierarchical Bayes models and Bayesian mixture models.

Poisson–gamma model

For example, in the example above, let the likelihood be a Poisson distribution, and let the prior now be specified by the conjugate prior, which is a gamma distribution (𝐺(𝛼, 𝛽)) (where 𝜂 = (𝛼, 𝛽)):

𝜌(𝜃|𝛼, 𝛽) =𝜃

𝛼−1_𝑒−𝜃 𝛽⁄

𝛽𝛼_Γ(𝛼) , for 𝜃 > 0, 𝛼 > 0, 𝛽 > 0.

It is straightforward to show the posterior is also a gamma distribution. Write

𝜌(𝜃|𝑦) ∝ 𝜌(𝑦|𝜃)𝜌(𝜃|𝛼, 𝛽),

where the marginal distribution has been omitted since it does not depend explicitly on 𝜃. Expanding terms which do depend on 𝜃 gives the posterior as:

𝜌(𝜃|𝑦) ∝ (𝜃𝑦𝑒−𝜃)(𝜃𝛼−1𝑒−𝜃 𝛽⁄ ) = 𝜃𝑦+𝛼−1𝑒−𝜃(1+1 𝛽⁄ ).

So the posterior density is also a gamma distribution 𝐺(𝛼′, 𝛽′), where 𝛼′ = 𝑦 + 𝛼, and 𝛽′ = (1 + 1 𝛽⁄ )−1_{. Also notice that the marginal is}

simply the integral of the posterior over all Θ, which turns out to be a negative binomial distribution.

To apply empirical Bayes, we will approximate the marginal using the maximum likelihood estimate (MLE). However, since the posterior is a gamma distribution, the MLE of the marginal turns out to be just the mean of the posterior, which is the point estimate 𝐸(𝜃|𝑦) we need. Recalling that the mean 𝜇 of a gamma distribution 𝐺(𝛼′, 𝛽′) is simply 𝛼′𝛽′, we have 𝐸(𝜃|𝑦) = 𝛼′𝛽′ = 𝑦̅ + 𝛼 1 + 1 𝛽⁄ = 𝛽 1 + 𝛽𝑦̅ + 1 1 + 𝛽(𝛼𝛽).

To obtain the values of 𝛼 and 𝛽, empirical Bayes prescribes estimating mean 𝛼𝛽 and variance 𝛼𝛽2 using the complete set of empirical data. The resulting point estimate 𝐸(𝜃|𝑦) is therefore like a weighted average of the sample mean 𝑦̅ and the prior mean 𝜇 = 𝛼𝛽. This turns out to be a general feature of empirical Bayes; the point estimates for the prior (i.e., mean) will look like a weighted averages of the sample estimate and the prior estimate (likewise for estimates of the variance).

Bayesian Linear Regression

Bayesian linear regression is an approach to linear regression in which the statistical analysis is undertaken within the context of Bayesian inference. When the regression model has errors that have a normal distribution, and if a particular form of prior distribution is assumed,

explicit results are available for the posterior probability distributions of the model’s parameters.

Consider a standard linear regression problem, in which for 𝑖 = 1, . . . , 𝑛. We specify the conditional distribution of 𝒚𝑖 given a 𝑘 × 1 predictor

vector 𝒙𝒊:

𝒚𝑖 = 𝑿𝑖𝑇𝜷 + 𝜀𝑖,

where 𝜷 is a 𝑘 × 1 vector, and the 𝜀𝑖 are independent and identical

normally distributed random variables: 𝜀𝑖~𝑁(𝜇, 𝜎2).

This corresponds to the following likelihood function: 𝜌(𝒚|𝑿, 𝜷, 𝜎2) ∝ (𝜎2₎−𝑛 2⁄ _{exp (−} 1

2𝜎2(𝒚 − 𝑿𝜷)

𝑇_{(𝒚 − 𝑿𝜷)) .}

The ordinary least squares solution is to estimate the coefficient vector using the Moore-Penrose pseudoinverse (Penrose, 1955) (Ben-Israel & Greville, 2003):

𝜷̂ = (𝑿𝑇𝑿)−1𝑿𝑇𝒚,

Where 𝑿 is the 𝑛 × 𝑘 design matrix, each row of which is a predictor vector 𝑿𝑖𝑇; and 𝒚 is the column 𝑛-vector [𝑦1⋯ 𝑦𝑛]𝑇.

This is a “frequentist” approach (Neyman, 1937), and it assumes that there are enough measurements to say something meaningful about 𝜷. In the Bayesian approach, the data are supplemented with additional information in the form of a prior probability distribution. The prior belief about the parameters is combined with the data’s likelihood function according to Bayes theorem to yield the posterior belief about the parameters 𝜷 and 𝜎. The prior can take different functional forms depending on the domain and the information that is available a priori.

Software

Several software packages are available that perform Empirical Bayes, including the Open Source software R with the limma package. Tos start the package in R, one simply enters the following in the R console at the prompt

source(“http://bioconductor.org/biocLite.R”)biocLite(“limma ”).

Commercial software includes MATLAB, SAS and SPSS.

Example Using R

Model Selection in Bayesian Linear Regression

Consider data generated by 𝑦𝑖 = 𝑏1𝑥𝑖+ 𝑏3𝑥𝑖3+ 𝜀𝑖, and suppose we

wish to fit a polynomial of degree 3 to the data. There are then 4 regression coefficients, namely, the intercept and the three coefficients of the power of x. This yields 24= 16 models possible models for the data. Let 𝑏1= 8 and 𝑏3= −0.5 so that the data looks like this in R:

> rm(list=ls()) > x=runif(200,-10,10) > a=c(18,0,-0.5,0) > Y=a[1]*x^1+a[2]*x^2+a[3]*x^3+a[4] > Y=Y+rnorm(length(Y),0,5) > plot(x,Y)

The code to generate the data and calculate the log marginal likelihood for the different models appears below.

> p=4 > X=cbind(x,x^2,x^3,1) > tf <- c(TRUE, FALSE) > models <- expand.grid(replicate(p,tf,simplify=FALSE)) > names(models) <- NULL > models=as.matrix(models) > models=models[-dim(models)[1],] > a_0=100 > b_0=0.5 > mu_0=rep(0,p) > lambda_0=diag(p) > lml <- function(model){ + n=length(Y) + Y=as.matrix(Y) + X=as.matrix(X[,model]) + mu_0=as.matrix(mu_0[model])

+ lambda_0=as.matrix(lambda_0[model,model]) + XtX=t(X)%*%X + lambda_n=lambda_0 + XtX + BMLE=solve(XtX)%*%t(X)%*%Y + mu_n=solve(lambda_n)%*%(t(X)%*%Y+lambda_0%*%mu_0) + a_n = a_0 + 0.5*n + b_n=b_0 + 0.5*(t(Y)%*%Y + t(mu_0)%*%lambda_0%*%mu_0 – + t(mu_n)%*%lambda_n%*%mu_n) + log_mar_lik <- -0.5*n*log(2*pi) + + 0.5*log(det(lambda_0)) - 0.5*log(det(lambda_n)) + + lgamma(a_n) - lgamma(a_0) + a_0*log(b_0) –

+ a_n*log(b_n) + return(mle) + } > lml.all=apply(models,1,lml) > results=cbind(lml.all, models) > order=sort(results[,1],index=TRUE,decreasing=TRUE) > results[order$ix,]

Model Evaluation

The models are listed in order of descending log marginal likelihood below: lml x x^2 x^3 c [1,] -1339.085 1 0 1 0 [2,] -1341.611 1 0 1 1 [3,] -1345.397 1 1 1 0 [4,] -1347.116 1 1 1 1 [5,] -2188.934 0 0 1 0 [6,] -2190.195 0 0 1 1 [7,] -2194.238 0 1 1 0 [8,] -2196.109 0 1 1 1 [9,] -2393.395 1 0 0 0 [10,] -2395.309 1 0 0 1 [11,] -2399.188 1 1 0 0 [12,] -2401.248 1 1 0 1 [13,] -2477.084 0 0 0 1 [14,] -2480.784 0 1 0 0 [15,] -2483.047 0 1 0 1

> BMLE [,1] x 18.241814068 0.008942083 -0.502597759 -0.398375650

The model with the highest log marginal likelihood is the model which includes 𝑥 and 𝑥3 only, for which the MLE of the regression coefficients are 18.241814068 and -0.502597759 for 𝑥 and 𝑥3 respectively.

In document Predictive Analytics Using R (Page 84-96)