• No results found

Advanced statistical inference. Suhasini Subba Rao

N/A
N/A
Protected

Academic year: 2021

Share "Advanced statistical inference. Suhasini Subba Rao"

Copied!
222
0
0

Loading.... (view fulltext now)

Full text

(1)

Advanced statistical inference

Suhasini Subba Rao

Email: [email protected]

(2)
(3)

Chapter 1

Basic Inference

1.1

A review of results in statistical inference

In this section, we review some results that you came across in STAT611 (or equivalent). We will review the Cramer-Rao bound and some properties of the likelihood. In later sections, we will use the likelihood as a means of parameter estimation (ie. the maximimum likelihood estimator which you would have done in previous courses) and heuristically argue why the Fisher information (which gives the Cramer-Rao bound) is extremely important.

1.1.1 The likelihood function

Let {Xi} be iid random variables with probability function (or probability density function)

f(x;θ), where f is known but the parameter θis unknown. The likelihood function is defined as

L(X;θ) =

T Y i=1

f(Xi;θ) (1.1)

and the log-likelihood is

logL(X;θ) =L(X;θ) =

T X

i=1

logf(Xi;θ). (1.2)

Example 1.1.1 (i) Suppose that {Xt} are iid normal random variables with mean µ and

varianceσ2 the log likelihood is

LT(X;µ, σ2)∝T σ2+ T X t=1 (Xt−µ)2 σ2

(4)

(ii) Suppose that{Xt}are iid binomial random variablesXt∼Bin(n, π). Then the log likelihood is LT(X;π)∝ T X t=1 log n Xt + T X t=1 Xtlog π 1−π +nlog(1π) .

(iii) Suppose that {Xt} are independent binomial random variables such that Xt∼Bin(nt, πt),

where the regressors zt influence the mean of Xt, such that πt = g(β′xt). Then the log

likelihood is LT(X;π)∝ T X t=1 log nt Xt + T X t=1 Xtlog g(β′xt) 1−g(β′xt) +ntlog(1−g(β′xt)) .

(iv) Suppose that {Xt} are independent exponential random variables which have the density

θ−1exp(x/θ). The log-likelihood is

LT(X;θ) = T X t=1 −αlogθ+Yt θ .

(v) A generalisation of the exponential distribution which gives more freedom in terms of shape

of the distribution is the Weibull. Suppose that {Xt} are independent Weibull random

variables which have the density αyθαα−1 exp(−(y/θ)α) where θ, α > 0 (in the case that

α = 0 we have the regular exponential) and y is defined over the positive real line. The

log-likelihood is LT(X;α, θ) = T X t=1

logα+ (α1) logYt−αlogθ−

Yt

θ

α

.

In the case, that α is known, but θ is unknown the likelihood is proportional to

LT(X;θ) = T X t=1 −αlogθ Yt θ α . .

1.1.2 Bounds for the variance of an unbiased estimator

We require the following assumptions, often called the regularity assumptions. We state the assumptions and result scalar θ, but they can easily be extended to the case thatθ is a vector.

Assumption 1.1.1 (Regularity Conditions 1) Let us suppose that LT(·) is the likelihood

with true parameter θ, and

(i) R ∂logLT(x;θ)

∂θ LT(x;θ)dx= 0 (for iid this is equivalent to

R ∂logf(x;θ)

(5)

(ii) ∂θ∂ R LT(x;θ)dx=R ∂LT∂θ(x;θ)dx= 0.

(iii) ∂θ∂ R g(x)LT(x;θ)dx= Rg(x)∂LT∂θ(x;θ)dx, where g is any function which is not a function

of θ (for example the estimator of θ).

(iv) E ∂logLT(X;θ)

∂θ 2

>0.

Theorem 1.1.1 (The Cramer-Rao bound) Let θ˜(X) be an unbiased estimator of θ˜.

Sup-pose the likelihood LT(X;θ) satisfies the regularity conditions (given in Assumption 1.1.1) and

˜

θ(X) is an unbiased estimator ofθ, then we have

var(˜θ(X))≥ E ∂logLT(X;θ) ∂θ 2−1 = E − ∂ 2logL T(X;θ) ∂θ2 −1 ,

PROOF. Recall that ˜θ(X) is an unbiased estimator ofθ therefore

Z

˜

θ(x)LT(x;θ)dx=θ.

Differentiating both sides wrt to θgives

Z ˜ θ(x)∂LT(x;θ) ∂θ dx= 1. Since R ∂LT(x;θ) ∂θ dx= 0 we have Z n θ−θ˜(x)o∂LT(x;θ) ∂θ dx= 1.

Multiplying and dividing by LT(x;θ) gives Z n ˜ θ(x)θo 1 LT(x;θ) ∂LT(x;θ) ∂θ LT(x;θ)dx= 1. (1.3)

Hence (sinceLT(x;θ) is the distribution of X) we have

E nθ˜(X)−θo 1 LT(X;θ) ∂logLT(X;θ) ∂θ = 1.

Recalling that the Cauchy-Schwartz inequality is E(U V) ≤E(U2)1/2E(V2)1/2 (where equality only arises if U =aV +b (wherea andb are constants)) and applying it to the above we have

var(˜θ(X))E ∂logLT(X;θ) ∂θ 2 ≥1. (1.4)

Thus giving us the Cramer-Rao inequality. Finally we need to prove thatE ∂logLT(X;θ)

∂θ 2 = E −∂2logLT(X;θ) ∂θ2

. To prove this result we use the fact that LT is a density to obtain Z

(6)

Now by differentiating the above with respect toθ gives

∂ ∂θ

Z

LT(x;θ)dx= 0.

By using Assumption 1.1.1(ii) we have

Z ∂L T(x;θ) ∂θ dx= 0⇒ Z logL T(x;θ) ∂θ LT(x;θ)dx= 0

Differentiating again with respect to θand taking the derivative inside gives

Z 2logL T(x;θ) ∂θ2 LT(x;θ)dx+ Z logL T(x;θ) ∂θ ∂LT(x;θ) ∂θ dx= 0 ⇒ Z 2logL T(x;θ) ∂θ2 LT(x;θ)dx+ Z logL T(x;θ) ∂θ 1 LT(x;θ) ∂LT(x;θ) ∂θ LT(x;θ)dx= 0 ⇒ Z 2logL T(x;θ) ∂θ2 LT(x;θ)dx+ Z logL T(x;θ) ∂θ 2 LT(x;θ)dx= 0 Thus −E ∂2logLT(X;θ) ∂θ2 =E ∂logLT(X;θ) ∂θ ∂LT(X;θ) ∂θ .

Which gives us the required result.

Corollary 1.1.1 (Estimators which attain the C-R bound) Suppose Assumption 1.1.1 is

satisfied. Then the estimator θ˜(X) attains the C-R bound only if it can be written as

ˆ

θ(X) =a(θ) +b(θ)∂logLT(X;θ)

∂θ

for some functions a(·) and b(·).

PROOF. The proof is clear and follows from when the Cauchy-Schwartz inequality is an actual

equality in the derivation of the C-R bound.

We mention that there exists some well known distributions which do not satisfy Assumption 1.1.1. These are non-regular distributions. A classical example of a distribution which violates this assumption is the uniform distribution f(x;θ) = 1/θ, for x ∈ [0, θ] and zero elsewhere. Other examples, include distributions where the support of the distribution is a function of the parameter. The Cramer-Rao lower bound does hold or even exist for such distributions.

Example 1.1.2 (The classical example of the uniform) Let us consider the example if the

(7)

random variables {Xt} the likelihood (it is easier to study the likelihood rather than the log-likelihood) is LT(XT;θ) = 1 θT T Y t=1 I[0](Xt).

Since the support of density involves the unknown parameter, then the derivative oflogLT(XT;θ)

is not well defined (what is the derivative of logI[0](Xt) = logI[Xt,∞)(θ) with respect to θ?

- observe that at log 0 is not well defined and the derivative at Xt is not well defined) and

Assumption 1.1.1(ii) is not satisfied. This is a classical example of a density which does not satisfy the regularity conditions. This means that the inverse of the Fisher information does not give a lower bound for the variance estimator. And below we will show why.

In fact, using LT(XT;θ), the maximum likelihood estimator of θ isθˆT = max1≤t≤T Xt (you

can see this by making a plot of LT(XT;θ) against θ). It it well known that the distribution of

max1≤t≤T Xt is P( max 1≤t≤TXt≤x) = P(X1 ≤x, . . . , XT ≤x) = T Y t=1 P(Xt≤x) = x θ T ,

and the density of max1≤t≤T Xt isfθˆT(x) =T xT−1/θT.

Exercise: Find the variance of θˆTdef inedabove.

Often we want to estimate a function of θ,τ(θ). The following corollary is a small generali-sation of the Cramer-Rao bound.

Corollary 1.1.2 Suppose the regularity conditions (Assumption 1.1.1) are satisfied and T(X)

is an unbiased estimator of τ(θ). Then we have

var(T(X))≥ (τ′(θ))

2

E ∂logLT(X;θ)

∂θ .

We now define the notion of sufficiency, which gives us the ingredients for constructing a good estimator (see also Sections 4.2.2 7.1.1 and 7.1.3 Davison (2002)).

Definition 1.1.1 (Sufficiency and the factorisation theorem) Suppose thatX= (X1, . . . , XT)

is a random vector. The statistic s(X) is called a sufficient statistic of the parameter θ, if the

conditional distribution of X givens(X) is not a function of θ.

Normally it is extremely hard to obtain the sufficient statistic from its definition. However, the factorisation theorem gives us a way of obtaining the sufficient statistic.

The Factorisation Theorem Suppose that the likelihood function can be partitioned as

follows,LT(X;θ) =h(X)g(s(X);θ), whereh(X)is not a function ofθ, thens(X) is a sufficient

(8)

We see that a sufficient statistic contains all the ingredients about the parameter θ.

Theorem 1.1.2 (Rao-Blackwell Theorem) Suppose s(X) is a sufficient statistic and θ(X)

is an unbiased estimator of θ then if we define the new unbiased estimator E(θ(X)|S(X)), then

var(E(˜θ(X)|S(X)))≤var(˜θ(X)).

The Rao-Blackwell theorem tells us that estimators with the smallest variance must be a function of the sufficient statistic. Of course this begs the question is there a unique estima-tor with the minumum variance. For this we require completeness of the sufficient statistic. Uniqueness immediately follows from completeness.

Definition 1.1.2 (Completeness) Let s(X) be a sufficient statistic for θ. Suppose Z(·) is a

function of s(X) such that E(Z(s(X))) = 0. s(X) is a complete sufficient statistic if and only

if E(Z(s(X))) = 0 implies Z(t) = 0 for allt.

Theorem 1.1.3 (Lehmann-Scheffe Theorem) Suppose that S(X) is a complete sufficient

statistic and θ˜(S(X))is an unbiased estimator estimator of θ thenθ˜(S(X)) is the unique

mini-mum variance unbiased estimator of θ.

The theorems above are theoretical, in the sense that, under certain conditions they give a lower bound for the variance of a plausible estimator and practical in the sense that they tell us that the best estimator should be a function of sufficient statistic. The natural question to ask, is how to construct such estimators.

One of the most popular estimators in statistics are maximum likelihood estimators (mle). That is the mle of θis ˆθT = arg maxθ∈ΘLT(θ), where Θ is the parameter space which contains

all values of θ with R f(x;θ)dx= 1. There are two reasons that they are so widely used (i) it can be shown for a wide range of probability distributions - including (under certain conditions) the exponential family of distributions, defined below, that the mle is a function of the sufficient statistic, hence the mle is often the minimum variance unbiased estimator (ii) asymptotically (at least) the mle under certain conditions attains the C-R bound.

Of course one can construct examples, where the regularity conditions are not satisfied and the mle is not the optimal estimator (examples include estimation of the range in the uniform distribution, where an estimator can be constructed which has a small variance than the mle). But for a mass majority of distributions the mle is optimal. It is also worth mentioning that there can exist biased estimators which have a smaller mean squared error than the MLE (this intriguing notion it called super-efficiency, which is beyond this course - see Stoica and Ottesten (1996) for a review).

(9)

1.1.3 Additional Notes

We will use various distributions in this course, it would be useful if you complied a list of these distributions and become familiar with them.

Example 1.1.3 (Useful transformations) Question:

The distribution function of the random variable Xt is Ft(x) = 1−exp(−λtx).

(i) Give a transformation of Xt, such that the transformed variable is uniformly

dis-tributed on the interval [0,1].

(ii) Suppose that I observe the independent (but not necessarily identically distributed)

random variables {Xt}, and I want to to check whether they have the distribution

functionFt(x) = 1−exp(−λtx). Using (i), suggest a method for checking this?

Answer:

(i) It is well known that if the random variable Xt has the distribution function Ft(x),

then the transformed random variable Yt = Ft(Xt) is uniformly distributed on the

interval [0,1]. To see this, note that the distribution of Yt can be evaluated as

P(Yt≤y) =P(Ft(Xt)≤y) =P(Xt≤Ft−1(y)) =Ft(Ft−1(y)) =y, y∈[0,1].

Thus to answer the question, we let Yt= 1−exp(−λtXt), which as a uniform

distri-bution.

(ii) If we want to check whether Xtfollows the distributionFt(x) = 1−exp(−λtx), we can

make the transformationYt= 1−exp(−λtXt), and use, for example, the

Kolmogorov-Smirnov test to check whether{Yt} follows a uniform distribution.

Example 1.1.4 Question

Suppose thatZ is a Weibull random variable with densityf(x;φ, α) = (αφ)(xφ)α−1exp(−(x/φ)α).

Show that E(Zr) =φrΓ 1 + r α . Hint: Use Z xaexp(xb)dx= 1 bΓ( a b + 1 b) a, b >0.

This result may be useful in some of the examples given in this course. Solution

(10)
(11)

Chapter 2

The Bayesian Cramer-Rao

2.1

The Bayesian Cramer-Rao inequality

The classical Cram´er-Rao inequality is useful for assessing the quality of a given estimator. But from the derivation we can clearly see that it only holds if the estimator is unbiased.

No such inequality can be derived to include estimators which are biased. For example, this can be a problem in nonparametric regression, where estimators in general will be biased. How does one access the estimator in such cases? To answer this question we consider the Bayesian Cramer-Rao inequality. This is similar to the Cramer-Rao inequality but does not require that the estimator is unbiased, so long as we place a prior on the parameter space. This inequality is known as the Bayesian Cramer-Rao (or van-Trees) inequality.

Suppose{Xi}Ti=1 are random variables with distribution functionLT(X;θ). Let ˜θ(X) be an

estimator ofθ. We now Bayesianise the set-up by placing a prior distribution on the parameter space Θ, the density of this prior we denote as λ. Let E(g(x)|θ) = R g(x)LT(x|θ)dx and Eλ

denote the expectation over the density of the parameter λ. For example

EλE(˜θ(X)|θ) = Z b a Z RT ˜ θ(x)LT(x|θ)dxdθ.

Assumption 2.1.1 θ is defined over the compact interval [a, b] and λ(x) 0 as x a and

x→b (soλ(a) =λ(b) = 0).

Theorem 2.1.1 Suppose Assumptions 1.1.1 and 2.1.1 hold. Let θ˜(X) be an estimator of θ. Then we have

(12)

where Eλ(I(θ)) = Z Z logL T(x;θ) ∂θ 2 LT(x;θ)λ(θ)dθ and I(λ) = Z ∂logλ(θ) ∂θ 2 λ(θ)dθ.

PROOF. We first note that under Assumption 2.1.1 we have

Z b a ∂LT(x;θ)λ(θ) ∂θ dθ=LT(x;θ)λ(θ) b a = 0.

Therefore by using the above we have

Z RT ˜ θ(x) Z b a ∂LT(x;θ)λ(θ) ∂θ dθdx= 0. (2.1)

Now let us consider RRT

Rb aθ

∂LT(x;θ)λ(θ)

∂θ dθdx. By integration by parts we have Z RT Z b a θ∂LT(x;θ)λ(θ) ∂θ dθdx= Z RT θLT(x;θ)λ(θ) b a dx Z Rn Z b a LT(x;θ)λ(θ)dθdx = Z LT(x;θ)λ(θ)dθdx. (2.2)

Subtracting (2.2) from (2.1) we have

Z RT Z b a ˜ θ(x)θ∂LT(x;θ)λ(θ) ∂θ dθdx= Z RT Z b a LT(x;θ)λ(θ)dθdx= 1.

Multiplying and dividing by LT(x;θ)λ(θ) gives Z RT Z b a ˜ θ(x)θ 1 LT(x;θ)λ(θ) ∂LT(x;θ)λ(θ) ∂θ LT(x;θ)λ(θ)dxdθ= 1. ⇒ Z RT Z b a ˜ θ(x)θ∂logLT(x;θ)λ(θ) ∂θ LT(x;θ)λ(θ)dxdθ= 1

Now by using the Cauchy-Schwartz inequality we have

1 Z b a Z RT ˜ θ(x)θ2LT(x;θ)λ(θ)dxdθ | {z } Eλ E((˜θ(X)−θ)2|θ) Z b a Z RT ∂logLT(x;θ)λ(θ) ∂θ 2 LT(x;θ)λ(θ)dxdθ.

Rearranging the above gives

Eλ Eθ(˜θ(X)θ)2 R 1 b a R RT ∂logLT(x;θ)λ(θ) ∂θ 2 LT(x;θ)λ(θ)dxdθ

(13)

Finally we want to show that the denominator of the RHS of the above is Z b a Z RT ∂logLT(x;θ)λ(θ) ∂θ 2 LT(x;θ)λ(θ)dxdθ=Eλ(I(θ)) +I(λ).

Using basic algebra we have

Z Z ∂logLT(x;θ)λ(θ) ∂θ 2 LT(x;θ)λ(θ)dxdθ = Z b a Z RT ∂logLT(x;θ) ∂θ + ∂logλ(θ) ∂θ 2 LT(x;θ)λ(θ)dxdθ = ∂logLT(x;θ) ∂θ 2 LT(x;θ)λ(θ)dxdθ | {z } Eλ(I(θ)) +2 Z b a Z RT ∂logLT(x;θ) ∂θ ∂logλ(θ) ∂θ LT(x;θ)λ(θ)dxdθ + Z b a Z RT ∂logλ(θ) ∂θ 2 LT(x;θ)λ(θ)dxdθ | {z } I(λ) . We note that Z Z logL T(x;θ) ∂θ ∂logλ(θ) ∂θ dxdθ= Z logλ(θ) ∂θ Z ∂L T(x;θ) ∂θ dx | {z } =0 dθ and RabRRT ∂logλ(θ) ∂θ 2 LT(x;θ)λ(θ)dxdθ=Rab ∂logλ(θ) ∂θ 2 λ(θ)dθ. Therefore we have Z Z ∂logLT(x;θ)λ(θ) ∂θ 2 LT(x;θ)λ(θ)dxdθ = Z b a Z RT ∂logLT(x;θ) ∂θ 2 LT(x;θ)λ(θ)dxdθ | {z } Eλ(I(θ)) + Z b a Z RT ∂logλ(θ) ∂θ 2 LT(x;θ)λ(θ)dxdθ | {z } I(λ) , we required.

We will consider applications of the Bayesian Cramer-Rao bound in Section 15.1.4 for ob-taining lower bounds of nonparametric density estimators.

(14)
(15)

Chapter 3

The Exponential Family

3.1

The exponential family of distributions

See also Section 5.2, Davison (2002).

It is possible to derive the properties (eg. mean, variance and maximum likelihood estima-tors - to be defined properly later on) for every distribution of interest. However, this can be cumbersome, the algebra can be tedious and we may not see the ‘big picture’. Instead, we now consider an ‘umbrella’ family of distributions which include several well known distributions. We will derive a general expression for the mean and variance of such distributions (which will be useful when we consider Generalised Linear models later in this course), and use these results to show that the maximum likelihood estimator is a function of the sufficient statistic :- thus is the best unbiased estimator (under the assumption of completeness). In other words, we that for this family of distributions the maximum likelihood estimator (which we have encountered many times previously) is indeed the best parameter estimator (in terms of minimum variance).

Suppose that the distribution of the random variable Xt can be written in the form

f(y;ω) = exp s(y)η(ω)b(ω) +c(y). (3.1) If the distribution ofXt(both the probability distribution function for discrete random variables

and probability density function for continuous random variables) has the above representation, then Xt is said to belong to the exponential family of distributions. A large number of well

known distribution functions belong to this family. Hence by understanding the properties the exponential family, we can draw conclusions on a large number of distribution functions.

Example 3.1.1 (a) The exponential distribution X Exp(λ), hence the pdf is f(y;λ) =

λexp(λy), which can be written as

(16)

Therefore s(y) =y and η(λ) =λ.

(b) The binomial distribution P(X=y) = nyπ(1π)n−y can be rewritten as

logP(y;λ) =ylog( π 1−π) +nlog(1−π) + log n y .

Therefore s(y) =y, η(π) = log(1ππ), b(π) =nlog(1π)−1 andc(y) = log ny.

It should be mentioned that it is straightforward to generalise the exponential family to the case that θis a vector of dimension greater than one. Suppose that θis a p-dimensional vector. The orderp exponential family is defined as distributions which satisfy

f(y;ω) = exp(s(y)′θ(ω)b(ω) +c(y)),

wheres(y) = (s1(y), . . . , sp(y)) (with {si}linearly independent) and θ(ω) = (θ1(ω), . . . , θp(ω)).

3.1.1 The natural exponential family

If we let θ = η(ω) and η is an invertible function (hence there is a one-to-one correspondence between the space containing ω and the space containingθ), then we can rewrite (3.1) we

f(y;θ) = exp(s(y)θκ(θ) +c(y)),

whereκ(θ) =b(η−1(θ)). Thenatural exponential family is whens(y) =y.

Now by transformation we give example of distributions which have natural form. (i) The exponential distribution is already in natural exponential form.

(ii) For the binomial distribution we let θ = log(1ππ), since log(1ππ) is invertible this gives the log distribution as

logf(y;θ) = logf(y; log π

1π) = yθ−nlog 1 1 + exp(θ) + log n y .

Hence the parameter of interest, π, has been transformed, and often we fit a model (later in the course) to θ, and transform back to obtain an estimator ofπ.

Some properties of the natural exponental

Distributions which have a natural exponential have interesting properties which we now discuss.

Lemma 3.1.1 Suppose that X is a random variable which has the natural exponential

repre-sentation. Then the moment generating function of X is E(exp(Xt)) = exp(κ(t+θ)κ(θ)).

(17)

PROOF. Let us suppose thattis sufficiently small such thatf(y; (θ+t)) is a distribution. The mfg is

MX(t) =E(exp(tY)) = Z

exp(ty) exp(θy−κ(θ) +c(y))dy

= exp(κ(θ+t)κ(θ))

Z

exp((θ+t)yκ(θ+t) +c(y))dy

= exp(κ(θ+t)κ(θ)),

since R exp((θ+t)yκ(θ+t) +c(y))dy = Rf(y; (θ+t))dy = 1. To obtain the moments we recall thatM′

X(0) =E(X) and var(X) =MX′′(0)−(MX′ (0))2. Therefore

MX′ (t) = κ′(θ+t) exp(κ(θ+t)−κ(θ))

MX′ (t) = κ′′(θ+t) + (κ′(θ+t))2exp(κ(θ+t)−κ(θ)).

Hence MX′ (0) =κ′(θ) and MX′′(0) =κ′′(θ) +κ′(θ)2, which gives the result.

Remark 3.1.1 The mean and variance of the natural exponential family make obtaining the

mle estimators quite simple. We derive this later but we first observe that since E(X) =κ′(θ),

therefore the mean of X is a function of θ, hence we can write µ(θ) =κ′(θ). Moreover, since

var(X) = κ′′(θ), then the derivative of µ, µ(θ), is strictly positive. In other words, µ(θ)( =

κ′(θ)) is an increasing function in θ. Thus µ(θ) is an invertible function, therefore given µ(θ),

we can uniquely determine θ. This observation will prove useful later when obtaining the mle

estimators of θ.

3.1.2 Maximum likelihood estimation for the exponential family

Suppose that {Xt}are iid random variables which have a natural exponential distribution

rep-resentation. Then the log likelihood function is

LT(X;θ) =θ T X t=1 Xt−T κ(θ) +T T X t=1 c(Xt).

Hence by using the factorisation theorem we see that the sufficient statistic for θ is s(X) =

PT

t=1Xt. Hence, supposing that Assumption 1.1.1 is satisfied, then the minimum variance

unbiased estimator of θ should be a function ofs(X). We now obtain the maximum likelihood estimator of θ, and derive conditions under which the mle is a function of s(X) (hence, by Rao-Blackwell theorem and the Lehmann-Scheffe lemma it is the best estimator).

The mle of θis ˆθT where

ˆ θT = arg max θ∈Θ θ T X t=1 Xt−T κ(θ) + T X t=1 c(Xt) .

(18)

The natural way to obtain ˆθT is to find the solution of ∂LT∂θ(X;θ) = 0. However for∂LT∂θ(X;θ)⌋θθT =

0, depends on a few conditions. Before we derive these conditions we first consider the solution of the derivative ofLT(X;θ). DifferentiatingLT(X;θ) gives

LT(X;θ) ∂θ = T X t=1 Xt−T κ′(θ).

Therefore, since µ(θ) =κprime(θ) is an invertible function, then ∂LT(X;θ)

∂θ = 0 when ˆ θT =µ−1 1 T T X t=1 Xt.

Of course, we need to know under what conditions

µ−1 1 T T X t=1 Xt= arg max θ∈Θ θ T X t=1 Xt−T κ(θ) +T T X t=1 c(Xt) .

The above really depends on the parameter space Θ.

Definition 3.1.1 Let Θ be the parameter space of θ and the space of outcomes of the

ran-dom variable X, Y. Let M = {µ = µ(θ);θ Θ} denote the man space. Let Y¯T = {y =

1

T PT

t=1xt;xt∈ Y} the sample mean space.

Lemma 3.1.2 Suppose that {Xt} are iid random variables which have a natural exponential

representation. If YT ⊂ M then µ−1 1 T T X t=1 Xt= arg max θ∈Θ θ T X t=1 Xt−T κ(θ) +T T X t=1 c(Xt) .

PROOF. The proof is straightforward, since the first derivative is zero whenθ=µ−1 1

T PT

t=1Xt.

Then this is the maximum of ℓ(X;θ) in the sample mean space ¯YT . Hence in order for it the

minimum over the mean space M, then either M= ¯YT or ¯YT ⊂ M.

Remark 3.1.2 (Minimum variance unbiased estimators) Suppose Xt has a distribution

in the natural exponential family, the conditions of the above lemma are satisfied and s(X) is

the complete statistic of θ. Moreover if µ−1(1

T PT

t=1Xt) is an unbiased estimator of θ, then

µ−1(T1 PTt=1Xt) is the minumum variance unbiased estimator of θ. However, in general, this

(19)

Remark 3.1.3 (Estimating ω) Often we are interested in estimating ω, where θ = θ(ω). However, since ∂ℓ(X;θ) ∂ω = ∂θ ∂ω × ∂ℓ(X;θ) ∂θ =θ′(ω) T X t=1 Xt−T κ′(θ.

Then if all conditions regarding parameter and sample mean spaces are satisfied then the mle of

ω is ˆ ωT =η−1 µ−1 1 T T X t=1 Xt .

It should be noted that one great advantage of the exponential family of distributions is that the mle is easy to obtain (with explicit expressions!).

Many of the results above can be generalised to the setting that {Xt} are independent but

not necessarily identically distributed and there exists regressorszwhich are known to influence the mean ofXt. We will revisit this problem when we consider generalised linear models.

(20)
(21)

Chapter 4

The Maximum Likelihood Estimator

4.1

The Maximum likelihood estimator

As illustrated in the exponential family of distributions, discussed above, the maximum likeli-hood estimator ofθ0 (the true parameter) is defined as

ˆ

θT = arg max

θ∈Θ LT(X;θ) = arg maxθ∈ΘLT(θ).

Often we find that ∂LT(θ)

∂θ ⌋θ=ˆθT = 0, hence solution can be obtained by solving the derivative

of the log likelihood (often called the score function). However, if θ0 lies on or close to the

boundary of the parameter space this will not necessarily be true.

Below we consider the sampling properties of ˆθT when the true parameter θ0 lies in the

interior of the parameter space Θ.

We note that the likelihood is invariant to transformations of the data. For example if X

has the density f(·;θ) and we define the transformed random variable Z = g(X), where the functiong has an inverse (its a 1-1 transformation), then it is easy to show that the density of

Z isf(g−1(z);θ)∂g−∂z1(z). Therefore the likelihood of {Zt=g(Xt)} is T Y t=1 f(g−1(Zt);θ) ∂g−1(z) ∂z ⌋z=Zt = T Y t=1 f(Xt;θ) ∂g−1(z) ∂z ⌋z=Zt.

Hence it is proportional to the likelihood of {Xt} and the maximum of the likelihood of {Zt =

(22)

4.1.1 Evaluating the MLE

Examples

Example 4.1.1 {Xt}are iid random variables, which follow a Normal (Gaussian) distribution

N(µ, σ2). The likelihood is proportional to

LT(X;µ, σ2) = −Tlogσ− 1 2σ2 T X t=1 (Xt−µ)2.

Maximising the above with respect toµ and σ2 gives µˆT = ¯X and σˆ2 = T1 PTt=1(Xt−X¯)2.

Example 4.1.2 Question:

{Xt} are iid random variables, which follow a Weibull distribution, which has the density

αyα−1

θα exp(−(y/θ)

α) θ, α >0.

Suppose that α is known, but θ is unknown (and we need to estimate it). What is the

maximum likelihood estimator of θ?

Solution:

The log-likelihood (of interest) is proportional to

LT(X;θ) = T X t=1

logα+ (α−1) logYt−αlogθ− Yt

θ α ∝ T X t=1 −αlogθ Yt θ α . .

The derivative of the log-likelihood wrt to θ is

∂LT ∂θ =− T α θ + α θα+1 T X t=1 Ytα= 0.

Solving the above givesθˆT = (T1 PTt=1Ytα)1/α.

Example 4.1.3 Notice that ifαis given, an explicit solution for the maximum of the likelihood, in the above example, can be obtained. Consider instead the maximum of the likelihood with

respect to α and θ, ie.

arg max θ,α T X t=1

logα+ (α−1) logYt−αlogθ−

Yt

θ

α

(23)

The derivative of the likelihood is ∂LT ∂θ = − T α θ + α θα+1 T X t=1 Ytα = 0 ∂LT ∂α = T α − T X t=1 logYt−Tlogθ− T α θ + T X t=1 log(Yt θ)×( Yt θ) α = 0.

It is clear that an explicit expression to the solution of the above does not exist and we need to find alternative methods for finding a solution. Below we shall describe numerical routines which can be used in the maximisation. In special cases, one can use other methods, such as the Profile likelihood (we cover this later on).

Numerical Routines

In an ideal world to maximise a likelihood, we would consider the derivative of the likelihood and solve it (∂LT(θ)

∂θ ⌋θ=ˆθT = 0), and an explicit expression would exist for this solution. In reality

this rarely happens (as we illustrated in the section above).

Usually, we will be unable to obtain an explicit expression for the MLE. In such cases, one has to do the maximisation using alternative, numerical methods. Typically it is relative straightforward to maximise the likelihood of random variables which belong to the exponential family (numerical algorithms sometimes have to be used, but they tend to be fast and attain the maximum of the likelihood - not just the local maximum). However, the story becomes more complicated even if we consider mixtures of exponential family distributions - these do not belong to the exponential family, and can be difficult to maximise using conventional numerical routines. We give an example of such a distribution here. Let us suppose that {Xt} are iid

random variables which follow the classical normal mixture distribution

f(y;θ) =pf1(y;θ1) + (1−p)f2(y;θ2),

wheref1 is the density of the normal with meanµ1 and variance σ12 andf2 is the density of the

normal with meanµ2 and varianceσ22. The log likelihood is

LT(Y;θ) = T X t=1 log pp1 2πσ12 exp(− 1 2σ2 1 (Xt−µ1)2) + (1−p) 1 p 2πσ22exp(− 1 2σ2 2 (Xt−µ2)2) .

Studying the above it is clear there does not explicit solution to the maximum. Hence one needs to use a numerical algorithm to maximise the above likelihood.

(24)

The Newton Raphson RoutineThe Newton-Raphson routine is the standard method to numerically maximise the likelihood, this can often be done automatically in R by using the Rfunctions optim or nlm. To apply Newton-Raphson, we have to assume that the derivative of the likelihood exists (this is not always the case - think about the ℓ1

-norm based estimators!) and the minimum lies inside the parameter space such that

∂LT(θ)

∂θ ⌋θ=ˆθT = 0. We choose an initial value θ1 and apply the routine

θn=θn−1+ ∂2LT(θ) ∂θ2 ⌋θn−1 −1 ∂LT(θn−1) ∂θ ⌋θn−1.

Where this routine comes from will be clear by using the Taylor expansion of ∂LT(θn−1) ∂θ

about θ0 (see Section 4.1.3). If the likelihood has just one global maximum and no local

maximums (hence it is convex), then it is quite easy to maximise. If on the other hand, the likelihood has a few local maximums and the initial valueθ1 is not chosen close enough

to the true maximum, then the routine may converge to a local maximum (not good!). In this case it may be a good idea to do the routine several times for several different initial values θ1(i). For each convergence value ˆθT(i) evaluate the likelihood LT(ˆθ(Ti)) and select the

value which gives the largest likelihood. It is best to avoid these problems by starting with an informed choice of initial value.

Implementing without any thought a Newton-Rapshon routine can lead to estimators which take an incredibly long time to converge. If one carefully considers the likelihood one can shorten the convergence time by rewriting the likelihood and using faster methods (often based on the Newton-Raphson).

Iterative least squaresThis is a method that we shall describe later when we consider Generalised linear models. As the name suggests the algorithm has to be interated, however at each step weighted least squares is implemented (see later in the course).

The EM-algorithm This is done by the introduction of dummy variables, which leads to a new ‘unobserved’ likelihood which can easily be maximised. In fact one the simplest methods of maximising the likelihood of mixture distributions is to use the EM-algorithm. We cover this later in the course.

See Example 4.23 on page 117 in Davison (2002).

The likelihood for dependent data

We mention that the likelihood for dependent data can also be constructed (though often the estimation and the asymptotic properties can be a lot harder to derive). Using Bayes rule (ie.

(25)

P(A1, A2, . . . , AT) =P(A1)QTi=2P(Ai|Ai−1, . . . , A1)) we have LT(X;θ) =f(X1;θ) T Y t=2 f(Xt|Xt−1, . . . , X1;θ).

Under certain conditions on{Xt}the structure aboveQTt=2f(Xt|Xt−1, . . . , X1;θ) can be

simpli-fied. For example ifXtwere Markovian then we haveXtconditioned on the past on depends the

most recent past observation, ie. f(Xt|Xt−1, . . . , X1;θ) =f(Xt|Xt−1;θ) in this case the above

likelihood reduces to LT(X;θ) =f(X1;θ) n Y t=2 f(Xt|Xt−1;θ). (4.1)

Example 4.1.4 A lot of the material we will cover in this class will be for independent obser-vations. However likelihood methods can also work for dependent observations too. Consider the

AR(1) time series

Xt=aXt−1+εt,

where εt are iid random variables with mean zero. We will assume that |a|<1.

We see from the above that the observationXt−1as a linear influence on the next observation

and it is Markovian, that it given Xt−1, the random variable Xt−2 has no influence on Xt (to

see this consider the distribution function P(Xt≤x|Xt−1, Xt−2)). Therefore by using (4.1) the

likelihood of {Xt}t is LT(X;a) =f(X1;a) T Y t=2 fε(Xt−aXt−1), (4.2)

whereis the density ofεandf(X1;a) is the marginal density ofX1. This means the likelihood

of {Xt} only depends onand the marginal density ofXt. We use ˆaT = arg maxLT(X;a) as

the mle estimator of a.

Often we ignore the term f(X1;a) (because this is often hard to know - try and figure it out

- its relatively easy in the Gaussian case) and consider what is called the conditional likelihood

QT(X;a) = T Y t=2 fε(Xt−aXt−1). (4.3) ˜

aT = arg maxLT(X;a) as the quasi-mle estimator ofa.

Exercise: What is the quasi-likelihood proportional to in the case that {εt} are Gaussian

random variables with mean zero. It should be mentioned that often the conditional likelihood

is derived as if the errors {εt} are Gaussian - even if they are not. This is often called the

(26)

4.1.2 A quick review of the central limit theorem

In this section we will not endeavour to proof the central limit theorem (which is usually based on showing that the characteristic function - a close cousin of the moment generating function - of the average converges to the characteristic function of the normal distribution). However, we will recall the general statement of the CLT and generalisations of it. The purpose of this section is not to lumber you with unnecessary mathematics but to help you understand when an estimator is close to normal (or not).

Lemma 4.1.1 (The famous CLT) Let us suppose that {Xt} are iid random variables, let

µ=E(Xt)<∞ and σ2 =var(Xt)<∞. Define X¯ = T1 PTt=1Xt. Then we have

T( ¯X−µ)→ ND (0, σ2),

alternatively( ¯X−µ)→ ND (0,σT2).

What this means that if we have a large enough sample size and plotted the histogram of several replications of the average, this should be close to normal.

Remark 4.1.1 (i) The above lemma appears to be ‘restricted’ to just averages. However, it can be used in several different contexts. Averages arise in several different situations. It is not just restricted to the average of the observations. By judicious algebraic manipulations, one can show that several estimators can be rewritten as an average (or approximately as an average). At first appearance, the MLE of the Weibull parameters given in Example 4.1.3) does not look like an average, however, in the section we will consider the general maximum likelihood estimators, and show that they can be rewritten as an average hence the CLT applies to them too.

(ii) The CLT can be extended in several ways.

(a) To random variables whose variance are not all the same (ie. indepedent but identi-cally distributed random variables).

(b) Dependent random variables (so long as the dependency ‘decays’ in some way). (c) To not just averages but weighted averages too (so long as the weight depends in

certain way). However, the weights should be ‘distributed well’ over all the random

variables. Ie. suppose that {Xt} are iid random variables. Then it is clear that

1 10

P10

t=1Xt will never be normal (unless{Xt} is normal - observe 10is fixed!), but it

seems plausible that n1Pnt=1sin(2πt/12)Xt is normal (despite this not being the sum

(27)

There exists several theorems which one can use to prove normality. But really the take home message is, look at your estimator and see whether asymptotic normality it looks plausible - you could even check it through simulations.

Example 4.1.5 (Some problem cases) One should think a little before blindly applying the

CLT. Suppose that the iid random variables{Xt}follow a t-distribution with 2 degrees of freedom,

ie. the density function is

f(x) = Γ(3√/2)

2π (1 + x2

2 )−

3/2.

LetX¯ = n1Pnt=1Xtdenote the sample mean. It is well known that the mean of the t-distribution

with two degrees of freedom exists, but the variance does not (it is too thick tailed). Thus, the

assumptions required for the CLT to hold are violated andis not normally distributed (in fact

it follows a stable law distribution). Intuitively this is clear, recall that the chance of outliers for

a t-distribution with a small number of degrees of freedom, if large. This makes it impossible

that even averages should be ‘well behaved’ (there is a large chance that an average could also be too large or too small).

To see why the variance is infinite, study the form of t-distribution (with two degrees). For

the variance to be finite, the tails of the distribution should converge to zero fast enough (in other

words the probability of outliers should not be too large). See that the tails of the t-distribution

(for largex) behaves likef(x)∼Cx−3 (make a plot in Maple to check), thus the second moment

E(X2)RM∞Cx−3x2dx=RM∞Cx−1dx(for someC andM), is clearly not finite! This argument

can be made precise.

4.1.3 The Taylor series expansion - the statisticians tool

The Taylor series is used all over the place in statistics and you should be completely fluent with using it. It can be used to prove consistency of an estimator, normality (based on the assumption that averages converge to a normal distribution), obtaining the limiting variance of an estimator etc. We start by demonstrating its use for the log likelihood.

We recall that the mean value (in the univariate case) states that

f(x) =f(x0) + (x−x0)f′(¯x1) f(x) =f(x0) + (x−x0)f′(x0) +

(xx0)2

2 f′′(¯x2),

where ¯x1 and ¯x2 both lie between x and x0. In the case thatf is a multivariate function, then

we have f(x) = f(x0) + (xx0)f(x)x=¯x1 f(x) = f(x0) + (x−x0)′∇f(x)⌋x=x0 +1 2(x−x0) ′2f(x) x=¯x2(x−x0),

(28)

where ¯x1 and ¯x2 both lie betweenx and x0. In the case that f(x) is a vector, then the mean value theorem does not directly work. Strictly speaking we cannot say that

f(x) = f(x0) + (xx0)′f(x)x=¯x1,

where ¯x1 lies between x and x0. However, it is quite straightforward to overcome this in-convience. The mean value theorem does hold pointwise, for every element of the vector

f(x) = (f1(x), . . . , fd(x)), ie. for every 1≤i≤dwe have

fi(x) = fi(x0) + (x−x0)∇fi(x)⌋x=¯xi,

where ¯xi lies between xand x0. Thus iffi(x)⌋x=¯xi → ∇fi(x)⌋x=x0, we do have that

f(x) f(x0) + (xx0)′f(x)x=x0.

We use the above below.

• Application 1 (An expression forLT(ˆθT)− LT(θ0) in terms of (ˆθT −θ0)):

The expansion ofLT(ˆθT) about θ0 (the true parameter)

LT(θ0)− LT(ˆθT) = ∂LT(θ) ∂θ ⌋θˆT(θ0− ˆ θT) + 1 2(θ0−θˆT) ′∂2LT(θ) ∂θ2 ⌋θ¯T(θ0− ˆ θT)

where ¯θT lies between θ0 and ˆθT. If ˆθT lies in the interior of the parameter space (this

is an extremely important assumption here) then ∂LT(θ)

∂θ ⌋θˆT = 0. Moreover, if it can

be shown that |θˆT −θ0| →P 0 (we show this in the section below), then under certain

conditions on ∂LT(θ)

∂θ (such as the existence of the third derivative etc.) it can be shown

that ∂2LT(θ)

∂θ2 ⌋θ¯T

P

→E(∂2LT(θ)

∂θ2 ⌋θ0) =I(θ0). Hence the above is roughly

2(LT(ˆθT)− LT(θ0))≈(ˆθT −θ0)′I(θ0)(ˆθT −θ0)

Note that in many of the derivations below we will use that

∂2LT(θ) ∂θ2 ⌋θ¯T P →E(∂ 2L T(θ) ∂θ2 ⌋θ0 =I(θ0).

But it should be noted that this only true if (i) |θˆT −θ0|→P 0 and (ii) ∂

2L T(θ) ∂θ2 converges uniformly toE(∂2LT(θ) ∂θ2 ⌋θ0 .

(29)

• Application 2 (An expression for (ˆθT −θ0) in terms of ∂L∂θT(θ)⌋θ0):

The expansion of thep-dimension vector∂LT(θ)

∂θ ⌋θˆT pointwise aboutθ0(the true parameter)

gives (for 1id) ∂Li,T(θ) ∂θ ⌋θˆT = ∂Li,T(θ) ∂θ ⌋θ0 + ∂Li,T(θ) ∂θ ⌋θ¯T(ˆθT −θ0).

Now by using the same argument as in Application 1 we have

LT(θ)

∂θ ⌋θ0 ≈I(θ0)(ˆθT −θ0).

We mention that UT(θ0) = ∂L∂θT(θ)⌋θ0 is often called the score or U statistic. And we

see that the asymptotic sampling properties of UT determine the sampling properties of

(ˆθT −θ0).

Example 4.1.6 (The Weibull) Evaluate the second derivative of the likelihood given in

Ex-ample 4.1.3, take the expection on this, I(θ, α) =E(2LT) (we use theto denote the second

derivative with respect to the parametersα and θ). Exercise: Evaluate I(θ, α).

Application 2 implies that the maximum likelihood estimators θˆT and αˆT (recalling that no

explicit expression for them exists) can be written as

ˆ θT −θ ˆ αT −α ! ≈I(θ, α)−1     PT t=1 −αθ +θαα+1Ytα PT t=1 1

α −logYt−logθ− αθ + log(Yθt)×(Yθt)α

   

4.1.4 Sampling properties of the maximum likelihood estimator

See also Section 4.4.2 (p118), Davison (2002). These proofs will not be examined, but you should have some idea why Theorem 4.1.2 is true.

We have shown that under certain conditions the maximum likelihood estimator can often be the minimum variance unbiased estimator (for example, in the case of exponential family of distributions). However, for finite samples, the mle may not attain the C-R lower bound. Hence for finite sample var(ˆθT) > I(θ)−1. However, it can be shown that asymptotically the

variance of the mle attains the mle lower bound. In other words, for large samples, the variance of the mle is close to the Cramer-Rao bound. We will prove the result in the case that ℓT is

the log likelihood of independent, identically distributed random variables. The proof can be generalised to the case of non-identically distributed random variables.

We first state sufficient conditions for this to be true.

Assumption 4.1.1 [Regularity Conditions 2] Let {Xt} be iid random variables with density

(30)

(i) Suppose the conditions in Assumption 1.1.1 hold.

(ii) Almost sure uniform convergence (This is optional)

For every ε >0 there exists a δ such that

P lim T→∞|θ1sup−θ2|>δ 1 TLT(X;θ)−E(LT(θ)) > ε →0.

We mention that directly verifying uniform convergence can be difficult. However, it can be established by showing that the parameter space is compact, point wise convergence of the likelihood to its expectation and almost sure equicontinuity in probability.

(iii) Model identifiability

For every θΘ, there does not exist anotherθ˜Θ such that f(x;θ) =f(x; ˜θ) for allx.

(iv) The parameter space Θ is finite and compact.

(v) supE|1

T

∂LT(X;θ)

∂θ |<∞.

We require Assumption 4.1.1(ii,iii) to show consistency and Assumptions 1.1.1 and 4.1.1(iii-v) to show asymptotic normality.

Theorem 4.1.1 Supppose Assumption 4.1.1(ii,iii) holds. Let θ0 be the true parameter and θˆT

be the mle. Then we haveθˆT

a.s.

→ θ0 (consistency).

PROOF. To prove the result we first need to show that the expectation of the maximum likeli-hood is maximum at the true parameter and that this is the unique maximum. In other words we need to show thatE(T1LT(X;θ))−E(T1LT(X;θ0))≤0 for allθ∈Θ. To do this, we have

E(1 TLT(X;θ))−E( 1 TLT(X;θ0)) = Z log f(x;θ) f(x;θ0) f(x;θ0)dx = E log f(X;θ) f(X;θ0) .

Now by using Jensen’s inequality we have

E log f(X;θ) f(X;θ0) ≤logE f(X;θ) f(X;θ0) = log Z f(x;θ)dx= 0.

Thus givingE(T1LT(X;θ))−E(T1LT(X;θ0))≤0. To prove thatE(T1LT(X;θ))−E(T1LT(X;θ0)) =

0 only when θ0 we note that identifiability assumption in Assumption 4.1.1(iii), which means

(31)

Hence E(T1LT(X;θ)) is uniquely maximum at θ0. Finally, we need to show that ˆθT →P θ0.

By Assumption 4.1.1(ii) (and also the LLN) we have that for allθΘ that T1LT(X;θ)a.s.→ ℓ(θ).

Therefore, for every mle ˆθT we have

1 TLT(X;θ0)≤ 1 TLT(X; ˆθT) a.s. → E(1 TLT(X; ˆθT))≤E( 1 TLT(X;θ0)) (4.4)

To bound |E(T1LT(X;θ0))−T1LT(X; ˆθT)|we note that E(1 TLT(X;θ0))− 1 TLT(X; ˆθT) = E(1 TLT(X;θ0))− 1 TLT(X;θ0) + E(1 TLT(X; ˆθT))− 1 TLT(X; ˆθT) + 1 TLT(X;θ0)−E( 1 TLT(X; ˆθT)) .

Now by using (4.4) we have

E(1 TLT(X;θ0))− 1 TLT(X; ˆθT)≤ E(1 TLT(X;θ0))− 1 TLT(X;θ0) + E(1 TLT(X; ˆθT))− 1 TLT(X; ˆθT)) + E(1 TLT(X; ˆθT))− 1 TLT(X; ˆθT)) and E(1 TLT(X;θ0)))− 1 TLT(X; ˆθT)≥ E(1 TLT(X;θ0)))− 1 TLT(X;θ0) + E(1 TLT(X; ˆθT))− 1 TLT(X; ˆθT) + E(1 TLT(X;θ0))− 1 TLT(X;θ0) .

Therefore, under Assumption 4.1.1(ii) we have

|E(1 TLT(X;θ0))− 1 TLT(X; ˆθT)| ≤3 supθ∈Θ|E (1 TLT(X;θ))− 1 TLT(X;θ)| a.s. → 0.

Since LT(θ) has a unique minimum this implies ˆθT a.s.→ θ0.

Hence we have shown consistency of the mle. We now need to show asymptotic normality.

Theorem 4.1.2 Suppose Assumption 4.1.1 is satisfied. (i) Then the score statistic is

1 √ T ∂LT(X;θ) ∂θ ⌋θ0 D → N 0, E ∂logf(X;θ) ∂θ ⌋θ0 2 . (4.5)

(ii) Then the mle is

√ T θˆT −θ0→ ND 0, E ∂logf(X;θ) ∂θ ⌋θ0 2−1 .

(32)

(iii) The log likelihood ratio is 2 LT(X; ˆθT)− LT(X;θ0) D →χ2p

PROOF. First we will prove (i). We recall because{Xt} are iid random variables, then

1 √ T ∂LT(X;θ) ∂θ ⌋θ0 = 1 √ T T X t=1 ∂logf(Xt;θ) ∂θ ⌋θ0. Hence ∂LT(X;θ)

∂θ ⌋θ0is the sum of iid random variables with mean zero and variance var(

∂logf(Xt;θ)

∂θ ⌋θ0).

Therefore, by the CLT for iid random variables we have (4.5).

We use (i) and Taylor (mean value) theorem to prove (ii). We first note that by the mean value theorem we have

1 T ∂LT(X;θ) ∂θ ⌋θˆT = 1 T ∂LT(X;θ) ∂θ ⌋θ0 + (ˆθT −θ0) 1 T ∂2L T(X;θ) ∂θ2 ⌋θ¯T. (4.6)

Now it can be shown because Θ has a compact support, |θˆT −θ0|a.s.→ 0 and the expectations of

the third derivative of LT is bounded that

1 T ∂2L T(X;θ) ∂θ2 ⌋θ¯T P → T1E ∂2L T(X;θ) ∂θ2 ⌋θ0 =E ∂2logf(X;θ) ∂θ2 ⌋θ0 . (4.7)

Substituting (4.7) into (4.6) gives

√ T(ˆθT −θ0) = 1 T ∂2LT(X;θ) ∂θ2 ⌋θ¯T −1 1 √ T ∂LT(X;θ) ∂θ ⌋θ0 = E 1 T ∂2LT(X;θ) ∂θ2 ⌋θ0 −1 1 √ T ∂LT(X;θ) ∂θ ⌋θ0+op(1).

We mention that the proof above is for univariate ∂2LT(X;θ)

∂θ2 ⌋θ¯T, but by redo-ing the above steps

pointwise it can easily be generalised to the multivariate case too. Hence by substituting the (4.5) into the above we have (ii). It is straightfoward to prove (iii) by using

2

LT(X; ˆθT)− LT(X;θ0)

≈(ˆθT −θ0)′I(θ0)(ˆθT −θ0)′,

(i) and the result that if X∼ N(0,Σ), then AX∼ N(0, A′ΣA). Example 4.1.7 (The Weibull) By using Example 4.1.6 we have

ˆ θT −θ ˆ αT −α ! ≈I(θ, α)−1     PT t=1 − αθ +θαα+1Ytα PT t=1 1

α−logYt−logθ−αθ + log(Yθt)×(Yθt)α

   .

(33)

Now we observe that RHS consists of a sum iid random variables (this can be viewed as an

average). Since the variance of this exists (you can show that it is I(θ, α)), the CLT can be

applied and we have that

ˆ θT −θ ˆ αT −α ! D → N 0, I(θ, α)−1.

Remark 4.1.2 (i) We recall that for iid random variables that the Fisher information for

sample size T is I(θ) =E ∂logLT(X;θ) ∂θ ⌋θ0 2 =TE ∂logf(X;θ) ∂θ ⌋θ0 2 .

Hence comparing with the above theorem, we see that for iid random variables (so long as the regularity conditions are satisfied) the MLE, asympotitically, attains the Cramer-Rao bound even if for finite samples this may not be true.

Moreover, since (ˆθT −θ0)≈I(θ0)−1 ∂LT(θ) ∂θ ⌋θ0 = (T− 1I(θ 0))−1 1 T ∂LT(θ) ∂θ ⌋θ0, and var √1 T ∂LT(θ) ∂θ ⌋θ0

= T1I(θ0), then it can be seen that |θˆT −θ0|=Op(T−1/2).

(ii) Under suitable conditions a similar result holds true for data which is not iid.

In summary, the MLE (under certain regularity conditions) tend to have the smallest vari-ance, and for large samples, the variance is close to the lower bound, which is the Cramer-Rao bound.

In the case that Assumption 4.1.1 is satisfied, the MLE is said to be asymptotically efficient. This means for finite samples the MLE may not attain the C-R bound but asymptotically it will.

(iii) A simple application of Theorem 4.1.2 is to the derivation of the distribution ofI(θ0)1/2(ˆθT−

θ0). It is clear that by using Theorem 4.1.2 we have

I(θ0)1/2(ˆθT −θ0)→ ND (0, Ip)

(where Ip is the identity matrix) and

(ˆθT −θ0)′I(θ0)(ˆθT −θ0)→D χ2p.

(iv) Note that these results apply when θ0 lies inside the parameter space Θ. Asθ gets closer to

(34)

Remark 4.1.3 (Generalised estimating equations) Closely related to the MLE are gener-alised estimating equations GEE, which are relate to the score statistic. These are estimators not based on maximising the likelihood but are related to equating the score statistic (derivative of the likelihood) to zero and solving for the unknown parameters. Often they are equivalent to the MLE but they can be adapted to be useful in themselves (and some adaptions will not be the derivative of a likelihood).

4.1.5 The Fisher information See also Section 4.3, Davison (2002).

Let us return to the Fisher information. We recall that undercertain regularity conditions an unbiased estimator, ˜θ(X), of a parameter θ0 is such that

var(˜θ(X))I(θ0)−1, where I(θ) =E ∂LT(θ) ∂θ 2 =E −∂ 2L T(θ) ∂θ2 .

is the Fisher information. Furthermore, under suitable regularity conditions, the MLE will asymptotically attain this bound. It is reasonable to ask, how one can interprete this bound.

(i) Situation 1. I(θ0) =E

− ∂2LT(θ)

∂θ2 ⌋θ0

is large (hence variance of the mle will be small) then it means that the gradient of ∂LT(θ)

∂θ is steep. Hence even for small deviations from

θ0, ∂L∂θT(θ) is likely to be far from zero. This means the mle ˆθT is likely to be in a close

neighbourhood ofθ0. (ii) Situation 2. I(θ0) = E − ∂2LT(θ) ∂θ2 ⌋θ0

is small (hence variance of the mle will large). In this case the gradient of the likelihood ∂LT(θ)

∂θ is flatter and hence ∂LT(θ)

∂θ ≈ 0 for a

large neighbourhood about the true parameterθ. Therefore the mle ˆθT can lie in a large

neighbourhood ofθ0.

This is one explanation as to why I(θ) is called the Fisher information. It contains informa-tion on how close close any estimator ofθ can be.

(35)

Chapter 5

Confidence Intervals

5.1

Confidence Intervals and testing

We first summarise the results in the previous section (which will be useful in this section). For convenience, we will assume that the likelihood is for iid random variables, whose density is

f(x;θ0) (though it is relatively simple to see how this can be generalised to general likelihoods

-of not necessarily iid rvs). Let us suppose thatθ0is the true parameter that we wish to estimate.

Based on Theorem 4.1.2 we have

√ T θˆT −θ0→ ND 0, E ∂logf(X;θ) ∂θ ⌋θ0 2−1 , (5.1) 1 √ T ∂LT ∂θ ⌋θ=θ0 D → N 0, E ∂logf(X;θ) ∂θ ⌋θ0 2 (5.2) and 2 LT(ˆθT)− LT(θ0)→D χ2p, (5.3)

where p are the number of parameters in the vector θ. Using any of (5.1), (5.2) and (5.3) we can construct 95% CI for θ0.

5.1.1 Constructing confidence intervals using the likelihood See also Section 4.5, Davison (2002).

One the of main reasons that we show asymptotic normality of an estimator (it is usually not possible to derive normality for finite samples) is to construct confidence intervals (CIs) and to test.

(36)

In the case that θ0 is a scaler (vector of dimension one), it is easy to use (5.1) to obtain √ T E ∂logf(X;θ) ∂θ ⌋θ0 21/2 ˆ θT −θ0→D N(0,1). (5.4)

Based on the above the 95% CI forθ0 is

ˆ θT − 1 √ TE ∂logf(X;θ) ∂θ ⌋θ0 2 zα/2,θˆT + 1 √ TE ∂logf(X;θ) ∂θ ⌋θ0 2 zα/2 .

The above, of course, requires an estimate of the (standardised) Fisher informationE ∂logf(X;θ) ∂θ ⌋θ0 2 = E − ∂2log∂θf2(X;θ)⌋θ0

Usually, we evaluate the second derivative of T1 logLT(θ) = T1LT(θ) and

replaceθ with the estimator of θ, ˆθT.

Exercise: Use (5.2) to construct a CI for θ0 based on the score

The CI constructed above works well ifθis a scalar. But beyond dimension one, constructing a CI based on (5.1) (and the p-dimensional normal) is extremely difficult. More precisely, if θ0

is a p-dimensional vector then the analogous version of (5.4) is

√ T E ∂logf(X;θ) ∂θ ⌋θ0 21/2 ˆ θT −θ0→D N(0, Ip),

using this it is difficult to obtain the CI ofθ0. One way to construct the CI is to ‘square’ ˆθT−θ0

and use ˆ θT −θ0′TE ∂logf(X;θ) ∂θ ⌋θ0 2 ˆ θT −θ0→D χ2p. (5.5) Based on above a 95% CI is θ; ˆθT −θ′TE ∂logf(X;θ) ∂θ ⌋θ0 2 ˆ θT −θ≤χ2p(0.95) . (5.6)

Note that as in the scalar case, this leads to the interval with the smallest length. A disadvantage of (5.6) is that we have to (a) estimate the information matrix and (b) try to find all θ such the above holds. This can be quite unwielding. An alternative method, which is asymptotically equivalent to the above but removes the need to estimate the information matrix and is to use (5.3). By using (5.3), a 100(1α)% CI for θ0 is

θ; 2 LT(ˆθT)− LT(θ)≤χ2p(100(1−α))

. (5.7)

(37)

Example 5.1.1 In the case that θ0 is a scalar the 95% CI based on (5.7) is θ;LT(θ)≥ LT(ˆθT)− 1 2χ 2 p(0.95) .

Both the 95% CIs in (5.6) and (5.7) will be very close for relatively large sample sizes. However one advantage of using (5.7) instead of (5.6) is that it is easier to evaluate - no need to obtain the second derivative of the likelihood etc.

Another feature which differentiates the CIs in (5.6) and (5.7) is that the CI based on (5.6) is symmetric about ˆθT (recall that ( ¯X −1.96σ/

T ,X¯ + 1.96σ/√T)) is symmetric about ¯X, whereas the symmetry condition may not hold for sample sizes when constructing a CI for θ0

using (5.7). This is a positive advantage of using (5.7) instead of (5.6). A disadvantage of using (5.7) instead of (5.6) is that sometimes in the CI based on (5.7) may have more than one interval. As you can see if the dimension of θ is large it is quite difficult to evaluate the CI (try it for the simple case that the dimension is two!). Indeed for dimensions greater than three it is extremely hard. However in most cases, we are only interested in constructing CIs for certain parameters of interest, the other unknown parameters are simply nuisance parameters and CIs for them are not of interest. For example, for the normal distribution we may only be interested in CIs for the mean but not the variance.

It is clear that directly using the log-likelihood ratio to construct CIs (and also test) will mean also constructing CIs for the nuisance parameters. Therefore below (in Section ??) we construct a variant of the likelihood (called the Profile likelihood), which allows us to deal with nuisance parameters in a more efficient way.

5.1.2 Testing using the likelihood

Let us suppose we wish to test the hypothesis H0 :θ =θ0 against the alternativeHA:θ6=θ0.

We can use any of the results in (5.1), (5.2) and (5.3) to do the test - they will lead to slightly different p-values, but ‘asympototically’ they are all equivalent, because they are all based (essentially) on the same derivation.

We now list the three tests that one can use.

The Wald test

The Wald statistic is based on (5.1). We recall from (5.1) that if the null is true, then we have

√ T θˆT −θ0→ ND 0, E ∂logf(X;θ) ∂θ ⌋θ0 2−1 .

(38)

Thus we can use as the test statistic T1 = √ T E ∂logf(X;θ) ∂θ ⌋θ0 21/2 ˆ θT −θ0→ ND (0,1).

Let us now consider how the test statistics behaves under the alternative HA:θ=θ1. If the

null is not true, then we have that

(ˆθT −θ0) = (ˆθT −θ1) + (θ1−θ0) ≈ I(θ1)−1 1 √ T X t ∂logf(Xt;θ1) ∂θ1 (θ1−θ0)

Thus the distribution of the test statisticT1becomes centered about

√ T E ∂logf(X;θ) ∂θ ⌋θ0 21/2 θ θ0. Thus for a larger sample size the more likely we are to reject the null.

Remark 5.1.1 (Types of alternatives) In the case that the alternative is fixed, it is clear

that the power in the test goes to 100%. Therefore, often to see the effectiveness of the test, one

lets the alternative get closer to the the null as T → ∞. For example

Suppose thatθ1=θ0+T1, then the center of T1 is √1T

E ∂logf(X;θ) ∂θ ⌋θ0 21/2 →0. Thus

the alternative is too close to the null for us to discriminate between the the two.

Suppose that θ1 =θ0+√1T, then the center of T1 is

E ∂logf(X;θ) ∂θ ⌋θ0 21/2 . Therefore,

the test does have power, but it’s not 100%.

In the case that the dimension of θ is greater than one, we use the test statistic ˜T1 =

ˆ θT −θ0 √ TE ∂logf(X;θ) ∂θ ⌋θ0 2 ˆ

θT −θ0 instead ofT1. Noting that the distribution of T1 is a

chi-squared with p-degrees of freedom.

The Score test

The score test is based on the score. We recall from (??), that under the null the distribution of the score is 1 √ T ∂LT ∂θ ⌋θ=θ0 D → N 0, E ∂logf(X;θ) ∂θ ⌋θ0 2 .

Thus we use as the test statistic

T2 = √1 T E ∂logf(X;θ) ∂θ ⌋θ0 2−1/2 ∂LT ∂θ ⌋θ=θ0 D → N(0,1).

An advantage of this test is that the maximum likelihood estimator (under either the null or alternative) does not have to be calculated.

(39)

The log-likelihood ratio test

Probably one of the most popular test os the log-likelihood ratio tests. This test is based on (5.3), and the test statistic is

T3 = 2 LT(ˆθT)− LT(θ0)→D χ2p.

An advantage of this test statistic is that it is pivotal, in the sense that the Fisher information etc. does not have to calculated, only the maximum likelihood estimator.

Exercise: What does the test statistic look like under the alternative?

5.1.3 Applications of the log-likeihood ratio to the multinomial distribution

Example 5.1.2 (The multinomial distribution) This is a generalisation of the binomial

distribution. In this case at any given trial there can arise m different events (in the Binomial

case m= 2). Let Zi denote the outcome of the ith trial and assume P(Zi =k) =πi (π1+. . .+

πm = 1). Suppose there were n trial conducted and let Y1 denote the number of times event 1

arises, Y2 denote the number of times event 2 arises and so on. Then it is straightforward to

show that P(Y1 =k1, . . . , Ym=km) = n k1, . . . , km Ym i=1 πki i .

If we do not impose any constraints on the probabilities{πi}, given{Yi}mi=1is straightforward

to derive the mle of {πi}(it is very intuitive too!).

Noting that πm = 1−Pmi=1−1πi, the log-likelihood of the multinomial is proportional to

LT(π) = mX−1 i=1 yilogπi+ymlog(1− mX−1 i=1 πi).

Differentiating the above with respect toπi and solving gives the mle estimator ˆπi =Yi/n, which

is what we would have expected! We observe that though there arem probabilities to estimate due to the constraint πm = 1−Pmi=1−1πi, we only have to estimate (m−1) probabilities. We

mention, that the same estimators can also be obtained by using Lagrange multipliers, that is maximising LT(π) subject to the parameter constraint that Ppj=1πi = 1. To enforce this

constraint, we normally add an additional term to LT(π) and include the dummy variable λ.

That is we define the constrained likelihood

˜ LT(π, λ) = m X i=1 yilogπi+λ( m X i=1 πi−1).

(40)

Now if we maximise ˜LT(π, λ) with respect to {πi}mi=1 and λ we will obtain the estimators

ˆ

πi =Yi/n(which is the same as the maximum ofLT(π)).

To derive the limiting distribution we note that the second derivative is

−∂ 2L T(π) ∂πiπj =    yi π2 i + ym (1−Pmr=1−1πr)2 i=j ym (1−Pmr=1−1πr)2 i6=j

Hence taking expectations of the above the information matrix is the (k1)×(k1) matrix

I(π) =n        1 π1 + 1 πm 1 πm . . . 1 πm 1 πm 1 π2 + 1 πm . . . 1 πm .. . ... ... ... 1 πm−1 . . . 1 πm−1 + 1 πm        .

Provided no of πi is equal to either 0 or 1 (which would drop the dimension of m and make

I(π)) singular, then the asymptotic distribution of the mle the normal with varianceI(π)−1. Sometimes the probabilities {πi}will not be ‘free’ and will be determined by a parameterθ

(where θ is an r-dimensional vector where r < m), ie. πi =πi(θ), in this case the likelihood of

the multinomial is LT(π) = mX−1 i=1 yilogπi+ymlog(1− mX−1 i=1 πi(θ)).

By differentiating the above with respect toθ and solving will give the mle.

Pearson’s goodness of Fit test

We now derive Pearson’s goodness of Fit test using the log-likelihood ratio, though Pearson did not use this method to derive his test.

Suppose the null is H0 :π1 = ˜π1, . . . , πm = ˜πm (where {π˜i} are some pre-set probabilities)

andHA: the probabilities are not the given probabilities. Hence we are testing restricted model

(where we do not have to estimate anything) against the full model where we estimate the probabilities using πi =Yi/n.

The log-likelihood ratio in this case is

W = 2arg max

π LT(π)− LT(˜π) .

Under the null we know that W = 2arg maxπLT(π)− LT(˜π) →P χ2m−1 (because we have to

(41)

Pearson-statistic is an approximation of this. 1 2W = mX−1 i=1 Yilog Yi n +Ymlog Ym n − mX−1 i=1 Yilog ˜πi−Ymlog ˜πm = m X i=1 Yilog Yi nπ˜i .

Recall thatYi is often called the observedYi =Oi andn˜πithe expected under the nullEi=nπ˜i.

Then W = 2Pmi=1Oilog OEii →P χ2m−1. By using that for a close to x and making a Taylor

expansion ofxlog(xa−1) aboutx=awe havexlog(xa−1)alog(aa−1) + (xa) +21(xa)2/a. We let O=xand E =a, then assuming the null is true andEi ≈Oi we have

W = 2 m X i=1 Yilog Yi nπ˜i ≈2 m X i=1 (Oi−Ei) + 1 2 (Oi−Ei)2 Ei .

Now we note thatPmi=1Ei=Pmi=1Oi =nhence the above reduces to

W (Oi−Ei)

2

Ei

D

→χ2m1.

We recall that the above is the Pearson test statistic. Hence this is one methods for deriving the Pearson chi-squared test for goodness of fit.

By using a similar argument, we can also obtain the test statistic of the chi-squared test for independent (and an explanation for the rather strange number of degrees of freedom!).

(42)

References

Related documents