Distributions of Maximum Likelihood Estimators and Model Comparisons

(1)

Distributions of Maximum Likelihood Estimators

and Model Comparisons

Peter Hingley

∗

Abstract_{—Experimental data need to be assessed} for purposes of model identification, estimation of model parameters and consequences of misspecified model fits. Here the first and third factors are consid-ered via analytic formulations for the distribution of the maximum likelihood estimates. When estimating this distribution with statistics, it is a tradition to in-vert the roles of population quantities and quantities that have been estimated from the observed sample. If the model is known, simulations, normal approxi-mations and p*-formula methods can be used. How-ever, exact analytic methods for describing the esti-mator density are recommended. One of the methods (TED) can be used when the data generating model differs from the estimation model, which allows for the estimation of common parameters across a suite of candidate models. Information criteria such as AIC can be used to pick a winning model. AIC is how-ever approximate and generally only asymptotically correct. For fairly simple models, where expressions remain tractable, the exact estimator density under TED allows for comparisons between models. This is illustrated via a novel information criterion. Three linear models are compared and fitted to econometric data on patent filings.

Keywords: AIC, likelihood, model comparison, patent filing, technique for estimator densities

1 Introduction

In the context of data analysis or data mining, a number of indicators are used for assessing data structure and for the comparison of explanatory models [7]. However, a danger is that such measures may be used without un-derstanding the logic behind their construction. When confronted with data from an experiment that contain errors, some basic questions arise. These include, firstly, what (if any) model can be identified. Secondly, how can parameters from a candidate model be best estimated. Thirdly, what are the consequences for the parameter es-timates if the chosen model is wrong.

There are practical considerations that should be con-sidered in order to minimise these problems. For exam-ple, an experiment should be well designed in a statistical sense by including replication and the sample should span

∗_{European Patent Office, Erhardtstrasse 27, D-80469 Munich,}

Germany. Email: phingley@epo.org.

a useful region of the parameter space. It may well be that the model structure is already known from previous experience and that prior information can somehow be incorporated into the analysis. An experiment can also be strictly designed in advance to discriminate between a set of prespecified models via a formal hypothesis testing procedure.

But one may nevertheless encounter unexpected results The data could be unique (e.g. in astronomy) or ex-pensive to reproduce (e.g. samples taken from a nuclear reactor), or broadly consistent with several emergent hy-potheses (e.g. in social studies). A statistical toolbox is required to deal with these situations. In this paper, some developments will be described. However this toolbox is not yet complete.

It will be assumed that data are to be analysed by using maximum likelihood estimation under a specified model. The approach will involve studying the distributions of maximum likelihood estimates in terms of their probabil-ity densprobabil-ity functions (estimator densities). The spectrum of methods that are available for determining these den-sities will be reviewed. In particular a method will be highlighted that determines the exact estimator density when a distinct model generates the data from the model that is used for estimation. Then there is a discussion of information criteria for comparing models on observed data and associated applications of the exact estimator density.

2 Definitions

Suppose that the members of a samplewi, (i = 1, ..., n) correspond to a model in the form of a probability dis-tribution g(wi|θ), where θ is a px1 vector of estimable parameters. The space ofw isW, and the space of θ is Θ. g(wi|θ) will be considered to be continuous, though simpler equivalents exist for the results when g(wi|θ) is discrete. In vector notation, write the sample asw(px1),

with likelihood given by the joint density g(w, θ). The MLE is the value ˆθofθthat maximisesg(w, θ).

(2)

data indirectly via the MLE or some other calculable sta-tistic.

E[h(w)] =R_Wh(w)g(w)dw

Useful quantities for inference are obtained from the log likelihoodl(θ, w) =log(g(w, θ)). The first derivative wrt

θ is termed the observed scorel′₍_{θ, w}₎

(px1), where′

indi-cates differentiation wrt θ. The MLE then satisfies the normal equation (l′₍_{θ, w}₎_|

θ= ˆθ) = 0(px1). From the second

derivative can be obtained a quantity called the observed information j(θ, w)(pxp) = −l′′(θ, w). These quantities

are observed because they depend upon the realisation of a particular samplew. But expected equivalents are also useful, in particular the expected information ma-trix i(θ, w)(pxp) = E[j(θ, w)]. A vector of independent

variableszcan be introduced in the above expressions to represent covariates in the regression situation.

3 Estimator densities from a statistical

model

Consider the probability density function of the MLE from a specified statistical model. Inferences about the parameters from sample data can be based on descrip-tors of this density. For example, the mean of the density gives an estimate of the parameter itself and the variance allows the construction of a confidence interval for the pa-rameter. It is also relevant to describe the density when the data generating mechanism differs from the model that is assumed for estimation. This might happen in a regression setting, where the estimation model is one of a number of possible candidate models for a process. Spe-cial techniques can be applied when there is doubt about the form of the data generating model, including distrib-ution free and robust approaches to estimation. But the use of a specific estimation model is usual when the data are presumed to be distributed in a certain way according to a scientific hypothesis. Nevertheless the modeller may accept that alternative models are possible.

In the remainder of this section and in Section 4, it will be assumed that the parameters for the formulae are al-ready known before generating the estimator densities. This is of course not the case when confronted with a real set of data. Now the parameters must be estimated and presumptions made about their true values, so that estimator densities can be generated. This is usually done by a principle of inversion: assume that the parameter es-timate is in fact the true value and generate an estimator density on this assumption. Typically confidence inter-vals for the true parameter value can be generated around the parameter estimates on the basis of the estimator den-sity. However this approach becomes questionable if the density is highly asymmetric.

A simple example of an estimator density is that of the mean of a normally distributed variablew, distributed as

N(θ0, σ2), with known variance σ2. If the same model

N(θ, σ2_{) is fitted with the mean}_θ _{as the estimable}

pa-rameter, then the MLE ˆθ is distributed as N(θ0, σ2/n),

where n is the sample size. This result is exact and can be easily proved (see Section 4.1). When the data are from a more intricate model, a distribution for the MLE ˆθ is given by a normal approximation as N(θ0, i(θ, w)−1) [5].

However this expression is generally only an approxima-tion to the actual distribuapproxima-tiong(ˆθ), to O(n−1_{) accuracy}

[4]. This means that it is only correct in the asymptotic limit as n → ∞, where almost everything that can be known about the population is present in the sample and ˆ

θ →θ0 due to consistency of the MLE.

Alternative techniques are available. One of these is the p*-formula [2] [9], which is based on ideas of the asymp-totic sufficiency of the MLE and can be more accurate than the normal approximation. The p*-formula is exact for a range of common models, although it is in general only asymptotically correct.

g(ˆθ|a) =c|j(ˆθ,θ, aˆ )|1/2_el(θ0,θ,aˆ )_,

where c is a normalising constant and||indicates absolute value. In this formula, the log likelihood is expressed not in terms of the sample vector w, but via (ˆθ, a), where

a is an ancillary statistic. This means a distribution-constant statistic which, together with the MLE, con-stitutes a sufficient statistic. Sufficient, in the classical sense, means that the conditional density for the dataw

does not depend on the true parameter value θ0 [14]. l

is the normed form of the likelihood given by the differ-ence between the log likelihood atθ0and its value at the

MLE ˆθ,l(θ0,θ, aˆ ) =l(θ0,θ, aˆ )−l(ˆθ,θ, aˆ ). The p*-formula

is generally again only an approximation to the actual distributiong(ˆθ), toO(n−1_{) accuracy. But it can be}

con-siderably better than the normal approximation, because it takes more of the model structure into account. It is exact for a number of commonly used distributions.

In case an exact distribution is required, the default op-tion is to carry out a simulaop-tions based analysis [13]. But simulation is essentially a numerical technique that requires a lot of computation. Some progress has been made in the formulation of exact analytic expressions for estimator densities. The trade off with approximate an-alytic forms is that computations become more difficult, though less extensive than with simulations and not at all impractical where prewritten routines can be made avail-able. The user should ensure that the initial conditions are fulfilled for the method that is selected.

(3)

g(ˆθ) =E[|j(θ, w)||l′(θ,w)=0].gs(0, θ0), (1)

where gs(l′(θ, w), θ0) is the marginal density of the

ob-served score.

Another formulation forg(ˆθ) was proposed by Hillier and Armstrong [8]. This is an integral equation, that can also apply to more general statistics than the MLE. It also depends uponl′₍_{θ, w}_{) and}_j₍_{θ, w}_).

g(ˆθ) =

Z

Wθˆ

|j(θ, w)|.|l′(θ, w).[l′(θ, w)]T|−1/2g(w|θ0)(dWθ),

(2)

where (dWθ) denotes a volume element on the surface

Wθ.

In both equations (1) and (2), only the data, the observed score and the observed information are involved. Thus it is not required that an analytic expression be available for ˆ

θin terms of the underlying data, which is useful because the MLE can usually only be found by an iterative esti-mation routine. The formulae however differ from each other in that (2) eschews the use of an expectation term within the formula, rather leaving the whole formula as a kind of extended expectation. Both sets of authors go on to show that their exact techniques can be used to re-establish previously known approximate approaches.

For simple densities, approximate methods such as p*-formula are at least asymptotically correct. They can also be used to make approximate inferences for small samples where the asymptotic results do not hold. But a trade-off should be applied, with exact techniques to be preferred unless they are too difficult to apply in a particular situa-tion. Approximate techniques are often sufficiently close to exactness to be acceptable when the estimation model is equivalent to the data generating model. However they are not usually appropriate when models differ, and nei-ther expression (1) nor (2) can be used directly in this situation either. In the next section an alternative for-mula will be discussed that can operate when the data generating model differs from the estimation model.

4 An exact technique for estimator

den-sities (TED)

In the following, statistical models will be specified in terms of the densities of data that are generated by them. Consider thatg0(w) is the true density ofw, andg1(w|θ)

is a presumed density for estimation ofθ. The log likeli-hood corresponding to g1(w|θ) is l(θ, w). g0 andg1 can

also be the same, in which caseg0(w) =g1(w|θ0).

When models differ, the estimate obtained by minimising

l(θ, w) is technically a quasi maximum likelihood estimate

(QMLE) [17]. Nevertheless it will continue to be termed MLE (ˆθ) here and is given by the same expression as that for the MLE in the usual situation, (l′₍_{θ, w}₎_|

θ= ˆθ) = 0. The space of ˆθis ˆΘ, a subspace of Θ.

Consider a (p×1) vectorT.

T(θ, θ∗, w) =l′(θ∗, w)−l′(θ, w) (3)

θ∗ _{is fixed at an arbitrary value. Under a simple set of}

regularity conditions, the exact density for ˆθ is given as follows [10].

g(ˆθ) =Ew[|j(θ, w)||_θ_{= ˆ}_θ]. g_[_T_{( ˆ}_θ,θ∗= ˆθ,w)](0), (4)

where j(θ, w) = −l′′₍_{θ, w}_{) is the observed information}

under the estimation model g1(w|θ), and the second

term represents the value of the density g_[_T_{( ˆ}_θ,θ∗,w)](t),

for which θ∗_{= ˆ}_θ_{, and hence} _t _{= 0 by (3). This is to}

be derived as the density of a transform T(ˆθ, θ∗_{, w}_{) of}

the data w on the data generating model g0(w). The

term Ew[|j(θ, w)||_θ_{= ˆ}_θ] describes a conditional expecta-tion, that is conditional onθ= ˆθand is taken wrtwover

g1(w|θ).

  R

W

c

θ(v)

|j(θ, w(v))||θ= ˆθ. g1(w(v)|θ= ˆθ).||w′(v)||dv

R

W

c

θ(v)

g1(w(v)|_θ_{= ˆ}_θ).||w′(v)||dv

  (5)

Here, integration is carried out on a manifold W_d

θ(v),

which runs over an (n - p) dimensional subset of W. The term ||w′₍_v₎_|| _{indicates the magnitude of the}

Jaco-bian from co-ordinates v that index the manifold to w. In practice ||w′₍_v₎_||_dv _{is equivalent to the volume}

ele-ment dWθ in equation (2). However equation (4) may be easier to use than equation (2), since the evaluation of expectation (5) involves taking conditional expecta-tions over data sets, which can usually be done with-out evaluation of the integrals in the expression. This is because of the following plug-in principle. In practice terms inwcan be replaced byEw[w|_θ_{= ˆ}_θ], terms inw2by

Ew[w2|_θ_{= ˆ}_θ], etc. For example, if the model g1(w|θ) was

normal N(r[θ1], θ2), terms proportional to w would be

replaced by terms proportional to r[ ˆθ1], and terms

pro-portional tow2 _{would be replaced by terms proportional}

to ˆθ2+[r( ˆθ1]2. Equation (4) reduces to equation (1) when

g0(w) =g1(w|θ0).

4.1 An example of the mean of a sample

from a normal distribution

(4)

TED for determining the pdf of the MLE will now be shown for the simple case of estimating the meanµ of a sample from a normal distribution N(µ0, σ2), with

vari-anceσ2 _{assumed to be known. The argument is given to}

illustrate how the technique can work in more complex cases, and there are certainly easier ways to establish this particular result, for example via the moment generating function [6].

The data generating and estimation models are written as follows, in vector form, based on the analytic expression for a normal distribution.

g0(w) =exp[−nlog(

√

2πσ2₎₋₍ 1 2σ2).

[wTw−2µ0wT1 +µ01T1µ0] (6)

g1(w|θ) =exp[−nlog(

√

2πσ2₎₋₍ 1 2σ2).

[wTw−2µwT1 +µ1T1µ], (7)

where T indicates transposition and 1 is a (nx1) vector of 1s.

Here the parameter vector θ is just a scalar µ because

p= 1. The log-likelihood is the logarithm of (7), and its derivative wrtµis

l’(θ, w) = (_σ12)[1T.(w−µ1)]

By (3),

T(θ, θ∗_{, w}_{) = (}1

σ2)[1T.(w−µ∗1)]−( 1

σ2)[1T.(w−µ1)]

Whenl′₍_{θ, w}₎_|

θ= ˆθ= 0,µ= ˆµ=wT1. Therefore

T(ˆθ, θ∗, w) = ( 1

σ2)[1

T_.₍_w

−µ∗1)] (8)

The observed information is j(θ, w) = −l′′₍_{θ, w}_{) =} n σ2.

This is a constant here, so the conditional expectation (5) is also n

σ2.

In order to develop the densityg_[_T_{( ˆ}_θ,θ∗,w)](t), recall that

this is derived as the density ofT(ˆθ, θ∗_{, w}_{) under the data}

generating model g0(w). In this case,g0(w) is given by

equation (6). Equation (8) indicates thatT(ˆθ, θ∗_{, w}_{) is a}

linear transform of w, and the use of a Jacobian shows that g_[_T_{( ˆ}_θ,θ∗,w)](t) isN(

nµ0

σ2 −

nµ∗

σ2 ,_σn2). Therefore

g_[_T_{( ˆ}_θ,θ∗_{= ˆ}_θ,w_)](0) = 1

√₂_π.n σ2

.exp

−1 2n σ2.[0−(

nµ0

σ2 −

nµˆ

σ2)]2

=

1

p

2π.n σ2

.exp "

−1

2σ2

n

.[(ˆµ−µ0)2] #

(9)

Applying (4), g(ˆθ) is obtained as the product of n σ2 and

(9), which isN(µ0, σ2/n), as required.

The step of calculating g_[_T_{( ˆ}_θ,θ∗,w)](t) sets the limit of

tractability for TED. For curved exponential families in general,T(ˆθ, θ∗_{, w}_{) is a linear function of}_w_{, which}

facil-itates the required calculations.

4.2 Example of the mean of a sample from

a negative exponential data generating

model using a normal distribution

esti-mation model

This is an example where the estimation model is not equivalent to data generating model. The data generating model is given as follows.

g0(w) = 1

µ0.exp[

−1

µ0w

T_1]_, _w

i>0, i= 1, ..., n (10)

The estimation model remains normal as in equation (7).

T(ˆθ, θ∗_{, w}_{) is still given by (8) and}_E

w[|j(θ, w)||_θ_{= ˆ}_θ] is still _σn2. Use can be made of the fact that the sum of n

i.i.d. negative exponential variables has a gamma distri-bution with shape parameter of n and meanµ0 [12],

g(s) =s

n−1_exp_[−1

µ0s]

µn

0(n−1)!

, s=wT1>0 (11)

Equation (8) shows thatT(ˆθ, θ∗_{, w}_{) is a linear transform}

of s, and s =σ2_T ₊_nµ∗_{. Therefore, transformation of}

equation (11) gives

g_[_T_{( ˆ}_θ,θ∗_,w_)](t) =

σ2₍_σ2_T₊_nµ∗₎n−1_exp_[−1

µ0(σ

2_T₊_nµ∗_)]

µn

0(n−1)! ,

σ2_T₊_nµ∗ _>₀ ₍₁₂₎

The densityg(ˆµ) is obtained by settingT = 0,µ∗_{= ˆ}_µ_in

equation (12) and multiplying byEw[|j(θ, w)||θ= ˆθ] = σn2.

g(ˆµ) =n

n_µ_ˆn−1_exp_[−nµˆ

µ0 ]

µn

0(n−1)! , µ>ˆ 0

This is a gamma distribution with a shape parameter of n and mean µ0

n.

In this case, the parameter estimate ˆµis necessarily pos-itive. The distribution of ˆµ is independent of σ2_{, which}

(5)

with this pair of models, which is related to the fact that the negative exponential has only one fixable parameter while the normal distribution has two. Although the ex-ample is simplistic, it shows that it can be possible to pick dominant estimation models wrt the parameter of interest.

5 Methods for comparing models

While some analysts are happy to assess the robustness of estimation of common parameters throughout a set of candidate models, others want to discriminate between models in order to identify the best one. Akaike [1] sug-gested an information criterion that can be used for this.

AIC = -2[l(θ, w|θ= ˆθ)] + 2p

The best model is considered to be the member of the candidate set that minimises AIC. In practice, the first term in the expression measures the goodness of fit of the model to data, while the second term is a penalty based on the number of parameters that have to be fitted. The latter term is a guard against over fitting.

The theoretical underpinning of this approach is that

AIC/2 can be shown to be asymptotically equivalent to a constant minus the expected value of the Kullback-Leibler informationI(g0, g1).

I(g0, g1) =log

_g

0(w)

g1(w|θ)

(13)

E[I(g0, g1)] = Z

W

log

_g

0(w)

g1(w|θ)

g0(w)dw (14)

For models with the same number of parameters, max-imisation of E[I] at the MLE is asymptotically equiva-lent to minimisation of AIC. I(g0, g1) itself is equivalent

to the difference between the (Boltzmann) entropy of the models under consideration,log(g0(w)) andlog(g1(w|θ))

respectively.

In the case of a fixed data generating modelg0(w), it can

be shown that the QMLE ˆθ for a particular estimation model g1(w|θ) minimisesI(g0, g1) [17].

AIC provides a benchmark for comparing adequacies of competing models on real data sets. In case the data generating modelg0(w) is unknown, it can be argued that

the true model need not be a member of the considered set of models [3]. However, AIC suffers to some extent from the same problem that was alluded to earlier for approximate forms of the estimator density, in that it is absolutely valid only in the asymptotic limit and it is also subject to problems of inversion. The first problem can be tackled by attempting a more exact approach, while

the second problem is inescapable unless one is prepared to carry out an encompassing analysis on theoretical data sets.

Several other related information criteria for comparing models have been proposed as improvements to AIC. One of these is TIC [16].

TIC = -2[l(θ, w|θ= ˆθ)] + 2[trace(K(ˆθ)(pxp).[j(ˆθ, w)]−1)],

whereK(ˆθ) =Eg0[l′(θ, w).l′(θ, w)T|θ= ˆθ].

T IC/2 is a closer approximation than AIC to a constant minusE[I]. For large samples, asn→ ∞, TIC and AIC converge as trace(K(ˆθ)(pxp).[j(ˆθ, w)]−1)→p.

Both AIC and TIC provide rational ways to compare nested or non-nested models of various degrees of com-plexity - with a balance between the log likelihood and the number of parameters as a guard against over fit-ting. Otherwise the common practice is to compare mod-els with different numbers of parameters via an analysis of variance based F test, using the sums of squares asso-ciated with the additional parameters. This is however only strictly valid for nested linear models. So the use of information criteria that are based on K-L information is attractive.

The TED estimator density (4) can be used for comparing models. One possible way to use the technique is via a robustness index (RI), when the parameters are common in the two models. Hingley [10] explains RI and gives an example of the comparison for a gamma data generating model and a negative exponential estimation model.

But TED can also be used directly to construct new in-formation criteria. For example, since it gives exact den-sities, it can make an indicator from K-L information without requiring the use of either expectations or asymp-totic approximation. However inversion is still required. A particularly simple form is obtained when comparing the effect of two alternative data generating models un-der a common estimation model, because then the term

Ew[|j(θ, w)||_θ_{= ˆ}_θ] in the expectation term (5) is the same for both models and cancels out when taking the ratio.

It is possible to calculate two densities for ˆθ. For both densities assume that the estimation model is g1(w|θ).

For the first density, the data generating model is

g1(w|θ1), and is written as g(ˆθ) = mA(ˆθ|θ1). For the

second density, the data generating model is g0(w|θ0),

and is written asg(ˆθ) =mB(ˆθ|θ0).

An information criterion can be constructed as follows.

H(mA, mB) = log

_mA_{( ˆ}_θ_|_θ₀₎

mB( ˆθ|θ1)

=

log

_g

A[T( ˆθ,θ∗= ˆθ,w)](0)

g_B_[_T_{( ˆ}_θ,θ∗= ˆθ,w)](0)

(6)

This takes a fairly simple form, and it is suggested thatθ0

should be set as ˆθfor the data underg0(w|θ) as estimation

model. For θ1, it is possible to use the estimate ˆθ from

g1(w|θ) as estimation model. For the particular example

that will be given below, it will turn out to be unnecessary to estimateθ1.

As a demonstration, consider a pair of nested linear mod-els.

g1(w|θ1) =M Nw(X1(nxq)θ1(qx1), σ2I(nxn))

g0(w|θ0) =M Nw(X0(nxp)θ0(px1), σ2I(nxn)),

where the error variance termσ2_{is assumed to be known.}

q andpcan be unequal, and it is not prespecified which model is nested in the other. Pairwise comparisons are envisaged, but a larger set of candidate models can be used with comparisons between all pairs of models in the set.

Under mA, both the data generating model and the estimation model are the same, and application of the TED formula gives g(ˆθ) as

M NW(θ1,(X1TX1)−1σ2). However, under mB, g(ˆθ) is M NW((X1TX1)−1X1TX0θ0,(X1TX1)−1σ2). Then the

information criterion (15) is as follows.

H(mA, mB) =−₂1log

XT

1X1

+1₂logXT

0X0

+

1 2σ2

_ˆ

θ−(X1TX1)−1X1TX0θ0

T (XT

1X1)

_ˆ

θ−(X1TX1)−1XT1X0θ0

− 1

2σ2

(ˆθ−θ1)

T (XT

1X1)(ˆθ−θ1)

Now,θ1can be estimated from the data set underg1(w|θ)

and θ0 can be estimated from the same data set under

g0(w|θ). These estimates are substituted into the above

expression, using the principle of invariance, to make an estimated indicator ˆH. This simplifies matters by causing the final term to disappear. Hence, as was stated above, in this case θ1does not actually need to be estimated at

all.

ˆ

H(mA, mB) = −₂1log

XT

1X1

+1₂logXT

0X0

+

1 2σ2

_ˆ

θ−(X1TX1)−1X1TX0θˆ0

T (XT

1X1)

_ˆ

θ−(X1TX1)−1XT1X0θˆ0

As an example of a particular application, consider three competing linear models for data on the development of numbers of patent filings per year at the European Patent Office from Germany (DE), Japan (JP) and USA (US). The models are based on an econometric specification that is used to forecast future patent filing levels [11].

Model i: Yt,j =a+b.Yt−1,j+c.Xt−5,j+errort,j

Model ii: Yt,j =aj+b.Yt−1,j+c.Xt−5,j+errort,j

Model iii: Yt,j =aj+b.Yt−1,j+cj.Xt−5,j+errort,j,

wherej= 1,2,3 indicates DE, JP, US;t= 1980, ...,2005. The main variables are first standardised between

coun-tries by means of the following transformations. Yt,j is the number of patent filings from country j in year t, di-vided by the number of workers in country j in year t. Xt,j is the discounted research and development expenditures in country j in year t, again divided by the number of workers in country j in year t. Model i assumes common intercepts and regression parameters for the three coun-tries. Model ii allows separate intercepts per countryaj. Model iii further allows separate slopes per country for the standardised research and development expenditures variablecj. See Figure 1.

The information criteria are calculated pairwise between all combinations of Models i, ii and iii. The fixed value of

σ2 _{is estimated from the fit of the data generating model}

in each case. From the above expression forH(mAˆ, mB), it can be seen that when the formulations ofg0(w|θ0) and

g1(w|θ1) are the same,H(mAˆ, mB) = 0.

Model i Model ii Model iii estimate estimate estimate Model i generate 0 1.30 3.54 Model ii generate 1.64 0 2.40 Model iii generate 1.25 -0.40 0

Table 1. Values of the information criterionH(mAˆ, mB) for pairwise combinations of Models i, ii and iii as data generating model and estimation model.

Table 1 shows the 3x3 representation of ˆHthat was found. Minimisation of ˆHis appropriate. In the case of data gen-erated by Model iii, the criteria suggest that estimation by Model ii is slightly better than estimation by Model iii.

Calculated Mean F P SS DF SS test value Model i 35.5066 3

Diff. i to ii 0.06581 2 0.0329 3.2 <0.05 Model ii 35.5724 5

Diff. ii to iii 0.0412 2 0.0206 2.0 >0.05 Model iii 35.6136 7

Resid. to iii 0.7199 70 0.0103 Total 36.3335 77

Table 2. Analysis of variance table to compare the sig-nificance of differences (’Diff.’) between Model i (nested in) Model ii (nested in) Model iii. SS is Sums of Squares. DF is degrees of freedom. F tests are of Mean SS against the Residual (’Resid.’) to Model iii. P values are against null hypotheses that the difference terms are not significant.

(7)

1980 1985 1990 1995 2000 2005 -10.5

-10.0 -9.5 -9.0 -8.5 -8.0 -7.5 -7.0

Germany

1980 1985 1990 1995 2000 2005 -10.5

-10.0 -9.5 -9.0 -8.5 -8.0 -7.5 -7.0

Japan

1980 1985 1990 1995 2000 2005 -10.5

-10.0 -9.5 -9.0 -8.5 -8.0 -7.5 -7.0

[image:7.595.100.494.84.179.2]

USA

Figure 1: Fig. 1. Transformed patent filings at the European Patent Office. Yt,j vs. Year (t), filings from three countries (j). Fitted values (lines) by Model iii.

iii instead of Model ii. This is consistent with Table 1, in that it suggests that Model ii is adequate for the data.

The particular information criterion that has been sug-gested here could be extended easily to other types of models, particularly to nonlinear regression models. Other designs for information criteria can also be imag-ined.

6 Conclusion

Analysts should consider using exact densities for maxi-mum likelihood estimates whenever it is practical to do so. The densities are particularly appropriate where al-ternative models are to be compared and assessed. The example in the previous section used exact densities to construct a trial information criterion. It may also be profitable to use exact methods to complement the range of existing approximate techniques with exact analytic equivalents. Although the exact techniques require more complex calculations, it can turn out that statistics can be created that have a simpler direct interpretation than the approximate ones. Further work is needed on these issues.

References

[1] Akaike, M., “Information theory as an extension of the maximum likelihood principle”, Petrov, B., Csaki, C. (eds),Second International Symposium on Information Theory, Akademiai Kiado, 1973.

[2] Barndorff-Nielsen, O.E., Cox, D.R., Inference and asymptotics, Chapman and Hall, 1994.

[3] Burnham, K.P., Anderson, D.R., Model selection and multimodel inference, Second Edition, Springer, 2002.

[4] Cox, D.R., Hinkley, D.V., Theoretical statistics, Chapman and Hall, 1974.

[5] Davison, A.C.,Statistical models, p. 118, Cambridge, 2003.

[6] DeGroot, M.,Probability and statistics, Second edi-tion, p. 271, Addison-Wesley, 1986.

[7] Giudici, P.,Applied data mining, Wiley, 2003.

[8] Hillier, G., Armstrong, M., “The density of the max-imum likelihood estimator”,Econometrica, V67, pp. 1459-1470, 1999.

[9] Hingley, P.J., “p*-formula”, Kotz, S., Read, C., Banks, D. (eds),Encyclopedia of statistical sciences, Update volume 3, Wiley, 1999.

[10] Hingley, P.J., “Analytic estimator densities for com-mon parameters under misspecified models”, Statis-tics for Industry and Technology, pp. 119-130, 2004.

[11] Park, W.G., “International patenting at the Euro-pean Patent Office: aggregate, sectoral and family filings”, Hingley, P., Nicolas, M. (eds), Forecasting innovations, Springer, 2006.

[12] Johnson, N.L., Kotz, S., Balakrishnan, N., Continu-ous univariate distributions, Volume 1, p. 340, Wi-ley, 1994.

[13] Morgan, B.J.T., Elements of simulation, Chapman and Hall, 1984.

[14] Schervish, M.J.,Theory of statistics, Springer, p. 85, 1995.

[15] Skovgaard, I.M., “On the density of minimum con-trast estimators”,Annals of Statistics, V18, pp. 779-789, 1990.

[16] Takeuchi, K., “Distribution of informational statis-tics and a criterion of model fitting”, Suri-Kagaku, V153, pp. 12-18, 1976.