Further aspects of maximum likelihood
7.6 Modified likelihoods
7.6.6 Pseudo-likelihood
We now consider a method for inference in dependent systems in which the dependence between observations is either only incompletely specified or spe-cified only implicitly rather than explicitly, making specification of a complete likelihood function either very difficult or impossible. Many of the most import-ant applications are to spatial processes but here we give more simple examples, where the dependence is specified one-dimensionally as in time series.
In any such fully specified model we can write the likelihood in the form fY1(y1;θ)fY2|Y1(y2, y1;θ) · · · fYn|Y(n−1)(yn, y(n−1);θ), (7.57) where in general y(k)= (y1,. . . , yk). Now in a Markov process the dependence on y(k)is restricted to dependence on ykand more generally for an m-dependent Markov process the dependence is restricted to the last m components.
Suppose, however, that we consider the function
fYk|Yk−1(yk, yk−1;θ), (7.58) where the one-step conditional densities are correctly specified in terms of an interpretable parameterθ but where higher-order dependencies are not assumed absent but are ignored. Then we call such a function, that takes no account of certain dependencies, a pseudo-likelihood.
If Uk(θ) denotes the gradient or pseudo-score vector obtained from Yk, because each term in the pseudo-likelihood is a probability density normalized to integrate to 1, then
E{Uk(θ); θ} = 0, (7.59)
7.6 Modified likelihoods 153
and the covariance matrix of a single vector Ukcan be found from the appropri-ate information matrix. In general, however, the Ukare not independent and so the covariance matrix of the total pseudo-score is not i(θ), the sum of the separ-ate information matrices. Provided that the dependence is sufficiently weak for standard n asymptotics to apply, the pseudo-maximum likelihood estimate ˜θ found from (7.58) is asymptotically normal with meanθ and covariance matrix
i−1(θ)cov(U)i−1(θ). (7.60)
Here U = Uk is the total pseudo-score and its covariance has to be found, for example by applying somewhat empirical time series methods to the sequence{Uk}.
Example 7.14. Lag one correlation of a stationary Gaussian time series. Sup-pose that the observed random variables are assumed to form a stationary Gaussian time series of unknown meanµ and variance σ2and lag one cor-relationρ and otherwise to have unspecified correlation structure. Then Yk
given Yk−1= yk−1has a normal distribution with meanµ + ρ(yk−1− µ) and varianceσ2(1 − ρ2); this last variance, the innovation variance in a first-order autoregressive process, may be taken as a new parameter. Note that if a term from Y1is included it will have the marginal distribution of the process, assum-ing observation starts in a state of statistical equilibrium. For inference aboutρ the most relevant quantity is the adjusted score Uρ·µ,σof (6.62) and its variance.
Some calculation shows that the relevant quantity is the variance of
{(Yk− µ)(Yk−1− µ) − (Yk− µ)2}. (7.61) Hereµ can be replaced by an estimate, for example the overall mean, and the variance can be obtained by treating the individual terms in the last expression as a stationary time series, and finding its empirical autocorrelation function.
Example 7.15. A long binary sequence. Suppose that Y1,. . . , Ymis a sequence of binary random variables all of which are mutually dependent. Suppose that the dependencies are represented by a latent multivariate normal distribution.
More precisely, there is an unobserved random vector W1,. . . , Wmhaving a mul-tivariate normal distribution of zero means and unit variances and correlation matrix P and there are threshold levelsα1,. . . , αmsuch that Yk = 1 if and only if Wk > αk. The unknown parameters of interest are typically those determining P and possibly also theα. The data may consist of one long realization or of several independent replicate sequences.
The probability of any sequence of 0s and 1s, and hence the likelihood, can then be expressed in terms of the m-dimensional multivariate normal distribu-tion funcdistribu-tion. Except for very small values of m there are two difficulties with
this likelihood. One is computational in that high-dimensional integrals have to be evaluated. The other is that one might well regard the latent normal model as a plausible representation of low-order dependencies but be unwilling to base high-order dependencies on it.
This suggests consideration of the pseudo-likelihood
k>lP(Yk= yk, Yl = yl), (7.62) which can be expressed in terms of the bivariate normal integral; algorithms for computing this are available.
The simplest special case is fully symmetric, havingαk = α and all correl-ations equal, say toρ. In the further special case α = 0 of median dichotomy the probabilities can be found explicitly because, for example, by Sheppard’s formula,
P(Yk= Yl = 0) = 1
4+sin−1ρ
2π . (7.63)
If we have n independent replicate sequences then first-, but not second-, order validity applies, as does standard n asymptotics, and estimates ofα and ρ are obtained, the latter not fully efficient because of neglect of information in the higher-order relations.
Suppose, however, that m is large, and that there is just one very long sequence. It is important that in this case direct use of the pseudo-likelihood above is not satisfactory. This can be seen as follows. Under this special correl-ational structure we may write, at least forρ > 0, the latent, i.e., unobserved, normal random variables, in the form
Wk = V√ρ + Vk√(1 − ρ), (7.64)
where V , V1,. . . , Vmare independent standard normal random variables. It fol-lows that the model is equivalent to the use of thresholds atα − V√
ρ with independent errors, instead of ones at α. It follows that if we are concerned with the behaviour for large m, the estimates ofα and ρ will be close respect-ively toα − V√
ρ and to zero. The general moral is that application of this kind of pseudo-likelihood to a single long sequence is unsafe if the long-range dependencies are not sufficiently weak.
Example 7.16. Case-control study. An important illustration of these ideas is provided by a type of investigation usually called a case-control study; in econometrics the term choice-based sampling is used. To study this it is helpful to consider as a preliminary a real or notional population of individuals, each having in principle a binary outcome variable y and two vectors of explanatory variables w and z. Individuals with y = 1 will be called cases and those with y= 0 controls. The variables w are to be regarded as describing the intrinsic
7.6 Modified likelihoods 155
nature of the individuals, whereas z are treatments or risk factors whose effect on y is to be studied. For example, w might include the gender of a study individual, z one or more variables giving the exposure of that individual to environmental hazards and y might specify whether or not the individual dies from a specific cause. Both z and w are typically vectors, components of which may be discrete or continuous.
Suppose that in the population we may think of Y , Z, W as random variables with
P(Y = 1 | W = w, Z = z) = L(α + βTz+ γTw), (7.65) where L(x) = ex/(1 + ex) is the unit logistic function. Interest focuses on β, assessing the effect of z on Y for fixed w. It would be possible to include an interaction term between z and w without much change to the following discussion.
Now if the response y = 1 is rare and also if it is a long time before the response can be observed, direct observation of the system just described can be time-consuming and in a sense inefficient, ending with a very large number of controls relative to the number of cases. Suppose, therefore, instead that data are collected as follows. Each individual with outcome y, where y = 0, 1, is included in the data with conditional probability, given the corresponding z, w, of py(w) and z determined retrospectively. For a given individual, D denotes its inclusion in the case-control sample. We write
P(D | Y = y, Z = z, W = w) = py(w). (7.66) It is crucial to the following discussion that the selection probabilities do not depend on z given w. In applications it would be quite common to take p1(w) = 1, i.e., to take all possible cases. Then for each case one or more controls are selected with probability of selection defined by p0(w). Choice of p0(w) is discussed in more detail below.
It follows from the above specification that for the selected individuals, we have in a condensed notation that
P(Y = 1 | D, z, w)
= P(Y = 1 | z, w)P(D | 1, z, w) P(D | z, w)
= L(α + βTz+ γTw)p1(w)
L(α + βTz+ γTw)p1(w) + {1 − L(α + βTz+ γTw)}p0(w)
= L{α + βTz+ γTw+ log{p1(w)/p0(w)}}. (7.67) There are now two main ways of specifying the choice of controls. In the first, for each case one or more controls are chosen that closely match the case
with respect to w. We then write for the kth such group of a case and controls, essentially without loss of generality,
γTw+ log{p1(w)/p0(w)} = λk, (7.68) whereλk characterizes this group of individuals. This representation points towards the elimination of the nuisance parameters λk by conditioning. An alternative is to assume for the entire data that approximately
log{p1(w)/p0(w)} = η + ζTw (7.69) pointing to the unconditional logistic relation of the form
P(Y = 1 | D, z, w) = L(α∗+ βTz+ γ∗Tw). (7.70) The crucial point is thatβ, but not the other parameters, takes the same value as the originating value in the cohort study.
Now rather than base the analysis directly on one of the last two formulae, it is necessary to represent that in this method of investigation the variable y is fixed for each individual and the observed random variable is Z. The likelihood is therefore the product over the observed individuals of, again in a condensed notation,
fZ|D,Y,W = fY|D,Z,WfZ|D,W/ fY|D,W. (7.71) The full likelihood function is thus a product of three factors. By (7.70) the first factor has a logistic form. The final factor depends only on the known functions pk(w) and on the numbers of cases and controls and can be ignored.
The middle factor is a product of terms of the form
fZ(z){L(α + βTz+ γTw)p1(w) + (1 − L(α + βTz+ γTw))p0(w)}
dv fZ(v){L(α + βTv+ γTw)p1(w) + (1 − L(α + βTv+ γTw))p0(w)}.
(7.72) Thus the parameter of interest,β, occurs both in the relatively easily handled logistic form of the first factor and in the second more complicated factor. One justification for ignoring the second factor could be to regard the order of the observations as randomized, without loss of generality, and then to regard the first logistic factor as a pseudo-likelihood. There is the difficulty, not affecting the first-order pseudo-likelihood, that the different terms are not independent in the implied new random system, because the total numbers of cases and controls are constrained, and may even be equal. Suppose, however, that the marginal distribution of Z, i.e., fZ(z), depends on unknown parameters ω in such a way that when(ω, α, β, γ ) are all unknown the second factor on its own
7.6 Modified likelihoods 157
provides no information aboutβ. This would be the case if, for example, ω represented an arbitrary multinomial distribution over a discrete set of possible values of z. The profile likelihood for(α, β, γ ), having maximized out over ω, is then essentially the logistic first factor which is thus the appropriate likelihood function for inference aboutβ.
Another extreme case arises if fZ(z) is completely known, when there is in principle further information aboutβ in the full likelihood. It is unknown whether such information is appreciable; it seems unlikely that it would be wise to use it.