The Economic and Social Review, Vol. 27, No. 1, October, 1995, pp. 33-42
Estimating Equations with Information Loss on
at Least One Dependent Variable
D E N I S C O N N I F F E *
The Economic and Social Research Institute
Abstract: Efficient, or joint, estimation of a pair of linear equations w i t h the same explanatory
variables reduces to separate estimation of each equation. This is no longer the case i f infor mation has been lost on at least one dependent variable; for example, i f all that is recorded is whether some value is exceeded or not, or i f information has been lost completely. This paper considers a class of problems, for which the efficient estimates can be deduced by simple intuitive arguments and which occurs frequently i n practice. Some members of the class have already appeared in the literature, but have not been related, while others are new.
I I N T R O D U C T I O N
I
t is well k n o w n that j o i n t (or efficient) estimation of two (or more) equations w i t h the same explanatory variables, and measured on thesame n observations having the form
y ii = x ; b1+ ul i (l )
y2 i= x ; b2 + u2 i'
where ui ; and u2i w i l l be assumed bivariate normal, j u s t reduces to single
equation estimation of each (Anderson, 1958; Zellner, 1962). O f course, it is
demonstrated i n texts t h a t i f there are cross equation restrictions on
Paper presented at the Ninth Annual Conference of the Irish Economic Association.
coefficients, or i f explanatory variables differ between equations, this may no longer be the case. B u t such cases are actually j u s t further manipulation of the ordinary least squares solutions, appropriate only i f strong assumptions hold. F o r example, i f
bi = (b'n.b'ja), b'2 = ( b2 1, b2 2) and x[ = (x'l i ( x2 i)
(1) becomes
y i i =xi i b n + x2 i b1 2+ ul i ( 2 )
y2 i = x 'i ; b2 1 + x'2 i b2 2+ u2 i'
These still have the same explanatory variables, so the usual ordinary least squares formulae apply for coefficients (call them by * ) as do the u s u a l variances or covariances (call them v^). Now a symmetry constraint, b1 2 = b 2 i , would j u s t mean that
Dl l 0 0N
D12 0 1 0
b12 w2 D12
— b12 +
h * °2l 0 1 0
\°22)
w3
\°22)
VD22 J ,0 0 i ;
\°22)
,W4 ,
(3)
or B * = H B + W ,
where B denotes the reduced coefficient vector B ' = ( b1 1, b1 2, b2 2) . T h e solution of (3) is a simple G L S estimator itself.
BH = [ H ' V ~1H r1H ' V "1B *
where V is the variance covariance matrix of the B * .
Again, the model (2) would become seemingly unrelated regressions if it could be assumed that b]2 = b 2 i = 0 and then
' h *A Dl l 1
D12
D21
=
00 0
0
W2
W3 h *
^°22 J
,0 h <W4,
or B * = K B k + W
giving Bk= [ K ' V ~1K r1K ' V -1B * (5)
T h u s , given that the appropriate estimators for the common explanatory variable case are known, it is quite easy to proceed to cases that j u s t reduce the number of coefficients through assumptions. O f course, the assumptions should be tested, w h i c h is another reason for starting from the common explanatory variable case.
However, efficient estimation of two equations with common explanatory variables does not always reduce to single equation estimators. T h i s paper is concerned w i t h one class of situations, where simple intuitive arguments often suggest the appropriate estimators, but w h i c h occur frequently i n practice. A s a starting point, suppose the situation represented by (1) did once hold, but information was lost on one or both of y ! and y2. F o r example, all measures of y2 might be lost except the signs. T h e n there is j u s t a binary variable z and the problem becomes that of efficient joint estimation of a linear equation for yi and a probit model for z. T h i s was solved by Chesher (1984). Or, all information might be lost on a sub-set n - r of the y2 values, the remaining r being fully observed. T h i s is the two linear equations w i t h unequal numbers of observations problem. T h e appropriate estimators were given i n Conniffe (1985). B u t , as w i l l be seen, there are m a n y other possibilities.
Before proceeding to examine them, it is essential to appreciate why joint estimation of (1) does reduce to single equation estimation. L e a s t squares (or maximum likelihood, given normality) estimation of a single linear equation, w i t h dependent variable y and explanatory variables xx and x2, lead to minimisation of
£( y i -xi i C i . 2 - x2ic2 . i )2
and so to the conditions
£(Xji(yi - x\fi12 - x'2ic21) = 0 j = L 2 .
B u t these imply that
* * , *
cl — c1 . 2 + a 2 c2 . 1
and
* * j * *
where cx* and c2* are the vectors of simple regression coefficients of y on xx and x2 respectively and d2* and di* are the matrices of simple regression coefficients of x x on x2 and of x2 on xx.
Returning to the pair of Equations (1), bivariate normality of ux and u2 implies that y2 i, conditionally on ylit is normal with mean
x'ib2+b(yli-x'ib1)
or
x ^ - S b ^ S ^
where 5 = a12/on, and with variance a2 2( l - p ) . So i n a regression of y2 on x and yi the coefficient of x, b *x y l, estimates b 2 - 8bx and the coefficient on
yv by l x , estimates 8. So a "new" estimator of b2 is
D2 = Kyi + (Sbi) = b* y l + b\byLx (7)
where bx* is the estimator of bx from the first equation: the "simple" coefficient of yx on x. B u t from (6), the right hand side of (7) is just b2* , so
b2 = b2.
It is true that a similar result can hold for non-linear equations (Gallant, 1975), but only i f the non-linear functional form is the same for both equations and the disturbance terms are additive normal. B u t these are very restrictive conditions.
I I G A S E S I N T H E L I T E R A T U R E
Chesher (1984) considered joint estimates of linear and probit models. T h i s could occur, for example, i f u s i n g household expenditure data to model consumption of two commodities such as food and a major consumer durable. Most households w i l l not purchase (or replace) the major durable at a l l during the survey period, so a (0,1) variable will result. A n underlying y 2 variable can be hypothesised, however, perhaps the service derived from the durable. Since it is not observed, a 22 can be taken as unity and the occurrence of z = l a s corresponding to y2 > 0. Another way of viewing this is that it is b2 / y o2 2 that is identifiable. T h e single equation estimates of hi and b2 are obtained by ordinary least squares and probit analysis respectively. Denote the O L S of b i by bx* and the single equation probit estimator by b2. T h e latter is the estimator obtained by maximising the likelihood
where F is the cumulative distribution function of the standard normal, so that F ( - x j b2) is the probability that Z; = 0.
When considering efficient estimators for b i and b2, the idea that infor mation lost on a system cannot lead to improved precision will be employed. I f y2, rather than z, was available the situation would be (1) and bx* would be the efficient estimator. Therefore, it must still be the efficient estimator when y2 is not available. However, we can perhaps do better t h a n b2. Since y2;, conditionally on y^, has mean x - ( b2 - 8 b1) + 8 yl i and variance 1 - p2 with p2 =
0 i2 / C Jn, a probit analysis employing x and y i can obtain bx y l and b yL x that estimate
b2- 8 b i , 8
— f and j - respectively.
( 1 - p2)2 ( 1 - p2)2
*
Substituting b j * for b1 and an, the estimator obtained from the error mean square of the ordinary regression analysis for y1 ; for o^, gives
b ^v 1+ b , b ^1 v
b2= — Z 2 ± 1 yl x x . (8)
{ l + O i i ( b J i . x )2}
A s C h e s h e r r e m a r k e d , (8) c a n be calculated given any s t a t i s t i c a l or econometric software package that contains a probit analysis routine. So can an estimate of its asymptotic variance, V ( b2) . The corresponding estimator for
8 = C J1 2 / On is
hp
8 = Zi5 - . (9)
{ l + < 4 ( by l.x)2}5
The same logic as before implies that b1 = bx * where b ] * is the ordinary least squares estimate based on a l l n observations. Again, the conditional distribution of y2 given yx implies that bx y l estimates b2 - 8 b ! and by L x estimates 8 , where the superscript r reminds that these are the ordinary least squares estimates based on the r "complete" observations. Now, as for (7),
b2= b ; ;1 +b ; by; .x ( H »
and note that hi* the estimator of ba based on all observations has been substituted for b i . B u t from (6)
D2 * = Dx*yl + bfby l . x
so that (10) may be rewritten
b2= b ^ - ( br; - b ; ) by; .x. ( I D
T h i s is the estimator discussed by Conniffe (1985), where the asymptotic variance a n d indeed the exact small sample variances are given. Other pub lished literature could be considered to describe situations falling within the class. F o r example, i f for both y ! and y2 only the signs were available for all n, the multivariate probit model of Ashford and Sowdon (1970) would apply.
I l l O T H E R C A S E S
One quite feasible situation is where a survey question presents such difficulty t h a t although the question (the y2 variable) has been reduced to (0,1) format (business surveys, for example, often reduce questions about expectations of future activity to "above or below normal") the question is unanswered by some respondents. T h e n there are t (<n) observations for w h i c h the probit analysis with x and y i as explanatory variables can be conducted. However, all n observations may be available for a yx variable and hence for estimating b j . A s before, bx* is the estimator for bx and i t is intuitively obvious that (8) is the estimator for b 2 , but with the difference that, although bx y l and by L x are based on the r observations, the estimates b i * and on* are based on all n observations.
squares estimator, is clearly efficient and considering the n - r observations separately, the efficient estimator is given by (8), but w i t h a l l components calculated only over the h - r observations. However, this actually estimates b2 / - \ / o2 2 and so needs multiplying by the square root of c>2 2, the standard estimator of G22 from the r observations, to give the estimator b2. Since b2 2 and c2 2 are independent and the r and n - r observations are assumed independent, a n efficient combined estimator is obtained by the u s u a l rule of combining inversely as the variances.
vn_r( vr +
v
n_
TrX+v
r(v
r+v
a_
rr%
(12)where Vr and Vn_r are used to denote the variances of the estimates for the r and n - r observations and Vn_r = a2 2Vc + b2b2 / 2 r , where Vc denotes the variance of the Chesher estimator (8). I n practice, the variances are replaced by estimates.
Yet another possible situation would arise from the previous case i f some respondents failed to provide any information on y2. T h e n there could be r observations on y2 and y1 } t on z and yx and n - r - t extra observations on y ^ F r o m the t observations on z and y i an estimator of b2 can be based on the product of the C h e s h e r estimator (8) and an estimator of 022 • F r o m the remaining n - t observations, the r on y2 and y~i and the n - r - t on ylt a n estimator b2, would be based on (11). Conniffe (1985) also described estimation of o22 for this data pattern, giving all relevant formulae. T h e n an efficient overall estimator of b2 is
v
n- t ( v
t+ v
a-
tr % + v
t( v
t+ v
n_
ty %
where Vt is the variance of b2 and V„.t is the variance of b2.
So far, all examples have retained their ordinary least squares estimates of
hi, evaluated over all n observations, as the efficient estimator of h i . So let us
consider loss of information on both y j and y2. Suppose a business survey seeks a (0,1) reply to one question and a quantitative reply to another. I t could easily happen that quantitative answers are not available from some
respondents who do provide replies to the (0,1) question. T h e n there are r \ observations w i t h yx and z and n - r more with z recorded. A n efficient
estimator for b2 (taking = 1) can be deduced along the lines previously employed. The r observations provide an estimator b2 and the n - r provide an independent probit estimator b2. Then an efficient combination is
where Vr and Vn_r are the variances of b2 and b2 respectively.
As regards estimation of b1 ; analogy with (10) suggests an estimator
b f - A ( b2- b2) (13)
that is, a modification of the O L S estimator from the r observations by a multiple of the difference between the estimators of b2 based on the r and n observations. Choosing A to minimise the (large sample) variance of (13) leads to
b f - ^ - V i V ; 1 ^ - ^ ) (14)
r*
where Vi is the conventional variance (matrix) of b1 . Replacing G^/GII by (9) and variances by estimates gives the estimator of b j . Note that i f y 2 was measured, this would reduce to (11) with bx and b2 interchanged.
Since Vr must be larger for unmeasured y 2 then for measured y2, the adjustment to the O L S estimate of bx ought to be smaller t h a n i n the measured case, which is an intuitively plausible result. The argument used to derive (14) is perhaps not sufficient to prove that it is a fully efficient estimator; that is, that its asymptotic variance attains the lower bound provided by the variance of the maximum likelihood estimator. However, a fully rigorous proof is possible, but requires some rather complicated mathematical detail.
I t may be worth mentioning that one case of this example has already been examined i n great detail. T h i s is when the (0,1) variable is itself the indicator of the presence or absence of yx. T h e context has been the possibility of sample selection bias (Heckman, 1979). However, that is a very special case indeed and will not be pursued here.
I V G E N E R A L I S A T I O N S A N D R E S E R V A T I O N S
As mentioned i n the Introduction, the assumption of common explanatory variables i n both equations may sometime be regarded as a n i n i t i a l hypothesis, so that the estimators discussed i n the preceding section may themselves be utilised to estimate the parameters of more restricted models. For example, once initial estimates of hi and b2 have been obtained it is possible to test and estimate models that assume b ^ = b2 1 = 0 where t>'i =(b'1 1, b'1 2), b2 = ( b2 1, b2 2) by exactly the same procedure as given i n (4) and (5).
Other kinds of generalisation follow from considering more t h a n two equations. B y considering conditional distributions of y2, given y j andy3 say, another range of situations can be examined, i n some of w h i c h probit analyses using x, and y3 as explanatory variables would certainly feature. The issues associated with generalisation of (11) to multiple equations have been discussed by Conniffe (1985) and at least some of the deductions would carry over to the class of models considered here.
T u r n i n g to reservations about the use of the estimators discussed here, there are clearly assumptions implicit in the combination of estimators over different sub-samples. I f y1 and y2 are fully measured i n r observations and information on y2 is lost i n some fashion i n another n - r , is there a danger that the r and n - r are really samples from different populations? T h i s question h a s been addressed rather differently i n mainline statistics and in econometrics. I n the former, the question has usually been (for example, Rubin, 1976): when can we utilise the extra information i n incomplete obser vations? The estimators based on the r observations are presumed to be valid and the issue is whether the incompleteness i n the n - r observations is a w a r n i n g that the underlying population has changed. I n econometrics, Heckman (1979) has asked a rather different question: could the complete ness of the r observations indicate that selection bias has produced a sample that no longer represents the population?
Mentioning "asymptotic" does raise another issue. The idea of "efficiency" relates to m i n i m u m variance i n large samples and i n s m a l l samples the potential gain can be more than offset by the extra estimation error involved i n estimating joint models rather t h a n single models. I n the case of linear equations w i t h extra observations on yx, where (11) is the efficient estimator of b2, the asymptotic variance would suggest that (11) is always better than the single equation estimator b2 provided the correlation between y i and y2 i s non-zero. B u t the exact small sample variance (Conniffe, 1985) shows that i n small samples the correlation needs to be appreciable before there is any gain. Probit and other non-linear estimation methods are not amenable to the m a t h e m a t i c a l devices used to obtain exact s m a l l sample formulae, but simulation or other numerical approaches could clarify matters.
However, possible failure of asymptotic properties to hold i n small samples is by no means confined to the class of models being considered i n this paper. E v e r y branch of econometrics utilises methods originally (and still i n some cases) justified only by asymptotic arguments. Reservation about realisable gains i n s m a l l samples should stimulate relevant investigations, rather than deter employment of intuitively plausible adjustments to standard formulae, which is w h a t I hope I have shown the estimators i n this paper to be.
REFERENCES
ANDERSON, T.W., 1958. An Introduction to Multivariate Statistical Analysis, New York: Wiley.
ASHFORD, J . R . , and R.R. SOWDON, 1970. "Multivariate Probit Analysis", Bio
metrics, Vol. 26, pp. 535-546.
C H E S H E R , A., 1984. "Improving the Efficiency of Probit Estimators", The Review of
Economics and Statistics, Vol. 66, pp. 523-527.
C O N N I F F E , D., 1985. "Estimating Regression Equations with Common Explanatory Variables but Unequal Numbers of Observations", Journal of Econometrics, Vol. 27, pp. 179-196.
F I S H E R , R.A., 1925. "Theory of Statistical Estimation", Proceedings of the Cambridge
Philosophical Society, Vol. 22, pp. 700-725.
G A L L A N T , A.R., 1975. "Seemingly Unrelated Nonlinear Regressions", Journal of
Econometrics, Vol. 3, pp. 35-50.
HAUSMAN, J.A., 1978. "Specification Tests in Econometrics", Econometrica, Vol. 46, pp. 1,251-1,271.
H E C K M A N , J . , 1979. "Sample Selection Bias as Specification Error", Econometrica, Vol. 47, pp. 153-161.
RUBIN, D:B., 1976. "Inference and Missing Data", Biometrika, Vol. 63, pp. 581-592. Z E L L N E R , A., 1962. "An Efficient Method of Estimating Seemingly Unrelated
Regressions and Tests for Aggregation Bias", Journal of the American Statistical