Regression Analysis of Probability-Linked Data

(1)

Regression Analysis of Probability-Linked Data

Ray Chambers

Centre for Statistical and Survey Methodology, University of Wollongong

______________________________________________________________________ This report was commissioned by Official Statistics Research, through Statistics New Zealand. The opinions, findings, recommendations and conclusions expressed in this report are those of the author(s), do not necessarily represent Statistics New Zealand and should not be reported as those of Statistics New Zealand. The department takes no responsibility for any omissions or errors in the information contained here.

(2)

Abstract

Data obtained after probability linkage of administrative registers will typically include errors due to the fact that some linked records actually contain data items are sourced from different individuals. Such errors can induce bias in standard statistical analyses if ignored. In this report we describe some approaches to eliminating this bias in the case of linear regression analysis and, more generally when inference is based on an estimating equation, with an emphasis on logistic regression. Simulation results that illustrate the gains from allowing for linkage error in linear and logistic regression analysis are presented, as are extensions of the approach to situations where a sample is linked to a register and to where the linked registers are of unequal size.

Keywords

Record matching, linkage errors, linear regression, logistic regression, estimating equations, measurement error.

Reproduction of material

Material in this report may be reproduced and published, provided that it does not purport to be published under government authority and that acknowledgement is made of this source.

Citation

Chambers, R. (2009). Regression analysis of probability-linked data, Official Statistics Research Series, 4. Available from http://www.statisphere.govt.nz/official-statistics-research/series/default.htm

Published by Statistics New Zealand

Tatauranga Aotearoa Wellington, New Zealand _____________________

ISSN 1177-5017 (Online) ISBN 978-0-478-31569-1 (Online)

(3)

Acknowledgements

The theory set out in this paper was not developed in a vacuum. It has benefited considerably from advice and critical input from Walt Davis of Statistics New Zealand, Milorad Kovacevic of Statistics Canada and Glenys Bishop and James Chipperfield of the Australian Bureau of Statistics. My thanks go out to all of them for their encouragement. Also, I would like to acknowledge the input of the referee who provided me with the details of the Neter et al. (1965) reference. This is a well-written paper that nicely summarises many of the statistical issues that I have attempted to tackle in this report.

(4)

List of tables

Table 1 Options for G_q(θ) in logistic regression ... 23

Table 2 Specification of Gˆq and ∂θUˆq in (50) for the linear case. ... 25

Table 3 Simulation results for the linear model. ... 29

Table 4 Simulation results for slope estimators under the logistic model.. ... 31

List of figures

Figure 1 Boxplots of percentage relative errors generated by different estimators in linear model simulations. ... 30

Figure 2 Boxplots of percentage relative errors generated by different slope estimators in logistic model simulations. ... 31

(6)

1 Introduction

In their seminal paper on the topic, Fellegi and Sunter (1969) defined record linkage as “a solution to the problem of recognizing those records in two files which represent identical persons, objects, or events...” Record linkage allows data for a single individual to be compiled from different data sources, enabling more powerful and effective analyses to be carried out than would otherwise be the case. In particular, datasets created by linking individual records constitute a critical resource for research in health, epidemiology, economics, demography, sociology and many other scientific areas. National statistical agencies increasingly rely on linking surveys to administrative registers to provide more accurate measurement and to reduce respondent burden. Frequently, one or more datasets (whether all administrative data or a mix of administrative and survey data) are linked to answer a broader array of research questions than can be addressed through any of the datasets individually.

Linked longitudinal datasets are particularly useful in health related research. These are datasets created by matching individual health and health-related records from a variety of sources over a period of time. For example, a longitudinal dataset created by linking

hospital admission and general practitioner records to private health insurance expenditure records for individuals in a particular social and/or demographic group could be used to build models for how changes in that group’s health expenditure influences subsequent uptake of medical services. This type of linkage is able to bring together a much better picture of the driving factors behind many public health issues. Thus, using data obtained from linking physician billing claims held by the Ontario Health Insurance Plan with data for consenting Ontario respondents to the 1994/95 Canadian National Population Health Survey, Iron, Manuel and Williams (2003) report on an analysis of the relationship between utilization and costs of physician services and incidence of self-reported chronic conditions for residents of Ontario province in Canada.

Data linkage is not confined to the health sciences. In a review commissioned by the UK Department for Trade and Industry, Chesher and Nesheim (2006) describe the extensive use of data linkage in economic research, particularly in the United States. Statistics New Zealand has recently developed a linked longitudinal employer-employee dataset based on linking administrative data held on the NZ Inland Revenue Department's tax system and Statistics New Zealand's list of NZ businesses. This dataset allows the analysis of job and worker flows, employment tenure, multiple job holding and business demography. Similarly, the Census Data Enhancement Initiative of the Australian Bureau of Statistics (ABS) aims to create a Statistical Longitudinal Census Dataset that integrates census data from the same individuals over a number of censuses, with the objective of building a research resource for longitudinal analysis of the Australian population. In the UK, the

Interdepartmental Migration and Population Task Force set up by the Office for National Statistics has recently recommended the use of record linkage to improve migration and population statistics in the UK. The aim in this case is to link administrative, health register, school enrollment and university student data with incoming passenger survey and labour force survey data to create an integrated longitudinal data set that will allow in-depth analysis of the UK migrant experience.

The process used to link datasets often involves a probabilistic matching of records from one dataset to another. In most linkage operations matching variables present in both datasets are used to maximise the probability that the values of the variables making up the linked record are the correct measurements for the population unit corresponding to that record. However, when analysis is undertaken using the resulting linked data, the errors inherent in this type of record matching are typically ignored. This is unfortunate since these errors introduce bias and additional variability into standard statistical estimation techniques. This poses a significant barrier to policy-relevant research using

(7)

Statistical methods for linking datasets are now well established (Herzog, Scheuren and Winkler, 2007), with recent statistical research in this area mainly focused on the

confidentiality issues that arise as a consequence of linkage. See Sibthorpe, Kliewer and Smith (1995) and Trutwein, Holman and Rosman (2005) for an Australian health data perspective, and Mackie and Bradburn (2000) for contributions to a workshop on

confidentiality and linkage sponsored by the US Committee on National Statistics and the Institute of Medicine. In contrast, aside from the notable contributions of Neter et al. (1965), Scheuren and Winkler (1993) and Lahiri and Larsen (2005), there has been comparatively little methodological research carried out on the impact of linkage errors on analysis of linked data. Linkage errors are the errors caused by incorrectly linking different population units as well as the errors caused by not linking the same population units in the datasets that are linked. These errors are a particular type of measurement error, and can lead to biased inference unless appropriate steps are taken to control and/or adjust for this bias. In this report we develop a methodological framework that can be used to provide

appropriate modifications to standard statistical analysis methods to ensure that they remain unbiased when used with probabilistically linked data. The framework is based on modelling the relationship between the probabilistically linked data and the ‘true’ data that would be obtained if error free linkage were possible. Inference then proceeds on the basis of a combined model defined by the integration of this linkage error model with the

statistical model for the ‘true’ data values that is of primary interest. Our assumptions about the data linkage situation and a description of a simple model for linkage errors are set out in the following sub-section. In section 2 we apply these ideas to fitting a simple linear regression model to linked data from two registers that each cover the same population. In section 3 this theory is generalised to where the statistical model of interest is fitted via the solution of an estimating equation, with application to logistic regression serving to motivate our approach. Simulation results for both linear and logistic regression are described in section 4 and illustrate the potential gains from the modified analytic methods that we propose. In section 5 we extend our framework to the important case of linking a survey to a register, while in section 6 we look at another important extension, where the registers that are linked are nested, in the sense that the population making up one of the linked registers is a subgroup of the population making up the other. Section 7 concludes the report with a short discussion of avenues for further research.

1.1 Background and assumptions

In what follows we assume that the existence of a population of N units, indexed by

i=1,K , N, such that, for each unit in this population, it is possible to measure the values of a scalar random variable Y and a vector random variable X. We are interested in modelling the relationship between Y and X in this population, and in particular we seek to fit a model of the form E(Y X)=g

(

X;β

)

for the regression of Y on X. Here g corresponds to a known functional form while the parameter β is unknown and needs to be estimated. This is usually straightforward if we have the values of Y and X for a random sample of units from this population. Unfortunately, we do not have such a sample. Instead, we have access to two registers that separately contain the population values of Y and X. We shall refer to these as the Y-register and X-register respectively from now on. For the time being we also assume that both registers refer to the same population and have no duplicates, so each is made up of N records.

If each unit in the population has a unique identifier, and this identifier is also stored on both registers, we can use it to link records from the two registers, and then estimate β using the Y and X values associated with the N linked records. Unfortunately, such a unique identifier does not exist. Instead, some form of probability-linking algorithm is used to associate (i.e. link) records on the X-register with records on the Y-register. This algorithm makes it is possible (at least conceptually) to link every record on the X-register with a

(8)

record on the Y-register. That is, linkage is complete and one to one between the Y and X-registers. Clearly, the data set constructed by this process (the linked data) can contain linkage errors, i.e. records where the values of Y and X actually come from different population units.

Although it may be theoretically possible for any two records on the Y and X-registers to be linked, most reasonable probability linking algorithms will only attempt to link records that are similar in some sense. Consequently, we shall assume that the linked records can be partitioned into Q distinct ‘blocks’ such that there is no possibility that linked records in different blocks contain data for the same population unit. We model this situation by assuming that the different blocks correspond to different values of a categorical population variable Z that can be derived from the information on either register, and which is defined in such a way that if a record on one register does not have the same value of Z as the record on the other register, then it is reasonable to assume that these two records cannot correspond to the same unit in the original population. Conversely, the fact that a Y-register record and an X-register record have the same value for Z does not guarantee that they correspond to the same unit, and so linkage errors can still occur within a block. We refer to

Z as a blocking variable, and those population units with the same value of Z as being in the same block. Note that errors in measurement of Z can lead to the same population unit having one value of Z on the Y-register and another on the X-register, which invalidates the assumption of no linkage errors when Y and X-register records have different values of Z. Consequently, we shall assume that Z is measured without error on both the Y-register and the X-register. With this set up, data linkage errors only occur among records in both registers in the same block.

This property of the blocking variable Z indicates a subtle but key difference between the use of the blocking concept in our development and its use in data linkage methodology. In the latter case, blocking variables define stages (or ‘passes’) in the linkage process, where at any particular stage matching is carried out with respect to a particular blocking variable. That is, only those remaining unmatched records at this stage with the same value for this blocking variable are considered as potential matches. However, once all matches at a particular stage of the process are declared, all remaining unmatched records are then considered as candidates for matching at the next stage using another blocking variable. Consequently it is quite possible that links can be created between Y and X-register records that have different values for any particular blocking variable. In our case the blocking variable Z is an ex-post construct. It defines a partition of the declared links into groups such that all linkage errors are isolated within the groups – there are no errors that ‘cross’ group boundaries. Clearly Z can be defined in terms of the blocking variables used in creating the links, but there is no fundamental requirement for this. The main requirement is that Z partitions both the Y and X-registers so that all (or virtually all) linkage errors are confined to the groups of records defined by the distinct values of this variable.

Without loss of generality, we denote the Q distinct values taken by Z by 1, 2,K ,Q. Let block q correspond to the M_q population units with Z=q, so N= M_q

q

∑

. Since Z is measured without error in both registers, and linkage is complete, the number of records in block q in each register is the same, i.e. M_q.

Let i index the records in the linked data set. Again, without loss of generality we assume that this index is the same as the one used to index the X-register, i.e. the linkage process associates a record from the Y-register, with its associated Y-value, with each record on the X-register. In block q we then have M_q linked data pairs (y_i∗, X_i), where y_i∗ denotes the Y -value from block q on the Y-register that is matched to X_i. More accurately, the record with

(9)

X-register. We use yq

∗

to denote the vector of order Mq defined by the linked values yi

∗

in block q and X_q as the matrix with rows defined by the values X_i in the same block. Also, let y_q denote the unknown vector of order M_q with entries indexed as in the X-register that corresponds to the ‘true’ Y values associated with X_q.

Since one and only one of the M_q records in block q on the Y-register can be matched to each distinct record in the corresponding block on the X-register, we model randomness in the outcome of the linkage process via the identity

y∗_q=A_qy_q (1)

where Aq =[aij q

] is an unknown random permutation matrix of order Mq. Note that the entries a_ijq_of_A

qare either zero or one, with a value of one occurring just once in each row and column. Also, since we are assuming that linkage errors are confined to blocks, it is natural to impose the condition that A_q

1 and Aq2 are independently distributed when

q₁≠q₂.

Clearly inference based on linked data will involve assumptions about the distribution of the A_q. In this report we assume that linkage is non-informative at each level of Z, i.e. the distribution of A_q is independent of y_q given X_q. Let

E

(

A_q X_q

)

=E_q. (2)

Given the care that typically goes into the construction of a linked data set, it seems reasonable that a ‘declared’ link is more likely to be correct than incorrect. Although the probability that such a link is correct will typically vary between the records that make up the linked dataset, as a first approximation we assume that the probability of correct linkage is the same for all records in a block. We also assume that it is equally likely that any two Y -records in the same block that are not linked to a particular X-record in that block could in fact be the ‘correct’ link for this record. A simple way of characterising both of these assumptions is via an exchangeable linkage errors model, where for each value of q

Pr correct linkage

(

)

=Pr

(

a_iiq =1

)

=λ_q (3) and, for i≠ j,

Pr incorrect linkage

(

)

=Pr

( )

a_ijq=1 =γ_q. (4) Given (3) and (4) hold, it follows that (2) is then of the form

E_q=

(

λ_q−γ_q

)

I_q+γ_q1_q1T_q (5) where I_q is the identity matrix of order M_q and 1_q denotes a vector of ones of length M_q. Since 1_qT_A q=1q T and A_q1_q=1_q we have 1T_q_E q =1q T

and E_q1_q=1_q, which means that (5) implies

λq+(Mq−1)γq =1. (6) In other words, we just need to specify λq in order to completely specify the first order properties of the linkage mechanism under the model (5). This will be particularly useful

(10)

later since estimation of λ_q requires only that we know whether a defined link is correct or incorrect, and not the identity of the correct link.

The model specified by (3) and (4) represents what is probably the simplest way of

characterising the behaviour of a probability-based linkage process, and will form the basis for the theory developed in this report. It was originally suggested by Neter et al. (1965) in a groundbreaking paper that investigated its use in assessing the impact of linkage error on response error analysis, where alternative data sources were linked to respondent records in order to assess the extent of response error in these records. As these authors note, and as we shall see in next section, the impact of linkage error defined by (3) and (4) is to attenuate the relationship between the study variable (in their case the difference between the survey value and the linked alternative value) and explanatory covariates.

Depending on the available information from the operation of the linkage process, more sophisticated models for linkage error can be formulated. For example, it may be the case that the Y and X-registers are ordered so that only ‘nearby’ records in the linked data can possibly correspond to the correct link. This can be modelled by replacing (4) by

Pr(incorrect linkage)=Pr(aij q= 1)= 1−λ_q 2m_q 0 if j−i ≤m_q otherwise      .

with appropriate modifications for values of i close to either the beginning or the end of the X-register. Here 2m_q denotes the number of nearest neighbours to y_i∗ in the linked data set that can actually contain the correct value y_i.

Another extension is where there exists another variable on the X-register, say W, with values w_i that vary within a block, such that the probability of a correct link depends on these values. For example, we could have

Pr correct linkage

(

)

=Pr a

(

_iiq=1

)

= p w

(

_i, w_i;λ_q

)

and, for i≠ j,

Pr incorrect linkage

(

)

=Pr a

( )

_ijq=1 = p w

(

_i, w_j;λ_q

)

where p(w_i, w_j;λ_q) is a function that (i) takes values in the interval [0,1]; (ii) is maximised when wi=wj; and (iii) satisfies p(wi, wj;λq)

j=1

Mq

∑

=1. An obvious candidate function in this case is where p(w_i, w_j;λ_q) is proportional to exp

(

−λ_q w_i−w_j

)

. Note however that if W is categorical and available on both registers then by including it in the definition of the blocking variable Z we recover the situation implicit in the exchangeable linkage errors model, where all linked records in the same block have the same probability of being incorrectly linked.

1.2 Research questions

Given the preceding development, there are a number of questions that immediately arise. 1. What are the properties of the estimator of β based on the linked data that

assumes all linkages are correct?

(11)

The methodological framework described in the previous sub-section was based a number of strong assumptions about the linkage process that will typically be violated. As a

consequence, we can ask further questions.

3. How do we need to modify our inference when linkage is incomplete (i.e. there are unlinked records in one or both of the Y and X-registers? 4. What happens when one or both of the Y and X-registers are based on

sample survey data? How do we integrate sample selection and linkage in inference?

5. We have assumed that all components of X are on one register. What happens if some components are actually held on the Y-register? More generally, what happens if components of X are held on different registers and these are linked either prior to the linkage to the Y-register or

simultaneously with the linkage to the Y-register?

In the rest of this report we develop some theory that may help in answering these questions.

(12)

2 Linear regression using linked data

In this section we consider the situation where the widely used linear regression model is the focus of inference. That is, the population values of Y and X in each block (i.e. those associated with population units with the same value of Z) satisfy

E_X

( )

y_q =X_qβ=f_q (7)

Var_X

( )

y_q =σ2I_q. (8)

where we use a subscript of X to denote conditioning on the value X_q of the explanatory variables in block q. Note that in addition to the regression parameter β in (7), which is the target of inference, (8) now includes an unknown scale parameter σ2_{. Given the}_y

q and X_q, the optimal estimator of β is then its Ordinary Least Squares (OLS) estimator

ˆ β= XT_q_X q q

∑

   −1 XT_q_y q q

∑

  . (9)

Unfortunately, unless the linkage is perfect, (9) cannot be calculated. Instead, what is usually done is to substitute the linked data values y_q∗ for y_q in this expression, which leads to the naïve linked data OLS estimator

ˆ β∗₌ _X q T_X q q

∑

   −1 X_qT_y q ∗ q

∑

   (10)

2.1 Bias-corrected OLS inference

Under the linkage error model (1) it is easy to see that (10) is actually ˆ β∗₌ XT_qX_q q

∑

   −1 XT_qA_qy_q q

∑

  .

Under non-informative linkage

E_X

( )

A_qy_q =E_X

( )

A_q E_X

( )

y_q =E_qf_q so E_X( ˆβ∗)= X_qTX_q q

∑

   −1 XT_qE_qX_q q

∑

  β=Dβ. (11)

That is, the naïve OLS estimator (10) based on the linked data set is biased. Provided Eq is known and the inverse of the matrix D in (11) exists, an unbiased estimator of β in this situation is ˆ βR=D −1βˆ∗₌ X_qTX_q q

∑

   −1 X_qTE_qX_q q

∑

  

{

}

−1 ˆ β∗ which, since Xq T EqXq q

∑

is then of full rank, reduces to ˆ βR = Xq T_E qXq q

∑

(

)

−1 X_qT_y q ∗ q

∑

(

)

. (12)

Note that the subscript of R used to denote the estimator defined by (12) serves as a reminder that this estimator is based on a ratio-type correction for the bias in the naive estimator (10).

(13)

We use an iterated expectation argument to obtain the variance of βˆR. To start, observe that VarX

( )

βˆR =D −1 VarX βˆ ∗

( )

D−1 T where

Var_X

( )

βˆ∗ =E_X

{

Var_AX

( )

βˆ∗

}

+Var_X

{ }

E_AX

( )

βˆ∗ . Here a subscript of AX denotes conditioning on both Aq and Xq, so

E_AX

( )

βˆ∗ = XT_q_X q q

∑

(

)

−1 XT_q_A qXq q

∑

(

)

β and Var_AX

( )

βˆ∗ =σ2 X_qTX_q q

∑

(

)

−1 XT_qA_qA_qTX_q q

∑

(

)

X_qTX_q q

∑

(

)

−1 =σ2 X_qTX_q q

∑

(

)

−1 since A_qA_qT =_I

q. Put Vq=VarX

(

AqXqβ

)

=VarX

( )

Aqfq . It follows that

Var_X

( )

βˆ_R =D−1 X_qTX_q q

∑

(

)

−1 X_qT

(

σ2I_q+V_q

)

X_q q

∑

{

}

X_qTX_q q

∑

(

)

−1 D−1

( )

T = XT_qE_qX_q q

∑

(

)

−1 X_qT

(

σ2I_q+V_q

)

X_q q

∑

{

}

X_qTE_qX_q q

∑

(

)

−1 . (13) An estimator of (13) is then ˆ V_X

( )

βˆ_R = X_qT_E qXq q

∑

(

)

−1 XT_q σ_ˆ2_I q+Vˆq

(

)

X_q q

∑

{

}

X_qT_E qXq q

∑

(

)

−1 (14) where σˆ2

and Vˆ_q are estimates of σ2

and V_q respectively.

In order to define these estimates, we note that after some simplification, and using the fact that A_qTA_q=I_q, E_X

(

y∗_q−f_q

)

T

(

y∗_q−f_q

)

q

∑

=E_X

{

y_qTy_q−yT_qf_q−f_qTy_q+f_qTf_q−2f_qT

(

A_q−I_q

)

y_q

}

q

∑

=Nσ2−2 fq T Eq−Iq

(

)

fq q

∑

.

It follows that when f_q and E_q are known, ˆ σ2=_N−1 _y q ∗₋ f_q

(

)

T y∗_q−f_q

(

)

q

∑

−2 f_qT _I q−Eq

(

)

f_q q

∑

{

}

(15)

is an unbiased method of moments estimator of σ2

under the linkage errors model (1) and the linear model specified by (7) and (8). Note that (15) can take negative values. In practice, we replace f_q by ˆf_q =X_qβˆ_R in (15) to then obtain a consistent estimator of σ2

. Development of an expression for Vˆ_q is somewhat more complicated. In Appendix 1 we show that a large M_qapproximation to V_q given a simple second-order extension of the exchangeable linkage errors model defined by (3) and (4) is

(14)

V_q≈diag (1_ −λ_q)

{

λ_q

(

f_i− f_q

)

2+ f_q(2)− f_q2

}

_ (16) where fq=

( )

fi and fq, fq

(2 )

denote the block q averages of the components of fq and their squares respectively. In order to calculate Vˆ_q we replace these components in (16) by their estimated values.

The approach to linear regression estimation using probability-linked data described above is in the spirit of Scheuren and Winkler (1993), where it is suggested that one corrects the naive estimator using an estimate of its bias under an appropriate model for the linkage error process. In our case the ratio-type adjustment we use for this purpose depends on knowing (or having good estimates of) the parameters (i.e. the λq) that characterise this process. As noted earlier, all that is required to estimate these parameters is access to a random ‘audit’ sample of the linked records in each block where the only thing we need to know is whether the declared links are correct or not. This could also be done by

comparison with the output from a ‘gold standard’ (e.g. clerical) linkage operation carried out on this sample of records.

2.2 Efficient linear estimation using linked data

An alternative approach to fitting a linear model using the probability-linked data is based on directly modelling the regression relationship between the linked values y_q∗ and the values in X_q. Since y_q∗=A_qy_q, and A_q and y_q are independently distributed given X_q, it follows

E_X

( )

y_q∗ =E_X

( )

A_q E_X

( )

y_q =E_qX_qβ=H_qβ. (17) That is, the yq

∗

also follow a linear model with regression coefficient β but with a modified set of explanatory variables H_q in block q. Lahiri and Larsen (2005) note this relationship and suggest estimation of β using the OLS estimator for this situation,

ˆ βA= Hq T Hq q

∑

(

)

−1 Hq T yq ∗ q

∑

(

)

= Xq T Eq T EqXq q

∑

(

)

−1 Xq T Eq T yq ∗ q

∑

(

)

. (18)

However, the optimality of this estimator depends on the regression errors under (17) being homoskedastic. It is easy to see that this condition generally does not hold, since implicit in the development leading to (13) is the fact that

VarX yq

∗

( )

=σ2

Iq+Vq= Σq (19)

which implies that the variances of the regression errors defined by the linked data vary between blocks. The Best Linear Unbiased Estimator (BLUE) for β given these data is

ˆ βC = Hq TΣ q −1_H q q

∑

(

)

−1 H_qTΣ q −1_y q ∗ q

∑

(

)

= XT_q_E q TΣ q −1_E qXq q

∑

(

)

−1 X_qT_E q TΣ q −1_y q ∗ q

∑

(

)

. (20)

Note that (20) depends on Σ_q, and hence on σ2

and β. Its ‘empirical’ (EBLUE) version is defined by substituting estimates for these parameters and iterating, using the estimate (15) for σ2

developed in the previous sub-section, combined with the estimate of β defined by (20).

Standard plug-in type ‘sandwich-type’ estimators of the variances of (18) and (20) are easily developed using the estimates σˆ2

and Vˆq developed in the previous sub-section. These are

(15)

ˆ V_X

( )

βˆ_A = XT_qE_qTE_qX_q q

∑

(

)

−1 XT_qE_qT σˆ2 I_q+Vˆ_q

(

)

E_qX_q q

∑

{

}

X_qTE_qTE_qX_q q

∑

(

)

−1 (21) in the case of (18) and

ˆ V_X

( )

βˆ_C = XT_qE_qT

(

σˆ2222I_q+Vˆ_q

)

−1E_qX_q q

∑

{

}

−1 (22) in the case of (20).

Note that such ‘plug-in’ estimators ignore the contribution to the variance associated with estimation of the linkage model parameters and hence may be biased low. This issue is further discussed in section 3.3.

2.3 Maximum likelihood using linked data

An alternative approach to constructing an efficient estimator of β given the linked data is to use the Missing Information Principle or MIP (Orchard and Woodbury, 1972) to derive the maximum likelihood estimator of this parameter given the linked data. In order to do so, we extend the linear model (7) and (8) to include an assumption of normality. That is, given

X_q, we assume that

y_q: N f

(

_q,σ2I_q

)

. When the y_q are known, the score function for β and σ2

has components sc1= 1 σ2 Xq T yq−fq

(

)

q

∑

(23) and sc₂= − N 2σ2+ 1 2σ4

(

yq−fq

)

T y_q−f_q

(

)

q

∑

. (24)

In order to apply the MIP, we replace (14) and (15) by their conditional expectations given y_q∗ and X_q. Using an iterated expectations argument again, we see that

Cov_X

( )

y_q, y∗_q =σ2E_X

( )

A_qT +Cov_X

(

f_q, A_qf_q

)

=σ2ET_q. Combining this result with (17) and (19), it follows that

y_q y∗_q Xq       : N fq E_qf_q    ,σ 2 Iq Eq T E_q Σ_q               and so E_X

( )

y_q y∗_q =f_q+E_qTΣ_q−1

(

y_q∗−E_qf_q

)

= ˆy_q and Var_X

( )

y_q y∗_q =σ2

(

I_q−E_qTΣ_q−1E_q

)

. We therefore replace (23) by sc₁∗= 1 σ2 Xq T ˆy_q−f_q

(

)

q

∑

= 1 σ2 Xq T ET_qΣ_q−1

(

y_q∗−E_qf_q

)

q

∑

(25) and, since yq T yq=yq ∗T yq ∗ , we replace (24) by

(16)

sc₂∗= − N 2σ2+ 1 2σ4 yq ∗T_y q ∗₋ 2f_qT_ˆy q+fq T_f q

(

)

q

∑

= − N 2σ2+ 1 2σ4 yq ∗₋ f_q

(

)

T y∗_q−f_q

(

)

−2f_qT

(

ˆy_q−y_q∗

)

{

}

q

∑

. (26)

The MLEs for β and σ2

are defined by setting (25) and (26) to zero and solving for these parameters. Since ˆy_q is a function of β and σ2

this needs to be done numerically. Note that the solution to setting (25) to zero is the BLUE (20). Since the MLE for σ2

obtained by setting (26) to zero is not the same as the method of moments estimator (15), the MLE and the EBLUE for β will not be the same. However, they are typically very close.

In order to estimate the variances and covariances of these MLEs, we calculate the matrix-valued observed information function corresponding to the MIP-based score function for these parameters and invert it. This can be done by either numerically differentiating (25) and (26), or by using the MIP information identity. This identity states that the information function for β and σ2

given the linked data is the conditional expectation of the ‘y_q known’ information function given the linked data minus the conditional variance of the ‘y_q known’ score function given the linked data. Denoting conditioning on the linked data

y_q∗; q=1, 2,K ,Q

(

)

by a superscript of *, the information function generated by these data is

info∗= EX ∗ info₁₁

(

)

E_X∗

(

info₁₂

)

E_X∗

(

info₂₁

)

E_X∗

(

info₂₂

)

     − VarX ∗ sc₁

( )

Cov_X∗

(

sc₁, sc₂

)

Cov_X∗

(

sc₂, sc₁

)

Var_X∗

( )

sc₂       (27) where E_X∗

(

info₁₁

)

= 1 σ2 Xq T_X q q

∑

E_X∗

(

info₂₂

)

= − N 2σ4+ 1 σ6 yq ∗₋ f_q

(

)

T y_q∗−f_q

(

)

−2f_qT

(

ˆy_q−y_q∗

)

{

}

q

∑

E_X∗

(

info₁₂

)

=E_X∗

(

info₂₁

)

= 1 σ4 Xq T _ˆy q−fq

(

)

q

∑

Var_X∗

( )

sc₁ = 1 σ4 Xq T Var_X

( )

y_q y∗_q X_q q

∑

= 1 σ2 Xq T I_q−E_qTΣ_q−1E_q

(

)

X_q q

∑

Cov∗_X

(

sc₁, sc₂

)

= 1 2σ6 CovX Xq T y_q−f_q

(

)

, y

(

_q−f_q

)

T

(

y_q−f_q

)

y∗_q

{

}

q

∑

= 1 2σ6 CovX Xq T yq,−2fq T yq yq ∗

(

)

q

∑

= − 1 σ6 Xq T_Var X yq yq ∗

( )

f_q q

∑

= − 1 σ4 Xq T I_q−E_qTΣ_q−1E_q

(

)

f_q q

∑

. and

(17)

Var_X∗

( )

sc₂ = 1 4σ8 VarX

(

yq−fq

)

T y_q−f_q

(

)

y_q∗

{

}

q

∑

= 1 4σ8 VarX yq T y_q−y_qTf_q−f_qTy_q+f_qTf_q y∗_q

{

}

q

∑

= 1 σ8 VarX fq T_y q yq ∗

{

}

q

∑

= 1 σ6 fq T _I q−Eq TΣ q −1_E q

(

)

f_q q

∑

.

The observed information for β and σ2

is the value of info∗ at the values of the MLEs for these parameters. The inverse of this matrix is then used as an estimate of the

variance/covariance matrix of these estimators. Note that the value of the matrix

Var_X∗

( )

sc₁ Cov_X∗

(

sc₁, sc₂

)

Cov∗_X

(

sc₂, sc₁

)

Var_X∗

( )

sc₂      

at the MLEs for β and σ2

is a measure of the information loss caused by incorrect linkage.

2.4 A fixed population approach

Suppose that we have perfectly linked data. The efficient estimator of the regression parameter β is then the ‘y_q known’ OLS estimator

B= X′_qX_q q

∑

(

)

−1 ′ X_qy_q q

∑

(

)

. (28)

So far, our emphasis has been on estimation of β. However, it is legitimate to also

consider prediction of B given the fixed finite population of Y and X-values that define the Y

and X-registers. In this context, we denote conditioning on these values (i.e. on the values of y_q and X_q) by a subscript of YX and look for a predictor Bˆ of B that satisfies (over repeated applications of the probability linkage process)

E_YX

( )

Bˆ =B. (29)

Note that none of βˆR, βˆA and βˆC satisfy (29) since we have

E_YX

( )

βˆ_R = X_qT_E qXq q

∑

(

)

−1 XT_q_E qyq q

∑

(

)

≠B E_YX

( )

βˆ_A = X_qT_E q T_E qXq q

∑

(

)

−1 XT_q_E q T_E qyq q

∑

(

)

≠B and E_YX

( )

βˆ_C = X_qT_E q TΣ q −1_E qXq q

∑

(

)

−1 X_qT_E q TΣ q −1_E qyq q

∑

(

)

≠B.

In order to derive a predictor that satisfies (29), consider the class of linear predictors of B

that can be written in the form ˆ B= X_qT_X q q

∑

(

)

−1 X_qT_K qyq ∗ q

∑

(

)

.

If K_qE_q =I_q it is straightforward to see that

E_YX

( )

Bˆ = X_qT_X q q

∑

(

)

−1 X_qT_K qEqyq q

∑

(

)

=B.

(18)

If

E

_q is of full rank (as is the case with (5) when

λ

_q

>

γ

_q), then an obvious choice is

K

_q

=

E

_q−1. More generally, Kovacevic (personal communication, 2008) has suggested that one put K_q=

( )

ET_qE_q −1ET_q, leading to the predictor

ˆ βB= Xq T X_q q

∑

(

)

−1 X_qT

( )

ET_qE_q −1ET_qy_q∗ q

∑

{

}

. (30)

Since (30) is linear in the y∗_q, variance estimation for this predictor using a plug-in sandwich-based approach follows directly. The resulting variance estimator is

ˆ V_X

( )

βˆ_B = X_qTX_q q

∑

(

)

−1 X_qT

( )

ET_qE_q −1ET_q σˆ2 I_q+Vˆ_q

(

)

E_q

( )

E_qTE_q −1X_q q

∑

{

}

XT_qX_q q

∑

(

)

−1 . (31)

(19)

3 Using estimating functions with probability-linked data

In this section we consider extension of the ideas developed for linear regression analysis in the previous section to where the regression model of interest is fitted via the solution of an estimating equation. In particular, we assume that this model is characterised by a p -dimensional parameter θ, which is then estimated by solving

H(θ)=0

where H(θ) is a p-dimensional unbiased estimating function for θ, i.e. a function of the data that satisfies E_X

{

H(θ₀)

}

=0 where θ₀ is the ‘true’ value of θ. Let ∂_θ denote the partial differentiation operator with respect to the components of θ. The resulting estimator θˆ can then be shown to be approximately unbiased for θ₀ since, under appropriate smoothness conditions

0=H( ˆθ)≈H(θ0)+ ∂

(

θH0

)

( )

θˆ−θ0 .

Here ∂_θH₀ is the p×p matrix of first order partial derivatives of H(θ) with respect to the components of θ, evaluated at θ₀. Since H(θ) is an unbiased estimating function, it immediately follows that

E_X

( )

θˆ−θ₀ ≈ − ∂

(

_θH₀

)

−1E H(

{

θ₀)

}

=0

provided ∂_θH₀ is of full rank, and so θˆ is approximately unbiased for θ₀. Furthermore, we then also have

Var_X( ˆθ)≈ ∂

(

_θH₀

)

−1Var_X

{

H(θ₀)

}

{

(

∂_θH₀

)

−1

}

T

(32) leading to the usual sandwich-type estimator of this variance

ˆ VX( ˆθ)≈ ∂

(

θH0

)

−1 θ0=θˆ

{

}

_Vˆ X

{

H(θ0)

}

(

∂θH0

)

−1 θ0=θˆ

{

}

T (33) where Vˆ_X

{

H(θ₀)

}

is an estimate of Var_X

{

H(θ₀)

}

. Typically, it is a plug-in estimate, i.e.

VarX

{

H(θ0)

}

evaluated at θˆ=θ0.

3.1 Correcting estimating functions for linkage error

We now turn our attention to the situation where a regression model is fitted using an estimating function and data that have been linked using a probability-based method. In particular, we shall concern ourselves with situations where H(θ) is of the form

H(θ)= G_i(θ)

{

y_i− f_i(θ)

}

i=1

N

∑

(34)

where f_i(θ₀)=E_X

( )

y_i and G_i(θ) is a vector of order p which is a function of θ and X_i but not of y_i. Clearly (34) defines an unbiased estimating function for θ₀, which we can write in ‘blocked’ form as

H(θ)= G_q(θ)

{

y_q−f_q(θ)

}

q

(20)

where G_q(θ) is the p×M_q matrix with columns defined by the vectors G_i(θ) associated with the population units making up block q, and f_q(θ) is the vector of order M_q defined by their corresponding values of f_i(θ).

Now consider the situation described in section 1.1 where instead of y_q, we have access to a probability-linked version of this vector, yq

∗₌

Aqyq. Here Aq is a random permutation matrix of order Mq distributed independently of yq given the values in Xq (i.e. linkage is non-informative given the values of the explanatory variables), with values of Aq distributed independently between blocks and where E_X

( )

A_q =E_q. Let H∗(θ) denote the value of (35) when we use yq

∗

instead of yq. That is, our naive estimator θˆ

∗

of θ0 that assumes no

linkage errors satisfies

H∗( ˆθ∗)= G_q( ˆθ∗) y

{

_q∗−f_q( ˆθ∗)

}

q

∑

=0. (36) Clearly, since E_X

{

H∗(θ₀)

}

= G_q(θ₀) E

{

(

_q−I_q

)

f_q(θ₀)

}

q

∑

≠0

we see that H∗(θ) is biased if linkage is not perfect, and so the resulting estimator θˆ∗ is also biased in this case. Given the value of E_q, we can correct for this bias, replacing the estimating function H∗(θ) by its bias-corrected version

H∗_adj(θ)=H∗(θ)− G_q(θ) E

{

(

_q−I_q

)

f_q(θ)

}

q

∑

= G_q(θ) y

{

_q∗−E_qf_q(θ)

}

q

∑

. (37)

Our bias-adjusted estimator of θ based on the linked data is then θˆadj

∗ where Hadj ∗ ( ˆθadj ∗ )=0.

The general results for inference based on unbiased estimating functions clearly apply to H∗_adj(θ) defined by (37). It immediately follows that the large sample variance of θˆadj

∗

is given by (32) with H∗_adj(θ) substituted for H(θ). That is,

Var_X( ˆθ_adj∗ )≈ ∂_θH_adj∗ _θ

=θ0

(

)

−1

Var_X

{

H_adj∗ (θ₀)

}

∂_θH∗_adj_θ

=θ0

(

)

−1       T (38) with plug-in sandwich-type estimator, see (33), of the form

ˆ

V_X( ˆθ_adj∗ )= ∂

{

_θH_adj∗ ( ˆθ_adj∗ )

}

−1Vˆ_X

{

H_adj∗ (θ₀)

}

_

{

∂_θH_adj∗ ( ˆθ_adj∗ )

}

−1_ T

. (39)

where ∂_θH_adj∗ ( ˆθ_adj∗ )= ∂_θH_adj∗ θ=_θˆ

adj

∗ .

(21)

Var_X(y_q∗)=E_X

{

Var_X

(

A_qy_q A_q

)

}

+Var_X

{

E_X

(

A_qy_q A_q

)

}

=E_X

{

A_qVar_X

( )

y_q A_qT

}

+Var_X

{

A_qf_q(θ₀)

}

=E_X

{

A_qΩ_q(θ₀)AT_q

}

+_Var X

{

Aqfq(θ0)

}

=E_X

{

A_qΩ_q(θ₀)AT_q

}

+V_q(θ₀) = Σq(θ0) (40) so

Var_X

{

H_adj∗ (θ₀)

}

= G_q(θ₀)Var_X

( )

y∗_q G_qT(θ₀) q

∑

= G_q(θ₀)Σ_q(θ₀)G_qT(θ₀) q

∑

and hence ˆ

V_X

{

H_adj∗ (θ₀)

}

= G_q( ˆθ_adj∗ )Σ_q( ˆθ_adj∗ )G_qT( ˆθ_adj∗ ) q

∑

. (41)

In order to compute (41) we need to estimate the covariance matrix Σ_q(θ) specified by (40). In turn, this requires that we estimate both V_q(θ₀), which can be approximated via (16) after replacing f_i by f_i( ˆθ_adj∗ ), and E_X

{

A_qΩ_q(θ₀)A_qT

}

, which will depend on the particular model that we assume for the y_q.

Next, in order to define the matrix of partial derivatives ∂_θH∗_adj( ˆθ_adj∗ ) in (39) we note that although in theory

∂_θH∗_adj = ∂_θ_G_q(θ) y

{

∗_q−E_qf_q(θ)

}

_ q

∑

it is often the case that Gq(θ) varies little as θ changes. Consequently, we approximate this derivative by

∂_θH_adj∗ ≈ − G_q(θ)E_q∂_θf_q(θ) q

∑

.

That is, we put

∂_θH∗_adj( ˆθ_adj∗ )= − G_q( ˆθ_adj∗ )E_q

{

∂_θf_q( ˆθ_adj∗ )

}

q

∑

(42)

where ∂_θf_q( ˆθ_adj∗ )= ∂_θf_q(θ) θ=_θˆ

adj

∗ . The final variance estimator for θˆadj

∗

is then obtained by substituting (41) and (42) into (39).

3.2 Application to linear and logistic regression

Although we have already developed the theory for linear regression in section 2, it is interesting to see how the results obtained there can be obtained as special cases of the general estimating equation theory set out in the previous sub-section. In particular, the Lahiri-Larsen estimator (18) and the BLUE (20) can be obtained from (28) by setting θ≡β and f_q(β)=X_qβ (so ∂_βf_q(β)=X_q) with G_q=X_qTET_q in the case of (18) and G_q =X_qTE_qTΣ_q−1 in the case of (20). As far as the predictor (30) of B is concerned, we note that it can be expressed as the solution to X_qT

( )

E_qTE_q −1ET_q

(

y∗_q−E_qX_qβˆ

)

q

∑

=0. It follows that in this case G_q=X_qT

( )

ET_qE_q −1ET_q which leads to ∂_βH_adj∗ ( ˆβ_B)= X_qTX_q

q

(22)

In contrast, the ‘ratio-adjusted’ estimator (12) cannot be expressed as the solution of an estimating equation of the form G_q

{

y_q∗−E_qX_qβ

}

q

∑

=0, being instead the solution to the alternative ‘ratio-type’ estimating equation

H_R(β)= X_qT

(

y_q∗−X_qDβ

)

q

∑

=0 (43) where D= X′_qX_q q

∑

   −1 ′ X_qE_qX_q q

∑



 . As a consequence, the results in the previous

subsection do not apply to it directly. However, it is not difficult to show that H_R(β) also defines an unbiased estimating function under the assumed linear model, since

E_X X_qT

(

y_q∗−X_qDβ

)

q

∑

{

}

= XT_q

(

E_qX_q−X_qD

)

q

∑

{

}

β = X_qTE_qX_q q

∑

− XT_qX_q q

∑

(

)

D

{

}

β =0.

The linearisation argument that was earlier used to define an estimator of variance in the ‘standard’ estimating function approach also applies to (12) when it is written as the solution to (43). In particular, we have

∂_βH_R(β)= − X_qT_X qD q

∑

= − XT_q_E q T_X q q

∑

(44) and Var_X

{

H_R(β₀)

}

= X_qTΣ qXq q

∑

. (45)

When (44) and (45) are substituted in (38) we obtain the variance expression (13), leading to the same plug-in estimator of variance as specified by (39).

The case where the regression model of interest corresponds to linear logistic regression is of special interest. Here f_q(θ)=

{

f_i(θ); i∈q

}

where

f_i(θ)= exp(Xi Tθ₎ 1+exp(X_iTθ). (46) It follows that ∂_θf_q(θ)=D_q(θ)X_q (47) where D_q(θ)=diag f _i(θ) 1

{

− f_i(θ)

}

.

The standard maximum likelihood estimating function (i.e. the score function) for the logistic regression model puts G_q(θ)=X_qT_{in (35). However, this is not the only choice for this} matrix when we estimate θ via the adjusted estimating equation (37). In particular we can also use the expressions for G_q(θ) that lead to the linear regression estimators (18), (20) and (30) introduced in section 2. We summarise these options in Table 1. Here option M defines the estimating equation for the MLE under perfect linkage, option A leads to the Lahiri-Larsen estimator (18) under a linear model and option B leads to the predictor (30) of the finite population regression vector (28) under the same model. In contrast, option C in Table 1 defines the second order efficient version of (35), which in the logistic case is given by G_qopt₍θ₎= ∂ θ EX yq ∗

Regression Analysis of Probability-Linked Data