• No results found

Regression Analysis of Probability-Linked Data

N/A
N/A
Protected

Academic year: 2021

Share "Regression Analysis of Probability-Linked Data"

Copied!
72
0
0

Loading.... (view fulltext now)

Full text

(1)

Regression Analysis of Probability-Linked Data

Ray Chambers

Centre for Statistical and Survey Methodology, University of Wollongong

______________________________________________________________________ This report was commissioned by Official Statistics Research, through Statistics New Zealand. The opinions, findings, recommendations and conclusions expressed in this report are those of the author(s), do not necessarily represent Statistics New Zealand and should not be reported as those of Statistics New Zealand. The department takes no responsibility for any omissions or errors in the information contained here.

(2)

Abstract

Data obtained after probability linkage of administrative registers will typically include errors due to the fact that some linked records actually contain data items are sourced from different individuals. Such errors can induce bias in standard statistical analyses if ignored. In this report we describe some approaches to eliminating this bias in the case of linear regression analysis and, more generally when inference is based on an estimating equation, with an emphasis on logistic regression. Simulation results that illustrate the gains from allowing for linkage error in linear and logistic regression analysis are presented, as are extensions of the approach to situations where a sample is linked to a register and to where the linked registers are of unequal size.

Keywords

Record matching, linkage errors, linear regression, logistic regression, estimating equations, measurement error.

Reproduction of material

Material in this report may be reproduced and published, provided that it does not purport to be published under government authority and that acknowledgement is made of this source.

Citation

Chambers, R. (2009). Regression analysis of probability-linked data, Official Statistics Research Series, 4. Available from http://www.statisphere.govt.nz/official-statistics-research/series/default.htm

Published by Statistics New Zealand

Tatauranga Aotearoa Wellington, New Zealand _____________________

ISSN 1177-5017 (Online) ISBN 978-0-478-31569-1 (Online)

(3)

Acknowledgements

The theory set out in this paper was not developed in a vacuum. It has benefited considerably from advice and critical input from Walt Davis of Statistics New Zealand, Milorad Kovacevic of Statistics Canada and Glenys Bishop and James Chipperfield of the Australian Bureau of Statistics. My thanks go out to all of them for their encouragement. Also, I would like to acknowledge the input of the referee who provided me with the details of the Neter et al. (1965) reference. This is a well-written paper that nicely summarises many of the statistical issues that I have attempted to tackle in this report.

(4)

Contents

1 Introduction ... 6

1.1 Background and assumptions ... 7

1.2 Research questions ... 10

2 Linear regression using linked data ... 12

2.1 Bias-corrected OLS inference ... 12

2.2 Efficient linear estimation using linked data ... 14

2.3 Maximum likelihood using linked data... 15

2.4 A fixed population approach ... 17

3 Using estimating functions with probability-linked data ... 19

3.1 Correcting estimating functions for linkage error ... 19

3.2 Application to linear and logistic regression ... 21

3.3 Variance estimation when linkage probabilities are estimated ... 23

3.4 Maximum likelihood logistic regression with linked data ... 25

4 Simulation analysis ... 28

4.1 Simulation of linear regression with linked data ... 28

4.2 Simulation of logistic regression based on linked data ... 29

5 Regression analysis under sample to register linkage ... 33

6 Regression analysis under nested linkage ... 38

6.1 Using estimating functions with nested linked data ... 40

6.2 Fitting linear and logistic models to nested linked data ... 42

6.3 Reversing the nesting ... 43

7 Conclusions and further research ... 46

References ... 48

Appendix 1 Approximating the V matrix ... 49

Appendix 2 R Code for linear model fitting and simulation ... 52

R functions for linear regression analysis ... 52

R code for linear model simulations ... 57

Simulation of known lambda case ... 57

Simulation of estimated lambda case ... 60

Appendix 3 R code for logistic model fitting and simulation ... 63

R functions for logistic regression analysis ... 63

R Code for logistic model simulations ... 68

Simulation of known lambda case ... 68

(5)

List of tables

Table 1 Options for Gq(θ) in logistic regression ... 23

Table 2 Specification of Gˆq and ∂θUˆq in (50) for the linear case. ... 25

Table 3 Simulation results for the linear model. ... 29

Table 4 Simulation results for slope estimators under the logistic model.. ... 31

List of figures

Figure 1 Boxplots of percentage relative errors generated by different estimators in linear model simulations. ... 30

Figure 2 Boxplots of percentage relative errors generated by different slope estimators in logistic model simulations. ... 31

(6)

1 Introduction

In their seminal paper on the topic, Fellegi and Sunter (1969) defined record linkage as “a solution to the problem of recognizing those records in two files which represent identical persons, objects, or events...” Record linkage allows data for a single individual to be compiled from different data sources, enabling more powerful and effective analyses to be carried out than would otherwise be the case. In particular, datasets created by linking individual records constitute a critical resource for research in health, epidemiology, economics, demography, sociology and many other scientific areas. National statistical agencies increasingly rely on linking surveys to administrative registers to provide more accurate measurement and to reduce respondent burden. Frequently, one or more datasets (whether all administrative data or a mix of administrative and survey data) are linked to answer a broader array of research questions than can be addressed through any of the datasets individually.

Linked longitudinal datasets are particularly useful in health related research. These are datasets created by matching individual health and health-related records from a variety of sources over a period of time. For example, a longitudinal dataset created by linking

hospital admission and general practitioner records to private health insurance expenditure records for individuals in a particular social and/or demographic group could be used to build models for how changes in that group’s health expenditure influences subsequent uptake of medical services. This type of linkage is able to bring together a much better picture of the driving factors behind many public health issues. Thus, using data obtained from linking physician billing claims held by the Ontario Health Insurance Plan with data for consenting Ontario respondents to the 1994/95 Canadian National Population Health Survey, Iron, Manuel and Williams (2003) report on an analysis of the relationship between utilization and costs of physician services and incidence of self-reported chronic conditions for residents of Ontario province in Canada.

Data linkage is not confined to the health sciences. In a review commissioned by the UK Department for Trade and Industry, Chesher and Nesheim (2006) describe the extensive use of data linkage in economic research, particularly in the United States. Statistics New Zealand has recently developed a linked longitudinal employer-employee dataset based on linking administrative data held on the NZ Inland Revenue Department's tax system and Statistics New Zealand's list of NZ businesses. This dataset allows the analysis of job and worker flows, employment tenure, multiple job holding and business demography. Similarly, the Census Data Enhancement Initiative of the Australian Bureau of Statistics (ABS) aims to create a Statistical Longitudinal Census Dataset that integrates census data from the same individuals over a number of censuses, with the objective of building a research resource for longitudinal analysis of the Australian population. In the UK, the

Interdepartmental Migration and Population Task Force set up by the Office for National Statistics has recently recommended the use of record linkage to improve migration and population statistics in the UK. The aim in this case is to link administrative, health register, school enrollment and university student data with incoming passenger survey and labour force survey data to create an integrated longitudinal data set that will allow in-depth analysis of the UK migrant experience.

The process used to link datasets often involves a probabilistic matching of records from one dataset to another. In most linkage operations matching variables present in both datasets are used to maximise the probability that the values of the variables making up the linked record are the correct measurements for the population unit corresponding to that record. However, when analysis is undertaken using the resulting linked data, the errors inherent in this type of record matching are typically ignored. This is unfortunate since these errors introduce bias and additional variability into standard statistical estimation techniques. This poses a significant barrier to policy-relevant research using

(7)

Statistical methods for linking datasets are now well established (Herzog, Scheuren and Winkler, 2007), with recent statistical research in this area mainly focused on the

confidentiality issues that arise as a consequence of linkage. See Sibthorpe, Kliewer and Smith (1995) and Trutwein, Holman and Rosman (2005) for an Australian health data perspective, and Mackie and Bradburn (2000) for contributions to a workshop on

confidentiality and linkage sponsored by the US Committee on National Statistics and the Institute of Medicine. In contrast, aside from the notable contributions of Neter et al. (1965), Scheuren and Winkler (1993) and Lahiri and Larsen (2005), there has been comparatively little methodological research carried out on the impact of linkage errors on analysis of linked data. Linkage errors are the errors caused by incorrectly linking different population units as well as the errors caused by not linking the same population units in the datasets that are linked. These errors are a particular type of measurement error, and can lead to biased inference unless appropriate steps are taken to control and/or adjust for this bias. In this report we develop a methodological framework that can be used to provide

appropriate modifications to standard statistical analysis methods to ensure that they remain unbiased when used with probabilistically linked data. The framework is based on modelling the relationship between the probabilistically linked data and the ‘true’ data that would be obtained if error free linkage were possible. Inference then proceeds on the basis of a combined model defined by the integration of this linkage error model with the

statistical model for the ‘true’ data values that is of primary interest. Our assumptions about the data linkage situation and a description of a simple model for linkage errors are set out in the following sub-section. In section 2 we apply these ideas to fitting a simple linear regression model to linked data from two registers that each cover the same population. In section 3 this theory is generalised to where the statistical model of interest is fitted via the solution of an estimating equation, with application to logistic regression serving to motivate our approach. Simulation results for both linear and logistic regression are described in section 4 and illustrate the potential gains from the modified analytic methods that we propose. In section 5 we extend our framework to the important case of linking a survey to a register, while in section 6 we look at another important extension, where the registers that are linked are nested, in the sense that the population making up one of the linked registers is a subgroup of the population making up the other. Section 7 concludes the report with a short discussion of avenues for further research.

1.1 Background and assumptions

In what follows we assume that the existence of a population of N units, indexed by

i=1,K , N, such that, for each unit in this population, it is possible to measure the values of a scalar random variable Y and a vector random variable X. We are interested in modelling the relationship between Y and X in this population, and in particular we seek to fit a model of the form E(Y X)=g

(

X;β

)

for the regression of Y on X. Here g corresponds to a known functional form while the parameter β is unknown and needs to be estimated. This is usually straightforward if we have the values of Y and X for a random sample of units from this population. Unfortunately, we do not have such a sample. Instead, we have access to two registers that separately contain the population values of Y and X. We shall refer to these as the Y-register and X-register respectively from now on. For the time being we also assume that both registers refer to the same population and have no duplicates, so each is made up of N records.

If each unit in the population has a unique identifier, and this identifier is also stored on both registers, we can use it to link records from the two registers, and then estimate β using the Y and X values associated with the N linked records. Unfortunately, such a unique identifier does not exist. Instead, some form of probability-linking algorithm is used to associate (i.e. link) records on the X-register with records on the Y-register. This algorithm makes it is possible (at least conceptually) to link every record on the X-register with a

(8)

record on the Y-register. That is, linkage is complete and one to one between the Y and X-registers. Clearly, the data set constructed by this process (the linked data) can contain linkage errors, i.e. records where the values of Y and X actually come from different population units.

Although it may be theoretically possible for any two records on the Y and X-registers to be linked, most reasonable probability linking algorithms will only attempt to link records that are similar in some sense. Consequently, we shall assume that the linked records can be partitioned into Q distinct ‘blocks’ such that there is no possibility that linked records in different blocks contain data for the same population unit. We model this situation by assuming that the different blocks correspond to different values of a categorical population variable Z that can be derived from the information on either register, and which is defined in such a way that if a record on one register does not have the same value of Z as the record on the other register, then it is reasonable to assume that these two records cannot correspond to the same unit in the original population. Conversely, the fact that a Y-register record and an X-register record have the same value for Z does not guarantee that they correspond to the same unit, and so linkage errors can still occur within a block. We refer to

Z as a blocking variable, and those population units with the same value of Z as being in the same block. Note that errors in measurement of Z can lead to the same population unit having one value of Z on the Y-register and another on the X-register, which invalidates the assumption of no linkage errors when Y and X-register records have different values of Z. Consequently, we shall assume that Z is measured without error on both the Y-register and the X-register. With this set up, data linkage errors only occur among records in both registers in the same block.

This property of the blocking variable Z indicates a subtle but key difference between the use of the blocking concept in our development and its use in data linkage methodology. In the latter case, blocking variables define stages (or ‘passes’) in the linkage process, where at any particular stage matching is carried out with respect to a particular blocking variable. That is, only those remaining unmatched records at this stage with the same value for this blocking variable are considered as potential matches. However, once all matches at a particular stage of the process are declared, all remaining unmatched records are then considered as candidates for matching at the next stage using another blocking variable. Consequently it is quite possible that links can be created between Y and X-register records that have different values for any particular blocking variable. In our case the blocking variable Z is an ex-post construct. It defines a partition of the declared links into groups such that all linkage errors are isolated within the groups – there are no errors that ‘cross’ group boundaries. Clearly Z can be defined in terms of the blocking variables used in creating the links, but there is no fundamental requirement for this. The main requirement is that Z partitions both the Y and X-registers so that all (or virtually all) linkage errors are confined to the groups of records defined by the distinct values of this variable.

Without loss of generality, we denote the Q distinct values taken by Z by 1, 2,K ,Q. Let block q correspond to the Mq population units with Z=q, so N= Mq

q

. Since Z is measured without error in both registers, and linkage is complete, the number of records in block q in each register is the same, i.e. Mq.

Let i index the records in the linked data set. Again, without loss of generality we assume that this index is the same as the one used to index the X-register, i.e. the linkage process associates a record from the Y-register, with its associated Y-value, with each record on the X-register. In block q we then have Mq linked data pairs (yi, Xi), where yi∗ denotes the Y -value from block q on the Y-register that is matched to Xi. More accurately, the record with

(9)

X-register. We use yq

to denote the vector of order Mq defined by the linked values yi

in block q and Xq as the matrix with rows defined by the values Xi in the same block. Also, let yq denote the unknown vector of order Mq with entries indexed as in the X-register that corresponds to the ‘true’ Y values associated with Xq.

Since one and only one of the Mq records in block q on the Y-register can be matched to each distinct record in the corresponding block on the X-register, we model randomness in the outcome of the linkage process via the identity

yq=Aqyq (1)

where Aq =[aij q

] is an unknown random permutation matrix of order Mq. Note that the entries aijq of A

qare either zero or one, with a value of one occurring just once in each row and column. Also, since we are assuming that linkage errors are confined to blocks, it is natural to impose the condition that Aq

1 and Aq2 are independently distributed when

q1q2.

Clearly inference based on linked data will involve assumptions about the distribution of the Aq. In this report we assume that linkage is non-informative at each level of Z, i.e. the distribution of Aq is independent of yq given Xq. Let

E

(

Aq Xq

)

=Eq. (2)

Given the care that typically goes into the construction of a linked data set, it seems reasonable that a ‘declared’ link is more likely to be correct than incorrect. Although the probability that such a link is correct will typically vary between the records that make up the linked dataset, as a first approximation we assume that the probability of correct linkage is the same for all records in a block. We also assume that it is equally likely that any two Y -records in the same block that are not linked to a particular X-record in that block could in fact be the ‘correct’ link for this record. A simple way of characterising both of these assumptions is via an exchangeable linkage errors model, where for each value of q

Pr correct linkage

(

)

=Pr

(

aiiq =1

)

q (3) and, for ij,

Pr incorrect linkage

(

)

=Pr

( )

aijq=1 =γq. (4) Given (3) and (4) hold, it follows that (2) is then of the form

Eq=

(

λq−γq

)

Iqq1q1Tq (5) where Iq is the identity matrix of order Mq and 1q denotes a vector of ones of length Mq. Since 1qTA q=1q T and Aq1q=1q we have 1TqE q =1q T

and Eq1q=1q, which means that (5) implies

λq+(Mq−1)γq =1. (6) In other words, we just need to specify λq in order to completely specify the first order properties of the linkage mechanism under the model (5). This will be particularly useful

(10)

later since estimation of λq requires only that we know whether a defined link is correct or incorrect, and not the identity of the correct link.

The model specified by (3) and (4) represents what is probably the simplest way of

characterising the behaviour of a probability-based linkage process, and will form the basis for the theory developed in this report. It was originally suggested by Neter et al. (1965) in a groundbreaking paper that investigated its use in assessing the impact of linkage error on response error analysis, where alternative data sources were linked to respondent records in order to assess the extent of response error in these records. As these authors note, and as we shall see in next section, the impact of linkage error defined by (3) and (4) is to attenuate the relationship between the study variable (in their case the difference between the survey value and the linked alternative value) and explanatory covariates.

Depending on the available information from the operation of the linkage process, more sophisticated models for linkage error can be formulated. For example, it may be the case that the Y and X-registers are ordered so that only ‘nearby’ records in the linked data can possibly correspond to the correct link. This can be modelled by replacing (4) by

Pr(incorrect linkage)=Pr(aij q= 1)= 1−λq 2mq 0 if jimq otherwise      .

with appropriate modifications for values of i close to either the beginning or the end of the X-register. Here 2mq denotes the number of nearest neighbours to yi∗ in the linked data set that can actually contain the correct value yi.

Another extension is where there exists another variable on the X-register, say W, with values wi that vary within a block, such that the probability of a correct link depends on these values. For example, we could have

Pr correct linkage

(

)

=Pr a

(

iiq=1

)

= p w

(

i, wiq

)

and, for ij,

Pr incorrect linkage

(

)

=Pr a

( )

ijq=1 = p w

(

i, wjq

)

where p(wi, wjq) is a function that (i) takes values in the interval [0,1]; (ii) is maximised when wi=wj; and (iii) satisfies p(wi, wjq)

j=1

Mq

=1. An obvious candidate function in this case is where p(wi, wjq) is proportional to exp

(

−λq wiwj

)

. Note however that if W is categorical and available on both registers then by including it in the definition of the blocking variable Z we recover the situation implicit in the exchangeable linkage errors model, where all linked records in the same block have the same probability of being incorrectly linked.

1.2 Research questions

Given the preceding development, there are a number of questions that immediately arise. 1. What are the properties of the estimator of β based on the linked data that

assumes all linkages are correct?

(11)

The methodological framework described in the previous sub-section was based a number of strong assumptions about the linkage process that will typically be violated. As a

consequence, we can ask further questions.

3. How do we need to modify our inference when linkage is incomplete (i.e. there are unlinked records in one or both of the Y and X-registers? 4. What happens when one or both of the Y and X-registers are based on

sample survey data? How do we integrate sample selection and linkage in inference?

5. We have assumed that all components of X are on one register. What happens if some components are actually held on the Y-register? More generally, what happens if components of X are held on different registers and these are linked either prior to the linkage to the Y-register or

simultaneously with the linkage to the Y-register?

In the rest of this report we develop some theory that may help in answering these questions.

(12)

2 Linear regression using linked data

In this section we consider the situation where the widely used linear regression model is the focus of inference. That is, the population values of Y and X in each block (i.e. those associated with population units with the same value of Z) satisfy

EX

( )

yq =Xqβ=fq (7)

VarX

( )

yq =σ2Iq. (8)

where we use a subscript of X to denote conditioning on the value Xq of the explanatory variables in block q. Note that in addition to the regression parameter β in (7), which is the target of inference, (8) now includes an unknown scale parameter σ2. Given the y

q and Xq, the optimal estimator of β is then its Ordinary Least Squares (OLS) estimator

ˆ β= XTqX q q

   −1 XTqy q q

  . (9)

Unfortunately, unless the linkage is perfect, (9) cannot be calculated. Instead, what is usually done is to substitute the linked data values yq∗ for yq in this expression, which leads to the naïve linked data OLS estimator

ˆ β∗= X q TX q q

   −1 XqTy qq

   (10)

2.1 Bias-corrected OLS inference

Under the linkage error model (1) it is easy to see that (10) is actually ˆ β∗= XTqXq q

   −1 XTqAqyq q

  .

Under non-informative linkage

EX

( )

Aqyq =EX

( )

Aq EX

( )

yq =Eqfq so EX( ˆβ∗)= XqTXq q

   −1 XTqEqXq q

  β=Dβ. (11)

That is, the naïve OLS estimator (10) based on the linked data set is biased. Provided Eq is known and the inverse of the matrix D in (11) exists, an unbiased estimator of β in this situation is ˆ βR=D −1βˆ∗= XqTXq q

   −1 XqTEqXq q

  

{

}

−1 ˆ β∗ which, since Xq T EqXq q

is then of full rank, reduces to ˆ βR = Xq TE qXq q

(

)

−1 XqTy qq

(

)

. (12)

Note that the subscript of R used to denote the estimator defined by (12) serves as a reminder that this estimator is based on a ratio-type correction for the bias in the naive estimator (10).

(13)

We use an iterated expectation argument to obtain the variance of βˆR. To start, observe that VarX

( )

βˆR =D −1 VarX βˆ ∗

( )

( )

D−1 T where

VarX

( )

βˆ∗ =EX

{

VarAX

( )

βˆ∗

}

+VarX

{ }

EAX

( )

βˆ∗ . Here a subscript of AX denotes conditioning on both Aq and Xq, so

EAX

( )

βˆ∗ = XTqX q q

(

)

−1 XTqA qXq q

(

)

β and VarAX

( )

βˆ∗ =σ2 XqTXq q

(

)

−1 XTqAqAqTXq q

(

)

XqTXq q

(

)

−1 =σ2 XqTXq q

(

)

−1 since AqAqT =I

q. Put Vq=VarX

(

AqXqβ

)

=VarX

( )

Aqfq . It follows that

VarX

( )

βˆR =D−1 XqTXq q

(

)

−1 XqT

(

σ2Iq+Vq

)

Xq q

{

}

XqTXq q

(

)

−1 D−1

( )

T = XTqEqXq q

(

)

−1 XqT

(

σ2Iq+Vq

)

Xq q

{

}

XqTEqXq q

(

)

−1 . (13) An estimator of (13) is then ˆ VX

( )

βˆR = XqTE qXq q

(

)

−1 XTq σˆ2I q+Vˆq

(

)

Xq q

{

}

XqTE qXq q

(

)

−1 (14) where σˆ2

and Vˆq are estimates of σ2

and Vq respectively.

In order to define these estimates, we note that after some simplification, and using the fact that AqTAq=Iq, EX

(

yqfq

)

T

(

yqfq

)

q

=EX

{

yqTyqyTqfqfqTyq+fqTfq2fqT

(

AqIq

)

yq

}

q

=Nσ2−2 fq T EqIq

(

)

fq q

.

It follows that when fq and Eq are known, ˆ σ2=N−1 y q fq

(

)

T yqfq

(

)

q

−2 fqT I qEq

(

)

fq q

{

}

(15)

is an unbiased method of moments estimator of σ2

under the linkage errors model (1) and the linear model specified by (7) and (8). Note that (15) can take negative values. In practice, we replace fq by ˆfq =XqβˆR in (15) to then obtain a consistent estimator of σ2

. Development of an expression for Vˆq is somewhat more complicated. In Appendix 1 we show that a large Mqapproximation to Vq given a simple second-order extension of the exchangeable linkage errors model defined by (3) and (4) is

(14)

Vqdiag (1 −λq)

{

λq

(

fifq

)

2+ fq(2)− fq2

}

 (16) where fq=

( )

fi and fq, fq

(2 )

denote the block q averages of the components of fq and their squares respectively. In order to calculate Vˆq we replace these components in (16) by their estimated values.

The approach to linear regression estimation using probability-linked data described above is in the spirit of Scheuren and Winkler (1993), where it is suggested that one corrects the naive estimator using an estimate of its bias under an appropriate model for the linkage error process. In our case the ratio-type adjustment we use for this purpose depends on knowing (or having good estimates of) the parameters (i.e. the λq) that characterise this process. As noted earlier, all that is required to estimate these parameters is access to a random ‘audit’ sample of the linked records in each block where the only thing we need to know is whether the declared links are correct or not. This could also be done by

comparison with the output from a ‘gold standard’ (e.g. clerical) linkage operation carried out on this sample of records.

2.2 Efficient linear estimation using linked data

An alternative approach to fitting a linear model using the probability-linked data is based on directly modelling the regression relationship between the linked values yq∗ and the values in Xq. Since yq∗=Aqyq, and Aq and yq are independently distributed given Xq, it follows

EX

( )

yq∗ =EX

( )

Aq EX

( )

yq =EqXqβ=Hqβ. (17) That is, the yq

also follow a linear model with regression coefficient β but with a modified set of explanatory variables Hq in block q. Lahiri and Larsen (2005) note this relationship and suggest estimation of β using the OLS estimator for this situation,

ˆ βA= Hq T Hq q

(

)

−1 Hq T yqq

(

)

= Xq T Eq T EqXq q

(

)

−1 Xq T Eq T yqq

(

)

. (18)

However, the optimality of this estimator depends on the regression errors under (17) being homoskedastic. It is easy to see that this condition generally does not hold, since implicit in the development leading to (13) is the fact that

VarX yq

( )

=σ2

Iq+Vq= Σq (19)

which implies that the variances of the regression errors defined by the linked data vary between blocks. The Best Linear Unbiased Estimator (BLUE) for β given these data is

ˆ βC = Hq TΣ q −1H q q

(

)

−1 HqTΣ q −1y qq

(

)

= XTqE q TΣ q −1E qXq q

(

)

−1 XqTE q TΣ q −1y qq

(

)

. (20)

Note that (20) depends on Σq, and hence on σ2

and β. Its ‘empirical’ (EBLUE) version is defined by substituting estimates for these parameters and iterating, using the estimate (15) for σ2

developed in the previous sub-section, combined with the estimate of β defined by (20).

Standard plug-in type ‘sandwich-type’ estimators of the variances of (18) and (20) are easily developed using the estimates σˆ2

and Vˆq developed in the previous sub-section. These are

(15)

ˆ VX

( )

βˆA = XTqEqTEqXq q

(

)

−1 XTqEqT σˆ2 Iq+Vˆq

(

)

EqXq q

{

}

XqTEqTEqXq q

(

)

−1 (21) in the case of (18) and

ˆ VX

( )

βˆC = XTqEqT

(

σˆ2222Iq+Vˆq

)

−1EqXq q

{

}

−1 (22) in the case of (20).

Note that such ‘plug-in’ estimators ignore the contribution to the variance associated with estimation of the linkage model parameters and hence may be biased low. This issue is further discussed in section 3.3.

2.3 Maximum likelihood using linked data

An alternative approach to constructing an efficient estimator of β given the linked data is to use the Missing Information Principle or MIP (Orchard and Woodbury, 1972) to derive the maximum likelihood estimator of this parameter given the linked data. In order to do so, we extend the linear model (7) and (8) to include an assumption of normality. That is, given

Xq, we assume that

yq: N f

(

q,σ2Iq

)

. When the yq are known, the score function for β and σ2

has components sc1= 1 σ2 Xq T yqfq

(

)

q

(23) and sc2= − N 2σ2+ 1 2σ4

(

yqfq

)

T yqfq

(

)

q

. (24)

In order to apply the MIP, we replace (14) and (15) by their conditional expectations given yq∗ and Xq. Using an iterated expectations argument again, we see that

CovX

( )

yq, yq =σ2EX

( )

AqT +CovX

(

fq, Aqfq

)

=σ2ETq. Combining this result with (17) and (19), it follows that

yq yq Xq       : N fq Eqfq    ,σ 2 Iq Eq T Eq Σq               and so EX

( )

yq yq =fq+EqTΣq−1

(

yq∗−Eqfq

)

= ˆyq and VarX

( )

yq yq =σ2

(

IqEqTΣq−1Eq

)

. We therefore replace (23) by sc1∗= 1 σ2 Xq T ˆyqfq

(

)

q

= 1 σ2 Xq T ETqΣq−1

(

yq∗−Eqfq

)

q

(25) and, since yq T yq=yqT yq ∗ , we replace (24) by

(16)

sc2∗= − N 2σ2+ 1 2σ4 yqTy q 2fqTˆy q+fq Tf q

(

)

q

= − N 2σ2+ 1 2σ4 yq fq

(

)

T yqfq

(

)

2fqT

(

ˆyqyq

)

{

}

q

. (26)

The MLEs for β and σ2

are defined by setting (25) and (26) to zero and solving for these parameters. Since ˆyq is a function of β and σ2

this needs to be done numerically. Note that the solution to setting (25) to zero is the BLUE (20). Since the MLE for σ2

obtained by setting (26) to zero is not the same as the method of moments estimator (15), the MLE and the EBLUE for β will not be the same. However, they are typically very close.

In order to estimate the variances and covariances of these MLEs, we calculate the matrix-valued observed information function corresponding to the MIP-based score function for these parameters and invert it. This can be done by either numerically differentiating (25) and (26), or by using the MIP information identity. This identity states that the information function for β and σ2

given the linked data is the conditional expectation of the ‘yq known’ information function given the linked data minus the conditional variance of the ‘yq known’ score function given the linked data. Denoting conditioning on the linked data

yq; q=1, 2,K ,Q

(

)

by a superscript of *, the information function generated by these data is

info∗= EXinfo11

(

)

EX

(

info12

)

EX

(

info21

)

EX

(

info22

)

     − VarXsc1

( )

CovX

(

sc1, sc2

)

CovX

(

sc2, sc1

)

VarX

( )

sc2       (27) where EX

(

info11

)

= 1 σ2 Xq TX q q

EX

(

info22

)

= − N 2σ4+ 1 σ6 yq fq

(

)

T yq∗−fq

(

)

2fqT

(

ˆyqyq

)

{

}

q

EX

(

info12

)

=EX

(

info21

)

= 1 σ4 Xq T ˆy qfq

(

)

q

VarX

( )

sc1 = 1 σ4 Xq T VarX

( )

yq yq Xq q

= 1 σ2 Xq T IqEqTΣq−1Eq

(

)

Xq q

CovX

(

sc1, sc2

)

= 1 2σ6 CovX Xq T yqfq

(

)

, y

(

qfq

)

T

(

yqfq

)

yq

{

}

q

= 1 2σ6 CovX Xq T yq,−2fq T yq yq

(

)

q

= − 1 σ6 Xq TVar X yq yq

( )

fq q

= − 1 σ4 Xq T IqEqTΣq−1Eq

(

)

fq q

. and

(17)

VarX

( )

sc2 = 1 4σ8 VarX

(

yqfq

)

T yqfq

(

)

yq

{

}

q

= 1 4σ8 VarX yq T yqyqTfqfqTyq+fqTfq yq

{

}

q

= 1 σ8 VarX fq Ty q yq

{

}

q

= 1 σ6 fq T I qEq TΣ q −1E q

(

)

fq q

.

The observed information for β and σ2

is the value of info∗ at the values of the MLEs for these parameters. The inverse of this matrix is then used as an estimate of the

variance/covariance matrix of these estimators. Note that the value of the matrix

VarX

( )

sc1 CovX

(

sc1, sc2

)

CovX

(

sc2, sc1

)

VarX

( )

sc2      

at the MLEs for β and σ2

is a measure of the information loss caused by incorrect linkage.

2.4 A fixed population approach

Suppose that we have perfectly linked data. The efficient estimator of the regression parameter β is then the ‘yq known’ OLS estimator

B= XqXq q

(

)

−1 ′ Xqyq q

(

)

. (28)

So far, our emphasis has been on estimation of β. However, it is legitimate to also

consider prediction of B given the fixed finite population of Y and X-values that define the Y

and X-registers. In this context, we denote conditioning on these values (i.e. on the values of yq and Xq) by a subscript of YX and look for a predictor Bˆ of B that satisfies (over repeated applications of the probability linkage process)

EYX

( )

Bˆ =B. (29)

Note that none of βˆR, βˆA and βˆC satisfy (29) since we have

EYX

( )

βˆR = XqTE qXq q

(

)

−1 XTqE qyq q

(

)

B EYX

( )

βˆA = XqTE q TE qXq q

(

)

−1 XTqE q TE qyq q

(

)

B and EYX

( )

βˆC = XqTE q TΣ q −1E qXq q

(

)

−1 XqTE q TΣ q −1E qyq q

(

)

B.

In order to derive a predictor that satisfies (29), consider the class of linear predictors of B

that can be written in the form ˆ B= XqTX q q

(

)

−1 XqTK qyqq

(

)

.

If KqEq =Iq it is straightforward to see that

EYX

( )

Bˆ = XqTX q q

(

)

−1 XqTK qEqyq q

(

)

=B.

(18)

If

E

q is of full rank (as is the case with (5) when

λ

q

>

γ

q), then an obvious choice is

K

q

=

E

q−1. More generally, Kovacevic (personal communication, 2008) has suggested that one put Kq=

( )

ETqEq −1ETq, leading to the predictor

ˆ βB= Xq T Xq q

(

)

−1 XqT

( )

ETqEq −1ETqyqq

{

}

. (30)

Since (30) is linear in the yq, variance estimation for this predictor using a plug-in sandwich-based approach follows directly. The resulting variance estimator is

ˆ VX

( )

βˆB = XqTXq q

(

)

−1 XqT

( )

ETqEq −1ETq σˆ2 Iq+Vˆq

(

)

Eq

( )

EqTEq −1Xq q

{

}

XTqXq q

(

)

−1 . (31)

(19)

3 Using estimating functions with probability-linked data

In this section we consider extension of the ideas developed for linear regression analysis in the previous section to where the regression model of interest is fitted via the solution of an estimating equation. In particular, we assume that this model is characterised by a p -dimensional parameter θ, which is then estimated by solving

H(θ)=0

where H(θ) is a p-dimensional unbiased estimating function for θ, i.e. a function of the data that satisfies EX

{

H(θ0)

}

=0 where θ0 is the ‘true’ value of θ. Let ∂θ denote the partial differentiation operator with respect to the components of θ. The resulting estimator θˆ can then be shown to be approximately unbiased for θ0 since, under appropriate smoothness conditions

0=H( ˆθ)≈H(θ0)+ ∂

(

θH0

)

( )

θˆ−θ0 .

Here ∂θH0 is the p×p matrix of first order partial derivatives of H(θ) with respect to the components of θ, evaluated at θ0. Since H(θ) is an unbiased estimating function, it immediately follows that

EX

( )

θˆ−θ0 ≈ − ∂

(

θH0

)

−1E H(

{

θ0)

}

=0

provided ∂θH0 is of full rank, and so θˆ is approximately unbiased for θ0. Furthermore, we then also have

VarX( ˆθ)≈ ∂

(

θH0

)

−1VarX

{

H(θ0)

}

{

(

θH0

)

−1

}

T

(32) leading to the usual sandwich-type estimator of this variance

ˆ VX( ˆθ)≈ ∂

(

θH0

)

−1 θ0=θˆ

{

}

Vˆ X

{

H(θ0)

}

(

∂θH0

)

−1 θ0=θˆ

{

}

T (33) where VˆX

{

H(θ0)

}

is an estimate of VarX

{

H(θ0)

}

. Typically, it is a plug-in estimate, i.e.

VarX

{

H(θ0)

}

evaluated at θˆ=θ0.

3.1 Correcting estimating functions for linkage error

We now turn our attention to the situation where a regression model is fitted using an estimating function and data that have been linked using a probability-based method. In particular, we shall concern ourselves with situations where H(θ) is of the form

H(θ)= Gi(θ)

{

yifi(θ)

}

i=1

N

(34)

where fi0)=EX

( )

yi and Gi(θ) is a vector of order p which is a function of θ and Xi but not of yi. Clearly (34) defines an unbiased estimating function for θ0, which we can write in ‘blocked’ form as

H(θ)= Gq(θ)

{

yqfq(θ)

}

q

(20)

where Gq(θ) is the p×Mq matrix with columns defined by the vectors Gi(θ) associated with the population units making up block q, and fq(θ) is the vector of order Mq defined by their corresponding values of fi(θ).

Now consider the situation described in section 1.1 where instead of yq, we have access to a probability-linked version of this vector, yq

=

Aqyq. Here Aq is a random permutation matrix of order Mq distributed independently of yq given the values in Xq (i.e. linkage is non-informative given the values of the explanatory variables), with values of Aq distributed independently between blocks and where EX

( )

Aq =Eq. Let H∗(θ) denote the value of (35) when we use yq

instead of yq. That is, our naive estimator θˆ

of θ0 that assumes no

linkage errors satisfies

H∗( ˆθ∗)= Gq( ˆθ∗) y

{

q∗−fq( ˆθ∗)

}

q

=0. (36) Clearly, since EX

{

H∗(θ0)

}

= Gq0) E

{

(

qIq

)

fq0)

}

q

0

we see that H∗(θ) is biased if linkage is not perfect, and so the resulting estimator θˆ∗ is also biased in this case. Given the value of Eq, we can correct for this bias, replacing the estimating function H∗(θ) by its bias-corrected version

Hadj(θ)=H∗(θ)− Gq) E

{

(

qIq

)

fq(θ)

}

q

= Gq) y

{

q∗−Eqfq(θ)

}

q

. (37)

Our bias-adjusted estimator of θ based on the linked data is then θˆadj

∗ where Hadj ∗ ( ˆθadj ∗ )=0.

The general results for inference based on unbiased estimating functions clearly apply to Hadj(θ) defined by (37). It immediately follows that the large sample variance of θˆadj

is given by (32) with Hadj(θ) substituted for H(θ). That is,

VarX( ˆθadj∗ )≈ ∂θHadjθ

=θ0

(

)

−1

VarX

{

Hadj∗ (θ0)

}

θHadjθ

=θ0

(

)

−1       T (38) with plug-in sandwich-type estimator, see (33), of the form

ˆ

VX( ˆθadj∗ )= ∂

{

θHadj∗ ( ˆθadj∗ )

}

−1VˆX

{

Hadj∗ (θ0)

}



{

θHadj∗ ( ˆθadj∗ )

}

−1 T

. (39)

where ∂θHadj∗ ( ˆθadj∗ )= ∂θHadj∗ θ=θˆ

adj

∗ .

(21)

VarX(yq∗)=EX

{

VarX

(

Aqyq Aq

)

}

+VarX

{

EX

(

Aqyq Aq

)

}

=EX

{

AqVarX

( )

yq AqT

}

+VarX

{

Aqfq0)

}

=EX

{

Aqq0)ATq

}

+Var X

{

Aqfq(θ0)

}

=EX

{

Aqq0)ATq

}

+Vq0) = Σq(θ0) (40) so

VarX

{

Hadj∗ (θ0)

}

= Gq0)VarX

( )

yq GqT0) q

= Gq0q0)GqT0) q

and hence ˆ

VX

{

Hadj∗ (θ0)

}

= Gq( ˆθadj∗ )Σq( ˆθadj)GqT( ˆθadj∗ ) q

. (41)

In order to compute (41) we need to estimate the covariance matrix Σq(θ) specified by (40). In turn, this requires that we estimate both Vq0), which can be approximated via (16) after replacing fi by fi( ˆθadj∗ ), and EX

{

Aqq0)AqT

}

, which will depend on the particular model that we assume for the yq.

Next, in order to define the matrix of partial derivatives ∂θHadj( ˆθadj∗ ) in (39) we note that although in theory

θHadj = ∂θGq) y

{

qEqfq(θ)

}

q

it is often the case that Gq(θ) varies little as θ changes. Consequently, we approximate this derivative by

θHadj∗ ≈ − Gq)Eqθfq(θ) q

.

That is, we put

θHadj( ˆθadj∗ )= − Gq( ˆθadj)Eq

{

θfq( ˆθadj∗ )

}

q

(42)

where ∂θfq( ˆθadj∗ )= ∂θfq(θ) θ=θˆ

adj

∗ . The final variance estimator for θˆadj

is then obtained by substituting (41) and (42) into (39).

3.2 Application to linear and logistic regression

Although we have already developed the theory for linear regression in section 2, it is interesting to see how the results obtained there can be obtained as special cases of the general estimating equation theory set out in the previous sub-section. In particular, the Lahiri-Larsen estimator (18) and the BLUE (20) can be obtained from (28) by setting θ≡β and fq(β)=Xqβ (so ∂βfq(β)=Xq) with Gq=XqTETq in the case of (18) and Gq =XqTEqTΣq−1 in the case of (20). As far as the predictor (30) of B is concerned, we note that it can be expressed as the solution to XqT

( )

EqTEq −1ETq

(

yqEqXqβˆ

)

q

=0. It follows that in this case Gq=XqT

( )

ETqEq −1ETq which leads to ∂βHadj∗ ( ˆβB)= XqTXq

q

(22)

In contrast, the ‘ratio-adjusted’ estimator (12) cannot be expressed as the solution of an estimating equation of the form Gq

{

yq∗−EqXqβ

}

q

=0, being instead the solution to the alternative ‘ratio-type’ estimating equation

HR(β)= XqT

(

yq∗−XqDβ

)

q

=0 (43) where D= XqXq q

   −1 ′ XqEqXq q

 . As a consequence, the results in the previous

subsection do not apply to it directly. However, it is not difficult to show that HR(β) also defines an unbiased estimating function under the assumed linear model, since

EX XqT

(

yq∗−XqDβ

)

q

{

}

= XTq

(

EqXqXqD

)

q

{

}

β = XqTEqXq q

XTqXq q

(

)

D

{

}

β =0.

The linearisation argument that was earlier used to define an estimator of variance in the ‘standard’ estimating function approach also applies to (12) when it is written as the solution to (43). In particular, we have

βHR(β)= − XqTX qD q

= − XTqE q TX q q

(44) and VarX

{

HR0)

}

= XqTΣ qXq q

. (45)

When (44) and (45) are substituted in (38) we obtain the variance expression (13), leading to the same plug-in estimator of variance as specified by (39).

The case where the regression model of interest corresponds to linear logistic regression is of special interest. Here fq(θ)=

{

fi); i∈q

}

where

fi(θ)= exp(Xi Tθ) 1+exp(XiTθ). (46) It follows that ∂θfq(θ)=Dq)Xq (47) where Dq(θ)=diag f i(θ) 1

{

fi(θ)

}

.

The standard maximum likelihood estimating function (i.e. the score function) for the logistic regression model puts Gq(θ)=XqT in (35). However, this is not the only choice for this matrix when we estimate θ via the adjusted estimating equation (37). In particular we can also use the expressions for Gq(θ) that lead to the linear regression estimators (18), (20) and (30) introduced in section 2. We summarise these options in Table 1. Here option M defines the estimating equation for the MLE under perfect linkage, option A leads to the Lahiri-Larsen estimator (18) under a linear model and option B leads to the predictor (30) of the finite population regression vector (28) under the same model. In contrast, option C in Table 1 defines the second order efficient version of (35), which in the logistic case is given by Gqopt(θ)= ∂ θ EX yq

( )

{

}

VarX−1 y q

( )

= ∂

{

θfq(θ)

}

EqTΣ q −1(θ)=X q TD q)Eq TΣ q −1(θ). (48)

References

Related documents

Device Device Device Device Device SIB SSAP S A SSAP S A A Device S S Application Service NoTA+BT SSAP+NoTA+BT NoTA+BT uPnP+WLAN SSAP+uPnP+BT Future Internet Assets. 

Buses enable people to travel to work, school and college, for leisure, entertainment, shopping and to access important services like health appointments.. They enable families

Background—Our previous study indicated that gene expression profiling of intestinal metaplasia (IM) or spasmolytic polypeptide-expressing metaplasia (SPEM) can identify

Several biopesticide strains could not be differentiated from isolates obtained from foods or associated with outbreaks based on panC type, SplitsTree and FTIR analysis, toxin

The microstructure and the formation of intermetallics in various Cu-Sn alloys and of Pb-Sn solder solidified on Cu-Sn electrical wire were studied using a JEOL JSM-6480LV scanning

If this is a building service contract as defined in Section 230 of the Labor Law, then, in accordance with Section 239 thereof, Contractor agrees that neither it nor its

The best oxygen saturation of water masses characterised Lakes Głęboczko and Jeleń, which was an effect of considerable volume, good oxygen satu- ration during circulation phases