• No results found

Taylor series linearization

Chapter 3: Logistic regression modelling 3.1 Introduction

3.4 Variance estimation

3.4.1 Taylor series linearization

TSL is a method used to approximate complex smooth non-linear functions by simple linear functions of statistics in order to calculate variances, construct confidence intervals and test hypotheses of parameters (Heeringa, et al., 2010; Lohr, 2010; Kolenikov, 2010). TSL is traditionally used when the statistic of interest is a function of moments (Kolenikov, 2010). Let ฮธ =f(๐‘‡1, ๐‘‡2, โ€ฆ, ๐‘‡๐‘˜) be a smooth function of totals ๐‘‡1, ๐‘‡2, โ€ฆ, ๐‘‡๐‘˜, which can be totals of any particular variable of interest, and let ๐œƒฬ‚ = f (๐‘ก1 , ๐‘ก2, โ€ฆ,๐‘ก๐‘˜) be an estimator of ฮธ, where ๐‘ก1 , ๐‘ก2, โ€ฆ,๐‘ก๐‘˜ are sample estimates of the corresponding totals (Kolenikov, 2010). Consider a

complex sample where ๐‘‡๐‘™, l=1,2,โ€ฆ,k can be estimated by

๐‘ก๐‘™ = โˆ‘๐‘–โˆˆ๐‘†๐‘ค๐‘–๐‘ฆ๐‘–๐‘™, (18)

where ๐‘ก๐‘™ is an estimator of ๐‘‡๐‘™, ๐‘ฆ๐‘–๐‘™ is the response of unit ๐‘– to item ๐‘™, and ๐‘ค๐‘– is the sampling

weight for unit ๐‘–. For simplicity the notation has been reduced to only the USU subscript ๐‘–. A new variable can be defined for constants ๐‘Ž1, .. , ๐‘Ž๐‘˜,

๐‘ž๐‘– = โˆ‘๐‘˜๐‘™=1๐‘Ž๐‘™ ๐‘ฆ๐‘–๐‘™, such that, ๐‘ก๐‘ž = โˆ‘๐‘–โˆˆ๐‘†๐‘ค๐‘– ๐‘ž๐‘– =โˆ‘๐‘–โˆˆ๐‘†๐‘ค๐‘– โˆ‘๐‘˜๐‘™=1๐‘Ž๐‘™ ๐‘ฆ๐‘–๐‘™ = โˆ‘๐‘˜๐‘™=1๐‘Ž๐‘™ โˆ‘๐‘–โˆˆ๐‘†๐‘ค๐‘–๐‘ฆ๐‘–๐‘™ =โˆ‘๐‘˜๐‘™=1๐‘Ž๐‘™ ๐‘ก๐‘™.

The variance can then be estimated by

V(๐‘ก๐‘ž) = V(โˆ‘๐‘˜๐‘™=1๐‘Ž๐‘™ ๐‘ก๐‘™) =โˆ‘๐‘˜๐‘™=1๐‘Ž๐‘™2 V(๐‘ก๐‘™) + 2 โˆ‘๐‘˜โˆ’1๐‘=1โˆ‘๐‘˜๐‘=๐‘™+1๐‘Ž๐‘™ ๐‘Ž๐‘ Cov(๐‘ก๐‘™, ๐‘ก๐‘), (19) where V(๐‘ก๐‘ž) is the variance of the estimated total for constants ๐‘Ž1, .. , ๐‘Ž๐‘˜, V(๐‘ก๐‘™) is the variance of the estimated total ๐‘ก๐‘™, and Cov(๐‘ก๐‘™, ๐‘ก๐‘) is the covariance of ๐‘ก๐‘™ and ๐‘ก๐‘ (Lohr, 2010).

Now consider the estimator of the population mean that can be expressed as a weighted combined ratio estimator,

โˆ‘๐‘–โˆˆ๐‘†๐‘ค๐‘–๐‘ฆ๐‘–๐‘™

โˆ‘๐‘–โˆˆ๐‘†๐‘ค๐‘– =

๐‘ก๐‘™

๐‘ = ๐‘ฆฬ…. (20)

The estimated mean, like many estimators under CS, is a non-linear function of two weighted sample totals. This is true for other estimated quantities too, such as the simple linear and logistic regression coefficients (Heeringa, et al., 2010; Lohr, 2010; Binder, 1983). This is a non-linear statistic and cannot be expressed in the form of Equation 19. To solve the problem of non-linearity of sample estimators, Taylor series expansion is used to approximate the estimates of interest, expressing them as linear combinations of weighted sample totals (Heeringa, et al., 2010).

In order to do so, let

โˆ‘๐‘–โˆˆ๐‘†๐‘ค๐‘–๐‘ฆ๐‘–๐‘™ = u and โˆ‘๐‘–โˆˆ๐‘†๐‘ค๐‘– = v. From Equation 20 it follows that

๐‘ฆฬ… = ๐‘ข

๐‘ฃ.

Using Taylor series expansion Equation 20 can be approximated as ๐‘ฆฬ…๐‘‡๐‘†๐ฟ = ๐‘ข0 ๐‘ฃ0 + (u - ๐‘ข0) [ ๐œ•๐‘ฆฬ…๐‘‡๐‘†๐ฟ ๐œ•๐‘ข ]๐‘ฃโˆ’๐‘ฃ0,๐‘ขโˆ’๐‘ข0 + (v - ๐‘ฃ0) [ ๐œ•๐‘ฆฬ…๐‘‡๐‘†๐ฟ ๐œ•๐‘ฃ ]๐‘ฃโˆ’๐‘ฃ0,๐‘ขโˆ’๐‘ข0 + remainder, ๐‘ฆฬ…๐‘‡๐‘†๐ฟ โ‰ˆ ๐‘ข0 ๐‘ฃ0 + (u - ๐‘ข0)[ ๐œ•๐‘ฆฬ…๐‘‡๐‘†๐ฟ ๐œ•๐‘ข ]๐‘ฃโˆ’๐‘ฃ0,๐‘ขโˆ’๐‘ข0+(v - ๐‘ฃ0)[ ๐œ•๐‘ฆฬ…๐‘‡๐‘†๐ฟ ๐œ•๐‘ฃ ]๐‘ฃโˆ’๐‘ฃ0,๐‘ขโˆ’๐‘ข0, (21)

where ๐‘ข0 and ๐‘ฃ0 are the weighted sample totals which are obtained from the survey data,

[๐œ•๐‘ฆฬ…๐‘‡๐‘†๐ฟ

๐œ•๐‘ข ]๐‘ฃโˆ’๐‘ฃ0,๐‘ขโˆ’๐‘ข0 is the derivative of ๐‘ฆฬ… with respect to ๐‘ข evaluated at the expected values of

the sample estimates ๐‘ข0 and ๐‘ฃ0, and [๐œ•๐‘ฆฬ…๐‘‡๐‘†๐ฟ

๐œ•๐‘ฃ ]๐‘ฃโˆ’๐‘ฃ0,๐‘ขโˆ’๐‘ข0 is the derivative of ๐‘ฆฬ… with respect to ๐‘ฃ

evaluated at the expected values of the sample estimates ๐‘ข0 and ๐‘ฃ0.

Note that the quadratic and higher order terms in the full Taylor series expansion are dropped since those terms are assumed inconsequential when the sample sizes are large enough (Woodruff, 1971; Heeringa, et al., 2010; Lohr, 2010). Furthermore, consistent and ideally unbiased estimators are generally used in the place of the expected values of the sample

estimators (Heeringa, et al., 2010). Making use of Equation 19, the variance of the linearized estimator can be calculated. In Equation 21, let

[๐œ•๐‘ฆฬ…๐‘‡๐‘†๐ฟ

๐œ•๐‘ข ]๐‘ฃโˆ’๐‘ฃ0,๐‘ขโˆ’๐‘ข0 = C and [ ๐œ•๐‘ฆฬ…๐‘‡๐‘†๐ฟ

๐œ•๐‘ฃ ]๐‘ฃโˆ’๐‘ฃ0,๐‘ขโˆ’๐‘ข0 = D.

Then Equation 21 reverts to

๐‘ฆฬ…๐‘‡๐‘†๐ฟ = ๐‘ข0

๐‘ฃ0 + (u - ๐‘ข0)C + (v - ๐‘ฃ0)D, and the variance of ๐‘ฆฬ…๐‘‡๐‘†๐ฟ can be calculated as

V(๐‘ฆฬ…๐‘‡๐‘†๐ฟ) = V( ๐‘ข0

๐‘ฃ0 + (๐‘ข โˆ’ ๐‘ข0)๐ถ + (๐‘ฃ โˆ’ ๐‘ฃ0)๐ท)

= 0 + ๐ถ2 V(๐‘ข โˆ’ ๐‘ข0) + ๐ท2 V(๐‘ฃ โˆ’ ๐‘ฃ0) + 2CDcov(๐‘ข โˆ’ ๐‘ข0, ๐‘ฃ โˆ’ ๐‘ฃ0)

=๐ถ2 V(๐‘ข ) + ๐ท2 V(๐‘ฃ ) + 2 CD cov(๐‘ข , ๐‘ฃ ). C and D can be obtained as

[๐œ•๐‘ฆฬ…๐‘‡๐‘†๐ฟ ๐œ•๐‘ข ]๐‘ฃโˆ’๐‘ฃ0,๐‘ขโˆ’๐‘ข0 = 1 ๐‘ฃ0 and [ ๐œ•๐‘ฆฬ…๐‘‡๐‘†๐ฟ ๐œ•๐‘ฃ ]๐‘ฃโˆ’๐‘ฃ0,๐‘ขโˆ’๐‘ข0 = โˆ’ ๐‘ข0 ๐‘ฃ02, which simplifies to V(๐‘ฆฬ…๐‘‡๐‘†๐ฟ) = ๐‘‰(๐‘ข)+๐‘ฆฬ…๐‘‡๐‘†๐ฟ ๐‘‰(๐‘ฃ)โˆ’2๐‘ฆฬ…๐‘‡๐‘†๐ฟ ๐‘๐‘œ๐‘ฃ(๐‘ข,๐‘ฃ) ๐‘ฃ02 .

Binder (1983) proposed using a multivariate version of the TSL to calculate the variance of the estimator of a logistic regression model parameter (Binder, 1983; Heeringa, et al., 2010). As discussed in Section 3.3, the estimators of the parameters of the logistic regression model can be obtained from the pseudo MLE defined in Equation 17. Similarly, Equation 17 can be used to obtain a variance-covariance matrix of the logistic regression model parameters. A simplified version of Equation 17 for an observation in stratum โ„Ž from the ๐‘—๐‘กโ„Ž PSU and the

๐‘–๐‘กโ„Ž SSU is given below,

โˆ‘ โˆ‘ โˆ‘ ๐‘คโ„Ž ๐‘— ๐‘– โ„Ž๐‘—๐‘–๐‘ซโ„Ž๐‘—๐‘–โ€ฒ [๐œ‹โ„Ž๐‘—๐‘–(1 โˆ’ ๐œ‹โ„Ž๐‘—๐‘–)]โˆ’1(๐‘ฆโ„Ž๐‘—๐‘–โˆ’ ๐œ‹โ„Ž๐‘—๐‘–)= 0, (22) where ๐‘ซโ„Ž๐‘—๐‘– is a vector of partial derivatives, ๐‹(๐œ‹โ„Ž๐‘—๐‘–(๐‘ฉ))

๐œ‘๐ต๐‘˜ , ๐‘˜ = 0, โ€ฆ , ๐‘, ๐‘ is the number of parameters, ๐‘คโ„Ž๐‘—๐‘– is the sampling weight and ๐œ‹โ„Ž๐‘—๐‘–(๐‘ฉ) is the probability of success of the ๐‘–๐‘กโ„Ž

SSU of PSU j from stratum h. Equation 22 reduces to ๐‘ + 1 estimating equations,

The Newton-Raphson method can be used to obtain the weighted parameter estimates by finding a solution for Equation 23. Using TSL, a sandwich-type variance estimator can be obtained in the form of (Heeringa, et al., 2010)

๐‘ฃ๐‘Ž๐‘Ÿ(๐‘ฉฬ‚) = (๐‘ฑโˆ’1)๐‘ฃ๐‘Ž๐‘ŸโŒˆ๐‘†(๐‘ฉฬ‚)โŒ‰(๐‘ฑโˆ’1),

where ๐‘ฑ is the matrix of second-order derivatives with respect to ๐ตฬ‚๐‘˜.

TSL is used in most survey packages under the assumption that the PSUs are sampled with replacement within the strata at the first stage (Kolenikov, 2010). Some advantages of using TSL are that, if the partial derivatives are known, linearization will almost always give the variance estimate of a statistic and, the theory of TSL is well developed (Lohr, 2010). However, it does have some drawbacks. Calculations can be cumbersome when functions are complex. Also, not all statistics yield smooth functions in terms of population totals, and the accuracy depends on the sample size (Lohr, 2010; Kolenikov, 2010).

Although TSL is the default variance estimator in most statistical packages other options are also provided such as the jackknife and the bootstrap. These techniques form part of the resampling methods and will be discussed next.

Related documents