• No results found

Comparing Means over Time

Descriptive Analysis for Continuous Variables

5.6 Linear Functions of Descriptive Estimates and Differences of Meansand Differences of Means

5.6.2 Comparing Means over Time

=  + − ⋅ −  ×

, )

. . ( . )

y

 $$108 250,

The result of this direct calculation matches the output from the lincom com-mand in Stata.

5.6.2 Comparing Means over Time

Analysts working with longitudinal survey data are often interested in comparing the means of a longitudinal series of survey measures. Chapter 12 will address the various types of longitudinal data and introduce more sophisticated tools for longitudinal analysis of survey data.

When comparing means on single variables measured at two or more

“waves” or points in time, an approach similar to that used in Section 5.6.1 can be applied. However, since longitudinal data are often released as a sep-arate file for each time period with distinct weight values for each time point, special data preparation steps may be needed. We now consider an example of this approach using two waves of data from the HRS study.

example 5.13: estimating Differences in Mean Total Household assets from 2004 to 2006 using Data from the HrS

To estimate the difference in 2004 and 2006 mean household assets for HRS panel households, the first step is to “stack” the 2004 and 2006 data sets, com-bining them into a single data set. Provided that they responded to the survey in both HRS waves, each panel household has two records in the stacked data set—

one for its 2004 observation and another for the 2006 interview. Each pair of household records includes the wave-specific measure of total household assets and the wave-specific sampling weight. For this example, when stacking the data sets, we assigned these values to two new common variables, TOTASSETS

Table 5.2

Estimated Variance–Covariance Matrix for Subpopulation Means of Total Household Assets

Subpopulation 1 2 3 4

1 6.032 × 108 0.714 × 108 –1.794 × 108 –3.438 × 108 2 2.918 × 108 –0.126 × 108 1.209 × 108

3 7.290 × 108 0.679 × 108

4 1.043 × 1010

Source: Based on the 2006 HRS data.

Note: Estimates by education level of household head.

© 2010 by Taylor and Francis Group, LLC

144 Applied Survey Data Analysis

and WEIGHT. Each record also includes the permanently assigned stratum and cluster codes. As described in Example 5.4, the HRS public use data sets include an indicator variable for each wave of data that identifies the respondents who are the household financial reporters (JFINR for 2004; KFINR for 2006). Using the over(year) option, estimates of mean total household assets are computed separately for 2004 and 2006. The subpop(finr0406) option restricts the estimates to the financial reporters for each of these two data collection years.

The postestimation lincom statement estimates the difference of means for the two time periods, its linearized standard error, and a 95% confidence interval for the difference:

gen weight = jwgthh

replace weight = kwgthh if year == 2006 gen finr04 = 1 if (year==2004 & jfinr==1) gen finr06 = 1 if (year==2006 & kfinr==1) gen finr0406 = 1 if finr04==1 | finr06==1 svyset secu [pweight = weight], strata(stratum) svy, subpop(finr0406): mean totassets, over(year) lincom [totassets]2004 - [totassets]2006

Contrast y2004y2006 se y( 2004y2006) CI y.95( 2004y2006) 2004 vs. 2006 –$115,526 $20,025 (–$155,642, –$75,411)

Note that the svyset command has been used again to specify the recoded sampling weight variable (WEIGHT) in the stacked data set. The svy: mean command is then used to request weighted estimates and linearized standard errors (and the covariances of the estimates, which are saved internally) for each subpopulation defined by the YEAR variable. The resulting estimate of the differ-ence of means is ˆ∆ =y2004y2006= –$115,526, with a linearized standard error of $20,025. The analysis provides evidence that the mean total household assets increased significantly from 2004 to 2006 for this population.

5.7 Exercises

1. This exercise serves to illustrate the effects of complex sample designs (i.e., design effects) on the variances of estimated means, due to stratification and clustering in sample selection (see Section 2.6.1 for a review). The following table lists values for an equal probability (self-weighting) sample of n = 36 observations.

Observations STRATUM CLUSTER

7.0685 13.7441 7.2293 1 1

13.6760 7.2293 13.7315 1 2

Observations STRATUM CLUSTER

13.2310 10.8922 12.3425 1 3

10.9647 11.2793 11.8507 1 4

11.3274 16.4423 11.9133 1 5

17.3248 12.1142 16.7290 1 6

19.7091 12.9173 18.3800 2 7

13.6724 16.2839 14.6646 2 8

15.3685 15.3004 13.5876 2 9

15.9246 14.0902 16.4873 2 10

20.2603 12.0955 18.1224 2 11

12.4546 18.4702 14.6783 2 12

Any software procedures can be used to answer the following four questions.

a. Assume the sample of n = 36 observations is a simple random sample from the population. Compute the sample mean, the standard error of the mean, and a 95% confidence interval for the population mean. (Ignore the finite population correction [fpc], stratification, and the clustering in calculating the stan-dard error.)

b. Next, assume that the n = 36 observations are selected as a strati-fied random sample of 18 observations from each of two strata.

(Ignore the fpc and the apparent clustering.) Assume that the population size of each of the two strata is equal. Compute the sample mean, the standard error of the mean and a 95% confi-dence interval for the population mean. (Ignore the fpc and the clustering in calculating the standard error.) What is the esti-mated DEFT( )y for the standard error of the sample mean (i.e., the square root of the design effect)?

c. Assume now that the n = 36 observations are selected as an equal probability sample of 12 clusters with exactly three observations from each cluster. Compute the sample mean, the standard error of the mean, a 95% confidence interval for the population mean, and the estimate of DEFT( )y . Ignore the fpc and the stratifica-tion in calculating the standard error. Use the simple model of the design effect for sample means to derive an estimate of roh (2.11), the synthetic intraclass correlation (this may take a nega-tive value for this “synthetic” data set).

d. Finally, assume that the n = 36 observations are selected as an equal probability stratified cluster sample of observations. (Two strata, six clusters per stratum, three observations per cluster.) Compute the sample mean, the standard error of the mean, the 95% CI, and estimates of DEFT( )y and roh.

© 2010 by Taylor and Francis Group, LLC

146 Applied Survey Data Analysis

2. Using the NCS-R data and a statistical software procedure of your choice, compute a weighted estimate of the total number of U.S.

adults that has ever been diagnosed with alcohol dependence (ALD) along with a 95% confidence interval for the total. Make sure to incorporate the complex design when computing the estimate and the confidence interval. Compute a second 95% confidence interval using an alternative variance estimation technique, and compare the two resulting confidence intervals. Would your inferences change at all depending on the variance estimation approach?

3. (Requires SUDAAN or SAS Version 9.2+) Using the SUDAAN or SAS software and the 2005–2006 NHANES data set, estimate the 25th percentile, the median, and the 75th percentile of systolic blood pressure (BPXSY1) for U.S. adults over the age of 50. You will need to create a subpopulation indicator of those aged 51 and older for this analysis. Remember to perform an appropriate subpopulation anal-ysis for this population subclass. Compute 95% confidence intervals for each percentile.

4. Download the NCS-R data set from the book Web site and consider the following questions. For this exercise, the SESTRAT variable identifies the stratum codes for computation of sampling errors, the SECLUSTR variable identifies the sampling error computation units, and the NCSRWTSH variable contains the final sampling weights for Part 1 of the survey for each sampled individual.

a. How many sampling error calculation strata are specified for the NCS-R sampling error calculation model?

b. How many SECUs (or clusters) are there in total?

c. How many degrees of freedom for variance estimation does the NCS-R provide?

d. What is the expected loss, Lw, due to random weighting in survey estimation for total population estimates? Hint: Lw = CV2(weight);

see Section 2.7.5.

e. What is the average cluster size, b, for total sample analyses of variables with no item-missing data?

5. Using the statistical software procedure of your choice, estimate the proportion of persons in the 2006 HRS target population with arthritis (ARTHRITIS = 1). Use Taylor series linearization to estimate a standard error for this proportion. Then, answer the following questions:

a. What is the design effect for the total sample estimate of the pro-portion of persons with arthritis (ARTHRITIS = 1)? What is the design effect for the estimated proportion of respondents age 70 and older (AGE70 = 1)? Hint: Use the standard variance formula,

var(p) = p × (1 – p)/(n – 1), to obtain the variance of the propor-tion p under SRS. Use the weighted estimate of p provided by the software procedure to compute the SRS variance. Don’t confuse standard errors and variances (squares of standard errors).

b. Construct a 95% confidence interval for the mean of DIABETES.

Based on this confidence interval, would you say the proportion of individuals with diabetes in the 2006 HRS target population is significantly different from 0.25?

6. Examine the CONTENTS listing for the NCS-R data set on the book Web site. Choose a dependent variable of interest (e.g., MDE: 1 = Yes, 0 = No). Develop a one-sample hypothesis (e.g., the prevalence of lifetime major depressive episodes in the U.S. adult population is p

= 0.20). Write down your hypothesis before you actually look at the sample estimates. Perform the required analysis using a software procedure of your choosing, computing the weighted sample-based estimate of the population parameter and a 95% confidence inter-val for the desired parameter. Test your hypothesis using the confi-dence interval. Write a one-paragraph statement of your hypothesis and a summary of the results of your sample-based estimation and your inference/conclusion based on the 95% CI.

7. Two subclasses are defined for NCS-R respondents based on their response to a question on diagnosis of a major depressive episode (MDE) (1 = Yes, 0 = No). For these two subclasses, use the software procedure of your choice to estimate the difference of means and standard error of the difference for body mass index (BMI). Make sure to use the unconditional approach to subclass analysis in this case, given that these subclasses can be thought of as cross-classes (see Section 4.5). Use the output from this analysis to replicate the following summary table.

Subclass Variable y se y,[ ( )]

MDE = 1 (Yes) BMI 27.59 (.131)

MDE = 0 (No) BMI 26.89 (.102)

Difference in Means (1–0) BMI .693 (.103)

© 2010 by Taylor and Francis Group, LLC

149

6