Data Analysis - Analytic Approach: Study Aim 2

CHAPTER 3: Methods

3.3 Analytic Approach: Study Aim 2

3.3.5 Data Analysis

Exploratory analyses were first conducted by examining the frequency distributions

and descriptive statistics for each of the variables included in this analysis. Although these

data have gone through extensive data editing and consistency checks, I inspected all

variables for implausible or out-of-range values and where possible, used other data collected

in the questionnaires to check for logic and consistency with the key variables of interest.

Bivariate distributions of fibroid status with the exposures of interest and each of the

covariates were also examined, as was the percentage of records with missing responses on

the covariates. To get a clearer picture of the relationship between age and fibroid status, I

categorized age as 21-29 (due to small numbers) and then by successive 2-year categories

(e.g., 30-31, 32-33, …, 58-59), and plotted the log-odds of fibroids by age. The log-odds of

fibroids tended to increase in a linear fashion for the most part, but seemed to level off (or

even form a slightly inverse “U” shape) after about age 50.

“Uncorrected” regression model

Logistic regression was used to estimate the association between pesticide use and

uterine fibroid prevalence. The first step was to use the uncorrected outcome, self-reported

uterine fibroid diagnosis. Although effect measure modification was not a primary focus, a

number of the possible endocrine disrupting pesticides were removed from the market in the

43 dependent (e.g., younger women would not have used DDT), odds ratios and 95%

confidence intervals were estimated for each age stratum (21-34, 35-39, 40-44, 45-49, 50-54,

55-59) and visually inspected for differences. I tested for statistical interaction by age of the

associations between fibroids and pesticide use patterns, ever use of hormonally active

pesticides, and chemical class pesticide groupings by including interaction terms for each

exposure and age with a P < 0.10 significance level.

I evaluated the linearity assumption for categorical predictors by including disjoint

indicator terms and inspecting graphs of the log-odds of fibroids plotted against the

variable’s categories (151). When a linear trend was seen, I modeled the variable as a single

ordinal (e.g., 0, 1, 2) variable and computed a Wald P value for its coefficient. Based on the

non-linear relationship between log odds of fibroids and age, I added a quadratic term for age

in the models. The quadratic term for age was statistically significant, but resulted in very

small changes in the exposure effect estimates. However, excluding the quadratic term

resulted in a poorly-fit model as assessed by the Hosmer-Lemeshow goodness-of-fit test (P

<0.0001) (152), so it was retained.

A backward elimination approach was used to build the final multivariable logistic

regression model. Age (continuous), age squared, and state of residence were forced into the

models. Each of the other two covariates was dropped one at a time sequentially from the

full model (starting with the covariate with the highest P-value in the full model and working

down), and retained if it resulted in a 10% or greater change in the exposure odds ratio

relative to the full model.

Outcome correction

The next step in the analysis was to run logistic regression models utilizing a method

proposed by Magder and Hughes to correct for outcome misclassification (12). This method

incorporates values of sensitivity and specificity into the estimation of logistic regression

parameters and corresponding variances using the Expectation-Maximization (EM) algorithm

to obtain maximum likelihood estimates (153). The procedure can be described as

essentially performing a “…standard logistic regression considering each study subject as

both diseased and not diseased with weights determined by the probability that the study

subject is truly diseased given the data” (12). To paraphrase their illustrative example,

suppose a woman reports that she has had a fibroid diagnosis. Given the sensitivity and

specificity of the self-report and the values of that woman’s covariates, the probability that

she truly has fibroids is estimated as 90%. Then a standard logistic regression is performed

with that woman entered twice: once as diseased with weight = 0.90 and again as non-

diseased with weight = 0.10. These probabilities need to be recalculated after the logistic

regression parameters are estimated because of the fact that the probabilities are partially

based on the value of the parameters. This leads to new probabilities, which lead to new

regression parameters. This process—estimating the probabilities and the regression

parameters—is repeated until the parameter estimates converge.

The benefit of the Magder and Hughes method is that it accommodates varying

sensitivity and specificity values for different subgroups of the analysis population. Based on

results from the validity analysis in Aim 1, sensitivity for white women increased with age

(except for the oldest age group) but specificity decreased slightly with age. The descriptive

analysis of presence/absence of fibroids at ultrasound among women reporting a previous

45 diagnosis suggests, however, that these women may not have been wrong. Rather, tumor

regression could have occurred with intervening factors such as time since diagnosis or

pregnancies.

I used a SAS macro available from the authors at

http://medschool.umaryland.edu/epidemiology/software.asp to perform the outcome

correction. I used results from the Aim 1 analysis to inform the estimates for sensitivity and

specificity of self-reported fibroids diagnosis. For the main correction model, specificity was

set to 0.95 but sensitivity varied by age: 18-29, 0.15; 30-34, 0.20; 35-39, 0.35; 40-44, 0.40;

45-59, 0.30. Sensitivity was set to 0.85 for women who reported having had a hysterectomy

(n = 3,022) based on the assumption that they would be better reporters of fibroid diagnosis.

As above, all corrected odds ratios were adjusted for age, age squared, and state.

Additional analyses

Several secondary analyses were conducted. First, I examined associations between

specific pesticides and uterine fibroid diagnosis and compared effect estimates obtained

using different referent groups: 1) including never users of any pesticides as well as users of

pesticides other than that of interest and 2) only users of pesticides other than that of interest

(Appendix B).

Next, I evaluated the degree to which assumptions about self-report validity influence

the corrected odds ratios and 95% confidence intervals (Appendix C). I used age-specific

sensitivity (regardless of hysterectomy status) and specificity = 0.95 as the initial set of

assumptions, and then varied sensitivity, specificity, and both. Assumptions about self-report

validity among women with hysterectomy were evaluated by varying sensitivity and

to 59 years old, whereas the validity analysis population only includes women up to age 49, it

was difficult to predict the shape of the sensitivity and specificity curves for women in older

age ranges. The final sensitivity analysis was conducted to examine the influence of

In document Pesticide use and self-reported uterine leiomyomata among farm women : an analysis of the Agricultural Health Study with assessment of outcome misclassification (Page 55-60)