Challenges in Longitudinal Data Analysis: Baseline Adjustment, Missing Data, and Drop-out
Sandra Taylor, Ph.D.
IDDRC BBRD Core 23 April 2014
Objectives
• Baseline Adjustment
▫ Introduce approaches
▫ Guidance on when to use different approaches
• Missing Data/Drop-out
▫ Raise awareness regarding issues/challenges caused by missing data
▫ Importance for study design and data analysis
▫ Basic understanding of approaches to handling with missing data
In longitudinal studies, subjects
typically have a baseline measurement
• Interest is commonly on
differences in change over time between groups
▫ Does the degree of change differ between groups ?
• Differences in starting values (i.e., baseline) important to consider when trying to assess
change over time. 6
8101214
Time
Four options for baseline adjustment
1. Retain baseline value as outcome with no assumptions about group differences at baseline
2. Retain baseline value as outcome and assume group means are equal at baseline
3. Subtract baseline from post baseline responses and analyze differences from baseline
4. Include baseline value as a covariate.
Retain baseline as outcome;
No assumptions at baseline
∙ ∙ ∙ ∙
Allow intercepts (baselines) to differ between groups
Time
Group 1 Group 2
Retain baseline as outcome;
Assume equal at baseline
∙ ∙ ∙
Assume same intercepts (baselines) in both groups
Time
Group 1 Group 2
Subtract baseline from post-baseline responses
• Define new variable as response variable
• Model as before
• Interpretation of results a bit different
▫ Group – Are there differences at time 2?
▫ Group·Time – Are the lines parallel from time 2 to n?
▫ Joint test of Group and Group·Time required to evaluate whether the patterns of change are the same over time
Use Baseline as covariate
• Outcome becomes adjusted change scores (i.e., change over time adjusted for baseline)
• Similar interpretation issues as Approach 3
Relationship Among Approaches
Retain baseline as outcome?
Assume equal means at baseline?
Analyze change from baseline
Include baseline as covariate
Approach 1 Approach 2 Approach 3 Approach 4
YES NO
YES NO
Which approach to use?
• Randomized or Observational Study?
▫ If randomized, reasonable to assume equal
baseline values across groups Approach 2
▫ If observational
Approach 2 if reasonable to assume equal baseline values across groups
Approach 1 if baseline values differ across groups
▫ Approaches 3 and 4 applicable where Approaches 1 and 2 are applicable, respectively.
What is it?
What does it matter?
What do we do about it?
What are missing data and drop-out?
• Missing Data
▫ Observations researcher was to collect but didn’t
▫ Many different causes for missing data
▫ Not specific to longitudinal data but common
• Drop-out
▫ Subjects leave a study before the intended end
▫ Special class of missing data unique to longitudinal data
What does it matter?
• Potential for bias and incorrect inferences
▫ Bias can be severe
• Loss of information/power
▫ Reduced precision and efficiency of estimates relative to complete data
• Data are unbalanced over time
▫ Problem for some analytical methods
Six Cities Study of Air Pollution and Health Hypothetical Weight Loss Study
Muscatine Coronary Risk Factor Study
Six Cities Study of Air Pollution and Health
• Objective: Characterize lung function growth in children
▫ Enrolled 1st/2nd grade, followed until graduation
▫ Annual lung function tests
• Wide range (1-12) of observations per child
▫ Late enrollment – moved into school district after 2nd grade
▫ Drop out – moved out of school district
• Consider reasons for moving out of district
Hypothetical Weight Loss Study
• Objective: Determine if coached program is more effective than on-line program
▫ Randomize subjects to each program
▫ Collect weight weekly for 3 months
• Types of missing values
▫ Drop-out: missing all values after time t
▫ Missing observation: missing one or more observations in the middle of the study
• What could cause the missing values?
Muscatine Coronary Risk Factor Study
• Objective: Examine development and
persistence of coronary disease risk factors
▫ Children aged 5-15
▫ Measured height and weight biennially; classified children as obese or not
▫ Parental consent required for each measurement
• Less 40% of children with complete data
• What factors contribute to missing values?
▫ No consent form
▫ Child absent from school on day of measurements
Missing Data Mechanisms
• 3 types distinguished based on relationship
between the probability of missingness and the actual values (observed or unobserved)
▫ Missing Completely at Random (MCAR)
▫ Missing at Random (MAR)
▫ Not Missing at Random (NMAR)
• Mechanisms have different assumptions and
methods for adequately handling missing values
differ among the mechanisms
Missing Completely at Random
• Probability of missing response is unrelated to
▫ The value of the response had it been obtained
▫ The value of observed responses
• Examples:
▫ Missed appointment due to car trouble
▫ Variables measured on a subset of subjects by study design
• Missingness is simply chance event unrelated to any of the data observed or unobserved
• Observed data can be considered random
sample of the complete data
Missing at Random
• Probability of missing response
▫ depends on the set of observed responses but
▫ unrelated to the specific missing value that would have been observed
• Examples:
▫ Removal of subject from study once pre-specified value obtained by study design
▫ Higher educated people don’t report income
• Observed data can NOT be considered random
sample of the complete data
Not Missing at Random
• Probability of missing response is related to the specific values that would have been obtained
• Examples
▫ Value is below the detection limit
▫ People with higher incomes don’t report income
▫ Subjects skips appointment because of weight gain
• Missingness is non-ignorable
Revisit Examples
• Weight Loss Study
▫ Moves out of area - MCAR
▫ Achieves goal weight – MAR or MNAR
▫ Not losing weight – MAR or MNAR
• Air Pollution and Health Study
▫ Job relocation – MCAR
▫ Child developed respiratory problems – MAR
▫ Avoid developing respiratory problems – MNAR
• Coronary Risk Factor Study
▫ Forgot to sign consent - MCAR
▫ Obese child feigns illness to avoid weighing – MNAR
Approaches to Handling Missing Data
• Deletion Methods
▫ Complete-case analysis (listwise deletion)
▫ Available-data analysis (pairwise deletion)
• Single Imputation Methods
• Model-Based Methods
▫ Multiple imputation
▫ Maximum likelihood
Deletion Methods
• Complete-Case Analysis
▫ Only analyze subjects with complete data
• Available-Data Analysis
▫ Analyzing all data that was observed
Different analytical methods can handle partial data (e.g., random effect models)
▫ More efficient/power than complete case because uses more information
Deletion Methods
Advantages and Disadvantages
• Advantages
▫ Simple; available-data analysis is default for statistics programs
• Disadvantages
▫ Reduced sample size
▫ Complete-case analysis discards data
▫ Biased estimates unless data is MCAR
Single Imputation
• Substitute missing values with an imputed value
• Analyze “complete” data using standard methods
• Many different approaches to single imputation
Single Imputation Methods
• Mean value imputation
▫ Substitute mean value for missing value
• “Last value carried forward” imputation
▫ Use last value observed
• Regression imputation
▫ Replaces missing value with value predicted from regression derived from observed data
• K-nearest neighbor imputation
▫ Impute value based on k most similar subjects
Single Imputation Methods
Advantages and Disadvantages
• Advantages
▫ Simple to implement and understand
▫ Maintains sample size
▫ Uses all available information
• Disadvantages
▫ Can reduce variability in the data
▫ Can weaken correlations/covariances
▫ Reduce standard errors because it doesn’t reflect the uncertainty about the predicted unknown
values
Maximum Likelihood
• Parameters estimated based on maximum likelihood using available data
▫ Random effect models implement this approach
• Advantages
▫ Uses all available information
▫ Unbiased estimates for MCAR and MAR data
• Disadvantages
▫ Model must be correctly specified
Multiple Imputation
• Missing values are imputed from a model (e.g., regression model)
• Imputation conducted multiple times
▫ Replacing missing value with a set of plausible values
• Each imputed data is analyzed
• Results from analysis of each imputed data set
are pooled into single estimate
Multiple Imputation
Advantages and Disadvantages
• Advantages
▫ Better reflects data variability
▫ Considers variability due to sampling and imputation
• Disadvantages
▫ More time and computer intensive
What if I have MNAR missingness?
• Selection models
• Pattern mixture models
• Random effect models
• Shared parameter models
What to do – study design?
• Carefully consider potential challenges to obtaining complete data
▫ Duration of study, number of visits/surveys, travel distance, participant characteristics/motivations
▫ Provide appropriate compensation/incentives
▫ Plan to enhance/support/encourage completion
• If possible, collect information about why an
observation is missing
What to do – data analysis?
• Evaluate missingness in data
▫ How much data is missing?
▫ Are there patterns to missingness?
▫ Are there differences between subjects with complete and incomplete data?
▫ Are there differences in missingness among experimental groups? Within experimental groups?