Background Missing data analysis inR rggobi: Data Visualistion Amelia: a program for missing data conclusion
Quantitative Methods Workshop
Graphical Methods for Investigating Missing Data
Graeme Hutcheson
School of Education University of Manchester
Graeme D. Hutcheson Manchester University
Background Missing data analysis inR rggobi: Data Visualistion Amelia: a program for missing data conclusion
missing data data imputation
missing data
I Data sets with missing values are very common in the social sciences.
I Missing data is commonly ‘dealt with’ by using:
I list-wise deletion
I simple data replacement (random values, mean values or values predicted directly from regression models)
I removing variables with relatively large amounts of missing data from the analysis.
None of these techniques is adequate.
Background Missing data analysis inR rggobi: Data Visualistion Amelia: a program for missing data conclusion
missing data data imputation
King et al., 2001: American Political Science Review
‘...approximately 94% (of analyses) use listwise deletion to
eliminate entire observations... List-wise deletion discards one-third of cases on average, which deletes both the few nonresponses and the many responses in those cases. The result is a loss of valuable information at best and severe selection bias at worst.’
Graeme D. Hutcheson Manchester University
Background Missing data analysis inR rggobi: Data Visualistion Amelia: a program for missing data conclusion
missing data data imputation
Objectives
I Even though missing data is important, it is rarely dealt with or even acknowledged in educational research. Why is this?
I There is a general ignorance as to the damaging effects that missing data can have on analyses.
I There is a lack of training about imputation techniques and available ‘useable’ software.
I There is a general reluctance from reviewers to accept data imputation (without detailed justifications they are often make
‘easy targets’ for criticism).
I Data imputation is not easy and should not be achieved by
Background Missing data analysis inR rggobi: Data Visualistion Amelia: a program for missing data conclusion
missing data data imputation
data imputation
I Data imputation is now accepted (particularly multiple imputation), but has been very slow to be adopted by researchers.
I The reason for this is only in part a lack of information and training. A bigger issue is that... in practice it can take many hours or days to run and cannot be fully automated.... no commercial software includes a correct implementation of multiple imputation (King et al., 2001).
Graeme D. Hutcheson Manchester University
Background Missing data analysis inR rggobi: Data Visualistion Amelia: a program for missing data conclusion
introduction R: programs conclusion
Missing data analysis in R
I The problem of software is now being addressed by the R package. Researchers working on a number of techniques over the last decade now have a platform on which to publish their software. This has led, in the last year or so, to many
techniques becoming accessible to researchers.
I A simple search for data imputation and missing data on CRAN shows the following (a selection of results are provided - note that these are only the packages that have the target words in the title):
Background Missing data analysis inR rggobi: Data Visualistion Amelia: a program for missing data conclusion
introduction R: programs conclusion
Missing data analysis in R
I Amelia II: A Program for Missing Data
I arrayImpute: Missing imputation for microarray data
I cat: Analysis of categorical-variable datasets with missing values
I EMV: Estimation of Missing Values for a Data Matrix
I impute: Imputation for microarray data
I mi: Missing Data Imputation and Model Checking
I mice: Multivariate Imputation by Chained Equations
Graeme D. Hutcheson Manchester University
Background Missing data analysis inR rggobi: Data Visualistion Amelia: a program for missing data conclusion
introduction R: programs conclusion
Missing data analysis in R
I mirf: Multiple imputation and random forests for unobservable phase, high-dimensional data.
I mitools: Tools for multiple imputation of missing data
I mix: Estimation/multiple Imputation for Mixed Categorical and Continuous Data
I pan: Multiple imputation for multivariate panel or clustered data
Background Missing data analysis inR rggobi: Data Visualistion Amelia: a program for missing data conclusion
introduction R: programs conclusion
Missing data analysis in R
I rggobi: Interface between R and GGobi (missing data tools)
I SeqKnn: Sequential KNN imputation method
I SimHap: A comprehensive modeling framework for
epidemiological outcomes and a multiple-imputation approach to haplotypic analysis of population-based data
I VIM: Visualization and Imputation of Missing Values
I yaImpute: An R Package for k-NN Imputation
Graeme D. Hutcheson Manchester University
Background Missing data analysis inR rggobi: Data Visualistion Amelia: a program for missing data conclusion
introduction R: programs conclusion
R add-on packages
I This is an exciting time in statistics and data analysis as these techniques are only now being made available to all
researchers (most of these programs have been uploaded in the last year).
I Many of the packages listed above also have point-and-click interfaces which makes them simple to operate (see, for example, rggobi, VIM, Amelia) and all have comprehensive manuals available for download from CRAN.
I This seminar will briefly demonstrate two packages rggobi, a data visualization package and Amelia II, a data imputation package.
Background Missing data analysis inR rggobi: Data Visualistion Amelia: a program for missing data conclusion
missing data shown in margin plots visualising missing data
Imputing data
checking imputed data (simple imputation) checking imputed data (multiple imputation)
rggobi: A data visualistion program
Full details of ‘rggobi’ can be found at:
http://www.ggobi.org/rggobi
Information about R and installing packages can be found at:
http://www.r-project.org http://www.rgsweb.net
Graeme D. Hutcheson Manchester University
Background Missing data analysis inR rggobi: Data Visualistion Amelia: a program for missing data conclusion
missing data shown in margin plots visualising missing data
Imputing data
checking imputed data (simple imputation) checking imputed data (multiple imputation)
rggobi: A data visualistion program
The following analyses are taken directly from the ggobi website http://www.ggobi.org/ and the book:
Cook, D. and Swayne, D. F. (2007). Interactive and Dynamic Graphics for Data Analysis: With Examples Using R and GGobi.
Springer.
The data show environmental readings for two years (an el-nino year (1997) and a non-el-nino year (1993)).
Background Missing data analysis inR rggobi: Data Visualistion Amelia: a program for missing data conclusion
missing data shown in margin plots visualising missing data
Imputing data
checking imputed data (simple imputation) checking imputed data (multiple imputation)
missing data shown in margin plots
Graeme D. Hutcheson Manchester University
Background Missing data analysis inR rggobi: Data Visualistion Amelia: a program for missing data conclusion
missing data shown in margin plots visualising missing data
Imputing data
checking imputed data (simple imputation) checking imputed data (multiple imputation)
rggobi: visualising missing data
Background Missing data analysis inR rggobi: Data Visualistion Amelia: a program for missing data conclusion
missing data shown in margin plots visualising missing data
Imputing data
checking imputed data (simple imputation) checking imputed data (multiple imputation)
rggobi: imputing data
Graeme D. Hutcheson Manchester University
Background Missing data analysis inR rggobi: Data Visualistion Amelia: a program for missing data conclusion
missing data shown in margin plots visualising missing data
Imputing data
checking imputed data (simple imputation) checking imputed data (multiple imputation)
rggobi: imputing data
Background Missing data analysis inR rggobi: Data Visualistion Amelia: a program for missing data conclusion
missing data shown in margin plots visualising missing data
Imputing data
checking imputed data (simple imputation) checking imputed data (multiple imputation)
rggobi: checking imputed data (simple imputation)
Graeme D. Hutcheson Manchester University
Background Missing data analysis inR rggobi: Data Visualistion Amelia: a program for missing data conclusion
missing data shown in margin plots visualising missing data
Imputing data
checking imputed data (simple imputation) checking imputed data (multiple imputation)
rggobi: checking imputed data (multiple imputation)
Background Missing data analysis inR rggobi: Data Visualistion Amelia: a program for missing data conclusion
introduction GUI input options output
Multiple imputation
I Methodologists and statisticians agree that ‘multiple
imputation’ is a superior approach to the problem of missing data scattered through ones explanatory and dependent variables than the methods currently used in applied data analysis (King et al., 2001: American Political Science Review).
I Amelia II is a package that implements a sophisticated multiple imputation of missing data and also allows diagnostics to assess the utility of the imputed data.
Graeme D. Hutcheson Manchester University
Background Missing data analysis inR rggobi: Data Visualistion Amelia: a program for missing data conclusion
introduction GUI input options output
Amelia: a simple GUI
Background Missing data analysis inR rggobi: Data Visualistion Amelia: a program for missing data conclusion
introduction GUI input options output
Amelia: data input
Graeme D. Hutcheson Manchester University
Background Missing data analysis inR rggobi: Data Visualistion Amelia: a program for missing data conclusion
introduction GUI input options output
Amelia: options - variables
Background Missing data analysis inR rggobi: Data Visualistion Amelia: a program for missing data conclusion
introduction GUI input options output
Amelia: options - Time Series/Cross Sectional data
With amelia, time series and cross-sectional indices can be set.
Researchers often also have additional prior information about missing data values based on previous research, academic consensus, or personal experience. This information can be
incorporated into the data imputation algorithm to produce vastly improved imputations.
Graeme D. Hutcheson Manchester University
Background Missing data analysis inR rggobi: Data Visualistion Amelia: a program for missing data conclusion
introduction GUI input options output
Amelia: options - priors
Case priors and distributional priors can be easily coded using the GUI.
Background Missing data analysis inR rggobi: Data Visualistion Amelia: a program for missing data conclusion
introduction GUI input options output
Amelia: output
The multiple imputed data files can be saved in a number of formats.
Graeme D. Hutcheson Manchester University
Background Missing data analysis inR rggobi: Data Visualistion Amelia: a program for missing data conclusion
introduction GUI input options output
Amelia: diagnostics - comparing imputed and observed
Background Missing data analysis inR rggobi: Data Visualistion Amelia: a program for missing data conclusion
introduction GUI input options output
Amelia: diagnostics - overimputation
Graeme D. Hutcheson Manchester University
Background Missing data analysis inR rggobi: Data Visualistion Amelia: a program for missing data conclusion
introduction GUI input options output
Amelia and rggobi
The values imputed using amelia can easily be saved and inspected using the graphical capabilities of rggobi. Data can be multiply imputed and also checked graphically for fit.
Background Missing data analysis inR rggobi: Data Visualistion Amelia: a program for missing data conclusion
King et al., 2001: American Political Science Review
‘For political scientists, almost any disciplined statistical model of multiple imputation would serve better than current practices. The threats to the validity of inferences from listwise deletion are of roughly the same magnitude as those from the much better known problems of omitted variable bias.’
Graeme D. Hutcheson Manchester University
Background Missing data analysis inR rggobi: Data Visualistion Amelia: a program for missing data conclusion
Conclusion
I That was just 2 of the many programs available for data imputation and missing data analysis.
I If you are interested in missing data analysis, investigate the available packages (see the manuals) and install those that might be of use.
I see www.RGSweb.net (for data coding and general information about R).