The Basics of Regression
Analysis
for TIPPS
Lehana Thabane
The Purpose of Regression Modeling
• To verify the association or relationship between a single
variable and one or more explanatory variables
– One explanatory variable: simple regression – Two or more explanatory variables: Multiple regression• To confirm results from other studies
• To reduce a large number of variables to a smaller set of
variables
• To develop scoring rules for risk assessment
– To determine variables that can discriminate between peoplewith a condition and those without it
• To develop a model to predict future unobserved
responses (outcomes) given a set of predictor variables
Assumptions of Regression Model
• Form of Model: y=a+b1*x1+b2*x2+error• Assumptions:
– Existence:The relationship between dependent and independent variables exist
– Linearity:The relationship is linear over the spectrum of values studied – Independence:
• For given values of explanatory variable x, the y-values are independent of each other
• The explanatory variables are independent of each other – hence the name “independent variables”
– Normality:For given values of the explanatory variables, the y-values are normally distributed
• Equivalent to saying errors are normally distributed with zero mean and SD=1
– Constant variance:the distribution of y-values has equal variance at each value of x.
Describing the Relationships
• Pearson’s correlation coefficient, r
– To assess linear relationship between two continuous variables – The variable should have a joint bivariate Normal distribution • Spearman’s correlation coefficient, rho
– Also called rank correlation coefficient – To assess linear relationship between two rank variables
– Also converts continuous variables into ranks, then computes a measure of correlation based on ranks
• Kendall’s correlation coefficient, tau – To assess linear relationship between two rank variables • Biserial correlation coefficient
– To assess relationship between a continuous variable and a binary variable • Multiserial correlation coefficient
– To assess relationship between a continuous variable and a categorical variable with three or more levels
• Intraclass correlation coefficient, ICC
– To assess the degree of agreement between assessors/reviewers
What does correlation measure?
•
Correlation
– is a measure of strength, not causation!
•
Correlation: r, rho, tau
– Varies from -1 to 1 with both extremes
indicating strong negative or positive
relationship respectively
– Measures the degree of association
– Measures the strength of linear relationship
– Does not provide information on causation
Assessing the Assumptions of Regression Model
• Existence
– Biological plausibility of the relationship
– R^2, measure of percent of variability explained by the model; goodness-of-fit of the model • Linearity:
– Use scatter plots to visually assess linear relationship
– Add non-linear terms to the model and assess their statistical significance or model improvement
• Independence of y-values – Assess serial correlation
– Data collection procedures (eg clustered data, time series data, etc) • Independence of explanatory variables
– Interdependence leads to multi-colinearity problems – Use variance inflation factor or tolerance statistics • Normality
– Use qqplots (should show a straight line) – Use goodness-of-fit statistics • Constant variance
– Use plots of residuals versus fitted values/x-values – Should show a random pattern
The Effects of Multi-collinearity
•
It increases the variances of the regression
coefficients
– The greater these variances, the more unstable the prediction equation will be
•
It makes determining the importance of a given
predictor difficult
– the common interpretation of regression coefficients as measuring the change in E(Y) when the corresponding predictor is increased by one unit (while all other predictors are held constant), is not fully applicable
•
It severely limits the size of R^2
– because all regressors are going after much of the same variance on the dependent variable (i.e. redundancy among predictors).
Methods For Detecting multi-collinearity
•
Informal Diagnostics
– Correlation matrix, large correlation coefficients
between regressors are a good indicator of possible
multicollinearity.
– Large changes in the estimated regression
coefficients when a variable is added or deleted.
– Estimated regression coefficients having a algebraic
sign that is opposite of that expected from theoretical
considerations or prior experience.
– Wide confidence intervals for the regression
coefficients representing important predictors.
Methods For Detecting multi-collinearity
•
Variance Inflation Factor
– Variance inflation factor (VIF) is defined as the reciprocal of the tolerance statistic
– VIF=1/(1-R^2) where R^2 is the coefficient of determination for model
– The higher the VIF, the greater the multi-collinearity problem. – Rule of Thumb: VIF>10 implies definite existence of
multi-collinearity
•
Tolerance Statistics
– Defined as T_j=1-R_j^2 where R_j^2 is the coefficient of determination for regressing X_j on the rest of the explanatory variables.
– The smaller the value of T_j, the greater the possibility of the existence of multi-collinearity.
– Equivalently, the larger the R_j^2 the greater the multi-collinearity problem.
Methods For Detecting multicollinearity (cont.)
•
Eigenvalues of correlation matrix
– Perform principal component (PC) analysis of predictor variables – PCs are a set of variables which are linear combinations of the
original set of variables
• They are independent of each other (not correlated with each other) • They have maximum variances (called eigenvalues)
• The larger the eigenvalue, the more important the corresponding PC in representing information in the original set of predictors
– The closer an eigenvalue is to zero, the higher the indication of multi-colinearity
• Eigenvalue of zero indicates existence of colinearity
– Thus, for kpredictors, kPCs should have non-zero eigenvalues if there is no colinearity
Remedial Measures For Multicollinearity
• Increase the sample size
• Get “appropriate” data by using proper sampling designs
to collect data
• Remove the most highly correlated independent
variables
• Substitute the most highly correlated independent
variables
– Use index based on correlated variables (eg BMI)
• Use the centering procedure
– But be careful of the interpretation of the results
• Modify the model
Rules of Thumb for Sample Sizes for
Regression Problems
• In regression problems,
power
refers to the ability to find
a specified
• regression coefficient; or • level of R^2
statistically significant at a specified level of significance and specified sample size
• For multiple regression: Hair et al, 2000 state
(a) That with 80% power, and ®= 0:05, one can detect a –R^2 >0.23 based on n = 50;
–R^2 > 0.12 based on n = 100;
(b) The general rule is that the ratio of number of subjects to number of independent variables should be about 5:1. There is substantial risk of “overfitting” if it falls below this
(c) The desired ratio is usually about 15 to 20 subjects for each independent variable.
Rules of Thumb (Cont.)
•
Green (1991) recommends
1. Rule 1: for testing multiple correlations: n > 50 + 8m where m is the number of independent variables.
2. Rule 2: for testing relationship of outcome with individual predictors: n > 104 + m
•
Harris (1985) recommends
1. For five or less predictors, the number of subjects should exceed the number of independent variables by 50: n > 50 + m
2. For equations involving six or more predictors, an absolute number of 10 subjects per predictor is recommended
•
For logistic regression, simulation studies indicate that
for stable models one requires 10 to 15 events per
predictor variable
1. Babyak (2004) 2. Penduzzi et al (1996)
Model Validation
• Independent verification– Use of surrogate outcomes
– Waiting until future observations are realized
• Split-sample approach
– Use part of the sample for model building – The other part for validation
• Resampling techniques
– Repeated sampling of the original data with replacement and fitting the model each time
– Examples: • Bootstrapping
• K-fold: dividing data into k equal-sized parts; repeat modelling k times, leaving out one part for model validation
• Leave-one-out: special case of k-fold where k=n (sample size) • Delete-d: Setting aside a percentage d of the sample for validation; repeat
the process 100 to 200 times and average the results.
Reporting the Methods
• State how the sample size was determined (if regression was a primary method of analysis) • Identify the variables and summarize them descriptively
• Specify how the (explanatory) variables that appear in final model were selected • Provide test of the model goodness-of-fit and methods assessing model assumptions
– Specify whether explanatory variables were tested for interaction effects – Specify whether all potential variables were tested for colinearity and how it was handled – Normality/constant variance assessments
• Specify how model was validated • Specify how
– outliers were handled (Diagnostics: Boxplots of residuals) – influential observations are handled (Diagnostics: Cook’s statistic) – missing data are handled
• Specify how results are summarized:
– Coefficient, standard error, 95% CI and associated p-value – Use OR, 95% CI and p-value for logistic regression – P-values reported to 3 decimal places • Specify any planned sensitivity analyses
Reporting the Results
• Report coefficient, standard error, 95% CI and associated p-value • Report results of goodness-of-fit assessments
– Provide the coefficient of determination, R^2
• This provides the amount of variability in the response accounted for by the explanatory variables included in the model
– Provide LR statistic, degrees of freedom and associated p-value • Report results of model assumptions
– Qqplots – Residual plots – Colinearity statistics (VIF) • Report results of sensitivity analysis
– Methods of handling missing data
– Different methods of analysis (different assumptions) – Outliers (analysis with and without outliers)
– Different definitions of outcomes (ie different cut-off points for binary outcomes) – Any twists based on variations in assumptions
How to deal with Missing Data
• List-wise Deletion• Single imputation techniques 9Hot deck
9Cold deck 9Mean Imputation 9Use regression technique
9Last Observation carried forward (LOCF) 9Composite method
• Multiple imputation
Common Errors in Regression Analyses
• Multivariableversus multivariateanalysis– Multivariable/multiple regression:
• single dependent/outcome variable with multiple independent variables/predictors
– Multivariate:
• multiple dependent/outcome variables with single or multiple independent variables/predictors
• No reporting of the assessment of model assumptions
• Poor reporting of the methods used to select predictor variables for inclusion in multivariable analysis
• No clearly stated hypotheses and justification • Poor reporting of the results
Common Errors in Regression Analyses (cont.)
•
Limited Scope
– The model may be applicable only to the range for which the data were available
• Cover a wide spectrum of the data
•
Form of the relationship
– A statistically significant linear relationship does not necessarily mean that the relationship is a straight line
• It is important to have a clear hypothesis and justification for it
•
Confounding
– Undefined confounding variables may create the illusion of the existence of a relationship or mask it
• Check whether uncontrolled variables are accounted for
Common Errors in Regression Analyses (cont.)
•
Inadequacy
– Goodness-of-fit is not the same as prediction
– A model with very good fit may not do well in
predicting unobserved responses.
– Why?
• The data used to develop the model may have shown a spurious relationship
– Review the literature to make sure that the model is plausible and has causal basis (biological hypothesis)
• The relationship may genuine, but the data/sample was not representative of the target population
– Use proper probability sampling techniques
• The relationship may have changed over time:
– It is important to check that the relationship remains unchanged during data collection and in the near future
Remarks About Analysis of Dependent data
•
Beware of dependent observations
– Repeated measures– Clustered data – Paired data
– Moving average or autoregressive process – Cluster randomization trials
• Modeling should also focus on the form of the
dependence
•
Alternative approaches include
– Random-effects or mixed-effects model – Generalized Estimating Equations (GEE)Other Important Remarks
• Always use a scatter plot to display relationship among
variables before you model the relationship
• Think about the goal of modeling prior to “turning on the
computer”
• Is the goal to determine the cause-and-effect mechanism? (Standard regression techniques won’t be helpful!) • Is it to derive a formula for prediction?
• Pearson’s correlation coefficient is important for bivariate
Normal variables
• Correlation coefficients without scatter plots can be
misleading
– Always report correlation coefficients along with scatter plots
SPSS Example: Employee data
• Gender (F,M)
• Date of Birth
• Educational Level (years)
• Employment Category
– Clerical
– Custodial
– Managerial
• Current Salary ($)
• Beginning Salary ($)
• Months since Hire (months)
• Previous Experience (months)
• Minority Classification (yes, no)
Modeling Employee data: Objectives
• Objective:
– To determine factors that predict current salary – To illustrate how
• To describe relationships using correlation coefficients • To display relationship
• To diagnose multi-colinearity • To check model assumptions • To summarize regression analysis results
• Dependent variable:
– Current salary• Independent variables:
– Age
– Educational Level (years) – Beginning Salary ($) – Months since Hire (months) – Previous Experience (months)
Regression in SPSS: Example
• Matrix scatter plot
Diagnosis of multi-colinearity
Variable Tolerance statistics VIF
Educational Level (years) Previous Experience (months) age
Beginning Salary Months since Hire
.508 .347 .348 .551 .983 1.967 2.881 2.877 1.814 1.018
Dimension Eigen value 1 2 3 4 5 6 5.315 .508 .136 .021 .014 .006
Diagnostics + Reporting
• INCOMPLETE
• TIPS ON DIAGNOSTICS:
– Residual plots
– Qqplots
• Check reporting guidelines for reporting
the results
References
1. Babyak MA. What you see may not be what you get: a brief, nontechnical introduction to overfitting in regression-type models. Psychosom Med 2004; 66(3): 411-21.
2. Penduzzi P, Concato J, Kemper E, Holford TR, Feinstein AR. A Simulation Study of the Number of Events per Variable in Logistic Regression. J Clin Epidemiol 1996; 49(12): 1373-9
3. Good PI, Hardin JW. Common Errors in Statistics (and How to Avoid them). NY: Wiley, 2003
4. Kleinbaum DG, Kupper LL, Muller KE, Nizam A. Applied Regression Analysis and other Multivariable Methods, 3rdEdition. NY: Duxbury, 1998 5. Rubin, D. B. (1987), Multiple Imputation for Nonresponse in Surveys,New
York: John Wiley & Sons, Inc.
6. Schafer, J. L. (1997),Analysis of Incomplete Multivariate Data,New York: Chapman and Hall.
7. Joe Schafer Multiple imputation: a primer. Statistical Methods in Medical Research, 8:3-15, 1999.
8. Barnard, J. and Meng, X.L. (1999). Applications of multiple imputation in medical studies: from AIDS to NHANES. Statistical Methods in Medical Research, 8, 17-36.
9. van Buuren, S., Boshuizen, H.C. and Knook, D.L. (1999). Multiple imputation of missing blood pressure covariates in survival analysis. Statistics in Medicine, 18, 681-694
10. Multiple Imputation Online: http://www.multiple-imputation.com/ 11. Lang TA, Secic M. How to report Statistics in Medicine. American College of