Correlation
and
Regression
Analysis
Nayyar Raza KazmiObjectives of the Lecture
• To understand the concept of Correlation and Regression Analysis.
• Understand the areas in which Correlation and regression Models can be applied.
• Understand interpreting Correlation and Regression parameters.
• Most of studies done by Post
graduate trainees are cross-sectional in nature.
• Analysis of such studies is
mostly confined to application of descriptive univariate
statistics.
• Quality of such studies can be
enhanced by further data mining by Correlation and Regression Analysis.
Correlation
– Strength of association between two
variables.
– Tells us how much the two variables are
associated with one another.
– However doesn’t assume CAUSATION.
– Simply tells us whether the two variables
Regression
• If there is a strong correlation between two variables, Regression is used to determine the value of dependent variable (Y) from
the value of independent variable (X) • Types
– Simple Linear Regression – Multiple Linear Regression – Logistic Regression
Correlation Analysis
Correlation Analysis
The
Independent
Independent
Variable
Variable
provides the basis for estimation. It is the predictor variable.Correlation Analysis
Correlation Analysis is a group of statistical techniques to measure the association between two variables.
A
Scatter Diagram
Scatter Diagram
is a chart that portrays the relationship between two variables.
The
Dependent
Dependent
Variable
Variable
is the variable being predicted or estimated.Advertising Minutes and $ Sales
0 5 10 15 20 25 30 70 90 110 130 150 170 190 Advertising Minutes Sa le s ($ th ou sa nd s)
The Coefficient of Correlation,
r
Negative values indicate an inverse relationship and
positive values indicate a direct relationship.
The
Coefficient of Correlation
Coefficient of Correlation
(r) is a measure of the strength of the relationship between two variables.- 1 0 1
P e a r s o n ' s r
Also called Pearson’s r and
Pearson’s product moment correlation coefficient.
It requires interval or ratio-scaled data.
It can range from -1.00 to 1.00.
Values of -1.00 or 1.00
indicate perfect and strong correlation.
Values close to 0.0 indicate weak correlation.
Perfect Negative Correlation 0 1 2 3 4 5 6 7 8 9 10 10 9 8 7 6 5 4 3 2 1 0 X Y
0 1 2 3 4 5 6 7 8 9 10 10 9 8 7 6 5 4 3 2 1 0 X Y Perfect Positive Correlation
0 1 2 3 4 5 6 7 8 9 10 10 9 8 7 6 5 4 3 2 1 0 X Y Zero Correlation
Phi Co-efficient
• Used for two categorical variables
ad - bc
(a+b)(a+c)(c+d)(b+d)
Regression Equation and
Regression Line
• where = computed value of the dependent variable • a = Y-intercept where X equals zero
• b = slope of the regression line, which is the increase or decrease in Y for each change of one unit of X
• X = a given value of the independent variable
Y
c = a + bXY
Simple Linear Regression
• Determines the value of a Dependent Variable based on a single independent Variable.
Multiple Linear Regression
• Used when the Dependent Variable is acontinuous variable and independent variables are continuous or categorical.
Putting MLR in Practice
• A descriptive study on normal healthy
adults aged 14-25 years gathers date about their weight, systolic Blood
?????
• Is serum cholesterol level associated with weight and systolic blood pressure?
• Can we predict Serum Cholesterol levels if we know a persons weight and systolic
Y= 18.52+3.20(BP)+[-4.06(Weight)]
So What could be the Serum Cholesterol level for a person who weighs 75Kg and has a
systolic Blood Pressure of 145mm Hg????
Y= 18.52+3.20(145)+[-4.06(75)] Y= 18.52+464+[-304.5]
Y= 18.52+464-304.5
Y= 178.02
Logistic Regression
• Logistic Regression is used when the outcome variable is categorical
• The independent variables could be either categorical or continuous
• Logistic Regression determines the Odds Ratio for various independent variables for the dichotomous dependent variable
• The Dichotomous Dependent variable could be presence/ absence of a
complication, disease etc.
• Data for dichotomous variables must be binary coded like 1 for presence of
complication or disease and 0 for Absence of complication or disease.
Putting Logistic Regression in
Practice
• Risk Factors for Complications of
Diabetes Mellitus in patients admitted to a Tertiary Care Hospital
Risk Factors for
Retinopathy No of patients(n=32) %age
BMI> 30 13 40.26 Smoking 28 87.5 Level of prior awareness 14 43.75 HbA1C >7 10 31.25 Duration of Diabetes > 10 Years 20 62.5
Where Correlation and Regression
Models can be applied
• Cross-sectional studies. • K.A.P Studies
• Studies aiming to determine relationships between certain factors of interest and
Softwares to use
• MS Excel with DataAnalysis add-in installed
• SPSS
• Epi Info 2002
• MedCalc (Recommended because of ease of use and power to perform all types of statistical
• Thankyou for your patience.(There is a Negative Strong Correlation between length of Biostats lecture and the Your
moods evident by the 11 “O” Clock sign on your forheads
• Questions, Queries and Suggestions are welcome.