UNIT IV
Correlation & Co-efficient
Till the previous chapter we have been mainly concerned with univariate data. In this chapter we study bivariate and multivariate populations. According to Ya-lun Chou, ―There are two related but distinct aspects of the study of association between variables. Correlation analysis and regression analysis. Correlation analysis has the objective of determining the degree or strength of the relationship between variables. Regression analysis attempts to establish the nature of the relationship between variables – that is, to study the functional relationship between the variables and thereby provide a mechanism of prediction, or forecasting.‖
Meaning
In our daily lives we notice that the bigger the house the higher are its upkeep charges, the higher rate of interest the greater is the amount of saving, the rise in prices bring about a decrease in demand and the devaluation of country‘s currency makes export cheaper or import dearer.
The above example clearly shows that there exists some kind of relationship between the two variables. Croxton and Cowden rightly said, ―when relationship between two variables is of quantitative nature the appropriate statistical tool for measuring and expressing it in formula is known as correlation. Thus correlation is a statistical device which helps in analyzing the relationship and also the covariation of two or more variables.
According to Simpson and Kafta ―correlation analysis deals with the association between two or more variables.‖ If two variables vary in such a way that movements in one are accompanied by movements in the other, then these quantities are said to be correlated.
Importance of Correlation
A car owner knows that there is a definite relationship between petrol consumed and distance travelled. Thus on the basis of this relationship the car owner can predict the value of one on the basis of other. Similarly if he finds that there is some distortions of relationship, he can set it right.
Correlation helps in the following ways
1. It helps to predict event and the events in which there is time gap i.e. it helps in planning 2. It helps in controlling events.
Types of Correlation
Correlation can be classified under the following heads- 1. Positive and negative correlation
2. Simple multiple and partial correlation 3. Linear and non-linear correlation Positive and Negative Correlation
Two variables are said to be positively correlated when both the variables move in the same direction. The correlation is said to be positive (directly related) when the increase in the value of one variable is accompanied by an increase in the value of the other variable and vice versa.
Two variables are said to be negatively correlated when both the variables move in the opposite direction. The correlation is said to be negative (inversely related) when the increase in the value of one variable is accompanied by a decrease in the value of the other variable and vice versa.
Simple, Multiple and Partial Correlation
Correlation is said to be simple when only two variables are studied. In multiple correlation three or more variables are studied simultaneously. In partial correlation though more than two variables are recognised, but only two are considered to be influencing each other; and the effect of other influencing variables are kept constant.
Linear and Non-linear Correlation
If the amount of change in one variable tends to bear a constant ratio to the amount of change in the other variable, then the correlation is said to be linear. The correlation is said to be non- linear if the amount of change in one variable does not bear a constant ratio to the amount of change in the other related variable.
Measurement of Correlation
The correlation can be measured by any of the following methods- 1. Scatter Diagram
2. Karl Pearson‘s coefficient of correlation 3. Rank correlation coefficient
Scatter Diagram Method
The scatter diagram represents graphically the relation between two variables Xand Y. For each pair ofXand Y, one dot is put and we get as many points on the graph as the number of observations. Degree ofcorrelation between the variables can be estimated by examining the shape of the plotted dots.Following are some scattered diagrams showing varied degrees of correlation.
Perfect positive correlation; r = 1
Perfect negative correlation; r = -1
Low degree of positive correlation; r> 1
Low degree of negative correlation; r< 1
No correlation; r— 0 Advantages
(1) It is very easy to draw a scatter diagram (2) It is easily understood and interpreted
(3) Extreme items does not unduly affect the result as such points remain isolated in the diagram
Disadvantages
(1) It does not give precise degree of correlation
(2) It is not amenable to further mathematical treatment Karl Pearson’s Coefficient of Correlation
The measure of degree of relationship between two variables is called the correlation coefficient. It is denoted by symbol r. The assumptions that constitute a bivariate linear correlation population model, for which correlation is to be calculated, includes the following-(ya-lun chou)
1. Both X and Y are random variables. Either variable can be designated as the independent variable, and the other variable is the dependent variable.
2. The bivariate population is normal. A bivariate normal population is, among other things, one in which both X and Y are normally distributed.
3. The relationship between X and Y is, in a sense, linear. This assumption implies that all the means of Y‘s associated with X values, fall on a straight line, which is the regression line of Y on X. And all the means of X‘s associated with Y values, fall on a straight line, which is the regression line of X on Y. Furthermore, the population regression lines in the two equations are the same if and only if the relationship between Y and X is perfect- that is r = ± 1. Otherwise, with Y dependent, intercepts and slopes will differ from the regression equation with X dependent. This method is most widely used in practice. It is denoted by symbol V.
The formula for computing coefficient of correlation can take various alternative forms depending upon the choice of the user.
METHOD I — WHEN DEVIATIONS ARE TAKEN FROM ACTUAL ARITHMETIC MEAN
(A) WHEN STANDARD DEVIATIONS ARE GIVEN IN THE QUESTION.
Where x = Deviations taken from actual mean of X series Y = Deviations taken from actual mean of Y series N = Number of items
σx = Standard deviation of X series σy = Standard deviation of Y series
(B) WHEN DEVIATIONS STANDARD DEVIATIONS ARE NOT GIVEN IN THE QUESTION
Where Σxy = Sum of product of deviations of X and Y series from actual mean Σx2 = Sum of squares of deviation of X series from its mean
Σy2 = Sum of squares of deviation of Y series from its mean
Example : 1
Find correlation between marks obtained by 10 students in mathematics and statistics
Solution:
Calculation of coefficient of correlation
1. The following table gives aptitude test scores and productivity indices of 8 randomly selected workers.
Score 57 58 59 59 60 61 62 64 Index 67 68 65 68 72 72 69 71
Calculate the correlation coefficient between aptitude score and productivity index.
Solution
X Y x=X-X y=Y-Y x2 y2 xy
57 67 -3 -2 9 4 6
58 68 -2 -1 4 1 2
59 65 -1 -4 1 16 4
59 68 -1 -1 1 1 1
60 72 0 3 0 9 0
61 72 1 3 1 9 3
62 69 2 0 4 0 0
64 71 4 2 16 4 8
∑x=480 ∑y=552 ∑x=0 ∑y=0 ∑x2=36 ∑y2=44 ∑xy=24
X=∑x/N=480/8=60, Y=∑y/N=552/8=69 Karl Pearson‘s correlation coefficient, r=∑xy/√∑x2*√∑y2 =24/√6√44 = 0.60
Example:Find the correlation between the sales and expenses from the data given below.
Sales
(Rs.lakhs) 50 50 55 60 65 65 60 60 50 65 Expenses
(Rs.lakhs) 11 13 14 16 16 15 14 13 13 15 Solution
X Y Y2 X2 ∑XY
50 11 121 2500 550
50 13 169 2500 650
55 14 196 3025 770
60 16 256 3600 960
65 16 256 4225 1040
65 15 225 4225 975
65 15 225 4225 840
60 14 196 3600 780
60 13 169 3600 650
50 13 169 2500
∑X=580 ∑Y=140 ∑Y2=1981 ∑Y2=34000 ∑XY=8590
r = N∑XY-∑X∑Y/(√N∑X2-(∑X2)*√ N∑Y2-(∑Y2)) r = 0.787
There is a high degree of positive correlation between the two variables.
Rank Correlation
Rank method for the computation of the coefficient of correlation is based on the rank or the order & not the magnitude of the variable. Accordingly it is more suitable when the variables can be arranged for e.g. in case of intelligence or beauty or any other qualitative phenomenon. The ranks may range from 1 to n. Edward spearman has provided the following formula
Where N = Number of pairs of variable X & Y D = Rank difference
Equal ranks or tie in ranks
R=1- 6{6∑D2+1/12(m13-m1)+1/12(m23-m2)+…}
N3-N
Where m1,m2.. are frequency of ranks Example :
From the data given belows calculate the rank correlation between X & Y
Solution:
Table : Computation of Rank Correlation
This shows there is very high positive correlation between X & Y.
Example 5 : Calculate Rank Correlation from the following data.
Solution :
Table : Calculation of Rank correlation
Here m1, m2 ... denote the number of times ranks are tied in both the variables, the subscripts
& denote the first tie, second tie,...., in both the variables
= 1 – 0.205 = 0.795
1. Two managers are asked to group an employee in order of potential for eventually becoming top managers. The ranking are
Employee A B C D E F G H I J Manager I 10 2 1 4 3 6 5 8 7 9 Manager II 9 4 2 3 1 5 6 8 7 10
Compute rank correlation coefficient and comment on the value:
Solution
Employee
Rank by manager I
R1
Rank by manager II
R2
D2=(R1-R2)2
A 10 9 1
B 2 4 4
C 1 2 1
D 4 3 1
E 3 1 4
F 6 5 1
G 5 6 1
H 8 8 0
I 7 7 0
J 9 10 1
∑D2=14
R=1-6∑D2/N3-N =1-6*14/(10)3-10 R=0.915
Thus there is a high degree of positive correlation in the ranks assigned by the two managers.
2. Calculate the rank correlation coefficient for the following data of marks of 2 tests given to candidates for a classical job.
Tests 92 89 87 86 83 77 71 63 53 50 Final
tests 86 83 91 77 68 85 52 82 37 57
Solution
Ranks are not given. We have to give the ranks for the each test.
Preliminary tests R1
Final
tests R2 D2
92 10 86 9 1
89 9 83 7 4
87 8 91 10 4
86 7 77 5 4
83 6 68 4 4
77 5 85 3 9
71 4 52 2 4
63 3 82 6 9
53 2 37 1 1
50 1 57 3 4
∑ D2=44
3. An examination of eight applicants for a clerical post was taken by a firm. From the marks obtained by the applicants in the accountancy and statistics papers. Compute the rank coefficient of correlation.
Applicant A B C D E F G H
Accountancy 15 20 28 12 40 60 20 80 Statistics 40 30 50 30 20 10 30 60 Solution:
There are repeated marks we give the ranks to the marks. R1=rank assigned to accountancy R2=rank assigned to statistics
Applicant Accountancy R1 Statistics R2 D2
A 15 2 40 6 16
B 20 3.5 30 4 0.25
C 28 5 50 7 4
D 12 1 30 4 9
E 40 6 20 2 16
F 60 7 10 1 36
G 20 3.5 30 4 0.25
H 80 8 60 8 0
∑ D2=81.5
R = 1-6{6∑D2+1/12(m13
-m1)+1/12(m23
-m2)+…}
N3-N R =0
There is no correlation between the marks obtained in the two subjects.
4. Ten competitors in the beauty contest are obtained by 3 judges in the following data.
1stjudge 1 6 5 10 3 2 4 9 7 8 2ndjudge 3 5 8 4 7 10 2 1 6 9 3rdjudge 6 4 9 8 1 2 3 10 5 7
Use the rank correlation coefficient to determine which pair of judges has the nearest approach to common tests in beauty.
Solution
In order to find out which pair of judges has the nearest approach to common tests in beauty.
We compare rank correlation between the judgement of i) 1st judge aand 2nd judge
ii) 2nd judge and 3rd judge iii) 1st judge and 3rd judge
R1 R2 R3 (R1-R2)2 =D12
(R2-R3)2 =D22
(R1-R3)2 =D32
1 3 6 4 9 25
6 5 4 1 1 4
5 8 9 9 1 16
10 4 8 3.6 16 4
3 7 1 16 36 4
2 10 2 64 64 0
4 2 3 4 1 1
9 1 10 64 81 1
7 6 5 1 1 4
8 9 7 1 4 1
∑ D12
=200 ∑ D22=214 ∑ D32
=60
R(I&II)= R=1-6∑D12
/N3-N=-0.212 R(II&III)= R=1-6∑D22/N3-N=-0.297 R(I&III)= R=1-6∑D32/N3-N=0.636
Since coefficient of correlation is maximum in the judgment of the 1st and 3rd judges. Thus they have the nearest approach to common tastes in beauty.
Example :
From the following data, compute coefficient of correlation (r) between X and Y:
X series Y series Arithmetic Mean 25 18 Square of Deviations from A.M. 136 138 Summation of products of deviations of X and Y series from
their respective means 122 Number of pairs of values 15 Solution :
Example :
From the following data, calculate Karl Pearson‘s coefficient of correlation
Solution :
Table Calculation of coefficient of correlation between height of fathers and height of sons
REGRESSION ANALYSIS INTRODUCTION
In the last chapter we studied the concept of statistical relationship between two variables such as -amount of fertilizer used and yield of a crop; price of a product and its supply, level of sales and amount ofadvertisement and so on.
The relationship between such variables do indicate the degree and direction of their association, but theydo not answer the question that whether there is any functional (or algebraic) relationship between twovariables? If yes, can it be used to estimate the most likely value of one variable, given the value of othervariable?
―In regression analysis we shall develop an estimating equation i.e., a mathematical formula that relatesthe known variables to the unknown variable. (Then, after we have learned the pattern of this relationship,we can apply correlation analysis to determine the degree to which the variables are related. Correlationanalysis, then, tells us how well the estimating equation actually describes the relationship). The variablewhich is used to predict the unknown variables is called the ‗independent‘ or ‗explaining‘ variable, andthe variable whose value is to be predicted is called the ‗dependent‘ or ‗explained‘ variable.‖ Ya-lunChou
DISTINCTION BETWEEN CORRELATION AND REGRESSION
By correlation we mean the degree of association or relationship between two or more variables correlationdoes not predict anything about the cause & effect relationship. Even a high degree of correlation doesnot imply necessarily that a cause & effect relationship exists between the two variables.Whereas in case of regression analysis, there is a functional
relationship between Y and X such that for eachvalue of Y there is only one value of X. One of the variables is identified as a dependent variable the other(s)as independent valuable(s).
The expression is derived for the purpose of predicting values of a dependentvariable on the basis of independent valuable(s).
REGRESSION LINES
A regression line is the line which shows the best mean values of one variable correspond-ing to meanvalues of the other. With two series X and Y, there are two arithmetic regression lines, one showing the bestmean values of X corresponding to mean Y‘s and the other showing the best mean values of Y correspondingto mean X‘s. In the context of scatter diagram, the regression line is the straight line that best fits the scatterdiagram. The most commonly used criteria is that it is the straight line that minimise the sum of the squareddeviations between the predicted and observed values of the dependent variable. In the case of twovariables X and Y, there will be two regression lines as the regression of X on Y and regression of Y on X.
REGRESSION EQUATIONS
There are different methods of deriving regression equations (1) By taking actual values of X and Y
(2) By taking deviations from actual mean (3) By taking deviations from assume mean
METHOD I WHEN ACTUAL VALUES ARE TAKEN The regression equation of Y on X is expressed as follows:
Yc = a+bX
Where a and b can be found out by solving the following two normal equations simultaneously:
ΣY =Na+bΣX ΣXY =aΣX+bΣX2
The regression equation of X on Y is expressed as follows:
Xc = a + bY
Where a and b can be found out by solving the following two normal equations simultaneously:
ΣX=Na+bΣY ΣXY =aΣY +bΣY2
Example : From the following table find :
(1) Regression Equation of X on Y.
(2) Regression Equation of Y on X.
Solution :
Table : Calculation of Regression Equations
Regression equation of X on Y is given by : X = a + bY
Where a & b can be found out be solving the following 2 equations simultaneously – ΣX = Na + b ΣY
ΣXY = aΣY + bΣY2
Substantly the alones obtained from the table above, we get 96 = 6a + 120 b ….(1)
2188 = 120a + 2761b ….(2)
Multiply equation 1 by 20 & subtract equation 2 from it.
1920 = 120a + 2400 b +2188 = 120a + 2761 b _____________________
- 268 = 0 - 351 b b = 0.76
Put this value of b in eq ... (1) 96 = 6a + 120 x 0.76
96 = 6a + 91.3 6a = 96 – 91.3 a=0.8
Put the value a & b in the regression equation of X on Y X = a + by
X = 0.8 + 0.76Y
Regression equation of Yon X is given by Y = a + bX
Where constants a and b can be found out by solving the following 2 normal equations simultaneously—
ΣY = Na+bΣX ΣXY = aΣX+bΣX2
Substituting the value obtained from the above table, we get 120 = 6a + 96b ....(1)
2188 = 96a +1758b ....(2)
Multiply e.g. 1 by 16 & subtract equation 2 from it 1920 = 96a + 15366
- 2188 = 96a + 17586 ____________________
-268 = 0 + -222b b=1.21
Put the value of b in equation 1 120 = 6a+ 96×1.21
120 = 6a+116.16 6a = 120 -116.16 6a = 3.84
a=0.64
Put the value of a and 6 in the regression equation of Y on X Y = a + bX
Y = 0.64 + 1.21X
METHOD II WHEN DEVIATIONS ARE TAKEN FROM ACTUAL MEAN Regression equation of X on Y is given by
X − X = bXY(Y − Y )
where X , Y are actual mean of X & Y series respectively
ΣXY = Sum of product of deviations taken from actual mean of X & Y.
ΣY2 = Sum of sequare of deviations from actual mean of Y.
Regression equation of Y on X is given by Y − Y = byx(X − X )
Where
ΣXY = Sum of product of deviations taken from actual mean of X & Y ΣX2 = Sum of square of deviations from actual mean of X.
Example : From the following table find : (1) Regression Equation of X on Y.
(2) Regression Equation of Y on X.
Solution :
X =16
Y =20
Regression equation of X on Y is given by X − X = bXY(Y − Y )
where X , Y are actual mean of X & Y series respectively
bXY=0.76
putting the value of bYX in the above equation & also put X = 16& Y = 20 X – 16 = 0.76 (Y – 20)
X – 16 = 0.76Y – 15.2 X = 0.76Y – 15.2 + 16 X = 0.76Y + 0.8
Regression equation of Y on X is given by Y − Y = byx(X − X )
Where
byx=1.21
Putting the value of bYX in above equation & also put Y = 20 & X = 16 Y - 20 = 1.21(X - 16)
Y - 20 = 1.21X - 1.21X16 Y - 20 = 1.21X -19.36 Y = 1.21X -19.36+ 20 Y = 1.21X +0.64
FEATURES OF REGRESSION COEFFICIENTS
(i) Both of regression coefficients should have same sign i e., either positive or negative.
(2) Coefficient of correlation could be found out if regression coefficients are known; by the formula
(3) Correlation coefficient would have the same sign as that of regression coefficients. ie., either positive or negative.
(4) Since -1≤ r≤ 1 this implies both the regression coefficient cannot be greater than one.
Example :
Following are the marks in Maths and English
Maths English
Mean 40 50 Standard Deviation 10 16 Coefficient of correlation 0.5
(1) Find two regression equation
(2) Find the most likely marks in Maths if marks in English are 40.
Solution : Let the marks in Maths be denoted by X and the marks in English by Y.
We have: X = 40
Y = 50 σ X = 10 σ Y = 16 r = 0.5
Regression Equation of Y on X
Y - 50 = 0.5(16/10)(X - 40) Y - 50 = 0.8 (X - 40) Y - 50 = 0.8X - 32 Y - 50 = 0.8X - 32 Y = 18 + 0.8X
Regression Equation of X on Y
X - 40 = 0.5(10/16)(X - 50) X - 40 = 0.3125 (X - 50) X = 40 + 0.3125Y -15.625 X = 24.375 + 0.3125Y
To find likely marks in Maths if marks in English are 40, put Y = 40 in regression equation of X on Y.
X = 0.3125 (40) + 24.375
= 12.5 + 24.375
= 36.875
1. calculate the two regression equations of X 10 12 13 12 16 15 Y 40 38 43 45 37 43 Solution:
X =∑X/N = 78/6 = 13 Y =∑Y/N = 246/6 = 41 byx=∑xy/x2 = -6/24 = -0.25 bxy=∑xy/y2 = -6/50 = -0.12
Regression equation Y on X is Y-Y = byx(X-X) Y-41=-0.25(X-13)
Y=-0.25X+44.25
Regression equation X on Y is X-X = bxy(Y-Y) X-13 = -0.12(Y-41)
X = -0.12Y+17.92 Value of y when x=20
Y = -0.25(20) + 44.25 Y=39.25
Coefficient of correlation r=+-√(bxy*byx)
r=-0.03 Example :
By using the following data, find out the two lines of Regression and from them compute the Karl Pearson‘s coefficient of correlation
ΣX = 250 ΣY = 300 ΣXY = 7900 ΣX2 = 6500 ΣY2 = 10000 N=10 Solution :
The regression equation of X on Y is expressed as follows:
XC = a + bY
To determine the value of a and b, the following two normal equations are to be solved simultaneously:
ΣX = Na + bΣY ΣXY = aΣY + bΣY2 250 = 10 X a + b X 300 7900 = a X 300 + b X 10000 250 = 10a + 300b ...(1) 7900 = 300a +10000b ...(2)
Multiply equation (1) by 30 and subtract equation (2) from it.
7500 = 300a + 9000b 7900 = 300a + 10000b
– – –
____________________
– 400 = 0 – 1000b b=0.4
Put the value of b in equation 1 250 = 10a + 300 X 0.4
250 = 10a + 120 10a = 250 –120 a=13
so the required equation of X on Y is X = 13 + 0.4Y
The regression equation of Y on X is expressed as follows—
Y = a + bX
where a and b could be found out by solving the following:
ΣY = Na + bΣX ΣXY = aΣX + bΣX2
putting the values in equations 300 = 10a + 250b ...(1)
7900 = 250a + 6500b ...(2)
Multiply equation 1 by 25 and subtract equation 2 from it.
7500 = 250a + 6250b 7900 = 250a + 2500b
- - - _______________
– 400 = 0 – 250b b=1.6
Put the value of b = 1.6 in equation 1.
300 = 10a + 250 X 1.6 300 = 10a + 400 300 - 400 = 10a – 100 = 10a a=10
Put the value of a and b in the regression equation of Y on X
Y = –10 + 1.6X
Karl Pearson‘s coefficient of correlation
r=0.8
1. Find the regression equations using least square method:
X 10 12 14 16 18 20 Y 18 27 22 38 33 42 Solution:
X Y X2 Y2 XY
10 18 100 324 180
12 27 144 729 324
14 22 196 484 308
16 38 256 1444 608
18 33 324 1089 594
20 42 400 1764 840
∑X=90 ∑Y=180 ∑X2=1420 ∑ Y2=5834 ∑XY=2854
Regression line Y on X is Y=a1+b1X 1 The normal equations are
∑Y=Na1+b1∑X
∑XY= a1 ∑X+b1∑X2 6a1+90b1=180 90a1+1420b1=2854 Solving these equations we get
a1 = -3 and b1 = 2.2
The regression equation Y on X is
Y=-3+2.2X 2
Regression equation X on Y is
X=a2+b2Y 3 The normal equations are
∑X=Na2+b2∑Y
∑XY= a2 ∑Y+b2∑Y2 6a2+180b2=90 180a2+5834b2=2854
Solving these two equations we get a2=4.5,b2=0.35 The regression equation X on Y is
X=4.5+0.35Y 4 Hence 3 & 4 are required regression equations…
2. Mean and variance of X and Y are X =10, var(x) =9, Y=90, var(y) =144respectively.
Correlation coefficient between x and y is 0.8. Find the regression line Y on X and hence obtain the value of Y when X=15.
Solution:
Given that
Mean of X=X =10 Mean of Y=Y =90 Var(x)= σx2=9 σx=3 Var(y)= σy2=144 σy=12
Correlation coefficient of Y and X is r=0.8 Regression equation Y on X is Y-Y = byx(X-X) To find byx
byx = rσy/σx
byx = 0.8(12/3) byx = 3.2
Y-90 = 3.2(X-10) Y = 3.2X+58
Value of Y when X=15 Y = 3.2(15)+58 Y = 106
3. From the following data calculate i) correlation coefficient,ii)S.D of Y given byx=0.89, bxy=0.85 and σx=3.
Solution:
i) we know that coefficient of correlation r=± √ r=0.87
ii) We know that byx= r(σy/σx) 0.89=0.87(σy/3)
σy=3.07
4. In a partially destroyed records the following data are available. Variance of x=25.
Regression equation of X on Y is 5X-Y=22. Regression equation of Y on X is 64X-45Y=24.
Find i) mean value of X and Y^ ii) coefficient of correlation between X and Y iii) standard deviation of Y
Solution:
i) Solving the regression equations we get the mean values of X and Y
5X-Y=22 1
64X-45Y=24 2
Solving these two equations X=6 and Y=8 Mean of X=6
Mean of Y=8
ii) The regression equation Y on X is 54X-45Y=24
Y=64/45X-24/45 3
The regression equation X on Y is 5X-Y=22
X=1/5Y+22/5 4
From 3 and 4
byx=64/45 bxy=1/5 coefficient of correlation between X and Y is
r = ± √ =0.53 bxy= r*σx/σy
σy=13.33