Anda di halaman 1dari 9

Last Update: 2 November

2017 Part I
M - 12
Correlation
Questions: Definition of correlation and correlation of coefficient. What are the type of correlation? What is the
range of correlation of coefficient? How to interpret the values of Correlations? What do you mean of Significance
of Correlations? What are the limits of correlations? What is the assumption of Correlation? What are the
properties of correlation? What you mean by the Bivariate Correlations and The Partial Correlations? What are the
Parametric and Nonparametric methods of statistics? How it is related with the regression analysis? What way
standard deviation helps to prediction of Pearsons product-moment r for linear correlation and The Spearman rank
() for linear correlation? Compute the correlation of coefficient from the following data:

Definition:
Correlation is the measure of the association or relation between the changes of two or more variables in
the individuals of a population or a sample from that population. As for example, a correlation between the scores
of height and weight in the individuals of a sample assesses whether or not the changes in height are significantly
associated with changes in weight and if so, to what degree and what direction same or opposite. Thus the
correlation coefficient is a measure of linear association between two variables.

Range of Correlation of Coefficient


Values of the correlation coefficient are always between -1 and +1. A correlation coefficient of +1 indicates
that two variables are perfectly related in a positive linear sense, a correlation coefficient of -1 indicates that two
variables are perfectly related in a negative linear sense, and a correlation coefficient of 0 indicates that there is no
linear relationship between the two variables. For simple linear regression, the sample correlation coefficient is the
square root of the coefficient of determination, with the sign of the correlation coefficient being the same as the sign
of b1, the coefficient of x1 in the estimated regression equation.

Positive correlation: Definition


In a positive correlation, the two variables tend to
move in the same direction: When the X variable increases,
the y variable also increases; if the X variable decreases,
the y variable also decreases.

Negative correlation: Definition


In a negative correlation, the two variables tend to
go in opposite directions. As the X variable increases, the y
variable decreases.

Neither regression nor correlation analyses can be


interpreted as establishing cause-and-effect relationships.
They can indicate only how or to what extent variables are
associated with each other. The correlation coefficient
measures only the degree of linear association between
two variables. Any conclusions about a cause-and-effect
relationship must be based on the judgment of the analyst.

In simple linear regression and correlation, we examine the question of whether there is a linear (straight
line) relationship between two variables. A linear relationship between two variables exists if one variable tends to
increase (or decrease) as the other variable increases, and if the change in one variable for a unit change of the
other is constant across all values of the variables. In this case, an XY plot of one variable against the other will tend
to form a straight line.

In regression, we are generally interested in predicting one variable (called the y-variable, dependent
variable, or response variable) from the other (called the x-variable, independent variable, or predictor variable).
In correlation, we are generally interested in the strength of the relationship--in whether there is a precise

-1-
relationship between the two variables, or only a rough relationship. In regression, it makes a difference which
variable to treat as dependent, and which as independent. In correlation, it makes no difference.

The principal result of correlation is the correlation coefficient (r), or the coefficient of determination (r2),
the latter being just the square of the former. Either has value 1 when there is a perfect relationship between the
two variables, and value 0 when there is no relationship. By convention, the correlation coefficient is positive if one
variable increases as the other increases, and is negative if one variable decreases as the other increases. Since the
coefficient of determination is the square of the correlation, it is always positive. Correlation is the more commonly
used statistic. A weakness of the correlation is that it has no intuitive interpretation. An advantage of the coefficient
of determination is that it can be interpreted as the proportion of the quadratic (squared) variation in y that is
attributable to x. Except at the limits of 0 and 1, the absolute value of the correlation is higher than the coefficient
of determinationit looks better.

How to Interpret the Values of Correlations ?


As mentioned before, the correlation coefficient (r) represents the linear relationship between two variables.
If the correlation coefficient is squared, then the resulting value (r, the coefficient of determination) will represent
the proportion of common variation in the two variables (i.e., the "strength" or "magnitude" of the relationship). In
order to evaluate the correlation between variables, it is important to know this "magnitude" or "strength" as well as
the significance of the correlation.
Significance of Correlations
The significance level calculated for each correlation is a primary source of information about the reliability
of the correlation. In order to facilitate identifying those coefficients that are significant at some desired level like
that of t(0.05), t(0.01), or t(0.001) etc. The significance of a correlation coefficient of a particular magnitude will change
depending on the size of the sample from which it was computed. The test of significance is based on the
assumption.

Assumption of Correlation:

The assumptions to be justifiable before may use the product-moment r for correlation are as follows:
1. that both variables give continuous metric data,
2. that the distribution of the residual values (i.e., the deviations from the regression line) for the
dependent variable y follows the normal distribution
3. that the variability of the residual values is the same for all values of the independent variable x.
4. that the paired scores of the two variables for each individual occur at random in the sample and
beyond the influence of other such paired scores, and
5. that there is a linear association between the variables.

Properties of correlation:
As far as the properties of r are concerned,
1. it ranges from -1.00 to +1.00, the two extreme score being measures of maximum negative and positive
correlations respectively, while a score of 0.00 signifies the absence of linear correlation.
2. its score is not changed on adding or subtracting a constant number to o from all the scores of one or both
variables, and also on multiplying or dividing all the scores by a constant number;
3. presence of a significant correlation between two variable does not mean that the changes of either
constitute the cause for the changes of the other. Moreover, the correlation coefficient, worked out and
found significant for a particular section of the population or under given set of conditions, may not hold
good for other sections of the population or under other conditions. For example, a correlation coefficient
between height and weight, worked out using a sample of newborn animals may not hold good for full-
grown animals of the same species.
4. it is given by those proportions of total variables of the two variables as are associated with each other, and
consequently depends on the covariance of the variables, viz., Cov(X, Y); the r scores of sample from the
same population lie dispersed, due to their different sampling errors, in a sampling distribution around the
parametric correlation coefficient () -this sampling distribution of r scores is either positively or negatively
skewed according as has a negative or positive value, but has no skewness when amounts to zero;
although r must be significant before a linear regression of one of the correlated variables may be worked
out on the other variable,
5. r has no direct predictive value, nor does it indicate that the changes of either of the correlated variables
constitute the cause of the changes of the other.

The Bivariate Correlations and the Partial Correlations:


The Bivariate Correlations procedure computes Pearsons (product moment) correlation coefficient,
Spearmans (rho), and Kendalls tau-b with their significance levels. Correlations measure how variables or rank
orders are related. Pearsons correlation coefficient is a measure of linear association. Two variables can be perfectly
-2-
related, but if the relationship is not linear, Pearsons correlation coefficient is not an appropriate statistic for
measuring their association.
Example. Is the number of games won by a basketball team correlated with the average number of points
scored per game? A scatter plot indicates that there is a linear relationship. Analyzing data from the
19941995 NBA season yields that Pearsons correlation coefficient (0.581) is significant at the 0.01 level.
You might suspect that the more games won per season, the fewer points the opponents scored. These
variables are negatively correlated (0.401), and the correlation is significant at the 0.05 level.

The Partial Correlations procedure computes partial correlation coefficients that describe the linear
relationship between two variables while controlling for the effects of one or more additional variables. Correlations
are measures of linear association. Two variables can be perfectly related, but if the relationship is not linear, a
correlation coefficient is not an appropriate statistic for measuring their association.

Example. Is there a correlation between birth rate and death rate? An ordinary correlation reveals a
significant correlation coefficient (0.367) at the 0.01 level. However, when you take into effect (or control
for) an economic measure, birth rate and death rate are no longer significantly correlated. The correlation
coefficient drops to 0.1003 (with a p value of 0.304).

Parametric and Nonparametric methods

The statistical methods discussed above generally focus on the parameters of populations or probability
distributions and are referred to as parametric methods.

Nonparametric methods are statistical methods that require fewer assumptions about a population or
probability distribution and are applicable in a wider range of situations.

For a statistical method to be classified as a nonparametric method, it must satisfy one of the following
conditions:
(1) the method is used with qualitative data, or
(2) the method is used with quantitative data when no assumption can be made about the population
probability distribution.

In cases where both parametric and nonparametric methods are applicable, statisticians usually recommend
using parametric methods because they tend to provide better precision. Nonparametric methods are useful,
however, in situations where the assumptions required by parametric methods appear questionable.

Standard deviation some times required to prediction the correlation.

The range, the difference between the largest value and the smallest value, is the simplest measure of
variability in the data. The range is determined by only the two extreme data values. The variance (s2) and the
standard deviation (s), on the other hand, are measures of variability that are based on all the data and are more
commonly used. Equation 1 shows the formula for computing the variance of a sample consisting of n items. In
applying equation 1, the deviation (difference) of each data value from the sample mean is computed and squared.
The squared deviations are then summed and divided by n - 1 to provide the sample variance.
The standard deviation is the square root of the variance. Because the unit of measure for the standard
deviation is the same as the unit of measure for the data, many individuals prefer to use
the standard deviation as the descriptive measure of variability.

Find the Standard deviation (SD = ?)

Class interval f xm xm m x fx x2 fx2


156 160 4 158 158 168 = - 10 -2 -8 4 16
161 165 14 163 163 168 = - 5 -1 -14 1 14
166 170 25 168 168 168 = 0 0 0 0
171 175 11 173 173 168 = +5 +1 11 1 11
176 - 180 6 178 178 168 = +10 +2 12 4 24
N = 60 fx = 1 fx2 = 1

-3-
2
fx 2 fx '
2
65 1 13 1 1
SD i 5 5 5 1.08
N N 60 60 12 60 60 3600
5 1.08 .00027 5 1.03 5.15
Pearsons product-moment r for linear correlation
The product-moment correlation coefficient or Pearson's r is a simple liners correlation coefficient It serves
to determine the strength or magnitude as well as the direction or algebraic sign of the association between two
continuous measurement variables, provided the relationship between their scores conforms t a straight line.
Product-moment r cannot be used directly for multiple correlations between more than two variables, nor if one or
both the variables is/are discontinuous, ordinal or qualitative.

Principle :
Where X and Y are the scores of two variables to be correlated and X and Y are their respective means, r
may be computed by either of the following formulas

( X X )( Y Y ) n XY X Y
r= or r=
( X X ) ( Y Y )
2 2
[ n X ( X ) 2 ][ n Y 2 ( Y ) 2 ]
2

1 r2 r
sr = , t= ; df = n-2
n2 sr

The computed r is significant only if the computed t is either higher than or equal to the critical t score for a
chosen significance level not higher that 0.05.

Data : Product moment r has to be computed for finding whether or not there is a significant linear correlation
between the following O2 consumption scores (Xml / min) and tracheal ventilation scores (Yml / min) of a sample of
water beetles. ( = 0.05 ).

Ind. No.1 2 3 4 5 6 7 8 9 10 11

X scores : 2.8 3.3 2.7 3.6 3.7 2.3 3.5 3.6 2.5 3.0 3.5
Y scores : 75.0 89.6 73.0 89.7 90.0 70.0 85.1 89.4 71.4 85.2 85.8

Critical t scores : t0.05 (11) = 2.201; t0.05 (22) = 2.074;


t0.05 (10) = 2.228; t0.05 (9) = 2.262; t0.05 (20) = 2.086.

Computation :
n = 11
(a) 1. Using sum of products :
Table: for computing r from sum of products.

Ind. No. X Y X- X 1 (X- X 1)2 Y- Y 1 (Y- Y 1)2 (X- X 1)(Y- Y 1)


1 2.8 75 -0.336 0.1129 -7.2 51.84 2.4192
2 3.3 89.6 0.164 0.1269 7.4 54.76 1.2136
3 2.7 73 -0.436 0.1901 -9.2 84.64 4.0112
4 3.6 89.7 0.464 0.2153 7.5 56.25 3.4800
5 3.7 90 0.564 0.3181 7.8 60.84 4.3992
6 2.3 70 -0.836 0.6989 -12.2 148.84 10.1992
7 3.5 85.1 0.364 0.1325 2.9 8.41 1.0556
8 3.6 89.4 0.464 0.2153 7.2 51.84 3.3408
9 2.5 71.4 -0.636 0.4045 -10.8 116.64 6.8688
10 3 85.2 -0.136 0.0185 3.0 9.0 0.408
11 3.5 85.8 0.364 0.1325 3.6 12.96 1.3104
34.5 904.2 - 2.4655 - 656.02 38.706

X =
X = 3.136 Y =
Y = 82.2
n n
-4-
( X X )( Y Y ) 38.706 38.706
r= = = = 0.962
( X X ) 2 ( Y Y ) 2 2.4655 656.02 40.217

1 r2 1 0.9254 r 0.962
sr = = = 0.091 t= = = 10.571
n2 9 sr 0.091

2. Using sum of products :


Name of the Score of X Score of Y x y x2 y2 xy
Pupil
1 49 77 -9 0 81 0 0
2 73 81 +15 4 225 16 60
3 54 87 -4 10 16 100 40
4 50 52 -8 -5 64 625 200
5 37 51 -21 -6 441 676 546
6 43 77 -15 0 225 0 0
7 84 93 +26 16 676 256 416
8 74 91 +16 14 256 196 224
9 55 89 -3 12 9 144 36
10 61 72 +3 -5 9 25 15
x = 580 y = 770 x2 = 2002 y2 = 2038 xy = 1355

x2 y2 xy
x y r
N N N.x.y
1355
2002 2038
10 14.11 14.27
10 10
1355
200 .2 203 .8
2013497
14.11 14.27 0.0067
Q: 3 Find out the co-efficient of correlation where x2 = 2002, y2 = 2038, xy = 1355
xy 1355 1355
rxy 0.67
x2 y2 2002 2038 2019.9

(b) 1. Using raw scores :


Table: for computing r from raw scores of variables.
Ind. No. X Y XY X2 Y2

1 2.8 75 210.00 7.84 5625.00


2 3.3 89.6 295.68 10.89 8028.16
3 2.7 73 197.10 7.29 5329.00
4 3.6 89.7 322.92 12.96 8046.09
5 3.7 90 333.00 13.69 8100.00
6 2.3 70 161.00 5.29 4900.00
7 3.5 85.1 257.85 12.25 7242.01
8 3.6 89.4 321.84 12.96 7992.36
9 2.5 71.4 178.50 6.25 5097.96
10 3 85.2 255.60 9.00 7259.04
11 3.5 85.8 300.30 12.25 7361.64
34.5 904.2 2873.79 110.67 74981.26
-5-
n XY X Y
r=
[n X ( X ) 2 ][n Y 2 ( Y ) 2 ]
2

11 2873 .79 34.5 904.2


=
[11110.67 (34.5) 2 ][ [11 74981 .26 (904.2) 2

31611 .69 31194 .9 416.79 416 .79


= = = = 0.942
(1217 .37 1190 .25 )( 824793 .86 817577 .64 27.12 7216.22 442 .384

Interpretation:

1 r2 1 0.0027 0.9973 r 0.0525


sr = = = = 0. 333 t= = = -0.158
n2 11 2 9 sr 0.333

df = n-2 = 11-2 = 9 = 0.05

Critical t scores: t0.05 (9) = 2.262

Because the computed t was found to be higher than the critical t score for 0.05% Level of significance,
there is a significant correlation between the variables ( P <0.05)

2. Using raw scores :


Name of the x y x2 y2 xy
Pupil
A 48 88 2304 7744 4224
B 32 80 1024 6400 2560
C 36 78 1296 6084 2808
D 34 74 1156 5476 2516
E 39 74 1521 5476 2886
F 37 75 1369 5625 2775
G 41 78 1681 6084 3198
H 45 83 2025 6889 3735
I 40 75 1600 5625 3000
J 30 71 900 5041 2130
x = 382 y = 776 x = 14876
2 y = 60444
2 xy = 29832

1 r2 r
sr = = .. ? t= = . ?
n2 sr
df = n-2 = 10-2 = 8 = 0.05

Critical t scores: t0.05 (9) = 2.262

-6-
N . xy x. y
rxy
N . x x . N . y 2 y
2 2 2

10 29832 382 776



10 14876 382 . 10 60444 776
2 2

298320 296432

148760 145924 604440 602176
1888 1888 1888 10 4 1888 10 3
0.74
2836 2264 53.25 47.58 25336350 2533635
Interpretation ?

Because the computed t was found to be lower than /higher than the critical t score for 0.05% Level of significance,
there is a/no significant correlation between the variables ( P <0.05)?

The Spearman rank () for linear correlation:

The Spearman rank () for linear correlation coefficient is a measure of the relationship between two
variables when data in the form of rank orders are available. For instance, the Spearman rank correlation
coefficient could be used to determine the degree of agreement between men and women concerning their
preference ranking of 10 different televisions shows. A Spearman rank correlation coefficient of 1 would indicate
complete agreement, a coefficient of -1 would indicate complete disagreement, and a coefficient of 0 would indicate
that the rankings were unrelated.
Limits of Rank correlation coefficient :
Spearmans rank correlation coefficients is given by
n
6 di 2 If n is odd and equal to 2m + 1 then
P = 1 i 1


n n 1 2
, d = 2m, 2m 2, 2m 4, ......., -2, -4, ....., -2m
n
P di 2 = 8m(m + 1)(2m + 1) /6
i 1
If n = 2m
P = -1
Then d = 2m 1, 2m 3, ......., 1, -1, -3,
......, -(2m-3), -(2m-1)
n
di
i 1
2
= 2m(4m2 1)/2 P = -1

Thus 1 P 1.

Principle :
Ranks are assigned in a descending order to the scores of each variable to be correlated, giving an average
ranks to all the scores of a tred set. Where D is the difference between the RX and RY ranks of each individual and n
is the sample size,

6 D2 1 r 2p pp
Rp = 1- ; sr = ; t= ; df = n-2
n( n 2 1) p n2 s rp

The computed rP is significant only if the t score is either higher than or equal to the critical t score for a
chosen level of significance, not higher than 0.05.
Data :
RP has to be computed for linear correlation between the following scores of respiratory rates (X per min)
and heart rates (Y per min), both discontinuous variables. ( = 0.05. ).

Ind. No.1 2 3 4 5 6 7 8 9 10
-7-
X scores : 14 10 19 12 15 20 23 13 15 21
Y scores : 68 62 79 65 72 81 83 68 70 83

Critical t scores : t0.05 (20) = 2.086; t0.05 (19) = 2.093;


t0.05 (18) = 2.101; t0.05 (10) = 2.228; t0.05 (8) = 2.306.

Computation :
Table : for assigning ranks and computing rho. ( n = 12).

Ind. No. X Y RX (Rank of X) RY (Rank of Y) D=(RX -RY) D2


1 14 68 7 7.5 -0.5 0.25
2 10 62 10 10 0 0
3 19 79 4 4 0 0
4 12 65 9 9.0 0 0
5 15 72 5.5 5.0 0.5 0.25
6 20 81 3 3 0 0
7 23 83 1 1.5 -0.5 0.25
8 13 68 8 7.5 -0.5 0.25
9 15 70 5.5 6 -0.5 0.25
10 21 83 2 1.5 0.5 0.25
- - - - - 1.50

6 D2 6 1.5
Rp = 1- = 1 = 0.991
n(n 1)
2
10 (100 1)
1 r 2p 1 (0.991) 2 0.0179
sr = = = = 0.00224 = 0.047
p n2 10 2 8

pp 0991
t= = = 21.085 df = n-2 = 10-2=8
s rp 0.047

Because the computed t is higher than the critical t for 0.05 level of significance, there is a significant correlation
between the variables ( P < 0.05).

Alternatively linear correlation can be computed through following formula:

Principle :
Ranks are assigned in a descending order to the scores of each variable to be correlated, giving an average
ranks to all the scores of a tred set. Where D is the difference between the X and Y ranks of each individual and n is
the sample size,

(D D)
2
D SD D
D = ; sD = ; sD = ;t= ; df = N-1
N N 1 N sD

The computed rP is significant only if the t score is either higher than or equal to the critical t score for a
chosen level of significance, not higher than 0.05.

Data :
RP has to be computed for linear correlation between the following scores of respiratory rates (X per min)
and heart rates (Y per min), both discontinuous variables. ( = 0.05. ).

Ind. No.1 2 3 4 5 6 7 8 9 10

X scores : 14 10 19 12 15 20 23 13 15 21
Y scores : 68 62 79 65 72 81 83 68 70 83

Critical t scores : t0.05 (20) = 2.086; t0.05 (19) = 2.093;


t0.05 (18) = 2.101; t0.05 (10) = 2.228; t0.05 (9) = 1.833.
-8-
Computation :
Table 7.1 for assigning ranks and computing rho. ( n = 10).

Ind. No. X Y D=(Y-X) D- D (D- D )2

1 14 68 54 -2.9 8.41
2 10 62 52 -4.9 24.01
3 19 79 60 3.1 9.61
4 12 65 53 -3.9 15.21
5 15 72 57 0.1 0.01
6 20 81 61 4.1 16.81
7 23 83 60 3.1 9.61
8 13 68 55 -1.9 3.61
9 15 70 55 -1.9 3.61
10 21 83 62 5.1 26.01
- - 569 -- 116.89

D 569
D = = = 56.9
N 10
(D D)
2
116.89 SD 3.604 3.604
sD = = = 3.604 s D = = = = 1.14
n 1 9 N 10 3.162

D 56.9
t= =t= = 49.912 df = n-1 = 10-1=9
sD 1.14

Because the computed t is higher than the critical t for 0.05 level of significance, there is a significant correlation
between the variables (P < 0.05).

-9-

Anda mungkin juga menyukai