Anda di halaman 1dari 3

ne of the simplest questions we can ask about two variables is: "Is there a linear

relationship between them?"


The (Pearson) correlation coefficient, r, is a measure of linear association between
two continuous variables. It is a measure of how well the data fit a straight line.
1 => r >= -1
If r > 0 we have a Positive correlation
If r < 0 we have a Negative correlation
If r = 0 we have No correlation

Scatterplots showing data sets with different correlations


We can also perform a hypothesis test, the null hypothesis is that there is no
correlation between the two variables (i.e. the population correlation coefficient, r, is
zero).
Note:
It is harder to spot a correlation close to zero, but these are the ones we come
across most often
Other points to note:
The correlation coefficient is unaffected by units of measurement

Correlations of less than 07 should be interpreted cautiously


Correlation does not imply causation
Overall, I do not find correlation to be a very useful technique
Correlation should not be used when:
There is a non-linear relationship between variables
The are outliers
There are distinct sub-groups
For example healthy controls with diseased cases
If the values of one of the variables is determined in advance
e.g. Picking the doses of a drug in an experiment measuring its effect
Two examples of when not to use a correlation coefficient:

(a) When there is a non-linear relationship; (b) when distinct subgroups are present.
In both of these examples the correlation coefficient quoted is spurious.
Spurious correlations crop up all the time:
The price of petrol shows a positive correlation with the divorce rate over
time
Number of deaths from heart attacks in a population rises with incidence of
long-sightedness over time
Maximum daily air temperature and number of deaths of cattle were
positively correlated during March 2001
If we repeatedly measure two variables on the same individual over a period
of time e.g. a child's height and ability to read, then we will tend to see a
correlation
Spearman rank correlation coefficient

A (Pearson) correlation coefficient should be only calculated between two normally


distributed random variables
The Spearman rank correlation coefficient, rs, can be calculated for non-normally
distributed variables. As this coefficient is based only on the ranks of observations
outliers do not affect it unduly.

Linear regression
Regression analysis fits the best line to the observed data and allows us to
make predictions about one variable from the values of the other.
One variable (the independent variable) is assumed to predict the other (the
dependent), the results are not the same if we swap the variables.
The values of the independent variable may be selected.
The values do not have to be normally distributed
There are other assumptions and requirements of a regression analysis. (The
relationships is approximately linear; the residuals have to be normally
distributed etc.)
Regression analysis is best carried out under the guidance of a statistician

Why are variables correlated?


If two variables, A and B, are correlated then there are four possibilities:
The result occurred by chance
A influences ('causes') B (or, B influences A. Not the same thing!)
A and B are influenced by some other variable(s). This can happen in two
ways:
o C may 'cause' both A and B e.g. increased consumption of sugar
increases the number of caries a person has and increases their
weight. Does more weight cause more caries?
o A may lead to an increase in C which 'causes' B e.g. low income
may increase chance of smoking which increases chance of death
from lung cancer. Does low income cause lung-cancer?

Anda mungkin juga menyukai