Anda di halaman 1dari 48

Correlation and Causality

LEARNING GOAL
Be able to define correlation, recognize positive and
negative correlations on scatter diagrams
Be aware of important cautions concerning the
interpretation of correlations
Become familiar with the concept of a best-fit line for a
correlation, recognize when such lines have predictive
value and when they may not
Copyright 2009 Pearson Education, Inc.

Definition
A correlation exists between two variables when
higher values of one variable consistently go with
higher values of another variable or when higher
values of one variable consistently go with lower
values of another variable.

Copyright 2009 Pearson Education, Inc.

Slide 7.1- 2

Here are a few examples of correlations:

There is a correlation between the variables


amount of smoking and likelihood of lung
cancer; that is heavier smokers are more
likely to get lung cancer.
There is a correlation between the variables
height and weight for people; that is, taller
people tend to weigh more than shorter
people.

Copyright 2009 Pearson Education, Inc.

Slide 7.1- 3

Here are a few examples of correlations:

There is a correlation between the variables


demand for apples and price of apples; that
is, demand tends to decrease as price
increases.
There is a correlation between practice time
and skill among piano players; that is, those
who practice more tend to be more skilled.

Copyright 2009 Pearson Education, Inc.

Slide 7.1- 4

Scatter Diagrams
Definition
A scatter diagram (or scatterplot) is a graph in
which each point represents the values of two
variables.

Copyright 2009 Pearson Education, Inc.

Slide 7.1- 5

Copyright 2009 Pearson Education, Inc.

Slide 7.1- 6

The following procedure describes how to


make the scatter diagram in Figure 7.1.
1. We assign one variable to
each axis and label the axis
with values that comfortably
fit all the data.
Sometimes the axis selection
is arbitrary, but if we suspect
that one variable depends on
Figure 7.1
the other then we plot the
explanatory variable on the horizontal axis and the response
variable on the vertical axis.
In this case, we expect the diamond price to depend at least in part
on its weight; we therefore say that weight is the explanatory
variable (because it helps explain the price) and price is the
response variable (because it responds to changes in the
explanatory variable).
Slide 7.1- 7
Copyright 2009 Pearson Education, Inc.

The following procedure describes how to


make the scatter diagram in Figure 7.1.
1. (cont.) We choose a range
of 0 to 2.5 carats for the
weight axis and $0 to
$16,000 for the price axis.
2. For each diamond in Table
7.1, we plot a single point
at the horizontal position
Figure 7.1
corresponding to its weight
and the vertical position corresponding to its price.
For example, the point for Diamond 10 goes at a position of 1.11
carats on the horizontal axis and $3,670 on the vertical axis. The
dashed lines on Figure 7.1 show how we locate this point.
3. (Optional) We can label some (or all) of the data points, as is
done for Diamonds 10, 16, and 19 in Figure 7.1.
Copyright 2009 Pearson Education, Inc.

Slide 7.1- 8

EXAMPLE 1 Color and Price


Using the data in Table 7.1 (slide 6), create a scatter diagram
to look for a correlation between a diamonds color and price.
Comment on the correlation.
Solution: We expect price to
depend on color, so we plot the
explanatory variable color on
the horizontal axis and the
response variable price on the
vertical axis in Figure 7.2.
(You should check a few of
the points against the data in Table 7.1.)

Figure 7.2

The points appear much more scattered than in Figure 7.1.


Nevertheless, you may notice a weak trend diagonally downward from the upper left toward the lower right.
Copyright 2009 Pearson Education, Inc.

Slide 7.1- 9

EXAMPLE 1 Color and Price


Using the data in Table 7.1 (slide 6), create a scatter diagram
to look for a correlation between a diamonds color and price.
Comment on the correlation.
Solution: (cont.)
This trend represents a weak
correlation in which diamonds
with more yellow color (higher
numbers for color) are less
expensive.
This trend is consistent with
Figure 7.2
what we would expect, because
colorless diamonds appear to sparkle more and are generally
considered more desirable.

Copyright 2009 Pearson Education, Inc.

Slide 7.1- 10

Types of Correlation
(Note: detailed descriptions of these graphs appear in the next few slides.)

Figure 7.3 Types of correlation seen on scatter diagrams.


Copyright 2009 Pearson Education, Inc.

Slide 7.1- 11

Figure 7.3(a-c) Types of correlation seen on scatter diagrams.

Parts a to c of Figure 7.3 show positive correlations,


in which the values of y tend to increase with increasing
values of x. The correlation becomes stronger as we
proceed from a to c. In fact, c shows a perfect positive
correlation, in which all the points fall along a straight
line.
Copyright 2009 Pearson Education, Inc.

Slide 7.1- 12

Figure 7.3(d-f) Types of correlation seen on scatter diagrams.

Parts d to f of Figure 7.3 show negative correlations,


in which the values of y tend to decrease with
increasing values of x. The correlation becomes
stronger as we proceed from d to f. In fact, f shows a
perfect negative correlation, in which all the points fall
along a straight line.
Copyright 2009 Pearson Education, Inc.

Slide 7.1- 13

Figure 7.3(g) Types of correlation seen on scatter diagrams.

Part g of Figure 7.3 shows no correlation between x


and y. In other words, values of x do not appear to be
linked to values of y in any way.

Copyright 2009 Pearson Education, Inc.

Slide 7.1- 14

Figure 7.3(h) Types of correlation seen on scatter diagrams.

Part h of Figure 7.3 shows a nonlinear relationship, in


which x and y appear to be related but the relationship
does not correspond to a straight line. (Linear means
along a straight line, and nonlinear means not along a
straight line.)
Copyright 2009 Pearson Education, Inc.

Slide 7.1- 15

Types of Correlation
Positive correlation: Both variables tend to increase (or
decrease) together.
Negative correlation: The two variables tend to change
in opposite directions, with one increasing while the other
decreases.
No correlation: There is no apparent (linear) relationship
between the two variables.
Nonlinear relationship: The two variables are related,
but the relationship results in a scatter diagram that does
not follow a straight-line pattern.

Copyright 2009 Pearson Education, Inc.

Slide 7.1- 16

Measuring the Strength of a


Correlation
Statisticians measure the strength of a
correlation with a number called the
correlation coefficient, represented by
the letter r.

Copyright 2009 Pearson Education, Inc.

Slide 7.1- 17

Properties of the Correlation Coefficient, r


The correlation coefficient, r, is a measure of the
strength of a correlation. Its value can range only from
-1 to 1.
If there is no correlation, the points do not follow any
ascending or descending straightline pattern, and the
value of r is close to 0.
If there is a positive correlation, the correlation
coefficient is positive (0 < r 1): Both variables increase
together. A perfect positive correlation (in which all the
points on a scatter diagram lie on an ascending straight
line) has a correlation coefficient r = 1. Values of r close
to 1 mean a strong positive correlation and positive
values closer to 0 mean a weak positive correlation.
Copyright 2009 Pearson Education, Inc.

Slide 7.1- 18

Properties of the Correlation Coefficient, r


(cont,)
If there is a negative correlation, the correlation
coefficient is negative (-1 r < 0): When one variable
increases, the other decreases. A perfect negative
correlation (in which all the points lie on a descending
straight line) has a correlation coefficient r = -1. Values
of r close to -1 mean a strong negative correlation and
negative values closer to 0 mean a weak negative
correlation.

Copyright 2009 Pearson Education, Inc.

Slide 7.1- 19

EXAMPLE 3 U.S. Farm Size


Figure 7.5 shows a scatter diagram
for the variables number of farms
and mean farm size in the United
States.
Each dot represents data from
a single year between 1950 and
2000; on this diagram, the earlier
years generally are on the right and
the later years on the left.

Figure 7.5 Scatter diagram for


farm size data. Source: U.S.
Department of Agriculture.

Estimate the correlation coefficient by comparing this diagram


to those in Figure 7.3 (slide 13) and discuss the underlying
reasons for the correlation.

Copyright 2009 Pearson Education, Inc.

Slide 7.1- 20

EXAMPLE 3 U.S. Farm Size


Solution: The scatter diagram shows
a strong negative correlation that
most closely resembles the scatter
diagram in Figure 7.3f, suggesting a
correlation coefficient around r = -0.9.
The correlation shows that as the
number of farms decreases, the size
of the remaining farms increases.

Figure 7.5 Scatter diagram for


farm size data. Source: U.S.
Department of Agriculture.

This trend reflects a basic change in


the nature of farming: Prior to 1950, most farms were small
family farms. Over time, these small farms have been replaced
by large farms owned by agribusiness corporations.

Copyright 2009 Pearson Education, Inc.

Slide 7.1- 21

Figure 7.1 Scatter diagram for the


price and weight data in Table 7.1.

Figure 7.2 Scatter diagram for the


color and price data in Table 7.1.

Copyright 2009 Pearson Education, Inc.

Slide 7.1- 22

Calculating the Correlation Coefficient


The formula for the (linear) correlation coefficient r can be
expressed in several different ways that are all algebraically
equivalent, which means that they produce the same value. The
following expression has the advantage of relating more
directly to the underlying rationale for r :

Copyright 2009 Pearson Education, Inc.

Slide 7.1- 23

The following alternative formula for r has the advantage of


simplifying calculations, so it is often used whenever manual
calculations are necessary. The following formula is also easy
to program into statistical software or calculators:

First calculate each of the required sums, then substitute the


values into the formula. Be sure to note that (x2) and (x)2 are
not equal: (x2) tells you to first square all the values of the
variable x and then add them; (x)2 tells you to add the x values
first and then square this sum. In other words, perform the
operation within the parentheses first. Similarly, (y2) and
(y)2 are not the same.
Copyright 2009 Pearson Education, Inc.

Slide 7.1- 24

Beware of Outliers
If you calculate
the correlation coefficient
for these data, youll find
that it is a relatively high
r = 0.880, suggesting a
very strong correlation.
Figure 7.10

However, if you cover the data point in the upper right corner of
Figure 7.10, the apparent correlation disappears.
In fact, without this data point, the correlation coefficient is r = 0.
Copyright 2009 Pearson Education, Inc.

Slide 7.2- 25

EXAMPLE 1 Masked Correlation


Youve conducted a study to determine how the number of
calories a person consumes in a day correlates with time spent
in vigorous bicycling. Your sample consisted of ten women
cyclists, all of approximately the same height and weight. Over
a period of two weeks, you asked each woman to record the
amount of time she spent cycling each day and what she ate on
each of those days. You used the eating records to calculate the
calories consumed each day.
Figure 7.11 shows a scatter diagram
with each womans mean time spent
cycling on the horizontal axis and
mean caloric intake on the vertical
axis. Do higher cycling times
correspond to higher intake
of calories?
Copyright 2009 Pearson Education, Inc.

Slide 7.2- 26

Solution: If you look at the data as a whole, your eye will


probably tell you that there is a positive correlation in which
greater cycling time tends to go with higher caloric intake. But
the correlation is very weak, with a correlation coefficient of
r = 0.374.
However, notice that two points are
outliers: one representing a cyclist
who cycled about a half-hour per
day and consumed more than 3,000
calories, and the other representing
a cyclist who cycled more than 2
hours per day on only 1,200 calories.
Its difficult to explain the two outliers, given that all the women
in the sample have similar heights and weights.

Copyright 2009 Pearson Education, Inc.

Slide 7.2- 27

Solution: (cont.)
We might therefore suspect that these two women either recorded
their data incorrectly or were not following their usual habits
during the two-week study. If we can confirm this suspicion, then
we would have reason to delete the two data points as invalid.
Figure 7.12 shows that the correlation
is quite strong without those two
outlier points, and suggests that the
number of calories consumed rises by
a little more than 500 calories for
each hour of cycling.
Figure 7.12 The data from Figure
Of course, we should not remove
7.11 without the two outliers.
the outliers without confirming our
suspicion that they were invalid data points, and we should report
our reasons for leaving them out.
Copyright 2009 Pearson Education, Inc.

Slide 7.2- 28

Beware of Inappropriate Grouping


Correlations can also be misinterpreted when data are grouped
inappropriately. In some cases, grouping data
hides correlations.
Consider a (hypothetical) study in which
researchers seek a correlation between hours
of TV watched per week and high school
grade point average (GPA). They collect the
21 data pairs in Table 7.3.
The scatter diagram (Figure 7.13) shows
virtually no correlation; the correlation
coefficient for the data is
about r = -0.063. The apparent conclusion
is that TV viewing habits are unrelated to
academic achievement.
Figure 7.13
Copyright 2009 Pearson Education, Inc.

Slide 7.2- 29

However, one astute researcher realizes that some of the


students watched mostly educational programs, while others
tended to watch comedies, dramas, and movies. She therefore
divides the data set into two groups, one for the students who
watched mostly educational television and one for the other
students.
Table 7.4
shows her
results with
the students
divided into
these two
groups.

Copyright 2009 Pearson Education, Inc.

Slide 7.2- 30

Now we find two very strong correlations (Figure 7.14): a


strong positive correlation for the students who watched
educational programs (r = 0.855) and a strong negative
correlation for the other students (r = -0.951).

Figure 7.14 These scatter diagrams show the same data as Figure 7.13,
separated into the two groups identified in Table 7.4.

Copyright 2009 Pearson Education, Inc.

Slide 7.2- 31

In other cases, a data set may show a stronger correlation than


actually exists among subgroups.
Figure 7.15 shows the scatter diagram of the (hypothetical)
data collected by a consumer group studying the relationship
between the weights and prices of cars.

Figure 7.15 Scatter diagram for the car weight and price data.

The data set as a whole shows a strong correlation; but there


is no correlation within either cluster.
Copyright 2009 Pearson Education, Inc.

Slide 7.2- 32

Correlation Does Not Imply Causality


Perhaps the most important caution about interpreting
correlations is one weve already mentioned:
Correlation does not necessarily imply causality.

Possible Explanations for a Correlation


1. The correlation may be a coincidence.
2. Both correlation variables might be directly
influenced by some common underlying cause.
3. One of the correlated variables may actually be a
cause of the other. But note that, even in this case, it
may be just one of several causes.
Copyright 2009 Pearson Education, Inc.

Slide 7.2- 33

Useful Interpretations of Correlation


In discussing uses of correlation that might lead to wrong
interpretations, we have described the effects of outliers,
inappropriate groupings, fishing for correlations, and
incorrectly concluding that correlation implies causality.
But there are many correct and useful interpretations of
correlation.
In general, correlation plays a prominent and important role
in a variety of fields, including meteorology, medical
research, business, economics, market research, advertising,
psychology, and computer science.

Copyright 2009 Pearson Education, Inc.

Slide 7.2- 34

Definition
The best-fit line (or regression line) on a scatter
diagram is a line that lies closer to the data points
than any other possible line (according to a
standard statistical measure of closeness).

Copyright 2009 Pearson Education, Inc.

Slide 7.3- 35

Predictions with Best-Fit Lines


Cautions in Making Predictions from Best-Fit Lines
1. Dont expect a best-fit line to give a good prediction
unless the correlation is strong and there are many
data points. If the sample points lie very close to the
best-fit line, the correlation is very strong and the
prediction is more likely to be accurate. If the sample
points lie away from the best-fit line by substantial
amounts, the correlation is weak and predictions tend
to be much less accurate.
2. Dont use a best-fit line to make predictions beyond
the bounds of the data points to which the line was fit.

Copyright 2009 Pearson Education, Inc.

Slide 7.3- 36

Cautions in Making Predictions from Best-Fit Lines


(cont.)
3. A best-fit line based on past data is not necessarily
valid now and might not result in valid predictions of
the future.
4. Dont make predictions about a population that is
different from the population from which the sample
data were drawn.
5. Remember that a best-fit line is meaningless when
there is no significant correlation or when the
relationship is nonlinear.

Copyright 2009 Pearson Education, Inc.

Slide 7.3- 37

EXAMPLE 1 Valid Predictions?


State whether the prediction (or implied prediction) should be
trusted in each of the following cases, and explain why or why
not.
a. Youve found a best-fit line for a correlation between the
number of hours per day that people exercise and the
number of calories they consume each day. Youve used this
correlation to predict that a person who exercises 18 hours
per day would consume 15,000 calories per day.
Solution:
a. No one exercises 18 hours per day on an ongoing basis, so
this much exercise must be beyond the bounds of any data
collected. Therefore, a prediction about someone who
exercises 18 hours per day should not be trusted.
Copyright 2009 Pearson Education, Inc.

Slide 7.3- 38

EXAMPLE 1 Valid Predictions?


State whether the prediction (or implied prediction) should be
trusted in each of the following cases, and explain why or why
not.
b. There is a well-known but weak correlation between SAT
scores and college grades. You use this correlation to predict
the college grades of your best friend from her SAT scores.
Solution:
b. The fact that the correlation between SAT scores and college
grades is weak means there is much scatter in the data. As a
result, we should not expect great accuracy if we use this
weak correlation to make a prediction about a single
individual.

Copyright 2009 Pearson Education, Inc.

Slide 7.3- 39

EXAMPLE 1 Valid Predictions?


State whether the prediction (or implied prediction) should be
trusted in each of the following cases, and explain why or why
not.
c. Historical data have shown a strong negative correlation
between national birth rates and affluence. That is, countries
with greater affluence tend to have lower birth rates. These
data predict a high birth rate in Russia.
Solution:
c. We cannot automatically assume that the historical data still
apply today. In fact, Russia currently has a very low birth
rate, despite also having a low level of affluence.

Copyright 2009 Pearson Education, Inc.

Slide 7.3- 40

EXAMPLE 1 Valid Predictions?


State whether the prediction (or implied prediction) should be
trusted in each of the following cases, and explain why or why
not.
d. A study in China has discovered correlations that are useful
in designing museum exhibits that Chinese children enjoy. A
curator suggests using this information to design a new
museum exhibit for Atlanta-area school children.
Solution:
d. The suggestion to use information from the Chinese study
for an Atlanta exhibit assumes that predictions made from
correlations in China also apply to Atlanta. However, given
the cultural differences between China and Atlanta, the
curators suggestion should not be considered without more
information to back it up.
Copyright 2009 Pearson Education, Inc.

Slide 7.3- 41

EXAMPLE 1 Valid Predictions?


State whether the prediction (or implied prediction) should be
trusted in each of the following cases, and explain why or why
not.
e. Scientific studies have shown a very strong correlation
between childrens ingesting of lead and mental retardation.
Based on this correlation, paints containing lead were
banned.
Solution:
e. Given the strength of the correlation and the severity of the
consequences, this prediction and the ban that followed
seem quite reasonable. In fact, later studies established lead
as an actual cause of mental retardation, making the
rationale behind the ban even stronger.
Copyright 2009 Pearson Education, Inc.

Slide 7.3- 42

EXAMPLE 1 Valid Predictions?


State whether the prediction (or implied prediction) should be
trusted in each of the following cases, and explain why or why
not.
f. Based on a large data set, youve made a scatter diagram for
salsa consumption (per person) versus years of education.
The diagram shows no significant correlation, but youve
drawn a best-fit line anyway. The line predicts that someone
who consumes a pint of salsa per week has at least 13 years
of education.
Solution:
f. Because there is no significant correlation, the best-fit line
and any predictions made from it are meaningless.

Copyright 2009 Pearson Education, Inc.

Slide 7.3- 43

The Correlation Coefficient and Best-Fit


Lines
Best-Fit Lines and r2
The square of the correlation coefficient, or r2, is
the proportion of the variation in a variable that
is accounted for by the best-fit line.

Copyright 2009 Pearson Education, Inc.

Slide 7.3- 44

EXAMPLE 4 Voter Turnout and Unemployment


Political scientists are interested in knowing what factors affect
voter turnout in elections. One such factor is the unemployment
rate. Data collected in presidential election years since 1964
show a very weak negative correlation between voter turnout
and the unemployment rate, with a correlation coefficient of
about r = -0.1. Based on this correlation, should we use the
unemployment rate to predict voter turnout in the next
presidential election?
Solution: The square of the correlation coefficient is r2 = (-0.1)2
= 0.01, which means that only about 1% of the variation in the
data is accounted for by the best-fit line. Nearly all of the
variation in the data must therefore be explained by other factors.
We conclude that unemployment is not a reliable predictor of
voter turnout.
Copyright 2009 Pearson Education, Inc.

Slide 7.3- 45

Finding Equations for Best-Fit Lines


If we draw any line on a scatter diagram, we can measure the
vertical distance between each data point and that line. One
measure of how well the line fits the data is the sum of the
squares of these vertical distances.
A large sum means that the vertical distances of data points from
the line are fairly large and hence the line is not a very good fit.
A small sum means the data points lie close to the line and the fit
is good.
Of all possible lines, the best-fit line is the line that minimizes
the sum of the squares of the vertical distances.

Copyright 2009 Pearson Education, Inc.

Slide 7.3- 46

You may recall that the equation of any straight line can be written
in the general form
y = mx + b
where m is the slope of the line and b is the y-intercept of the line.
The formulas for the slope and y-intercept of the best-fit line are as
follows:
sy
slope = m = r s
x
y-intercept = b = y (m x)
x
In the above expressions, r is the correlation coefficient, sx denotes
the standard deviation of the x values (or the values of the first
variable), sy denotes the standard deviation of the y values, x
represents the mean of the values of the variable x, and yy
represents the mean of the values of the variable y.
Copyright 2009 Pearson Education, Inc.

Slide 7.3- 47

Because these formulas are tedious to use by hand, we usually


use a calculator or computer to find the slope and y-intercept of
best-fit lines.
When software or a calculator is used to find the slope and
intercept of the best-fit line, results are commonly expressed in
the format y = b0 + b1x, where b0 is the intercept and b1 is the
slope, so be careful to correctly identify those two values.

Copyright 2009 Pearson Education, Inc.

Slide 7.3- 48