Anda di halaman 1dari 33

56

Chapter 3

Chapter 3
3.1 (a) The amount of time a student spends studying is the explanatory variable and the grade
on the exam is the response variable. (b) Height is the explanatory variable and weight is the
response variable. (c) Inches of rain is the explanatory variable and the yield of corn is the
response variable. (d) It is more reasonable to explore the relationship between a students grades
in statistics and French. (e) A familys income is the explanatory variable and the years of
education their eldest child completes is the response variable.
3.2 The explanatory variable is weight of a person, and the response variable is mortality rate
(that is, how likely a person is to die over a 10-year period). The other variables that may
influence the relationship between weight and survival are the amount of physical activity,
perhaps measured by hours of exercise per week, and economic status, which could be measured
by annual income of the person, family net worth, amount of savings, or some other financial
variable.
3.3 Water temperature is the explanatory variable, and weight change (growth) is the response
variable. Both are quantitative.
3.4 The explanatory variable is the type of treatmentremoval of the breast or removal of only
the tumor and nearby lymph nodes, followed by radiation, and survival time is the response
variable. Type of treatment is a categorical variable, and survival time is a quantitative variable.
3.5 (a) The explanatory variable is the number of powerboat registrations. (b) A scatterplot is
shown below.
50

Manatees killed

40

30

20

10
450

500

550
600
650
Powerboat Registrations (1000s)

700

750

The scatterplot shows a positive linear relationship between these variables. (c) There is a
positive linear association between powerboat registrations and manatees killed. (d) Yes, the
relationship between these variables is linear. (e) The relationship is a strong, positive, linear
association. Yes, the number of manatees killed can be predicted accurately from powerboat
registrations. For 719,000 powerboat registrations, about 48 manatees would be killed by
powerboats.
3.6 (a) A scatterplot is shown below.

Examining Relationships

57

20.0
17.5

New adults

15.0
12.5
10.0
7.5
5.0
40

50

60
Percent returning

70

80

(b) The scatterplot shows a negative, linear, fairly weak relationship. (Note: direction=negative,
form=linear, strength=weak.) (c) Because this association is negative, we conclude that the
sparrowhawk is a long-lived territorial species.
3.7 (a) A positive association between IQ and GPA means that students with higher IQs
tend to have higher GPAs, and those with lower IQs generally have lower GPAs. The plot
does show a positive association. (b) The form of the relationship roughly linear, because a line
through the scatterplot of points would provide a good summary. The positive association is
moderately strong (with a few exceptions) because most of the points would be close to the line.
(c) The lowest point on the plot is for a student with an IQ of about 103 and a GPA of about 0.5.
3.8 (a) From Figure 3.5, the returns on stocks were about 50% in 1954 and about 28% in 1974.
(b) The return on Treasury bills in 1981 was about 15%. (c) The scatterplot shows no clear
pattern. The statement that high treasury bill returns tend to go with low returns on stocks
implies a negative association; there may be some suggestion of such a pattern, but it is
extremely weak.
3.9 (a) A scatterplot with speed as the explanatory variable is shown below.
22.5

Fuel used (liters/100 km)

20.0
17.5
15.0
12.5
10.0
7.5
5.0
0

20

40

60

80
100
Speed (km/h)

120

140

160

(b) The relationship is curved or quadratic. High amounts of fuel were used for low and high
values of speed and low amounts of fuel were used for moderate speeds. This makes sense
because the best fuel efficiency is obtained by driving at moderate speeds. (Note: 60 km/hr is
about 37 mph) (c) Poor fuel efficiency (above average fuel consumption) is found at both high
and low speeds and good fuel efficiency (below average fuel consumption) is found at moderate
speeds. (d) The relationship is very strong, with little deviation for a curve that can be drawn
through the points.

58

Chapter 3

3.10 (a) A scatterplot with mass as the explanatory variable is shown below.
1500
1400

Rate (cal)

1300
1200
1100
1000
900
30

35

40

45

50

55

Mass (kg)

(b) The association is positive, and the relationship is linear and moderately strong. (c) The
scatterplot below shows that the pattern of the relationship does hold for men. However, the
relationship between mass and rate is not as strong for men as it is for women. The group of
men has higher lean body masses and metabolic rates than the group of women.
2000

sex
F
M

1800

Rate (cal)

1600

1400

1200

1000

30

35

40

45
50
Mass (kg)

55

60

65

3.11 A scatterplot from a calculator is shown below. As expected, the calculator graph looks the
same as the scatterplot in Exercise 3.9 (a).

3.12 A scatterplot from a calculator is shown below. As expected, the calculator graph shows
the same relationship as the scatterplot in Exercise 3.10.

Examining Relationships

59

3.13 (a) The scatterplot below shows a strong, positive, linear relationship between the two
measurements. Thus, all five specimens appear to be from the same species.
90

Humerus length (cm)

80

70

60

50

40

40

45

50

55
60
Femur length (cm)

65

70

75

(b) The femur measurements have mean of 58.2 and a standard deviation of 13.2. The humerus
measurements have a mean of 66 and a standard deviation of 15.89. The table below shows the
standardized measurements (labeled zfemur and zhumerus) obtained by subtracting the mean and
dividing by the standard deviation. The column labeled product contains the product
(zfemurzhumerus) of the standardized measurements. The sum of the products is 3.97659, so
1
the correlation coefficient is r = 3.97659 = 0.9941 .
4
femur Humerus
zfemur zhumerus product
38
41
-1.53048 -1.57329 2.40789
56
63
-0.16669 -0.18880 0.03147
59
70
0.06061 0.25173 0.01526
64
72
0.43944 0.37759 0.16593
74
84
1.19711 1.13277 1.35605
(c) The correlation coefficient is the same, 0.9941.
3.14 The scatterplot below, with price as the explanatory variable, shows a strong, positive,
linear association between price and deforestation percent.

60

Chapter 3

Deforestation (percent)

3.0

2.5

2.0

1.5

1.0

0.5
30

40

50
60
Price (cents per pound)

70

(b) The prices have a mean of 50 and a standard deviation of 16.32. The deforestation percents
have a mean of 1.738% and a standard deviation of 0.928%. The table below shows the
standardized values (labeled zprice and zdeforestation) obtained by subtracting the mean and
dividing by the standard deviation. The column labeled product contains the product
(zpricezdeforestation) of the standardized measurements. The sum of the products is 3.82064,
1
so the correlation coefficient is r = 3.82064 = 0.9552 .
4
price Deforestation
zprice
zdeforestation product
29
0.49
1.28638
1.34507
1.73028
40
1.59
0.61256
0.15951
0.09771
54
1.69
0.24503
0.05173
0.01268
55
1.82
0.30628
0.08838
0.02707
72
3.10
1.34764
1.46794
1.97826
(c) The correlation coefficient is the same, 0.9552.
3.15 (a) The lowest calorie count is about 107 calories and the sodium level for this brand is
about 145 mg. The highest calorie count is about 195 calories, and the sodium level for this
brand is about 510 mg. (b) The scatterplot shows positive association; high-calorie hot dogs tend
to be high in salt, and low-calorie hot dogs tend to have low sodium. (c) The lower left point is
an outlier. Ignoring this point, the relationship is linear and moderately strong.
3.16 (a) The correlation r is clearly positive but not near 1. The scatterplot shows that students
with high IQs tend to have high grade point averages, but there is more variation in the grade
point averages for students with moderate IQs. (b) The correlation r for the data in Figure 3.8
would be closer to one. The overall positive relationship between calories and sodium is
stronger than the positive association between IQs and GPAs. (c) The outliers with moderate IQ
scores in Figure 3.4 weaken the positive relationship between IQ and GPA, so removing them
would increase r. The outlier in the lower left corner of Figure 3.8 strengthens the positive,
linear relationship between calories and sodium, so removing this outlier would decrease r.
3.17 (a) A scatterplot is shown below.

Examining Relationships

61

0.50

0.25

0.00

-0.25

-0.50

-0.75
-5

-4

-3

-2

-1

(b) The correlation r = 0.2531. (c) The two scatterplots, using the same scale for both variables,
are shown below.
5.0

2.5

2.5

0.0

0.0

y*

5.0

-2.5

-2.5

-5.0

-5.0

-7.5

-7.5
-5

-4

-3

-2

-1

-5

-4

-3

-2

-1

x*

(d) The correlation between x* and y* is the same as the correlation between x and y, r = 0.2531.
Although the variables have been transformed, the distances between the corresponding points
and the strengths of the association have not changed.
3.18 (a) The correlation between the percent of returning birds and the number of new adults is
r = 0.748 . (b) A scatterplot with the two new points added is shown below.
C ode
A
B
Original

25

New adults

20

15

10

5
0

10

20

30

40
50
60
Percent returning

70

80

90

62

Chapter 3

The correlation for the original data plus point A is r = 0.807. The correlation for the original
data plus point B is r = 0.469. (c) Point A fits in with the negative linear association displayed
by the other points, and even emphasizes (strengthens) that association because, when A is
included, the points of the scatterplot are less spread out (relative to the length of the apparent
line suggested by the points). On the other hand, Point B deviates from the pattern, weakening
the association.
3.19 There is a perfect, positive association between the ages of the women and their spouses, so
r = 1.
3.20 (a) A scatterplot of mileage versus speed is shown below.
30

Mileage (mpg)

29
28
27
26
25
24
20

30

40
Speed (mph)

50

60

(b) The speeds have a mean of 40 and a standard deviation of 15.81. The mileages have a mean
of 26.8 mpg and a standard deviation of 2.68 mpg The table below shows the standardized
values (labeled zspeed and zmpg) obtained by subtracting the mean and dividing by the standard
deviation. The column labeled product contains the product (zspeedzmpg) of the
standardized measurements. The sum of the products is 0.0, so the correlation coefficient is also
0.0.
speed
mpg
zspeed
zmpg
product
20
24
1.26491
1.04350
1.31993
30
28
0.63246
0.44721
0.28284
40
30
0.00000
1.19257
0.00000
50
28
0.63246
0.44721
0.28284
60
24
1.26491
1.04350
1.31993
The correlation coefficient r measures the strength of linear association between two quantitative
variables; this plot shows a nonlinear relationship between speed and mileage.
3.21 (a) New Yorks median household income is about $32,800 and the mean income per
person is about $27,500. (b) Both of these variables measure the prosperity of a state, so you
would expect an increase on one measure to correspond with an increase in the other measure.
Household income will generally be higher than income per person because most households
have one primary source of income and at least one other smaller source of income. (c) In the
District of Columbia there are a relatively small number of individuals earning a great deal of
money. Thus, the income distribution is skewed to the right, which would raise the mean per
capita income above the median household income. (d) Alaskas median household income is
about $48,000. (e) Ignoring the outliers, the relationship is positive, linear, and moderately
strong.

Examining Relationships

63

3.22 (a) A scatterplot, with the explanatory variable time, is shown below.

Pulse (beats per minute)

160

150

140

130

120
34.0

34.5

35.0
35.5
Time (minutes)

36.0

36.5

(b) The association between time and pulse is negative. The faster Professor Moore swims 2000
yards the more effort he will have to exert. Thus, a higher speed (lower time) will correspond
with a higher pulse and slower speeds (higher times) will correspond with lower pulses. (c) The
negative, linear relationship is moderately strong. (d) The correlation is r = 0.744. The
scatterplot shows a negative association between time and pulse. Small times correspond with
large pulses and large times correspond with small pulses. (e) The value of r would not
change.
3.23 (a) Gender is a categorical variable and the correlation coefficient r measures the strength
of linear association for two quantitative variables. (b) The largest possible value of the
correlation coefficient r is 1. (c) The correlation coefficient r has no units.
3.24 The papers report is wrong because the correlation ( r = 0.0 ) is interpreted incorrectly.
The author incorrectly suggests that a correlation of zero indicates a negative association
between research productivity and teaching rating. The psychologist meant that there is no linear
association between research productivity and teaching rating. In other words, knowledge of a
professors research productivity will not help you predict her teaching rating.
3.25 (a) A scatterplot, with the correct calories as the explanatory variable, is shown below.

Guessed calories

400

300

200

100

100

200
Correct calories

300

400

(b) There is a positive, linear relationship between the correct and guessed calories. The guessed
calories for 5 oz. of spaghetti with tomato sauce and the cream-filled snack cake are unusually
high and do not appear to fit the overall pattern displayed for the other foods. (c) The correlation

64

Chapter 3

is r = 0.825 . This agrees with the positive association observed in the plot; it is not closer to 1
because of the unusual guessed calories for spaghetti and cake. (d) The fact that the guesses are
all higher than the true calorie count does not influence the correlation. The correlation r would
not change if every guess were 100 calories higher. The correlation r does not change if a
constant is added to all values of a variable because the standardized values would be unchanged.
(c) The correlation without these two foods is r = 0.984 . The correlation is closer to 1 because
the relationship is much stronger without these two foods.
3.26 (a) Rachel should choose small-cap stocks because small-cap stocks have a lower
correlation with municipal bonds. Thus, the weak, positive relationship between small-cap
stocks and bonds will provide more diversification than the large-cap stocks, which have a
stronger positive relationship with bonds. (b) She should look for a negative correlation,
although this would also mean that the return on this investment would tend to decrease when
return on bonds increases.
3.27 The correlation is r = 0.481 . The one unusual point (10, 1) is responsible for reducing the
correlation. Outliers tend to have fairly strong effects on correlation; the effect is very strong
here because there are only six observations.
3.28 (a) A scatterplot of yield versus plants is shown below.
170

Yield (bushels per acre)

160
150

140

130

120
110
10000

15000

20000
Plants (per acre)

25000

30000

(b) The overall pattern is not linear. The yield tends to be highest for moderate planting rates
and smallest for small and large planting rates. There is clearly no positive or negative
association between planting rates and yield. (d) The mean yields for the five planting rates are:
Planting rate
Mean
12000
131.025
16000
143.150
20000
146.225
24000
143.067
28000
134.750
A scatterplot with the means added is shown below. We would recommend the planting rate
with the highest average yield, 20,000 plants per acre.

Examining Relationships

65

170

Yield (bushels per acre)

160
150

140

130

120
110
10000

15000

20000
Plants (per acre)

25000

30000

3.29 (a) For every one week increase in age, the rat will increase its weight by an average of 40
grams. (b) The y intercept provides an estimate for the birth weight (100 grams) of this male rat.
(c) A graph is provided below.
500

Weight (grams)

400

300

200

100
0

4
6
Time (weeks)

10

(d) No, we should not use this line to predict the rats weight at 104 weeks. This would be
extrapolation. This regression line would predict a weight of 4260 grams (about 9.4 lbs) for a 2
year old rat! The regression equation is only reliable for times where data were collected.
3.30 (a) The slope is 0.882; this means that on the average, reading score increases by 0.882 for
each one-point increase in IQ. (b) The predicted scores for x = 90 and x = 130 are 33.4 +
0.88290 = 45.98 and 33.4 + 0.882130 = 81.26. (c) This is most easily done by plotting the
points (90, 45.98) and (130, 81.26), then drawing the line connecting them.
85
80

Reading Score

75
70
65
60
55
50
45
90

100

110
IQ

120

130

(d) The intercept (33.4) would correspond to the expected reading score for a child with an IQ
of 0; neither that reading score nor that IQ has any meaningful interpretation.

66

Chapter 3

3.31 (a) The slope is 0.0138 minutes per meter. On average, if the depth of the dive is
increased by one meter, it adds 0.0138 minutes (about 0.83 seconds) to the time spent
underwater. (b) When Depth = 200, the regression line estimates DiveDuration to be 5.45
minutes (5 minutes and 27 seconds). (c) To plot the line, compute DiveDuration = 3.242
minutes when Depth = 40 meters, and DiveDuration = 6.83 minutes when Depth = 300 meters.
7

Duration (minutes)

3
50

100

150
200
Depth of dive (meters)

250

300

(d) The intercept suggests that a dive of no depth would last an average of 2.69 minutes; this
obviously does not make any sense.
3.32 (a) The slope is 0.0053; this means that on the average for each additional week of study
the pH decreased by 0.0053 units. Thus, the acidity of the precipitation increased over time. (b)
To plot the line, compute pH at the beginning (weeks = 0) and end (weeks = 150) of the study.
At the beginning of the study pH is 5.43 and at the end of the study pH is 4.635.
5.5
5.4
5.3

pH

5.2
5.1
5.0
4.9
4.8
4.7
4.6
0

20

40

60

80
Weeks

100

120

140

160

(c) Yes, the y intercept provides an estimate for the pH level at the beginning of the study. (d)
The regression line predicts the pH to be 4.635 at the end of this study.
3.33 (a) A scatterplot from the calculator is shown below.

(b) Let y = number of manatees killed and x = number of powerboat registrations. The leastsquare regression equation is y = 41.43 + 0.1249 x .

Examining Relationships

67

50

Manatees killed

40

30

20

10
450

500

550
600
650
Powerboat Registrations (1000s)

700

750

(c) When 716,000 powerboats are registered, the predicted number of manatees killed will be
41.43 + 0.1249 716 = 47.99, or about 48 manatees. (d) Yes, the measures seem to be
succeeding, three of the four new points are below the regression line, indicating that fewer
manatees than predicted were killed. Additional evidence of success is provided by the two
points for 1992 and 1993; they fall well below the overall pattern.

Manatees killed

50

40

30

20

10
450

500

550
600
650
Powerboat registrations (1000s)

700

750

(e) The mean number of manatee deaths for the years with 716,000 powerboat registrations is 42.
The prediction of 48 was too high.
3.34 (a) The least squares regression line is y = 31.9 0.304 x . The calculator output (and
Minitab output) is shown below.

Minitab output
The regression equation is
newadults = 31.9 - 0.304 %returning
Predictor
Constant
%returning
S = 3.66689

Coef
31.934
-0.30402

SE Coef
4.838
0.08122

R-Sq = 56.0%

T
6.60
-3.74

P
0.000
0.003

R-Sq(adj) = 52.0%

68

Chapter 3

(b) The means, standard deviations, and correlation are: x = 58.23% , sx = 13.03% ,
y = 14.23 new birds, s y = 5.29 new birds, r = 0.748 . (c) The slope is

5.29
b = 0.748
 0.304 and the intercept is a = 14.23 b 58.23  31.9 . (d) The slope tells
13.03
us that as the percent of returning birds increases by one the number of new birds will decrease
by 0.304 on average. The y intercept provides a prediction that we will see 31.9 new adults in a
new colony when the percent of returning birds is zero. This value is clearly outside the range of
values studied for the 13 colonies of sparrowhawks and has no practical meaning in this
situation. (e) The predicted value for the number of new adults is 31.9 0.30460 = 13.69 or
about 14.
3.35 (a) Let y = Blood Alcohol Content (BAC) and x = Number of Beers. The least-squares
regression line is y = 0.0127 + 0.017964 x . (b) The slope indicates that on average, the BAC
will increase by 0.017964 for each additional beer consumed. The intercept suggests that the
average BAC will be 0.01270 if no beers are consumed; this is clearly ridiculous. (c) The
predicted BAC for a student who consumed 6 beers is 0.0127 + 0.0179646 = 0.0951. (d) The
prediction error is 0.10 0.0951 = 0.0049.
3.36 (a) The relationship between the two variables in Figure 3.15 is positive, linear, and very
strong. (b) The regression line predicts that the Sanchez family would average about 500 cubic
feet of gas per day in a month that averages 20 degree-days per day. (c) The blue line in Figure
3.15 is called the least-squares line because it minimizes the sum of the squared deviations of
the observed amounts of gas consumed from the predicted amounts of gas. In other words, the
least squares line minimizes the squared vertical distances from the observed amounts of gas
consumed to the values predicted by the line. (d) The least squares line provides a very good fit
because the prediction errors, the vertical distances from the points to the line, are very small and
the linear relationship is very strong.
0.044139929
3.37 The slope is b = 0.894
 0.018 and the intercept is
2.1975365
a = 0.07375 b 4.8125  0.0129 , which is the same as the equation in Exercise 3.35.

3.38 (a) Let y = gas used and x = degree-days. The least-squares regression line
is y = 1.08921 + 0.188999 x . (b) The slope tells us that on average the amount of gas used
increases by 0.188999 for each one unit increase in degree-days. The y intercept provides a
realistic estimate (108.921 cubic feet) for the average amount of gas used when the average
number of heating degree-days per day is zero. (c) The predicted value is 1.08921 +
0.18899920 = 4.8629, which is very close to the rough estimate of 5 from Exercise 3.36 (b).
(d) The predicted value for this month is 1.08921 + 0.18899930 = 6.7592, so the prediction
error is 640 675.92 = 35.92.
3.39 (a) There is a positive, linear association between the two variables. There is more
variation in the field measurements for larger laboratory measurements. The values are scattered
above and below the line y = x for small and moderate depths, indicating strong agreement, but

Examining Relationships

69

the field measurements tend to be smaller than the laboratory measurements for large depths. (b)
The points for the larger depths fall systematically below the line y = x showing that the field
measurements are too small compared to the laboratory measurements. (c) In order to minimize
the sum of the squared distances from the points to the regression line, the top right part of the
blue line in Figure 3.20 would need to be pulled down to go through the middle of the group of
points that are currently below the blue line. Thus, the slope would decrease and the intercept
would increase. (d) The residual plot clearly shows that the prediction errors increase for larger
laboratory measurements. In other words, the variability in the field measurements increases as
the laboratory measurements increase. The least squares line does not provide a great fit,
especially for larger depths.
3.40 (a) A scatterplot with the least squares regression line is shown below.
Fuel consumption (liters/100 km)

22.5
20.0
17.5
15.0
12.5
10.0
7.5
5.0
0

20

40

60
80
100
Speed (km/hour)

120

140

160

(b) We would certainly not use the regression line to predict fuel consumption. The scatterplot
shows a nonlinear relationship. (c) The sum of the residuals provided is 0.01, which illustrates
a slight roundoff error. (d) The residual plot indicates that the regression line underestimates
fuel consumption for slow and fast speeds and overestimates fuel consumption for moderate
speeds. The quadratic pattern in the residual plot indicates that the regression model is not
appropriate for these data.
10.0
7.5

Residual

5.0
2.5
0

0.0
-2.5
-5.0
0

20

40

60
80
100
Speed (km/hour)

120

140

160

3.41 (a) The scatterplot with y = rate and x = mass is shown below.

70

Chapter 3

(b) The least-squares regression line is y = 201.162 + 24.026 x . (c) The slope tells us that a
female will increase her metabolic rate by a mean of 24.026 calories for each additional kg of
lean body mass. The intercept provides an estimate for the average metabolic rate (201 calories)
for women, when their lean body mass is zero (clearly unrealistic). (d) The residual plot (shown
below) shows no clear pattern, so the least squares line is an adequate model for the data.

(e) The residual plot with the predicted value on the horizontal axis looks exactly like the
previous plot of the residuals versus lean body mass.

3.42 (a) The correlations are all approximately the same (To three decimal places
rA = rB = rC = 0.816 and rD = 0.817 ), and the regression lines are all approximately y = 3.0 + 0.5 x .
For all four sets, we predict y  8 when x = 10 . (b) The scatterplots are provided below.
Data Set B

Data Set A
11

10
9

7
y

10

4
3

4
5.0

7.5

10.0

12.5

15.0

5.0

7.5

10.0
x

12.5

15.0

Examining Relationships

71

Data Set C

Data Set D

13

13

12

12

11

11

10

10
y

8
8

4
5.0

7.5

10.0

12.5

15.0

10

12

14
x

16

18

20

(c) The residual plots are shown below.


Data Set A

Data Set C

2
3

Residual

Residual

-1

-1

-2
5.0

7.5

10.0

12.5

15.0

5.0

7.5

10.0

12.5

15.0

Data Set B

Data Set D
2

1.0
1

0.0

-0.5
-1.0

Residual

Residual

0.5

-1

-1.5
-2.0

-2
5.0

7.5

10.0
x

12.5

15.0

10

12

14
x

16

18

20

(d) The regression line should only be used for Data Set A. The variables have a moderate linear
association with a fair amount of variability from the regression line and no obvious pattern in
the residual plot. For Data Set B, there is an obvious nonlinear relationship which can be seen in
both plots; we should fit a parabola or some other curve. For Data Set C, the point (13, 12.74)
deviates from the strong linear relationship of the other points, pulling the regression line up. If a
data entry error (or some other error) was made for this point, a regression line for the other
points would be very useful for prediction. For Data Set D, the data point with x = 19 is a very
influential pointthe other points alone give no indication of slope for the line. The regression
line is not useful in this situation with only two values of the explanatory variable x.
3.43 (a) The scatterplot of the data with the least-squares regression line is below.

72

Chapter 3

900
800
700
Response

600
500
400
300
200
100
0
0

10
Amount (ng)

15

20

(b) The regression equation is y = 14.4 + 46.6 x . (c) The residual plot is below. The residuals for
the extreme x-values (x = 0.25 and x = 20.0) are almost all positive; all of the residuals for the
middle two x values are negative.
10

Residual

-5

-10

-15
0

10
Amount (ng)

15

20

(d) r 2 = 0.9997; 99.97% of the variation in the response is explained by the least-squares
regression with the amount of substance. This value suggests that the regression line does a
great job predicting gas chromatograph readings.
3.44 r = 0.16 = 0.40 (high attendance goes with high grades, so the correlation must be
positive).
3.45 (a) The regression line is y = 482.28 9.76 x , where y = pulse and x = time. (b) Prediction
is 147.52 bpm, 4.48 bpm lower that the actual value. (c) With y = time and x = pulse,
y = 43.0 0.0567 x . The predicted time is now 34.39 minutes, only 0.09 minutes (5.4 seconds)
too high. (d) The results depend on which variable is viewed as the explanatory variable.
3.46 (a) You should not use the least-squares regression line from Example 3.13 because the
roles of the variables are reversed. Fat gain is now the explanatory variable and change in NEA
is the response variable. (b) The least-squares line for predicting change in y = NEA from x =
257.66
fat gain has slope b = 0.7786
= 176.1472 and intercept
1.1389
a = 324.8 (176.1472) 2.388 = 745.4395 . Thus, the regression line is
y = 745.4395 176.1472 x . (c) There will be variability in NEA change from subject to subject
so we want to use the regression line to predict the change in NEA rather than data from one

Examining Relationships

73

individual. The predicted change in NEA from this new subject is


y = 745.4395 176.1472 3.0 = 216.9979 , which is much different than the NEA change for the
other subject. (d) For the NEA data, r = 0.7786 and r 2 = 0.6062 . About 61% of the variation
in NEA change is accounted for by the linear relationship with fat gained. The other 39% is
individual variation among subjects that is not explained by the linear relationship. The two
r 2 values are the same because switching the roles of x and y does not change the correlation
coefficient or the square of the correlation coefficient.
3.47 (a) r 2 = ( 0.596 ) = 0.3552 . Thus, the straight-line relationship explains 35.52% of the
variation in yearly changes. (b) The regression equation is y = 6.083 + 1.707 x . (c) The predicted
change is y = 6.083 + 1.707 1.75 = 9.0703% . We could have given the answer without doing
calculations because the regression line must pass through ( x , y ) = (1.75, 9.07).
2

3.48 (a) A scatterplot, with the least-squares regression line, is shown below. The plot shows a
strong, positive linear association between the number of beaver-caused stumps and the number
of beetle larvae clusters.
Number of clusters of beetle larvae

60

50

40
30

20

10
0
1

3
Number of stumps

(b) The least-squares regression line is y = 1.29 + 11.89 x . (c) The residual plot is shown below.
The linear model appears to provide a very good fit.
10

Residual

-5

-10

-15
1

3
Number of stumps

(d) About 84% of the variation in the number of beetle larvae clusters is accounted for by the
linear relationship with the number of stumps.
3.49 (a) The slope (1.507) says that, on average, BOD rises (falls) by 1.507 mg/L for every 1
mg/L increase (decrease) in TOC. (b) When TOC = 0 mg/L, the predicted BOD level is 55.43

74

Chapter 3

mg/L. The negative value of BOD was obtained because values of TOC near zero were probably
not included in the study. This is another example where the intercept does not have a practical
interpretation.
3.50 (a) The least-squares line for predicting y = GPA from x = IQ has slope
2.1
b = 0.6337
= 0.101 and intercept a = 7.447 0.101108.9 = 3.5519 . Thus, the
13.17
2
regression line is y = 3.5519 + 0.101x . (b) r 2 = ( 0.6337 ) = 0.4016 . Thus, 40.16% of the
variation in GPA is accounted for by the linear relationship with IQ. (c) The predicted GPA for
this student is y = 3.5519 + 0.101103 = 6.8511 and the residual is 6.8511 0.53 = 6.3211 .
3.51 (a) A scatterplot, with the regression line, is shown below.
8

Weight (kg)

4
0

6
Age (months)

10

12

(b) Clearly, this line does not fit the data very well; the data show a clearly curved pattern. (c)
The residuals sum to 0.01 (the result of roundoff error). The residual plot below shows a clear
quadratic pattern, with the first two and the last four residuals being negative and those between
3 and 8 months being positive.
0.50

Residual for Weight

0.25
0

0.00
-0.25
-0.50
-0.75
-1.00
0

6
Age (months)

10

12

3.52 (a) A scatterplot, with the regression line, is shown below.

Examining Relationships

75

250

Absorbance

200

150

100

50

0
0

500

1000
Nitrates (mg/l)

1500

2000

The correlation is 0.99994 > 0.997, so recalibration is not necessary. (b) The regression
line for predicting absorbance is y = 1.6571 + 0.1133x . The average increase in absorbance for a
1 mg/l increase in nitrates is 0.1133. The predicted absorbance when no nitrates are present is
1.6571. Ideally, we should predict no absorbance when nitrates are not present. (c) The
predicted absorbance in a specimen with 500 mg/l of nitrates is
y = 1.6571 + 0.1133 500 = 58.308 . (d) This prediction should be very accurate since the linear
relationship is almost perfect, see the scatterplot above. Almost 100% ( r 2 = 0.9999 ) of the
variation in absorbance is accounted for by the linear relationship with nitrates.
3.53 (a) A scatterplot, with the regression line, is shown below.

Height (centimeters)

95.0

92.5

90.0

87.5

85.0
35

40

45
50
Age (months)

55

60

(b) The regression line for predicting y = height from x = age is y = 71.95 + 0.3833x . (c) When
x = 40 months: y = 87.28 cm. When x = 60 months: y = 94.95 cm. (d) A change of 6 cm in
12 months is 0.5 cm/month. Sarah is growing at about 0.38 cm/month; more slowly than normal.
3.54 (a) Sarahs predicted height at 480 months is y = 71.95 + 0.3833 480 = 255.93 cm.
Converting to inches, Sarahs predicted height is 255.93 0.3937 = 100.7596 inches or about 8.4
feet! (b) The prediction is impossibly large, because we incorrectly used the least-squares
regression line to extrapolate.
3.55 (a) The slope of the regression line for predicting final-exam score from pre-exam totals is
8
b = 0.6 = 0.16 ; for every extra point earned on the midterm, the score on the final exam
30
increases by a mean of 0.16. The intercept of the regression line is a = 75 0.16 280 = 30.2 ; if
the student had a pre-exam total of 0 points, the predicted score on the final would be 30.2. (b)

76

Chapter 3

Julies predicted final exam score is y = 30.2 + 0.16 300 = 78.2 . (c) r 2 = 0.36 , so only 36% of
the variability in the final exam scores is accounted for by the linear relationship with pre-exam
totals. About 64% of the individual variation is not accounted for by the least squares regression
line, so Julie has a good reason to think this is not a good estimate.
3.56 (a) A scatterplot, with the regression line, is shown below.
70
60

Overseas (% return)

50
40
30
20
10
0
-10
-20
-30

-20

-10

0
10
U.S. (% return)

20

30

40

(b) r = 0.4641 and r 2 = 0.2154 or 21.54%. There is a positive linear association between U.S.
and overseas returns, but it is not very strong. Only 21.54% of the variation in overseas returns
is accounted for by the linear relationship with U.S. returns. (c) The regression line for predicting
overseas returns is y = 5.694 + 0.6201x . (d) The overseas return in 1997 is predicted to be
y = 5.694 + 0.6201 33.4 = 26.41% . Thus, the residual is 24.31%. About 78.5% of the
variation in the overseas returns will not be accounted for by the linear relationship and some of
the prediction errors are very large, so predictions based on the regression line will not be very
accurate. (e) In 1986, the overseas return was 69.4%over 50 percentage points higher than
would be expected. There are no other unusual points.
3.57 (a) The five-number summaries for the different returns are:
First
Third
Minimum Quartile Median Quartile Maximum
U.S.
26.4%
5.1%
18.2% 30.4%
37.6%
Overseas 23.2%
2.1%
11.2% 29.6%
69.4%
These statistics are used to construct the side-by-side boxplots shown below.
80

Total Return (%)

60

40

20

-20

-40
Overseas

U.S.

(b) The quartiles and the median of the U.S. five-number summary are higher, but the minimum
and maximum overseas returns are higher. (c) Overseas stocks are more volatile. The box is

Examining Relationships

77

slightly taller for the overseas returns and the range (maximum minimum) is larger for the
overseas returns, as shown by the whiskers in the plot above.
3.58 Since the least-squares regression line must pass through the point of averages, we know
that y = 46.6 + 0.41x . Octavios predicted final exam score is
y = 46.6 + 0.41( x + 10 ) = ( 46.6 + 0.41x ) + 0.41 10 = y + 4.1 .
Thus, he will score 4.1 points above the mean on the final exam.
3.59 (a) The scatterplot is shown below. The guessed values are much higher than expected for
these two foods.
450
400

Guessed calories

350
300
250
200
150
100
0

50

100

150
Correct calories

200

250

300

(b) The regression line for predicting y = guessed calories from x = actual calories using all
points is y = 58.59 + 1.3036 x ( r 2 = 0.68 ) . Excluding spaghetti and snack cake:

y = 43.88 + 1.14721x ( r 2 = 0.968 ) . (c) A scatterplot with the two separate regression lines is

shown below.
450
400

Guessed calories

350
300
250
200
150
100
50
0

50

100

150
Correct calories

200

250

300

The two removed points could be called influential, in that when they are included, the
regression line passes above every other point; after removing them, the new regression line
passes through the middle of the remaining points.
3.60 (a) Without Child 19, the least-squares regression line for predicting y = score from x = age
is y = 109.305 1.1933x .

78

Chapter 3

120
110

Score

100
90
80
70
60
50
5

10

15

20

25
Age

30

35

40

45

The slope and intercept change slightly when Child 19 is removed, so this point does not appear
to be extremely influential. (b) With all children, r 2 = 0.410 ; without Child 19, r 2 = 0.572 . r 2
increases because more of the variability in the scores is explained by the stronger linear
relationship with age. In other words, with Child 19s high Gesell score removed, there is less
variability around the regression line.
3.61 (a) A scatterplot with the two new points is shown below. Point A is a horizontal outlier;
that is, it has a much smaller x -value than the others. Point B is a vertical outlier; it has a higher
y -value than the others.
30

Count of new birds

25

20

15

10

5
0

10

20

30
40
50
60
Percent of returning birds

70

80

90

(b) The three regression formulas are: y = 31.9 0.304 x (the original data);
y = 22.8 0.156 x (with Point A); y = 32.3 0.293x (with Point B). Adding Point B has little
impact. Point A is influential; it pulls the line down, and changes how the line looks relative to
the original 13 data points.
3.62 (a) Who? The individuals are 16 couples in their mid-twenties who were married or had
been dating for two years. What? The variables are empathy score (a quantitative measure of
empathy from a psychological test) and brain activity (a quantitative variable reported as a
fraction between 1 and 1). Why? The researchers wanted to see how the brain expresses
empathy. In particular, they were interested in checking if women with higher empathy scores
have a stronger response when their partner has a painful experience. When, where, how, and by
whom? The researchers zapped the hands of the men and women to measure brain activity,
presumably in a lab, doctors office, or hospital. The results appeared in Science in 2004 so the
data were probably collected shortly before publication of the article. (b) Subject 16 is
influential on the correlation. With all subjects, r = 0.515 ; without Subject 16, r = 0.331 . (c)
Subject 16 is not influential on the least-squares regression line (see the scatterplot below).

Examining Relationships

79

0.8

Brain activity

0.6

0.4

0.2

0.0

20

30

40

50

60
70
Empathy score

80

90

100

110

The regression lines are: y = 0.0578 + 0.0076 x (with all subjects) and
y = 0.0152 + 0.0067 x (without Subject 16).
3.63 Higher income can cause better health: higher income means more money to pay for
medical care, drugs and better nutrition, which in turn results in better health. Better health can
cause higher income: if workers enjoy better health, they are better able to get and hold a job,
which can increase their income.
3.64 No, you cannot shorten your stay by choosing a smaller hospital. The positive correlation
does not imply a cause and effect relationship. Larger hospitals tend to see more patients in poor
condition, which means that the patients will tend to require a longer stay.
3.65 (a) A scatterplot, with the regression line, is shown below.
Farm population (millions of persons)

35
30
25
20
15
10
5
1930

1940

1950

1960

1970

1980

Year

The least-squares regression line for predicting farm y = population from the explanatory
variable x = year is y = 1166.93 0.5868 x . (b) The farm population decreased on average by
about 0.59 million (590,000) people per year. About 97.7% of the variation in the farm
population is accounted for by the linear relationship with year. (c) The predicted farm
population for the year 2010 is 12,538,000; clearly impossible, as population must be greater
than or equal to zero.
3.66 (a) Who? The individuals are students at a large state university. What? The variables are
the number of first-year students and the number of students who enroll in elementary
mathematics courses. Both variables are quantitative and take on integer values from several
hundred to several thousand, depending on the size of the university. Why? The data were
collected to try to predict the number of students who will enroll in elementary mathematics

80

Chapter 3

courses. When, where, how, and by whom? Faculty members in the mathematics department at
a large state university obtained the enrollment data and class sizes from 1993 to 2000. These
data were probably extracted from a historical data base in the Registrars office. A scatterplot,
with the regression line, is shown below.
Number of elementary math students

7750

7500

7250

7000

6750

6500
4000

4100

4200
4300
4400
4500
4600
Number of first-year students

4700

4800

4900

300

300

200

200

Residual

Residual

The regression line appears to provide a reasonable fit. About 69.4% of the variation in
enrollments for elementary math classes is accounted for by the linear relationship with the
number of first-year students. (b) The residual plots are shown below.

100

-100

100

-100

-200

-200
4000

4100

4200
4300
4400
4500
4600
Number of first-year students

4700

4800

4900

1993

1994

1995

1996

1997

1998

1999

2000

Year

The plot of the residuals against x shows that a somewhat different line would fit the five lower
points well. The three points above the regression line represent a different relation between the
number of first-year students and mathematics enrollments. The plot of the residuals against
year clearly illustrates that the five negative residuals are from the years 1993 to 1997, and the
three positive residuals are from 1998, 1999, and 2000. (c) The change in requirements was not
visible on the scatterplot in part (a) or the plot of the residuals against x. However, the change is
clearly illustrated (negative residuals before 1998 and positive residuals after 1998) on the plot of
the residuals against year.
3.67 The correlation for individual stocks would be lower. Individual stock performances will
be more variable weakening the relationship.
3.68 A scatterplot, with both regression lines, is shown below. A scatterplot with a circle
around the point from 1986 with the largest residual is shown in the solution to Exercise 3.56.

Examining Relationships

81

70
60

Overseas (% return)

50
40
30
20
10
0
-10
-20
-30

-20

-10

0
10
U.S. (% return)

20

30

40

As the scatterplot shows, the point from 1986 is not very influential on the regression line. The
two regression lines are: y = 5.694 + 0.6201x (with all points) and y = 4.141 + 0.5885 x (without
the point in 1986). (b) The residual plot below, for all of the points, does not show any unusual
patterns, although the large residual is clearly visible.
60
50
40

Residual

30
20
10
0
-10
-20
-30
1970

1975

1980

1985
Year

1990

1995

2000

3.69 (a) Yes, but the relationship is not very strong. (b) The mortality rate is extremely variable
for those hospitals that treat few heart attacks. As the number of patients treated increases the
variability decreases and the mortality rate appears to decrease giving the appearance of an
exponentially decreasing pattern of points in the plot. The nonlinearity strengthens the
conclusion that heart attack patients should avoid hospitals that treat few heart attacks.
3.70 (a) A scatterplot, with both regression lines, is below.
95

Round 2

90

85

80

75
80

85

90
Round 1

95

100

105

The influential observation (circled) is observation 7, (105, 89). (b) The line with the larger
slope is the line that omits the influential observation (105, 89). The influential point pulls the
regression line with all of the points downward in order to minimize the overall prediction error.

82

Chapter 3

3.71 Age is a lurking variable. We would expect both variables, shoe size and reading
comprehension score, to increase as the child ages.
3.72 (a) A scatterplot, with the two unusual observations marked and the three separate
regression lines added, is shown below.
#15

350

FPG (mg/ml)

300
250
#18
200
150
100
5.0

7.5

10.0

12.5
HbA (%)

15.0

17.5

20.0

(b) The correlations are: r1  0.4819 (all observations); r2  0.5684 (without Subject 15);

r3  0.3837 (without Subject 18). Both outliers change the correlation. Removing subject 15
increases r, because its presence makes the scatterplot less linear, while removing Subject 18
decreases r, because its presence decreases the relative scatter about the linear pattern. (c) The
three regression lines shown in the scatterplot above are: y  66.4 + 10.4 x (all observations);
y  69.5 + 8.92 x (without #15); y  52.3 + 12.1x (without #18). While the equation changes in
response to removing either subject, one could argue that neither one is particularly influential,
as the line moves very little over the range of x (HbA) values. Subject #15 is an outlier in terms
of its y value; such points are typically not influential. Subject #18 is an outlier in terms of its x
value, but is not particularly influential because it is consistent with the linear pattern suggested
by the other points.
3.73 (a) Who? The individuals are land masses. What? The two quantitative variables are the
amount of snow cover (in millions of square kilometers) and summer wind stress (in newtons per
square meter). Why? The data were collected to explore a possible effect of global warming.
When, where, how, and by whom? The data from Europe and Asia appear to be collected over a
7 year period during the months of May, June, and July. The amount of snow cover may have
been estimated from arial photographs or satellite images and the summer wind stress
measurements may have been collected by meteorologists. The scatterplot below suggests a
negative linear association, with correlation r = 0.9179.

Examining Relationships

83

Wind stress (newtons per square meter)

0.20

0.15

0.10

0.05

0.00
5

10
15
20
25
Snowcover (millions of square kilometers)

30

The regression line for predicting y = wind stress from x = snow cover is y  0.212 0.0056 x ;
r 2 = 0.843 . The linear relationship explains 84.3% of the variation in wind stress. We have good
evidence that decreasing snow cover is strongly associated with increasing wind stress. (b) The
graph shows 3 clusters of 7 points.
3.74 The sketch below shows two clusters of points, each with a positive correlation. The top
cluster represents economists employed by business firms and the bottom cluster represents
economists employed by colleges and universities. When the two clusters are combined into one
large group of economists, the overall correlation is negative.

3.75 (a) In the scatterplot below right-hand points are filled circles; left-hand points are open
circles.
400

Time (milliseconds)

350
300
250
200
150
100
0

50

100

150
200
Distance

250

300

350

(b) The right-hand points lie below the left-hand points. (This means the right-hand times are
shorter, so the subject is right-handed.) There is no striking pattern for the left-hand points; the
pattern for right-hand points is obscured because they are squeezed at the bottom of the plot. (c)
The regression line for the right hand is y  99.4 + 0.0283x (r = 0.305, r2 = 9.3%). The regression
line for the left hand is y  172 + 0.262 x (r = 0.318, r2 = 10.1%). The left-hand regression is

84

Chapter 3

slightly better, but neither is very good: distance accounts for only 9.3% (right) and 10.1% (left)
of the variation in time.
3.76 The two residual plots are shown below; neither shows a systematic pattern.
20

150

50

Residual for right hand

Residual for left hand

15
100

10
5
0

0
-5

-50

-10

-100
0

10
Time order

15

20

10
Time order

15

20

CASE CLOSED (1) A scatterplot is shown below. The average number of home runs hit per
game decreases from 1960 to 1970, then levels off before increasing from about 1980 to 2000.
The correlation is 0.466, which indicates a moderate positive association.
Average number of HRs per game

2.4
2.2
2.0
1.8
1.6
1.4
1.2
1.0
1960

1970

1980
Year

1990

2000

(2) A scatterplot below, with the regression line, shows a moderately strong linear association
between average home runs per game and year after Rawlings became the supplier. The
correlation is 0.732.
Average number of HRs per game

2.4

2.2
2.0

1.8

1.6

1.4
1.2
1980

1985

1990
Year

1995

2000

(3) The least-squares regression line is y  61.09 + 0.0316 x . The slope (0.0316) indicates the
average increase in the average number of home runs as year increases by one. The intercept has
no practical meaning in this setting. (4) The residual plot suggests that the regression line

Examining Relationships

85

provides a surprisingly reasonable fit. However, all of the residuals after 1995 are positive so a
model with some curvature would fit better.
0.4
0.3
0.2

Residual

0.1
0

0.0
-0.1
-0.2
-0.3
-0.4
-0.5
1980

1985

1990
Year

1995

2000

(5) r 2 = 0.536 , which indicates that about 54% of the variation in the average number of home
runs per game is accounted for by the linear relationship with year. In other words, about 46% of
the variation is not explained by the least-squares regression line. (6) The predicted value for
2001 is about 2.16. This estimate is probably not very accurate. In particular, since the residuals
are positive for all years after 1995, this estimate is likely to be too low. (7) The prediction error
is 2.0922.16 = 0.068. The estimate is not bad, and it even overestimated the average number
of home runs per game. (8) No, these data should not be used to predict the mean number of
home runs per game in 2020. This case study has illustrated that patterns can change over time
so we have no data to help use predict what might happen 20 years in the future. We should not
use the regression line to extrapolate.
3.77 Seriousness of the fire is a lurking variable: more serious fires need more attention. It
would be more accurate to say that a large fire causes more firefighters to be sent, rather than
vice versa.
3.78 (a) Two mothers are 57 inches tall; their husbands are 66 and 67 inches tall. (b) The tallest
fathers are 74 inches tall; there are three of them, and their wives are 62, 64, and 67 inches tall.
(c) There is no clear explanatory variable; either could go on the horizontal axis. (d) Positive
association means that when one parent is short (tall) the other parent also tends to be short (tall).
In other words, there is a direct association between the heights of parents. We say the
association is weak because there is a considerable amount of variation (or scatter) in the points.
3.79 (a) A scatterplot, with the regression line, is shown below. There is a negative association
between alcohol consumption and heart disease.

Heart disease death rate (per 100,000)

86

Chapter 3

300

250

200

150

100

50
0

3
4
5
6
7
Alcohol from wine (liters per year)

(b) The regression equation for predicting y = heart disease death rate from x = alcohol
consumption is y  260.56 22.969 x . The slope provides an estimate for the average decrease
(slope is negative) in the heart disease death rate for a one liter increase in wine consumption.
Thus, for every extra liter of alcohol consumed, the heart disease death rate decreases on average
by about 23 per 100,000. The intercept provides an estimate for the average death rate (261 per
100,000) when no wine is consumed. (c) The correlation is r = 0.843, which indicates a strong
negative association between wine consumption and heart disease death rate. r 2 = 0.71 , so 71%
of the variation in death rate is accounted for by the linear relationship with wine consumption.
(d) The predicted heart disease death rate is y  260.56 22.969 4  168.68 . (e) No. Positive r
indicates that the least-squares line must have positive slope, negative r indicates that it must
have negative slope. The direction of the association and the slope of the least-squares line must
s
always have the same sign. Recall b = r y and the standard deviations are always
sx
nonnegative.
3.80 (a) The point at the far left of the plot (Alaska) and the point at the extreme right (Florida)
are unusual. Alaska may be an outlier because its cold temperatures discourage older residents
from remaining in the state. Florida is unusual because many individuals choose to retire there.
(b) The linear association is positive, but very weak. (c) The outliers tend to suggest a stronger
linear trend than the other points and will be influential on the correlation. Thus, the correlation
with the outliers is r = 0.267 , and the correlation without the outliers is r = 0.067 .
3.81 (a) A scatterplot, with the regression line, is shown below.
3.6

Steps per second

3.5

3.4
3.3

3.2

3.1

3.0
15

16

17

18
19
20
Speed (feet per second)

21

22

Examining Relationships

87

(b) There is a very strong positive linear relationship, r = 0.999 . (c) The regression line for
predicting y = steps per second from x = running speed is y  1.7661 + 0.0803x . (d) Yes,
r 2 = 0.998 so 99.8% of the variation in steps per second is explained by the linear relationship
with speed. (e) No, the regression line would change because the roles of x and y are reversed.
However, the correlation would stay the same, so r 2 would also stay the same.
3.82 The correlation for the individual runners would be lower because there is much more
variation among the individuals. The variation in the average number of steps for the group is
smaller so the regression line does a great job for the published data.

Monkey call response (spikes per second)

3.83 (a) One possible measure of the difference is the mean response: 106.2 spikes/second for
pure tones and 176.6 spikes/second for monkey callsan average of an additional 70.4
spikes/second. (b) A scatterplot, with the regression line y  93.9 + 0.778 x , is shown below.
500

400

300

200

100

0
0

100

200
300
400
Pure tone response (spikes/second)

500

The third point (pure tone 241, call 485 spikes/second) has the largest residual; it is circled. The
first point (474 and 500 spikes/second) is an outlier in the x direction; it is marked with a square.
(c) The correlation drops only slightly (from 0.6386 to 0.6101) when the third point is removed;
it drops more drastically (to 0.4793) without the first point. (d) Without the first point, the
regression line is y  101 + 0.693x ; without the third point, it is y  98.4 + 0.679 x .
3.84 (a) In the mid-1990s, European and American stocks were only weakly linked, but now it
is more common for them to rise and fall together. Thus investing in both types of stocks is
not that much different from investing in either type alone. (b) The article is incorrect; a
correlation of 0.8 means that a straight-line relationship explains about 64% of the variation in
European stock prices.
2.7
3.85 The slope is b = 0.5
= 0.54 . The regression line, shown below, for predicting y =
2.5
husbands height from x = wifes height is y = 33.67 + 0.54 x .

88

Chapter 3

Husband's height (inches)

70

60

50

40

30
0

10

20

30
40
Woman's height (inches)

50

60

70

The predicted height is y = 33.67 + 0.54 67 = 69.85 inches.


3.86 Who? The individuals are the essays provided by students on the new SAT writing test.
What? The variables are the word count (length of essay) and score. Both variables are
quantitative and take on integer values. Why? The data were collected to investigate the
relationship between length of the essay and score. When, where, how, and by whom? The data
were collected after the first administration of the new SAT writing test in March, 2005. Dr.
Perelman may have obtained the data from the Educational Testing Service or from colleagues
who scored the essays. Graphs: The scatterplot below, with the regression line included, shows
a relationship between length of the essay and score, but the relationship appears to be nonlinear.
The residual plot also shows a clear pattern, so using the least-squares regression line to predict
score from length of essay is not a good idea.
9
1.0

0.5
0.0

6
Residual

Score of essay

5
4

-0.5
-1.0

3
-1.5
2
-2.0

-2.5

0
0

100

200

300
400
500
Number of words in essay

600

700

100

200
300
400
500
Length of essay (number of words)

600

700

Numerical summaries: The correlation between word count and score is 0.881. The least squares
regression line for predicting y = score from x = word count is y  1.1728 + 0.0104 . This line
accounts for about 77.5% of the variation in score. Interpretation: Even though the scatterplot
shows a moderately strong positive association between length of the essay and score, we do not
want to jump to conclusions about the nature of this relationship. Better students tend to give
more thorough explanations so there could be another reason why the longer essays tend to get
high scores. In fact, a careful look at the scatterplot reveals considerably more variation in the
length of the essays for students who received a score of 4, 5, or 6. If Dr. Perelmans made his
second conclusion about being right over 90% of the time by rounding the correlation coefficient
from 0.88 to 0.9, then he made a serious mistake with his interpretation of the correlation
coefficient. If scores were assigned by simply sorting the word counts from smallest to largest,
the error rate would be much larger than 10%.

Anda mungkin juga menyukai