177 tayangan

Diunggah oleh SergioDragomiroff

Simple Linear Regression

- Ming 1
- Basic Eco No Metrics - Gujarati
- 线性回归分析 linear regression.ppt
- Regression Analysis - Wikipedia
- DASE Session Problem Sheet 4
- Case 9
- Forecasting Covers in Hotel Food and Beverage Outlets
- Research Project Real
- Linear Regression with One Regressor
- chapter+13
- Universal Compression Index Equation
- 1110-4563-1-PB.pdf
- kelompok 1
- Lect02_estim_regline
- Chapter 3 Notes
- Gwrgeographically Weighted Regression
- Case Study IE322
- 64
- tasarı_Estimation of Uster H
- regresi & interpolasi

Anda di halaman 1dari 46

S e c t i o n V

16

Understanding

RelationshipsNumerical

Data Part 2

Daniel M. Nagy/Shutterstock.com

Preview

Chapter Learning Objectives

16.1 The Simple Linear Regression

Model

16.2 Inferences Concerning the

Slope of the Population

Regression Line

16.3 Checking Model Adequacy

Are You Ready to Move On?

Chapter 16 Review Exercises

Technology Notes

AP* Review Questions for

Chapter 16

Preview

In Chapter 4, you learned how to describe relationships between two numerical

variables. When the relationship was judged to be linear you found the equation

of the least squares regression line and assessed the quality of the fit using the

scatterplot, the residual plot, and the values of the coefficient of determination (r2)

and the standard deviation about the least squares line (se). In this chapter you will

learn how to make inferences about the slope of the population regression line.

734

85241_ch16_ptg01.indd 734

20/12/12 6:39 PM

Conceptual Understanding

After completing this chapter, you should be able to

C1Understand how probabilistic and deterministic models differ.

C2Understand that the simple linear regression model provides a basis for making inferences about

linear relationships.

After completing this chapter, you should be able to

M1Interpret the parameters of the simple linear regression model in context.

M2Use scatterplots, residual plots, and normal probability plots to assess the credibility of the

M3Know the conditions for appropriate use of methods for making inferences about b.

M4Compute the margin of error when the sample slope b is used to estimate a population slope b.

M5Use the five-step process for estimation problems (EMC3) and computer output to construct and

interpret a confidence interval estimate for the slope of a population regression line.

M6Use the five-step process (HMC3) to test hypotheses about the slope of the population

regression line.

M7Use graphs to identify potential outliers and influential points.

After completing this chapter, you should be able to

P1 Interpret a confidence interval for a population slope in context.

P2 Carry out the model utility test and interpret the result in context.

Preview Example

Premature Babies

Babies born prematurely (before the 37th week of pregnancy) often have low birth

weights. Is a low birth weight related to factors that affect brain function? The

authors of the paper Intrauterine Growth Restriction Affects the Preterm Infants

Hippocampus(Pediatric Research [2008]: 438-43) hoped to use data from a study of

premature babies to answer this question. They measured x 5 birth weight (in grams)

and y 5 hippocampus volume (in mL) for 26 premature babies. The hippocampus

is a part of the brain that is important in the development of both short- and longterm memory. The sample correlation coefficient for their data is r 5 0.4722 and the

5 1.67 1 0.0026x. The pattern in the

scatterplot (Figure 16.1) suggests there may be a positive linear relationship. However,

the correlation coefficient is not very large, and the value of the slope is close to zero.

Could the pattern observed in the scatterplotand the nonzero slopebe plausibly

explained by chance? That is, is it plausible that there is no relationship between birth

weight and hippocampus volume in the population of all premature babies? Or does

the sample provide convincing evidence of a linear relationship between these two

variables? If there is evidence of a meaningful relationship between these two variables,

the regression line could be used to predict the hippocampus volume. If the predicted

volume was sufficiently small, early cognitive therapy could be recommended. On the

other hand, if there is no meaningful relationship between these variables, low birth

weight should not automatically trigger potentially expensive therapy.

735

85241_ch16_ptg01.indd 735

20/12/12 6:39 PM

736

2.4

Hippocampus volume

2.3

2.2

2.1

2.0

1.9

1.8

1.7

1.6

Figure 16.1

1.5

hippocampus volume.

500

1000

1500

Birth weight

2000

2500

In this chapter, you will learn methods that will help you determine if there is a

real and useful linear relationship between two variables or if the pattern in the data

could be simply due to chance differences that occur when a sample is selected from a

population.

section 16.1

A deterministic relationship between two variables x and y is one in which the value of y is

completely determined by the value of the independent variable x. A deterministic relationship can be described, or modeled, using mathematical notation, such as y 5 f(x) where f(x)

is a particular function of x. This relationship is deterministic in the sense that the value of the

independent variable is all that is needed to determine the value of the dependent variable.

For example, you might convert x 5 temperature in degrees centigrade to y 5 temperature in

9

degrees Fahrenheit using y 5 f(x), where f(x) 5 __x 1 32. Once the centigrade temperature

5

is known, the Fahrenheit temperature is completely determined. Or you might determine

y 5 amount of money in a savings account after x years, using the compound interest forr nx

mula, y 5 P 1 1 __ , where P is the principal (the amount of money deposited), r is the

n

interest rate, and n is the number of times each year the interest is compounded. The number

of years you leave the principal in the bank determines the amount in the account.

In many situations the variables of interest are not deterministically related. For example,

the value of y 5 first-year college grade point average is not determined solely by x 5 high

school grade point average, and y 5 crop yield is determined partly by factors other than x 5

amount of fertilizer used. The relationship between two variables, x and y, that are not deterministically related can be described by extending the deterministic model to specify a probabilistic model. The general form of a probabilistic model allows y to be larger or smaller

than f(x) by a random amount e. The model equation for a probabilistic model has the form

5 f (x) 1 e

In a scatterplot of y versus x, some of the data points will fall above the graph of f(x)

and some will fall below. Thinking geometrically, if e . 0, the corresponding point in the

scatterplot will lie above the graph of the function y 5 f(x). If e , 0, the corresponding

point will fall below the graph of f(x).

For example, consider the probabilistic model

y 5 50 2 10x 1 x2 1 e

___________________

f(x)

The graph of the function y 5 50 2 10x 1 x2 is shown as the orange curve in Figure 16.2.

The observed point (4, 30) is also shown in the figure. Because f(4) 5 50 2 10(4) 1 42 5

Unless otherwise noted, all content on this page is Cengage Learning.

85241_ch16_ptg01.indd 736

20/12/12 6:39 PM

737

50 2 40 1 16 5 26 for this point, you can write y 5 f(x) 1 e, where e 5 4. The point

(4, 30) falls 4 above the graph of the function, y 5 50 2 10x 1 x2.

y

e=4

26

Graph of

y = 50 10x + x 2

Figure 16.2

deterministic part of a

probabilistic model.

The simple linear regression model is a special case of the general probabilistic model in

which the deterministic function, f(x), is linear (so its graph is a straight line).

Definition

The simple linear regression model assumes that there is a line with vertical or

y intercept a and slope b, called the population regression line. When a value of

the independent variable x is fixed and an observation on the dependent variable y

is made,

y 5 a 1 bx 1 e

Without the random deviation e, all observed (x, y) points would fall exactly on

the population regression line. The inclusion of e in the model equation recognizes

that points will deviate from the line by a random amount.

Figure 16.3 shows two observations in relation to the population regression line.

y

Observation when x = x1

(positive deviation)

Population regression

line (slope b)

e2

e1

Observation when x = x2

(negative deviation)

a = vertical

intercept

Figure 16.3

from the population regression

line.

0

0

x = x1

x = x2

85241_ch16_ptg01.indd 737

20/12/12 6:39 PM

738

Before you actually observe a value of y for any particular value of x, you are

uncertain about the value of e. It could be negative, positive, or even 0. Also, e might

be quite large in magnitude (resulting in a point far from the population regression line)

or quite small (resulting in a point very close to the line). The simple linear regression

model makes some assumptions about the distribution of e at any particular x value in

the population.

Basic Assumptions of the Simple Linear Regression Model

1. The distribution of e at any particular x value has mean value 0. That is, me 5 0.

2. The standard deviation of e (which describes the spread of its distribution)

is the same for any particular value of x. This standard deviation is denoted

by se.

3. The distribution of e at any particular x value is normal.

4. The random deviations e1, e2, ..., en associated with different observations are

independent of one another.

The simple linear regression model assumptions about the variability in the values

of e in the population imply that there is also variability in the y values observed at any

particular value of x. Consider y when x has some fixed value x*, so that

y 5 a 1 bx* 1 e.

Because a and b are fixed (they are unknown population values), a 1 bx* is also a

fixed number. The sum of a fixed number and a normally distributed variable (e) is

also a normally distributed variable (the bell-shaped curve is simply shifted), so y

itself has a normal distribution. Furthermore, me 5 0 implies that the mean value of y

is a 1 bx*, the height of the population regression line for the value x 5 x*. Finally,

because there is no variability in the fixed number a 1 bx*, the standard deviation of

y is the same as the standard deviation of e. These properties are summarized in the

following box.

and

) (

mean y value

height of the population

___________

5 ____________________

for x*

regression line above

x* 5 a 1 bx*

standard deviation of y for a fixed value x* 5 se

The slope b of the population regression line is the mean or expected change

in y associated with a 1-unit increase in x. The y intercept a is the height of

the population line when x 5 0.

The value of se determines how much the (x, y) observations deviate vertically

from the population line; when se is small, most observations will be close to

the line, but when se is large, the observations will tend to deviate more from

the line.

The key features of the model are illustrated in Figures 16.4 and 16.5. Notice that

the three normal curves in Figure 16.4 have identical spreads. This is a consequence of

se being the same at any value of x, which implies that the variability in the y values at a

particular value of x is constantthe variability does not depend on the value of x.

85241_ch16_ptg01.indd 738

20/12/12 6:39 PM

739

y

y = a + bx,

the population

regression line

(line of mean values)

a + bx3

Standard deviation s

Normal curve

a + bx2

Standard deviation s

Normal curve

a + bx1

Standard deviation s

Normal curve

x

x1

Figure 16.4

regression model.

x2

x3

Population regression

line

Population regression

line

Figure 16.5

model: (a) small se ; (b) large se

(b)

(a)

The authors of the article On Weight Loss by Wrestlers Who Have Been Standing on Their

Heads (paper presented at the Sixth International Conference on Statistics, Combinatorics,

and Related Areas, Forum for Interdisciplinary Mathematics, 1999, with the data also

appearing in A Quick Course in Statistical Process Control, Mick Norton, 2005) state that

amateur wrestlers who are overweight near the end of the weight certification period, but

just barely so, have been known to stand on their heads for a minute or two, get on their

feet, step back on the scale, and establish that they are in the desired weight class. Using

a headstand as the method of last resort has become a fairly common practice in amateur

wrestling.

Does this really work? Data were collected in an experiment where weight loss was

recorded for each wrestler after exercising for 15 minutes and then doing a headstand for

1 minute 45 sec. Based on these data, the authors of the article concluded that there was in

fact a demonstrable weight loss that was greater than that for a control group that exercised

for 15 minutes but did not do the headstand. (The authors give a plausible explanation for

why this might be the case based on the way blood and other body fluids collect in the head

during the headstand and the effect of weighing while these fluids are draining immediately after standing.) The authors also concluded that a simple linear regression model was

a reasonable way to describe the relationship between the variables

y 5 weight loss (in pounds)

and

x 5 body weight prior to exercise and headstand (in pounds)

Unless otherwise noted, all content on this page is Cengage Learning.

85241_ch16_ptg01.indd 739

20/12/12 6:39 PM

740

Suppose that the actual model equation has a 5 0, b 5 0.001, and se 5 0.09 (these

values are consistent with the findings in the article). The population regression line is

shown in Figure 16.6.

y

Mean y when

= 0.19

x = 190

Population

regression line

y = 0.001x

Figure 16.6

for Example 16.1

x = 190

If the distribution of the random errors at any fixed weight (x value) is normal, then

the variable y 5 weight loss is normally distributed with

my 5 0 1 0.001x

sy 5 0.09

For example, when x 5 190 (corresponding to a 190-pound wrestler), weight loss has

mean value

my 5 0 1 0.001(190) 5 0.19 pounds

Because the standard deviation of y is sy 5 0.09, the interval 0.19 6 2(0.09) 5 (0.01,

0.37) includes y values that are within 2 standard deviations of the mean value for y when

x 5 190. Roughly 95% of the weight loss observations made for 190-lb wrestlers will be in

this range. The slope b 5 0.001 can be interpreted as the mean change in weight associated

with each additional pound of body weight.

More insight into model properties can be gained by thinking of the population of all

(x, y) pairs as consisting of many smaller subpopulations. Each subpopulation contains

pairs for which x has a fixed value. Suppose, for example, that in a large population of

college students the variables

x 5 grade point average in major courses

and

y 5 starting salary after graduation

are related according to the simple linear regression model. Then you can think about the

subpopulation of all pairs with x 5 3.20 (corresponding to all students with a grade point

average of 3.20 in major courses), the subpopulation of all pairs having x 5 2.75, and so

on. The model assumes that for each of these subpopulations, y is normally distributed

with the same standard deviation, and that the mean y value (rather than y itself) is linearly

related to x.

In practice, the judgment of whether the simple linear regression model is

appropriatethat is the judgments about the credibility of the assumptions underlying the

linear modelmust be based on knowledge of how the data were collected, as well as an

inspection of various plots of the data and the residuals. The sample observations should be

independent of one another, which will be the case if the data are from a random sample.

In addition, the scatterplot should show a linear rather than a curved pattern, and the vertical spread of points should be very similar throughout the range of x values. Figure 16.7

shows plots with three different patterns; only the first pattern is consistent with the simple

linear regression model assumptions.

Unless otherwise noted, all content on this page is Cengage Learning.

85241_ch16_ptg01.indd 740

20/12/12 6:39 PM

741

Figure 16.7

patterns in scatter plots:

(a) Consistent with the simple

linear regression model;

(b) Suggests a nonlinear

probabilistic model; (c) Suggests

that variability in y changes

with x.

x

(a)

xx

(b)

(c)

In Section 16.3, you will see how to check whether the basic assumptions of the simple

linear regression model are reasonable. When this is the case, the values of a and b (y

intercept and slope of the population regression line) can be estimated from sample data.

The estimates of a and b are denoted by a and b, respectively. These estimates are

the values of the intercept and slope of the least squares regression line. Recall that

that the least squares regression line is the line for which the sum of squared vertical

deviations of points in the scatterplot from the line is smaller than for any other line.

The estimates of the slope and the y intercept of the population regression line are

the slope and y intercept, respectively, of the least squares line. That is,

_

)

(x 2 x

)(y 2 y

_

b 5 estimate of b 5 ______________

2

(x2 x

)

_

_

a 5 estimate of a 5 y

2 bx

The values of a and b are usually obtained using statistical software or a graphing

calculator. If the slope and intercept are calculated by hand, you can use the following computational formula:

(x)(y)

xy2 ________

n

_____________

b 5

2

(x)

x2 2 _____

y

5 a 1 bx

Let x* denote a specified value of the independent variable x. Then a 1 bx* has

two different interpretations:

1. It is a point estimate of the mean y value when x 5 x*.

2. It is a point prediction of an individual y value to be observed when x 5 x*.

Example 16.2 Mothers Age and Babys Birth Weight

Medical researchers have noted that adolescent females are much more likely to deliver

low-birth-weight babies than are adult females. (Low birth weight in humans is generally

defined as a weight below 2,500 grams) Because low-birth-weight babies have higher

mortality rates, a number of studies have examined the relationship between birth weight

and mothers age for babies born to young mothers.

One such study is described in the article Body Size and Intelligence in 6-Year-Olds:

Are Offspring of Teenage Mothers at Risk? (Maternal and Child Health Journal [2009]:

847-856). The following data on

and

y 5 birth weight of baby (in grams)

Unless otherwise noted, all content on this page is Cengage Learning.

85241_ch16_ptg01.indd 741

20/12/12 6:39 PM

742

are consistent with summary values given in the article and also with data published by the

National Center for Health Statistics.

Observation

1

10

15

17

18

15

16

19

17

16

18

19

2,289

3,393

3,271

2,648

2,897

3,327

2,970

2,535

3,138

3,573

A scatterplot of the data is given in Figure 16.8. The scatterplot shows a linear pattern,

and the spread in the y values appears to be similar across the range of x values. This

supports the appropriateness of the simple linear regression model.

Babys weight (g)

3500

3000

2500

Figure 16.8

15

maternal age for Example 16.2.

16

17

Mothers age (yr)

18

19

For these data, the equation of the estimated regression line was found using statistical

software, resulting in

y

5 a 1 bx 5 21,163.45 1 245.15x

An estimate of the mean birth weight of babies born to 18-year-old mothers results

from substituting x 5 18 into the estimated equation:

5 21,163.45 1 245.15(18)

5 3,249.25 grams

Similarly, you would predict the birth weight of a baby to be born to a particular

18-year-old mother to be

y

5 predicted y value when x 5 18

5 a 1 b(18)

5 3,249.25 grams

The estimate of the mean weight and the prediction of an individual baby weight are

identical, because the same x value was used in each calculation. However, their interpretations differ. One is the prediction of the weight of a single baby whose mother is 18, whereas

the other is an estimate of the mean weight of all babies born to 18-year-old mothers.

In Example 16.2, the x values in the sample ranged from 15 to 19. The estimated

regression equation should not be used to make an estimate or prediction for any x value

much outside this range. Without sample data for such values, or some clear theoretical

reason for expecting the relationship to be linear outside the observed range of x values,

you have no reason to believe that the estimated linear relationship continues outside the

range from 15 to 19. Making predictions outside this range can be misleading, and statisticians refer to this as the danger of extrapolation.

Unless otherwise noted, all content on this page is Cengage Learning.

85241_ch16_ptg01.indd 742

20/12/12 6:39 PM

743

The value of se determines the extent to which observed points (x, y) tend to fall close to

or far away from the population regression line. A point estimate of se is based on

SSResid 5 (y 2 y

) 2

where y

1 5 a 1 bx1, , y

n 5 a 1 bxn are the fitted or predicted y values and the residuals

are y1 2 y

1, yn 2 y

n. SSResid is a measure of the extent to which the sample data spread

out around the estimated regression line.

Definition

The statistic for estimating the variance s2eis

SSResid

s2e 5 _______

n22

where

SSResid 5 (y 2 y

) 2 5 y2 2 a y 2 b xy

The subscript in s2eand s2eis a reminder that you are estimating the variance of the

errors or residuals.

The estimate of se is the estimated standard deviation

__

s2e

se 5

linear regresssion is n 2 2.

The estimates and number of degrees of freedom here have analogs in previous

work involving a single sample x1, x2, , xn. The sample variance s2 had a numerator of

_ 2

(x 2 x

) , a sum of squared deviations (residuals), and denominator n 2 1, the number of

_

degrees of freedom associated with s2 and s. The use of x

as an estimate of m in the formula

for s2 reduces the number of degrees of freedom by 1, from n to n 21. In simple linear

regression, estimation of two quantities, a and b, results in a loss of 2 degrees of freedom,

leaving n 2 2 as the number of degrees of freedom associated with SSResid, s2e and se.

Once the estimated regression equation has been found, the usefulness of this model

is evaluated using a residual plot and the values of se and the coefficient of determination,

r2. Recall from Chapter 4 that the values of se and r2 are interpreted as described in the

following box.

The coefficient of determination, r2, is the proportion of variability in y that can be

explained by the approximate linear relationship between x and y.

The value of se, the estimated standard deviation about the population regression

line, is interpreted as the typical amount by which an observation deviates from

the population regression line.

Example 16.3 Estimating Elk Weight

Wildlife biologists monitor the ecological health of animals. For large animals whose habitat is relatively inaccessible, this can present some practical problems. The Rocky Mountain

elk is the fourth largest deer species and is a case in point. Males range up to 7.5 feet in

length and over 500 pounds in weight. The equipment, manpower, and time needed to weigh

these creatures make direct measurement of weight difficult and expensive. The authors of

the paper Estimating Elk Weight From Chest Girth (Wildlife Society Bulletin [1996]: 58-611)

found they could reliably estimate elk weights by a much more practical method: measuring

the chest girth and then using linear regression to estimate the weight. They measured the

chest girth and weight of 19 Rocky Mountain elk in Custer State Park, South Dakota. The

85241_ch16_ptg01.indd 743

20/12/12 6:39 PM

744

resulting data (from a scatterplot in the paper) is given in the accompanying table. The table

also includes the predicted values and residuals for the estimated regression line.

Girth (cm)

Weight(kg)

Predicted

y Value

Residual

96

105

108

109

110

114

121

124

131

135

137

138

140

142

157

157

159

155

162

87

196

163

196

183

171

230

225

211

231

225

266

241

264

284

292

300

337

339

136.266

161.069

169.336

172.092

174.848

185.871

205.162

213.429

232.720

243.744

249.255

252.011

257.523

263.034

304.372

304.372

309.884

298.860

318.151

238.2661

34.9314

26.3361

23.9080

8.1522

214.8711

24.8380

11.5705

221.7203

212.7436

224.2553

13.9889

216.5228

0.9655

220.3720

212.3720

29.8837

38.1397

20.8488

The scatterplot (Figure 16.9) gives evidence of a strong positive linear relationship between

x 5 chest girth (in cm)

and

y 5 weight in (kg)

350

Weight (kg)

300

250

200

150

100

Figure 16.9

chest girth for Example 16.3

90

100

110

120

130

Girth (cm)

140

150

160

170

Regression Analysis: Weight versus Girth

The regression equation is

Weight 5 2 136 1 2.81 Girth

Predictor

Constant

Girth

S 5 23.6626

Coef

2135.51

2.8063

SE Coef

T

35.75 23.79

0.2686 10.45

R-Sq 5 86.5%

P

0.001

0.000

R-Sq(adj) 5 85.7%

Unless otherwise noted, all content on this page is Cengage Learning.

85241_ch16_ptg01.indd 744

20/12/12 6:39 PM

745

y

5 2136 1 2.81x

r2 5 0.865

Se 5 23.6626

Approximately 86.5% of the observed variation in elk weight y can be attributed to the

linear relationship between weight and chest girth. The magnitude of a typical deviation

from the least-squares line is about 23.6626 kg, which is relatively small in comparison to

the y values themselves.

Another important assumption of the simple linear regression model is that the

random deviations at any particular x value are normally distributed. In Section 16.3,

you will see how the residuals can be used to determine whether this assumption is

plausible.

section

16.1Exercises

Each exercise set assesses the following chapter learning objectives: C1, M1

Section 16.1

Exercise Set 1

16.1 Identify the following relationships as deterministic

or probabilistic:

a. The relationship between the length of the sides of a

square and its perimeter.

b. The relationship between the height and weight of an adult.

c. The relationship between SAT score and college freshman

GPA.

d. The relationship between tree height in centimeters and

tree height in inches.

16.2 Let x be the size of a house (in square feet) and y be

the amount of natural gas used (therms) during a specified

period. Suppose that for a particular community, x and y are

related according to the simple linear regression model with

a 5 y intercept of population regression line 5 25.0

3000 square feet.

a. What is the equation of the population regression line?

b. Graph the population regression line by first finding the

point on the line corresponding to x 5 1000 and then

the point corresponding to x 5 2000, and drawing a line

through these points.

c. What is the mean value of gas usage for houses with

2100 sq. ft. of space?

d. What is the average change in usage associated with a 1

sq. ft. increase in size?

e. What is the average change in usage associated with a

100 sq. ft. increase in size?

f. Would you use the model to predict mean usage for a 500

sq. ft. house? Why or why not?

16.3 Suppose that a simple linear regression model is

appropriate for describing the relationship between y 5

85241_ch16_ptg01.indd 745

house price (in dollars) and x 5 house size (in square feet)

for houses in a large city. The population regression line is

y 5 23,000 1 47x and se 5 5000.

a. What is the average change in price associated with one

extra square foot of space? With an additional 100 sq. ft.

of space?

b. Approximately what proportion of 1800 sq. ft. homes

would be priced over $110,000? Under $100,000?

Section 16.1

Exercise Set 2

16.4 Identify the following relationships as deterministic

or probabilistic:

a. The relationship between height at birth and height at one

year of age.

b. The relationship between a positive number and its

square root.

c. The relationship between temperature in degrees

Fahrenheit and degrees centigrade.

d. The relationship between adult shoe size and shirt size.

16.5 The flow rate in a device used for air quality measurement depends on the pressure drop x (inches of water) across

the devices filter. Suppose that for x values between 5 and

20, these two variables are related according to the simple

linear regression model with population regression line

y 5 20.12 1 0.095x.

a. What is the mean flow rate for a pressure drop of

10 inches? A drop of 15 inches?

b. What is the average change in flow rate associated with

a 1 inch increase in pressure drop? Explain.

16.6 The paper Predicting Yolk Height, Yolk Width,

Albumen Length, Eggshell Weight, Egg Shape Index, Eggshell

Thickness, Egg Surface Area of Japanese Quails Using

Various Egg Traits as Regressors (International Journal of

20/12/12 6:39 PM

746

relationship between y 5 eggshell thickness (in micrometers) and x 5 egg length (mm) for quail eggs. Suppose

that the population regression line is y 5 0.135 1 0.003x

and that se 5 0.005. Then, for a fixed x value, y has a normal distribution with mean 0.135 1 0.003x and standard

deviation 0.005.

a. What is the mean eggshell thickness for quail eggs that

are 15 mm in length? For quail eggs that are 17 mm in

length?

b. What is the probability that a quail egg with a length of

15 mm will have a shell thickness that is greater than

0.18 mm?

c. Approximately what proportion of quail eggs of length

14 mm has a shell thickness of greater than 0.175? Less

than 0.178?

Additional Exercises

16.7 Tom and Ray are managers of electronics stores with

slightly different pricing strategies for USB drives. In Toms

store, customers pay the same amount, c, for each USB

drive. In Rays store, it is a little more exciting. The customer pays an up-front cost of $1.00. Ray charges the same

price per USB drive, c, but at the register the customer flips

a coin. If the coin lands heads up, the customer gets his or

her $1.00 back, plus another dollar off the total cost of the

USB drives purchased.

a. Which of these pricing strategies can be expressed as a

deterministic model?

b. Using mathematical notation, specify a model using

Toms pricing strategy that relates y 5 total cost to x 5

number of USB drives purchased.

c. Using mathematical notation, specify a model using

Rays pricing strategy that relates y 5 total cost to x 5

number of USB drives purchased.

d. Describe the distribution of e for the probabilistic model

described above. What is the mean of the distribution

of e? What is the standard deviation of e?

16.8 Identify the following relationships as deterministic or

probabilistic:

a. The relationship between the speed limit and a drivers

speed.

b. The relationship between the price in dollars and the

price in Euros of an object.

c. The relationship between the number of pages and the

number of words in a text book.

d. The relationship between the possible numbers of pennies and the nickels in a pile if no other coins are in the

pile and the amount of money in the pile is $3.00.

16.9 Hormone replacement therapy (HRT) is thought to

increase the risk of breast cancer. The accompanying data

on x 5 percent of women using HRT and y 5 breast cancer

incidence (cases per 100,000 women) for a region in

85241_ch16_ptg01.indd 746

Cancer Incidence after Decrease in Utilisation of Hormone

Replacement Therapy (Epidemiology [2008]: 427430). The

describe the relationship between HRT use and breast cancer

incidence.

HRT Use

46.30

40.60

39.50

36.60

30.00

103.30

105.00

100.00

93.80

83.50

b. What is the estimated average change in breast cancer

incidence associated with a 1 percentage point increase

in HRT use?

c. What would you predict the breast cancer incidence to be

in a year when HRT use was 40%?

d. Should you use this regression model to predict breast

cancer incidence for a year when HRT use was 20%?

Explain.

e. Calculate and interpret the value of r 2.

f. Calculate and interpret the value of se.

16.10 Consider the accompanying data on x 5 advertising

share and y 5 market share for a particular brand of soft drink

during 10 randomly selected years.

x 0.103 0.072 0.071 0.077 0.086 0.047 0.060 0.050 0.070 0.052

y 0.135 0.125 0.120 0.086 0.079 0.076 0.065 0.059 0.051 0.039

think the simple linear regression model would be

appropriate for describing the relationship between

x and y?

b. Calculate the equation of the estimated regression line

and use it to obtain the predicted market share when the

advertising share is 0.09.

c. Compute r 2. How would you interpret this value?

d. Calculate a point estimate of se. How many degrees of

freedom is associated with this estimate?

16.11 The authors of the paper Weight-Bearing Activity

During Youth Is a More Important Factor for Peak Bone

Mass than Calcium Intake (Journal of Bone and Mineral

studied a number of

variables they thought might be related to bone mineral

density (BMD). The accompanying data on x 5 weight

at age 13 and y 5 bone mineral density at age 27 are

consistent with summary quantities for women given in the

paper.

Research [1994], 10891096)

20/12/12 6:39 PM

Weight (kg)

BMD (g/cm2)

54.4

59.3

74.6

62.0

73.7

70.8

66.8

66.7

64.7

71.8

69.7

64.7

62.1

68.5

58.3

1.15

1.26

1.42

1.06

1.44

1.02

1.26

1.35

1.02

0.91

1.28

1.17

1.12

1.24

1.00

747

women whose age 13 weight was 60 kg.

16.12 The production of pups and their survival are the most

significant factors contributing to gray wolf population growth.

The causes of early pup mortality are unknown and difficult

to observe. The pups are concealed within their dens for 3

weeks after birth, and after they emerge it is difficult to confirm

their parentage. Researchers recently used portable ultrasound

equipment to investigate some factors related to reproduction (Diagnosing Pregnancy, in Utero Litter Size, and Fetal

Growth with Ultrasound in Wild, Free-Ranging Wolves, Journal

of Mammology [2006]: 85-92). A scatterplot of y 5 length of

(in days) is shown below. Computer output from a regression

analysis is also given.

Bivariate Fit of Emb Ves Diam (cm) By Gest Age (days)

6

5

Emb Ves Diam (cm)

1.5

1.4

BMD (g/cm^2)

1.3

4

3

2

1.2

1.1

25

35

30

Gest Age (days)

40

Linear Fit

0.9

Linear Fit

0.8

55

60

65

Weight (kg)

70

75

Linear Fit

Linear Fit

BMD (g/cm^2) = 0.5584011 + 0.0094363*Weight (kg)

Summary of Fit

RSquare

RSquare Adj

Root Mean Square Error

Mean of Response

Observations (or Sum Wgts)

0.121081

0.053472

0.155141

1.18

15

Lack of Fit

Analysis of Variance

Parameter Estimates

Term

Intercept

Weight (kg)

0.5584011 0.466212

1.20

0.2524

0.0094363 0.007051

1.34

0.2038

can be explained by the simple linear regression model?

b. Give a point estimate of se and interpret this estimate.

c. Give an estimate of the average change in BMD associated with a 1 kg increase in weight at age 13.

Summary of Fit

RSquare

RSquare Adj

Root Mean Square Error

Mean of Response

Observations (or Sum Wgts)

0.792803

0.780615

0.450587

2.482526

19

Lack of Fit

Analysis of Variance

Parameter Estimates

Term

Intercept

Gest Age (days)

Estimate

3.497279

0.1903121

Std Error

0.748605

0.023597

t Ratio

4.67

8.07

Prob>|t|

0.0002*

<.0001*

b. What is the estimated embryonic sac diameter for a

gestational age of 30 days?

c. What is the average change in sac diameter associated

with a 1-day increase in gestational age?

d. What is the average change in sac diameter associated

with a 5-day increase in gestational age?

e. Would you use this model to predict the mean embryonic sac diameter for all gestation ages from conception

to birth? Why or why not?

85241_ch16_ptg01.indd 747

20/12/12 6:39 PM

748

section 16.2

Regression Line

The slope coefficient b in the simple linear regression model represents the average or expected

change in the response variable y that is associated with a 1-unit increase in the value of the

independent variable x. For example, consider x 5 the size of a house (in square feet) and y 5

selling price of the house. If the simple linear regression model is appropriate for the population

of houses in a particular city, b would be the average increase in selling price associated with a

1-square-foot increase in size. As another example, if x 5 amount of time per week a computer

system is used and y 5 the resulting annual maintenance expense, then b would be the expected

change in expense associated with using the computer system one additional hour per week.

Because the value of b is almost always unknown, it must be estimated from sample

data. The slope of the least squares regression line, b, provides an estimate. In some situations, the value of the statistic b may vary greatly from sample to sample, and the value of b

computed from a single sample may be quite different from the value of the population slope,

b. In other situations, almost all possible samples result in a value of b that is quite close to

b. The sampling distribution of b provides information about the behavior of this statistic.

Inferences about the slope

of the population regression line are based on the

sampling distribution of

the statistic b. The properties given here depend on

the four basic assumptions

of the linear regression

model being met. In Section 16.3, you will see how

to determine if these assumptions are reasonable.

When the four basic assumptions of the simple linear regression model are satisfied

1. The mean value of the sampling distribution of b is b. That is, mb 5 b , so the

sampling distribution of b is always centered at the value of b. This means that b

is an unbiased statistic for estimating b.

2. The standard deviation of the sampling distribution of the statistic b is

se

sb 5 __________

________

_ 2

(x 2 x

)

3. The statistic b has a normal distribution (a consequence of the model assumption that the random deviation e is normally distributed).

The fact that b is unbiased tells you that the sampling distribution is centered at the right

place, but it gives no information about variability. If sb is large, the sampling distribution of

b will be quite spread out around b and an estimate far from the value of b could result. For

se

________

sb 5 ___________

_

to be small, the numerator se should be small (little variability about the

(x 2 x)2

________

_

_

population line) and/or the denominator

(x 2 x)2

should be large. Because (x 2 x)2 is a

measure of how much the observed x values spread out, b tends to be more precisely estimated

when the x values in the sample are spread out rather than when they are close together. The

normality of the sampling distribution of b implies that the standardized variable

b2b

z 5 ______

sb

has a standard normal distribution. However, inferential methods cannot be based on this

statistic, because the value of sb is not known (because the unknown se appears in the

numerator of sb). One way to proceed is to estimate se with se to obtain an estimate of sb.

The estimated standard deviation of the statics b is

se

________

sb 5 ___________

_ 2

(x 2 x

)

For inferences about the

slope of the population regression line, df 5 n 2 2.

85241_ch16_ptg01.indd 748

When the four basic assumptions of the simple linear regression model are satisfied,

b2b

is the

the probability distribution of the standardized variable t 5 ______

s

t distribution with df 5 (n 2 2).

20/12/12 6:39 PM

749

x2 m

was used in Chapter 12 to develop a confidence interIn the same way that t 5 ______

s

____

__

n

val for m, the t variable in the preceding box can be used to obtain a confidence interval for b.

Confidence Interval for b

When the four basic assumptions of the simple linear regression model are satisfied, a confidence interval for b, the slope of the population regression line, has

the form

b 6 (t critical value)sb

where the t critical value is based on df 5 n 2 2. Appendix Table 3 gives critical

values corresponding to the most frequently used confidence levels.

The interval estimate of b is centered at b and extends out from the center by an amount

that depends on the sampling variability of b. When sb is small, the interval is narrow, implying that the investigator has relatively precise knowledge of the value of b. Calculation of a

confidence interval for the slope of a population regression line is illustrated in Example 16.4.

In Section 7.2, you learned four key questions that guide the decision about what statistical inference method to consider in any particular situation. In Section 7.3, a five-step

process for estimation problems was introduced.

The four key questions of section 7.2 were

Q

Question Type

S

Study Type

T

Type of Data

N

Number of Samples or

Treatments

Sample data or experiment data?

One variable or two? Categorical or numerical?

How many samples or treatments?

Q: estimation

S: sample data

T: two numerical variables

N: one sample

the method you will want to consider in a regression setting is the confidence interval for

the slope of a population regression line.

Once you have selected the confidence interval for the slope of a population regression line as the method you want to consider, because this is an estimation problem you

would follow the five-step process for estimation problems (EMC3).

Example 16.4 The Bison of Yellowstone Park

The dedicated work of conservationists for over 100 years has brought the bison in

Yellowstone National Park from near extinction to a herd of over 3,000 animals. This

recovery is a mixed blessing. Many bison have been exposed to the bacteria that cause

brucellosis, a disease that infects domestic cattle, and there are many domestic cattle herds

near Yellowstone. Because of concerns that free-ranging bison can infect nearby cattle, it

is important to monitor and manage the size of the bison population and, if possible, keep

bison from transmitting this bacteria to ranch cattle. The article Reproduction and Survival

of Yellowstone Bison (The Journal of Wildlife Management [2007]: 2365-2372) described a

large multiyear study of the factors that influence bison movement and herd size. The

85241_ch16_ptg01.indd 749

20/12/12 6:39 PM

750

researchers studied a number of environmental factors to better understand the relationship between bison reproduction and the environment. One factor thought to influence

reproduction is stress due to accumulated snow, which makes foraging more difficult for

the pregnant bison. Data from 19811997 on

y 5 spring calf ratio (SCR)

and

x 5 previous fall snow-water equivalent (SWE)

are shown in the accompanying table. Spring calf ratio is the ratio of calves to adults, a

measure of reproductive success. The researchers were interested in estimating the mean

change in spring calf ratio associated with each additional cm in snow-water equivalent.

Lets answer the four key questions for this problem.

SCR

SWE

SCR

SWE

0.19

0.14

0.21

0.23

0.26

0.19

0.29

0.23

0.16

1,933

4,906

3,072

2,543

3,509

3,908

2,214

2,816

4,128

0.22

0.22

0.18

0.21

0.25

0.19

0.22

0.17

3,317

3,332

3,511

3,907

2,533

4,611

6,237

7,279

The answers are estimation, sample data, two numerical variables, one sample. This

Q

Question Type

S

Study Type

T

Type of Data

N

Number of Samples

or Treatments

Estimation

Sample data

combination of answers suggests considering a confidence interval for the slope of a population regression line. You can now use the five-step process (EMC3) to estimate the slope

of the population regression line.

Step

Estimate

In this example, the value of b, the mean increase in spring calf ratio for each

additional 1 cm of snow-water equivalent, will be estimated.

Method

Because the answers to the four key questions are estimation, sample data,

two numerical values, one sample, a confidence interval for b, the slope of

the population regression line, will be considered.

For this example, a 95% confidence level will be used.

Check

The four basic assumptions of the simple linear regression model need to be

met in order to use the confidence interval.

(continued)

85241_ch16_ptg01.indd 750

20/12/12 6:39 PM

751

Step

would need to assume that these years are representative of yearly circumstances at Yellowstone, and that each years reproduction and snowfall is

independent of previous years. You should keep this in mind when you get

to the step that involves interpretation.

A scatterplot of the data is shown here. The pattern in the plot looks linear

and the spread does not seem to be different for different values of x.

0.300

0.275

SCR

0.250

0.225

0.200

0.175

0.150

2000

3000

4000

5000

SWE

6000

7000

8000

0.050

0.025

0.000

0.025

Residuals

0.050

0.075

outliers, it is reasonable to think that the distribution of e is

approximately normal.

Calculate

Linear Fit

SCR 5 0.2606561 2 0.0136639*SWE

Summary of Fit

RSquare

0.257644

RSquare Adj

0.208153

Root Mean Square Error

0.033513

Mean of Response

0.209412

Observations (or Sum Wgts)

17

(continued)

Unless otherwise noted, all content on this page is Cengage Learning.

85241_ch16_ptg01.indd 751

20/12/12 6:39 PM

752

Step

Parameter Estimates

Term Estimate Std Error t Ratio

Intercept 0.2606561 0.023885 10.91

SWE 20.013664 0.005989 22.28

Prob>|t|

<.0001*

0.0375*

sb

df 5 n 2 2 = 17 2 2 = 15

The t critical value for a 95% confidence level and df 5 15 is 2.13.

b 6(t critical value)sb

5 20.0137 6(2.13)(0.00599)

5 (20.265, 20.0009)

Communicate

Results

Confidence interval:

You can be 95% confident that the true average change in spring calf

ratio associated with an increase of 1 cm in the snow-water equivalent is

between 20.0265 and 20.0009.

Confidence level:

The method used to construct this interval estimate is successful in

capturing the actual value of the slope of the population regression about

95% of the time.

Hypotheses about b can be tested using a t test similar to the t tests introduced in Chapters 12

and 13. The null hypothesis states that b has a specified hypothesized value. The t statistic

results from standardizing b, the estimate of b, under the assumption that H0 is true. When

H0 is true, the sampling distribution of this statistic is the t distribution with df 5 n 2 2.

Hypothesis Test for the Slope of the Population Regression Line, b

Appropriate when the four basic assumptions of the simple linear regression

model are reasonable:

1. The distribution of e at any particular x value has mean value 0 (that is

me5 0 ).

2. The standard deviation of e is se, which does not depend on x.

3. The distribution of e at any particular x value is normal.

4. The random deviations e1, e2, e3, en associated with different observations are

independent of one another.

When these conditions are met, the following test statistic can be used:

b 2 b0

t 5 ______

sb

where b0 is the hypothesized value from the null hypothesis.

Form of the null hypothesis: H0: b 5 b0

When the assumptions of the simple linear regression model are reasonable and

the null hypothesis is true, the t test statistic has a t distribution with df 5 n 2 2.

Associated P-value:

When the alternative

hypothesis is

The P-value is

Ha: b . b0

the appropriate t curve

(continued)

85241_ch16_ptg01.indd 752

20/12/12 6:39 PM

Ha: b , b0

appropriate t curve

Ha: b b0

or

2(area to the left of the t) if t is negetive

753

This test is a method you should consider when the answers to the four key questions

are hypothesis testing, sample data, two numerical variables, one sample. You would carry

out this test using the five-step process for hypothesis testing problems (HMC3).

Inference for a population slope generally focuses on two questions:

(2) What are plausible values for the population slope?

The question of plausible values can be addressed by calculating a confidence interval for

the population slope. The question of whether a population slope is equal to zero can be

answered by using the hypothesis testing procedure with a null hypothesis H0: b 5 0. This

test of H0: b 5 0 versus Ha: b 0 is called the model utility test for simple linear regression.

The default computer output for inference for a regression slope is for the model utility test.

When the null hypothesis of the model utility test is true, the population regression

line is a horizontal line, and the value of y in the simple linear regression model does not

depend on x. That is,

y 5 a 1 bx 1 e

5 a 1 0x 1 e

5a1e

If b is in fact equal to 0, knowledge of x will be of no use it will have no utility

for predicting y. On the other hand, if b is different from 0, there is a useful linear relationship between x and y, and knowledge of x is useful for predicting y. This is illustrated by

the scatterplots in Figure 16.10.

nonzero slope

slope = 0

Figure 16.10

(a) b 5 0; (b) b 0

(a)

(b)

The model utility test for simple linear regression is the test of

H0: b 5 0

versus

Ha: b 0

The null hypothesis specifies that there is no useful linear relationship between

x and y, whereas the alternative hypothesis specifies that there is a useful linear

(continued)

Unless otherwise noted, all content on this page is Cengage Learning.

85241_ch16_ptg01.indd 753

20/12/12 6:39 PM

754

relationship between x and y. If H0 is rejected, you can conclude that the simple

linear regression model is useful for predicting y.

The test statistic is the t ratio

(b 2 0) __

b

t 5 _______

5 s .

s

b

It is recommended that the model utility test be carried out before using the estimated

regression line to make inferences.

Have you experienced a sudden flood of memory when scanning from station to station on

your car radio and recognized a song from your past? Perhaps you could remember the title

of the song, the artist, and even when the song was released. From a seemingly small amount

of information you were able to recover a great deal of the songs context from memory. The

article Plink: Thin slices of Music (Krumhansl, C. Music Perception [2010]:337-354) describes

a study of this phenomenon. The investigator compiled a list of songs from Rolling Stone,

Billboard, and Blender lists of songs plus some recent songs familiar to college students.

Twenty-three college students were then exposed to 56 clips of songs. Most of these students

had had musical training, and they listened to popular music for an average of 21.7 hours

per week. After hearing three short clips from a song (only 400 ms in duration), the students

were asked in what year each of the songs was released. The accompanying table shows the

Actual

Release

Judged

Release

Actual

Release

Judged

Release

Actual

Release

Judged

Release

Actual

Release

Judged

Release

1998

1967

1998

1999

1983

1982

1965

1991

1983

1976

1971

1981

1967

2007

1997.2

1973.7

1996.3

1993.3

1985.4

1988.0

1970.2

1992.8

1984.1

1979.3

1975.4

1984.6

1973.7

1997.2

1976

2008

1971

1965

1967

1971

1967

1984

1984

1968

1965

1965

1979

1997

1983.3

1995.0

1979.8

1976.8

1975.0

1978.0

1978.0

1983.3

1989.8

1976.7

1978.5

1977.2

1986.7

1996.3

1976

2006

1974

2007

1976

1974

1970

1971

1999

1997

2006

1981

2008

1965

1988.0

1996.7

1985.4

1999.8

1987.2

1977.6

1982.8

1976.3

1988.5

1994.1

1995.4

1989.3

1993.7

1981.1

1970

1975

1991

2008

1965

1987

1975

1968

1987

2008

1982

1979

2000

2000

1985.4

1985.9

1993.3

1995.4

1977.6

1990.7

1986.3

1986.7

1988.0

1990.2

1991.1

1983.7

1989.8

1991.1

actual release year and the average of the release years given by the students. The actual

release years ranged from 1965 (The Beatles, Help) to 2008 (Katy Perry, I Kissed a Girl).

Is there a relationship between the judged and actual release year for these songs? A

scatterplot of the data (Figure 16.11) suggests that there is a linear relation between these

two variables, but this can be confirmed this using the model utility test.

With x 5 actual release year and y 5 judged release year, the equation of the esti

mated regression line is y

5 1095 1 0.449x. The five-step process for hypothesis testing

can be used to carry out the model utility test.

85241_ch16_ptg01.indd 754

20/12/12 6:39 PM

755

2000

1995

Judged

1990

1985

1980

1975

1970

Figure 16.11

versus actual release year

1960

1970

1980

1990

2000

2010

Actual

Process Step

H Hypotheses

In the model utility test, the null hypothesis is there is no useful relationship between the actual and the judged

release year: H0: b 5 0.

The alternative hypothesis specifies that there is a useful relationship: b 0.

Hypotheses:

Null hypothesis: H0: b 5 0

Alternative hypothesis: Ha: b 0

M Method

Because the answers to the four key questions are hypothesis testing, sample data, two numerical variables in a

regression setting and one sample, a hypothesis test for the slope of a population regression line will be considered.

The test statistic for this test is

b20

b

t 5 _____

5 __

s

sb

b

The value of 0 in the test statistic is the hypothesized value from the null hypothesis.

For this example, a significance level of 0.05 will be used.

Significance level:

a 5 0.05

C Check

In Section 16.3, you will see how to check to see if the four assumptions of the simple linear regression model

are reasonable. For this example, you can assume that these assumptions are reasonable and proceed with the

model utility test.

C Calculate

Linear Fit

Judged Release = 1095.1525 + 0.449281*Actual Release

Summary of Fit

RSquare

0.771

RSquare Adj

0.766759

3.59844

Root Mean Square Error

1986.013

Mean of Response

56

Observations (or Sum Wgts)

Lack of Fit

Analysis of Variance

sb

Parameter Estimates

Term

Estimate Std Error t Ratio Prob>|t|

Intercept

1095.1525

16.58 <.0001*

66.07159

Actual Release

0.449281 0.033321

13.48 <.0001*

(continued)

Unless otherwise noted, all content on this page is Cengage Learning.

85241_ch16_ptg01.indd 755

20/12/12 6:39 PM

756

Test statistic:

.449 2 0

b 2 0 0__________

t 5 _____

5 13.48

sb

0.0333

Associated P-value:

5 2P(t .13.48)

0

C Communicate

results

Because the P-value is less than the selected significance level, the null hypothesis is rejected.

Decision: Reject H0.

Conclusion: The sample data provide convincing evidence that there is a useful linear relationship between the

actual release year and the judged release year.

Because the model utility test confirms that there is a useful linear relationship

between judged release year and actual release year, it would be reasonable to use the

estimated regression model to predict the judged release year for a given song based on

its actual release year. Of course, before you do this, you would also want to evaluate the

accuracy of predictions by looking at the value of se.

When H0: b 5 0 cannot be rejected using the model utility test at a reasonably small significance level, the search for a useful model must continue. One possibility is to relate y

to x using a nonlinear model an appropriate strategy if the scatterplot shows curvature.

section

16.2Exercises

Each exercise set assesses the following chapter learning objectives: C2, M3, M4 , M5, M6, P1, P2

Section 16.2 Exercise Set 1

16.13 The standard deviation of the errors, se, is an important part of the linear regression model.

a. What is the relationship between the value of se and the

value of the test statistic in a test of a hypotheses about b?

b. What is the relationship between the value of se and the

width of a confidence interval for b?

appropriate amounts of sleep for people 9 to 19 years of

age. In that research, a linear regression model is used to

describe the relationship between alertness and number of

hours of sleep the night before. The researchers reported a

95% confidence interval, but newspapers usually report an

estimate and a margin of error.

a. In order to calculate a margin of error from the reported

confidence interval, what additional conditions, if any,

need to be verified?

b. In order to calculate a margin of error from the reported

confidence interval, what additional information, if any,

is needed?

16.15. A nursing student has completed his final project, and

is preparing for a meeting with his project advisor. The subject

of his project was the relationship between systolic blood pressure (SBP) and body mass index (BMI). The last time he met

with his advisor he had completed his measurements, but only

entered half his data into his statistical software. For the data he

85241_ch16_ptg01.indd 756

met. In a short paragraph, explain, using appropriate statistical

terminology, which of the conditions below must be rechecked.

1.The standard deviation of e is the same for all values of x.

2.The distribution of e at any particular x value is normal.

16.16 Consider the accompanying data on x 5 research

and development expenditure (thousands of dollars) and y 5

growth rate (% per year) for eight different industries.

x

y

2024

1.90

5038

3.96

905

2.44

3572

0.88

1157

0.37

327

20.90

378

0.49

191

1.01

information for predicting growth rate from research and

development expenditure? Use a .05 level of significance.

b. Use a 90% confidence interval to estimate the average

change in growth rate associated with a $1000 increase in

expenditure. Interpret the resulting interval

16.17 The paper The Effects of Split Keyboard Geometry

on Upper Body Postures (Ergonomics [2009]: 104111)

describes a study to determine the effects of several keyboard characteristics on typing speed. One of the variables

considered was the front-to-back surface angle of the

keyboard. Minitab output resulting from fitting the simple

linear regression model with x5surface angle (degrees)

and y5typing speed (words per minute) is given below.

20/12/12 6:39 PM

The regression equation is

Typing Speed560.010.0036 Surface Angle

Predictor

Constant

Surface Angle

Coef SE Coef

T

P

60.0286 0.2466 243.45 0.000

0.00357 0.03823

0.09 0.931

Analysis of Variance

Source

Regression

Residual Error

Total

DF

SS

MS

F

P

1 0.0023 0.0023 0.01 0.931

3 0.7857 0.2619

4 0.7880

regression model are met. Carry out a hypothesis test

to decide if there is a useful linear relationship between

x and y.

b. Are the values of se and r2 consistent with the conclusion

from Part (a)? Explain.

16.18 Do taller adults make more money? The authors

of the paper Stature and Status: Height, Ability, and Labor

Market Outcomes (Journal of Political Economics [2008]:

499532) investigated the association between height and

describe the relationship between x5height (in inches) and

y5log(weekly gross earnings in dollars) in a very large

sample of men. The logarithm of weekly gross earnings was

used because this transformation resulted in a relationship

that was approximately linear. The paper reported that the

slope of the estimated regression line was b50.023 and

the standard deviation of b was sb 5 0.004 . Carry out a

hypothesis test to decide if there is convincing evidence of a

useful linear relationship between height and the logarithm

of weekly earnings. You can assume that the basic assumptions of the simple linear regression model are met.

16.19 The effects of grazing animals on grasslands have

been the focus of numerous investigations by ecologists.

One such study, reported in The Ecology of Plants, Large

Mammalian Herbivores, and Drought in Yellowstone National

Park (Ecology [1992]: 20432058), proposed using the simple

linear regression model to relate y 5 green biomass concentration (g/cm3) to x 5 elapsed time since snowmelt (days).

5

106.3 2 .640x. What is the estimate of average change in

biomass concentration associated with a 1-day increase

in elapsed time?

b. What value of biomass concentration would you predict

when elapsed time is 40 days?

c. The sample size was n 5 58, and the reported value of

the coefficient of determination was 0.470. What does

this tell you about the linear relationship between the two

variables?

85241_ch16_ptg01.indd 757

757

Section 16.2

Exercise Set 2

16.20 Consider a test of hypotheses about, b the population

slope in a linear regression model.

a. If you reject the null hypothesis, b 5 0, what does this

mean in terms of a linear relationship between x and y?

b. If you fail to reject the null hypothesis, b 5 0, what does

this mean in terms of a linear relationship between x and y?

16.21 Researchers studying pleasant touch sensations measured the firing frequency (impulses per second) of nerves that

were stimulated by a light brushing stroke on the forearm and

also recorded the subjects numerical rating of how pleasant

the sensation was. The accompanying data was read from a

graph in the paper Coding of Pleasant Touch by Unmyelinated

Afferents in Humans (Nature Neuroscience, April 12, 2009).

Firing

Frequency

23

24

22

25

27

Pleasantness

Rating

0.2

1.0

1.2

1.2

1.0

Firing

Frequency

28

34

33

36

34

Pleasantness

Rating

2.0

2.3

2.2

2.4

2.8

a. Estimate the mean change in pleasantness rating associated with an increase of 1 impulse per second in firing

frequency using a 95% confidence interval. Interpret the

resulting interval.

b. Carry out a hypothesis test to decide if there is convincing

evidence of a useful linear relationship between firing

frequency and pleasantness rating.

16.22 The largest commercial fishing enterprise in the

southeastern United States is the harvest of shrimp. In a

study described in the paper Long-term Trawl Monitoring

of White Shrimp, Litopenaeus setiferus (Linnaeus), Stocks

within the ACE Basin National Estuariene Research Reserve,

South Carolina (Journal of Coastal Research [2008]:193-199),

abundance of white shrimp. One variable the researchers

thought might be related to abundance is the amount of oxygen in the water. The relationship between mean catch per tow

of white shrimp and oxygen concentration was described by

fitting a regression line using data from ten randomly selected

offshore sites. (The catch per tow is the number of shrimp

caught in a single outing.) Computer output is shown below.

The regression equation is

Mean catch per tow 5 25859 1 97.2 O2 Saturation

Predictor

Coef

SE Coef

T

P

Constant

25859

2394 22.45

0.040

O2 Saturation 97.22

34.63

2.81

0.023

S 5 481.632

R-Sq 5 49.6% R-Sq(adj) 5 43.3%

a. Is there convincing evidence of a useful linear relationship between the shrimp catch per tow and oxygen concentration density? Explain.

20/12/12 6:39 PM

758

hypotheses using a 5 .05.

why not?

c. Construct a 95% confidence interval for b and interpret

it in context.

d. What margin of error is associated with the confidence

interval in Part (c)?

Adults with Childhood Lead Exposure (Public Library of Science

Medicine [May 27, 2008]: e112) studied the relationship between

childhood environmental lead exposure and a measure of

brain volume change in a particular region of the brain. Data

were given for x5mean childhood blood lead level (mg/dL)

and y5brain volume change (BVC, in percent). A subset of

data read from a graph that appeared in the paper was used to

produce the accompanying Minitab output.

Regression Analysis: BVC versus Mean Blood Lead Level

The regression equation is

BVC520.00179 2 0.00210 Mean Blood Lead Level

Predictor

Coef

SE Coef

T

P

Constant

20.001790 0.008303 20.22 0.830

Mean Blood 20.0021007 0.0005743 23.66 0.000

Lead Level

Study

Control

1

2

3

4

5

6

7

8

9

10

250

360

475

525

610

740

880

920

1010

1200

CHI

3 03

491

659

683

922

1044

1421

1329

1481

1815

Ag[Br,I] Emulsions at Millisecond Range Exposures

(Photographic Science and Engineering [1981]: 138144) gave

the accompanying data on x 5 % light absorption and y 5

peak photovoltage.

4.0

0.12

x

y

8.7

0.28

12.7

0.55

19.1

0.68

21.4

0.85

24.6

1.02

28.9

1.15

29.8

1.34

30.5

1.29

evidence of a useful linear relationship between x and y. You

can assume that the basic assumptions of the simple linear

regression model are met.

1.4

16.26 The accompanying data were read from a plot

(and are a subset of the complete data set) given in the

article Cognitive Slowing in Closed-Head Injury (Brain and

Cognition [1996]: 429440). The data represent the mean

response times for a group of individuals with closed-head

injury (CHI) and a matched control group without head

injury on 10 different tasks. Each observation was based on

a different study, and used different subjects, so it is reasonable to assume that the observations are independent.

a. Fit a linear regression model that would allow you to

predict the mean response time for those suffering a

closed-head injury from the mean response time on the

same task for individuals with no head injury.

b. Do the sample data support the hypothesis that there is a

useful linear relationship between the mean response time

for individuals with no head injury and the mean response

PeakPhotoVoltage

1.2

Additional Exercises

16.24 a. Explain the difference between the line y 5 a 1 bx

5 a 1 bx.

b. Explain the difference between b and b.

c. Let x* denote a particular value of the independent variable.

Explain the difference between a 1 bx* and a 1 bx*.

d. Explain the difference between s and se.

1

0.8

0.6

0.4

0.2

0

0

10

15

20

%LightAbsorption

25

30

35

Linear Fit

Linear Fit

PeakPhotoVoltage = 0.082594 + 0.0446485* %LightAbsorption

Summary of Fit

RSquare

RSquare Adj

Root Mean Square Error

Mean of Response

Observations (or Sum Wgts)

0.982731

0.980264

0.061117

0.808889

9

Analysis of Variance

Parameter Estimates

Term

Estimate

Intercept

0.082594

%LightAbsorption 0.0446485

0.049093

1.68

0.1364

0.002237

19.96 <.0001*

85241_ch16_ptg01.indd 758

20/12/12 6:39 PM

between the peak photovoltage and the percent of light

absorption?

b. What is the equation of the estimated regression line?

c. How much of the observed variation in peak photovoltage can be explained by the model relationship?

d. Predict peak photovoltage when percent absorption is 19.1,

and compute the value of the corresponding residual.

e. The authors claimed that there is a useful linear relationship between the two variables. Do you agree? Carry out

a formal test.

f. Give an estimate of the average change in peak photovoltage associated with a 1 percentage point increase in

light absorption. Your estimate should convey information about the precision of estimation.

believed that for each 1-mm increase in chord length, cranial capacity would be expected to increase by 20 cm 3. Do

these new experimental data provide convincing evidence

against this prior belief?

16.29 Suppose you are given the computer output

shown. You are interested in testing the null hypothesis

b 5 1.0 versus an alternative hypothesis of b > 1.0.

Describe how you would use the given computer output

to test these hypotheses.

16.28 In anthropological studies, an important characteristic of fossils is cranial capacity. Frequently skulls

are at least partially decomposed, so it is necessary to

use other characteristics to obtain information about

capacity. One measure that has been used is the length

of the lambda-opisthion chord. The article Vertesszollos

and the Presapiens Theory (American Journal of Physical

Anthropology [1971]) reported the accompanying data for n

x (chord

length in mm)

y (capacity

in cm 3)

78

75

78

81

84

86

87

850

775

750

975

915

1015

1030

section 16.3

759

Linear Fit

y = 5.6452776 + 0.9797401*x

Summary of Fit

RSquare

0.985289

RSquare Adj

0.984954

Root Mean Square Error

12.48525

Mean of Response

0.791304

Observations (or Sum Wgts)

46

Lack of Fit

Analysis of Variance

Parameter Estimates

Term

Estimate

Intercept 5.6452776

0.9797401

x

1.84302

3.06 0.0037*

0.018048 54.29 <.0001*

Section 16.2 introduced methods for estimating and testing hypotheses about b, the slope

in the simple linear regression model

y 5 a 1 bx 1 e

In this model, e represents the random deviation of a y value from the population

regression line a 1 bx. The methods presented in Section 16.2 require that some assumptions about the random deviations in the simple linear regression model be met in order

for inferences to be valid. These assumptions include:

1. At any particular x value, the distribution of e is normal.

2. At any particular x value, the standard deviation of e is se, which is constant over all

values of x (that is, se does not depend on x).

Inferences based on the simple linear regression model are still appropriate if model

assumptions are slightly violated (for example, mild skew in the distribution of e).

However, interpreting a confidence interval or the result of a hypothesis test when assumptions are seriously violated can result in misleading conclusions. For this reason, it is

important to be able to detect any serious violations.

Residual Analysis

If the deviations e1, e2, , en from the population line were available, they could be examined for any inconsistencies with model assumptions. For example, a normal probability

plot of these deviations would suggest whether or not the normality assumption was plausible. However, because these deviations are

Unless otherwise noted, all content on this page is Cengage Learning.

85241_ch16_ptg01.indd 759

20/12/12 6:39 PM

760

e1 5 y1 2 (a 1 bx1)

:

en 5 yn 2 (a 1 bxn)

they can be calculated only if a and b are known. In practice, this will almost never be the

case. Instead, diagnostic checks must be based on the residuals

y1 2 y

1 5 y1 2 (a 1 bx1)

:

yn 2 y

n 5 yn 2 (a 1 bxn)

which are the deviations from the estimated regression line. When all model assumptions are

met, the mean value of the residuals at any particular x value is 0. Any observation that gives

a large positive or negative residual should be examined carefully for any unusual circumstances, such as a recording error or nonstandard experimental condition. Identifying residuals with unusually large magnitudes is made easier by inspecting standardized residuals.

Recall that a quantity is standardized by subtracting its mean value (0 in this case) and

dividing by its actual or estimated standard deviation:

residual

standardized residual 5 _________________________________

The value of a standardized residual tells you the distance (in standard deviations) of the

corresponding residual from its expected value, 0.

Because residuals at different x values have different standard deviations (depending on the value of x for that observation)1, computing the standardized residuals can be

tedious. Fortunately, many computer regression programs provide standardized residuals.

Example 16.6 Revisiting the Elk

Example 16.3 introduced data on

x 5 chest girth (in cm)

and

y 5 weight (in kg)

for a sample of 19 Rocky Mountain elk. (See Example 16.3 for a more detailed description

of the study.)

Inspection of the scatterplot in Figure 16.12 suggests the data are consistent with the

assumptions of the simple linear regression model.

350

Weight (kg)

300

250

200

150

100

90

Figure 16.12

1

100

110

120

130

Girth (cm)

140

150

160

170

i, is se

th

________________

_ 2

(xi 2 x

)

1

__

________

_ 2

1 2 n 2

(x 2 x

)

85241_ch16_ptg01.indd 760

20/12/12 6:39 PM

761

The data, residuals, and the standardized residuals (computed using Minitab) are

given in Table 16.1. For the residual with the largest magnitude, 38.1397, the standardized

residual is 1.81294. That is, the residual is approximately 1.8 standard deviations above

its expected value of 0. This value is not particularly unusual in a sample of this size. Also

notice that for the negative residual with the largest magnitude, 238.2661, the standardized residual is 21.92313, still not unusual in a sample of this size. On the standardized

scale, no residual here is surprisingly large.

Table 16.1 Data, residuals, and standardized residuals for the elk data

Observation

Girth (cm)

x

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

Weight (kg)

y

96

105

108

109

110

114

121

124

131

135

137

138

140

142

157

157

159

155

162

98

196

163

196

183

171

230

225

211

231

225

266

241

264

284

292

300

337

339

Residual

238.2661

34.9314

26.3361

23.9080

8.1522

214.8711

24.8380

11.5705

221.7203

212.7436

224.2553

13.9889

216.5228

0.9655

220.3720

212.3720

29.8837

38.1397

20.8488

Standardized

Residual

21.92313

1.68004

20.30135

1.13323

0.38517

20.69477

1.14452

0.53117

20.99323

20.58320

21.11135

0.64147

20.75921

0.04448

20.97540

20.59236

20.47699

1.81294

1.01967

136.266

161.069

169.336

172.092

174.848

185.871

205.162

213.429

232.720

243.744

249.255

252.011

257.523

263.034

304.372

304.372

309.884

298.860

318.151

Next, consider the assumption of the normality of es. Figure 16.13 shows box plots of

the residuals and standardized residuals. The box plots are approximately symmetric and

there are no outliers, so the assumption of normally distributed errors seems reasonable.

40

30

20

10

0

10

Residual

20

30

40

0

Standardized Residual

Figure 16.13

standardized residuals for the elk

data.

Notice that the boxplots of the residuals and standardized residuals are nearly identical. While it is preferable to work with the standardized residuals, if you do not have access

to a computer package or calculator that will produce standardized residuals, a plot of the

unstandardized residuals should suffice.

A normal probability plot of the standardized residuals (or the residuals) is

another way to assess whether it is reasonable to assume that e1, e2,..., en all come

from the same normal distribution. An advantage of the normal probability plot, shown

in Figure 16.14, is that the value of each residual can be seen, which provides more

information about the distribution. The pattern in the normal probability plot of the

85241_ch16_ptg01.indd 761

20/12/12 6:39 PM

1

Normal score

Normal score

762

2

40

30

Figure 16.14

residuals and standardized

residuals for the elk data

20

10

0

10

Residual

20

30

40

0

Standardized residual

standardized residuals and pattern in the normal probability plot of the the residuals

for the elk data are reasonably straight, confirming that the assumption of normality of

the error distribution is reasonable. Also notice that the pattern in both normal probability plots is similar, so you dont need to construct botheither plot could be used.

When considering linear

regression, your first step

should be to study the scatterplot and a residual plot.

These two plots provide

importantinformation

about whether a linear

model is appropriate

A plot of the (x, residual) pairs is called a residual plot, and a plot of the (x, standardized

residual) pairs is a standardized residual plot. Residual and standardized residual plots

typically exhibit the same general shapes. If you are using a computer package or graphing

calculator that calculates standardized residuals, the standardized residual plot is recommended. If not, it is acceptable to use the unstandardized residual plot instead.

A standardized residual plot or a residual plot is often helpful in identifying unusual

or highly influential observations and in checking for violations of model assumptions. A

desirable plot is one that exhibits no particular pattern (such as curvature or a much greater

spread in one part of the plot than in another) and that has no point that is far removed from

all the others. A point in the residual plot falling far above or far below the horizontal line

at height 0 corresponds to a large residual, which can indicate unusual behavior, such as a

recording error, a nonstandard experimental condition, or an atypical experimental subject.

A point with an x value that differs greatly from others in the data set could have exerted

excessive influence in determining the estimated regression line.

A standardized residual plot, such as the one pictured in Figure 16.15(a) is desirable,

because no point lies much outside the horizontal band between 22 and 2 (so there is no

unusually large residual corresponding to an outlying observation). There is no point far to

the left or right of the others (which could indicate an observation that might greatly influence the estimated line), and there is no pattern to indicate that the model should somehow

be modified. When the plot has the appearance of Figure 16.15(b), the fitted model should

be changed to incorporate curvature (a nonlinear model).

The increasing spread from left to right in Figure 16.15(c) suggests that the

variance of y is not the same at each x value but rather increases with x. A straightline model may still be appropriate, but the best-fit line should be obtained by using

weighted least squares rather than ordinary least squares. This involves giving more

weight to observations in the region exhibiting low variability and less weight to

observations in the region exhibiting high variability. A specialized regression analysis

textbook or a statistician should be consulted for more information on using weighted

least squares.

The standardized residual plots of Figures 16.15(d) and 16.15(e) show an outlier (a

point with a large standardized residual) and a potentially influential observation, respectively. Consider deleting the observation corresponding to such a point from the data set

and refitting a line. Substantial changes in estimates and various other quantities are a

signal that a more careful analysis should be carried out before proceeding.

Unless otherwise noted, all content on this page is Cengage Learning.

85241_ch16_ptg01.indd 762

20/12/12 6:39 PM

Notice the difference between an outlier (an observation that is far removed

from the other observations in the y direction) and

a potentially influential observation (an observation

that is far removed from

the other observations in

the x direction).

Standardized

residual

763

Standardized

residual

(a)

(b)

Standardized

residual

Standardized

residual

1

x

Large

residual

2

(d)

(c)

Standardized

residual

2

FIGURE 16.15

(a) satisfactory plot; (b) plot

suggesting that a curvilinear

regression model is needed;

(c) plot indicating nonconstant

variance; (d) plot showing a

large residual; (e) plot showing

a potentially influential

observation.

1

0

Potentially

influential

observation

1

2

(e)

The article Snow Cover and Temperature Relationships in North America and Eurasia

(Journal of Climate and Applied Meteorology [1983]: 460469) explored the relationship

between OctoberNovember continental snow cover (x, in millions of square kilometers)

and DecemberFebruary temperature (y, in C). The following data refer to Eurasia during

the n 5 13 time periods (196921970, 197021971, , 198121982):

x

13.00

12.75

16.70

18.85

16.60

15.35

13.90

213.5

215.7

215.5

214.7

216.1

214.6

213.4

Standardized Residual

20.11

22.19

20.36

1.23

20.91

20.12

0.34

22.40

16.20

16.70

13.65

13.90

14.75

218.9

214.8

213.6

214.0

212.0

213.5

Standardized Residual

21.54

0.04

1.25

20.28

21.54

0.58

A simple linear regression analysis described in the article included r2 5 0.52 and

r 5 0.72, suggesting a significant linear relationship. This is confirmed by a model

Unless otherwise noted, all content on this page is Cengage Learning.

85241_ch16_ptg01.indd 763

20/12/12 6:39 PM

764

utility test. The scatterplot and standardized residual plot are displayed in Figure 16.16.

There are no unusual patterns, although one standardized residual, 22.19, is a bit on the

large side. The most interesting feature is the observation (22.40, 218.9), corresponding

to a point far to the right of the others in these plots. This observation may have had a

substantial influence on the estimated regression line. The estimated slope when all 13

observations are included is b 5 20.459, and sb 5 0.133. When the potentially influential observation is deleted, the estimate of b based on the remaining 12 observations is

b 5 20.228. The change in slope is

change in slope 5 original b 2 new b

5 20.459 2 (2 0.288)

5 20.231

The change expressed in standard deviations is 20.231/0.133 5 21.74. Because b

has changed by substantially more than 1 standard deviation, the observation under consideration appears to be highly influential.

TEMP

-11.5 +

-13.0 +

-14.5 +

-16.0 +

-17.5 +

-19.0 +

* *

*

*

*

*

*

*

+-----------+-----------+-----------+-----------+-----------+

SNOW

12.5

15.0

17.5

20.0

22.5

25.0

Figure 16.16

Plots for the data of Example 16.7:

(a) Scatter plot; (b) Standardized

residual plot

(a)

STRESID

2.0 +

Potentially influential

*

observation

*

*

1.0 +

*

*

* *

0.0 +

*

*

*

*

-1.0 +

*

-2.0 +

*

-3.0 +

+-----------+-----------+-----------+-----------+-----------+ SNOW

12.5

15.0

17.5

20.0

22.5

25.0

(b)

In addition, r2 based just on the 12 observations is only 0.13, and the t ratio for testing

b 5 0 is not significant. Evidence for a linear relationship is much less conclusive in light

of this analysis. The investigators should seek a climatological explanation for the influential observation and collect more data, which could be used to find a more useful model.

Unless otherwise noted, all content on this page is Cengage Learning.

85241_ch16_ptg01.indd 764

20/12/12 6:39 PM

765

The paper Physiological Characteristics and Performance of Top U.S. Biathletes (Medicine

and Science in Sports and Exercise [195]: 13021310) describes a study of the relationship

between cardiovascular fitness (as measured by time to exhaustion running on a treadmill)

and performance on a 20-kilometer ski race. Data on

x 5 treadmill time to exhaustion (in minutes)

and

Dont forget to check assumptions. If you are used

to checking assumptions

before doing much in the

way of calculation, it is

sometimes easy to forget to

check them in a regression

setting. Be sure to step back

and think about whether

the four basic assumptions

of the linear regression

model are reasonable before making inferences

about the population slope

or using the estimated

model to make predictions.

Figure 16.17

probability plot of standardized

residuals; (b)Standardized

residual plot

for 11 athletes are shown in Table 16.2. Standardized residuals and residuals are also given.

Is it reasonable to use the given data to construct a confidence interval or test hypotheses

about b, the average change in ski time associated with a 1-min increase in treadmill time?

It depends on whether the assumptions that the distribution of the deviations from the

population regression line at any fixed x is approximately normal and that the variance of

this distribution does not depend on x are reasonable. Constructing a normal probability

plot of the standardized residuals and a standardized residual plot will provide insight into

whether these assumptions are in fact reasonable.

Table 16.2 Data, Residuals, and Standardized Residuals for

Example 16.8

Observation

1

2

3

4

5

6

7

8

9

10

11

Treadmill

Ski Time

Residual

Standardized

Residual

71.0

71.4

65.0

68.7

64.4

69.4

63.0

64.6

66.9

62.6

61.7

0.172

2.206

3.494

0.906

1.994

3.006

2.461

0.394

2.373

0.527

0.206

0.10

1.13

1.74

0.44

0.96

1.44

1.18

0.19

1.16

0.27

0.12

7.7

8.4

8.7

9.0

9.6

9.6

10.0

10.2

10.4

11.0

11.7

Figure 16.17 shows a normal probability plot of the standardized residuals and a standardized residual plot. The normal probability plot is quite straight, and the standardized

residual plot does not show evidence of any patterns or of increasing spread.

Standardized residual

Standardized residual

21

21

22

22

22

21

0

Normal score

(a)

10

Treadmill time

11

12

(b)

85241_ch16_ptg01.indd 765

20/12/12 6:39 PM

766

The article Appropriate Placement of Intubation Depth Marks in a New Cuffed,

Paediatric Tracheal Tube (British Journal of Anaesthesia [2004]: 80-87) describes a study

of the use of tracheal tubes in newborns and infants. Newborns and infants have small

trachea, and there is little margin for error when inserting tracheal tubes. Using X-rays

of a large number of children aged 2 months to 14 years, the researchers examined

the relationships between appropriate trachea tube insertion depth and other variables

such as height, weight, and age. A scatterplot and a standardized residual plot constructed using data on the insertion depth and height of the children (both measured in

cm) are shown in Figure 16.18.

3

20

2

Standardized residual

Insertion depth

18

16

14

12

1

0

1

2

10

3

50

75

100

Figure 16.18

depth vs. height data of Example

16.9; (b) standardized residual

plot.

Figure 16.19

depth vs. weight data of Example

16.9; (b) standardized residual

plot.

125

Height

150

175

75

100

(a)

125

Height

150

175

200

(b)

Residual plots like the ones pictured in Figure 16.18(b) are desirable. No point lies

much outside the horizontal band between 22 and 2 (so there are no unusually large

residuals corresponding to outliers). There is no point far to the left or right of the others

(no observation that might be influential), and there is no pattern of curvature or differences in the variability of the residuals for different height values to indicate that the model

assumptions are not reasonable.

But consider what happens when the relationship between insertion depth and weight is

examined. A scatterplot of insertion depth and weight (kg) is shown in Figure 16.19(a), and a

standardized residual plot in Figure 16.19(b). While some curvature is evident in the original

scatterplot, it is even more clearly visible in the standardized residual plot. A careful inspection

of these plots suggests that along with curvature, the residuals may be more variable at larger

weights. When plots have this curved appearance and increasing variability in the residuals, the

linear regression model is not appropriate.

3

22

2

Standardized residual

24

20

Insertion depth

50

200

18

16

14

12

1

0

1

2

10

3

0

10

20

30

40

50

60

70

80

90

Weight

10

20

30

40

50

Weight

(a)

(b)

60

70

80

90

85241_ch16_ptg01.indd 766

20/12/12 6:39 PM

767

Treefrogs search for mating partners was the examined in the article, The Cause of

Correlations Between Nightly Numbers of Male and Female Barking Treefrogs (Hyla gratiosa)

Attending Choruses (Behavioral Ecology [2002: 274281). A lek, in the world of animal

Figure 16.20

of Example 16.20; (b) residual plot

behavior, is a cluster of males gathered in a relatively small area to exhibit courtship displays.

The female preference hypothesis asserts that females will prefer larger leks over smaller

leks, presumably because there are more males to choose from. The scatterplot and residual

plot in Figure 16.20 show the relationship between the number of females and the number

of males in observed leks of barking treefrogs. You can see that the unequal variance, which

is noticeable in the scatterplot, is even more evident in the residual plot. This indicates that

the assumptions of the linear regression model are not reasonable in this situation.

35

15

10

25

Residuals

Number of females

30

20

15

10

5

17.5

0

0

10

20

50

60

30

40

Number of males

(a)

70

80

90

section

16.3Exercises

10

15.0

20

30

40

50

60

Number of males

(b)

70

10

20

30

40

%Logged

50

60

70

30

40

%Logged

50

60

70

80

90

7.5

Each exercise set assesses the following chapter learning objectives: M2, M7

Exercise Set 1

16.30 The following graphs are based on data from an

experiment to assess the effects of logging on a squirrel

population in British Columbia (Effects of Logging Pattern

Wildlife Management [2007]: 26552663). Plots of land,

2

1

Residual

percentages of logging, and the squirrel population density

for each plot was measured after 3 years. The scatterplot,

residual plot, and a boxplot of the residuals are shown

here.

10

10.0

5.0

Section 16.2

12.5

1

2

3

10

20

17.5

15.0

12.5

10.0

7.5

5.0

3

0

10

20

30

40

%Logged

50

60

70

0

Residual

3

Unless otherwise noted,

2 all content on this page is Cengage Learning.

85241_ch16_ptg01.indd 767

Residual

1

0

1

2

20/12/12 6:39 PM

768

is known to be influenced by body size, latitude, and

average environmental temperature. Researchers gathered data on Gopher tortoises in Okeeheelee County Park

in Florida to further understand the factors that affect

reproduction in these animals (Geographic Variation in

Body and Clutch Size of Gopher Tortoises, Copeia [2007]:

355363). The scatterplot, residual plot, and a normal

regression line with x 5 body length and y 5 clutch size

are shown here.

Does it appear that the assumptions of the simple linear

regression model are plausible? Explain your reasoning in

a few sentences.

14

ClutchSize

10

1.64

0.9

1.28

0.8

0.67

0.7

0.0

0.5

0.3

0.67

0.2

0.1

1.28

0.05

1.64

8

16.32 Carbon aerosols have been identified as a contributing factor in a number of air quality problems.

In a chemical analysis of diesel engine exhaust, x 5

mass (mg/cm2) and y 5 elemental carbon (mg/cm2) were

recorded (Comparison of Solvent Extraction and Thermal

set is y

5 31 1 .737x. The accompanying table gives the

observed x and y values and the corresponding standardized residuals.

8

6

4

0.95

Vehicle Exhaust Aerosol Environmental Science Technology

[1984]: 231234). The estimated regression line for this data

12

0

280

regression model are plausible? Explain your reasoning in

a few sentences.

290

300

310

Length(mm)

320

330

340

x

y

St. resid.

x

y

St. resid.

x

y

St. resid.

x

y

St. resid.

x

y

St. resid.

164.2

181

2.52

161.8

170

1.72

118.7

106

21.07

108.1

102

20.75

78.9

86

20.27

156.9

156

0.82

230.9

193

20.73

248.8

204

20.95

89.4

91

20.51

387.8

310

20.89

109.8

115

0.27

106.5

110

0.05

102.4

98

20.73

76.4

97

0.85

135.0

141

0.91

111.4

87.0

132

96

1.64

0.08

97.6

79.7

94

77

20.77 21.11

64.2

89.4

76

89

20.20 20.68

131.7 100.8

128

88

0.00 21.49

82.9 117.9

90

130

20.18

1.05

Residuals

2

0

2

4

6

8

unusually large residuals? Do you think that there are any

influential observations?

b. Is there any pattern in the standardized residual plot that

would indicate that the simple linear regression model is

not appropriate?

c. Based on your plot in Part (a), do you think that it is

reasonable to assume that the variance of y is the same at

each x value? Explain.

16.33 The article Vital Dimensions in Volume Perception:

Can the Eye Fool the Stomach? (Journal of Marketing

85241_ch16_ptg01.indd 768

20/12/12 6:39 PM

769

baby food, Cheez Whiz, Skippy Peanut Butter, and Ahmeds

tandoori paste, to name a few).

Product

Maximum Width

(cm)

Minimum Width

(cm)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

2.50

2.90

2.15

2.90

3.20

2.00

1.60

4.80

5.90

5.80

2.90

2.45

2.60

2.60

2.70

3.10

5.10

10.20

3.50

2.70

3.00

2.70

2.50

2.40

4.40

7.50

4.25

1.80

2.70

2.00

2.60

3.15

1.80

1.50

3.80

5.00

4.75

2.80

2.10

2.20

2.60

2.60

2.90

5.10

10.20

3.50

1.20

1.70

1.75

1.70

1.20

1.20

7.50

4.25

prediction of the maximum width of a food container

based on its minimum width.

b. Calculate the standardized residuals (or just the residuals if you dont have access to a computer program that

gives standardized residuals) and make a residual plot to

determine whether there are any outliers.

c. The data point with the largest residual is for a 1-liter

Coke bottle. Delete this data point and refit the regression. Did deletion of this point result in a large change in

the equation of the estimated regression line?

d. For the regression line of Part (c), interpret the estimated

slope and, if appropriate, the intercept.

e. For the data set with the Coke bottle deleted, do you

think that the assumptions of the simple linear regression

model are reasonable? Give statistical evidence for your

answer.

16.34 Models of climate change predict that global temperatures and precipitation will increase in the next 100

in northern latitudes. Researchers gathered data on the

potential effects of climate change for flowering plants

in Norway. (Climatic Variability, Plant Phenology, and

Northern Ungulates, Ecology [1999]: 13221339). The table

below gives data for one flower species. Range of flowering

dates and elevation for different sites in Norway were used

to construct the given scatterplot. A potentially influential

point is indicated on the scatterplot.

Bivariate Fit of Flowering Date Range by Elevation

35

30

Flowering

date range

25

20

15

100

200

300

Elevation

400

500

Flowering Range

versus Elevation: Tussilago Farfara

Elevation (Meters

Above Sea Level)

23.3

5.6

55.6

140.0

31.1

112.2

106.7

42.2

75.6

176.7

126.7

126.7

176.7

201.1

133.3

90.0

41.1

125.6

477.8

Flowering

Date Range

33.4

32.0

31.9

31.3

28.1

29.3

28.4

26.6

24.9

25.7

24.7

23.5

23.2

21.8

22.3

21.4

19.7

17.6

17.6

What are the values of a, b, r2, se?

b. Fit a linear regression model with the indicated point

omitted. What are the values of a, b, r2, se?

85241_ch16_ptg01.indd 769

20/12/12 6:39 PM

770

Section 16.2

Exercise Set 2

16.35 In the study described in Exercise 16.31, the effect

of latitude on mean clutch size was investigated. Data

from various locations in Florida, Georgia, Alabama, and

Mississippi on y 5 mean clutch size and x 5 latitude were

measured. The scatterplot, standardized residual plot, and

several graphs of the standardized residuals are shown

below.

Does it appear that the assumptions of the simple linear

regression model are plausible? Explain your reasoning in

a few sentences.

4

Frequency

Parts (a) and (b).

d. The researchers could use the estimated regression equation based on all 19 observations to make predictions for

elevations ranging from 0 to 200 meters; or they could

use the estimated regression equation based on the 18

observations (omitting the observation identified by an

arrow) to make predictions for elevations ranging from

0 to 500 meters. Which strategy would you recommend,

and why?

3

2

1

0

frequency and y5pleasantness rating when nerves were

stimulated by a light brushing stoke on the forearm. The x

values and the corresponding residuals from a simple linear

regression are as follows:

a. Construct a standardized residual plot. Does the plot

exhibit any unusual features?

Firing Frequency, x

23

24

22

25

27

28

34

33

36

34

8

7

5

4

4 26

27

28

26

27

28

Standardized

Standardized

Residual

Residual

29

30

Latitude

29

30

31

32

33

31

32

33

0

1

21

1

2

2 26

27

28

26

27

28

29

30

Latitude

29

30

31

32

33

31

32

33

Latitude

Normal Probability plot of the Residuals

22

22.0

21.5

21.0

20.5

0.0

0.5

Standardized residual

1.0

1.5

16.37 The accompanying scatterplot, based on 34 sediment samples with x 5 sediment depth (cm) and y 5 oil

and grease content (mg/kg), appeared in the article Mined

Land Reclamation Using Polluted Urban Navigable Waterway

Sediments (Journal of Environmental Quality [1984]: 415422).

90

Percent

1

0

50

10

1

21.83

0.04

1.45

0.20

21.07

1.19

20.24

20.13

20.81

1.17

Latitude

2

1

99

Standardized Residual

follows. Based on this plot, do you think it is reasonable

to assume that the error distribution is approximately

normal? Explain.

6

5

Normal score

MeanMean

Clutch

Clutch

Size Size

7

6

Residual

0

Residual

Discuss the effect that the observation (20, 33,000) will have on

the estimated regression line. If this point were omitted, what

do you think will happen to the slope of the estimated regression line compared to the slope when this point is included?

Unless otherwise noted, all content on this page is Cengage Learning.

85241_ch16_ptg01.indd 770

20/12/12 6:39 PM

Standardized R

2

0

1

2

(mg/kg)

771

40

60

80 Adequacy

100

16.3 Checking

Model

Locations/Pack

20

7

6

Frequency

32,000

28,000

5

4

3

2

24,000

1

20,000

16,000

8,000

4,000

30

60

90

120 150 180

Subsample mean depth (cm)

16.38 Investigators in northern Alaska periodically monitored radio collared wolves in 25 wolf packs over 4 years,

keeping track of the packs home ranges. (Population

Dynamics and Harvest Characteristics of Wolves in the Central

Brooks Range, Alaska, Wildlife Monographs, [2008]: 125).

its members in a specified amount of time. The investigators

noticed that wolf packs with larger home ranges tended to

be located more often by monitoring equipment. The investigators decided to explore the relationship between home

range and the number of locations per pack. A scatterplot

and standardized residual plot of the data are shown below,

as well as plots of the standardized residuals.

Does it appear that the assumptions of the simple linear

regression model are plausible? Explain your reasoning in

a few sentences.

Additional Exercises

16.39 Carbon acrosols have been identified as a contributing factor in a number of air quality problems. In a chemical

analysis of diesel engine exhaust, x 5 mass (mg/cm2) and

y 5 elemental carbon (mg/cm2) were recorded ("Comparison

of Solvent Extraction and Thermal Optical Carbon Analysis

Methods: Application to Diesel Vehicle Exhaust Aerosol"

Environmental science Technology [1984]: 231234). The esti

5 31 1 .737x.

A scatterplot of the data and a standardized residual plot are

shown below.

Bivariate Fit of carbon By mass

300

250

carbon

12,000

0

1

Standardized Residual

200

150

2500

100

Home Range

2000

1500

50

1000

50

100

150

200

250

mass

300

350

400

500

0

20

40

60

Locations/Pack

80

100

St. Residuals

Standardized Residual

1

2

1

0

0

1

1

2

20

40

60

Locations/Pack

80

100

2

50

100

150

200

250

mass

300

350

400

85241_ch16_ptg01.indd 771

Frequency

6

5

4

3

20/12/12 6:39 PM

772

that there are any influential observations?

b. Is there any pattern in the standardized residual plot that

would indicate that the simple linear regression model is

not appropriate?

c. Based on the scatterplot and the standardized residual

plot, do you think that it is reasonable to assume that the

variance of y is the same at each x value? Explain.

1322-1339). The table below gives data for one flower spe-

latitude for different sites in Norway is also shown. Two

points that are potentially influential are indicated on the

scatterplot.

50

45

Mean Flowering

date range

40

35

30

25

20

15

10

58

59

60

61

62

Latitude (N)

63

Flowering Range

Versus Latitude: Anemone Hepatica

Flowering

Latitude (N)

Date Range

58.7

58.2

58.2

59.4

60.0

59.4

59.1

59.3

59.5

59.5

59.7

59.8

60.8

46.1

35.9

34.7

32.3

33.0

29.7

26.9

26.2

25.6

27.6

19.1

24.4

26.2

64

60.9

63.4

63.4

60.5

60.7

60.7

61.1

from a motionless resting position outside its own burrow.

When prey appears on the horizon, within say 20 cm, the

scorpion assumes an alert posture; it determines the angular

position of the prey, makes a quick rotation, and runs after it.

In a recent study of the scorpions accuracy, the angular position (0 degrees 5 right in front) of the prey, and the turning

angle of the scorpion was recorded for 23 attacks. A simple

regression model relating the response angle of the predator

5 a 1 b(t), was fit.

The resulting residual plot is shown. Describe the locations of

any outliers you see in the residual plot.

40

30

20

10

0

10

20

30

40

200 150

(continued)

26.8

28.7

19.2

22.5

17.9

12.9

11.8

What are the values of a, b, r2 and se?

b. Fit a linear regression model with the two observations

identified by arrows omitted. What are the values of a, b,

r2 and se?

c. In a few sentences, describe any differences you found in

Parts (a) and (b).

d. The researchers could use the estimated regression

equation based on all 20 observations to make predictions for latitudes ranging from 58 to 64, or they could

use the estimated regression equation based on the 18

observations (omitting the two observations identified by

arrows) to make predictions for latitudes ranging from 58

to 62. Which strategy would you recommend, and why?

Residual

temperatures and precipitation will increase in the next

100 years, with the largest changes occurring during

winter in northern latitudes. Researchers recently gathered data on the potential effects of climate change

for flowering plants in Norway. (Climatic Variability,

Flowering Range

Versus Latitude: Anemone Hepatica

Flowering

Latitude (N)

Date Range

100

50

0

50

Target Angle

100

150

200

85241_ch16_ptg01.indd 772

20/12/12 6:39 PM

most significant factors contributing to gray wolf population

growth. The causes of early pup mortality are unknown, and

difficult to observe. The pups are concealed within their dens

for 3 weeks after birth, and after they emerge it is difficult to

confirm their parentage. Researchers recently used portable

ultrasound equipment to investigate some factors related to

reproduction. (Diagnosing Pregnancy, in Utero Litter Size, and

Fetal Growth with Ultrasound in Wild, Free-Ranging Wolves,

Journal of Mammology [2006]: 8592)

fetus (in cm, measured from crown to rump) and gestational age (in days) is shown below. Identify the point

that has the largest residual by giving its approximate

coordinates.

5

Crown rump(cm)

Growth Rate of Tamarix as an Indication of Lake Boundary

Fluctuations at Sebkhet Kelbia, Tunisia (Journal of Arid

Environments [1982]: 4351) used a simple linear regres-

(average width in centimeters of the last two annual rings)

and x 5 stem density (stems/m2). The estimated model was

based on the following data. Also given are the standardized

residuals.

x

y

St. resid.

x

y

St. resid.

4

0.75

20.28

15

0.55

0.24

5

1.20

1.92

15

0.00

22.05

6

0.55

20.90

19

0.35

20.12

9

0.60

20.28

21

0.45

0.60

14

0.65

0.54

22

0.40

0.52

regression model to be appropriate?

b. Construct a normal probability plot of the standardized

residuals. Does the assumption that the random deviation

distribution is normal appear to be reasonable? Explain.

c. Construct a standardized residual plot. Are there any

unusually large residuals?

d. Is there anything about the standardized residual plot

that would cause you to question the use of the simple

linear regression model to describe the relationship

between x and y?

773

25

30

Gest Age(days)

35

40

All chapter learning objectives are assessed in these exercises. The learning objectives assessed

in each exercise are given in parentheses.

16.44 (C1)

Describe what distinguishes a deterministic model from a

probabilistic model.

16.45 (C2)

In the context of the simple linear regression model,

explain the difference between a and a. Between b and b.

Between se and se.

16.46 (M1)

The SAT and ACT exams are often used to predict a

students first-term college grade point average (GPA).

Different formulas are used for different colleges and

majors. Suppose that a student is applying to State U with

an intended major in civil engineering. Also suppose that

for this college and this major, the following model is used

to predict first term GPA.

GPA 5 a 1 b (ACT)

a 5 0.5

b 5 0.1

a. In this context, what would be the appropriate interpretation of a?

b. In this context, what would be the appropriate interpretation of b?

16.47 (M2)

Theropods were carnivorous dinosaurs, characterized by

short forelimbs, living in the Jurassic and Cretaceous periods. (Tyrannosaurus rex is classified as a Theropod.) What

scientists know about therapods is based on studying incomplete skeletal remains. In a study described in the paper

My Theropod is Bigger than Yoursor not: Estimating Body

Size from Skull Length in Theropods (Journal of Vertebrate

85241_ch16_ptg01.indd 773

20/12/12 6:39 PM

774

2

1.5

1

Residuals

skeletons to develop a model describing the relationship

between body length and skull length. JMP was used to

produce the following graphical displays and computer

output. When you evaluate the fit of an estimated regression

line, all of the information below is considered as a whole.

However, the summary statistics in the computer output and

the different plots each convey some specific information.

a. Using only the scatterplot, do you think a linear model

does a good job of describing the relationship? Explain

why or why not.

b. Using only the residual plot, what can you determine

about whether the basic assumptions of the linear

regression model are met?

c. Using only the normal probability plot and boxplot of

the residuals, what can you determine about whether the

basic assumptions of the linear regression model are met?

d. Using only the values of r2 and se, what can you say about

the quality of the fit of the linear model for these data?

0.5

0

0.5

1

1.5

2

0.25

0.5

0.75

SkullLength

1.25

1.5

Linear Fit

BodyLength = 0.7061088 + 7.791973*SkullLength

0.95

0.9

Summary of Fit

1.64

1.28

0.8

RSquare

RSqureAdj

Root Mean Square Error

Mean of Response

Observations(or Sum Wgts)

0.67

0.5

0.0

0.2

0.1

0.05

0.67

Analysis Of Variance

1.28

1.64

Parameter Estimates

1.5

0.5

0.5

12

Estimate

0.7061088

Std Error

0.330485

SkullLength

7.791973

0.415318

t Ratio Prob>|t|

2.14

0.0475*

18.76

<.0001*

Ruffed grouse are a species of birds that nest on the ground.

Because of this, chick survival at night in the first few

weeks of life depends on avoiding predators. Biologists

have theorized that protection from predators might be

supplied by the mother hens choice of brooding sites.

One variable that biologists thought might be related to

survival is the density of vegetation in the vicinity of the

nest. Dense vegetation would possible reduce the ability of

predators to detect the nests. The paper Nocturnal Roost

10

BodyLength

Term

Intercept

16.48 (M3)

There are 4 basic assumptions necessary for making inferences about b, the slope of the population regression line.

a. What are the four assumptions?

b. Which assumptions can be checked using sample data?

c. What statistics or graphs would be used to check each of

the assumptions you listed in Part (b)?

1.5

14

8

6

4

2

0

0.953929

0.951218

0.801042

5.859474

19

0.25

0.5

0.75

SkullLength

1.25

1.5

Ornithology [2005]:168174) describes a study in which

85241_ch16_ptg01.indd 774

20/12/12 6:39 PM

Technology Notes

Linear Fit

BroodSurvival = 0.9468008 0.0261902*StemDensity

Summary of Fit

RSquare

RSqureAdj

Root Mean Square Error

Mean of Response

Observations(or Sum Wgts)

0.9

0.9

0.8

0.8

LicePrevalence

Prevalence

Lice

of chicks surviving /number of eggs hatched) in 23 nests in

different vegetation densities (thousands of stems / hectare.)

Computer output (from JMP) is shown below.

0.193788

0.155397

0.287538

0.436043

23

StemDensity 0.02619

Std Error

0.235108

0.011657

0.5

0.5

16.50 (M7)

Researchers in Hawaii have recently documented a large

increase in the prevalence of a bird parasite known as chewing lice. (Explosive Increase in Ectoparasites in Hawaiian

Forest Birds, The Journal of Parasitology [2008]: 10091021).

Current data suggest that the prevalence of chewing lice

may be less for bird species with a high degree of bill

overhang. A species is said to have bill overhang when

the upper bill extends downward in front of the end of

the lower bill. The following scatterplot shown shows the

relationship between the prevalence of chewing lice and

bill overhang for 8 bird species in the Hawaiian Islands. A

residual plot is also shown. Use these plots to identify any

outliers or potentially influential observations. For each

point you identify, assess its influence on the estimated

slope of the regression line.

0.2

0.2

0.4

0.6

0.4

0.6

Bill Overhang

Bill Overhang

0.8

0.8

1.0

1.0

0.4

0.4 0.0

0.0

0.2

0.2

0.6

0.4

0.6

0.4

Bill Overhang

Bill Overhang

0.8

0.8

1.0

1.0

0.2

0.2

0.0355*

a. Is there convincing evidence of a useful linear relationship between brood survival and stem density?

Explain.

b. Would you describe the relationship as strong? Why or

why not?

c. Construct a 95% confidence interval for b and interpret

it in context.

d. What margin of error is associated with the confidence

interval in part (c)?

0.0

0.0

0.3

0.3

t Ratio Prob>|t|

4.03

0.0006*

2.25

0.6

0.6

0.3

0.3

0.1

0.1

0.0

0.0

Residual

Residual

Estimate

0.9468008

0.7

0.7

0.4

0.4

Parameter Estimates

Term

Intercept

775

0.1

0.1

0.2

0.2

0.3

0.3

16.51 (M6)

Suppose you are given the computer output shown. You

want to test the hypothesis, b 5 1.0. Describe how you

would use the computer output to test this hypothesis

Linear Fit

y = 5.6452776 + 0.9797401*x

Summary of Fit

RSquare

RSqureAdj

Root Mean Square Error

Mean of Response

Observations(or Sum Wgts)

0.985289

0.984954

12.48525

0.791304

46

Parameter Estimates

Term

Intercept

Estimate

5.6452776

Std Error

1.84302

0.9797401

0.018048

t Ratio Prob>|t|

3.06

0.0037*

54.29

<.0001*

Technology Notes

Regression Test

TI-83/84

1. Enter the data for the independent variable into L1 (In order

to access lists press the STAT key, highlight the option called

Edit then press ENTER)

3. Press STAT

4. Highlight TESTS

5. Highlight LinRegTTest and press ENTER

6. Next to b & r select the appropriate alternative hypothesis

7. Highlight Calculate

85241_ch16_ptg01.indd 775

20/12/12 6:39 PM

776

TI-Nspire

1. Enter the data into two separate data lists (In order to access

data lists select the spreadsheet option and press enter)

Note: Be sure to title the lists by selecting the top row of the

column and typing a title.

2. Press the menu key and select 4:Stat Tests then 4:Stats

Tests then A:Linear Reg t Test and press enter

3. In the box next to X List choose the list title where you

stored your independent data from the drop-down menu

4. In the box next to Y List choose the list title where you

stored your dependent data from the drop-down menu

5. In the box next to Alternate Hyp choose the appropriate

alternative hypothesis from the drop-down menu

6. Press OK

JMP

1. Input the data for the dependent variable into the first

column

2. Input the data for the independent variable into the second

column

3. Click Analyze and select Fit Y by X

4. Select the dependent variable (Y) from the box under Select

Columns and click on Y, Response

5. Select the independent variable (X) from the box under

Select Columns and click on X, Factor

6. Click the red arrow next to Bivariate Fit of and select

Fit Line

MINITAB

1. Input the data for the dependent variable into the first

column

2. Input the data for the independent variable into the second

column

3. Select Stat then Regression then Regression

4. Highlight the name of the column containing the dependent

variable and click Select

5. Highlight the name of the column containing the independent variable and click Select

6. Click OK

85241_ch16_ptg01.indd 776

the t-test results for the regression analysis.

SPSS

1. Input the data for the dependent variable into one column

2. Input the data for the independent variable into a second

column

3. Click Analyze then click Regression then click Linear

4. Select the name of the dependent variable and click the

arrow to move the variable to the box under Dependent:

5. Select the name of the independent variable and click the

arrow to move the variable to the box under Independent(s):

6. Click OK

Note: The p-value for the regression test can be found in the

Coefficients table in the row with the independent variable name.

Excel

1. Input the data for the dependent variable into the first column

2. Input the data for the independent variable into the second

column

3. Select Analyze then choose Regression then choose

Linear

4. Highlight the name of the column containing the dependent

variable

5. Click the arrow button next to the Dependent box to move

the variable to this box

6. Highlight the name of the column containing the independent variable

7. Click the arrow button next to the Independent box to move

the variable to this box

8. Click OK

Note: The test statistic and p-value for the regression test for the

slope can be found in the third table of output. These values are

listed in the row titled with the independent variable name and

the columns entitled t Stat and P-value.

20/12/12 6:39 PM

Review Questions

777

Use the following information for questions 16.

A study was carried out to investigate the relationship between x5the number of components needing repair and

y5the time of the service call (in minutes) for a computer

repair company. The number of components and the service

time for a random sample of 20 service calls was used to fit

a simple linear regression model. Partial computer output is

shown below.

The regression equation is

Time 5 37.2 1 9.97 Number

Predictor

Coef SE Coef

T

P

Constant

37.213

7.985

4.66 0.000

Number

9.9695

0.7218

13.81 0.000

S518.7534 R-Sq589.7% R-Sq(adj)589.2%

1. Which of the following statements is a correct interpretation of the value 9.97?

(A) The average number of components needing repair

goes up 9.97 for each 1 minute increase in the service time of a call.

(B) On average, the service call time goes up 9.97 minutes

for each additional component needing repair.

(C) The service call time is 9.97 minutes when there

are 0 components to repair.

(D) Approximately 9.97% of the observed variation in

the service call times can be explained by the linear

relationship between service time and number of

components requiring repair.

(E) If this regression equation is used to predict service

call times, we can expect predictions to be within

9.97 minutes of the actual time.

2. Which of the following statements is a correct interpretation of the value 89.7%?

(A) On average, the service call time goes up 89.7 minutes

for each additional component needing repair.

(B) The magnitude of a typical difference between an

observed service call time and the service call time

predicted by the linear model is approximately

89.7 minutes.

(C) The correlation between service call time and number of components needing repair is 89.7%.

(D) Approximately 89.7% of the observed variation in

service call time can be explained by the linear relationship between service call time and number of

components needing repair.

(E) If this regression equation is used to predict service

call times, we can expect predictions to be within

89.7 minutes of the actual time.

3. The value of se is 18.75. Which of the following is an

appropriate interpretation of this value?

(A) 18.75% of the variability in service time can be explained by the linear relationship between service

call time and number of components needing repair.

(B) There is a positive correlation between service call

time and number of components needing repair.

(C) For every 1-component increase in the number of

components needing repair, the predicted service

call time increases by about 18.75 minutes.

(D) The magnitude of a typical difference between an observed service time and the service call time predicted

by the linear model is approximately 18.75 minutes.

(E) The average service call time is 18.75 minutes.

4. The value of se is 18.75. If the assumptions of the

simple linear regression model are satisfied, which of

the following is correct?

(A) The width of a 95% confidence interval for the slope

of the population regression line is 2(18.75) 5 37.50.

(B) It would be unlikely that a prediction based on the

regression line will be greater than 18.75 minutes.

(C) It would be unlikely that a prediction based on the

regression line will differ from the actual value by

more than 2(18.75)537.50 minutes.

(D) Errors associated with predictions based on the regression line will always be less than 18.75 minutes.

(E) The value of se does not provide any information

about the anticipated magnitude of prediction errors.

5. Which of the following is a 95% confidence interval for

the change in service time associated with a 1-unit

increase in the number of components needing repair?

(A) 37.21 6 (1.96)(7.985)

(B) 37.21 6 (2.910)(7.985)

(C) 9.97 6 (1.96)(0.7218)

(D) 9.97 6 (2.10)(0.7218)

(E) 9.97 6 2(18.7534)

6. If the basic assumptions of the simple linear regression

model are reasonable, what conclusion should be

reached regarding model utility if a significance level of

0.05 is used for the model utility test?

(A) There is convincing evidence of a negative linear

relationship between service call time and number

of components needing repair.

(B) There is convincing evidence that the model is not

useful for predicting service call time.

(C) There is convincing evidence that the model is useful for predicting service call time.

(D) There is not convincing evidence that the model is

useful for predicting service call time.

(E) A conclusion cannot be reached based on the given

information.

AP* and the Advanced Placement Program are registered trademarks of the College Entrance Examination Board, which was not involved in the production of, and does not endorse, this product.

85241_ch16_ptg01.indd 777

20/12/12 6:39 PM

778

variables x and y, which of the following must be true

of b, the slope of the population regression line?

(A) b , 0

(B) b . 0

(C) b50

(D) b . 1

(E) 21,b,1

(A) I only

(B) II only

(C) III only

(D) I and III only

(E) II and III only

y

a linear regression. Which of these plots indicates that

the relationship between the two variables used to fit

the line may not be linear?

8.5

8.0

B

D

7.5

Standardized residual

2

7.0

6.5

C

6.0

9. Which of the labeled points would have the largest residual when a linear model is fit to the data?

21

22

210

25

10

Standardized residual

2.0

(A) A

(B) B

(C) C

(D) D

(E) Both C and D

10. Which of the labeled points corresponds to a potentially

influential observation if a linear model is to be fit to

the data?

1.5

1.0

(A) A

(B) B

(C) C

(D) D

(E) Both C and D

0.5

0.0

20.5

21.0

21.5

100

110

120

130

140

and y, what decision will be made in a test of H0: b 5 0

versus H0: b 0?

(A) Reject H0 and conclude that there is no evidence

that the linear model is useful

(B) Reject H0 and conclude that there is evidence that

the linear model is useful

(C) Fail to reject H0 and conclude that there is no evidence that the linear model is useful

(D) Fail to reject H0 and conclude that there is evidence

that the linear model is useful

(E) Not enough information to say.

Standardized residual

3

2

1

0

21

22

23

150

200

250

300

350

85241_ch16_ptg01.indd 778

20/12/12 6:39 PM

Review Questions

and 13.

As part of a study of the swimming speed of sharks, a random sample of 18 lemon sharks (Triakis semifasciata) were

observed in a laboratory sea tunnel. Body lengths and

maximum sustainable swimming speeds (MSSS, reported

in body lengths per second) were measured for each shark.

The computer output from a regression with y = MSSS and

x = body length is given below.

Linear Fit

MSSS = 1.8928955 - 0.0104278*Length

Summary of Fit

RSquare

0.526395

RSquare Adj

0.496794

S

0.272031

Mean of Response

1.24

N Observations

18

Analysis of Variance

Sum of

Source

DF

Squares

Model

1

1.3159870

Error

16

1.1840130

Total

17

2.5000000

779

made about the random deviation e in a simple linear

regression model?

(A) The distribution of e is normal.

(B) The standard deviation of e, se, depends upon the

particular value of x.

(C) The mean value of e is 0.

(D) The random deviations, e1, e2 , en, associated

with different observations are independent of one

another.

(E) The standard deviation of e, se, is the same for

each x value.

15. The residual plot below indicates that the one or more

of assumptions of the linear regression model may not

be met. Which of the following is a reasonable conclusion based on this residual plot?

Standardized residual

3

Mean

Square

1.31599

0.07400

Parameter Estimates

Term

Estimate Std Error t Ratio

Intercept

1.8928955 0.167575 11.30

Length(cm) 20.010428 0.002473 24.22

F Ratio

17.7834

Prob > F

0.0007*

Prob>|t|

,.0001*

0.0007*

12. For this data set, the model utility test is based on how

many degrees of freedom?

(A) 15

(B) 16

(C) 17

(D) 18

(E) 19

13. What is the P-value associated with the model utility

test?

2

1

0

21

22

23

150

200

250

300

350

model would be more appropriate.

(B) There is evidence that the residuals are not normally distributed.

(C) The slope of the regression line is non-zero.

(D) The correlation between x and y is non-zero.

(E) There is evidence the residuals do not have the

same variance for all x values.

(A) 0.0001

(B) 0.0007

(C) 0.07400

(D) 0.167575

(E) 0.526395

85241_ch16_ptg01.indd 779

20/12/12 6:39 PM

- Ming 1Diunggah olehSilviu Borsan
- Basic Eco No Metrics - GujaratiDiunggah olehFaizan Raza
- 线性回归分析 linear regression.pptDiunggah olehvictor_chung_23
- Regression Analysis - WikipediaDiunggah olehalejojg
- DASE Session Problem Sheet 4Diunggah olehAnkit Dangi
- Case 9Diunggah olehHealthyYOU
- Forecasting Covers in Hotel Food and Beverage OutletsDiunggah olehKirby C. Loberiza
- Research Project RealDiunggah olehAhmad Azhar
- Linear Regression with One RegressorDiunggah olehkaranloves1942
- chapter+13Diunggah olehMolly Mitchell
- Universal Compression Index EquationDiunggah olehWalid
- 1110-4563-1-PB.pdfDiunggah olehKraja
- kelompok 1Diunggah olehRicki Gushendrio
- Lect02_estim_reglineDiunggah olehapi-19973711
- Chapter 3 NotesDiunggah olehPete Jacopo Belbo Caya
- Gwrgeographically Weighted RegressionDiunggah olehJames Cormier-Chisholm
- Case Study IE322Diunggah olehlrg5092
- 64Diunggah olehAluh Novalia
- tasarı_Estimation of Uster HDiunggah olehbluellay
- regresi & interpolasiDiunggah olehHaritsari Dewi
- regression-converted.pdfDiunggah olehTnt 1111
- CHAPTER 8 SIMPLE LINEAR REGRESSIONDiunggah olehNur Iffatin
- Yesim Ozan_Simple Linear Regression-Presentation_08.08.15.pptxDiunggah olehyeşim ozan
- 330_Lecture1_2014Diunggah olehPETER
- RegressionDiunggah olehamjadakram
- Articulo de Estadistica en InglesDiunggah olehanon_183438305
- 23008 Leverage ReportDiunggah olehodvut
- M3.2 _ Regression.pdfDiunggah olehGyanVardhan
- MA1 T2 MD Cost Terms Concepts and ClassificationsDiunggah olehMae Ciarie Yangco
- STA302_Mid_2010SDiunggah olehexamkiller

- Computer Project # 1 ( Final Version)Diunggah olehSergioDragomiroff
- Practice Exam 2Diunggah olehSergioDragomiroff
- The Legitimate Successor of JesusDiunggah olehSergioDragomiroff
- SantoDiunggah olehSergioDragomiroff
- Exam 1Diunggah olehSergioDragomiroff
- Statistical Analysis of Meteorological DataDiunggah olehSergioDragomiroff
- 12 Step PrayersDiunggah olehSergioDragomiroff
- Two PopulationsDiunggah olehSergioDragomiroff
- Probability ProblemsDiunggah olehSergioDragomiroff
- St. Gertrude the Great - The Exercises.pdfDiunggah olehSergioDragomiroff
- O Gracious LordDiunggah olehSergioDragomiroff
- Probability ProblemsDiunggah olehSergioDragomiroff
- Kyrie Requiem MassDiunggah olehSergioDragomiroff
- UMBayesAdaptIntro SM2Diunggah olehSergioDragomiroff
- e a Twelve Step ProgramDiunggah olehAndrew Michael
- toptenDiunggah olehSergioDragomiroff
- exam 1 MGF 1107Diunggah olehSergioDragomiroff
- MGF 1106 exam 1Diunggah olehSergioDragomiroff
- Bootstrap for Eigen Values ExampleDiunggah olehSergioDragomiroff
- BootstrapDiunggah olehSergioDragomiroff
- Homework 1 Sieve of ErathostenesDiunggah olehSergioDragomiroff
- Skeweness Estimators FunctionsDiunggah olehSergioDragomiroff

- Biostatistics NotesDiunggah olehTubocurare
- AOAC 972.16.pdfDiunggah olehJuliet Romero
- KiellandDiunggah olehdeltanueves
- Evaluating the performance of a hydrological model on River Kaduna dischargeDiunggah olehAnonymous 7VPPkWS8O
- MULTIVARIATE REGRESSION MODELS FOR PANEL DATADiunggah olehpedroalbarran_ua
- CHAPTER 4 DISCUSSION.docxDiunggah olehArah Opalec
- Logit ModelDiunggah olehNISHANT
- Final Exam Question Bum2413 Applied StatisticsDiunggah olehSyafiq MT
- Logistic RegressionDiunggah olehMohamed Med
- Chapter_14 Advanced Regression ModelsDiunggah olehmgahabib
- En Tanagra Categorical Selection Log RegDiunggah olehVishnu Prakash Singh
- Forecasting QuizDiunggah olehsirfanalizaidi
- Doing Bayesian Data Analysis With R and BUGSDiunggah olehOctavio Martinez
- SPSS ManualDiunggah olehSee Kok Keong
- Linear RelationshipDiunggah olehchingyho
- Trim - I - QMM_111401Diunggah olehsarim
- Dose Linearity and Dose ProportionalityDiunggah olehdivyenshah3
- Coker-Cranney Et Al (2016) - Crime During SEC FballDiunggah olehJose Rincon
- 1 Introuction to Linear ModelsDiunggah olehNiccolo Aaron Mendoza Alcantara
- Exponential Smoothing- The State of the ArtDiunggah olehproluvieslacus
- Effect of fuel gas to GTDiunggah olehAmpornchai Phupol
- Mech 2012Diunggah olehAruncpri
- An exploratory study of Hofstede's cross cultural dimesionDiunggah olehThanchanok Deesit
- zzz..Diunggah olehJeremy Adam
- Hoffmann _ Linear Regression Analysis_ Second EditionDiunggah olehslatercl
- Cointegration and Unit RootsDiunggah olehdacii68kon
- Curve FittingDiunggah olehCillalois Marie Famero
- cremeans lp 11 4 11 10 full scheduleDiunggah olehapi-420628029
- Regression AnalysisDiunggah olehMary Christine Ignacio
- LAD-LassoDiunggah olehJoanne Wong