Anda di halaman 1dari 21

4/21/08

19

Linear Patterns

FITTING LINES TO DATA ....................................................................................................... 19-3 Least Squares .................................................................................................................. 19-3 INTERPRETING THE FITTED LINE .......................................................................................... 19-4 Interpreting the Intercept ................................................................................................ 19-5 Interpreting the Slope ..................................................................................................... 19-6 PROPERTIES OF RESIDUALS .................................................................................................. 19-9 Standard Deviation of the Residuals............................................................................. 19-11 EXPLAINING VARIATION .................................................................................................... 19-12 Using the R-Squared Statistic ....................................................................................... 19-13 CONDITIONS FOR SIMPLE REGRESSION............................................................................... 19-13 SUMMARY .......................................................................................................................... 19-17

4/21/2008

19 Linear Patterns

Many factors affect the price of a commodity, but we can group these factors into two broad categories: fixed costs and variable costs. Fixed costs are present and of constant size regardless of the quantity; variable costs increase with the amount. As an example, lets consider the price charged by a jewelry store for a diamond. The size of the diamond determines a variable cost. The larger the diamond, the higher the price all other things held fixed. Some diamonds are more desirable than others because of a rare color or particular sparkle. We express a variable cost as a rate, for example as dollars per carat. (A carat is a unit of weight commonly used for gems. One carat is 0.2 grams.) Fixed costs are present regardless of the size of the diamond. Fixed costs include the vague category often called overhead expenses, such as the cost of lighting the store and maintaining a storefront where the diamonds are shown. These expenses also include the cost of maintaining a web site to advertise goods on-line. If we simply take the ratio of the cost of a diamond to its weight, say $2,500 for a 1-carat diamond, then we mix these costs together. The variable cost is not $2,500 per carat unless this jeweler has no fixed costs. We can separate fixed and variable costs by comparing the prices of diamonds of varying size. By studying the relationship between the price and weight in a diverse collection of diamonds, we will be able to separate fixed costs from variable costs and get a better understanding of what determines the cost of a gem. The technique that we will use to estimate fixed and variable costs is known as regression analysis. Regression analysis builds a description of the relationship (dependence, association) between two variables. In this chapter, well focus on using regression as a descriptive tool and only hint at the use of regression for making inferences about populations.

19-2

4/21/2008

19 Linear Patterns

Fitting Lines to Data


Response

Explanatory Variable Predictor

The following scatterplot shows the price in dollars versus the weight in carats of 320 emerald-cut diamonds. We put the variable that we are trying to predict on the y-axis and call it the response. The associated variable on the x-axis is called an explanatory variable or predictor. Explanatory variables have many names: factors, covariates, or even independent variables. (The last name is traditional, but introduces the word independent in a way that that has nothing to do with probability.) Unlike correlation (Chapter 6), it matters which variable is the response and which is the explanatory variable.
1750
Price ($, using Credit Card)

1500

1250

1000

750 .3 .35 .4
Weight (carats)

.45

.5

association direction bends variation surprises

Figure 19-1 Prices versus weights for emerald-cut diamonds.

Using the terminology of Chapter 6, the scatterplot shows positive, linear association between weight and price. The association is evident, but not terribly strong. The correlation between weight and price is r = 0.66. Theres quite a bit of variation around the upward trend; diamonds of the same weight do not all cost the same amount. Other characteristics aside from weight influence the price such as the cut, clarity, or color of the stone.

Least Squares
If the association between x and y is linear, then we can summarize the relationship between the variables with a line. To identify the line, we use an intercept b0 and a slope b1. The equation of the resulting line would usually be written y = b0 + b1 x in algebra. Because we associate the symbol y with the observed response, well write the equation of a line that describes data with a different symbol on the left side as

y
fitted value Estimate of response based on fitting a line to data.

= b0 + b1 x. y as a fitted value, an estimate of y The hat or caret over y identifies y based on an equation fit to the data. Its there to remind you that the data vary around the line. The line omits details in order to show the overall trend. Expressed using the names of the variables, the equation of this line is Estimated Price = b0 + b1 Weight The word Estimated takes the place of the caret.
19-3

residual Vertical distance of a point from a line, y - (b0 + b1 x) or

4/21/2008

19 Linear Patterns

To measure how close the line comes to a point, we use the vertical . We use vertical deviations, rather than perpendicular deviation, y - y deviations used in geometry, because if we use the line to predict y from is the error wed make. We want these deviations x, the deviation y - y to be as close to zero as possible. These deviations are important in regression analysis and are called residuals. = y b0 b1 x e=y- y

The units of the residuals match the units of the response. In this example, the residuals are measured in dollars. The double-headed arrows in this drawing illustrate two residuals. y1 b0 + b1 x1 y2 x1 x2
Figure 19-2. A residual is the vertical deviation (positive or negative) from the line.

b0 + b1 x2

least squares regression Picks the line that minimizes the sum of the squared residuals.

Residuals are positive for points above the line (y1 in Figure 19-2) and negative for points below the line (y2). Each observation defines a residual, and the best-fitting line makes these deviations as small as possible. A single, straight line will not pass though every point in the scatterplot unless there are only 2 points or the points are perfectly aligned (r = 1 or r = 1). To avoid canceling negative and positive residuals, we choose b0 and b1 so that the resulting line minimizes the sum of the squared residuals. This line is called the least squares regression line. Some of the residuals will be positive and others will be negative; the average of the residuals from a least squares regression is zero. (The formulas for b0 and b1 are shown in Under the Hood: The Least Squares Line.) We used a software package to compute the least squares regression line. The slope b1 = 2,670 and intercept b0 = 43. Put together, the line fit to these diamonds is Estimated Price = 43 + 2,670 Weight

Interpreting the Fitted Line


You should always look at a graph that shows the least squares regression line with the data. The following scatterplot adds the least squares regression line to the plot of the price on the weight for the 320 diamonds.

19-4

4/21/2008

19 Linear Patterns

1750
Price ($, using Credit Card)

1500

1250

1000

750 .3 .35 .4
Weight (carats)

.45

.5

Figure 19-3. Estimating prices using the least squares line.

The equation of this line estimates, or predicts, the price of a diamond of any weight. For example, if we set the weight to x = 0.4 carats, then the estimated price is (follow the blue arrows in Figure 19-3) = 43 + 2,670 0.4 = $1,111 y The price of one of the diamonds that weigh 0.4 carats is $1,009. Because this diamond costs less than the fitted value, the residual at this point is negative: = $1,009 $1,111 = $102 e=y- y Another diamond that weighs 0.4 carets is priced at $1,251. It costs more , so its residual is positive, than y = $1,251 $1,111 = $240 y- y

To estimate the price of a larger carat diamond, set x = 0.5 carats. The estimated price is $267 higher than the price of a 0.4-carat diamond (follow the green arrows in Figure 19-3) = 3 + 2,670 0.5 = $1,378 y

Interpreting the Intercept


The interpretation of the intercept and slope of a line fit to data is an essential part of regression modeling. The intercept b0 and the slope b1 are easy to interpret if you are familiar with the data and pay attention to measurement units. The intercept has the measurement units of y. Because the response in this example measures price in dollars, the estimated intercept b0 is not just 43, its $43.
interpretations of b0 1. Component of y that is present regardless of x. 2. Estimates average of the response when x = 0.

To interpret b0, think of it as telling you how much of the response is always present, regardless of x. In this example, a jeweler has costs regardless of the size of the gem: storage, labor, and other costs of running the business. The intercept represents the portion of the price that is present regardless of the weight: fixed costs. This equation estimates that fixed costs make up $43 of the selling price of every diamond. 19-5

4/21/2008

19 Linear Patterns

It is common to see a second interpretation of an intercept. If we plug x=0 into the equation for the line, we are left with b0: = b0 + b1 x = b0 if x = 0 y y Hence, the intercept is the fitted value if x = 0. For the diamonds, if we set the weight to 0, then Estimated Price = 43 + 2,670 0 = $43 Interpreted literally, this equation estimates the cost of a weightless diamond to be $43. Thats a nave way to think about b0 in this case. To see the problem, weve got to expand the plot to show b0. The x-axis in Figure 19-3 runs from 0.3 carats up to 0.5 carats, matching the range of weights. The vertical line at the left of the plot is not the real y-axis. To show the intercept, weve got to extend x-axis to show x=0. The odd appearance of the next scatterplot shows why software generally does not do this: too much white space.
2500
Price ($, using Credit Card)

2000 1500 1000 500 0 0 .1 .2 .3 .4 .5

b0

Weight (carats)

Figure 19-4. Scatterplot showing the intercept.

extrapolation An estimate outside the range of experience provided in the data.

The intercept is the point where the y-axis and the least squares line meet. This point lies far from the data. Thats often the case. Unless the range of the explanatory variable includes zero, b0 lies outside the data and is an extrapolation. An extrapolation is an estimate based on conditions unlike those in the data. Equations become less reliable when extended beyond the observations. Saying anything about weightless diamonds lies outside what these data can tell us.

Interpreting the Slope


The interpretation of the slope typically offers more insights than the intercept because the slope describes how differences in the explanatory variable associate with differences in the response. The slope has units of y divided by units of x. The slope in this example converts carats to dollars: b1 = 2,670 dollars per carat. Once you attach units to b1, its meaning should be clear. The slope estimates the variable costs of a diamond. The slope does not mean that a one-carat diamond costs $2,670, even on average. Estimates of price include the intercept (fixed costs), too. 19-6

4/21/2008

19 Linear Patterns

The slope in a regression equation indicates how the average value of y changes as x changes. In essence, the slope in a regression equation compares averages. The slope $2,670/carat compares average prices of diamonds that differ by 1 carat. Thats another extrapolation. These data dont include diamonds that weigh more than 0.5 carats, much less diamonds that differ by a full carat. It is more sensible to use the slope to compare the prices of diamonds that differ by, say, one tenth of a carat. For example, the average difference in price between diamonds that weigh 0.40 carats and 0.50 carats is b1 (0.50 - 0.40) = $2670 0.1 = $267. Only the difference in weight matters because the fixed costs present in both cancel. The difference in average prices between diamonds that weigh 0.40 versus 0.50 carats is the same as the difference in prices between diamonds that weigh 0.30 versus 0.40 carats. It is tempting, but incorrect, to describe the slope as the change in y caused by changing x by 1. For instance, one might say The price of a diamond goes up by $2,670 for each additional tenth of a carat increase in weight. The problem with this language is that we cannot change the weight of a diamond to see how its price changes. Instead, we compared diamonds of different weights. Diamonds that have different weights can be different in other ways as well. We have already seen that other factors affect the price. Perhaps the heavier diamonds have nicer colors or better cuts. Such lurking variables would mean that some of the price increase that we have attributed to weight is due to these other factors. The problem of confounding (confusing the effects of explanatory variables as in a two-sample t-test, Chapter 18) happens in regression analysis as well.

Example 19.1 Estimating Consumption

Motivation

state the question

Utilities in many communities rely on meter readers who visit homes to read the meters that record consumption of electricity and gas. Unless someone is home to let the meter reader come in, the utility has to estimate the amount of energy used. The utility in this example sells natural gas to homes in the Philadelphia area. Many of these are older homes with the gas meter located in the basement. We can estimate the use of gas in these homes using a regression equation.
Identify x and y. Link b0 and b1 to problem. Describe data. Check for linear association.

Method

describe the data and select an approach

The explanatory variable is the average number of degrees below 65 during the billing period, and the response is the number of hundred cubic feet of gas (CCF) consumed during the billing period (about a month). The explanatory 19-7

4/21/2008

19 Linear Patterns

variable is 0 if the average temperature is above 65 (assuming that the homeowner wont need to heat in this case). The intercept estimates the base level of gas consumption for things unrelated to temperature (such as heating water), and the slope measures the amount of gas used on average per 1 decrease in temperature. For this experiment, the local utility has 4 years of data (n = 48) for an owner-occupied, detached home. Based on the scatterplot, the association appears linear.

Mechanics

do the analysis

The fitted least squares line shown in the scatterplot tracks the pattern in the data very closely (r = 0.98). The equation of this line is Estimated Gas (CCF) = 26.7 + 5.7 Degrees Below 65 The estimated intercept b0 = 26.7 CCF estimates the amount of gas used for things other than heating, and the slope b1 estimates that this homeowners use of gas for heating increases on average by about 5.7 CCF per 1 drop in average temperature. Theres relatively little variation around the fitted line.

Message

summarize the results

The utility can accurately predict gas use for this home and perhaps homes like this one without reading the meter by using the temperature during the billing period. During the summer, the home uses about 27 hundred cubic feet of gas. As the weather gets colder, gas use rises about 6 hundred cubic feet for each degree below 65. For example, we expect this home to use about 27 + 6 10 = 87 CCF of gas during a billing period with average temperature 55.

Are You There?

A manufacturing plant receives orders for customized mechanical parts. The orders vary in size, from about 25 to 150 units. After configuring the production line, a supervisor oversees the production. This scatterplot plots the production time (in hours) versus the number of units for 45 orders.

19-8

4/21/2008

19 Linear Patterns

Figure 19-5. Scatterplot of production time on order size.

The least squares regression line shown in the scatterplot is Estimated Production Time (Hours) = 2.1 + 0.031 Number of Units (a) Interpret the intercept of the estimated line. Is the intercept visible in Figure 19-5?1 (b) Interpret the slope of the estimated line.2 (c) Using the fitted line, estimate the amount of time needed for an order with 100 units. Is this estimate an extrapolation?3 (d) Based on the fitted line, how much more time does an order with 100 units require over an order with 50 units?4

Properties of Residuals
The slope and intercept describe how y is related to x. The residuals show whats left over after we account for this relationship. If a regression equation works well, it should capture the underlying pattern. Only simple random variation that can be summarized in a histogram should be left in the residuals. To see what is left after fitting the line, it is essential to plot the residuals. You can see the residuals in the original scatterplot, but a better plot makes it easier to identify problems by zooming in on these deviations around the least squares regression line. The explanatory variable remains on the x-axis, but the residuals go on the y-axis in place of the response y.

1 The intercept (2.1 hours) is best interpreted as the time required for all orders, regardless of size (e.g., to set up production). It is not visible; it lies farther to the left and is thus an extrapolation. 2 Once production is running, an order takes about 1.9 minutes (60 0.031) per unit. 3 An order for 100 units is estimated to take 2.1 + 0.031 100 = 5.2 hours. This is not an extrapolation because we have orders for less and more than 100 units. 4 50 more units would need about 0.031 50 = 1.55 additional hours.

19-9

4/21/2008

19 Linear Patterns

1750

Price ($, using Credit Card)

1500

1250

Price vs. Weight

1000

750 .3 .35 .4
Weight (carats)

.45

.5

500 250
Residual

-250 -500 .3 .35 .4


Weight (carats)

Residual vs. Weight


.45 .5

Figure 19-6. Shearing produces a residual plot.

Visually, the residual plot shears the original scatterplot, pulling up on the left and pushing down on the right until the regression line in the scatterplot becomes flat. The horizontal line at zero in the residual plot corresponds to the regression line in the scatterplot. By flattening the line, the plot of the residuals focuses our attention on the deviations around the line rather than the line itself. Do you see a pattern in the residual plot shown in Figure 19-6? If the least squares line captures the underlying relationship, then a scatterplot of residuals versus x should have no pattern. It should stretch out horizontally, with consistent vertical scatter throughout. It should show no bends and ideally lack outliers. In other words, the residuals should show only simple variation that we can summarize in a histogram. You can check for simplicity of the residuals using the visual test for simplicity (see Chapter 6). One of the following scatterplots is the residual plot from Figure 19-1. The other three scramble the weights so that residuals are randomly paired with the weights. Do you recognize the original plot of the residuals? If all of these plots look the same, then theres no apparent pattern in the residual plot.

19-10

4/21/2008
500

19 Linear Patterns
500

250

250

-250

e
.3 .35 .4 .45 .5

-250

-500

-500 .3 .35 .4 .45 .5

Weight (carats)
500 500

Weight (carats)

250

250

-250

e
.3 .35 .4 .45 .5

-250

-500

-500 .3 .35 .4 .45 .5

Weight (carats)

Weight (carats)

Figure 19-7. Simplicity in residuals.

These plots appear similar. The bottom right plot shows the residuals on the weights. It does have a pattern, but its subtle: the residuals become more variable as the diamonds become larger. Smaller diamonds have more consistent prices than larger diamonds. This pattern is well hidden in the initial scatterplot of the data, but becomes more apparent in the scatterplot of residuals on weights. Just as a microscope reveals unseen organisms, a residual plot can reveal subtleties invisible to the naked eye. (Well deal with such problems in Chapter 23.)

Standard Deviation of the Residuals


A regression equation should capture the pattern that relates x to y and leave behind only simple variation in the residuals. If we decide that the residuals are simple enough, we can summarize them in a histogram.
500

500 250
Residual

400 300 200 100 0 -100 -200 -300 -400 -500 20 40 60

-250 -500 .3 .35 .4


Weight (carats)

.45

.5

Count

Figure 19-8. Summarizing residuals in a histogram.

The histogram of the diamond residuals appears reasonably symmetric around 0 and bell-shaped, with a hint of skewness. If the residuals are nearly normal, we can summarize the residual variation with a mean and standard deviation. Because we fit this line using least squares, the 19-11

4/21/2008

19 Linear Patterns

mean of the residuals must be zero. The standard deviation is more interesting because it measures how much the residuals vary around the fitted line. The standard deviation of the residuals goes by many names, such as the standard error of the regression or the root mean squared error (RMSE). The formula used to compute the standard deviation of the residuals is almost the formula used to calculate other standard deviations:

se

2 2 2 e1 + e2 + + en n 2 We subtract 2 in the denominator because we use two estimates derived from the data, b0 and b1, to calculate each residual that contributes to se. (See Chapter 15, Under the Hood: Students t and Degrees of Freedom.) If all of the data were exactly on a line, then se would be zero. Least squares makes se as small as possible, but it seldom can reduce it to zero.

se =

For the diamonds, se = $169. Like the residuals, the units of se match those of the response (dollars in this case). Since these residuals have a bell-shaped distribution around the regression line, the Empirical Rule implies that the prices of about 2/3 of the diamonds are within $169 of the regression line and about 95% of the prices are within $338 of the regression. If a jeweler quotes a price for a 0.5-carat diamond that is $400 more than this line predicts, we ought to be surprised.

Explaining Variation
A regression line splits each value of the response into two parts, the fitted value and residual, +e y= y represents that portion of y that is related to x, and The fitted value y the residual e represents the effects of other factors. As a percentage, how much of the variation of y belongs with each of these components? The histograms in this figure show the variation in price (left) and the variation in the residuals (right).
Price ($, using Credit Card)
1750
500

1500

Residuals
.3 .4 .5

250 0 -250 -500

1250 1000 750

Weight (carats)

.3

.35

.4

.45

.5

Weight (carats)

Figure 19-9. The prices among diamonds (left) vary more than the residuals (right).

There is less variation in the residuals after subtracting away the line. How much less? What proportion of the variation remains in the 19-12

4/21/2008

19 Linear Patterns

residuals? If you had to put a number between 0% and 100% on the fraction of the variation left in the residuals, what would you say? The correlation between x and y determines this percentage. The sample correlation r is confined to the range -1 r 1. If we square the correlation, we get a value between 0 and 1. The sign wont matter. The squared correlation r2 is exactly the fraction of the variation accounted for by the least squares regression line, and 1 - r2 is the fraction of variation left in the residuals. In fact, the least squares regression line is the line determined by the correlation studied in Chapter 6. For the diamonds, r2 = 0.662 = 0.434 and 1 - r2 is 0.566. Because 0 r2 1, this summary is often described on a percentage scale. For example, one might say The fitted line explains 43.4% of the variation in price. All regression analyses include r-squared as part of the summary of the fitted equation. If r2 = 0, the regression line describes none of the variation in the data. In that case, its slope would be zero. If r2 = 1 (or 100%), the line represents all of the variation and se = 0.

r-squared Square of the correlation between x and y; the percentage of explained variation.

Using the R-Squared Statistic


r2 is a popular measure of a regression because of its intuitive interpretation as a percentage. This lack of units makes it incomplete, however, as a summary statistic. For example, r2 alone does not indicate the size of the typical residual. The standard deviation of the residuals se is useful along with r2 because se conveys these units. Since se = $169, we know that fitted prices frequently differ from actual prices by $100 or more. For example, in Example 19.1 (household gas use) r2 = 0.955 and se = 16 CCF. The size of r2 tells us that the data stick close to the fitted line, but only in a relative sense. se tells us that, appealing to the Empirical Rule, about 2/3 of the data lie within about 16 CCF of the fitted line. Along with the slope and intercept, you should always report both r2 and se so that others can judge how well your equation describes the data. Theres no fast rule for how large r2 must be. The typical size of r2 varies greatly across applications. In macroeconomics, we will frequently see regression lines that have r2 larger than 90%. With medical studies and surveys, on the other hand, an r2 of 30% and less may indicate an important discovery.

tip

Conditions for Simple Regression


Even though we have not assumed that the data are a sample from a population, we need to check three conditions before we summarize the association between two variables with a line. We can check two of these conditions by looking at scatterplots. Because these conditions are easily verified by looking at plots, regression analysis with one explanatory variable and one response is often called simple regression. Once you see the plots, you should be immediately know whether a line is a good 19-13

4/21/2008

19 Linear Patterns

summary of the association. (Regression analysis that uses several explanatory variables at once is called multiple regression.) Two conditions tell us whether the least squares line captures the pattern that relates x and y. The straight-enough condition is met if you think the pattern in the scatterplot looks like a line. If the pattern in the scatterplot of y on x does not look straight, stop. If the relationship appears to bend, for example, an alternative equation is needed (Chapter 20). Because residuals magnify deviations from the linear pattern, its important to examine the residuals as well. The residuals must meet the simple-enough condition. The plot of the residuals on x should have no pattern, allowing you to summarize them with a histogram. Outliers, bending patterns, or isolated clusters indicate problems worth investigating.
Whats embarrassing? Unless you always obtain data by running experiments (Ch 18), you always have the potential for lurking variables. Just make sure there isnt anything obvious that explains the relationship you are describing.

The third condition requires some thinking. The no embarrassing lurking variable condition is met if you cannot easily think of another explanation for the pattern in the plot of y on x. We mentioned this issue when discussing the interpretation of the slope of the fitted line in the scatterplot of the prices of the diamonds versus their weights (Figure 19-3). If the larger diamonds are systematically different in other ways from the smaller diamonds than weight alone, then it may be the case that these other differences explain the increase in price better than weight. This issue is known as confounding in two-sample tests of the difference in means (Chapter 18). The following example illustrates a relationship that is affected by a lurking variable.

Example 19.2 Lease Costs

Motivation

state the question

When auto dealers lease cars, they include the cost of depreciation. They want to be sure that the price of the lease plus the resale value of the returned car earns a profit. How can a dealer anticipate the effect of age on the value of a used car? A manufacturer who leases thousands of cars can take the approach that we first used with the diamonds. Group cars of similar ages and see how the price drops as cars age. A small dealer that needs to account for local conditions wont have enough data. They need regression analysis. Lets help a BMW dealer in the Philadelphia area determine the cost due to depreciation. The dealer currently estimates that $4,000 is enough to cover the depreciation per year.

19-14

4/21/2008

19 Linear Patterns describe the data and select an approach

Method
Identify x and y. Link b0 and b1 to problem. Describe data. Check straight-enough condition.

We will fit a least squares regression with a linear equation to see how the age of a car is related to its resale value. Age in years is the explanatory variable and resale value in dollars is the response. The slope b1 estimates how much, on average, the resale price falls per year; we expect a negative slope. The intercept estimates the value of a new car, one with age 0. We can use it to see if our fit is reasonable. We have prices of 218 used BMWs in the 3-series located in the Philadelphia region. For each car, we have the price (in dollars) and the age in years. (These same data appear in Chapter 1.) We obtained them from web sites advertising certified used BMWs in 2006. This plot shows the data. Straight-enough. Seems okay. The data for one car at age 0 (4 cars from 2006) seems unusual, but we dont have many cars in that model year. No lurking variables. There is an important lurking variable that is associated with both price and age: how far the car has been driven. Older cars are likely to have been driven further than newer models, and we can anticipate that the higher the mileage, the lower the price. The effect of mileage is mixed with the effect of age. Whatever estimate we get for b1, it will combine the measured effect of age with the unknown effect of mileage.

Mechanics

do the analysis

The scatterplot shows a linear relationship. There are no extreme outliers or isolated clusters. After a little rounding of b0 and b1, the fitted equation is (calculated by software ) Estimated price = 39,850 - 2,900 Age. Interpret the slope and intercept within the context of the problem, assigning units to both. Dont go on until youre comfortable with the interpretation of these estimates. The slope is the annual estimated decrease in the resale value of the car, about $2,900 per year. The intercept estimates the price of used cars from this current model year to be about $39,850. The average selling price of new cars like these is $43,000, so the intercept suggests that a car depreciates about $3,000 when it is driven off the lot. Use r2 and se to summarize the fit of your equation. Add units to se . The regression describes r2 = 45% of the variation in prices, leaving the rest to other factors (such as different models and options). The residual standard deviation se =

19-15

4/21/2008

19 Linear Patterns

$3,367. Thats a lot of residual variation, but we have not taken account of other factors that affect resale value (such as mileage or options). Check the residuals. A plot of the residuals is an essential part of any regression analysis because it is the best check for additional patterns and interesting quirks in the data.
15000 10000 5000 0 -5000 -10000 0 1 2
Age

Simple enough condition: The residuals cluster into vertical groups for cars of each model year. There is roughly equal scatter at each age. These seem okay.

Message

Residual

summarize the results

Our results indicate that used BMW cars (in the 3 series), on average, decline in value by about $2,900 per year. This estimate combines the effect of a car getting older along with the effects of other variables like mileage and damage that accumulate as a car gets older. Thus, our current lease pricing that charges $4,000 per year appears profitable. We should confirm that fees at the time the lease is signed are adequate to cover the estimated $3,000 depreciation that occurs when the lessee drives the car off the lot. State the limitations. The model leaves more than half of the variation in prices unexplained, however. Our estimate of resale value could be off by thousands of dollars. Also, our estimates of the depreciation should only be used for short-term leases. So long as leases last 4 years, the estimates should be fine. Longer leases would require extrapolation outside the data used to build our model.

19-16

4/21/2008

19 Linear Patterns

Summary
The least-squares regression line summarizes how the average value of the response (y) depends upon the value of an explanatory variable or = b0 + b1 x. b0 is the predictor (x). The fitted value of the response is y intercept and b1 is the slope. The intercept has the same units as the response, and the slope has the units of the response divided by the from units of the explanatory variable. The vertical deviations e = y - y the line are residuals. Least squares determines values for b0 and b1 that minimize the sum of the squared residuals. The r2 statistic tells the percentage of variation in the response that is described by the equation, and the residual standard deviation se gives the scale of the unexplained variation. The linear equation should satisfy the straightenough, simple-enough, and no embarrassing lurking variables conditions.

Key Terms
condition no embarrassing lurking variable, 19-14 simple-enough, 19-14 straight-enough, 19-14 explanatory variable, 19-3 extrapolation, 19-6 fitted value, 19-3 least squares regression, 19-4 predictor, 19-3 residual, 19-4 response, 19-3 root mean squared error (RMSE), 19-12 r-squared, 19-13 simple regression, 19-13 standard error of the regression, 19-12

Formulas
Linear equation

= b0 + b1 x y

b0 is the intercept and b1 is the slope. This equation for a line is sometimes called slope-intercept form. Fitted value Residual

= b0 + b1 x y = y b0 b1 x e=y y

the residuals Standard deviation of


n

= i=1 n 2 n 2 R-squared Proportion of the variation in y that has been captured by the equation. r2 = corr(y,x)2 19-17

2 se = i = 1

(y

b0 b1 xi

) e

2 i

4/21/2008

19 Linear Patterns

Best Practices
Always look at the scatterplot. Regression is reliable if you look at the plot of the response on the explanatory variable. It also helps to add the fitted line to this plot. Plot the residuals versus x to zoom in on deviations from the regression line. Know the substantive context of your model. Unless you do, theres no way for you to decide whether the slope and intercept make sense. If you cannot interpret the slope and intercept, the model may have serious flaws. Perhaps theres an important lurking factor or extreme outlier. A plot will show you an outlier, but youve got to know the context to identify a lurking variable. Limit predictions to the range of observed conditions. When you extrapolate the equation outside the data, youre making a bet that the equation keeps going and going. Without data, you wont know if the predictions make sense. A linear equation often does a reasonable job over the range of observed x-values but fails to describe the relationship beyond this range.

Pitfalls
Believing that changing x causes changes in y because the linear model describes the variation in y. A linear equation is closely related to a correlation, and correlation is not causation. Drawing a regression line on a scatterplot might suggest causation, but it doesnt prove it. Forgetting about lurking variables. With a little imagination, you can think of other explanations for why heavier diamonds cost more than smaller diamonds. Perhaps heavier diamonds also have more desirable colors or better, more precise cuts. Thats what it means to have a lurking variable: perhaps its the lurking variable that produces the higher costs, not your choice of the explanatory variable. Trusting summaries like r2 without checking the plots. Although r2 measures the strength of the linear equations ability to describe the response, a high r2 does not demonstrate the appropriateness of the linear equation. For example, the fitted line in the following figure (which is related to Example 19.1) has r2 = 0.90. The line describes most of the variation, but misses the effect of warmer weather and underestimates the slope. The relationship is not linear, a topic we will consider in Chapter 20.

19-18

4/21/2008

19 Linear Patterns

Figure 19-19-10. Gas consumption versus temperature is not linear.

A single outlier, or data that separate into two groups rather than a single cloud of points, can make produce a large r2 when, in fact, the linear equation is inappropriate. Conversely, a low r2 value may be due to a single outlier. It may be that most of the data fall roughly along a straight line with the exception of an outlier.

About the Data


We obtained the data for emerald-cut diamonds from the web site of McGivern Diamonds in the fall of 2004. These data are not a representative sample of prices of diamonds; they are the population of diamonds (of these sizes and cut) offered for sale on a specific date. The data on prices of used BMW cars was similarly taken from listings of used cars available within 100 miles of Philadelphia during the fall of 2006.

Under the Hood: The Least Squares Line


The least squares line minimizes the sum
n

S( b0 , b1 ) = yi b0 b1 xi
i=1

Two equations, known as the normal equations, lead to formulas for the optimal values for b0 and b1 that determine the best-fitting line:
n

yi b0 b1xi = 0
i=1

( y
i=1

b0 b1 xi xi = 0

(These equations have nothing to do with a normal distribution. Normal in this sense means right angles.) After a bit of algebra, the normal equations give these formulas for the slope and intercept: n yi y xi x cov(x, y) s y b1 = i = 1 n = =r and b0 = y b1 x 2 var( x) sx xi x

)(

i=1

19-19

4/21/2008

19 Linear Patterns

The least squares line matches the line defined by the sample correlation r in Chapter 6. If you remember that the units of the slope are those of y divided by those of x, you can remember the formula for the slope. The normal equations say two things about the residuals from a least = b0 + b1 x, the normal equations are squares regression. With y
n n

( y
i= 1

i yi ) = 0

( y
i= 1

i ) xi = 0 y

Written in this way, the normal equations tell you that


is the residual. Since 1. The mean of the residuals is zero. The deviation y- y the sum of the residuals is zero, the average residual is zero as well.

2. The residuals are uncorrelated with the explanatory variable. The second equation is the covariance between x and the residuals. Because the covariance is zero, so is the correlation.

Software Hints

All computer packages that fit a regression summarize the fitted model with several tables. These tables may be laid out differently from one package to another, but all contain essentially the same information and all include many more numbers than we need to use at this point. The output for regression from software includes a section that looks something like this. Summary of Fit: Response = Price RSquare 0.434303 Root Mean Square Error 168.634 Mean of Response 1146.653 Observations 320 Term Intercept Weight (carats) Parameter Estimate 43.419237 2669.8544 Estimates Std Error 71.23376 170.8713 t Ratio 0.61 15.62 Prob>|t| 0.5426 <.0001

The slope and intercept coefficient are given in the second table in a column labeled Estimate. Usually the slope is labeled with the name of the x-variable, and the intercept is labeled Intercept or Constant. The regression equation in this table is Estimated Price = 43.419237 + 2669.8544 Weight Statistics packages show more digits for the estimated slope and intercept than you need for interpretation. Ordinarily, you should round the reported numbers. We will learn about the other numbers in the regression table in the coming chapters. For now, you need to be able to find the coefficients b0 and b1 and to locate se and r2. These are circled above.

19-20

4/21/2008

19 Linear Patterns

Excel

The right way to do regression with Excel starts with a picture of the data: the scatterplot of Y on X. We talked about how to scatterplot data back in Chapter 6. In the chart wizard, pick the icon that looks like a scatterplot, identify your columns, and add the plot to your spreadsheet. Next, select the scatterplot and follow the menu commands Chart > Add Trendline Pick the option for adding a line, and Excel will add the least squares line to the plot. If you double-click the line in the chart, Excel will show the equation and r2 for the model. These options do not include se. You can also use formulas in Excel to find the least squares regression. The formula LINEST does most of the work, but its easier to load and use the Analysis ToolPak that comes with Excel. After youve done that, follow the menu commands Tools > Data Analysis and fill in the dialog with the range for Y and X. (If you dont see Data Analysis in the Tools menu, there was a problem loading this add in. Check your installation.) By default, Excel fills another sheet in the open workbook with a summary of the regression. The summary includes various plots if you select that option.

Minitab
The procedure resembles constructing a scatterplot of the data. Follow the menu sequence Graph > Scatterplot and choose the icon that adds a line to the plot. After you pick variables for the response and explanatory variable, click OK. Minitab then draws the scatterplot with the least squares line added. The tabular summary of the fitted model appears in the scrolling window that records results of calculations in your current session. Following the menu sequence Analyze > Fit Y by X opens the dialog that we used to construct a scatterplot in Chapter 6. Fill in variables for the response and explanatory variable, then click OK. In the window that shows the scatterplot, click on the red triangle above the scatterplot (near the words Bivariate Fit of ). In the pop-up menu, choose the item Fit Line. JMP adds the least squares line to the plot and appends a tabular summary of the model to the output window below the scatterplot.

JMP

19-21

Anda mungkin juga menyukai