Anda di halaman 1dari 20

2/7/08

6 Association between Quantitative Variables


SCATTERPLOTS ............................................................................................................................6-3 ASSOCIATION ...............................................................................................................................6-4 COVARIANCE ...............................................................................................................................6-7 CORRELATION .............................................................................................................................6-9 SUMMARIZING ASSOCIATION WITH A LINE ..............................................................................6-11 NONLINEAR PATTERNS .............................................................................................................6-12 SPURIOUS CORRELATION ..........................................................................................................6-13 CHECKLIST: CORRELATION ......................................................................................................6-14 CORRELATION TABLES..............................................................................................................6-16 SUMMARY ..................................................................................................................................6-17

2/7/2008

6 Numerical Association

Whether its concern over rising prices or global warming, many people who heat their homes with natural gas would like to use less. According to the Department of Energy, the US consumes about 23 trillion cubic feet of natural gas each year with more than one-fifth going to households. Businesses have noticed the interest in conservation and found ways to capitalize on conservation as a marketing tool. For example, a government analysis found that homeowners could reduce their energy use by up to 30% by upgrading the insulation of their home (and making other improvements). Ads like this one from OwensCorning advertise possible savings and tax credits. Saving 30% sounds great, but does it amount to much money? Most consumers would rather hear how much money theyll save. An ad that promises savings in dollars might be more attractive. It is easier to spend $400 for insulation if you know you will save that much in a year. How much money can homeowners who use natural gas save by insulating their home? Lets see what sort of ad we can design to motivate homeowners to insulate. The potential savings from insulating a home vary widely. This figure summarizes the annual consumption of 1,658 homes that heat with natural gas (in thousands of cubic feet, abbreviated MCF).

Figure 6-1. Annual use of natural gas per household, in thousands of cubic feet (MCF).

The distribution is right skewed. Values accumulate near the median at 90 MCF and tail off to the right. The considerable variation translates into a wide range of costs. To convert these amounts to dollars, add a zero to the scale: natural gas sells for about $10 per MCF. For instance, the home at the far right used 409 thousand cubic feet. At $10 per MCF, the annual heating bill for this home is $4,090. By comparison, homeowners at the far left spent less than $100.
January Average Temperatures Duluth, Miami, MN FL High 16.3 75.2 Low -2.1 59.2 Daily 7.2 67.1

In order to tell a particular homeowner how much he or she might save, we need to understand why theres so much variation in usage. An obvious explanation is that some live in cold climates and others in warm climates. To see how much variation is explained by climate, we need to understand the association between heating and climate.

6-2

2/7/2008

6 Numerical Association

Scatterplots

Heating degree-days (HDD) measure the severity of winter weather. For example, a typical winter in Florida produces 700 HDD whereas a winter in Minnesota produces 9,000. This histogram shows the distribution of the number of HDD experienced by these homes.1

Figure 6-2. Heating degree-days around the US.

The homes on the left are evidently in Florida whereas those at the right appear to be in Minnesota (or other locations as warm or cold).
scatterplot A graph that displays pairs of values as points on a two-dimensional grid.

For climate to explain variation in the use of natural gas, these variables must be associated. Association between numerical variables resembles association between categorical variables (Chapter 5). Two categorical variables are associated if the distribution of one depends on the value of the other. Thats also true for numerical variables, but the methods for seeing and quantifying association differ. In place of a contingency table, a scatterplot is the best method for seeing association between numerical variables. A scatterplot displays pairs of values. In this example, we have 1,658 pairs, one for each home. One element of each pair is the number of heating degree-days and the other is the amount of natural gas. The scales of these variables determine the axes of the scatterplot. The vertical axis of the scatterplot is called the y-axis, and the horizontal axis is the x-axis. The variable that defines the xaxis specifies the horizontal location, and the variable that defines the y-axis specifies the vertical location. Together, a pair of values defines the coordinates of a point, which is usually written as (x, y). To decide which variable to put on the x-axis and which to put on the y-axis, display the variation that you are trying to explain along the y-axis. We will call the variable on the y-axis the response. The variable that we use to explain variation is the explanatory variable and goes on the x-axis. In this example, , the amount of natural gas used is the response, and the number of heating degree days is the explanatory variable.

tip
response ( y) Name for the variable that has the variation wed like to understand, shown on the y-axis of a scatterplot. explanatory variable (x) Name of a variable that is to be used to explain variation in the response; placed on the x-axis in scatterplots.

To calculate HDD for a day, average the high and low temperatures. If the average is less than 65, the difference between the average and 65 is the number of heating degree days. For example, if the high temperature is 60 and the low is 40, then the average is 50 and this day has 65-50 = 15 HDD. If the average is above 65, HDD = 0. According to this scale, you dont need heat on such days!

6-3

2/7/2008

6 Numerical Association

Figure 6.3 shows the completed scatterplot. Each point in the plot represents one of the 1,658 households that make up these data. For example, the point marked with an x represents a household for which (x, y) = (2776 HDD, 229 MCF).

Figure 6-3. Scatterplot with histograms along the axes.

Figure 6-3 adds the histogram of the response at the right and the histogram of the explanatory variable at the top. Histograms show the marginal distributions of the variables in a scatterplot because they summarize the variation in one variable without taking account of the other. These histograms are analogous to the marginal counts of the categorical variables in a contingency table. As with contingency tables, we cannot judge the presence of association from marginal distributions. For a contingency table, we considered patterns in the cells; for a scatterplot, we look for patterns in the display of points.

Association
Is the use of natural gas associated with heating degree-days? Can we explain some of the differences in consumption by taking account of differences in climate? It appears that we can: the points in Figure 6-3 generally shift upward as we look from left to right. This upward drift means that homes in warm climates typically use less natural gas than homes in cold climates. Before we reach a conclusion, however, we need to decide whether this pattern is real. Are we imagining it? Heres a simple way to decide. Recall how we measured dependence in a table (Chapter 5). Chi-square compares the observed counts to those in an artificial table that forces the variables to be unrelated. The cells of the artificial table remove any association between the variables. We can do the same for a scatterplot. To construct numerical variables that are not associated, randomly pair x and y values. Pair each value of y with a randomly chosen value of x. If the scatterplot of the randomly paired data looks like the scatterplot of the original data, then theres little or no pattern and no association.
6-4

2/7/2008

6 Numerical Association

As an example, one of the scatterplots in the next figure is the scatterplot of gas use on heating DD shown in Figure 6-3. The other three randomly pair the number of heating degree days with the amount of gas used. Do you recognize the original?
400 400

Natural Gas (MCF)

300 200 100 0 0 1000 3000 5000 7000 9000

Natural Gas (MCF)

300 200 100 0 0 1000 3000 5000 7000 9000

Heating DD
400 400

Heating DD

Natural Gas (MCF)

300 200 100 0 0 1000 3000 5000 7000 9000

Natural Gas (MCF)

300 200 100 0 0 1000 3000 5000 7000 9000

Heating DD

Heating DD

Figure 6-4. Do you recognize the original scatterplot?

visual test for simplicity A method for identifying a pattern in a plot of numerical variables. Compare the original scatterplot to others that randomly match the coordinates.

The original scatterplot is at the lower right. Unlike the others, the amount of gas used in this frame appears lower at the left (less in warm climates) and higher at the right (more in cold climates). In the others, the variation in gas use appears the same regardless of the climate. This comparison of the original plot to several artificial plots in which the variables are unrelated is the visual test for simplicity. If you recognize the original, theres a pattern. Otherwise, we say the data are simple because the variability in the response looks the same everywhere. Because the original plot stands out from the artificial plots in Figure 6-4, theres a pattern that indicates association between HDD and the amount of gas used.

Describing Association in a Scatterplot


+ If you find association in a scatterplot, then you need to describe it. To describe the association, start with its direction. In this example, the colder the winter, the larger the gas consumption tends to be. This pattern has a positive direction because the points in the scatterplot tend to concentrate in the lower left and upper right corners. As the explanatory variable increases, so does the response. A pattern running the other way has a negative direction. As x increases, y tends to decrease. Another property of the association is its curvature. Does the pattern resemble a line or does it bend? The scatterplot of the natural gas use appears linear. The points roughly concentrate along an upward sloping
6-5

2/7/2008

6 Numerical Association

line that runs from the lower left corner to the upper right. Linear patterns have a consistent direction. Linear patterns with positive direction follow a line with positive slope; linear patterns with negative direction follow a line with negative slope. Curved relationships are harder to describe because the direction changes. The third property of the association is the amount of variation around the pattern. In this case, theres quite a bit of variation among homes in similar climates. For instance, among homes in a climate with 6,000 HDD, some use 40 cubic feet of gas compared to others that use 6 times as much. Plus, some homes in cold climates use less gas than others in warm climates. The variation around the linear pattern in gas use also appears to increase with the amount used. The points stick closer to the linear pattern in the warm climates at the left of Figure 6-3 than in the colder climates on the right. Finally, look for outliers and other surprises. Often the most interesting aspect of a scatterplot is something unexpected. An outlying point is almost always interesting and deserves special attention. Clusters of several outliers raise questions about what makes the group so different. In Figure 6-3, we dont see outliers so much as an increase in the variation that comes with higher consumption of natural gas. Lets review the questions to answer when describing the association that you see in a scatterplot: 1. 2. 3. 4. Direction. Does the pattern trend up, down, or both? Curvature. Does the pattern appear to be linear or does it curve? Variation. Are the points tightly clustered along the pattern? Outliers and surprises. Did you find something unexpected?

Dont worry about memorizing this list. The key is to look at the scatterplot and think. Well repeat these steps over and over again, and soon the sequence will become automatic.

Are You There?


One of the following scatterplots shows the price of diamonds versus their weights (in carets). The other shows the box-office gross (in dollars) of movies versus a critics rating on a 5-point Likert scale. Identify which is which and describe any association.2

The diamonds are on the left, with positive association. The pattern is vaguely linear, with quite a bit of variation around the linear trend. The movies are on the right with little or no association. The rating assigned by the critic has little relationship with the popularity at the box office.

6-6

2/7/2008

6 Numerical Association

Figure 6-5. Which scatterplot matches each situation?

Covariance
covariance A statistic that measures the amount of association between two numerical variables.

Covariance quantifies the amount of association between two numerical variables. To see how covariance works, lets continue with the energy data. In the following plot, weve colored the points in the scatterplot of the amount of natural gas versus the number of heating degree days (Figure 6-3).

Figure 6-6. Scatterplot of the amount of natural gas versus HDD with reference lines at the means.

The gray lines locate the means of the two variables and divide the plot into four quadrants. Green points in the upper right and lower left quadrants identify households for which both variables are larger or smaller than average. For instance, a household in the upper right quadrant is in a colder than average climate and uses more than the average amount of gas. Red points in the other two quadrants identify households for which one variable is larger than its mean and the other is smaller. These include, for instance, homes in relatively warm locations that use more gas than average. Points in the green quadrants indicate positive association. Positive association implies that the variables vary together. For example, cases that are relatively large on one axis are also large on the other. If there is positive association, most of the points should be in the upper right and lower left quadrants. Thats the case in this example. Points in the other two quadrants (red) suggest negative association: one variable is relatively large whereas the other is relatively small. Negative
6-7

2/7/2008

6 Numerical Association

association implies that most of the data should be in the upper left and lower right quadrants.

x- x

(x, y)

!
( x , y)

y- y

6-7. Deviations from the means in the scatterplot. ! Figure ! Rather than count the points in the quadrants, covariance measures the distances from the means. Cases that lie farther from the means have larger effect on the covariance. Consider the area of the rectangle in Figure 6-7. One corner lies at the two means ( x , y ) and the other lies at the location (x,y) of a point in the scatterplot. The area is (x - x )(y - y ). The sign (positive or negative) of this area indicates the type of association indicated by the point. For in the green quadrants, the ! points ! product of the deviations is positive because both deviations have the ! ! same sign: x"x y" y >0 In the red quadrants, the product is negative. One variable is above its mean (positive deviation) and the other is below its mean (negative deviation). Points with either deviation equal to zero dont contribute to ! the covariance.

)(

The covariance is (almost) the average of these areas. The formula for the covariance is
Case 1 2 1658 HDD 4080 2068 8896 Gas 72.5 35.6 107.5

cov(x , y ) =

(x

" x y 1 " y + x2 " x y 2 " y + L + xn " x yn " y n"1

)(

) (

)(

)(

The divisor n 1 matches the divisor in s2. Because the energy example has a large number of cases (n = 1658), we used a computer to compute ! the covariance between climate and gas use. To illustrate how the data enter into the formula, the next expression plugs in several values from the data table (a portion appears at the left).
1 1 2 2 1658

cov(HDD , Gas) = =

(HDD " HDD)(Gas " Gas) + (HDD " HDD)(Gas " Gas) + L + (HDD
( )( ) ( )( ) (
1657

" HDD Gas1658 " Gas

1658 " 1 4080 " 4547.5 72.5 " 99.9 + 2068 " 4547.5 35.6 " 99.9 + L + 8896 " 4547.5 107.5 " 99.9

)(

)(

)
6-8

= 63, 357

2/7/2008

6 Numerical Association

The covariance confirms the presence of positive association, but it is difficult to interpret the specific value. The positive covariance implies that, on average, homes in climates with more than the average number of heating DD use more than the average amount of natural gas. Thats consistent with our intuition and the scatterplot. The magnitude, however, is difficult to interpret. This difficulty occurs because the covariance has units: those of the x-variable times those of the y-variable. In this case, the covariance is 63,357 HDDMCF. To make sense of the covariance, we need to modify it so that we can look at the value and have a sense of the strength of the association.

Correlation
correlation (r) A standardized measure of association between two numerical variables; the correlation is always between 1 and +1.

Correlation is a more easily interpreted measure of association between numerical variables. The correlation between two numerical variables is easy to find once we have the covariance. Just divide the covariance by the product of the standard deviations. cov(x , y ) . corr( x , y ) = sx s y For example, the correlation between heating DD and gas usage is cov(HDD , Gas) corr(HDD , Gas) = sHDD sGas ! 63357 DD " MCF . = 2235.4 DD " 51.26 MCF # 0.55 The units of the two standard deviations cancel the units of the covariance. The resulting statistic does not have units and is always in the range 1 ! to +1, 1 corr(x, y) +1 The sign of the correlation indicates the direction of the association. The correlation is often abbreviated by the letter r. The fact that the correlation in this example is well below its upper limit tells you that the data show considerable variation around the pattern. Homes in the same climate vary considerably in how much gas they use. Other factors beyond climate, such as thermostat settings, affect consumption. Because r has no units, it is not affected by the scale of measurement. For instance, had we measured gas use in cubic meters and climate in degrees Celsius, the correlation r = 0.55 in both cases.

tip
r

6-9

2/7/2008

6 Numerical Association

Natural Gas (metric, MCM)


0 1000 3000 5000 7000 9000

400

120 100 80 60 40 20 0 0 1000 2000 3000 4000 5000

Natural Gas (MCF)

300 200 100 0

Heating DD

Heating DD (Celsius)

Table 6-1. Scales affect the axes, but not the content of a scatterplot.

If you cover the axes, these scatterplots match. Only the labels on the axes differ. These differences dont affect the direction, curvature, or variation in the scatterplot so they dont change the strength of the association either. The correlation is the same in both cases. The correlation can reach -1.0 or +1.0, but these extremes are unusual. They happen only if all the data fall exactly on a single line. For example, this plot shows the temperature at 50 locations on both Fahrenheit and Celsius scales.

Figure 6-8. Centigrade temperatures are perfectly associated with Fahrenheit temperatures.

The correlation r = 1 because there is a line that gives one temperature exactly in terms of the other (Fahrenheit = 32 + 1.8 Celsius). Data on a line with negative slope implies r = 1.

Are You There?

This scatterplot shows the amount of natural gas used versus another variable, the size of the home measured in the number of square feet.

6-10

2/7/2008

6 Numerical Association

Figure 6-6-9. Scatterplot of natural gas used versus the size of the home.

(a) Describe the association between the two variables.3 (b) The covariance between these variables is 19,096. What are the units of the covariance?4 (c) The standard deviation of natural gas is sy = 51.26 MCF, and the SD of the size is sx = 876.5 sq ft. What is the correlation between the use of natural gas and the size of the home?5 (d) What does the correlation indicate about the association?6

Summarizing Association with a Line


At the beginning of the chapter, we set out to design an ad that would convince homeowners to insulate. Weve confirmed that gas use and climate are associated. This association means that we can attribute some of the variation in gas use to differences in climate. How does that affect our ad? It means that we ought to consider an ad that takes account of location. For example, we could replace the ad shown in the introduction with an interactive web ad: Click to find typical savings from insulating! The savings would be larger in colder climates. In fact, the correlation determines a line that we can use to predict the possible savings. This line relates the z-scores of the two variables. (Chapter 4 defines zscores. A z-score is the deviation from the mean divided by the standard deviation.) If a case lies zx standard deviations from x , we expect to find y near r zx standard deviations from y . We can write this as ) z y = r zx ! The ^ distinguishes what this formula predicts from the real thing. If ) ! is flat; wed predict z the correlation r = 0, this line y =0 for any value of x.
! As an example, suppose that a home is located in a cold climate with 8,800 heating degree-days. The average number of heating DD is x = !
3 4

Theres weak positive association, vaguely linear. Looks like the variation around the pattern increases with usage. The units of the covariance are those of x times those of y, so thats 19096 (Square Feet)(MCF). 5 ! The correlation is 19096/(876.5*51.26) 0.425 6 The correlation confirms that there is some linear association (itt halfway between 0 and its largest positive value +1), but its not as strong as the correlation between gas usage and climate in the text example.

6-11

2/7/2008

6 Numerical Association

4,547.5 HDD with standard deviation sx = 2,235.4 HDD. Hence, the zscore for a home with 8,800 HDD is 8800 " 4547.5 zx = # 1.9 2235.4 This climate is about 1.9 standard deviations colder than the mean. The positive association between heating DD and gas usage suggests that this home uses more than the average amount of gas. The predicted z! score for gas consumption is ) z y = r " z x = 0.55 " 1.9 = 1.045 Converting back to the original units, this works out to ( y = 99.9 MCF and sy = 51.26 MCF) ) ! y + z y " s y = 99.9 + 1.045 " 51.26 # 153.5 MCF. The correlation predicts a home in a location with ! 8,800 heating degreedays to use 153.5 thousand cubic feet of gas, costing $1535. The savings from insulating would be estimated to range up to 30% of the predicted ! costs, or 0.3 $1535 = $460.50. Lets do another example, this time for a warmer climate. For a home in a climate 1 SD below the mean, we expect its gas use to be about 0.55 SDs less than the average gas used, or 99.9 0.55 51.26 = 71.707 MCF. Because less gas is typically used in warmer climates, the predicted savings are smaller, only 0.3 717.07 $215. Weve now got the ingredients for an effective interactive ad. Customers at our web site could click on their location on a map like this one or enter their zip code. Our software would look up the climate in that area and use the correlation line to estimate the possible dollar savings.

Nonlinear Patterns
If the association between the variables is not linear, a line may be a poor summary of the pattern. Linear relationships are common, but do not apply in every situation. Chi-square (Chapter 5) measures the all of the association between categorical variables; covariance and correlation quantify only linear association. If the pattern in a scatterplot bends, then covariance and correlation miss some of the association. Be sure that you inspect the scatterplot before relying on these statistics to measure association.
6-12

2/7/2008

6 Numerical Association

For example, the variable on the x-axis in the following scatterplot is the age of an employee and the variable on the y-axis is the cost in dollars to the employer to provide a life insurance benefit.

Figure 6-10. The cost of insurance does not continue to increase.

You can see a strong pattern, but its not linear. The direction of the pattern changes. The cost of providing life insurance initially grows as young employees go through their family years and opt for more insurance. As employees age, fewer want life insurance, and the cost to the firm to provide this benefit shrinks. The correlation does not see this pattern because the pattern is not linear. The correlation between age and cost is near zero, r = 0.09. In a sense, the average slope is zero; the positive slope on the left cancels the negative slope on the right. If benefits managers only compute a correlation and do not look at the scatterplot, they might think that these two variables were unrelated. That mistake could produce poor decisions when it comes time to anticipating the cost of this benefit.

Spurious Correlation
A scatterplot of the damage (in dollars) caused to homes by fire would show a strong correlation with the number of firefighters who tried to put out the blaze. Does that mean that firefighters cause damage? No, and a lurking variable, the size of the blaze, explains the superficial association. A correlation that results from an underlying, lurking variable rather than the shown x and y variables is often called a spurious correlation. Scatterplots and correlation reveal association, not causation. You have to be cautious interpreting dependence in a contingency table (recall Simpsons paradox), and the same warning applies to scatterplots. Lurking variables can affect the relationship in a scatterplot in the same way that they affect the relationship between categorical variables in a contingency table. Interpreting the association in a scatterplot requires deeper knowledge about the variables. For the example of this chapter, it seems sensible to interpret the scatterplot of gas usage and heating degree-days as meaning that colder weather causes households to use more natural
6-13

spurious correlation Correlation between variables due to the effects of a lurking variable.

2/7/2008

6 Numerical Association

gas. We can feel pretty good about this interpretation because weve all lived in homes and had to turn up the heat when it gets cold outside. Though it makes sense, its not proof. What other factors might affect the relationship between HDD and the amount of gas that gets used? We already mentioned two: thermostat settings and the size of the home. How might these factors affect the relationship between heating degree-days and gas use? Maybe homes in colder climates are bigger than those in warmer climates. That might explain the larger amount of gas used for heating. Its not the cold, its the size of the home. These alternatives may not seem plausible, but we need to rule out alternative explanations before we make decisions based on association.

Checklist: Correlation
Correlation measures linear association between two numerical variables. Before you use correlation (and covariance), verify that your data meet the prerequisites in this checklist. Numerical variables. Correlation applies only to quantitative variables. Dont apply correlation to categorical data. No obvious lurking variables. Theres a spurious correlation between math skills and shoe size for kids in an elementary school, but there are better ways to convey the effect of age. Linear. Correlation measures the strength of linear association. A correlation near zero implies no linear association, but there could be a nonlinear pattern. Outliers. Outliers can distort the correlation dramatically. An outlier can enlarge an otherwise small correlation or conceal a large correlation. If you see an outlier, its often helpful to report the correlation with and without the point. These conditions are easy to check in the scatterplot. Many correlations, however, are reported without plots. Think about this checklist even if you dont see the scatterplot.

4M: Getting there first


When retailers like Target look for locations for opening a new store, they take into account how far they will be from the nearest competition. It seems foolish to open a new store right next to rival Wal-Mart or does it? Is it really the case that its better to locate farther from the competition? Maybe theres a reason no other stores are nearby. For this problem, well look at data for a regional chain of retail stores. The chain operates 55 stores and has the total sales at each store over the past year.
6-14

2/7/2008
Motivation. State the objective.

6 Numerical Association
I will look at the relationship between sales at the retail outlets and the distance to the nearest competitor to see if stores located far from the competition have higher sales. My data measure the sales (in dollars) over the prior year for 55 retail stores. The other variable is the distance in miles from the nearest competitor. Ill start with a scatterplot to see the relationship between sales and distance. Once I see the scatterplot, Ill check the conditions before I report the correlation. If the conditions are met, Ill use the correlation to quantify the strength of the association.

Method. Plan your approach. Identify the data and state your approach.

Mechanics. Make the scatterplot and compute the correlation if the plot suggests the conditions are met. Use a computer program or graphing calculator if you can.
Sales

3000000

2500000

2000000

1500000 0 1 2 3 4 5 6 7 8 9 10

Distance

Use the correlation checklist.

Numerical variables: Both variables are numerical. No obvious lurking variable: I am concerned about this one. Im not sure that all of the stores are of the same size. The chain operates some superstores that combine a traditional department store with a grocery market. If large stores tend to be farther from the competition, Ive got a problem. Maybe the association is because of size, not distance. Linear: The scatterplot seems reasonably linear with substantial variation. Outliers: Some stores sell a lot and others a little, but thats to be expected. The largest selling store is a bit remote, but not enough to be a problem. The correlation is 0.74. Sales and distance are positively related. The data show a strong, positive linear association between distance to the nearest competitor and the annual sales. The correlation between sales and distance is 0.74. I have lingering concerns that there may be other lurking factors behind this relationship, but find no other problems.

We used a computer to compute the correlation. Message. Summarize the pattern in the data, along with any unusual features.

If you suspect lurking factors, you ought to mention them again here.

6-15

2/7/2008

6 Numerical Association

Correlation Tables
It is common in some fields to report the correlation between every pair of numerical variables and arrange these in a table. The rows and columns of the table name the variables, and the cells hold the correlations. Correlation tables, or correlation matrices as they are sometimes called, are compact and convey a lot of information at a glance. They can be an efficient way to explore a large data set because they summarize at a glance many pairwise relationships. Before relying on the correlations you find in the table, however, be sure to review the checklist for the correlation. Because the correlation table does not show plots, you cannot tell whether the results conceal outliers or miss nonlinear patterns. As an example, this table shows correlations among several characteristics reported in Forbes magazine for large companies.
Assets 1.000 0.746 0.682 0.602 0.641 0.594 Sales 0.746 1.000 0.879 0.814 0.855 0.924 Market Value 0.682 0.879 1.000 0.968 0.970 0.818 Profits 0.602 0.814 0.968 1.000 0.989 0.762 Cash Flow 0.641 0.855 0.970 0.989 1.000 0.787 Employees 0.594 0.924 0.818 0.762 0.787 1.000

Assets Sales Market Value Profits Cash Flow Employees

Table 6-2. Correlation matrix of the characteristics of large companies.

For example, the correlation between profits and cash flow is quite high, 0.989. Notice that this correlation appears twice in the table, once above the diagonal and once below the diagonal. The table is symmetric about the \ diagonal. The correlation between profits and cash flow is the same as the correlation between cash flow and provits, so the values match on either side of the diagonal. The correlation (and covariance) are the same regardless of which variable you call x and which you call y. The diagonal cells of a correlation table are exactly 1. (Can you see why? Think about where the points lie in a scatterplot of assets on assets, for instance.) Correlation tables are commonly offered by statistics packages. Some packages also offer a graphical alternative known as a scatterplot matrix. Rather than show each correlation, a scatterplot matrix shows every scatterplot and makes it harder to miss abnormalities that a correlation can hide.

6-16

2/7/2008

6 Numerical Association

Summary

Scatterplots show association between two numerical variables. The response goes on the y-axis and the explanatory variable goes on the xaxis. The visual test for simplicity compares the observed scatterplot to artificial plots that remove any pattern. To describe the pattern in a scatterplot, indicate its direction, curvature, variation and other surprising features such as outliers. Covariance measures the amount of linear association between two numerical variables. The correlation r scales the covariance so that the resulting measure of association is always between 1 and +1. The correlation is also the slope of a line that relates the standardized deviations from the mean on the x-axis to standardized deviations from the mean on the y-axis. Spurious correlation results from the presence of a lurking variable, and curvature or outliers can cause correlation to miss the pattern. correlation, 6-9 covariance, 6-7, 6-8 explanatory variable, 6-3 response, 6-3 scatterplot, 6-3 spurious correlation, 6-13 visual test for simplicity, 6-5

Key Terms

Best Practices
To understand the relationship between two numerical variables, start with a scatterplot. If you see the plot, you wont be fooled by outliers, bending patterns, or other features that can mislead the correlation. Look at the plot, look at the plot, look at the plot. Its the most important thing to do. Be wary of a correlation if you are not familiar with the underlying data. Use clear labels for the scatterplot. Many people dont like to read; they only look at the pictures. Use labels for the axes in your scatterplots that are self-explanatory. Describe a relationship completely. If you do not show a scatterplot, make sure to convey the direction, curvature, variation and any unusual features. Consider the possibility of lurking factors. Correlation shows association, not causation. There might be another factor lurking in the background thats a better explanation for the pattern that you have found. Use a correlation to quantify the dependence between two numerical variables that are linearly related. Dont use it in other cases. Youll only fool yourself and those who read what you have done. Verify the correlation checklist before you report the correlations.
6-17

2/7/2008

6 Numerical Association

Pitfalls
Dont use the correlation if the data are categorical. Keep categorical data in tables where it belongs. Dont treat association and correlation as causation. The presence of association does not mean that changing one variable causes a change in the other. The apparent relationship may in fact be due to some other, lurking variable. Dont assume that a correlation of zero means that the variables are not associated. If the relationship bends, the correlation can miss the pattern. A single outlier can also hide an otherwise strong linear pattern or enhance the pattern in data that have little linear association. Dont assume that a correlation near -1 or +1 means near perfect association. Unless you see the plot, youll never know whether all that you have done is find an outlier.

Checklist for Correlation


Numerical variables No obvious lurking variable Linear pattern No extreme outliers

Formulas
Covariance
cov(x , y ) =

(x
n

" x y 1 " y + x2 " x y 2 " y + L + xn " x yn " y n"1


i

)(

) ( )

)(

)(

# (x
=
i=1

" x yi " y n"1

)(

Correlation
! r = corr( x , y ) = cov(x , y ) sx s y

When doing calculations by hand, the following formula is also used.


n

r=

"( x
i =1 n

! x ) ( yi ! y ) ! x ) ( yi ! y )
2 2

"( x
i =1

Covariance and correlation cov(x,y) = sx sy corr(x,y) corr(x,y) = cov(x,y)/(sx sy)


6-18

2/7/2008

6 Numerical Association

About the Data


The data on household energy consumption was collected by the Energy Information Agency of the US Department of Energy (DoE) as part of their survey of residential energy consumption around the US. How does the government use these data? For one, they use the data to provide guidance to homeowners who are thinking about making their homes more energy efficient. The Home Energy Saver at DoEs web site at hes.lbl.gov provides a calculator that compares the energy costs for homes in your area. The calculator uses data like those in this chapter to estimate the value of modernizing appliances and adding insulation.

Software Hints
Statistics packages generally make it easy to look at a scatterplot to check whether the correlation is appropriate. Some packages make this easier than others. Many packages allow you to modify or enhance a scatterplot, altering the axis labels, the axis numbering, the plot symbols, or the colors used. Some options, such as color and symbol choice, can be used to display additional information on the scatterplot.

Excel

Use the Chart Wizard to produce a scatterplot of two columns. The process is simplest if you have selected the two columns prior to starting the Wizard, with the x column adjacent and to the left of the y column. (If the columns are not adjacent or in this order, youll need to fill in the data dialog later.) After clicking the Chart Wizard button, select the icon that looks like a scatterplot without connected dots. Click the next button and youre on your way. Subsequent dialogs allow you to modify the scales and label of the axes. The chart is interactive. If you change the data in the spreadsheet, the change is transferred to the chart. To compute a correlation, select the function CORREL from the menu of statistical functions. Enter ranges for two numerical variables in the dialog. (You can type the function directly into a cell as well.)

Minitab

To make a scatterplot, follow the menu commands Graph > Plot Select the default plot (others, for example, let you include the correlation line) then identify click the numerical variable for the y-axis and the numerical variable for the x-axis. To compute a correlation, follow the menu items Stat > Basic Statistics > Correlation Fill the dialog with at least 2 numerical variables. If you choose more than 2, you get a table of correlations. The sequence Stat > Basic Statistics > Covariance produces the covariance of the variables.
6-19

2/7/2008

6 Numerical Association

JMP

To make a scatterplot, follow the menu commands Analyze > Fit Y by X In the dialog, assign numerical variables from the available list for the yaxis and the x-axis. Once you have formed the scatterplot, to obtain a correlation use the pop-up menu identified by the red triangle in the upper left corner of the output window holding the scatterplot. Select the item labeled Density ellipse and choose a level of coverage (such as 0.95). The ellipse shown by JMP is thinner as the correlation becomes larger. A table below the scatterplot shows the correlation. To obtain a correlation table, follow the menu commands Analyze > Multivariate Methods > Multivariate and select variables from the list shown in the dialog. By default, JMP produces the correlation table and the scatterplot matrix (a table of scatterplots, one for each correlation). Clicking on the red triangle button in the output window produces other options, including the table of covariances.

6-20

Anda mungkin juga menyukai