Anda di halaman 1dari 24

Lecture Notes in Elementary Statistics

Copyright c 2006 by Andrew Matchett

Preface

These notes grew out of lectures the author has given in statistics classes at University of Wisconsin - La Crosse over the past two decades. The help of Carolyn Helein, Tricia Larson, and Stephanie Lawniczak in proofreading and in suggesting improvements is gratefully acknowledged.

Descriptive Statistics

The eld of statistics may be divided into two parts, descriptive statistics and inferential statistics. Descriptive statistics is more elementary and also more general than inferential statistics. It consists of a exible and continuously growing body of methods for identifying and describing key features of sets of data. A chemist makes ve measurements of the boiling point of a newly synthesized compound. A eld biologist tags and weighs 27 rattlesnakes found in a week of searching through a remote canyon in Utah. An economist records each week the price of oil, the number of housing starts, and the price of beef. All three of these individuals will begin the study of their data with descriptive statistics. Descriptive statistics serves along with probability theory as the foundation for inferential statistics. The goal in inferential statistics is to use information from a sample to draw inferences and to quantify error probabilities and condence levels associated with those inferences. Descriptive
All rights reserved. No part of these notes may be reproduced without prior consent of the author.

ID# 01 02 03 04 05 06 07 08 09

Weight 121 136 153 125 134 120 125 147 140

Gender F F M M M M F M F

Table 1: A Data File with 9 Observations

statistics is more general because it may be applied to any sets of data, not just data gathered for purposes of inference. In certain applications, data les are enormous. For example, remote sensing data in the eld of geographical information systems (GIS) can occupy many gigabytes on a computer disk. It would be virtually impossible for a person to get any useful information at all by looking at the raw data. But armed with computers and sophisticated software, geographers extract all kinds of useful information including frequency tables, bar charts, summary statistics and color coded maps. This, too, is descriptive statistics, and people in the GIS eld sometimes dene descriptive statistics as the art of converting data into information.

1.1

Univariate Descriptives

Frequency Tables and Visual Displays Our aim is to describe sets of data. Consider the very small data le in table 1. This le gives weights and genders of 9 experimental subjects or observational units. Weight and gender are variables. Weight is a quantitative variable and gender is a qualitative variable. This le is prototypical. It has the form that has come to be almost universal for data les. Each observational unit has its data recorded in one row of the le, and each column corresponds to a variable. A person or object on which measurments are made is called an observational unit, and the collection of data values 2

obtained from a given observational unit constitute what is called an observation. Thus the data le in table 1 has 9 observations and 2 or 3 variables depending on whether we wish to consider the ID number to be a variable. If the observations are from a controlled experiment, the term experimental unit is sometimes used instead of observational unit. The experimental unit or observational unit is the unit which is measured or weighed or scanned or probed, etc. to determine the values of the variables of interest. When there are many variables, it sometimes happens that people prefer to enter the data for each observational unit in several successive rows rather than in one extremely long row. Therefore the term record is sometimes used as a more precise term than row. But in practice, we usually think of each observational unit as corresponding to one row of data. It is worthwhile to pause here to point out the breathtaking reach of statistical methods. The observational units in a study may be mice or people or stars or trees or cities or mineral specimins. The eld of study may be the social sciences, life sciences, or physical sciences. In all of these disparate areas, the same statistical methods apply and provide a unifying methodology. Though a data le may have more than one variable, we often begin by considering certain variables, one at a time. The data corresponding to a single variable is called univariate data. A number, table, or visual display that gives information about a single variable is called a univariate descriptive. One of the most basic univariate descriptives is the frequency table. A variable generally has a number of possible values. The frequency of a particular value is the number of times that particular value occurs in the data column. The relative frequency is the frequency expressed as a percentage or proportion of the total number of observations under consideration. Here is the frequency table for the variable, gender, in table 1. Frequency Table for Gender value F M Frequency Relative Frequency Cumulative Frequency 4 .44 4 5 .56 9

The table shows immediately the number of observational units of each gender, something that was not immediately apparent from the data le, itself. Cumulative frequency is the sum of the frequencies of the values at or before 3

the current one in the listing. Thus the cumulative frequency corresponding to the last value in the listing is always equal to the total number of observational units. The next table is a frequency table for the variable, weight. The values of the variable have been sorted in ascending order, which is the usual practice in a frequency table for quantitative data. The sorted list of values gives information about the data that was not apparent before. For example we can see at once that the smallest weight is 120 and the largest is 153. However the reader may agree that the rest of the table is not very illuminating and gives little insight beyond what is given by the original data le. Frequency Table for Weight value 120 121 125 134 136 140 147 153 frequency 1 1 2 1 1 1 1 1 relative frequency 11.1% 11.1% 22.2% 11.1% 11.1% 11.1% 11.1% 11.1% cumulative rel. freq. 11.1% 22.2% 44.4% 55.6% 66.7% 77.8% 88.9% 100.0%

A more informative table can be made by grouping the data into intervals. We do this in the next table, and the result is a more interesting display which conveys new insight about the data. An important point here is that the intervals are the same length. Therefore the variation in frequency among the intervals says something about the variation in density of the observations from one location to another. In particular, the table with grouped data shows that the weights cluster more closely together (more densely) at the lower end of the weight range than at the higher end. Frequency Table for Weight interval frequency relative frequency cumulative rel. freq. [120, 130) 4 44.4% 44.4% [130, 140) 2 22.2% 66.7% [140, 150) 2 22.2% 88.9% [150, 160) 1 11.1% 100.0% 4

In this table, frequency means the number of observations contained in the given interval. The intervals themselves are expressed in a common form of interval notation. In this notation, [120,130) denotes the interval of real numbers x such that 120 x < 130. A graphical depiction of the interval [120, 130) is as follows:

110

115

120

125

130

135

The square bracket signies that the number 120 is included in the interval, while the parenthesis signies that the number 130 is not. But 130 is contained in the next interval, [130, 140). It is always important to adopt some sort of endpoint convention like this when grouping data into intervals, so that every data value is in one and only one interval. A more common endpoint convention in practice is to choose borderlines that cannot contain any of the data values. Notice that the weights in our data le are given to the nearest integer. So if we make the borderlines between intervals be 110.5, 120.5, 130.5, 140.5, 150.5, and 160.5, then none of the data values can possibly fall on any borderline. Then again each data value is in one and only one interval. The frequency table that results with this choice of borderlines is as follows: Weight Frequency 110.5 to 120.5 1 120.5 to 130.5 3 130.5 to 140.5 3 140.5 to 150.5 1 150.5 to 160.5 1 The information in frequency tables is often presented in graphical form. For a nonnumerical variable we have the bar chart. A bar chart for gender appears below.

F M Gender

A bar chart corresponding to a frequency table for numerical data is called a histogram. The next display is a histogram for the variable, weight. It conveys exactly the same information as the frequency table, but in a visual manner. freq. 4 2 0 120 140 160
Borderline values are counted in next higher interval.

Histogram of Weight Values


The given histogram is a frequency histogram, because it relates bar heights to frequencies. When bar heights are referenced to relative frequencies, the term, relative frequency histogram, is used. Stem-and-leaf plots and dotplots are other useful ways of giving visual impressions of distributions. These are shown below for the weight measurements in our data le.

12 13 14 15

0155 46 07 3
s s s s ss s s s

120

140

160

Stem & Leaf Plot of Weight Values

Dotplot of Weight Values

Measures of Location and Dispersion In this section we will use the terminology of samples and populations. A number that is computed from a population or that in some way characterizes a population is called a parameter. A number computed from a sample is called a statistic. In this section we will dene ve statistics and three parameters for univariate quantitative data. However, rst we need to identify two dierent kinds of populations. A population is said to be concrete if it is a denite nite set of observational units. The number of units in a concrete population is denoted N . For a concrete population, this number N is always xed. An example of a concrete population is the set timber wolves in the state of Wisconsin at the start of the day on January 1, 2000. It is a safe bet that no person knows exactly what the sample size, N , is equal to for this population. But we know that it has some specic value. The other kind of population is an abstract population. This is a population with no denite size and whose observational units only come into existence as time progresses or as they are created by an experimenter. For example, when a chemist performs ve measurements of the boiling point of a certain liquid so that he can report the average as his estimate of the true boiling point, that chemist is thinking of his ve measurements as a sample from the population of measurements he or other chemists could take now or in the future. These measurements have no existence independent of the experiment, and accordingly their number is indenite. We shall postpone until chapter 3 a precise denition of population parameter for abstract populations. 7

Throughout this chapter n will denote the number of observational units in a sample, and N the number of observational units in a concrete population. In this section we will focus on numerical variables like weight or height. For any one variable, its values make up one column of the data le. So we think of these values as simply a list of numbers. The mean of a list of numbers is just the average. It is important to avoid confusing population means with sample means, and so we use dierent symbols for them. We will often use x as the name of a variable. Then the data values may be called x-values. Denition 1 The sample mean of a list of x-values is denoted x and is dened by x x= , n where the sum is over all observations in the sample. The population mean is denoted and is dened similarly: x . N This time, of course, the sum is over all observational units in the population. = Denition 2 The median of a list of n numbers is found by rst listing the numbers in increasing order. Then if n is odd the median is the number in the middle position in the listing. if n is even the median is the average of the two middle numbers in the listing. There is no standard symbol for the median and no special symbol to distinguish population median from sample median. Both the median and mean are measures of location. Next we dene measures of dispersion. A measure of dispersion indicates how spread out a set of numerical data values are. Variance, standard deviation and range measure dispersion. Denition 3 Sample variance is denoted s2 and is dened by the formula, (x x)2 . n1 Sample standard deviation is denoted s and is dened to be the square root of the sample variance. Range is dened to be the largest data value minus the smallest data value. s2 = 8

Example 1 Consider the following two samples. Sample 1 Sample 2 2, 3, 3, 5, 7 0, 2, 3, 6, 9

For these samples, present the ve statistics we have dened and also give dotplots. Discuss how the ve statistics reect features of the dotplots. Solution

x Sample 1 Sample 2 4 4

median 3 3

s2 4 12.5

s 2 3.54

range 5 9

The Five Statistics

Sample 1 Sample 2
s

s s

2 Dotplots

10

The dotplots show clearly that sample 2 is more spread out (dispersed) than sample 1. This is reected in the fact that the variance, standard deviation, and range are greater for sample 2 than for sample 1. Both dotplots show a greater density of values on the left side than on the right. Therefore the distributions are not symmetrical. This lack of symmetry is reected in the fact that the median of each sample diers from the mean. Finally, both data sets have the same general location on the number line, and so one expects that the means should be at least close to one another. The fact that the means of the two sets are equal is consistent with this. 9

Population variance is dened by a formula a little bit dierent from the analogous formula for sample variance. In both formulas variance may be computed as a fraction with numerator equal to the sum of the squares of the deviations of the observed values from the mean. In the denominator we put the number of observations minus 1 in the case of a sample but for a population we put in the denominator simply the number of observations. That is, for population variance we divide by N instead of the N 1 which would be analogous to the sample case. Denition 4 Population variance is denoted 2 , and is dened by 2 = (x )2 N

population standard deviation is denoted and is dened to by = sum(x )2 N

So, if population variance is dened with the number of values in the set as the denominator, why is that not done also for sample variance? The answer is that dividing by n 1 for sample variance will give a sample variance that is, on the average, a more accurate estimate of the population variance. Why this is true requires a sophisticated understanding of theoretical concepts beyond the scope of this chapter. These concepts will be accumulated in the next two chapters and the reason for n 1 in the denominator for sample variance will be resolved in chapter 4. For now the main point is that calculators that have built-in routines to compute variance and standard deviation often give the user the choice which denominator to use, and the user should choose the correct one in each situation. Interpreting the Standard Deviation Standard deviation gives us information about how data values are distributed. A set of data values is said to have mound shaped distribution if a dotplot or histogram for the data would have approximately the shape of a symmetrical mound. In the following results, standard deviation may be computed as either population standard deviation (denominator N ) or sample standard deviation (denominator n 1). The results are valid either way. 10

Empirical Rule If a set of numerical data values has mound shaped distribution, then approximately 68% of the data values will lie within one standard deviation of the mean, and approximately 95% of the data values will lie within two standard deviations of the mean. Chebyschevs Theorem For any set of numerical data values, at least 3/4 of the data values will lie within two standard deviations of the mean, and at least 8/9 of the data values will lie within three standard deviations of the mean. Chebyschevs theorem will be proved in chapter 4, and the Empirical Rule will be justied in chapter 5. Here are examples of the use of each. Example 2 The mean diameter of red pine trees in a certain pine plantation is 38.7 inches with standard deviation 4.2 inches. Within what limits will the diameters of most trees fall? Assume red pine diameters in this plantation have mound shaped distribution. Solution This is an open-ended problem, with a variety of reasonable answers. In problems like these, the answer should be a full sentence that says something intelligent and correct about the distribution of tree diameters. The sentence should also be responsive to the actual question that is asked. So it should give some fact that pertains to most trees. Also the sentence should be self contained. That is, all the information in it should be clear without reference to the statement of the problem. With that understanding, any of the following statements would be good answers. 1. Approximately 68% of the red pines in the plantation will have diameters between 34.5 inches and 42.9 inches. 2. Approximately 95% of the red pines in the plantation will be between 30.3 inches and 47.1 inches in diameter. 3. Approximately 84% of the red pines in the plantation will have diameters under 42.9 inches. 11

Notice that in the third statement we have used the fact that the assumption of a mound shaped distribution includes the assumption that the distribution is symmetrical. The fact that we were asked to assume a mound shaped distribution is what makes the empirical rule the preferred tool for this problem rather than Chebyschevs Theorem. Example 3 New car prices in 2002 averaged $26,500 with standard deviation, $3100. What can you say about the distribution of new car prices? Solution As in the previous problem, the aim is to make some intelligent statement about the distribution of new car prices. The answer should be a full sentence, and should be self contained so that a person could get the information from it without reading the problem. There is no reason to think that new car prices are mound shaped. Indeed, common sense says that they are not. Therefore it is much better to use Chebyschevs Theorem than the empirical rule. Any of the following statements would be a good answer to the question the problem asks. 1. In 2002 at least 75% of all new cars were priced between $20,300 and $32,700. 2. In 2002 at least 8/9 of all new cars were priced between $17,200 and $35,800. 3. In 2002 at least 75% of all new cars were priced under $32,700. 4. In 2002 at most one quarter of all new cars were priced over $32,700. Notice that in statement 3, we have the same percentage as in statement 1. This is because Chebyschevs Theorem tells us nothing about where the values are that are not within 2 standard deviations of the mean. They may all be above the mean or they may all be below the mean or some may be above and some below. We may not assume symmetry as we do when the data values are assumed to have mound shaped distribution. Percentiles and Z-scores One of the most common ways of reporting scores of grade school students on a standardized exam is to report percentile ranks. The percentile rank of a given raw score is the percentage of scores below the given one. The use of percentiles or quantiles to identify locations within a univariate data distribution is very closely related to this. 12

Denition 5 Given a list of numbers, a p-th percentile is a number c such that p percent (or fewer) of the numbers in the list are less than c, and 100p percent (or fewer) are greater than c. Given a list of numbers and a percentage, p, the following algorithm will always produce a number c that satises the requirements of denition 5 for a p-th percentile. 1. Sort the numbers in the list in increasing order, and determine the sample size, n. 2. Compute p percent of n. 3. If the resulting number is not an integer, round it up to the next larger integer. The result of this step is an integer k to use in the next step. 4. Count k positions up from the smallest value in the list. The data value that is located at the k-th position is a correct value to report as the percentile we are seeking. Example 4 Find a 60-th percentile location in the following data list. 4, 6, 6, 25, 7, 1, 12, 7, 18, 16, 23, 24. Solution First we sort the list. 1, 4, 6, 6, 7, 7, 12, 16, 18, 23, 24, 25 There are 12 data values in the list, and .60 12 = 7.2. Rounding 7.2 upward gives the integer 8. Finally, counting 8 positions from the start of the list brings us to the number 16, which is therefore our answer. So we report 16 as a 60-th percentile for the given data. It is instructive to verify that 16 indeed satises the requirements of definition 5 for a 60-th percentile. So, let us nd the percentage of data values that are less than 16. There are 7 data values less than 16. That amounts to 7/12 of the full set of data values. Now 7/12 converts to the percentage, 58.3%. This is less than 60% as the denition requires. Similarly the percentage of data values greater than 16 is 33.3%. But 100%60% = 40%, and

13

33.3% < 40%, which is the other requirement of denition 5. Thus 16 satises denition 5, as a 60-th percentile for the data set. It is worth pointing out that the median which we have already dened is a 50-th percentile. Percents are easily converted to proportions. It is just a matter of moving the decimal point over two places. When we shift to the terminology of proportions, it is customary to replace the word, percentile, with the word quantile. Thus the 60-th percentile is the same as the .6 quantile, or the 3/5 quantile. The 1/4, 1/2, and 3/4 quantiles come up frequently enough that a special name has been bestowed upon the group. They are called quartiles. To be precise, the three are called the lower, middle, and upper quartiles, respectively. The interquartile range (IQR) is dened to be the dierence between the upper and lower quartiles: IQR = upper quartile lower quartile. The IQR gives a measure of spread that is not sensitive to extreme values, and is particularly appropriate to use when the median is used as a measure of location. The z-score is a simple device for standardizing data values. The idea is to replace a raw score, x, by its distance from the mean measured in standard deviations. That is, xx . z= s

1.2

Bivariate Descriptives

Bivariate data is data that occupies two columns of a data le and comes from two variables. Bivariate descriptives are tables, visual displays, or statistics that reveal or measure some aspect of the relationship between two variables. When a data le contains many variables, there are often several pairs of variables to which bivariate methods may productively be applied. In this section we will consider contingency tables, scatterplots, least squares lines, and correlation cocients. Two-Way Frequency Tables and Scatterplots We begin with an example of a data le with several variables. Consider the following scenario. The visitors to a certain city zoo during a certain afternoon purchase tickets upon arrival. As they leave the zoo later in the day, their tickets are collected and the variables, age, gender, arrival time, 14

and departure time are recorded. Age is recorded in years and arrival and departure times are recorded to the nearest minute. Zoo visitors usually come in groups: couples, families, one adult supervising several children, etc. Solitary visitors can be considered to be groups of size 1. In the present study, this group phenomenon is recorded by means of a grouping variable called, appropriately, group. The values of the grouping variable are positive integers giving the order in which the groups arrived. Presumably the order of arrival is not as important to us as recording who was grouped with whom, and that is the main information the grouping variable allows us to preserve. Since each person in a group is given the same group number, that number shows who was in which group. The rst few lines of the data le might look something like the following: Name Kim Sangreen Thelma Hurd William Kasack Emily Kasack Scott Homan Loretta Staller Susan Staller Maria VanCleve Julie VanCleve Ann Rasmussen Shawn Baily Chris Baily . . . Group 001 001 002 002 003 004 004 004 004 004 005 005 Age 52 75 45 36 43 36 8 8 7 8 19 22 . . . Gender M F M F M F F F F F M M arrival time 1:05 1:05 1:15 1:15 1:15 1:12 1:12 1:12 1:11 1:12 1:24 1:24 . . . departure time 2:15 2:15 2:35 2:35 2:45 3:42 3:42 3:42 3:43 3:43 3:39 3:39

Looking at the data we see that the rst group to arrive, group number 001, contained two people named Kim and Thelma. The second group, group number 002, contained again two people. Group number 003 contained only one person, but group number 4 contained ve people, apparently one adult woman and and four girls. In this data le, the observational unit is the zoo visitor. We will not consider name to be a variable since it simply identies the observational unit. So there are ve variables, giving many relationships to explore. Consider the bivariate data for the variables age and gender. One way to explore the relationship between these two variables is to examine a bivariate frequency table. If we group age values into several intervals, and 15

consider the entire data le, not just the few lines reproduced above, we will obtain the following kind of table: Age under 18 [18,40] M gender F 237 385 103 11 241 109 [41,60] 97 over 60 6

A bivariate frequency table like this one is also called a contingency table. The present table would be called a 2 by 4 contingency table to express the fact that the body of the table consists of two rows and four columns. This gives 8 cells in the body of the table. An m by n contingency table will have m n cells. The numbers in the individual cells are frequencies. Thus we see that there were 385 females in the age range 18 to 40. A contingency table can be very powerful at revealing relationships between variables. In the present example we see at once that in the age range [18,40], females outnumber males more than three to one, and that this does not come close to happening for the other age ranges. A contingency table is good for qualitative variables and for numerical variables whose values have been grouped into intervals. For two continuous quantitative variables for which we do not wish to group the data into intervals, a scatterplot is often useful. Figure (1) gives a scatterplot of departure time - arrival time versus age for zoo visitors who were over 60 years old and male. Each male over 60 years old corresponds to a point plotted on the graph. The abscissa (horizontal coordinate) of the point is the mans age while the ordinate (vertical coordinate) is the duration of the mans visit. The Least Squares Line A scatterplot for bivariate numerical data sometimes shows a roughly linear relationship between two variables. This suggests the problem of determining the line that best ts the set of data points in a scatterplot. A brief review of lines and their equations is in order. 16

240 180 Time 120 60

s s s

60

62

64

66 Age

68

70

Figure 1: Scatterplot of Duration-of-visit vs. Age for Males over 60

Consider a pair of mutually perpendicular axes in the plane, one axis being vertical and the other horizontal. Let these axes be marked o as number lines. Then points in the plane may be represented as ordered pairs of numbers, the rst entry being the abscissa or horizontal coordinate and the second entry being the ordinate or vertical coordinate. It is customary to label the vertical axis the y-axis and the horizontal axis the x-axis. Then the point (2,1) would be the point whose x-coordinate is 2 and whose ycoordinate is 1. Recall that a line in the (x, y)-plane usually satises an equation of the form y = mx + b, where m and b are specic numbers. We say usually because vertical lines are an exception. We will focus on lines which are not vertical, and then the equation of the line may always be put in the form y = mx + b. Here m is called the slope and b, the y-intercept. It is easy to see why the graph of an equation of the form y = mx + b is a straight line. Consider the specic example, y = .5x + 1. To sketch the graph of this equation in the (x, y)-plane, let us start at the point on the line whose x-coordinate is 0. From the equation, we see that the y coordinate must be 1. So our starting point is (0,1). Now if x is increased by 1, then y will 17

increase by .5, landing us on the point (1,.5). If x is increased again by 1, then y will increase again by .5 bringing us to the point (2,1), and so on. What makes the graph straight is that the steepness, or slope, between any two points is constant. We may dene the slope of a line as the amount of increase in y for every unit increase in x. The reader may be more familiar with the denition, slope = rise/run, which is equivalent. In the eld of statistics it is customary to rename the y-intercept and slope as b0 and b1 respectively, and to rewrite the general linear equation as y = b0 + b1 x. Given a linear equation, to plot the corresponding line in the (x, y)-plane, one may simply plot two points that satisfy the lines equation and then draw the line determined by those two points. Two points determine a line, and a basic skill in beginning analytic geometry or precalculus is to write the equation of a line passing through two given points. What we are about to do is more general. We want the equation of the line that comes as close as possible to passing through several points. This will be a generalization of the process for nding the equation of the line through two points, which is why we do not need to review that topic here. What we will need to do at the outset is to decide on a criterion for measuring how well a given line ts a given set of points. When we write an equation of the form y = b0 + b1 x, we naturally tend to think of the equation as a formula for y in terms of x. So, in a sense, y depends on x, and we call y the dependent variable and x the independent variable. Given a specic example, we could solve algebraically for x in terms of y. Then our equation would display x as the dependent variable and y as the independent one. So which variable is considered the dependent one and which the independent one is a characteristic of the form of an equation, not a characteristic of the underlying relationship between the variables. When the two variables are x and y, it is almost universally the custom to arrange that y be the dependent variable. Now suppose we are given n data points, (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ) , and a linear equation y = b0 + b1 x. Think of the xi as n pre-determined numbers, xed once and for all, for purposes of some experiment, perhaps. Maybe the xi are temperature settings and we are interested in investigating the values of some response variable, y, at each of these settings. Then the yi are the observed y-values at the settings, and we will use yi to denote the value, y i = b0 + b1 x i , 18

computed from our equation. Think of yi as the value predicted by the equation for the variable y when x is equal to xi . Then the dierence, yi yi , would be the discrepancy between the observed value and the computed value. This discrepancy can be thought of as an amount of error. The best tting line will correspond to the linear equation that in some sense minimizes the totality of these errors. Of course some of the errors may be positive and some negative. If we simply add all the errors together and use that as a measure of how well the line ts the data, then we would nd that for any slope, we could adjust the position of a line with that slope to reduce the total error to zero or to any positive or negative value for that matter. We need a more discerning measure of t than that. So we will replace the errors by their squares and then sum. Thus our measure of how well a line ts the data points will be the sum of the squares of the errors. This sum can never be less than zero, and the smaller it is the better the t. The best tting line therefore will be the line that minimizes the sum of the squares of the errors. This brings us to the following denition. Denition 6 Given data points, (xi , yi ), i = 1, 2, . . . , n, and a line, y = b0 + b1 x, let yi = b0 + b1 xi . Dene SSE by
n

SSE =
i=1

(yi yi )2 .

This is called the sum of squares for error. The least squares line for the given data points is the line, with b0 and b1 determined so that SSE is as small as possible. The least squares line as just dened, is what we were previously calling the best tting line, and, for reasons still to come, is often called the regression line. These three terms may be used interchangeably. We will now present formulas for calculating the coecients b0 and b1 of the least squares line for a given set of data points, but rst we dene some notation that will greatly simplify the formulas. Denition 7 Given data points, (xi , yi ), i = 1, 2, . . . , n, the statistics SSxy ,

19

SSxx , and SSyy are dened as follows: SSxy = SSxx = SSyy = ((xi x)(yi y)) (xi x)2 (yi y)2 .

The notation SS comes from the words Sum of Squares. While one of the three SS sums is not a sum of squares, the other two are and the one that is not would be if x and y were the same, and so it is in a way a generalized sum of squares. That is admittedly a weak rationale for the notation, but it is the best there is, and the notation is pretty standard, so we shall use it. Notice that to compute these SS quantities one must rst compute the sample means of x and y. The fundamental result at which we have been aiming is Theorem 1 which follows. Theorem 1 Given a set of n bivariate data points, (x1 , y1 ), (x2 , y2 ), . . . (xn , yn ), the best tting line to the data points is the line y = b0 + b1 x, where b1 = SSxy SSxx and b0 = y b1 x .

Before proving this theorem, we present an example. Example 5 Consider the following set of bivariate data.

x 1 3 6 7 8 y 2 6 5 5 7 Sketch the scatterplot, of y versus x, nd the equation of the best tting line, and compute SSE for the best tting line. Solution The scatterplot is easy, and is given in gure (2). Now we compute b1 and b0 . For the given data, x = 5, y = 5, SSxy = 16, and SSxx = 34 . From Theorem 1, b1 = .471, and b0 = 2.645 . Therefore the least squares line is 20

8
s

6 y 4 2
s

s s s

4 x

10

Figure 2: Scatterplot of y vs. x for Example 5.

y = 2.645 + .471 x To compute SSE, add the computed values of y for each i to the table of (x, y) values. x 1 3 6 7 8 y 2 6 5 5 7 y 3.12 4.06 5.47 5.94 6.41 We nd that SSE = 6.47

21

Proof of Theorem 1 The equation we seek, y = b0 + b1 x, will be determined by the criterion that
n

SSE(b0 , b1 ) =
i=1

[yi (b0 + b1 xi )]2

(1)

shall be as small as possible. In (1) we have indicated explicitly that SSE depends upon b0 and b1 . The expression in brackets in equation (1) is the dierence between the observed value of y and the value predicted from a knowledge of x. It can be thought of as the error. Our next step is to rework this error expression as follows. [yi (b0 + b1 xi )] = [(yi y) b1 xi (b0 y)] = [(yi y) b1 (xi x) (b0 y + b1 x)] = [(yi y) b1 (xi x)] (b0 y + b1 x) Squaring expression (2) gives, [yi (b0 + b1 xi )]2 = [(yi y) b1 (xi x)]2 2[(yi y) b1 (xi x)](b0 y + b1 x) +(b0 y + b1 x)2 (3) (4) (5)

(2)

Lines 3, 4, and 5 give a three-term expression for the square of the error term. So we have made the simple expression on the left hand side of line (3) at least three times as complicated. It had better be worth it. Fortunately it is, because a truly elegant thing is going to happen. We shall sum from 1 to n. We may sum the three terms on lines (3), (4), and (5) separately. In the summation, the rst and third terms yield nonnegative sums. However, the middle term elegantly sums to 0 for us, because the deviations of a variable about its mean sum to zero. One additional property of the third term is that it does not contain the summation variable, i, and so summing the third term from 1 to n is the same as adding up n copies of the term or multiplying by n. Thus summing the terms (3), (4), and (5) from 1 to n yields
n

SSE =
i=1

[(yi y) b1 (xi x)]2 + n(b0 y + b1 x)2 22

(6)

An important fact here is that both terms on the right hand side of equation (6) are nonnegative. A second fact is that the second term may be reduced to zero by setting b0 = y b1 x. Therefore all we need to do is gure out the value of b1 that will minimize the rst term. Minimizing the rst term of SSE in equation (6) is easy provided we have the tool of dierential calculus. We want to minimize [(yi y)b1 (xi x)]2 . View this as a function of b1 , and call it f (b1 ). A minimum of this function will occur at the value of b1 at which the derivative of f is 0. So we shall dierentiate f , set the result equal to zero and solve for b1 . The details are as follows.
n

f (b1 ) = d f (b1 ) = db1 =


i=1 i=1 n

[(yi y) b1 (xi x)]2 2[(yi y) b1 (xi x)](1)(xi x)


i=1 n

2[(yi y)(xi x) + b1 (xi x)(xi x)]


n n

= 2
i=1

[(yi y)(xi x)] + 2


i=1

[b1 (xi x)2 ]

= 2SSxy + 2b1 SSxx Setting the last expression equal to zero and solving for b1 gives the desired result and completes the proof. The Correlation Coecient An important statistic for bivariate numerical data is the correlation coecient. As usual we shall consider our data to be a sample from some population. Denition 8 The sample correlation coecient of two variables x and y is denoted r and is dened by either of the equivalent formulas, r= or r= (x x)(y y) ( (x x)2 )( (y y)2 )

SSxy SSxx SSyy 23

This statistic is also called the Pearson product moment coecient of correlation after the statistician Karl Pearson. Given two variables, x and y, in a data le, a positive value of the correlation coecient between them indicates that as one variable increases the other one also tends to increase. A negative value of r indicates that as one variable increases the other tends to decrease. If the two variables are independent of one another or unrelated then r will tend to be zero or close to zero. However, the converse of this last statement is not true. The correlation coecient of one variable with another can be zero even when there is a strong relationship between the two variables. Example 6 The correlation coecient r for the data, x y -2 4 -1 1 0 1 2 0 1 4

is zero, even though x and y satisfy the equation, y = x2 .

24

Anda mungkin juga menyukai