Anda di halaman 1dari 28

CORRELATON & REGRESSION

Correlation and regression are concerned with the investigation of relationships between two or more variables.

We consider just two associated variables. We might want to know:




If a relationship exists between those variables If so, how strong that relationship is What form that relationship takes Can we make use of that relationship for predictive purposes i.e. forecasting?

Correlation is used to find the strength of the relationship Regression describes the relationship itself in the form of an equation which best fits the data

General method for investigating the relationship between 2 variables:

 For an initial insight into the relationship between two variables:




plot a scatter diagram

 If there appears to be a linear relationship, quantify it:


calculate the correlation coefficient This is a measure of the strength of this linear relationship. Its symbol is 'r' and its value lies between -1 and +1

If the relationship is found to be significantly strong:




find the equation of the line of best fit through the data, using linear regression

 The 'goodness of fit' statistic can be calculated to see how useful the regression equation is likely to be  Once defined by an equation, the relationship can be used for predictive purposes.

Example The data represents a sample of advertising expenditures and sales for ten randomly selected months. See slide 12 for complete data.
Month Advertising expenditure (0,000 s) x 1.2 0.8 1.0 Sales (0.000 s) y 101 92 110 etc.

1 2 3

Plot a scatter diagram of the data

Plot of Sales (0,000's) against Avertising Expenditure (0,000's)


120

110 sales (0,000's)

100

90

Note scales are not started at zero

80

70 0.6 0.7 0.8 0.9 1.0 1.1 advertising (0,000's) 1.2 1.3

The graph suggests a linear relationship between sales and advertising expenditure. The larger the amount spent on advertising the higher the sales in general.

If there is a relationship, we need to be able to measure the strength of that relationship. i.e. calculate the value of the correlation coefficient

Pearson's Product Moment Correlation Coefficient (r) is a measure of how close a linear relationship there is between x and y. can be produced directly from a calculator in LR (linear regression) mode For the sales and advertising data the correlation coefficient: r = 0.875 The value of r is always between + 1 and -1

Plot of Sales (0,000's) against Avertising Expenditure (0,000's)


30 25 20 15 10 5 0 2 4 6 8 x 10 12 14 y

r = -1 perfect negative correlation

Plot of Sales (0,000's) against Avertising Expenditure (0,000's)


30 25 20 y 15 10

r = -0.7
2 4 6 8 x 10 12 14

5 0

Plot of Sales (0,000's) against Avertising Expenditure (0,000's)


12 10 8 6 4 2 0 2 4 6 8 x 10 12 14 y

r = 0 no correlation

Plot of Sales (0,000's) against Avertising Expenditure (0,000's)


50 45 40 35 y 30 25 20 15 2 4 6 8 x 10 12 14

r = +0.8

Plot of Sales (0,000's) against Avertising Expenditure (0,000's)


45 40 35 30 25 20 15 2 4 6 8 x 10 12 14 y

r = +1 perfect positive correlation

Formula for correlation coefficient, r r =


where Sxx = 7x2 - 7x 7x n Syy = 7y2 - 7y 7y n Sxy = 7x2 - 7x 7y n

Sxy Sxx Syy

Longhand calculations for correlation coefficient r.


Step 1
Month Advertising Expenditure 0000s x 1.2 0.8 1.0 1.3 0.7 0.8 1.0 0.6 0.9 1.1 9.4 Sales 0000s y 101 92 110 120 90 82 93 75 91 105 959 x2 1.44 0.64 1.00 1.69 0.49 0.64 1.00 0.36 0.81 1.21 9.28 y2 10201 8464 12100 14400 8100 6724 8649 5625 8281 11025 93569 xy 121.2 73.6 110.0 156.0 63.0 65.6 93.0 45.0 81.9 115.5 924.8

1 2 3 4 5 6 7 8 9 10 Totals

Step 2
Therefore:

Sxx = 7x2 - 7x 7x = 9.28 - 9.4 x 9.4 n 10

= 0.444

Syy = 7y2 - 7y 7y = 93569 - 959 x 959 n 10 Sxy = 7xy - 7x 7y = 924.8 - 9.4 x 959 n 10

= 1600.9

= 23.34

Step 3
Therefore: r = Sxy Sxx Syy = 23.34 0.444 x 1600.9 = 0.875

Hypothesis test for the value of r


We shall not go into the details here!

Null hypothesis (H0): A linear relationship does not exist between sales and advertising Alternative hypothesis(H1): A linear relationship does exist between sales and advertising. If we calculate a test statistic and critical value we discover that test statistic > critical value so we reject H0

Conclude that a linear relationship exists between sales and amount spent on advertising.

The Goodness of Fit Statistic (R2) This also measures of the closeness of the relationship between x and y R2 = 100r2 R2 tells us what percentage of the total variation in y (here sales) is explained by the variation in x (here advertising expenditure)

Interpretation:


If r = +1 or 1, then R2 =100% So 100% of the variation in y is explained by the variation in x. If r = 0, then R2 = 0% So none of the variation in y is explained by the variation in x For the data above the goodness of fit statistic R2 = 100 r2 = 100 x 0.8752 = 76.6%

76.6% of the variation in sales is explained by the variation in the amount spent on advertising. The remaining 23.4% of the variation is explained by other factors: e.g. price competitor s prices etc.

Regression equation Since we know, for the sample data, that there is a significant relationship between the two variables, the next obvious step is to find its equation. We can then add the regression line to the scatter diagram and use it to predict future sales, given advertising expenditure for a particular month. The regression equation can be produced directly from a calculator in LR mode.

The regression line has the equation: y = a + bx x is the independent variable y is the dependent variable a is the intercept on the y-axis b is the gradient or slope of the line.

For the sales and advertising data, the values of a and b are 46.5 and 52.6. So regression equation is: y = 46.5 + 52.6x Sales = 46.5 + 52.6 advertising (a and b can be found using LR mode on your calculator or by calculation)

Formula for a and b


This is found by calculating the square of the differences between actual and expected values. We chose a and b so that the total difference is minimizied: b = Sxy a= y - bx ( x, y) Sxx
is called the centroid

Where x , y are the means of the x and y data and the S s are defined as previously.

Calculations for the regression equation. In the regression equation y = a + bx b = Sxy Sxx = 23.34 = 52.6 0.444

a = y - b x = 95.9 - 52.6 x 0.94 = 46.5 (As y = 7y = 959 and x = 7x = 9.4 = 0.94) n 10 n 10 Therefore the regression equation is y = 46.5 + 52.6x

Plotting the regression equation on the scatter diagram. The line y = a + bx can be plotted on the scatter diagram by plotting three points. The centroid ( x , y ) and any other two points, which satisfy the regression equation. From the data (x, y) = (0.94, 95.9) Plot (0.94,95.9)

When x = 0.6, y = 46.5 + (52.6 x 0.6) = 78.06 Plot (0.6, 78.6) When x = 1.2, y = 46.5 + (52.6 x 1.2) = 109.6 Plot (1.3, 109.6)

Plot of sales (0,000's) against Advertising expenditure (),000's)


120

110

sales

100

x x
90

80

70 0.6 0.7 0.8 0.9 1.0 advertising 1.1 1.2 1.3

Note  regression equation y = a + bx can only be used to calculate an estimate for y given the value of x


The linear relationship y = a + bx can only be assumed to exist between y and x for the range of values within the sample

Interpreting the coefficients in the regression equation first the a value The intercept (a) is the estimate of y when x = 0, but care is needed if using this y = 46.5 + 52.6x Sales = 46.5 + 52.6 advertising

why?

When x = 0, y = 46.5 i.e. When nothing is spent on advertising, sales would be expected on average to be 46.5 units = 46.5 x 10,0000 = 465,000

the b value y = 46.5 + 52.6x 52.6x If x If x If x If x If x If x etc. So if advertising expenditure is increased by 1 unit, sales will be increased by 52.6 units on average. = = = = = = 0 0.6 0.8 1 1.2 2 y = 46.5, but care is needed here! y = 46.5 + (52.6)(0.6) = y = 46.5 + (52.6)(0.8) = y = 46.5 + 52.6 = y = 46.5 + (52.6)(1. 2) = y = 46.5 + 52.6 x 2 but care is needed
here also!

For each additional 10,000 spent on advertising, sales will increase by 52.6 x 10,000 = 526,000 on average. But we cannot estimate sales outside the range: E.g. we should not try to estimate sales for x = 5 using this method.

Anda mungkin juga menyukai