Chapter 14 Solutions Develop Your Skills 14.1 1. Scatter diagrams are shown below.
SalaryandAge
100 90 80 70 60 50 40 30 20 10 0 20 30 40 Age 50 60 70
Salary($000)
SalaryandYearsofPostsecondary Education
100 90 80 70 60 50 40 30 20 10 0 0 2 4 6 8 10 YearsofPostsecondaryEducation
Salary($000)
SalaryandYearsofExperience
100 90 80 70 60 50 40 30 20 10 0 0 5 10 15 20 25 30 35 40 YearsofExperience
Salary($000)
389
All three scatter diagrams show the expected positive relationships. Salary appears to be linearly positively related to age, years of postsecondary education, and years of experience. We note that the variability in salary increases for older ages, and greater years of experience. Salary seems more strongly related to age for ages under about 40. Salary also appears to be more strongly related to years of experience under about 15. The variability of salary when plotted against years of postsecondary education is more variable than for the other explanatory variables, but also more constant. At this point, years of experience appears to be the strongest candidate as an explanatory variable. 2. An excerpt of the Regression output is shown below.
SUMMARYOUTPUT RegressionStatistics MultipleR RSquare AdjustedRSquare StandardError Observations ANOVA df Regression Residual Total 3 36 39 Coefficients 27.70373012 0.3191034 2.846348768 1.568477845 0.928389811 0.861907642 0.850399945 6.844780506 40
The model is as follows: Salary ($000) = 27.7 -0.3 (Age) + 2.8 (Years of Postsecondary Education) + 1.6 (Years of Experience) In other words, salary is $27,704 - $319 for each year of age + $2,846 for each year of postsecondary education + $1,568 for each year of experience. The coefficient for age does not seem appropriate, and points to problems with this model.
390
3. The scatter diagram for age and years of experience is shown below. Note that the age axis starts at 20, since there are no workers under 20 years of age.
AgeandYearsofExperience
40 35 30 25 20 15 10 5 0 20 30 40 Age 50 60 70
The two variables are very closely related, as we would expect. It is not possible to acquire years of experience without also acquiring years of age. It does not make sense to include both explanatory variables in the model.
YearsofExperience
391
4.
An excerpt of the Regression output for the salaries data set with years of postsecondary education and age as explanatory variables is shown below.
SUMMARYOUTPUT RegressionStatistics MultipleR RSquare AdjustedRSquare StandardError Observations ANOVA df Regression Residual Total 2 37 39 Coefficients 2.125320671 1.121515567 2.155278577 0.902574999 0.814641629 0.804622258 7.822241019 40
The model is as follows: Salary ($000) = -2.1 + 1.1 (Age) + 2.2 (Years of Postsecondary Education) In other words, Salary = -2,125 + $1,122 for each year of age + $2,155 for each year of postsecondary education
392
5.
An excerpt of the Regression output for the salaries data set with years of postsecondary education and years of experience is shown below.
SUMMARYOUTPUT RegressionStatistics MultipleR 0.92720353 RSquare 0.859706386 AdjustedRSquare 0.852122948 StandardError 6.805249337 Observations 40 ANOVA df Regression Residual Total 2 37 39 Coefficients 20.87563476
2.673930095 1.238703002
The model is as follows: Salary ($000) = 20.9 +2.7 (Years of Postsecondary Education) + 1.2 (Years of Experience) In other words, Salary = $20,876 + $2,674 for every year of postsecondary education + $1,239 for every year of experience.
393
Develop Your Skills 14.2 6. The residual plots are shown below.
AgeResidualPlot
20 15 10
Residuals
5 0 5 0 10 15 Age 10 20 30 40 50 60 70
YearsofPostsecondary EducationResidualPlot
20 15 10 5 0 5 0 10 15 YearsofPostsecondaryEducation 2 4 6 8 10
Residuals Residuals
YearsofExperienceResidual Plot
20 15 10 5 0 5 10 15 0 10 20 YearsofExperience 30 40
394
All of these residual plots appear to have the horizontal band appearance that is desired, and so these plots appear to be consistent with the required conditions. A plot of the residuals vs. predicted salaries for this model is shown below.
Residualsvs.PredictedSalary
(Age,Years of Postsecondary Education, Years of Experience)
20 15 10 5 0 5 10 15 0 20 40 60 80 100 PredictedSalary(000)
This residual plot also appears to have the desired horizontal band appearance, centred around zero. There is somewhat less variability in the residuals for predicted salaries under about $40,000, but it is not pronounced. 7. The residual plots are shown below.
Residual
AgeResidualPlot
20 15 10
Residuals
5 0 5 0 10 15 20 Age 10 20 30 40 50 60 70
395
YearsofPostsecondary EducationResidualPlot
20 10
Residuals
0 0 10 20 YearsofPostsecondaryEducation 2 4 6 8 10
In the age residual plot, we see a a couple of points that are above and below the desired horizontal band. Also, the residuals appear to be centred somewhat above zero. The postsecondary education residual plot looks more like the desired horizontal band, although once again, the residuals appear to be centred above zero. A plot of the residuals vs. predicted salary is shown below.
Residualsvs.PredictedSalary
(Age,Years of Postsecondary Education)
20 15 10 5 0 5 10 15 20 0 20 40 60 80 100 PredictedSalary(000)
The plot appears to have the desired horizontal band appearance, with the residuals centred around zero. The two circled points correspond to observations with standardized residuals +2 or -2 (observations 29 and 40).
Residual
396
8.
YearsofPostsecondary EducationResidualPlot
15 10
Residuals
5 0 5 0 10 15 YearsofPostsecondaryEducation 2 4 6 8 10
YearsofExperienceResidual Plot
15 10 5
Residuals
0 5 10 15 YearsofExperience 0 10 20 30 40
Both appear to have the desired horizontal band appearance, centred on zero.
397
Residualsvs.PredictedSalary
(Yearsof Postsecondary Education, Years of Experience)
15 10
Residuals
5 0 5 10 15 0 20 40 60 80 100 PredictedSalary
There appears to be somewhat less variability for lower predicted salaries. However, overall, the plot shows the desired horizontal band, centred around zero. 9. A histogram of the residuals for the model discussed in Exercise 6 is shown below.
(Age,YearsofPostsecondaryEducation,Yearsof Experience)
Frequency
10 8 6 4 2 0
Residual
398
The histogram is somewhat skewed to the right. It appears to be centred close to zero. A histogram of the residuals for the model discussed in Exercise 7 is shown below.
(Age,YearsofPostsecondaryEducation)
Frequency
8 6 4 2 0 Residual
As for the previous model, we see the histogram is somewhat skewed to the right, and centred approximately on zero. A histogram of the residuals for the model discussed in Exercise 8 is shown below.
(YearsofPostsecondaryEducation,Yearsof Experience)
10 8 6 4 2 0 Residual
As with the others, this histogram appears skewed to the right, but the skewness appears more pronounced here.
Copyright 2011 Pearson Canada Inc.
399
10. For the model containing all explanatory variables: There is one set of observations that produces a standardized residual just slightly above 2. This is data point 26, where the observed salary is $67,400, age is 47, years of postsecondary education are 5, and years of experience are 17. The actual salary is above the predicted salary of $53,600. If we had access to the original records, we would double-check this data point. For the model containing years of postsecondary education and age as explanatory variables: There are two observations with standardized residuals +2 or -2 (observations 29 and 40). As mentioned in the answer to Exercise 7, these two points are obvious in the plot of residuals vs. predicted salary for this model. f we had access to the original records, we would double-check these data points. For the model containing years of postsecondary education and years of experience as explanatory variables: There are no observations with standardized residuals +2 or -2.
400
12. Test for the significance of the overall model (all explanatory variables): H0: 1 = 2 = 3 = 0 H1: At least one of the is is not zero. = 0.05 From the Excel output, we see that F = 74.9, and the p-value is approximately zero. There is strong evidence that the overall model is significant. Tests for the significance of the individual explanatory variables: Age: H0: 1 = 0 H1: 1 0
(from Excel output). The p-value is 0.5, so we fail to reject H0. There is not enough evidence to conclude that age is a significant explanatory variable for salaries, when years of postsecondary education and years of experience are included in the model. This is not surprising, given how closely related age and years of experience appear to be.
401
(from Excel output). The p-value is approximately zero, so we reject H0. There is enough evidence to conclude that years of postsecondary education is a significant explanatory variable for salaries, when age and years of experience are included in the model. Years of experience: H0: 3 = 0 H1: 3 0
(from Excel output). The p-value is approximately zero, so we reject H0. There is enough evidence to conclude that years of experience is a significant explanatory variable for salaries, when age and years of postsecondary education are included in the model. 13. Test for the significance of the overall model (age and years of postsecondary education ): H0: 1 = 2 = 0 H1: At least one of the is is not zero. = 0.05 From the Excel output, we see that F = 81.3, and the p-value is approximately zero. There is strong evidence that the overall model is significant.
402
Tests for the significance of the individual explanatory variables: Age: H0: 1 = 0 H1: 1 0
(from Excel output). The p-value is approximately zero, so we reject H0. There is enough evidence to conclude that age is a significant explanatory variable for salaries, when years of postsecondary education are included in the model. Years of postsecondary education: H0: 2 = 0 H1: 2 0
(from Excel output). The p-value is approximately zero, so we reject H0. There is enough evidence to conclude that years of postsecondary education is a significant explanatory variable for salaries, when age is included in the model.
403
14. Test for the significance of the overall model (years of experience and years of postsecondary education ): H0: 1 = 2 = 0 H1: At least one of the is is not zero. = 0.05 From the Excel output, we see that F = 113.4, and the p-value is approximately zero. There is strong evidence that the overall model is significant. Tests for the significance of the individual explanatory variables: Years of postsecondary education: H0: 1 = 0 H1: 1 0
(from Excel output). The p-value is approximately zero, so we reject H0. There is enough evidence to conclude that years of postsecondary education is a significant explanatory variable for salaries, when years of experience are included in the model. Years of experience: H0: 2 = 0 H1: 2 0
(from Excel output). The p-value is approximately zero, so we reject H0. There is enough evidence to conclude that years of experience is a significant explanatory variable for salaries, when years of postsecondary education are included in the model.
404
15. The adjusted R2 values are shown below: Model Adjusted R2 All Explanatory Variables 0.85 Years of Postsecondary Education and Age 0.80 Years of Postsecondary Education and Years of Experience 0.85 At this point, the model that contains years of postsecondary education and age does not seem worth considering. The adjusted R2 value is lower than for the other models. As we have already seen, age and years of experience are highly correlated, and it appears that the model containing years of experience does a better job. Develop Your Skills 14.4 16. The Excel output from the Multiple Regression Tools add-in is shown below (in two parts, to fit better on the page):
PredictionInterval ConfidenceInterval Lowerlimit Upperlimit Lowerlimit Upperlimit 106933.6829 137638.2386 114521.388 130050.5335
With 95% confidence, the interval ($114,521.39, $130,050.53) contains average Woodbon sales when mortgage rates are 6%, housing starts are 3,500 and advertising expenditure is $3,500. 17. It would not be appropriate to use the Woodbon model to make a prediction for mortgage rates of 6%, housing starts of 2,500, and advertising expenditure of $4,000, because the highest advertising expenditure in the sample data is only $3,500. We should not rely on a model for predictions based on explanatory variable values that are outside the range of the sample data on which the model is based.
405
18. The Excel output from the Multiple Regression Tools add-in is shown below (in two parts, to fit better on the page):
ConfidenceIntervalandPredictionIntervalsCalculations Point 95 =ConfidenceLevel(%) Number Age Yearsof PostsecondaryEducation 1 35 5
With 95% confidence, the interval ($31,689, $64,120) contains the salary of an individual who is 35 years old, and who has 5 years of postsecondary education. 19. The Excel output from the Multiple Regression Tools add-in is shown below (in two parts, to fit better on the page):
ConfidenceIntervalandPredictionIntervalsCalculations Point 95 =ConfidenceLevel(%) Number Age Yearsof PostsecondaryEducation 1 35 5
With 95% confidence, the interval ($44,477, $51,331) contains the average salary of all individuals who are 35 years old, and who have 5 years of postsecondary education. The confidence interval is narrower than the prediction interval from Exercise 18, because the variability in the average salary is less than the variability for an individual salary.
406
20. The Excel output from the Multiple Regression Tools add-in is shown below (in two parts, to fit better on the page):
With 95% confidence, the interval ($32,509, $60,756) contains the salary of an individual who has 5 years of postsecondary education, and 10 years of experience. 21. The text contains scatter diagrams of Woodbon Annual Sales plotted against mortgage rates and advertising expenditure (see Exhibit 14.2). Each relationship appears linear, with no pronounced curvature. A plot of the residuals versus the predicted y-values for this model is shown below.
WoodbonModel,ResidualsVersus PredictedSales(MortgageRatesand
AdvertisingExpenditure as Explanatory Variables)
20000 15000 10000 Residual 5000 0 5000 10000 15000 20000 0 20000 40000 60000 80000 100000 120000 140000 PredictedSalesValues
407
The plot shows the desired horizontal band appearance, although there appears to be reduced variability for higher predicted values. The other residual plots are shown below.
The mortgage rates residual plot shows the desired horizontal band appearance.
AdvertisingExpenditure ResidualPlot
20000 15000 10000 5000 0 5000 10000 15000 20000 $0 $1,000 $2,000 $3,000 $4,000 AdvertisingExpenditure
The advertising expenditure residual plot shows decreased variability for higher advertising expenditures. This is a concern, because it appears to violate the required conditions. At this point, we will refrain from conducting the F-test, as the required conditions are not met.
Copyright 2011 Pearson Canada Inc.
Residuals
408
22. The Excel Regression output for the model that includes per-capital income and population is shown below.
SUMMARYOUTPUT RegressionStatistics MultipleR 0.720727015 RSquare 0.51944743 AdjustedRSquare 0.483850943 StandardError 687.5975486 Observations 30 ANOVA df Regression Residual Total 2 27 29 SS MS F SignificanceF 13798538.88 6899269 14.59266 5.05245E05 12765340.5 472790.4 26563879.38
Coefficients StandardError tStat Pvalue Lower95% 25502.98998 1443.513376 17.6673 2.3E16 22541.14522 0.062546304 0.011890908 5.260011 1.52E05 0.038148175 0.024579975 0.034847903 0.70535 0.486634 0.046922016
From this we can see that the model is significant. H0: 1 = 2 = 0 H1: At least one of the is is not zero. = 0.05 From the Excel output, we see that F = 14.6, and the p-value is approximately zero. There is strong evidence that the overall model is significant. However, only one of the explanatory variables is significant in this model.
409
(from Excel output). The p-value is approximately zero, so we reject H0. There is enough evidence to conclude that population is a significant explanatory variable for sales, when per-capita income is included in the model. Per-capita income: H0: 2 = 0 H1: 2 0
(from Excel output). The p-value is 0.49, so we fail to reject H0. There is not enough evidence to conclude that per-capita income is a significant explanatory variable for sales, when population is included in the model.
410
We proceed with the analysis by creating all possible regressions. The output is shown below.
MultipleRegressionToolsAllPossibleModelsCalculations
AdjustedR^2 StandardError 0.493121976 681.4291273 Coefficients pvalue 26434.2424 7.3224E28 0.063380025 9.17832E06
K SignificanceF 1 9.17832E06
K SignificanceF 1 0.385584016
0.385584016
AdjustedR^2 StandardError 0.483851936 687.6320544 Coefficients pvalue 25503.11658 2.306E16 0.062550338 1.51515E05 0.024571325
K SignificanceF 2 5.05232E05
0.486807201
It is clear, from this output, that the model for sales that we would want to explore first is the one with population as the explanatory variable. This model has the highest adjusted R2, the lowest standard error, and it is significant. Of course, once we have focused on this model, we must ensure that it meets the required conditions.
411
The scatter diagram for sales and population shows some evidence of a positive linear relationship.
SalesandPopulation
$32,000 $31,500 $31,000 $30,500 $30,000 $29,500 $29,000 $28,500 $28,000 $27,500 20,000 30,000 40,000 50,000 60,000 70,000 Population
A plot of the residuals versus the predicted sales values for this model is shown below.
Sales
This plot shows, more or less, the desired horizontal band appearance. However, there are two points that raise questions, as they are far from the other points (the points are circled on the plot).
Copyright 2011 Pearson Canada Inc.
Residual
412
PopulationResidualPlot
2000 1500 1000 500 0 500 1000 1500 2000 20,000 40,000 Population 60,000 80,000
As we might have expected, the same two points stand out in the plot. There are no dates associated with these data points, so we cannot assess whether they are related over time. A histogram of the residuals is shown below.
Residuals
Residuals
(ModelBasedonPopulation)
12 10 Frequency 8 6 4 2 0 Residual
413
There are two observations which produce a standardized residual +2 or -2 (observations 2 and 10). These are the same data points that stood out on the residual plots. If we had access to the original data, we would double-check these data points. Because we cannot do that, we will leave them in the model. Our analysis suggests that the model that predicts sales on the basis of population is the best model for this data set. Because the model appears to meet the required conditions, it could be used as the basis for predictions of sales. 23. Here is the correlation matrix for the variables in the Salaries data set.
Yearsof Experience
Salary(000)
1 0.227538151 0.528597263
1 0.862062768
From this we can see that years of experience and age are very highly correlated, and so we would not choose to include both in our model. Both age and years of experience are very highly correlated with salary, and so one or the other appears to be promising as an explanatory variable.
414
24. We will use the Excel add-in to provide summary data about all possible models (notice that the output is spread over two pages).
AdjustedR^2 StandardError 0.736129338 9.090529977 Coefficients pvalue 0.207680673 0.967203503 1.252388299 9.16507E13
K SignificanceF 1 9.16507E13
K SignificanceF 1 0.000454415
4.03150375
0.000454415
AdjustedR^2 StandardError 0.736393063 9.085986075 Coefficients pvalue 28.23279516 2.36687E13 1.365020143 8.99113E13
K SignificanceF 1 8.99113E13
ModelNumber AdjustedR^2 4 0.804622258 VariableLabels Coefficients Intercept 2.125320671 Age 1.121515567 PostsecondaryE 2.155278577
K SignificanceF 2 2.87227E14
415
AdjustedR^2 StandardError 0.740351949 9.017500671 Coefficients pvalue 13.78392724 0.249310597 0.631384586 0.216724787 0.69640475
K SignificanceF 2 5.53555E12
0.211309492
K SignificanceF 2 1.66036E16
2.673930095 1.238703002
2.60097E06 1.02872E14
AdjustedR^2 StandardError 0.850399945 6.844780506 Coefficients pvalue 27.70373012 0.005220235 0.3191034 0.453660758
K SignificanceF 3 1.51907E15
2.846348768 1.568477845
5.77326E06
0.001223366
416
25. There are many possible models here. However, the one that looks most promising is the one that includes years of experience and years of postsecondary education. This is a logical model. Overall, it is significant, and each of the explanatory variables is significant. The standard error is relatively low. As well, the model makes sense. It is reasonable to expect that both of these factors would have a positive impact on salary. We cannot decide to rely on this model without checking the required conditions. The residual plots are shown below.
Residualsvs.PredictedSalary
(Yearsof Postsecondary Education, Years of Experience)
15 10
Residuals
5 0 5 10 15 0 20 40 60 80 100 PredictedSalary
YearsofPostsecondary EducationResidualPlot
15 10
Residuals
5 0 5 0 10 15 YearsofPostsecondaryEducation 2 4 6 8 10
417
YearsofExperienceResidual Plot
15 10 5
Residuals
0 5 10 15 YearsofExperience 0 10 20 30 40
All the residual plots show the desired horizontal band appearance, centred on zero. A histogram of the residuals has some right-skewness.
Frequency
10 8 6 4 2 0 Residual
There are no obvious outliers or influential observations. We choose this model as the best available.
418
26. The Excel Regression output for the model based on income only is shown below.
SUMMARYOUTPUT RegressionStatistics MultipleR 0.604253286 RSquare 0.365122033 AdjustedRSquare 0.345883307 StandardError 422.1512823 Observations 35 ANOVA df Regression Residual Total 1 33 34 SS MS F SignificanceF 3382189.615 3382190 18.97849 0.000121079 5880986.271 178211.7 9263175.886
Intercept Income($000)
Coefficients StandardError tStat Pvalue Lower95% 75.02173946 396.7061107 0.189112 0.851164 732.0829074 23.17297158 5.319255679 4.356431 0.000121 12.35086458
Compare these results with those shown in Exhibit 14.26, where gender is included in the model. We see that the adjusted R2 is higher for the model that includes gender, and the standard error is lower. It appears that adding the gender variable improves the model.
419
27. We used indicator variables as shown in Exhibit 14.27 in the text. The Excel Regression output is as shown below.
SUMMARYOUTPUT RegressionStatistics MultipleR 0.404030563 RSquare 0.163240695 AdjustedRSquare 0.101258525 StandardError 313.2587914 Observations 30 ANOVA df Regression Residual Total 2 27 29 SS 516890.0667 2649538.9 3166428.967 MS F SignificanceF 258445.0333 2.633671806 0.090179418 98131.07037
Coefficients StandardError tStat Pvalue Lower95% 1564.4 99.06112778 15.79226923 3.67865E15 1361.143357 218.9 140.0935904 1.562526875 0.12981017 506.3483007 313.4 140.0935904 2.237075937 0.03373543 600.8483007
H0: 1 = 2 = 3 = 0 H1: At least one of the is is not zero. = 0.05 From the Excel output, we see that F = 2.63, and the p-value is about 9%. There is not enough evidence to infer that there is a significant relationship between battery life and brand. Note that this is the same conclusion we came to in Chapter 11.
420
28. First, we must set up the data set with indicator variables. We cannot run the regression with the 1, 2, and 3 codes for region that are in the data set. Two indicator variables (in combination) indicate region, as follows: Central 1 0 North 0 1 Southwest 0 0
SUMMARYOUTPUT RegressionStatistics MultipleR 0.692380383 RSquare 0.479390595 AdjustedRSquare 0.429009039 StandardError 12.12223863 Observations 35 ANOVA df Regression Residual Total 3 31 34 SS MS 4194.738105 1398.246 4555.408752 146.9487 8750.146857 F SignificanceF 9.5152 0.000131176
Coefficients StandardError tStat Pvalue Lower95% 14.78132092 11.7183791 1.26138 0.216582 38.68111257 0.827088142 14.07075014 6.169658282 0.174906441 4.728746 4.67E05 0.470364104 5.122616397 2.74679 0.009933 3.623105161 5.449944445 1.132059 0.266291 4.945576653
We can see that the overall model is significant. As well, the number of sales contacts is significant. The first indicator variable is also significant, but the second one is not.
421
What does this mean? It appears that when the first region indicator variable is included in the model, the second one is not significant. If we think about what the region indicator variables tell us, it appears that region is significant, but only in the sense that it matters whether the region is central, or not (the distinction between north and southwest is not significant). In fact, if we re-run the model, keeping only the distinction between sales in the central region or not, the results are as follows:
SUMMARYOUTPUT RegressionStatistics MultipleR 0.67665967 RSquare 0.457868309 AdjustedRSquare 0.423985079 StandardError 12.17545162 Observations 35 ANOVA df Regression Residual Total 2 32 34 SS MS F SignificanceF 4006.414947 2003.207 13.51312 5.56771E05 4743.73191 148.2416 8750.146857 Pvalue Lower95% 0.333408 34.1436764
Coefficients StandardError tStat Intercept 11.10718667 11.30939798 0.98212 NumberofSales Contacts (Monthly) 0.822594987 0.175628992 4.683708 RegionIndicator1 10.67039881 4.16780093 2.560199
The model can be interpreted as follows: Sales = -$11,107.19 + $822.59 X Number of Sales Contacts + $10,670.40 for the central region, and Sales = -$11,107.19 + $822.59 X Number of Sales Contacts for the north or southwest regions.
422
29. Excel's Regression output for the model including both number of employees and shift is shown below.
SUMMARYOUTPUT RegressionStatistics MultipleR 0.40825172 RSquare 0.166669467 AdjustedRSquare 0.128790806 StandardError 4165.200265 Observations 47 ANOVA df Regression Residual Total 2 44 46 SS MS F SignificanceF 152673338.7 76336669 4.400089 0.018112587 763351302.8 17348893 916024641.5
Coefficients StandardError tStat Pvalue Lower95% Intercept 33118.64628 10531.14671 3.144828 0.002976 11894.51498 NumberofEmployees 246.1711185 83.00625235 2.965694 0.004865 78.88301139 Shift(0=Day,1=Night) 158.6692673 1282.379223 0.12373 0.902092 2425.796202
It appears the overall model is significant, however, shift is not a significant explanatory variable when the number of employees is included in the model. As well, if the model is run with only shift included as an explanatory variable, it is not significant. Therefore, it appears that shift is not a useful explanatory variable for the number of units produced. As well, the model based on number of employees, while significant, is not a particularly useful model (the adjusted R2 is only 0.15).
423
30. Province is not a significant explanatory variable for wages and salaries, either as the sole explanatory variable (p-value for the F-test = 0.67), or when age is included in the model (pvalue for the test of the province indicator variable coefficient = 0.76). While it appears that the model including age alone is significant, the required conditions are not met. See the residual plot shown below. The plot clearly shows a pattern of increasing variability for higher ages.
AgeResidualPlot
200000 150000 100000
Residuals
Chapter Review Exercises 1. The model can be interpreted as follows: $Monthly Credit Card Balance = $38.36 + $0.99(Age of Head of Household) + $22.04(Income in thousands of dollars) + $0.38(Value of the home in thousands of dollars). Generally, monthly credit card balances are higher for older heads of household with higher incomes and more expensive homes.
424
2.
H0: 1 = 2 = 3 = 0 H1: At least one of the is is not zero. = 0.05 From the Excel output, we see that F = 5.96. The F distribution will have (3, 31) degrees of freedom. We estimate that the p-value < 0.01. There is strong evidence that the overall model is significant.
3.
(from Excel output). The p-value is 0.93, so we fail to reject H0. There is not enough evidence to conclude that age of head of household is a significant explanatory variable for credit card balances, when household income and value of the home are included in the model. Income ($000): H0: 2 = 0 H1: 2 0
(from Excel output). The p-value is 0.02, so we reject H0. There is enough evidence to conclude that household income is a significant explanatory variable for credit card balances, when age of head of household and value of the home are included in the model.
425
(from Excel output). The p-value is 0.92, so we fail to reject H0. There is not enough evidence to conclude that value of the home is a significant explanatory variable for credit card balances, when age of head of household and household income are included in the model. 4. Age of head of household is fairly highly correlated with household income. As well, it appears that age is not a significant explanatory variable, when household income is included in the model. Collinearity between these two variables may be causing a problem in the model. Because the tests are done in the same format as the final exam, it is expected that the tests will prove to be better predictors of the final exam mark. However, good knowledge of the material is likely to result in higher marks for all of the evaluations, so we must consider that any one of them could be a good predictor of the final exam mark. None of the correlations between the explanatory variables is particularly high. There is a fairly high correlation between the mark on Test #2 and the final exam mark, which suggests that the mark on Test #2 might be a good explanatory variable for the final exam mark. The model that predicts the final exam mark on the basis of the mark on Test #2 is clearly the best. The adjusted R2 is higher, the standard error is lower, than for all the other variations. As well, the model is significant (p-value for the F-test is approximately zero).
5.
6.
7.
426
8.
We have 95% confidence that the interval (675.45, $2,486.74) contains the monthly credit card bill for a head of household aged 45, with annual income of $65,000 and a home valued at $175,000. The Excel output is shown below (split for visibility).
ConfidenceIntervalandPredictionIntervalsCalculations Point 95% =ConfidenceLevel(%) AgeofHeadof ValueofHome Number Household Income(000) (000) 1 45 65 175 PredictionInterval Lowerlimit Upperlimit 675.4457684 2486.743786
9.
All of the models that include Test #2 as an explanatory variable are better than those which do not. Test #2 was the basis for the best model when only one explanatory variable was included in the model, and so this is not surprising. The two-variable model with the highest adjusted R2 contains Test #2 and Assignment #2. Adding Assignment #2 to Test #2 as an explanatory variable increases the adjusted R2 value from 0.51 to 0.57, and the standard error decreases from 14.3 to 13.4. Prediction and confidence intervals made with the two-variable model would be narrower than for the model with only Test #2. The two-variable model is better, but whether it is "best" depends on the way the model might be used. Suppose it is being used to predict the exam marks, and identify those who are in danger of failing the course, or not achieving a grade level necessary for external accreditation. Test #2 is a significant explanatory variable. If Assignment #2 comes much later in the course, it may be better to use the single-variable model, so that the student can be alerted to a potential problem earlier, with time for adjustments.
427
10. The residual plots for this model are shown below. All have the desired appearance.
Assignment#2ResidualPlot
30 20 10 0 10 0 20 30 Assignment#2 50 100 150
Residuals
Test#2ResidualPlot
30 20
Residuals
Residualsvs.PredictedExamMark
(Test #2 and Assignment #2)
30 20 Residual 10 0 10 20 30 0 20 40 60 80 100 120 PredictedExamMark
428
There is one data point that produces a standardized residual that is greater than 2 (observation 69). However, there is no way to double-check this point. It appears this model meets the required conditions. 11. None of the three-variable models represents a real improvement on the model which includes Test #2 and Assignment #2. As we might expect, the best three-variable models include both Test #2 and Assignment #2. The best of these, in terms of higher adjusted R2, also contains Test #1. However, Test #1 is not significant as an explanatory variable when Test #2 and Assignment #2 are included in the model. This is true for all the other models that include both Test #2 and Assignment #2: the third explanatory variable is not significant when Test #2 and Assignment #2 are included in the model. 12. The Excel Regression output for the model containing all possible explanatory variables is shown below.
SUMMARYOUTPUT RegressionStatistics MultipleR 0.775514198 RSquare 0.601422271 AdjustedRSquare 0.579030264 StandardError 13.22677489 Observations 95 ANOVA df Regression Residual Total 5 89 94 SS MS F SignificanceF 23494.40275 4698.881 26.85879 1.85347E16 15570.33409 174.9476 39064.73684 tStat 3.345719 1.275381 1.751646 2.388069 5.769027 0.365879 Pvalue Lower95% 0.001203 7.344028811 0.205494 0.055878047 0.083279 0.020060282 0.019051 0.022668174 1.14E07 0.290291262 0.715323 0.104482531
Coefficients StandardError 18.08370688 5.405029406 0.100148977 0.078524761 0.149314405 0.085242328 0.134964923 0.056516334 0.442801866 0.076755029 0.023581516 0.064451647
Again we see that the explanatory variables other than Test #2 and Assignment #2 are not significant in this model. This is not the best model.
429
13. Using the model that includes Test #2 and Assignment #2, the Excel output is shown below (split for visibility):
ConfidenceIntervalandPredictionIntervalsCalculations Point 95% =ConfidenceLevel(%) Number Assignment#2 Test#2 1 65 70 PredictionInterval Lowerlimit Upperlimit 48.90969082 102.4567362
We have 95% confidence that the interval (48.9, 100) contains the final exam mark of a student who received a mark of 65 on Assignment 2 and 70 on Test 2. After all the analysis, it appears that the best model generates a prediction interval so wide that it is not really useful.
430
14. The Excel output for all possible regressions calculations is shown below.
MultipleRegressionToolsAllPossibleModelsCalculations
AdjustedR^2 StandardError 0.325238882 2007.384641 Coefficients pvalue 3294907.127 0.00180397 1651.875001 0.001726878
K SignificanceF 1 0.001726878
AdjustedR^2 StandardError 0.420323605 1860.580138 Coefficients pvalue 21217.96234 1.44594E15 0.056903836 0.00027344
K SignificanceF 1 0.00027344
K SignificanceF 2 0.000120809
All three models are significant. The model with the highest adjusted R2 is the one that includes both year and kilometres as explanatory variables. Both year and kilometres are significant explanatory variables, when the other variable is included in the model.
431
However, this model presents some problems. Initial scatter diagrams for each of the explanatory variables are shown below.
ListPrice,UsedCars,20042008ModelYear
HondaAccordPricesonAutoTrader.ca (November2008)
$25,000 $20,000 $15,000 $10,000 $5,000 $0 2003 2004 2005 2006 2007 2008
In this scatter diagram, we see there is more variability for list prices for older cars. Note there is only one data point for a car from the 2007 model year.
ListPrice,UsedCars,20042008Model Year
HondaAccordPricesonAutoTrader.ca (November2008)
$25,000 $20,000 $15,000 $10,000 $5,000 $0 0 50,000 100,000 Kilometres 150,000 200,000
Here, we see there is more variability in list prices for Honda Accords with higher kilometres.
432
With both these explanatory variables included in the model, the residual plots are as shown below.
YearResidualPlot
6000 5000 4000 3000 2000 1000 0 1000 2000 3000 2003 2004 2005 2006 2007 2008 Year
Residuals
KilometresResidualPlot
6000 5000 4000 3000 2000 1000 0 1000 2000 3000 0 50000 100000 Kilometres 150000 200000
Both of these plots show an unusual observation, which is circled. Both refer to the observation where a Honda Accord with 121,353 kilometres, 2005 model year, was listed for $19,888. There may be something unusual about this observation to explain why the list price is so unusually high for a relatively high-mileage (kilometrage!) car. Referring back to the original listing, if it were available, might tell us something to explain this. Since we do not have this information available, we cannot assess if this point is legitimate.
Residuals
433
Residualsvs.PredictedListPrice
(YearandKilometres)
6000 4000
Residuals
The outlier that showed up on the other residual plots also shows up here. There are some concerns about the model. The standard error is fairly wide, so, for example, if we predicted the list price of a 2005 Honda Accord with 85,000 kilometres, the prediction interval would be ($13,114.99, $20,308.02). Therefore, the model is not that useful for predicting the list price of a used Honda Accord. 15. Year of the car is not really a quantitative variable. There are four years (2004, 2005, 2006, and 2007) in the sample data set, so three indicator variables are required. They could be set up as follows: Year Indicator Variable 1 Indicator Variable 2 Indicator Variable 3 2004 1 0 0 2005 0 1 0 2006 0 0 1 2007 0 0 0 All possible regressions calculations provide many possible models. However, notice again that there is only one observation for the year 2007. The data set is not really large enough to support this analysis. We will proceed, out of curiosity.
434
The model with the highest adjusted R2 contains kilometres and only the indicator variable specifying whether the car is from the 2004 model year, or not.
K SignificanceF 2 2.11076E05
2071.672715 0.003726813
The model is as follows: For the model year 2004: List price = $21,641.08 0.05(Kilometres) - $2,017.67 For model years 2005, 2006, and 2007: List price = $21,641.08 0.05(Kilometres) Notice that this model is more intuitive than the model from Exercise 14, which was: List price = -$2,076,840.42 + $1,045.99(Year) + 0.043(Kilometres) Such a model does not really make sense, and this should have been your clue that treating the year of a car as a quantitative variable is not the correct approach.
435
16. All seven possible regression models are significant, as the output for all possible regressions calculations shows. The output is split over two pages.
MultipleRegressionToolsAllPossibleModelsCalculations
K SignificanceF 1 0.001578561
AdjustedR^2 StandardError 0.422781893 1421.270907 Coefficients pvalue 17430.60759 4.51747E08 0.163329197 6.02729E05
K SignificanceF 1 6.02729E05
K SignificanceF 1 0.003278203
0.188766345 0.003278203
K SignificanceF 2 9.0189E05
0.127329166 0.003228531
436
K SignificanceF 2 2.0497E05
0.180554925 0.000664511
K SignificanceF 2 4.83609E05
K SignificanceF 3 1.49902E05
None of the one-variable models seems useful, as the adjusted R2 is quite low. Of the twovariable models, the most promising is Model Number 5, with local population and estimated weekly traffic volume as explanatory variables. The model is significant, and each of the explanatory variables is significant, when the other one is included in the model. The adjusted R2 is 0.52, which is not high, but still better than the other two-variable models. At first, it appears that Model Number 7, which includes all three explanatory variables, might be best, as it has the highest adjusted R2 of all the models. However, note that median income in the local area is not a significant explanatory variable (with a 5% significance level), when the other two variables are included in the model. as well, not that the standard error for this model is almost the same as for Model Number 5, which relies on only two explanatory variables.
437
Therefore, we will investigate Model Number 5 to see if it conforms to the required conditions. First, such a model makes some sense. Initial scatter diagrams for monthly sales and each explanatory variable show some evidence of a positive linear relationship, although neither relationship looks particularly strong. Note that the scales on some of the axes in the graphs below do not start at zero.
MonthlySalesandLocalPopulation
$33,000 $32,000 $31,000 $30,000 $29,000 $28,000 $27,000 $26,000 $25,000 $24,000 0 50,000 100,000 150,000 200,000 250,000 300,000 LocalPopulation
MonthlySales
MonthlySalesandWeekly Traffic
$33,000 $32,000 $31,000 $30,000 $29,000 $28,000 $27,000 $26,000 $25,000 $24,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000
MonthlySales
EstimatedTrafficVolume(Weekly)
438
Residual plots show (more or less) the desired horizontal band appearance, as shown below.
LocalPopulationResidualPlot
3000 2000 1000
Residuals
EstimatedTrafficVolume (Weekly)ResidualPlot
3000 2000
Residuals
1000 0 1000 2000 3000 4000 10,000 15,000 20,000 25,000 30,000 35,000 40,000 EstimatedTrafficVolume(Weekly)
439
Residual
0 1000 2000 3000 4000 25000 26000 27000 28000 29000 30000 31000 32000 33000 PredictedMonthlySales
A histogram of the residuals is approximately normal. There do not appear to be any outliers or influential observations.
It appears the model meets the required conditions. This model could be the basis of a location decision. The form of the model is as follows: Predicted Monthly Sales = $22,441 + 0.0143(Local Population) + 0.1806(Estimated Weekly Traffic Volume)
440
17. While it is tempting the add the new data and analyze the model which was best from the analysis we did for Exercise 16, the correct approach is to look at all possible models. We have to allow for the possibility that the new information ALONE will be the basis of the most important explanatory variable. In fact, the output of all possible regressions calculations shows that inclusion of the indicator variable for the location being within a five-minute drive of a major highway does improve the model we chose as best for Exercise 16. However, the best of all of the models, in terms of adjusted R2, is the model with all possible explanatory variables. The adjusted R2 for this model if 0.656, compared with 0.517 for the preferred model in Exercise 16. The data requirements for this model are more onerous, and this would have to be taken into consideration before the model was selected. While local population and median incomes could be obtained through Statistics Canada, information about estimated weekly traffic volume will probably have to be collected (and possibly over several weeks). However, the information about whether a location is within a five-minute drive of a major highway could be obtained by looking at road maps and estimating driving distance. We will analyze the "all-in" model to see if it conforms to required conditions. Residual plots look acceptable. The histogram of residuals appears normally-distributed. There are no obvious outliers or influential observations.
LocalPopulationResidual Plot
3000 2000 1000 0 1000 2000 3000 0 50,000 100,000 150,000 200,000 250,000 300,000 LocalPopulation
Residuals
441
EstimatedTrafficVolume (Weekly)ResidualPlot
3000 2000 1000 0 1000 2000 3000 10,000 15,000 20,000 25,000 30,000 35,000 40,000 EstimatedTrafficVolume(Weekly)
Residuals
442
DoughnutShopSalesPrediction Model(AllExplanatoryVariables)
3000 2000 Residual 1000 0 1000 2000 3000 25000 27000 29000 31000 33000 PredictedMonthlySales
Frequency
8 6 4 2 0
Residual
It appears the model meets the required conditions. The model is as follows: For locations within a five-minute drive of a major highway: $Predicted Monthly Sales = $17,413 + 0.009(Local Population) + 0.089(Median Income in Local Area) + 0.145(Estimated Weekly Traffic Volume) +$1,137 For locations not within a five-minute drive of a major highway: $Predicted Monthly Sales = $17,413 + 0.009(Local Population) + 0.089(Median Income in Local Area) + 0.145(Estimated Weekly Traffic Volume) 18. a. Since the Canadian economy is resource-based, it does not seem unusual to look to resource stocks as a stand-in for the entire stock index. Once could argue that as the economy overall goes, so will go the financial sector and the stock index. The Rona stock seems less likely, ahead of time, to be a good predictor of the TSX. However, we will begin, as usual, by looking at all possible models.
443
However, when we do this, we do not find a useful model. Of the one-variable models, the best is the one that predicts the TSX on the basis of the price of Potash Corporation stock. However, although the model is significant, the adjusted R2 is quite low, and the standard error is relatively high. When we examine the two-variable models, all of them have at least one variable that is not significant in the model when the other is present. All the three- and four-variable models show the same problem. This is not surprising, as all of the stocks and the TSX will be affected by overall economic conditions. Some multi-collinearity is the likely result. Scatter diagrams for the TSX and each stock price also hint that there does not appear to me a linear relationship between these stock prices and the TSX.
TSXandRoyalBankStockPrice
16000 14000 12000 10000 8000 6000 4000 2000 0 0 20 40 60 80 10 0 RoyalBankStockPrice
TSX
TSXandRonaInc.StockPrice
16000 14000 12000 10000 8000 6000 4000 2000 0 0 10 20 30 40 50 60 RonaInc.StockPrice
TSX
444
TSXandPetroCanadaStockPrice
16000 14000 12000 10000 8000 6000 4000 2000 0 0 20 40 60 80 10 0 PetroCanadaStockPrice
TSX
TSXandPotashCorp.StockPrice
16000 14000 12000 10000 8000 6000 4000 2000 0 50 100 150 PotashCorp.StockPrice 200 250
The only relationship that looks somewhat linear is the last one, between the TSX and the Potash Corporation stock price, and there is clearly more variability in the TSX for prices in the lower part of the Potash Corporation stock price range for these data. This helps explain why the only model that appeared to have any predictive power was the one based on the stock price of the Potash Corporation. b. There is definitely evidence of the stock market crisis at the end of 2008. For example, if we examine the residuals for the model based on the Potash Corporation stock price, they show a definite time-related pattern, as shown below. It is not a good idea to try to build a model using data for this time period. Whatever relationships may have held before the fall of 2008, it appeared that financial markets were increasingly unpredictable, and the new information that became available at the time of the crisis may change forever the way stock markets work.
TSX
445
TSXModel,ResidualsOverTime(Potash
CorporationStockPriceasExplanatoryVariable)
6000 5000 4000 3000 2000 1000 0 1000 2000 3000 4000
Residual
01/11/2002
01/03/2003
01/07/2003
01/11/2003
01/03/2004
01/07/2004
01/11/2004
01/03/2005
01/07/2005
01/11/2005
01/03/2006
01/07/2006
01/11/2006
01/03/2007
01/07/2007
01/11/2007
01/03/2008
01/07/2008
19. There are many possible models. However, for many of the models, when the overall model is significant, some of the individual explanatory variables are not significant, given the other explanatory variables in the model. This is not surprising, as the factors that lead to student success in one subject probably contribute to student success in other subjects. The best one-variable model is based on the mark in Intermediate Accounting 1. The best two-variable model includes the mark in Intermediate Accounting 1 and Cost Accounting 1. Model results are summarized below.
AdjustedR^2 0.520480342 Coefficients 17.2097272 0.711183518 AdjustedR^2 0.590366643 Coefficients 14.52779938 0.420200699 0.377202568
StandardError 12.93893991 pvalue 0.008500616 2.05994E09 StandardError 11.95895263 pvalue 0.016864759 0.002427925 0.003952342
Of these two, Model Number 6 appears to be the better model, with a higher adjusted R2, and somewhat lower standard error.
01/11/2008
446
20. Scatter diagrams of the Statistics 1 mark and each of the marks in Intermediate Accounting 1 and Cost Accounting 1 are shown below. Both relationships appear linear.
MarkinStatistics1
447
IntermediateAccounting1 ResidualPlot
40 20
Residuals
CostAccounting1Residual Plot
40 20
Residuals
Resdiaul
0 10 0 20 30 40 PredictedStatistics1mark 20 40 60 80 100
448
The histogram of residuals is skewed to the left, as shown below. However, generally, the residuals appear to be normally-distributed.
Accounting1andIntermediateAccounting1as ExplanatoryVariables)
Frequency
Residual
There are some data points with standardized residuals either -2 or +2. However, we have no way to verify these data points, so for now, we have no choice but to leave them in the model. It appears that the model to predict the Statistics 1 mark on the basis of marks in Cost Accounting 1 and Intermediate Accounting 1 meets the required conditions. 21. We have 95% confidence that the interval (42, 91) contains the Statistics 1 mark of an individual student who achieved a mark of 65 in Cost Accounting 1 and Intermediate Accounting 1.
449
22. Of all the one-variable models, the best is the one based on years of experience. Of all the other models, the one based on years of experience and the local advertising budget is best. This model has an adjusted R2 of 0.95. The model seems sensible. Sales = $12,260 + $1,185(Years of Experience) + $4(Local Advertising Budget) It seems reasonable to expect that salespeople would increase their skill as they gain years of experience, and this could result in increased sales. It also seems likely that increases in the local advertising budget would lead to increases in sales. Scatter diagrams for each explanatory variable and sales are shown below. Note the vertical axes on each graph does not start at zero.
SalesandYearsofExperience
$80,000 $75,000 $70,000 $65,000 $60,000 $55,000 $50,000 $45,000 $40,000 0 5 10 15 20 25 30 35 40 YearsofExperience
Sales
SalesandLocalAdvertising Budget
$80,000 $75,000 $70,000 $65,000 $60,000 $55,000 $50,000 $45,000 $40,000 $3,000 $3,500 $4,000 $4,500 $5,000 $5,500 $6,000 $6,500 $7,000 LocalAdvertisingBudget
Sales
450
There appears to be a strong linear relationship between sales and years of experience. The relationship between the local advertising budget and sales is less obvious. However, these variables in combination appear to provide the best model for sales. The residual plots appear to have the desired horizontal band appearance. There is one point that appears unusual in all three plots (it is indicated with a triangular marker). All three points correspond to the same observation, which is the 40th data point. If we had the ability to double-check the accuracy of this point, we would. This data point is the only one with a standardized residual either -2 or +2.
2000 0 2000 4000 6000 2000 3000 4000 5000 6000 7000 LocalAdvertisingBudget
451
Residual
1000 1000 3000 5000 40000 45000 50000 55000 60000 65000 70000 75000 80000 PredictedSales
The histogram of the residuals is somewhat bimodal, with some left-skewness. While not significantly non-normal, such a histogram suggests some caution when using the model.
ExperienceandLocalAdvertisingBudgetas ExplanatoryVariables)
10 8 6 4 2 0
Residual
A 95% prediction interval for a salesperson with 15 years of experience, and a local advertising budget of $4,000 would be ($42,081, $50,035).
452