Q1, 3 & 5 are for tutorial discussions; answers for the rest are already included.
Again, BACT offers useful consulting before each tutorial class.
(1) Price of Beef. How is the price of beef related to other factors? The data in
Tut4-Q1-Price_of_Beef.xlsx give information on the price of beef (PBE) and the
possible explanatory variables: consumption of beef per capita (CBE), retail
food price index (PFO), food consumption per capita index (CFO), and an index
of real disposable income per capita (RDINC) for the years 1925 to 1941 in the
United States.
(a) Use computer software to find the regression equation for predicting the
price of beef based on all of the given explanatory variables. What is the
regression equation?
In order of importance of X variables, from Tut4-Q1Price_of_Beef(answer)&.xlsx (as given by Excels Data Analysis add-in):
PBE = 347.60 2.83CFO 0.772CBE 0.327PFO + 0.544 RDINC
As seen, for predicting the price of beef, food consumption is most
important, followed by consumption of beef, retail food price index, and
real disposable income.
Note that it is usually sufficient (and clearer) to give only 3 significant
digits for numbers. In the above, 347.60s 5 significant digits gives us 2
decimals to match those of 2.83; i.e. the 3-digit rule needs to be
selectively relaxed.
The plots generated by Data Analysis add-in are not used.
(b) Produce the appropriate residual plots to check Regression assumptions. Is
this inference appropriate for this model? Please explain.
Remember that, practically, there is really no point in doing this, unless
this is the final most-valid model. Nonetheless, there are 3 ways to do
this:
(1) By Excel Data Analysis add-in: Tut4-Q1-Price_of_Beef(answer)&.xlsx
(ignore the Line Fit plots, and the Normal Probability plot).
(2) By RegressTemplate.xlsm: Tut4-Q1-RegressPrice_of_Beef(answer)&.xlsm
(without the Normal probability plot; with axes hand-tuned to achieve
square plots).
(3) By MiniRegressTemplate.xlsx: Tut4-Q1-MiniRegressionTemplatePrice_of_Beef(answer)&.xlsx (axes are scaled automatically, so plots are
mostly not square; Normal probability plot has very slight hint of short
right-tail).
All plots look ok generally, with no clearly discernible pattern.
1
(c) How much (total) variation in the price of beef can be explained by this
model?
R2 = 91.7%. [Adjusted R2 (88.9%) is the proportion of the variance in the
price of beef which is explained by the model.]
(d) Consider the coefficient of beef consumption per capita (CBE). Does it say
that the price of beef goes up when people eat less beef? Explain.
The coefficient is negative, which suggests lower beef price (PBE) is
expected with higher consumption (CBE). But it must be interpreted after
allowing for the effects of the other predictors. CBE (beef consumption)
may be collinear with CFO (food consumption), for example. In addition,
regression coefficients cannot be interpreted causally. Eating less beef (an
X) may occur along with higher beef prices (Y), but it is unlikely to cause
them. For example, the causation (if any) may actually run in the other
direction: more expensive beef (PBE as an X) may lead to lower per capita
consumption (CBE as Y); i.e. the whole regression might have to be
rethought!
So, does having price as Y (instead of as an X) result in a nonsensical
model altogether? Probably.
See also Tut4-Q1-RegressPrice_of_Beef&.xlsm where Year is included, yielding
a model with a smaller Significance F. Year could be standing for factors that
were not included in the original collection of 4 Xs. Some would argue that
PFO (retail food price index) should have captured any trend that Year is here
standing in for, but apparently not. When the Run button is pressed, an
even better model obtains at Tut4-Q1-RegressPrice_of_BeefSol&.xlsm after
RDINC (Real Disposable Income) is deleted, and with variables ordered
according to importance. Year is not even the least important in this model:
PFO (Retail Food Price Index) is worse.
Lesson: dont throw away any data (e.g. Year) if possible; let SSS do any
throwing.
(2) U.S.News March-1999 rankings of the best Business Schools in the U.S.A. was
analysed with Microsoft Excel and output below, where the Overall Scores of
business schools are regressed on the ten published numerical measures.
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.993695
R Square
0.98743
Adjusted R Square 0.984122
Standard Error
1.617535
Observations
49
ANOVA
df
Regression
Residual
Total
Intercept
GMAT
RecruitRank
AcadRank
GPA
Salary
Employ3Month
Employ0Month
Enroll
AcceptRate
Fees
10
38
48
SS
MS
F
Significance F
7809.92298
781 298.49653 6.516E-33
99.42395923 2.616
7909.346939
t Stat
-6.72
5.33
-5.09
-4.91
4.683
4.147
2.695
2.473
0.443
0.336
-0.02
P-value
5.975E-08
4.712E-06
1.001E-05
1.75E-05
3.55E-05
0.0001819
0.0104222
0.0180055
0.6600896
0.7389825
0.9844997
Lower 95%
-130.72502
0.0751262
-0.232269
-0.2301368
9.2470973
0.0001152
6.2374214
1.7044955
-0.000741
-6.6389221
-0.0001469
Upper 95%
-70.168723
0.16713456
-0.1000923
-0.0957946
23.3269618
0.00033496
43.8895516
17.1038925
0.00115641
9.27796985
0.00014405
The correlation matrix for the variables (some of which with shortened
names) is as follows:
1
0.357
1
0.022 0.221
1
0.119
0.24 0.691967
Overal
100
98
98
98
95
94
93
93
91
90
89
88
86
84
83
81
81
79
79
73
72
72
71
71
69
68
68
68
68
67
66
66
65
64
64
63
63
63
63
63
63
63
63
62
62
62
62
61
61
Acad
1
2
5
2
2
5
9
7
9
9
13
9
14
7
18
14
17
16
18
20
25
20
22
25
25
34
30
34
22
22
34
34
52
30
44
52
42
42
34
44
25
25
44
30
63
52
34
59
34
Recruit
5
2
3
1
10
7
8
3
6
16
8
17
13
13
18
13
11
18
12
23
24
22
26
28
24
21
31
29
35
27
49
33
38
54
70
37
49
47
20
65
33
30
43
40
43
45
40
54
35
GPA
3.59
3.5
3.45
3.5
3.5
3.38
3.45
3.34
3.34
3.5
3.36
3.4
3.4
3.43
3.4
3.25
3.2
3.2
3.37
3.26
3.34
3.3
3.2
3.3
3.2
3.21
3.21
3.35
3.24
3.13
3.3
3.3
3.2
3.4
3.31
3.2
3.2
3.3
3.4
3.2
3.3
3.28
3.2
3.2
3.27
3.4
3.3
3.52
3.28
GMAT
722
689
685
685
690
695
680
672
664
683
685
671
675
674
682
647
641
653
660
624
640
631
639
650
630
637
641
653
620
624
628
628
632
634
652
631
636
619
601
663
612
613
633
614
630
632
647
634
618
Accept
6.70%
12.90%
15.50%
13.10%
13.40%
22.70%
11.50%
22.00%
15.90%
14.50%
15.00%
12.00%
17.70%
11.00%
25.80%
28.40%
22.60%
29.80%
23.20%
25.50%
30.70%
40.10%
27.60%
25.30%
45.20%
36.40%
26.50%
24.00%
45.60%
33.10%
42.80%
27.20%
44.60%
24.90%
27.00%
48.70%
32.10%
33.60%
67.90%
30.50%
47.50%
45.90%
42.00%
40.00%
35.10%
43.40%
32.30%
43.70%
27.00%
Salary
$105,700
$105,000
$115,000
$100,000
$100,000
$95,000
$106,000
$97,000
$98,950
$95,000
$95,000
$100,000
$95,000
$90,000
$95,000
$92,500
$92,000
$88,370
$85,000
$76,000
$77,500
$80,000
$80,000
$87,000
$77,000
$80,000
$77,000
$70,000
$74,000
$75,000
$78,500
$71,600
$75,500
$62,000
$71,500
$75,000
$71,000
$63,000
$66,000
$65,000
$67,000
$62,000
$65,000
$74,150
$70,550
$72,000
$64,700
$65,200
$65,000
Employ
80.80%
95.40%
97.50%
95.00%
97.10%
94.70%
98.00%
99.70%
98.80%
97.70%
93.60%
97.40%
94.80%
66.80%
89.70%
90.60%
96.00%
89.10%
84.70%
96.70%
80.50%
82.00%
89.90%
76.40%
85.10%
73.90%
83.00%
80.00%
96.50%
87.60%
94.00%
87.50%
93.10%
72.20%
75.50%
84.10%
82.10%
86.10%
38.90%
86.40%
72.60%
79.30%
88.40%
82.60%
90.60%
74.70%
71.40%
78.40%
77.10%
Employ3
98.90%
98.50%
100.00%
100.00%
97.10%
98.20%
99.10%
100.00%
99.40%
99.70%
98.30%
99.00%
97.80%
87.40%
97.80%
94.20%
98.00%
95.30%
88.70%
99.20%
99.20%
91.80%
94.40%
97.50%
92.00%
92.10%
98.10%
94.20%
99.10%
95.20%
97.00%
94.80%
100.00%
97.20%
96.20%
95.50%
97.30%
97.70%
88.90%
97.70%
93.40%
92.50%
99.00%
89.70%
97.90%
91.10%
88.80%
91.20%
88.10%
Fees
$24,990
$26,260
$25,872
$26,290
$27,100
$26,284
$27,770
$25,185
$26,548
$20,534
$21,479
$26,100
$27,923
$19,792
$25,355
$25,135
$16,353
$24,130
$15,268
$15,619
$24,200
$17,013
$23,970
$24,852
$24,240
$25,678
$14,958
$15,235
$16,332
$23,800
$12,411
$12,182
$12,865
$12,411
$20,577
$16,300
$22,598
$9,125
$21,670
$19,867
$19,004
$16,230
$20,500
$20,900
$16,218
$23,304
$13,677
$7,740
$14,756
Enrol
726
1,767
2,504
1,557
708
2,808
1,373
1,927
671
1,046
491
375
2,902
809
430
576
461
641
726
252
515
548
629
1,229
433
519
438
867
1,276
727
853
457
200
321
249
295
855
181
1,575
127
505
470
666
1,139
390
389
349
271
247
(a) In the context of the regression model, which variable appeared to be least
related to the overall rankings? Please explain. Comment on whether this
is surprising? What is the practical implication of this result for a typical
U.S.News reader?
Fees least related; largest P-value. Rather surprising. On the whole, given
other variables that are better predictor for ranking, higher fees do not
imply better school. If this model is to be believed, a reader should be
aware that high fees might be arbitrary (if it can be argued that school
charge high fees when they are ranked high).
(b) In the context of the regression model, what is the most important
characteristic of a good business school? Please explain. How would you
interpret this result in a simple English sentence for a typical U.S.News
reader?
GMAT is most important, since it has smallest P-value. This is not the
final model, since some variables should probably be deleted (and then
relative importance of variables may change), hence it is premature to
4
MBA Applicantadvising an MBA applicant, who has the same criteria for
ranking business schools as U.S.News, on how he might choose the best
among a list of business schools (none of which appears on the U.S.News
list) he had been offered admissions for September 1999. You should not
assume that the MBA applicant would know how U.S.News arrived at the
rankings, or how to do statistical analyses on his own (i.e., dont give him a
formula, just tell him what to look for). Note: a school that did not
participate in the U.S.News survey will not typically have external
measures like RecruitRank, but will normally publish internal data like Fees
and Median Starting Salary.
Applicant cannot (and has no intention to) control many of the drivers in a
regression equation, and is more dependent on correlation analysis: just
aim for school with highest Salary score.
Dear MBA Applicant, Among all the schools that had offered you
admissions, you might do well to choose the one with the highest Salary
score. We have identify this to be the single best predictor of the value
you are likely to derive from your MBA education.
(h) Explain how you might obtain a good (better?) prediction equation for the
overall score of a business school. Make a guess as to whether your
procedure is likely to yield a good model, giving your reasons.
Step-wise procedure, like the SSS. Since Sig F and adjusted R-square are
already good, step-wise procedure likely to yield even better (more
useable) model.
(i) Find as good a prediction model as you can for predicting the overall score
of a business school. Clearly state your prediction equation and explain
why it might be good.
If we apply variable selection to the original model Tut4-Q2RegressU.S.NewsMarch1999Rankings.xlsm, we end up with the last 3
variables (Fees, AcceptRate and Enrol) being deleted from the original
regression model: Tut4-Q2-RegressU.S.NewsMarch1999RankingsSol.xlsm.
It is surprising that the relative importance of the original variables was
intact except AcadRank and RecruitRank switched adjacent places. This
probably means that the 3 deleted variables were not much related to the
final set of 7 Xs (since deleting collinear Xs may much change some
remaining p-values but not others, thus alter relative significance of
remaining Xs).
It makes more sense to base our regression analysis on this final model,
rather than the original model with 10 Xs, since the bigger model wasnt
valid as yet.
(3) 4e: P 558, 5e: P 493, problem 19. Electric utility stocks. Tut4-Q3-P10_19.xlsx
Traditionally, utility stocks are bought by pensioners for their steady
dividends, not for their upside price potentials.
Standard Error
R Square
Adjusted R
Square
Significance F
29945.59
071
0.485569
15
0.482093
266
4.06E-23
Home
SizeSize
Lot
24552.37
187
0.656518
175
0.651844
953
7.74E-35
Home Size
Lot Size
Number
Rooms
24613.2081
0.65716210
2
0.65011748
8
9.08E-34
Home Size
Lot Size
Number
Rooms
Number
Baths
23770.43227
0.682428382
0.673667785
3.86E-35
Home Size
Lot Size
Number
Rooms
Number
Baths
Home
(index)
23741.54434
0.685384624
0.674460479
1.87E-34
0.68
29500
28500
R Square
0.63
27500
26500
0.58
Standard Error
25500
Adjusted R Square
0.53
24500
23500
0.48
1
1.00E-24
1.00E-26
1.00E-28
1.00E-30
Significance F
1.00E-32
1.00E-34
1.00E-36
1
Note that the vertical axis for Significance F has a log scale. We can see
the same kink at 3 on the horizontal scales of all three graphs: a rise for
Standard Error and Significance F, and a dip for Adjusted R Square. There
is actually also a slight rise for R Square. This represents the introduction
of the 3rd variable (Number Rooms) that does not go well in the presence
of Home Size and Lot Size. We also see the Significance F increases when
the last (5th) variable Home (index) is added, whereas the other 3 statistics
did not react adversely. This is why Significance F is used in the SSS
scheme of variables selection to produce a parsimonious model: it
suggested that 4-variable model is best, whereas the other 3 criteria all
preferred the full 5-variable model.
If we rank the importance of the 5 variables, and delete the worst one in
turns, we get the followings instead.
Home
Size
Standard Error
R Square
Adjusted R
Square
Significance
F
29945.59
0.485569
15
0.482093
266
4.06E-23
Home Size
Lot Size
Home Size
Lot Size
Number
Baths
24552.37
0.656518
175
0.651844
953
7.74E-35
23699.0802
0.68215501
8
0.67562395
6
3.68E-36
Home Size
Lot Size
Number
Baths
Home
(index)
23670.832
0.6850840
92
0.6763967
57
2.11E-35
Home Size
Lot Size
Number
Baths
Home (index)
Number
Rooms
23741.54434
0.685384624
0.674460479
1.87E-34
0.68
29500
28500
R Square
0.63
27500
26500
0.58
Standard Error
25500
Adjusted R Square
0.53
24500
23500
0.48
1
1.00E-24
1.00E-26
1.00E-28
1.00E-30
Significance F
1.00E-32
1.00E-34
1.00E-36
1
Here we can more clearly see why Significance F is the better criterion for
variables selection. Viewing it as a forward stepwise procedure, R Square
kept increasing with more variables, but all the other statistics turned after
a while: Significance F turned at 3 variables, then Standard Error and
Adjusted R Square turned at 4. Note that the 3-variable model here (with
Number Baths) was not picked up by the earlier sequence (whose 3variable model had Number Rooms).
For this and other reasons, Significance F is preferable to both Standard
Error and Adjusted R Square (not to mention R Square) for variables
selection: it prefers models with fewer X variables.
(b) The regression coefficients with 5 Xs are as follows:
Y: Price
Intercept
Lot Size
Home Size
Number Baths
Home (index)
Number Rooms
Coefficient
s
81697.677
9
7688.8224
2
30.363661
2
14726.900
8
52.899410
4
863.26689
7
Standard Error
10571.61791
892.3935895
7.052401732
4329.243968
45.47676299
2327.602126
t Stat
7.72802
8.61595
4
4.30543
6
3.40172
6
1.16321
8
0.37088
3
P-value
1.7095E12
1.1224E14
3.0647E05
0.0008670
3
0.2466648
0.7112700
9
Where Home Size, Number Baths, Home (index) and Number Rooms are
the same, an increase of one unit in Lot Size is associated with an increase
of about 7700 in Selling Price, on average. Same for the other variables.
Remember: this shouldnt be the model we are interpreting, since it is not
yet valid, as some P-values are large.
(c) R Square, 68.5%, is the proportion of (total) variation in the selling price
that is explained by the 5 variables.
(5) 4e: P 659, problem 61; 5e: P 581, problem 56. Business admissions. Tut4-Q5P11_61.xlsx
Freshman: Year 1; Sophomore: Year 2; Junior: Year 3; Senior: Year 4. Some
U.S. colleges (universities) require students to declare their majors after the
sophomore year (e.g. Berkeley); some after freshmen year (e.g. MIT); while
others (especially for less well-known colleges/universities) admit students
directly from Year 1 (e.g. Wharton of U Penn, which isas an exceptionvery
well known; e.g. graduated Donald Trump which ran for U.S. President in
2015).
This problem considers whether to catch students early for a major (before
they settle into some other disciplines), or to accept them later when their
aptitudes can be better determined. This issue doesnt exist in the British
(and Singaporean) system, because students apply to different majors for
university admission, and are thus streamed before Year 1 (just like Wharton).
Tut4-Q5-P11_61(answers)&.xlsx gives the correlations.
a. Correlations all low (rather surprising!): largest is 0.43, with most quite a
bit smaller. Hence, collinearity not likely to be a problem.
b. Starting with Tut4-Q5-P11_61Regress&.xlsm, using SSS (click on Run), we
end with Tut4-Q5-P11_61RegressSol&.xlsm, which has a largest p-value >
0.05 (equivalent to 95% Confidence Interval of -0.0014 to 0.1929 capturing
0). If we eliminate that last variable, we obtain Tut4-Q5P11_61RegressSol2&.xlsm (although it has a larger Significance F), as
required by the question. The prediction equation is
I-core = -0.30 + 0.15 M119 + 0.20 E270 + 0.14 A202 + 0.14 A201 + 0.12
L201 + 0.12 K201.
The proportion of total variation in I-core that is explained by the freshman
and sophomore grades is only 36%, the R Square value. The standard
error is 0.46, about half a letter grade (commonly for GPA: A is 4, B is 3, C
is 2, D is 1, F is 0).
c. Only Calculus (M119) and Computers (K201) among freshman classes are
useful for predicting junior performance (Wait! How is this possible? Are
we not talking about a business program here? Who needs quant stuff for
biz?!); the other variables are sophomore classes.
10
11