Olive
Linear
Regression
Linear Regression
David J. Olive
Linear Regression
123
David J. Olive
Department of Mathematics
Southern Illinois University
Carbondale, IL, USA
Y x| T x.
Multiple linear regression and many experimental design models are spe-
cial cases of the linear regression model, and the models can be presented
compactly by dening the population model in terms of the sucient predic-
tor SP = T x and the estimated model in terms of the estimated sucient
T
predictor ESP = x. In particular, the response plot or estimated su-
cient summary plot of the ESP versus Y is used to visualize the conditional
distribution Y | T x. The residual plot of the ESP versus the residuals is used
to visualize the conditional distribution of the residuals given the ESP.
The literature on multiple linear regression is enormous. See Stigler (1986)
and Harter (1974a,b, 1975a,b,c, 1976) for history. Draper (2002) is a good
source for more recent literature. Some texts that were standard at one time
include Wright (1884), Johnson (1892), Bartlett (1900), Merriman (1907),
Weld (1916), Leland (1921), Ezekiel (1930), Bennett and Franklin (1954),
Ezekiel and Fox (1959), and Brownlee (1965). Recent reprints of several of
these texts are available from www.amazon.com.
Draper and Smith (1966) was a breakthrough because it popularized the
use of residual plots, making the earlier texts obsolete. Excellent texts include
Chatterjee and Hadi (2012), Draper and Smith (1998), Fox (2015), Hamil-
ton (1992), Kutner et al. (2005), Montgomery et al. (2012), Mosteller and
Tukey (1977), Ryan (2009), Sheather (2009), and Weisberg (2014). Cook and
Weisberg (1999a) was a breakthrough because of its use of response plots.
v
vi Preface
Multivariate linear regression and MANOVA models are special cases. Recent
results from Kakizawa (2009), Su and Cook (2012), Olive et al. (2015), and
Olive (2016b) make the multivariate linear regression model (Chapter 12)
easy to learn after the student has mastered the multiple linear regression
model (Chapters 2 and 3). For the multivariate linear regression model, it is
assumed that the iid zero mean error vectors have fourth moments.
Fourth, recent literature on plots for goodness and lack of t, bootstrapping,
outlier detection, response transformations, prediction intervals, prediction
regions, and variable selection has been incorporated into the text. See Olive
(2004b, 2007, 2013a,b, 2016a,b,c) and Olive and Hawkins (2005).
Chapter 1 reviews the material to be covered in the text and can be
skimmed and then referred to as needed. Chapters 2 and 3 cover multiple lin-
ear regression, Chapter 4 considers generalized least squares, and Chapters 5
through 9 consider experimental design models. Chapters 10 and 11 cover lin-
ear model theory and the multivariate normal distribution. These chapters
are needed for the multivariate linear regression model covered in Chapter 12.
Chapter 13 covers generalized linear models (GLMs) and generalized additive
models (GAMs).
The text also uses recent literature to provide answers to the following
important questions:
How can the conditional distribution Y | T x be visualized?
How can be estimated?
How can variable selection be performed eciently?
How can Y be predicted?
The text emphasizes prediction and visualizing the models. Some of the
applications in this text using this research are listed below.
1) It is shown how to use the response plot to detect outliers and to assess
the adequacy of linear models for multiple linear regression and experimental
design.
2) A graphical method for selecting a response transformation for linear
models is given. Linear models include multiple linear regression and many
experimental design models. This method is also useful for multivariate linear
regression.
3) A graphical method for assessing variable selection for the multiple
linear regression model is described. It is shown that for submodels I with
k predictors, the widely used screen Cp (I) k is too narrow. More good
submodels are considered if the screen Cp (I) min(2k, p) is used. Variable
selection methods originally meant for multiple linear regression can be ex-
tended to GLMs. See Chapter 13. Similar ideas from Olive and Hawkins
(2005) have been incorporated in Agresti (2013). Section 3.4.1 shows how to
bootstrap the variable selection estimator.
4) Asymptotically optimal prediction intervals for a future response Yf
are given for models of the form Y = T x + e where the errors are iid,
viii Preface
can be used to download the R functions and data sets into R. Type ls(). Over
65 R functions from lregpack.txt should appear. In R, enter the command q().
A window asking Save workspace image? will appear. Click on No to remove
the functions from the computer (clicking on Yes saves the functions on R,
but the functions and data are easily obtained with the source commands).
Chapters 27 can be used for a one-semester course in regression and
experimental design. For a course in generalized linear models, replace some of
the design chapters by Chapter 13. Design chapters could also be replaced by
Chapters 12 and 13. A more theoretical course would cover Chapters 1,10, 11,
and 12.
Acknowledgments
This work has been partially supported by NSF grants DMS 0202922 and
DMS 0600933. Collaborations with Douglas M. Hawkins and R. Dennis Cook
were extremely valuable. I am grateful to the developers of useful mathemat-
ical and statistical techniques and to the developers of computer software
and hardware (including R Core Team (2016)). Cook (1998) and Cook and
Weisberg (1999a) inuenced this book. Teaching material from this text has
been invaluable. Some of the material in this text has been used in a Math
583 regression graphics course, a Math 583 experimental design course, and
a Math 583 robust statistics course. In 2009 and 2016, Chapters 2 to 7 were
used in Math 484, a course on multiple linear regression and experimental
design. Chapters 11 and 12 were used in a 2014 Math 583 theory of linear
Preface ix
models course. Chapter 12 was also used in a 2012 Math 583 multivariate
analysis course. Chapter 13 was used for a categorical data analysis course.
Thanks also goes to Springer, to Springers associate editor Donna
Chernyk, and to several reviewers.
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Some Regression Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Multiple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Variable Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4 Other Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.5 Complements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.6 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
xi
xii Contents
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 489
Chapter 1
Introduction
This chapter provides a preview of the book but is presented in a rather ab-
stract setting and will be easier to follow after reading the rest of the book.
The reader may omit this chapter on rst reading and refer back to it as nec-
essary. Chapters 2 to 9 consider multiple linear regression and experimental
design models t with least squares. Chapter 1 is useful for extending several
techniques, such as response plots and plots for response transformations used
in those chapters, to alternative tting methods and to alternative regression
models. Chapter 13 illustrates some of these extensions for the generalized
linear model (GLM) and the generalized additive model (GAM).
Response variables are the variables of interest, and are predicted with
a p 1 vector of predictor variables x = (x1 , . . . , xp )T where xT is the
transpose of x. A multivariate regression model has m > 1 response variables.
For example, predict Y1 = systolic blood pressure and Y2 = diastolic blood
pressure using a constant x1 , x2 = age, x3 = weight, and x4 = dosage amount
of blood pressure medicine. The multivariate location and dispersion model
of Chapter 10 is a special case of the multivariate linear regression model of
Chapter 12.
A univariate regression model has one response variable Y . Suppose Y
is independent of the predictor variables x given a function h(x), written
Y x|h(x), where h : Rp Rd and the integer d is as small as possible. Then
Y follows a dD regression model, where d p since Y x|x. If Y x, then
Y follows a 0D regression model. Then there are 0D, 1D, . . . , pD regression
models, and all univariate regression models are dD regression models for
some integer 0 d p. Cook (1998, p. 49) and Cook and Weisberg (1999a,
p. 414) use similar notation with h(x) = (xT 1 , . . . , xT d )T .
The remainder of this chapter considers 1D regression models, where h :
Rp R is a real function. The additive error regression model Y = m(x) + e
is an important special case with h(x) = m(x). See Section 13.2. An impor-
tant special case of the additive error model is the linear regression model
Y = xT + e = x1 1 + + xp p + e. Multiple linear regression and many
experimental design models are special cases of the linear regression model.
The multiple linear regression model has at least one predictor xi that
takes on many values. Chapter 2 ts this model with least squares and Chap-
ter 3 considers variable selection models such as forward selection. There are
many other methods for tting the multiple linear regression model, includ-
ing lasso, ridge regression, partial least squares (PLS), and principal com-
ponent regression (PCR). See James et al. (2013), Olive (2017), and Pelawa
Watagoda and Olive (2017). Chapters 2 and 3 consider response plots, plots
for response transformations, and prediction intervals for the multiple linear
regression model t by least squares. All of these techniques can be extended
to alternative tting methods.
Notation. Often the index i will be suppressed. For example, the linear
regression model
Yi = + T xi + ei (1.2)
for i = 1, . . . , n where is a p 1 unknown vector of parameters, and ei
is a random error. This model could be written Y = + T x + e. More
accurately, Y |x = + T x + e, but the conditioning on x will often be
suppressed. Often the errors e1 , . . . , en are iid (independent and identically
distributed) from a distribution that is known except for a scale parameter.
For example, the ei s might be iid from a normal (Gaussian) distribution
with mean 0 and unknown standard deviation . For this Gaussian model,
estimation of , , and is important for inference and for predicting a new
value of the response variable Yf given a new vector of predictors xf .
The class of 1D regression models is very rich, and many of the most
used statistical models, including GLMs and GAMs, are 1D regression mod-
els. Nonlinear regression, nonparametric regression, and linear regression are
special cases of the additive error regression model
Y = h(x) + e = SP + e. (1.3)
Z = t1 ( + T x + e) (1.5)
Y = t(Z) = + T x + e. (1.6)
4 1 Introduction
Sections 3.2 and 5.4 show how to choose the response transformation t(Z)
graphically, and these techniques are easy to extend to the additive error
regression model Y = h(x) + e. Then the response transformation model is
Y = t (Z) = h (x) + e, and the graphical method for selecting the response
transformation is to plot hi (x) versus ti (Z) for several values of i , choosing
the value of = 0 where the plotted points follow the identity line with unit
slope and zero intercept. For the multiple linear regression model, hi (x) =
xT i where i can be found using the desired tting method, e.g. lasso.
Box (1979) warns that all models are wrong, but some are useful. For
example, the function g in equation (1.4) or the error distribution could
be misspecied. Diagnostics are used to check whether model assumptions
such as the form of g and the proposed error distribution are reasonable.
Often diagnostics use residuals ri . For example, the additive error regression
model (1.3) uses
ri = Yi h(xi )
where h(x) is an estimate of h(x).
Z = t1 ( + T x + e), (1.7)
Y = t(Z) = + T x + e (1.8)
able Y and the predictors x. The response plot is used to visualize the con-
ditional distribution of Y |x, Y |SP , and Y |( + T x) if SP = + T x.
iii) Check for lack of t of the model with a residual plot of the ESP versus
the residuals.
iv) Fit the model and nd h(x). If SP = + T x, estimate and , e.g.,
using maximum likelihood estimators.
v) Estimate the mean function E(Yi |xi ) = (xi ) = di (xi ) or estimate
(xi ) where the di are known constants.
vii) Check for overdispersion with an OD plot. See Section 13.8.
viii) Check whether Y is independent of x, that is, check whether the
nontrivial predictors x are needed in the model. Check whether SP = h(x)
c where the constant c does not depend on the xi . If SP = + T x, check
whether = 0, for example, test Ho : = 0,
ix) Check whether a reduced model can be used instead of the full model.
If SP = + T x = + TR xR + TO xO where the r 1 vector xR consists
of the nontrivial predictors in the reduced model, test Ho : O = 0.
x) Use variable selection to nd a good submodel.
xi) Predict Yi given xi .
The eld of statistics known as regression graphics gives useful results for
examining the 1D regression model (1.1) even when the model is unknown or
misspecied. The following section shows that the sucient summary plot is
useful for explaining the given 1D model while the response plot can often be
used to visualize the conditional distribution of Y |SP . Also see Chapter 13
and Olive (2013b).
Suppose that the response variable Y is quantitative and that at least one pre-
dictor variable xi is quantitative. Then the multiple linear regression (MLR)
model is often a very useful model. For the MLR model,
5
0
Y
5
10
10 5 0 5
SP
5
0
Y
5
10
10 5 0 5
ESP
10 5 0 5
ESP
5
0
Y
5
10
T
x and the estimated conditional mean function is (ESP ) = ESP. The
T
estimated or tted value of Yi is equal to Yi = + x. Now the vertical
deviation of Yi from the identity line is equal to the residual ri = Yi ( +
T
xi ). The interpretation of the ESSP is almost the same as that of the
SSP, but now the mean SP is estimated by the estimated sucient predictor
(ESP). This plot is also called the response plot and is used as a goodness
of t diagnostic. The residual plot is a plot of the ESP versus ri and is used
as a lack of t diagnostic. These two plots should be made immediately after
tting the MLR model and before performing inference. Figures 1.2 and 1.3
show the response plot and residual plot for the articial data.
The response
plot is also a useful visual aid for describing the ANOVA F
test (see 2.4) which tests whether = 0, that is, whether the nontrivial
predictors x are needed in the model. If the predictors are not needed in the
model, then Yi and E(Yi |xi ) should be estimated by the sample mean Y . If
the predictors are needed, then Yi and E(Yi |xi ) should be estimated by the
T
ESP Yi = + xi . If the identity line clearly ts the data better than the
horizontal line Y = Y , then the ANOVA F test should have a small pvalue
and reject the null hypothesis Ho that the predictors x are not needed in the
MLR model. Figure 1.2 shows that the identity line ts the data better than
any horizontal line. Figure 1.4 shows the response plot for the articial data
when only X4 and X5 are used as predictors with the identity line and the
line Y = Y added as visual aids. In this plot the horizontal line ts the data
1.3 Variable Selection 9
about as well as the identity line which was expected since Y is independent
of X4 and X5 .
It is easy to nd data sets where the response plot looks like Figure 1.4,
but the pvalue for the ANOVA F test is very small. In this case, the MLR
model is statistically signicant, but the investigator needs to decide whether
the MLR model is practically signicant.
SP = + T x = + TS xS + TE xE = + TS xS . (1.11)
The extraneous terms that can be eliminated given that the subset S is in
the model have zero coecients: E = 0.
Now suppose that I is a candidate subset of predictors, that S I and
that O is the set of predictors not in I. Then
SP = + T x = + TS xS = + TS xS + T(I/S) xI/S + 0T xO = + TI xI ,
where xI/S denotes the predictors in I that are not in S. Since this is true
regardless of the values of the predictors, O = 0 if S I. Hence for any
subset I that includes all relevant predictors, the population correlation
corr( + T xi , + T
I xI,i ) = 1. (1.12)
and consideration. To make this advice more specic, use the rule of thumb
that a candidate subset of predictors I is worth considering if the sample
correlation of ESP and ESP(I) satises
T T T T
corr( + xi , I + I xI,i ) = corr( xi , I xI,i ) 0.95. (1.13)
The diculty with this approach is that tting large numbers of possi-
ble submodels involves substantial computation. Fortunately, (ordinary) least
squares (OLS) frequently gives a useful ESP, and methods originally meant
for multiple linear regression using the Mallows Cp criterion (see Jones 1946
and Mallows 1973) also work for more general 1D regression models. As a rule
of thumb, the OLS ESP is useful if |corr(OLS ESP, ESP)| 0.95 where ESP
T
is the standard ESP (e.g., for generalized linear models, the ESP is + x
where (, ) is the maximum likelihood estimator of (, )), or if the OLS
response plot suggests that the OLS ESP is good. Variable selection will be
discussed in much greater detail in Chapters 3 and 13, but the following
methods are useful for a large class of 1D regression models.
Backward elimination starts with the full model. All models contain a
constant = U0 . Hence the full model contains U0 , X1 , . . . , Xp1 . We will also
say that the full model contains U0 , U1 , . . . , Up1 where Ui need not equal Xi
for i 1.
Step 1) k = p 1: t each model with p 1 predictors including a constant.
Delete the predictor Up1 , say, that corresponds to the model with the small-
est Cp . Keep U0 , . . . , Up2 .
1.3 Variable Selection 11
All subsets variable selection examines all subsets and keeps track of
several (up to three, say) subsets with the smallest Cp (I) for each group of
submodels containing k predictors including a constant. This method can be
used for p 30 by using the ecient leaps and bounds algorithms when
OLS and Cp is used (see Furnival and Wilson 1974).
Rule of thumb for variable selection (assuming that the cost of each
predictor is the same): nd the submodel Im with the minimum Cp . If Im
uses km predictors including a constant, do not use any submodel that has
more than km predictors. Since the minimum Cp submodel often has too
many predictors, also look at the submodel Io with the smallest value of k,
say ko , such that Cp 2k. This submodel may have too few predictors.
So look at the predictors in Im but not in Io and see if they can be deleted
or not. (If Im = Io , then it is a good candidate for the best submodel.)
where SSE is the residual sum of squares from the full model and SSE(I) is
the residual sum of squares from the candidate submodel. Then
SSE(I)
Cp (I) = + 2k n = (p k)(FI 1) + k (1.14)
M SE
where MSE is the residual mean square for the full model. Let ESP(I) =
T
I + I x be the ESP for the submodel and let VI = Y ESP (I) so that
T
VI,i = Yi I + I xi . Let ESP and V denote the corresponding quantities
12 1 Introduction
for the full model. Then Olive and Hawkins (2005) show that corr(VI , V ) 1
forces corr(OLS ESP, OLS ESP(I)) 1 and that
SSE np np
corr(V, VI ) = = = .
SSE(I) Cp (I) + n 2k (p k)FI + n p
Notice that the submodel Ik that minimizes Cp (I) also maximizes corr(V, VI )
among all submodels I with k predictors including a constant. If Cp (I) 2k
and n 10p, then 0.948 corr(V, V (I)), and both corr(V, V (I)) 1.0 and
corr(OLS ESP, OLS ESP(I)) 1.0 as n .
Suppose that the OLS ESP and the standard ESP are highly correlated:
|corr(ESP, OLS ESP)| 0.95. Then often OLS variable selection can be used
for the 1D data, and using the pvalues from OLS output seems to be a useful
benchmark. To see this, suppose that n 5p and rst consider the model
Ii that deletes the predictor Xi . Then the model has k = p 1 predictors
including the constant, and the test statistic is ti where
t2i = FIi .
or
Cp (Ii ) = Cp (If ull ) + (t2i 2).
Using the screen Cp (I) min(2k, p) suggests that the predictor Xi should
not be deleted if
|ti | > 2 1.414.
If |ti | < 2, then the predictor can probably be deleted since Cp decreases.
More generally, for the partial F test, notice that by (1.14), Cp (I) 2k
i (p k)FI p + 2k 2k i (p k)Fi p i
1.4 Other Issues 13
p
FI .
pk
Now k is the number of terms in the model including a constant while p k
is the number of terms set to 0. As k 0, the partial F test will reject Ho
(i.e., say that the full model should be used instead of the submodel I) unless
FI is not much larger than 1. If p is very large and p k is very small, then
the partial F test will tend to suggest that there is a model I that is about
as good as the full model even though model I deletes p k predictors.
The Cp (I) k screen tends to overt. An additive error single index model
is Y = m( + xT ) + e. We simulated multiple linear regression and single
index model data sets with p = 8 and n = 50, 100, 1000, and 10000. The true
model S satised Cp (S) k for about 60% of the simulated data sets, but S
satised Cp (S) 2k for about 97% of the data sets.
The 1D regression models oer a unifying framework for many of the most
used regression models. By writing the model in terms of the sucient predic-
tor SP = h(x), many important topics valid for all 1D regression models can
be explained compactly. For example, the previous section presented variable
selection, and equation (1.14) can be used to motivate the test for whether
the reduced model can be used instead of the full model. Similarly, the su-
cient predictor can be used to unify the interpretation of coecients and to
explain models that contain interactions and factors.
Interpretation of Coecients
One interpretation of the coecients in a 1D model (1.11) is that i is the
rate of change in the SP associated with a unit increase in xi when all other
predictor variables x1 , . . . , xi1 , xi+1 , . . . , xp are held xed. Denote a model
by SP = + T x = + 1 x1 + + p xp . Then
SP
i = for i = 1, . . . , p.
xi
Of course, holding all other variables xed while changing xi may not be
possible. For example, if x1 = x, x2 = x2 and SP = + 1 x + 2 x2 , then x2
cannot be held xed when x1 increases by one unit, but
d SP
= 1 + 22 x.
dx
The interpretation of i changes with the model in two ways. First,
the interpretation changes as terms are added and deleted from the SP.
Hence the interpretation of 1 diers for models SP = + 1 x1 and
14 1 Introduction
exp(SP )
E(Y |SP ) = (SP ) = ,
1 + exp(SP )
and the change in the conditional expectation associated with a one unit
increase in xi is more complex.
Interactions
Suppose X1 is quantitative and X2 is qualitative with 2 levels and X2 = 1
for level a2 and X2 = 0 for level a1 . Then a rst order model with interaction
is SP = +1 x1 +2 x2 +3 x1 x2 . This model yields two unrelated lines in the
sucient predictor depending on the value of x2 : SP = + 2 + (1 + 3 )x1
if x2 = 1 and SP = + 1 x1 if x2 = 0. If 3 = 0, then there are two
parallel lines: SP = + 2 + 1 x1 if x2 = 1 and SP = + 1 x1 if x2 = 0.
If 2 = 3 = 0, then the two lines are coincident: SP = + 1 x1 for
both values of x2 . If 2 = 0, then the two lines have the same intercept:
SP = + (1 + 3 )x1 if x2 = 1 and SP = + 1 x1 if x2 = 0. In general,
as factors have more levels and interactions have more terms, e.g. x1 x2 x3 x4 ,
the interpretation of the model rapidly becomes very complex.
1.5 Complements
the plotted points will fall about some line with slope and intercept if
the SLR model holds, but in a plot of SP = + T xi versus Yi , the plotted
points will fall about the identity line with unit slope and zero intercept if the
multiple linear regression model holds. If there are more than two nontrivial
predictors, then we generally cannot nd a sucient summary plot and need
to use an estimated sucient summary plot.
Important theoretical results for the additive error single index model Y =
m(+ T x)+e were given by Brillinger (1977, 1983) and Aldrin et al. (1993).
Li and Duan (1989) extended these results to models of the form
Y = g( + T x, e) (1.15)
where g is a bivariate inverse link function. Olive and Hawkins (2005) discuss
variable selection while Chang (2006) and Chang and Olive (2007, 2010)
discuss (ordinary) least squares (OLS) tests. Severini (1998) discusses when
OLS output is relevant for the Gaussian additive error single index model.
1.6 Problems
R Problem
This chapter introduces the multiple linear regression model, the response
plot for checking goodness of t, the residual plot for checking lack of t,
the ANOVA F test, the partial F test, the t tests, and least squares. The
problems use software R, SAS, Minitab, and Arc.
Denition 2.1. The response variable is the variable that you want to
predict. The predictor variables are the variables used to predict the re-
sponse variable.
Notation. In this text the response variable will usually be denoted by
Y and the p predictor variables will often be denoted by x1 , . . . , xp . The
response variable is also called the dependent variable while the predictor
variables are also called independent variables, explanatory variables, carri-
ers, or covariates. Often the predictor variables will be collected in a vector x.
Then xT is the transpose of x.
Denition 2.2. Regression is the study of the conditional distribu-
tion Y |x of the response variable Y given the vector of predictors x =
(x1 , . . . , xp )T .
Denition 2.3. A quantitative variable takes on numerical values
while a qualitative variable takes on categorical values.
Example 2.1. Archeologists and crime scene investigators sometimes
want to predict the height of a person from partial skeletal remains. A model
for prediction can be built from nearly complete skeletons or from living
humans, depending on the population of interest (e.g., ancient Egyptians
or modern US citizens). The response variable Y is height and the predic-
tor variables might be x1 1, x2 = femur length, and x3 = ulna length.
Denition 2.4. Suppose that the response variable Y and at least one
predictor variable xi are quantitative. Then the multiple linear regression
(MLR) model is
for i = 1, . . . , n. Here n is the sample size and the random variable ei is the
ith error. Suppressing the subscript i, the model is Y = xT + e.
Y = X + e, (2.2)
If the predictor variables are random variables, then the above MLR model
is conditional on the observed values of the xi . That is, observe the xi and
then act as if the observed xi are xed.
2.1 The MLR Model 19
Denition 2.6. The unimodal MLR model has the same assumptions
as the constant variance MLR model, as well as the assumption that the zero
mean constant variance errors e1 , . . . , en are iid from a unimodal distribution
that is not highly skewed. Note that E(ei ) = 0 and V (ei ) = 2 < .
Denition 2.7. The normal MLR model or Gaussian MLR model has
the same assumptions as the unimodal MLR model but adds the assumption
that the errors e1 , . . . , en are iid N (0, 2 ) random variables. That is, the ei
are iid normal random variables with zero mean and variance 2 .
The unknown coecients for the above 3 models are usually estimated
using (ordinary) least squares (OLS).
Denition 2.9. The ordinary least squares (OLS) estimator OLS mini-
mizes
n
QOLS (b) = ri2 (b), (2.4)
i=1
There are many statistical models besides the MLR model, and you should
learn how to quickly recognize an MLR model. A regression model has a
response variable Y and the conditional distribution of Y given the predic-
tors x = (x1 , . . . , xp )T is of interest. Regression models are used to predict Y
and to summarize the relationship between Y and x. If a constant xi,1 1
(this notation means that xi,1 = 1 for i = 1, . . . , n) is in the model, then
xi,1 is often called the trivial predictor, and the MLR model is said to have
20 2 Multiple Linear Regression
Notation: For MLR, the residual plot will often mean the residual
plot of Yi versus ri , and the response plot will often mean the plot of Yi
versus Yi .
Remark 2.1. For any MLR analysis, always make the response
plot and the residual plot of Yi versus Yi and ri , respectively.
22 2 Multiple Linear Regression
Response Plot
3
1500 1600 1700 1800
Y
63
44
Residual Plot
3
0 50
RES
50
63
100
44
Fig. 2.1 Residual and Response Plots for the Tremearne Data
Denition 2.13. An outlier is an observation that lies far away from the
bulk of the data.
Remark 2.2. For MLR, the response plot is important because MLR
is the study of the conditional distribution of Y |xT , and the response
plot is used to visualize the conditional distribution of Y |xT since
Y = xT is a good estimator of xT if is a good estimator of .
If the MLR model is useful, then the plotted points in the response plot
should be linear and scatter about the identity line with no gross outliers.
Suppose the tted values range in value from wL to wH with no outliers. Fix
the t = w in this range and mentally add a narrow vertical strip centered at
w to the response plot. The plotted points in the vertical strip should have a
mean near w since they scatter about the identity line. Hence Y |f it = w is
like a sample from a distribution with mean w. The following example helps
illustrate this remark.
The response plot may look good while the residual plot suggests that the
unimodal MLR model can be improved. Examining plots to nd model vio-
lations is called checking for lack of t. Again assume that n 5p.
The unimodal MLR model often provides a useful model for the data, but
the following assumptions do need to be checked.
i) Is the MLR model appropriate?
ii) Are outliers present?
iii) Is the error variance constant or nonconstant? The constant variance
assumption VAR(ei ) 2 is known as homoscedasticity. The nonconstant
variance assumption VAR(ei ) = i2 is known as heteroscedasticity.
iv) Are any important predictors left out of the model?
v) Are the errors e1 , . . . , en iid?
vi) Are the errors ei independent of the predictors xi ?
Make the response plot and the residual plot to check i), ii), and iii). An
MLR model is reasonable if the plots look like Figures 1.2, 1.3, 1.4, and 2.1.
A response plot that looks like Figure 13.7 suggests that the model is not
linear. If the plotted points in the residual plot do not scatter about the
r = 0 line with no other pattern (i.e., if the cloud of points is not ellipsoidal or
rectangular with zero slope), then the unimodal MLR model is not sustained.
The ith residual ri is an estimator of the ith error ei . The constant variance
assumption may have been violated if the variability of the point cloud in the
residual plot depends on the value of Y . Often the variability of the residuals
increases as Y increases, resulting in a right opening megaphone shape. (Fig-
ure 4.1b has this shape.) Often the variability of the residuals decreases as Y
increases, resulting in a left opening megaphone shape. Sometimes the vari-
ability decreases then increases again, and sometimes the variability increases
then decreases again (like a stretched or compressed football).
Remark 2.3. Residual plots magnify departures from the model while the
response plot emphasizes how well the MLR model ts the data.
Since the residuals ri = ei are estimators of the errors, the residual plot
is used to visualize the conditional distribution e|SP of the errors given the
sucient predictor SP = xT , where SP is estimated by Y = xT . For the
unimodal MLR model, there should not be any pattern in the residual plot:
as a narrow vertical strip is moved from left to right, the behavior of the
residuals within the strip should show little change.
2.3 Checking Lack of Fit 25
Notation. A rule of thumb is a rule that often but not always works well
in practice.
Rule of thumb 2.1. If the residual plot would look good after several
points have been deleted, and if these deleted points were not gross outliers
(points far from the point cloud formed by the bulk of the data), then the
residual plot is probably good. Beginners often nd too many things wrong
with a good model. For practice, use the lregpack function MLRsim to generate
several MLR data sets, and make the response and residual plots for these
data sets: type MLRsim(nruns=10) in R and right click Stop for each plot
(20 times) to generate 10 pairs of response and residual plots. This exercise
will help show that the plots can have considerable variability even when the
MLR model is good. See Problem 2.30.
Rule of thumb 2.2. If the plotted points in the residual plot look like
a left or right opening megaphone, the rst model violation to check is the
assumption of nonconstant variance. (This is a rule of thumb because it is
possible that such a residual plot results from another model violation such
as nonlinearity, but nonconstant variance is much more common.)
The residual plot of Y versus r should always be made. It is also a good idea
to plot each nontrivial predictor xj versus r and to plot potential predictors
wj versus r. If the predictor is quantitative, then the residual plot of xj versus
r should look like the residual plot of Y versus r. If the predictor is qualitative,
e.g. gender, then interpreting the residual plot is much more dicult; however,
if each category contains many observations, then the plotted points for each
category should form a vertical line centered at r = 0 with roughly the same
variability (spread or range).
Rule of thumb 2.3. Suppose that the MLR model uses predictors xj
and that data has been collected on variables wj that are not included in
the MLR model. To check whether important predictors have been left out,
make residual plots of xj and wj versus r. If these plots scatter about the
r = 0 line with no other pattern, then there is no evidence that x2j or wj are
needed in the model. If the plotted points scatter about a parabolic curve,
try adding x2j or wj and wj2 to the MLR model. If the plot of the potential
predictor wj versus r has a linear trend, try adding wj to the MLR model.
The additive error regression model and EE plot in Section 13.7 can also be
used to check whether important predictors have been left out.
Rule of thumb 2.4. To check that the errors are independent of the pre-
dictors, make residual plots of xj versus r. If the plot of xj versus r scatters
about the r = 0 line with no other pattern, then there is no evidence that the
errors depend on xj . If the variability of the residuals changes with the value
of xj , e.g. if the plot resembles a left or right opening megaphone, the errors
may depend on xj . Some remedies for nonconstant variance are considered
in Chapter 4.
26 2 Multiple Linear Regression
To study residual plots, some notation and properties of the least squares
estimator are needed. MLR is the study of the conditional distribution of
Yi |xTi , and the MLR model is Y = X + e where X is an n p matrix
of full rank p. Hence the number of predictors p n. The ith row of X is
xTi = (xi,1 , . . . , xi,p ) where xi,k is the value of the ith observation on the
kth predictor xk . We will denote the jth column of X by Xj v j which
corresponds to the jth variable or predictor xj .
Example 2.4. If Y is brain weight in grams, x1 1, x2 is age, and x3 is
the size of the head in (mm)3 , then for the Gladstone (1905) data
3738 1 39 149.5
4261 1 35 152.5
Y = . , X = . . .. = [v 1 v 2 v 3 ].
.. .. .. .
3306 1 19 141
Hence the rst person had brain weight = 3738, age = 39, and size = 149.5.
After deleting observations with missing values, there were n = 267 cases
(people measured on brain weight, age, and size), and x267 = (1, 19, 141)T .
The second predictor x2 = age corresponds to the 2nd column of X and
is X2 = v 2 = (39, 35, . . . , 19)T . Notice that X1 v 1 = 1 = (1, . . . , 1)T
corresponds to the constant x1 .
The results in the following proposition are properties of least squares
(OLS), not of the underlying MLR model. See Chapter 11 for more linear
model theory. Denitions 2.8 and 2.9 dene the hat matrix H, vector of
tted values Y , and vector of residuals r. Parts f) and g) make residual plots
useful. If the plotted points are linear with roughly constant variance and the
correlation is zero, then the plotted points scatter about the r = 0 line with
no other pattern. If the plotted points in a residual plot of w versus r do show
a pattern such as a curve or a right opening megaphone, zero correlation will
usually force symmetry about either the r = 0 line or the w = median(w)
line. Hence departures from the ideal plot of random scatter about the r = 0
line are often easy to detect.
H T = X T [(X T X)1 ]T (X T )T = H.
n
n
n
A= Yi ri Y ri = Yi ri
i=1 i=1 i=1
n
by d) again. But i=1 Yi ri = r T Y = 0 by e).
ng) Following the argument in f), the resultn follows if A =
i=1 (x i,j x j )(ri r) = 0 where
n x j = i=1 xi,j /n is the sample mean of
the jth predictor. Now r = i=1 ri /n = 0 by d), and thus
n
n
n
A= xi,j ri xj ri = xi,j ri
i=1 i=1 i=1
n
by d) again. But i=1 xi,j ri = XjT r = v Tj r = 0 by c).
28 2 Multiple Linear Regression
Without loss of generality, E(e) = 0 for the unimodal MLR model with a
constant, in that if E(e) = = 0, then the MLR model can always be written
as Y = xT + e where E(e) = 0 and E(Y ) E(Y |x) = xT . To see this
claim notice that
Y = 1 + x2 2 + + xp p + e = 1 + E(e) + x2 2 + + xp p + e E(e)
= 1 + x 2 2 + + x p p + e
where 1 = 1 + E(e) and e = e E(e). For example, if the errors ei are iid
exponential () with E(ei ) = , use ei = ei .
For least squares, it is crucial that 2 exists. For example, if the ei are iid
Cauchy(0,1), then 2 does not exist and the least squares estimators tend to
perform very poorly.
The performance of least squares is analogous to the performance of Y .
The sample mean Y is a very good estimator of the population mean if the
Yi are iid N (, 2 ), and Y is a good estimator of if the sample size is large
and the Yi are iid with mean and variance 2 . This result follows from
the central limit theorem (CLT), but how large is large depends on the
underlying distribution. The n > 30 rule tends to hold for distributions that
are close to normal in that they take on many values and 2 is not huge. Error
distributions that are highly nonnormal with tiny 2 often need n >> 30.
For example, if Y1 , . . . , Yn are iid Gamma(1/m, 1), then n > 25m may be
needed. Another example is distributions that take on one value with very
high probability, e.g. a Poisson random variable with very small variance.
Bimodal and multimodal distributions and highly skewed distributions with
large variances also need larger n. Chihara and Hesterberg (2011, p. 177)
suggest using n > 5000 for moderately skewed distributions.
There are central limit type theorems for the least squares estimators that
depend on the error distribution of the iid errors ei . See Theorems 2.8, 11.25,
and 12.7. We always assume that the ei are continuous random variables with
a probability density function. Error distributions that are close to normal
may give good results for moderate n if n 10p and np 30 where p is the
number of predictors. Error distributions that need large n for the CLT to
apply for e, will tend to need large n for the limit theorems for least squares
to apply (to give good approximations).
Checking whether the errors are iid is often dicult. The iid assumption is
often reasonable if measurements are taken on dierent objects, e.g. people.
In industry often several measurements are taken on a batch of material.
For example a batch of cement is mixed and then several small cylinders of
concrete are made from the batch. Then the cylinders are tested for strength.
2.4 The ANOVA F Test 29
Experience from such experiments suggests that objects (e.g., cylinders) from
dierent batches are independent, but objects from the same batch are not
independent.
One check on independence can also be made if the time order of the
observations is known. Let r[t] be the residual where [t] is the time order of
the trial. Hence [1] was the 1st and [n] was the last trial. Plot the time order
t versus r[t] if the time order is known. Again, trends and outliers suggest
that the model could be improved. A box shaped plot with no trend suggests
that the MLR model is good. A plot similar to the Durbin Watson test plots
r[t1] versus r[t] for t = 2, . . . , n. Linear trend suggests serial correlation while
random scatter suggests that there is no lag 1 autocorrelation. As a rule of
thumb, if the OLS slope b is computed for the plotted points, b > 0.25 gives
some evidence that there is positive correlation between r[t1] and r[t] . Time
series plots, such as the ACF or PACF of the residuals, may be useful.
After tting least squares and checking the response and residual plots to see
that an MLR model is reasonable, the next step is to check whether there is
an MLR relationship between Y and the nontrivial predictors x2 , . . . , xp . If
at least one of these predictors is useful, then the OLS tted values Yi should
be used. If none of the nontrivial predictors is useful, then Y will give as good
predictions as Yi . Here the sample mean
30 2 Multiple Linear Regression
1
n
Y = Yi . (2.5)
n i=1
In the denition below, SSE is the sum of squared residuals and a residual
ri = ei = errorhat. In the literature errorhat is often rather misleadingly
abbreviated as error.
n
SST O = (Yi Y )2 . (2.6)
i=1
n
SSR = (Yi Y )2 . (2.7)
i=1
n
n
SSE = (Yi Yi )2 = ri2 . (2.8)
i=1 i=1
Proof.
n
n
SST O = (Yi Yi + Yi Y )2 = SSE + SSR + 2 (Yi Yi )(Yi Y ).
i=1 i=1
But
n
n
A= ri Yi Y ri = 0
i=1 i=1
Denition 2.15. Assume that a constant is in the MLR model and that
SSTO = 0. The coecient of multiple determination
SSR SSE
R2 = [corr(Yi , Yi )]2 = =1
SSTO SSTO
The following 2 propositions suggest that R2 does not behave well when
many predictors that are not needed in the model are included in the model.
Such a variable is sometimes called a noise variable and the MLR model
is tting noise. Proposition 2.5 appears, for example, in Cramer (1946,
pp. 414415), and suggests that R2 should be considerably larger than p/n
if the predictors are useful. Note that if n = 10p and p 2, then under the
conditions of Proposition 2.5, E(R2 ) 0.1.
Notice that each SS/n estimates the variability of some quantity. SST O/n
SY2 , SSE/n Se2 = 2 , and SSR/n SY2 .
Seber and Lee (2003, pp. 4447) show that when the MLR model holds,
MSE is often a good estimator of 2 . Under regularity conditions, the MSE
is one of the best unbiased quadratic estimators of 2 . For the normal MLR
model, MSE is the uniformly minimum variance unbiased estimator of 2 .
Seber and Lee also give the following theorem that shows that the MSE is an
unbiased estimator of 2 under very weak assumptions
if the MLR model is
appropriate. From Theorem 12.7 MSE is a n consistent estimator of 2 .
Theorem 2.6. If Y = X +e where X is an np matrix of full rank p, if
the ei are independent with E(ei ) = 0, and VAR(ei ) = 2 , then 2 = M SE
is an unbiased estimator of 2 .
The ANOVA F test tests whether any of the nontrivial predictors
x2 , . . . , xp are needed in the OLS MLR model, that is, whether Yi should
be predicted by the OLS t Yi = 1 + xi,2 2 + + xi,p p or with the
sample mean Y . ANOVA stands for analysis of variance, and the computer
output needed to perform the test is contained in the ANOVA table. Below
is an ANOVA table given in symbols. Sometimes Regression is replaced by
Model and Residual by Error.
Source df SS MS F p-value
Regression p-1 SSR MSR Fo=MSR/MSE for Ho:
Residual n-p SSE MSE 2 = = p = 0
Remark 2.4. Recall that for a 4 step test of hypotheses, the pvalue is the
probability of getting a test statistic as extreme as the test statistic actually
observed and that Ho is rejected if the pvalue < . As a benchmark for this
textbook, use = 0.05 if is not given. The 4th step is the nontechnical
conclusion which is crucial for presenting your results to people who are not
familiar with MLR. Replace Y and x2 , . . . , xp by the actual variables used in
the MLR model. Follow Example 2.5.
P
pval pvalue 0
(converges to 0 in probability, so pval is a consistent estimator of pvalue)
as the sample size n . See Theorem 11.25, Section 11.6, and Chang
and Olive (2010). Then the computer output pval is a good estimator of the
unknown pvalue. We will use Ho H0 and Ha HA H1 .
P (Fp1,np > Fo ).
iv) State whether you reject Ho or fail to reject Ho. If Ho is rejected, conclude
that there is an MLR relationship between Y and the predictors x2 , . . . , xp . If
you fail to reject Ho, conclude that there is not an MLR relationship between
Y and the predictors x2 , . . . , xp . (Or there is not enough evidence to conclude
that there is an MLR relationship between Y and the predictors.)
Example 2.5. For the Gladstone (1905) data, the response variable Y =
brain weight, x1 1, x2 = size of head, x3 = sex, x4 = breadth of head,
x5 = circumference of head. Assume that the response and residual plots
look good and test whether at least one of the nontrivial predictors is needed
in the model using the output shown below.
Summary Analysis of Variance Table
Source df SS MS F p-value
Regression 4 5396942. 1349235. 196.24 0.0000
Residual 262 1801333. 6875.32
Solution: i) Ho: 2 = = 5 = 0 Ha: not Ho
ii) Fo = 196.24 from output.
iii) The pval = 0.0 from output.
iv) The pval < (= 0.05 since was not given). So reject Ho. Hence there
is an MLR relationship between brain weight and the predictors size, sex,
breadth, and circumference.
Remark 2.5. There is a close relationship between the response plot and
the ANOVA F test. If n 10p and n p 30 and if the plotted points
follow the identity line, typically Ho will be rejected if the identity line ts
the plotted points better than any horizontal line (in particular, the line
Y = Y ). If a horizontal line ts the plotted points about as well as the identity
line, as in Figure 1.4, this graphical diagnostic is inconclusive (sometimes the
ANOVA F test will reject Ho and sometimes fail to reject Ho), but the MLR
relationship is at best weak. In Figures 1.2 and 2.1, the ANOVA F test
should reject Ho since the identity line ts the plotted points better than
any horizontal line. Under the above conditions, a graphical ANOVA F test
34 2 Multiple Linear Regression
rejects Ho if the response plot is not similar to the residual plot. The graphical
test is inconclusive if the response plot looks similar to the residual plot. The
graphical test is also useful for multiple linear regression methods other than
least squares, such as M -estimators and other robust regression estimators.
Remark 2.6. If the RR plot of the residuals Yi Y versus the OLS resid-
uals ri = Yi Yi shows tight clustering about the identity line, then the MLR
relationship is weak: Y ts the data about as well as the OLS t.
Example 2.6. Cook and Weisberg (1999a, pp. 261, 371) describe a data
set where rats were injected with a dose of a drug approximately proportional
to body weight. The response Y was the fraction of the drug recovered from
the rats liver. The three predictors were the body weight of the rat, the dose
of the drug, and the liver weight. A constant was also used. The experimenter
expected the response to be independent of the predictors, and 19 cases
were used. However, the ANOVA F test suggested that the predictors were
important. The third case was an outlier and easily detected in the response
and residual plots (not shown). After deleting the outlier, the response and
residual plots looked ok and the following output was obtained.
RR Plot
0.10
0.05
full$residual
0.0 -0.05
-0.10
Fig. 2.2 RR Plot With Outlier Deleted, Submodel Uses Only the Trivial Predictor with
Y = Y
Some assumptions are needed on the ANOVA F test. Assume that both
the response and residual plots look good. It is crucial that there are no
outliers. Then a rule of thumb is that if n p is large, then the ANOVA
F test pvalue is approximately correct. An analogy can be made with the
central limit theorem, Y is a good estimator for if the Yi are iid N (, 2 )
and also a good estimator for if the data are iid with mean and variance
2 if n is large enough. Also see Theorem 11.25. More on the robustness and
lack of robustness of the ANOVA F test can be found in Wilcox (2012).
If all of the xi are dierent (no replication) and if the number of predictors
p = n, then the OLS t Yi = Yi and R2 = 1. Notice that Ho is rejected if the
statistic Fo is large. More precisely, reject Ho if
Fo > Fp1,np,1
where
P (F Fp1,np,1 ) = 1
when F Fp1,np . Since R2 increases to 1 while (n p)/(p 1) decreases
to 0 as p increases to n, Theorem 2.7a below implies that if p is large then
the Fo statistic may be small even if some of the predictors are very good. It
is a good idea to use n 10p or at least n 5p if possible. Theorem 11.25
can be used to show that pval is a consistent estimator of the pvalue under
reasonable conditions.
36 2 Multiple Linear Regression
Remark 2.7. When a constant is not contained in the model (i.e., xi,1 is
not equal to 1 for all i), then the computer output still produces an ANOVA
table with the test statistic and pvalue, and nearly the same 4 step test of
hypotheses can be used. The hypotheses are now Ho: 1 = = p = 0 Ha:
not Ho, and you are testing whether or not there is an MLR relationship
between Y and x1 , . . . , xp . An MLR model without a constant (no intercept)
is sometimes called a regression through the origin. See Section 2.10.
2.5 Prediction
(Y1 , x1 ), . . . , (Yn , xn )
well, but when new test data is collected, a very dierent MLR model is
needed to t the new data well. In particular, the MLR model seems to t
the data (Yi , xi ) well for i = 1, . . . , n, but when the researcher tries to predict
Yf for a new vector of predictors xf , the prediction is very poor in that Yf is
not close to the Yf actually observed. Wait until after the MLR model
has been shown to make good predictions before claiming that the
model gives good predictions!
There are several reasons why the MLR model may not t new data well.
i) The model building process is usually iterative. Data Z, w1 , . . . , wr is col-
lected. If the model is not linear, then functions of Z are used as a potential
response variable and functions of the wi as potential predictors. After trial
and error, the functions are chosen, resulting in a nal MLR model using Y
and x1 , . . . , xp . Since the same data set was used during the model building
2.5 Prediction 37
process, biases are introduced and the MLR model ts the training data
better than it ts new test data. Suppose that Y , x1 , . . . , xp are specied
before collecting data and that the residual and response plots from the re-
sulting MLR model look good. Then predictions from the prespecied model
will often be better for predicting new data than a model built from an iter-
ative process.
ii) If (Yf , xf ) come from a dierent population than the population of
(Y1 , x1 ), . . . , (Yn , xn ), then prediction for Yf can be arbitrarily bad.
iii) Even a good MLR model may not provide good predictions for an xf
that is far from the xi (extrapolation).
iv) The MLR model may be missing important predictors (undertting).
v) The MLR model may contain unnecessary predictors (overtting).
The following theorem is analogous to the central limit theorem and the
theory for the tinterval for based on Y and the sample standard deviation
(SD) SY . If the data Y1 , . . . , Yn are iid with mean 0 and variance 2 , then Y
is asymptotically normal and the tinterval will perform well if the sample
size is large enough. The result below suggests that the OLS estimators Yi
and are good if the sample size is large enough. The condition max hi 0
in probability usually holds if the researcher picked the design matrix X or
if the xi are iid random vectors from a well-behaved population. Outliers
D
can cause the condition to fail. Convergence in distribution, Zn Np (0, ),
means the multivariate normal approximation can be used for probability
2.5 Prediction 39
XT X
W 1
n
Equivalently,
D
(X T X)1/2 ( ) Np (0, 2 I p ). (2.13)
where the inequality follows from Chebyshevs inequality. Hence the asymp-
totic coverage of the nominal 95% PI is at least 73.9%. The 95% PI (2.14)
was often quite accurate in that the asymptotic coverage was close to 95% for
a wide variety of error distributions. The 99% and 90% PIs did not perform
as well.
Example 2.8. For the Buxton (1920) data suppose that the response Y
= height and the predictors were a constant, head length, nasal height, bigo-
nal breadth, and cephalic index. Five outliers were deleted leaving 82 cases.
Figure 2.3 shows a response plot of the tted values versus the response Y
with the identity line added as a visual aid. The plot suggests that the model
is good since the plotted points scatter about the identity line in an evenly
populated band although the relationship is rather weak since the correlation
of the plotted points is not very high. The triangles represent the upper and
lower limits of the semiparametric 95% PI (2.17). For this example, 79 (or
96%) of the Yi fell within their corresponding PI while 3 Yi did not. A plot
using the classical PI (2.14) would be very similar for this data. The plot was
made with the following R commands, using the lregpack function piplot.
x <- buxx[-c(61,62,63,64,65),]
Y <- buxy[-c(61,62,63,64,65)]
piplot(x,Y)
1900
1850
1800
1750
1700
Y
1650
1600
1550
Given output showing i and given xf , se(pred), and se(Yf ), Example 2.9
shows how to nd Yf , a CI for E(Yf |xf ), and the classical PI (2.14) for Yf .
42 2 Multiple Linear Regression
Example 2.9. The Rounceeld (1995) data povc.lsp are female and
male life expectancies from n = 91 countries where 6 cases with missing GNP
were deleted. Suppose that it is desired to predict female life expectancy Y
from male life expectancy X. Suppose that if Xf = 60, then se(pred) =
2.1285, and se(Yf ) = 0.2241. Below is some output.
a) Find Yf if Xf = 60.
Solution: In this example, xf = (1, Xf )T since a constant is in the output
above. Thus Yf = 1 + 2 Xf = 2.93739 + 1.12359(60) = 64.478.
b) If Xf = 60, nd a 90% condence interval for E(Y ) E(Yf |xf ).
Solution: The CI is Yf tn2,1/2 se(Yf ) = 64.478 1.645(0.2241) =
64.478 0.3686 = [64.1094, 64.8466]. To use the ttable on the last page of
Chapter 14, use the 2nd to last row marked by Z since d = df = n 2 =
89 > 30. In the last row nd CI = 90% and intersect the 90% column and
the Z row to get the value of t89,0.95 z.95 = 1.645.
c) If Xf = 60, nd a 90% prediction interval for Yf .
Solution: The PI is Yf tn2,1/2 se(pred) = 64.478 1.645(2.1285) =
64.478 3.5014 = [60.9766, 67.9794].
Two more PIs will be dened and then the 4 PIs (2.14), (2.17), (2.18),
and (2.20) will be compared via simulation. An asymptotically conservative
(ac) 100(1 )% PI has asymptotic coverage 1 1 . We used the (ac)
100(1 )% PI
n
Yf max(|/2 |, |1/2 |) (1 + hf ) (2.18)
np
In the simulations described below, will be the sample percentile for the
PIs (2.17) and (2.18). A PI is asymptotically optimal if it has the shortest
asymptotic length that gives the desired asymptotic coverage. If the error
distribution is unimodal, an asymptotically optimal PI can be created by
applying the shorth(c) estimator to the residuals where c =
n(1 ) and
x is the smallest integer x, e.g.,
7.7 = 8. That is, let r(1) , . . . , r(n) be the
order statistics of the residuals. Compute r(c) r(1) , r(c+1) r(2) , . . . , r(n)
r(nc+1) . Let [r(d) , r(d+c1) ] = [1 , 12 ] correspond to the interval with the
smallest distance. Then the large sample asymptotically optimal 100 (1)%
PI for Yf is
[Yf + an 1 , Yf + an 12 ] (2.20)
where an is given by (2.16).
Remark 2.8. We recommend using the asymptotically optimal PI (2.20)
instead of the classical PI (2.14). The lregpack function pisim can be used to
recreate the simulation described below. See Problem 2.29.
A small simulation study compares the PI lengths and coverages for sample
sizes n = 50, 100, and 1000 for several error distributions. The value n =
gives the asymptotic coverages and lengths. The MLR model with E(Yi ) =
1 + xi2 + + xi8 was used. The vectors (x2 , . . . , x8 )T were iid N7 (0, I 7 ).
The error distributions were N(0,1), t3 , and exponential(1) 1. Also, a small
sensitivity study to examine the eects of changing (1 + 15/n) to (1 + k/n)
on the 99% PIs (2.17) and (2.20) was performed. For n = 50 and k between
10 and 20, the coverage increased by roughly 0.001 as k increased by 1.
Tables 2.12.3 show the results of the simulations for the 3 error distri-
butions. The letters c, s, a, and o refer to intervals (2.14), (2.17), (2.18),
and (2.20), respectively. For the normal errors, the coverages were about
right and the semiparametric interval tended to be rather long for n = 50
and 100. The classical PI asymptotic coverage 1 tended to be fairly close
to the nominal coverage 1 for all 3 distributions and = 0.01, 0.05, and
0.1. The asymptotically optimal PI tended to have short length and simulated
coverage close to the nominal coverage.
The partial F test is used to test whether the reduced model is good in
that it can be used instead of the full model. It is crucial that the reduced
model be selected before looking at the data. If the reduced model is selected
after looking at output and discarding the worst variables, then the pvalue
for the partial F test will be too high. For (ordinary) least squares, usually
a constant is used, and we are assuming that both the full model and the
reduced model contain a constant. The partial F test has null hypothesis
Ho : iq+1 = = ip = 0, and alternative hypothesis HA : at least one of the
ij = 0 for j > q. The null hypothesis is equivalent to Ho: the reduced model
is good. Since only the full model and reduced model are being compared,
the alternative hypothesis is equivalent to HA : the reduced model is not as
good as the full model, so use the full model, or more simply, HA : use the
full model.
To perform the partial F test, t the full model and the reduced model
and obtain the ANOVA table for each model. The quantities dfF , SSE(F)
46 2 Multiple Linear Regression
and MSE(F) are for the full model and the corresponding quantities from
the reduced model use an R instead of an F . Hence SSE(F) and SSE(R) are
the residual sums of squares for the full and reduced models, respectively.
Shown below is output only using symbols.
Full model
Reduced model
iii) Find the pval = P(FdfR dfF ,dfF > FR ). (On exams typically an F table is
used. Here dfR dfF = pq = number of parameters set to 0, and dfF = np,
while pval is the estimated pvalue.)
iv) State whether you reject Ho or fail to reject Ho. Reject Ho if the pval
and conclude that the full model should be used. Otherwise, fail to reject Ho
and conclude that the reduced model is good.
Six plots are useful diagnostics for the partial F test: the RR plot with
the full model residuals on the vertical axis and the reduced model residuals
on the horizontal axis, the FF plot with the full model tted values on the
vertical axis, and always make the response and residual plots for the full
and reduced models. Suppose that the full model is a useful MLR model. If
the reduced model is good, then the response plots from the full and reduced
models should be very similar, visually. Similarly, the residual plots (of the
tted values versus the residuals) from the full and reduced models should be
very similar, visually. Finally, the correlation of the plotted points in the RR
and FF plots should be high, 0.95, say, and the plotted points in the RR
and FF plots should cluster tightly about the identity line. Add the identity
line to both the RR and FF plots as a visual aid. Also add the OLS line from
regressing r on r R to the RR plot (the OLS line is the identity line in the FF
plot). If the reduced model is good, then the OLS line should nearly coincide
with the identity line in that it should be dicult to see that the two lines
intersect at the origin, as in Figure 2.2. If the FF plot looks good but the
RR plot does not, the reduced model may be good if the main goal of the
analysis is to predict Y.
48 2 Multiple Linear Regression
In Chapter 3, Example 3.8 describes the Gladstone (1905) data. Let the
reduced model use a constant, (size)1/3 , sex, and age. Then Figure 3.7 shows
the response and residual plots for the full and reduced models, and Figure 3.9
shows the RR and FF plots.
Example 2.10. For the Buxton (1920) data, n = 76 after 5 outliers and
6 cases with missing values are removed. Assume that the response variable
Y is height, and the explanatory variables are x2 = bigonal breadth, x3 =
cephalic index, x4 = nger to ground, x5 = head length, x6 = nasal height,
and x7 = sternal height. Suppose that the full model uses all 6 predictors plus
a constant (x1 ) while the reduced model uses the constant, cephalic index,
and nger to ground. Test whether the reduced model can be used instead of
the full model using the output below.
= 41588.9/496.629 = 83.742.
iii) pval = P (F4,69 > 83.742) = 0.00.
iv) The pval < (= 0.05, since was not given), so reject Ho. The full model
should be used instead of the reduced model. (Bigonal breadth, head length,
nasal height, and sternal height are needed in the MLR for height given that
cephalic index and nger to ground are in the model.)
Using a computer to get the pval makes sense, but for exams you may need
to use a table. In ARC, you can use the Calculate probability option from the
ARC menu, enter 83.742 as the value of the statistic, 4 and 69 as the degrees
of freedom, and select the F distribution. To use the table near the end of
Chapter 14, use the bottom row since the denominator degrees of freedom 69
> 30. Intersect with the column corresponding to k = 4 numerator degrees of
freedom. The cuto value is 2.37. If the FR statistic was 2.37, then the pval
would be 0.05. Since 83.472 > 2.37, the pval < 0.05, and since 83.472 >> 2.37,
we can say that the pval 0.0.
2.7 The Wald t Test 49
Example 2.11. Now assume that the reduced model uses the constant,
sternal height, nger to ground, and head length. Using the output below, test
whether the reduced model is good.
Summary Analysis of Variance Table for Reduced Model
Source df SS MS F p-value
Regression 3 259704. 86568. 177.93 0.0000
Residual 72 35030.1 486.528
Solution: The 4 step partial F test follows.
i) Ho: the reduced model is good Ha: use the full model
ii)
SSE(R) SSE(F ) 35030.1.0 34267.4
FR = /M SE(F ) = /496.629
dfR dfF 72 69
= 254.2333/496.629 = 0.512.
iii) The pval = P (F3,69 > 0.512) = 0.675.
iv) The pval > , so reject fail to reject Ho. The reduced model is good.
To use the F table near the end of Chapter 14, use the bottom row since
the denominator degrees of freedom 69 > 30. Intersect with the column cor-
responding to k = 3 numerator degrees of freedom. The cuto value is 2.61.
Since 0.512 < 2.61, pval > 0.05, and this is enough information to fail to
reject Ho.
Some R commands and output to do the above problem are shown below.
cyp <- matrix(scan(),nrow=76,ncol=8,byrow=T,dimnames=
list( c(), c("indx", "ht", "sternal", "finger",
"hdlen","nasal","bigonal", "cephalic")))
#copy and paste the data set cyp.lsp then press enter
cyp <- cyp[,-1]; cyp <- as.data.frame(cyp)
full <- lm(ht~.,data=cyp)
red <- lm(ht~sternal+finger+hdlen,data=cyp)
anova(red,full)
Model 1: ht ~ sternal + finger + hdlen
Model 2: ht ~ sternal + finger + hdlen + nasal
+ bigonal + cephalic
Res.Df RSS Df Sum of Sq F Pr(>F)
1 72 35030
2 69 34267 3 762.67 0.5119 0.6754
Use the normal table or the d = Z line in the ttable if the degrees of freedom
d = n p 30. Again pval is the estimated pvalue.
iv) State whether you reject Ho or fail to reject Ho and give a nontechnical
sentence restating your conclusion in terms of the story problem.
The added variable plot (also called a partial regression plot) is used to
give information about the test Ho : k = 0. The points in the plot cluster
about a line through the origin with slope = k . An interesting fact is that the
residuals from this line, i.e. the residuals from regressing r (k) on r(xk |x(k) ),
are exactly the same as the usual residuals from regressing Y on x. The range
of the horizontal axis gives information about the collinearity of xk with the
other predictors. Small range implies that xk is well explained by the other
predictors. The r(xk |x(k) ) represent the part of xk that is not explained by
the remaining variables while the r (k) represent the part of Y that is not
explained by the remaining variables.
An added variable plot with a clearly nonzero slope and tight clustering
about a line implies that xk is needed in the MLR for Y given that the other
predictors x2 , . . . , xk1 , xk+1 , . . . , xp are in the model. Slope near zero in the
added variable plot implies that xk may not be needed in the MLR for Y
given that all other predictors x2 , . . . , xk1 , xk+1 , . . . , xp are in the model.
If the zero line with 0 slope and 0 intercept and the OLS line are added to
the added variable plot, the variable is probably needed if it is clear that the
two lines intersect at the origin. Then the point cloud should be tilted away
from the zero line. The variable is probably not needed if the two lines nearly
coincide near the origin in that you cannot clearly tell that they intersect at
the origin.
Shown below is output only using symbols and the following example shows
how to use output to perform the Wald ttest.
Response = Y
Coecient Estimates
Example 2.12. The output above was collected from 26 districts in Prus-
sia in 1843. See Hebbler (1847). The goal is to study the relationship between
Y = the number of women married to civilians in the district with the predic-
tors x2 = the population of the district, and x3 = military women = number
of women married to husbands in the military.
a) Find a 95% condence interval for 2 corresponding to population.
The CI is k tnp,1/2 se(k ). Since n = 26, df = n p = 26 3 = 23.
From the ttable at the end of Chapter 14, intersect the df = 23 row with
the column that is labelled by 95% in the CI row near the bottom of the
table. Then tnp,1/2 = 2.069. Using the output shows that the 95% CI is
0.180225 2.069(0.00503871) = [0.16980, 0.19065].
b) Perform a 4 step test for Ho: 2 = 0 corresponding to population.
i) Ho: 2 = 0 HA : 2 = 0
ii) to2 = 35.768
iii) pval = 0.0
iv) Reject Ho, the population is needed in the MLR model for the number
of women married to civilians if the number of military women is in the
model.
c) Perform a 4 step test for Ho: 3 = 0 corresponding to military women.
i) Ho: 3 = 0 HA : 3 = 0
ii) to3 = 0.713
iii) pval = 0.4883
iv) Fail to reject Ho, the number of military women is not needed in the
MLR model for the number of women married to civilians if population is in
the model.
Figure 2.4, made with the commands shown below, shows the added vari-
able plots for x2 and x3 . The plot for x2 strongly suggests that x2 is needed
in the MLR model while the plot for x3 indicates that x3 does not seem to
be very important. The slope of the OLS line in a) is 0.1802 while the slope
of the line in b) is 1.894.
source("G:/lregdata.txt")
x2 <- marry[,1]
x3 <- marry[,5]
y <- marry[,3]
#par(mfrow=c(1,2),pty="s")
#square plots look nice but have too much white space
par(mfrow=c(1,2))
resy2 <- residuals(lm(y~x3))
resx2 <- residuals(lm(x2~x3))
plot(resx2,resy2)
abline(lsfit(resx2,resy2)$coef)
title("a) Added Variable Plot for x2")
resy3 <- residuals(lm(y~x2))
2.8 The OLS Criterion 53
50000
10000
5000
0
resy2
resy3
0
50000
5000
10000
1600
1300
Y
1000
Fig. 2.5 The OLS Fit Minimizes the Sum of Squared Residuals
n = Y2i
where the residual ri () xTi . In other words, let ri = ri () be the
n
OLS residuals. Then i=1 ri i=1 ri2 () for any p 1 vector , and the
equality holds i = if the n p design
n matrixX n p n. In
is of full rank
n
particular, if X has full rank p, then i=1 ri2 < i=1 ri2 () = i=1 e2i even
if the MLR model Y = X + e is a good approximation to the data.
deviations from the identity line are the residuals ri (). For this data, the OLS
estimator = (498.726, 1.597, 30.462, 0.696)T . Figure 2.5b shows the re-
sponse plot using the ESP xT where = (498.726, 1.597, 30.462, 0.796)T .
Hence only the coecient for x4 was changed; however, the residuals ri () in
the resulting plot are much larger in magnitude on average than the residuals
in the OLS response plot. With slightly larger changes in the OLS ESP, the
resulting will be such that the squared residuals are massive.
Proof: Seber and Lee (2003, pp. 3637). Recall that the hat matrix H =
X(X T X)1 X T and notice that (I H)T = I H, that (I H)H = 0
and that HX = X. Let be any p 1 vector. Then
(Y X )T (X X) = (Y HY )T (HY HX) =
Y T (I H)H(Y X) = 0.
Thus QOLS () = Y X2 = Y X + X X2 =
Y X 2 + X X2 + 2(Y X )T (X X).
Hence
Y X2 = Y X 2 + X X2 . (2.21)
So
Y X2 Y X 2
with equality i
X( ) = 0
n
(Yi xi,1 1 xi,2 2 xi,p p )2 ,
i=1
QOLS () n
= 2 xi,j (Yi xi,1 1 xi,2 2 xi,p p ) = 2(v j )T (Y X)
j i=1
56 2 Multiple Linear Regression
X T Y X T X = 0,
or
X T X = X T Y . (2.22)
Equation (2.22) is known as the normal equations. If X has full rank,
then = (X T X)1 X T Y . To show that is the global minimizer of the
OLS criterion, use the argument following Equation (2.21).
Y x|xT .
n
dQOLS () n
QOLS () = (Yi ) 2
and = 2 (Yi ).
i=1
d i=1
n
Setting the derivative equal to 0 and calling the solution gives i=1 Yi = n
or = Y . The second derivative
d2 QOLS ()
= 2n > 0,
d 2
hence is the global minimizer.
Yi = 1 + 2 Xi + ei = + Xi + ei
Q n
= 2 (Yi 1 2 Xi )
1 i=1
58 2 Multiple Linear Regression
and
2Q
= 2n.
12
Similarly,
Q n
= 2 Xi (Yi 1 2 Xi )
2 i=1
and
2Q n
=2 Xi2 .
22 i=1
Setting the rst partial derivatives to zero and calling the solutions 1 and
2 shows that the OLS estimators 1 and 2 satisfy the normal equations:
n
n
Yi = n1 + 2 Xi and
i=1 i=1
n
n
n
Xi Yi = 1 Xi + 2 Xi2 .
i=1 i=1 i=1
Xi X
ki = n . (2.24)
j=1 (Xj X)
2
The no intercept MLR model, also known as regression through the origin, is
still Y = X+e, but there is no intercept in the model, so X does not contain
a column of ones 1. Hence the intercept term 1 = 1 (1) is replaced by 1 xi1 .
Software gives output for this model if the no intercept or intercept = F
option is selected. For the no intercept model, the assumption E(e) = 0 is
important, and this assumption is rather strong.
Many of the usual MLR results still hold: OLS = (X T X)1 X T Y , the
vector of predicted tted values Y = X OLS = HY where the hat matrix
H = X(X T X)1 X T provided the inverse exists, and the vector of residuals
is r = Y Y . The response plot and residual plot are made in the same way
and should be made before performing inference.
The main dierence in the output is the ANOVA table. The ANOVA F
test in Section 2.4 tests Ho : 2 = = p = 0. The test in this section tests
Ho : 1 = = p = 0 Ho : = 0. The following denition and test
follows Guttman (1982, p. 147) closely.
n
SST = Yi2 . (2.25)
i=1
n
SSM = Yi2 . (2.26)
i=1
n
n
SSE = (Yi Yi ) =
2
ri2 . (2.27)
i=1 i=1
d) The degrees of freedom (df) for SSM is p, the df for SSE is n p and
the df for SST is n. The mean squares are MSE = SSE/(n p) and MSM =
SSM/p.
Source df SS MS F p-value
Model p SSM MSM Fo=MSM/MSE for Ho:
Residual n-p SSE MSE =0
The ANOVA F test can also be found with the no intercept model by
adding a column of ones to the R matrix x and then performing the partial
F test with the full model and the reduced model that only uses the column
of ones. Notice that the intercept=F option needs
to be used to t both
models. The residual standard error = RSE = M SE. Thus SSE = (n
k)(RSE)2 where n k is the denominator degrees of freedom for the F test
2.11 Summary 61
> ls.print(lsfit(x[,1],y,intercept=F))
Residual Standard Error=164.5028
F-statistic (df=1, 266)=15744.48
((266*(164.5028)^2 - 262*(82.9175)^2)/4)/(82.9175)^2
[1] 196.2435
2.11 Summary
1) The response variable is the variable that you want to predict. The pre-
dictor variables are the variables used to predict the response variable.
2) Regression is the study of the conditional distribution Y |x.
3) The MLR model is
for i = 1, . . . , n. Here n is the sample size and the random variable ei is the ith
error. Assume that the errors are iid with E(ei ) = 0 and VAR(ei ) = 2 < .
Assume that the errors are independent of the predictor variables xi . The
unimodal MLR model assumes that the ei are iid from a unimodal distribution
that is not highly skewed. Usually xi,1 1.
Y = X + e,
i=1 ri .
9) If the MLR model contains a constant, then R2 = [corr(Yi , Yi )]2 =
SSR SSE
=1 .
SSTO SSTO
Source df SS MS F p-value
Regression p-1 SSR MSR Fo=MSR/MSE for Ho:
Residual n-p SSE MSE 2 = = p = 0
11) The large sample 100 (1 )% CI for E(Yf |xf ) = xTf = E(Yf ) is
Yf tnp,1/2 se(Yf ) where P (T tnp, ) = if T has a t distribution with
n p degrees of freedom.
12) The classical 100 (1 )% PI for Yf is Yf tnp,1/2 se(pred), but
should be replaced with the asymptotically optimal PI (2.20).
Full model
Reduced model
iii) Find the pval = P(FdfR dfF ,dfF > FR ). (On exams typically an F table is
used. Here dfR dfF = pq = number of parameters set to 0, and dfF = np,
while pval is the estimated pvalue.)
iv) State whether you reject Ho or fail to reject Ho. Reject Ho if the pval
and conclude that the full model should be used. Otherwise, fail to reject Ho
and conclude that the reduced model is good.
Use the normal table or the d = Z line in the ttable if the degrees of freedom
d = n p 30.
iv) State whether you reject Ho or fail to reject Ho and give a nontechnical
sentence restating your conclusion in terms of the story problem. If Ho is
rejected, then conclude that xk is needed in the MLR model for Y given that
the other predictors are in the model. If you fail to reject Ho, then conclude
that xk is not needed in the MLR model for Y given that the other predictors
are in the model.
n n
16) Given i=1 (Xi X)(Yi Y ), i=1 (Xi X)2 , X, and Y , nd the
least squares line Y = 1 + 2 X where
n
(Xi X)(Yi Y )
2 = i=1 n
i=1 (Xi X)
2
and 1 = Y 2 X.
17) Given , sX , sY , X, and Y , nd the least squares line Y = 1 + 2 X
where 2 = sY /sX and 1 = Y 2 X.
2.12 Complements
The Least Squares Central Limit Theorem 2.8 is often a good approximation
if n 10p and the error distribution has light tails, i.e. the probability of
an outlier is nearly 0 and the tails go to zero at an exponential rate or faster.
For error distributions with heavier tails, much larger samples are needed,
and the assumption that the variance 2 exists is crucial, e.g. Cauchy errors
are not allowed. Norman and Streiner (1986, p. 63) recommend n 5p.
The classical MLR prediction interval does not work well and should be re-
placed by the Olive (2007) asymptotically optimal PI (2.20). Lei and Wasser-
man (2014) provide an alternative: use the Lei et al. (2013) PI [rL , rU ] on the
residuals, then the PI for Yf is
[Yf + rL , Yf + rU ]. (2.28)
Bootstrap PIs need more theory and instead of using B = 1000 samples, use
B = max(1000, n). See Olive (2014, pp. 279285).
For the additive error regression model Y = m(x) + e, the response plot
of Y = m(x) vs. Y , with the identity line added as a visual aid, is used
like the MLR response plot. We want n 10 df where df is the degrees of
freedom from tting m. Olive (2013a) provides PIs for this model, including
the location model. These PIs are large sample PIs provided that the sample
quantiles of the residuals are consistent estimators of the population quantiles
2.12 Complements 65
of the errors. The response plot and PIs could also be used for methods
described in James et al. (2013) such as ridge regression, lasso, principal
components regression, and partial least squares. See Pelawa Watagoda and
Olive (2017) if n is not large compared to p.
In addition to large sample theory, we want the PIs to work well on a
single data set as future observations are gathered, but only have the training
data (x1 , Y1 ), . . . , (xn , Yn ). Much like k-fold cross validation for discriminant
analysis, randomly divide the data set into k = 5 groups of approximately
equal size. Compute the model from 4 groups and use the 5th group as a
validation set: compute the PI for xf = xj for each j in the 5th group.
Repeat so each of the 5 groups is used as a validation set. Compute the
proportion of times Yi was in its PI for i = 1, . . . , n as well as the average
length of the n PIs. We want the proportion near the nominal proportion
and short average length if two or more models or PIs are being considered.
Following Chapter 11, under the regularity conditions, much of the infer-
ence that is valid for the normal MLR model is approximately valid for the
unimodal MLR model when the sample size is large. For example, condence
intervals for i are asymptotically correct, as are t tests for i = 0 (see Li
and Duan (1989, p. 1035)), the MSE is an estimator of 2 by Theorems 2.6
and 2.7, and variable selection procedures perform well (see Chapter 3 and
Olive and Hawkins 2005).
Algorithms for OLS are described in Datta (1995), Dongarra et al. (1979),
and Golub and Van Loan (1989). See Harter (1974a,b, 1975a,b,c, 1976) for
a historical account of multiple linear regression. Draper (2002) provides a
bibliography of more recent references.
Cook and Weisberg (1997, 1999a: ch. 17) call a plot that emphasizes model
agreement a model checking plot. Anscombe (1961) and Anscombe and Tukey
(1963) suggested graphical methods for checking multiple linear regression
and experimental design methods that were the state of the art at the
time.
The rules of thumb given in this chapter for residual plots are not perfect.
Cook (1998, pp. 46) gives an example of a residual plot that looks like a
right opening megaphone, but the MLR assumption that was violated was
linearity, not constant variance. Ghosh (1987) gives an example where the
residual plot shows no pattern even though the constant variance assumption
is violated. Searle (1988) shows that residual plots will have parallel lines if
several cases take on each of the possible values of the response variable, e.g.
if the response is a count.
Several authors have suggested using the response plot to visualize the co-
ecient of determination R2 in multiple linear regression. See, for example,
Chambers et al. (1983, p. 280). Anderson-Sprecher (1994) provides an ex-
cellent discussion about R2 . Kachigan (1982, pp. 174177) also gives a good
explanation of R2 . Also see Kvalseth (1985), and Freedman (1983).
Hoaglin and Welsh (1978) discuss the hat matrix H, and Brooks et al.
(1988) recommend using xf < max hi for valid predictions. Simultaneous
66 2 Multiple Linear Regression
R Squared: R2
Sigma hat: M SE
Number of cases: n
Degrees of Freedom : np
Source df SS MS F p-value
Regression p-1 SSR MSR Fo=MSR/MSE for Ho:
Residual n-p SSE MSE 2 = = p = 0
The typical relevant OLS output has the form given above, but occa-
sionally software also includes output for a lack of t test as shown below.
Source df SS MS Fo
Regression p1 SSR MSR Fo=MSR/MSE
Residual np SSE MSE
lack of t cp SSLF MSLF FLF = MSLF/MSPE
pure error nc SSPE MSPE
Yi = m(xi ) + ei (2.29)
where E(Yi |xi ) = m(xi ), m is some possibly nonlinear function, and that
the ei are iid N (0, 2 ). Notice that the MLR model is the special case with
m(xi ) = xTi . The lack of t test needs at least one replicate: 2 or more Ys
with the same value of predictors x. Then there a c replicate groups with
nj observations in the jth group. Each group has the vector of predictors xj ,
2.12 Complements 67
c
say, and at least one nj > 1. Also, j=1 nj = n. Denote the Ys in the jth
group by Yij , and let the sample mean of the Ys in the jth group be Y j .
Then
1
nj
(Yij Y j )2
nj 1 i=1
c
nj
SSP E = (Yij Y j )2 .
j=1 i=1
Although the lack of t test seems clever, examining the response plot and
residual plot is a much more eective method for examining whether or not
the MLR model ts the data well provided that n 10p. A graphical version
of the lack of t test would compute the Y j and see whether they scatter
about the identity line in the response plot. When there are no replicates,
the range of Y could be divided into several narrow nonoverlapping intervals
called slices. Then the mean Y j of each slice could be computed and a step
function with step height Y j at the jth slice could be plotted. If the step
function follows the identity line, then there is no evidence of lack of t.
However, it is easier to check whether the Yi are scattered about the identity
line. Examining the residual plot is useful because it magnies deviations
from the identity line that may be dicult to see until the linear trend is
removed. The lack of t test may be sensitive to the assumption that the
errors are iid N (0, 2 ).
When Y x|xT , then the response plot of the estimated sucient pre-
dictor (ESP) xT versus Y is used to visualize the conditional distribution of
Y |xT , and will often greatly outperform the corresponding lack of t test.
When the response plot can be combined with a good lack of t plot such as
68 2 Multiple Linear Regression
a residual plot, using a one number summary of lack of t such as the test
statistic FLF makes little sense.
Nevertheless, the literature for lack of t tests for various statistical meth-
ods is enormous. See Joglekar et al. (1989), Pena and Slate (2006), and Su
and Yang (2006) for references.
For the following homework problems, Cody and Smith (2006) is useful
for SAS, while Cook and Weisberg (1999a) is useful for Arc. Becker et al.
(1988) and Crawley (2013) are useful for R.
2.13 Problems
2.1. Assume that the response variable Y is height, and the explanatory
variables are X2 = sternal height, X3 = cephalic index, X4 = nger to ground,
X5 = head length, X6 = nasal height, and X7 = bigonal breadth. Suppose that
the full model uses all 6 predictors plus a constant (= X1 ) while the reduced
model uses the constant and sternal height. Test whether the reduced model
can be used instead of the full model using the output above. The data set
had 74 cases.
2.2. The above output, starting on the previous page, comes from the
Johnson (1996) STATLIB data set bodyfat after several outliers are deleted.
It is believed that Y = 1 + 2 X2 + 3 X22 + e where Y is the persons bodyfat
and X2 is the persons density. Measurements on 245 people were taken. In
addition to X2 and X22 , 7 additional measurements X4 , . . . , X10 were taken.
Both the full and reduced models contain a constant X1 1.
b) Test whether the reduced model can be used instead of the full model.
2.3. The above output was produced from the le mussels.lsp in Arc. See
Cook and Weisberg (1999a). Let Y = log(M) where M is the muscle mass
of a mussel. Let X1 1, X2 = log(H) where H is the height of the shell,
and let X3 = log(S) where S is the shell mass. Suppose that it is desired to
predict Yf if log(H) = 4 and log(S) = 5, so that xTf = (1, 4, 5). Assume that
se(Yf ) = 0.410715 and that se(pred) = 0.467664.
2.4. The above output, starting on the previous page, is from the multiple
linear regression of the response Y = height on the two nontrivial predictors
sternal height = height at shoulder, and nger to ground = distance from the
tip of a persons middle nger to the ground.
a) Consider the plot with Yi on the vertical axis and the least squares
tted values Yi on the horizontal axis. Sketch how this plot should look if the
multiple linear regression model is appropriate.
b) Sketch how the residual plot should look if the residuals ri are on the
vertical axis and the tted values Yi are on the horizontal axis.
c) From the output, are sternal height and nger to ground useful for
predicting height? (Perform the ANOVA F test.)
2.5. Suppose that it is desired to predict the weight of the brain (in
grams) from the cephalic index measurement. The output below uses data
from 267 people.
predictor coef Std. Error t-value p-value
Constant 865.001 274.252 3.154 0.0018
cephalic 5.05961 3.48212 1.453 0.1474
Do a 4 step test for 2 = 0.
2.7. Suppose that the 95% condence interval for 2 is [17.457, 15.832].
In the simple linear regression model, is X a useful linear predictor for Y ? If
your answer is no, could X be a useful predictor for Y ? Explain.
2.8. Suppose it is desired to predict the yearly return from the stock
market from the return in January. Assume that the correlation = 0.496.
Using the table below, nd the least squares line Y = 1 + 2 X.
56 63
59 70
64 72
74 84
2.10. In the above table, xi is the length of the femur and yi is the length
of the humerus taken from ve dinosaur fossils (Archaeopteryx) that preserved
both bones. See Moore (2000, p. 99).
a) Complete the table and nd the least squares estimators 1 and 2 .
a) What is E(Yi )?
c) Show that your is the global minimizer of the least squares criterion
d2
Q by showing that the second derivative Q() > 0 for all values of .
d 2
2.12. The location model is Yi = +ei for i = 1, . . . , n where the ei are iid
with mean E(ei ) = 0 and constant variance VAR(ei ) = 2 . The least squares
n
estimator of minimizes the least squares criterion Q() = (Yi )2 .
i=1
To nd the least squares estimator, perform the following steps.
72 2 Multiple Linear Regression
d
a) Find the derivative Q, set the derivative equal to zero and solve for
d
. Call the solution .
b) To show that the solution was indeed the global minimizer of Q, show
d2
that Q > 0 for all real . (Then the solution is a local min and Q is
d 2
convex, so is the global min.)
2.13. The normal error model for simple linear regression through the
origin is
Yi = Xi + ei
for i = 1, . . . , n where e1 , . . . , en are iid N (0, 2 ) random variables.
b) Find E().
c) Find VAR().
n
(Hint: Note that = i=1 ki Yi where the ki depend on the Xi which are
treated as constants.)
2.14. Suppose that the regression model is Yi = 10+2Xi2 +3 Xi3 +ei for
i = 1, . . . , n where the ei are iid N (0, 2 ) random variables. The least squares
n
criterion is Q(3 ) = (Yi 10 2Xi2 3 Xi3 )2 . Find the least squares es-
i=1
d
timator 3 of 3 by setting the rst derivative Q(3 ) equal to zero. Show
d3
that your 3 is the global minimizer of the least squares criterion Q by show-
d2
ing that the second derivative Q(3 ) > 0 for all values of 3 .
d32
Minitab Problems
Double click means press the rightmost mouse button twice in rapid
succession. Drag means hold the mouse button down. This technique is
used to select menu options.
After your computer is on, get into Minitab, often by searching programs
and then double clicking on the icon marked Student Minitab.
i) In a few seconds, the Minitab session and worksheet windows ll the screen.
At the top of the screen there is a menu. The upper left corner has the menu
option File. Move your cursor to File and drag down the option Open
Worksheet. A window will appear. Double click on the icon Student. This
will display a large number of data sets.
2.13 Problems 73
ii) In the middle of the screen there is a scroll bar, a gray line with left and
right arrow keys. Use the right arrow key to make the data le Prof.mtw
appear. Double click on Prof.mtw. A window will appear. Click on OK.
iii) The worksheet window will now be lled with data. The top of the screen
has a menu. Go to Stat and drag down Regression. Another window will
appear: drag down Regression (write this as Stat>Regression>Regression).
iv) A window will appear with variables to the left and the response variable
and predictors (explanatory variables) to the right. Double click on instru-
crs to make it the response. Double click on manner to make it the (pre-
dictor) explanatory variable. Then click on OK.
v) The required output will appear in the session window. You can view the
output by using the vertical scroll bar on the right of the screen.
vi) Copy and paste the output into Word, or to print your single page of
output, go to File, and drag down the option Print Session Window. A
window will appear. Click on ok. Then get your output from the printer.
Use the F3 key to clear entries from a dialog window if you make a mistake
or want a new plot.
To get out of Minitab, move your cursor to the x in the upper right
corner of the screen. When asked whether to save changes, click on no.
2.15. (Minitab problem.) See the above instructions for using Minitab.
Get the data set prof.mtw. Assign the response variable to be instrucr (the
instructor rating from course evaluations) and the explanatory variable (pre-
dictor) to be manner (the manner of the instructor). Run a regression on
these variables.
d) To get residual and response plots you need to store the residuals and
tted values. Use the menu commands Stat>Regression>Regression to get
the regression window. Put instrucr in the Response and manner in the
Predictors boxes. The click on Storage. From the resulting window click
on Fits and Residuals. Then click on OK twice.
To get a response plot, use the commands Graph>Plot, (double click)
place instrucr in the Y box, and Fits1 in the X box. Then click on OK. Print
the plot by clicking on the graph and then clicking on the printer icon.
e) To make a residual plot, use the menu commands Graph>Plot to get
a window. Place Resi1 in the Y box and Fits1 in the X box. Then click
on OK. Print the plot by clicking on the graph and then clicking on the
printer icon.
74 2 Multiple Linear Regression
f) To save your Minitab data on your ash drive, use the menu commands
File>Save Current Worksheet as. In the resulting dialog window, the top
box says Save in and there is an arrow icon to the right of the top box.
Click several times on the arrow icon until the Save in box reads My com-
2.13 Problems 75
puter, then click on Removable Disk (J:). In the File name box, enter
H2d16.mtw. Then click on OK.
SAS Problems
Copy and paste the SAS programs for problems 2.17 and 2.18
from (http://lagrange.math.siu.edu/Olive/lreghw.txt), or enter
the SAS program in Notepad or Word.
SAS is a statistical software package widely used in industry. You will need
a ash dive. Referring to the program for Problem 2.17, the semicolon ;
is used to end SAS commands and the options ls = 70; command makes
the output readable. (An * can be used to insert comments into the SAS
program. Try putting an * before the options command and see what it does
to the output.) The next step is to get the data into SAS. The command data
wcdata; gives the name wcdata to the data set. The command input x
y; says the rst entry is variable x and the 2nd variable y. The command
cards; means that the data is entered below. Then the data is entered
and the isolated semicolon indicates that the last case has been entered. The
command proc print; prints out the data. The command proc corr; will
give the correlation between x and y. The commands proc plot; plot y*x;
makes a scatterplot of x and y. The commands proc reg; model y=x; output
out = a p =pred r =resid; tells SAS to perform a simple linear regression
with y as the response variable. The output data set is called a and contains
the tted values and residuals. The command proc plot data = a; tells SAS
to make plots from data set a rather than data set wcdata. The command
plot resid*(pred x); will make a residual plot of the tted values versus the
residuals and a residual plot of x versus the residuals. The next plot command
makes a response plot.
To use SAS on windows (PC), use the following steps.
i) Get into SAS, often by double clicking on an icon for programs such as a
Math Progs icon and then double clicking on a SAS icon. If your computer
does not have SAS, go to another computer.
ii) A window should appear with 3 icons. Double click on The SAS System
for . . . .
iii) Like Minitab, a window with a split screen will open. The top screen
says Log-(Untitled) while the bottom screen says Editor-Untitled1. Press the
spacebar and an asterisk appears: Editor-Untitled1*.
2.17. a) Copy and paste the program for this problem from
(http://lagrange.math.siu.edu/Olive/lreghw.txt), or enter the SAS
program in Notepad or Word. The ls stands for linesize so l is a lowercase L,
not the number one.
When you are done entering the program, you may want to save the pro-
gram as h2d17.sas on your ash drive (J: drive, say). (On the top menu of
the editor, use the commands File > Save as. A window will appear. Use
76 2 Multiple Linear Regression
the upper right arrow to locate Removable Disk (J:) and then type the le
name in the bottom box. Click on OK.)
b) Get back into SAS, and from the top menu, use the File> Open
command. A window will open. Use the arrow in the upper right corner
of the window to navigate to Removable Disk (J:). (As you click on the
arrow, you should see My Documents, C: etc, then Removable Disk (J:).)
Double click on h2d17.sas. (Alternatively cut and paste the program into the
SAS editor window.) To execute the program, use the top menu commands
Run>Submit. An output window will appear if successful.
If you were not successful, look at the log window for hints on errors.
A single typo can cause failure. Reopen your le in Word or Notepad and
make corrections. Occasionally you cannot nd your error. Then nd your
instructor or wait a few hours and reenter the program.
c) To copy and paste relevant output into Word or Notepad, click on the
output window and use the top menu commands Edit>Select All and then
the menu commands Edit>Copy.
In Notepad use the commands Edit>Paste. Then use the mouse to high-
light the relevant output. Then use the commands Edit>Copy.
Finally, in Word, use the command Paste. You can also cut output from
Word and paste it into Notepad.
You may want to save your SAS output as the le HW2d17.doc on your
ash drive.
d) To save your output on your ash drive, use the Word menu commands
File > Save as. In the Save in box select Removable Disk (J:) and in
the File name box enter HW2d17.doc. To get a Word printout, click on the
printer icon or use the menu commands File>Print.
Save the output giving the least squares coecients in Word.
e) Predict Y if X = 40.
f) What is the residual when X = 40?
2.18. This problem shows how to use SAS for MLR. The data are from
Kutner et al. (2005, problem 6.5). The response is brand liking, a measure-
ment for whether the consumer liked the brand. The variable X1 is moisture
content and the variable X2 is sweetness. Copy and paste the program for
this problem from (http://lagrange.math.siu.edu/Olive/lreghw.txt).
a) Execute the SAS program and copy the output le into Notepad. Scroll
down the output that is now in Notepad until you nd the regression coe-
cients and ANOVA table. Then cut and paste this output into Word.
b) Do the 4 step ANOVA F test.
You should scroll through your SAS output to see how it made the re-
sponse plot and various residual plots, but cutting and pasting these plots is
2.13 Problems 77
tedious. So we will use Minitab to get these plots. Find the program for this
problem from (http://lagrange.math.siu.edu/Olive/lreghw.txt). Then
copy and paste the numbers (between cards; and the semicolon ;) into
Minitab. Use the mouse commands Edit>Paste Cells. This should enter
the data in the Worksheet (bottom part of Minitab). Under C1 enter Y and
under C2 enter X1 under C3 enter X2. Use the menu commands
Stat>Regression>Regression to get a dialog window. Enter Y as the re-
sponse variable and X1 and X2 as the predictor variable. Click on Storage
then on Fits, Residuals, and OK OK.
c) To make a response plot, enter the menu commands Graph>Plot
and place Y in the Ybox and FITS1 in the Xbox. Click on OK. Then
use the commands Edit>Copy Graph to copy the plot. Include the plot in
Word with the commands Edit> Paste. If these commands fail, click on
the graph and then click on the printer icon.
d) Based on the response plot, does a linear model seem reasonable?
e) To make a residual plot, enter the menu commands Graph>Plot and
place RESI 1 in the Ybox and FITS1 in the Xbox. Click on OK. Then
use the commands Edit>Copy Graph to copy the plot. Include the plot in
Word with the commands Edit> Paste. If these commands fail, click on
the graph and then click on the printer icon.
f) Based on the residual plot does a linear model seem reasonable?
Problems using ARC
To quit Arc, move the cursor to the x in the upper right corner and click.
Warning: Some of the following problems uses data from the
books webpage (http://lagrange.math.siu.edu/Olive/lregbk.htm).
Save the data les on a ash drive G, say. Get in Arc and use the menu
commands File > Load and a window with a Look in box will appear. Click
on the black triangle and then on Removable Disk (G:). Then click twice on
the data set name.
In Arc, use the menu commands Edit > Copy. In Word, use the menu
command Paste. This should copy the graph into the Word document.
a) Cut and paste the output (from Coecient Estimates to Sigma hat)
into Word. Write down the least squares equation Y = 1 + 2 x.
c) Make a residual plot of the tted values versus the residuals. Use
the commands Graph&Fit > Plot of and put L1:Fit-values in H and
L1:Residuals in V. Put sex in the Mark by box. Move the OLS bar to 1.
Put the plot into Word. Does the plot look ellipsoidal with zero mean?
2.21. In Arc enter the menu commands File>Load>Data and open the
le mussels.lsp. This data set is from Cook and Weisberg (1999a).
The response variable Y is the mussel muscle mass M, and the explanatory
variables are X2 = S = shell mass, X3 = H = shell height, X4 = L = shell
length, and X5 = W = shell width.
Enter the menu commands Graph&Fit>Fit linear LS and t the model:
enter S, H, L, W in the Terms/Predictors box, M in the Response box
and click on OK.
2.22. Get cyp.lsp as described above Problem 2.19. You can open the
le in Notepad and then save it on a ash drive G, say, using the Notepad
menu commands File>Save As and clicking the top checklist then click
Removable Disk (G:). You could also save the le on the desktop, load it
in Arc from the desktop, and then delete the le (sending it to the Recycle
Bin).
a) In Arc enter the menu commands File>Load>Removable Disk (G:)
and open the le cyp.lsp. This data set consists of various measurements
taken on men from Cyprus around 1920. Let the response Y = height and
X = cephalic index = 100(head breadth)/(head length). Use Arc to get the
least squares output and include the relevant output in Word.
i) The Arc menu L1 should have been created for the regression. Use
the menu commands L1>Prediction to open a dialog window. Enter 1400
650 in the box and click on OK. Include the resulting output in Word.
j) Let Xf,2 = 1400 and Xf,3 = 650 and use the output from i) to nd a
95% CI for E(Yf ). Use the last line of the output, that is, se = S(Yf ).
a) Include the ANOVA tables for the full and reduced models in Word.
d) Both plots should cluster tightly about the identity line if the reduced
model is about as good as the full model. Is the reduced model good?
e) Perform the 4 step partial F test (of Ho: the reduced model is good)
using the 2 ANOVA tables from part a).
2.26. a) Activate the cbrain.lsp data set in ARC. Fit least squares with
age, sex, size1/3 , and headht as terms and brnweight as the response. See
Problem 2.20. Assume that the multiple linear regression model is appro-
priate. (This may be a reasonable assumption, 5 infants appear as outliers
but the data set has hardly any cases that are babies. If age was uniformly
represented, the babies might not be outliers anymore.) Assuming that ARC
makes the menu L1 for this regression, select AVP-All 2D. A window will
appear. Move the OLS slider bar to 1 and click on the zero line box. The
window will show the added variable plots for age, sex, size1/3 , and headht
as you move along the slider bar that is below case deletions. Include all 4
added variable plots in Word.
b) What information do the 4 plots give? For example, which variables do
not seem to be needed?
(If it is clear that the zero and OLS lines intersect at the origin, then the
variable is probably needed, and the point cloud should be tilted away from
the zero line. If it is dicult to see where the two lines intersect since they
nearly coincide near the origin, then the variable may not be needed, and the
point cloud may not tilt away from the zero line.)
R Problems
plot(zred$fit,buxy)
abline(0,1)
e) Use the following command to make the residual plot for the reduced
model. Include the plot in Word.
plot(zred$fit,zred$resid)
f) The plots look bad because of 5 massive outliers. The following com-
mands remove the outliers. Include the output in Word.
plot(zred$fit,zbux[,5])
abline(0,1)
i) Use the following command to make the residual plot for the reduced
model without the outliers. Include the plot in Word.
plot(zred$fit,zred$resid)
2.28. Get the R commands for this problem. The data is such that Y =
2 + x2 + x3 + x4 + e where the zero mean errors are iid [exponential(2) -
2]. Hence the residual and response plots should show high skew. Note that
= (2, 1, 1, 1)T . The R code uses 3 nontrivial predictors and a constant, and
the sample size n = 1000.
2.13 Problems 83
a) Copy and paste the commands for part a) of this problem into R. Include
the response plot in Word. Is the lowess curve fairly close to the identity line?
b) Copy and paste the commands for part b) of this problem into R.
Include the residual plot in Word: press the Ctrl and c keys as the same time.
Then use the menu command Paste in Word. Is the lowess curve fairly
close to the r = 0 line?
c) The output out$coef gives . Write down . Is close to ?
c) Repeat b) using the command pisim(n=100, type = 3). Now the er-
rors are EXP(1) - 1.
e) The infants are in the lower left corner of the plot. Do the PIs seem to
be better for the infants or the bulk of the data. Explain briey.
Building a multiple linear regression (MLR) model from data is one of the
most challenging regression problems. The nal full model will have re-
sponse variable Y = t(Z), a constant x1 , and predictor variables x2 =
t2 (w2 , . . . , wr ), . . . , xp = tp (w2 , . . . , wr ) where the initial data consists of
Z, w2 , . . . , wr . Choosing t, t2 , . . . , tp so that the nal full model is a useful
MLR approximation to the data can be dicult.
Model building is an iterative process. Given the problem and data but
no model, the model building process can often be aided by graphs that
help visualize the relationships between the dierent variables in the data.
Then a statistical model can be proposed. This model can be t and inference
performed. Then diagnostics from the t can be used to check the assumptions
of the model. If the assumptions are not met, then an alternative model can
be selected. The t from the new model is obtained, and the cycle is repeated.
This chapter provides some tools for building a good full model.
Warning: Researchers often have a single data set and tend to expect
statistics to provide far more information from the single data set than is
reasonable. MLR is an extremely useful tool, but MLR is at its best when the
nal full model is known before collecting and examining the data. However,
it is very common for researchers to build their nal full model by using
the iterative process until the nal model ts the data well. Researchers
should not expect that all or even many of their research questions can be
answered from such a full model. If the nal MLR full model is built from
a single data set in order to t that data set well, then typically inference
from that model will not be valid. The model may be useful for describing
the data, but may perform very poorly for prediction of a future response.
The model may suggest that some predictors are much more important than
others, but a model that is chosen prior to collecting and examining the data
is generally much more useful for prediction and inference. A single data
Often a nal full model is built after collecting and examining the data.
This procedure is called data snooping, and such models cannot be ex-
pected to be reliable. If possible, spend about 1/8 of the budget to collect
data and build an initial MLR model. Spend another 1/8 of the budget to
collect more data to check the initial MLR model. If changes are necessary,
continue this process until no changes from the previous step are needed,
resulting in a tentative MLR model. Then spend between 1/2 and 3/4 of the
budget to collect data assuming that the tentative model will be useful.
Alternatively, if the data set is large enough, use a training set of a
random sample of k of the n cases to build a model where 10p n/2 k
0.9n. Then use validation set of the other n k cases to conrm that the
model built with the training set is good. This technique may help reduce
biases, but needs n 20p.
Rule of thumb 3.1. If the MLR model is built using the variable se-
lection methods from Section 3.4, then the nal submodel can be used for
description. If the full model was found after collecting the data, the model
may not be useful for inference and prediction. If the full model was selected
before collecting the data, then the prediction region method of bootstrap-
ping the variable selection model, described in Section 3.4.1, may be useful.
general regression problems, not just for multiple linear regression. A power
transformation has the form x = t (w) = w for = 0 and x = t0 (w) =
log(w) for = 0. Often L where
There are several rules of thumb that are useful for visually selecting a
power transformation to remove nonlinearities from the predictors. Let a
plot of X1 versus X2 have X2 is on the vertical axis and X1 on the horizontal
axis.
Rule of thumb 3.2. a) If strong nonlinearities are apparent in the scat-
terplot matrix of the predictors w2 , . . . , wp , it is often useful to remove the
nonlinearities by transforming the predictors using power transformations.
c) Suppose the plot of X1 versus X2 is nonlinear. The unit rule says that
if X1 and X2 have the same units, then try the same transformation for both
X1 and X2 .
Assume that all values of X1 and X2 are positive. Then the following six
rules are often used.
d) The log rule states that a positive predictor that has the ratio between
the largest and smallest values greater than ten should be transformed to logs.
So X > 0 and max(X)/ min(X) > 10 suggests using log(X).
88 3 Building an MLR Model
e) The range rule states that a positive predictor that has the ratio be-
tween the largest and smallest values less than two should not be transformed.
So X > 0 and max(X)/ min(X) < 2 suggests keeping X.
f) The bulging rule states that changes to the power of X2 and the power
of X1 can be determined by the direction that the bulging side of the curve
points. If the curve is hollow up (the bulge points down), decrease the power
of X2 . If the curve is hollow down (the bulge points up), increase the power
of X2 . If the curve bulges towards large values of X1 increase the power of
X1 . If the curve bulges towards small values of X1 decrease the power of X1 .
See Tukey (1977, pp. 173176).
i) The cube root rule says that if X is a volume measurement, then cube
root transformation X 1/3 may be useful.
In the literature, it is sometimes stated that predictor transformations
that are made without looking at the response are free. The reasoning is
that the conditional distribution of Y |(x2 = a2 , . . . , xp = ap ) is the same
as the conditional distribution of Y |[t2 (x2 ) = t2 (a2 ), . . . , tp (xp ) = tp (ap )]:
is simply a change of labelling. Certainly if Y |x = 9 N (0, 1), then
there
Y | x = 3 N (0, 1). To see that Rule of thumb 3.2a does not always work,
suppose that Y = 1 +2 x2 + +p xp +e where the xi are iid lognormal(0,1)
random variables. Then wi = log(xi ) N (0, 1) for i = 2, . . . , p and the
scatterplot matrix of the wi will be linear while the scatterplot matrix of the
xi will show strong nonlinearities if the sample size is large. However, there is
an MLR relationship between Y and the xi while the relationship between Y
and the wi is nonlinear: Y = 1 + 2 ew2 + + p ewp + e = T w + e. Given
Y and the wi with no information of the relationship, it would be dicult to
nd the exponential transformation and to estimate the i . The moral is that
predictor transformations, especially the log transformation, can and often
do greatly simplify the MLR analysis, but predictor transformations can turn
a simple MLR analysis into a very complex nonlinear analysis.
3.1 Predictor Transformations 89
To spread small values of the variable, make i smaller. To spread large values
of the variable, make i larger.
For example, if both variables are right skewed, then there will be many
more cases in the lower left of the plot than in the upper right. Hence small
values of both variables need spreading. Figures 13.3 b) and 13.16 have this
shape.
a) b)
1.4
1 2 3 4 5 6 7
1.0
x
x
0.6
0.2
0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5
w w
c) d)
6 8 10 12
60
40
x
x
20
2 4
0
0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5
w w
vertical variable need spreading. Hence in Figure 3.1d, small values of x need
spreading. Notice that the plotted points bulge down towards large values of
the horizontal variable.
Example 3.2: Mussel Data. Cook and Weisberg (1999a, pp. 351,
433, 447) gave a data set on 82 mussels sampled o the coast of New
Zealand. The response is muscle mass M in grams, and the predictors are
a constant, the length L and height H of the shell in mm, the shell width
W , and the shell mass S. Figure 3.2 shows the scatterplot matrix of the
predictors L, W , H, and S. Examine the variable length. Length is on the
vertical axis on the three top plots and the right of the scatterplot matrix
labels this axis from 150 to 300. Length is on the horizontal axis on the
three leftmost marginal plots, and this axis is labelled from 150 to 300 on the
bottom of the scatterplot matrix. The marginal plot in the bottom left corner
has length on the horizontal and shell on the vertical axis. The marginal
plot that is second from the top and second from the right has height on the
horizontal and width on the vertical axis. If the data is stored in x, the plot
can be made with the following command in R.
pairs(x,labels=c("length","width","height","shell"))
3.1 Predictor Transformations 91
width
140
height
80 100
100 200 300
shell
0
350/10 = 35 > 10, the log rule suggests that log S may be useful. If log S
replaces S in the scatterplot matrix, then there may be some nonlinearity
present in the plot of log S versus W with small values of W needing spread-
ing. Hence the ladder rule suggests reducing from 1 and we tried log(W ).
Figure 3.3 shows that taking the log transformations of W and S results in
a scatterplot matrix that is much more linear than the scatterplot matrix of
Figure 3.2. Notice that the plot of W versus L and the plot of log(W ) versus
L both appear linear. This plot can be made with the following commands.
z <- x; z[,2] <- log(z[,2]); z[,4] <- log(z[,4])
pairs(z,labels=c("length","Log W","height","Log S"))
The plot of shell versus height in Figure 3.2 is nonlinear, and small values
of shell need spreading since if the plotted points were projected on the
horizontal axis, there would be too many points at values of shell near 0.
Similarly, large values of height need spreading.
92 3 Building an MLR Model
Log W
3.4
3.0
140
height
80 100
6
5
Log S
4
3
Denition 3.3. Assume that all of the values of the response Zi are
positive. Then the modied power transformation family
() Zi 1
t (Zi ) Zi = (3.3)
(0)
for = 0 and Zi = log(Zi ). Generally where is some interval such
as [1, 1] or a coarse subset such as L . This family is a special case of the
response transformations considered by Tukey (1957).
Warning: The Rule of thumb 3.2 does not always work. For example, the
log rule may fail. If the relationships in the scatterplot matrix are already lin-
ear or if taking the transformation does not increase the linearity (especially
in the row containing the response), then no transformation may be better
than taking a transformation. For the Arc data set evaporat.lsp, the log
rule suggests transforming the response variable Evap, but no transformation
works better.
There are several reasons to use a coarse grid of powers. First, several of the
powers correspond to simple transformations such as the log, square root, and
cube root. These powers are easier to interpret than = 0.28, for example.
94 3 Building an MLR Model
sqrt(Z)
2000
40
Z
10
0
500 1000 10 30 50
TZHAT TZHAT
c) lambda = 0 d) lambda = 1
0.008
log(Z)
7
1/Z
0.000
5
According to Mosteller and Tukey (1977, p. 91), the most commonly used
power transformations are the = 0 (log), = 1/2, = 1, and = 1/3
transformations in decreasing frequency of use. Secondly, if the estimator n
can only take values in L , then sometimes n will converge (e.g., in prob-
ability) to L . Thirdly, Tukey (1957) showed that neighboring power
transformations are often very similar, so restricting the possible powers to
a coarse grid is reasonable. Note that powers can always be added to the
grid L . Useful powers are 1/4, 2/3, 2, and 3. Powers from numerical
methods can also be added.
If more than one value of L gives a linear plot, take the simplest or
most reasonable transformation or the transformation that makes the most
sense to subject matter experts. Also check that the corresponding residual
plots of W versus W W look reasonable. The values of in decreasing order
of importance are 1, 0, 1/2, 1, and 1/3. So the log transformation would be
chosen over the cube root transformation if both transformation plots look
equally good.
3.2 Graphical Methods for Response Transformations 95
The essential point of the next example is that observations that inuence
the choice of the usual BoxCox numerical power transformation are often
easily identied in the transformation plots. The transformation plots are
especially useful if the bivariate relationships of the predictors, as seen in the
scatterplot matrix of the predictors, are linear.
Example 3.4: Mussel Data. Consider the mussel data of Example 3.2
where the response is muscle mass M in grams, and the predictors are the
length L and height H of the shell in mm, the logarithm log W of the shell
width W, the logarithm log S of the shell mass S, and a constant. With this
starting point, we might expect a log transformation of M to be needed
96 3 Building an MLR Model
a) lambda = 1 b) lambda = 0
50
4
log(Z)
2
Z
20
8
48
0
0
10 10 30 1.0 2.0 3.0 4.0
TZHAT TZHAT
48
Z**(0.28)
0.6
1/Z
2.0
8
8
1.0
0.0
48
1.5 2.0 2.5 3.0 0.00 0.15 0.30
TZHAT TZHAT
Fig. 3.5 Transformation Plots for the Mussel Data
because M and S are both mass measurements and log S is being used as
a predictor. Using log M would essentially reduce all measurements to the
scale of length. The BoxCox likelihood method gave 0 = 0.28 with ap-
proximate 95 percent condence interval 0.15 to 0.4. The log transformation
is excluded under this inference leading to the possibility of using dierent
transformations of the two mass measurements.
Shown in Figure 3.5 are transformation plots for four values of . A striking
feature of these plots is the two points that stand out in three of the four
plots (cases 8 and 48). The BoxCox estimate = 0.28 is evidently inuenced
by the two outlying points and, judging deviations from the identity line in
Figure 3.5c, the mean function for the remaining points is curved. In other
words, the BoxCox estimate is allowing some visually evident curvature
in the bulk of the data so it can accommodate the two outlying points.
Recomputing the estimate of o without the highlighted points gives o =
0.02, which is in good agreement with the log transformation anticipated
at the outset. Reconstruction of the transformation plots indicated that now
the information for the transformation is consistent throughout the data on
the horizontal axis of the plot.
Note that in addition to helping visualize against the data, the transfor-
mation plots can also be used to show the curvature and heteroscedasticity in
the competing models indexed by L . Example 3.4 shows that the plot
can also be used as a diagnostic to assess the success of numerical methods
such as the BoxCox procedure for estimating o .
3.3 Main Eects, Interactions, and Indicators 97
Example 3.5: Mussel Data Again. Return to the mussel data, this
time considering the regression of M on a constant and the four untrans-
formed predictors L, H, W , and S. Figure 3.2 shows the scatterplot matrix
of the predictors L, H, W , and S. Again nonlinearity is present. Figure 3.3
shows that taking the log transformations of W and S results in a linear
scatterplot matrix for the new set of predictors L, H, log W , and log S. Then
the search for the response transformation can be done as in Example 3.4.
Denition 3.5. Suppose that the explanatory variables have the form
x2 , . . . , xk , xjj = x2j , xij = xi xj , x234 = x2 x3 x4 , et cetera. Then the variables
x2 , . . . , xk are main eects. A product of two or more dierent main eects
is an interaction. A variable such as x22 or x37 is a power. An x2 x3 interaction
will sometimes also be denoted as x2 : x3 or x2 x3 .
Rule of thumb 3.3. Suppose that the MLR model contains at least one
power or interaction. Then the corresponding main eects that make up the
powers and interactions should also be in the MLR model.
Rule of thumb 3.3 suggests that if x23 and x2 x7 x9 are in the MLR model,
then x2 , x3 , x7 , and x9 should also be in the MLR model. A quick way to check
whether a term like x23 is needed in the model is to t the main eects models
and then make a scatterplot matrix of the predictors and the residuals, where
the residuals r are on the top row. Then the top row shows plots of xk versus
r, and if a plot is parabolic, then x2k should be added to the model. Potential
predictors wj could also be added to the scatterplot matrix. If the plot of
wj versus r shows a positive or negative linear trend, add wj to the model.
If the plot is quadratic, add wj and wj2 to the model. This technique is for
quantitative variables xk and wj .
98 3 Building an MLR Model
Example 3.6. Two varieties of cement that replace sand with coal waste
products were compared to a standard cement mix. The response Y was the
compressive strength of the cement measured after 7, 28, 60, 90, or 180 days
RESID
Y
500
3000
of curing time = x2 . This cement was intended for sidewalks and barriers but
not for construction. The data is likely from small batches of cement prepared
in the lab, and is likely correlated; however, MLR can be used for exploratory
and descriptive purposes. Actually using the dierent cement mixtures in the
eld (e.g., as sidewalks) would be very expensive. The factor mixture had 3
levels: 2 for the standard cement, and 0 and 1 for the coal based cements.
3.4 Variable Selection 99
A plot of x2 versus Y (not shown but see Problem 3.15) resembled the left
half of a quadratic Y = c(x2 180)2 . Hence x2 and x22 were added to the
model.
Figure 3.6 shows the response plot and residual plots from this model.
The standard cement mix uses the symbol + while the coal based mixes
use an inverted triangle and square. OLS lines based on each mix are added
as visual aids. The lines from the two coal based mixes do not intersect,
suggesting that there may not be an interaction between these two mixes.
There is an interaction between the standard mix and the two coal mixes
since these lines do intersect. All three types of cement become stronger with
time, but the standard mix has the greater strength at early curing times
while the coal based cements become stronger than the standard mix at the
later times. Notice that the interaction is more apparent in the residual plot.
Problem 3.15 adds a factor F x3 based on mix as well as the x2 F x3 and
x22 F x3 interactions. The resulting model is an improvement, but there is
still some curvature in the residual plot, and one case is not t very well.
Variable selection, also called subset or model selection, is the search for a
subset of predictor variables that can be deleted without important loss of
information. A model for variable selection in multiple linear regression can
be described by
where xI/S denotes the predictors in I that are not in S. Since this is true
regardless of the values of the predictors, O = 0 and the sample correlation
corr(xTi , xTI,i I ) = 1.0 for the population model if S I.
All too often, variable selection is performed and then the researcher tries
to use the nal submodel for inference as if the submodel was selected before
gathering data. At the other extreme, it could be suggested that variable se-
lection should not be done because classical inferences after variable selection
are not valid. Neither of these two extremes is useful.
Ideally the model is known before collecting the data. After the data is
collected, the MLR assumptions are checked and then the model is used
for inference. Alternatively, a preliminary study can be used to collect data.
Then the predictors and response can be transformed until a full model is
built that seems to be a useful MLR approximation of the data. Then variable
selection can be performed, suggesting a nal model. Then this nal model is
the known model used before collecting data for the main part of the study.
See the two paragraphs above the paragraph above Rule of thumb 3.1. If
the full model is known, inference with the bootstrap prediction region
method and prediction intervals of Section 3.4.1 may be useful.
In practice, the researcher often has one data set, builds the full model,
and performs variable selection to obtain a nal submodel. In other words, an
extreme amount of data snooping was used to build the nal model. A major
problem with the nal MLR model (chosen after variable selection or data
snooping) is that it is not valid for inference in that the p-values for the OLS
t-tests and ANOVA F test are likely to be too small, while the p-value for the
partial F test that uses the nal model as the reduced model is likely to be
too high. Similarly, the actual coverage of the nominal 100(1)% prediction
intervals tends to be too small and unknown (e.g., the nominal 95% PIs may
only contain 83% of the future responses Yf ). Thus the model is likely to t
the data set from which it was built much better than future observations.
Call the data set from which the MLR model was built the training data,
consisting of cases (Yi , xi ) for i = 1, . . . , n. Then the future predictions tend
to be poor in that |Yf Yf | tends to be larger on average than |Yi Yi |.
To summarize, a nal MLR model selected after variable selection can be
useful for description and exploratory analysis: the tests and intervals can
be used for exploratory purposes, but the nal model is usually not valid for
inference.
Generally the research paper should state that the model was built with
one data set, and is useful for description and exploratory purposes, but
3.4 Variable Selection 101
should not be used for inference. The research paper should only suggest
that the model is useful for inference if the model has been shown to be
useful on data collected after the model was built. For example, if
the researcher can collect new data and show that the model produces valid
inferences (e.g., 97 out of 100 95% prediction intervals contained the future
response Yf ), then the researcher can perhaps claim to have found a model
that is useful for inference.
Other problems exist even if the full MLR model Y = xT + e is good.
Let I {1, . . . , p} and let xI be the nal vector of predictors. If xI is missing
important predictors contained in the full model, sometimes called undert-
ting, then the nal model Y = xTI I + e may be a very poor approximation
to the data, in particular the full model may be linear while the nal model
may be nonlinear. Similarly the full model may satisfy V (ei ) = 2 while the
constant variance assumption is violated by the submodel: V (ei ) = i2 . These
two problems are less severe if the joint distribution of (Y, xT )T is multivari-
ate normal, since then Y = xTI I + e satises the constant variance MLR
model regardless of the subset I used. See Problem 10.10.
In spite of these problems, if the researcher has a single data set with
many predictors, then usually variable selection must be done. Let p 1 be
the number of nontrivial predictors and assume that the model also contains
a constant. Also assume that n 10p. If the MLR model found after variable
selection has good response and residual plots, then the model may be very
useful for descriptive and exploratory purposes.
Simpler models are easier to explain and use than more complicated mod-
els, and there are several other important reasons to perform variable selec-
tion. First, an MLR model with unnecessary predictors has a mean square
error for prediction that is too large. Let xS contain the necessary predictors,
let x be the full model, and let xI be a submodel. If (3.4) holds and S I,
then E(Y |xI ) = xTI I = xTS S = xT . Hence OLS applied to Y and xI
yields an unbiased estimator I of I . If (3.4) holds, S I, S is a k 1
vector, and I is a j 1 vector with j > k, then
1 1
n n
2 j 2 k
V (YIi ) = > = V (YSi ). (3.7)
n i=1 n n n i=1
1
n
1 2 2 p
V (Yi ) = tr( 2 H) = tr((X T X)1 X T X) =
n i=1 n n n
model or exclude all of the indicator variables from the model. If the model
contains powers or interactions, also include all main eects in the model
(see Section 3.3).
Next we suggest methods for nding a good submodel. We make the sim-
plifying assumptions that the full model is good, that all predictors have the
same cost, that each submodel contains a constant, and that there is no the-
ory requiring that a particular predictor must be in the model. Also assume
that n 10p, and that the response and residual plots of the full model
are good. Rule of thumb 3.5 should be used for the full model and for all
submodels.
The basic idea is to obtain tted values from the full model and the can-
didate submodel. If the candidate model is good, then the plotted points in
a plot of the submodel tted values versus the full model tted values should
follow the identity line. In addition, a similar plot should be made using the
residuals.
A problem with this idea is how to select the candidate submodel from
the nearly 2p potential submodels. One possibility would be to try to order
the predictors in importance, say x1 , . . . , xp . Then let the kth model contain
the predictors x1 , x2 , . . . , xk for k = 1, . . . , p. If the predicted values from the
submodel are highly correlated with the predicted values from the full model,
then the submodel is good. All subsets selection, forward selection, and
backward elimination can be used (see Section 1.3), but criteria to separate
good submodels from bad are needed.
Two important summaries for submodel I are R2 (I), the proportion of
the variability of Y explained by the nontrivial predictors in the model, and
M SE(I) = I2 , the estimated error variance. See Denitions 2.15 and 2.16.
Suppose that model I contains k predictors, including a constant. Since
adding predictors does not decrease R2 , the adjusted RA 2
(I) is often used,
where
n n
RA2
(I) = 1 (1 R2 (I)) = 1 M SE(I) .
nk SST
See Seber and Lee (2003, pp. 400401). Hence the model with the maximum
2
RA (I) is also the model with the minimum M SE(I).
where SSE is the error sum of squares from the full model, and SSE(I) is the
error sum of squares from the candidate submodel. An extremely important
criterion for variable selection is the Cp criterion.
104 3 Building an MLR Model
Denition 3.8.
SSE(I)
Cp (I) = + 2k n = (p k)(FI 1) + k
M SE
where MSE is the error mean square for the full model.
From Section 1.3, recall that all subsets selection, forward selection, and
backward elimination produce one or more submodels of interest for k =
2, . . . , p where the submodel contains k predictors including a constant. The
following proposition helps explain why Cp is a useful criterion and suggests
that for subsets I with k terms, submodels with Cp (I) min(2k, p) are
especially interesting. Olive and Hawkins (2005) show that this interpretation
of Cp can be generalized to 1D regression models with a linear predictor T x,
such as generalized linear models. Denote the residuals and tted values from
the full model by ri = Yi xTi = Yi Yi and Yi = xTi , respectively.
Similarly, let I be the estimate of I obtained from the regression of Y on xI
and denote the corresponding residuals and tted values by rI,i = Yi xTI,i I
and YI,i = xTI,i I where i = 1, . . . , n.
Proposition 3.1. Suppose that a numerical variable selection method
suggests several submodels with k predictors, including a constant, where
2 k p.
a) The model I that minimizes Cp (I) maximizes corr(r, rI ).
p
b) Cp (I) 2k implies that corr(r, rI ) 1 .
n
c) As corr(r, rI ) 1,
Using the screen Cp (I) min(2k, p) suggests that the predictor xi should
not be deleted if
|ti | > 2 1.414.
If |ti | < 2, then the predictor can probably be deleted since Cp decreases.
The literature suggests using the Cp (I) k screen, but this screen eliminates
too many potentially useful submodels.
3.4 Variable Selection 105
Six graphs will be used to compare the full model and the candidate sub-
model. Let be the estimate of obtained from the regression of Y on all
of the terms x. Many numerical methods such as forward selection, back-
ward elimination, stepwise, and all subsets methods using the Cp (I) criterion
(Jones 1946; Mallows 1973) have been suggested for variable selection. We
will use the FF plot, RR plot, the response plots from the full and submodel,
and the residual plots (of the tted values versus the residuals) from the full
and submodel. These six plots will contain a great deal of information about
the candidate subset provided that Equation (3.4) holds and that a good
estimator (such as OLS) for and I is used.
For these plots to be useful, it is crucial to verify that a multiple linear
regression (MLR) model is appropriate for the full model. Both the re-
sponse plot and the residual plot for the full model need to be
used to check this assumption. The plotted points in the response plot
should cluster about the identity line (that passes through the origin with
unit slope) while the plotted points in the residual plot should cluster about
the horizontal axis (the line r = 0). Any nonlinear patterns or outliers in
either plot suggest that an MLR relationship does not hold. Similarly, be-
fore accepting the candidate model, use the response plot and the residual
plot from the candidate model to verify that an MLR relationship holds for
the response Y and the predictors xI . If the submodel is good, then the
residual and response plots of the submodel should be nearly identical to the
corresponding plots of the full model. Assume that all submodels contain a
constant.
visual aids. The subset I is good if the plotted points cluster tightly about
the identity line in both plots. In particular, the OLS line and the identity
line should nearly coincide so that it is dicult to tell that the two lines
intersect at the origin in the RR plot.
To verify that the six plots are useful for assessing variable selection,
the following notation will be useful. Suppose that all submodels include
a constant and that X is the full rank n p design matrix for the full
model. Let the corresponding vectors of OLS tted values and residuals
be Y = X(X T X)1 X T Y = HY and r = (I H)Y , respectively.
Suppose that X I is the n k design matrix for the candidate submodel
and that the corresponding vectors of OLS tted values and residuals are
Y I = X I (X TI X I )1 X TI Y = H I Y and r I = (I H I )Y , respectively.
A plot can be very useful if the OLS line can be compared to a reference
line and if the OLS slope is related to some quantity of interest. Suppose that
a plot of w versus z places w on the horizontal axis and z on the vertical axis.
Then denote the OLS line by z = a + bw. The following proposition shows
that the plotted points in the FF, RR, and response plots will cluster about
the identity line. Notice that the proposition is a property of OLS and holds
even if the data does not follow an MLR model. Let corr(x, y) denote the
correlation between x and y.
Also recall that the OLS line passes through the means of the two variables
(w, z).
(*) Notice that the OLS slope from regressing z on w is equal to one if
and only if the OLS slope from regressing w on z is equal to [corr(z, w)]2 .
2 T
i) The slope b = 1 if YI,i Yi = YI,i . This equality holds since Y I Y =
T
Y T H I Y = Y T H I H I Y = Y I Y I . Since b = 1, a = Y Y = 0.
v) The OLS line passes through the origin. Hence a = 0. The slope b =
r T r I /r T r. Since r T r I = Y T (I H)(I H I )Y and (I H)(I H I ) =
I H, the numerator r T r I = r T r and b = 1.
vi) Again a = 0 since the OLS line passes through the origin. From v),
SSE(I)
1= [corr(r, rI )].
SSE
108 3 Building an MLR Model
Hence
SSE
corr(r, rI ) =
SSE(I)
and the slope
SSE
b= [corr(r, rI )] = [corr(r, rI )]2 .
SSE(I)
Remark 3.2. Daniel and Wood (1980, p. 85) suggest using Mallows
graphical method for screening subsets by plotting k versus Cp (I) for models
close to or under the Cp = k line. Proposition 3.2 vi) implies that if Cp (I) k
or FI < 1, then corr(r, rI ) and corr(ESP, ESP (I)) both go to 1.0 as n .
Hence models I that satisfy the Cp (I) k screen will contain the true model
S with high probability when n is large. This result does not guarantee that
the true model S will satisfy the screen, but overt is likely. Let d be a lower
bound on corr(r, rI ). Proposition 3.2 vi) implies that if
1 p
Cp (I) 2k + n 2 1 2 ,
d d
Rule of thumb 3.7. Assume that the full model has good response and
residual plots and that n 10p. Let subset I have k predictors, including a
constant. Know how to nd good models from output. The following rules of
thumb (roughly in order of decreasing importance) may be useful. It is often
not possible to have all 10 rules of thumb to hold simultaneously. Let Imin be
the minimum Cp model and let II be the model with the fewest predictors
satisfying Cp (II ) Cp (Imin ) + 1. Do not use more predictors than model II
to avoid overtting. Then the submodel I is good if
i) the response and residual plots for the submodel looks like the response
and residual plots for the full model,
ii) corr(ESP,ESP(I)) = corr(Y, YI ) 0.95.
iii) The plotted points in the FF plot (= EE plot for MLR) cluster tightly
about the identity line.
iv) Want the p-value 0.01 for the partial F test that uses I as the reduced
model.
v) The plotted points in the RR plot cluster tightly about the identity line.
vi) Want R2 (I) > 0.9R2 and R2 (I) > R2 0.07 (recall that R2 (I) R2 =
R2 (f ull) since adding predictors to I does not decrease R2 (I)).
vii) Want Cp (Imin ) Cp (I) min(2k, p) with no big jumps in Cp (the
increase should be less than four) as variables are deleted.
viii) Want hardly any predictors with p-values > 0.05.
ix) Want few predictors with p-values between 0.01 and 0.05.
x) Want MSE(I) to be smaller than or not much larger than the MSE from
the full model.
(If n 5p, use the above rules, but we want n 10k.)
estimated sucient predictors that are highly correlated with the full model
ESP (the correlation should be at least 0.9 and preferably greater than 0.95).
Similarly, make a scatterplot matrix of the residuals for M1, M2, M3, M4,
and M5.
To summarize, the nal submodel should have few predictors, few variables
with large OLS t test pvalues (0.01 to 0.05 is borderline), good response and
residual plots, and an FF plot (= EE plot) that clusters tightly about the
identity line. If a factor has c 1 indicator variables, either keep all c 1
indicator variables or delete all c 1 indicator variables, do not delete some
of the indicator variables.
Example 3.7. The pollution data of McDonald and Schwing (1973) can
be obtained from STATLIB or the texts website. The response Y = mort
is the mortality rate, and most of the independent variables were related
to pollution. A scatterplot matrix of the rst 9 predictors and Y was made
and then a scatterplot matrix of the remaining predictors with Y . The log
rule suggested making the log transformation with 4 of the variables. The
summary output is shown below and on the following page. The response
and residual plots were good. Notice that p = 16 and n = 60 < 5p. Also
many p-values are too high.
Response = MORT
Label Estimate Std. Error t-value p-value
Constant 1881.11 442.628 4.250 0.0001
DENS 0.00296 0.00397 0.747 0.4588
EDUC -19.6669 10.7005 -1.838 0.0728
log[HC] -31.0112 15.5615 -1.993 0.0525
HOUS -0.40107 1.64372 -0.244 0.8084
HUMID -0.44540 1.06762 -0.417 0.6786
JANT -3.58522 1.05355 -3.403 0.0014
JULT -3.84292 2.12079 -1.812 0.0768
log[NONW] 27.2397 10.1340 2.688 0.0101
log[NOX] 57.3041 15.4764 3.703 0.0006
OVR65 -15.9444 8.08160 -1.973 0.0548
POOR 3.41434 2.74753 1.243 0.2206
POPN -131.823 69.1908 -1.905 0.0633
PREC 3.67138 0.77814 4.718 0.0000
log[SO] -10.2973 7.38198 -1.395 0.1700
WWDRK 0.88254 1.50954 0.585 0.5618
Taking the minimum Cp model from backward elimination gives the out-
put shown below. The response and residual plots were OK although the
2
correlation in the RR and FF plots was not real high. The R in the sub-
model decreased from about 0.79 to 0.74 while = M SE was 33.22 for
the full model and 33.31 for the submodel. Removing nonlinearities from the
predictors by using two scatterplots and the log rule, and then using back-
ward elimination and forward selection, seems to be very eective for nding
the important predictors for this data set. See Problem 13.17 in order to
reproduce this example with the essential plots.
Response = MORT
Label Estimate Std. Error t-value p-value
Constant 943.934 82.2254 11.480 0.0000
EDUC -15.7263 6.17683 -2.546 0.0138
JANT -1.86899 0.48357 -3.865 0.0003
log[NONW] 33.5514 5.93658 5.652 0.0000
log[NOX] 21.7931 4.29248 5.077 0.0000
PREC 2.92801 0.59011 4.962 0.0000
death was acute, 3 if the cause of death was chronic, and coded as 2 oth-
erwise. A variable ageclass was coded as 0 if the age was under 20, 1 if the
age was between 20 and 45, and as 3 if the age was over 45. Head size, the
product of the head length, head breadth, and head height, is a volume mea-
surement, hence (size)1/3 was also used as a predictor with the same physical
dimensions as the other lengths. Thus there are 11 nontrivial predictors and
one response, and all models will also contain a constant. Nine cases were
deleted because of missing values, leaving 267 cases.
Figure 3.7 shows the response plots and residual plots for the full model
and the nal submodel that used a constant, size1/3 , age, and sex. The ve
cases separated from the bulk of the data in each of the four plots correspond
to ve infants. These may be outliers, but the visual separation reects the
small number of infants and toddlers in the data. A purely numerical variable
selection procedure would miss this interesting feature of the data. We will
rst perform variable selection with the entire data set, and then examine the
eect of deleting the ve cases. Using forward selection and the Cp statistic
on the Gladstone data suggests the subset I5 containing a constant, (size)1/3 ,
age, sex, breadth, and cause with Cp (I5 ) = 3.199. The pvalues for breadth
and cause were 0.03 and 0.04, respectively. The subset I4 that deletes cause
has Cp (I4 ) = 5.374 and the pvalue for breadth was 0.05. Figure 3.8d shows
the RR plot for the subset I4 . Note that the correlation of the plotted points
is very high and that the OLS and identity lines nearly coincide.
100
FRES
Y
200
400
200 100
SRES3
Y
400
Fig. 3.7 Gladstone data: comparison of the full model and the submodel.
yI,i . Then the RR plot of rI,i versus ri , and the FF plot of YI,i versus Yi were
constructed.
1200
100
FRES
FFIT
0
-200 -100 0 100 200 400 600 800 1000 1200 1400
SRES1 SFIT1
200
100
FRES
FRES
0
0
-200
-200
Fig. 3.8 Gladstone data: submodels added (size)1/3 , sex, age, and nally breadth.
For this model, the correlation in the FF plot (Figure 3.8b) was very
high, but in the RR plot the OLS line did not coincide with the identity
line (Figure 3.8a). Next sex was added to I, but again the OLS and identity
lines did not coincide in the RR plot (Figure 3.8c). Hence age was added
to I. Figure 3.9a shows the RR plot with the OLS and identity lines added.
These two lines now nearly coincide, suggesting that a constant plus (size)1/3 ,
sex, and age contains the relevant predictor information. This subset has
Cp (I) = 7.372, R2 (I) = 0.80, and I = 74.05. The full model which used
11 predictors and a constant has R2 = 0.81 and = 73.58. Since the Cp
criterion suggests adding breadth and cause, the Cp criterion may be leading
to an overt.
Figure 3.9b shows the FF plot. The ve cases in the southwest corner cor-
respond to ve infants. Deleting them leads to almost the same conclusions,
although the full model now has R2 = 0.66 and = 73.48 while the submodel
has R2 (I) = 0.64 and I = 73.89.
116 3 Building an MLR Model
a) RR Plot b) FF Plot
1400
200
1200
100
1000
FRES
FFIT
0
800
-100
600
-200
400
-200 -100 0 100 200 400 600 800 1000 1200 1400
SRES3 SFIT3
Fig. 3.9 Gladstone data with Predictors (size)1/3 , sex, and age
a) RR Plot b) FF plot
0.10
0.50
0.05
0.45
full$residual
ffit
0.40
0.0 -0.05
0.35
-0.10
0.30
-0.10 -0.05 0.0 0.05 0.10 0.30 0.35 0.40 0.45 0.50
sub$residual sfit
Example 3.9. Cook and Weisberg (1999a, pp. 261, 371) describe a data
set where rats were injected with a dose of a drug approximately proportional
to body weight. The data set is included as the le rat.lsp in the Arc software
3.4 Variable Selection 117
The point of this example is that a subset of outlying cases can cause
numeric second-moment criteria such as Cp to nd structure that does not
exist. The FF and RR plots can sometimes detect these outlying cases, allow-
ing the experimenter to run the analysis without the inuential cases. The
example also illustrates that global numeric criteria can suggest a model with
one or more nontrivial terms when in fact the response is independent of the
predictors.
Numerical variable selection methods for MLR are very sensitive to in-
uential cases such as outliers. Olive and Hawkins (2005) show that a plot
of the residuals versus Cooks distances (see Section 3.5) can be used to de-
tect inuential cases. Such cases can also often be detected from response,
residual, RR, and FF plots.
Warning: deleting inuential cases and outliers will often lead to
better plots and summary statistics, but the cleaned data may no
longer represent the actual population. In particular, the resulting
model may be very poor for both prediction and description.
Multiple linear regression data sets with cases that inuence numerical
variable selection methods are common. Table 3.1 shows results for seven
interesting data sets. The rst two rows correspond to the Ashworth (1842)
data, the next 2 rows correspond to the Gladstone data in Example 3.8, and
the next 2 rows correspond to the Gladstone data with the 5 infants deleted.
Rows 7 and 8 are for the Buxton (1920) data, while rows 9 and 10 are for
the Tremearne (1911) data. These data sets are available from the books
website. Results from the nal two data sets are given in the last 4 rows. The
last 2 rows correspond to the rat data described in Example 3.9. Rows 11
and 12 correspond to the ais data that comes with Arc (Cook and Weisberg
1999a).
118 3 Building an MLR Model
The full model used p predictors, including a constant. The nal submodel
I also included a constant, and the nontrivial predictors are listed in the
second column of Table 3.1. For a candidate submodel I, let Cp (I, c) denote
the value of the Cp statistic for the clean data that omits inuential cases and
outliers. The third column lists p, Cp (I), andCp (I, c) while the rst column
gives the set of inuential cases. Two rows are presented for each data set. The
second row gives the response variable and any predictor transformations.
For example, for the Gladstone data p = 10 since there were 9 nontrivial
predictors plus a constant. Only the predictor size was transformed, and the
nal submodel is the one given in Example 3.8. For the rat data, the nal
submodel is the one given in Example 3.9: none of the 3 nontrivial predictors
was used.
Table 3.1 and simulations suggest that if the subset I has k predictors,
then using the Cp (I) min(2k, p) screen is better than using the conven-
tional Cp (I) k screen. The major and ais data sets show that deleting the
inuential cases may increase the Cp statistic. Thus interesting models from
the entire data set and from the clean data set should be examined.
Example 3.10. Conjugated linoleic acid (CLA) occurs in beef and dairy
products and appears to have many human health benets. Joanne Numrich
provided four data sets where the response was the amount of CLA (or re-
lated compounds), and the explanatory variables were feed components from
the cattle diet. The data was to be used for descriptive and exploratory pur-
poses. Several data sets had outliers with unusually high levels of CLA. These
outliers were due to one researcher and may be the most promising cases in
the data set. However, to describe the bulk of the data with OLS MLR,
the outliers were omitted. In one of the data sets there are 33 cases and 25
3.4 Variable Selection 119
The bootstrap will be described and then applied to variable selection. Sup-
pose there is data w1 , . . . , wn collected from a distribution with cdf F into
an n p matrix W . The empirical distribution, with cdf Fn , gives each ob-
served data case wi probability 1/n. Let the statistic Tn = t(W ) = t(Fn )
be computed from the data. Suppose the statistic estimates = t(F ). Let
t(W ) = t(Fn ) = Tn indicate that t was computed from an iid sample from
the empirical distribution Fn : a sample of size n was drawn with replacement
from the observed sample w1 , . . . , wn .
Some notation is needed to give the Olive (2013a) prediction region used
to bootstrap a hypothesis test. Suppose w1 , . . . , wn are iid p 1 random
vectors with mean and nonsingular covariance matrix w . Let a future
test observation wf be independent of the wi but from the same distribution.
Let (w, S) be the sample mean and sample covariance matrix where
1 1
n n
w= wi and S = S w = (wi w)(wi w)T . (3.8)
n i=1 n1
i=1
Let Di2 = Dw 2
i
for each observation wi . Let D(c) be the cth order statistic of
D1 , . . . , Dn . Consider the hyperellipsoid
An = {w : Dw
2
(w, S) D(c)
2
} = {w : Dw (w, S) D(c) }. (3.10)
Hence if there was an iid sample T1,n , . . . , TB,n of the statistic, the Olive
(2013a) large sample 100(1 )% prediction region {w : D2 (T , S T ) D(c) 2
}
for Tf,n contains E(Tn ) = with asymptotic coverage 1 . To make the
asymptotic coverage equal to 1, use the large sample 100(1)% condence
region {w : D2 (T1,n , S T ) D(c)
2
}. The prediction region method bootstraps
3.4 Variable Selection 121
this procedure by using a bootstrap sample of the statistic T1,n , . . . , TB,n .
Centering the region at T1,n instead of T is not needed since the bootstrap
sample is centered near Tn : the distribution of n(Tn ) is approximated
by the distribution of n(T Tn ) or by the distribution of n(T T ).
T
2
If H0 is true and E[] = c, then = 0. Let D0 = T [S T ]1 T and fail to re-
ject H0 if D0 D(UB ) and reject H0 if D0 > D(UB ) . This percentile method
is equivalent to computing the prediction region (3.10) on the wi = Ti and
checking whether 0 is in the prediction region.
Methods for bootstrapping the multiple linear regression model are well
known. The estimated covariance matrix of the (ordinary) least squares esti-
mator is
OLS ) = M SE(X T X)1 .
Cov(
The residual bootstrap computes the least squares estimator and obtains the
n residuals and tted values r1 , . . . , rn and Y1 , . . . , Yn . Then a sample of size
n is selected with replacement from the residuals resulting in r11 , . . . , rn1 .
Hence the empirical distribution of the residuals is used. Then a vector Y 1 =
(Y11 T
, . . . , Yn1 ) is formed where Yi1 = Yi + ri1
. Then Y 1 is regressed on X
resulting in the estimator 1 . This process is repeated B times resulting in
the estimators 1 , . . . , B . This method should have n 10p so that the
residuals ri are close to the errors ei .
Efron (1982, p. 36) notes that for the residual bootstrap, the sample co-
variance matrix of the i is estimating the population bootstrap matrix
np
M SE(X T X)1 as B . Hence the residual bootstrap standard
n
np
error SE(i ) SE(i,OLS ).
n
T T
If the z i = (Yi , xi ) are iid observations from some population, then a
sample of size n can be drawn with replacement from z 1 , . . . , z n . Then the
response and predictor variables can be formed into vector Y 1 and design
matrix X 1 . Then Y 1 is regressed on X 1 resulting in the estimator 1 . This
process is repeated B times resulting in the estimators 1 , . . . , B . If the
z i are the rows of a matrix Z, then this nonparametric bootstrap uses the
empirical distribution of the z i .
Following Seber and Lee (2003, p. 100), the classical test statistic for test-
ing H0 : A = c, where A is a full rank r p matrix, is
n
(A c)T [M SE A(X T X)1 AT ]1 (A c),
np
2
and we expect D(U B)
n 2
np r,1 , for large n and B, and p << n.
K
U= i Np (0, I2i W Ii ,0 ),
i=1
K
0 i 1, i=1 i = 1, and K is the number of subsets Ii that contain S.
Inference techniques for the variable selection model have not had much
success. Efron (2014) lets t(Z) be a scalar valued statistic, based on all of the
data Z, that estimates a parameter of interest . Form a bootstrap sample
1
n
Z i and t(Z i ) for i = 1, . . . , B. Then = s(Z) = t(Z i ), a boot-
B i=1
strap smoothing or bagging estimator. In the regression setting with vari-
able selection, Z i can be formed with the nonparametric or residual boot-
strap using the full model. The prediction region method can also be applied
to t(Z). For example, when A is 1 p, the prediction region method uses
= A c, t(Z) = A c and T = . Efron (2014) used the condence
interval T z1 SE(T ) which is symmetric about T . The prediction re-
gion method uses T ST D(UB ) which is also a symmetric interval centered
at T . If both the prediction region method and Efrons method are large
sample condence intervalsfor , then they have the same asymptotic length
(scaled by multiplying by n), since otherwise the shorter interval will have
lower asymptotic coverage. Since the prediction region interval is a percentile
interval, the shorth(c) interval could have much shorter length than both the
Efron interval and the prediction region interval if the bootstrap distribution
is not symmetric.
The prediction region method can be used for vector valued statistics and
parameters, and may not need the statistic to be asymptotically normal.
These features are likely useful for variable selection models. Prediction in-
tervals and regions can have higher than the nominal coverage 1 if the
distribution is discrete or a mixture of a discrete distribution and some other
distribution. In particular, coverage can be high if the wi distribution is a
mixture of a point mass at 0 and the method checks whether 0 is in the
prediction region. Such a mixture often occurs for variable selection meth-
ods. The bootstrap sample for the Wi = ij can contain many zeroes and be
highly skewed if the jth predictor is weak. Then the computer program may
fail because S w is singular, but if all or nearly all of the ij = 0, then there
is strong evidence that the jth predictor is not needed given that the other
predictors are in the variable selection method.
As an extreme simulation case, suppose ij = 0 for i = 1, . . . , B and for
each run in the simulation. Consider testing H0 : j = 0. Then regardless of
the nominal coverage 1 , the closed interval [0,0] will contain 0 for each
run and the observed coverage will be 1 > 1 . Using the open interval
(0,0) would give observed coverage 0. Also intervals [0, b] and [a, 0] correctly
suggest failing to reject j = 0, while intervals (0, b) and (a, 0) incorrectly
suggest rejecting H0 : j = 0. Hence closed regions and intervals make sense.
3.4 Variable Selection 125
Olive (2016a) showed that applying the prediction region method results
in a large sample 100(1 )% condence region for for a wide variety of
problems, and used the method for variable selection where = .
Example 3.11. Cook and Weisberg (1999a, pp. 351, 433, 447) gives a
data set on 82 mussels sampled o the coast of New Zealand. Let the response
variable be the logarithm log(M ) of the muscle mass, and the predictors are
the length L and height H of the shell in mm, the logarithm log(W ) of the shell
width W, the logarithm log(S) of the shell mass S and a constant. Inference
for the full model is shown along with the shorth(c) nominal 95% condence
intervals for i computed using the nonparametric and residual bootstraps.
As expected, the residual bootstrap intervals are close to the classical least
squares condence intervals i 2SE(i ).
The minimum Cp model from all subsets variable selection uses a constant,
H, and log(S). The shorth(c) nominal 95% condence intervals for i using
the residual bootstrap are shown. Note that the interval for H is right skewed
and contains 0 when closed intervals are used instead of open intervals. The
least squares output is also shown, but should only be used for inference if
the model was selected before looking at the data.
It was expected that log(S) may be the only predictor needed, along with
a constant, since log(S) and log(M ) are both log(mass) measurements and
likely highly correlated. Hence we want to test H0 : 2 = 3 = 4 = 0 with
the Imin model selected by all subsets variable selection. (Of course this test
would be easy to do with the full model using least squares theory.) Then
H0 : A = (2 , 3 , 4 )T = 0. Using the prediction region method with the
full model gave an interval [0,2.930] with D0 = 1.641. Note that 23,0.95 =
2.795. So fail to reject H0 . Using the prediction region method with the Imin
variable selection model had [0, D(UB ) ] = [0, 3.293] while D0 = 1.134. So fail
to reject H0 .
library(leaps)
y <- log(mussels[,5]); x <- mussels[,1:4]
x[,4] <- log(x[,4]); x[,2] <- log(x[,2])
out <- regboot(x,y,B=1000)
tem <- rowboot(x,y,B=1000)
outvs <- vselboot(x,y,B=1000) #get bootstrap CIs,
apply(out$betas,2,shorth3);
apply(tem$betas,2,shorth3);
apply(outvs$betas,2,shorth3)
ls.print(outvs$full)
ls.print(outvs$sub)
#test if beta_2 = beta_3 = beta_4 = 0
Abeta <- out$betas[,2:4]
#prediction region method with residual bootstrap
predreg(Abeta)
Abeta <- outvs$betas[,2:4]
#prediction region method with Imin
predreg(Abeta)
Example 3.12. Consider the Gladstone (1905) data set where the vari-
ables are as in Problem 3.6. Output is shown below for the full model and the
bootstrapped minimum Cp forward selection estimator. Note that the shorth
intervals for length and sex are quite long. These variables are often in and
often deleted from the bootstrap forward selection model. Output for II is
also shown. For this data set, II = Imin .
The regression models used the residual bootstrap on the full model least
squares estimator and on the all subsets variable selection estimator for the
model Imin . The residuals were from least squares applied to the full model
in both cases. Results are shown for when the iid errors ei N (0, 1). Ta-
ble 3.2 shows two rows for each model giving the observed condence interval
coverages and average lengths of the condence intervals. The term reg is
for the full model regression, and the term vs is for the all subsets variable
selection. The column for the test gives the length and coverage = P(fail
to reject H0 ) for the interval [0, D(UB ) ] where D(UB ) is the cuto for the
condence region. The volume of the condence
region will decrease to 0 as
n . The cuto will often be near r,0.95 if the statistic T is asymp-
2
totically normal. Note that 22,0.95 = 2.448 is very close to 2.449 for the
full model regression bootstrap test. The coverages were near 0.95 for the
regression bootstrap on the full model. For Imin the coverages were near 0.95
for 1 and 2 , but higher for the other 3 tests since zeroes often occurred for
j for j = 3, 4. The average lengths and coverages were similar for the full
model and all subsets variable selection Imin for 1 and 2 , but the lengths
were shorter for Imin for 3 and 4 .
3.5 Diagnostics
1
n
u= ui (3.14)
n i=1
and
1
n
C = Cov(U ) = (ui u)(ui u)T , (3.15)
n 1 i=1
Y (i) = X (3.16)
(i)
denote the n 1 vector of tted values from estimating with OLS without
the ith case. Denote the jth element of Y (i) by Y(i),j . It can be shown that
the variance of the ith residual VAR(ri ) = 2 (1 hi ). The usual estimator
of the error variance is n 2
r
= i=1 i .
2
np
The (internally) studentized residual
r
ei = i
1 hi
has zero mean and approximately unit variance.
1
n
= (Y(i),j Yj )2 .
2 j=1
p
ri2 hi e2i hi
CDi = = .
2 (1 hi ) 1 hi
p p 1 hi
When the statistics CDi , hi , and MDi are large, case i may be an outlier or
inuential case. Examining a stem plot or dot plot of these three statistics for
unusually large values can be useful for agging inuential cases. Cook and
Weisberg (1999a, p. 358) suggest examining cases with CDi > 0.5 and that
cases with CDi > 1 should always be studied. Since H = H T and H = HH,
the hat matrix is symmetric andnidempotent. Hence the eigenvalues of H are
zero or one, and trace(H) = i=1 hi = p. It can be shown that 0 hi 1.
Rousseeuw and Leroy (1987, pp. 220, 224) suggest using hi > 2p/n and
MD2i > 2p1,0.95 as benchmarks for leverages and Mahalanobis distances
where 2p1,0.95 is the 95th percentile of a chisquare distribution with p 1
degrees of freedom.
Note that Proposition 3.4c) implies that Cooks distance is the product
of the squared residual and a quantity that becomes larger the farther ui
is from u. Hence inuence is roughly the product of leverage and distance
of Yi from Yi (see Fox 1991, p. 21). Mahalanobis distances and leverages
both dene hyperellipsoids based on a metric closely related to the sample
covariance matrix of the nontrivial predictors. All points ui on the same
hyperellipsoidal contour are the same distance from u and have the same
leverage (or the same Mahalanobis distance).
Cooks distances, leverages, and Mahalanobis distances can be eective for
nding inuential cases when there is a single outlier, but can fail if there
are two or more outliers. Nevertheless, these numerical diagnostics combined
with response and residual plots are probably the most eective techniques for
detecting cases that eect the tted values when the multiple linear regression
model is a good approximation for the bulk of the data.
A scatterplot of x versus y (recall the convention that a plot of x versus y
means that x is on the horizontal axis and y is on the vertical axis) is used to
132 3 Building an MLR Model
visualize the conditional distribution y|x of y given x (see Cook and Weisberg
1999a, p. 31). For the simple linear regression model (with one nontrivial
predictor x2 ), the most eective technique for checking the assumptions of
the model is to make a scatterplot of x2 versus Y and a residual plot of x2
versus ri . Departures from linearity in the scatterplot suggest that the simple
linear regression model is not adequate. The points in the residual plot should
scatter about the line r = 0 with no pattern. If curvature is present or if the
distribution of the residuals depends on the value of x2 , then the simple linear
regression model is not adequate.
In general there is more than one nontrivial predictor and in this setting
two plots are crucial for any multiple linear regression analysis, re-
gardless of the regression estimator (e.g., OLS, L1 etc.). The rst plot is the
residual plot of the tted values Yi versus the residuals ri , and the second
plot is the response plot of the tted values Yi versus the response Yi .
Recalling Denitions 2.11 and 2.12, residual and response plots are plots of
wi = xTi versus ri and Yi , respectively, where is a known p 1 vector. The
most commonly used residual and response plots takes = . Plots against
the individual predictors xj and potential predictors are also used. If the
residual plot is not ellipsoidal with zero slope, then the unimodal MLR model
(with iid errors from a unimodal distribution that is not highly skewed) is
not sustained. In other words, if the variables in the residual plot show some
type of dependency, e.g. increasing variance or a curved pattern, then the
multiple linear regression model may be inadequate. Proposition 2.1 showed
that the response plot simultaneously displays the tted values, response, and
residuals. The plotted points in the response plot should scatter about the
identity line if the multiple linear regression model holds. Recall that residual
plots magnify departures from the model while the response plot emphasizes
how well the model ts the data.
When the bulk of the data follows the MLR model, the following rules of
thumb are useful for nding inuential cases and outliers from the response
and residual plots. Look for points with large absolute residuals and for points
far away from Y . Also look for gaps separating the data into clusters. To
determine whether small clusters are outliers or good leverage points, give
zero weight to the clusters, and t an MLR estimator to the bulk of the
data. Denote the weighted estimator by w . Then plot Yw versus Y using
the entire data set. If the identity line passes through the bulk of the data
but not the cluster, then the cluster points may be outliers. In Figure 3.7,
the 5 infants are good leverage points in that the t to the bulk of the data
passes through the cluster of infants. For the Buxton (1920) data, the cluster
of cases far from the bulk of the data in Figure 3.11 are outliers.
To see why gaps are important, recall that the coecient of determination
R2 is equal to the squared correlation (corr(Y, Y ))2 . R2 over emphasizes the
3.6 Outlier Detection 133
strength of the MLR relationship when there are two clusters of data since
much of the variability of Y is due to the smaller cluster.
Denition 3.12. Outliers are cases that lie far from the bulk of the data.
Hence Y outliers are cases that have unusually large vertical distances from
the MLR t to the bulk of the data while x outliers are cases with predictors
x that lie far from the bulk of the xi . Suppose that some analysis to detect
outliers is performed. Masking occurs if the analysis suggests that one or
more outliers are in fact good cases. Swamping occurs if the analysis suggests
that one or more good cases are outliers.
The residual and response plots are very useful for detecting outliers. If
there is a cluster of cases with outlying Y s, the identity line will often pass
through the outliers. If there are two clusters with similar Y s, then the two
plots may fail to show the clusters. Then using methods to detect x outliers
may be useful.
Let the q continuous predictors in the MLR model be collected into vectors
ui for i = 1, . . . , n. Let the n q matrix W have n rows uT1 , . . . , uTn . Let the
q 1 column vector T (W ) be a multivariate location estimator, and let the
q q symmetric positive denite matrix C(W ) be a covariance estimator.
Often q = p 1 and only the constant is omitted from xi to create ui .
for each point ui . Notice that Di2 is a random variable (scalar valued).
1
n
T (W ) = u = ui ,
n i=1
and
1
n
C(W ) = S = (ui T (W ))(ui T (W ))T
n 1 i=1
Olive (2002) shows that the plotted points in the DD plot will follow the
identity line with zero intercept and unit slope if the predictor distribution
is multivariate normal (MVN), and will follow a line with zero intercept but
nonunit slope if the distribution is elliptically contoured with nonsingular
covariance matrix but not MVN. (Such distributions have linear scatterplot
matrices. See Chapter 10.) Hence if the plotted points in the DD plot follow
some line through the origin, then there is some evidence that outliers and
strong nonlinearities have been removed from the predictors.
Response Plot
500 1000 1500
Y
61
62
0
61
RES
64
63
65 62
-150
Figure 3.11 shows the response plot and residual plot for the Buxton
data. Although an index plot of Cooks distance CDi may be useful for
agging inuential cases, the index plot provides no direct way of judg-
ing the model against the data. As a remedy, cases in the response plot
with CDi > min(0.5, 2p/n) were highlighted. Notice that the OLS t passes
through the outliers, but the response plot is resistant to Y outliers since Y
is on the vertical axis. Also notice that although the outlying cluster is far
from Y , only two of the outliers had large Cooks distance. Hence masking
occurred for both Cooks distances and for OLS residuals, but not for OLS
tted values.
5
250
4
150
RD
RD
3
2
50
1
0
Figure 3.12a shows the DD plot made from the four predictors head length,
nasal height, bigonal breadth, and cephalic index. The ve massive outliers cor-
respond to head lengths that were recorded to be around 5 feet. Figure 3.12b
is the DD plot computed after deleting these points and suggests that the
predictor distribution is now much closer to a multivariate normal distribu-
tion.
High leverage outliers are a particular challenge to conventional numer-
ical MLR diagnostics such as Cooks distance, but can often be visualized
using the response and residual plots. The following techniques are useful for
detecting outliers when the multiple linear regression model is appropriate.
1. Find the OLS residuals and tted values and make a response plot and
a residual plot. Look for clusters of points that are separated from the
bulk of the data and look for residuals that have large absolute values.
Beginners frequently label too many points as outliers. Try to estimate
the standard deviation of the residuals in both plots. In the residual plot,
3.6 Outlier Detection 137
look for residuals that are more than 5 standard deviations away from
the r = 0 line. The identity line and r = 0 line may pass right through a
cluster of outliers, but the cluster of outliers can often be detected because
there is a large gap between the cluster and the bulk of the data, as in
Figure 3.11.
2. Make a DD plot of the predictors that take on many values (the continuous
predictors).
3. Make a scatterplot matrix of several diagnostics such as leverages, Cooks
distances, and studentized residuals.
3.7 Summary
FF and RR plots cluster tightly about the identity line. In the RR plot, the
OLS line and identity line can be added to the plot as visual aids. It should
be dicult to see that the OLS and identity lines intersect at the origin, so
the two lines should nearly coincide at the origin. If the FF plot looks good
but the RR plot does not, the submodel may be good if the main goal of the
analysis is for prediction.
10) Forward selection Step 1) k = 1: Start with a constant w1 = x1 .
Step 2) k = 2: Compute Cp for all models with k = 2 containing a constant
and a single predictor xi . Keep the predictor w2 = xj , say, that minimizes Cp .
Step 3) k = 3: Fit all models with k = 3 that contain w1 and w2 . Keep the
predictor w3 that minimizes Cp . . . .
Step j) k = j: Fit all models with k = j that contains w1 , w2 , . . . , wj1 . Keep
the predictor wj that minimizes Cp . . . .
Step p): Fit the full model.
11) Let Imin correspond to the submodel with the smallest Cp . Find
the submodel II with the fewest number of predictors such that Cp (II )
Cp (Imin ) + 1. Then II is the initial submodel that should be examined. It
is possible that II = Imin or that II is the full model. Models I with fewer
predictors than II such that Cp (I) Cp (Imin ) + 4 are interesting and should
also be examined. Models I with k predictors, including a constant and with
fewer predictors than II such that Cp (Imin ) + 4 < Cp (I) min(2k, p) should
be checked.
12) There are several guidelines for building an MLR model. Suppose
that variable Z is of interest and variables W1 , . . . , Wr have been collected
along with Z. Make a scatterplot matrix of W1 , . . . , Wr and Z. (If r is large,
several matrices may need to be made. Each one should include Z.) Remove
or correct any gross outliers. It is often a good idea to transform the Wi
to remove any strong nonlinearities from the predictors. Eventually
140 3 Building an MLR Model
3.8 Complements
With one data set, OLS is a great place to start but a bad place to end. If
n = 5kp where k > 2, it may be useful to take a random sample of n/k cases
to build the MLR model. Then check the model on the full data set.
Predictor Transformations
Cook (1993) shows that partial residual plots are useful for visualizing
provided that the plots of xi versus xj are linear. More general ceres plots, in
particular ceres plots with smooth augmentation, can be used to visualize
if Y = uT + (x j )p + e but the linearity condition fails. Fitting the additive
model Y = 1 + j=2 Sj (xj ) + e or Y = 1 + 2 x2 + + j1 xj1 + S(xj ) +
j+1 xj+1 + + p xp + e and plotting S(xj ) can be useful. Similar ideas are
also useful for GLMs. See Chapter 13 and Olive (2013b) which also discusses
response plots for many regression models.
The assumption that all values of x1 and x2 are positive for power trans-
formation can be removed by using the modied power transformations of
Yeo and Johnson (2000).
142 3 Building an MLR Model
Response Transformations
Application 3.1 was suggested by Olive (2004b, 2013b) for additive error
regression models Y = m(x) + e. An advantage of this graphical method is
that it works for linear models: that is, for multiple linear regression and for
many experimental design models. Notice that if the plotted points in the
transformation plot follow the identity line, then the plot is also a response
plot. The method is also easily performed for MLR methods other than least
squares.
A variant of the method would plot the residual plot or both the response
and the residual plot for each of the seven values of . Residual plots are also
useful, but they no not distinguish between nonlinear monotone relationships
and nonmonotone relationships. See Fox (1991, p. 55).
Cook and Olive (2001) also suggest a graphical method for selecting and
assessing response transformations under model (3.2). Cook and Weisberg
(1994) show that a plot of Z versus xT (swap the axis on the transformation
plot for = 1) can be used to visualize t if Y = t(Z) = xT + e, suggesting
that t1 can be visualized in a plot of xT versus Z.
If there is nonlinearity present in the scatterplot matrix of the nontrivial
predictors, then transforming the predictors to remove the nonlinear-
ity will often be a useful procedure. More will be said about response
transformations for experimental designs in Section 5.4.
?regsubsets
Bootstrap
Olive (2016a,b,c) showed that the prediction region method for creating
a large sample 100(1 )% condence region for an r 1 parameter vector
is a special case of the percentile method when r = 1, and gave sucient
conditions for r > 1. The shorth method gives the shortest percentile method
intervals, asymptotically, and should be used when B 1000. Efron (2014)
reviews some alternative methods for variable selection inference.
Consider the residual bootstrap, and let r W denote an n1 random vector
of elements selected with replacement from the n residuals r1 , . . . , rn . Then
there are K = nn possible values for r W . Let r W W
1 , . . . , r K be the possible
W
values of r . These values are equally likely, so are selected with probability
= 1/K. Note that r W has a discrete distribution. Then
E(r1j )
..
E(r W
j )= . .
E(rnj )
Now the marginal distribution of rij takes on the n values r1 , . . . , rn with
the same probability 1/n. So each of the n marginal distributions is the
n
empirical distribution of the residuals. Hence E(rij ) = i=1 ri /n = r, and
r = 0 for least squares residuals for multiple linear regression when there
is a constant in the model. So for least squares, E(r Wj ) = 0, and E( j ) =
(X T X)1 X T E(Y + r W T
j ) = (X X)
1
X T Y = (X T X)1 X T HY =
(X T X)1 X T Y = = n
Diagnostics
Outliers
RMVN estimators is given in Olive (2008: ch. 10, 2016c: ch. 4) and Olive and
Hawkins (2010). These three estimators are also used in Zhang et al. (2012).
Lasso and Other Variable Selection Techniques
Response plots, prediction intervals, and the bootstrap prediction region
method are also useful for other variable selection techniques such as lasso
and ridge regression. If n 400 and p 3000, Bertsimas et al. (2016) give a
fast all subsets variable selection method.
Recent theory for lasso assumes that is selected before looking at the
data, rather than being estimated using kfold cross validation. See Hastie
et al. (2015). The prediction region method appears to be useful when n >> p
if none of the i = 0, but (in 2016) it takes a long time to simulate lasso with
kfold cross validation.
Lasso seems to work under ASSUMPTION L: assume the predictors are
uncorrelated or the number of active predictors (predictors with nonzero
coecients) is not much larger than 20. When n is xed and p increases, the
lasso prediction intervals increase in length slowly provided that assumption
L held. Methods are being developed that should work under more reasonable
assumptions. See Pelawa Watagoda (2017) and Pelawa Watagoda and Olive
(2017).
3.9 Problems
3.1. From the above output from backward elimination, what terms
should be used in the MLR model to predict Y ? (You can tell that the non-
trivial variables are nger to ground, nasal height, and sternal height from
the delete lines. DONT FORGET THE CONSTANT!)
3.2. The table on the following page gives summary statistics for 4 MLR
models considered as nal submodels after performing variable selection. The
response plot and residual plot for the full model L1 was good. Model L3 was
the minimum Cp model found. Which model should be used as the nal
submodel? Explain briey why each of the other 3 submodels should not be
used.
3.9 Problems 147
3.3. The above table gives summary statistics for 4 MLR models consid-
ered as nal submodels after performing variable selection. The response plot
and residual plot for the full model L1 was good. Model L2 was the minimum
Cp model found.
a) Which model is II , the initial submodel to look at?
b) What other model or models, if any, should be examined?
3.4. The output below and on the following page is from software that
does all subsets variable selection. The data is from Ashworth (1842). The
predictors were A = log(1692 property value), B = log(1841 property value),
and C = log(percent increase in value), while the response variable is Y =
log(1841 population).
a) The top output corresponds to data with 2 small outliers. From this
output, what is the best model? Explain briey.
b) The bottom output corresponds to the data with the 2 outliers removed.
From this output, what is the best model? Explain briey.
k CP R SQ R SQ RESID SS VARIABLES
-- ----- ---- ------ -------- -------------
1 903.5 0.0000 0.0000 183.102 INTERCEPT ONLY
2 0.7 0.9052 0.9062 17.1785 B
2 406.6 0.4944 0.4996 91.6174 A
2 426.0 0.4748 0.4802 95.1708 C
3 2.1 0.9048 0.9068 17.0741 A C
3 2.6 0.9043 0.9063 17.1654 B C
3 2.6 0.9042 0.9062 17.1678 A B
4 4.0 0.9039 0.9069 17.0539 A B C
R Problems
zx <- cbrainx[,c(1,3,5,6,7,8,9,10)]
zbrain <- as.data.frame(cbind(cbrainy,zx))
zfull <- lm(cbrainy~.,data=zbrain)
summary(zfull)
back <- step(zfull)
To quit Arc, move the cursor to the x in the northeast corner and click.
Problems 3.73.11 use data sets that come with Arc (Cook and Weisberg
1999a).
3.7 . a) In Arc enter the menu commands File>Load>Data and open
the le big-mac.lsp. Next use the menu commands Graph&Fit> Plot of to
obtain a dialog window. Double click on TeachSal and then double click on
BigMac. Then click on OK. These commands make a plot of x = TeachSal =
primary teacher salary in thousands of dollars versus y = BigMac = minutes
of labor needed to buy a Big Mac and fries. Include the plot in Word.
3.9 Problems 151
3.8 . In Arc enter the menu commands File>Load>Data and open the
le mussels.lsp. Use the commands Graph&Fit>Scatterplot Matrix of. In
the dialog window select H, L, W, S, and M (so select M last). Click on OK
and include the scatterplot matrix in Word. The response M is the edible part
of the mussel while the 4 predictors are shell measurements. Are any of the
marginal predictor relationships nonlinear? Is E(M |H) linear or nonlinear?
a) Fit the full model with Y = log V ol, X1 = log D, and X2 = log Ht.
Add the output that has the LS coecients to Word.
152 3 Building an MLR Model
b) Fitting the full model will result in the menu L1. Use the commands
L1>AVPAll 2D. This will create a plot with a slider bar at the bottom
that says log[D]. This is the added variable plot for log(D). To make an added
variable plot for log(Ht), click on the slider bar. Add the OLS line to the AV
plot for log(Ht) by moving the OLS slider bar to 1, and add the zero line by
clicking on the Zero line box. Include the resulting plot in Word.
c) Fit the reduced model that drops log(Ht). Make an RR plot with the
residuals from the full model on the V axis and the residuals from the sub-
model on the H axis. Add the LS line and the identity line as visual aids.
(Click on the Options menu to the left of the plot and type y=x in the
resulting dialog window to add the identity line.) Include the plot in Word.
d) Similarly make an FF plot using the tted values from the two models.
Add the OLS line which is the identity line. Include the plot in Word.
e) Next put the residuals from the submodel on the V axis and log(Ht)
on the H axis. Move the OLS slider bar to 1, and include this residual plot
in Word.
f) Next put the residuals from the submodel on the V axis and the tted
values from the submodel on the H axis. Include this residual plot in Word.
g) Next put log(Vol) on the V axis and the tted values from the submodel
on the H axis. Move the OLS slider bar to 1, and include this response plot
in Word.
Graph&Fit menu, select Fit linear LS. Use log[BigMac] as the response
and the other 9 log variables as the Terms. This model is the full model.
Include the output in Word.
d) Make a response plot (L1:Fit-Values in H and log(BigMac) in V) and
residual plot (L1:Fit-Values in H and L1:Residuals in V), and include both
plots in Word.
e) Using the L1 menu, select Examine submodels and try forward
selection and backward elimination. Using the Cp min(2k, p) rule suggests
that the submodel using log[service], log[TeachSal], and log[TeachTax] may be
good. From the Graph&Fit menu, select Fit linear LS, t the submodel
and include the output in Word.
f) Make a response plot (L2:Fit-Values in H and log(BigMac) in V) and
residual plot (L2:Fit-Values in H and L2:Residuals in V) for the submodel,
and include the plots in Word.
g) Make an RR plot (L2:Residuals in H and L1:Residuals in V) and FF plot
(L2:Fit-Values in H and L1:Fit-Values in V) for the submodel, and include
the plots in Word. Move the OLS slider bar to 1 in each plot to add the
identity line. For the RR plot, click on the Options menu then type y = x in
the long horizontal box near the bottom of the window and click on OK to
add the identity line.
h) Do the plots and output suggest that the submodel is good? Explain.
Warning: The following problems use data from the books web-
page (http://lagrange.math.siu.edu/Olive/lregbk.htm). Save the
data les on a ash drive G, say. Get in Arc and use the menu commands
File > Load and a window will appear. Click on Removable Disk (G:).
Then click twice on the data set name.
3.12 . The following data set has 5 babies that are good leverage
points: they look like outliers but should not be deleted because they follow
the same model as the bulk of the data.
a) In Arc enter the menu commands File>Load>Removable Disk (G:)
and open the le cbrain.lsp. Select transform from the cbrain menu, and add
size1/3 using the power transformation option (p = 1/3). From
Graph&Fit, select Fit linear LS. Let the response be brnweight and as terms
include everything but size and Obs. Hence your model will include size1/3 .
This regression will add L1 to the menu bar. From this menu, select Examine
submodels. Choose forward selection. You should get models including k = 2
to 12 terms including the constant. Find the model with the smallest Cp (I) =
CI statistic and include all models with the same k as that model in Word.
That is, if k = 2 produced the smallest CI , then put the block with k = 2
into Word. Next go to the L1 menu, choose Examine submodels and choose
Backward Elimination. Find the model with the smallest CI and include all
of the models with the same value of k in Word.
154 3 Building an MLR Model
g) For your submodel in f), make an RR plot with the residuals from the
full model on the V axis and the residuals from the submodel on the H axis.
Add the OLS line and the identity line y=x as visual aids. Include the RR
plot in Word.
h) Similarly make an FF plot using the tted values from the two models.
Add the OLS line which is the identity line. Include the FF plot in Word.
i) Using the submodel, include the response plot (of Y versus Y ) and
residual plot (of Y versus the residuals) in Word.
j) Using results from f)-i), explain why your submodel is a good model.
3.13. Activate the cyp.lsp data set. Choosing no more than 3 nonconstant
terms, try to predict height with multiple linear regression. Include a plot with
the tted values on the horizontal axis and height on the vertical axis. Is your
model linear? Also include a plot with the tted values on the horizontal axis
and the residuals on the vertical axis. Does the residual plot suggest that the
linear model may be inappropriate? (There may be outliers in the plot. These
could be due to typos or because the error distribution has heavier tails than
the normal distribution.) State which model you use.
3.16. This problem gives a slightly simpler model than Problem 3.15 by
using the indicator variable x3 = 1 if standard cement (if x2 = 2) and x3 =
0 otherwise (if x2 is 0 or 1). Activate the cement.lsp data.
156 3 Building an MLR Model
a) From the cement menu, select Transform, select x1, and place a 2 in the
p box. This should add x12 to the data set. From the cement menu, select
Make interactions and select x1 and x3.
b) From Graph&Fit select Fit linear LS, select x1, x12 , x3, and x1*x3 as
the terms and y as the response. Include the output in Word.
c) Make the response and residual plots. When making these plots, place
x2 in the Mark by box. Include the plots in Word. Does the model seem ok?
3.17 . Get the McDonald and Schwing (1973) data pollution.lsp from
(http://lagrange.math.siu.edu/Olive/lregbk.htm), and save the le on
a ash drive. Activate the pollution.lsp dataset with the menu commands
File > Load > Removable Disk (G:) > pollution.lsp. Scroll up the screen
to read the data description. Often simply using the log rule on the predictors
with max(x)/ min(x) > 10 works wonders.
a) Make a scatterplot matrix of the rst nine predictor variables and the re-
sponse Mort. The commands Graph&Fit > Scatterplot-Matrix of will bring
down a Dialog menu. Select DENS, EDUC, HC, HOUS, HUMID, JANT,
JULT, NONW, NOX, and MORT. Then click on OK.
A scatterplot matrix with slider bars will appear. Move the slider bars for
NOX, NONW, and HC to 0, providing the log transformation. In Arc, the
diagonals have the min and max of each variable, and these were the three
predictor variables satisfying the log rule. Open Word.
In Arc, use the menu commands Edit > Copy. In Word, use the menu
command Paste. This should copy the scatterplot matrix into the Word
document. Print the graph.
b) Make a scatterplot matrix of the last six predictor variables and the
response Mort. The commands Graph&Fit > Scatterplot-Matrix of will
bring down a Dialog menu. Select OVR65, POOR, POPN, PREC, SO,
WWDRK, and MORT. Then click on OK. Move the slider bar of SO to 0
and copy the plot into Word. Print the plot as described in a).
c) Click on the pollution menu and select Transform. Click on the log
transformations button and select HC, NONW, NOX, and SO. Click on OK.
Then t the full model with the menu commands Graph&Fit > Fit lin-
ear LS. Select MORT for the response. For the terms, select DENS, EDUC,
log[HC], HOUS, HUMID, JANT, JULT, log[NONW], log[NOX], OVR65,
POOR, POPN, PREC, log[SO], and WWDRK. Click on OK.
This model is the full model. To make the response plot use the menu
commands Graph&Fit >Plot of. Select MORT for the V-box and L1:Fit-
Values for the H-box. Click on OK. When the graph appears, move the OLS
slider bar to 1 to add the identity line. Copy the plot into Word.
To make the residual plot use the menu commands Graph&Fit >Plot
of. Select L1:Residuals for the V-box and L1:Fit-Values for the H-box. Click
on OK. Copy the plot into Word. Print the two plots.
3.9 Problems 157
to count a womans husband if he was not at home. Do not use the predictor
X2 in the full model. Do parts a), b), and c) above Problem 3.19.
3.24 . For the le pop.lsp, described below, use Z = Y . Do parts a), b),
and c) above Problem 3.19.
This data set comes from Ashworth (1842). Try transforming all variables
to logs. Then the added variable plots show two outliers. Delete these two
cases. Notice the eect of these two outliers on the pvalues for the coecients
and on numerical methods for variable selection.
Note: then log(Y ) and log(X2 ) make a good submodel.
3.25 . For the le pov.lsp, described below, use i) Z = f lif e and ii)
Z = gnp2 = gnp + 2. This data set comes from Rounceeld (1995). Making
loc into a factor may be a good idea. Use the commands poverty>Make factors
and select the variable loc. For ii), try transforming to logs and deleting the 6
cases with gnp2 = 0. (These cases had missing values for gnp. The le povc.lsp
has these cases deleted.) Try your nal submodel on the data that includes
the 6 cases with gnp2 = 0. Do parts a), b), and c) above Problem 3.19.
b) Write down your nal model (e.g., a very poor nal model is
exp(BigM ac) = 1 + 2 exp(EngSal) + 3 (T eachSal)3 + e).
c) Include the least squares output for your model and between 3 and 5
plots that justify that your multiple linear regression model is reasonable.
Below or beside each plot, give a brief explanation for how the plot gives
support for your model.
3.28. This is like Problem 3.27 with the BigMac data. Assume that a
multiple linear regression model holds for Y = t(Z) and for some terms
(usually powers or logs of the predictors). Using the techniques learned in
class, nd such a model. Give output for the full model, output for the nal
submodel and use several plots to justify your choices. These data sets, as
well as the BigMac data set, come with Arc. See Cook and Weisberg (1999a).
(INSTRUCTOR: Allow 2 hours for each part.)
file "response" Z
a) allomet.lsp BRAIN
b) casuarin.lsp W
c) evaporat.lsp Evap
d) hald.lsp Y
e) haystack.lsp Vol
f) highway.lsp rate
(From the menu Highway, select "Add a variate" and
type sigsp1 = sigs + 1. Then you can transform
sigsp1.)
g) landrent.lsp Y
h) ozone.lsp ozone
i) paddle.lsp Weight
3.9 Problems 161
j) sniffer.lsp Y
k) water.lsp Y
i) Write down the full model that you use and include the full model
residual plot and response plot in Word. Give R2 for the full model.
iii) Include the least squares output for your model and between 3 and
5 plots that justify that your multiple linear regression model is reasonable.
Below or beside each plot, give a brief explanation for how the plot gives
support for your model.
3.29 . a) Activate buxton.lsp (you need to download the le onto your
ash drive Removable Disk (G:)). From the Graph&Fit menu, select Fit
linear LS. Use height as the response variable and bigonal breadth, cephalic
index, head length, and nasal height as the predictors. Include the output in
Word.
b) Make a response plot (L1:Fit-Values in H and height in V) and residual
plot (L1:Fit-Values in H and L1:Residuals in V) and include both plots in
Word.
c) In the residual plot use the mouse to move the cursor just above and
to the left of the outliers. Hold the leftmost mouse button down and move
the mouse to the right and then down. This will make a box on the residual
plot that contains the outliers. Go to the Case deletions menu and click
on Delete selection from data set. From the Graph&Fit menu, select Fit
linear LS and t the same model as in a) (the model should already be
entered, just click on OK). Include the output in Word.
d) Make a response plot (L2:Fit-Values in H and height in V) and residual
plot (L2:Fit-Values in H and L2:Residuals in V) and include both plots in
Word.
e) Explain why the outliers make the MLR relationship seem much
stronger than it actually is. (Hint: look at R2 .)
Variable Selection in SAS
3.30. Copy and paste the SAS program for this problem into the SAS
editor. Then perform the menu commands Run>Submit to obtain about
15 pages of output. Do not print out the output.
The data is from SAS Institute (1985, pp. 695704, 717718). Aerobic
tness is being measured by the ability to consume oxygen. The response
Y = Oxygen (uptake rate) is expensive to measure, and it is hoped that
the OLS Y can be used instead. The variables are Age in years, Weight in
kg, RunTime = time in minutes to run 1.5 miles, RunPulse = heart rate
when Y is measured, RestPulse = heart rate while running, and MaxPulse =
maximum heart rate recorded while running.
162 3 Building an MLR Model
The concepts of a random vector, the expected value of a random vector, and
the covariance of a random vector are needed before covering generalized least
squares. Recall that for random variables Yi and Yj , the covariance of Yi and
Yj is Cov(Yi , Yj ) i,j = E[(Yi E(Yi ))(Yj E(Yj )] = E(Yi Yj )E(Yi )E(Yj )
provided the second moments of Yi and Yj exist.
where the ij entry of Cov(Y ) is Cov(Yi , Yj ) = i,j provided that each i,j
exists. Otherwise Cov(Y ) does not exist.
and
E(AY ) = AE(Y ) and E(AY B) = AE(Y )B. (4.2)
Also
Cov(a + AY ) = Cov(AY ) = ACov(Y )AT . (4.3)
E(Y ) = X + E(e) = X.
since H T = H and HH = H.
Recall that the vector of residuals r OLS = (I H)Y = Y Y OLS . Hence
E(r OLS ) = E(Y ) E(Y OLS ) = E(Y ) E(Y ) = 0. Using (4.3) and (4.4),
Denition 4.3. Suppose that the response variable and at least one of the
predictor variables is quantitative. Then the generalized least squares (GLS)
model is
Y = X + e, (4.5)
where Y is an n 1 vector of dependent variables, X is an n p matrix
of predictors, is a p 1 vector of unknown coecients, and e is an n 1
vector of unknown errors. Also E(e) = 0 and Cov(e) = 2 V where V is a
known n n positive denite matrix.
Denition 4.4. The GLS estimator
Y = X + e, (4.7)
W LS = (X T V 1 X)1 X T V 1 Y . (4.8)
The tted values are Y F GLS = X F GLS . The feasible weighted least squares
(FWLS) estimator is the special case of the FGLS estimator where V =
V () is diagonal. Hence the estimated weights wi = 1/vi = 1/vi (). The
FWLS estimator and tted values will be denoted by F W LS and Y F W LS ,
respectively.
Notice that the ordinary least squares (OLS) model is a special case of
GLS with V = I n , the n n identity matrix. It can be shown that the GLS
estimator minimizes the GLS criterion
166 4 WLS and Generalized Least Squares
Notice that the FGLS and FWLS estimators have p + q + 1 unknown param-
eters. These estimators can perform very poorly if n < 10(p + q + 1).
The GLS and WLS estimators can be found from the OLS regression
(without an intercept) of a transformed model. Typically there will be a
constant in the model: the rst column of X is a vector of ones. Following
Seber and Lee (2003, pp. 6668), there is a nonsingular n n matrix K
such that V = KK T . Let Z = K 1 Y , U = K 1 X, and = K 1 e. This
method uses the fast, but rather unstable, Cholesky decomposition.
Proposition 4.1. a)
Z = U + (4.10)
2
follows the OLS model since E() = 0 and Cov() = I n .
b) The GLS estimator GLS can be obtained from the OLS regression
(without an intercept) of Z on U .
c) For WLS, Yi = xTi + ei . The corresponding OLS model Z = U +
is equivalent to Zi = uTi + i for i = 1, . . . , n where uTi is the ith row of U .
Then Zi = wi Yi and ui = wi xi . Hence W LS can be obtained from the
OLS regression (without an intercept) of Zi = wi Yi on ui = wi xi .
Cov() = K 1 Cov(e)(K 1 )T = 2 K 1 V (K 1 )T
= 2 K 1 KK T (K 1 )T = 2 I n .
Notice that OLS without an intercept needs to be used since U does not
contain a vector of ones. The rst column of U is K 1 1 = 1.
b) Let ZU denote the OLS estimator obtained by regressing Z on U .
Then
ZU = (U T U )1 U T Z = (X T (K 1 )T K 1 X)1 X T (K 1 )T K 1 Y
Following Johnson and Wichern (1988, p. 51) and Freedman (2005, p. 54),
there is a symmetric, nonsingular n n square root matrix R = V 1/2 such
that V = RR. Let Z = R1 Y , U = R1 X and = R1 e. This method
uses the spectral theorem (singular value decomposition) and has better com-
putational properties than transformation based on the Cholesky decompo-
sition.
Proposition 4.2. a)
Z = U + (4.11)
2
follows the OLS model since E() = 0 and Cov() = I n .
b) The GLS estimator GLS can be obtained from the OLS regression
(without an intercept) of Z on U .
c) For WLS, Yi = xTi + ei . The corresponding OLS model Z = U +
is equivalent to Zi = uTi + i for i = 1, . . . , n where uTi is the ith row of U .
Then Zi = wi Yi and ui = wi xi . Hence W LS can be obtained from the
OLS regression (without an intercept) of Zi = wi Yi on ui = wi xi .
= 2 R1 RR(R1 ) = 2 I n .
Notice that OLS without an intercept needs to be used since U does not
contain a vector of ones. The rst column of U is R1 1 = 1.
b) Let ZU denote the OLS estimator obtained by regressing Z on U .
Then
Remark 4.1. Standard software produces WLS output and the ANOVA
F test and Wald t tests are performed using this output.
168 4 WLS and Generalized Least Squares
Remark 4.2. The FGLS estimator can also be found from the OLS
regression (without an intercept) of Z on U where V () = RR. Similarly
the FWLS estimator can be found from the OLS regression (without an in-
tercept) of Zi = wi Yi on ui = wi xi . But now U is a random matrix
instead of a constant matrix. Hence these estimators are highly nonlinear.
OLS output can be used for exploratory purposes, but the pvalues are gen-
erally not correct. The Olive (2016a,b) nonparametric bootstrap tests may
be useful for FGLS and FWLS. The nonparametric bootstrap could also be
applied to the OLS estimator.
Under regularity conditions, the OLS estimator OLS is a consistent esti-
mator of when the GLS model holds, but GLS should be used because it
generally has higher eciency.
Denition 4.8. Let ZU be the OLS estimator from regressing Z on U .
The vector of tted values is Z = U ZU and the vector of residuals is
r ZU = Z Z. Then ZU = GLS for GLS, ZU = F GLS for FGLS,
ZU = W LS for WLS, and ZU = F W LS for FWLS. For GLS, FGLS,
WLS, and FWLS, a residual plot is a plot of Zi versus rZU,i and a response
plot is a plot of Zi versus Zi .
Notice that the residual and response plots are based on the OLS output
from the OLS regression without intercept of Z on U . If the model is good,
then the plotted points in the response plot should follow the identity line
in an evenly populated band while the plotted points in the residual plot
should follow the line rZU,i = 0 in an evenly populated band (at least if the
distribution of is not highly skewed).
Plots based on YGLS = X ZU and on ri,GLS = Yi Yi,GLS should be
similar to those based on OLS . Although the plot of Yi,GLS versus Yi should
be linear, the plotted points will not scatter about the identity line in an
evenly populated band. Hence this plot cannot be used to check whether
the GLS model with V is a good approximation to the data. Moreover, the
ri,GLS and Yi,GLS may be correlated and usually do not scatter about the
r = 0 line in an evenly populated band. The plots in Denition 4.8 are both
a check on linearity and on whether the model using V (or V ) gives a good
approximation of the data, provided that n > k(p + q + 1) where k 5 and
preferably k 10.
For GLS and WLS (and for exploratory purposes for FGLS and FWLS),
plots and model building and variable selection should be based on Z and U .
Form Z and U and then use OLS software for model selection and variable
selection. If the columns of X are v 1 , . . . , v p , then the columns of U are
U1 , . . . , Up where Uj = R1 v j corresponds to the jth predictor Xj . For
example, the analog of the OLS residual plot of jth predictor versus the
residuals is the plot of the jth predictor Uj versus rZU . The notation is
confusing but the idea is simple: form Z and U , then use OLS software and
the OLS techniques from Chapters 2 and 3 to build the model.
4.2 GLS, WLS, and FGLS 169
12
2
RESID
1
Y
6
2
4
2 4 6 8 2 4 6 8
FIT FIT
ZRESID
0
Z
5
0 4 8 12 2 0 4 8 12
ZFIT ZFIT
Fig. 4.1 Plots for Draper and Smith Data
Example 4.2. Draper and Smith (1981, pp. 112114) present an FWLS
example with n = 35 and p = 2. Hence Y = 1 + 2 x + e. Let vi =
vi () = 1.5329 0.7334xi + 0.0883x2i . Thus = (1.5329, 0.7334, 0.0883)T .
Figure 4.1a and b shows the response and residual plots based on the OLS
regression of Y on x. The residual plot has the shape of the right opening
megaphone, suggesting that the variance is not constant. Figure 4.1c and d
shows the response and residual plots based on FWLS with weights wi = 1/vi .
See Problem 4.2 to reproduce these plots. Software meant for WLS needs the
weights. Hence FWLS can be computed using WLS software with the es-
timated weights, but the software may print WLS instead of FWLS, as in
Figure 4.1c and d.
Warning. A problem with the response and residual plots for GLS and
FGLS given in Denition 4.8 is that some of the transformed cases (Zi , uTi )T
can be outliers or high leverage points.
iii) Find the pval = P(FdfR dfF ,dfF > FR ). (On exams often an F table is
used. Here dfR dfF = pq = number of parameters set to 0, and dfF = np.)
iv) State whether you reject Ho or fail to reject Ho. Reject Ho if pval
and conclude that the full model should be used. Otherwise, fail to reject Ho
and conclude that the reduced model is good.
Assume that the GLS model contains a constant 1 . The GLS ANOVA F
test of Ho : 2 = = p versus Ha: not Ho uses the reduced model that
contains the rst column of U . The GLS ANOVA F test of Ho : i = 0
versus Ho : i = 0 uses the reduced model with the ith column of U deleted.
For the special case of WLS, the software will often have a weights option
that will also give correct output for inference.
Example 4.3. Suppose that the data from Example 4.2 has valid weights,
so that WLS can be used instead of FWLS. The R commands below per-
form WLS.
> ls.print(lsfit(dsx,dsy,wt=dsw))
Residual Standard Error=1.137
R-Square=0.9209
F-statistic (df=1, 33)=384.4139, p-value=0
Estimate Std.Err t-value Pr(>|t|)
Intercept -0.8891 0.3004 -2.9602 0.0057
X 1.1648 0.0594 19.6065 0.0000
> ls.print(lsfit(u[,1],z,intercept=F))
Residual Standard Error=3.9838, R-Square=0.7689
F-statistic (df=1, 34)=113.1055, p-value=0
Estimate Std.Err t-value Pr(>|t|)
X 4.5024 0.4234 10.6351 0
> ((34*(3.9838)^2-33*(1.137)^2)/1)/(1.137)^2
[1] 384.4006
The WLS t-test for this data has t = 19.6065 which corresponds to F =
t2 = 384.4 since this test is equivalent to the WLS ANOVA F test when there
is only one predictor. The WLS t-test for the intercept has F = t2 = 8.76.
This test statistic can be found from the no intercept OLS model by leaving
the rst column of U out of the model, then perform the partial F test as
shown below.
> ls.print(lsfit(u[,2],z,intercept=F))
Residual Standard Error=1.2601
F-statistic (df=1, 34)=1436.300
Estimate Std.Err t-value Pr(>|t|)
X 1.0038 0.0265 37.8985 0
> ((34*(1.2601)^2-33*(1.137)^2)/1)/(1.137)^2
[1] 8.760723
172 4 WLS and Generalized Least Squares
4.4 Complements
The theory for GLS and WLS is similar to the theory for the OLS MLR
model, but the theory for FGLS and FWLS is often lacking or huge sam-
ple sizes are needed. However, FGLS and FWLS are often used in practice
because usually V is not known and V must be used instead. Kariya and
Kurata (2004) is a PhD level text covering FGLS. Cook and Zhang (2015)
suggest an envelope method for WLS.
Shi and Chen (2009) describe numerical diagnostics for GLS. Long and
Ervin (2000) discuss methods for obtaining standard errors when the constant
variance assumption is violated.
Following Sheather (2009, ch. 9, ch. 10) many linear models with seri-
ally correlated errors (e.g. AR(1) errors) and many linear mixed models
can be t with FGLS. Both Sheather (2009) and Houseman et al. (2004)
use the Cholesky decomposition and make the residual plots based on the
Cholesky residuals Z Z where V () = KK T . We recommend plots based
on Z Z where V () = RR. In other words, use transformation corre-
sponding to Proposition 4.2 instead of the transformation corresponding to
Proposition 4.1.
4.5 Problems
R Problems
4.2. Download the wlsplot function and the Draper and Smith (1981)
data dsx, dsy, dsw.
a) Enter the R command wlsplot(x=dsx, y = dsy, w = dsw) to re-
produce Figure 4.1. Once you have the plot you can print it out directly, but
it will generally save paper by placing the plots in the Word editor.
b) Activate Word (often by double clicking on a Word icon). Click on the
screen and type Problem 4.2. In R, click on the plot and then press the
keys Ctrl and c simultaneously. This procedure makes a temporary copy of
the plot. In Word, move the pointer to Edit and hold down the leftmost mouse
button. This will cause a menu to appear. Drag the pointer down to Paste.
In the future, these menu commands will be denoted by Edit>Paste. The
plot should appear on the screen. To save your output on your ash drive
(J, say), use the Word menu commands File > Save as. In the Save in box
select Removable Disk (J:) and in the File name box enter HW4d2.doc. To
exit from Word, click on the X in the upper right corner of the screen. In
Word a screen will appear and ask whether you want to save changes made
in your document. Click on No. To exit from R, type q() or click on the
X in the upper right corner of the screen and then click on No.
4.3. Download the fwlssim function. This creates WLS data if type
is 1 or 3 and FWLS data if type is 2 or 4. Let the sucient predictor
SP = 25 + 2x2 + + 2xp . Then Y = SP + |SP 25k|e where the xij and
ei are iid N (0, 1). Thus Y |SP N (SP, (SP 25k)2 2 ). If type is 1 or 2,
then k = 1/5, but k = 1 if type is 3 or 4. The default has 2 = 1.
The function creates the OLS response and residual plots and the FWLS
(or WLS) response and residual plots.
a) Type the following command several times. The OLS and WLS plots
tend to look the same.
fwlssim(type=1)
b) Type the following command several times. Now the FWLS plots often
have outliers.
fwlssim(type=2)
c) Type the following command several times. The OLS residual plots have
a saddle shape, but the WLS plots tend to have highly skewed tted values.
fwlssim(type=3)
d) Type the following command several times. The OLS residual plots
have a saddle shape, but the FWLS plots tend to have outliers and highly
skewed tted values.
fwlssim(type=4)
Chapter 5
One Way Anova
5.1 Introduction
n1 units treatment 1, the next n2 units treatment 2, . . . , and the nal np units
treatment p.
Balanced designs have the group sizes the same: ni m = n/p. Label the
units alphabetically so Carroll gets 1, . . . , Xumong gets 9. The R function
sample can be used to draw a random permutation. Then the rst 3 numbers
in the permutation correspond to group 1, the next 3 to group 2, and the nal
3 to group 3. Using the output shown below gives the following 3 groups.
> sample(9)
[1] 6 7 9 5 1 4 2 8 3
> rand(9,3)
$perm
[1] 6 7 9 5 1 4 2 8 3
$groups
[1] 2 3 3 2 2 1 1 3 1
Denition 5.5. Replication means that for each treatment, the ni re-
sponse variables Yi,1 , . . . , Yi,ni are approximately iid random variables.
5.2 Fixed Eects One Way Anova 177
Example 5.2. a) If ten students work two types of paper mazes three
times each, then there are 60 measurements that are not replicates. Each
student should work the six mazes in random order since speed increases
with practice. For the ith student, let Zi1 be the average time to complete
the three mazes of type 1, let Zi2 be the average time for mazes of type 2,
and let Di = Zi1 Zi2 . Then D1 , . . . , D10 are replicates.
b) Cobb (1998, p. 126) states that a student wanted to know if the shapes
of sponge cells depends on the color (green or white). He measured hundreds
of cells from one white sponge and hundreds of cells from one green sponge.
There were only two units so n1 = 1 and n2 = 1. The student should have
used a sample of n1 green sponges and a sample of n2 white sponges to get
more replicates.
c) Replication depends on the goals of the study. Box et al. (2005, pp. 215
219) describe an experiment where the investigator times how long it takes
him to bike up a hill. Since the investigator is only interested in his perfor-
mance, each run up a hill is a replicate (the time for the ith run is a sample
from all possible runs up the hill by the investigator). If the interest had been
on the eect of eight treatment levels on student bicyclists, then replication
would need n = n1 + + n8 student volunteers where ni ride their bike up
the hill under the conditions of treatment i.
The one way Anova model is used to compare p treatments. Usually there
is replication and Ho: 1 = 2 = = p is a hypothesis of interest. In-
vestigators may also want to rank the population means from smallest to
largest.
Denition 5.6. Let fZ (z) be the pdf of Z. Then the family of pdfs
fY (y) = fZ (y ) indexed by the location parameter , < < ,
is the location family for the random variable Y = + Z with standard
pdf fZ (z).
Denition 5.7. A one way xed eects Anova model has a single qualita-
tive predictor variable W with p categories a1 , . . . , ap . There are p dierent
distributions for Y , one for each category ai . The distribution of
Y |(W = ai ) fZ (y i )
where the location family has second moments. Hence all p distributions come
from the same location family with dierent location parameter i and the
same variance 2 .
Denition 5.8. The one way xed eects normal Anova model is the spe-
cial case where
178 5 One Way Anova
Y |(W = ai ) N (i , 2 ).
Example 5.3. The pooled 2 sample ttest is a special case of a one way
Anova model with p = 2. For example, one population could be ACT scores
for men and the second population ACT scores for women. Then W = gender
and Y = score.
Denition 5.9. The cell means model is the parameterization of the one
way xed eects Anova model such that
Yij = i + eij
where Yij is the value of the response variable for the jth trial of the ith
factor level. The i are the unknown means and E(Yij ) = i . The eij are
iid from the location family with pdf fZ (z) and unknown variance 2 =
VAR(Yij ) = VAR(eij ). For the normal cell means model, the eij are iid
N (0, 2 ) for i = 1, . . . , p and j = 1, . . . , ni .
The cell means model is a linear model (without intercept) of the form
Y = X c c + e =
Y11 100 ... 0 e11
.. .. .. .. .. ..
. . . . . .
Y1,n1 1 0 0 ... 0 e1,n1
e
Y21 0 1 0 ... 0
1 21
.. .. .. .. .. ..
. . . . . .
2
.. + .
Y2,n2 = 0 1 0 ... 0 (5.1)
. e2,n2
. . . . .. .
.. .. .. .. . p ..
Yp,1 0 0 0 ... 1 ep,1
. . . . .. .
.. .. .. .. . .
.
Yp,np 0 0 0 ... 1 ep,np
ni
Notation. Let Yi0 = j=1 Yij and let
1
ni
i = Y i0 = Yi0 /ni = Yij . (5.2)
ni j=1
5.2 Fixed Eects One Way Anova 179
E(Y ) = X c c = (1 , . . . , 1 , 2 , . . . , 2 , . . . , p , . . . , p )T ,
c = (X Tc X c )1 X Tc Y = (Y 10 , . . . , Y p0 )T = (1 , . . . , p )T .
Since the cell means model is a linear model, there is an associated response
plot and residual plot. However, many of the interpretations of the OLS
quantities for Anova models dier from the interpretations for MLR models.
First, for MLR models, the conditional distribution Y |x makes sense even if
x is not one of the observed xi provided that x is not far from the xi . This
fact makes MLR very powerful. For MLR, at least one of the variables in x
is a continuous predictor. For the one way xed eects Anova model, the p
distributions Y |xi make sense where xTi is a row of X c .
Also, the OLS MLR ANOVA F test for the cell means model tests H0 :
c = 0 H0 : 1 = = p = 0, while the one way xed eects ANOVA F
test given after Denition 5.13 tests H0 : 1 = = p .
Denition 5.10. Consider the one way xed eects Anova model. The
response plot is a plot of Yij i versus Yij and the residual plot is a plot of
Yij i versus rij .
The points in the response plot scatter about the identity line and the
points in the residual plot scatter about the r = 0 line, but the scatter need
not be in an evenly populated band. A dot plot of Z1 , . . . , Zm consists of an
axis and m points each corresponding to the value of Zi . The response plot
consists of p dot plots, one for each value of i . The dot plot corresponding
to i is the dot plot of Yi1 , . . . , Yi,ni . The p dot plots should have roughly the
180 5 One Way Anova
Rule of thumb 5.2. Often an outlier is very good, but more often an
outlier is due to a measurement error and is very bad.
The assumption of the Yij coming from the same location family with
dierent location parameters i and the same constant variance 2 is a big
assumption and often does not hold. Another way to check this assumption is
to make a box plot of the Yij for each i. The box in the box plot corresponds
to the lower, middle, and upper quartiles of the Yij . The middle quartile
is just the sample median of the data mij : at least half of the Yij mij
and at least half of the Yij mij . The p boxes should be roughly the same
length and the median should occur in roughly the same position (e.g., in
the center) of each box. The whiskers in each plot should also be roughly
similar. Histograms for each of the p samples could also be made. All of the
histograms should look similar in shape.
Example 5.4. Kuehl (1994, p. 128) gives data for counts of hermit crabs
on 25 dierent transects in each of six dierent coastline habitats. Let Z be
the count. Then the response variable Y = log10 (Z + 1/6). Although the
5.2 Fixed Eects One Way Anova 181
counts Z varied greatly, each habitat had several counts of 0 and often there
were several counts of 1, 2, or 3. Hence Y is not a continuous variable. The cell
means model was t with ni = 25 for i = 1, . . . , 6. Each of the six habitats
was a level. Figure 5.1a and b shows the response plot and residual plot.
There are 6 dot plots in each plot. Because several of the smallest values in
each plot are identical, it does not always look like the identity line is passing
through the six sample means Y i0 for i = 1, . . . , 6. In particular, examine the
dot plot for the smallest mean (look at the 25 dots furthest to the left that
fall on the vertical line FIT 0.36). Random noise (jitter) has been added to
the response and residuals in Figure 5.1c and d. Now it is easier to compare
the six dot plots. They seem to have roughly the same spread.
1
RESID
1
Y
0
-1
0
0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.4 0.5 0.6 0.7 0.8 0.9 1.0
FIT FIT
1
JR
JY
1
0
-1
0
0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.4 0.5 0.6 0.7 0.8 0.9 1.0
FIT FIT
The plots contain a great deal of information. The response plot can be
used to explain the model, check that the sample from each population (treat-
ment) has roughly the same shape and spread, and to see which populations
have similar means. Since the response plot closely resembles the residual plot
in Figure 5.1, there may not be much dierence in the six populations. Lin-
earity seems reasonable since the samples scatter about the identity line. The
residual plot makes the comparison of similar shape and spread easier.
182 5 One Way Anova
p
ni
SST O = (Yij Y 00 )2 .
i=1 j=1
p
SST R = ni (Y i0 Y 00 )2 .
i=1
p
ni
SSE = (Yij Y i0 )2 .
i=1 j=1
1 1
p ni i p n
2 = M SE = 2
rij = (Yij Y i0 )2 =
n p i=1 j=1 n p i=1 j=1
1
p
(ni 1)Si2 = Spool
2
n p i=1
2
where Spool is known as the pooled variance estimator.
The ANOVA F test tests whether the p means are equal. If Ho is not
rejected and the means are equal, then it is possible that the factor is unim-
portant, but it is also possible that the factor is important but the
level is not. For example, the factor might be type of catalyst. The yield
may be equally good for each type of catalyst, but there would be no yield if
no catalyst was used.
The ANOVA table is the same as that for MLR, except that SSTR re-
places the regression sum of squares. The MSE is again an estimator of 2 .
The ANOVA F test tests whether all p means i are equal. Shown below is
an ANOVA table given in symbols. Sometimes Treatment is replaced by
Between treatments, Between Groups, Model, Factor, or Groups.
Sometimes Error is replaced by Residual, or Within Groups. Some-
times p-value is replaced by P, P r(> F ), or PR > F. The p-value
is nearly always an estimated p-value, denoted by pval.
5.2 Fixed Eects One Way Anova 183
Source df SS MS F p-value
Treatment p-1 SSTR MSTR Fo=MSTR/MSE for Ho:
Error n-p SSE MSE 1 = = p
P (Fp1,np > Fo ).
iv) State whether you reject Ho or fail to reject Ho. If the pval , reject Ho
and conclude that the mean response depends on the factor level. (Hence not
all of the treatment means are equal.) Otherwise fail to reject Ho and conclude
that the mean response does not depend on the factor level. (Hence all of the
treatment means are equal, or there is not enough evidence to conclude that
the mean response depends on the factor level.) Give a nontechnical sentence.
max(S1 , . . . , Sp ) 2 min(S1 , . . . , Sp ),
then the one way ANOVA F test results will be approximately correct if the
response and residual plots suggest that the remaining one way Anova model
assumptions are reasonable. See Moore (2007, p. 634). If all of the ni 5,
replace the standard deviations by the ranges of the dot plots when exam-
ining the response and residual plots. The range Ri = max(Yi,1 , . . . , Yi,ni )
min(Yi,1 , . . . , Yi,ni ) = length of the ith dot plot for i = 1, . . . , p.
The assumption that the zero mean iid errors have constant variance
V (eij ) 2 is much stronger for the one way Anova model than for the mul-
tiple linear regression model. The assumption implies that the p population
distributions have pdfs from the same location family with dierent means
1 , . . . , p but the same variances 12 = = p2 2 . The one way ANOVA
F test has some resistance to the constant variance assumption, but con-
dence intervals have much less resistance to the constant varianceassumption.
Consider condence intervals for i such as Y i0 tni 1,1/2 M SE/ ni .
MSE is a weighted average of the Si2 . Hence MSE overestimates small i
2
2 2
and underestimates large i when the i are not equal. Hence using M SE
instead of Si will make the CI too long or too short, and Rule of thumb 5.3
does not apply to condence intervals based on MSE.
184 5 One Way Anova
Remark 5.2. When the assumption that the p groups come from the
same location family with nite variance 2 is violated, the one way ANOVA
F test may not make much sense because unequal means may not imply the
superiority of one category over another. Suppose Y is the time in minutes
until relief from a headache and that Y1j N (60, 1) while Y2j N (65, 2 ).
If 2 = 1, then the type 1 medicine gives headache relief 5 minutes faster, on
average, and is superior, all other things being equal. But if 2 = 100, then
many patients taking medicine 2 experience much faster pain relief than those
taking medicine 1, and many experience much longer time until pain relief.
In this situation, predictor variables that would identify which medicine is
faster for a given patient would be very useful.
Example 5.5. The output below represents grams of fat (minus 100
grams) absorbed by doughnuts using 4 types of fat. See Snedecor and Cochran
(1967, p. 259). Let i denote the mean amount of fati absorbed by doughnuts,
i = 1, 2, 3 and 4. a) Find 1 . b) Perform a 4 step ANOVA F test.
n1
Solution: a) 1c = 1 = Y 10 = Y10 /n1 = j=1 Y1j /n1 =
(64 + 72 + 68 + 77 + 56 + 95)/6 = 432/6 = 72.
b) i) H0 : 1 = 2 = 3 = 4 Ha : not H0
ii) F = 5.41
iii) pval = 0.0069
iv) Reject H0 , the mean amount of fat absorbed by doughnuts depends on
the type of fat.
Notice that the strain of clover 3dok1 appears to have the highest mean
nitrogen content. There are 4 pairs of means that are not signicantly dier-
ent. The letter B suggests 3dok5 and 3dok7, the letter C suggests 3dok7 and
compos, the letter D suggests compos and 3dok4, while the letter E suggests
3dok4 and 3dok13 are not signicantly dierent.
B 23.980 5 3dok5
B
C B 19.920 5 3dok7
C
C D 18.700 5 compos
D
E D 14.640 5 3dok4
E
E 13.260 5 3dok13
Remark 5.3. Two graphical methods can also be used. Recall from Chap-
ter 1 that a response plot is an estimated sucient summary plot. If n is not
too small, each ni 5, and the sample mean (where the dot plot crosses
the identity line) for one dot plot is below or above another dot plot, then
conclude that the population mean corresponding to the higher dot plot is
greater than the sample mean corresponding to the lower dot plot. As the
ni increase, the sample mean of one dot plot only needs to be above or be-
low most of the cases in the other dot plot. The p population means may or
may not be equal if all p of the dot plots have lots of overlap. This will hap-
pen, for example, if the response plot looks like the residual plot. Hence this
graphical method is inconclusive for Figure 5.1a. Remark 5.2 gives another
situation where this graphical method can fail. An advantage of this graphi-
cal method is that the p populations do not need to come from populations
with the same variance or from the same location scale family as long as OLS
gives a consistent estimator of . The second graphical method is given in
Denition 5.15.
Example 5.6, continued: Figure 5.2 shows the response and residual
plots for the clover data. The plots suggest the constant variance assumption
is not reasonable. The population means may or may not dier for the groups
with the two smallest sample means, but these two groups appear to have
smaller population means than the other groups. Similarly, the population
means may or may not dier for the two groups with sample means near
5.2 Fixed Eects One Way Anova 187
Response Plot
30
Y
20
10
15 20 25
FIT
Residual Plot
4
RES
2
8
15 20 25
FIT
Fig. 5.2 Response and Residual Plots for Clover Data
20, but these two groups appear to have population means that are smaller
than the two groups with the largest sample means. The population means
of these last two groups may or may not dier. Figure 5.2 was made with the
following commands, using the lregpack function aovplots.
x<-c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4,5,5,5,5,
5,6,6,6,6,6)
y<-c(19.4,32.6,27.0,32.1,33.0,17.7,24.8,27.9,25.2,
24.3,17.0,19.4,9.1,11.9,15.8,20.7,21.0,20.5,18.8,
18.6,14.3,14.4,11.8,11.6,14.2,17.3,19.4,19.1,16.9,
20.8)
x <- factor(x)
z <- aov(y~x)
aovplots(Y=y,FIT=fitted(z),RES=resid(z))
#right click stop twice
Denition 5.15. Graphical Anova for the one way model uses the
residuals as a reference set instead of a t, F , or normal distribution. The
scaled treatment deviations or scaled eect c(Y i0 Y 00 ) = c(i Y 00 )
are scaled to have the same variability as the residuals. A dot plot of the
scaled deviations is placed above the dot plot of the residuals. Assume that
188 5 One Way Anova
For n 100, let r(1) r(2) r(n) be the order statistics of the
residuals. Then instead of the range, use r(0.975n) r(0.025n) as the dis-
tance where
x is the smallest integer x, e.g.
7.7 = 8. So eects outside
of the interval (r(0.025n) , r(0.975n) ) are signicant. See Box et al. (2005,
pp. 136, 166). A derivation of the scaling constant c = (n p)/(p 1) is
given in Section 5.6.
ganova(x,y)
smn 0.0296 0.0661 -0.0508 -0.0449
Treatments "A" "B" "C" "D"
Example 5.7. Cobb (1998) describes a one way Anova design used to
study the amount of calcium in the blood. For many animals, the bodys
ability to use calcium depends on the level of certain hormones in the blood.
The response was 1/(level of plasma calcium). The four groups were A: Fe-
male controls, B: Male controls, C: Females given hormone, and D: Males
5.3 Random Eects One Way Anova 189
given hormone. There were 10 birds of each gender, and ve from each gen-
der were given the hormone. The output above uses the lregpack function
ganova to produce Figure 5.3.
In Figure 5.3, the top dot plot has the scaled treatment deviations. From
left to right, these correspond to C, D, A, and B since the output shows that
the deviation corresponding to C is the smallest with value 0.050. Since the
deviations corresponding to C and D are much closer than the range of the
residuals, the C and D eects yielded similar mean response values. A and
B appear to be signicantly dierent from C and D. The distance between
the scaled A and B treatment deviations is about the same as the distance
between the smallest and largest residuals, so there is only marginal evidence
that the A and B eects are signicantly dierent.
Since all 4 scaled deviations lie outside of the range of the residuals, all
eects A, B, C, and D appear to be signicant.
Denition 5.16. For the random eects one way Anova, the levels of
the factor are a random sample of levels from some population of levels F .
The cell means model for the random eects one way Anova is Yij = i + eij
for i = 1, . . . , p and j = 1, . . . , ni . The i are randomly selected from some
population with mean and variance 2 , where i F is equivalent to
i . The eij and i are independent, and the eij are iid from a location
family with pdf f , mean 0, and variance 2 . The Yij |i f (y i ), the
location family with location parameter i and variance 2 . Unconditionally,
E(Yij ) = and V (Yij ) = 2 + 2 .
For the random eects model, the i are independent random variables
with E(i ) = and V (i ) = 2 . The cell means model for xed eects one
way Anova is very similar to that for the random eects model, but the i
are xed constants rather than random variables.
Denition 5.17. For the normal random eects one way Anova model,
N (, 2 ). Thus the i are independent N (, 2 ) random variables. The
eij are iid N (0, 2 ) and the eij and i are independent. For this model,
Yij |i N (i , 2 ) for i = 1, . . . , p. Note that the conditional variance 2 is
the same for each i . Unconditionally, Yij N (, 2 + 2 ).
The ANOVA tables for the xed and random eects one way Anova models
are exactly the same, and the two F tests are very similar. The main dierence
is that the conclusions for the random eects model can be generalized to
the entire population of levels. For the xed eects model, the conclusions
only hold for the p xed levels. If Ho : 2 = 0 is true and the random eects
model holds, then the Yij are iid with pdf f (y ). So the F statistic for
the random eects test has an approximate Fp1,np distribution if the ni
are large by the results for the xed eects one way ANOVA test. For both
tests, the pval is an estimate of the population p-value.
Source df SS MS F P
brand 5 854.53 170.906 238.71 0.0000
error 42 30.07 0.716
Example 5.8. Data is from Kutner et al. (2005, problem 25.7). A re-
searcher is interested in the amount of sodium in beer. She selects 6 brands
of beer at random from 127 brands and the response is the average sodium
content measured from 8 cans of each brand.
a) State whether this is a random or xed eects one way Anova. Explain
briey.
b) Using the output above, perform the appropriate 4 step ANOVA F
test.
Solution: a) Random eects since the beer brands were selected at random
from a population of brands.
b) i) H0 : 2 = 0 Ha : 2 > 0
ii) F0 = 238.71
iii) pval = 0.0
iv) Reject H0 , so 2 > 0 and the mean amount of sodium depends on the
beer brand.
Remark 5.4. The response and residual plots for the random eects mod-
els are interpreted in the same way as for the xed eects model, except that
the dot plots are from a random sample of p levels instead of from p xed
levels.
5.4 Response Transformations for Experimental Design 191
Denition 5.18. Assume that all of the values of the response Zi are
positive. A power transformation has the form Y = t (Z) = Z for = 0
and Y = t0 (Z) = log(Z) for = 0 where L = {1, 1/2, 0, 1/2, 1}.
In the following example, the plots show t (Z) on the vertical axis. The
label TZHAT of the horizontal axis are the tted values that result from
using t (Z) as the response in the software.
2.5
6
6
5
2.0
4
4
1/sqrt(Z)
LOG(Z)
1.5
1/Z
3
2
1.0
2
0
0.5
1
0
0.0
-2
1.2 1.4 1.6 1.8 2.0 2.2 0.7 0.8 0.9 1.0 1.1 1.0 1.5 2.0
TZHAT TZHAT TZHAT
20
400
15
300
sqrt(Z)
Z
10
200
100
5
0
0
3 4 5 10 20 30 40 50 60 70
TZHAT TZHAT
5.5 Summary
1) The xed eects one way Anova model has one qualitative explanatory
variable called a factor and a quantitative response variable Yij . The factor
variable has p levels, E(Yij ) = i and V (Yij ) = 2 for i = 1, . . . , p and
j = 1, . . . , ni . Experimental units are randomly assigned to the treatment
levels.
2) Let n = n1 + + np . In an experiment, the investigators use random-
ization to randomly assign n units to treatments. Draw a random permutation
of {1, . . . , n}. Assign the rst n1 units to treatment 1, the next n2 units to
treatment 2, . . . , and the nal np units to treatment p. Use ni m = n/p if
possible. Randomization washes out the eect of lurking variables.
3) The 4 step xed eects one way ANOVA F test has steps
i) Ho: 1 = 2 = = p and Ha: not Ho.
ii) F o = MSTR/MSE is usually given by output.
iii) The pval = P(Fp1,np > F o) is usually given by output.
iv) If pval , reject Ho and conclude that the mean response depends
on the factor level. Otherwise fail to reject Ho and conclude that the mean
response does not depend on the factor level. (Hence all of the treatment
means are equal, or there is not enough evidence to conclude that the mean
response depends on the factor level.) Give a nontechnical sentence.
Source df SS MS F p-value
Treatment p-1 SSTR MSTR Fo=MSTR/MSE for Ho:
Error n-p SSE MSE 1 = = p
Y |(W = ai ) fZ (y i )
where the location family has second moments. Hence all p distributions
come from the same location family with dierent location parameter i and
the same variance 2 . The one way xed eects normal Anova model is the
special case where Y |(W = ai ) N (i , 2 ).
7) The response plot is a plot of Y versus Y . For the one way Anova model,
the response plot is a plot of Yij = i versus Yij . Often the identity line with
unit slope and zero intercept is added as a visual aid. Vertical deviations
from the identity line are the residuals rij = Yij Yij = Yij i . The plot
will consist of p dot plots that scatter about the identity line with similar
shape and spread if the xed eects one way Anova model is appropriate.
The ith dot plot is a dot plot of Yi,1 , . . . , Yi,ni . Assume that each ni 10. If
the response plot looks like the residual plot, then a horizontal line ts the p
dot plots about as well as the identity line, and there is not much dierence
in the i . If the identity line is clearly superior to any horizontal line, then
at least some of the means dier.
8) The residual plot is a plot of Y versus residual r = Y Y . The plot will
consist of p dot plots that scatter about the r = 0 line with similar shape
and spread if the xed eects one way Anova model is appropriate. The ith
dot plot is a dot plot of ri,1 , . . . , ri,ni . Assume that each ni 10. Under
the assumption that the Yij are from the same location family with dierent
parameters i , each of the p dot plots should have roughly the same shape
and spread. This assumption is easier to judge with the residual plot than
with the response plot.
9) Rule of thumb: If max(S1 , . . . , Sp ) 2 min(S1 , . . . , Sp ), then the one
way ANOVA F test results will be approximately correct if the response and
residual plots suggest that the remaining one way Anova model assumptions
are reasonable. Replace the Si by the ranges Ri of the dot plots in the residual
and response plots.
10) In an experiment, the investigators assign units to treatments. In
an observational study, investigators simply observe the response, and the
treatment groups need to be p random samples from p populations (the lev-
els). The eects of lurking variables are present in observational studies.
11) If a qualitative variable has c levels, represent it with c1 or c indicator
variables. Given a qualitative variable, know how to represent the data with
indicator variables.
12) The cell means model for the xed eects one way Anova is Yij =
i + eij where Yij is the value of the response variable for the jth trial
of the ith factor level for i = 1, . . . , p and j = 1, . . . , ni . The i are the
unknown means and E(Yij ) = i . The eij are iid from the location family
with pdf fZ (z), zero mean, and unknown variance 2 = V (Yij ) = V (eij ).
For the normal cell means model, the eij are iid N (0, 2 ). The estimator
ni
i = Y i0 = j=1 Yij /ni = Yij . The ith residual is rij = Yij Y i0 , and Y 00 is
5.5 Summary 195
p
the samplemeanof all of the Yij and n = i=1 ni . The total sum of squares
p ni
j=1 (Yij Y 00 ) , the treatment sum of squares SSTR =
2
SSTO =
p i=1
p ni
i=1 ni (Y i0 Y 00 ) , and the error sum of squares SSE = j=1 (Yij
2
i=1
2 2
Y i0 ) . The MSE is an estimator of . The ANOVA table is the same as
that for multiple linear regression, except that SSTR replaces the regression
sum of squares and that SSTO, SSTR, and SSE have n 1, p 1, and n p
degrees of freedom.
ni
13) Let Yi0 = j=1 Yij and let
1
ni
i = Y i0 = Yi0 /ni = Yij .
ni j=1
Hence the dot notation means sum over the subscript corresponding to the
p ni
0, e.g. j. Similarly, Y00 = i=1 j=1 Yij is the sum of all of the Yij . Be able
to nd i from data.
14) If the p treatment groups have the same pdf (so i in the location
family) with nite variance 2 , and if the one way ANOVA F test statistic is
n!
computed from all ways of assigning ni of the response variables
n1 ! np !
to treatment i, then the histogram of the F test statistic is approximately
Fp1,np for large ni .
15) For the one way Anova, the tted values Yij = Y i0 and the residuals
rij = Yij Yij .
16) Know that for the random eects one way Anova, the levels of
the factor are a random sample of levels from some population of levels F .
Assume the i are iid with mean and variance 2 . The cell means model
for the random eects one way Anova is Yij = i + eij for i = 1, . . . , p and
j = 1, . . . , ni . The sample size n = n1 + + np and often ni m so n = pm.
The i and eij are independent. The eij have mean 0 and variance 2 . The
Yij |i f (y i ), a location family with variance 2 while eij f (y). In
the test below, if H0 : 2 = 0 is true, then the Yij are iid with pdf f (y ),
so the F statistic Fp1,np if the ni are large.
17) Know that the 4 step random eects one way Anova test is
i) H0 : 2 = 0 HA : 2 > 0
ii) F0 = M ST R/M SE is usually obtained from output.
iii) The pval = P (Fp1,np > F0 ) is usually obtained from output.
iv) If pval reject Ho, conclude that 2 > 0 and that the mean response
depends on the factor level. Otherwise, fail to reject Ho, conclude that 2 = 0
and that the mean response does not depend on the factor level. (Or there
is not enough evidence to conclude that the mean response depends on the
factor level.)
18) Know how to tell whether the experiment is a xed or random eects
one way Anova. (Were the levels xed or a random sample from a population
of levels?)
196 5 One Way Anova
Y = to (Z) = E(Y ) + e = xT + e
where the subscripts (e.g., Yij ) have been suppressed. If o was known, then
Y = to (Z) would follow the DOE model. Assume that all of the values
of the response Z are positive. A power transformation has the form
Y = t (Z) = Z for = 0 and Y = t0 (Z) = log(Z) for = 0 where
L = {1, 1/2, 0, 1/2, 1}.
20) A graphical method for response transformations computes the tted
values W from the DOE model using W = t (Z) as the response for each
of the ve values of L . Let T = W = TZHAT and plot TZHAT vs.
t (Z) for {1, 1/2, 0, 1/2, 1}. These plots are called transformation
plots. The residual or error degrees of freedom used to compute the MSE
should not be too small. Choose the transformation Y = t (Z) that has the
best plot. Consider the one way Anova model with ni 5 for i = 1, . . . , p.
i) The dot plots should spread about the identity line with similar shape
and spread. ii) Dot plots that are approximately symmetric are better than
skewed dot plots. iii) Spread that increases or decreases with TZHAT (the
shape of the plotted points is similar to a right or left opening megaphone)
is bad.
21) The transformation plot for the selected transformation is also the
response plot for that model (e.g., for the model that uses Y = log(Z) as
the response). Make all of the usual checks on the DOE model (residual and
response plots) after selecting the response transformation.
22) The log rule says try Y = log(Z) if max(Z)/ min(Z) > 10 where
Z > 0 and the subscripts have been suppressed (so Z Zij for the one way
Anova model).
p p
23) A contrast C = i=1 ki i where i=1 ki = 0. The estimated contrast
p
is C = i=1 ki Y i0 .
p
24) Consider
p a family of null hypotheses for contrasts {Ho : i=1 ki i = 0
where i=1 ki = 0 and the ki may satisfy other constraints }. Let S denote
the probability of a type I error for a single test from the family. The family
level F is an upper bound on the (usually unknown) size T . Know how to
interpret F T = P(of making at least one type I error among the family
of contrasts) where a type I error is a false rejection.
25) Two important families of contrasts are the family of all possible con-
trasts and the family of pairwise dierences Cij = i j where i = j.
The Schee multiple comparisons procedure has a F for the family of all
possible contrasts, while
# $ the Tukey multiple comparisons procedure has a F
for the family of all p2 pairwise contrasts.
5.6 Complements 197
5.6 Complements
Often the data does not consist of samples from p populations, but consists
of a group of n = mp units where m units are randomly assigned to each of
the p treatments. Then the Anova models can still be used to compare treat-
ments, but statistical inference to a larger population cannot be made. Of
course a nonstatistical generalization to larger populations can be made. The
nonstatistical generalization from the group of units to a larger population
is most compelling if several experiments are done with similar results. For
example, generalizing the results of an experiment for psychology students
to the population of all of the university students is less compelling than the
following generalization. Suppose one experiment is done for psychology stu-
dents, one for engineers, and one for English majors. If all three experiments
198 5 One Way Anova
give similar results, then generalize the results to the population of all of the
universitys students.
Four good tests on the design and analysis of experiments are Box et al.
(2005), Cobb (1998), Kuehl (1994), and Ledolter and Swersey (2007). Also
see Dean and Voss (2000), Kirk (2012), Maxwell and Delaney (2003), Mont-
gomery (2012), and Oehlert (2000).
A randomization test has H0 : the dierent treatments have no eect.
This null hypothesis is also true if all p pdfs Y |(W = ai ) fZ (y ) are
n!
the same. An impractical randomization test uses all M = n1 !n p!
ways of
assigning ni of the Yij to treatment i for i = 1, . . . , p. Let F0 be the usual
F statistic. The F statistic is computed for each of the M permutations and
H0 is rejected if the proportion of the M F statistics that are larger than
F0 is less than . The distribution of the M F statistics is approximately
Fp1,np for large n when H0 is true. The power of the randomization test is
also similar to that of the usual F test. See Hoeding (1952). These results
suggest that the usual F test is semiparametric: the pvalue is approximately
correct if n is large and if all p pdfs Y |(W = ai ) fZ (y ) are the same.
Let [x] be the integer part of x, e.g. [7.7] = 7. Olive (2014, section 9.3)
shows that practical randomization tests that use a random sample of
max(1000, [n log(n)]) permutations have level and power similar to the tests
that use all M possible permutations. See Ernst (2009) and the lregpack func-
tion rand1way for R code.
All of the parameterizations of the one way xed eects Anova model
yield the same predicted values, residuals, and ANOVA F test, but the inter-
pretations of the parameters dier. The cell means model is a linear model
(without intercept) of the form Y = X c c + e = that can be t using OLS.
The OLS MLR output gives the correct tted values and residuals but an
incorrect ANOVA table. An equivalent linear model (with intercept) with
correct OLS MLR ANOVA table as well as residuals and tted values can
be formed by replacing any column of the cell means model by a column of
ones 1. Removing the last column of the cell means model and making the
rst column 1 gives the model Y = 0 + 1 x1 + + p1 xp1 + e given in
matrix form by (5.5) on the following page.
It can be shown that the OLS estimators corresponding to (5.5) are 0 =
Y p0 = p , and i = Y i0 Y p0 = i p for i = 1, . . . , p 1. The cell means
model has i = i = Y i0 .
Wilcox (2012) gives an excellent discussion of the problems that outliers
and skewness can cause for the one and two sample tintervals, the ttest,
tests for comparing 2 groups, and the ANOVA F test. Wilcox (2012) replaces
ordinary population means by truncated population means and uses trimmed
means to create analogs of one way Anova and multiple comparisons.
5.6 Complements 199
110 ... 0
. . .
.. .. ..
..
.
Y11 e11
.. 110 ... 0 ..
. ... 0
.
101
Y1,n1
.. .. .. ..
e1,n1
Y21 . . . . e21
0
.. 1 01 ... 0
..
. .. 1 +
= .. .. ..
.
. .
Y2,n2 . . . . .. (5.5)
e 2,n2
. 1 00 ... 1 .
.. p1 ..
.
.. ... ... ..
Yp,1 . ep,1
. 100 ... 1 .
.. 1 0 0
. .
... 0
Yp,np . . . .. ep,np
.. .. .. .
1 0 0 ... 0
Graphical Anova uses scaled treatment eects = scaled treatment de-
viations di = cdi = c(Y i0 Y 00 ) for i = 1, . . . , p. Following Box et al.
(2005, p. 166), suppose ni m = n/p for i = 1, . . . , n. If Ho: 1 =
= p is true, want the sample variance of the scaled deviations to
be approximately
p equal to the sample variance of the residuals. So want
1 2 2 p
p i=1 c d i M ST R SST R/(p 1) md2i /(p 1)
1 1 n 2 = F0 = = = i=1n
i=1 ri M SE SSE/(n p) i=1 ri /(n p)
2
n
p p
since SST R = i=1 m(Y i0 Y 00 ) = i=1 mdi . So
2 2
p 2n 2
p m(np) 2
i=1 c di i=1 di
F0 = n p2 = n p12 .
i=1 ri i=1 ri
mp (n p) (n p)
c2 = =
n (p 1) (p 1)
since mp/n = 1. Thus c = (n p)/(p 1).
For Graphical Anova, see Box et al. (2005, pp. 136, 150, 164, 166) and
Hoaglin et al. (1991). The R package granova, available from
(http://streaming.stat.iastate.edu/CRAN/), and authored by R.M.
Pruzek and J.E. Helmreich, may be useful.
The modied power transformation family
() Zi 1
Yi = t (Zi ) Zi =
for = 0 and t0 (Zi ) = log(Zi ) for = 0 where L .
200 5 One Way Anova
Box and Cox (1964) give a numerical method for selecting the response
transformation for the modied power transformations. Although the method
gives a point estimator o , often an interval of reasonable values is gen-
erated (either graphically or using a prole likelihood to make a condence
interval), and L is used if it is also in the interval.
There are several reasons to use a coarse grid L of powers. First, several of
the powers correspond to simple transformations such as the log, square root,
and reciprocal. These powers are easier to interpret than = 0.28, for exam-
ple. Secondly, if the estimator n can only take values in L , then sometimes
n will converge in probability to L . Thirdly, Tukey (1957) showed
that neighboring modied power transformations are often very similar, so
restricting the possible powers to a coarse grid is reasonable.
The graphical method for response transformations is due to Olive (2004b).
A variant of the method would plot the residual plot or both the response
and the residual plot for each of the ve values of . Residual plots are also
useful, but they do not distinguish between nonlinear monotone relationships
and nonmonotone relationships. See Fox (1991, p. 55). Alternative methods
are given by Cook and Olive (2001) and Box et al. (2005, p. 321).
An alternative to one way Anova is to use FWLS (see Chapter 4) on the
cell means model with 2 V = diag(12 , . . . , p2 ) where i2 occurs ni times
on the diagonal and i2 is the variance of the ith group for i = 1, . . . , p.
ni
Then V = diag(S12 , . . . , Sp2 ) where Si2 = ni11 j=1 (Yij Y i0 )2 is the sample
variance of the Yij . Hence the estimated weights for FWLS are wij wi =
1/Si2 . Then the FWLS cell means model has Y = X c c + e as in (5.1) except
Cov(e) = diag(12 , . . . , p2 ).
Hence Z = U c c + . Then U Tc U c = diag(n1 w1 , . . . , np wp ), (U Tc U c )1 =
1
diag(S12 /n1 , . . . , Sp2 /np ) = (X V X T )1 , and U Tc Z = (w1 Y10 , . . . , wp Yp0 )T .
Thus from Chapter 4,
F W LS = (Y 10 , . . . , Y p0 )T = c .
That is, the FWLS estimator equals the one way Anova estimator of based
on OLS applied to the cell means model. The ANOVA F test generalizes
the pooled t test in that the two tests are equivalent for p = 2. The FWLS
procedure is also known as the Welch one way Anova and generalizes the
Welch t test. The Welch t test is thought to be much better than the pooled
t test if n1 = n2 and 12 = 22 . See Brown and Forsythe (1974a,b), Kirk (1982),
pp. 100, 101, 121, 122), Olive (2014, pp. 278279), Welch (1947, 1951), and
Problem 5.11.
In matrix form Z = U c c + becomes
5.6 Complements 201
w1 Y1,1 w1 0 0 ... 0 11
.. .. .. .. .. ..
. . .
. . .
w1 Y1,n1 w1 0 0 ... 0
1,n1
w2 Y21 0 w2 0 ... 0 21
1
.. .. .. .. .. ..
. . 2 .
. = . .
. + . (5.6)
w2 Y2,n2 0 0
w2 0 ... .. 2,n2
.. .. .. .. ..
. p ...
. . . .
wp Yp,1 0 0 0 ... wp p,1
.. . .. .. .. .
.. .
.
. . . .
wp Yp,np 0 0 0 ... wp p,np
2(p2) p
FW =
i=1 (1 u ) /(ni
wi 2
1+ p2 1 1)
p p
where wi = ni /Si2 , u = i=1 wi and Y00 = i=1 wi Y i0 /u. Then the test
statistic is compared to an Fp1,dW distribution where dW =
f and
3 % w i &2
p
1/f = 1 /(ni 1).
p 1 i=1
2 u
For the modied Welch (1947) test, the test statistic is compared to an
Fp1,dM W distribution where dM W =
f and
p p
(S 2 /ni )2 i=1 (1/wi )
2
f = p i=11 i 2 2
= p 1 2
.
i=1 ni 1 (Si /ni ) i=1 ni 1 (1/wi )
where p
% ni & 2 % ni & 2
ci = 1 Si / 1 Si .
n i=1
n
The lregpack function anovasim can be used to simulate and compare
the four tests with the usual one way ANOVA test. Some simulation results
are in Haenggi (2009).
5.7 Problems
got 1000, four got 5000, and four got 10000. These four groups are denoted by
none, n1000, n5000, and n10000, respectively. The seedling growths
were all recorded and the table below gives the one way ANOVA results.
a) What is none ?
b) Do a four step test for whether the four mean growths are equal.
(So Ho: none = n1000 = n5000 = n10000 .)
c) Examine the Bonferroni comparison of means. Which groups of means
are not signicantly dierent?
> sample(11)
[1] 7 10 9 8 1 6 3 11 2 4 5
y1 y5 y2 y3 y4
9.8 10.8 15.4 17.6 21.6
5.6. The tensile strength of a cotton nylon ber used to make womens
shirts is believed to be aected by the percentage of cotton in the ber. The
5 levels of cotton percentage that are of interest are tabled above. Also shown
is a (Tukey pairwise) comparison of means. Which groups of means are not
signicantly dierent? Data is from Montgomery (1984, pp. 51, 66).
5.8. Ledolter and Swersey (2007, p. 49) describe a one way Anova design
used to study the eectiveness of 3 product displays (A, B, and C). Fifteen
stores were used and each display was randomly assigned to 5 stores. The
response Y was the sales volume for the week during which the display was
present compared to the base sales for that store.
a) Find 2 = B using output on the previous page.
b) Perform a 4 step ANOVA F test.
2 0 2 4 6
Residuals
Fig. 5.5 Graphical Anova for Problem 5.9
ganova(x,y)
smn -3.2333 -3.0374 6.2710
Treatments "A" "B" "C"
5.9. Ledolter and Swersey (2007, p. 49) describe a one way Anova design
used to study the eectiveness of 3 product displays (A, B, and C). Fifteen
stores were used and each display was randomly assigned to 5 stores. The
response Y was the sales volume for the week during which the display was
present compared to the base sales for that store. Figure 5.5 is the Graphical
Anova plot found using the lregpack function ganova.
a) Which two displays (from A, B, and C) yielded similar mean sales
volume?
b) Which eect (from A, B, and C) appears to be signicant?
206 5 One Way Anova
Source df SS MS F P
treatment 3 89.19 29.73 15.68 0.0002
error 12 22.75 1.90
Problems using R.
5.11. The pooled t procedures are a special case of one way Anova with
p = 2. Consider the pooled t CI for 1 2 . Let X1 , . . . , Xn1 be iid with
mean 1 and variance 12 . Let Y1 , . . . , Yn2 be iid with mean 2 and variance
22 . Assume that the two samples are independent (or that n1 + n2 units
were randomly assigned to two groups) and that ni for i = 1, 2 in
such a way that = n1n+n1
2
(0, 1). Let = 22 /12 , and let the pooled
sample variance Sp = [(n1 1)S12 + (n2 1)S22 ]/[n1 + n2 2] and 2 =
2
X Y (1 2 ) D
2 N (0, 1) and
S1 S22
n1 + n2
S12 S22
n1 + n2 X Y (1 2 ) X Y (1 2 ) D
2 2
= N (0, 2 ).
1 1 S1 S2 1 1
Sp n1 + n2 n1 + n2
Sp n1 + n2
y <- ycrab+1/6
aovtplt(crabhab,y)
5.14. The following data set considers the number of warp breaks per
loom, where the factor is tension (low, medium, or high).
a) Copy and paste the commands for this problem into R.
Highlight the ANOVA table by pressing the left mouse key and dragging
the cursor over the ANOVA table. Then use the menu commands Edit>
Copy. Enter Word and use the menu command Paste. b) To place the
residual plot in Word, get into R and click on the plot, hit the Ctrl and c
keys at the same time. Enter Word and use the menu command Paste or
hit the Ctrl and v keys at the same time.
c) Copy and paste the commands for this part into R.
Click on the response plot, hit the Ctrl and c keys at the same time. Enter
Word and use the menu command Paste.
5.15. Obtain the Box et al. (2005, p. 134) blood coagulation data from
lregdata and the R program ganova from lregpack. The program does graph-
ical Anova for the one way Anova model.
a) Enter the following command and include the plot in Word by simulta-
neously pressing the Ctrl and c keys, then using the menu command Paste
in Word, or hit the Ctrl and v keys at the same time.
ganova(bloodx,bloody)
The scaled treatment deviations are on the top of the plot. As a rule
of thumb, if all of the scaled treatment deviations are within the spread of
the residuals, then population treatment means are not signicantly dierent
(they all give response near the grand mean). If some deviations are outside
of the spread of the residuals, then not all of the population treatment means
5.7 Problems 209
are equal. Box et al. (2005, p. 137) state The graphical analysis discourages
overreaction to high signicance levels and avoids underreaction to very
nearly signicant dierences.
b) From the output, which two treatments means were approximately the
same?
z<-rand1way(y=bloody,group=bloodx,B=1000)
hist(z$rdist)
z$Fpval
z$randpval
5.16. Cut and paste the SAS program for this problem into the SAS
Editor.
To execute the program, use the top menu commands Run>Submit. An
output window will appear if successful.
(If you were not successful, look at the log window for hints on errors.
A single typo can cause failure. Reopen your le in Word or Notepad and
make corrections. Occasionally you cannot nd your error. Then nd your
instructor or wait a few hours and reenter the program.)
Data is from SAS Institute (1985, pp. 126129). See Example 5.6.
a) In SAS, use the menu commands Edit>Select All then Edit>Copy.
In Word, use the menu command Paste. Highlight the rst page of output
and use the menu command Cut. (SAS often creates too much output.
These commands reduce the output from 4 pages to 3 pages.)
You may want to save your SAS output as the le HW5d16.doc on your
ash drive.
b) Perform the 4 step test for Ho: 1 = 2 = = 6 .
c) From the residual and response plots, does the assumption of equal
population standard deviations (i = for i = 1, . . . , 6) seem reasonable?
5.17. To get in ARC, you need to nd the ARC icon. Suppose the ARC icon
is in a math progs folder. Move the cursor to the math progs folder, click the
right mouse button twice, move the cursor to ARC, double click, move the
cursor to ARC, double click. These menu commands will be written math
progs > ARC > ARC. To quit ARC, move cursor to the x in the northeast
corner and click.
This Cook and Weisberg (1999a, p. 289) data set contains IQ scores on
27 pairs of identical twins, one raised by foster parents IQf and the other
by biological parents IQb. C gives the social class of the biological parents:
C = 1 for upper class, 2 for middle class and 3 for lower class. Hence the
Anova test is for whether mean IQ depends on class.
a) Activate twins.lsp dataset with the menu commands
File > Load > Data > twins.lsp.
b) Use the menu commands Twins>Make factors, select C and click on
OK. The line {F}C Factor 27 Factorrst level dropped should appear on
the screen.
c) Use the menu commands Twins>Description to see a description of
the data.
d) Enter the menu commands Graph&Fit>Fit linear LS and select {F}C
as the term and IQb as the response. Highlight the output by pressing the
left mouse key and dragging the cursor over the output. Then use the menu
commands Edit> Copy. Enter Word and use the menu command Paste.
5.7 Problems 211
This McKenzie and Goldman (1999, p. T-234) data set has 30 three-month-
old infants randomized into ve groups of 6 each. Each infant is shown a
mobile of one of ve multicolored designs, and the goal of the study is to see
if the infant attention span varies with type of design of mobile. The times
that each infant spent watching the mobile are recorded.
b) Choose Stat>Basic Statistics>Display Descriptive Statistics, select
C1 Time as the Variable, click the By variable option and press Tab.
Select C2 Design as the By variable. c) From the window in b), click on
Graphs the Boxplots of data option, and OK twice. Click on the plot
and then click on the printer icon to get a plot of the boxplots.
d) Select Stat>ANOVA>One-way, select C1-time as the response and
C2-Design as the factor. Click on Store residuals and click on Store ts.
Then click on OK. Click on the output and then click on the printer icon.
e) To make a residual plot, select Graph>Plot. Select Resi1 for Y
and Fits1 for X and click on OK. Click on the plot and then click on
the printer icon to get the residual plot.
f) To make a response plot, select Graph>Plot. Select C1 Time for
Y and Fits1 for X and click on OK. Click on the plot and then click
on the printer icon to get the response plot.
g) Do the 4 step test for Ho: 1 = 2 = = 5 .
To get out of Minitab, move your cursor to the x in the NE corner of
the screen. When asked whether to save changes, click on no.
Chapter 6
The K Way Anova Model
For a K way Anova model, A1 , . . . , AK are the factors with li levels for
i = 1, . . . , K. Hence there are l1 l2 lK treatments where each treatment
uses exactly one level from each factor. First the two way Anova model is
discussed and then the model with K > 2. Interactions between the K factors
are important.
Denition 6.1. The xed eects two way Anova model has two factors
A and B plus a response Y . Factor A has a levels and factor B has b levels.
There are ab treatments.
Denition 6.2. The cell means model for two way Anova is Yijk =
ij + eijk where i = 1, . . . , a; j = 1, . . . , b; and k = 1, . . . , m. The sample size
n = abm. The ij are constants and the eijk are iid from a location family
with mean 0 and variance 2 . Hence the Yijk f (yij ) come from a location
family with location parameter ij . The tted values are Yijk = Y ij0 = ij
while the residuals rijk = Yijk Yijk .
For one way Anova models, the cell sizes ni need not be equal. For K way
Anova models with K 2 factors, the statistical theory is greatly simplied
if all of the cell sizes are equal. Such designs are called balanced designs.
Denition 6.3. A balanced design has all of the cell sizes equal: for the
two way Anova model, nij m.
Denition 6.4. A two way Anova design uses factorial crossing if each
combination of an A level and a B level is used and called a treatment. There
are ab treatments for the two way Anova model.
Remark 6.1. If A and B are factors, then there are 5 possible models.
i) The two way Anova model has terms A, B, and AB.
ii) The additive model or main eects model has terms A and B.
iii) The one way Anova model that uses factor A.
iv) The one way Anova model that uses factor B.
v) The null model does not use any of the three terms A, B, or AB. If the
null model holds, then Yijk f (y 00 ) so the Yijk form a random sample of
size n from a location family, and the distribution of the response is the same
for all ab treatments. For models i)iv), the distribution of the response is
not the same for all ab treatments.
Remark 6.2. The response plot, residual plot, and transformation plots
for response transformations are used in the same way as Chapter 5. The
plots work best if the MSE degrees of freedom max(10, n/5). The model
is overtting if 1 MSE df < max(10, n/5), and then the plots may only
be useful for detecting large deviations from the model. For the model that
contains A, B, and AB, there will be ab dot plots of size m, and we need
m 5 to check for similar shape and spread. For the additive model, the
response and residual plots often look like those for multiple linear regression.
Then the plotted points should scatter about the identity line or r = 0 line
in a roughly evenly populated band if the additive two way Anova model is
reasonable. We want n 5(number of parameters in the model) for inference.
So we want n 5ab or m 5 when all interactions and main eects are in
the two way Anova model.
Shown is an ANOVA table for the two way Anova model given in symbols.
Sometimes Error is replaced by Residual, or Within Groups. A and
B are the main eects while AB is the interaction. Sometimes p-value is
replaced by P, P r(> F ), or PR > F. The p-value corresponding to
FA is for Ho: 10 = = a0 . The p-value corresponding to FB is for Ho:
01 = = 0b . The p-value corresponding to FAB is for Ho: there is no
interaction. The sample p-value pval is an estimator of the population
p-value.
6.1 Two Way Anova 215
Source df SS MS F p-value
A a-1 SSA MSA FA =MSA/MSE pval
B b-1 SSB MSB FB =MSB/MSE pval
AB (a 1)(b 1) SSAB MSAB FAB =MSAB/MSE pval
Error n ab = ab(m 1) SSE MSE
does not depend on the level of B. (Or there is not enough evidence to
conclude that the mean response depends on the level of B.)
The interaction plot is rather hard to use, especially if the nij = m are
small. For small m, the curves can be far from parallel, even if there is no
interaction. The further the curves are from being parallel, the greater the
evidence of interaction. Intersection of curves suggests interaction unless the
two curves are nearly the same. The two curves may be nearly the same if
two levels of one factor give nearly the same mean response for each level of
the other factor. Then the curves could cross several times even though there
is no interaction. Software lls space. So the vertical axis needs to be checked
to see whether the sample means for two curves are close with respect to
the standard error M SE/m for the means.
The interaction plot is the most useful if the conclusions for the plot agree
with the conclusions for the F test for no interaction.
Denition 6.7. The overparameterized two way Anova model has Yijk =
ij + eijk with ij = 00 + i + j + ()ij where the interaction parameters
()ij = ij i0 0j + 00 . The A main eects are i = i0 00 for
i = 1, . . . , a.The B main
eects are j = 0j 00 for j= 1, . . . , b. Here
i i = 0, j j =
i ()ij = 0 for j = 1, . . . , b and j ()ij = 0 for
0,
i = 1, . . . , a. Thus i j ()ij = 0.
1.5
2
1
1.4
mean of Y
1.3
1.2
1.1
1 2
A
Fig. 6.1 Interaction Plot for Example 6.1.
with line segments, then there will be b parallel curves with curve height
depending on j . If there is interaction, then not all of the p curves will be
parallel. The interaction plot replaces the ij by the ij = Y ij0 .
Example 6.2. The output below uses data from Kutner et al. (2005, prob-
lems 19.1415). The output is from an experiment on hay fever, and 36 vol-
unteers were given medicine. The two active ingredients (factors A and B)
in the medicine were varied at three levels each (low, medium, and high).
218 6 The K Way Anova Model
The response is the number of hours of relief. (The factor names for this
problem are A and B.)
a) Give a four step test for the A*B interaction.
b) Give a four step test for the A main eects.
c) Give a four step test for the B main eects.
Source DF SS MS F P
A 2 220.0200 110.0100 1827.86 0.000
B 2 123.6600 61.8300 1027.33 0.000
Interaction 4 29.4250 7.3562 122.23 0.000
Error 27 1.6250 0.0602
Use factorial crossing to compare the eects (main eects, pairwise inter-
actions, . . . , K-fold interaction if there are K factors) of two or more factors.
If A1 , . . . , AK are the factors with li levels for i = 1, . . . , K; then there are
l1 l2 lK treatments where each treatment uses exactly one level from each
factor.
Source df SS MS F p-value
#K $ K main eects e.g. SSA = MSA FA pA
2 2 factor interactions e.g. SSAB = MSAB FAB pAB
#K $
3 3 factor interactions e.g. SSABC = MSABC FABC pABC
.. .. .. .. ..
# $ . . . . .
K
K1 K 1 factor interactions
the K factor interaction SSA L = MSA L FAL pAL
Error SSE MSE
6.3 Summary 219
On the previous page is a partial ANOVA table for a K way Anova design
with the degrees of freedom left blank. For A, use H0 : 100 = = l1 00 .
The other main eects have similar null hypotheses. For interaction, use H0 :
no interaction.
These models get complex rapidly as K and the number of levels li in-
crease. As K increases, there are a large number of models to consider. For
experiments, usually the 3 way and higher order interactions are #not $ signif-
icant. Hence a full model that includes all K main eects and K 2 2 way
interactions is a useful starting point for response, residual, and transforma-
tion plots. The higher order interactions can be treated as potential terms
and checked for signicance. As a rule of thumb, signicant interactions tend
to involve signicant main ' eects.
K
The sample size n = m i=1 li m 2K is minimized by taking li = 2 for
i = 1, . . . , K. Hence the sample size grows exponentially fast with K. Designs
that use the minimum number of levels 2 are discussed in Section 8.1.
6.3 Summary
1) The xed eects two way Anova model has two factors A and B plus a
response Y . Factor A has a levels and factor B has b levels. There are ab
treatments. The cell means model is Yijk = ij + eijk where i = 1, . . . , a; j =
1, . . . , b; and k = 1, . . . , m. The sample size n = abm. The ij are constants
and the eijk are iid with mean 0 and variance 2 . Hence the Yijk f (y ij )
come from a location family with location parameter ij . The tted values
are Yijk = Y ijo = ij while the residuals rijk = Yijk Yijk .
2) Know that the 4 step test for AB interaction is
i) Ho: no interaction HA : there is an interaction
ii) FAB is obtained from output.
iii) The pval is obtained from output.
iv) If pval reject Ho, and conclude that there is an interaction between A
and B, otherwise fail to reject Ho, and conclude that there is no interaction
between A and B.
3) Keep A and B in the model if there is an AB interaction.
4) Know that the 4 step test for A main eects is
i) Ho: 10 = = a0 HA : not Ho
ii) FA is obtained from output.
iii) The pval is obtained from output.
iv) If pval reject Ho and conclude that the mean response depends on the
level of A, otherwise fail to reject Ho and conclude that the mean response
does not depend on the level of A.
5) Know that the 4 step test for B main eects is
i) Ho: 01 = = 0b HA : not Ho
ii) FB is obtained from output.
220 6 The K Way Anova Model
6) Shown is an ANOVA table for the two way Anova model given in sym-
bols. Sometimes Error is replaced by Residual, or Within Groups. A
and B are the main eects while AB is the interaction. Sometimes p-value
is replaced by P, P r(> F ), or PR > F. The p-value corresponding to
FA is for Ho: 10 = = a0 . The p-value corresponding to FB is for Ho:
01 = = 0b . The p-value corresponding to FAB is for Ho: there is no
interaction.
Source df SS MS F p-value
A a-1 SSA MSA FA =MSA/MSE pval
B b-1 SSB MSB FB =MSB/MSE pval
AB (a 1)(b 1) SSAB MSAB FAB =MSAB/MSE pval
Error n ab = ab(m 1) SSE MSE
6.4 Complements
Four good texts on the design and analysis of experiments are mentioned
in the Complements of Chapter 5. The software for K way Anova is often
used to t block designs. Each block is entered as if it were a factor and the
main eects model is t. The one way block design treats the block like one
factor and the treatment factor as another factor and uses two way Anova
software without interaction to get the correct sum of squares, F statistic,
and p-value. The Latin square design treats the row block as one factor, the
column block as a second factor, and the treatment factor as another factor.
Then the three way Anova software for main eects is used to get the correct
sum of squares, F statistic, and p-value. These two designs are described in
Chapter 7. The K way software is also used to get output for the split plot
designs described in Chapter 9.
Consider nding a model using pretesting or variable selection, and then
acting as if that model was selected before examining the data. This method
does not lead to valid inference. See Fabian (1991) for results on the 2 way
Anova model. If the method can be automated, the bootstrap method of Olive
(2016a) is conjectured to be useful for inference. This bootstrap method may
also be useful for unbalanced designs where the nij are not all equal to m.
Gail (1996) explains why it took so long to use double blinded completely
randomized controlled experiments to test new vaccines.
222 6 The K Way Anova Model
6.5 Problems
a) Copy and paste the SAS program into SAS, use the le command
Run>Submit.
b) Click on the Graph1 window and scroll down to the second interaction
plot of tmp vs ymn. Press the printer icon to get the plot.
c) Is interaction present?
d) Click on the output window then click on the printer icon. This will
produce 5 pages of output, but only hand in the ANOVA table, response plot,
and residual plots.
(Cutting and pasting the output into Word resulted in bad plots. Using
Notepad gave better plots, but the printer would not easily put the ANOVA
table and two plots on one page each.)
e) Do the residual and response plots look ok?
6.4. a) Copy the SAS data for problem 6.3 into Notepad. Then hit Enter
every three numbers so that the data is in 3 columns.
1 50 130
1 50 155
1 50 74
1 50 180
1 65 34
. . .
. . .
. . .
3 80 60
b) Copy and paste the data into Minitab using the menu commands
Edit>Paste Cells and click on OK. Right below C1 type material, below
C2 type temp and below C3 type mvoltage.
c) Select Stat>ANOVA>Two-way, select C3 mvoltage as the response
and C1 material as the row factor and C2 temp as the column factor.
Click on Store residuals and click on Store ts. Then click on OK.
Click on the output and then click on the printer icon.
d) To make a residual plot, select Graph>Plot. Select Resi1 for Y and
Fits1 for X and click on OK. Click on the printer icon to get a plot of
the graph.
e) To make a response plot, select Graph>Plot. Select C3 mvoltage for
Y and Fits1 for X and click on OK. Click on the printer icon to get
a plot of the graph.
R Problem
6.5. The Box et al. (2005, p. 318) poison data has 4 types of treatments
(1,2,3,4) and 3 types of poisons (1,2,3). Each animal is given a poison and a
treatment, and the response is survival in hours. Get the poison data from
lregdata.
a) Type the following commands to see that the output for the three
models is the same. Print the output.
out1<-aov(stime~ptype*treat,poison)
summary(out1)
out2<-aov(stime~ptype + treat + ptype*treat,poison)
summary(out2)
out3<-aov(stime~.^2,poison)
summary(out3)
#The three models are the same.
b) Type the following commands to see the residual plot. Include the plot
in Word.
plot(fitted(out1),resid(out1))
title("Residual Plot")
c) Type the following commands to see the response plot. Include the plot
in Word.
6.5 Problems 225
attach(poison)
out4 <- aov((1/stime)~ptype*treat,poison)
summary(out4)
f) Type the following commands to get the residual plot. Copy the plot
into Word.
plot(fitted(out4),resid(out4))
title("Residual Plot")
g) Type the following commands to get the response plot. Copy the plot
into Word.
h) Type the following commands to get the interaction plot. Copy the plot
into Word.
interaction.plot(treat,ptype,(1/stime))
detach(poison)
Blocks are groups of similar units and blocking can yield experimental designs
that are more ecient than designs that do not block. One way block designs
and Latin square designs will be discussed.
Denition 7.1. A block is a group of mk similar or homogeneous units.
In a block design, each unit in a block is randomly assigned to one of k
treatments with each treatment getting m units from the block. The meaning
of similar is that the units are likely to have similar values of the response
when given identical treatments.
In agriculture, adjacent plots of land are often used as blocks since adjacent
plots tend to give similar yields. Litter mates, siblings, twins, time periods
(e.g., dierent days), and batches of material are often used as blocks.
Following Cobb (1998, p. 247), there are 3 ways to get blocks. i) Sort units
into groups (blocks) of mk similar units. ii) Divide large chunks of material
(blocks) into smaller pieces (units). iii) Reuse material or subjects (blocks)
several times. Then the time slots are the units.
Example 7.1. For i), to study the eects of k dierent medicines, sort
n = bk people into b groups of size k according to similar age and weight. For
ii), suppose there are b plots of land. Divide each plot into k subplots. Then
each plot is a block and the subplots are units. For iii), give the k dierent
treatments to each person over k months. Then each person has a block of
time slots and the ith month = time slot is the unit.
Suppose there are b blocks and n = kb. The one way Anova design randomly
assigns b of the units to each of the k treatments. Blocking places a constraint
on the randomization, since within each block of units, exactly one unit is
randomly assigned to each of the k treatments.
Hence a one way Anova design would use the R command sample(n) and
the rst b units would be assigned to treatment 1, the second b units to
treatment 2, . . . , and the last b units would be assigned to treatment k.
For the completely randomized block designs, described below, the com-
mand sample(k) is done b times: once for each block. The ith command is
for the units of the ith block. If k = 5 and the sample(5) command yields
2 5 3 1 4, then the 2nd unit in the ith block is assigned to treatment
1, the 5th unit to treatment 2, the 3rd unit to treatment 3, the 1st unit to
treatment 4, and the 4th unit to treatment 5.
Remark 7.1. Blocking and randomization often makes the iid error
assumption hold to a useful approximation.
Denition 7.2. For the one way block design or completely ran-
domized block design (CRBD), there is a factor A with k levels and
there are b blocks. The CRBD model is
1
b
io
i = ( + i + j ) = + i .
b b j=1
So the i are all equal if the i are all equal. The errors eij are iid with 0
mean and constant variance 2 .
Notice that the CRBD model is additive: there is no block treatment in-
teraction. The ANOVA table for the CRBD is like the ANOVA table for a
two way Anova main eects model. Shown below is a CRBD ANOVA table in
symbols. Sometimes Treatment is replaced by Factor or Model. Some-
times Blocks is replaced by the name of the blocking variable. Sometimes
Error is replaced by Residual.
7.1 One Way Block Designs 229
Source df SS MS F p-value
Blocks b-1 SSB MSB Fblock pblock
Treatment k-1 SSTR MSTR F0 =MSTR/MSE pval for Ho
Error (k 1)(b 1) SSE MSE
Rule of thumb 7.1. If pblock 0.1, then blocking was not useful. If
0.05 < pblock < 0.1, then the usefulness was borderline. If pblock 0.05,
then blocking was useful.
Remark 7.2. The response, residual, and transformation plots are used
almost in the same way as for the one and two way Anova model, but all of
the dot plots have sample size m = 1. Look for the plotted points falling in
roughly evenly populated bands about the identity line and r = 0 line. See
Problem 7.4 for these plots and the following plot.
Denition 7.3. The block response scatterplot plots blocks versus the
response. The plot will have b dot plots of size k with a symbol corresponding
to the treatment. Dot plots with clearly dierent means suggest that blocking
was useful. A symbol pattern within the blocks (e.g., symbols A and B are
always highest while C and D are always lowest) suggests that the response
depends on the factor level.
Denition 7.4. Graphical Anova for the CRBD model uses the resid-
uals as a reference
set instead of an F distribution. The scaled treatment
deviations b 1(Y i0 Y 00 ) have about the same
variability as the resid-
uals if Ho is true. The scaled block deviations k 1(Y 0j Y 00 ) also have
about the same variability as the residuals if blocking is ineective. A dot
plot of the scaled block deviations is placed above the dot plot of the scaled
treatment deviations which is placed above the dot plot of the residuals. For
small n 40, suppose the distance between two scaled deviations (A and
B, say) is greater than the range of the residuals = max(rij ) min(rij ).
Then declare A and B to be signicantly dierent. If the distance is less
than the range, do not declare A and B to be signicantly dierent. Scaled
230 7 Block Designs
deviations that lie outside the range of the residuals are signicant: the cor-
responding treatment means are signicantly dierent from the overall mean.
For n 100, let r(1) r(2) r(n) be the order statistics of the
residuals. Then instead of the range, use r(0.975n) r(0.025n) as the dis-
tance where
x is the smallest integer x, e.g.
7.7 = 8. So eects out-
side of the interval (r(0.025n) , r(0.975n) ) are signicant. See Box et al. (2005,
pp. 150151).
Example 7.2. Ledolter and Swersey (2007, p. 60) give completely ran-
domized block design data. The block variable = market had 4 levels (1
Binghamton, 2 Rockford, 3 Albuquerque, 4 Chattanooga) while the treat-
ment factor had 4 levels (A no advertising, B $6 million, C $12 million, D
$18 million advertising dollars in 1973). The response variable was average
cheese sales (in pounds per store) sold in a 3-month period.
a) From the graphical Anova in Figure 7.1, were the blocks useful?
b) Perform an appropriate 4 step test for whether advertising helped cheese
sales.
Solution: a) In Figure 7.1, the top dot plot is for the scaled block deviations.
The leftmost dot corresponds to blocks 4 and 1, the middle dot to block 3
and the rightmost dot to block 1 (see output from the lregpack function
ganova2). Yes, the blocks were useful since some (actually all) of the dots
corresponding to the scaled block deviations fall outside the range of the
residuals. This result also agrees with pblock = 4.348e06 < 0.05.
b) i) Ho: 1 = 2 = 3 = 4 HA : not Ho
ii) Fo = 1.313
iii) pval = 0.3292
iv) Fail to reject Ho, the mean sales does not depend on advertising level.
In Figure 7.1, the middle dot plot is for the scaled treatment deviations.
From left to right, these correspond to B, A, D, and C since the output shows
that the deviation corresponding to C is the largest with value 733.3. Since
7.1 One Way Block Designs 231
30
25
Treatmentdevs
20
15
10
the four scaled treatment deviations all lie within the range of the residuals,
the four treatments again do not appear to be signicant.
Example 7.3. Snedecor and Cochran (1967, p. 300) give a data set with
5 types of soybean seed. The response frate = number of seeds out of 100
that failed to germinate. Five blocks were used. On the previous page is a
block response scatterplot where A, B, C, D, and E refer to seed type. The 2
in the second block indicates that A and C both had values 10. Which type
of seed has the highest germination failure rate?
a) A b) B c) C d) D e) E
Solution: a) A since A is on the top for blocks 25 and second for block 1.
Response Plot
10
5
Y
6
2
4 6 8 10 12
FIT
Residual Plot
5
2 4
RES
2
4 6 8 10 12
FIT
Fig. 7.2 One Way Block Design Does Not Fit All of the Data
Note: The response and residual plots in Figure 7.2 suggest that one case
is not t well by the model. The Bs and Es in the block response plot suggest
that there may be a block treatment interaction, which is not allowed by the
completely randomized block design. Figure 7.2 was made with the following
commands using the lregpack function aovplots.
Blocking is used to reduce the MSE so that inference such as tests and con-
dence intervals are more precise. Below is a partial ANOVA table for a k way
Anova design with one block where the degrees of freedom are left blank. For
A, use H0 : 100 = = l1 00 . The other main eects have similar null
hypotheses. For interaction, use H0 : no interaction.
These models get complex rapidly as k and the number of levels li in-
crease. As k increases, there are a large number of models to consider. For
experiments, usually the 3 way and higher order interactions are not signif-
icant.
# $ Hence a full model that includes the blocks, all k main eects, and
all k2 two way interactions is a useful starting point for response, residual,
and transformation plots. The higher order interactions can be treated as
potential terms and checked for signicance. As a rule of thumb, signicant
interactions tend to involve signicant main eects.
Source df SS MS F p-value
block SSblock MSblock Fblock pblock
#k $ k main eects e.g. SSA = MSA FA p A
2 2 way interactions e.g. SSAB = MSAB FAB pAB
#k $
3 3 way interactions e.g. SSABC = MSABC FABC pABC
.. .. .. .. ..
# $ . . . . .
k
k1 k 1 way interactions
the k way interaction SSA L = MSA L FAL pAL
Error SSE MSE
The following example has one block and 3 factors. Hence there are 3 two
way interactions and 1 three way interaction.
Example 7.4. Snedecor and Cochran (1967, pp. 361364) describe a block
design (2 levels) with three factors: food supplements Lysine (4 levels), Me-
thionine (3 levels), and Protein (2 levels). Male pigs were fed the supplements
in a 432 factorial arrangement and the response was average daily weight
gain. The ANOVA table is shown on the following page. The model could be
234 7 Block Designs
Solution: a) Randomly.
b) Yes, 0.0379 < 0.05.
c) H0 : 0010 = 0020 HA : not H0
FP = 19.47
pval = 0.0002
Reject H0 , the mean weight gain depends on the protein level.
d) None.
Source df SS MS F pvalue
block 1 0.1334 0.1334 4.85 0.0379
L 3 0.0427 0.0142 0.5164 0.6751
M 2 0.0526 0.0263 0.9564 0.3990
P 1 0.5355 0.5355 19.47 0.0002
LM 6 0.2543 0.0424 1.54 0.2099
LP 3 0.2399 0.0800 2.91 0.0562
MP 2 0.0821 0.0410 1.49 0.2463
LMP 6 0.0685 0.0114 0.4145 0.8617
error 23 0.6319 0.0275
Latin square designs have a lot of structure. The design contains a row block
factor, a column block factor, and a treatment factor, each with a levels. The
7.3 Latin Square Designs 235
two blocking factors, and the treatment factor are crossed, but it is assumed
that there is no interaction. A capital letter is used for each of the a treatment
levels. So a = 3 uses A, B, C while a = 4 uses A, B, C, D.
Five Latin squares are shown below. The rst, third, and fth are standard.
If a = 5, there are 56 standard Latin squares.
A B C A B C A B C D A B C D E A B C D E
B C A C A B B A D C E A B C D B A E C D
C A B B C A C D A B D E A B C C D A E B
D C B A C D E A B D E B A C
B C D E A E C D B A
Yijk = + i + j + k + eijk
where i is the ith treatment eect, j is the jth row block eect, k is the
kth column block eect with i, j, and k = 1, . . . , a. The errors eijk are iid
with 0 mean and constant variance 2 . The ith treatment mean i = + i .
Shown below is an ANOVA table for the Latin square model given in
symbols. Sometimes Error is replaced by Residual, or Within Groups.
Sometimes rblocks and cblocks are replaced by the names of the blocking
factors. Sometimes p-value is replaced by P, P r(> F ), or PR > F.
Source df SS MS F p-value
rblocks a1 SSRB MSRB Frow prow
cblocks a1 SSCB MSCB Fcol pcol
treatments a1 SSTR MSTR Fo =MSTR/MSE pval
Error (a 1)(a 2) SSE MSE
Rule of thumb 7.2. Let pblock be prow or pcol . If pblock 0.1, then block-
ing was not useful. If 0.05 < pblock < 0.1, then the usefulness was borderline.
If pblock 0.05, then blocking was useful.
Be able to perform the 4 step ANOVA F test for the Latin square
design. This test is similar to the xed eects one way ANOVA F test.
i) Ho: 1 = 2 = = a and HA : not Ho.
ii) Fo = MSTR/MSE is usually given by output.
iii) The pval = P(Fa1,(a1)(a2) > Fo ) is usually given by output.
iv) If the pval , reject Ho and conclude that the mean response depends
on the factor level. Otherwise fail to reject Ho and conclude that the mean
response does not depend on the factor level. (Or there is not enough evidence
236 7 Block Designs
to conclude that the mean response depends on the factor level.) Give a
nontechnical sentence. Use = 0.05 if is not given.
Remark 7.4. The response, residual, and transformation plots are used
almost in the same way as for the one and two way Anova models, but all of
the dot plots have sample size m = 1. Look for the plotted points falling in
roughly evenly populated bands about the identity line and r = 0 line. See
Problem 7.5 and the following example.
Source df SS MS F P
rblocks 3 774.335 258.1117 2.53 0.1533
cblocks 3 133.425 44.4750 0.44 0.7349
fertilizer 3 1489.400 496.4667 4.87 0.0476
error 6 611.100 101.8500
Example 7.5. Dunn and Clark (1974, p. 129) examine a study of four
fertilizers on yields of wheat. The row blocks were 4 types of wheat. The
column blocks were 4 plots of land. Each plot was divided into 4 subplots
and a Latin square design was used. (To illustrate the inference for Latin
square designs, ignore the fact that the data had an outlier. Case 14 had a
yield of 64.5 while the next highest yield was 35.5. For the response plot in
Figure 7.3, note that both Y and Y are large for the high yield. Also note
that Y underestimates Y by about 10 for this case.)
a) Were the row blocks useful? Explain briey.
b) Were the column blocks useful? Explain briey.
c) Do an appropriate 4 step test.
Solution:
a) No, prow = 0.1533 > 0.1.
b) No, pcol = 0.7349 > 0.1.
c) i) H0 : 1 = 2 = 3 = 4 HA : not H0
ii) F0 = 4.87
iii) pval = 0.0476
iv) Reject H0 . The mean yield depends on the fertilizer level.
Figure 7.3 was made with the following commands using the lregpack func-
tion aovplots.
rblocks <- c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4)
cblocks <- c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4)
fertilizer <- c(1,2,3,4,2, 3, 4, 1, 3, 4, 1, 2, 4, 1, 2, 3)
yield <- c(35.5,24.5,14.7,35.5, 14.4, 6.2, 13.7, 24.5, 14.1,
16.2, 34.3, 19.7, 15.0, 64.5, 34.6, 19.0)
rblocks <- factor(rblocks)
cblocks <- factor(cblocks)
fertilizer <- factor(fertilizer)
dcls <- data.frame(yield,rblocks,cblocks,fertilizer)
rm(yield,rblocks,cblocks,fertilizer)
7.3 Latin Square Designs 237
Response Plot
14
10 30 50
Y
10 20 30 40 50
FIT
Residual Plot
10
14
RES
0
10
10 20 30 40 50
FIT
Fig. 7.3 Latin Square Data
attach(dcls)
z <- aov(yield~rblocks+cblocks+fertilizer)
summary(z)
aovplots(Y=yield,FIT=fitted(z),RES=resid(z))
#right click Stop twice, drag the plots to make them square
detach(dcls)
Remark 7.5. The Latin square model is additive, but the model is often
incorrectly used to study nuisance factors that can interact. Factorial or
fractional factorial designs should be used when interaction is possible.
Example 7.6. In the social sciences, often a blocking factor is time: the
levels are a time slots. Following Cobb (1998, p. 254), a Latin square design
was used to study the response Y = blood sugar level, where the row blocks
were 4 rabbits, the column blocks were 4 time slots, and the treatments were
4 levels of insulin. Label the rabbits as I, II, III, and IV; the dates as 1, 2, 3,
4; and the 4 insulin levels i1 < i2 < i3 < i4 as 1, 2, 3, 4. Suppose the random
permutation for the rabbits was 3, 1, 4, 2; the permutation for the dates 1,
4, 3, 2; and the permutation for the insulin levels was 2, 3, 4, 1. Then i2 is
238 7 Block Designs
the northwest corner of the square gets B = variety 2, the northeast corner
gets D = variety 4, the southwest corner gets A = variety 3, the southeast
corner gets C = variety 5, et cetera.
7.4 Summary
Source df SS MS F p-value
Blocks b-1 SSB MSB Fblock pblock
Treatment k-1 SSTR MSTR F0 =MSTR/MSE pval for Ho
Error (k 1)(b 1) SSE MSE
6) Rule of thumb: If pblock 0.1, then blocking was not useful. If 0.05 <
pblock < 0.1, then the usefulness was borderline. If pblock 0.05, then blocking
was useful.
7) The response, residual, and transformation plots for CRBD models are
used almost in the same way as for the one and two way Anova model, but all
of the dot plots have sample size m = 1. Look for the plotted points falling
in roughly evenly populated bands about the identity line and r = 0 line.
8) The block response scatterplot plots blocks versus the response.
The plot will have b dot plots of size k with a symbol corresponding to the
treatment. Dot plots with clearly dierent means suggest that blocking was
useful. A symbol pattern within the blocks suggests that the response depends
on the factor level.
240 7 Block Designs
9) Shown is an ANOVA table for the Latin square model given in sym-
bols. Sometimes Error is replaced by Residual, or Within Groups.
Sometimes rblocks and cblocks are replaced by the blocking factor name.
Sometimes p-value is replaced by P, P r(> F ), or PR > F.
Source df SS MS F p-value
rblocks a1 SSRB MSRB Frow prow
cblocks a1 SSCB MSCB Fcol pcol
treatments a1 SSTR MSTR Fo =MSTR/MSE pval
Error (a 1)(a 2) SSE MSE
10) Let pblock be prow or pcol . Rule of thumb: If pblock 0.1, then blocking
was not useful. If 0.05 < pblock < 0.1, then the usefulness was borderline. If
pblock 0.05, then blocking was useful.
11) The ANOVA F test for the Latin square design with a treatments is
nearly the same as the xed eects one way ANOVA F test.
i) Ho: 1 = 2 = = a and HA : not Ho.
ii) Fo = MSTR/MSE is usually given by output.
iii) The pval = P(Fa1,(a1)(a2) > Fo ) is usually given by output.
iv) If the pval , reject Ho and conclude that the mean response depends
on the factor level. Otherwise fail to reject Ho and conclude that the mean
response does not depend on the factor level. Give a nontechnical sentence.
12) The response, residual, and transformation plots for Latin square de-
signs are used almost in the same way as for the one and two way Anova
models, but all of the dot plots have sample size m = 1. Look for the plotted
points falling in roughly evenly populated bands about the identity line and
r = 0 line.
13) The randomization is done in 3 steps. Draw 3 random permutations
of 1, . . . , a. Use the 1st permutation to randomly assign row block levels to
the numbers 1, . . . , a. Use the 2nd permutation to randomly assign column
block levels to the numbers 1, . . . , a. Use the 3rd permutation to randomly
assign treatment levels to the 1st a letters (A, B, C, and D if a = 4).
14) Graphical Anova for the completely randomizedblock de-
makes a dot plot of the scaled block deviations j = k 1j =
sign
k 1(y 0j0 y 000 ) on top, a dot plot of scaled treatment deviations (eects)
i = b 1i = b 1(y i00 y 000 ) in the middle, and a dot plot of the
residuals on the bottom. Here k is the number of treatments and b is the
number of blocks.
15) Graphical Anova uses the residuals as a reference distribution. Suppose
the dot plot of the residuals looks good. Rules of thumb: i) An eect is
marginally signicant if its scaled deviation is as big as the biggest residual
or as negative as the most negative residual. ii) An eect is signicant if it is
well beyond the minimum or maximum residual. iii) Blocking was eective
if at least one scaled block deviation is beyond the range of the residuals.
7.5 Complements 241
iv) The treatments are dierent if at least one scaled treatment eect is
beyond the range of the residuals. (These rules depend on the number of
residuals n. If n is very small, say 8, then the scaled eect should be well
beyond the range of the residuals to be signicant. If the n is 40, the value
of the minimum residual and the value of the maximum residual correspond
to a 1/40 + 1/40 = 1/20 = 0.05 critical value for signicance.)
7.5 Complements
Box et al. (2005, pp. 150156) explain Graphical Anova for the CRBD and
why randomization combined with blocking often makes the iid error assump-
tion hold to a reasonable approximation.
It is easier to see model deciencies if the response and residual plots are
square. In R, drag the plots so the plots look square. Matched pairs tests are
a special case of CRBD with k = 2.
The R package granova may be useful for graphical Anova. It is available
from (http://streaming.stat.iastate.edu/CRAN/) and authored by R.M.
Pruzek and J.E. Helmreich. Also see Hoaglin et al. (1991).
A randomization test has H0 : the dierent treatments have no eect.
This null hypothesis is also true if within each block, all k pdfs are from the
same location family. Let j = 1, . . . , b index the b blocks. There are b pdfs, one
for each block, that come from the same location family but possibly dierent
location parameters: fZ (y 0j ). Let A be the treatment factor with k levels
ai . Then Yij |(A = ai ) fZ (y 0j ) where j is xed and i = 1, . . . , k.
Thus the levels ai have no eect on the response, and the Yij are iid within
each block if H0 holds. Note that there are k! ways to assign Y1j , . . . Ykj
to the k treatments within each block. An impractical randomization test
uses all M = [k!]b ways of assigning responses to treatments. Let F0 be the
usual CRBD F statistic. The F statistic is computed for each of the M
permutations and H0 is rejected if the proportion of the M F statistics that
are larger than F0 is less than . The distribution of the M F statistics is
approximately Fk1,(k1)(b1) for large n under H0 . The randomization test
and the usual CBRD F test also have the same power, asymptotically. See
Hoeding (1952) and Robinson (1973). These results suggest that the usual
CRBD F test is semiparametric: the pvalue is approximately correct if n is
large and if all k pdfs Yij |(A = ai ) fZ (y 0j ) are the same for each block
where j is xed and i = 1, . . . , k. If H0 does not hold, then there are kb pdfs
Yij |(A = ai ) fZ (y ij ) from the same location family. Hence the location
parameter depends on both the block and treatment.
Olive (2014, section 9.3) shows that practical randomization tests that
use a random sample of max(1000, [n log(n)]) randomizations have level and
power similar to the tests that use all M possible randomizations. Here each
randomization uses b randomly drawn permutations of 1, . . . , k.
242 7 Block Designs
Hunter (1989) discusses some problems with the Latin square design.
Welch (1990) suggests that the ANOVA F test is not a good approxima-
tion for the permutation test for the Latin square design.
7.6 Problems
7.4. This problem is for a one way block design and uses data from Box
et al. (2005, p. 146).
a) Copy and paste the SAS program for this problem from
(http://lagrange.math.siu.edu/Olive/lreghw.txt). Print out the out-
put but only turn in the ANOVA table, residual plot, and response plot.
b) Do the plots look ok?
c) Copy the SAS data into Minitab much as done for Problem 6.4. Right
below C1 type block, below C2 type treat, and below C3 type yield.
d) Select Stat>ANOVA>Two-way, select C3 yield as the response and
C1 block as the row factor and C2 treat as the column factor. Click on
Fit additive model, click on Store residuals, and click on Store ts.
Then click on OK.
e) block response scatterplot: Use le commands Edit>Command
Line Editor and write the following lines in the window.
GSTD
LPLOT yield vs block codes for treat
f) Click on the submit commands box and print the plot. Click on the
output and then click on the printer icon.
g) Copy (http://lagrange.math.siu.edu/Olive/lregdata.txt) into R.
Type the following commands to get the following ANOVA table.
z<-aov(yield~block+treat,pen)
summary(z)
7.5. This problem is for a Latin square design and uses data from Box
et al. (2005, pp. 157160).
244 7 Block Designs
Copy and paste the SAS program for this problem from
(http://lagrange.math.siu.edu/Olive/lreghw.txt).
a) Click on the output and use the menu commands Edit>Select All
and Edit>Copy. In Word use the menu command Paste then use the
left mouse button to highlight the rst page of output. Then use the menu
command Cut. Then there should be one page of output including the
ANOVA table. Print out this page.
b) Copy the data for this problem from
(http://lagrange.math.siu.edu/Olive/lregdata.txt)
into R. Use the following commands to create a residual plot. Copy and paste
the plot into Word. (Click on the plot and simultaneously hit the Ctrl and c
buttons. Then go to Word and use the menu command Paste.)
z<-aov(emissions~rblocks+cblocks+additives,auto)
summary(z)
plot(fitted(z),resid(z))
title("Residual Plot")
abline(0,0)
c) Use the following commands to create a response plot. Copy and paste
the plot into Word. (Click on the plot and simultaneously hit the Ctrl and c
buttons. Then go to Word and use the menu command Paste.)
attach(auto)
FIT <- auto$emissions - z$resid
plot(FIT,auto$emissions)
title("Response Plot")
abline(0,1)
detach(auto)
d) Do the plots look ok?
e) Were the column blocks useful? Explain briey.
f) Were the row blocks useful? Explain briey.
g) Do an appropriate 4 step test.
7.6. Obtain the Box et al. (2005, p. 146) penicillin data from
(http://lagrange.math.siu.edu/Olive/lregdata.txt) and the R pro-
gram ganova2 from (http://lagrange.math.siu.edu/Olive/lregpack.
txt). The program does graphical Anova for completely randomized block
designs.
a) Copy and paste the R commands for this problem into R. Include the
plot in Word by simultaneously pressing the Ctrl and c keys, then using the
menu command Paste in Word.
b) Blocking seems useful because some of the scaled block deviations are
outside of the spread of the residuals. The scaled treatment deviations are in
the middle of the plot. Do the treatments appear to be signicantly dierent?
Chapter 8
Orthogonal Designs
Orthogonal designs for factors with two levels can be t using least squares.
The orthogonality of the contrasts allows each coecient to be estimated
independently of the other variables in the model.
kf
This chapter covers 2k factorial designs, 2R fractional factorial designs,
and Plackett Burman PB(n) designs. The entries in the design matrix X are
either 1 or 1. The columns of the design matrix X are orthogonal: cTi cj = 0
for i = j where ci is the ith column of X. Also cTi ci = n, and the absolute
values of the column entries sum to n.
The rst column of X is 1, the vector of ones, but the remaining columns
of X are the coecients of a contrast. Hence the ith column ci has entries
that are 1 or 1, and the entries of the ith column ci sum to 0 for i > 1.
Factorial designs are a special case of the k way Anova designs of Chapter 6,
and these designs use factorial crossing to compare the eects (main eects,
pairwise interactions, . . . , k-fold interaction) of the k factors. If A1 , . . . , Ak are
the factors with li levels for i = 1, . . . , k then there are l1 l2 lk treatments
where each ' treatment uses exactly one level from each factor. The sample
k
size n = m i=1 li m 2k . Hence the sample size grows exponentially fast
with k. Often the number of replications m = 1.
Often each run is expensive, for example, in industry and medicine. A goal
is to improve the product in terms of higher quality or lower cost. Often the
subject matter experts can think of many factors that might improve the
product. The number of runs n is minimized by taking li = 2 for i = 1, . . . , k.
Rule of thumb 8.1. Do not spend more than 25% of the budget on the
initial experiment. It may be a good idea to plan for four experiments, each
taking 25% of the budget.
p p
Denition 8.3. Recall that a contrast C = i=1 di i where i=1 di =
p
0, and the estimated contrast is C = i=1 di Y i0 where i and Y i0 are
appropriate population and sample means. In a table of contrasts, the
coecients di of the contrast are given where a corresponds to 1 and a +
corresponds to 1. Sometimes a column I corresponding to the overall mean
is given where each entry is a +. The column corresponding to I is not a
contrast.
To make a table of contrasts there is a rule for main eects and a rule for
interactions.
8.1 Factorial Designs 247
I A B C AB AC BC ABC y
+ + + + y 1110
+ + + + y 2110
+ + + + y 1210
+ + + + y 2210
+ + + + y 1120
+ + + + y 2120
+ + + + y 1220
+ + + + + + + + y 2220
divisor 8 4 4 4 4 4 4 4
The table of contrasts for a 24 design is shown on the following page. The
column of ones corresponding to I was omitted. Again rows correspond to
runs and the levels of the main eects A, B, C, and D completely specify the
run. The rst row of the table corresponds to the low levels of A, B, C, and
D. In the second row, the level of A is high while B, C, and D are low. Note
that the interactions are obtained by multiplying the component columns
where + = 1 and = 1. Hence the rst row of the column corresponding
to the ABC entry is ()()() = .
Randomization for a 2k design: The runs are determined by the levels
of the k main eects in the table of contrasts. So a 23 design is determined by
the levels of A, B, and C. Similarly, a 24 design is determined by the levels
of A, B, C, and D. Randomly assign units to the m2k runs. Often the units
are time slots. If possible, perform the m2k runs in random order.
Genuine run replicates need to be used. A common error is to take m
measurements per run, and act as if the m measurements are from m runs.
248 8 Orthogonal Designs
Denition 8.4. If the response depends on the two levels of the factor,
then the factor is called active. If the response does not depend on the two
levels of the factor, then the factor is called inert.
Active factors appear to change the mean response as the level of the factor
changes from 1 to 1. Inert factors do not appear to change the response as
the level of the factor changes from 1 to 1. An inert factor could be needed
but the level low or high is not important, or the inert factor may not be
needed and so can be omitted from future studies. Often subject matter
experts can tell whether the inert factor is needed or not.
The 2k designs are used for exploratory data analysis: they provide
answers to the following questions.
i) Which combinations of levels are best?
ii) Which factors are active and which are inert? That is, use the 2k design
to screen for factors where the response depends on whether the level is high
or low.
iii) How should the levels be modied to improve the response?
If all 2k runs give roughly the same response, then choose the levels that
are cheapest to increase prot. Also the system tends to be robust to changes
in the factor space so managers do not need to worry about the exact values
of the levels of the factors.
In an experiment, there will be an interaction between management, sub-
ject matter experts (often engineers), and the data analyst (statistician).
8.1 Factorial Designs 249
Remark
# $ 8.1. If m = 1, then there
# $ is one response per run but k main
eects, k2 2 factor interactions, kj j factor interactions, and 1 k way in-
teraction. Then the MSE df = 0 unless at least one high order interaction
is assumed
# $ to be zero. A full model that includes all k main eects and
all k2 two way interactions is a useful starting point for response, residual,
and transformation plots. The higher order interactions can be treated as
potential terms and checked for signicance. As a rule of thumb, signicant
interactions tend to involve signicant main eects.
Rule of thumb 8.2. Mentally add 2 lines parallel to the identity line and
2 lines parallel to the r = 0 line that cover most of the cases. Then a case
is an outlier if it is well beyond these 2 lines. This rule often fails for large
outliers since often the identity line goes through or near a large outlier so
its residual is near zero. Often such outliers are still far from the bulk of
the data, and there will be a gap in the response plot (along the identity
line) separating the bulk of the data from the outliers. Such gaps appear in
Figures 3.7, 3.10b) (in an FF plot), 3.11, and 7.3 where the gap would be
easier to see if the plot was square. A response that is far from the bulk of
the data in the response plot is a large outlier (large in magnitude).
Rule of thumb 8.3. Often an outlier is very good, but more often an
outlier is due to a measurement error and is very bad.
Denition 8.6. A critical mix is a single combination of levels, out of
2k , that gives good results. Hence a critical mix produces good outliers (or a
single outlier if m = 1).
Be able to pick out active and inert factors and good (or the best) combi-
nations of factors (cells or runs) from the table of contrasts = table of runs.
Often the table will only contain the contrasts for the main eects. If high
values of the response are desirable, look for high values of y for m > 1. If
m = 1, then y = y. The following two examples help illustrate the process.
O H C y
5.9
+ 4.0
+ 3.9
+ + 1.2
+ 5.3
+ + 4.8
+ + 6.3
+ + + 0.8
250 8 Orthogonal Designs
Solution: a) The two lowest values of y are 0.8 and 1.2 which correspond
to + + + and + + . (Note that if the 1.2 was 4.2, then + + + corresponding
to 0.8 would be a critical mix.)
b) C would be inert since O and H should be at their high + levels.
run R T C D y
1 14
2 + 16
3 + 8
4 + + 22
5 + 19
6 + + 37
7 + + 20
8 + + + 38
9 + 1
10 + + 8
11 + + 4
12 + + + 10
13 + + 12
14 + + + 30
15 + + + 13
16 + + + + 30
Suppose the model using all of the columns of X is used. If some columns
are removed (e.g. those corresponding to the insignicant eects), then for
2k designs the following quantities remain unchanged for the terms that were
not deleted: the eects, the coecients, and SS(eect) = MS(eect). The
MSE, SE(eect), F and t statistics, pvalues, tted values, and residuals do
change.
The regression equation corresponding to the signicant eects (e.g. found
with a QQ plot of Denition 8.9) can be used to form a reduced model. For
example, suppose the full (least squares) tted model is Yi = 0 + 1 xi1 +
2 xi2 + 3 xi3 + 12 xi12 + 13 xi13 + 23 xi23 + 123 xi123 . Suppose the A, B, and
AB eects are signicant. Then the reduced (least squares) tted model is
Yi = 0 + 1 xi1 + 2 xi2 + 12 xi12 where the coecients (s) for the reduced
model can be taken from the full model since the 2k design is orthogonal.
The coecient 0 corresponding to I is equal to the I eect, but the
coecient of a factor x corresponding to an eect is = 0.5 eect. Consider
signicant eects and assume interactions can be ignored.
i) If a large response Y is desired and > 0, use x = 1. If < 0, use
x = 1.
ii) If a small response Y is desired and > 0, use x = 1. If < 0, use
x = 1.
Rule of thumb 8.4. To predict Y with Y , the number of coecients =
the number of s in the model should be n/2, where the sample size n =
number of runs. Otherwise the model is overtting.
From the regression equation Y = xT , be able to predict Y given x. Be
able to tell whether x = 1 or x = 1 should be used. Given the x values
of the main eects, get the x values of the interactions by multiplying the
columns corresponding to the main eects.
Least squares output in symbols is shown below. Often Estimate is re-
placed by Coef or Coecient. Often Intercept is replaced by Con-
stant. The t statistic and pvalue are for whether the term or eect is sig-
nicant. So t12 and p12 are for testing whether the x12 term or AB eect is
signicant.
The least squares coecient = 0.5 (eect). The sum of squares for an x
correspondingto an eect is equal to SS(eect).
SE(coef) = SE() = 0.5
SE(eect) = M SE/n. Also SE(0 ) = M SE/n.
Example 8.3. a) The biggest possible model for the 23 design is Y =
0 + 1 x1 + 2 x2 + 3 x3 + 12 x12 + 13 x13 + 23 x23 + 123 x123 + e with least
squares tted or predicted values given by Yi = 0 + 1 xi1 + 2 xi2 + 3 xi3 +
12 xi12 + 13 xi13 + 23 xi23 + 123 xi123 .
The second order model is Y = 0 +1 x1 +2 x2 +3 x3 +12 x12 +13 x13 +
23 x23 + e. The main eects model is Y = 0 + 1 x1 + 2 x2 + 3 x3 + e.
b) A typical least squares output for the 23 design with m = 2 is shown
below. Often Estimate is replaced by Coef.
Residual Standard Error=2.8284 = sqrt(MSE)
R-Square=0.9763 F-statistic (df=7, 8)=47.054 pvalue=0
There are several advantages to least squares over 2k software. The dis-
advantage of the following four points is that the design will no longer be
orthogonal: the estimated coecients and hence the estimated eects will
depend on the terms in the model. i) If there are several missing values or
outliers, delete the corresponding rows from the design matrix X and the
vector of responses y as long as the number of rows of the design matrix
the number of columns. ii) If the exact quantitative levels are not observed,
replace them by the observed levels cx in the design matrix. iii) If the wrong
levels are used in a run, replace the corresponding row in the design ma-
trix by a row corresponding to the levels actually used. iv) The number of
replications per run i can be mi , that is, we do not need mi m.
If the number of replications m 2, then the standard error for the eect is
MSE
SE(eect) = . (8.2)
m2k2
Sometimes M SE is replaced by 2 .
M SE
SE(mean) = (8.3)
m2k
where m2k = n, m 2, and sometimes M SE is replaced by 2 .
The sum of squares for an eect is also the mean square for the eect since
df = 1.
M S(eect) = SS(eect) = m2k2 (eect)2 (8.4)
for m 1.
A 95% condence interval (CI) for an eect is
where dfe is the MSE degrees of freedom. Use tdfe ,0.975 z0.975 = 1.96 if
dfe > 30. The eect is signicant if the CI does not contain 0, while the eect
is not signicant if the CI contains 0.
Source df SS MS F p-value
A 1 SSA MSA FA pA
B 1 SSB MSB FB pB
C 1 SSC MSC FC pC
AB 1 SSAB MSAB FAB pAB
AC 1 SSAC MSAC FAC pAC
BC 1 SSBC MSBC FBC pBC
ABC 1 SSABC MSA FABC pABC
Error (m 1)2k SSE MSE
eect
One can use t statistics for eects with t0 = tdfe where dfe
SE(eect)
is the MSE df. Then t20 = M S(ef f ect)/M SE = F0 F1,dfe .
Source df SS MS F p-value
k# main
$ eects 1 e.g. SSA = MSA FA p A
k
2 2 factor interactions 1 e.g. SSAB = MSAB FAB pAB
#k $
3 3 factor interactions 1 e.g. SSABC = MSABC FABC pABC
.. .. .. .. ..
.
# k $ . . . .
k1 k 1 factor interactions
the k factor interaction 1 SSA L = MSA L FAL pAL
Error (m 1)2k SSE MSE
I A B C AB AC BC ABC y
+ + + + 6.333
+ + + + 4.667
+ + + + 9.0
+ + + + 6.667
+ + + + 4.333
+ + + + 2.333
+ + + + 7.333
+ + + + + + + + 4.667
divisor 8 4 4 4 4 4 4 4
Normal QQ plot
2
1
effects
0
1
2
The lregpack functions twocub and twofourth can be used to nd the ef-
fects, SE(eect), and QQ plots for 23 and 24 designs. If m = 1, the twofourth
function also makes the response and residual plots based on the second order
model for 24 designs.
For the data in Example 8.4, the output below and on the following page
shows that the A and C eects have values 2.166 and 2.000 while the B
eect is 2.500. These are the three signicant eects shown in the QQ plot
in Figure 8.1. The two commands below produced the output.
z<-c(6.333,4.667,9,6.667,4.333,2.333,7.333,4.667)
twocub(z,m=3,MSE=0.54)
$Aeff
[1] -2.16625
$Beff
[1] 2.50025
258 8 Orthogonal Designs
$Ceff
[1] -2.00025
$ABeff
[1] -0.33325
$ACeff
[1] -0.16675
$BCeff
[1] 0.16675
$ABCeff
[1] 0.00025
$MSA
[1] 28.15583
$MSB
[1] 37.5075
$MSC
[1] 24.006
$MSAB
[1] 0.6663334
$MSAC
[1] 0.1668334
$MSABC
[1] 3.75e-07
$MSE
[1] 0.54
$SEeff
[1] 0.3
Factorial designs are expensive since n = m2k when there are k factors and m
replications. A fractional factorial design uses n = m2kf where f is dened
below, and so costs much less. Such designs can be useful when the higher
order interactions are not signicant.
kf
Denition 8.10. A 2R fractional factorial design has k factors and
kf
takes m2 runs where the number of replications m is usually 1. The design
is an orthogonal design and each factor has two levels low = 1 and high =
1. R is the resolution of the design.
kf
Remark 8.2. A 2R design has no q factor interaction (or main eect for
q = 1) confounded with any other eect consisting of less than R q factors.
kf
So a 2III design has R = 3 and main eects are confounded with 2 factor
kf
interactions. In a 2IV design, R = 4 and main eects are not confounded
with 2 factor interactions but 2 factor interactions are confounded with other
2 factor interactions. In a 2Vkf design, R = 5 and main eects and 2 factor
interactions are only confounded with 4 and 3 way or higher interactions
respectively. The R = 4 and R = 5 designs are good because the 3 way and
higher interactions are rarely signicant, but these designs are more expensive
than the R = 3 designs.
In a 2Rkf
design, each eect is confounded or aliased with 2f 1 other
eects. Thus the M th main eect is really an estimate of the M th main eect
plus 2f 1 other eects. If R 3 and none of the two factor interactions are
signicant, then the M th main eect is typically a useful estimator of the
population M th main eect.
Rule of thumb 8.8. Main eects tend to be larger than q factor inter-
action eects, and the lower order interaction eects tend to be larger than
the higher order interaction eects. So two way interaction eects tend to be
larger than three way interaction eects.
Rule of thumb 8.9. Signicant interactions tend to have signicant
component main eects. Hence if A, B, C, and D are factors, B and D are
inert and A and C are active, then the AC eect is the two factor interaction
most likely to be active. If only A was active, then the two factor interactions
containing A (AB, AC, and AD) are the ones most likely to be active.
Suppose each run costs $1000 and m = 1. The 2k factorial designs need 2k
runs while fractional factorial designs need 2kf runs. These designs use the
fact that three way and higher interactions tend to be inert for experiments.
23 A B C AB AC BC ABC
241
IV A B C AB+ AC+ BC+ D
252
IIIABC D E BC+ BE+
263
IIIABC D E F AF+
274
IIIABC D E F G
Consider the designs given in Remarks 8.3 and 8.4. Least squares estimates
kf
for the 2R designs with ko = 3 use the design matrix corresponding to a 23
design while the designs with ko = 4 use the design matrix corresponding to
the 24 design given in Section 8.1.
Randomly assign units to runs. Do runs in random order if possible. In in-
dustry, units are often time slots (periods of time), so randomization consists
8.2 Fractional Factorial Designs 261
Assume none of the interactions are signicant. Then the 274 III fractional
factorial design allows estimation of 7 main eects in 23 = 8 runs. The 21511
III
fractional factorial design allows estimation of 15 main eects in 24 = 16 runs.
The 23126
III fractional factorial design allows estimation of 31 main eects in
25 = 32 runs.
Fractional factorial designs with k f = ko can be t with software meant
for 2ko designs. Hence the lregpack functions twocub and twofourth can
262 8 Orthogonal Designs
be used for the ko = 3 and ko = 4 designs that use the standard table
of contrasts. The response and residual plots given by twofourth are not
appropriate, but the QQ plot and the remaining output are relevant. Some
of the interactions will correspond to main eects for the fractional factorial
design.
For example, if the Example 8.4 data was from a 241 IV design, then the
A, B, and C eects would be the same, but the D eect is the eect labelled
ABC. So the D eect 0.
Normal QQ plot
40
30
effects
20
10
0
Example 8.5. Montgomery (1984, pp. 344346) gives data from a 274 III
design with the QQ plot shown in Figure 8.2. The goal was to study eye focus
time with factors A = sharpness of vision, B = distance of target from eye,
C = target shape, D = illumination level, E = target size, F = target density,
and G = subject. The lregpack function twocub gave the eects above.
a) What is the D eect?
b) What eects are signicant?
Solution: By the last line in the table given in Remark 8.3, note that for
this design, A, B, C, AB, AC, BC, ABC correspond to A, B, C, D, E, F, G. So
the AB eect from the output is the D eect.
8.3 Plackett Burman Designs 263
I A B C AB AC BC ABC y
+ + + + 86.8
+ + + + 85.9
+ + + + 79.4
+ + + + 60.0
+ + + + 94.6
+ + + + 85.4
+ + + + 84.5
+ + + + + + + + 80.3
used to screen k main eects when the number of runs n is small. Often
k = n 4, n 3, n 2, or n 1 is used. We will assume that the number of
replications m = 1.
A contrast matrix for the PB(12) design is shown below. Again the column
of plusses corresponding to I is omitted. If k = 8 then eects A to H are
used but eects J, K, and L are empty. As a convention the mean square
and sum of squares for factor E will be denoted as MSe and SSe while MSE
= 2 .
run A B C D E F G H J K L
1 + - + - - - + + + - +
2 + + - + - - - + + + -
3 - + + - + - - - + + +
4 + - + + - + - - - + +
5 + + - + + - + - - - +
6 + + + - + + - + - - -
7 - + + + - + + - + - -
8 - - + + + - + + - + -
9 - - - + + + - + + - +
10 + - - - + + + - + + -
11 - + - - - + + + - + +
12 - - - - - - - - - - -
The PB(n) designs are k factor 2 level orthogonal designs. So nding
quantities such as eects, MS, SS, least squares estimates, et cetera for PB(n)
kf
designs is similar to nding the corresponding quantities for the 2k and 2R
designs. Randomize units (often time slots) to runs and least squares can be
used.
Remark 8.6. For the PB(n) design, let c be a column from the table of
contrasts where + = 1 and = 1. Let y be the column of responses since
m = 1. If k < n 1, pool the last J = n 1 k empty eects into the
MSE with df = J as the full model. This procedure is done before looking
at the data, so is not data snooping. The MSE can also be given or found
by pooling insignicant MSs into the MSE, but the latter method uses data
snooping. This pooling needs to be done if k = n 1 since then there is no
df for MSE. The following formulas ignore the I eect.
cT y 2cT y
a) The eect corresponding to c is eect = = .
n/2 n
MSE 4MSE
b) The standard error for the eect is SE(eect) = = .
n/4 n
M SE
c) SE(mean) = .
n
d) The sum of squares and mean sum of squares for an eect is
n
M S(eect) = SS(eect) = (eect)2 .
4
8.3 Plackett Burman Designs 265
15
10
effects
5
0
5
For the PB(n) design, the least squares coecient = 0.5 (eect). The sum
of squares for an x corresponding
to an eect is equal to SS(eect). SE(coef)
= SE() = 0.5 SE(eect) = M SE/n. Also SE(0 ) = M SE/n.
Example 8.7. Shown below is least squares output using PB(12) data
from Ledolter and Swersey (2007, pp. 244256). There were k = 10 factors so
the MSE has 1 df and there are too many terms in the model. In this case the
QQ plot shown in Figure 8.3 is more reliable for nding signicant eects.
a) Which eects, if any, appear to be signicant from the QQ plot?
b) Let the reduced model Y = 0 + r1 xr1 + + rj xrj where j is the
number of signicant terms found in a). Write down the reduced model.
c) Want large Y . Using the model in b), choose the x values that will give
large Y , and predict Y .
Estimate Std.Err t-value Pr(>|t|)
Intercept 6.7042 2.2042 3.0416 0.2022
c1 8.5792 2.2042 3.8922 0.1601
c2 -1.7958 2.2042 -0.8147 0.5648
c3 2.3125 2.2042 1.0491 0.4847
c4 4.1208 2.2042 1.8696 0.3127
c5 3.1542 2.2042 1.4310 0.3883
c6 -3.3958 2.2042 -1.5406 0.3665
c7 0.9542 2.2042 0.4329 0.7399
c8 -1.1208 2.2042 -0.5085 0.7005
c9 1.3125 2.2042 0.5955 0.6581
c10 1.7875 2.2042 0.8110 0.5662
266 8 Orthogonal Designs
Solution: a) The most signicant eects are either in the top right or
bottom left corner. Although the points do not all scatter closely about the
line, the point in the bottom left is not signicant. So none of the eects
corresponding to the bottom left of the plot are signicant. A is the signicant
eect with value 2(8.5792) = 17.1584. See the top right point of Figure 8.3.
b) Y = 6.7042 + 8.5792x1 .
c) Y = 6.7042 + 8.5792(1) = 15.2834.
The lregpack function pb12 can be used to nd eects and MS(eect) for
PB(12) data. Least squares output and a QQ plot are also given.
8.4 Summary
for m 1.
8.4 Summary 267
7) If a single run out of 2k cells gives good values for the response, then
that run is called a critical mix.
8) A factor is active if the response depends on the two levels of the factor,
and is inert, otherwise.
9) Randomization for a 2k design: randomly assign units to the m2k runs.
The runs are determined by the levels of the k main eects in the table of
contrasts. So a 23 design is determined by the levels of A, B, and C. Similarly,
a 24 design is determined by the levels of A, B, C, and D. Perform the m2k
runs in random order if possible.
10) A table of contrasts for a 23 design is shown below. The rst column
is for the mean and is not a contrast. The last column corresponds to the
cell means. Note that y 1110 = y111 if m = 1. So y might be replaced by y if
m = 1.
I A B C AB AC BC ABC y
+ + + + y 1110
+ + + + y 2110
+ + + + y 1210
+ + + + y 2210
+ + + + y 1120
+ + + + y 2120
+ + + + y 1220
+ + + + + + + + y 2220
divisor 8 4 4 4 4 4 4 4
11) Be able to pick out active and inert factors and good (or the best)
combinations of factors (cells or runs) from the table of contrasts = table of
runs.
12) Plotted points far away from the identity line and r = 0 line are
potential outliers, but often the identity line goes through or near an outlier
that is large in magnitude. Then the case has a small residual. Look for gaps
is the response and residual plots.
13) A 95% condence interval (CI) for an eect is
where dfe is the MSE degrees of freedom. Use tdfe ,0.975 z0.975 = 1.96 if
dfe > 30. The eect is signicant if the CI does not contain 0, while the eect
is not signicant if the CI contains 0.
14) Suppose there is no replication so m = 1. Find J interaction mean
squares that are small compared to the bulk of the mean squares. Add them
up (pool them) to make M SE with dfe = J. So
268 8 Orthogonal Designs
Source df SS MS F p-value
A 1 SSA MSA FA pA
B 1 SSB MSB FB pB
C 1 SSC MSC FC pC
AB 1 SSAB MSAB FAB pAB
AC 1 SSAC MSAC FAC pAC
BC 1 SSBC MSBC FBC pBC
ABC 1 SSABC MSA FABC pABC
Error (m 1)2k SSE MSE
8.4 Summary 269
18) Below is the ANOVA table for a 2k design. For A, use H0 : 100 =
200 . The other main eects have similar null hypotheses. For interaction,
use H0 : no interaction. If m = 1 use a procedure similar to point 14) for
exploratory purposes.
Source df SS MS F p-value
k# main
$ eects 1 e.g. SSA = MSA FA p A
k
#k2$ 2 factor interactions 1 e.g. SSAB = MSAB FAB pAB
3 3 factor interactions 1 e.g. SSABC = MSABC FABC pABC
.. .. .. .. ..
#. k $ . . . .
k1 k 1 factor interactions
the k factor interaction 1 SSA L = MSA L FAL pAL
Error (m 1)2k SSE MSE
23 A B C AB AC BC ABC
241
IV A B C AB+ AC+ BC+ D
252
IIIABC D E BC+ BE+
263
IIIABC D E F AF+
274
IIIABC D E F G
44) Suppose the full model using all of the columns of X is used. If some
columns are removed (e.g. those corresponding to the insignicant eects),
then for the orthogonal designs in point 43) the following quantities remain
unchanged for the terms that were not deleted: the eects, the coecients,
8.4 Summary 273
50) The least squares coecient = 0.5 (eect). The sum of squares for an
x corresponding
to an eect is equal to SS(eect). SE(coef) = SE() = 0.5
SE(eect) = M SE/n. Also SE(0 ) = M SE/n.
51) The Plackett Burman PB(n) designs have k factors where 2 k
kf
n 1. The factors have 2 levels and orthogonal contrasts like the 2k and 2R
designs. The PB(n) designs are resolution 3 designs, but the confounding of
main eects with 2 factor interactions is complex. The PB(n) designs use n
runs where n is a multiple of 4. The values n = 12, 20, 24, 28, and 36 are
especially common.
52) The PB(n) designs are usually used with main eects so assume that all
interactions are insignicant. So they are main eects designs used to screen k
main eects when the number of runs n is small. Often k = n 4, n 3, n 2,
or n 1 is used. We will assume that the number of replications m = 1.
53) If k = n 1 there is no df for MSE. If k < n 1, pool the last
J = n 1 k empty eects into the MSE with df = J as the full model.
This procedure is done before looking at the data, so is not data snooping.
run A B C D E F G H J K L
1 + - + - - - + + + - +
2 + + - + - - - + + + -
3 - + + - + - - - + + +
4 + - + + - + - - - + +
5 + + - + + - + - - - +
6 + + + - + + - + - - -
7 - + + + - + + - + - -
8 - - + + + - + + - + -
9 - - - + + + - + + - +
10 + - - - + + + - + + -
11 - + - - - + + + - + +
12 - - - - - - - - - - -
54) The contrast matrix for the PB(12) design is shown above. Again the
column of plusses corresponding to I is omitted. If k = 8 then eects A to
8.5 Complements 275
H are used but eects J, K, and L are empty. As a convention the mean
square and sum of squares for factor E will be denoted as MSe and SSe while
MSE = 2 .
55) The PB(n) designs are k factor 2 level orthogonal designs. So nding
eects, MS, SS, least squares estimates, et cetera for PB(n) designs is similar
kf
to nding the corresponding quantities for the 2k and 2R designs.
56) For the PB(n) design, let c be a column from the table of contrasts
where + = 1 and = 1. Let y be the column of responses since m = 1.
For k < n 1, MSE can be found for the full model as in 53). MSE can also
be given or found by pooling insignicant MSs into the MSE, but the latter
method uses data snooping.
cT y 2cT y
a) The eect corresponding to c is eect = = .
n/2 n
MSE 4MSE
b) The standard error for the eect is SE(eect) = = .
n/4 n
M SE
c) SE(mean) = .
n
d) The sum of squares and mean square for an eect is
n
M S(eect) = SS(eect) = (eect)2 .
4
57) For the PB(n) design, the least squares coecient = 0.5 (eect). The
sum of squares for an x corresponding to an eect is equal to SS(eect).
SE(coef) = SE() = 0.5 SE(eect) = M SE/n. Also SE(0 ) = M SE/n.
8.5 Complements
Box et al. (2005) and Ledolter and Swersey (2007) are excellent references
for k factor 2 level orthogonal designs.
Suppose it is desired to increase the response Y and that A, B, C, . . . are
the k factors. The main eects for A, B, . . . measure
Y Y
, ,
A B
et cetera. The interaction eect AB measures
Y
.
AB
Hence
Y Y Y
0, 0, and large
A B AB
276 8 Orthogonal Designs
8.6 Problems
8.1. From the above least squares output, what is the AB eect?
I A B C AB AC BC ABC Y
+ + + + 3.81
+ + + + 4.28
+ + + + 3.74
+ + + + 4.10
+ + + + 3.75
+ + + + 3.66
+ + + + 3.82
+ + + + + + + + 3.68
I A B C AB AC BC ABC y
+ + + + 86.8
+ + + + 85.9
+ + + + 79.4
+ + + + 60.0
+ + + + 94.6
+ + + + 85.4
+ + + + 84.5
+ + + + + + + + 80.3
8.4. Suppose that for 23 data with m = 2, the MSE = 407.5625. Find
SE(eect).
I A B C AB AC BC ABC y
+ + + + 63.6
+ + + + 76.8
+ + + + 60.3
+ + + + 80.3
+ + + + 67.2
+ + + + 71.3
+ + + + 68.3
+ + + + + + + + 74.3
divisor 8 4 4 4 4 4 4 4
I A B C AB AC BC ABC y
+ + + + 32
+ + + + 35
+ + + + 28
+ + + + 31
+ + + + 48
+ + + + 39
+ + + + 28
+ + + + + + + + 29
divisor 8 4 4 4 4 4 4 4
8.6 Problems 279
8.7. Suppose the B eect = 5, SE(ef f ect) = 2, and dfe = 8.
i) Find a 95% condence interval for the B eect.
ii) Is the B eect signicant? Explain briey.
R (along with 1 SAS and 1 Minitab) Problems
8.8. Copy the Box et al. (2005, p. 199) product development data from
(http://lagrange.math.siu.edu/Olive/lregdata.txt) into R.
Then type the following commands.
8.9. Get the SAS program for this problem from (http://lagrange.math.siu.
edu/Olive/lreghw.txt). The data is the pilot plant example from Box et al.
(2005, pp. 177186). The response variable is Y = yield, while the three
predictors (T = temp, C = concentration, K = catalyst) are at two levels.
a) Print out the output but do not turn in the rst page.
b) Do the residual and response plots look ok?
8.10. Get the data for this problem. The data is the pilot plant example
from Box et al. (2005, pp. 177186) examined in Problem 8.9. Minitab needs
the levels for the factors and the interactions.
Highlight the data and use the menu commands Edit>Copy. In Minitab,
use the menu command Edit>PasteCells. After a window appears, click on
ok.
Below C1 type A, below C2 type B, below C3 type C and below
C8 type yield.
a) Use the menu command STAT>ANOVA>Balanced Anova put
yield in the responses box and
A|B|C
in the Model box. Click on Storage. When a window appears, click on
Fits and Residuals. Then click on OK. This window will disappear.
Click on OK.
b) Next highlight the bottom 8 lines and use the menu commands
Edit>Delete Cells. Then the data set does not have replication. Use the
menu command STAT>ANOVA>Balanced Anova put yield in the re-
sponses box and
A B C A*C
in the Model box. Click on Storage. When a window appears, click on
Fits and Residuals. Then click on OK. This window will disappear.
Click on OK.
(The model A|B|C would have resulted in an error message, not enough
data.)
c) Print the output by clicking on the top window and then clicking on
the printer icon.
d) Make a response plot with the menu commands Graph>Plot with
yield in the Y box and FIT2 in the X box. Print by clicking on the printer
icon.
e) Make a residual plot with the menu commands Graph>Plot with
RESI2 in the Y box and FIT2 in the X box. Print by clicking on the printer
icon.
f) Do the plots look ok?
8.11. Get the R code and data for this problem from
(http://lagrange.math.siu.edu/Olive/lreghw.txt). The data is the pilot plant
8.6 Problems 281
example from Box et al. (2005, pp. 177186) examined in Problems 8.9 and
8.10.
a) Copy and paste the code into R. Then copy and paste the output into
Notepad. Print out the page of output.
b) The least squares estimate = coecient for x1 is half the A eect. So
what is the A eect?
This is the Ledolter and Swersey (2007, p. 80) cracked pots 24 data and
the response and residual plots are from the model without 3 and 4 factor
interactions.
b) Copy the plots into Word and print the plots. Do the response and
residual plots look ok?
282 8 Orthogonal Designs
8.15. Download lregpack into R. The data is the PB(12) example from
Box et al. (2005, p. 287).
a) Type the following commands. Copy and paste the QQ plot into Word
and print the plot.
b) Copy and paste the output into Notepad and print the output.
c) As a 25 design, the eects B, D, BD, E, and DE were thought to be real.
The PB(12) design works best when none of the interactions is signicant.
From the QQ plot and the output for the PB(12) design, which factors, if
any, appear to be signicant?
d) The output gives the A, B, C, D, and E eects along with the cor-
responding least squares coecients 1 , . . . , 5 . What is the relationship
between the coecients and the eects?
For parts e) to g), act as if the PB(12) design with 5 factors is
appropriate.
e) The full model has Y = 0 + 1 x1 + 2 x2 + 3 x3 + 4 x4 + 5 x5 . The
reduced model is Y = 0 + j xj where xj is the signicant term found in c).
Give the numerical formula for the reduced model.
f) Compute Y using the full model if xi = 1 for i = 1, . . . , 5. Then compute
Y using the reduced model if xj = 1.
g) If the goal of the experiment is to produce large values of Y , should
xj = 1 or xj = 1 in the reduced model? Explain briey.
Chapter 9
More on Experimental Designs
This chapter considers split plot designs briey and reviews the ten designs
considered in Chapter 5 Section 9.1. The one and two way Anova designs,
completely randomized block design, and split plot designs are the building
blocks for more complicated designs. Some split plot designs can be written as
a linear model, Y = xT +e, but the errors are dependent with a complicated
correlation structure.
Denition 9.1. Split plot designs have two units. The large units are
called whole plots and contain blocks of small units called subplots. The
whole plots get assigned to factor A while the subplots get assigned to factor
B (randomly if the units are experimental but not randomly if the units are
observational). A and B are crossed so the AB interaction can be studied.
The split plot design depends on how whole plots are assigned to A. Three
common methods are described below, and methods a) and b) are described
in more detail in the following subsections. The randomization and split plot
ANOVA table depend on the design used for assigning the whole plots to
factor A.
a) The whole plots are assigned to A completely at random, as in a one
way Anova.
b) The whole plots are assigned to A and to a blocking variable as in a
completely randomized block design (if the whole plots are experimental, but
a complete block design is used if the whole plots are observational).
c) The whole plots are assigned to A, to row blocks, and to column blocks
as in a Latin square.
The key feature of a split plot design is that there are two units of dierent
sizes: one size for each of the 2 factors of interest. The larger units are assigned
to A. The large units contain blocks of small units assigned to factor B. Also
factors A and B are crossed.
Shown below is the split plot ANOVA table when the whole plots are assigned
to factor A as in a one way Anova design. The whole plot error is error(W) and
can be obtained as an A*replication interaction. The subplot error is error(S).
FA = M SA/M SEW, FB = M SB/M SES, and FAB = M SAB/M SES. R
computes the three test statistics and pvalues correctly, but for SAS FA and
the pvalue pA need to be computed using MSA, MSEW, dfA , and dfeW ob-
tained from the ANOVA table. Sometimes error(W) is also denoted as
residuals. There are ma whole plots, and each whole plot contains b sub-
plots. Thus there are mab subplots. As always, the pvalue column actually
gives pval, an estimate of the pvalue.
Source df SS MS F p-value
A a1 SSA MSA FA pA
error(W) or A*repl a(m 1) SSEW MSEW
B b1 SSB MSB FB pB
AB (a 1)(b 1) SSAB MSAB FAB pAB
residuals or error(S) a(m 1)(b 1) SSES MSES
The tests of interest for this split plot design are nearly identical to those of
a two way Anova model. Yijk has i = 1, . . . , a, j = 1, . . . , b, and k = 1, . . . , m.
Keep A and B in the model if there is an AB interaction.
a) The 4 step test for AB interaction is
i) Ho: there is no interaction HA : there is an interaction.
ii) FAB is obtained from output.
iii) The pval is obtained from output.
iv) If pval reject Ho and conclude that there is an interaction between A
and B, otherwise fail to reject Ho and conclude that there is no interaction
between A and B. (Or there is not enough evidence to conclude that there is
an interaction.)
b) The 4 step test for A main eects is
i) Ho: 100 = = a00 HA : not Ho.
ii) FA is obtained from output.
iii) The pval is obtained from output.
iv) If pval reject Ho and conclude that the mean response depends on the
level of A, otherwise fail to reject Ho and conclude that the mean response
does not depend on the level of A. (Or there is not enough evidence to
conclude that the response depends on the level of A.)
9.1 Split Plot Designs 285
Source df SS MS F p-value
variety 7 763.16 109.02 1.232 0.3421
MSEW 16 1415.83 88.49
treatment 3 30774.3 10258.1 423.44 0.00
variety*treatment 21 2620.1 124.8 5.150 0.00
error(S) 48 1162.8 24.2
Example 9.1. This split plot data is from Chambers and Hastie (1993,
p. 158). There were 8 varieties of guayule (rubber plant) and 4 treatments
were applied to seeds. The response was the rate of germination. The whole
plots were greenhouse ats and the subplots were 4 subplots of the ats. Each
at received seeds of one variety (A). Each subplot contained 100 seeds and
was treated with one of the treatments (B). There were m = 3 replications
so each variety was planted in 3 ats for a total of 24 ats and 4(24) = 96
observations.
Factorial crossing: Variety and treatments (A and B) are crossed since all
combinations of variety and treatment occur. Hence the AB interaction can
be measured.
Blocking: The whole plots are the 24 greenhouse ats. Each at is a block
of 4 subplots. Each of the 4 subplots gets one of the 4 treatments.
Randomization: The 24 ats are assigned to the 8 varieties completely at
random. Use the sample(24) command to generate a random permutation.
The rst 3 numbers of the permutation get variety one, the next 3 get variety
2, . . . , the last 3 get variety 8. Use the sample(4) command 24 times, once
for each at. If 2, 4, 1, 3 was the permutation for the ith at, then the 1st
subplot gets treatment 3, the 2nd gets treatment 1, the 3rd gets treatment
4, and the 4th subplot gets treatment 2.
Fail to reject Ho, the mean rate of germination does not depend on va-
riety. (This test would make more sense if there was no variety * treatment
interaction.)
b) Ho: 010 = = 040 Ha: not Ho
FB = 423.44
pval = 0.00
Reject Ho, the mean rate of germination depends on treatment.
c) Ho: no interaction Ha: there is an interaction
FAB = 5.15
pval = 0.00
Reject Ho, there is a variety * treatment interaction.
Shown below is the split plot ANOVA table when the whole plots are
assigned to factor A and a blocking variable as in a completely random-
ized block design. The whole plot error is error(W) and can be obtained as
a block*A interaction. The subplot error is error(S). FA = M SA/M SEW,
FB = M SB/M SES, and FAB = M SAB/M SES. Factor A has a levels
and factor B has b levels. There are r blocks of a whole plots. Each whole
plot contains b subplots, and each block contains a whole plots and thus ab
subplots. Hence there are ra whole plots and rab subplots.
SAS computes the last two test statistics and pvalues correctly, and the
last line of SAS output gives FA and the pvalue pA . The initial line of output
for A is not correct. The output for blocks is probably not correct.
Source df SS MS F p-value
blocks r1
A a1 SSA MSA FA pA
error(W) or block*A (r 1)(a 1) SSEW MSEW
B b1 SSB MSB FB pB
AB (a 1)(b 1) SSAB MSAB FAB pAB
error(S) a(r 1)(b 1) SSES MSES
The tests of interest for this split plot design are nearly identical to those
of a two way Anova model. Yijk has i = 1, . . . , a, j = 1, . . . , b and k = 1, . . . , r.
Keep A and B in the model if there is an AB interaction.
a) The 4 step test for AB interaction is
i) Ho: there is no interaction HA : there is an interaction.
ii) FAB is obtained from output.
iii) The pval is obtained from output.
iv) If pval reject Ho and conclude that there is an interaction between A
and B, otherwise fail to reject Ho and conclude that there is no interaction
9.1 Split Plot Designs 287
between A and B. (Or there is not enough evidence to conclude that there is
an interaction.)
b) The 4 step test for A main eects is
i) Ho: 100 = = a00 HA : not Ho.
ii) FA is obtained from output.
iii) The pval is obtained from output.
iv) If pval reject Ho and conclude that the mean response depends on the
level of A, otherwise fail to reject Ho and conclude that the mean response
does not depend on the level of A. (Or there is not enough evidence to
conclude that the response depends on the level of A.)
c) The 4 step test for B main eects is
i) Ho: 010 = = 0b0 HA : not Ho.
ii) FB is obtained from output.
iii) The pval is obtained from output.
iv) If pval reject Ho and conclude that the mean response depends on the
level of B, otherwise fail to reject Ho and conclude that the mean response
does not depend on the level of B. (Or there is not enough evidence to
conclude that the response depends on the level of B.)
Source df SS MS F p-value
Block 5 4.150 0.830
Variety 2 0.178 0.089 0.65 0.5412
Block*Variety 10 1.363 0.136
Date 3 1.962 0.654 23.39 0.00
Variety*Date 6 0.211 0.035 1.25 0.2973
error(S) 45 1.259 0.028
Example 9.2. The ANOVA table above is for the Snedecor and Cochran
(1967, pp. 369372) split plot data where the whole plots are assigned to
factor A and to blocks in a completely randomized block design. Factor A =
variety of alfalfa (ladak, cossack, ranger). Each eld had two cuttings, with
the second cutting on July 7, 1943. Factor B = date of third cutting (none,
Sept. 1, Sept. 20, Oct. 7) in 1943. The response variable was yield (tons per
acre) in 1944. The 6 blocks were elds of land divided into 3 plots of land,
one for each variety. Each of these 3 plots was divided into 4 subplots for
date of third cutting. So each block had 3 whole plots and 12 subplots.
a) Perform the test corresponding to A.
b) Perform the test corresponding to B.
c) Perform the test corresponding to AB.
Warning: Although the split plot model can be written as a linear model,
the errors are not iid and have a complicated correlation structure. It is also
dicult to get tted values and residuals from the software, so the model
cant be easily checked with response and residual plots. These facts make
the split plot model very hard to use for most researchers.
b) For a random eects one way Anova model, the levels are a random
sample from a population oflevels.
p
Randomization: Let n = i=1 mi and do the sample(n) command. Assign
the rst m1 units to treatment (level) 1, the next m2 units to treatment 2,
. . . , the last mp units to treatment p.
II) Two way Anova: Factor A has a levels and factor B has b levels. The
two factors are crossed, forming ab cells.
Randomization: Let n = mab and do the sample(n) command. Randomly
assign m units to each of the ab cells. Assign the rst m units to the (A, B) =
(1, 1) cell, the next m units to the (1,2) cell, . . . , the last m units to the (a, b)
cell.
III) k way Anova: There are k factors A1 , . . . , Ak with a1 , . . . , ak levels,
'k
respectively. The k factors are crossed, forming i=1 ai cells.
'k
Randomization: Let n = m i=1 ai and do the sample(n) command. Ran-
domly assign m units to each cell. Each cell is a combination of levels, so the
(1, 1, . . . , 1, 1) cell gets the 1st m units.
IV) Completely randomized block design: Factor A has k levels (treat-
ments), and there are b blocks (a blocking variable has b levels) of k units.
Randomization: Let n = kb and do the sample(k) command b times.
Within each block of k units, randomly assign 1 unit to each treatment.
V) Latin squares: Factor A has a levels (treatments), the row blocking
variable has a blocks of a units, and the column blocking variable has a blocks
of a units. There are a2 units since the row and column blocking variables are
crossed. The treatment factor, row blocking variable, and column blocking
variable are also crossed. A Latin square is such that each of the a treatments
occurs once in each row and once in each column.
Randomization: Pick an a a Latin square. Use the sample(a) command
to assign row levels to numbers 1 to a. Use the sample(a) command to assign
column levels to numbers 1 to a. Use the sample(a) command to assign
treatment levels to the rst a capital letters. If possible, use the sample(a2 )
command to assign units, 1 unit to each cell of the Latin square.
VI) 2k factorial design: There are k factors, each with 2 levels.
Randomization: Let n = m2k and do the sample(n) command. Randomly
assign m units to each cell. Each cell corresponds to a run which is determined
by a string of k +s and s corresponding to the k main eects.
kf
VII) 2R fractional factorial design: There are k factors, each with 2
levels.
Randomization: Let n = 2kf and do the sample(n) command. Randomly
assign 1 unit to each run which is determined by a string of k +s and s
corresponding to the k main eects.
VIII) Plackett Burman P B(n) design: There are k factors, each with 2
levels.
290 9 More on Experimental Designs
Try to become familiar with the designs and their randomization so that
you can recognize a design given a story problem.
9.3 Summary
1) The analysis of the response, not that of the residuals, is of primary im-
portance. The response plot can be used to analyze the response in the back-
ground of the tted model. For linear models such as experimental designs,
the estimated mean function is the identity line and should be added as a
visual aid.
2) Assume that the residual degrees of freedom are large enough for testing.
Then the response and residual plots contain much information. Linearity and
constant variance may be reasonable if the plotted points scatter about the
identity line in a (roughly) evenly populated band. Then the residuals should
scatter about the r = 0 line in an evenly populated band. It is easier to check
linearity with the response plot and constant variance with the residual plot.
Curvature is often easier to see in a residual plot, but the response plot can
be used to check whether the curvature is monotone or not. The response plot
is more eective for determining whether the signal to noise ratio is strong
or weak, and for detecting outliers, inuential cases, or a critical mix.
3) The three basic principles of DOE (design of experiments) are
i) use randomization to assign units to treatments.
ii) Use factorial crossing to compare the eects (main eects, pairwise
interactions, . . . , J-fold interaction) for J 2 factors. If A1 , . . . , AJ are the
factors with li levels for i = 1, . . . , J then there are l1 l2 lJ treatments
where each treatment uses exactly one level from each factor.
iii) Blocking is used to divide units into blocks of similar units where
similar means the units are likely to have similar values of the response
when given the same treatment. Within each block randomly assign units to
treatments.
292 9 More on Experimental Designs
4) Split plot designs have two units. The large units are called whole plots
and contain blocks of small units called subplots. The whole plots get assigned
to factor A while the subplots get assigned to factor B (randomly if the units
are experimental but not randomly if the units are observational). A and B
are crossed so the AB interaction can be studied.
5) The split plot design depends on how whole plots are assigned to A.
Three common methods are a) the whole plots are assigned to A completely
at random, as in a one way Anova, b) the whole plots are assigned to A
and to a blocking variable as in a completely randomized block design (if the
whole plots are experimental, a complete block design is used if the whole
plots are observational), c) the whole plots are assigned to A, to row blocks,
and to column blocks as in a Latin square.
6) The split plot ANOVA table when whole plots are assigned to levels of
A as in a one way Anova is shown below. The whole plot error is error(W) and
can be obtained as an A*replication interaction. The subplot error is error(S).
FA = M SA/M SEW, FB = M SB/M SES, and FAB = M SAB/M SES. R
computes the three test statistics and pvalues correctly, but for SAS FA
and the pvalue pA need to be computed using MSA, MSEW, dfA , and dfeW
obtained from the ANOVA table.
Source df SS MS F p-value
A a1 SSA MSA FA pA
error(W) or A*repl a(m 1) SSEW MSEW
B b1 SSB MSB FB pB
AB (a 1)(b 1) SSAB MSAB FAB pAB
residuals or error(S) a(m 1)(b 1) SSES MSES
Source df SS MS F p-value
blocks r1
A a1 SSA MSA FA pA
error(W) or block*A (r 1)(a 1) SSEW MSEW
B b1 SSB MSB FB pB
AB (a 1)(b 1) SSAB MSAB FAB pAB
error(S) a(r 1)(b 1) SSES MSES
iv) If pval reject Ho and conclude that the mean response depends on the
level of B, otherwise fail to reject Ho and conclude that the mean response
does not depend on the level of B.
9.4 Complements
9.5 Problems
Source df SS MS F p-value
Block 2 77.55 38.78
Method 2 128.39 64.20 7.08 0.0485
Block*Method 4 36.28 9.07
Temp 3 434.08 144.69 41.94 0.00
Method*Temp 6 75.17 12.53 2.96 0.0518
error(S) 12 50.83 4.24
9.1. The ANOVA table above is for the Montgomery (1984, pp. 386389)
split plot data where the whole plots are assigned to factor A and to blocks
in a completely randomized block design. The response variable is tensile
strength of paper. Factor A is (preparation) method with 3 levels (1, 2, 3).
Factor B is temperature with 4 levels (200, 225, 250, 275). The pilot plant
can make 12 runs a day and the experiment is repeated each day, with days
as blocks. A batch of pulp is made by one of the 3 preparation methods. Then
the batch of pulp is divided into 4 samples, and each sample is cooked at one
of the four temperatures.
a) Perform the test corresponding to A.
b) Perform the test corresponding to B.
c) Perform the test corresponding to AB.
9.5 Problems 295
Source df SS MS F p-value
Block 1 0.051 0.051
Nitrogen 3 37.32 12.44 29.62 0.010
Block*Nitrogen 3 1.26 0.42
Thatch 2 3.82 1.91 9.10 0.009
Nitrogen*Thatch 6 4.15 0.69 3.29 0.065
error(S) 12 1.72 0.21
9.2. The ANOVA table above is for the Kuehl (1994, pp. 473481) split
plot data where the whole plots are assigned to factor A and to blocks in
a completely randomized block design. The response variable is the average
chlorophyll content (mg/gm of turf grass clippings). Factor A is nitrogen
fertilizer with 4 levels (1, 2, 3, 4). Factor B is length of time that thatch was
allowed to accumulate with 3 levels (2, 5, or 8 years).
There were 2 blocks of 4 whole plots to which the levels of factor A were
assigned. The 2 blocks formed a golf green which was seeded with turf grass.
The 8 whole plots were plots of golf green. Each whole plot had 3 subplots
to which the levels of factor B were randomly assigned.
a) Perform the test corresponding to A.
b) Perform the test corresponding to B.
c) Perform the test corresponding to AB.
Source df SS MS F p-value
Block 5 4.150 0.830
Variety 2 0.178 0.089 0.65 0.5412
Block*Variety 10 1.363 0.136
Date 3 1.962 0.654 23.39 0.00
Variety*Date 6 0.211 0.035 1.25 0.2973
error(S) 45 1.259 0.028
9.3. The ANOVA table above is for the Snedecor and Cochran (1967, pp.
369372) split plot data where the whole plots are assigned to factor A and to
blocks in a completely randomized block design. Factor A = variety of alfalfa
(ladak, cossack, ranger). Each eld had two cuttings, with the second cutting
on July 7, 1943. Factor B = date of third cutting (none, Sept. 1, Sept. 20,
Oct. 7) in 1943. The response variable was yield (tons per acre) in 1944. The
6 blocks were elds of land divided into 3 plots of land, one for each variety.
Each of these 3 plots was divided into 4 subplots for date of third cutting.
So each block had 3 whole plots and 12 subplots.
a) Perform the test corresponding to A.
b) Perform the test corresponding to B.
c) Perform the test corresponding to AB.
296 9 More on Experimental Designs
attach(steel)
out<-aov(resistance~heat*coating + Error(wplots),steel)
summary(out)
detach(steel)
This split plot steel data is from Box et al. (2005, p. 336). The whole plots
are time slots to use a furnace, which can hold 4 steel bars at one time. Factor
A = heat has 3 levels (360, 370, 380o F). Factor B = coating has 4 levels
(4 types of coating: c1, c2, c3, and c4). The response was corrosion resistance.
a) Perform the test corresponding to A.
b) Perform the test corresponding to B.
c) Perform the test corresponding to AB.
9.7. This is the same data as in Problem 9.6, using SAS. Copy and paste
the SAS program from (http://lagrange.math.siu.edu/Olive/lrsashw.txt)
into SAS, run the program, then print the output. Only include the second
page of output.
To get the correct F statistic for heat, you need to divide MS heat by MS
wplots.
f (z|, )
where the ith row of W is xTi and the jth column is v j . Each column v j of
W corresponds to a variable. For example, the data may consist of n visitors
to a hospital where the p = 2 variables height and weight of each individual
were measured.
There are some dierences in the notation used in multiple linear regression
and multivariate location dispersion models. Notice that W could be used
as the design matrix in multiple linear regression although usually the rst
column of the regression design matrix is a vector of ones. The n p design
matrix in the multiple linear regression model was denoted by X, and xTi
was the ith row of X. In the multivariate location dispersion model, X and
X i will be used to denote a p 1 random vector with observed value xi ,
and xTi is the ith row of the data matrix W . Johnson and Wichern (1988,
pp. 7, 53) uses X to denote the n p data matrix and an n 1 random
vector, relying on the context to indicate whether X is a random vector or
data matrix. Software tends to use dierent notation. For example, R will
use commands such as
var(x)
to compute the sample covariance matrix of the data. Hence x corresponds
to W , x[,1] is the rst column of x, and x[4, ] is the 4th row of x.
and
E(AX) = AE(X) and E(AXB) = AE(X)B. (10.3)
Thus
Cov(a + AX) = Cov(AX) = ACov(X)AT . (10.4)
Some important properties of multivariate normal (MVN) distributions are
given in the following three propositions. These propositions can be proved
using results from Johnson and Wichern (Johnson and Wichern (1988), pp.
127132).
Cov(X) = .
X 1 |X 2 = x2 Nq (1 + 12 1 1
22 (x2 2 ), 11 12 22 21 ).
Example 10.1. Let p = 2 and let (Y, X)T have a bivariate normal distri-
bution. That is,
Y Y Y2 Cov(Y, X)
N2 , 2 .
X X Cov(X, Y ) X
1 1 1 1 1
exp( (x2 + 2xy + y 2 )) f1 (x, y) + f2 (x, y)
2 2 1 2 2(1 2 ) 2 2
where x and y are real and 0 < < 1. Since both marginal distributions
of fi (x, y) are N(0,1) for i = 1 and 2 by Proposition
(( 10.2 a), the marginal
distributions of X and Y are N(0,1). Since xyfi (x, y)dxdy = for i = 1
and for i = 2, X and Y are uncorrelated, but X and Y are not independent
since f (x, y) = fX (x)fY (y).
E(X) = (10.7)
and
Cov(X) = cX (10.8)
where
cX = 2 (0).
U D2 = D2 (, ) = (X )T 1 (X ). (10.9)
p/2
h(u) = kp up/21 g(u). (10.10)
(p/2)