Anda di halaman 1dari 499

David J.

Olive

Linear
Regression
Linear Regression
David J. Olive

Linear Regression

123
David J. Olive
Department of Mathematics
Southern Illinois University
Carbondale, IL, USA

ISBN 978-3-319-55250-7 ISBN 978-3-319-55252-1 (eBook)


DOI 10.1007/978-3-319-55252-1

Library of Congress Control Number: 2017934111

Mathematics Subject Classification (2010): 62J05

Springer International Publishing AG 2017


This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of
the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology
now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, express or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.

Printed on acid-free paper

This Springer imprint is published by Springer Nature


The registered company is Springer International Publishing AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface

Regression is the study of the conditional distribution Y |x of the response


variable Y given the p 1 vector of predictors x. In a linear regression
model, Y = T x + e, and Y is conditionally independent of x given a single
linear combination T x of the predictors, written

Y x| T x.

Multiple linear regression and many experimental design models are spe-
cial cases of the linear regression model, and the models can be presented
compactly by dening the population model in terms of the sucient predic-
tor SP = T x and the estimated model in terms of the estimated sucient
T
predictor ESP = x. In particular, the response plot or estimated su-
cient summary plot of the ESP versus Y is used to visualize the conditional
distribution Y | T x. The residual plot of the ESP versus the residuals is used
to visualize the conditional distribution of the residuals given the ESP.
The literature on multiple linear regression is enormous. See Stigler (1986)
and Harter (1974a,b, 1975a,b,c, 1976) for history. Draper (2002) is a good
source for more recent literature. Some texts that were standard at one time
include Wright (1884), Johnson (1892), Bartlett (1900), Merriman (1907),
Weld (1916), Leland (1921), Ezekiel (1930), Bennett and Franklin (1954),
Ezekiel and Fox (1959), and Brownlee (1965). Recent reprints of several of
these texts are available from www.amazon.com.
Draper and Smith (1966) was a breakthrough because it popularized the
use of residual plots, making the earlier texts obsolete. Excellent texts include
Chatterjee and Hadi (2012), Draper and Smith (1998), Fox (2015), Hamil-
ton (1992), Kutner et al. (2005), Montgomery et al. (2012), Mosteller and
Tukey (1977), Ryan (2009), Sheather (2009), and Weisberg (2014). Cook and
Weisberg (1999a) was a breakthrough because of its use of response plots.

v
vi Preface

Other texts of interest include Abraham and Ledolter (2006), Harrell


(2015), Pardoe (2012), Mickey et al. (2004), Cohen et al. (2003), Kleinbaum
et al. (2014), Mendenhall and Sincich (2011), Vittingho et al. (2012), and
Berk (2003).
This text is an introduction to linear regression models for undergraduates
and beginning graduate students in a mathematics or statistics department.
The text is for graduate students in elds like quantitative psychology. The
prerequisites for this text are linear algebra and a calculus-based course in
statistics at the level of Chihara and Hesterberg (2011), Hogg et al. (2014),
Rice (2006), or Wackerly et al. (2008). The student should be familiar with
vectors, matrices, condence intervals, expectation, variance, normal distri-
bution, and hypothesis testing.
This text will not be easy reading for nonmathematical students. Lindsey
(2004) and Bowerman and OConnell (2000) attempt to present regression
models to students who have not had calculus or linear algebra. Also see
Kachigan (1991, ch. 35) and Allison (1999).
This text does not give much history of regression, but it should be noted
that many of the most important ideas in statistics are due to Fisher, Ney-
man, E.S. Pearson, and K. Pearson. See Lehmann (2011). For example, David
(20062007) says that the following terms were due to Fisher: analysis of
variance, confounding, consistency, covariance, degrees of freedom, eciency,
factorial design, information, information matrix, interaction, level of sig-
nicance, likelihood, location, maximum likelihood, null hypothesis, pivotal
quantity, randomization, randomized blocks, sampling distribution, scale,
statistic, Students t, test of signicance, and variance.
David (20062007) says that terms due to Neyman and E.S. Pearson in-
clude alternative hypothesis, composite hypothesis, likelihood ratio, power,
power function, simple hypothesis, size of critical region, test criterion, test
of hypotheses, and type I and type II errors. Neyman also coined the term
condence interval. David (20062007) says that terms due to K. Pearson
include bivariate normal, goodness of t, multiple regression, nonlinear re-
gression, random sampling, skewness, standard deviation, and weighted least
squares.
This text is dierent from the massive competing literature in several
ways. First, response plots are heavily used in this text. With the response
plot, the presentation for multiple linear regression is about the same as the
presentation for simple linear regression. Hence the text immediately starts
with the multiple linear regression model, rather than spending 100 pages on
simple linear regression and then covering multiple regression.
Second, the assumption of iid normal N (0, 2 ) errors is replaced by the as-
sumption that the iid zero mean errors have constant variance 2 . Then large
sample theory can be used to justify hypothesis tests, condence intervals,
and prediction intervals.
Third, the multivariate linear model y i = B T xi + i for i = 1, . . . , n has
m 2 response variables Y1 , . . . , Ym and p predictor variables x1 , x2 , . . . , xp .
Preface vii

Multivariate linear regression and MANOVA models are special cases. Recent
results from Kakizawa (2009), Su and Cook (2012), Olive et al. (2015), and
Olive (2016b) make the multivariate linear regression model (Chapter 12)
easy to learn after the student has mastered the multiple linear regression
model (Chapters 2 and 3). For the multivariate linear regression model, it is
assumed that the iid zero mean error vectors have fourth moments.
Fourth, recent literature on plots for goodness and lack of t, bootstrapping,
outlier detection, response transformations, prediction intervals, prediction
regions, and variable selection has been incorporated into the text. See Olive
(2004b, 2007, 2013a,b, 2016a,b,c) and Olive and Hawkins (2005).
Chapter 1 reviews the material to be covered in the text and can be
skimmed and then referred to as needed. Chapters 2 and 3 cover multiple lin-
ear regression, Chapter 4 considers generalized least squares, and Chapters 5
through 9 consider experimental design models. Chapters 10 and 11 cover lin-
ear model theory and the multivariate normal distribution. These chapters
are needed for the multivariate linear regression model covered in Chapter 12.
Chapter 13 covers generalized linear models (GLMs) and generalized additive
models (GAMs).
The text also uses recent literature to provide answers to the following
important questions:
How can the conditional distribution Y | T x be visualized?
How can be estimated?
How can variable selection be performed eciently?
How can Y be predicted?
The text emphasizes prediction and visualizing the models. Some of the
applications in this text using this research are listed below.
1) It is shown how to use the response plot to detect outliers and to assess
the adequacy of linear models for multiple linear regression and experimental
design.
2) A graphical method for selecting a response transformation for linear
models is given. Linear models include multiple linear regression and many
experimental design models. This method is also useful for multivariate linear
regression.
3) A graphical method for assessing variable selection for the multiple
linear regression model is described. It is shown that for submodels I with
k predictors, the widely used screen Cp (I) k is too narrow. More good
submodels are considered if the screen Cp (I) min(2k, p) is used. Variable
selection methods originally meant for multiple linear regression can be ex-
tended to GLMs. See Chapter 13. Similar ideas from Olive and Hawkins
(2005) have been incorporated in Agresti (2013). Section 3.4.1 shows how to
bootstrap the variable selection estimator.
4) Asymptotically optimal prediction intervals for a future response Yf
are given for models of the form Y = T x + e where the errors are iid,
viii Preface

unimodal, and independent of x. Asymptotically optimal prediction regions


are developed for multivariate linear regression.
5) Rules of thumb for selecting predictor transformations are given.
6) The DD plot is a graphical diagnostic for whether the predictor dis-
tribution is multivariate normal or from some other elliptically contoured
distribution. The DD plot is also useful for detecting outliers in the predic-
tors and for displaying prediction regions for multivariate linear regression.
7) The multivariate linear regression model has m response variables.
Plots, prediction regions, and tests are developed that make this model nearly
as easy to use as the multiple linear regression model (m = 1), at least for
small m.
Throughout the book, there are goodness of t and lack of t plots for
examining the model. The response plot is especially important.
The website (http://lagrange.math.siu.edu/Olive/lregbk.htm) for
this book provides R programs in the le lregpack.txt and several R data
sets in the le lregdata.txt. Section 14.1 discusses how to get the data sets
and programs into the software, but the following commands will work.
Downloading the books R functions lregpack.txt and data les lreg-
data.txt into R: The commands
source("http://lagrange.math.siu.edu/Olive/lregpack.txt")
source("http://lagrange.math.siu.edu/Olive/lregdata.txt")

can be used to download the R functions and data sets into R. Type ls(). Over
65 R functions from lregpack.txt should appear. In R, enter the command q().
A window asking Save workspace image? will appear. Click on No to remove
the functions from the computer (clicking on Yes saves the functions on R,
but the functions and data are easily obtained with the source commands).
Chapters 27 can be used for a one-semester course in regression and
experimental design. For a course in generalized linear models, replace some of
the design chapters by Chapter 13. Design chapters could also be replaced by
Chapters 12 and 13. A more theoretical course would cover Chapters 1,10, 11,
and 12.
Acknowledgments
This work has been partially supported by NSF grants DMS 0202922 and
DMS 0600933. Collaborations with Douglas M. Hawkins and R. Dennis Cook
were extremely valuable. I am grateful to the developers of useful mathemat-
ical and statistical techniques and to the developers of computer software
and hardware (including R Core Team (2016)). Cook (1998) and Cook and
Weisberg (1999a) inuenced this book. Teaching material from this text has
been invaluable. Some of the material in this text has been used in a Math
583 regression graphics course, a Math 583 experimental design course, and
a Math 583 robust statistics course. In 2009 and 2016, Chapters 2 to 7 were
used in Math 484, a course on multiple linear regression and experimental
design. Chapters 11 and 12 were used in a 2014 Math 583 theory of linear
Preface ix

models course. Chapter 12 was also used in a 2012 Math 583 multivariate
analysis course. Chapter 13 was used for a categorical data analysis course.
Thanks also goes to Springer, to Springers associate editor Donna
Chernyk, and to several reviewers.

Carbondale, IL, USA David J. Olive


Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Some Regression Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Multiple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Variable Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4 Other Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.5 Complements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.6 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2 Multiple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17


2.1 The MLR Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Checking Goodness of Fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3 Checking Lack of Fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.1 Residual Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.2 Other Model Violations . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.4 The ANOVA F Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.5 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.6 The Partial F Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.7 The Wald t Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.8 The OLS Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
2.9 Two Important Special Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
2.9.1 The Location Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
2.9.2 Simple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . 57
2.10 The No Intercept MLR Model . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.11 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
2.12 Complements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
2.12.1 Lack of Fit Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
2.13 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

3 Building an MLR Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85


3.1 Predictor Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
3.2 Graphical Methods for Response Transformations . . . . . . . . . 92

xi
xii Contents

3.3 Main Eects, Interactions, and Indicators . . . . . . . . . . . . . . . . 97


3.4 Variable Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
3.4.1 Bootstrapping Variable Selection . . . . . . . . . . . . . . . . . 119
3.5 Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
3.6 Outlier Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
3.8 Complements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
3.9 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

4 WLS and Generalized Least Squares . . . . . . . . . . . . . . . . . . . . . . 163


4.1 Random Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
4.2 GLS, WLS, and FGLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
4.3 Inference for GLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
4.4 Complements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
4.5 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172

5 One Way Anova . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175


5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
5.2 Fixed Eects One Way Anova . . . . . . . . . . . . . . . . . . . . . . . . . . 177
5.3 Random Eects One Way Anova . . . . . . . . . . . . . . . . . . . . . . . . 189
5.4 Response Transformations for Experimental Design . . . . . . . 191
5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
5.6 Complements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
5.7 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202

6 The K Way Anova Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213


6.1 Two Way Anova . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
6.2 K Way Anova Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
6.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
6.4 Complements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
6.5 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222

7 Block Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227


7.1 One Way Block Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
7.2 Blocking with the K Way Anova Design . . . . . . . . . . . . . . . . . 233
7.3 Latin Square Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
7.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
7.5 Complements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
7.6 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242

8 Orthogonal Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245


8.1 Factorial Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
8.2 Fractional Factorial Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
8.3 Plackett Burman Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
8.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
8.5 Complements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
8.6 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
Contents xiii

9 More on Experimental Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . 283


9.1 Split Plot Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
9.1.1 Whole Plots Randomly Assigned to A . . . . . . . . . . . . . 284
9.1.2 Whole Plots Assigned to A as in a CRBD . . . . . . . . . 286
9.2 Review of the DOE Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
9.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
9.4 Complements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
9.5 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294

10 Multivariate Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299


10.1 The Multivariate Normal Distribution . . . . . . . . . . . . . . . . . . . 300
10.2 Elliptically Contoured Distributions . . . . . . . . . . . . . . . . . . . . . 303
10.3 Sample Mahalanobis Distances . . . . . . . . . . . . . . . . . . . . . . . . . 307
10.4 Complements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
10.5 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309

11 Theory for Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313


11.1 Projection Matrices and the Column Space . . . . . . . . . . . . . . . 313
11.2 Quadratic Forms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317
11.3 Least Squares Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
11.3.1 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330
11.4 Nonfull Rank Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
11.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336
11.6 Complements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338
11.7 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340

12 Multivariate Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 343


12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
12.2 Plots for the Multivariate Linear Regression Model . . . . . . . . 348
12.3 Asymptotically Optimal Prediction Regions . . . . . . . . . . . . . . 350
12.4 Testing Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356
12.5 An Example and Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
12.5.1 Simulations for Testing . . . . . . . . . . . . . . . . . . . . . . . . . . 372
12.5.2 Simulations for Prediction Regions . . . . . . . . . . . . . . . . 375
12.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377
12.7 Complements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381
12.8 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383

13 GLMs and GAMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389


13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389
13.2 Additive Error Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393
13.3 Binary, Binomial, and Logistic Regression . . . . . . . . . . . . . . . . 394
13.4 Poisson Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403
13.5 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410
13.6 Variable Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419
xiv Contents

13.7 Generalized Additive Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 428


13.7.1 Response Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 431
13.7.2 The EE Plot for Variable Selection . . . . . . . . . . . . . . . 432
13.7.3 An EE Plot for Checking the GLM . . . . . . . . . . . . . . . 433
13.7.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433
13.8 Overdispersion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437
13.9 Complements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 440
13.10 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442

14 Stu for Students . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459


14.1 R and Arc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459
14.2 Hints for Selected Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465
14.3 Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 470

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 489
Chapter 1
Introduction

This chapter provides a preview of the book but is presented in a rather ab-
stract setting and will be easier to follow after reading the rest of the book.
The reader may omit this chapter on rst reading and refer back to it as nec-
essary. Chapters 2 to 9 consider multiple linear regression and experimental
design models t with least squares. Chapter 1 is useful for extending several
techniques, such as response plots and plots for response transformations used
in those chapters, to alternative tting methods and to alternative regression
models. Chapter 13 illustrates some of these extensions for the generalized
linear model (GLM) and the generalized additive model (GAM).
Response variables are the variables of interest, and are predicted with
a p 1 vector of predictor variables x = (x1 , . . . , xp )T where xT is the
transpose of x. A multivariate regression model has m > 1 response variables.
For example, predict Y1 = systolic blood pressure and Y2 = diastolic blood
pressure using a constant x1 , x2 = age, x3 = weight, and x4 = dosage amount
of blood pressure medicine. The multivariate location and dispersion model
of Chapter 10 is a special case of the multivariate linear regression model of
Chapter 12.
A univariate regression model has one response variable Y . Suppose Y
is independent of the predictor variables x given a function h(x), written
Y x|h(x), where h : Rp Rd and the integer d is as small as possible. Then
Y follows a dD regression model, where d p since Y x|x. If Y x, then
Y follows a 0D regression model. Then there are 0D, 1D, . . . , pD regression
models, and all univariate regression models are dD regression models for
some integer 0 d p. Cook (1998, p. 49) and Cook and Weisberg (1999a,
p. 414) use similar notation with h(x) = (xT 1 , . . . , xT d )T .
The remainder of this chapter considers 1D regression models, where h :
Rp R is a real function. The additive error regression model Y = m(x) + e
is an important special case with h(x) = m(x). See Section 13.2. An impor-
tant special case of the additive error model is the linear regression model
Y = xT + e = x1 1 + + xp p + e. Multiple linear regression and many
experimental design models are special cases of the linear regression model.

Springer International Publishing AG 2017 1


D.J. Olive, Linear Regression, DOI 10.1007/978-3-319-55252-1 1
2 1 Introduction

The multiple linear regression model has at least one predictor xi that
takes on many values. Chapter 2 ts this model with least squares and Chap-
ter 3 considers variable selection models such as forward selection. There are
many other methods for tting the multiple linear regression model, includ-
ing lasso, ridge regression, partial least squares (PLS), and principal com-
ponent regression (PCR). See James et al. (2013), Olive (2017), and Pelawa
Watagoda and Olive (2017). Chapters 2 and 3 consider response plots, plots
for response transformations, and prediction intervals for the multiple linear
regression model t by least squares. All of these techniques can be extended
to alternative tting methods.

1.1 Some Regression Models

All models are wrong, but some are useful.


Box (1979)

In data analysis, an investigator is presented with a problem and data from


some population. The population might be the collection of all possible out-
comes from an experiment while the problem might be predicting a future
value of the response variable Y or summarizing the relationship between Y
and the p 1 vector of predictor variables x. A statistical model is used to
provide a useful approximation to some of the important underlying charac-
teristics of the population which generated the data. Many of the most used
models for 1D regression, dened below, are families of conditional distribu-
tions Y |x = xo indexed by x = xo . A 1D regression model is a parametric
model if the conditional distribution is completely specied except for a xed
nite number of parameters, otherwise, the 1D model is a semiparametric
model. GLMs and GAMs, dened below, are covered in Chapter 13.

Denition 1.1. Regression investigates how the response variable Y


changes with the value of a p 1 vector x of predictors. Often this condi-
tional distribution Y |x is described by a 1D regression model, where Y is
conditionally independent of x given the sucient predictor SP = h(x),
written
Y x|SP or Y x|h(x), (1.1)
where the real valued function h : Rp R. The estimated sucient predictor
ESP = h(x). An important special case is a model with a linear predictor
T
h(x) = + T x where ESP = + x. This class of models includes the gen-
eralized linear model (GLM). Another important special case is a generalized
additive model (GAM), where Y is independent of x = (x1 , . . . , xp )T given the
p
additive predictor AP = + j=1 Sj (xj ) for some (usually unknown) func-
p
tions Sj . The estimated additive predictor EAP = ESP = + j=1 Sj (xj ).
1.1 Some Regression Models 3

Notation: In this text, a plot of x versus Y will have x on the horizontal


axis, and Y on the vertical axis.

Plots are extremely important for regression. When p = 1, x is both a


sucient predictor and an estimated sucient predictor. So a plot of x versus
Y is both a sucient summary plot and a response plot. Usually the SP is
unknown, so only the response plot can be made. The response plot will be
extremely useful for checking the goodness of t of the 1D regression model.

Denition 1.2. A sucient summary plot is a plot of the SP versus Y .


An estimated sucient summary plot (ESSP) or response plot is a plot of
the ESP versus Y .

Notation. Often the index i will be suppressed. For example, the linear
regression model
Yi = + T xi + ei (1.2)
for i = 1, . . . , n where is a p 1 unknown vector of parameters, and ei
is a random error. This model could be written Y = + T x + e. More
accurately, Y |x = + T x + e, but the conditioning on x will often be
suppressed. Often the errors e1 , . . . , en are iid (independent and identically
distributed) from a distribution that is known except for a scale parameter.
For example, the ei s might be iid from a normal (Gaussian) distribution
with mean 0 and unknown standard deviation . For this Gaussian model,
estimation of , , and is important for inference and for predicting a new
value of the response variable Yf given a new vector of predictors xf .

The class of 1D regression models is very rich, and many of the most
used statistical models, including GLMs and GAMs, are 1D regression mod-
els. Nonlinear regression, nonparametric regression, and linear regression are
special cases of the additive error regression model

Y = h(x) + e = SP + e. (1.3)

The multiple linear regression model and experimental design model or


ANOVA model are special cases of the linear regression model. Another
important class of parametric or semiparametric 1D regression models has
the form
Y = g( + xT , e) or Y = g(xT , e). (1.4)
Special cases include GLMs and the response transformation model

Z = t1 ( + T x + e) (1.5)

where t1 is a one to one (typically monotone) function. Hence

Y = t(Z) = + T x + e. (1.6)
4 1 Introduction

Sections 3.2 and 5.4 show how to choose the response transformation t(Z)
graphically, and these techniques are easy to extend to the additive error
regression model Y = h(x) + e. Then the response transformation model is
Y = t (Z) = h (x) + e, and the graphical method for selecting the response
transformation is to plot hi (x) versus ti (Z) for several values of i , choosing
the value of = 0 where the plotted points follow the identity line with unit
slope and zero intercept. For the multiple linear regression model, hi (x) =
xT i where i can be found using the desired tting method, e.g. lasso.

In the literature, the response variable is sometimes called the dependent


variable while the predictor variables are sometimes called carriers, covariates,
explanatory variables, or independent variables. The ith case (Yi , xTi ) consists
of the values of the response variable Yi and the predictor variables xTi =
(xi,1 , . . . , xi,p ) where p is the number of predictors and i = 1, . . . , n. The
sample size n is the number of cases.

Box (1979) warns that all models are wrong, but some are useful. For
example, the function g in equation (1.4) or the error distribution could
be misspecied. Diagnostics are used to check whether model assumptions
such as the form of g and the proposed error distribution are reasonable.
Often diagnostics use residuals ri . For example, the additive error regression
model (1.3) uses
ri = Yi h(xi )
where h(x) is an estimate of h(x).

Exploratory data analysis (EDA) can be used to nd useful models when


the form of the regression model is unknown. For example, if the monotone
function t is unknown, and

Z = t1 ( + T x + e), (1.7)

then the transformation

Y = t(Z) = + T x + e (1.8)

follows a linear regression model. EDA can be used to nd response and


predictor transformations to build a model. See Sections 3.1 and 3.2.

After selecting a 1D regression model such as a GLM, the investigator will


often want to check whether the model is useful and to perform inference.
Several things to consider are listed below.
i) Use the response plot (and the sucient summary plot) to explain the
1D regression model to consulting clients, students, or researchers.
ii) Goodness of t: use the response plot to show that the model provides a
simple, useful approximation for the relationship between the response vari-
1.2 Multiple Linear Regression 5

able Y and the predictors x. The response plot is used to visualize the con-
ditional distribution of Y |x, Y |SP , and Y |( + T x) if SP = + T x.
iii) Check for lack of t of the model with a residual plot of the ESP versus
the residuals.
iv) Fit the model and nd h(x). If SP = + T x, estimate and , e.g.,
using maximum likelihood estimators.
v) Estimate the mean function E(Yi |xi ) = (xi ) = di (xi ) or estimate
(xi ) where the di are known constants.
vii) Check for overdispersion with an OD plot. See Section 13.8.
viii) Check whether Y is independent of x, that is, check whether the
nontrivial predictors x are needed in the model. Check whether SP = h(x)
c where the constant c does not depend on the xi . If SP = + T x, check
whether = 0, for example, test Ho : = 0,
ix) Check whether a reduced model can be used instead of the full model.
If SP = + T x = + TR xR + TO xO where the r 1 vector xR consists
of the nontrivial predictors in the reduced model, test Ho : O = 0.
x) Use variable selection to nd a good submodel.
xi) Predict Yi given xi .

The eld of statistics known as regression graphics gives useful results for
examining the 1D regression model (1.1) even when the model is unknown or
misspecied. The following section shows that the sucient summary plot is
useful for explaining the given 1D model while the response plot can often be
used to visualize the conditional distribution of Y |SP . Also see Chapter 13
and Olive (2013b).

1.2 Multiple Linear Regression

Suppose that the response variable Y is quantitative and that at least one pre-
dictor variable xi is quantitative. Then the multiple linear regression (MLR)
model is often a very useful model. For the MLR model,

Yi = +xi,1 1 +xi,2 2 + +xi,p p +ei = +xTi +ei = + T xi +ei (1.9)

for i = 1, . . . , n. Here Yi is the response variable, xi is a p 1 vector of


nontrivial predictors, is an unknown constant, is a p1 vector of unknown
coecients, and ei is a random variable called the error.
The Gaussian or normal MLR model makes the additional assumption that
the errors ei are iid N (0, 2 ) random variables. This model can also be written
as Y = + T x + e where e N (0, 2 ), or Y |x N ( + T x, 2 ), or Y |x
N (SP, 2 ), or Y |SP N (SP, 2 ). The normal MLR model is a parametric
model since, given x, the family of conditional distributions is completely
6 1 Introduction

specied by the parameters , , and 2 . Since Y |SP N (SP, 2 ), the


conditional mean function E(Y |SP ) M (SP ) = (SP ) = SP = + T x.
The MLR model is discussed in detail in Chapters 2, 3, and 4.

5
0
Y
5
10

10 5 0 5
SP

Fig. 1.1 SSP for MLR Data

A sucient summary plot (SSP) of the sucient predictor SP = + T xi


versus the response variable Yi with the mean function added as a visual aid
can be useful for describing the multiple linear regression model. This plot
can not be used for real data since and are unknown. To make Figure 1.1,
the articial data used n = 100 cases with k = 5 nontrivial predictors. The
data used = 1, = (1, 2, 3, 0, 0)T , ei N (0, 1) and x from a multivariate
normal distribution x N5 (0, I).
In Figure 1.1, notice that the identity line with unit slope and zero in-
tercept corresponds to the mean function since the identity line is the line
Y = SP = + T x = (SP ) = E(Y |SP ). The vertical deviation of Yi
from the line is equal to ei = Yi ( + T xi ). For a given value of SP ,
Yi N (SP, 2 ). For the articial data, 2 = 1. Hence if SP = 0 then
Yi N (0, 1), and if SP = 5 then Yi N (5, 1). Imagine superimposing the
N (SP, 2 ) curve at various values of SP . If all of the curves were shown, then
the plot would resemble a road through a tunnel. For the articial data, each
Yi is a sample of size 1 from the normal curve with mean + T xi .
T
The estimated sucient summary plot (ESSP) is a plot of + xi versus
Yi with the identity line added as a visual aid. For MLR, the ESP = +
1.2 Multiple Linear Regression 7

5
0
Y
5
10

10 5 0 5
ESP

Fig. 1.2 ESSP = Response Plot for MLR Data


2
1
RES
0 1
2

10 5 0 5
ESP

Fig. 1.3 Residual Plot for MLR Data


8 1 Introduction

5
0
Y
5
10

1.4 1.2 1.0 0.8 0.6 0.4 0.2


ESP

Fig. 1.4 Response Plot when Y is Independent of the Predictors

T
x and the estimated conditional mean function is (ESP ) = ESP. The
T
estimated or tted value of Yi is equal to Yi = + x. Now the vertical
deviation of Yi from the identity line is equal to the residual ri = Yi ( +
T
xi ). The interpretation of the ESSP is almost the same as that of the
SSP, but now the mean SP is estimated by the estimated sucient predictor
(ESP). This plot is also called the response plot and is used as a goodness
of t diagnostic. The residual plot is a plot of the ESP versus ri and is used
as a lack of t diagnostic. These two plots should be made immediately after
tting the MLR model and before performing inference. Figures 1.2 and 1.3
show the response plot and residual plot for the articial data.

The response
 plot is also a useful visual aid for describing the ANOVA F
test (see 2.4) which tests whether = 0, that is, whether the nontrivial
predictors x are needed in the model. If the predictors are not needed in the
model, then Yi and E(Yi |xi ) should be estimated by the sample mean Y . If
the predictors are needed, then Yi and E(Yi |xi ) should be estimated by the
T
ESP Yi = + xi . If the identity line clearly ts the data better than the
horizontal line Y = Y , then the ANOVA F test should have a small pvalue
and reject the null hypothesis Ho that the predictors x are not needed in the
MLR model. Figure 1.2 shows that the identity line ts the data better than
any horizontal line. Figure 1.4 shows the response plot for the articial data
when only X4 and X5 are used as predictors with the identity line and the
line Y = Y added as visual aids. In this plot the horizontal line ts the data
1.3 Variable Selection 9

about as well as the identity line which was expected since Y is independent
of X4 and X5 .

It is easy to nd data sets where the response plot looks like Figure 1.4,
but the pvalue for the ANOVA F test is very small. In this case, the MLR
model is statistically signicant, but the investigator needs to decide whether
the MLR model is practically signicant.

1.3 Variable Selection

A standard problem in 1D regression is variable selection, also called subset or


model selection. Assume that the 1D regression model uses a linear predictor

Y x|( + T x), (1.10)

that a constant is always included, that x = (x1 , . . . , xp1 )T are the p 1


nontrivial predictors, and that the np matrix X with ith row (1, xTi ) has full
rank p. Then variable selection is a search for a subset of predictor variables
that can be deleted without important loss of information.
To clarify ideas, assume that there exists a subset S of predictor variables
such that if xS is in the 1D model, then none of the other predictors are
needed in the model. Write E for these (extraneous) variables not in S,
partitioning x = (xTS , xTE )T . Then

SP = + T x = + TS xS + TE xE = + TS xS . (1.11)

The extraneous terms that can be eliminated given that the subset S is in
the model have zero coecients: E = 0.
Now suppose that I is a candidate subset of predictors, that S I and
that O is the set of predictors not in I. Then

SP = + T x = + TS xS = + TS xS + T(I/S) xI/S + 0T xO = + TI xI ,

where xI/S denotes the predictors in I that are not in S. Since this is true
regardless of the values of the predictors, O = 0 if S I. Hence for any
subset I that includes all relevant predictors, the population correlation

corr( + T xi , + T
I xI,i ) = 1. (1.12)

This observation, which is true regardless of the explanatory power of the


model, suggests that variable selection for a 1D regression model (1.11) is
simple in principle. For each value of j = 1, 2, . . . , p 1 nontrivial predictors,
keep track of subsets I that provide the largest values of corr(ESP,ESP(I)).
Any such subset for which the correlation is high is worth closer investigation
10 1 Introduction

and consideration. To make this advice more specic, use the rule of thumb
that a candidate subset of predictors I is worth considering if the sample
correlation of ESP and ESP(I) satises
T T T T
corr( + xi , I + I xI,i ) = corr( xi , I xI,i ) 0.95. (1.13)

The diculty with this approach is that tting large numbers of possi-
ble submodels involves substantial computation. Fortunately, (ordinary) least
squares (OLS) frequently gives a useful ESP, and methods originally meant
for multiple linear regression using the Mallows Cp criterion (see Jones 1946
and Mallows 1973) also work for more general 1D regression models. As a rule
of thumb, the OLS ESP is useful if |corr(OLS ESP, ESP)| 0.95 where ESP
T
is the standard ESP (e.g., for generalized linear models, the ESP is + x
where (, ) is the maximum likelihood estimator of (, )), or if the OLS
response plot suggests that the OLS ESP is good. Variable selection will be
discussed in much greater detail in Chapters 3 and 13, but the following
methods are useful for a large class of 1D regression models.

Perhaps the simplest method of variable selection is the t directed search


(see Daniel and Wood 1980, pp. 100101). Let k be the number of pre-
dictors in the model, including the constant. Hence k = p for the full
model. Let X1 , . . . , Xp1 denote the nontrivial predictor variables and let
W1 , W2 , . . . , Wp1 be the predictor variables in decreasing order of impor-
tance. Use theory if possible, but if no theory is available then t the full
model using OLS and let ti denote the t statistic for testing Ho : i = 0. Let
|t|(1) |t|(2) |t|(p1) . Then Wi corresponds to the Xj with |t|(pi)
for i = 1, 2, . . . , p 1. That is, W1 has the largest t statistic, W2 the next
largest, etc. Then use OLS to compute Cp (Ij ) for the p 1 models Ij where
Ij contains W1 , . . . , Wj and a constant for j = 1, . . . , p 1.

Forward selection starts with a constant = W0 .


Step 1) k = 2: compute Cp for all models containing the constant and a
single predictor Xi . Keep the predictor W1 = Xj , say, that corresponds to
the model with the smallest value of Cp .
Step 2) k = 3: Fit all models with k = 3 that contain W0 and W1 . Keep the
predictor W2 that minimizes Cp .
Step j) k = j + 1: Fit all models with k = j + 1 that contains W0 , W1 , . . . , Wj .
Keep the predictor Wj+1 that minimizes Cp .
Step p 1) k = p: Fit the full model.

Backward elimination starts with the full model. All models contain a
constant = U0 . Hence the full model contains U0 , X1 , . . . , Xp1 . We will also
say that the full model contains U0 , U1 , . . . , Up1 where Ui need not equal Xi
for i 1.
Step 1) k = p 1: t each model with p 1 predictors including a constant.
Delete the predictor Up1 , say, that corresponds to the model with the small-
est Cp . Keep U0 , . . . , Up2 .
1.3 Variable Selection 11

Step 2) k = p 2: t each model with p 2 predictors including the con-


stant. Delete the predictor Up2 that corresponds to the smallest Cp . Keep
U0 , U1 , . . . , Up3 .
Step j) k = pj: t each model with pj predictors and a constant. Delete the
predictor Upj that corresponds to the smallest Cp . Keep U0 , U1 , . . . , Upj1 .
Step p 2) k = 2: The current model contains U0 , U1 , and U2 . Fit the model
U0 , U1 and the model U0 , U2 . Assume that model U0 , U1 minimizes Cp . Then
delete U2 and keep U0 and U1 .
(Step p 1) which nds Cp for the model that only contains the constant U0
is often omitted.)

All subsets variable selection examines all subsets and keeps track of
several (up to three, say) subsets with the smallest Cp (I) for each group of
submodels containing k predictors including a constant. This method can be
used for p 30 by using the ecient leaps and bounds algorithms when
OLS and Cp is used (see Furnival and Wilson 1974).

Rule of thumb for variable selection (assuming that the cost of each
predictor is the same): nd the submodel Im with the minimum Cp . If Im
uses km predictors including a constant, do not use any submodel that has
more than km predictors. Since the minimum Cp submodel often has too
many predictors, also look at the submodel Io with the smallest value of k,
say ko , such that Cp 2k. This submodel may have too few predictors.
So look at the predictors in Im but not in Io and see if they can be deleted
or not. (If Im = Io , then it is a good candidate for the best submodel.)

Variable selection with the Cp criterion is closely related to the partial F


test for testing whether a reduced model should be used instead of the full
model. See Section 2.6. The following results are properties of OLS and hold
even if the data does not follow a 1D model. If the candidate model of xI has
k terms (including the constant), then the partial F test for reduced model
I uses test statistic
 
SSE(I) SSE SSE n p SSE(I)
FI = / = 1
(n k) (n p) n p pk SSE

where SSE is the residual sum of squares from the full model and SSE(I) is
the residual sum of squares from the candidate submodel. Then

SSE(I)
Cp (I) = + 2k n = (p k)(FI 1) + k (1.14)
M SE
where MSE is the residual mean square for the full model. Let ESP(I) =
T
I + I x be the ESP for the submodel and let VI = Y ESP (I) so that
T
VI,i = Yi I + I xi . Let ESP and V denote the corresponding quantities
12 1 Introduction

for the full model. Then Olive and Hawkins (2005) show that corr(VI , V ) 1
forces corr(OLS ESP, OLS ESP(I)) 1 and that
  
SSE np np
corr(V, VI ) = = = .
SSE(I) Cp (I) + n 2k (p k)FI + n p

Also Cp (I) 2k corresponds to corr(VI , V ) dn where



p
dn = 1 .
n

Notice that the submodel Ik that minimizes Cp (I) also maximizes corr(V, VI )
among all submodels I with k predictors including a constant. If Cp (I) 2k
and n 10p, then 0.948 corr(V, V (I)), and both corr(V, V (I)) 1.0 and
corr(OLS ESP, OLS ESP(I)) 1.0 as n .

If a 1D model (1.11) holds, a common assumption made for variable se-


lection is that the tted full model ESP is a good estimator of the sucient
predictor, and the usual graphical and numerical checks on this assumption
should be made. Also assume that the OLS ESP is useful. This assump-
tion can be checked by making an OLS response plot or by verifying that
|corr(OLS ESP, ESP)| 0.95. Then we suggest that submodels I are inter-
esting if Cp (I) min(2k, p).

Suppose that the OLS ESP and the standard ESP are highly correlated:
|corr(ESP, OLS ESP)| 0.95. Then often OLS variable selection can be used
for the 1D data, and using the pvalues from OLS output seems to be a useful
benchmark. To see this, suppose that n 5p and rst consider the model
Ii that deletes the predictor Xi . Then the model has k = p 1 predictors
including the constant, and the test statistic is ti where

t2i = FIi .

Using (1.14) and Cp (If ull ) = p, notice that

Cp (Ii ) = (p (p 1))(t2i 1) + (p 1) = t2i 1 + Cp (If ull ) 1,

or
Cp (Ii ) = Cp (If ull ) + (t2i 2).
Using the screen Cp (I) min(2k, p) suggests that the predictor Xi should
not be deleted if
|ti | > 2 1.414.

If |ti | < 2, then the predictor can probably be deleted since Cp decreases.
More generally, for the partial F test, notice that by (1.14), Cp (I) 2k
i (p k)FI p + 2k 2k i (p k)Fi p i
1.4 Other Issues 13

p
FI .
pk
Now k is the number of terms in the model including a constant while p k
is the number of terms set to 0. As k 0, the partial F test will reject Ho
(i.e., say that the full model should be used instead of the submodel I) unless
FI is not much larger than 1. If p is very large and p k is very small, then
the partial F test will tend to suggest that there is a model I that is about
as good as the full model even though model I deletes p k predictors.
The Cp (I) k screen tends to overt. An additive error single index model
is Y = m( + xT ) + e. We simulated multiple linear regression and single
index model data sets with p = 8 and n = 50, 100, 1000, and 10000. The true
model S satised Cp (S) k for about 60% of the simulated data sets, but S
satised Cp (S) 2k for about 97% of the data sets.

1.4 Other Issues

The 1D regression models oer a unifying framework for many of the most
used regression models. By writing the model in terms of the sucient predic-
tor SP = h(x), many important topics valid for all 1D regression models can
be explained compactly. For example, the previous section presented variable
selection, and equation (1.14) can be used to motivate the test for whether
the reduced model can be used instead of the full model. Similarly, the su-
cient predictor can be used to unify the interpretation of coecients and to
explain models that contain interactions and factors.

Interpretation of Coecients
One interpretation of the coecients in a 1D model (1.11) is that i is the
rate of change in the SP associated with a unit increase in xi when all other
predictor variables x1 , . . . , xi1 , xi+1 , . . . , xp are held xed. Denote a model
by SP = + T x = + 1 x1 + + p xp . Then

SP
i = for i = 1, . . . , p.
xi
Of course, holding all other variables xed while changing xi may not be
possible. For example, if x1 = x, x2 = x2 and SP = + 1 x + 2 x2 , then x2
cannot be held xed when x1 increases by one unit, but
d SP
= 1 + 22 x.
dx
The interpretation of i changes with the model in two ways. First,
the interpretation changes as terms are added and deleted from the SP.
Hence the interpretation of 1 diers for models SP = + 1 x1 and
14 1 Introduction

SP = +1 x1 +2 x2 . Secondly, the interpretation changes as the parametric


or semiparametric form of the model changes. For multiple linear regression,
E(Y |SP ) = SP and an increase in one unit of xi increases the conditional
expectation by i . For binary logistic regression,

exp(SP )
E(Y |SP ) = (SP ) = ,
1 + exp(SP )

and the change in the conditional expectation associated with a one unit
increase in xi is more complex.

Factors for Qualitative Variables


The interpretation of the coecients also changes if interactions and fac-
tors are present. Suppose a factor W is a qualitative random variable that
takes on c categories a1 , . . . , ac . Then the 1D model will use c 1 indicator
variables Wi = 1 if W = ai and Wi = 0 otherwise, where one of the levels ai
is omitted, e.g. use i = 1, . . . , c 1.

Interactions
Suppose X1 is quantitative and X2 is qualitative with 2 levels and X2 = 1
for level a2 and X2 = 0 for level a1 . Then a rst order model with interaction
is SP = +1 x1 +2 x2 +3 x1 x2 . This model yields two unrelated lines in the
sucient predictor depending on the value of x2 : SP = + 2 + (1 + 3 )x1
if x2 = 1 and SP = + 1 x1 if x2 = 0. If 3 = 0, then there are two
parallel lines: SP = + 2 + 1 x1 if x2 = 1 and SP = + 1 x1 if x2 = 0.
If 2 = 3 = 0, then the two lines are coincident: SP = + 1 x1 for
both values of x2 . If 2 = 0, then the two lines have the same intercept:
SP = + (1 + 3 )x1 if x2 = 1 and SP = + 1 x1 if x2 = 0. In general,
as factors have more levels and interactions have more terms, e.g. x1 x2 x3 x4 ,
the interpretation of the model rapidly becomes very complex.

1.5 Complements

Cook and Weisberg (1999a, p. 411) dene a sucient summary plot to be a


plot that contains all the sample regression information about the conditional
distribution of the response given the predictors. To help explain the given 1D
model, use the sucient summary plot (SSP) of SP versus Yi with the mean
function added as a visual aid. If p = 1, then Y x|x and the plot of xi versus
Yi is a SSP and has been widely used to explain regression models such as the
simple linear regression (SLR) model and the logistic regression model with
one nontrivial predictor. See Agresti (2002, cover illustration and p. 169)
and Collett (1999, p. 74). Replacing x by SP has two major advantages.
First, the plot can be made for p 1 and secondly, the possible shapes that
the plot can take is greatly reduced. For example, in a plot of xi versus Yi ,
1.6 Problems 15

the plotted points will fall about some line with slope and intercept if
the SLR model holds, but in a plot of SP = + T xi versus Yi , the plotted
points will fall about the identity line with unit slope and zero intercept if the
multiple linear regression model holds. If there are more than two nontrivial
predictors, then we generally cannot nd a sucient summary plot and need
to use an estimated sucient summary plot.
Important theoretical results for the additive error single index model Y =
m(+ T x)+e were given by Brillinger (1977, 1983) and Aldrin et al. (1993).
Li and Duan (1989) extended these results to models of the form

Y = g( + T x, e) (1.15)

where g is a bivariate inverse link function. Olive and Hawkins (2005) discuss
variable selection while Chang (2006) and Chang and Olive (2007, 2010)
discuss (ordinary) least squares (OLS) tests. Severini (1998) discusses when
OLS output is relevant for the Gaussian additive error single index model.

1.6 Problems

1.1. Explain why the model Y = g( + T x, e) can also be written as Y =


g( + xT , e).

R Problem

Use the command source(G:/lregpack.txt) to download the func-


tions and the command source(G:/lregdata.txt) to download the data.
See Preface or Section 14.1. Typing the name of the lregpack function,
e.g. lrplot2, will display the code for the function. Use the args command,
e.g. args(lrplot2), to display the needed arguments for the function. For the
following problem, the R command can be copied and pasted from (http://
lagrange.math.siu.edu/Olive/lreghw.txt) into R.
1.2. The Beaton et al. (1996) TIMMS data has response variable Y = 1 if
there was a statistically signicant gender dierence in the nations 8th grade
TIMSS science test, and Y = 0 otherwise. There were n = 35 countries and 12
predictors, including x1 = male 8th grade score, x2 = female 8th grade score,
x3 = male 7th grade score, x4 = female 7th grade score, and x5 = percent
of 8th graders with educational aids (dictionary, study table, and computer).
Enter (or copy and paste) the R command lrplot2(xtimss,ytimss) for this
problem to make a logistic regression response plot using x1 x5 as predictors.
See Chapter 13. Include the response plot in Word.
Chapter 2
Multiple Linear Regression

This chapter introduces the multiple linear regression model, the response
plot for checking goodness of t, the residual plot for checking lack of t,
the ANOVA F test, the partial F test, the t tests, and least squares. The
problems use software R, SAS, Minitab, and Arc.

2.1 The MLR Model

Denition 2.1. The response variable is the variable that you want to
predict. The predictor variables are the variables used to predict the re-
sponse variable.
Notation. In this text the response variable will usually be denoted by
Y and the p predictor variables will often be denoted by x1 , . . . , xp . The
response variable is also called the dependent variable while the predictor
variables are also called independent variables, explanatory variables, carri-
ers, or covariates. Often the predictor variables will be collected in a vector x.
Then xT is the transpose of x.
Denition 2.2. Regression is the study of the conditional distribu-
tion Y |x of the response variable Y given the vector of predictors x =
(x1 , . . . , xp )T .
Denition 2.3. A quantitative variable takes on numerical values
while a qualitative variable takes on categorical values.
Example 2.1. Archeologists and crime scene investigators sometimes
want to predict the height of a person from partial skeletal remains. A model
for prediction can be built from nearly complete skeletons or from living
humans, depending on the population of interest (e.g., ancient Egyptians
or modern US citizens). The response variable Y is height and the predic-
tor variables might be x1 1, x2 = femur length, and x3 = ulna length.

Springer International Publishing AG 2017 17


D.J. Olive, Linear Regression, DOI 10.1007/978-3-319-55252-1 2
18 2 Multiple Linear Regression

The heights of individuals with x2 = 200mm and x3 = 140mm should be


shorter on average than the heights of individuals with x2 = 500mm and
x3 = 350mm. In this example Y , x2 , and x3 are quantitative variables. If
x4 = gender is a predictor variable, then gender (coded as male = 1 and
female = 0) is qualitative.

Denition 2.4. Suppose that the response variable Y and at least one
predictor variable xi are quantitative. Then the multiple linear regression
(MLR) model is

Yi = xi,1 1 + xi,2 2 + + xi,p p + ei = xTi + ei (2.1)

for i = 1, . . . , n. Here n is the sample size and the random variable ei is the
ith error. Suppressing the subscript i, the model is Y = xT + e.

In matrix notation, these n equations become

Y = X + e, (2.2)

where Y is an n 1 vector of dependent variables, X is an n p matrix


of predictors, is a p 1 vector of unknown coecients, and e is an n 1
vector of unknown errors. Equivalently,

Y1 x1,1 x1,2 . . . x1,p 1 e1
Y2 x2,1 x2,2 . . . x2,p 2 e2

.. = .. .. . . .. .. + .. . (2.3)
. . . . . . .
Yn xn,1 xn,2 . . . xn,p p en

Often the rst column of X is X1 = 1, the n 1 vector of ones. The ith


case (xTi , Yi ) = (xi1 , xi2 , . . . , xip , Yi ) corresponds to the ith row xTi of X
and the ith element of Y (if xi1 1, then xi1 could be omitted). In the
MLR model Y = xT + e, the Y and e are random variables, but we only
have observed values Yi and xi . If the ei are iid (independent and identically
distributed) with zero mean E(ei ) = 0 and variance VAR(ei ) = V (ei ) = 2 ,
then regression is used to estimate the unknown parameters and 2 .

Denition 2.5. The constant variance MLR model uses the


assumption that the errors e1 , . . . , en are iid with mean E(ei ) = 0 and
variance VAR(ei ) = 2 < . Also assume that the errors are independent
of the predictor variables xi . The predictor variables xi are assumed to be
xed and measured without error. The cases (xTi , Yi ) are independent for
i = 1, . . . , n.

If the predictor variables are random variables, then the above MLR model
is conditional on the observed values of the xi . That is, observe the xi and
then act as if the observed xi are xed.
2.1 The MLR Model 19

Denition 2.6. The unimodal MLR model has the same assumptions
as the constant variance MLR model, as well as the assumption that the zero
mean constant variance errors e1 , . . . , en are iid from a unimodal distribution
that is not highly skewed. Note that E(ei ) = 0 and V (ei ) = 2 < .

Denition 2.7. The normal MLR model or Gaussian MLR model has
the same assumptions as the unimodal MLR model but adds the assumption
that the errors e1 , . . . , en are iid N (0, 2 ) random variables. That is, the ei
are iid normal random variables with zero mean and variance 2 .

The unknown coecients for the above 3 models are usually estimated
using (ordinary) least squares (OLS).

Notation. The symbol A B = f (c) means that A and B are equivalent


and equal, and that f (c) is the formula used to compute A and B.

Denition 2.8. Given an estimate b of , the corresponding vector of


predicted values or tted values is Y Y (b) = Xb. Thus the ith tted value

Yi Yi (b) = xTi b = xi,1 b1 + + xi,p bp .

The vector of residuals is r r(b) = Y Y (b). Thus ith residual ri


ri (b) = Yi Yi (b) = Yi xi,1 b1 xi,p bp .

Most regression methods attempt to nd an estimate of which mini-


mizes some criterion function Q(b) of the residuals.

Denition 2.9. The ordinary least squares (OLS) estimator OLS mini-
mizes
n
QOLS (b) = ri2 (b), (2.4)
i=1

and OLS = (X T X)1 X T Y .

The vector of predicted or tted values Y OLS = X OLS = HY where the


hat matrix H = X(X T X)1 X T provided the inverse exists. Typically the
subscript OLS is omitted, and the least squares regression equation is
Y = 1 x1 + 2 x2 + + p xp where x1 1 if the model contains a constant.

There are many statistical models besides the MLR model, and you should
learn how to quickly recognize an MLR model. A regression model has a
response variable Y and the conditional distribution of Y given the predic-
tors x = (x1 , . . . , xp )T is of interest. Regression models are used to predict Y
and to summarize the relationship between Y and x. If a constant xi,1 1
(this notation means that xi,1 = 1 for i = 1, . . . , n) is in the model, then
xi,1 is often called the trivial predictor, and the MLR model is said to have
20 2 Multiple Linear Regression

a constant or intercept. All nonconstant predictors are called nontrivial pre-


dictors. The term multiple is used if the model uses one or more nontrivial
predictors. (Some authors use multivariate instead of multiple, but in this
text a multivariate linear regression model will have m 2 response vari-
ables. See Chapter 12.) The simple linear regression model is a special case
of the MLR model that uses exactly one nontrivial predictor. Suppose the
response variable is Y and data has been collected on additional variables
x1 , . . . , xp .
An MLR model is linear in the unknown coecients . Thus the model
is an MLR model in Y and if we can write Yi = xTi + ei or Yi = wTi + ei
where each wi is a function of x1 , . . . , xp . Symbols other than w or x may be
used. Alternatively, the model is linear in the parameters if Y /i does
not depend on the parameters. If Y = xT + e = x1 1 + + xp p + e, then
Y /i = xi . Similarly, if Y = wT + e, then Y /i = wi .

Example 2.2. a) Suppose that interest is in predicting a function of Z


from functions of w1 , . . . , wk . If Y = t(Z) = xT + e where t is a function
and each xi is some function of w1 , . . . , wk , then there is an MLR model in
Y and . Similarly, Z = t(Y ) = wT + e is an MLR model in Z and .
b) To see that Y = 1 + 2 x + 3 x2 + e is an MLR model in Y and ,
take w1 = 1, w2 = x, and w3 = x2 . Then Y = wT + e.
c) If Y = 1 + 2 exp(3 x) + e, then the model is a nonlinear regression
model that is not an MLR model in Y and . Notice that the model can
not be written in the form Y = wT + e and that Y /2 = exp(3 x) and
Y /3 = 2 x exp(3 x) depend on the parameters.

2.2 Checking Goodness of Fit

It is crucial to realize that an MLR model is not necessarily a


useful model for the data, even if the data set consists of a response
variable and several predictor variables. For example, a nonlinear regression
model or a much more complicated model may be needed. Chapters 1 and 13
describe several alternative models. Let p be the number of predictors and n
the number of cases. Assume that n 5p, then plots can be used to check
whether the MLR model is useful for studying the data. This technique is
known as checking the goodness of t of the MLR model.

Notation. Plots will be used to simplify regression analysis, and in this


text a plot of W versus Z uses W on the horizontal axis and Z on the vertical
axis.

Denition 2.10. A scatterplot of X versus Y is a plot of X versus Y


and is used to visualize the conditional distribution Y |X of Y given X.
2.2 Checking Goodness of Fit 21

Denition 2.11. A response plot is a plot of a variable wi versus Yi .


Typically wi is a linear combination of the predictors: wi = xTi where is
a known p 1 vector. The most commonly used response plot is a plot of the
tted values Yi versus the response Yi .

Proposition 2.1. Suppose that the regression estimator b of is used to


nd the residuals ri ri (b) and the tted values Yi Yi (b) = xTi b. Then
in the response plot of Yi versus Yi , the vertical deviations from the identity
line (that has unit slope and zero intercept) are the residuals ri (b).

Proof. The identity line in the response plot is Y = xT b. Hence the


vertical deviation is Yi xTi b = ri (b). 

Denition 2.12. A residual plot is a plot of a variable wi versus the


residuals ri . The most commonly used residual plot is a plot of Yi versus ri .

Notation: For MLR, the residual plot will often mean the residual
plot of Yi versus ri , and the response plot will often mean the plot of Yi
versus Yi .

If the unimodal MLR model as estimated by least squares is useful, then


in the response plot the plotted points should scatter about the identity line
while in the residual plot of Y versus r the plotted points should scatter
about the r = 0 line (the horizontal axis) with no other pattern. Figures 1.2
and 1.3 show what a response plot and residual plot look like for an articial
MLR data set where the MLR regression relationship is rather strong in that
the sample correlation corr(Y , Y ) is near 1. Figure 1.4 shows a response plot
where the response Y is independent of the nontrivial predictors in the model.
Here corr(Y , Y ) is near 0 but the points still scatter about the identity line.
When the MLR relationship is very weak, the response plot will look like the
residual plot.
The above ideal shapes for the response and residual plots are for when
the unimodal MLR model gives a good approximation for the data. If the
plots have the ideal shapes and n 10p, then expect inference, except for
classical prediction intervals, to be approximately correct for many unimodal
distributions that are close to the normal distribution.
If the response and residual plots suggest an MLR model with iid skewed
errors, then add lowess to both plots. The scatterplot smoother tries to esti-
mate the mean function E(Y |Y ) or E(r|Y ) without using any model. If the
lowess curve is close to the identity line in the response plot and close to the
r = 0 line in the residual plot, then the constant variance MLR model may be
a good approximation to the data, but sample sizes much larger than n = 10p
may be needed before inference is approximately correct. Such skewed data
sets seem rather rare, but see Chen et al. (2009) and see Problem 2.28.

Remark 2.1. For any MLR analysis, always make the response
plot and the residual plot of Yi versus Yi and ri , respectively.
22 2 Multiple Linear Regression

Response Plot

3
1500 1600 1700 1800
Y

63
44

1550 1600 1650 1700 1750 1800


FIT

Residual Plot

3
0 50
RES
50

63
100

44

1550 1600 1650 1700 1750 1800


FIT

Fig. 2.1 Residual and Response Plots for the Tremearne Data

Denition 2.13. An outlier is an observation that lies far away from the
bulk of the data.

Remark 2.2. For MLR, the response plot is important because MLR
is the study of the conditional distribution of Y |xT , and the response
plot is used to visualize the conditional distribution of Y |xT since
Y = xT is a good estimator of xT if is a good estimator of .

If the MLR model is useful, then the plotted points in the response plot
should be linear and scatter about the identity line with no gross outliers.
Suppose the tted values range in value from wL to wH with no outliers. Fix
the t = w in this range and mentally add a narrow vertical strip centered at
w to the response plot. The plotted points in the vertical strip should have a
mean near w since they scatter about the identity line. Hence Y |f it = w is
like a sample from a distribution with mean w. The following example helps
illustrate this remark.

Example 2.3. Tremearne (1911) presents a data set of about 17 mea-


surements on 115 people of Hausa nationality. We deleted 3 cases because
of missing values and used height as the response variable Y . Along with a
2.2 Checking Goodness of Fit 23

constant xi,1 1, the ve additional predictor variables used were height


when sitting, height when kneeling, head length, nasal breadth, and span (per-
haps from left hand to right hand). Figure 2.1 presents the OLS response and
residual plots for this data set. These plots show that an MLR model should
be a useful model for the data since the plotted points in the response plot
are linear and follow the identity line while the plotted points in the residual
plot follow the r = 0 line with no other pattern (except for a possible outlier
marked 44).
To use the response plot to visualize the conditional distribution of Y |xT ,
use the fact that the tted values Y = xT . For example, suppose the height
given t = 1700 is of interest. Mentally examine the plot about a narrow
vertical strip about t = 1700, perhaps from 1685 to 1715. The cases in the
narrow strip have a mean close to 1700 since they fall close to the identity
line. Similarly, when the t = w for w between 1500 and 1850, the cases have
heights near w, on average.
Cases 3, 44, and 63 are highlighted. The 3rd person was very tall while
the 44th person was rather short. Beginners often label too many points as
outliers. Mentally draw a box about the bulk of the data ignoring any outliers.
Double the width of the box (about the identity line for the response plot
and about the horizontal line for the residual plot). Cases outside of this
imaginary doubled box are potential outliers. Alternatively, visually estimate
the standard deviation of the residuals in both plots. In the residual plot look
for residuals that are more than 5 standard deviations from the r = 0 line.
In Figure 2.1, the standard deviation of the residuals appears to be around
10. Hence cases 3 and 44 are certainly worth examining.
The identity line can also pass through or near an outlier or a cluster
of outliers. Then the outliers will be in the upper right or lower left of the
response plot, and there will be a large gap between the cluster of outliers
and the bulk of the data. See Figure 3.11. Figure 2.1 was made with the
following R commands, using lregpack function MLRplot and the major.lsp
data set from the texts webpage.

major <- matrix(scan(),nrow=112,ncol=7,byrow=T)


#copy and paste the data set then press enter
major <- major[,-1]
X<-major[,-6]
Y <- major[,6]
MLRplot(X,Y) #left click the 3 highlighted cases,
#then right click Stop for each of the two plots
24 2 Multiple Linear Regression

2.3 Checking Lack of Fit

The response plot may look good while the residual plot suggests that the
unimodal MLR model can be improved. Examining plots to nd model vio-
lations is called checking for lack of t. Again assume that n 5p.
The unimodal MLR model often provides a useful model for the data, but
the following assumptions do need to be checked.
i) Is the MLR model appropriate?
ii) Are outliers present?
iii) Is the error variance constant or nonconstant? The constant variance
assumption VAR(ei ) 2 is known as homoscedasticity. The nonconstant
variance assumption VAR(ei ) = i2 is known as heteroscedasticity.
iv) Are any important predictors left out of the model?
v) Are the errors e1 , . . . , en iid?
vi) Are the errors ei independent of the predictors xi ?

Make the response plot and the residual plot to check i), ii), and iii). An
MLR model is reasonable if the plots look like Figures 1.2, 1.3, 1.4, and 2.1.
A response plot that looks like Figure 13.7 suggests that the model is not
linear. If the plotted points in the residual plot do not scatter about the
r = 0 line with no other pattern (i.e., if the cloud of points is not ellipsoidal or
rectangular with zero slope), then the unimodal MLR model is not sustained.
The ith residual ri is an estimator of the ith error ei . The constant variance
assumption may have been violated if the variability of the point cloud in the
residual plot depends on the value of Y . Often the variability of the residuals
increases as Y increases, resulting in a right opening megaphone shape. (Fig-
ure 4.1b has this shape.) Often the variability of the residuals decreases as Y
increases, resulting in a left opening megaphone shape. Sometimes the vari-
ability decreases then increases again, and sometimes the variability increases
then decreases again (like a stretched or compressed football).

2.3.1 Residual Plots

Remark 2.3. Residual plots magnify departures from the model while the
response plot emphasizes how well the MLR model ts the data.

Since the residuals ri = ei are estimators of the errors, the residual plot
is used to visualize the conditional distribution e|SP of the errors given the
sucient predictor SP = xT , where SP is estimated by Y = xT . For the
unimodal MLR model, there should not be any pattern in the residual plot:
as a narrow vertical strip is moved from left to right, the behavior of the
residuals within the strip should show little change.
2.3 Checking Lack of Fit 25

Notation. A rule of thumb is a rule that often but not always works well
in practice.
Rule of thumb 2.1. If the residual plot would look good after several
points have been deleted, and if these deleted points were not gross outliers
(points far from the point cloud formed by the bulk of the data), then the
residual plot is probably good. Beginners often nd too many things wrong
with a good model. For practice, use the lregpack function MLRsim to generate
several MLR data sets, and make the response and residual plots for these
data sets: type MLRsim(nruns=10) in R and right click Stop for each plot
(20 times) to generate 10 pairs of response and residual plots. This exercise
will help show that the plots can have considerable variability even when the
MLR model is good. See Problem 2.30.
Rule of thumb 2.2. If the plotted points in the residual plot look like
a left or right opening megaphone, the rst model violation to check is the
assumption of nonconstant variance. (This is a rule of thumb because it is
possible that such a residual plot results from another model violation such
as nonlinearity, but nonconstant variance is much more common.)
The residual plot of Y versus r should always be made. It is also a good idea
to plot each nontrivial predictor xj versus r and to plot potential predictors
wj versus r. If the predictor is quantitative, then the residual plot of xj versus
r should look like the residual plot of Y versus r. If the predictor is qualitative,
e.g. gender, then interpreting the residual plot is much more dicult; however,
if each category contains many observations, then the plotted points for each
category should form a vertical line centered at r = 0 with roughly the same
variability (spread or range).

Rule of thumb 2.3. Suppose that the MLR model uses predictors xj
and that data has been collected on variables wj that are not included in
the MLR model. To check whether important predictors have been left out,
make residual plots of xj and wj versus r. If these plots scatter about the
r = 0 line with no other pattern, then there is no evidence that x2j or wj are
needed in the model. If the plotted points scatter about a parabolic curve,
try adding x2j or wj and wj2 to the MLR model. If the plot of the potential
predictor wj versus r has a linear trend, try adding wj to the MLR model.
The additive error regression model and EE plot in Section 13.7 can also be
used to check whether important predictors have been left out.
Rule of thumb 2.4. To check that the errors are independent of the pre-
dictors, make residual plots of xj versus r. If the plot of xj versus r scatters
about the r = 0 line with no other pattern, then there is no evidence that the
errors depend on xj . If the variability of the residuals changes with the value
of xj , e.g. if the plot resembles a left or right opening megaphone, the errors
may depend on xj . Some remedies for nonconstant variance are considered
in Chapter 4.
26 2 Multiple Linear Regression

To study residual plots, some notation and properties of the least squares
estimator are needed. MLR is the study of the conditional distribution of
Yi |xTi , and the MLR model is Y = X + e where X is an n p matrix
of full rank p. Hence the number of predictors p n. The ith row of X is
xTi = (xi,1 , . . . , xi,p ) where xi,k is the value of the ith observation on the
kth predictor xk . We will denote the jth column of X by Xj v j which
corresponds to the jth variable or predictor xj .
Example 2.4. If Y is brain weight in grams, x1 1, x2 is age, and x3 is
the size of the head in (mm)3 , then for the Gladstone (1905) data

3738 1 39 149.5
4261 1 35 152.5

Y = . , X = . . .. = [v 1 v 2 v 3 ].
.. .. .. .
3306 1 19 141

Hence the rst person had brain weight = 3738, age = 39, and size = 149.5.
After deleting observations with missing values, there were n = 267 cases
(people measured on brain weight, age, and size), and x267 = (1, 19, 141)T .
The second predictor x2 = age corresponds to the 2nd column of X and
is X2 = v 2 = (39, 35, . . . , 19)T . Notice that X1 v 1 = 1 = (1, . . . , 1)T
corresponds to the constant x1 .
The results in the following proposition are properties of least squares
(OLS), not of the underlying MLR model. See Chapter 11 for more linear
model theory. Denitions 2.8 and 2.9 dene the hat matrix H, vector of
tted values Y , and vector of residuals r. Parts f) and g) make residual plots
useful. If the plotted points are linear with roughly constant variance and the
correlation is zero, then the plotted points scatter about the r = 0 line with
no other pattern. If the plotted points in a residual plot of w versus r do show
a pattern such as a curve or a right opening megaphone, zero correlation will
usually force symmetry about either the r = 0 line or the w = median(w)
line. Hence departures from the ideal plot of random scatter about the r = 0
line are often easy to detect.

Warning: If n > p, as is usually the case, X is not square, so (X T X)1 =


X (X T )1 since X 1 does not exist.
1

Proposition 2.2. Suppose that X is an n p matrix of full rank p. Then


a) H is symmetric: H = H T .
b) H is idempotent: HH = H.
c) X T r = 0 so that XjT r = v Tj r = 0.
d) If there is aconstant X1 v 1 = 1 in the model, then the sum of the
n
residuals is zero: i=1 ri = 0.
e) r T Y = 0.
f) If there is a constant in the model, then the sample correlation of the
tted values and the residuals is 0: corr(r, Y ) = 0.
2.3 Checking Lack of Fit 27

g) If there is a constant in the model, then the sample correlation of the


jth predictor with the residuals is 0: corr(r, v j ) = 0 for j = 1, . . . , p.

Proof. a) X T X is symmetric since (X T X)T = X T (X T )T = X T X.


Hence (X T X)1 is symmetric since the inverse of a symmetric matrix is
symmetric. (Recall that if A has an inverse then (AT )1 = (A1 )T .) Thus
using (AT )T = A and (ABC)T = C T B T AT shows that

H T = X T [(X T X)1 ]T (X T )T = H.

b) HH = X(X T X)1 X T X(X T X)1 X T = H since (X T X)1 X T X =


I p , the p p identity matrix.
c) X T r = X T (I p H)Y = [X T X T X(X T X)1 X T ]Y =
[X T X T ]Y = 0. Since v j is the jth column of X, v Tj is the jth row of X T
and v Tj r = 0 for j = 1, . . . , p.
n
d) Since v 1 = 1, v T1 r = i=1 ri = 0 by c).
e) r T Y = [(I n H)Y ]T HY = Y T (I n H)HY = Y T (H H)Y = 0.
f) The sample correlation between W and Z is corr(W, Z) =
n n
i=1 (wi w)(zi z) (wi w)(zi z)
= n i=1 n
(n 1)sw sz i=1 (wi w)
2
i=1 (zi z)
2

where sm is the sample standard deviation of m for m = w, z. So the result


n
follows if A = i=1 (Yi Y )(ri r) = 0. Now r = 0 by d), and thus


n 
n 
n
A= Yi ri Y ri = Yi ri
i=1 i=1 i=1

n
by d) again. But i=1 Yi ri = r T Y = 0 by e).
ng) Following the argument in f), the resultn follows if A =
i=1 (x i,j x j )(ri r) = 0 where
n x j = i=1 xi,j /n is the sample mean of
the jth predictor. Now r = i=1 ri /n = 0 by d), and thus


n 
n 
n
A= xi,j ri xj ri = xi,j ri
i=1 i=1 i=1
n
by d) again. But i=1 xi,j ri = XjT r = v Tj r = 0 by c). 
28 2 Multiple Linear Regression

2.3.2 Other Model Violations

Without loss of generality, E(e) = 0 for the unimodal MLR model with a
constant, in that if E(e) = = 0, then the MLR model can always be written
as Y = xT + e where E(e) = 0 and E(Y ) E(Y |x) = xT . To see this
claim notice that

Y = 1 + x2 2 + + xp p + e = 1 + E(e) + x2 2 + + xp p + e E(e)

= 1 + x 2 2 + + x p p + e
where 1 = 1 + E(e) and e = e E(e). For example, if the errors ei are iid
exponential () with E(ei ) = , use ei = ei .

For least squares, it is crucial that 2 exists. For example, if the ei are iid
Cauchy(0,1), then 2 does not exist and the least squares estimators tend to
perform very poorly.
The performance of least squares is analogous to the performance of Y .
The sample mean Y is a very good estimator of the population mean if the
Yi are iid N (, 2 ), and Y is a good estimator of if the sample size is large
and the Yi are iid with mean and variance 2 . This result follows from
the central limit theorem (CLT), but how large is large depends on the
underlying distribution. The n > 30 rule tends to hold for distributions that
are close to normal in that they take on many values and 2 is not huge. Error
distributions that are highly nonnormal with tiny 2 often need n >> 30.
For example, if Y1 , . . . , Yn are iid Gamma(1/m, 1), then n > 25m may be
needed. Another example is distributions that take on one value with very
high probability, e.g. a Poisson random variable with very small variance.
Bimodal and multimodal distributions and highly skewed distributions with
large variances also need larger n. Chihara and Hesterberg (2011, p. 177)
suggest using n > 5000 for moderately skewed distributions.
There are central limit type theorems for the least squares estimators that
depend on the error distribution of the iid errors ei . See Theorems 2.8, 11.25,
and 12.7. We always assume that the ei are continuous random variables with
a probability density function. Error distributions that are close to normal
may give good results for moderate n if n 10p and np 30 where p is the
number of predictors. Error distributions that need large n for the CLT to
apply for e, will tend to need large n for the limit theorems for least squares
to apply (to give good approximations).

Checking whether the errors are iid is often dicult. The iid assumption is
often reasonable if measurements are taken on dierent objects, e.g. people.
In industry often several measurements are taken on a batch of material.
For example a batch of cement is mixed and then several small cylinders of
concrete are made from the batch. Then the cylinders are tested for strength.
2.4 The ANOVA F Test 29

Experience from such experiments suggests that objects (e.g., cylinders) from
dierent batches are independent, but objects from the same batch are not
independent.
One check on independence can also be made if the time order of the
observations is known. Let r[t] be the residual where [t] is the time order of
the trial. Hence [1] was the 1st and [n] was the last trial. Plot the time order
t versus r[t] if the time order is known. Again, trends and outliers suggest
that the model could be improved. A box shaped plot with no trend suggests
that the MLR model is good. A plot similar to the Durbin Watson test plots
r[t1] versus r[t] for t = 2, . . . , n. Linear trend suggests serial correlation while
random scatter suggests that there is no lag 1 autocorrelation. As a rule of
thumb, if the OLS slope b is computed for the plotted points, b > 0.25 gives
some evidence that there is positive correlation between r[t1] and r[t] . Time
series plots, such as the ACF or PACF of the residuals, may be useful.

If it is assumed that the error distribution is symmetric, make a histogram


of the residuals. Check whether the histogram is roughly symmetric or clearly
skewed. If it is assumed that the errors ei are iid N (0, 2 ) again check
whether the histogram is mound shaped with short tails. A commonly
used alternative is to make a normal probability plot of the residuals. Let
r(1) < r(2) < < r(n) denote the residuals ordered from smallest to largest.
Hence r(1) is the value of the smallest residual. The normal probability plot
plots the e(i) versus r(i) where the e(i) are the expected values of the order
statistics from a sample of size n from an N (0, 1) distribution. (Often the e(i)
are the standard normal percentiles that satisfy P (Z e(i) ) = (i 0.5)/n
where Z N (0, 1).)
Rules of thumb: i) if the plotted points scatter about some straight line
in the normal probability plot, then there is no evidence against the normal
assumption. ii) if the plotted points have an ess shape (concave up then
concave down), then the error distribution is symmetric with lighter tails
than the normal distribution. iii) If the plot resembles a cubic function, then
the error distribution is symmetric with heavier tails than the normal distri-
bution. iv) If the plotted points look concave up (e.g. like x2 where x > 0),
then the error distribution is right skewed.

2.4 The ANOVA F Test

After tting least squares and checking the response and residual plots to see
that an MLR model is reasonable, the next step is to check whether there is
an MLR relationship between Y and the nontrivial predictors x2 , . . . , xp . If
at least one of these predictors is useful, then the OLS tted values Yi should
be used. If none of the nontrivial predictors is useful, then Y will give as good
predictions as Yi . Here the sample mean
30 2 Multiple Linear Regression

1
n
Y = Yi . (2.5)
n i=1

In the denition below, SSE is the sum of squared residuals and a residual
ri = ei = errorhat. In the literature errorhat is often rather misleadingly
abbreviated as error.

Denition 2.14. Assume that a constant is in the MLR model.


a) The total sum of squares


n
SST O = (Yi Y )2 . (2.6)
i=1

b) The regression sum of squares


n
SSR = (Yi Y )2 . (2.7)
i=1

c) The residual sum of squares or error sum of squares is


n 
n
SSE = (Yi Yi )2 = ri2 . (2.8)
i=1 i=1

The result in the following proposition is a property of least squares (OLS),


not of the underlying MLR model. An obvious application is that given any
two of SSTO, SSE, and SSR, the 3rd sum of squares can be found using the
formula SST O = SSE + SSR.

Proposition 2.3. Assume that a constant is in the MLR model. Then


SST O = SSE + SSR.

Proof.

n 
n
SST O = (Yi Yi + Yi Y )2 = SSE + SSR + 2 (Yi Yi )(Yi Y ).
i=1 i=1

Hence the result follows if



n
A ri (Yi Y ) = 0.
i=1

But

n 
n
A= ri Yi Y ri = 0
i=1 i=1

by Proposition 2.2 d) and e). 


2.4 The ANOVA F Test 31

Denition 2.15. Assume that a constant is in the MLR model and that
SSTO = 0. The coecient of multiple determination
SSR SSE
R2 = [corr(Yi , Yi )]2 = =1
SSTO SSTO

where corr(Yi , Yi ) is the sample correlation of Yi and Yi .

Warnings: i) 0 R2 1, but small R2 does not imply that the MLR


model is bad.
ii) If the MLR model contains a constant, then there are several equivalent
formulas for R2 . If the model does not contain a constant, then R2 depends
on the software package.
iii) R2 does not have much meaning unless the response plot and residual
plot both look good.
iv) R2 tends to be too high if n is small.
v) R2 tends to be too high if there are two or more separated clusters of
data in the response plot.
vi) R2 is too high if the number of predictors p is close to n.
vii) In large samples R2 will be large (close to one) if 2 is small compared
to the sample variance SY2 of the response variable Y . R2 is also large if the
sample variance of Y is close to SY2 . Thus R2 is sometimes interpreted as
the proportion of the variability of Y explained by conditioning on x, but
warnings i) - v) suggest that R2 may not have much meaning.

The following 2 propositions suggest that R2 does not behave well when
many predictors that are not needed in the model are included in the model.
Such a variable is sometimes called a noise variable and the MLR model
is tting noise. Proposition 2.5 appears, for example, in Cramer (1946,
pp. 414415), and suggests that R2 should be considerably larger than p/n
if the predictors are useful. Note that if n = 10p and p 2, then under the
conditions of Proposition 2.5, E(R2 ) 0.1.

Proposition 2.4. Assume that a constant is in the MLR model. Adding


a variable to the MLR model does not decrease (and usually increases) R2 .

Proposition 2.5. Assume that a constant 1 is in the MLR model, that


2 = = p = 0 and that the ei are iid N (0, 2 ). Hence the Yi are iid
N (1 , 2 ). Then

a) R2 follows a beta distribution: R2 beta( p1 np


2 , 2 ).
b)
p1
E(R2 ) = .
n1
c)
2(p 1)(n p)
VAR(R2 ) = .
(n 1)2 (n + 1)
32 2 Multiple Linear Regression

Notice that each SS/n estimates the variability of some quantity. SST O/n
SY2 , SSE/n Se2 = 2 , and SSR/n SY2 .

Denition 2.16. Assume that a constant is in the MLR model. Asso-


ciated with each SS in Denition 2.14 is a degrees of freedom (df) and a
mean square = SS/df . For SSTO, df = n 1 and M ST O = SST O/(n 1).
For SSR, df = p 1 and M SR = SSR/(p 1). For SSE, df = n p and
M SE = SSE/(n p).

Seber and Lee (2003, pp. 4447) show that when the MLR model holds,
MSE is often a good estimator of 2 . Under regularity conditions, the MSE
is one of the best unbiased quadratic estimators of 2 . For the normal MLR
model, MSE is the uniformly minimum variance unbiased estimator of 2 .
Seber and Lee also give the following theorem that shows that the MSE is an
unbiased estimator of 2 under very weak assumptions
if the MLR model is
appropriate. From Theorem 12.7 MSE is a n consistent estimator of 2 .
Theorem 2.6. If Y = X +e where X is an np matrix of full rank p, if
the ei are independent with E(ei ) = 0, and VAR(ei ) = 2 , then 2 = M SE
is an unbiased estimator of 2 .
The ANOVA F test tests whether any of the nontrivial predictors
x2 , . . . , xp are needed in the OLS MLR model, that is, whether Yi should
be predicted by the OLS t Yi = 1 + xi,2 2 + + xi,p p or with the
sample mean Y . ANOVA stands for analysis of variance, and the computer
output needed to perform the test is contained in the ANOVA table. Below
is an ANOVA table given in symbols. Sometimes Regression is replaced by
Model and Residual by Error.

Summary Analysis of Variance Table

Source df SS MS F p-value
Regression p-1 SSR MSR Fo=MSR/MSE for Ho:
Residual n-p SSE MSE 2 = = p = 0

Remark 2.4. Recall that for a 4 step test of hypotheses, the pvalue is the
probability of getting a test statistic as extreme as the test statistic actually
observed and that Ho is rejected if the pvalue < . As a benchmark for this
textbook, use = 0.05 if is not given. The 4th step is the nontechnical
conclusion which is crucial for presenting your results to people who are not
familiar with MLR. Replace Y and x2 , . . . , xp by the actual variables used in
the MLR model. Follow Example 2.5.

Notation. The pvalue pvalue given by output tends to only be cor-


rect for the normal MLR model. Hence the output is usually only giving an
estimate of the pvalue, which will often be denoted by pval. So reject H0 if
pval . Often
2.4 The ANOVA F Test 33

P
pval pvalue 0
(converges to 0 in probability, so pval is a consistent estimator of pvalue)
as the sample size n . See Theorem 11.25, Section 11.6, and Chang
and Olive (2010). Then the computer output pval is a good estimator of the
unknown pvalue. We will use Ho H0 and Ha HA H1 .

Be able to perform the 4 step ANOVA F test of hypotheses.


i) State the hypotheses Ho: 2 = = p = 0 Ha: not Ho.
ii) Find the test statistic Fo = M SR/M SE or obtain it from output.
iii) Find the pval from output or use the F table: pval =

P (Fp1,np > Fo ).

iv) State whether you reject Ho or fail to reject Ho. If Ho is rejected, conclude
that there is an MLR relationship between Y and the predictors x2 , . . . , xp . If
you fail to reject Ho, conclude that there is not an MLR relationship between
Y and the predictors x2 , . . . , xp . (Or there is not enough evidence to conclude
that there is an MLR relationship between Y and the predictors.)
Example 2.5. For the Gladstone (1905) data, the response variable Y =
brain weight, x1 1, x2 = size of head, x3 = sex, x4 = breadth of head,
x5 = circumference of head. Assume that the response and residual plots
look good and test whether at least one of the nontrivial predictors is needed
in the model using the output shown below.
Summary Analysis of Variance Table
Source df SS MS F p-value
Regression 4 5396942. 1349235. 196.24 0.0000
Residual 262 1801333. 6875.32
Solution: i) Ho: 2 = = 5 = 0 Ha: not Ho
ii) Fo = 196.24 from output.
iii) The pval = 0.0 from output.
iv) The pval < (= 0.05 since was not given). So reject Ho. Hence there
is an MLR relationship between brain weight and the predictors size, sex,
breadth, and circumference.
Remark 2.5. There is a close relationship between the response plot and
the ANOVA F test. If n 10p and n p 30 and if the plotted points
follow the identity line, typically Ho will be rejected if the identity line ts
the plotted points better than any horizontal line (in particular, the line
Y = Y ). If a horizontal line ts the plotted points about as well as the identity
line, as in Figure 1.4, this graphical diagnostic is inconclusive (sometimes the
ANOVA F test will reject Ho and sometimes fail to reject Ho), but the MLR
relationship is at best weak. In Figures 1.2 and 2.1, the ANOVA F test
should reject Ho since the identity line ts the plotted points better than
any horizontal line. Under the above conditions, a graphical ANOVA F test
34 2 Multiple Linear Regression

rejects Ho if the response plot is not similar to the residual plot. The graphical
test is inconclusive if the response plot looks similar to the residual plot. The
graphical test is also useful for multiple linear regression methods other than
least squares, such as M -estimators and other robust regression estimators.

Denition 2.17. An RR plot is a plot of residuals from 2 dierent mod-


els or tting methods.

Remark 2.6. If the RR plot of the residuals Yi Y versus the OLS resid-
uals ri = Yi Yi shows tight clustering about the identity line, then the MLR
relationship is weak: Y ts the data about as well as the OLS t.

Example 2.6. Cook and Weisberg (1999a, pp. 261, 371) describe a data
set where rats were injected with a dose of a drug approximately proportional
to body weight. The response Y was the fraction of the drug recovered from
the rats liver. The three predictors were the body weight of the rat, the dose
of the drug, and the liver weight. A constant was also used. The experimenter
expected the response to be independent of the predictors, and 19 cases
were used. However, the ANOVA F test suggested that the predictors were
important. The third case was an outlier and easily detected in the response
and residual plots (not shown). After deleting the outlier, the response and
residual plots looked ok and the following output was obtained.

Summary Analysis of Variance Table


Source df SS MS F p-value
Regression 3 0.00184396 0.000614652 0.10 0.9585
Residual 14 0.0857172 0.00612265

The 4 step ANOVA F test is


i) Ho: 2 = = 4 = 0 Ha: not Ho
ii) Fo = 0.10.
iii) pval = 0.9585.
iv) The pval > (= 0.05 since was not given). So fail to reject Ho. Hence
there is not an MLR relationship between fraction of drug recovered and the
predictors body weight, dose, and liver weight. (More accurately, there is not
enough statistical evidence to conclude that there is an MLR relationship:
failing to reject Ho is not the same as accepting Ho; however, it may be a
good idea to keep the nontechnical conclusions nontechnical.)
Figure 2.2 shows the RR plot where the residuals from the full model
are plotted against Yi Y , the residuals from the model using no nontrivial
predictors. This plot reinforces the conclusion that the response Y is inde-
pendent of the nontrivial predictors. The identity line and the OLS line from
regressing ri on Yi Y (that is, use Yi = ri , a constant and xi,2 = Yi Y ,
nd the OLS line and then plot it) are shown as visual aids. If the OLS line
and identity line nearly coincide in that it is dicult to tell that the two lines
intersect at the origin, then the 2 sets of residuals are close.
2.4 The ANOVA F Test 35

RR Plot

0.10
0.05
full$residual
0.0 -0.05
-0.10

-0.10 -0.05 0.0 0.05 0.10


subresidual

Fig. 2.2 RR Plot With Outlier Deleted, Submodel Uses Only the Trivial Predictor with
Y = Y

Some assumptions are needed on the ANOVA F test. Assume that both
the response and residual plots look good. It is crucial that there are no
outliers. Then a rule of thumb is that if n p is large, then the ANOVA
F test pvalue is approximately correct. An analogy can be made with the
central limit theorem, Y is a good estimator for if the Yi are iid N (, 2 )
and also a good estimator for if the data are iid with mean and variance
2 if n is large enough. Also see Theorem 11.25. More on the robustness and
lack of robustness of the ANOVA F test can be found in Wilcox (2012).
If all of the xi are dierent (no replication) and if the number of predictors
p = n, then the OLS t Yi = Yi and R2 = 1. Notice that Ho is rejected if the
statistic Fo is large. More precisely, reject Ho if

Fo > Fp1,np,1

where
P (F Fp1,np,1 ) = 1
when F Fp1,np . Since R2 increases to 1 while (n p)/(p 1) decreases
to 0 as p increases to n, Theorem 2.7a below implies that if p is large then
the Fo statistic may be small even if some of the predictors are very good. It
is a good idea to use n 10p or at least n 5p if possible. Theorem 11.25
can be used to show that pval is a consistent estimator of the pvalue under
reasonable conditions.
36 2 Multiple Linear Regression

Theorem 2.7. Assume that the MLR model has a constant 1 .


a)
M SR R2 np
Fo = = .
M SE 1 R2 p 1
b) If the errors ei are iid N (0, 2 ), and if Ho: 2 = = p = 0 is true,
then Fo has an F distribution with p 1 numerator and n p denominator
degrees of freedom: Fo Fp1,np .
c) If the errors are iid with mean 0 and variance 2 , if the error distribution
is close to normal, and if n p is large enough, and if Ho is true, then
Fo Fp1,np in that the p-value from the software (pval) is approximately
correct.

Remark 2.7. When a constant is not contained in the model (i.e., xi,1 is
not equal to 1 for all i), then the computer output still produces an ANOVA
table with the test statistic and pvalue, and nearly the same 4 step test of
hypotheses can be used. The hypotheses are now Ho: 1 = = p = 0 Ha:
not Ho, and you are testing whether or not there is an MLR relationship
between Y and x1 , . . . , xp . An MLR model without a constant (no intercept)
is sometimes called a regression through the origin. See Section 2.10.

2.5 Prediction

This section gives estimators for predicting a future or new value Yf of


the response variable given the predictors xf , and for estimating the mean
E(Yf ) E(Yf |xf ). This mean is conditional on the values of the predictors
xf , but the conditioning is often suppressed.
Warning: All too often the MLR model seems to t the training data

(Y1 , x1 ), . . . , (Yn , xn )

well, but when new test data is collected, a very dierent MLR model is
needed to t the new data well. In particular, the MLR model seems to t
the data (Yi , xi ) well for i = 1, . . . , n, but when the researcher tries to predict
Yf for a new vector of predictors xf , the prediction is very poor in that Yf is
not close to the Yf actually observed. Wait until after the MLR model
has been shown to make good predictions before claiming that the
model gives good predictions!
There are several reasons why the MLR model may not t new data well.
i) The model building process is usually iterative. Data Z, w1 , . . . , wr is col-
lected. If the model is not linear, then functions of Z are used as a potential
response variable and functions of the wi as potential predictors. After trial
and error, the functions are chosen, resulting in a nal MLR model using Y
and x1 , . . . , xp . Since the same data set was used during the model building
2.5 Prediction 37

process, biases are introduced and the MLR model ts the training data
better than it ts new test data. Suppose that Y , x1 , . . . , xp are specied
before collecting data and that the residual and response plots from the re-
sulting MLR model look good. Then predictions from the prespecied model
will often be better for predicting new data than a model built from an iter-
ative process.
ii) If (Yf , xf ) come from a dierent population than the population of
(Y1 , x1 ), . . . , (Yn , xn ), then prediction for Yf can be arbitrarily bad.
iii) Even a good MLR model may not provide good predictions for an xf
that is far from the xi (extrapolation).
iv) The MLR model may be missing important predictors (undertting).
v) The MLR model may contain unnecessary predictors (overtting).

Three remedies for i) are a) use previously published studies to select


an MLR model before gathering data. Unfortunately, often the population
from a published study is quite dierent from the population of the data
collected by the researcher. Then the MLR model from the published study
is not adequate. b) Do a trial study. Collect some data, build an MLR model
using the iterative process. Then use this model as the prespecied model and
collect data for the main part of the study. Better yet, do a trial study, specify
a model, collect more trial data, improve the specied model, and repeat until
the latest specied model works well. Unfortunately, trial studies are often
too expensive or not possible because the data is dicult to collect. c) If
the data set is large enough, use a training set of a random sample of k
of the n cases to build a model where 10p n/2 k 0.9n. Then use
validation set of the other n k cases to conrm that the model built
with the training set is good. This technique may help reduce biases, but
needs n 20p. See James et al. (2013, pp. 176178). In particular, build the
model with the training set, then check the asymptotically optimal prediction
interval (2.20), derived later in this section, on the validation set.

Denition 2.18. Consider the MLR model Y = X + e and the hat


matrix H = X(X T X)1 X T . Let hi = hii be the ith diagonal element of H
for i = 1, . . . , n. Then hi is called the ith leverage and hi = xTi (X T X)1 xi .
Suppose new data is to be collected with predictor vector xf . Then the
leverage of xf is hf = xTf (X T X)1 xf . Extrapolation occurs if xf is far
from the x1 , . . . , xn .

Rule of thumb 2.5. Predictions based on extrapolation are not reliable.


A rule of thumb is that extrapolation occurs if hf > max(h1 , . . . , hn ). This
rule works best if the predictors are linearly related in that a plot of xi versus
xj should not have any strong nonlinearities. If there are strong nonlinearities
among the predictors, then xf could be far from the xi but still have hf <
max(h1 , . . . , hn ).
38 2 Multiple Linear Regression

Example 2.7. Consider predicting Y = weight from x = height and a


constant from data collected on men between 18 and 24 where the minimum
height was 57 and the maximum height was 79 inches. The OLS equation
was Y = 167 + 4.7x. If x = 70, then Y = 167 + 4.7(70) = 162 pounds.
If x = 1 inch, then Y = 167 + 4.7(1) = 162.3 pounds. It is impossible
to have negative weight, but it is also impossible to nd a 1 inch man. This
MLR model should not be used for x far from the interval (57, 79).

Denition 2.19. Consider the unimodal MLR model Y = xT + e where


E(e) = 0. Then regression function is the hyperplane

E(Y ) E(Y |x) = x1 1 + x2 2 + + xp p = xT . (2.9)

Assume OLS is used to nd . Then the point estimator of Yf given x = xf


is
Yf = xf,1 1 + + xf,p p = xTf . (2.10)

The point estimator of E(Yf ) E(Yf |xf ) given x = xf is also Yf = xTf .


Assume that the MLR model contains a constant 1 so that x1 1. The large
sample 100 (1 )% condence interval (CI) for E(Yf |xf ) = xTf = E(Yf )
is
Yf tnp,1/2 se(Yf ) (2.11)
where P (T tnp, ) = if T has a t distribution with n p degrees of
freedom. Generally se(Yf ) will come from output, but
 
se(Yf ) = M SE hf = M SE xTf (X T X)1 xf .

Recall the interpretation of a 100 (1 )% CI for a parameter is that


if you collect data then form the CI, and repeat for a total of k times where
the k trials are independent from the same population, then the probability
that m of the CIs will contain follows a binomial(k, = 1 ) distribution.
Hence if 100 95% CIs are made, = 0.95 and about 95 of the CIs will contain
while about 5 will not. Any given CI may (good sample) or may not (bad
sample) contain , but the probability of a bad sample is .

The following theorem is analogous to the central limit theorem and the
theory for the tinterval for based on Y and the sample standard deviation
(SD) SY . If the data Y1 , . . . , Yn are iid with mean 0 and variance 2 , then Y
is asymptotically normal and the tinterval will perform well if the sample
size is large enough. The result below suggests that the OLS estimators Yi
and are good if the sample size is large enough. The condition max hi 0
in probability usually holds if the researcher picked the design matrix X or
if the xi are iid random vectors from a well-behaved population. Outliers
D
can cause the condition to fail. Convergence in distribution, Zn Np (0, ),
means the multivariate normal approximation can be used for probability
2.5 Prediction 39

calculations involving Zn . When p = 1, the univariate normal distribution


can be used. See Sen and Singer (1993, p. 280) for the theorem, which implies
that Np (, 2 (X T X)1 ). See Chapter 10 for the multivariate normal
distribution.
Theorem 2.8, LS CLT (Least Squares Central Limit Theorem):
Consider the MLR model Yi = xTi + ei and assume that the zero mean
errors are iid with E(ei ) = 0 and VAR(ei ) = 2 . Also assume that
maxi (h1 , . . . , hn ) 0 in probability as n and

XT X
W 1
n

as n . Then the least squares (OLS) estimator satises


D
n( ) Np (0, 2 W ). (2.12)

Equivalently,
D
(X T X)1/2 ( ) Np (0, 2 I p ). (2.13)

Denition 2.20. A large sample 100(1 )% prediction interval (PI) has


the form [Ln , Un ] where P (Ln Yf Un ) 1 as the sample size
n . For the Gaussian MLR model, assume that the random variable Yf
is independent of Y1 , . . . , Yn . Then the 100 (1 )% PI for Yf is

Yf tnp,1/2 se(pred) (2.14)

where P (T tnp, ) = if T has a t distribution with n p degrees of


freedom. Generally se(pred) will come from output, but

se(pred) = M SE (1 + hf ).

The interpretation of a 100 (1 )% PI for a random variable Yf is similar


to that of a CI. Collect data, then form the PI, and repeat for a total of k
times where k trials are independent from the same population. If Yf i is the
ith random variable and P Ii is the ith PI, then the probability that Yf i P Ii
for m of the PIs follows a binomial(k, = 1 ) distribution. Hence if 100
95% PIs are made, = 0.95 and Yf i P Ii happens about 95 times.
There are two big dierences between CIs and PIs. First, the length of
the CI goes to 0 as the sample size n goes to while the length of the PI
converges to some nonzero number J, say. Secondly, the CI for E(Yf |xf )
given in Denition 2.19 tends to work well for the unimodal MLR model if
the sample size is large while the PI in Denition 2.20 is made under the
assumption that the ei are iid N (0, 2 ) and may not perform well if the
normality assumption is violated.
40 2 Multiple Linear Regression

To see this, consider xf such that the heights Y of women between 18


and 24 is normal with a mean of 66 inches and an SD of 3 inches. A 95%
CI for E(Y |xf ) should be centered at about 66 and the length should go
to zero as n gets large. But a 95% PI needs to contain about 95% of the
heights so the PI should converge to the interval 66 1.96(3). This result
follows because if Y N (66, 9) then P (Y < 66 1.96(3)) = P (Y > 66 +
1.96(3)) 0.025. In other words, the endpoints of the PI estimate the 97.5
and 2.5 percentiles of the normal distribution. However, the percentiles of a
parametric error distribution depend heavily on the parametric distribution
and the parametric formulas are violated if the assumed error distribution is
incorrect.
Assume that the unimodal MLR model is valid so that e is from some
distribution with 0 mean and variance 2 . Olive (2007) shows that if 1 is
the asymptotic coverage of the classical nominal (1 )100% PI (2.14), then
1
1 = P (z1/2 < e < z1/2 ) 1 2 (2.15)
z1/2

where the inequality follows from Chebyshevs inequality. Hence the asymp-
totic coverage of the nominal 95% PI is at least 73.9%. The 95% PI (2.14)
was often quite accurate in that the asymptotic coverage was close to 95% for
a wide variety of error distributions. The 99% and 90% PIs did not perform
as well.

Let be the percentile of the error e, i.e. P (e ) = . Let be


the sample percentile of the residuals. Then the results from Theorem 2.8
suggest that the residuals ri estimate the errors ei , and that the sample
percentiles of the residuals estimate . For many error distributions,
 n   n 
 r2  e2
i 2 i
E(M SE) = E = =E .
i=1
np i=1
n

This result suggests that 


n
ri ei .
np
Using   
15 n
an = 1+ (1 + hf ), (2.16)
n np
a large sample semiparametric 100(1 )% PI for Yf is

[Yf + an /2 , Yf + an 1/2 ]. (2.17)

This PI is very similar to the classical PI except that is used instead of


z to estimate the error percentiles . The large sample coverage 1 of
this nominal 100(1 )% PI is asymptotically correct: 1 = 1 .
2.5 Prediction 41

Example 2.8. For the Buxton (1920) data suppose that the response Y
= height and the predictors were a constant, head length, nasal height, bigo-
nal breadth, and cephalic index. Five outliers were deleted leaving 82 cases.
Figure 2.3 shows a response plot of the tted values versus the response Y
with the identity line added as a visual aid. The plot suggests that the model
is good since the plotted points scatter about the identity line in an evenly
populated band although the relationship is rather weak since the correlation
of the plotted points is not very high. The triangles represent the upper and
lower limits of the semiparametric 95% PI (2.17). For this example, 79 (or
96%) of the Yi fell within their corresponding PI while 3 Yi did not. A plot
using the classical PI (2.14) would be very similar for this data. The plot was
made with the following R commands, using the lregpack function piplot.

x <- buxx[-c(61,62,63,64,65),]
Y <- buxy[-c(61,62,63,64,65)]
piplot(x,Y)
1900
1850
1800
1750
1700
Y
1650
1600
1550

1640 1660 1680 1700 1720 1740 1760


FIT

Fig. 2.3 95% PI Limits for Buxton Data

Given output showing i and given xf , se(pred), and se(Yf ), Example 2.9
shows how to nd Yf , a CI for E(Yf |xf ), and the classical PI (2.14) for Yf .
42 2 Multiple Linear Regression

Below is shown typical output in symbols. Sometimes Label is replaced by


Predictor and Estimate by coef or Coecients.

Label Estimate Std. Error t-value p-value


Constant 1 se(1 ) to,1 for Ho: 1 = 0
x2 2 se(2 ) to,2 = 2 /se(2 ) for Ho: 2 = 0
..
.
xp p se(p ) to,p = p /se(p ) for Ho: p = 0

Example 2.9. The Rounceeld (1995) data povc.lsp are female and
male life expectancies from n = 91 countries where 6 cases with missing GNP
were deleted. Suppose that it is desired to predict female life expectancy Y
from male life expectancy X. Suppose that if Xf = 60, then se(pred) =
2.1285, and se(Yf ) = 0.2241. Below is some output.

Label Estimate Std. Error t-value p-value


Constant -2.93739 1.42523 -2.061 0.0422
mlife 1.12359 0.0229362 48.988 0.0000

a) Find Yf if Xf = 60.
Solution: In this example, xf = (1, Xf )T since a constant is in the output
above. Thus Yf = 1 + 2 Xf = 2.93739 + 1.12359(60) = 64.478.
b) If Xf = 60, nd a 90% condence interval for E(Y ) E(Yf |xf ).
Solution: The CI is Yf tn2,1/2 se(Yf ) = 64.478 1.645(0.2241) =
64.478 0.3686 = [64.1094, 64.8466]. To use the ttable on the last page of
Chapter 14, use the 2nd to last row marked by Z since d = df = n 2 =
89 > 30. In the last row nd CI = 90% and intersect the 90% column and
the Z row to get the value of t89,0.95 z.95 = 1.645.
c) If Xf = 60, nd a 90% prediction interval for Yf .
Solution: The PI is Yf tn2,1/2 se(pred) = 64.478 1.645(2.1285) =
64.478 3.5014 = [60.9766, 67.9794].

Two more PIs will be dened and then the 4 PIs (2.14), (2.17), (2.18),
and (2.20) will be compared via simulation. An asymptotically conservative
(ac) 100(1 )% PI has asymptotic coverage 1 1 . We used the (ac)
100(1 )% PI
 
n
Yf max(|/2 |, |1/2 |) (1 + hf ) (2.18)
np

which has asymptotic coverage


2.5 Prediction 43

1 = P [ max(|/2 |, |1/2 |) < e < max(|/2 |, |1/2 |)]. (2.19)

Notice that 1 1 1 /2 and 1 = 1 if the error distribution


is symmetric.

In the simulations described below, will be the sample percentile for the
PIs (2.17) and (2.18). A PI is asymptotically optimal if it has the shortest
asymptotic length that gives the desired asymptotic coverage. If the error
distribution is unimodal, an asymptotically optimal PI can be created by
applying the shorth(c) estimator to the residuals where c = n(1 ) and
x is the smallest integer x, e.g., 7.7 = 8. That is, let r(1) , . . . , r(n) be the
order statistics of the residuals. Compute r(c) r(1) , r(c+1) r(2) , . . . , r(n)
r(nc+1) . Let [r(d) , r(d+c1) ] = [1 , 12 ] correspond to the interval with the
smallest distance. Then the large sample asymptotically optimal 100 (1)%
PI for Yf is
[Yf + an 1 , Yf + an 12 ] (2.20)
where an is given by (2.16).
Remark 2.8. We recommend using the asymptotically optimal PI (2.20)
instead of the classical PI (2.14). The lregpack function pisim can be used to
recreate the simulation described below. See Problem 2.29.
A small simulation study compares the PI lengths and coverages for sample
sizes n = 50, 100, and 1000 for several error distributions. The value n =
gives the asymptotic coverages and lengths. The MLR model with E(Yi ) =
1 + xi2 + + xi8 was used. The vectors (x2 , . . . , x8 )T were iid N7 (0, I 7 ).
The error distributions were N(0,1), t3 , and exponential(1) 1. Also, a small
sensitivity study to examine the eects of changing (1 + 15/n) to (1 + k/n)
on the 99% PIs (2.17) and (2.20) was performed. For n = 50 and k between
10 and 20, the coverage increased by roughly 0.001 as k increased by 1.

Table 2.1 N(0,1) Errors.


n clen slen alen olen ccov scov acov ocov
0.01 50 5.860 6.172 5.191 6.448 .989 .988 .972 .990
0.01 100 5.470 5.625 5.257 5.412 .990 .988 .985 .985
0.01 1000 5.182 5.181 5.263 5.097 .992 .993 .994 .992
0.01 5.152 5.152 5.152 5.152 .990 .990 .990 .990
0.05 50 4.379 5.167 4.290 5.111 .948 .974 .940 .968
0.05 100 4.136 4.531 4.172 4.359 .956 .970 .956 .958
0.05 1000 3.938 3.977 4.001 3.927 .952 .952 .954 .948
0.05 3.920 3.920 3.920 3.920 .950 .950 .950 .950
0.1 50 3.642 4.445 3.658 4.193 .894 .945 .895 .929
0.1 100 3.455 3.841 3.519 3.690 .900 .930 .905 .913
0.1 1000 3.304 3.343 3.352 3.304 .901 .903 .907 .901
0.1 3.290 3.290 3.290 3.290 .900 .900 .900 .900
44 2 Multiple Linear Regression

Table 2.2 t3 Errors.


n clen slen alen olen ccov scov acov ocov
0.01 50 9.539 12.164 11.398 13.297 .972 .978 .975 .981
0.01 100 9.114 12.202 12.747 10.621 .978 .983 .985 .978
0.01 1000 8.840 11.614 12.411 11.142 .975 .990 .992 .988
0.01 8.924 11.681 11.681 11.681 .979 .990 .990 .990
0.05 50 7.160 8.313 7.210 8.139 .945 .956 .943 .956
0.05 100 6.874 7.326 7.030 6.834 .950 .955 .951 .945
0.05 1000 6.732 6.452 6.599 6.317 .951 .947 .950 .945
0.05 6.790 6.365 6.365 6.365 .957 .950 .950 .950
0.1 50 5.978 6.591 5.532 6.098 .915 .935 .900 .917
0.1 100 5.696 5.756 5.223 5.274 .916 .913 .901 .900
0.1 1000 5.648 4.784 4.842 4.706 .929 .901 .904 .898
0.1 5.698 4.707 4.707 4.707 .935 .900 .900 .900

Table 2.3 Exponential(1) 1 Errors.


n clen slen alen olen ccov scov acov ocov
0.01 50 5.795 6.432 6.821 6.817 .971 .987 .976 .988
0.01 100 5.427 5.907 7.525 5.377 .974 .987 .986 .985
0.01 1000 5.182 5.387 8.432 4.807 .972 .987 .992 .987
0.01 5.152 5.293 8.597 4.605 .972 .990 .995 .990
0.05 50 4.310 5.047 5.036 4.746 .946 .971 .955 .964
0.05 100 4.100 4.381 5.189 3.840 .947 .971 .966 .955
0.05 1000 3.932 3.745 5.354 3.175 .945 .954 .972 .947
0.05 3.920 3.664 5.378 2.996 .948 .950 .975 .950
0.1 50 3.601 4.183 3.960 3.629 .920 .945 .925 .916
0.1 100 3.429 3.557 3.959 3.047 .930 .943 .945 .913
0.1 1000 3.303 3.005 3.989 2.460 .931 .906 .951 .901
0.1 3.290 2.944 3.991 2.303 .929 .900 .950 .900

The simulation compared coverages and lengths of the classical (2.14),


semiparametric (2.17), asymptotically conservative (2.18), and asymptoti-
cally optimal (2.20) PIs. The latter 3 intervals are asymptotically optimal
for symmetric unimodal error distributions in that they have the shortest
asymptotic length that gives the desired asymptotic coverage. The PIs (2.17)
and (2.20) also give the correct asymptotic coverage if the unimodal errors
are not symmetric, while the PI (2.18) gives higher coverage (is conserva-
tive). The simulation used 5000 runs and gave the proportion p of runs
where Yf fell within the nominal 100(1 )% PI. The count mp has a
binomial(m = 5000, p = 1 n ) distribution where 1 n converges to
theasymptotic coverage (1 ). The standard error for the proportion
is p(1 p)/5000 = 0.0014, 0.0031, and 0.0042 for p = 0.01, 0.05 and
0.1, respectively. Hence an observed coverage p (0.986, 0.994) for 99%,
p (0.941, 0.959) for 95%, and p (0.887, 0.913) for 90% PIs suggests that
there is no reason to doubt that the PI has the nominal coverage.
2.6 The Partial F Test 45

Tables 2.12.3 show the results of the simulations for the 3 error distri-
butions. The letters c, s, a, and o refer to intervals (2.14), (2.17), (2.18),
and (2.20), respectively. For the normal errors, the coverages were about
right and the semiparametric interval tended to be rather long for n = 50
and 100. The classical PI asymptotic coverage 1 tended to be fairly close
to the nominal coverage 1 for all 3 distributions and = 0.01, 0.05, and
0.1. The asymptotically optimal PI tended to have short length and simulated
coverage close to the nominal coverage.

2.6 The Partial F Test

Suppose that there is data on variables Z, w1 , . . . , wr and that a useful MLR


model has been made using Y = t(Z), x1 1, x2 , . . . , xp where each xi is
some function of w1 , . . . , wr . This useful model will be called the full model. It
is important to realize that the full model does not need to use every variable
wj that was collected. For example, variables with outliers or missing values
may not be used. Forming a useful full model is often very dicult, and it is
often not reasonable to assume that the candidate full model is good based
on a single data set, especially if the model is to be used for prediction.
Even if the full model is useful, the investigator will often be interested in
checking whether a model that uses fewer predictors will work just as well.
For example, perhaps xp is a very expensive predictor but is not needed given
that x1 , . . . , xp1 are in the model. Also a model with fewer predictors tends
to be easier to understand.

Denition 2.21. Let the full model use Y , x1 1, x2 , . . . , xp and let


the reduced model use Y , x1 , xi2 , . . . , xiq where {i2 , . . . , iq } {2, . . . , p}.

The partial F test is used to test whether the reduced model is good in
that it can be used instead of the full model. It is crucial that the reduced
model be selected before looking at the data. If the reduced model is selected
after looking at output and discarding the worst variables, then the pvalue
for the partial F test will be too high. For (ordinary) least squares, usually
a constant is used, and we are assuming that both the full model and the
reduced model contain a constant. The partial F test has null hypothesis
Ho : iq+1 = = ip = 0, and alternative hypothesis HA : at least one of the
ij = 0 for j > q. The null hypothesis is equivalent to Ho: the reduced model
is good. Since only the full model and reduced model are being compared,
the alternative hypothesis is equivalent to HA : the reduced model is not as
good as the full model, so use the full model, or more simply, HA : use the
full model.

To perform the partial F test, t the full model and the reduced model
and obtain the ANOVA table for each model. The quantities dfF , SSE(F)
46 2 Multiple Linear Regression

and MSE(F) are for the full model and the corresponding quantities from
the reduced model use an R instead of an F . Hence SSE(F) and SSE(R) are
the residual sums of squares for the full and reduced models, respectively.
Shown below is output only using symbols.

Full model

Source df SS MS Fo and p-value


Regression p 1 SSR MSR Fo=MSR/MSE
Residual dfF = n p SSE(F) MSE(F) for Ho:2 = = p = 0

Reduced model

Source df SS MS Fo and p-value


Regression q 1 SSR MSR Fo=MSR/MSE
Residual dfR = n q SSE(R) MSE(R) for Ho: 2 = = q = 0

Be able to perform the 4 step partial F test of hypotheses. i) State


the hypotheses. Ho: the reduced model is good Ha: use the full model
ii) Find the test statistic. FR =
 
SSE(R) SSE(F )
/M SE(F )
dfR dfF

iii) Find the pval = P(FdfR dfF ,dfF > FR ). (On exams typically an F table is
used. Here dfR dfF = pq = number of parameters set to 0, and dfF = np,
while pval is the estimated pvalue.)
iv) State whether you reject Ho or fail to reject Ho. Reject Ho if the pval
and conclude that the full model should be used. Otherwise, fail to reject Ho
and conclude that the reduced model is good.

Sometimes software has a shortcut. In particular, the R software uses the


anova command. See Problem 2.27. As an example, assume that the full
model uses x2 and x3 while the reduced model uses x2 . Both models contain
a constant. Then the following commands will perform the partial F test.
(On the computer screen the second command looks more like
red < lm(yx2).)
full <- lm(y~x2+x3)
red <- lm(y~x2)
anova(red,full)
For an n 1 vector a, let

a = a21 + + a2n = aT a
2.6 The Partial F Test 47

be the Euclidean norm of a. If r and r R are the vector of residuals from


the full and reduced models, respectively, notice that SSE(F ) = r2 and
SSE(R) = r R 2 .

The following proposition suggests that Ho is rejected in the partial F test


if the change in residual sum of squares SSE(R)SSE(F ) is large compared
to SSE(F ). If the change is small, then FR is small and the test suggests
that the reduced model can be used.

Proposition 2.6. Let R2 and RR 2


be the multiple coecients of determi-
nation for the full and reduced models, respectively. Let Y and Y R be the
vectors of tted values for the full and reduced models, respectively. Then
the test statistic in the partial F test is
 
SSE(R) SSE(F )
FR = /M SE(F ) =
dfR dfF
 
Y 2 Y R 2
/M SE(F ) =
dfR dfF

SSE(R) SSE(F ) n p R 2 RR2


np
= .
SSE(F ) pq 1R 2 pq
Denition 2.22. An FF plot is a plot of tted values from 2 dierent
models or tting methods.

Six plots are useful diagnostics for the partial F test: the RR plot with
the full model residuals on the vertical axis and the reduced model residuals
on the horizontal axis, the FF plot with the full model tted values on the
vertical axis, and always make the response and residual plots for the full
and reduced models. Suppose that the full model is a useful MLR model. If
the reduced model is good, then the response plots from the full and reduced
models should be very similar, visually. Similarly, the residual plots (of the
tted values versus the residuals) from the full and reduced models should be
very similar, visually. Finally, the correlation of the plotted points in the RR
and FF plots should be high, 0.95, say, and the plotted points in the RR
and FF plots should cluster tightly about the identity line. Add the identity
line to both the RR and FF plots as a visual aid. Also add the OLS line from
regressing r on r R to the RR plot (the OLS line is the identity line in the FF
plot). If the reduced model is good, then the OLS line should nearly coincide
with the identity line in that it should be dicult to see that the two lines
intersect at the origin, as in Figure 2.2. If the FF plot looks good but the
RR plot does not, the reduced model may be good if the main goal of the
analysis is to predict Y.
48 2 Multiple Linear Regression

In Chapter 3, Example 3.8 describes the Gladstone (1905) data. Let the
reduced model use a constant, (size)1/3 , sex, and age. Then Figure 3.7 shows
the response and residual plots for the full and reduced models, and Figure 3.9
shows the RR and FF plots.

Example 2.10. For the Buxton (1920) data, n = 76 after 5 outliers and
6 cases with missing values are removed. Assume that the response variable
Y is height, and the explanatory variables are x2 = bigonal breadth, x3 =
cephalic index, x4 = nger to ground, x5 = head length, x6 = nasal height,
and x7 = sternal height. Suppose that the full model uses all 6 predictors plus
a constant (x1 ) while the reduced model uses the constant, cephalic index,
and nger to ground. Test whether the reduced model can be used instead of
the full model using the output below.

Summary Analysis of Variance Table for the Full Model


Source df SS MS F p-value
Regression 6 260467. 43411.1 87.41 0.0000
Residual 69 34267.4 496.629

Summary Analysis of Variance Table for Reduced Model


Source df SS MS F p-value
Regression 2 94110.5 47055.3 17.12 0.0000
Residual 73 200623. 2748.27

Solution: The 4 step partial F test is shown below.


i) Ho: the reduced model is good Ha: use the full model
ii)
   
SSE(R) SSE(F ) 200623.0 34267.4
FR = /M SE(F ) = /496.629
dfR dfF 73 69

= 41588.9/496.629 = 83.742.
iii) pval = P (F4,69 > 83.742) = 0.00.
iv) The pval < (= 0.05, since was not given), so reject Ho. The full model
should be used instead of the reduced model. (Bigonal breadth, head length,
nasal height, and sternal height are needed in the MLR for height given that
cephalic index and nger to ground are in the model.)
Using a computer to get the pval makes sense, but for exams you may need
to use a table. In ARC, you can use the Calculate probability option from the
ARC menu, enter 83.742 as the value of the statistic, 4 and 69 as the degrees
of freedom, and select the F distribution. To use the table near the end of
Chapter 14, use the bottom row since the denominator degrees of freedom 69
> 30. Intersect with the column corresponding to k = 4 numerator degrees of
freedom. The cuto value is 2.37. If the FR statistic was 2.37, then the pval
would be 0.05. Since 83.472 > 2.37, the pval < 0.05, and since 83.472 >> 2.37,
we can say that the pval 0.0.
2.7 The Wald t Test 49

Example 2.11. Now assume that the reduced model uses the constant,
sternal height, nger to ground, and head length. Using the output below, test
whether the reduced model is good.
Summary Analysis of Variance Table for Reduced Model
Source df SS MS F p-value
Regression 3 259704. 86568. 177.93 0.0000
Residual 72 35030.1 486.528
Solution: The 4 step partial F test follows.
i) Ho: the reduced model is good Ha: use the full model
ii)
   
SSE(R) SSE(F ) 35030.1.0 34267.4
FR = /M SE(F ) = /496.629
dfR dfF 72 69
= 254.2333/496.629 = 0.512.
iii) The pval = P (F3,69 > 0.512) = 0.675.
iv) The pval > , so reject fail to reject Ho. The reduced model is good.
To use the F table near the end of Chapter 14, use the bottom row since
the denominator degrees of freedom 69 > 30. Intersect with the column cor-
responding to k = 3 numerator degrees of freedom. The cuto value is 2.61.
Since 0.512 < 2.61, pval > 0.05, and this is enough information to fail to
reject Ho.
Some R commands and output to do the above problem are shown below.
cyp <- matrix(scan(),nrow=76,ncol=8,byrow=T,dimnames=
list( c(), c("indx", "ht", "sternal", "finger",
"hdlen","nasal","bigonal", "cephalic")))
#copy and paste the data set cyp.lsp then press enter
cyp <- cyp[,-1]; cyp <- as.data.frame(cyp)
full <- lm(ht~.,data=cyp)
red <- lm(ht~sternal+finger+hdlen,data=cyp)
anova(red,full)
Model 1: ht ~ sternal + finger + hdlen
Model 2: ht ~ sternal + finger + hdlen + nasal
+ bigonal + cephalic
Res.Df RSS Df Sum of Sq F Pr(>F)
1 72 35030
2 69 34267 3 762.67 0.5119 0.6754

2.7 The Wald t Test

Often investigators hope to examine k in order to determine the importance


of the predictor xk in the model; however, k is the coecient for xk given
that the other predictors are in the model. Hence k depends strongly on
the other predictors in the model. Suppose that the model has an intercept:
50 2 Multiple Linear Regression

x1 1. The predictor xk is highly correlated with the other predictors if the


OLS regression of xk on x1 , . . . , xk1 , xk+1 , . . . , xp has a high coecient of
determination Rk2 . If this is the case, then often xk is not needed in the model
given that the other predictors are in the model. If at least one Rk2 is high
for k 2, then there is multicollinearity among the predictors.
As an example, suppose that Y = height, x1 1, x2 = left leg length, and
x3 = right leg length. Then x2 should not be needed given x3 is in the model
and 2 = 0 is reasonable. Similarly 3 = 0 is reasonable. On the other hand,
if the model only contains x1 and x2 , then x2 is extremely important with 2
near 2. If the model contains x1 , x2 , x3 , x4 = height at shoulder, x5 = right
arm length, x6 = head length, and x7 = length of back, then Ri2 may be high
for each i 2. Hence xi is not needed in the MLR model for Y given that
the other predictors are in the model.
Denition 2.23. The 100 (1 ) % CI for k is k tnp,1/2 se(k ). If
the degrees of freedom d = n p 30, the N(0,1) cuto z1/2 may be used.
Know how to do the 4 step Wald ttest of hypotheses.
i) State the hypotheses Ho: k = 0 Ha: k = 0.
ii) Find the test statistic to,k = k /se(k ) or obtain it from output.
iii) Find pval from output or use the ttable: pval =

2P (tnp < |to,k |) = 2P (tnp > |to,k |).

Use the normal table or the d = Z line in the ttable if the degrees of freedom
d = n p 30. Again pval is the estimated pvalue.
iv) State whether you reject Ho or fail to reject Ho and give a nontechnical
sentence restating your conclusion in terms of the story problem.

Recall that Ho is rejected if the pval . As a benchmark for this textbook,


use = 0.05 if is not given. If Ho is rejected, then conclude that xk is needed
in the MLR model for Y given that the other predictors are in the model.
If you fail to reject Ho, then conclude that xk is not needed in the MLR
model for Y given that the other predictors are in the model. (Or there is
not enough evidence to conclude that xk is needed in the MLR model given
that the other predictors are in the model.) Note that xk could be a very
useful individual predictor, but may not be needed if other predictors are
added to the model. It is better to use the output to get the test statistic
and pval than to use formulas and the ttable, but exams may not give the
relevant output.
Denition 2.24. Assume that there is a constant x1 1 in the model,
and let x(k) = (x1 , . . . , xk1 , xk+1 , . . . , xp )T be the vector of predictors with
the kth predictor xk deleted. Let r (k) be the residuals from regressing Y on
x(k) , that is, on all of the predictor variables except xk . Let r(xk |x(k) ) denote
the residuals from regressing xk on x(k) . Then an added variable plot for
xk is a plot of r(xk |x(k) ) versus r (k) for k = 2, . . . , p.
2.7 The Wald t Test 51

The added variable plot (also called a partial regression plot) is used to
give information about the test Ho : k = 0. The points in the plot cluster
about a line through the origin with slope = k . An interesting fact is that the
residuals from this line, i.e. the residuals from regressing r (k) on r(xk |x(k) ),
are exactly the same as the usual residuals from regressing Y on x. The range
of the horizontal axis gives information about the collinearity of xk with the
other predictors. Small range implies that xk is well explained by the other
predictors. The r(xk |x(k) ) represent the part of xk that is not explained by
the remaining variables while the r (k) represent the part of Y that is not
explained by the remaining variables.
An added variable plot with a clearly nonzero slope and tight clustering
about a line implies that xk is needed in the MLR for Y given that the other
predictors x2 , . . . , xk1 , xk+1 , . . . , xp are in the model. Slope near zero in the
added variable plot implies that xk may not be needed in the MLR for Y
given that all other predictors x2 , . . . , xk1 , xk+1 , . . . , xp are in the model.
If the zero line with 0 slope and 0 intercept and the OLS line are added to
the added variable plot, the variable is probably needed if it is clear that the
two lines intersect at the origin. Then the point cloud should be tilted away
from the zero line. The variable is probably not needed if the two lines nearly
coincide near the origin in that you cannot clearly tell that they intersect at
the origin.
Shown below is output only using symbols and the following example shows
how to use output to perform the Wald ttest.

Response = Y
Coecient Estimates

Label Estimate Std. Error t-value p-value


Constant 1 se(1 ) to,1 for Ho: 1 = 0
x2 2 se(2 ) to,2 = 2 /se(2 ) for Ho: 2 = 0
..
.
xp p se(p ) to,p = p /se(p ) for Ho: p = 0
Output for Ex. 2.12
Label Estimate Std. Error t-value p-value
Constant -7736.26 2660.36 -2.908 0.0079
x2 0.180225 0.00503871 35.768 0.0000
x3 -1.89411 2.65789 -0.713 0.4832

R Squared: 0.988, Sigma hat: 4756.08, n = 26

Summary Analysis of Variance Table


Source df SS MS F p-value
Regression 2 41380950140. 20690475070. 914.69 0.00
Residual 23 520265969. 22620260.
52 2 Multiple Linear Regression

Example 2.12. The output above was collected from 26 districts in Prus-
sia in 1843. See Hebbler (1847). The goal is to study the relationship between
Y = the number of women married to civilians in the district with the predic-
tors x2 = the population of the district, and x3 = military women = number
of women married to husbands in the military.
a) Find a 95% condence interval for 2 corresponding to population.
The CI is k tnp,1/2 se(k ). Since n = 26, df = n p = 26 3 = 23.
From the ttable at the end of Chapter 14, intersect the df = 23 row with
the column that is labelled by 95% in the CI row near the bottom of the
table. Then tnp,1/2 = 2.069. Using the output shows that the 95% CI is
0.180225 2.069(0.00503871) = [0.16980, 0.19065].
b) Perform a 4 step test for Ho: 2 = 0 corresponding to population.
i) Ho: 2 = 0 HA : 2 = 0
ii) to2 = 35.768
iii) pval = 0.0
iv) Reject Ho, the population is needed in the MLR model for the number
of women married to civilians if the number of military women is in the
model.
c) Perform a 4 step test for Ho: 3 = 0 corresponding to military women.
i) Ho: 3 = 0 HA : 3 = 0
ii) to3 = 0.713
iii) pval = 0.4883
iv) Fail to reject Ho, the number of military women is not needed in the
MLR model for the number of women married to civilians if population is in
the model.

Figure 2.4, made with the commands shown below, shows the added vari-
able plots for x2 and x3 . The plot for x2 strongly suggests that x2 is needed
in the MLR model while the plot for x3 indicates that x3 does not seem to
be very important. The slope of the OLS line in a) is 0.1802 while the slope
of the line in b) is 1.894.

source("G:/lregdata.txt")
x2 <- marry[,1]
x3 <- marry[,5]
y <- marry[,3]
#par(mfrow=c(1,2),pty="s")
#square plots look nice but have too much white space
par(mfrow=c(1,2))
resy2 <- residuals(lm(y~x3))
resx2 <- residuals(lm(x2~x3))
plot(resx2,resy2)
abline(lsfit(resx2,resy2)$coef)
title("a) Added Variable Plot for x2")
resy3 <- residuals(lm(y~x2))
2.8 The OLS Criterion 53

a) Added Variable Plot for x2 b) Added Variable Plot for x3

50000

10000
5000
0
resy2

resy3
0
50000

5000
10000

4e+05 0e+00 500 0 500 1000


resx2 resx3

Fig. 2.4 Added Variable Plots for x2 and x3

resx3 <- residuals(lm(x3~x2))


plot(resx3,resy3)
abline(lsfit(resx3,resy3)$coef)
title("b) Added Variable Plot for x3")
par(mfrow=c(1,1))
If the predictor xk is categorical, e.g. gender, the added variable plot may
look like two spheres, but if the OLS line is added to the plot, it will have
slope equal to k .

2.8 The OLS Criterion

The OLS estimator minimizes the OLS criterion



n
QOLS () = ri2 ()
i=1
54 2 Multiple Linear Regression

a) OLS Minimizes Sum of Squared Vertical Deviations

1600
1300
Y
1000

1000 1100 1200 1300 1400 1500


OLSESP

b) This ESP Has a Much Larger Sum


1600
1300
Y
1000

1100 1200 1300 1400 1500 1600


BADESP

Fig. 2.5 The OLS Fit Minimizes the Sum of Squared Residuals

n = Y2i 
where the residual ri () xTi . In other words, let ri = ri () be the
n
OLS residuals. Then i=1 ri i=1 ri2 () for any p 1 vector , and the
equality holds i = if the n p design
n matrixX n p n. In
is of full rank
n
particular, if X has full rank p, then i=1 ri2 < i=1 ri2 () = i=1 e2i even
if the MLR model Y = X + e is a good approximation to the data.

Example 2.13. When a model depends on the predictors x only through


the linear combination xT , then xT is called a sucient predictor and
xT is called an estimated sucient predictor (ESP). For OLS the model is
Y = xT + e, and the tted value Y = ESP . To illustrate the OLS criterion
graphically, consider the Gladstone (1905) data where we used brain weight as
the response. A constant, x2 = age, x3 = sex, and x4 = (size)1/3 were used
as predictors after deleting ve infants from the data set. In Figure 2.5a, the
OLS response plot of the OLS ESP = Y versus Y is shown. The vertical devi-
ations from the identity line are the residuals, and OLS minimizes the sum of
squared residuals. If any other ESP xT is plotted versus Y , then the vertical
2.8 The OLS Criterion 55

deviations from the identity line are the residuals ri (). For this data, the OLS
estimator = (498.726, 1.597, 30.462, 0.696)T . Figure 2.5b shows the re-
sponse plot using the ESP xT where = (498.726, 1.597, 30.462, 0.796)T .
Hence only the coecient for x4 was changed; however, the residuals ri () in
the resulting plot are much larger in magnitude on average than the residuals
in the OLS response plot. With slightly larger changes in the OLS ESP, the
resulting will be such that the squared residuals are massive.

Proposition 2.10. The OLS estimator is the unique minimizer of the


OLS criterion if X has full rank p n.

Proof: Seber and Lee (2003, pp. 3637). Recall that the hat matrix H =
X(X T X)1 X T and notice that (I H)T = I H, that (I H)H = 0
and that HX = X. Let be any p 1 vector. Then

(Y X )T (X X) = (Y HY )T (HY HX) =

Y T (I H)H(Y X) = 0.
Thus QOLS () = Y X2 = Y X + X X2 =
Y X 2 + X X2 + 2(Y X )T (X X).
Hence
Y X2 = Y X 2 + X X2 . (2.21)

So
Y X2 Y X 2

with equality i
X( ) = 0

i = since X is full rank. 

Alternatively calculus can be used. Notice that ri () = Yi xi,1 1 xi,2 2


xi,p p . Recall that xTi is the ith row of X while v j is the jth column.
Since QOLS () =


n
(Yi xi,1 1 xi,2 2 xi,p p )2 ,
i=1

the jth partial derivative

QOLS () n
= 2 xi,j (Yi xi,1 1 xi,2 2 xi,p p ) = 2(v j )T (Y X)
j i=1
56 2 Multiple Linear Regression

for j = 1, . . . , p. Combining these equations into matrix form, setting the


derivative to zero and calling the solution gives

X T Y X T X = 0,

or
X T X = X T Y . (2.22)
Equation (2.22) is known as the normal equations. If X has full rank,
then = (X T X)1 X T Y . To show that is the global minimizer of the
OLS criterion, use the argument following Equation (2.21).

2.9 Two Important Special Cases

When studying a statistical model, it is often useful to try to understand


the model that contains a constant but no nontrivial predictors, then try to
understand the model with a constant and one nontrivial predictor, then the
model with a constant and two nontrivial predictors, and then the general
model with many predictors. In this text, most of the models are such that
Y is independent of x given xT , written

Y x|xT .

Then wi = xTi is a scalar, and trying to understand the model in terms


of xTi is about as easy as trying to understand the model in terms of one
nontrivial predictor. In particular, the response plot of xTi versus Yi is
essential.
For MLR, the two main benets of studying the MLR model with one
nontrivial predictor X are that the data can be plotted in a scatterplot of Xi
versus Yi and that the OLS estimators can be computed by hand with the
aid of a calculator if n is small.

2.9.1 The Location Model

The location model


Yi = + ei , i = 1, . . . , n (2.23)
is a special case of the multiple linear regression model where p = 1, X = 1,
and = 1 = . This model contains a constant but no nontrivial predictors.
In the location model, OLS = 1 = = Y . To see this, notice that
2.9 Two Important Special Cases 57


n
dQOLS () n
QOLS () = (Yi ) 2
and = 2 (Yi ).
i=1
d i=1
n
Setting the derivative equal to 0 and calling the solution gives i=1 Yi = n
or = Y . The second derivative

d2 QOLS ()
= 2n > 0,
d 2
hence is the global minimizer.

2.9.2 Simple Linear Regression

The simple linear regression (SLR) model is

Yi = 1 + 2 Xi + ei = + Xi + ei

where the ei are iid with E(ei ) = 0 and VAR(ei ) = 2 for i = 1, . . . , n.


The Yi and ei are random variables while the Xi are treated as known
constants. The parameters 1 , 2 , and 2 are unknown constants that
need to be estimated. (If the Xi are random variables, then the model is
conditional on the Xi s provided that the errors ei are independent of the
Xi . Hence the Xi s are still treated as constants.)
The SLR model is a special case of the MLR model with p = 2, xi,1 1,
and xi,2 = Xi . The normal SLR model adds the assumption that the ei are
iid N(0, 2 ). That is, the error distribution is normal with zero mean and
constant variance 2 . The response variable Y is the variable that you want
to predict while the predictor variable X is the variable used to predict the
response. For SLR, E(Yi ) = 1 + 2 Xi and the line E(Y ) = 1 + 2 X is the
regression function. VAR(Yi ) = 2 .
For SLR, the least squares n estimators 1 and 2 minimize the least
squares criterion Q(1 , 2 ) = i=1 (Yi 1 2 Xi )2 . For a xed 1 and 2 ,
Q is the sum of the squared vertical deviations from the line Y = 1 + 2 X.
The least squares (OLS) line is Y = 1 + 2 X where the slope
n
(Xi X)(Yi Y )
2 = i=1 n
i=1 (Xi X)
2

and the intercept 1 = Y 2 X.


By the chain rule,

Q n
= 2 (Yi 1 2 Xi )
1 i=1
58 2 Multiple Linear Regression

and
2Q
= 2n.
12
Similarly,
Q n
= 2 Xi (Yi 1 2 Xi )
2 i=1

and
2Q n
=2 Xi2 .
22 i=1

Setting the rst partial derivatives to zero and calling the solutions 1 and
2 shows that the OLS estimators 1 and 2 satisfy the normal equations:

n 
n
Yi = n1 + 2 Xi and
i=1 i=1


n 
n 
n
Xi Yi = 1 Xi + 2 Xi2 .
i=1 i=1 i=1

The rst equation gives 1 = Y 2 X.


There are several equivalent formulas for the slope 2 .
n n n n
i=1 (Xi X)(Yi Y ) i=1 Xi Yi n (
1
i=1 Xi )( i=1 Yi )
2 = n = n n
i=1 (Xi X) i=1 Xi n (
2 2 1 2
i=1 Xi )
n n
(Xi X)Yi Xi Yi nX Y
= i=1
n = i=1
n = sY /sX .
i=1 (Xi X) i=1 Xi n(X)
2 2 2

Here the sample correlation (X, Y ) = corr(X, Y ) =


n n
i=1 (Xi X)(Yi Y ) (Xi X)(Yi Y )
=  i=1 n
(n 1)sX sY n
(X X)2 (Y Y )2
i=1 i i=1 i

where the sample standard deviation




 1  n
sW =  (Wi W )2
n 1 i=1

for W = X, Y. Notice that the term n 1 that occurs in the denominator of


, s2Y , and s2X can be replaced by n as long as n is used in all 3 quantities.
n
Also notice that the slope 2 = i=1 ki Yi where the constants
2.10 The No Intercept MLR Model 59

Xi X
ki =  n . (2.24)
j=1 (Xj X)
2

2.10 The No Intercept MLR Model

The no intercept MLR model, also known as regression through the origin, is
still Y = X+e, but there is no intercept in the model, so X does not contain
a column of ones 1. Hence the intercept term 1 = 1 (1) is replaced by 1 xi1 .
Software gives output for this model if the no intercept or intercept = F
option is selected. For the no intercept model, the assumption E(e) = 0 is
important, and this assumption is rather strong.
Many of the usual MLR results still hold: OLS = (X T X)1 X T Y , the
vector of predicted tted values Y = X OLS = HY where the hat matrix
H = X(X T X)1 X T provided the inverse exists, and the vector of residuals
is r = Y Y . The response plot and residual plot are made in the same way
and should be made before performing inference.
The main dierence in the output is the ANOVA table. The ANOVA F
test in Section 2.4 tests Ho : 2 = = p = 0. The test in this section tests
Ho : 1 = = p = 0 Ho : = 0. The following denition and test
follows Guttman (1982, p. 147) closely.

Denition 2.25. Assume that Y = X + e where the ei are iid. Assume


that it is desired to test Ho : = 0 versus HA : = 0.
a) The uncorrected total sum of squares


n
SST = Yi2 . (2.25)
i=1

b) The model sum of squares


n
SSM = Yi2 . (2.26)
i=1

c) The residual sum of squares or error sum of squares is


n 
n
SSE = (Yi Yi ) =
2
ri2 . (2.27)
i=1 i=1

d) The degrees of freedom (df) for SSM is p, the df for SSE is n p and
the df for SST is n. The mean squares are MSE = SSE/(n p) and MSM =
SSM/p.

The ANOVA table given for the no intercept or intercept = F option


is below.
60 2 Multiple Linear Regression

Summary Analysis of Variance Table

Source df SS MS F p-value
Model p SSM MSM Fo=MSM/MSE for Ho:
Residual n-p SSE MSE =0

The 4 step no intercept ANOVA F test for = 0 is below.


i) State the hypotheses Ho: = 0, Ha: = 0.
ii) Find the test statistic Fo = M SM/M SE or obtain it from output.
iii) Find the pval from output or use the F table: pval = P (Fp,np > Fo ).
iv) State whether you reject Ho or fail to reject Ho. If Ho is rejected, conclude
that there is an MLR relationship between Y and the predictors x1 , . . . , xp . If
you fail to reject Ho, conclude that there is not an MLR relationship between
Y and the predictors x1 , . . . , xp . (Or there is not enough evidence to conclude
that there is an MLR relationship between Y and the predictors.)

Warning: Several important models can be cast in the no intercept MLR


form, but often a dierent test than Ho : = 0 is desired. For example, when
the generalized or weighted least squares models of Chapter 4 are transformed
into no intercept MLR form, the test of interest is Ho: 2 = = p = 0.
The one way ANOVA model of Chapter 5 is equivalent to the cell means
model, which is in no intercept MLR form, but the test of interest is Ho :
1 = = p .

Proposition 2.11. Suppose Y = X + e where X may or may not con-


tain a column of ones. Then the partial F test of Section 2.6 can be used for
inference.

Example 2.14. Consider the Gladstone (1905) data described in Exam-


ple 2.5. If the le of data sets lregdata is downloaded into R, then the ANOVA
F statistic for testing 2 = = 4 = 0 can be found with the following
commands. The command lst adds a column of ones to x which contains
the variables size, sex, breadth, and circumference. Three of these predictor
variables are head measurements. Then the response Y is brain weight, and
the model contains a constant (intercept).

> y <- cbrainy


> x <- cbrainx[,c(11,10,3,6)]
> ls.print(lsfit(x,y))
F-statistic (df=4, 262)=196.2433

The ANOVA F test can also be found with the no intercept model by
adding a column of ones to the R matrix x and then performing the partial
F test with the full model and the reduced model that only uses the column
of ones. Notice that the intercept=F option needs
to be used to t both
models. The residual standard error = RSE = M SE. Thus SSE = (n
k)(RSE)2 where n k is the denominator degrees of freedom for the F test
2.11 Summary 61

and k is the numerator degrees of freedom = number of variables in the


model. The column of ones xone is counted as a variable. The last line of
output computes the partial F statistic and is again 196.24.

> xone <- 1 + 0*1:267


> x <- cbind(xone,x)
> ls.print(lsfit(x,y,intercept=F))
Residual Standard Error=82.9175
F-statistic (df=5, 262)=12551.02

Estimate Std.Err t-value Pr(>|t|)


xone 99.8495 171.6189 0.5818 0.5612
size 0.2209 0.0358 6.1733 0.0000
sex 22.5491 11.2372 2.0066 0.0458
breadth -1.2464 1.5139 -0.8233 0.4111
circum 1.0255 0.4719 2.1733 0.0307

> ls.print(lsfit(x[,1],y,intercept=F))
Residual Standard Error=164.5028
F-statistic (df=1, 266)=15744.48

Estimate Std.Err t-value Pr(>|t|)


X 1263.228 10.0674 125.477 0

((266*(164.5028)^2 - 262*(82.9175)^2)/4)/(82.9175)^2
[1] 196.2435

2.11 Summary

1) The response variable is the variable that you want to predict. The pre-
dictor variables are the variables used to predict the response variable.
2) Regression is the study of the conditional distribution Y |x.
3) The MLR model is

Yi = xi,1 1 + xi,2 2 + + xi,p p + ei = xTi + ei

for i = 1, . . . , n. Here n is the sample size and the random variable ei is the ith
error. Assume that the errors are iid with E(ei ) = 0 and VAR(ei ) = 2 < .
Assume that the errors are independent of the predictor variables xi . The
unimodal MLR model assumes that the ei are iid from a unimodal distribution
that is not highly skewed. Usually xi,1 1.

4) In matrix notation, these n equations become


62 2 Multiple Linear Regression

Y = X + e,

where Y is an n 1 vector of dependent variables, X is an n p matrix


of predictors, is a p 1 vector of unknown coecients, and e is an n 1
vector of unknown errors.
T 1 T 2
n5) The OLS estimators are OLS = (X X) X Y and = M SE =
i=1 ri /(n p). Thus =
2
M SE. The vector of predicted or tted values

Y OLS = X OLS = HY where the hat matrix H = X(X T X)1 X T . The
ith tted value Yi = xTi . The ith residual ri = Yi Yi and the vector of
residuals r = Y Y = (I H)Y . The least squares regression equation for
a model containing a constant is Y = 1 + 2 x2 + + p xp .
6) Always make the response plot of Y versus Y and residual plot of
Y versus r for any MLR analysis. The response plot is used to visualize the
MLR model, that is, to visualize the conditional distribution of Y |xT . If the
unimodal MLR model of 3) is useful, then i) the plotted points in the response
plot should scatter about the identity line with no other pattern, and ii) the
plotted points in the residual plot should scatter about the r = 0 line with no
other pattern. If either i) or ii) is violated, then the unimodal MLR model is
not sustained. In other words, if the plotted points in the residual plot show
some type of dependency, e.g. increasing variance or a curved pattern, then
the multiple linear regression model may be inadequate.
7) Use xf max hi for valid predictions.
8) If the
MLR model contains  a constant, then SSTO = SSE n + SSR where
n n
SST
n 2 O = i=1 (Y i Y ) 2
, SSR = i=1 ( Y i Y ) 2
and SSE = i=1 (Yi Yi ) =
2

i=1 ri .
9) If the MLR model contains a constant, then R2 = [corr(Yi , Yi )]2 =
SSR SSE
=1 .
SSTO SSTO

Source df SS MS F p-value
Regression p-1 SSR MSR Fo=MSR/MSE for Ho:
Residual n-p SSE MSE 2 = = p = 0

10) Be able to perform the 4 step ANOVA F test of hypotheses.


i) State the hypotheses Ho: 2 = = p = 0 Ha: not Ho.
ii) Find the test statistic Fo = M SR/M SE or obtain it from output.
iii) Find pval, the estimated pvalue, from output or use the F table:
pval = P (Fp1,np > Fo ).
iv) State whether you reject Ho or fail to reject Ho. If Ho is rejected, conclude
that there is an MLR relationship between Y and the predictors x2 , . . . , xp .
If you fail to reject Ho, conclude that there is a not an MLR relationship
between Y and the predictors x2 , . . . , xp .
2.11 Summary 63

11) The large sample 100 (1 )% CI for E(Yf |xf ) = xTf = E(Yf ) is
Yf tnp,1/2 se(Yf ) where P (T tnp, ) = if T has a t distribution with
n p degrees of freedom.
12) The classical 100 (1 )% PI for Yf is Yf tnp,1/2 se(pred), but
should be replaced with the asymptotically optimal PI (2.20).
Full model

Source df SS MS Fo and p-value


Regression p 1 SSR MSR Fo=MSR/MSE
Residual dfF = n p SSE(F) MSE(F) for Ho:2 = = p = 0

Reduced model

Source df SS MS Fo and p-value


Regression q 1 SSR MSR Fo=MSR/MSE
Residual dfR = n q SSE(R) MSE(R) for Ho: 2 = = q = 0

13) Be able to perform the 4 step partial F test of hypotheses. i) State


the hypotheses Ho: the reduced model is good Ha: use the full model.
ii) Find the test statistic FR =
 
SSE(R) SSE(F )
/M SE(F )
dfR dfF

iii) Find the pval = P(FdfR dfF ,dfF > FR ). (On exams typically an F table is
used. Here dfR dfF = pq = number of parameters set to 0, and dfF = np,
while pval is the estimated pvalue.)
iv) State whether you reject Ho or fail to reject Ho. Reject Ho if the pval
and conclude that the full model should be used. Otherwise, fail to reject Ho
and conclude that the reduced model is good.

Label Estimate Std. Error t-value p-value

Constant 1 se(1 ) to,1 for Ho: 1 = 0


x2 2 se(2 ) to,2 = 2 /se(2 ) for Ho: 2 = 0
..
.
xp p se(p ) to,p = p /se(p ) for Ho: p = 0

14) The 100 (1 ) % CI for k is k tnp,1/2 se(k ). If the degrees


of freedom d = n p 30, the N(0,1) cuto z1/2 may be used.
15) The corresponding 4 step ttest of hypotheses has the following steps.
i) State the hypotheses Ho: k = 0 Ha: k = 0.
ii) Find the test statistic to,k = k /se(k ) or obtain it from output.
iii) Find the pval from output or use the ttable: pval =
64 2 Multiple Linear Regression

2P (tnp < |to,k |) = 2P (tnp > |to,k |).

Use the normal table or the d = Z line in the ttable if the degrees of freedom
d = n p 30.
iv) State whether you reject Ho or fail to reject Ho and give a nontechnical
sentence restating your conclusion in terms of the story problem. If Ho is
rejected, then conclude that xk is needed in the MLR model for Y given that
the other predictors are in the model. If you fail to reject Ho, then conclude
that xk is not needed in the MLR model for Y given that the other predictors
are in the model.
n n
16) Given i=1 (Xi X)(Yi Y ), i=1 (Xi X)2 , X, and Y , nd the
least squares line Y = 1 + 2 X where
n
(Xi X)(Yi Y )
2 = i=1 n
i=1 (Xi X)
2

and 1 = Y 2 X.
17) Given , sX , sY , X, and Y , nd the least squares line Y = 1 + 2 X
where 2 = sY /sX and 1 = Y 2 X.

2.12 Complements

The Least Squares Central Limit Theorem 2.8 is often a good approximation
if n 10p and the error distribution has light tails, i.e. the probability of
an outlier is nearly 0 and the tails go to zero at an exponential rate or faster.
For error distributions with heavier tails, much larger samples are needed,
and the assumption that the variance 2 exists is crucial, e.g. Cauchy errors
are not allowed. Norman and Streiner (1986, p. 63) recommend n 5p.
The classical MLR prediction interval does not work well and should be re-
placed by the Olive (2007) asymptotically optimal PI (2.20). Lei and Wasser-
man (2014) provide an alternative: use the Lei et al. (2013) PI [rL , rU ] on the
residuals, then the PI for Yf is

[Yf + rL , Yf + rU ]. (2.28)

Bootstrap PIs need more theory and instead of using B = 1000 samples, use
B = max(1000, n). See Olive (2014, pp. 279285).
For the additive error regression model Y = m(x) + e, the response plot
of Y = m(x) vs. Y , with the identity line added as a visual aid, is used
like the MLR response plot. We want n 10 df where df is the degrees of
freedom from tting m. Olive (2013a) provides PIs for this model, including
the location model. These PIs are large sample PIs provided that the sample
quantiles of the residuals are consistent estimators of the population quantiles
2.12 Complements 65

of the errors. The response plot and PIs could also be used for methods
described in James et al. (2013) such as ridge regression, lasso, principal
components regression, and partial least squares. See Pelawa Watagoda and
Olive (2017) if n is not large compared to p.
In addition to large sample theory, we want the PIs to work well on a
single data set as future observations are gathered, but only have the training
data (x1 , Y1 ), . . . , (xn , Yn ). Much like k-fold cross validation for discriminant
analysis, randomly divide the data set into k = 5 groups of approximately
equal size. Compute the model from 4 groups and use the 5th group as a
validation set: compute the PI for xf = xj for each j in the 5th group.
Repeat so each of the 5 groups is used as a validation set. Compute the
proportion of times Yi was in its PI for i = 1, . . . , n as well as the average
length of the n PIs. We want the proportion near the nominal proportion
and short average length if two or more models or PIs are being considered.
Following Chapter 11, under the regularity conditions, much of the infer-
ence that is valid for the normal MLR model is approximately valid for the
unimodal MLR model when the sample size is large. For example, condence
intervals for i are asymptotically correct, as are t tests for i = 0 (see Li
and Duan (1989, p. 1035)), the MSE is an estimator of 2 by Theorems 2.6
and 2.7, and variable selection procedures perform well (see Chapter 3 and
Olive and Hawkins 2005).
Algorithms for OLS are described in Datta (1995), Dongarra et al. (1979),
and Golub and Van Loan (1989). See Harter (1974a,b, 1975a,b,c, 1976) for
a historical account of multiple linear regression. Draper (2002) provides a
bibliography of more recent references.
Cook and Weisberg (1997, 1999a: ch. 17) call a plot that emphasizes model
agreement a model checking plot. Anscombe (1961) and Anscombe and Tukey
(1963) suggested graphical methods for checking multiple linear regression
and experimental design methods that were the state of the art at the
time.
The rules of thumb given in this chapter for residual plots are not perfect.
Cook (1998, pp. 46) gives an example of a residual plot that looks like a
right opening megaphone, but the MLR assumption that was violated was
linearity, not constant variance. Ghosh (1987) gives an example where the
residual plot shows no pattern even though the constant variance assumption
is violated. Searle (1988) shows that residual plots will have parallel lines if
several cases take on each of the possible values of the response variable, e.g.
if the response is a count.
Several authors have suggested using the response plot to visualize the co-
ecient of determination R2 in multiple linear regression. See, for example,
Chambers et al. (1983, p. 280). Anderson-Sprecher (1994) provides an ex-
cellent discussion about R2 . Kachigan (1982, pp. 174177) also gives a good
explanation of R2 . Also see Kvalseth (1985), and Freedman (1983).
Hoaglin and Welsh (1978) discuss the hat matrix H, and Brooks et al.
(1988) recommend using xf < max hi for valid predictions. Simultaneous
66 2 Multiple Linear Regression

prediction intervals are given by Sadooghi-Alvandi (1990). Olive (2007) sug-


gests three large sample prediction intervals for MLR that are valid under
the unimodal MLR model. Also see Schoemoyer (1992).
Sall (1990) discusses the history of added variable plots while Darlington
(1969) provides an interesting proof that minimizes the OLS criterion.

2.12.1 Lack of Fit Tests

Label Estimate Std. Error t-value p-value

Constant 1 se(1 ) to,1 for Ho: 1 = 0


x2 2 se(2 ) to,2 = 2 /se(2 ) for Ho: 2 = 0
..
.
xp p se(p ) to,p = p /se(p ) for Ho: p = 0

R Squared: R2
Sigma hat: M SE
Number of cases: n
Degrees of Freedom : np

Source df SS MS F p-value
Regression p-1 SSR MSR Fo=MSR/MSE for Ho:
Residual n-p SSE MSE 2 = = p = 0

The typical relevant OLS output has the form given above, but occa-
sionally software also includes output for a lack of t test as shown below.

Source df SS MS Fo
Regression p1 SSR MSR Fo=MSR/MSE
Residual np SSE MSE
lack of t cp SSLF MSLF FLF = MSLF/MSPE
pure error nc SSPE MSPE

The lack of t test assumes that

Yi = m(xi ) + ei (2.29)

where E(Yi |xi ) = m(xi ), m is some possibly nonlinear function, and that
the ei are iid N (0, 2 ). Notice that the MLR model is the special case with
m(xi ) = xTi . The lack of t test needs at least one replicate: 2 or more Ys
with the same value of predictors x. Then there a c replicate groups with
nj observations in the jth group. Each group has the vector of predictors xj ,
2.12 Complements 67
c
say, and at least one nj > 1. Also, j=1 nj = n. Denote the Ys in the jth
group by Yij , and let the sample mean of the Ys in the jth group be Y j .
Then
1 
nj
(Yij Y j )2
nj 1 i=1

is an estimator of 2 for each group with nj > 1. Let


c 
nj
SSP E = (Yij Y j )2 .
j=1 i=1

Then M SP E = SSP E/(n c) is an unbiased estimator of 2 when


model (2.29) holds, regardless of the form of m. The PE in SSPE stands
for pure error. c
Now SSLF = SSE SSP E = j=1 nj (Y j Yj )2 . Notice that Y j is an
unbiased estimator of m(xj ) while Yj is an estimator of m if the MLR model
is appropriate: m(xj ) = xTj . Hence SSLF and MSLF can be very large if
the MLR model is not appropriate.
The 4 step lack of t test is i) Ho: no evidence of MLR lack of t, HA :
there is lack of t for the MLR model.
ii) FLF = M SLF/M SP E.
iii) The pval = P (Fcp,nc > FLF ).
iv) Reject Ho if pval and state the HA claim that there is lack of t.
Otherwise, fail to reject Ho and state that there is not enough evidence to
conclude that there is MLR lack of t.

Although the lack of t test seems clever, examining the response plot and
residual plot is a much more eective method for examining whether or not
the MLR model ts the data well provided that n 10p. A graphical version
of the lack of t test would compute the Y j and see whether they scatter
about the identity line in the response plot. When there are no replicates,
the range of Y could be divided into several narrow nonoverlapping intervals
called slices. Then the mean Y j of each slice could be computed and a step
function with step height Y j at the jth slice could be plotted. If the step
function follows the identity line, then there is no evidence of lack of t.
However, it is easier to check whether the Yi are scattered about the identity
line. Examining the residual plot is useful because it magnies deviations
from the identity line that may be dicult to see until the linear trend is
removed. The lack of t test may be sensitive to the assumption that the
errors are iid N (0, 2 ).
When Y x|xT , then the response plot of the estimated sucient pre-
dictor (ESP) xT versus Y is used to visualize the conditional distribution of
Y |xT , and will often greatly outperform the corresponding lack of t test.
When the response plot can be combined with a good lack of t plot such as
68 2 Multiple Linear Regression

a residual plot, using a one number summary of lack of t such as the test
statistic FLF makes little sense.
Nevertheless, the literature for lack of t tests for various statistical meth-
ods is enormous. See Joglekar et al. (1989), Pena and Slate (2006), and Su
and Yang (2006) for references.

For the following homework problems, Cody and Smith (2006) is useful
for SAS, while Cook and Weisberg (1999a) is useful for Arc. Becker et al.
(1988) and Crawley (2013) are useful for R.

2.13 Problems

Problems with an asterisk * are especially important.

Output for Problem 2.1


Full Model Summary Analysis of Variance Table
Source df SS MS F p-value
Regression 6 265784. 44297.4 172.14 0.0000
Residual 67 17240.9 257.327

Reduced Model Summary Analysis of Variance Table


Source df SS MS F p-value
Regression 1 264621. 264621. 1035.26 0.0000
Residual 72 18403.8 255.608

2.1. Assume that the response variable Y is height, and the explanatory
variables are X2 = sternal height, X3 = cephalic index, X4 = nger to ground,
X5 = head length, X6 = nasal height, and X7 = bigonal breadth. Suppose that
the full model uses all 6 predictors plus a constant (= X1 ) while the reduced
model uses the constant and sternal height. Test whether the reduced model
can be used instead of the full model using the output above. The data set
had 74 cases.

Output for Problem 2.2


Full Model Summary Analysis of Variance Table
Source df SS MS F p-value
Regression 9 16771.7 1863.52 1479148.9 0.0000
Residual 235 0.29607 0.00126

Reduced Model Summary Analysis of Variance Table


Source df SS MS F p-value
Regression 2 16771.7 8385.85 6734072.0 0.0000
Residual 242 0.301359 0.0012453
2.13 Problems 69

Coefficient Estimates, Response = y, Terms = (x2 x2^2)


Label Estimate Std. Error t-value p-value
Constant 958.470 5.88584 162.843 0.0000
x2 -1335.39 11.1656 -119.599 0.0000
x2^2 421.881 5.29434 79.685 0.0000

2.2. The above output, starting on the previous page, comes from the
Johnson (1996) STATLIB data set bodyfat after several outliers are deleted.
It is believed that Y = 1 + 2 X2 + 3 X22 + e where Y is the persons bodyfat
and X2 is the persons density. Measurements on 245 people were taken. In
addition to X2 and X22 , 7 additional measurements X4 , . . . , X10 were taken.
Both the full and reduced models contain a constant X1 1.

a) Predict Y if X2 = 1.04. (Use the reduced model Y = 1 + 2 X2 +


3 X22 + e.)

b) Test whether the reduced model can be used instead of the full model.

Output for Problem 2.3


Label Estimate Std. Error t-value p-value
Constant -5.07459 1.85124 -2.741 0.0076
log[H] 1.12399 0.498937 2.253 0.0270
log[S] 0.573167 0.116455 4.922 0.0000

R Squared: 0.895655 Sigma hat: 0.223658, n = 82


(log[H] log[S]) (4 5)
Prediction = 2.2872, s(pred) = 0.467664,
Estimated population mean value = 2.287, s = 0.410715

2.3. The above output was produced from the le mussels.lsp in Arc. See
Cook and Weisberg (1999a). Let Y = log(M) where M is the muscle mass
of a mussel. Let X1 1, X2 = log(H) where H is the height of the shell,
and let X3 = log(S) where S is the shell mass. Suppose that it is desired to
predict Yf if log(H) = 4 and log(S) = 5, so that xTf = (1, 4, 5). Assume that
se(Yf ) = 0.410715 and that se(pred) = 0.467664.

a) If xTf = (1, 4, 5) nd a 99% condence interval for E(Yf ).

b) If xTf = (1, 4, 5) nd a 99% prediction interval for Yf .


Problem 2.4 Output, Coef. Estimates Response = height
Label Estimate Std. Error t-value p-value
Constant 227.351 65.1732 3.488 0.0008
sternal height 0.955973 0.0515390 18.549 0.0000
finger to ground 0.197429 0.0889004 2.221 0.0295
70 2 Multiple Linear Regression

R Squared: 0.879324 Sigma hat: 22.0731

Summary Analysis of Variance Table


Source df SS MS F p-value
Regression 2 259167. 129583. 265.96 0.0000
Residual 73 35567.2 487.222

2.4. The above output, starting on the previous page, is from the multiple
linear regression of the response Y = height on the two nontrivial predictors
sternal height = height at shoulder, and nger to ground = distance from the
tip of a persons middle nger to the ground.

a) Consider the plot with Yi on the vertical axis and the least squares
tted values Yi on the horizontal axis. Sketch how this plot should look if the
multiple linear regression model is appropriate.

b) Sketch how the residual plot should look if the residuals ri are on the
vertical axis and the tted values Yi are on the horizontal axis.

c) From the output, are sternal height and nger to ground useful for
predicting height? (Perform the ANOVA F test.)

2.5. Suppose that it is desired to predict the weight of the brain (in
grams) from the cephalic index measurement. The output below uses data
from 267 people.
predictor coef Std. Error t-value p-value
Constant 865.001 274.252 3.154 0.0018
cephalic 5.05961 3.48212 1.453 0.1474
Do a 4 step test for 2 = 0.

2.6. Suppose that the scatterplot of X versus Y is strongly curved rather


than ellipsoidal. Should you use simple linear regression to predict Y from
X? Explain.

2.7. Suppose that the 95% condence interval for 2 is [17.457, 15.832].
In the simple linear regression model, is X a useful linear predictor for Y ? If
your answer is no, could X be a useful predictor for Y ? Explain.

2.8. Suppose it is desired to predict the yearly return from the stock
market from the return in January. Assume that the correlation = 0.496.
Using the table below, nd the least squares line Y = 1 + 2 X.

variable mean X or Y standard deviation s


January return 1.75 5.36
yearly return 9.07 15.35
2.13 Problems 71

2.9. Suppose that (Xi X)(Yi Y ) = 70690.0,

(Xi X) = 19800.0, X = 70.0, and Y = 312.28.
2

a) Find the least squares slope 2 .


b) Find the least squares intercept 1 .
c) Predict Y if X = 80.

xi yi xi x yi y (xi x)(yi y) (xi x)2


38 41

56 63

59 70

64 72

74 84

2.10. In the above table, xi is the length of the femur and yi is the length
of the humerus taken from ve dinosaur fossils (Archaeopteryx) that preserved
both bones. See Moore (2000, p. 99).
a) Complete the table and nd the least squares estimators 1 and 2 .

b) Predict the humerus length if the femur length is 60.

2.11. Suppose that the regression model is Yi = 7 + Xi + ei for i =


1, . . . , n where the ei are iid N (0, 2 ) random variables. The least squares
n
criterion is Q() = (Yi 7 Xi )2 .
i=1

a) What is E(Yi )?

b) Find the least squares estimator of by setting the rst derivative


d
Q() equal to zero.
d

c) Show that your is the global minimizer of the least squares criterion
d2
Q by showing that the second derivative Q() > 0 for all values of .
d 2
2.12. The location model is Yi = +ei for i = 1, . . . , n where the ei are iid
with mean E(ei ) = 0 and constant variance VAR(ei ) = 2 . The least squares
 n
estimator of minimizes the least squares criterion Q() = (Yi )2 .
i=1
To nd the least squares estimator, perform the following steps.
72 2 Multiple Linear Regression

d
a) Find the derivative Q, set the derivative equal to zero and solve for
d
. Call the solution .

b) To show that the solution was indeed the global minimizer of Q, show
d2
that Q > 0 for all real . (Then the solution is a local min and Q is
d 2
convex, so is the global min.)
2.13. The normal error model for simple linear regression through the
origin is
Yi = Xi + ei
for i = 1, . . . , n where e1 , . . . , en are iid N (0, 2 ) random variables.

a) Show that the least squares estimator for is


n
Xi Y i
= i=1
n 2
.
i=1 Xi

b) Find E().

c) Find VAR().
n
(Hint: Note that = i=1 ki Yi where the ki depend on the Xi which are
treated as constants.)
2.14. Suppose that the regression model is Yi = 10+2Xi2 +3 Xi3 +ei for
i = 1, . . . , n where the ei are iid N (0, 2 ) random variables. The least squares
n
criterion is Q(3 ) = (Yi 10 2Xi2 3 Xi3 )2 . Find the least squares es-
i=1
d
timator 3 of 3 by setting the rst derivative Q(3 ) equal to zero. Show
d3
that your 3 is the global minimizer of the least squares criterion Q by show-
d2
ing that the second derivative Q(3 ) > 0 for all values of 3 .
d32
Minitab Problems
Double click means press the rightmost mouse button twice in rapid
succession. Drag means hold the mouse button down. This technique is
used to select menu options.
After your computer is on, get into Minitab, often by searching programs
and then double clicking on the icon marked Student Minitab.
i) In a few seconds, the Minitab session and worksheet windows ll the screen.
At the top of the screen there is a menu. The upper left corner has the menu
option File. Move your cursor to File and drag down the option Open
Worksheet. A window will appear. Double click on the icon Student. This
will display a large number of data sets.
2.13 Problems 73

ii) In the middle of the screen there is a scroll bar, a gray line with left and
right arrow keys. Use the right arrow key to make the data le Prof.mtw
appear. Double click on Prof.mtw. A window will appear. Click on OK.
iii) The worksheet window will now be lled with data. The top of the screen
has a menu. Go to Stat and drag down Regression. Another window will
appear: drag down Regression (write this as Stat>Regression>Regression).
iv) A window will appear with variables to the left and the response variable
and predictors (explanatory variables) to the right. Double click on instru-
crs to make it the response. Double click on manner to make it the (pre-
dictor) explanatory variable. Then click on OK.
v) The required output will appear in the session window. You can view the
output by using the vertical scroll bar on the right of the screen.
vi) Copy and paste the output into Word, or to print your single page of
output, go to File, and drag down the option Print Session Window. A
window will appear. Click on ok. Then get your output from the printer.

Use the F3 key to clear entries from a dialog window if you make a mistake
or want a new plot.

To get out of Minitab, move your cursor to the x in the upper right
corner of the screen. When asked whether to save changes, click on no.

2.15. (Minitab problem.) See the above instructions for using Minitab.
Get the data set prof.mtw. Assign the response variable to be instrucr (the
instructor rating from course evaluations) and the explanatory variable (pre-
dictor) to be manner (the manner of the instructor). Run a regression on
these variables.

a) Place the computer output into Word.

b) Write the regression equation.

c) Predict instrucr if manner = 2.47.

d) To get residual and response plots you need to store the residuals and
tted values. Use the menu commands Stat>Regression>Regression to get
the regression window. Put instrucr in the Response and manner in the
Predictors boxes. The click on Storage. From the resulting window click
on Fits and Residuals. Then click on OK twice.
To get a response plot, use the commands Graph>Plot, (double click)
place instrucr in the Y box, and Fits1 in the X box. Then click on OK. Print
the plot by clicking on the graph and then clicking on the printer icon.
e) To make a residual plot, use the menu commands Graph>Plot to get
a window. Place Resi1 in the Y box and Fits1 in the X box. Then click
on OK. Print the plot by clicking on the graph and then clicking on the
printer icon.
74 2 Multiple Linear Regression

2.16. a) Enter the following data on the Minitab worksheet:


x y
30 73
20 50
60 128
80 170
40 87
50 108
60 135
30 60
70 148
60 132
To enter the data click on the C1 column header and enter x. Then click on
the C2 header and enter y. Then enter the data. Or copy the data from Prob-
lem 2.17 obtained from (http://lagrange.math.siu.edu/Olive/lreghw.
txt).
Then in Minitab, use the menu commands Edit>Paste Cells and click on
OK. Obtain the regression output from Minitab with the menu commands
Stat>Regression>Regression.
b) Place the output into Word.
c) Write down the least squares equation.
To save your output on your ash drive (J, say), use the Word menu
commands File > Save as. In the Save in box select Removable Disk
(J:), and in the File name box enter HW2d16.doc. To get a Word printout,
click on the printer icon or use the menu commands File>Print.
d) To get residual and response plots you need to store the residuals and
tted values. Use the menu commands Stat>Regression>Regression to get
the regression window. Put Y in the Response and X in the Predictors
boxes. The click on Storage. From the resulting window click on Fits and
Residuals. Then click on OK twice.
To make a response plot, use the menu commands Graph>Plot to get
a window. Place Y in the Y box and Fits1 in the X box. Then click on
OK. Print the plot by clicking on the graph and then clicking on the printer
icon.
e) To make a residual plot of the tted values versus the residuals, use
the menu commands Graph>Plot to get a window. Place Resi1 in the Y
box and Fits1 in the X box. Then click on OK. Print the plot by clicking
on the graph and then clicking on the printer icon.

f) To save your Minitab data on your ash drive, use the menu commands
File>Save Current Worksheet as. In the resulting dialog window, the top
box says Save in and there is an arrow icon to the right of the top box.
Click several times on the arrow icon until the Save in box reads My com-
2.13 Problems 75

puter, then click on Removable Disk (J:). In the File name box, enter
H2d16.mtw. Then click on OK.
SAS Problems
Copy and paste the SAS programs for problems 2.17 and 2.18
from (http://lagrange.math.siu.edu/Olive/lreghw.txt), or enter
the SAS program in Notepad or Word.
SAS is a statistical software package widely used in industry. You will need
a ash dive. Referring to the program for Problem 2.17, the semicolon ;
is used to end SAS commands and the options ls = 70; command makes
the output readable. (An * can be used to insert comments into the SAS
program. Try putting an * before the options command and see what it does
to the output.) The next step is to get the data into SAS. The command data
wcdata; gives the name wcdata to the data set. The command input x
y; says the rst entry is variable x and the 2nd variable y. The command
cards; means that the data is entered below. Then the data is entered
and the isolated semicolon indicates that the last case has been entered. The
command proc print; prints out the data. The command proc corr; will
give the correlation between x and y. The commands proc plot; plot y*x;
makes a scatterplot of x and y. The commands proc reg; model y=x; output
out = a p =pred r =resid; tells SAS to perform a simple linear regression
with y as the response variable. The output data set is called a and contains
the tted values and residuals. The command proc plot data = a; tells SAS
to make plots from data set a rather than data set wcdata. The command
plot resid*(pred x); will make a residual plot of the tted values versus the
residuals and a residual plot of x versus the residuals. The next plot command
makes a response plot.
To use SAS on windows (PC), use the following steps.
i) Get into SAS, often by double clicking on an icon for programs such as a
Math Progs icon and then double clicking on a SAS icon. If your computer
does not have SAS, go to another computer.

ii) A window should appear with 3 icons. Double click on The SAS System
for . . . .

iii) Like Minitab, a window with a split screen will open. The top screen
says Log-(Untitled) while the bottom screen says Editor-Untitled1. Press the
spacebar and an asterisk appears: Editor-Untitled1*.

2.17. a) Copy and paste the program for this problem from
(http://lagrange.math.siu.edu/Olive/lreghw.txt), or enter the SAS
program in Notepad or Word. The ls stands for linesize so l is a lowercase L,
not the number one.
When you are done entering the program, you may want to save the pro-
gram as h2d17.sas on your ash drive (J: drive, say). (On the top menu of
the editor, use the commands File > Save as. A window will appear. Use
76 2 Multiple Linear Regression

the upper right arrow to locate Removable Disk (J:) and then type the le
name in the bottom box. Click on OK.)
b) Get back into SAS, and from the top menu, use the File> Open
command. A window will open. Use the arrow in the upper right corner
of the window to navigate to Removable Disk (J:). (As you click on the
arrow, you should see My Documents, C: etc, then Removable Disk (J:).)
Double click on h2d17.sas. (Alternatively cut and paste the program into the
SAS editor window.) To execute the program, use the top menu commands
Run>Submit. An output window will appear if successful.

If you were not successful, look at the log window for hints on errors.
A single typo can cause failure. Reopen your le in Word or Notepad and
make corrections. Occasionally you cannot nd your error. Then nd your
instructor or wait a few hours and reenter the program.

c) To copy and paste relevant output into Word or Notepad, click on the
output window and use the top menu commands Edit>Select All and then
the menu commands Edit>Copy.
In Notepad use the commands Edit>Paste. Then use the mouse to high-
light the relevant output. Then use the commands Edit>Copy.
Finally, in Word, use the command Paste. You can also cut output from
Word and paste it into Notepad.

You may want to save your SAS output as the le HW2d17.doc on your
ash drive.

d) To save your output on your ash drive, use the Word menu commands
File > Save as. In the Save in box select Removable Disk (J:) and in
the File name box enter HW2d17.doc. To get a Word printout, click on the
printer icon or use the menu commands File>Print.
Save the output giving the least squares coecients in Word.
e) Predict Y if X = 40.
f) What is the residual when X = 40?

2.18. This problem shows how to use SAS for MLR. The data are from
Kutner et al. (2005, problem 6.5). The response is brand liking, a measure-
ment for whether the consumer liked the brand. The variable X1 is moisture
content and the variable X2 is sweetness. Copy and paste the program for
this problem from (http://lagrange.math.siu.edu/Olive/lreghw.txt).
a) Execute the SAS program and copy the output le into Notepad. Scroll
down the output that is now in Notepad until you nd the regression coe-
cients and ANOVA table. Then cut and paste this output into Word.
b) Do the 4 step ANOVA F test.
You should scroll through your SAS output to see how it made the re-
sponse plot and various residual plots, but cutting and pasting these plots is
2.13 Problems 77

tedious. So we will use Minitab to get these plots. Find the program for this
problem from (http://lagrange.math.siu.edu/Olive/lreghw.txt). Then
copy and paste the numbers (between cards; and the semicolon ;) into
Minitab. Use the mouse commands Edit>Paste Cells. This should enter
the data in the Worksheet (bottom part of Minitab). Under C1 enter Y and
under C2 enter X1 under C3 enter X2. Use the menu commands
Stat>Regression>Regression to get a dialog window. Enter Y as the re-
sponse variable and X1 and X2 as the predictor variable. Click on Storage
then on Fits, Residuals, and OK OK.
c) To make a response plot, enter the menu commands Graph>Plot
and place Y in the Ybox and FITS1 in the Xbox. Click on OK. Then
use the commands Edit>Copy Graph to copy the plot. Include the plot in
Word with the commands Edit> Paste. If these commands fail, click on
the graph and then click on the printer icon.
d) Based on the response plot, does a linear model seem reasonable?
e) To make a residual plot, enter the menu commands Graph>Plot and
place RESI 1 in the Ybox and FITS1 in the Xbox. Click on OK. Then
use the commands Edit>Copy Graph to copy the plot. Include the plot in
Word with the commands Edit> Paste. If these commands fail, click on
the graph and then click on the printer icon.
f) Based on the residual plot does a linear model seem reasonable?
Problems using ARC

To quit Arc, move the cursor to the x in the upper right corner and click.
Warning: Some of the following problems uses data from the
books webpage (http://lagrange.math.siu.edu/Olive/lregbk.htm).
Save the data les on a ash drive G, say. Get in Arc and use the menu
commands File > Load and a window with a Look in box will appear. Click
on the black triangle and then on Removable Disk (G:). Then click twice on
the data set name.

2.19 . (Scatterplot in Arc.) Get cbrain.lsp as described above. (Activate


the cbrain.lsp dataset with the menu commands File > Load > Removable
Disk (G:) > cbrain.lsp.) Scroll up the screen to read the data description.

a) Make a plot of age versus brain weight brnweight. The commands


Graph&Fit > Plot of will bring down a menu. Put age in the H box
and brnweight in the V box. Put sex in the Mark by box. Click OK. Make
the lowess bar on the plot read .1. Open Word.

In Arc, use the menu commands Edit > Copy. In Word, use the menu
command Paste. This should copy the graph into the Word document.

b) For a given age, which gender tends to have larger brains?

c) At what age does the brain weight appear to be decreasing?


78 2 Multiple Linear Regression

2.20. (SLR in Arc.) Activate cbrain.lsp as in Problem 2.19. Brain weight


and the cube root of size should be linearly related. To add the cube root of
size to the data set, use the menu commands cbrain > Transform. From
the window, select size and enter 1/3 in the p: box. Then click OK. Get some
output with commands Graph&Fit > Fit linear LS. In the dialog window,
put brnweight in Response, and (size)1/3 in terms.

a) Cut and paste the output (from Coecient Estimates to Sigma hat)
into Word. Write down the least squares equation Y = 1 + 2 x.

b) If (size)1/3 = 15, what is the estimated brnweight?

c) Make a residual plot of the tted values versus the residuals. Use
the commands Graph&Fit > Plot of and put L1:Fit-values in H and
L1:Residuals in V. Put sex in the Mark by box. Move the OLS bar to 1.
Put the plot into Word. Does the plot look ellipsoidal with zero mean?

d) Make a response plot of the tted values versus Y = brnweight. Use


the commands Graph&Fit > Plot of and put L1:Fit-values in H and
brnweight in V. Put sex in Mark by. Move the OLS bar to 1. Put the plot
into Word. Does the plot look linear?

2.21. In Arc enter the menu commands File>Load>Data and open the
le mussels.lsp. This data set is from Cook and Weisberg (1999a).
The response variable Y is the mussel muscle mass M, and the explanatory
variables are X2 = S = shell mass, X3 = H = shell height, X4 = L = shell
length, and X5 = W = shell width.
Enter the menu commands Graph&Fit>Fit linear LS and t the model:
enter S, H, L, W in the Terms/Predictors box, M in the Response box
and click on OK.

a) To get a response plot, enter the menu commands


Graph&Fit>Plot of and place L1:Fit-Values in the Hbox and M in the
Vbox. Copy the plot into Word.

b) Based on the response plot, does a linear model seem reasonable?

c) To get a residual plot, enter the menu commands Graph&Fit>Plot of


and place L1:Fit-Values in the Hbox and L1:Residuals in the Vbox. Copy
the plot into Word.

d) Based on the residual plot, what MLR assumption seems to be violated?

e) Include the regression output in Word.


2.13 Problems 79

f) Ignoring the fact that an important MLR assumption seems to have


been violated, do any of predictors seem to be needed given that the other
predictors are in the model?

g) Ignoring the fact that an important MLR assumption seems to have


been violated, perform the ANOVA F test.

2.22. Get cyp.lsp as described above Problem 2.19. You can open the
le in Notepad and then save it on a ash drive G, say, using the Notepad
menu commands File>Save As and clicking the top checklist then click
Removable Disk (G:). You could also save the le on the desktop, load it
in Arc from the desktop, and then delete the le (sending it to the Recycle
Bin).
a) In Arc enter the menu commands File>Load>Removable Disk (G:)
and open the le cyp.lsp. This data set consists of various measurements
taken on men from Cyprus around 1920. Let the response Y = height and
X = cephalic index = 100(head breadth)/(head length). Use Arc to get the
least squares output and include the relevant output in Word.

b) Intuitively, the cephalic index should not be a good predictor for a


persons height. Perform a 4 step test of hypotheses with Ho: 2 = 0.

2.23. a) In Arc open the le cyp.lsp (obtained as in Problem 2.22).


The response variable Y is height, and the explanatory variables are a
constant, X2 = sternal height (probably height at shoulder), and X3 = nger
to ground.
Enter the menu commands Graph&Fit>Fit linear LS and t the model:
enter sternal height and nger to ground in the Terms/Predictors box,
height in the Response box and click on OK.
Include the output in Word. Your output should certainly include the lines
from Response = height to the ANOVA table.

b) Predict Y if X2 = 1400 and X3 = 650.

c) Perform a 4 step ANOVA F test of the hypotheses with


Ho: 2 = 3 = 0.

d) Find a 99% CI for 2 .

e) Find a 99% CI for 3 .

f) Perform a 4 step test for 2 = 0.

g) Perform a 4 step test for 3 = 0.

h) What happens to the conclusion in g) if = 0.01?


80 2 Multiple Linear Regression

i) The Arc menu L1 should have been created for the regression. Use
the menu commands L1>Prediction to open a dialog window. Enter 1400
650 in the box and click on OK. Include the resulting output in Word.

j) Let Xf,2 = 1400 and Xf,3 = 650 and use the output from i) to nd a
95% CI for E(Yf ). Use the last line of the output, that is, se = S(Yf ).

k) Use the output from i) to nd a 95% PI for Yf . Now se(pred) = s(pred).

2.24. In Arc enter the menu commands File>Load>Removable Disk


(G:) and open the le cyp.lsp (obtained as in Problem 2.22).
The response variable Y is height, and the explanatory variables are
X2 = sternal height (probably height at shoulder), and X3 = nger to ground.

Enter the menu commands Graph&Fit>Fit linear LS and t the model:


enter sternal height and nger to ground in the Terms/Predictors box,
height in the Response box and click on OK.

a) To get a response plot, enter the menu commands


Graph&Fit>Plot of and place L1:Fit-Values in the Hbox and height in
the Vbox. Copy the plot into Word.

b) Based on the response plot, does a linear model seem reasonable?

c) To get a residual plot, enter the menu commands Graph&Fit>Plot of


and place L1:Fit-Values in the Hbox and L1:Residuals in the Vbox. Copy
the plot into Word.

d) Based on the residual plot, does a linear model seem reasonable?

2.25. In Arc enter the menu commands File>Load>Removable Disk


(G:) and open the le cyp.lsp (obtained as in Problem 2.22).
The response variable Y is height, and the explanatory variables are X2 =
sternal height, X3 = nger to ground, X4 = bigonal breadth, X5 = cephalic
index, X6 = head length, and X7 = nasal height. Enter the menu commands
Graph&Fit>Fit linear LS and t the model: enter the 6 predictors (in
order: X2 1st and X7 last) in the Terms/Predictors box, height in the
Response box and click on OK. This gives the full model. For the reduced
model, only use predictors 2 and 3.

a) Include the ANOVA tables for the full and reduced models in Word.

b) Use the menu commands Graph&Fit>Plot of. . . to get a dialog win-


dow. Place L2:Fit-Values in the Hbox and L1:Fit-Values in the Vbox. Place
the resulting plot in Word.
2.13 Problems 81

c) Use the menu commands Graph&Fit>Plot of. . . to get a dialog win-


dow. Place L2:Residuals in the Hbox and L1:Residuals in the Vbox. Place
the resulting plot in Word.

d) Both plots should cluster tightly about the identity line if the reduced
model is about as good as the full model. Is the reduced model good?

e) Perform the 4 step partial F test (of Ho: the reduced model is good)
using the 2 ANOVA tables from part a).
2.26. a) Activate the cbrain.lsp data set in ARC. Fit least squares with
age, sex, size1/3 , and headht as terms and brnweight as the response. See
Problem 2.20. Assume that the multiple linear regression model is appro-
priate. (This may be a reasonable assumption, 5 infants appear as outliers
but the data set has hardly any cases that are babies. If age was uniformly
represented, the babies might not be outliers anymore.) Assuming that ARC
makes the menu L1 for this regression, select AVP-All 2D. A window will
appear. Move the OLS slider bar to 1 and click on the zero line box. The
window will show the added variable plots for age, sex, size1/3 , and headht
as you move along the slider bar that is below case deletions. Include all 4
added variable plots in Word.
b) What information do the 4 plots give? For example, which variables do
not seem to be needed?
(If it is clear that the zero and OLS lines intersect at the origin, then the
variable is probably needed, and the point cloud should be tilted away from
the zero line. If it is dicult to see where the two lines intersect since they
nearly coincide near the origin, then the variable may not be needed, and the
point cloud may not tilt away from the zero line.)
R Problems

Use the command source(G:/lregpack.txt) to download the func-


tions and the command source(G:/lregdata.txt) to download the data.
See Preface or Section 14.1. Typing the name of the lregpack function,
e.g. piplot, will display the code for the function. Use the args command,
e.g. args(pisim), to display the needed arguments for the function. For some
of the following problems, the R commands can be copied and pasted from
(http://lagrange.math.siu.edu/Olive/lreghw.txt) into R.
2.27. a) Download the data into R as described above.
For the Buxton (1920) data suppose that the response Y = height and the
predictors were a constant, head length, nasal height, bigonal breadth, and
cephalic index. There are 87 cases.
Type the following commands
zbux <- cbind(buxx,buxy)
zbux <- as.data.frame(zbux)
zfull <- lm(buxy~len+nasal+bigonal+cephalic,data=zbux)
82 2 Multiple Linear Regression

zred <- lm(buxy~len+nasal,data=zbux)


anova(zred,zfull)
b) Include the output in Word: press the Ctrl and c keys as the same
time. Then use the menu commands Paste in Word (or copy and paste the
output: hit the Ctrl and v keys at the same time).
c) Use the output to perform the partial F test where the full model is
described in a) and the reduced model uses a constant, head length, and
nasal height. The output from the anova(zred,zfull) command produces
the correct partial F statistic.
d) Use the following commands to make the response plot for the reduced
model. Include the plot in Word.

plot(zred$fit,buxy)
abline(0,1)

e) Use the following command to make the residual plot for the reduced
model. Include the plot in Word.

plot(zred$fit,zred$resid)

f) The plots look bad because of 5 massive outliers. The following com-
mands remove the outliers. Include the output in Word.

zbux <- zbux[-c(60,61,62,63,64,65),]


zfull <- lm(buxy~len+nasal+bigonal+cephalic,data=zbux)
zred <- lm(buxy~len+nasal,data=zbux)
anova(zred,zfull)

g) Redo the partial F test.


h) Use the following commands to make the response plot for the reduced
model without the outliers. Include the plot in Word.

plot(zred$fit,zbux[,5])
abline(0,1)

i) Use the following command to make the residual plot for the reduced
model without the outliers. Include the plot in Word.

plot(zred$fit,zred$resid)

j) Do the plots look ok?

2.28. Get the R commands for this problem. The data is such that Y =
2 + x2 + x3 + x4 + e where the zero mean errors are iid [exponential(2) -
2]. Hence the residual and response plots should show high skew. Note that
= (2, 1, 1, 1)T . The R code uses 3 nontrivial predictors and a constant, and
the sample size n = 1000.
2.13 Problems 83

a) Copy and paste the commands for part a) of this problem into R. Include
the response plot in Word. Is the lowess curve fairly close to the identity line?
b) Copy and paste the commands for part b) of this problem into R.
Include the residual plot in Word: press the Ctrl and c keys as the same time.
Then use the menu command Paste in Word. Is the lowess curve fairly
close to the r = 0 line?
c) The output out$coef gives . Write down . Is close to ?

2.29. a) Download the R functions piplot and pisim from lregpack.

b) The command pisim(n=100, type = 1) will produce the mean length


of the classical, semiparametric, asymptotically conservative, and asymptot-
ically optimal PIs when the errors are normal, as well as the coverage pro-
portions. Give the simulated lengths and coverages. Note: pimenlen gives the
four lengths (classical, semi, ac, aopt). Make table with headers classical,
semi, ac, and aopt.

c) Repeat b) using the command pisim(n=100, type = 3). Now the er-
rors are EXP(1) - 1.

d) Download lregdata.txt and type the command


piplot(cbrainx,cbrainy). This command gives the semiparametric PI lim-
its for the Gladstone data. Include the plot in Word.

e) The infants are in the lower left corner of the plot. Do the PIs seem to
be better for the infants or the bulk of the data. Explain briey.

2.30. Use the function MLRsim as described in Rule of Thumb 2.1 to


generate 10 pairs of response and residual plots. Right click Stop twenty
times, and include the last plot in Word.
Chapter 3
Building an MLR Model

Building a multiple linear regression (MLR) model from data is one of the
most challenging regression problems. The nal full model will have re-
sponse variable Y = t(Z), a constant x1 , and predictor variables x2 =
t2 (w2 , . . . , wr ), . . . , xp = tp (w2 , . . . , wr ) where the initial data consists of
Z, w2 , . . . , wr . Choosing t, t2 , . . . , tp so that the nal full model is a useful
MLR approximation to the data can be dicult.

Model building is an iterative process. Given the problem and data but
no model, the model building process can often be aided by graphs that
help visualize the relationships between the dierent variables in the data.
Then a statistical model can be proposed. This model can be t and inference
performed. Then diagnostics from the t can be used to check the assumptions
of the model. If the assumptions are not met, then an alternative model can
be selected. The t from the new model is obtained, and the cycle is repeated.
This chapter provides some tools for building a good full model.

Warning: Researchers often have a single data set and tend to expect
statistics to provide far more information from the single data set than is
reasonable. MLR is an extremely useful tool, but MLR is at its best when the
nal full model is known before collecting and examining the data. However,
it is very common for researchers to build their nal full model by using
the iterative process until the nal model ts the data well. Researchers
should not expect that all or even many of their research questions can be
answered from such a full model. If the nal MLR full model is built from
a single data set in order to t that data set well, then typically inference
from that model will not be valid. The model may be useful for describing
the data, but may perform very poorly for prediction of a future response.
The model may suggest that some predictors are much more important than
others, but a model that is chosen prior to collecting and examining the data
is generally much more useful for prediction and inference. A single data

Springer International Publishing AG 2017 85


D.J. Olive, Linear Regression, DOI 10.1007/978-3-319-55252-1 3
86 3 Building an MLR Model

set is a great place to start an analysis, but can be a terrible way


to end the analysis.

Often a nal full model is built after collecting and examining the data.
This procedure is called data snooping, and such models cannot be ex-
pected to be reliable. If possible, spend about 1/8 of the budget to collect
data and build an initial MLR model. Spend another 1/8 of the budget to
collect more data to check the initial MLR model. If changes are necessary,
continue this process until no changes from the previous step are needed,
resulting in a tentative MLR model. Then spend between 1/2 and 3/4 of the
budget to collect data assuming that the tentative model will be useful.
Alternatively, if the data set is large enough, use a training set of a
random sample of k of the n cases to build a model where 10p n/2 k
0.9n. Then use validation set of the other n k cases to conrm that the
model built with the training set is good. This technique may help reduce
biases, but needs n 20p.

After obtaining a nal full model, researchers will typically nd a nal


submodel after performing variable selection. Even if the nal full model was
selected before collecting data, the nal submodel, obtained after performing
variable selection, may be hard to use.

Rule of thumb 3.1. If the MLR model is built using the variable se-
lection methods from Section 3.4, then the nal submodel can be used for
description. If the full model was found after collecting the data, the model
may not be useful for inference and prediction. If the full model was selected
before collecting the data, then the prediction region method of bootstrap-
ping the variable selection model, described in Section 3.4.1, may be useful.

The remainder of this chapter considers interactions, predictor transfor-


mations, variable selection, and diagnostics. These techniques are useful for a
wide variety of regression models, including those covered in Chapters 12 and
13. This chapter also gives a graphical method for response transformations
which can be extended to additive error regression models.

3.1 Predictor Transformations

As a general rule, inferring about the distribution of Y |X from a lower


dimensional plot should be avoided when there are strong nonlinearities
among the predictors.
Cook and Weisberg (1999b, p. 34)
Predictor transformations are used to remove gross nonlinearities in the
predictors, and this technique is often very useful. Power transformations are
particularly eective, and the techniques of this section are often useful for
3.1 Predictor Transformations 87

general regression problems, not just for multiple linear regression. A power
transformation has the form x = t (w) = w for = 0 and x = t0 (w) =
log(w) for = 0. Often L where

L = {1, 1/2, 1/3, 0, 1/3, 1/2, 1} (3.1)

is called the ladder of powers. Often when a power transformation is needed,


a transformation that goes down the ladder, e.g. from = 1 to = 0 will
be useful. If the transformation goes too far down the ladder, e.g. if = 0
is selected when = 1/2 is needed, then it will be necessary to go back up
the ladder. Additional powers such as 2 and 3 can always be added.
Denition 3.1. A scatterplot of x versus Y is used to visualize the con-
ditional distribution of Y |x. A scatterplot matrix is an array of scatter-
plots. It is used to examine the marginal relationships of the predictors and
response.
In this section we will only make a scatterplot matrix of the predictors.
Often nine or ten variables can be placed in a scatterplot matrix. The names
of the variables appear on the diagonal of the scatterplot matrix. The software
Arc gives two numbers, the minimum and maximum of the variable, along
with the name of the variable. The R software labels the values of each
variable in two places, see Example 3.2 below. Let one of the variables be W .
All of the marginal plots above and below W have W on the horizontal axis.
All of the marginal plots to the left and the right of W have W on the vertical
axis.

There are several rules of thumb that are useful for visually selecting a
power transformation to remove nonlinearities from the predictors. Let a
plot of X1 versus X2 have X2 is on the vertical axis and X1 on the horizontal
axis.
Rule of thumb 3.2. a) If strong nonlinearities are apparent in the scat-
terplot matrix of the predictors w2 , . . . , wp , it is often useful to remove the
nonlinearities by transforming the predictors using power transformations.

b) Use theory if available.

c) Suppose the plot of X1 versus X2 is nonlinear. The unit rule says that
if X1 and X2 have the same units, then try the same transformation for both
X1 and X2 .

Assume that all values of X1 and X2 are positive. Then the following six
rules are often used.

d) The log rule states that a positive predictor that has the ratio between
the largest and smallest values greater than ten should be transformed to logs.
So X > 0 and max(X)/ min(X) > 10 suggests using log(X).
88 3 Building an MLR Model

e) The range rule states that a positive predictor that has the ratio be-
tween the largest and smallest values less than two should not be transformed.
So X > 0 and max(X)/ min(X) < 2 suggests keeping X.

f) The bulging rule states that changes to the power of X2 and the power
of X1 can be determined by the direction that the bulging side of the curve
points. If the curve is hollow up (the bulge points down), decrease the power
of X2 . If the curve is hollow down (the bulge points up), increase the power
of X2 . If the curve bulges towards large values of X1 increase the power of
X1 . If the curve bulges towards small values of X1 decrease the power of X1 .
See Tukey (1977, pp. 173176).

g) The ladder rule appears in (Cook and Weisberg 1999a, p. 86).


To spread small values of a variable, make smaller.
To spread large values of a variable, make larger.

h) If it is known that X2 X1 and the ranges of X1 and X2 are such that


this relationship is one to one, then
1/
X1 X2 and X2 X1 .
1/
Hence either the transformation X1 or X2 will linearize the plot. Note
that log(X2 ) log(X1 ), so taking logs of both variables will also linearize
the plot. This relationship frequently occurs if there is a volume present. For
example, let X2 be the volume of a sphere and let X1 be the circumference
of a sphere.

i) The cube root rule says that if X is a volume measurement, then cube
root transformation X 1/3 may be useful.
In the literature, it is sometimes stated that predictor transformations
that are made without looking at the response are free. The reasoning is
that the conditional distribution of Y |(x2 = a2 , . . . , xp = ap ) is the same
as the conditional distribution of Y |[t2 (x2 ) = t2 (a2 ), . . . , tp (xp ) = tp (ap )]:
is simply a change of labelling. Certainly if Y |x = 9 N (0, 1), then
there
Y | x = 3 N (0, 1). To see that Rule of thumb 3.2a does not always work,
suppose that Y = 1 +2 x2 + +p xp +e where the xi are iid lognormal(0,1)
random variables. Then wi = log(xi ) N (0, 1) for i = 2, . . . , p and the
scatterplot matrix of the wi will be linear while the scatterplot matrix of the
xi will show strong nonlinearities if the sample size is large. However, there is
an MLR relationship between Y and the xi while the relationship between Y
and the wi is nonlinear: Y = 1 + 2 ew2 + + p ewp + e = T w + e. Given
Y and the wi with no information of the relationship, it would be dicult to
nd the exponential transformation and to estimate the i . The moral is that
predictor transformations, especially the log transformation, can and often
do greatly simplify the MLR analysis, but predictor transformations can turn
a simple MLR analysis into a very complex nonlinear analysis.
3.1 Predictor Transformations 89

Theory, if available, should be used to select a transformation. Frequently


more than one transformation will work. For example, if W = weight and X1
1/3
= volume = (X2 )(X3 )(X4 ), then W versus X1 and log(W ) versus log(X1 ) =
log(X2 )+log(X3 )+log(X4 ) may both work. Also if W is linearly related with
X2 , X3 , X4 and these three variables all have length units mm, say, then the
1/3
units of X1 are (mm)3 . Hence the units of X1 are mm.
Suppose that all values of the variable w to be transformed are positive.
The log rule says use log(w) if max(wi )/ min(wi ) > 10. This rule often works
wonders on the data and the log transformation is the most used (modied)
power transformation. If the variable w can take on the value of 0, use
log(w + c) where c is a small constant like 1, 1/2, or 3/8.
To use the ladder rule, suppose you have a scatterplot of two variables
x1 1 versus x2 2 where both x1 > 0 and x2 > 0. Also assume that the plotted
points follow a nonlinear one to one function. Consider the ladder of powers

L = {1, 1/2, 1/3, 0, 1/3, 1/2, 1}.

To spread small values of the variable, make i smaller. To spread large values
of the variable, make i larger.
For example, if both variables are right skewed, then there will be many
more cases in the lower left of the plot than in the upper right. Hence small
values of both variables need spreading. Figures 13.3 b) and 13.16 have this
shape.

Consider the ladder of powers. Often no transformation ( = 1) is best,


then the log transformation, then the square root transformation, then the
reciprocal transformation.

Example 3.1. Examine Figure 3.1. Let X1 = w and X2 = x. Since w is on


the horizontal axis, mentally add a narrow vertical slice to the plot. If a large
amount of data falls in the slice at the left of the plot, then small values need
spreading. Similarly, if a large amount of data falls in the slice at the right of
the plot (compared to the middle and left of the plot), then large values need
spreading. For the variable on the vertical axis, make a narrow horizontal
slice. If the plot looks roughly like the northwest corner of a square, then
small values of the horizontal and large values of the vertical variable need
spreading. Hence in Figure 3.1a, small values of w need spreading. Notice that
the plotted points bulge up towards small values of the horizontal variable.
If the plot looks roughly like the northeast corner of a square, then large
values of both variables need spreading. Hence in Figure 3.1b, large values
of x need spreading. Notice that the plotted points bulge up towards large
values of the horizontal variable. If the plot looks roughly like the southwest
corner of a square, as in Figure 3.1c, then small values of both variables
need spreading. Notice that the plotted points bulge down towards small
values of the horizontal variable. If the plot looks roughly like the southeast
corner of a square, then large values of the horizontal and small values of the
90 3 Building an MLR Model

a) b)

1.4

1 2 3 4 5 6 7
1.0
x

x
0.6
0.2

0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5
w w
c) d)

6 8 10 12
60
40
x

x
20

2 4
0

0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5
w w

Fig. 3.1 Plots to Illustrate the Bulging and Ladder Rules

vertical variable need spreading. Hence in Figure 3.1d, small values of x need
spreading. Notice that the plotted points bulge down towards large values of
the horizontal variable.

Example 3.2: Mussel Data. Cook and Weisberg (1999a, pp. 351,
433, 447) gave a data set on 82 mussels sampled o the coast of New
Zealand. The response is muscle mass M in grams, and the predictors are
a constant, the length L and height H of the shell in mm, the shell width
W , and the shell mass S. Figure 3.2 shows the scatterplot matrix of the
predictors L, W , H, and S. Examine the variable length. Length is on the
vertical axis on the three top plots and the right of the scatterplot matrix
labels this axis from 150 to 300. Length is on the horizontal axis on the
three leftmost marginal plots, and this axis is labelled from 150 to 300 on the
bottom of the scatterplot matrix. The marginal plot in the bottom left corner
has length on the horizontal and shell on the vertical axis. The marginal
plot that is second from the top and second from the right has height on the
horizontal and width on the vertical axis. If the data is stored in x, the plot
can be made with the following command in R.

pairs(x,labels=c("length","width","height","shell"))
3.1 Predictor Transformations 91

Nonlinearity is present in several of the plots. For example, width and


length seem to be linearly related while length and shell have a nonlin-
earrelationship. The minimum value of shell is 10 while the max is 350. Since

20 30 40 50 60 0 100 200 300

150 200 250 300


length
20 30 40 50 60

width

140
height

80 100
100 200 300

shell
0

150 200 250 300 80 100 120 140 160

Fig. 3.2 Scatterplot Matrix for Original Mussel Data Predictors

350/10 = 35 > 10, the log rule suggests that log S may be useful. If log S
replaces S in the scatterplot matrix, then there may be some nonlinearity
present in the plot of log S versus W with small values of W needing spread-
ing. Hence the ladder rule suggests reducing from 1 and we tried log(W ).
Figure 3.3 shows that taking the log transformations of W and S results in
a scatterplot matrix that is much more linear than the scatterplot matrix of
Figure 3.2. Notice that the plot of W versus L and the plot of log(W ) versus
L both appear linear. This plot can be made with the following commands.
z <- x; z[,2] <- log(z[,2]); z[,4] <- log(z[,4])
pairs(z,labels=c("length","Log W","height","Log S"))
The plot of shell versus height in Figure 3.2 is nonlinear, and small values
of shell need spreading since if the plotted points were projected on the
horizontal axis, there would be too many points at values of shell near 0.
Similarly, large values of height need spreading.
92 3 Building an MLR Model

3.0 3.4 3.8 4.2 3 4 5 6

150 200 250 300


length
4.2
3.8

Log W
3.4
3.0

140
height

80 100
6
5

Log S
4
3

150 200 250 300 80 100 120 140 160

Fig. 3.3 Scatterplot Matrix for Transformed Mussel Data Predictors

3.2 Graphical Methods for Response Transformations

If the ratio of largest to smallest value of y is substantial, we usually begin


by looking at log y.
Mosteller and Tukey (1977, p. 91)
The applicability of the multiple linear regression model can be expanded
by allowing response transformations. An important class of response trans-
formation models adds an additional unknown transformation parameter o ,
such that
(o )
Yi = to (Zi ) Zi = E(Yi |xi ) + ei = xTi + ei . (3.2)

If o was known, then Yi = to (Zi ) would follow a multiple linear regression


model with p predictors including the constant. Here, is a p 1 vector
of unknown coecients depending on o , x is a p 1 vector of predictors
that are assumed to be measured with negligible error, and the errors ei are
assumed to be iid with zero mean.
Denition 3.2. Assume that all of the values of the response Zi are
positive. A power transformation has the form Y = t (Z) = Z for = 0
and Y = t0 (Z) = log(Z) for = 0 where
L = {1, 1/2, 1/3, 0, 1/3, 1/2, 1}.
3.2 Graphical Methods for Response Transformations 93

Denition 3.3. Assume that all of the values of the response Zi are
positive. Then the modied power transformation family

() Zi 1
t (Zi ) Zi = (3.3)

(0)
for = 0 and Zi = log(Zi ). Generally where is some interval such
as [1, 1] or a coarse subset such as L . This family is a special case of the
response transformations considered by Tukey (1957).

A graphical method for response transformations rets the model using


the same tting method: changing only the response from Z to t (Z).
Compute the tted values Wi using Wi = t (Zi ) as the response. Then
a transformation plot of Wi versus Wi is made for each of the seven values of
L with the identity line added as a visual aid. Vertical deviations from
the identity line are the residuals ri = Wi Wi . Then a candidate response
transformation Y = t (Z) is reasonable if the plotted points follow the iden-
tity line in a roughly evenly populated band if the unimodal MLR model is
reasonable for Y = W and x. See Denition 2.6. Curvature from the identity
line suggests that the candidate response transformation is inappropriate.

By adding the response Z to the scatterplot matrix, the methods of


the previous section can also be used to suggest good values of , and it is
usually a good idea to use predictor transformations to remove nonlinearities
from the predictors before selecting a response transformation. Check that
the scatterplot matrix with the transformed variables is better than the scat-
terplot matrix of the original variables. Notice that the graphical method is
equivalent to making response plots for the seven values of W = t (Z),
and choosing the best response plot where the MLR model seems most
reasonable. The seven response plots are called transformation plots be-
low. Our convention is that a plot of X versus Y means that X is on the
horizontal axis and Y is on the vertical axis.

Warning: The Rule of thumb 3.2 does not always work. For example, the
log rule may fail. If the relationships in the scatterplot matrix are already lin-
ear or if taking the transformation does not increase the linearity (especially
in the row containing the response), then no transformation may be better
than taking a transformation. For the Arc data set evaporat.lsp, the log
rule suggests transforming the response variable Evap, but no transformation
works better.

Denition 3.4. A transformation plot is a plot of W versus W with the


identity line added as a visual aid.

There are several reasons to use a coarse grid of powers. First, several of the
powers correspond to simple transformations such as the log, square root, and
cube root. These powers are easier to interpret than = 0.28, for example.
94 3 Building an MLR Model

a) lambda = 1 b) lambda = 0.5

sqrt(Z)
2000

40
Z

10
0
500 1000 10 30 50
TZHAT TZHAT

c) lambda = 0 d) lambda = 1

0.008
log(Z)
7

1/Z

0.000
5

5 6 7 8 0.002 0.002 0.006


TZHAT TZHAT

Fig. 3.4 Four Transformation Plots for the Textile Data

According to Mosteller and Tukey (1977, p. 91), the most commonly used
power transformations are the = 0 (log), = 1/2, = 1, and = 1/3
transformations in decreasing frequency of use. Secondly, if the estimator n
can only take values in L , then sometimes n will converge (e.g., in prob-
ability) to L . Thirdly, Tukey (1957) showed that neighboring power
transformations are often very similar, so restricting the possible powers to
a coarse grid is reasonable. Note that powers can always be added to the
grid L . Useful powers are 1/4, 2/3, 2, and 3. Powers from numerical
methods can also be added.

Application 3.1. This graphical method for selecting a response trans-


formation is very simple. Let Wi = t (Zi ). Then for each of the seven values
of L , perform OLS on (Wi , xi ) and make the transformation plot of
Wi versus Wi . If the plotted points follow the identity line for , then take
o = , that is, Y = t (Z) is the response transformation. (Note that this
procedure can be modied to create a graphical diagnostic for a numerical
estimator of o by adding to L . OLS can be replaced by other methods.)

If more than one value of L gives a linear plot, take the simplest or
most reasonable transformation or the transformation that makes the most
sense to subject matter experts. Also check that the corresponding residual
plots of W versus W W look reasonable. The values of in decreasing order
of importance are 1, 0, 1/2, 1, and 1/3. So the log transformation would be
chosen over the cube root transformation if both transformation plots look
equally good.
3.2 Graphical Methods for Response Transformations 95

After selecting the transformation, the usual checks should be made. In


particular, the transformation plot for the selected transformation is the re-
sponse plot, and a residual plot should also be made. The following example
illustrates the procedure, and the plots show W = t (Z) on the vertical axis.
The label TZHAT of the horizontal axis are the tted values W that
result from using W = t (Z) as the response in the OLS software.
Example 3.3: Textile Data In their pioneering paper on response trans-
formations, Box and Cox (1964) analyze data from a 33 experiment on the
behavior of worsted yarn under cycles of repeated loadings. The response
Z is the number of cycles to failure and a constant is used along with the
three predictors length, amplitude, and load. Using the normal prole log
likelihood for o , Box and Cox determine o = 0.06 with approximate 95
percent condence interval 0.18 to 0.06. These results give a strong indi-
cation that the log transformation may result in a relatively simple model,
as argued by Box and Cox. Nevertheless, the numerical BoxCox transfor-
mation method provides no direct way of judging the transformation against
the data.
Shown in Figure 3.4 are transformation plots of W versus W = Z for
four values of except log(Z) is used if = 0. The plots show how the trans-
formations bend the data to achieve a homoscedastic linear trend. Perhaps
more importantly, they indicate that the information on the transformation
is spread throughout the data in the plot since changing causes all points
along the curvilinear scatter in Figure 3.4a to form along a linear scatter in
Figure 3.4c. Dynamic plotting using as a control seems quite eective for
judging transformations against the data and the log response transformation
does indeed seem reasonable.
Note the simplicity of the method: Figure 3.4a shows that a response trans-
formation is needed since the plotted points follow a nonlinear curve while
Figure 3.4c suggests that Y = log(Z) is the appropriate response transforma-
tion since the plotted points follow the identity line. If all 7 plots were made
for L , then = 0 would be selected since this plot is linear. Also, Fig-
ure 3.4a suggests that the log rule is reasonable since max(Z)/ min(Z) > 10.

The essential point of the next example is that observations that inuence
the choice of the usual BoxCox numerical power transformation are often
easily identied in the transformation plots. The transformation plots are
especially useful if the bivariate relationships of the predictors, as seen in the
scatterplot matrix of the predictors, are linear.
Example 3.4: Mussel Data. Consider the mussel data of Example 3.2
where the response is muscle mass M in grams, and the predictors are the
length L and height H of the shell in mm, the logarithm log W of the shell
width W, the logarithm log S of the shell mass S, and a constant. With this
starting point, we might expect a log transformation of M to be needed
96 3 Building an MLR Model

a) lambda = 1 b) lambda = 0

50

4
log(Z)
2
Z
20
8
48

0
0
10 10 30 1.0 2.0 3.0 4.0
TZHAT TZHAT

c) lambda = 0.28 d) lambda = 1


3.0

48
Z**(0.28)

0.6
1/Z
2.0

8
8
1.0

0.0
48
1.5 2.0 2.5 3.0 0.00 0.15 0.30
TZHAT TZHAT
Fig. 3.5 Transformation Plots for the Mussel Data

because M and S are both mass measurements and log S is being used as
a predictor. Using log M would essentially reduce all measurements to the
scale of length. The BoxCox likelihood method gave 0 = 0.28 with ap-
proximate 95 percent condence interval 0.15 to 0.4. The log transformation
is excluded under this inference leading to the possibility of using dierent
transformations of the two mass measurements.
Shown in Figure 3.5 are transformation plots for four values of . A striking
feature of these plots is the two points that stand out in three of the four
plots (cases 8 and 48). The BoxCox estimate = 0.28 is evidently inuenced
by the two outlying points and, judging deviations from the identity line in
Figure 3.5c, the mean function for the remaining points is curved. In other
words, the BoxCox estimate is allowing some visually evident curvature
in the bulk of the data so it can accommodate the two outlying points.
Recomputing the estimate of o without the highlighted points gives o =
0.02, which is in good agreement with the log transformation anticipated
at the outset. Reconstruction of the transformation plots indicated that now
the information for the transformation is consistent throughout the data on
the horizontal axis of the plot.
Note that in addition to helping visualize against the data, the transfor-
mation plots can also be used to show the curvature and heteroscedasticity in
the competing models indexed by L . Example 3.4 shows that the plot
can also be used as a diagnostic to assess the success of numerical methods
such as the BoxCox procedure for estimating o .
3.3 Main Eects, Interactions, and Indicators 97

Example 3.5: Mussel Data Again. Return to the mussel data, this
time considering the regression of M on a constant and the four untrans-
formed predictors L, H, W , and S. Figure 3.2 shows the scatterplot matrix
of the predictors L, H, W , and S. Again nonlinearity is present. Figure 3.3
shows that taking the log transformations of W and S results in a linear
scatterplot matrix for the new set of predictors L, H, log W , and log S. Then
the search for the response transformation can be done as in Example 3.4.

3.3 Main Eects, Interactions, and Indicators

Section 1.4 explains interactions, factors, and indicator variables in an ab-


stract setting when Y x|xT where xT is the sucient predictor (SP).
MLR is such a model. The Section 1.4 interpretations given in terms of the
SP can be given in terms of E(Y |x) for MLR since E(Y |x) = xT = SP for
MLR.

Denition 3.5. Suppose that the explanatory variables have the form
x2 , . . . , xk , xjj = x2j , xij = xi xj , x234 = x2 x3 x4 , et cetera. Then the variables
x2 , . . . , xk are main eects. A product of two or more dierent main eects
is an interaction. A variable such as x22 or x37 is a power. An x2 x3 interaction
will sometimes also be denoted as x2 : x3 or x2 x3 .

Denition 3.6. A factor W is a qualitative random variable. Suppose W


has c categories a1 , . . . , ac . Then the factor is incorporated into the MLR
model by using c 1 indicator variables xW j = 1 if W = aj and xW j = 0
otherwise, where one of the levels aj is omitted, e.g. use j = 1, . . . , c1. Each
indicator variable has 1 degree of freedom. Hence the degrees of freedom of
the c 1 indicator variables associated with the factor is c 1.

Rule of thumb 3.3. Suppose that the MLR model contains at least one
power or interaction. Then the corresponding main eects that make up the
powers and interactions should also be in the MLR model.

Rule of thumb 3.3 suggests that if x23 and x2 x7 x9 are in the MLR model,
then x2 , x3 , x7 , and x9 should also be in the MLR model. A quick way to check
whether a term like x23 is needed in the model is to t the main eects models
and then make a scatterplot matrix of the predictors and the residuals, where
the residuals r are on the top row. Then the top row shows plots of xk versus
r, and if a plot is parabolic, then x2k should be added to the model. Potential
predictors wj could also be added to the scatterplot matrix. If the plot of
wj versus r shows a positive or negative linear trend, add wj to the model.
If the plot is quadratic, add wj and wj2 to the model. This technique is for
quantitative variables xk and wj .
98 3 Building an MLR Model

The simplest interaction to interpret is the interaction between a quanti-


tative variable x2 and a qualitative variable x3 with 2 levels. Suppose that
x3 = 1 for level a2 and x3 = 0 for level a1 . Then a rst order model with
interaction is SP = E(Y |x) = 1 + 2 x2 + 3 x3 + 4 x2 x3 . This model yields
two unrelated lines in the conditional expectation depending on the value of
x3 : E(Y |x) = 1 + 3 + (2 + 4 )x2 if x3 = 1, and E(Y |x) = 1 + 2 x2 if
x3 = 0. If 4 = 0, then there are two parallel lines: E(Y |x) = 1 + 3 + 2 x2
if x3 = 1, and E(Y |x) = 1 + 2 x2 if x3 = 0. If 3 = 4 = 0, then the two
lines are coincident: E(Y |x) = 1 + 2 x2 for both values of x3 . If 3 = 0,
then the two lines have the same intercept: E(Y |x) = 1 + (2 + 4 )x2 if
x3 = 1, and E(Y |x) = 1 + 2 x2 if x3 = 0.
Notice that 4 = 0 corresponds to no interaction. The estimated slopes
of the two lines will not be exactly identical, so the two estimated lines will
not be parallel even if there is no interaction. If the two estimated lines
have similar slopes and do not cross, there is evidence of no interaction,
while crossing lines is evidence of interaction provided that the two lines
are not nearly coincident. Two lines with very dierent slopes also suggests
interaction. In general, as factors have more levels and interactions have more
terms, e.g. x2 x3 x4 x5 , the interpretation of the model rapidly becomes very
complex.

Example 3.6. Two varieties of cement that replace sand with coal waste
products were compared to a standard cement mix. The response Y was the
compressive strength of the cement measured after 7, 28, 60, 90, or 180 days

a) Response Plot b) Residual Plot


500
6000

RESID
Y

500
3000

4000 6000 4000 6000


FIT FIT
Fig. 3.6 Plots to Illustrate Interaction for the Cement Data

of curing time = x2 . This cement was intended for sidewalks and barriers but
not for construction. The data is likely from small batches of cement prepared
in the lab, and is likely correlated; however, MLR can be used for exploratory
and descriptive purposes. Actually using the dierent cement mixtures in the
eld (e.g., as sidewalks) would be very expensive. The factor mixture had 3
levels: 2 for the standard cement, and 0 and 1 for the coal based cements.
3.4 Variable Selection 99

A plot of x2 versus Y (not shown but see Problem 3.15) resembled the left
half of a quadratic Y = c(x2 180)2 . Hence x2 and x22 were added to the
model.
Figure 3.6 shows the response plot and residual plots from this model.
The standard cement mix uses the symbol + while the coal based mixes
use an inverted triangle and square. OLS lines based on each mix are added
as visual aids. The lines from the two coal based mixes do not intersect,
suggesting that there may not be an interaction between these two mixes.
There is an interaction between the standard mix and the two coal mixes
since these lines do intersect. All three types of cement become stronger with
time, but the standard mix has the greater strength at early curing times
while the coal based cements become stronger than the standard mix at the
later times. Notice that the interaction is more apparent in the residual plot.
Problem 3.15 adds a factor F x3 based on mix as well as the x2 F x3 and
x22 F x3 interactions. The resulting model is an improvement, but there is
still some curvature in the residual plot, and one case is not t very well.

3.4 Variable Selection

Variable selection, also called subset or model selection, is the search for a
subset of predictor variables that can be deleted without important loss of
information. A model for variable selection in multiple linear regression can
be described by

Y = xT + e = T x + e = xTS S + xTE E + e = xTS S + e (3.4)

where e is an error, Y is the response variable, x = (xTS , xTE )T is a p 1


vector of predictors, xS is a kS 1 vector, and xE is a (p kS ) 1 vector.
Given that xS is in the model, E = 0 and E denotes the subset of terms
that can be eliminated given that the subset S is in the model.

Since S is unknown, candidate subsets will be examined. Let xI be the


vector of k terms from a candidate subset indexed by I, and let xO be the
vector of the remaining predictors (out of the candidate submodel). Then

Y = xTI I + xTO O + e. (3.5)

Denition 3.7. The model Y = xT + e that uses all of the predictors is


called the full model. A model Y = xTI I +e that only uses a subset xI of the
predictors is called a submodel. The full model is always a submodel. The
sucient predictor (SP) is the linear combination of the predictor variables
used in the model. Hence the full model has SP = xT and the submodel
has SP = xTI I .
100 3 Building an MLR Model

The estimated sucient predictor (ESP) is xT , and the following re-


marks suggest that a submodel I is worth considering if the correlation
corr(ESP, ESP (I)) 0.95. Suppose that S is a subset of I and that
model (3.4) holds. Then

SP = xT = xTS S = xTS S + xTI/S (I/S) + xTO 0 = xTI I (3.6)

where xI/S denotes the predictors in I that are not in S. Since this is true
regardless of the values of the predictors, O = 0 and the sample correlation
corr(xTi , xTI,i I ) = 1.0 for the population model if S I.

All too often, variable selection is performed and then the researcher tries
to use the nal submodel for inference as if the submodel was selected before
gathering data. At the other extreme, it could be suggested that variable se-
lection should not be done because classical inferences after variable selection
are not valid. Neither of these two extremes is useful.
Ideally the model is known before collecting the data. After the data is
collected, the MLR assumptions are checked and then the model is used
for inference. Alternatively, a preliminary study can be used to collect data.
Then the predictors and response can be transformed until a full model is
built that seems to be a useful MLR approximation of the data. Then variable
selection can be performed, suggesting a nal model. Then this nal model is
the known model used before collecting data for the main part of the study.
See the two paragraphs above the paragraph above Rule of thumb 3.1. If
the full model is known, inference with the bootstrap prediction region
method and prediction intervals of Section 3.4.1 may be useful.
In practice, the researcher often has one data set, builds the full model,
and performs variable selection to obtain a nal submodel. In other words, an
extreme amount of data snooping was used to build the nal model. A major
problem with the nal MLR model (chosen after variable selection or data
snooping) is that it is not valid for inference in that the p-values for the OLS
t-tests and ANOVA F test are likely to be too small, while the p-value for the
partial F test that uses the nal model as the reduced model is likely to be
too high. Similarly, the actual coverage of the nominal 100(1)% prediction
intervals tends to be too small and unknown (e.g., the nominal 95% PIs may
only contain 83% of the future responses Yf ). Thus the model is likely to t
the data set from which it was built much better than future observations.
Call the data set from which the MLR model was built the training data,
consisting of cases (Yi , xi ) for i = 1, . . . , n. Then the future predictions tend
to be poor in that |Yf Yf | tends to be larger on average than |Yi Yi |.
To summarize, a nal MLR model selected after variable selection can be
useful for description and exploratory analysis: the tests and intervals can
be used for exploratory purposes, but the nal model is usually not valid for
inference.
Generally the research paper should state that the model was built with
one data set, and is useful for description and exploratory purposes, but
3.4 Variable Selection 101

should not be used for inference. The research paper should only suggest
that the model is useful for inference if the model has been shown to be
useful on data collected after the model was built. For example, if
the researcher can collect new data and show that the model produces valid
inferences (e.g., 97 out of 100 95% prediction intervals contained the future
response Yf ), then the researcher can perhaps claim to have found a model
that is useful for inference.
Other problems exist even if the full MLR model Y = xT + e is good.
Let I {1, . . . , p} and let xI be the nal vector of predictors. If xI is missing
important predictors contained in the full model, sometimes called undert-
ting, then the nal model Y = xTI I + e may be a very poor approximation
to the data, in particular the full model may be linear while the nal model
may be nonlinear. Similarly the full model may satisfy V (ei ) = 2 while the
constant variance assumption is violated by the submodel: V (ei ) = i2 . These
two problems are less severe if the joint distribution of (Y, xT )T is multivari-
ate normal, since then Y = xTI I + e satises the constant variance MLR
model regardless of the subset I used. See Problem 10.10.

In spite of these problems, if the researcher has a single data set with
many predictors, then usually variable selection must be done. Let p 1 be
the number of nontrivial predictors and assume that the model also contains
a constant. Also assume that n 10p. If the MLR model found after variable
selection has good response and residual plots, then the model may be very
useful for descriptive and exploratory purposes.
Simpler models are easier to explain and use than more complicated mod-
els, and there are several other important reasons to perform variable selec-
tion. First, an MLR model with unnecessary predictors has a mean square
error for prediction that is too large. Let xS contain the necessary predictors,
let x be the full model, and let xI be a submodel. If (3.4) holds and S I,
then E(Y |xI ) = xTI I = xTS S = xT . Hence OLS applied to Y and xI
yields an unbiased estimator I of I . If (3.4) holds, S I, S is a k 1
vector, and I is a j 1 vector with j > k, then

1 1
n n
2 j 2 k
V (YIi ) = > = V (YSi ). (3.7)
n i=1 n n n i=1

In particular, the full model has j = p. Hence having unnecessary predic-


tors decreases the precision for prediction. Fitting unnecessary predictors is
sometimes called tting noise or overtting. As an extreme case, suppose
that the full model contains p = n predictors, including a constant, so that
the hat matrix H = I n , the n n identity matrix. Then Y = Y so that
VAR(Y |x) = VAR(Y ).
To see that (3.7) holds, assume that the model includes all p possible
terms so may overt but does not undert. Then Y = HY and Cov(Y ) =
2 HIH T = 2 H. Thus
102 3 Building an MLR Model

1
n
1 2 2 p
V (Yi ) = tr( 2 H) = tr((X T X)1 X T X) =
n i=1 n n n

where tr(A) is the trace operation. Replacing p by j and k and replacing H


by H I and H S implies Equation (3.7). Hence if only k parameters are needed
1
n
and p >> k, then serious overtting occurs and increases V (Yi ).
n i=1
Secondly, often researchers are interested in examining the eects of certain
predictors on the response. Recall that i measures the eect of xi given that
all of the other predictors x1 , . . . , xi1 , xi+1 , . . . , xp are in the model. If some
of the predictors are highly correlated, then these predictors may not be
needed in the MLR model given that the other predictors are in the model.
Hence it will not be possible to examine the eects of these predictors on the
response unless the MLR model is changed.
Thirdly, there may be an extremely expensive predictor xp that researchers
would like to omit. If xp is not needed in the MLR model given that
x1 , . . . , xp1 are in the model, then xp can be removed from the model, saving
money.

A major assumption before performing variable selection is that the full


model is good. A factor with c levels can be incorporated into the full model by
creating c 1 indicator variables. Sometimes the categories can be combined
into fewer categories. For example, if the factor is race with levels white,
black, and other, new levels white and nonwhite may be useful for some data
sets. Two rules of thumb are useful for building a full model. Notice that Rule
of thumb 3.4 uses data snooping. Hence the full model and the submodels
chosen after variable selection can be used for description and exploratory
analysis, but should not be used for inference.
Rule of thumb 3.4. Remove strong nonlinearities from the predictors by
making scatterplot matrices of the predictors and the response. If necessary,
transform the predictors and the response using methods from Sections 3.1
and 3.2. Do not transform indicator variables. Each scatterplot matrix should
contain the response entered as the last variable. Do not use more than 10 or
11 variables per scatterplot matrix. Hence if there are 90 predictor variables,
make 10 scatterplot matrices. The rst will contain x1 , . . . , x9 , Y and the last
will contain x82 , . . . , x90 , Y.
Often a variable xi does not need to be transformed if the transformation
does not increase the linearity of the plot of xi versus Y . If the plot of xi
versus xj is nonlinear for some xj , try to transform one or both of xi and xj
in order to remove the nonlinearity, but be careful that the transformations
do not cause a nonlinearity to appear in the plots of xi and xj versus Y .

Rule of thumb 3.5. Let xw1 , . . . , xw,c1 correspond to the indicator


variables of a factor W. Either include all of the indicator variables in the
3.4 Variable Selection 103

model or exclude all of the indicator variables from the model. If the model
contains powers or interactions, also include all main eects in the model
(see Section 3.3).

Next we suggest methods for nding a good submodel. We make the sim-
plifying assumptions that the full model is good, that all predictors have the
same cost, that each submodel contains a constant, and that there is no the-
ory requiring that a particular predictor must be in the model. Also assume
that n 10p, and that the response and residual plots of the full model
are good. Rule of thumb 3.5 should be used for the full model and for all
submodels.

The basic idea is to obtain tted values from the full model and the can-
didate submodel. If the candidate model is good, then the plotted points in
a plot of the submodel tted values versus the full model tted values should
follow the identity line. In addition, a similar plot should be made using the
residuals.
A problem with this idea is how to select the candidate submodel from
the nearly 2p potential submodels. One possibility would be to try to order
the predictors in importance, say x1 , . . . , xp . Then let the kth model contain
the predictors x1 , x2 , . . . , xk for k = 1, . . . , p. If the predicted values from the
submodel are highly correlated with the predicted values from the full model,
then the submodel is good. All subsets selection, forward selection, and
backward elimination can be used (see Section 1.3), but criteria to separate
good submodels from bad are needed.
Two important summaries for submodel I are R2 (I), the proportion of
the variability of Y explained by the nontrivial predictors in the model, and
M SE(I) = I2 , the estimated error variance. See Denitions 2.15 and 2.16.
Suppose that model I contains k predictors, including a constant. Since
adding predictors does not decrease R2 , the adjusted RA 2
(I) is often used,
where
n n
RA2
(I) = 1 (1 R2 (I)) = 1 M SE(I) .
nk SST
See Seber and Lee (2003, pp. 400401). Hence the model with the maximum
2
RA (I) is also the model with the minimum M SE(I).

For multiple linear regression, recall that if the candidate model of xI


has k terms (including the constant), then the partial F statistic for testing
whether the p k predictor variables in xO can be deleted is
 
SSE(I) SSE SSE n p SSE(I)
FI = / = 1
(n k) (n p) n p pk SSE

where SSE is the error sum of squares from the full model, and SSE(I) is the
error sum of squares from the candidate submodel. An extremely important
criterion for variable selection is the Cp criterion.
104 3 Building an MLR Model

Denition 3.8.
SSE(I)
Cp (I) = + 2k n = (p k)(FI 1) + k
M SE
where MSE is the error mean square for the full model.
From Section 1.3, recall that all subsets selection, forward selection, and
backward elimination produce one or more submodels of interest for k =
2, . . . , p where the submodel contains k predictors including a constant. The
following proposition helps explain why Cp is a useful criterion and suggests
that for subsets I with k terms, submodels with Cp (I) min(2k, p) are
especially interesting. Olive and Hawkins (2005) show that this interpretation
of Cp can be generalized to 1D regression models with a linear predictor T x,
such as generalized linear models. Denote the residuals and tted values from
the full model by ri = Yi xTi = Yi Yi and Yi = xTi , respectively.
Similarly, let I be the estimate of I obtained from the regression of Y on xI
and denote the corresponding residuals and tted values by rI,i = Yi xTI,i I
and YI,i = xTI,i I where i = 1, . . . , n.
Proposition 3.1. Suppose that a numerical variable selection method
suggests several submodels with k predictors, including a constant, where
2 k p.
a) The model I that minimizes Cp (I) maximizes corr(r, rI ).

p
b) Cp (I) 2k implies that corr(r, rI ) 1 .
n
c) As corr(r, rI ) 1,

I I ) = corr(ESP, ESP(I)) = corr(Y, YI ) 1.


corr(xT , xT

Proof. These results are a corollary of Proposition 3.2 below. 


Remark 3.1. Consider the model Ii that deletes the predictor xi . Then
the model has k = p 1 predictors including the constant, and the test
statistic is ti where
t2i = FIi .
Using Denition 3.8 and Cp (If ull ) = p, it can be shown that

Cp (Ii ) = Cp (If ull ) + (t2i 2).

Using the screen Cp (I) min(2k, p) suggests that the predictor xi should
not be deleted if
|ti | > 2 1.414.

If |ti | < 2, then the predictor can probably be deleted since Cp decreases.
The literature suggests using the Cp (I) k screen, but this screen eliminates
too many potentially useful submodels.
3.4 Variable Selection 105

More generally, it can be shown that Cp (I) 2k i


p
FI .
pk
Now k is the number of terms in the model including a constant while p k
is the number of terms set to 0. As k 0, the partial F test will reject Ho:
O = 0 (i.e., say that the full model should be used instead of the submodel I)
unless FI is not much larger than 1. If p is very large and p k is very small,
then the partial F test will tend to suggest that there is a model I that is
about as good as the full model even though model I deletes p k predictors.

Denition 3.9. The tt or FF plot is a plot of YI,i versus Yi while


a residualresidual or RR plot is a plot rI,i versus ri . A response plot is a
plot of YI,i versus Yi . An EE plot is a plot of ESP(I) versus ESP. For MLR,
the EE and FF plots are equivalent.

Six graphs will be used to compare the full model and the candidate sub-
model. Let be the estimate of obtained from the regression of Y on all
of the terms x. Many numerical methods such as forward selection, back-
ward elimination, stepwise, and all subsets methods using the Cp (I) criterion
(Jones 1946; Mallows 1973) have been suggested for variable selection. We
will use the FF plot, RR plot, the response plots from the full and submodel,
and the residual plots (of the tted values versus the residuals) from the full
and submodel. These six plots will contain a great deal of information about
the candidate subset provided that Equation (3.4) holds and that a good
estimator (such as OLS) for and I is used.
For these plots to be useful, it is crucial to verify that a multiple linear
regression (MLR) model is appropriate for the full model. Both the re-
sponse plot and the residual plot for the full model need to be
used to check this assumption. The plotted points in the response plot
should cluster about the identity line (that passes through the origin with
unit slope) while the plotted points in the residual plot should cluster about
the horizontal axis (the line r = 0). Any nonlinear patterns or outliers in
either plot suggest that an MLR relationship does not hold. Similarly, be-
fore accepting the candidate model, use the response plot and the residual
plot from the candidate model to verify that an MLR relationship holds for
the response Y and the predictors xI . If the submodel is good, then the
residual and response plots of the submodel should be nearly identical to the
corresponding plots of the full model. Assume that all submodels contain a
constant.

Application 3.2. To visualize whether a candidate submodel using pre-


dictors xI is good, use the tted values and residuals from the submodel and
full model to make an RR plot of the rI,i versus the ri and an FF plot of YI,i
versus Yi . Add the OLS line to the RR plot and identity line to both plots as
106 3 Building an MLR Model

visual aids. The subset I is good if the plotted points cluster tightly about
the identity line in both plots. In particular, the OLS line and the identity
line should nearly coincide so that it is dicult to tell that the two lines
intersect at the origin in the RR plot.

To verify that the six plots are useful for assessing variable selection,
the following notation will be useful. Suppose that all submodels include
a constant and that X is the full rank n p design matrix for the full
model. Let the corresponding vectors of OLS tted values and residuals
be Y = X(X T X)1 X T Y = HY and r = (I H)Y , respectively.
Suppose that X I is the n k design matrix for the candidate submodel
and that the corresponding vectors of OLS tted values and residuals are
Y I = X I (X TI X I )1 X TI Y = H I Y and r I = (I H I )Y , respectively.

A plot can be very useful if the OLS line can be compared to a reference
line and if the OLS slope is related to some quantity of interest. Suppose that
a plot of w versus z places w on the horizontal axis and z on the vertical axis.
Then denote the OLS line by z = a + bw. The following proposition shows
that the plotted points in the FF, RR, and response plots will cluster about
the identity line. Notice that the proposition is a property of OLS and holds
even if the data does not follow an MLR model. Let corr(x, y) denote the
correlation between x and y.

Proposition 3.2. Suppose that every submodel contains a constant and


that X is a full rank matrix.
Response Plot: i) If w = YI and z = Y , then the OLS line is the identity
line.
2
ii) If w = Y and z = YI , then the OLS line has slope
n b = [corr(Y, Y2I )] =
R (I) and intercept a = Y (1 R (I)) where Y = i=1 Yi /n and R (I) is
2 2

the coecient of multiple determination from the candidate model.


FF or EE Plot: iii) If w = YI and z = Y , then the OLS line is the identity
line. Note that ESP (I) = YI and ESP = Y .
iv) If w = Y and z = YI , then the OLS line has slope b = [corr(Y , YI )]2 =
SSR(I)/SSR and intercept a = Y [1 (SSR(I)/SSR)] where SSR is the
regression sum of squares.
RR Plot: v) If w = r and z = rI , then the OLS line is the identity line.
vi) If w = rI and z = r, then a = 0 and the OLS slope b = [corr(r, rI )]2 and
  
SSE np np
corr(r, rI ) = = = .
SSE(I) Cp (I) + n 2k (p k)FI + n p

Proof: Recall that H and H I are symmetric idempotent matrices and


that HH I = H I . The mean of OLS tted values is equal to Y and the
mean of OLS residuals is equal to 0. If the OLS line from regressing z on w
is z = a + bw, then a = z bw and
3.4 Variable Selection 107

(wi w)(zi z) SD(z)
b=  = corr(z, w).
(wi w) 2 SD(w)

Also recall that the OLS line passes through the means of the two variables
(w, z).
(*) Notice that the OLS slope from regressing z on w is equal to one if
and only if the OLS slope from regressing w on z is equal to [corr(z, w)]2 .
  2 T
i) The slope b = 1 if YI,i Yi = YI,i . This equality holds since Y I Y =
T
Y T H I Y = Y T H I H I Y = Y I Y I . Since b = 1, a = Y Y = 0.

ii) By (*), the slope



(YI,i Y )2
b = [corr(Y, YI )] = R (I) = 
2 2
= SSR(I)/SST O.
(Yi Y )2

The result follows since a = Y bY .


  2
iii) The slope b = 1 if YI,i Yi = YI,i . This equality holds since
T T
Y Y I = Y T HH I Y = Y T H I Y = Y I Y I . Since b = 1, a = Y Y = 0.

iv) From iii),


SD(Y )
1= [corr(Y , YI )].
SD(YI )
Hence
SD(YI )
corr(Y , YI ) =
SD(Y )
and the slope
SD(YI )
b= corr(Y , YI ) = [corr(Y , YI )]2 .
SD(Y )
Also the slope 
(YI,i Y )2
b=  = SSR(I)/SSR.
(Yi Y )2
The result follows since a = Y bY .

v) The OLS line passes through the origin. Hence a = 0. The slope b =
r T r I /r T r. Since r T r I = Y T (I H)(I H I )Y and (I H)(I H I ) =
I H, the numerator r T r I = r T r and b = 1.

vi) Again a = 0 since the OLS line passes through the origin. From v),

SSE(I)
1= [corr(r, rI )].
SSE
108 3 Building an MLR Model

Hence 
SSE
corr(r, rI ) =
SSE(I)
and the slope

SSE
b= [corr(r, rI )] = [corr(r, rI )]2 .
SSE(I)

Algebra shows that


 
np np
corr(r, rI ) = = . 
Cp (I) + n 2k (p k)FI + n p

A standard model selection procedure will often be needed to suggest


models. For example, forward selection or backward elimination could be
used. If p < 30, Furnival and Wilson (1974) provide a technique for selecting
a few candidate subsets after examining all possible subsets.

Remark 3.2. Daniel and Wood (1980, p. 85) suggest using Mallows
graphical method for screening subsets by plotting k versus Cp (I) for models
close to or under the Cp = k line. Proposition 3.2 vi) implies that if Cp (I) k
or FI < 1, then corr(r, rI ) and corr(ESP, ESP (I)) both go to 1.0 as n .
Hence models I that satisfy the Cp (I) k screen will contain the true model
S with high probability when n is large. This result does not guarantee that
the true model S will satisfy the screen, but overt is likely. Let d be a lower
bound on corr(r, rI ). Proposition 3.2 vi) implies that if
 
1 p
Cp (I) 2k + n 2 1 2 ,
d d

then corr(r, rI ) d. The simple screen Cp (I) 2k corresponds to



p
d dn = 1 .
n

To avoid excluding too many good submodels, consider models I with


Cp (I) min(2k, p). Models under both the Cp = k line and the Cp = 2k line
are of interest.

Rule of thumb 3.6. a) After using a numerical method such as forward


selection or backward elimination, let Imin correspond to the submodel with
the smallest Cp . Find the submodel II with the fewest number of predictors
such that Cp (II ) Cp (Imin ) + 1. Then II is the initial submodel that should
be examined. It is possible that II = Imin or that II is the full model. Do
not use more predictors than model II to avoid overtting.
3.4 Variable Selection 109

b) Models I with fewer predictors than II such that Cp (I) Cp (Imin ) + 4


are interesting and should also be examined.
c) Models I with k predictors, including a constant and with fewer predic-
tors than II such that Cp (Imin ) + 4 < Cp (I) min(2k, p) should be checked
but often undert: important predictors are deleted from the model. Under-
t is especially likely to occur if a predictor with one degree of freedom is
deleted (recall that if the c 1 indicator variables corresponding to a factor
are deleted, then the factor has c 1 degrees of freedom) and the jump in Cp
is large, greater than 4, say.
d) If there are no models I with fewer predictors than II such that Cp (I)
min(2k, p), then model II is a good candidate for the best subset found by
the numerical procedure.

Rule of thumb 3.7. Assume that the full model has good response and
residual plots and that n 10p. Let subset I have k predictors, including a
constant. Know how to nd good models from output. The following rules of
thumb (roughly in order of decreasing importance) may be useful. It is often
not possible to have all 10 rules of thumb to hold simultaneously. Let Imin be
the minimum Cp model and let II be the model with the fewest predictors
satisfying Cp (II ) Cp (Imin ) + 1. Do not use more predictors than model II
to avoid overtting. Then the submodel I is good if
i) the response and residual plots for the submodel looks like the response
and residual plots for the full model,
ii) corr(ESP,ESP(I)) = corr(Y, YI ) 0.95.
iii) The plotted points in the FF plot (= EE plot for MLR) cluster tightly
about the identity line.
iv) Want the p-value 0.01 for the partial F test that uses I as the reduced
model.
v) The plotted points in the RR plot cluster tightly about the identity line.
vi) Want R2 (I) > 0.9R2 and R2 (I) > R2 0.07 (recall that R2 (I) R2 =
R2 (f ull) since adding predictors to I does not decrease R2 (I)).
vii) Want Cp (Imin ) Cp (I) min(2k, p) with no big jumps in Cp (the
increase should be less than four) as variables are deleted.
viii) Want hardly any predictors with p-values > 0.05.
ix) Want few predictors with p-values between 0.01 and 0.05.
x) Want MSE(I) to be smaller than or not much larger than the MSE from
the full model.
(If n 5p, use the above rules, but we want n 10k.)

The following description of forward selection and backward elimina-


tion modies the description of Section 1.3 slightly. Criterion such as AIC,
2
M SE(I), or RA (I) are sometimes used instead of Cp . For forward selection,
the numerical method may add the predictor not yet in the model that has
the smallest pvalue for the t test. For backward elimination, the numerical
method may delete the variable in the model (that is not a constant) that
has the largest pvalue for the t test.
110 3 Building an MLR Model

Forward selection Step 1) k = 1: Start with a constant w1 = x1 . Step


2) k = 2: Compute Cp for all models with k = 2 containing a constant and a
single predictor xi . Keep the predictor w2 = xj , say, that minimizes Cp .
Step 3) k = 3: Fit all models with k = 3 that contain w1 and w2 . Keep the
predictor w3 that minimizes Cp . . . .
Step j) k = j: Fit all models with k = j that contains w1 , w2 , . . . , wj1 . Keep
the predictor wj that minimizes Cp . . . .
Step p): Fit the full model.

Backward elimination: All models contain a constant = u1 . Step 0)


k = p: Start with the full model that contains x1 , . . . , xp . We will also say
that the full model contains u1 , . . . , up where u1 = x1 but ui need not equal
xi for i > 1.
Step 1) k = p 1: Fit each model with k = p 1 predictors including a
constant. Delete the predictor up , say, that corresponds to the model with
the smallest Cp . Keep u1 , . . . , up1 .
Step 2) k = p 2: Fit each model with p 2 predictors including a con-
stant. Delete the predictor up1 corresponding to the smallest Cp . Keep
u1 , . . . , up2 . . . .
Step j) k = p j: t each model with p j predictors including a con-
stant. Delete the predictor upj+1 corresponding to the smallest Cp . Keep
u1 , . . . , upj . . . .
Step p 2) k = 2. The current model contains u1 , u2 , and u3 . Fit the model
u1 , u2 and the model u1 , u3 . Assume that model u1 , u2 minimizes Cp . Then
delete u3 , and keep u1 and u2 .

Heuristically, backward elimination tries to delete the variable that will


increase Cp the least. An increase in Cp greater than 4 (if the predictor has 1
degree of freedom) may be troubling in that a good predictor may have been
deleted. In practice, the backward elimination program may use some other
criterion: e.g. delete the variable such that the submodel I with j predictors
has a) the smallest Cp (I) or b) the biggest pvalue in the test Ho i = 0
versus HA i = 0 where the model with j + 1 terms from the previous step
(using the j predictors in I and the variable xj+1 ) is treated as the full model.
Heuristically, forward selection tries to add the variable that will decrease
Cp the most. A decrease in Cp less than 4 (if the predictor has 1 degree of
freedom) may be troubling in that a bad predictor may have been added.
In practice, the forward selection program may use some other criterion, e.g.
add the variable such that the submodel I with j nontrivial predictors has
a) the smallest Cp (I) or b) the smallest pvalue in the test Ho i = 0 versus
HA i = 0 where the current model with j terms plus the predictor xi is
treated as the full model (for all variables xi not yet in the model).
Suppose that the full model is good and is stored in M1. Let M2, M3,
M4, and M5 be candidate submodels found after forward selection, backward
elimination, et cetera. Recall that ESP (I) = YI . Make a scatterplot matrix
of the ESPs for M1, M2, M3, M4, M5, and Y . Good candidates should have
3.4 Variable Selection 111

estimated sucient predictors that are highly correlated with the full model
ESP (the correlation should be at least 0.9 and preferably greater than 0.95).
Similarly, make a scatterplot matrix of the residuals for M1, M2, M3, M4,
and M5.

To summarize, the nal submodel should have few predictors, few variables
with large OLS t test pvalues (0.01 to 0.05 is borderline), good response and
residual plots, and an FF plot (= EE plot) that clusters tightly about the
identity line. If a factor has c 1 indicator variables, either keep all c 1
indicator variables or delete all c 1 indicator variables, do not delete some
of the indicator variables.

Example 3.7. The pollution data of McDonald and Schwing (1973) can
be obtained from STATLIB or the texts website. The response Y = mort
is the mortality rate, and most of the independent variables were related
to pollution. A scatterplot matrix of the rst 9 predictors and Y was made
and then a scatterplot matrix of the remaining predictors with Y . The log
rule suggested making the log transformation with 4 of the variables. The
summary output is shown below and on the following page. The response
and residual plots were good. Notice that p = 16 and n = 60 < 5p. Also
many p-values are too high.

Response = MORT
Label Estimate Std. Error t-value p-value
Constant 1881.11 442.628 4.250 0.0001
DENS 0.00296 0.00397 0.747 0.4588
EDUC -19.6669 10.7005 -1.838 0.0728
log[HC] -31.0112 15.5615 -1.993 0.0525
HOUS -0.40107 1.64372 -0.244 0.8084
HUMID -0.44540 1.06762 -0.417 0.6786
JANT -3.58522 1.05355 -3.403 0.0014
JULT -3.84292 2.12079 -1.812 0.0768
log[NONW] 27.2397 10.1340 2.688 0.0101
log[NOX] 57.3041 15.4764 3.703 0.0006
OVR65 -15.9444 8.08160 -1.973 0.0548
POOR 3.41434 2.74753 1.243 0.2206
POPN -131.823 69.1908 -1.905 0.0633
PREC 3.67138 0.77814 4.718 0.0000
log[SO] -10.2973 7.38198 -1.395 0.1700
WWDRK 0.88254 1.50954 0.585 0.5618

R Squared: 0.787346 Sigma hat: 33.2178


Number of cases: 60 Degrees of freedom: 44

Summary Analysis of Variance Table


Source df SS MS F p-value
112 3 Building an MLR Model

Regression 15 179757. 11983.8 10.86 0.0000


Residual 44 48550.5 1103.42
Shown below this paragraph is some output from forward selection. The
minimum Cp model had Cp = 7.353 with 7 predictors, including a con-
stant. Deleting JANT from this model increased Cp to 17.763, suggesting
that JANT is an important predictor. Notice that Cp > 2k = 12 for the
model that deletes JANT.
Base terms: (log[NONW] EDUC log[SO] PREC)
df RSS | k C_I
Add: log[NOX] 54 72563.9 | 6 17.763
Add: JANT 54 72622. | 6 17.815
Add: HOUS 54 74884.8 | 6 19.866
Add: POPN 54 75350.2 | 6 20.288
Add: log[HC] 54 75373.4 | 6 20.309
Add: JULT 54 75405.8 | 6 20.338
Add: OVR65 54 75692.2 | 6 20.598
Add: HUMID 54 75747.4 | 6 20.648
Add: DENS 54 75872.1 | 6 20.761
Add: POOR 54 75938.4 | 6 20.821
Add: WWDRK 54 75971.8 | 6 20.851

Base terms: (log[NONW] EDUC log[SO] PREC log[NOX])


df RSS | k C_I
Add: JANT 53 58871. | 7 7.353
Add: log[HC] 53 69233.3 | 7 16.744
Add: HOUS 53 70774.1 | 7 18.141
Add: POPN 53 71424.7 | 7 18.730
Add: POOR 53 72049.4 | 7 19.296
Add: OVR65 53 72337.1 | 7 19.557
Add: JULT 53 72348.6 | 7 19.568
Add: WWDRK 53 72483.1 | 7 19.690
Add: DENS 53 72494.9 | 7 19.700
Add: HUMID 53 72563.9 | 7 19.763
Output for backward elimination is shown below, and the minimum Cp
model had Cp = 6.284 with 6 predictors, including a constant. Deleting
EDUC increased Cp to 10.800 > 2k = 10. Since Cp increased by more than
4, EDUC is probably important.
Current terms:(EDUC JANT log[NONW] log[NOX] OVR65 PREC)
df RSS | k C_I
Delete: OVR65 54 59897.9 | 6 6.284
Delete: EDUC 54 66809.3 | 6 12.547
Delete: log[NONW] 54 73178.1 | 6 18.319
Delete: JANT 54 76417.1 | 6 21.255
3.4 Variable Selection 113

Delete: PREC 54 83958.1 | 6 28.089


Delete: log[NOX] 54 86823.1 | 6 30.685

Current terms: (EDUC JANT log[NONW] log[NOX] PREC)


df RSS | k C_I
Delete: EDUC 55 67088.1 | 5 10.800
Delete: JANT 55 76467.4 | 5 19.300
Delete: PREC 55 87206.7 | 5 29.033
Delete: log[NOX] 55 88489.6 | 5 30.196
Delete: log[NONW] 55 95327.5 | 5 36.393

Taking the minimum Cp model from backward elimination gives the out-
put shown below. The response and residual plots were OK although the
2
correlation in the RR and FF plots was not real high. The R in the sub-
model decreased from about 0.79 to 0.74 while = M SE was 33.22 for
the full model and 33.31 for the submodel. Removing nonlinearities from the
predictors by using two scatterplots and the log rule, and then using back-
ward elimination and forward selection, seems to be very eective for nding
the important predictors for this data set. See Problem 13.17 in order to
reproduce this example with the essential plots.
Response = MORT
Label Estimate Std. Error t-value p-value
Constant 943.934 82.2254 11.480 0.0000
EDUC -15.7263 6.17683 -2.546 0.0138
JANT -1.86899 0.48357 -3.865 0.0003
log[NONW] 33.5514 5.93658 5.652 0.0000
log[NOX] 21.7931 4.29248 5.077 0.0000
PREC 2.92801 0.59011 4.962 0.0000

R Squared: 0.737644 Sigma hat: 33.305


Number of cases: 60 Degrees of freedom: 54

Summary Analysis of Variance Table


Source df SS MS F p-value
Regression 5 168410. 33681.9 30.37 0.0000
Residual 54 59897.9 1109.22

Example 3.8. The FF and RR plots can be used as a diagnostic for


whether a given numerical method is including too many variables. Glad-
stone (1905) attempts to estimate the weight of the human brain (measured
in grams after the death of the subject) using simple linear regression with
a variety of predictors including age in years, height in inches, head height
in mm, head length in mm, head breadth in mm, head circumference in mm,
and cephalic index. The sex (coded as 0 for females and 1 for males) of each
subject was also included. The variable cause was coded as 1 if the cause of
114 3 Building an MLR Model

death was acute, 3 if the cause of death was chronic, and coded as 2 oth-
erwise. A variable ageclass was coded as 0 if the age was under 20, 1 if the
age was between 20 and 45, and as 3 if the age was over 45. Head size, the
product of the head length, head breadth, and head height, is a volume mea-
surement, hence (size)1/3 was also used as a predictor with the same physical
dimensions as the other lengths. Thus there are 11 nontrivial predictors and
one response, and all models will also contain a constant. Nine cases were
deleted because of missing values, leaving 267 cases.
Figure 3.7 shows the response plots and residual plots for the full model
and the nal submodel that used a constant, size1/3 , age, and sex. The ve
cases separated from the bulk of the data in each of the four plots correspond
to ve infants. These may be outliers, but the visual separation reects the
small number of infants and toddlers in the data. A purely numerical variable
selection procedure would miss this interesting feature of the data. We will
rst perform variable selection with the entire data set, and then examine the
eect of deleting the ve cases. Using forward selection and the Cp statistic
on the Gladstone data suggests the subset I5 containing a constant, (size)1/3 ,
age, sex, breadth, and cause with Cp (I5 ) = 3.199. The pvalues for breadth
and cause were 0.03 and 0.04, respectively. The subset I4 that deletes cause
has Cp (I4 ) = 5.374 and the pvalue for breadth was 0.05. Figure 3.8d shows
the RR plot for the subset I4 . Note that the correlation of the plotted points
is very high and that the OLS and identity lines nearly coincide.

a) Full Response Plot b) Full Residual Plot


1200

100
FRES
Y

200
400

400 800 1200 400 800 1200


FFIT FFIT

c) Sub Response Plot d) Sub Residual Plot


1200

200 100
SRES3
Y
400

400 800 1200 400 800 1200


SFIT3 SFIT3

Fig. 3.7 Gladstone data: comparison of the full model and the submodel.

A scatterplot matrix of the predictors and response suggests that (size)1/3


might be the best single predictor. First we regressed Y = brain weight on the
eleven predictors described above (plus a constant) and obtained the residuals
ri and tted values Yi . Next, we regressed Y on the subset I containing
(size)1/3 and a constant and obtained the residuals rI,i and the tted values
3.4 Variable Selection 115

yI,i . Then the RR plot of rI,i versus ri , and the FF plot of YI,i versus Yi were
constructed.

a) RR Plot for (size)^(1/3) b) FF Plot for (size)^(1/3)


200

1200
100
FRES

FFIT
0

400 600 800


-200

-200 -100 0 100 200 400 600 800 1000 1200 1400
SRES1 SFIT1

c) RR Plot for 2 Predictors d) RR Plot for 4 Predictors


100 200

200
100
FRES

FRES
0
0
-200

-200

-200 -100 0 100 200 -200 -100 0 100 200


SRES2 SRES4

Fig. 3.8 Gladstone data: submodels added (size)1/3 , sex, age, and nally breadth.

For this model, the correlation in the FF plot (Figure 3.8b) was very
high, but in the RR plot the OLS line did not coincide with the identity
line (Figure 3.8a). Next sex was added to I, but again the OLS and identity
lines did not coincide in the RR plot (Figure 3.8c). Hence age was added
to I. Figure 3.9a shows the RR plot with the OLS and identity lines added.
These two lines now nearly coincide, suggesting that a constant plus (size)1/3 ,
sex, and age contains the relevant predictor information. This subset has
Cp (I) = 7.372, R2 (I) = 0.80, and I = 74.05. The full model which used
11 predictors and a constant has R2 = 0.81 and = 73.58. Since the Cp
criterion suggests adding breadth and cause, the Cp criterion may be leading
to an overt.
Figure 3.9b shows the FF plot. The ve cases in the southwest corner cor-
respond to ve infants. Deleting them leads to almost the same conclusions,
although the full model now has R2 = 0.66 and = 73.48 while the submodel
has R2 (I) = 0.64 and I = 73.89.
116 3 Building an MLR Model

a) RR Plot b) FF Plot

1400
200

1200
100

1000
FRES

FFIT
0

800
-100

600
-200

400

-200 -100 0 100 200 400 600 800 1000 1200 1400
SRES3 SFIT3

Fig. 3.9 Gladstone data with Predictors (size)1/3 , sex, and age

a) RR Plot b) FF plot
0.10

0.50
0.05

0.45
full$residual

ffit
0.40
0.0 -0.05

0.35
-0.10

0.30

-0.10 -0.05 0.0 0.05 0.10 0.30 0.35 0.40 0.45 0.50
sub$residual sfit

Fig. 3.10 RR and FF Plots for Rat Data

Example 3.9. Cook and Weisberg (1999a, pp. 261, 371) describe a data
set where rats were injected with a dose of a drug approximately proportional
to body weight. The data set is included as the le rat.lsp in the Arc software
3.4 Variable Selection 117

and can be obtained from the website (www.stat.umn.edu/arc/). The re-


sponse Y is the fraction of the drug recovered from the rats liver. The three
predictors are the body weight of the rat, the dose of the drug, and the liver
weight. The experimenter expected the response to be independent of the
predictors, and 19 cases were used. However, the Cp criterion suggests using
the model with a constant, dose, and body weight, both of whose coecients
were statistically signicant. The RR and FF plots are shown in Figure 3.10.
The identity line was added to both plots and the OLS line was added to the
RR plot. The upper right corner of the FF plot shows one outlier, the third
case, that is clearly separated from the rest of the data.
We deleted this case and again searched for submodels. The Cp statistic
is less than one for all three simple linear regression models, and the RR and
FF plots look the same for all submodels containing a constant. Figure 2.2
shows the RR plot where the residuals from the full model are plotted against
Y Y , the residuals from the model using no nontrivial predictors. This plot
suggests that the response Y is independent of the nontrivial predictors.

The point of this example is that a subset of outlying cases can cause
numeric second-moment criteria such as Cp to nd structure that does not
exist. The FF and RR plots can sometimes detect these outlying cases, allow-
ing the experimenter to run the analysis without the inuential cases. The
example also illustrates that global numeric criteria can suggest a model with
one or more nontrivial terms when in fact the response is independent of the
predictors.

Numerical variable selection methods for MLR are very sensitive to in-
uential cases such as outliers. Olive and Hawkins (2005) show that a plot
of the residuals versus Cooks distances (see Section 3.5) can be used to de-
tect inuential cases. Such cases can also often be detected from response,
residual, RR, and FF plots.
Warning: deleting inuential cases and outliers will often lead to
better plots and summary statistics, but the cleaned data may no
longer represent the actual population. In particular, the resulting
model may be very poor for both prediction and description.
Multiple linear regression data sets with cases that inuence numerical
variable selection methods are common. Table 3.1 shows results for seven
interesting data sets. The rst two rows correspond to the Ashworth (1842)
data, the next 2 rows correspond to the Gladstone data in Example 3.8, and
the next 2 rows correspond to the Gladstone data with the 5 infants deleted.
Rows 7 and 8 are for the Buxton (1920) data, while rows 9 and 10 are for
the Tremearne (1911) data. These data sets are available from the books
website. Results from the nal two data sets are given in the last 4 rows. The
last 2 rows correspond to the rat data described in Example 3.9. Rows 11
and 12 correspond to the ais data that comes with Arc (Cook and Weisberg
1999a).
118 3 Building an MLR Model

The full model used p predictors, including a constant. The nal submodel
I also included a constant, and the nontrivial predictors are listed in the
second column of Table 3.1. For a candidate submodel I, let Cp (I, c) denote
the value of the Cp statistic for the clean data that omits inuential cases and
outliers. The third column lists p, Cp (I), andCp (I, c) while the rst column

Table 3.1 Summaries for Seven Data Sets


inuential cases submodel I p, Cp (I), Cp (I, c)
le, response transformed predictors
14, 55 log(x2 ) 4, 12.665, 0.679
pop, log(y) log(x1 ), log(x2 ), log(x3 )
118, 234, 248, 258 (size)1/3 , age, sex 10, 6.337, 3.044
cbrain,brnweight (size)1/3
118, 234, 248, 258 (size)1/3 , age, sex 10, 5.603, 2.271
cbrain-5,brnweight (size)1/3
11, 16, 56 sternal height 7, 4.456, 2.151
cyp,height none
3, 44 x2 , x5 6, 0.793, 7.501
major,height none
11, 53, 56, 166 log(LBM), log(Wt), sex 12, 1.701, 0.463
ais,%Bfat log(Ferr), log(LBM), log(Wt), Ht
3 no predictors 4, 6.580, 1.700
rat,y none

gives the set of inuential cases. Two rows are presented for each data set. The
second row gives the response variable and any predictor transformations.
For example, for the Gladstone data p = 10 since there were 9 nontrivial
predictors plus a constant. Only the predictor size was transformed, and the
nal submodel is the one given in Example 3.8. For the rat data, the nal
submodel is the one given in Example 3.9: none of the 3 nontrivial predictors
was used.
Table 3.1 and simulations suggest that if the subset I has k predictors,
then using the Cp (I) min(2k, p) screen is better than using the conven-
tional Cp (I) k screen. The major and ais data sets show that deleting the
inuential cases may increase the Cp statistic. Thus interesting models from
the entire data set and from the clean data set should be examined.

Example 3.10. Conjugated linoleic acid (CLA) occurs in beef and dairy
products and appears to have many human health benets. Joanne Numrich
provided four data sets where the response was the amount of CLA (or re-
lated compounds), and the explanatory variables were feed components from
the cattle diet. The data was to be used for descriptive and exploratory pur-
poses. Several data sets had outliers with unusually high levels of CLA. These
outliers were due to one researcher and may be the most promising cases in
the data set. However, to describe the bulk of the data with OLS MLR,
the outliers were omitted. In one of the data sets there are 33 cases and 25
3.4 Variable Selection 119

predictors, including a constant. Regressing Y on all of the predictors gave


R2 = 0.84 and an ANOVA F test p-value of 0.223, suggesting that none of the
predictors are useful. From Proposition 2.5, an R2 > (p 1)/(n 1) = 0.75
is not very surprising. Remarks above Theorem 2.7 help explain why R2 can
be high with a high ANOVA F test p-value.
Of course just tting the data to the collected variables is a poor way
to proceed. Only variables x1 , x2 , x5 , x6 , x20 , and x21 took on more than a
few values. Taking log(Y ) and using variables x2 , x9 , x23 , and x24 seemed to
result in an adequate model, although the number of distinct tted values
was rather small. See Problem 3.18 for more details.

3.4.1 Bootstrapping Variable Selection

The bootstrap will be described and then applied to variable selection. Sup-
pose there is data w1 , . . . , wn collected from a distribution with cdf F into
an n p matrix W . The empirical distribution, with cdf Fn , gives each ob-
served data case wi probability 1/n. Let the statistic Tn = t(W ) = t(Fn )
be computed from the data. Suppose the statistic estimates = t(F ). Let
t(W ) = t(Fn ) = Tn indicate that t was computed from an iid sample from
the empirical distribution Fn : a sample of size n was drawn with replacement
from the observed sample w1 , . . . , wn .
Some notation is needed to give the Olive (2013a) prediction region used
to bootstrap a hypothesis test. Suppose w1 , . . . , wn are iid p 1 random
vectors with mean and nonsingular covariance matrix w . Let a future
test observation wf be independent of the wi but from the same distribution.
Let (w, S) be the sample mean and sample covariance matrix where

1 1 
n n
w= wi and S = S w = (wi w)(wi w)T . (3.8)
n i=1 n1
i=1

Then the ith squared sample Mahalanobis distance is the scalar


2
Dw 2
= Dw (w, S) = (w w)T S 1 (w w). (3.9)

Let Di2 = Dw 2
i
for each observation wi . Let D(c) be the cth order statistic of
D1 , . . . , Dn . Consider the hyperellipsoid

An = {w : Dw
2
(w, S) D(c)
2
} = {w : Dw (w, S) D(c) }. (3.10)

If n is large, we can use c = kn = n(1 ). If n is not large, using c =


Un where Un decreases to kn , can improve small sample performance. Olive
(2013a) showed that (3.10) is a large sample 100(1 )% prediction region
for a large class of distributions, although regions with smaller volumes may
exist. Note that the result follows since if w and S are nonsingular, then the
120 3 Building an MLR Model

Mahalanobis distance is a continuous function of (w, S). Let D = D(, w ).


D D
Then Di D and Di2 D2 . Hence the sample percentiles of the Di are
consistent estimators of the population percentiles of D at continuity points
of the cumulative distribution function (cdf) of D. Prediction region (3.10)
estimates the highest density region for a large class of elliptically contoured
distributions. Some of the above terms appear in Chapter 10.
Denition 3.10. Given training data w1 , . . . , wn , a large sample 100(1
)% prediction region for a future test value wf is a set An such that P (wf
An ) 1 as n , while a large sample condence region for a parameter
is a set An such that P ( An ) 1 as n .
When teaching condence intervals, it is often noted that by the cen-
tral limit theorem, the probability that Y n is within two standard devia-
tions (2SD(Y n ) = 2/ n) of is about 95%. Hence the probability that
is within twostandard deviations
of Y n is about 95%. Thus the inter-
val [ 1.96S/ n, + 1.96S/ n ] is a large sample 95% prediction in-
terval for a sample mean Y n,f if is known, while
future value of the
[Y n 1.96S/ n, Y n + 1.96S/ n ] is a large sample 95% condence interval
for the population mean . Note that the lengths of the two intervals are
the same. Where the interval is centered determines whether the interval is a
condence or a prediction interval. Here S is the sample standard deviation.
The following theorem shows that the hyperellipsoid Rc centered at the
statistic Tn is a large sample 100(1 )% condence region for , but the hy-
perellipsoid Rp centered at known is a large sample 100(1 )% prediction
region for a future value of the statistic Tf,n .
Theorem 3.3. Let the 100(1 )th percentile D1
2
be a continuity point
D D
of the distribution of D2 . Assume that D
2
(Tn , T ) D2 , D
2
(Tn , T )
P
D2 , and D1
2
D1
2
where P (D2 D12
) = 1 . i) Then Rc = {w :
Dw (Tn , T ) D1 } is a large sample 100(1 )% condence region for ,
2 2

and if is known, then Rp = {w : Dw 2


(, T ) D1
2
} is a large sample
100(1)% prediction region for a future value of the statistic Tf,n . ii) Region
Rc contains i region Rp contains Tn .
2
Proof: i) Note that D (Tn , T ) = DT2 n (, T ). Thus the probability
that Rc contains is P (D 2
(Tn , T ) D1
2
) 1 , and the probability
that Rp contains Tf,n is P (D (Tf,n , T ) D1
2 2
) 1 , as n .
ii) D (Tn , T ) D1 i DTn (, T ) D1 . 
2 2 2 2

Hence if there was an iid sample T1,n , . . . , TB,n of the statistic, the Olive
(2013a) large sample 100(1 )% prediction region {w : D2 (T , S T ) D(c) 2
}
for Tf,n contains E(Tn ) = with asymptotic coverage 1 . To make the
asymptotic coverage equal to 1, use the large sample 100(1)% condence
region {w : D2 (T1,n , S T ) D(c)
2
}. The prediction region method bootstraps
3.4 Variable Selection 121


this procedure by using a bootstrap sample of the statistic T1,n , . . . , TB,n .

Centering the region at T1,n instead of T is not needed since the bootstrap
sample is centered near Tn : the distribution of n(Tn ) is approximated

by the distribution of n(T Tn ) or by the distribution of n(T T ).

Consider testing H0 : = c versus H1 : = c where c is a known r 1


vector. If a condence region can be constructed for c, then fail to reject
H0 if 0 is in the condence region, and reject H0 if 0 is not in the condence
region.
The prediction region method makes a bootstrap sample wi = i c
for i = 1, . . . , B. Make the prediction region (3.10) for the wi , and reject H0
if 0 is not in the prediction region. As shown below, the prediction region
method is a special case of the percentile method, and a special case of
bootstrapping a test statistic.
For p = 1, the percentile method uses an interval that contains UB

kB = B(1 ) of the Ti,n from a bootstrap sample T1,n , . . . , TB,n where
the statistic Ti,n is an estimator of based on a sample of size n. Often
the n is suppressed. Here x is the smallest integer x, e.g. 7.8 = 8.

Let T(1) , T(2) , . . . , T(B) be the order statistics of the bootstrap sample. Then
one version of the percentile method discards the largest and smallest B/2
order statistics, resulting in an interval [LB , RB ] that is a large sample 100(1
)% condence interval (CI) for , and also a large sample 100(1 )%

prediction interval (PI) for a future bootstrap value Tf,n .
Olive (2016a,b, 2014: p. 283) recommended using the shorth(c) estimator
for the percentile method. The shorth interval tends to be shorter than the
interval that deletes the smallest and largest B/2 observations Wi when
the Wi do not come from a symmetric distribution. Frey (2013) showed that
 B and iid data, the shorth(kB ) PI has maximum undercoverage
for large
1.12 /B, and used the shorth(c) estimator as the large sample 100(1 )%
prediction interval where c = min(B, B[1 + 1.12 /B ] ). Hence if
B = 1000, there may be about 1% undercoverage using c = kB = B(1 ).
Consider testing H0 : = c versus H1 : = c, and the statistic Ti = c.
If E(Ti ) = and Cov(Ti ) = T were known, then the squared Maha-
lanobis distance Di2 (, T ) = (Ti )T 1 T (Ti ) would be a natural
2
statistic to use if the percentile D1 (, T ) was known. The prediction
region method bootstraps the squared Mahalanobis distances, forming the
bootstrap sample wi = Ti = i c and the squared Mahalanobis distances
1 
B
Di2 = Di2 (T , S T ) = (Ti T )T [S T ]1 (Ti T ) where T = T and
B i=1 i
1 
B
S T = (T T )(Ti T )T are the sample mean and sample co-
B 1 i=1 i
variance matrix of T1 , . . . , TB . Then the percentile method that contains the
smallest UB B(1 ) distances is used to get the closed interval [0, D(UB ) ].
122 3 Building an MLR Model

T
2
If H0 is true and E[] = c, then = 0. Let D0 = T [S T ]1 T and fail to re-
ject H0 if D0 D(UB ) and reject H0 if D0 > D(UB ) . This percentile method
is equivalent to computing the prediction region (3.10) on the wi = Ti and
checking whether 0 is in the prediction region.
Methods for bootstrapping the multiple linear regression model are well
known. The estimated covariance matrix of the (ordinary) least squares esti-
mator is
 OLS ) = M SE(X T X)1 .
Cov(
The residual bootstrap computes the least squares estimator and obtains the
n residuals and tted values r1 , . . . , rn and Y1 , . . . , Yn . Then a sample of size

n is selected with replacement from the residuals resulting in r11 , . . . , rn1 .

Hence the empirical distribution of the residuals is used. Then a vector Y 1 =
(Y11 T
, . . . , Yn1 ) is formed where Yi1 = Yi + ri1

. Then Y 1 is regressed on X

resulting in the estimator 1 . This process is repeated B times resulting in

the estimators 1 , . . . , B . This method should have n 10p so that the
residuals ri are close to the errors ei .
Efron (1982, p. 36) notes that for the residual bootstrap, the sample co-

variance matrix of the i is estimating the population bootstrap matrix
np
M SE(X T X)1 as B . Hence the residual bootstrap standard
n 
np
error SE(i ) SE(i,OLS ).
n
T T
If the z i = (Yi , xi ) are iid observations from some population, then a
sample of size n can be drawn with replacement from z 1 , . . . , z n . Then the
response and predictor variables can be formed into vector Y 1 and design

matrix X 1 . Then Y 1 is regressed on X 1 resulting in the estimator 1 . This

process is repeated B times resulting in the estimators 1 , . . . , B . If the
z i are the rows of a matrix Z, then this nonparametric bootstrap uses the
empirical distribution of the z i .
Following Seber and Lee (2003, p. 100), the classical test statistic for test-
ing H0 : A = c, where A is a full rank r p matrix, is

(A c)T [M SE A(X T X)1 AT ]1 (A c)


FR = ,
r
D
and when H0 is true, rFR 2r for a large class of error distributions. The

sample covariance matrix S w of the wi = A i c is estimating
np
M SE A(X T X)1 AT ,
n
and w 0 when H0 is true. Thus under H0 , the squared distance Di2 =
(wi w)T S 1
w (wi w)
3.4 Variable Selection 123

n
(A c)T [M SE A(X T X)1 AT ]1 (A c),
np
2
and we expect D(U B)
n 2
np r,1 , for large n and B, and p << n.

Returning to variable selection, suppose model I is selected. Then least


squares output for the model Y = X I I + e can be obtained, but the least
squares output is not correct for inference. In particular, M SE(I)(X TI X I )1
is not the correct estimated covariance matrix of I . The selected model tends
to t the data too well, so SE(i ) from the incorrect estimated covariance
matrix tends to be too small. Hence the condence intervals for i are too
short, and hypothesis tests reject H0 : i = 0 too often.
Hastie et al. (2009, p. 57) note that variable selection is a shrinkage es-
timator: the coecients are shrunk to 0 for the omitted variables. Suppose
n 10p. If I is k 1, form I,0 from I by adding 0s corresponding to
the omitted variables. Then I,0 is a nonlinear estimator of , and the resid-
ual bootstrap method can be applied. For example, suppose = Imin ,0
is formed from model Imin that minimizes Cp from some variable selection
method such as forward selection, backward elimination, stepwise selection,
or all subsets variable selection. Instead of computing the least squares esti-
mator from regressing Y i on X, perform variable selection on Y i and X,
t the model that minimizes the criterion, and add 0s corresponding to the

omitted variables, resulting in estimators 1 , . . . , B .
Suppose the variable selection method, such as forward selection or all
subsets, produces K models. Let model Imin be the model that minimizes
the criterion, e.g. Cp (I) or AIC(I). Following Seber and Lee (2003, p. 448)
and Nishi (1984), the probability that model Imin from Cp or AIC underts
goes to zero as n . Since there are a nite number of regression models
I that contain the true model, and each model gives a consistent estimator
I,0 of , the probability that Imin picks one of these models goes to one as
n . Hence Imin ,0 is a consistent estimator of under model (3.4).
D
Note that if S I, and Y = X I I + eI , then n( I I )
Nk (0, I2 W I ) under mild regularity conditions where n(X TI X I )1 W I .
D
Hence n( I,0 ) Np (0, I2 W I,0 ) where the W I,0 has a column and
row of zeroes added for each variable not in I. Note that W I,0 is singular
unless I corresponds to the full model. For example, if p = 3 and model I
uses a constant and x3 with

  W11 0 W12
W11 W12
WI = , then W I,0 = 0 0 0 .
W21 W22
W21 0 W22
D
Hence it is reasonable to conjecture that n( Imin ,0 ) U where
124 3 Building an MLR Model


K
U= i Np (0, I2i W Ii ,0 ),
i=1

K
0 i 1, i=1 i = 1, and K is the number of subsets Ii that contain S.
Inference techniques for the variable selection model have not had much
success. Efron (2014) lets t(Z) be a scalar valued statistic, based on all of the
data Z, that estimates a parameter of interest . Form a bootstrap sample
1 
n
Z i and t(Z i ) for i = 1, . . . , B. Then = s(Z) = t(Z i ), a boot-
B i=1
strap smoothing or bagging estimator. In the regression setting with vari-
able selection, Z i can be formed with the nonparametric or residual boot-
strap using the full model. The prediction region method can also be applied
to t(Z). For example, when A is 1 p, the prediction region method uses

= A c, t(Z) = A c and T = . Efron (2014) used the condence

interval T z1 SE(T ) which is symmetric about T . The prediction re-

gion method uses T ST D(UB ) which is also a symmetric interval centered

at T . If both the prediction region method and Efrons method are large
sample condence intervalsfor , then they have the same asymptotic length
(scaled by multiplying by n), since otherwise the shorter interval will have
lower asymptotic coverage. Since the prediction region interval is a percentile
interval, the shorth(c) interval could have much shorter length than both the
Efron interval and the prediction region interval if the bootstrap distribution
is not symmetric.
The prediction region method can be used for vector valued statistics and
parameters, and may not need the statistic to be asymptotically normal.
These features are likely useful for variable selection models. Prediction in-
tervals and regions can have higher than the nominal coverage 1 if the
distribution is discrete or a mixture of a discrete distribution and some other
distribution. In particular, coverage can be high if the wi distribution is a
mixture of a point mass at 0 and the method checks whether 0 is in the
prediction region. Such a mixture often occurs for variable selection meth-

ods. The bootstrap sample for the Wi = ij can contain many zeroes and be
highly skewed if the jth predictor is weak. Then the computer program may

fail because S w is singular, but if all or nearly all of the ij = 0, then there
is strong evidence that the jth predictor is not needed given that the other
predictors are in the variable selection method.

As an extreme simulation case, suppose ij = 0 for i = 1, . . . , B and for
each run in the simulation. Consider testing H0 : j = 0. Then regardless of
the nominal coverage 1 , the closed interval [0,0] will contain 0 for each
run and the observed coverage will be 1 > 1 . Using the open interval
(0,0) would give observed coverage 0. Also intervals [0, b] and [a, 0] correctly
suggest failing to reject j = 0, while intervals (0, b) and (a, 0) incorrectly
suggest rejecting H0 : j = 0. Hence closed regions and intervals make sense.
3.4 Variable Selection 125

Olive (2016a) showed that applying the prediction region method results
in a large sample 100(1 )% condence region for for a wide variety of
problems, and used the method for variable selection where = .

Example 3.11. Cook and Weisberg (1999a, pp. 351, 433, 447) gives a
data set on 82 mussels sampled o the coast of New Zealand. Let the response
variable be the logarithm log(M ) of the muscle mass, and the predictors are
the length L and height H of the shell in mm, the logarithm log(W ) of the shell
width W, the logarithm log(S) of the shell mass S and a constant. Inference
for the full model is shown along with the shorth(c) nominal 95% condence
intervals for i computed using the nonparametric and residual bootstraps.
As expected, the residual bootstrap intervals are close to the classical least
squares condence intervals i 2SE(i ).
The minimum Cp model from all subsets variable selection uses a constant,
H, and log(S). The shorth(c) nominal 95% condence intervals for i using
the residual bootstrap are shown. Note that the interval for H is right skewed
and contains 0 when closed intervals are used instead of open intervals. The
least squares output is also shown, but should only be used for inference if
the model was selected before looking at the data.
It was expected that log(S) may be the only predictor needed, along with
a constant, since log(S) and log(M ) are both log(mass) measurements and
likely highly correlated. Hence we want to test H0 : 2 = 3 = 4 = 0 with
the Imin model selected by all subsets variable selection. (Of course this test
would be easy to do with the full model using least squares theory.) Then
H0 : A = (2 , 3 , 4 )T = 0. Using the prediction region method  with the
full model gave an interval [0,2.930] with D0 = 1.641. Note that 23,0.95 =
2.795. So fail to reject H0 . Using the prediction region method with the Imin
variable selection model had [0, D(UB ) ] = [0, 3.293] while D0 = 1.134. So fail
to reject H0 .

large sample full model inference


Est. SE t Pr(>|t|) rowboot resboot
i -1.249 0.838 -1.49 0.14 [-2.93,-0.048][-3.138,0.194]
L -0.001 0.002 -0.28 0.78 [-0.005,0.003][-0.005,0.004]
W 0.130 0.374 0.35 0.73 [-0.384,0.827][-0.555,0.971]
H 0.008 0.005 1.50 0.14 [-0.002,0.018][-0.003,0.017]
S 0.640 0.169 3.80 0.00 [ 0.188,1.001][ 0.276,0.955]
output and shorth intervals for the min Cp submodel
Est. SE t Pr(>|t|) 95% shorth CI
int -0.9573 0.1519 -6.3018 0.0000 [-2.769, 0.460]
L 0 [-0.004, 0.004]
W 0 [-0.595, 0.869]
H 0.0072 0.0047 1.5490 0.1254 [ 0.000, 0.016]
S 0.6530 0.1160 5.6297 0.0000 [ 0.324, 0.913]
126 3 Building an MLR Model

The R code used to produce the above output is shown below.

library(leaps)
y <- log(mussels[,5]); x <- mussels[,1:4]
x[,4] <- log(x[,4]); x[,2] <- log(x[,2])
out <- regboot(x,y,B=1000)
tem <- rowboot(x,y,B=1000)
outvs <- vselboot(x,y,B=1000) #get bootstrap CIs,
apply(out$betas,2,shorth3);
apply(tem$betas,2,shorth3);
apply(outvs$betas,2,shorth3)
ls.print(outvs$full)
ls.print(outvs$sub)
#test if beta_2 = beta_3 = beta_4 = 0
Abeta <- out$betas[,2:4]
#prediction region method with residual bootstrap
predreg(Abeta)
Abeta <- outvs$betas[,2:4]
#prediction region method with Imin
predreg(Abeta)

Example 3.12. Consider the Gladstone (1905) data set where the vari-
ables are as in Problem 3.6. Output is shown below for the full model and the
bootstrapped minimum Cp forward selection estimator. Note that the shorth
intervals for length and sex are quite long. These variables are often in and
often deleted from the bootstrap forward selection model. Output for II is
also shown. For this data set, II = Imin .

large sample full model inference for Ex. 3.12


Estimate SE t Pr(>|t|) resboot
Int -3021.255 1701.070 -1.77 0.077 [-6549.8,322.79]
age -1.656 0.314 -5.27 0.000 [ -2.304,-1.050]
breadth -8.717 12.025 -0.72 0.469 [-34.229,14.458]
cephalic 21.876 22.029 0.99 0.322 [-20.911,67.705]
circum 0.852 0.529 1.61 0.109 [ -0.065, 1.879]
headht 7.385 1.225 6.03 0.000 [ 5.138, 9.794]
height -0.407 0.942 -0.43 0.666 [ -2.211, 1.565]
len 13.475 9.422 1.43 0.154 [ -5.519,32.605]
sex 25.130 10.015 2.51 0.013 [ 6.717,44.19]
output and shorth intervals for the min Cp submodel
Estimate SE t Pr(>|t|) 95% shorth CI
Int -1764.516 186.046 -9.48 0.000 [-6151.6,-415.4]
age -1.708 0.285 -5.99 0.000 [ -2.299,-1.068]
breadth 0 [-32.992, 8.148]
cephalic 5.958 2.089 2.85 0.005 [-10.859,62.679]
circum 0.757 0.512 1.48 0.140 [ 0.000, 1.817]
3.4 Variable Selection 127

headht 7.424 1.161 6.39 0.000 [ 5.028, 9.732]


height 0 [ -2.859, 0.000]
len 6.716 1.466 4.58 0.000 [ 0.000,30.508]
sex 25.313 9.920 2.55 0.011 [ 0.000,42.144]
output and shorth for I_I model
Estimate SE t Pr(>|t|) 95% shorth CI
Int -1764.516 186.046 -9.48 0.000 [-6104.9,-778.2]
age -1.708 0.285 -5.99 0.000 [ -2.259,-1.003]
breadth 0 [-31.012, 6.567]
cephalic 5.958 2.089 2.85 0.005 [ -6.700,61.265]
circum 0.757 0.512 1.48 0.140 [ 0.000, 1.866]
headht 7.424 1.161 6.39 0.000 [ 5.221,10.090]
height 0 [ -2.173, 0.000]
len 6.716 1.466 4.58 0.000 [ 0.000,28.819]
sex 25.313 9.920 2.55 0.011 [ 0.000,42.847]
The R code used to produce the above output is shown below. The last
four commands are useful for examining the variable selection output.
x<-cbrainx[,c(1,3,5,6,7,8,9,10)]
y<-cbrainy
library(leaps)
out <- regboot(x,y,B=1000)
outvs <- fselboot(x,cbrainy) #get bootstrap CIs,
apply(out$betas,2,shorth3);
apply(outvs$betas,2,shorth3)
ls.print(outvs$full)
ls.print(outvs$sub)
outvs <- modIboot(x,cbrainy) #get bootstrap CIs,
apply(out$betas,2,shorth3);
apply(outvs$betas,2,shorth3)
ls.print(outvs$sub)
tem<-regsubsets(x,y,method="forward")
tem2<-summary(tem)
tem2$which
tem2$cp
A small simulation study was done in R using B = max(1000, n, 20p) and
5000 runs. The regression model used = (1, 1, 0, 0)T with n = 100, p = 4,
and various zero mean iid error distributions. The design matrix X consisted
of iid N(0,1) random variables. Hence the full model least
squares condence
intervals for i should have length near 2t96,0.975 / n 2(1.96)/10 =
0.392 when the iid zero mean errors have variance 2 . The simulation com-
puted the shorth(c) interval for each i and used the prediction region method
to test H0 : 3 = 4 = 0. The nominal coverage was 0.95 with = 0.05. Ob-
served coverage between 0.94 and 0.96 would suggest coverage is close to the
nominal value.
128 3 Building an MLR Model

The regression models used the residual bootstrap on the full model least
squares estimator and on the all subsets variable selection estimator for the
model Imin . The residuals were from least squares applied to the full model
in both cases. Results are shown for when the iid errors ei N (0, 1). Ta-
ble 3.2 shows two rows for each model giving the observed condence interval
coverages and average lengths of the condence intervals. The term reg is
for the full model regression, and the term vs is for the all subsets variable
selection. The column for the test gives the length and coverage = P(fail
to reject H0 ) for the interval [0, D(UB ) ] where D(UB ) is the cuto for the
condence region. The volume of the condence
 region will decrease to 0 as
n . The cuto will often be near r,0.95 if the statistic T is asymp-
2

totically normal. Note that 22,0.95 = 2.448 is very close to 2.449 for the
full model regression bootstrap test. The coverages were near 0.95 for the
regression bootstrap on the full model. For Imin the coverages were near 0.95
for 1 and 2 , but higher for the other 3 tests since zeroes often occurred for
j for j = 3, 4. The average lengths and coverages were similar for the full
model and all subsets variable selection Imin for 1 and 2 , but the lengths
were shorter for Imin for 3 and 4 .

Table 3.2 Bootstrapping Regression and Variable Selection


model cov/len 1 2 3 4 test
reg cov 0.9496 0.9430 0.9440 0.9454 0.9414
len 0.3967 0.3996 0.3997 0.3997 2.4493
vs cov 0.9482 0.9486 0.9974 0.9974 0.9896
len 0.3965 0.3990 0.3241 0.3257 2.6901

The R code for the simulation is shown below.


regbootsim(nruns=5000) #takes a while
library(leaps)
vsbootsim(nruns=5000) #takes a long while
vsbootsim2(nruns=5000) #bootstraps forwards selection
Remark 3.3. Predictor transformations can be done as long as the re-
sponse variable is not used. Suppose the p predictors are selected and variable
selection is done. Use the prediction region method for exploratory testing.
Olive (2013a) gives prediction intervals for models of the form Y = m(x) + e.
The variable selection model is such a model, so use the Olive (2013a) PI
after automated variable selection using Cp or AIC.
The Olive (2013a) PI has
 
15 n + 2p
bn = 1 + . (3.11)
n np

Let qn = min(1 + 0.05, 1 + p/n) for > 0.1 and


3.5 Diagnostics 129

qn = min(1 /2, 1 + 10p/n), otherwise. (3.12)

If 1 < 0.999 and qn < 1 + 0.001, set qn = 1 .


Let c = nqn . Compute r(c) r(1) , r(c+1) r(2) , . . . , r(n) r(nc+1) . Let
(r(d) , r(d+c1) ) = (1 , 12 ) correspond to the interval with the smallest
length. Then the asymptotically optimal 100 (1 )% large sample PI for
Yf is
(m(xf ) + bn 1 , m(xf ) + bn 12 ), (3.13)
Let Y = xT + e where is p 1, but let m(x) = xTImin Imin using
the minimum Cp model Imin from forward selection. The lregpack function
vspisim simulates (3.13) when = (1, 1, . . . , 1, 0, . . . , 0)T where the rst
k + 1 entries of the p 1 vector are 1s, for various error distributions using
the nominal 95% PI. This simulation is similar to the full model simulation
done under Remark 2.8. With 5000 runs, p = 4, k = 1, and N (0, 1) errors,
the asymptotic length is 3.92 = 2(1.96). With n = 40, the coverage was
0.9858 with average length 7.7557. With n = 80, the coverage was 0.979 with
average length 5.0278. With n = 200, the coverage was 0.966 with average
length 4.2852. With n = 400, the coverage was 0.958 with average length
4.081. After variable selection, coverage starts to be good for n 10p, but
the PI length was not near the optimal asymptotic length until n 100p.
Use the following R code.
library(leaps)
vspisim(n=40,p=4,k=1,type=1,nruns=5000)

3.5 Diagnostics

Automatic or blind use of regression models, especially in exploratory work,


all too often leads to incorrect or meaningless results and to confusion
rather than insight. At the very least, a user should be prepared to make and
study a number of plots before, during, and after tting the model.
Chambers et al. (1983, p. 306)
Diagnostics are used to check whether model assumptions are reasonable.
This section focuses on diagnostics for the unimodal MLR model Yi = xTi +ei
for i = 1, . . . , n where the errors are iid from a unimodal distribution that is
not highly skewed with E(ei ) = 0 and VAR(ei ) = 2 . See Denition 2.6.
It is often useful to use notation to separate the constant from the non-
trivial predictors. Assume that xi = (1, xi,2 , . . . , xi,p )T (1, uTi )T where the
(p 1) 1 vector of nontrivial predictors ui = (xi,2 , . . . , xi,p )T . In matrix
form,
Y = X + e,
X = [X1 , X2 , . . . , Xp ] = [1, U ],
130 3 Building an MLR Model

1 is an n 1 vector of ones, and U = [X2 , . . . , Xp ] is the n (p 1) matrix


of nontrivial predictors. The kth column of U is the n 1 vector of the jth
predictor Xj = (x1,j , . . . , xn,j )T where j = k + 1. The sample mean and
covariance matrix of the nontrivial predictors are

1
n
u= ui (3.14)
n i=1

and
1 
n
C = Cov(U ) = (ui u)(ui u)T , (3.15)
n 1 i=1

respectively, where uTi is the ith row of U .

Some important numerical quantities that are used as diagnostics measure



the distance of ui from u and the inuence of case i on the OLS t 
OLS .
  =
The ith residual ri = Yi Yi , and the vector of tted values is Y = X
T 1 T
X(X X) X Y = HY where H is the hat matrix. Case (or leave one out
or deletion) diagnostics are computed by omitting the ith case from the OLS
regression. Following Cook and Weisberg (1999a, p. 357), let


Y (i) = X (3.16)
(i)

denote the n 1 vector of tted values from estimating with OLS without
the ith case. Denote the jth element of Y (i) by Y(i),j . It can be shown that
the variance of the ith residual VAR(ri ) = 2 (1 hi ). The usual estimator
of the error variance is n 2
r
 = i=1 i .
2
np
The (internally) studentized residual
r
ei = i
 1 hi

has zero mean and approximately unit variance.

Denition 3.11. The ith leverage hi = H ii is the ith diagonal element


of the hat matrix H. The ith squared (classical) Mahalanobis distance

MD2i = (ui u)T C 1 (ui u).

The ith Cooks distance


 )
(  T X T X(
 ) (Y (i) Y )T (Y (i) Y )
(i) (i)
CDi = = (3.17)
2
p 2
p
3.5 Diagnostics 131

1  
n
= (Y(i),j Yj )2 .
2 j=1
p

Proposition 3.4. a) (Rousseeuw and Leroy 1987, p. 225)


1 1
hi = MD2i + .
n1 n
b) (Cook and Weisberg 1999a, p. 184)
1
hi = xTi (X T X)1 xi = (xi x)T (U T U )1 (xi x) + .
n
c) (Cook and Weisberg 1999a, p. 360)

ri2 hi e2i hi
CDi = = .
2 (1 hi ) 1 hi
p p 1 hi

When the statistics CDi , hi , and MDi are large, case i may be an outlier or
inuential case. Examining a stem plot or dot plot of these three statistics for
unusually large values can be useful for agging inuential cases. Cook and
Weisberg (1999a, p. 358) suggest examining cases with CDi > 0.5 and that
cases with CDi > 1 should always be studied. Since H = H T and H = HH,
the hat matrix is symmetric andnidempotent. Hence the eigenvalues of H are
zero or one, and trace(H) = i=1 hi = p. It can be shown that 0 hi 1.
Rousseeuw and Leroy (1987, pp. 220, 224) suggest using hi > 2p/n and
MD2i > 2p1,0.95 as benchmarks for leverages and Mahalanobis distances
where 2p1,0.95 is the 95th percentile of a chisquare distribution with p 1
degrees of freedom.
Note that Proposition 3.4c) implies that Cooks distance is the product
of the squared residual and a quantity that becomes larger the farther ui
is from u. Hence inuence is roughly the product of leverage and distance
of Yi from Yi (see Fox 1991, p. 21). Mahalanobis distances and leverages
both dene hyperellipsoids based on a metric closely related to the sample
covariance matrix of the nontrivial predictors. All points ui on the same
hyperellipsoidal contour are the same distance from u and have the same
leverage (or the same Mahalanobis distance).
Cooks distances, leverages, and Mahalanobis distances can be eective for
nding inuential cases when there is a single outlier, but can fail if there
are two or more outliers. Nevertheless, these numerical diagnostics combined
with response and residual plots are probably the most eective techniques for
detecting cases that eect the tted values when the multiple linear regression
model is a good approximation for the bulk of the data.
A scatterplot of x versus y (recall the convention that a plot of x versus y
means that x is on the horizontal axis and y is on the vertical axis) is used to
132 3 Building an MLR Model

visualize the conditional distribution y|x of y given x (see Cook and Weisberg
1999a, p. 31). For the simple linear regression model (with one nontrivial
predictor x2 ), the most eective technique for checking the assumptions of
the model is to make a scatterplot of x2 versus Y and a residual plot of x2
versus ri . Departures from linearity in the scatterplot suggest that the simple
linear regression model is not adequate. The points in the residual plot should
scatter about the line r = 0 with no pattern. If curvature is present or if the
distribution of the residuals depends on the value of x2 , then the simple linear
regression model is not adequate.
In general there is more than one nontrivial predictor and in this setting
two plots are crucial for any multiple linear regression analysis, re-
gardless of the regression estimator (e.g., OLS, L1 etc.). The rst plot is the
residual plot of the tted values Yi versus the residuals ri , and the second
plot is the response plot of the tted values Yi versus the response Yi .

Recalling Denitions 2.11 and 2.12, residual and response plots are plots of
wi = xTi versus ri and Yi , respectively, where is a known p 1 vector. The
most commonly used residual and response plots takes = .  Plots against
the individual predictors xj and potential predictors are also used. If the
residual plot is not ellipsoidal with zero slope, then the unimodal MLR model
(with iid errors from a unimodal distribution that is not highly skewed) is
not sustained. In other words, if the variables in the residual plot show some
type of dependency, e.g. increasing variance or a curved pattern, then the
multiple linear regression model may be inadequate. Proposition 2.1 showed
that the response plot simultaneously displays the tted values, response, and
residuals. The plotted points in the response plot should scatter about the
identity line if the multiple linear regression model holds. Recall that residual
plots magnify departures from the model while the response plot emphasizes
how well the model ts the data.

When the bulk of the data follows the MLR model, the following rules of
thumb are useful for nding inuential cases and outliers from the response
and residual plots. Look for points with large absolute residuals and for points
far away from Y . Also look for gaps separating the data into clusters. To
determine whether small clusters are outliers or good leverage points, give
zero weight to the clusters, and t an MLR estimator to the bulk of the
data. Denote the weighted estimator by w . Then plot Yw versus Y using
the entire data set. If the identity line passes through the bulk of the data
but not the cluster, then the cluster points may be outliers. In Figure 3.7,
the 5 infants are good leverage points in that the t to the bulk of the data
passes through the cluster of infants. For the Buxton (1920) data, the cluster
of cases far from the bulk of the data in Figure 3.11 are outliers.
To see why gaps are important, recall that the coecient of determination
R2 is equal to the squared correlation (corr(Y, Y ))2 . R2 over emphasizes the
3.6 Outlier Detection 133

strength of the MLR relationship when there are two clusters of data since
much of the variability of Y is due to the smaller cluster.

Information from numerical diagnostics can be incorporated into the re-


sponse plot by highlighting cases that have large absolute values of the di-
agnostic. For example, the Cooks distance CDi for the ith case tends to be
large if Yi is far from the sample mean Y and if the corresponding absolute
residual |ri | is not small. If Yi is close to Y , then CDi tends to be small
unless |ri | is large. Thus cases with large Cooks distances can often be found
by examining the response and residual plots. An exception to these rules
of thumb occurs if a group of cases form a cluster and the OLS t passes
through the cluster. Then the CDi s corresponding to these cases tend to be
small even if the cluster is far from Y .

Example 3.13. Tremearne (1911) presents a data set of about 17 mea-


surements on 115 people of Hausa nationality. We deleted 3 cases because
of missing values and used height as the response variable Y . The ve pre-
dictor variables used were height when sitting, height when kneeling, head
length, nasal breadth, and span (perhaps from left hand to right hand). Fig-
ure 2.1 presents the OLS residual and response plots for this data set. Points
corresponding to cases with Cooks distance > min(0.5, 2p/n) are shown as
highlighted squares (cases 3, 44, and 63). The 3rd person was very tall while
the 44th person was rather short. From the plots, the standard deviation of
the residuals appears to be around 10. Hence cases 3 and 44 are certainly
worth examining. Two other cases have residuals near fty.
Data sets like this one are very common. The majority of the cases seem to
follow a multiple linear regression model with iid Gaussian errors, but a small
percentage of cases seem to come from an error distribution with heavier tails
than a Gaussian distribution.

3.6 Outlier Detection

Do not attempt to build a model on a set of poor data! In human surveys,


one often nds 14inch men, 1000pound women, students with no lungs,
and so on. In manufacturing data, one can nd 10,000 pounds of material
in a 100 pound capacity barrel, and similar obvious errors. All the planning,
and training in the world will not eliminate these sorts of problems. . . . In
our decades of experience with messy data, we have yet to nd a large
data set completely free of such quality problems.
Draper and Smith (1981, p. 418)
134 3 Building an MLR Model

There is an enormous literature on outlier detection in multiple linear


regression. Typically a numerical measure such as Cooks distance or a resid-
ual plot based on resistant ts is used. The following terms are frequently
encountered.

Denition 3.12. Outliers are cases that lie far from the bulk of the data.
Hence Y outliers are cases that have unusually large vertical distances from
the MLR t to the bulk of the data while x outliers are cases with predictors
x that lie far from the bulk of the xi . Suppose that some analysis to detect
outliers is performed. Masking occurs if the analysis suggests that one or
more outliers are in fact good cases. Swamping occurs if the analysis suggests
that one or more good cases are outliers.

The residual and response plots are very useful for detecting outliers. If
there is a cluster of cases with outlying Y s, the identity line will often pass
through the outliers. If there are two clusters with similar Y s, then the two
plots may fail to show the clusters. Then using methods to detect x outliers
may be useful.
Let the q continuous predictors in the MLR model be collected into vectors
ui for i = 1, . . . , n. Let the n q matrix W have n rows uT1 , . . . , uTn . Let the
q 1 column vector T (W ) be a multivariate location estimator, and let the
q q symmetric positive denite matrix C(W ) be a covariance estimator.
Often q = p 1 and only the constant is omitted from xi to create ui .

Denition 3.13. The ith squared Mahalanobis distance is

Di2 = Di2 (T (W ), C(W )) = (ui T (W ))T C 1 (W )(ui T (W )) (3.18)

for each point ui . Notice that Di2 is a random variable (scalar valued).

The classical Mahalanobis distance corresponds to the sample mean and


sample covariance matrix

1
n
T (W ) = u = ui ,
n i=1

and
1 
n
C(W ) = S = (ui T (W ))(ui T (W ))T
n 1 i=1

and will bedenoted by M Di . When T (W ) and C(W ) are robust estima-


tors, Di = Di2 will sometimes be denoted by RDi . We suggest using the
Olive (2008) RFCH or RMVNestimator as the robust estimator. The sample
Mahalanobis distance Di = Di2 is an analog of the absolute value of the
sample z-score |zi | = |(Yi Y )/|. Also notice that the Euclidean distance of
ui from the estimate of center T (W ) is Di (T (W ), I q ) where I q is the q q
identity matrix. Plot the M Di versus the RDi to detect outlying u.
3.6 Outlier Detection 135

Denition 3.14: Rousseeuw and Van Driessen (1999). The DD plot


is a plot of the classical Mahalanobis distances MDi versus robust Maha-
lanobis distances RDi . The DD plot is best for n 20p.

Olive (2002) shows that the plotted points in the DD plot will follow the
identity line with zero intercept and unit slope if the predictor distribution
is multivariate normal (MVN), and will follow a line with zero intercept but
nonunit slope if the distribution is elliptically contoured with nonsingular
covariance matrix but not MVN. (Such distributions have linear scatterplot
matrices. See Chapter 10.) Hence if the plotted points in the DD plot follow
some line through the origin, then there is some evidence that outliers and
strong nonlinearities have been removed from the predictors.

Response Plot
500 1000 1500
Y

61
62
0

0 500 1000 1500


FIT
Residual Plot
-50 0 50 100

61
RES

64
63
65 62
-150

0 500 1000 1500


FIT

Fig. 3.11 Residual and Response Plots for Buxton Data

Example 3.14. Buxton (1920, pp. 2325) gives 20 measurements of 88


men. We chose to predict stature using an intercept, head length, nasal height,
bigonal breadth, and cephalic index. Observation 9 was deleted since it had
missing values. Five individuals, numbers 6266, were reported to be about
0.75 inches tall with head lengths well over ve feet! This appears to be a
clerical error; these individuals stature was recorded as head length and the
integer 18 or 19 given for stature, making the cases massive outliers with
enormous leverage.
136 3 Building an MLR Model

Figure 3.11 shows the response plot and residual plot for the Buxton
data. Although an index plot of Cooks distance CDi may be useful for
agging inuential cases, the index plot provides no direct way of judg-
ing the model against the data. As a remedy, cases in the response plot
with CDi > min(0.5, 2p/n) were highlighted. Notice that the OLS t passes
through the outliers, but the response plot is resistant to Y outliers since Y
is on the vertical axis. Also notice that although the outlying cluster is far
from Y , only two of the outliers had large Cooks distance. Hence masking
occurred for both Cooks distances and for OLS residuals, but not for OLS
tted values.

a) DD Plot with Outliers b) DD Plot without Outliers

5
250

4
150
RD

RD
3
2
50

1
0

1 2 3 4 5 0.5 1.5 2.5 3.5


MD MD

Fig. 3.12 DD Plots for Buxton Data

Figure 3.12a shows the DD plot made from the four predictors head length,
nasal height, bigonal breadth, and cephalic index. The ve massive outliers cor-
respond to head lengths that were recorded to be around 5 feet. Figure 3.12b
is the DD plot computed after deleting these points and suggests that the
predictor distribution is now much closer to a multivariate normal distribu-
tion.
High leverage outliers are a particular challenge to conventional numer-
ical MLR diagnostics such as Cooks distance, but can often be visualized
using the response and residual plots. The following techniques are useful for
detecting outliers when the multiple linear regression model is appropriate.
1. Find the OLS residuals and tted values and make a response plot and
a residual plot. Look for clusters of points that are separated from the
bulk of the data and look for residuals that have large absolute values.
Beginners frequently label too many points as outliers. Try to estimate
the standard deviation of the residuals in both plots. In the residual plot,
3.6 Outlier Detection 137

look for residuals that are more than 5 standard deviations away from
the r = 0 line. The identity line and r = 0 line may pass right through a
cluster of outliers, but the cluster of outliers can often be detected because
there is a large gap between the cluster and the bulk of the data, as in
Figure 3.11.
2. Make a DD plot of the predictors that take on many values (the continuous
predictors).
3. Make a scatterplot matrix of several diagnostics such as leverages, Cooks
distances, and studentized residuals.

Detecting outliers is much easier than deciding what to do with them.


After detection, the investigator should see whether the outliers are recording
errors. The outliers may become good cases after they are corrected. But
frequently there is no simple explanation for why the cases are outlying.
Typical advice is that outlying cases should never be blindly deleted and that
the investigator should analyze the full data set including the outliers as well
as the data set after the outliers have been removed (either by deleting the
cases or the variables that contain the outliers).
Typically two methods are used to nd the cases (or variables) to delete.
The investigator computes OLS diagnostics and subjectively deletes cases,
or a resistant multiple linear regression estimator is used that automatically
gives certain cases zero weight. A third, much more eective method is to use
the response and residual plots.
Suppose that the data has been examined, recording errors corrected, and
impossible cases deleted. For example, in the Buxton (1920) data, 5 people
with heights of 0.75 inches were recorded. For this data set, these heights
could be corrected. If they could not be corrected, then these cases should
be discarded since they are impossible. If outliers are present even after cor-
recting recording errors and discarding impossible cases, then we can add an
additional rough guideline.
If the purpose is to display the relationship between the predictors and the
response, make a response plot using the full data set (computing the tted
values by giving the outliers weight zero) and using the data set with the
outliers removed. Both plots are needed if the relationship that holds for
the bulk of the data is obscured by outliers. The outliers are removed from
the data set in order to get reliable estimates for the bulk of the data. The
identity line should be added as a visual aid and the proportion of outliers
should be given.
138 3 Building an MLR Model

3.7 Summary

1) Suppose you have a scatterplot of two variables x1 1 versus x2 2 with


x1 , x2 > 0, and that the plotted points follow a nonlinear one to one func-
tion. Consider the ladder of powers 1, 0.5, 1/3, 0, 1/3, 0.5, and 1. The
ladder rule says to spread small values of the variable, make i smaller. To
spread large values of the variable, make i larger.
2) Suppose w is positive. The log rule says use log(w) if
max(wi )/ min(wi ) > 10.
3) There are several guidelines for choosing power transformations. First,
see rules 1) and 2) above. Suppose that all values of the variable w to be
transformed are positive. The log rule often works wonders on the data.
If the variable w can take on the value of 0, use log(w + c) where c is a
small constant like 1, 1/2, or 3/8. The unit rule says that if Xi and y
have the same units, then use the same transformation of Xi and y. The
cube root rule says that if w is a volume measurement, then the cube root
transformation w1/3 may be useful. Consider the ladder of powers given in
point 1). No transformation ( = 1) is best, then the log transformation,
then the square root transformation. Theory, if available, should be used to
select a transformation. Frequently more than one transformation will work.
1/3
For example, if y = weight and X1 = volume = X2 X3 X4 , then y vs. X1
or log(y) vs. log(X1 ) = log(X2 ) + log(X3 ) + log(X4 ) may both work. Also if
y is linearly related with X2 , X3 , X4 and these three variables all have length
1/3
units mm, say, then the units of X1 are (mm)3 . Hence the units of X1 are
mm.
4) To nd a response transformation, make the transformation plots
and choose a transformation such that the transformation plot is linear.
5) A factor (with c levels a1 , . . . , ac ) is incorporated into the MLR model
by using c 1 indicator variables xW j = 1 if W = aj and xW j = 0 otherwise,
where one of the levels aj is omitted, e.g. use j = 1, . . . , c 1.
6) For variable selection, the model Y = xT + e that uses all of the
predictors is called the full model. A model Y = xTI I + e that only uses a
subset xI of the predictors is called a submodel. The full model is always
a submodel. The full model has SP = xT and the submodel has SP =
xTI I .
7) Make scatterplot matrices of the predictors and the response. Then re-
move strong nonlinearities from the predictors using power trans-
formations. The log rule is very useful.
8) Either include all of the indicator variables for a factor in the model or
exclude all of them. If the model contains powers or interactions, also include
all main eects in the model.
9) After selecting a submodel I, make the response and residual plots for
the full model and the submodel. Make the RR plot of rI,i versus ri and the
FF plot of YI,i versus Yi . The submodel is good if the plotted points in the
3.7 Summary 139

FF and RR plots cluster tightly about the identity line. In the RR plot, the
OLS line and identity line can be added to the plot as visual aids. It should
be dicult to see that the OLS and identity lines intersect at the origin, so
the two lines should nearly coincide at the origin. If the FF plot looks good
but the RR plot does not, the submodel may be good if the main goal of the
analysis is for prediction.
10) Forward selection Step 1) k = 1: Start with a constant w1 = x1 .
Step 2) k = 2: Compute Cp for all models with k = 2 containing a constant
and a single predictor xi . Keep the predictor w2 = xj , say, that minimizes Cp .
Step 3) k = 3: Fit all models with k = 3 that contain w1 and w2 . Keep the
predictor w3 that minimizes Cp . . . .
Step j) k = j: Fit all models with k = j that contains w1 , w2 , . . . , wj1 . Keep
the predictor wj that minimizes Cp . . . .
Step p): Fit the full model.

Backward elimination: All models contain a constant = u1 . Step 0)


k = p: Start with the full model that contains x1 , . . . , xp . We will also say
that the full model contains u1 , . . . , up where u1 = x1 but ui need not equal
xi for i > 1.
Step 1) k = p 1: Fit each model with k = p 1 predictors including a
constant. Delete the predictor up , say, that corresponds to the model with
the smallest Cp . Keep u1 , . . . , up1 .
Step 2) k = p 2: Fit each model with p 2 predictors including a con-
stant. Delete the predictor up1 corresponding to the smallest Cp . Keep
u1 , . . . , up2 . . . .
Step j) k = p j: t each model with p j predictors including a con-
stant. Delete the predictor upj+1 corresponding to the smallest Cp . Keep
u1 , . . . , upj . . . .
Step p 2) k = 2. The current model contains u1 , u2 , and u3 . Fit the model
u1 , u2 and the model u1 , u3 . Assume that model u1 , u2 minimizes Cp . Then
delete u3 and keep u1 and u2 .

11) Let Imin correspond to the submodel with the smallest Cp . Find
the submodel II with the fewest number of predictors such that Cp (II )
Cp (Imin ) + 1. Then II is the initial submodel that should be examined. It
is possible that II = Imin or that II is the full model. Models I with fewer
predictors than II such that Cp (I) Cp (Imin ) + 4 are interesting and should
also be examined. Models I with k predictors, including a constant and with
fewer predictors than II such that Cp (Imin ) + 4 < Cp (I) min(2k, p) should
be checked.
12) There are several guidelines for building an MLR model. Suppose
that variable Z is of interest and variables W1 , . . . , Wr have been collected
along with Z. Make a scatterplot matrix of W1 , . . . , Wr and Z. (If r is large,
several matrices may need to be made. Each one should include Z.) Remove
or correct any gross outliers. It is often a good idea to transform the Wi
to remove any strong nonlinearities from the predictors. Eventually
140 3 Building an MLR Model

you will nd a response variable Y = tZ (Z) and nontrivial predictor variables


X2 , . . . , Xp for the full model. Interactions such as Xk = Wi Wj and powers
such as Xk = Wi2 may be of interest. Indicator variables are often used in
interactions, but do not transform an indicator variable. The response plot
for the full model should be linear, and the residual plot should be ellipsoidal
with zero trend. Find the LS output. Often want the number of predictors
k in the submodel to be small. We will almost always include a constant in
the submodel. If the submodel seems to be good, make the response plot and
residual plot for the submodel. They should be linear and ellipsoidal with
zero trend, respectively. From the output, see if any terms can be eliminated
(look for predictors Xi such that the pvalue > 0.01, 0.05, or 0.1 for Ho:
i = 0). Also see point 13) below.
13) Assume that the full model has good response and residual plots and
than n 10p. Let subset I have k predictors, including a constant. The
following rules of thumb may be useful, but may not all hold simultaneously.
Let Imin be the minimum Cp model and let II be the model with the fewest
predictors satisfying Cp (II ) Cp (Imin ) + 1. Do not use more predictors than
model II to avoid overtting. Then the submodel I is good if
i) the response and residual plots for the submodel looks like the response
and residual plots for the full model,
ii) corr(ESP,ESP(I)) = corr(Y, YI ) 0.95.
iii) The plotted points in the FF plot cluster tightly about the identity line.
iv) Want the p-value 0.01 for the partial F test that uses I as the reduced
model.
v) The plotted points in the RR plot cluster tightly about the identity line.
vi) Want R2 (I) > 0.9R2 and R2 (I) > R2 0.07 (recall that R2 (I) R2 (f ull)
since adding predictors to I does not decrease R2 (I)).
vii) Want Cp (Imin ) Cp (I) min(2k, p) with no big jumps in Cp (the
increase should be less than four) as variables are deleted.
viii) Want hardly any predictors with p-values > 0.05.
ix) Want few predictors with p-values between 0.01 and 0.05.
14) Always check that the full model is good. If the candidate model seems
to be good, the usual MLR checks should still be made. In particular, the
response plot and residual plot need to be made for the submodel.
15) Inuence is roughly (leverage)(discrepancy). The leverages hi are
the diagonal elements of the hat matrix H and measure how far xi is from
the sample mean of the predictors. Cooks distance is widely used, but the
response plot and residual plot are the most eective tools for detecting
outliers and inuential cases.
3.8 Complements 141

3.8 Complements

With one data set, OLS is a great place to start but a bad place to end. If
n = 5kp where k > 2, it may be useful to take a random sample of n/k cases
to build the MLR model. Then check the model on the full data set.

Predictor Transformations

One of the most useful techniques in regression is to remove


gross nonlinearities in the predictors by using predictor transfor-
mations. The log rule is very useful for transforming highly skewed
predictors. The linearizing of the predictor relationships could be done by
using marginal power transformations or by transforming the joint distribu-
tion of the predictors towards an elliptically contoured distribution. The lin-
earization might also be done by using simultaneous power transformations
= (2 , . . . , p )T of the predictors so that the vector w = (x2 2 , . . . , xp p )T
( ) ( )

of transformed predictors is approximately multivariate normal. A method


for doing this was developed by Velilla (1993). (The basic idea is the same
as that underlying the likelihood approach of Box and Cox for estimating a
power transformation of the response in regression, but the likelihood comes
from the assumed multivariate normal distribution of w .) The Cook and
Nachtsheim (1994) procedure can cause the distribution to be closer to ellip-
tical symmetry. Marginal Box-Cox transformations also seem to be eective.
Power transformations can also be selected with slider bars in Arc.
Suppose that it is thought that the model Y = xT + e could be improved
by transforming xj . Let xT = uT + j xj where uT = x1 1 + +
xj1 j1 + xj+1 j+1 + + xp p . Let (xj ) denote the unknown transfor-
mation.

Denition 3.15. Consider the OLS residuals ri (j) = Yi uTi obtained


from the OLS regression of Y on u. A partial residual plot or component plus
residual plot or ceres plot with linear augmentation is a plot of the ri (j) versus
xj and is used to visualize .

Cook (1993) shows that partial residual plots are useful for visualizing
provided that the plots of xi versus xj are linear. More general ceres plots, in
particular ceres plots with smooth augmentation, can be used to visualize
if Y = uT + (x j )p + e but the linearity condition fails. Fitting the additive
model Y = 1 + j=2 Sj (xj ) + e or Y = 1 + 2 x2 + + j1 xj1 + S(xj ) +
j+1 xj+1 + + p xp + e and plotting S(xj ) can be useful. Similar ideas are
also useful for GLMs. See Chapter 13 and Olive (2013b) which also discusses
response plots for many regression models.

The assumption that all values of x1 and x2 are positive for power trans-
formation can be removed by using the modied power transformations of
Yeo and Johnson (2000).
142 3 Building an MLR Model

Response Transformations

Application 3.1 was suggested by Olive (2004b, 2013b) for additive error
regression models Y = m(x) + e. An advantage of this graphical method is
that it works for linear models: that is, for multiple linear regression and for
many experimental design models. Notice that if the plotted points in the
transformation plot follow the identity line, then the plot is also a response
plot. The method is also easily performed for MLR methods other than least
squares.
A variant of the method would plot the residual plot or both the response
and the residual plot for each of the seven values of . Residual plots are also
useful, but they no not distinguish between nonlinear monotone relationships
and nonmonotone relationships. See Fox (1991, p. 55).
Cook and Olive (2001) also suggest a graphical method for selecting and
assessing response transformations under model (3.2). Cook and Weisberg
(1994) show that a plot of Z versus xT (swap the axis on the transformation
plot for = 1) can be used to visualize t if Y = t(Z) = xT + e, suggesting
that t1 can be visualized in a plot of xT versus Z.
If there is nonlinearity present in the scatterplot matrix of the nontrivial
predictors, then transforming the predictors to remove the nonlinear-
ity will often be a useful procedure. More will be said about response
transformations for experimental designs in Section 5.4.

There has been considerable discussion on whether the response transfor-


mation parameter should be selected with maximum likelihood (see Bickel
and Doksum 1981), or selected by maximum likelihood and then rounded to
a meaningful value on a coarse grid L (see Box and Cox 1982 and Hinkley
and Runger 1984). Suppose that no strong nonlinearities are present among
the predictors x and that if predictor transformations were used, then the
transformations were chosen without examining the response. Also assume
that
Y = to (Z) = xT + e.
Suppose that a transformation t is chosen without examining the response.
Results in Li and Duan (1989), Chen and Li (1998), and Chang and Olive
(2010) suggest that if x has an approximate elliptically contoured distribu-
tion, then the OLS ANOVA F , partial F , and Wald t tests will have the
correct level asymptotically, even if = o .
Now assume that the response is used to choose . For example, assume
that the numerical Box Cox method is used. Then is likely to be variable
unless the sample size is quite large, and considerable bias can be introduced,
as observed by Bickel and Doksum (1981). Now assume that is chosen with
the graphical method (and assume that ties are broken by using theory or by
using the following list in decreasing order of importance 1, 0, 1/2, 1, and
1/3 so that the log transformation is chosen over the cube root transformation
if both look equally good). Then will often rapidly converge in probability
3.8 Complements 143

to a value L . Hence for moderate sample sizes, it may be reason-


able to assume that the OLS tests have approximately the correct level. Let
W = t (Z) and perform the OLS regression of W on x. If the response and
residual plots suggest that the MLR model is appropriate, then the response
transformation from the graphical method will be useful for description and
exploratory purposes, and may be useful for prediction and inference. If a
numerical method is used to choose , perhaps in an interval or on a coarse
grid, the Olive (2016a) bootstrap tests for H0 : A = c may be useful.
The MLR assumptions always need to be checked after making a response
transformation. Since the graphical method uses a response plot to choose
the transformation, the graphical method should be much more reliable than
a numerical method. Transformation plots should be made if a numerical
method is used, but numerical methods are not needed to use the graphical
method.

Variable Selection and Multicollinearity

The literature on numerical methods for variable selection in the OLS


multiple linear regression model is enormous. Three important papers are
Jones (1946), Mallows (1973), and Furnival and Wilson (1974). Chatterjee
and Hadi (1988, pp. 4347) give a nice account on the eects of overtting
on the least squares estimates. Ferrari and Yang (2015) give a method for
testing whether a model is undertting. Section 3.4.1 followed Olive (2016a)
closely. See Olive (2016b) for more on prediction regions. Also see Claeskins
and Hjort (2003), Hjort and Claeskins (2003), and Efron et al. (2004). Texts
include Burnham and Anderson (2002), Claeskens and Hjort (2008), and
Linhart and Zucchini (1986).
Cook and Weisberg (1999a, pp. 264265) give a good discussion of the ef-
fect of deleting predictors on linearity and the constant variance assumption.
Walls and Weeks (1969) note that adding predictors increases the variance
of a predicted response. Also R2 gets large. See Freedman (1983).
Discussion of biases introduced by variable selection and data snooping
include Hurvich and Tsai (1990), Leeb and Potscher (2006), Selvin and Stuart
(1966), and Hjort and Claeskins (2003). This theory assumes that the full
model is known before collecting the data, but in practice the full model is
often built after collecting the data. Freedman (2005, pp. 192195) gives an
interesting discussion on model building and variable selection.
The predictor variables can be transformed if the response is not used,
and then inference can be done for the linear model. Suppose the p predictor
variables are xed so Y = t(Z) = X+e, and the computer program outputs
, after doing an automated response transformation and automated variable
selection. Then the nonlinear estimator can be bootstrapped. See Olive
(2016a). If data snooping, such as using graphs, is used to select the response
transformation and the submodel from variable selection, then strong, likely
unreasonable assumptions are needed for valid inference for the nal nonlinear
model.
144 3 Building an MLR Model

Olive and Hawkins (2005) discuss inuential cases in variable selection,


as do Leger and Altman (1993). The interpretation of Mallows Cp given in
Proposition 3.2 is due to Olive and Hawkins (2005), who show that the Cp
statistic can be used for variable selection for many 1D regression models,
such as GLMs, where SP = T x. Other interpretations of the Cp statistic
specic to MLR can be given. See Gilmour (1996). The Cp statistic is due to
Jones (1946). Also see Kenard (1971).
The AIC(I) statistic is often used instead of Cp (I). The full model and
the model Imin found with the smallest AIC are always of interest. Burnham
and Anderson (2004) suggest that if (I) = AIC(I) AIC(Imin ), then
models with (I) 2 are good, models with 4 (I) 7 are borderline,
and models with (I) > 10 should not be used as the nal submodel. Find
the submodel II with the smallest number of predictors such that (II ) 2.
Then II is the initial submodel to examine, and often II = Imin . Also examine
submodels I with fewer predictors than II with (I) 7. See Chapter 13.
When there are strong linear relationships among the predictors, multi-
collinearity is present. Let Rk2 be the coecient of multiple determination
when xk is regressed on the remaining predictor variables, including a con-
stant. The variance ination factor is VIF(k) = 1/(1 Rk2 ). Both Rk2 and
VIF(k) are large when multicollinearity is present. Following Cook and Weis-
berg (1999a, p. 274), if sk is the sample standard deviation of xk , then the
standard error of k is

M SE 1 M SE 
se(k ) = = V IF (k).
sk n 1 1 Rk
2
sk n 1

Hence k becomes more dicult to estimate when multicollinearity is present.


Variable selection is a useful way to reduce multicollinearity, and alternatives
such as ridge regression are discussed in Gunst and Mason (1980). See James
et al. (2013, ch. 6) for more information on variable selection, ridge regression,
and lasso. Belsley (1984) shows that centering the data before diagnosing the
data for multicollinearity is not necessarily a good idea.
We note that the pollution data of Example 3.7 has been heavily analyzed
in the ridge regression literature, but this data was easily handled by the log
rule combined with variable selection. The pollution data can be obtained
from this texts website, or from the STATLIB website:
(http://lib.stat.cmu.edu/).
The leaps function in R and Proc Rsquare in SAS can be used to perform
all subsets variable selection with the Cp criterion. The step and regsubsets
functions in R can be used for forward selection and backward elimination.
See Problem 3.6. Get more information on these R functions with the follow-
ing commands.
?step
library(leaps)
?leaps
3.8 Complements 145

?regsubsets
Bootstrap

Olive (2016a,b,c) showed that the prediction region method for creating
a large sample 100(1 )% condence region for an r 1 parameter vector
is a special case of the percentile method when r = 1, and gave sucient
conditions for r > 1. The shorth method gives the shortest percentile method
intervals, asymptotically, and should be used when B 1000. Efron (2014)
reviews some alternative methods for variable selection inference.
Consider the residual bootstrap, and let r W denote an n1 random vector
of elements selected with replacement from the n residuals r1 , . . . , rn . Then
there are K = nn possible values for r W . Let r W W
1 , . . . , r K be the possible
W
values of r . These values are equally likely, so are selected with probability
= 1/K. Note that r W has a discrete distribution. Then


E(r1j )
..
E(r W
j )= . .

E(rnj )

Now the marginal distribution of rij takes on the n values r1 , . . . , rn with
the same probability 1/n. So each of the n marginal  distributions is the
n
empirical distribution of the residuals. Hence E(rij ) = i=1 ri /n = r, and
r = 0 for least squares residuals for multiple linear regression when there

is a constant in the model. So for least squares, E(r Wj ) = 0, and E( j ) =
(X T X)1 X T E(Y + r W T
j ) = (X X)
1
X T Y = (X T X)1 X T HY =

(X T X)1 X T Y = = n

since HX = X and X T H = X T . Here j = 1, . . . , B.

Diagnostics

Excellent introductions to OLS diagnostics include Fox (1991) and Cook


and Weisberg (1999a, pp. 161163, 183184, section 10.5, section 10.6, ch. 14,
ch. 15, ch. 17, ch. 18, and section 19.3). More advanced works include Belsley
et al. (1980), Cook and Weisberg (1982), Atkinson (1985), and Chatterjee and
Hadi (1988). Hoaglin and Welsh (1978) examine the hat matrix while Cook
(1977) introduces Cooks distance. Also see Velleman and Welsch (1981).
Cook and Weisberg (1997, 1999a: ch. 17) call a plot that emphasizes model
agreement a model checking plot.

Outliers

Olive (2008) is an introduction to outlier detection and robust regression.


Also see Olive (2005) and Olive and Hawkins (2011). Some useful properties
of the DD plot are given in Olive (2002). Theory for the FCH, RFCH, and
146 3 Building an MLR Model

RMVN estimators is given in Olive (2008: ch. 10, 2016c: ch. 4) and Olive and
Hawkins (2010). These three estimators are also used in Zhang et al. (2012).
Lasso and Other Variable Selection Techniques
Response plots, prediction intervals, and the bootstrap prediction region
method are also useful for other variable selection techniques such as lasso
and ridge regression. If n 400 and p 3000, Bertsimas et al. (2016) give a
fast all subsets variable selection method.
Recent theory for lasso assumes that is selected before looking at the
data, rather than being estimated using kfold cross validation. See Hastie
et al. (2015). The prediction region method appears to be useful when n >> p
if none of the i = 0, but (in 2016) it takes a long time to simulate lasso with
kfold cross validation.
Lasso seems to work under ASSUMPTION L: assume the predictors are
uncorrelated or the number of active predictors (predictors with nonzero
coecients) is not much larger than 20. When n is xed and p increases, the
lasso prediction intervals increase in length slowly provided that assumption
L held. Methods are being developed that should work under more reasonable
assumptions. See Pelawa Watagoda (2017) and Pelawa Watagoda and Olive
(2017).

3.9 Problems

Problems with an asterisk * are especially important.

Output for problem 3.1. Current terms:


(finger to ground nasal height sternal height)
df RSS | k C_I
Delete: nasal height 73 35567.2 | 3 1.617
Delete: finger to ground 73 36878.8 | 3 4.258
Delete: sternal height 73 186259. | 3 305.047

3.1. From the above output from backward elimination, what terms
should be used in the MLR model to predict Y ? (You can tell that the non-
trivial variables are nger to ground, nasal height, and sternal height from
the delete lines. DONT FORGET THE CONSTANT!)

3.2. The table on the following page gives summary statistics for 4 MLR
models considered as nal submodels after performing variable selection. The
response plot and residual plot for the full model L1 was good. Model L3 was
the minimum Cp model found. Which model should be used as the nal
submodel? Explain briey why each of the other 3 submodels should not be
used.
3.9 Problems 147

Output for Problem 3.2.


L1 L2 L3 L4
# of predictors 10 6 4 3
# with 0.01 p-value 0.05 0 0 0 0
# with p-value > 0.05 6 2 0 0
R2 (I) 0.774 0.768 0.747 0.615
corr(Y , YI ) 1.0 0.996 0.982 0.891
C
p (I) 10.0 3.00 2.43 22.037
M SE 63.430 61.064 62.261 75.921
p-value for partial F test 1.0 0.902 0.622 0.004

Output for Problem 3.3.


L1 L2 L3 L4
# of predictors 10 5 4 3
# with 0.01 p-value 0.05 0 1 0 0
# with p-value > 0.05 8 0 0 0
R2 (I) 0.655 0.650 0.648 0.630
corr(Y , YI ) 1.0 0.996 0.992 0.981
C
p (I) 10.0 4.00 5.60 13.81
M SE 73.548 73.521 73.894 75.187
p-value for partial F test 1.0 0.550 0.272 0.015

3.3. The above table gives summary statistics for 4 MLR models consid-
ered as nal submodels after performing variable selection. The response plot
and residual plot for the full model L1 was good. Model L2 was the minimum
Cp model found.
a) Which model is II , the initial submodel to look at?
b) What other model or models, if any, should be examined?

3.4. The output below and on the following page is from software that
does all subsets variable selection. The data is from Ashworth (1842). The
predictors were A = log(1692 property value), B = log(1841 property value),
and C = log(percent increase in value), while the response variable is Y =
log(1841 population).
a) The top output corresponds to data with 2 small outliers. From this
output, what is the best model? Explain briey.
b) The bottom output corresponds to the data with the 2 outliers removed.
From this output, what is the best model? Explain briey.

Output for Problem 3.4.


ADJ 99 cases 2 outliers
k CP R SQ R SQ RESID SS VARIABLES
-- ----- ---- ---- -------- -------------
1 760.7 0.0000 0.0000 185.928 INTERCEPT ONLY
2 12.7 0.8732 0.8745 23.3381 B
148 3 Building an MLR Model

2 335.9 0.4924 0.4976 93.4059 A


2 393.0 0.4252 0.4311 105.779 C
3 12.2 0.8748 0.8773 22.8088 B C
3 14.6 0.8720 0.8746 23.3179 A B
3 15.7 0.8706 0.8732 23.5677 A C
4 4.0 0.8857 0.8892 20.5927 A B C

ADJ 97 cases after deleting 2 outliers

k CP R SQ R SQ RESID SS VARIABLES
-- ----- ---- ------ -------- -------------
1 903.5 0.0000 0.0000 183.102 INTERCEPT ONLY
2 0.7 0.9052 0.9062 17.1785 B
2 406.6 0.4944 0.4996 91.6174 A
2 426.0 0.4748 0.4802 95.1708 C
3 2.1 0.9048 0.9068 17.0741 A C
3 2.6 0.9043 0.9063 17.1654 B C
3 2.6 0.9042 0.9062 17.1678 A B
4 4.0 0.9039 0.9069 17.0539 A B C
R Problems

Use the command source(G:/lregpack.txt) to download the func-


tions and the command source(G:/lregdata.txt) to download the data.
See Preface or Section 14.1. Typing the name of the lregpack function,
e.g. tplot, will display the code for the function. Use the args command,
e.g. args(tplot), to display the needed arguments for the function. For some
of the following problems, the R commands can be copied and pasted from
(http://lagrange.math.siu.edu/Olive/lreghw.txt) into R.
3.5 . a) Download the R function tplot that makes the transformation
plots for L .
b) Use the following R command to make a 100 3 matrix. The columns
of this matrix are the three nontrivial predictor variables.
nx <- matrix(rnorm(300),nrow=100,ncol=3)
Use the following command to make the response variable Y.
y <- exp( 4 + nx%*%c(1,1,1) + 0.5*rnorm(100) )
This command means the MLR model log(Y ) = 4 + X2 + X3 + X4 + e will
hold where e N (0, 0.25).
To nd the response transformation, you need the program tplot given
in a). Type ls() to see if the programs were downloaded correctly.
3.9 Problems 149

c) To make the transformation plots type the following command.


tplot(nx,y)
The rst plot will be for = 1. Move the cursor to the plot and hold
the rightmost mouse key down and highlight Stop to go to the next plot.
Repeat these mouse operations to look at all of the plots. The identity line is
included in each plot. When you get a plot where the plotted points cluster
about the identity line with no other pattern, include this transformation
plot in Word by pressing the Ctrl and c keys simultaneously. This will copy
the graph. Then in Word use the menu command Paste. You should get
the log transformation.

d) Type the following commands.


out <- lsfit(nx,log(y))
ls.print(out)
Use the mouse to highlight the created output and include the output in
Word.
) using the output in d).
e) Write down the least squares equation for log(Y

3.6. Download cbrainx and cbrainy into R.


The data is the brain weight data from Gladstone (1905). The response Y
is brain weight while the predictors are age, breadth, cephalic, circum, headht,
height, len, sex, and a constant. The step function can be used to perform
forward selection and backward elimination in R.
a) Copy and paste the commands for this problem into R. The commands
t the full model, display the LS output, and perform backward elimination
using the AIC criterion. Copy and paste the output for backward elimination
into Word (one page of output).

zx <- cbrainx[,c(1,3,5,6,7,8,9,10)]
zbrain <- as.data.frame(cbind(cbrainy,zx))
zfull <- lm(cbrainy~.,data=zbrain)
summary(zfull)
back <- step(zfull)

b) We want low AIC and as few predictors as possible. Backward elimina-


tion starts with the full model then deletes one nontrivial predictor at a time.
The term <None> corresponds to the current model that does not eliminate
any terms. The terms listed above <None> correspond to models that have
smaller AIC than the current model. R stops when eliminating terms makes
the AIC higher than the current model. Which terms, including a constant,
were in this minimum AIC model?
c) Copy and paste the commands for this problem into R. The commands
t the null model that only contains a constant. Forward selection starts at
150 3 Building an MLR Model

the null model (corresponding to lower) and considers 8 nontrivial predictors


(given by upper).
Copy and paste the output for forward selection into Word (two pages of
output).
zint <- lm(cbrainy~1,data=zbrain)
forw <- step(zint,scope=list(lower=~1,
upper=~age+breadth+cephalic+circum+headht+height
+len+sex),direction="forward")
d) Forward selection in R starts with the null model and then adds a
predictor circum to the model. Forward selection in R allows you to consider
models with fewer predictors than the minimum AIC model (unlike backward
elimination). Which terms, including a constant, were in the minimum AIC
model?
e) The following code can be used to do all subsets regression. When k = 6
there is a Cp = 6.009241 corresponding to the 39th model that looks good.
From the output, this model that contains a constant and variables 1 = x2 , 3
= x4 , 5 = x6 , 7 = x8 , and 8 = x9 . Note that R labels the nontrivial predictors
from 1 to 8, so variable j = xj+1 .
library(leaps)
out<-leaps(x=zx,y=cbrainy)
out
plot(out$size,out$Cp)
tem<-1:length(out$size)
tem[out$Cp < 6.01]
for(i in 2:max(out$size))
print( c(i, min(out$Cp[out$size==i])))
out$which[39,]
1 2 3 4 5 6 7 8
TRUE FALSE TRUE FALSE TRUE FALSE TRUE TRUE
zx[1,]
age breadth cephalic circum headht height len sex
39.0 149.5 81.9 550.0 137.0 68.0 182.5 1.0
Problems using ARC

To quit Arc, move the cursor to the x in the northeast corner and click.
Problems 3.73.11 use data sets that come with Arc (Cook and Weisberg
1999a).
3.7 . a) In Arc enter the menu commands File>Load>Data and open
the le big-mac.lsp. Next use the menu commands Graph&Fit> Plot of to
obtain a dialog window. Double click on TeachSal and then double click on
BigMac. Then click on OK. These commands make a plot of x = TeachSal =
primary teacher salary in thousands of dollars versus y = BigMac = minutes
of labor needed to buy a Big Mac and fries. Include the plot in Word.
3.9 Problems 151

Consider transforming y with a (modied) power transformation


"
() (y 1)/, = 0
y =
log(y), = 0

b) Should simple linear regression be used to predict y from x? Explain.

c) In the plot, = 1. Which transformation will increase the linearity of


the plot, log(y), or y (2) ? Explain.

3.8 . In Arc enter the menu commands File>Load>Data and open the
le mussels.lsp. Use the commands Graph&Fit>Scatterplot Matrix of. In
the dialog window select H, L, W, S, and M (so select M last). Click on OK
and include the scatterplot matrix in Word. The response M is the edible part
of the mussel while the 4 predictors are shell measurements. Are any of the
marginal predictor relationships nonlinear? Is E(M |H) linear or nonlinear?

3.9 . The le wool.lsp has data from a 33 experiment on the behavior of


worsted yarn under cycles of repeated loadings. The response Y is the number
of cycles to failure and the three predictors are the length, amplitude, and
load. Make ve transformation plots by using the following commands.
From the menu Wool select transform and double click on Cycles.
Select modied power and use p = 1, 0.5, 0, and 0.5. Use the menu
commands Graph&Fit>Fit linear LS to obtain a dialog window. Next t LS
ve times. Use Amp, Len, and Load as the predictors for all 5 regressions, but
use Cycles1 , Cycles0.5 , log[Cycles], Cycles0.5 , and Cycles as the response.
Use the menu commands Graph&Fit>Plot of to create a dialog window.
Double click on L5:Fit-Values and double click on Cycles, double click on
L4:Fit-Values and double click on Cycles0.5 , double click on L3:Fit-Values
and double click on log[Cycles], double click on L2:Fit-Values and double click
on Cycles0.5 , double click on L1:Fit-Values and double click on Cycles1 .
a) You may stop when the resulting plot in linear. Let Z = Cycles. Include
the plot of Y versus Y = Z () that is linear in Word. Move the OLS slider
bar to 1. What response transformation do you end up using?
b) Use the menu commands Graph&Fit>Plot of and put L5:Fit-Values
in the H box and L3:Fit-Values in the V box. Is the plot linear?

3.10. In Arc enter the menu commands File>Load>Data and open


the le bcherry.lsp. The menu Trees will appear. Use the menu commands
Trees>Transform and a dialog window will appear. Select terms Vol, D,
and Ht. Then select the log transformation. The terms log Vol, log D, and log
Ht should be added to the data set. If a tree is shaped like a cylinder or a
cone, then V ol D2 Ht and taking logs results in a linear model.

a) Fit the full model with Y = log V ol, X1 = log D, and X2 = log Ht.
Add the output that has the LS coecients to Word.
152 3 Building an MLR Model

b) Fitting the full model will result in the menu L1. Use the commands
L1>AVPAll 2D. This will create a plot with a slider bar at the bottom
that says log[D]. This is the added variable plot for log(D). To make an added
variable plot for log(Ht), click on the slider bar. Add the OLS line to the AV
plot for log(Ht) by moving the OLS slider bar to 1, and add the zero line by
clicking on the Zero line box. Include the resulting plot in Word.

c) Fit the reduced model that drops log(Ht). Make an RR plot with the
residuals from the full model on the V axis and the residuals from the sub-
model on the H axis. Add the LS line and the identity line as visual aids.
(Click on the Options menu to the left of the plot and type y=x in the
resulting dialog window to add the identity line.) Include the plot in Word.

d) Similarly make an FF plot using the tted values from the two models.
Add the OLS line which is the identity line. Include the plot in Word.

e) Next put the residuals from the submodel on the V axis and log(Ht)
on the H axis. Move the OLS slider bar to 1, and include this residual plot
in Word.

f) Next put the residuals from the submodel on the V axis and the tted
values from the submodel on the H axis. Include this residual plot in Word.

g) Next put log(Vol) on the V axis and the tted values from the submodel
on the H axis. Move the OLS slider bar to 1, and include this response plot
in Word.

h) Does log(Ht) seem to be an important term? If the only goal is to


predict volume, will much information be lost if log(Ht) is omitted? Beside
each of the 6 plots, remark on the information given by the plot.
(Some of the plots will suggest that log(Ht) is needed while others will suggest
that log(Ht) is not needed.)

3.11 . a) In this problem we want to build an MLR model to predict


Y = t(BigM ac) where t is some power transformation. In Arc enter the
menu commands File>Load>Data and open the le big-mac.lsp. Make a
scatterplot matrix of the variables, except City, and include the plot in
Word.
b) The log rule makes sense for the BigMac data. From the scatterplot
matrix, use the Transformations menu and select Transform to logs.
Include the resulting scatterplot matrix in Word.
c) From the Mac menu, select Transform. Then select all 10 variables
and click on the Log transformations button. Then click on OK. From the
3.9 Problems 153

Graph&Fit menu, select Fit linear LS. Use log[BigMac] as the response
and the other 9 log variables as the Terms. This model is the full model.
Include the output in Word.
d) Make a response plot (L1:Fit-Values in H and log(BigMac) in V) and
residual plot (L1:Fit-Values in H and L1:Residuals in V), and include both
plots in Word.
e) Using the L1 menu, select Examine submodels and try forward
selection and backward elimination. Using the Cp min(2k, p) rule suggests
that the submodel using log[service], log[TeachSal], and log[TeachTax] may be
good. From the Graph&Fit menu, select Fit linear LS, t the submodel
and include the output in Word.
f) Make a response plot (L2:Fit-Values in H and log(BigMac) in V) and
residual plot (L2:Fit-Values in H and L2:Residuals in V) for the submodel,
and include the plots in Word.
g) Make an RR plot (L2:Residuals in H and L1:Residuals in V) and FF plot
(L2:Fit-Values in H and L1:Fit-Values in V) for the submodel, and include
the plots in Word. Move the OLS slider bar to 1 in each plot to add the
identity line. For the RR plot, click on the Options menu then type y = x in
the long horizontal box near the bottom of the window and click on OK to
add the identity line.
h) Do the plots and output suggest that the submodel is good? Explain.

Warning: The following problems use data from the books web-
page (http://lagrange.math.siu.edu/Olive/lregbk.htm). Save the
data les on a ash drive G, say. Get in Arc and use the menu commands
File > Load and a window will appear. Click on Removable Disk (G:).
Then click twice on the data set name.

3.12 . The following data set has 5 babies that are good leverage
points: they look like outliers but should not be deleted because they follow
the same model as the bulk of the data.
a) In Arc enter the menu commands File>Load>Removable Disk (G:)
and open the le cbrain.lsp. Select transform from the cbrain menu, and add
size1/3 using the power transformation option (p = 1/3). From
Graph&Fit, select Fit linear LS. Let the response be brnweight and as terms
include everything but size and Obs. Hence your model will include size1/3 .
This regression will add L1 to the menu bar. From this menu, select Examine
submodels. Choose forward selection. You should get models including k = 2
to 12 terms including the constant. Find the model with the smallest Cp (I) =
CI statistic and include all models with the same k as that model in Word.
That is, if k = 2 produced the smallest CI , then put the block with k = 2
into Word. Next go to the L1 menu, choose Examine submodels and choose
Backward Elimination. Find the model with the smallest CI and include all
of the models with the same value of k in Word.
154 3 Building an MLR Model

b) What was the minimum Cp model was chosen by forward selection?

c) What was the minimum Cp model was chosen by backward elimination?

d) Which minimum Cp model do you prefer? Explain.

e) Give an explanation for why the two models are dierent.

f) Pick a submodel and include the regression output in Word.

g) For your submodel in f), make an RR plot with the residuals from the
full model on the V axis and the residuals from the submodel on the H axis.
Add the OLS line and the identity line y=x as visual aids. Include the RR
plot in Word.

h) Similarly make an FF plot using the tted values from the two models.
Add the OLS line which is the identity line. Include the FF plot in Word.

i) Using the submodel, include the response plot (of Y versus Y ) and
residual plot (of Y versus the residuals) in Word.

j) Using results from f)-i), explain why your submodel is a good model.

3.13. Activate the cyp.lsp data set. Choosing no more than 3 nonconstant
terms, try to predict height with multiple linear regression. Include a plot with
the tted values on the horizontal axis and height on the vertical axis. Is your
model linear? Also include a plot with the tted values on the horizontal axis
and the residuals on the vertical axis. Does the residual plot suggest that the
linear model may be inappropriate? (There may be outliers in the plot. These
could be due to typos or because the error distribution has heavier tails than
the normal distribution.) State which model you use.

3.14. Activate the insulation data, contributed by Elizabeth Spector,


with the commands File>Load>Removable Disk (G:)>insulation.lsp.
The data description should appear in the Listener window.
Then go to the Graph&Fit menu and choose Plot of . . . and select
time for the H box, y for the V box, and type for the Mark by
box. Then click on OK and a window with a plot should open.
a) The OLS popdown menu is the triangle below OLS. Select Fit by
marksgeneral and then use the cursor to move the small black box to 2 on
the OLS slider bar. Then copy and paste the plot to Word. This command
ts least squares quadratic functions to the data from each of the 5 types of
insulation.
b) If there is no interaction, then the 5 curves will be roughly parallel and
will not cross. The curves will cross if there is interaction. Is there interaction?
3.9 Problems 155

c) The top curve corresponds to no insulation, and the temperature rapidly


rose and then rapidly cooled o. Corn pith corresponds to curve 2. Is corn
pith comparable to the more standard types of insulation 35?

3.15. Activate the cement.lsp data, contributed by Alyass Hossin. Act


as if 20 dierent samples were used to collect this data. If 5 measurements
on 4 dierent samples were used, then experimental design with repeated
measures or longitudinal data analysis may be a better way to analyze this
data.
a) From Graph&Fit select Plot of, place x1 in H, y in V, and x2 in the
Mark by box. From the OLS menu, select Fit by marksgeneral and move the
slider bar to 2. Include the plot in Word.
b) A quadratic seems to be a pretty good MLR model. From the cement
menu, select Transform, select x1, and place a 2 in the p box. This should
add x12 to the data set. From Graph&Fit select Fit linear LS, select x1 and
x12 as the terms and y as the response. Include the output in Word.
c) Make the response plot. Again from the OLS menu, select Fit by marks
general and move the slider bar to 1. Include the plot in Word. This plot
suggests that there is an interaction: the CM cement is stronger for low curing
times and weaker for higher curing times. The plot suggests that there may
not be an interaction between the two new types of cement.
d) Place the residual plot in Word. (Again from the OLS menu, select Fit
by marksgeneral and move the slider bar to 1.) The residual plot is slightly
fan shaped.
e) From the cement menu, select Make factors and select x2. From the
cement menu, select Make interactions and select x1 and (F)x2. Repeat,
selecting x12 and (F)x2. From Graph&Fit select Fit linear LS, select x1, x12 ,
(F)x2, x1*(F)x2, and x12 *(F)x2 as the terms and y as the response. Include
the output in Word.
f) Include the response plot and residual plot in Word.
g) Next delete the standard cement in order to compare the two coal
based cements. From Graph&Fit select Scatterplotmatrix of, then select x1,
x2, and y. Hold down the leftmost mouse button and highlight the x2 = 2
cases. Then from the Case deletions menu, select Delete selection from data
set. From Graph&Fit select Fit linear LS, select x1, x12 , x2 as the terms and
y as the response. Include the output in Word. The output suggests that the
MA brand is about 320 psi less strong than the ME brand. (May need to add
x2*x1 and x2*x12 interactions.)
h) Include the response plot and residual plot in Word. The residual plot
is not particularly good.

3.16. This problem gives a slightly simpler model than Problem 3.15 by
using the indicator variable x3 = 1 if standard cement (if x2 = 2) and x3 =
0 otherwise (if x2 is 0 or 1). Activate the cement.lsp data.
156 3 Building an MLR Model

a) From the cement menu, select Transform, select x1, and place a 2 in the
p box. This should add x12 to the data set. From the cement menu, select
Make interactions and select x1 and x3.
b) From Graph&Fit select Fit linear LS, select x1, x12 , x3, and x1*x3 as
the terms and y as the response. Include the output in Word.
c) Make the response and residual plots. When making these plots, place
x2 in the Mark by box. Include the plots in Word. Does the model seem ok?

3.17 . Get the McDonald and Schwing (1973) data pollution.lsp from
(http://lagrange.math.siu.edu/Olive/lregbk.htm), and save the le on
a ash drive. Activate the pollution.lsp dataset with the menu commands
File > Load > Removable Disk (G:) > pollution.lsp. Scroll up the screen
to read the data description. Often simply using the log rule on the predictors
with max(x)/ min(x) > 10 works wonders.
a) Make a scatterplot matrix of the rst nine predictor variables and the re-
sponse Mort. The commands Graph&Fit > Scatterplot-Matrix of will bring
down a Dialog menu. Select DENS, EDUC, HC, HOUS, HUMID, JANT,
JULT, NONW, NOX, and MORT. Then click on OK.
A scatterplot matrix with slider bars will appear. Move the slider bars for
NOX, NONW, and HC to 0, providing the log transformation. In Arc, the
diagonals have the min and max of each variable, and these were the three
predictor variables satisfying the log rule. Open Word.
In Arc, use the menu commands Edit > Copy. In Word, use the menu
command Paste. This should copy the scatterplot matrix into the Word
document. Print the graph.
b) Make a scatterplot matrix of the last six predictor variables and the
response Mort. The commands Graph&Fit > Scatterplot-Matrix of will
bring down a Dialog menu. Select OVR65, POOR, POPN, PREC, SO,
WWDRK, and MORT. Then click on OK. Move the slider bar of SO to 0
and copy the plot into Word. Print the plot as described in a).
c) Click on the pollution menu and select Transform. Click on the log
transformations button and select HC, NONW, NOX, and SO. Click on OK.
Then t the full model with the menu commands Graph&Fit > Fit lin-
ear LS. Select MORT for the response. For the terms, select DENS, EDUC,
log[HC], HOUS, HUMID, JANT, JULT, log[NONW], log[NOX], OVR65,
POOR, POPN, PREC, log[SO], and WWDRK. Click on OK.
This model is the full model. To make the response plot use the menu
commands Graph&Fit >Plot of. Select MORT for the V-box and L1:Fit-
Values for the H-box. Click on OK. When the graph appears, move the OLS
slider bar to 1 to add the identity line. Copy the plot into Word.
To make the residual plot use the menu commands Graph&Fit >Plot
of. Select L1:Residuals for the V-box and L1:Fit-Values for the H-box. Click
on OK. Copy the plot into Word. Print the two plots.
3.9 Problems 157

d) Using the L1 menu, select Examine submodels and try forward


selection. Using the L1 menu, select Examine submodels and try back-
ward elimination. You should get a lot of output including that shown in
Example 3.7.
Fit the submodel with the menu commands Graph&Fit > Fit linear
LS. Select MORT for the response. For the terms, select EDUC, JANT,
log[NONW], log[NOX], and PREC. Click on OK.
This model is the submodel suggested by backward elimination. To make
the response plot use the menu commands Graph&Fit >Plot of. Select
MORT for the V-box and L2:Fit-Values for the H-box. Click on OK. When
the graph appears, move the OLS slider bar to 1 to add the identity line.
Copy the plot into Word.
To make the residual plot use the menu commands Graph&Fit >Plot
of. Select L2:Residuals for the V-box and L2:Fit-Values for the H-box. Click
on OK. Copy the plot into Word. Print the two plots.
e) To make an RR plot use the menu commands Graph&Fit >Plot of.
Select L1:Residuals for the V-box and L2:Residuals for the H-box. Click on
OK. Move the OLS slider bar to one. On the window for the plot, click on
Options. A window will appear. Type y = x and click on OK to add the
identity line. Copy the plot into Word. Print the plot.
f) To make an FF plot use the menu commands Graph&Fit >Plot of.
Select L1:Fit-Values for the V-box and L2:Fit-Values for the H-box. Click on
OK. Move the OLS slider bar to one and click on OK to add the identity
line. Copy the plot into Word.
g) Using the response and residual plots from the full model and submodel
along with the RR and FF plots, does the submodel seem ok?

3.18. Get the Joanne Numrich data c12.lsp from


(http://lagrange.math.siu.edu/Olive/lregbk.htm), and save the le on
a ash drive. Activate the c12.lsp dataset with the menu commands File >
Load > Removable Disk (G:) > c12.lsp. Scroll up the screen to read the
data description. This data set is described in Example 3.10.
a) A bad model uses Y1 and all 24 nontrivial predictors. There are many
indicator variables. Click on the CLA menu and select Transform. Click on
the log transformations button and select y1 . Click on OK.
b) Use the menu commands Graph&Fit > Fit linear LS. Select log[y1]
for the response. For the terms, select x1, x2, x8, x9, x10, x11, x18, x20, x23,
and x24. Click on OK.
This model will be used as the full model. To make the response plot use
the menu commands Graph&Fit >Plot of. Select log[y1] for the V-box and
L1:Fit-Values for the H-box. Click on OK. When the graph appears, move
the OLS slider bar to 1 to add the identity line. Copy the plot into Word.
To make the residual plot use the menu commands Graph&Fit >Plot
of. Select L1:Residuals for the V-box and L1:Fit-Values for the H-box. Click
on OK. Copy the plot into Word. Print the two plots.
158 3 Building an MLR Model

c) As in Problem 13.17, use forward selection, backward elimination, and


plots to nd a good submodel.
Using material learned in Chapters 23, analyze the data sets described in
Problems 3.193.29. Assume that the response variable Y = t(Z) and that
the predictor variables X2 , . . . , Xp are functions of the remaining variables
W2 , . . . , Wr . Unless told otherwise, the full model Y, X1 , X2 , . . . , Xp (where
X1 1) should use functions of every variable W2 , . . . , Wr (and often p = r).
(In practice, often some of the variables and some of the cases are deleted,
but we will use all variables and cases, unless told otherwise, primarily so
that the instructor has some hope of grading the problems in a reasonable
amount of time.)
Read the description of the data provided by Arc. Once you have a
good full model, perform forward selection and backward elimination. Find
the model Imin that minimizes Cp (I), nd the model II with the fewest
number of predictors such that Cp (II ) Cp (Imin ) + 1 (it is possible that
II = Imin ), and nd the smallest value of k such that Cp (I) min(p, 2k).
Model II often has too many terms while the 2nd model often has too few
terms.
a) Give the output for your full model, including Y = t(Z) and R2 . If it
is not obvious from the output what your full model is, then write down the
full model. Include a response plot for the full model. (This plot should be
linear.) Also include a residual plot.
b) Give the output for your nal submodel. If it is not obvious from the
output what your submodel is, then write down the nal submodel.
c) Give between 3 and 5 plots that justify that your multiple linear regres-
sion submodel is reasonable. Below or beside each plot, give a brief explana-
tion for how the plot gives support for your model.
3.19. For the le bodfat.lsp, described in Problem 2.2, use Z = Y =
bodyf at but do not use X1 = density as a predictor in the full model. You
may use the remaining 13 nontrivial predictor variables. Do parts a), b), and
c) above.
3.20 . For the le boston2.lsp use Z = (y =) CRIM. Do parts a), b), and
c) above Problem 3.19.
Note: Y = log(CRIM ), X4 , X8 , is an interesting submodel, but more pre-
dictors are probably needed. The data set comes from Harrison and Rubinfeld
(1978).
3.21 . For the le major.lsp, described in Example 2.3, use Z = Y . Do
parts a), b), and c) above Problem 3.19.
Note: there are 1 or more outliers that aect numerical methods of variable
selection.
3.22. For the le marry.lsp, described in Example 2.12, use Z = Y . This
data set comes from Hebbler (1847). The census takers were not always willing
3.9 Problems 159

to count a womans husband if he was not at home. Do not use the predictor
X2 in the full model. Do parts a), b), and c) above Problem 3.19.

3.23 . For the le museum.lsp, described below, use Z = Y . Do parts a),


b), and c) above Problem 3.19.
This data set consists of measurements taken on skulls at a museum and
was extracted from tables in Schaahausen (1878). There are at least three
groups of data: humans, chimpanzees, and gorillas. The OLS t obtained from
the humans passes right through the chimpanzees. Since Arc numbers cases
starting at 0, cases 4759 are apes. These cases can be deleted by highlighting
the cases with small values of Y in the scatterplot matrix and using the
case deletions menu. (You may need to maximize the window containing the
scatterplot matrix in order to see this menu.)
i) Try variable selection using all of the data.
ii) Try variable selection without the apes.
perhaps only X1 , X2 , and X3 should be used
If all of the cases are used,
in the full model. Note that Y and X2 have high correlation.

3.24 . For the le pop.lsp, described below, use Z = Y . Do parts a), b),
and c) above Problem 3.19.
This data set comes from Ashworth (1842). Try transforming all variables
to logs. Then the added variable plots show two outliers. Delete these two
cases. Notice the eect of these two outliers on the pvalues for the coecients
and on numerical methods for variable selection.
Note: then log(Y ) and log(X2 ) make a good submodel.

3.25 . For the le pov.lsp, described below, use i) Z = f lif e and ii)
Z = gnp2 = gnp + 2. This data set comes from Rounceeld (1995). Making
loc into a factor may be a good idea. Use the commands poverty>Make factors
and select the variable loc. For ii), try transforming to logs and deleting the 6
cases with gnp2 = 0. (These cases had missing values for gnp. The le povc.lsp
has these cases deleted.) Try your nal submodel on the data that includes
the 6 cases with gnp2 = 0. Do parts a), b), and c) above Problem 3.19.

3.26 . For the le skeleton.lsp, described below, use Z = y.


This data set is also from Schaahausen (1878). At one time I heard or
read a conversation between a criminal forensics expert with his date. It went
roughly like If you wound up dead and I found your femur, I could tell what
your height was to within an inch. Two things immediately occurred to me.
The rst was no way and the second was that the man must not get many
dates! The les cyp.lsp and major.lsp have measurements including height,
but their R2 0.9. The skeleton data set has at least four groups: stillborn
babies, newborns and children, older humans, and apes.
a) Take logs of each variable and t the regression of log(Y) on log(X1 ),
. . . , log(X13 ). Make a residual plot and highlight the case with the smallest
residual. From the Case deletions menu, select Delete selection from data
160 3 Building an MLR Model

set. Go to Graph&Fit and again t the regression of log(Y) on log(X1 ), . . . ,


log(X13 ) (you should only need to click on OK). The output should say that
case 37 has been deleted. Include this output for the full model in Word.
b) Do part b) above Problem 3.19.
c) Do part c) above Problem 3.19.

3.27. Activate big-mac.lsp in Arc. Assume that a multiple linear regres-


sion model holds for t(y) and some terms (functions of the predictors) where
y is BigMac = hours of labor to buy Big Mac and fries. Using techniques you
have learned in class nd such a model. (Hint: Recall from Problem 3.11
that transforming all variables to logs and then using the model constant,
log(service), log(TeachSal) and log(TeachTax) was ok but the residuals did
not look good. Try adding a few terms from the minimal Cp model.)
a) Write down the full model that you use (e.g., a very poor full model is
exp(BigM ac) = 1 + 2 exp(EngSal) + 3 (T eachSal)3 + e) and include a
response plot for the full model. (This plot should be linear.) Give R2 for the
full model.

b) Write down your nal model (e.g., a very poor nal model is
exp(BigM ac) = 1 + 2 exp(EngSal) + 3 (T eachSal)3 + e).
c) Include the least squares output for your model and between 3 and 5
plots that justify that your multiple linear regression model is reasonable.
Below or beside each plot, give a brief explanation for how the plot gives
support for your model.

3.28. This is like Problem 3.27 with the BigMac data. Assume that a
multiple linear regression model holds for Y = t(Z) and for some terms
(usually powers or logs of the predictors). Using the techniques learned in
class, nd such a model. Give output for the full model, output for the nal
submodel and use several plots to justify your choices. These data sets, as
well as the BigMac data set, come with Arc. See Cook and Weisberg (1999a).
(INSTRUCTOR: Allow 2 hours for each part.)

file "response" Z
a) allomet.lsp BRAIN
b) casuarin.lsp W
c) evaporat.lsp Evap
d) hald.lsp Y
e) haystack.lsp Vol
f) highway.lsp rate
(From the menu Highway, select "Add a variate" and
type sigsp1 = sigs + 1. Then you can transform
sigsp1.)
g) landrent.lsp Y
h) ozone.lsp ozone
i) paddle.lsp Weight
3.9 Problems 161

j) sniffer.lsp Y
k) water.lsp Y

i) Write down the full model that you use and include the full model
residual plot and response plot in Word. Give R2 for the full model.

ii) Write down the nal submodel that you use.

iii) Include the least squares output for your model and between 3 and
5 plots that justify that your multiple linear regression model is reasonable.
Below or beside each plot, give a brief explanation for how the plot gives
support for your model.
3.29 . a) Activate buxton.lsp (you need to download the le onto your
ash drive Removable Disk (G:)). From the Graph&Fit menu, select Fit
linear LS. Use height as the response variable and bigonal breadth, cephalic
index, head length, and nasal height as the predictors. Include the output in
Word.
b) Make a response plot (L1:Fit-Values in H and height in V) and residual
plot (L1:Fit-Values in H and L1:Residuals in V) and include both plots in
Word.
c) In the residual plot use the mouse to move the cursor just above and
to the left of the outliers. Hold the leftmost mouse button down and move
the mouse to the right and then down. This will make a box on the residual
plot that contains the outliers. Go to the Case deletions menu and click
on Delete selection from data set. From the Graph&Fit menu, select Fit
linear LS and t the same model as in a) (the model should already be
entered, just click on OK). Include the output in Word.
d) Make a response plot (L2:Fit-Values in H and height in V) and residual
plot (L2:Fit-Values in H and L2:Residuals in V) and include both plots in
Word.
e) Explain why the outliers make the MLR relationship seem much
stronger than it actually is. (Hint: look at R2 .)
Variable Selection in SAS

3.30. Copy and paste the SAS program for this problem into the SAS
editor. Then perform the menu commands Run>Submit to obtain about
15 pages of output. Do not print out the output.
The data is from SAS Institute (1985, pp. 695704, 717718). Aerobic
tness is being measured by the ability to consume oxygen. The response
Y = Oxygen (uptake rate) is expensive to measure, and it is hoped that
the OLS Y can be used instead. The variables are Age in years, Weight in
kg, RunTime = time in minutes to run 1.5 miles, RunPulse = heart rate
when Y is measured, RestPulse = heart rate while running, and MaxPulse =
maximum heart rate recorded while running.
162 3 Building an MLR Model

The selection commands do forward selection, backward elimination, and


all subsets selection where the best ten models with the lowest Cp are
recorded. The proc rsquare command also does all subsets regression with
the Cp criterion.
The plots give the response and residual plots for the full model and the
submodel that used Age, RunTime, RunPulse, MaxPulse, and a constant.
a) Was the above plot for the minimum Cp model?
b) Do the plots suggest that the submodel was good?

Variable Selection in Minitab


3.31. Get the data set prof.mtb as described in Problem 2.15. The data is
described in McKenzie and Goldman (1999, pp. ED-22-ED-23). Assign the re-
sponse variable to be instrucr (the instructor rating from course evaluations)
and the predictors to be interest in the course, manner of the instructor, and
course = rating of the course.
a) To get residual and response plots you need to store the residuals and
tted values. Use the menu commands Stat>Regression>Regression to get
the regression window. Put instrucr in the Response and interest, manner,
and course in the Predictors boxes. The click on Storage. From the result-
ing window click on Fits and Residuals. Then click on OK twice.
b) To get a response plot, use the commands Graph>Plot, (double click)
place instrucr in the Y box, and Fits1 in the X box. Then click on OK. Print
the plot by clicking on the graph and then clicking on the printer icon.
c) To make a residual plot, use the menu commands Graph>Plot to get
a window. Place Resi1 in the Y box and Fits1 in the X box. Then click
on OK. Print the plot by clicking on the graph and then clicking on the
printer icon.
d) To perform all subsets regression, use the menu commands
Stat>Regression>Best Subsets to get the regression window. Put instrucr
in the Response and interest, manner, and course in the Free predictors
boxes. Which submodel is good?
Chapter 4
WLS and Generalized Least Squares

4.1 Random Vectors

The concepts of a random vector, the expected value of a random vector, and
the covariance of a random vector are needed before covering generalized least
squares. Recall that for random variables Yi and Yj , the covariance of Yi and
Yj is Cov(Yi , Yj ) i,j = E[(Yi E(Yi ))(Yj E(Yj )] = E(Yi Yj )E(Yi )E(Yj )
provided the second moments of Yi and Yj exist.

Denition 4.1. Y = (Y1 , . . . , Yn )T is an n 1 random vector if Yi is


a random variable for i = 1, . . . , n. Y is a discrete random vector if each Yi
is discrete, and Y is a continuous random vector if each Yi is continuous.
A random variable Y1 is the special case of a random vector with n = 1.

Denition 4.2. The population mean of a random n 1 vector Y =


(Y1 , . . . , Yn )T is
E(Y ) = (E(Y1 ), . . . , E(Yn ))T
provided that E(Yi ) exists for i = 1, . . . , n. Otherwise the expected value
does not exist. The n n population covariance matrix

Cov(Y ) = E[(Y E(Y ))(Y E(Y ))T ] = (i,j )

where the ij entry of Cov(Y ) is Cov(Yi , Yj ) = i,j provided that each i,j
exists. Otherwise Cov(Y ) does not exist.

The covariance matrix is also called the variancecovariance matrix and


variance matrix. Sometimes the notation Var(Y ) is used. Note that Cov(Y )
is a symmetric positive semidenite matrix. If Z and Y are n 1 random
vectors, a a conformable constant vector, and A and B are conformable
constant matrices, then

E(a + Y ) = a + E(Y ) and E(Y + Z) = E(Y ) + E(Z) (4.1)

Springer International Publishing AG 2017 163


D.J. Olive, Linear Regression, DOI 10.1007/978-3-319-55252-1 4
164 4 WLS and Generalized Least Squares

and
E(AY ) = AE(Y ) and E(AY B) = AE(Y )B. (4.2)
Also
Cov(a + AY ) = Cov(AY ) = ACov(Y )AT . (4.3)

Example 4.1. Consider the OLS model Y = X + e where the ei are


iid with mean 0 and variance 2 . Then Y and e are random vectors while
a = X is a constant vector. Notice that E(e) = 0. Thus

E(Y ) = X + E(e) = X.

Since the ei are iid,


Cov(Y ) = Cov(e) = 2 I n (4.4)
where I n is the n n identity matrix. This result makes sense because the
Yi are independent with Yi = xTi + ei . Hence VAR(Yi ) = VAR(ei ) = 2 .
Recall that OLS = (X T X)1 X T Y . Hence

E( OLS ) = (X T X)1 X T E(Y ) = (X T X)1 X T X = .

That is, OLS is an unbiased estimator of . Using (4.3) and (4.4),

Cov( OLS ) = (X T X)1 X T Cov(Y )X(X T X)1

= 2 (X T X)1 X T X(X T X)1 = 2 (X T X)1 .


Recall that Y OLS = X OLS = X(X T X)1 X T Y = HY . Hence

E(Y OLS ) = X(X T X)1 X T E(Y ) = X(X T X)1 X T X = X = E(Y ).

Using (4.3) and (4.4),

Cov(Y OLS ) = HCov(Y )H T = 2 H

since H T = H and HH = H.
Recall that the vector of residuals r OLS = (I H)Y = Y Y OLS . Hence
E(r OLS ) = E(Y ) E(Y OLS ) = E(Y ) E(Y ) = 0. Using (4.3) and (4.4),

Cov(r OLS ) = (I H)Cov(Y )(I H)T = 2 (I H)

since I H is symmetric and idempotent: (I H)T = I H and


(I H)(I H) = I H.
4.2 GLS, WLS, and FGLS 165

4.2 GLS, WLS, and FGLS

Denition 4.3. Suppose that the response variable and at least one of the
predictor variables is quantitative. Then the generalized least squares (GLS)
model is
Y = X + e, (4.5)
where Y is an n 1 vector of dependent variables, X is an n p matrix
of predictors, is a p 1 vector of unknown coecients, and e is an n 1
vector of unknown errors. Also E(e) = 0 and Cov(e) = 2 V where V is a
known n n positive denite matrix.
Denition 4.4. The GLS estimator

GLS = (X T V 1 X)1 X T V 1 Y . (4.6)

The tted values are Y GLS = X GLS .


Denition 4.5. Suppose that the response variable and at least one of the
predictor variables is quantitative. Then the weighted least squares (WLS)
model with weights w1 , . . . , wn is the special case of the GLS model where
V is diagonal: V = diag(v1 , . . . , vn ) and wi = 1/vi . Hence

Y = X + e, (4.7)

E(e) = 0, and Cov(e) = 2 diag(v1 , . . . , vn ) = 2 diag(1/w1 , . . . , 1/wn ).


Denition 4.6. The WLS estimator

W LS = (X T V 1 X)1 X T V 1 Y . (4.8)

The tted values are Y W LS = X W LS .


Denition 4.7. The feasible generalized least squares (FGLS) model is
the same as the GLS estimator except that V = V () is a function of an
unknown q 1 vector of parameters . Let the estimator of V be V = V ().
Then the FGLS estimator
1 1
F GLS = (X T V X)1 X T V Y. (4.9)

The tted values are Y F GLS = X F GLS . The feasible weighted least squares
(FWLS) estimator is the special case of the FGLS estimator where V =
V () is diagonal. Hence the estimated weights wi = 1/vi = 1/vi (). The
FWLS estimator and tted values will be denoted by F W LS and Y F W LS ,
respectively.
Notice that the ordinary least squares (OLS) model is a special case of
GLS with V = I n , the n n identity matrix. It can be shown that the GLS
estimator minimizes the GLS criterion
166 4 WLS and Generalized Least Squares

QGLS () = (Y X)T V 1 (Y X).

Notice that the FGLS and FWLS estimators have p + q + 1 unknown param-
eters. These estimators can perform very poorly if n < 10(p + q + 1).
The GLS and WLS estimators can be found from the OLS regression
(without an intercept) of a transformed model. Typically there will be a
constant in the model: the rst column of X is a vector of ones. Following
Seber and Lee (2003, pp. 6668), there is a nonsingular n n matrix K
such that V = KK T . Let Z = K 1 Y , U = K 1 X, and  = K 1 e. This
method uses the fast, but rather unstable, Cholesky decomposition.

Proposition 4.1. a)
Z = U +  (4.10)
2
follows the OLS model since E() = 0 and Cov() = I n .
b) The GLS estimator GLS can be obtained from the OLS regression
(without an intercept) of Z on U .
c) For WLS, Yi = xTi + ei . The corresponding OLS model Z = U + 
is equivalent to Zi = uTi + i for i = 1, . . . , n where uTi is the ith row of U .

Then Zi = wi Yi and ui = wi xi . Hence W LS can be obtained from the

OLS regression (without an intercept) of Zi = wi Yi on ui = wi xi .

Proof. a) E() = K 1 E(e) = 0 and

Cov() = K 1 Cov(e)(K 1 )T = 2 K 1 V (K 1 )T

= 2 K 1 KK T (K 1 )T = 2 I n .
Notice that OLS without an intercept needs to be used since U does not
contain a vector of ones. The rst column of U is K 1 1 = 1.
b) Let ZU denote the OLS estimator obtained by regressing Z on U .
Then

ZU = (U T U )1 U T Z = (X T (K 1 )T K 1 X)1 X T (K 1 )T K 1 Y

and the result follows since V 1 = (KK T )1 = (K T )1 K 1 =


(K 1 )T K 1 .

c) The result follows from b) if Zi = wi Yi and ui = wi xi . But

for WLS, V = diag(v1 , . . . , vn ) and hence K = K T = diag( v1 , . . . , vn ).
Hence

K 1 = diag(1/ v1 , . . . , 1/ vn ) = diag( w1 , . . . , wn )
4.2 GLS, WLS, and FGLS 167

and Z = K 1 Y has ith element Zi = wi Yi . Similarly, U = K 1 X has

ith row uTi = wi xTi . 

Following Johnson and Wichern (1988, p. 51) and Freedman (2005, p. 54),
there is a symmetric, nonsingular n n square root matrix R = V 1/2 such
that V = RR. Let Z = R1 Y , U = R1 X and  = R1 e. This method
uses the spectral theorem (singular value decomposition) and has better com-
putational properties than transformation based on the Cholesky decompo-
sition.

Proposition 4.2. a)
Z = U +  (4.11)
2
follows the OLS model since E() = 0 and Cov() = I n .
b) The GLS estimator GLS can be obtained from the OLS regression
(without an intercept) of Z on U .
c) For WLS, Yi = xTi + ei . The corresponding OLS model Z = U + 
is equivalent to Zi = uTi + i for i = 1, . . . , n where uTi is the ith row of U .

Then Zi = wi Yi and ui = wi xi . Hence W LS can be obtained from the

OLS regression (without an intercept) of Zi = wi Yi on ui = wi xi .

Proof. a) E() = R1 E(e) = 0 and

Cov() = R1 Cov(e)(R1 )T = 2 R1 V (R1 )T

= 2 R1 RR(R1 ) = 2 I n .
Notice that OLS without an intercept needs to be used since U does not
contain a vector of ones. The rst column of U is R1 1 = 1.
b) Let ZU denote the OLS estimator obtained by regressing Z on U .
Then

ZU = (U T U )1 U T Z = (X T (R1 )T R1 X)1 X T (R1 )T R1 Y

and the result follows since V 1 = (RR)1 = R1 R1 = (R1 )T R1 .



c) The result follows from b) if Zi = wi Yi and ui = wi xi . But for

WLS, V = diag(v1 , . . . , vn ) and hence R = diag( v1 , . . . , vn ). Hence

R1 = diag(1/ v1 , . . . , 1/ vn ) = diag( w1 , . . . , wn )

and Z = R1 Y has ith element Zi = wi Yi . Similarly, U = R1 X has ith

row uTi = wi xTi . 

Remark 4.1. Standard software produces WLS output and the ANOVA
F test and Wald t tests are performed using this output.
168 4 WLS and Generalized Least Squares

Remark 4.2. The FGLS estimator can also be found from the OLS
regression (without an intercept) of Z on U where V () = RR. Similarly
the FWLS estimator can be found from the OLS regression (without an in-
tercept) of Zi = wi Yi on ui = wi xi . But now U is a random matrix
instead of a constant matrix. Hence these estimators are highly nonlinear.
OLS output can be used for exploratory purposes, but the pvalues are gen-
erally not correct. The Olive (2016a,b) nonparametric bootstrap tests may
be useful for FGLS and FWLS. The nonparametric bootstrap could also be
applied to the OLS estimator.
Under regularity conditions, the OLS estimator OLS is a consistent esti-
mator of when the GLS model holds, but GLS should be used because it
generally has higher eciency.
Denition 4.8. Let ZU be the OLS estimator from regressing Z on U .
The vector of tted values is Z = U ZU and the vector of residuals is
r ZU = Z Z. Then ZU = GLS for GLS, ZU = F GLS for FGLS,
ZU = W LS for WLS, and ZU = F W LS for FWLS. For GLS, FGLS,
WLS, and FWLS, a residual plot is a plot of Zi versus rZU,i and a response
plot is a plot of Zi versus Zi .
Notice that the residual and response plots are based on the OLS output
from the OLS regression without intercept of Z on U . If the model is good,
then the plotted points in the response plot should follow the identity line
in an evenly populated band while the plotted points in the residual plot
should follow the line rZU,i = 0 in an evenly populated band (at least if the
distribution of  is not highly skewed).
Plots based on YGLS = X ZU and on ri,GLS = Yi Yi,GLS should be
similar to those based on OLS . Although the plot of Yi,GLS versus Yi should
be linear, the plotted points will not scatter about the identity line in an
evenly populated band. Hence this plot cannot be used to check whether
the GLS model with V is a good approximation to the data. Moreover, the
ri,GLS and Yi,GLS may be correlated and usually do not scatter about the
r = 0 line in an evenly populated band. The plots in Denition 4.8 are both
a check on linearity and on whether the model using V (or V ) gives a good
approximation of the data, provided that n > k(p + q + 1) where k 5 and
preferably k 10.
For GLS and WLS (and for exploratory purposes for FGLS and FWLS),
plots and model building and variable selection should be based on Z and U .
Form Z and U and then use OLS software for model selection and variable
selection. If the columns of X are v 1 , . . . , v p , then the columns of U are
U1 , . . . , Up where Uj = R1 v j corresponds to the jth predictor Xj . For
example, the analog of the OLS residual plot of jth predictor versus the
residuals is the plot of the jth predictor Uj versus rZU . The notation is
confusing but the idea is simple: form Z and U , then use OLS software and
the OLS techniques from Chapters 2 and 3 to build the model.
4.2 GLS, WLS, and FGLS 169

a) OLS Response Plot b) OLS Residual Plot

12

2
RESID
1
Y
6
2

4
2 4 6 8 2 4 6 8
FIT FIT

c) WLS Response Plot d) WLS Residual Plot


15

ZRESID
0
Z
5

0 4 8 12 2 0 4 8 12
ZFIT ZFIT
Fig. 4.1 Plots for Draper and Smith Data

Example 4.2. Draper and Smith (1981, pp. 112114) present an FWLS
example with n = 35 and p = 2. Hence Y = 1 + 2 x + e. Let vi =
vi () = 1.5329 0.7334xi + 0.0883x2i . Thus = (1.5329, 0.7334, 0.0883)T .
Figure 4.1a and b shows the response and residual plots based on the OLS
regression of Y on x. The residual plot has the shape of the right opening
megaphone, suggesting that the variance is not constant. Figure 4.1c and d
shows the response and residual plots based on FWLS with weights wi = 1/vi .
See Problem 4.2 to reproduce these plots. Software meant for WLS needs the
weights. Hence FWLS can be computed using WLS software with the es-
timated weights, but the software may print WLS instead of FWLS, as in
Figure 4.1c and d.

Warning. A problem with the response and residual plots for GLS and
FGLS given in Denition 4.8 is that some of the transformed cases (Zi , uTi )T
can be outliers or high leverage points.

Remark 4.3. If the response Yi is the sample mean or sample median of


ni cases where the ni are not all equal, then use WLS with weights wi = ni .
See Sheather (2009, p. 121).
170 4 WLS and Generalized Least Squares

4.3 Inference for GLS

Inference for the GLS model Y = X + e can be performed by using the


partial F test for the equivalent no intercept OLS model Z = U + .
Following Section 2.10, create Z and U , t the full and reduced model using
the no intercept or intercept = F option. Let pval be the estimated
pvalue.
The 4 step partial F test of hypotheses: i) State the hypotheses Ho:
the reduced model is good Ha: use the full model
ii) Find the test statistic FR =
 
SSE(R) SSE(F )
/M SE(F )
dfR dfF

iii) Find the pval = P(FdfR dfF ,dfF > FR ). (On exams often an F table is
used. Here dfR dfF = pq = number of parameters set to 0, and dfF = np.)
iv) State whether you reject Ho or fail to reject Ho. Reject Ho if pval
and conclude that the full model should be used. Otherwise, fail to reject Ho
and conclude that the reduced model is good.

Assume that the GLS model contains a constant 1 . The GLS ANOVA F
test of Ho : 2 = = p versus Ha: not Ho uses the reduced model that
contains the rst column of U . The GLS ANOVA F test of Ho : i = 0
versus Ho : i = 0 uses the reduced model with the ith column of U deleted.
For the special case of WLS, the software will often have a weights option
that will also give correct output for inference.

Example 4.3. Suppose that the data from Example 4.2 has valid weights,
so that WLS can be used instead of FWLS. The R commands below per-
form WLS.

> ls.print(lsfit(dsx,dsy,wt=dsw))
Residual Standard Error=1.137
R-Square=0.9209
F-statistic (df=1, 33)=384.4139, p-value=0
Estimate Std.Err t-value Pr(>|t|)
Intercept -0.8891 0.3004 -2.9602 0.0057
X 1.1648 0.0594 19.6065 0.0000

Alternative R commands given below produce similar output.


zout<-lm(dsy~dsx,weights=dsw)
summary(zout)
anova(zout)
zoutr<-lm(dsy~1,weights=dsw)
anova(zoutr,zout)
4.3 Inference for GLS 171

The F statistic 384.4139 tests Ho : 2 = 0 since weights were used. The


WLS ANOVA F test for Ho : 2 = 0 can also be found with the no intercept
model by adding a column of ones to x, form U and Z and compute the
partial F test where the reduced model uses the rst column of U . Notice that
the intercept=F option needs to be used to t both models. The residual
standard error = RSE = M SE. Thus SSE = (n k)(RSE)2 where n k
is the denominator degrees of freedom for the F test and k is the numerator
degrees of freedom = number of variables in the model. The column of ones
xone is counted as a variable. The last line of output computes the partial F
statistic and is again 384.4.

> xone <- 1 + 0*1:35


> x <- cbind(xone,dsx)
> z <- as.vector(diag(sqrt(dsw))%*%dsy)
> u <- diag(sqrt(dsw))%*%x
> ls.print(lsfit(u,z,intercept=F))
Residual Standard Error=1.137, R-Square=0.9817
F-statistic (df=2, 33)=886.4982, p-value=0
Estimate Std.Err t-value Pr(>|t|)
xone -0.8891 0.3004 -2.9602 0.0057
dsx 1.1648 0.0594 19.6065 0.0000

> ls.print(lsfit(u[,1],z,intercept=F))
Residual Standard Error=3.9838, R-Square=0.7689
F-statistic (df=1, 34)=113.1055, p-value=0
Estimate Std.Err t-value Pr(>|t|)
X 4.5024 0.4234 10.6351 0
> ((34*(3.9838)^2-33*(1.137)^2)/1)/(1.137)^2
[1] 384.4006

The WLS t-test for this data has t = 19.6065 which corresponds to F =
t2 = 384.4 since this test is equivalent to the WLS ANOVA F test when there
is only one predictor. The WLS t-test for the intercept has F = t2 = 8.76.
This test statistic can be found from the no intercept OLS model by leaving
the rst column of U out of the model, then perform the partial F test as
shown below.

> ls.print(lsfit(u[,2],z,intercept=F))
Residual Standard Error=1.2601
F-statistic (df=1, 34)=1436.300
Estimate Std.Err t-value Pr(>|t|)
X 1.0038 0.0265 37.8985 0

> ((34*(1.2601)^2-33*(1.137)^2)/1)/(1.137)^2
[1] 8.760723
172 4 WLS and Generalized Least Squares

4.4 Complements

The theory for GLS and WLS is similar to the theory for the OLS MLR
model, but the theory for FGLS and FWLS is often lacking or huge sam-
ple sizes are needed. However, FGLS and FWLS are often used in practice
because usually V is not known and V must be used instead. Kariya and
Kurata (2004) is a PhD level text covering FGLS. Cook and Zhang (2015)
suggest an envelope method for WLS.
Shi and Chen (2009) describe numerical diagnostics for GLS. Long and
Ervin (2000) discuss methods for obtaining standard errors when the constant
variance assumption is violated.
Following Sheather (2009, ch. 9, ch. 10) many linear models with seri-
ally correlated errors (e.g. AR(1) errors) and many linear mixed models
can be t with FGLS. Both Sheather (2009) and Houseman et al. (2004)
use the Cholesky decomposition and make the residual plots based on the
Cholesky residuals Z Z where V () = KK T . We recommend plots based
on Z Z where V () = RR. In other words, use transformation corre-
sponding to Proposition 4.2 instead of the transformation corresponding to
Proposition 4.1.

4.5 Problems

Problems with an asterisk * are especially important.

R Problems

Use the command source(G:/lregpack.txt) to download the func-


tions and the command source(G:/lregdata.txt) to download the data.
See Preface or Section 14.1. Typing the name of the lregpack function,
e.g. wlsplot, will display the code for the function. Use the args command,
e.g. args(wlsplot), to display the needed arguments for the function. For some
of the following problems, the R commands can be copied and pasted from
(http://lagrange.math.siu.edu/Olive/lreghw.txt) into R.
4.1. Generalized and weighted least squares are each equivalent to a least
squares regression without intercept. Let V = diag(1, 1/2, 1/3, . . . , 1/9) =
diag(1/wi ) where n = 9 and the weights wi = i for i = 1, . . . , 9. Let xT =
(1, x1 , x2 , x3 ). Then the weighted least squares with weight vector wT =

(1, 2, . . . , 9) should be equivalent to the OLS regression of wi Yi = Zi on

u where uT = wi x = ( wi , wi x1 , wi x2 , wi x3 ). There is no intercept

because the vector of ones has been replaced by a vector of the wi s. Copy
and paste the commands for this problem into R, and include the output
from both lst commands. The coecients from both lst commands should
be the same.
4.5 Problems 173

4.2. Download the wlsplot function and the Draper and Smith (1981)
data dsx, dsy, dsw.
a) Enter the R command wlsplot(x=dsx, y = dsy, w = dsw) to re-
produce Figure 4.1. Once you have the plot you can print it out directly, but
it will generally save paper by placing the plots in the Word editor.
b) Activate Word (often by double clicking on a Word icon). Click on the
screen and type Problem 4.2. In R, click on the plot and then press the
keys Ctrl and c simultaneously. This procedure makes a temporary copy of
the plot. In Word, move the pointer to Edit and hold down the leftmost mouse
button. This will cause a menu to appear. Drag the pointer down to Paste.
In the future, these menu commands will be denoted by Edit>Paste. The
plot should appear on the screen. To save your output on your ash drive
(J, say), use the Word menu commands File > Save as. In the Save in box
select Removable Disk (J:) and in the File name box enter HW4d2.doc. To
exit from Word, click on the X in the upper right corner of the screen. In
Word a screen will appear and ask whether you want to save changes made
in your document. Click on No. To exit from R, type q() or click on the
X in the upper right corner of the screen and then click on No.

4.3. Download the fwlssim function. This creates WLS data if type
is 1 or 3 and FWLS data if type is 2 or 4. Let the sucient predictor
SP = 25 + 2x2 + + 2xp . Then Y = SP + |SP 25k|e where the xij and
ei are iid N (0, 1). Thus Y |SP N (SP, (SP 25k)2 2 ). If type is 1 or 2,
then k = 1/5, but k = 1 if type is 3 or 4. The default has 2 = 1.
The function creates the OLS response and residual plots and the FWLS
(or WLS) response and residual plots.
a) Type the following command several times. The OLS and WLS plots
tend to look the same.

fwlssim(type=1)

b) Type the following command several times. Now the FWLS plots often
have outliers.

fwlssim(type=2)

c) Type the following command several times. The OLS residual plots have
a saddle shape, but the WLS plots tend to have highly skewed tted values.

fwlssim(type=3)

d) Type the following command several times. The OLS residual plots
have a saddle shape, but the FWLS plots tend to have outliers and highly
skewed tted values.

fwlssim(type=4)
Chapter 5
One Way Anova

Chapters 59 consider experimental design models. These models are linear


models, and many of the techniques used for multiple linear regression can
be used for experimental design models. In particular, least squares, response
plots, and residual plots will be important. These models have been used
to greatly increase agricultural yield, greatly improve medicine, and greatly
improve the quality of manufactured goods. The models are also good for
screening out good ideas from bad ideas (e.g., for a medical treatment for
heart disease or for improving the gas mileage of a car).

Denition 5.1. Models in which the response variable Y is quantitative,


but all of the predictor variables are qualitative are called analysis of vari-
ance (ANOVA or Anova) models, experimental design models, or design of
experiments (DOE) models. Each combination of the levels of the predictors
gives a dierent distribution for Y . A predictor variable W is often called a
factor and a factor level ai is one of the categories W can take.

5.1 Introduction

Denition 5.2. A lurking variable is not one of the variables in the


study, but may aect the relationships among the variables in the study.
A unit is the experimental material assigned treatments, which are the
conditions the investigator wants to study. The unit is experimental if it was
randomly assigned to a treatment, and the unit is observational if it was not
randomly assigned to a treatment.

Denition 5.3. In an experiment, the investigators use randomiza-


tion to assign treatments to units. To assign p treatments to n = n1 + +np
experimental units, draw a random permutation of {1, . . . , n}. Assign the rst

Springer International Publishing AG 2017 175


D.J. Olive, Linear Regression, DOI 10.1007/978-3-319-55252-1 5
176 5 One Way Anova

n1 units treatment 1, the next n2 units treatment 2, . . . , and the nal np units
treatment p.

Randomization allows one to do valid inference such as F tests of hy-


potheses and condence intervals. Randomization also washes out the eects
of lurking variables and makes the p treatment groups similar except for the
treatment. The eects of lurking variables are present in observational studies
dened in Denition 5.4.

Denition 5.4. In an observational study, investigators simply observe


the response, and the treatment groups need to be p random samples from p
populations (the levels) for valid inference.

Example 5.1. Consider using randomization to assign the following nine


people (units) to three treatment groups.

Carroll, Collin, Crawford, Halverson, Lawes,


Stach, Wayman, Wenslow, Xumong

Balanced designs have the group sizes the same: ni m = n/p. Label the
units alphabetically so Carroll gets 1, . . . , Xumong gets 9. The R function
sample can be used to draw a random permutation. Then the rst 3 numbers
in the permutation correspond to group 1, the next 3 to group 2, and the nal
3 to group 3. Using the output shown below gives the following 3 groups.

group 1: Stach, Wayman, Xumong


group 2: Lawes, Carroll, Halverson
group 3: Collin, Wenslow, Crawford

> sample(9)
[1] 6 7 9 5 1 4 2 8 3

Often there is a table or computer le of units and related measurements,


and it is desired to add the units group to the end of the table. The lregpack
function rand reports a random permutation and the quantity groups[i] =
treatment group for the ith person on the list. Since persons 6, 7, and 9 are in
group 1, groups[7] = 1. Since Carroll is person 1 and is in group 2, groups[1]
= 2, et cetera.

> rand(9,3)
$perm
[1] 6 7 9 5 1 4 2 8 3

$groups
[1] 2 3 3 2 2 1 1 3 1

Denition 5.5. Replication means that for each treatment, the ni re-
sponse variables Yi,1 , . . . , Yi,ni are approximately iid random variables.
5.2 Fixed Eects One Way Anova 177

Example 5.2. a) If ten students work two types of paper mazes three
times each, then there are 60 measurements that are not replicates. Each
student should work the six mazes in random order since speed increases
with practice. For the ith student, let Zi1 be the average time to complete
the three mazes of type 1, let Zi2 be the average time for mazes of type 2,
and let Di = Zi1 Zi2 . Then D1 , . . . , D10 are replicates.
b) Cobb (1998, p. 126) states that a student wanted to know if the shapes
of sponge cells depends on the color (green or white). He measured hundreds
of cells from one white sponge and hundreds of cells from one green sponge.
There were only two units so n1 = 1 and n2 = 1. The student should have
used a sample of n1 green sponges and a sample of n2 white sponges to get
more replicates.
c) Replication depends on the goals of the study. Box et al. (2005, pp. 215
219) describe an experiment where the investigator times how long it takes
him to bike up a hill. Since the investigator is only interested in his perfor-
mance, each run up a hill is a replicate (the time for the ith run is a sample
from all possible runs up the hill by the investigator). If the interest had been
on the eect of eight treatment levels on student bicyclists, then replication
would need n = n1 + + n8 student volunteers where ni ride their bike up
the hill under the conditions of treatment i.

5.2 Fixed Eects One Way Anova

The one way Anova model is used to compare p treatments. Usually there
is replication and Ho: 1 = 2 = = p is a hypothesis of interest. In-
vestigators may also want to rank the population means from smallest to
largest.
Denition 5.6. Let fZ (z) be the pdf of Z. Then the family of pdfs
fY (y) = fZ (y ) indexed by the location parameter , < < ,
is the location family for the random variable Y = + Z with standard
pdf fZ (z).
Denition 5.7. A one way xed eects Anova model has a single qualita-
tive predictor variable W with p categories a1 , . . . , ap . There are p dierent
distributions for Y , one for each category ai . The distribution of

Y |(W = ai ) fZ (y i )

where the location family has second moments. Hence all p distributions come
from the same location family with dierent location parameter i and the
same variance 2 .
Denition 5.8. The one way xed eects normal Anova model is the spe-
cial case where
178 5 One Way Anova

Y |(W = ai ) N (i , 2 ).

Example 5.3. The pooled 2 sample ttest is a special case of a one way
Anova model with p = 2. For example, one population could be ACT scores
for men and the second population ACT scores for women. Then W = gender
and Y = score.

Notation. It is convenient to relabel the response variable Y1 , . . . , Yn as


the vector Y = (Y11 , . . . , Y1,n1 , Y21 , . . . , Y2,n2 , . . . , Yp1 , . . . , Yp,np )T where the
Yij are independent and Yi1 , . . . , Yi,ni are iid. Here j = 1, . . . , ni where ni is
the number of cases from the ith level where i = 1, . . . , p. Thus n1 + +np =
n. Similarly use double subscripts on the errors. Then there will be many
equivalent parameterizations of the one way xed eects Anova model.

Denition 5.9. The cell means model is the parameterization of the one
way xed eects Anova model such that

Yij = i + eij

where Yij is the value of the response variable for the jth trial of the ith
factor level. The i are the unknown means and E(Yij ) = i . The eij are
iid from the location family with pdf fZ (z) and unknown variance 2 =
VAR(Yij ) = VAR(eij ). For the normal cell means model, the eij are iid
N (0, 2 ) for i = 1, . . . , p and j = 1, . . . , ni .

The cell means model is a linear model (without intercept) of the form
Y = X c c + e =

Y11 100 ... 0 e11
.. .. .. .. .. ..
. . . . . .

Y1,n1 1 0 0 ... 0 e1,n1
e
Y21 0 1 0 ... 0
1 21
.. .. .. .. .. ..
. . . . . .

2
.. + .
Y2,n2 = 0 1 0 ... 0 (5.1)
. e2,n2
. . . . .. .
.. .. .. .. . p ..

Yp,1 0 0 0 ... 1 ep,1

. . . . .. .
.. .. .. .. . .
.
Yp,np 0 0 0 ... 1 ep,np
ni
Notation. Let Yi0 = j=1 Yij and let

1 
ni
i = Y i0 = Yi0 /ni = Yij . (5.2)
ni j=1
5.2 Fixed Eects One Way Anova 179

Hence the dot notation  means


sum over the subscript corresponding to the
p ni
0, e.g. j. Similarly, Y00 = i=1 j=1 Yij is the sum of all of the Yij .

Let X c = [v 1 v 2 v p ], and notice that the indicator variables used


in the cell means model (5.1) are v hk = xhk = 1 if the hth case has W = ak ,
and v hk = xhk = 0, otherwise, for k = 1, . . . , p and h = 1, . . . , n. So Yij
has xhk = 1 only if i = k and j = 1, ..., ni . The model can use p indicator
variables for the factor instead of p 1 indicator variables because the model
does not contain an intercept. Also notice that (X Tc X c ) = diag(n1 , . . . , np ),

E(Y ) = X c c = (1 , . . . , 1 , 2 , . . . , 2 , . . . , p , . . . , p )T ,

and X Tc Y = (Y10 , . . . , Y10 , Y20 , . . . , Y20 , . . . , Yp0 , . . . , Yp0 )T . Hence (X Tc X c )1


= diag(1/n1 , . . . , 1/np ) and the OLS estimator

c = (X Tc X c )1 X Tc Y = (Y 10 , . . . , Y p0 )T = (1 , . . . , p )T .

Thus Y = X c c = (Y 10 , . . . , Y 10 , . . . , Y p0 , . . . , Y p0 )T . Hence the ijth tted


value is
Yij = Y i0 = i (5.3)
and the ijth residual is

rij = Yij Yij = Yij i . (5.4)

Since the cell means model is a linear model, there is an associated response
plot and residual plot. However, many of the interpretations of the OLS
quantities for Anova models dier from the interpretations for MLR models.
First, for MLR models, the conditional distribution Y |x makes sense even if
x is not one of the observed xi provided that x is not far from the xi . This
fact makes MLR very powerful. For MLR, at least one of the variables in x
is a continuous predictor. For the one way xed eects Anova model, the p
distributions Y |xi make sense where xTi is a row of X c .
Also, the OLS MLR ANOVA F test for the cell means model tests H0 :
c = 0 H0 : 1 = = p = 0, while the one way xed eects ANOVA F
test given after Denition 5.13 tests H0 : 1 = = p .
Denition 5.10. Consider the one way xed eects Anova model. The
response plot is a plot of Yij i versus Yij and the residual plot is a plot of
Yij i versus rij .
The points in the response plot scatter about the identity line and the
points in the residual plot scatter about the r = 0 line, but the scatter need
not be in an evenly populated band. A dot plot of Z1 , . . . , Zm consists of an
axis and m points each corresponding to the value of Zi . The response plot
consists of p dot plots, one for each value of i . The dot plot corresponding
to i is the dot plot of Yi1 , . . . , Yi,ni . The p dot plots should have roughly the
180 5 One Way Anova

same amount of spread, and each i corresponds to level ai . If a new level


af corresponding to xf was of interest, hopefully the points in the response
plot corresponding to af would form a dot plot at f similar in spread to
the other dot plots, but it may not be possible to predict the value of f .
Similarly, the residual plot consists of p dot plots, and the plot corresponding
to i is the dot plot of ri1 , . . . , ri,ni .
Assume that each ni 10. Under the assumption that the Yij are from
the same location family with dierent parameters i , each of the p dot plots
should have roughly the same shape and spread. This assumption is easier
to judge with the residual plot. If the response plot looks like the residual
plot, then a horizontal line ts the p dot plots about as well as the identity
line, and there is not much dierence in the i . If the identity line is clearly
superior to any horizontal line, then at least some of the means dier.
Denition 5.11. An outlier corresponds to a case that is far from the
bulk of the data. Look for a large vertical distance of the plotted point from
the identity line or the r = 0 line.
Rule of thumb 5.1. Mentally add 2 lines parallel to the identity line and
2 lines parallel to the r = 0 line that cover most of the cases. Then a case is
an outlier if it is well beyond these 2 lines.
This rule often fails for large outliers since often the identity line goes
through or near a large outlier so its residual is near zero. A response that is
far from the bulk of the data in the response plot is a large outlier (large
in magnitude). Look for a large gap between the bulk of the data and the
large outlier.
Suppose there is a dot plot of nj cases corresponding to level aj that is
far from the bulk of the data. This dot plot is probably not a cluster of bad
outliers if nj 4 and n 5p. If nj = 1, such a case may be a large outlier.

Rule of thumb 5.2. Often an outlier is very good, but more often an
outlier is due to a measurement error and is very bad.
The assumption of the Yij coming from the same location family with
dierent location parameters i and the same constant variance 2 is a big
assumption and often does not hold. Another way to check this assumption is
to make a box plot of the Yij for each i. The box in the box plot corresponds
to the lower, middle, and upper quartiles of the Yij . The middle quartile
is just the sample median of the data mij : at least half of the Yij mij
and at least half of the Yij mij . The p boxes should be roughly the same
length and the median should occur in roughly the same position (e.g., in
the center) of each box. The whiskers in each plot should also be roughly
similar. Histograms for each of the p samples could also be made. All of the
histograms should look similar in shape.
Example 5.4. Kuehl (1994, p. 128) gives data for counts of hermit crabs
on 25 dierent transects in each of six dierent coastline habitats. Let Z be
the count. Then the response variable Y = log10 (Z + 1/6). Although the
5.2 Fixed Eects One Way Anova 181

counts Z varied greatly, each habitat had several counts of 0 and often there
were several counts of 1, 2, or 3. Hence Y is not a continuous variable. The cell
means model was t with ni = 25 for i = 1, . . . , 6. Each of the six habitats
was a level. Figure 5.1a and b shows the response plot and residual plot.
There are 6 dot plots in each plot. Because several of the smallest values in
each plot are identical, it does not always look like the identity line is passing
through the six sample means Y i0 for i = 1, . . . , 6. In particular, examine the
dot plot for the smallest mean (look at the 25 dots furthest to the left that
fall on the vertical line FIT 0.36). Random noise (jitter) has been added to
the response and residuals in Figure 5.1c and d. Now it is easier to compare
the six dot plots. They seem to have roughly the same spread.

a) Response Plot b) Residual Plot


2

1
RESID
1
Y

0
-1
0

0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.4 0.5 0.6 0.7 0.8 0.9 1.0
FIT FIT

c) Jittered Response Plot d) Jittered Residual Plot


2

1
JR
JY
1

0
-1
0

0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.4 0.5 0.6 0.7 0.8 0.9 1.0
FIT FIT

Fig. 5.1 Plots for Crab Data

The plots contain a great deal of information. The response plot can be
used to explain the model, check that the sample from each population (treat-
ment) has roughly the same shape and spread, and to see which populations
have similar means. Since the response plot closely resembles the residual plot
in Figure 5.1, there may not be much dierence in the six populations. Lin-
earity seems reasonable since the samples scatter about the identity line. The
residual plot makes the comparison of similar shape and spread easier.
182 5 One Way Anova

Denition 5.12. a) The total sum of squares


p 
ni
SST O = (Yij Y 00 )2 .
i=1 j=1

b) The treatment sum of squares


p
SST R = ni (Y i0 Y 00 )2 .
i=1

c) The residual sum of squares or error sum of squares


p 
ni
SSE = (Yij Y i0 )2 .
i=1 j=1

Denition 5.13. Associated with each SS in Denition 5.12 is a degrees


of freedom (df) and a mean square = SS/df. For SSTO, df = n 1 and
M ST O = SST O/(n1). For SSTR, df = p1 and M ST R = SST R/(p1).
For SSE, df = n p and M SE = SSE/(n p).
ni
Let Si2 = j=1 (Yij Y i0 )2 /(ni 1) be the sample variance of the ith
group. Then the MSE is a weighted sum of the Si2 :

1  1 
p ni i p n
2 = M SE = 2
rij = (Yij Y i0 )2 =
n p i=1 j=1 n p i=1 j=1

1 
p
(ni 1)Si2 = Spool
2
n p i=1
2
where Spool is known as the pooled variance estimator.

The ANOVA F test tests whether the p means are equal. If Ho is not
rejected and the means are equal, then it is possible that the factor is unim-
portant, but it is also possible that the factor is important but the
level is not. For example, the factor might be type of catalyst. The yield
may be equally good for each type of catalyst, but there would be no yield if
no catalyst was used.
The ANOVA table is the same as that for MLR, except that SSTR re-
places the regression sum of squares. The MSE is again an estimator of 2 .
The ANOVA F test tests whether all p means i are equal. Shown below is
an ANOVA table given in symbols. Sometimes Treatment is replaced by
Between treatments, Between Groups, Model, Factor, or Groups.
Sometimes Error is replaced by Residual, or Within Groups. Some-
times p-value is replaced by P, P r(> F ), or PR > F. The p-value
is nearly always an estimated p-value, denoted by pval.
5.2 Fixed Eects One Way Anova 183

Summary Analysis of Variance Table

Source df SS MS F p-value
Treatment p-1 SSTR MSTR Fo=MSTR/MSE for Ho:
Error n-p SSE MSE 1 = = p

Be able to perform the 4 step xed eects one way ANOVA F


test of hypotheses.
i) State the hypotheses Ho: 1 = 2 = = p and Ha: not Ho.
ii) Find the test statistic Fo = M ST R/M SE or obtain it from output.
iii) Find the pval from output or use the F table: pval =

P (Fp1,np > Fo ).

iv) State whether you reject Ho or fail to reject Ho. If the pval , reject Ho
and conclude that the mean response depends on the factor level. (Hence not
all of the treatment means are equal.) Otherwise fail to reject Ho and conclude
that the mean response does not depend on the factor level. (Hence all of the
treatment means are equal, or there is not enough evidence to conclude that
the mean response depends on the factor level.) Give a nontechnical sentence.

Rule of thumb 5.3. If

max(S1 , . . . , Sp ) 2 min(S1 , . . . , Sp ),

then the one way ANOVA F test results will be approximately correct if the
response and residual plots suggest that the remaining one way Anova model
assumptions are reasonable. See Moore (2007, p. 634). If all of the ni 5,
replace the standard deviations by the ranges of the dot plots when exam-
ining the response and residual plots. The range Ri = max(Yi,1 , . . . , Yi,ni )
min(Yi,1 , . . . , Yi,ni ) = length of the ith dot plot for i = 1, . . . , p.

The assumption that the zero mean iid errors have constant variance
V (eij ) 2 is much stronger for the one way Anova model than for the mul-
tiple linear regression model. The assumption implies that the p population
distributions have pdfs from the same location family with dierent means
1 , . . . , p but the same variances 12 = = p2 2 . The one way ANOVA
F test has some resistance to the constant variance assumption, but con-
dence intervals have much less resistance to the constant varianceassumption.

Consider condence intervals for i such as Y i0 tni 1,1/2 M SE/ ni .
MSE is a weighted average of the Si2 . Hence MSE overestimates small i
2
2 2
and underestimates large i when the i are not equal. Hence using M SE
instead of Si will make the CI too long or too short, and Rule of thumb 5.3
does not apply to condence intervals based on MSE.
184 5 One Way Anova

Remark 5.1. If the units are a representative sample of some population


of interest, then randomization of units into groups makes the assumption
that Yi1 , . . . , Yi,ni are iid hold to a useful approximation for large sample
theory. Random sampling from populations also induces the iid assumption.
Linearity can be checked with the response plot, and similar shape and spread
of the location families can be checked with both the response and residual
plots. Also check that outliers are not present. If the p dot plots in the
response plot are approximately symmetric, then the sample sizes ni can
be smaller than if the dot plots are skewed.

Remark 5.2. When the assumption that the p groups come from the
same location family with nite variance 2 is violated, the one way ANOVA
F test may not make much sense because unequal means may not imply the
superiority of one category over another. Suppose Y is the time in minutes
until relief from a headache and that Y1j N (60, 1) while Y2j N (65, 2 ).
If 2 = 1, then the type 1 medicine gives headache relief 5 minutes faster, on
average, and is superior, all other things being equal. But if 2 = 100, then
many patients taking medicine 2 experience much faster pain relief than those
taking medicine 1, and many experience much longer time until pain relief.
In this situation, predictor variables that would identify which medicine is
faster for a given patient would be very useful.

Example 5.5. The output below represents grams of fat (minus 100
grams) absorbed by doughnuts using 4 types of fat. See Snedecor and Cochran
(1967, p. 259). Let i denote the mean amount of fati absorbed by doughnuts,
i = 1, 2, 3 and 4. a) Find 1 . b) Perform a 4 step ANOVA F test.
n1
Solution: a) 1c = 1 = Y 10 = Y10 /n1 = j=1 Y1j /n1 =
(64 + 72 + 68 + 77 + 56 + 95)/6 = 432/6 = 72.
b) i) H0 : 1 = 2 = 3 = 4 Ha : not H0
ii) F = 5.41
iii) pval = 0.0069
iv) Reject H0 , the mean amount of fat absorbed by doughnuts depends on
the type of fat.

fat1 fat2 fat3 fat4


64 78 75 55
72 91 93 66
68 97 78 49
77 82 71 64
56 85 63 70
95 77 76 68
One way Anova for Fat1 Fat2 Fat3 Fat4
Source DF SS MS F P
treatment 3 1636.5 545.5 5.41 0.0069
error 20 2018.0 100.9
5.2 Fixed Eects One Way Anova 185
p p
Denition 5.14. A contrast C = ki i where ki = 0. The
p i=1 i=1
estimated contrast is C = i=1 ki Y i0 .
If the null hypothesis of the xed eects one way ANOVA test is not true,
then not all of the means i are equal. Researchers will often have hypotheses,
before examining the data, that they desire to test. Often such a hypothesis
can be put in the form of a contrast. For example, the contrast C = i j
is used to compare the means of the ith and jth groups while the contrast
1 (2 + + p )/(p 1) is used to compare the last p 1 groups with
the 1st group. This contrast is useful when the 1st group corresponds to a
standard or control treatment while the remaining groups correspond to new
treatments.
Assume that the normal cell means model is a useful approximation to the
data. Then the Y i0 N (i , 2 /ni ) are independent, and
 
p p
ki2
C = ki Y i0 N C, 2
.
i=1
n
i=1 i

Hence the standard error




 p
ki2
SE(C) = M SE .
n
i=1 i

The degrees of freedom is equal to the MSE degrees of freedom p= n p.


Consider
p a family of null hypotheses for contrasts {Ho : i=1 ki i = 0
where i=1 ki = 0 and the ki may satisfy other constraints}. Let S denote
the probability of a type I error for a single test from the family where a
type I error is a false rejection. The family level F is an upper bound
on the (usually unknown) size T . Know how to interpret F T = P(of
making at least one type I error among the family of contrasts).
Two important families of contrasts are the family of all possible con-
trasts and the family of pairwise dierences Cij = i j where i = j. The
Schee multiple comparisons procedure has a F for the family of all possible
# $ the Tukey multiple comparisons procedure has a F for the
contrasts, while
family of all p2 pairwise contrasts.
To interpret output for multiple comparisons procedures, the underlined
means or blocks of letters besides groups of means indicate that the group of
means are not signicantly dierent.
Example 5.6. The output below uses data from SAS Institute (1985,
pp. 126129). The mean nitrogen content of clover depends on the strain
of clover (3dok1, 3dok5, 3dok7, compos, 3dok4, 3dok13). Recall that means
1 and 2 are signicantly dierent if you can conclude that 1 = 2 while
1 and 2 are not signicantly dierent if there is not enough evidence to
conclude that 1 = 2 (perhaps because the means are approximately equal
or perhaps because the sample sizes are not large enough).
186 5 One Way Anova

Notice that the strain of clover 3dok1 appears to have the highest mean
nitrogen content. There are 4 pairs of means that are not signicantly dier-
ent. The letter B suggests 3dok5 and 3dok7, the letter C suggests 3dok7 and
compos, the letter D suggests compos and 3dok4, while the letter E suggests
3dok4 and 3dok13 are not signicantly dierent.

Means with the same letter are not significantly


different.
Waller Grouping Mean N strain
A 28.820 5 3dok1

B 23.980 5 3dok5
B
C B 19.920 5 3dok7
C
C D 18.700 5 compos
D
E D 14.640 5 3dok4
E
E 13.260 5 3dok13

Remark 5.3. Two graphical methods can also be used. Recall from Chap-
ter 1 that a response plot is an estimated sucient summary plot. If n is not
too small, each ni 5, and the sample mean (where the dot plot crosses
the identity line) for one dot plot is below or above another dot plot, then
conclude that the population mean corresponding to the higher dot plot is
greater than the sample mean corresponding to the lower dot plot. As the
ni increase, the sample mean of one dot plot only needs to be above or be-
low most of the cases in the other dot plot. The p population means may or
may not be equal if all p of the dot plots have lots of overlap. This will hap-
pen, for example, if the response plot looks like the residual plot. Hence this
graphical method is inconclusive for Figure 5.1a. Remark 5.2 gives another
situation where this graphical method can fail. An advantage of this graphi-
cal method is that the p populations do not need to come from populations
with the same variance or from the same location scale family as long as OLS
gives a consistent estimator of . The second graphical method is given in
Denition 5.15.

Example 5.6, continued: Figure 5.2 shows the response and residual
plots for the clover data. The plots suggest the constant variance assumption
is not reasonable. The population means may or may not dier for the groups
with the two smallest sample means, but these two groups appear to have
smaller population means than the other groups. Similarly, the population
means may or may not dier for the two groups with sample means near
5.2 Fixed Eects One Way Anova 187

Response Plot

30
Y
20
10

15 20 25
FIT

Residual Plot
4
RES
2
8

15 20 25
FIT
Fig. 5.2 Response and Residual Plots for Clover Data

20, but these two groups appear to have population means that are smaller
than the two groups with the largest sample means. The population means
of these last two groups may or may not dier. Figure 5.2 was made with the
following commands, using the lregpack function aovplots.

x<-c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4,5,5,5,5,
5,6,6,6,6,6)

y<-c(19.4,32.6,27.0,32.1,33.0,17.7,24.8,27.9,25.2,
24.3,17.0,19.4,9.1,11.9,15.8,20.7,21.0,20.5,18.8,
18.6,14.3,14.4,11.8,11.6,14.2,17.3,19.4,19.1,16.9,
20.8)

x <- factor(x)
z <- aov(y~x)
aovplots(Y=y,FIT=fitted(z),RES=resid(z))
#right click stop twice

Denition 5.15. Graphical Anova for the one way model uses the
residuals as a reference set instead of a t, F , or normal distribution. The
scaled treatment deviations or scaled eect c(Y i0 Y 00 ) = c(i Y 00 )
are scaled to have the same variability as the residuals. A dot plot of the
scaled deviations is placed above the dot plot of the residuals. Assume that
188 5 One Way Anova

ni m = n/p for i = 1, . . . , p. For small n 40, suppose the distance be-


tween two scaled deviations (A and B, say) is greater than the range of the
residuals = max(rij ) min(rij ). Then declare A and B to be signicantly
dierent. If the distance is less than the range, do not declare A and B to
be signicantly dierent. Scaled deviations that lie outside the range of the
residuals are signicant (so signicantly dierent from the overall mean).

For n 100, let r(1) r(2) r(n) be the order statistics of the
residuals. Then instead of the range, use r(0.975n) r(0.025n) as the dis-
tance where x is the smallest integer x, e.g. 7.7 = 8. So eects outside
of the interval (r(0.025n) , r(0.975n) ) are signicant. See Box et al. (2005,

pp. 136, 166). A derivation of the scaling constant c = (n p)/(p 1) is
given in Section 5.6.

Scaled Treatment Deviations


20 18
graphicalanova
14 16
12
10

0.04 0.00 0.02 0.04 0.06


Residuals
Fig. 5.3 Graphical Anova

ganova(x,y)
smn 0.0296 0.0661 -0.0508 -0.0449
Treatments "A" "B" "C" "D"

Example 5.7. Cobb (1998) describes a one way Anova design used to
study the amount of calcium in the blood. For many animals, the bodys
ability to use calcium depends on the level of certain hormones in the blood.
The response was 1/(level of plasma calcium). The four groups were A: Fe-
male controls, B: Male controls, C: Females given hormone, and D: Males
5.3 Random Eects One Way Anova 189

given hormone. There were 10 birds of each gender, and ve from each gen-
der were given the hormone. The output above uses the lregpack function
ganova to produce Figure 5.3.

In Figure 5.3, the top dot plot has the scaled treatment deviations. From
left to right, these correspond to C, D, A, and B since the output shows that
the deviation corresponding to C is the smallest with value 0.050. Since the
deviations corresponding to C and D are much closer than the range of the
residuals, the C and D eects yielded similar mean response values. A and
B appear to be signicantly dierent from C and D. The distance between
the scaled A and B treatment deviations is about the same as the distance
between the smallest and largest residuals, so there is only marginal evidence
that the A and B eects are signicantly dierent.
Since all 4 scaled deviations lie outside of the range of the residuals, all
eects A, B, C, and D appear to be signicant.

5.3 Random Eects One Way Anova

Denition 5.16. For the random eects one way Anova, the levels of
the factor are a random sample of levels from some population of levels F .
The cell means model for the random eects one way Anova is Yij = i + eij
for i = 1, . . . , p and j = 1, . . . , ni . The i are randomly selected from some
population with mean and variance 2 , where i F is equivalent to
i . The eij and i are independent, and the eij are iid from a location
family with pdf f , mean 0, and variance 2 . The Yij |i f (y i ), the
location family with location parameter i and variance 2 . Unconditionally,
E(Yij ) = and V (Yij ) = 2 + 2 .

For the random eects model, the i are independent random variables
with E(i ) = and V (i ) = 2 . The cell means model for xed eects one
way Anova is very similar to that for the random eects model, but the i
are xed constants rather than random variables.

Denition 5.17. For the normal random eects one way Anova model,
N (, 2 ). Thus the i are independent N (, 2 ) random variables. The
eij are iid N (0, 2 ) and the eij and i are independent. For this model,
Yij |i N (i , 2 ) for i = 1, . . . , p. Note that the conditional variance 2 is
the same for each i . Unconditionally, Yij N (, 2 + 2 ).

The xed eects one way Anova tested Ho : 1 = = p . For the


random eects one way Anova, interest is in whether i for every i in
where the population is not necessarily nite. Note that if 2 = 0, then
i for all i . In the sample of p levels, the i will dier if 2 > 0.
190 5 One Way Anova

Be able to perform the 4 step random eects one way ANOVA


F test of hypotheses:
i) Ho : 2 = 0 Ha : 2 > 0
ii) Fo = M ST R/M SE is usually obtained from output.
iii) The pval = P (Fp1,np > Fo ) is usually obtained from output.
iv) If pval reject Ho, conclude that 2 > 0 and that the mean response
depends on the factor level. Otherwise, fail to reject Ho, conclude that 2 = 0
and that the mean response does not depend on the factor level. (Or there
is not enough evidence to conclude that the mean response depends on the
factor level.)

The ANOVA tables for the xed and random eects one way Anova models
are exactly the same, and the two F tests are very similar. The main dierence
is that the conclusions for the random eects model can be generalized to
the entire population of levels. For the xed eects model, the conclusions
only hold for the p xed levels. If Ho : 2 = 0 is true and the random eects
model holds, then the Yij are iid with pdf f (y ). So the F statistic for
the random eects test has an approximate Fp1,np distribution if the ni
are large by the results for the xed eects one way ANOVA test. For both
tests, the pval is an estimate of the population p-value.

Source df SS MS F P
brand 5 854.53 170.906 238.71 0.0000
error 42 30.07 0.716

Example 5.8. Data is from Kutner et al. (2005, problem 25.7). A re-
searcher is interested in the amount of sodium in beer. She selects 6 brands
of beer at random from 127 brands and the response is the average sodium
content measured from 8 cans of each brand.
a) State whether this is a random or xed eects one way Anova. Explain
briey.
b) Using the output above, perform the appropriate 4 step ANOVA F
test.
Solution: a) Random eects since the beer brands were selected at random
from a population of brands.
b) i) H0 : 2 = 0 Ha : 2 > 0
ii) F0 = 238.71
iii) pval = 0.0
iv) Reject H0 , so 2 > 0 and the mean amount of sodium depends on the
beer brand.

Remark 5.4. The response and residual plots for the random eects mod-
els are interpreted in the same way as for the xed eects model, except that
the dot plots are from a random sample of p levels instead of from p xed
levels.
5.4 Response Transformations for Experimental Design 191

5.4 Response Transformations for Experimental Design

A model for an experimental design is Yi = E(Yi ) + ei for i = 1, . . . , n where


the error ei = Yi E(Yi ) and E(Yi ) E(Yi |xi ) is the expected value of the
response Yi for a given vector of predictors xi . Many models can be t with
least squares (OLS or LS) and are linear models of the form

Yi = xi,1 1 + xi,2 2 + + xi,p p + ei = xTi + ei

for i = 1, . . . , n. Often xi,1 1 for all i. In matrix notation, these n equations


become
Y = X + e,
where Y is an n 1 vector of dependent variables, X is an n p design
matrix of predictors, is a p 1 vector of unknown coecients, and e is
an n 1 vector of unknown errors. If the tted values are Yi = xTi , then
Yi = Yi + ri where the residuals ri = Yi Yi .
The applicability of an experimental design model can be expanded by
allowing response transformations. An important class of response transfor-
mation models adds an additional unknown transformation parameter o ,
such that
( )
Yi = to (Zi ) Zi o = E(Yi ) + ei = xTi + ei .
If o was known, then Yi = to (Zi ) would follow the linear model for the
experimental design.

Denition 5.18. Assume that all of the values of the response Zi are
positive. A power transformation has the form Y = t (Z) = Z for = 0
and Y = t0 (Z) = log(Z) for = 0 where L = {1, 1/2, 0, 1/2, 1}.

A graphical method for response transformations computes the tted val-


ues Wi from the experimental design model using Wi = t (Zi ) as the re-
sponse. Then a plot of the W versus W is made for each of the ve values
of L . The plotted points follow the identity line in a (roughly) evenly
populated band if the experimental design model is reasonable for (W , W ).
An exception is the one way Anova model where there will be p dot plots
of roughly the same shape and spread that scatter about the identity line.
If more than one value of L gives a linear plot, consult subject matter
experts and use the simplest or most reasonable transformation. Also look
at the residual plots of the competing transformations. Note that L has 5
models, and the graphical method selects the model with the best response
plot. After selecting the transformation, the usual checks should be made. In
particular, the transformation plot is also the response plot, and a residual
plot should be made. The Equation (3.3) transformations could also be used.

Denition 5.19. A transformation plot is a plot of (W , W ) with the iden-


tity line added as a visual aid.
192 5 One Way Anova

In the following example, the plots show t (Z) on the vertical axis. The
label TZHAT of the horizontal axis are the tted values that result from
using t (Z) as the response in the software.

2.5
6

6
5

2.0

4
4

1/sqrt(Z)

LOG(Z)
1.5
1/Z
3

2
1.0
2

0
0.5
1
0

0.0

-2
1.2 1.4 1.6 1.8 2.0 2.2 0.7 0.8 0.9 1.0 1.1 1.0 1.5 2.0
TZHAT TZHAT TZHAT
20

400
15

300
sqrt(Z)

Z
10

200
100
5

0
0

3 4 5 10 20 30 40 50 60 70
TZHAT TZHAT

Fig. 5.4 Transformation Plots for Crab Data

For one way Anova models with ni m 5, look for a transformation


plot that satises the following conditions. i) The p dot plots scatter about
the identity line with similar shape and spread. ii) Dot plots with more skew
are worse than dot plots with less skew or dot plots that are approximately
symmetric. iii) Spread that increases or decreases with TZHAT is bad.
Example 5.4, continued. Following Kuehl (1994, p. 128), let C be the
count of crabs and let the response Z = C + 1/6. Figure 5.4 shows the
ve transformation plots. The transformation log(Z) results in dot plots that

have roughly the same shape and spread. The transformations 1/Z and 1/ Z
do not handle the 0 countswell, and the dot plots fail to cover the identity
line. The transformations Z and Z have variance that increases with the
mean. See Problem 5.13 to reproduce the plots.
Remark 5.5. The graphical method for response transformations can be
used for design models that are linear models, not just one way Anova models.
The method is nearly identical to that of Chapter 3, but L only has 5 values.
max(Zi )
The log rule states that if all of the Zi > 0 and if > 10, then the
min(Zi )
response transformation Y = log(Z) will often work.
5.5 Summary 193

5.5 Summary

1) The xed eects one way Anova model has one qualitative explanatory
variable called a factor and a quantitative response variable Yij . The factor
variable has p levels, E(Yij ) = i and V (Yij ) = 2 for i = 1, . . . , p and
j = 1, . . . , ni . Experimental units are randomly assigned to the treatment
levels.
2) Let n = n1 + + np . In an experiment, the investigators use random-
ization to randomly assign n units to treatments. Draw a random permutation
of {1, . . . , n}. Assign the rst n1 units to treatment 1, the next n2 units to
treatment 2, . . . , and the nal np units to treatment p. Use ni m = n/p if
possible. Randomization washes out the eect of lurking variables.
3) The 4 step xed eects one way ANOVA F test has steps
i) Ho: 1 = 2 = = p and Ha: not Ho.
ii) F o = MSTR/MSE is usually given by output.
iii) The pval = P(Fp1,np > F o) is usually given by output.
iv) If pval , reject Ho and conclude that the mean response depends
on the factor level. Otherwise fail to reject Ho and conclude that the mean
response does not depend on the factor level. (Hence all of the treatment
means are equal, or there is not enough evidence to conclude that the mean
response depends on the factor level.) Give a nontechnical sentence.

Summary Analysis of Variance Table

Source df SS MS F p-value
Treatment p-1 SSTR MSTR Fo=MSTR/MSE for Ho:
Error n-p SSE MSE 1 = = p

4) Shown is an ANOVA table given in symbols. Sometimes Treatment is


replaced by Between treatments, Between Groups, Model, Factor, or
Groups. Sometimes Error is replaced by Residual, or Within Groups.
Sometimes p-value is replaced by P, P r(> F ), or PR > F.
5) Boxplots and dot plots for each level are useful for this test. A dot plot
of Z1 , . . . , Zm consists of an axis and m points each corresponding to the
value of Zi . If all of the boxplots or dot plots are about the same for the
response plot, then the ANOVA F test may or may not fail to reject Ho. If
Ho is true, then Yij = + eij where the eij are iid with 0 mean and constant
variance 2 . Then = Y 00 and the factor levels do not help predict Yij .
6) Let fZ (z) be the pdf of Z. Then the family of pdfs fY (y) = fZ (y )
indexed by the location parameter , < < , is the location family
for the random variable Y = + Z with standard pdf fZ (y). A one way
xed eects Anova model has a single qualitative predictor variable W with
p categories a1 , . . . , ap . There are p dierent distributions for Y , one for each
category ai . The distribution of
194 5 One Way Anova

Y |(W = ai ) fZ (y i )

where the location family has second moments. Hence all p distributions
come from the same location family with dierent location parameter i and
the same variance 2 . The one way xed eects normal Anova model is the
special case where Y |(W = ai ) N (i , 2 ).
7) The response plot is a plot of Y versus Y . For the one way Anova model,
the response plot is a plot of Yij = i versus Yij . Often the identity line with
unit slope and zero intercept is added as a visual aid. Vertical deviations
from the identity line are the residuals rij = Yij Yij = Yij i . The plot
will consist of p dot plots that scatter about the identity line with similar
shape and spread if the xed eects one way Anova model is appropriate.
The ith dot plot is a dot plot of Yi,1 , . . . , Yi,ni . Assume that each ni 10. If
the response plot looks like the residual plot, then a horizontal line ts the p
dot plots about as well as the identity line, and there is not much dierence
in the i . If the identity line is clearly superior to any horizontal line, then
at least some of the means dier.
8) The residual plot is a plot of Y versus residual r = Y Y . The plot will
consist of p dot plots that scatter about the r = 0 line with similar shape
and spread if the xed eects one way Anova model is appropriate. The ith
dot plot is a dot plot of ri,1 , . . . , ri,ni . Assume that each ni 10. Under
the assumption that the Yij are from the same location family with dierent
parameters i , each of the p dot plots should have roughly the same shape
and spread. This assumption is easier to judge with the residual plot than
with the response plot.
9) Rule of thumb: If max(S1 , . . . , Sp ) 2 min(S1 , . . . , Sp ), then the one
way ANOVA F test results will be approximately correct if the response and
residual plots suggest that the remaining one way Anova model assumptions
are reasonable. Replace the Si by the ranges Ri of the dot plots in the residual
and response plots.
10) In an experiment, the investigators assign units to treatments. In
an observational study, investigators simply observe the response, and the
treatment groups need to be p random samples from p populations (the lev-
els). The eects of lurking variables are present in observational studies.
11) If a qualitative variable has c levels, represent it with c1 or c indicator
variables. Given a qualitative variable, know how to represent the data with
indicator variables.
12) The cell means model for the xed eects one way Anova is Yij =
i + eij where Yij is the value of the response variable for the jth trial
of the ith factor level for i = 1, . . . , p and j = 1, . . . , ni . The i are the
unknown means and E(Yij ) = i . The eij are iid from the location family
with pdf fZ (z), zero mean, and unknown variance 2 = V (Yij ) = V (eij ).
For the normal cell means model, the eij are iid N (0, 2 ). The estimator
ni
i = Y i0 = j=1 Yij /ni = Yij . The ith residual is rij = Yij Y i0 , and Y 00 is
5.5 Summary 195
p
the samplemeanof all of the Yij and n = i=1 ni . The total sum of squares
p ni
j=1 (Yij Y 00 ) , the treatment sum of squares SSTR =
2
SSTO =
p i=1
p ni
i=1 ni (Y i0 Y 00 ) , and the error sum of squares SSE = j=1 (Yij
2
i=1
2 2
Y i0 ) . The MSE is an estimator of . The ANOVA table is the same as
that for multiple linear regression, except that SSTR replaces the regression
sum of squares and that SSTO, SSTR, and SSE have n 1, p 1, and n p
degrees of freedom.
ni
13) Let Yi0 = j=1 Yij and let

1 
ni
i = Y i0 = Yi0 /ni = Yij .
ni j=1

Hence the dot notation  means sum over the subscript corresponding to the
p ni
0, e.g. j. Similarly, Y00 = i=1 j=1 Yij is the sum of all of the Yij . Be able
to nd i from data.
14) If the p treatment groups have the same pdf (so i in the location
family) with nite variance 2 , and if the one way ANOVA F test statistic is
n!
computed from all ways of assigning ni of the response variables
n1 ! np !
to treatment i, then the histogram of the F test statistic is approximately
Fp1,np for large ni .
15) For the one way Anova, the tted values Yij = Y i0 and the residuals
rij = Yij Yij .
16) Know that for the random eects one way Anova, the levels of
the factor are a random sample of levels from some population of levels F .
Assume the i are iid with mean and variance 2 . The cell means model
for the random eects one way Anova is Yij = i + eij for i = 1, . . . , p and
j = 1, . . . , ni . The sample size n = n1 + + np and often ni m so n = pm.
The i and eij are independent. The eij have mean 0 and variance 2 . The
Yij |i f (y i ), a location family with variance 2 while eij f (y). In
the test below, if H0 : 2 = 0 is true, then the Yij are iid with pdf f (y ),
so the F statistic Fp1,np if the ni are large.
17) Know that the 4 step random eects one way Anova test is
i) H0 : 2 = 0 HA : 2 > 0
ii) F0 = M ST R/M SE is usually obtained from output.
iii) The pval = P (Fp1,np > F0 ) is usually obtained from output.
iv) If pval reject Ho, conclude that 2 > 0 and that the mean response
depends on the factor level. Otherwise, fail to reject Ho, conclude that 2 = 0
and that the mean response does not depend on the factor level. (Or there
is not enough evidence to conclude that the mean response depends on the
factor level.)
18) Know how to tell whether the experiment is a xed or random eects
one way Anova. (Were the levels xed or a random sample from a population
of levels?)
196 5 One Way Anova

19) The applicability of a DOE (design of experiments) model can be ex-


panded by allowing response transformations. An important class of response
transformation models is

Y = to (Z) = E(Y ) + e = xT + e

where the subscripts (e.g., Yij ) have been suppressed. If o was known, then
Y = to (Z) would follow the DOE model. Assume that all of the values
of the response Z are positive. A power transformation has the form
Y = t (Z) = Z for = 0 and Y = t0 (Z) = log(Z) for = 0 where
L = {1, 1/2, 0, 1/2, 1}.
20) A graphical method for response transformations computes the tted
values W from the DOE model using W = t (Z) as the response for each
of the ve values of L . Let T = W = TZHAT and plot TZHAT vs.
t (Z) for {1, 1/2, 0, 1/2, 1}. These plots are called transformation
plots. The residual or error degrees of freedom used to compute the MSE
should not be too small. Choose the transformation Y = t (Z) that has the
best plot. Consider the one way Anova model with ni 5 for i = 1, . . . , p.
i) The dot plots should spread about the identity line with similar shape
and spread. ii) Dot plots that are approximately symmetric are better than
skewed dot plots. iii) Spread that increases or decreases with TZHAT (the
shape of the plotted points is similar to a right or left opening megaphone)
is bad.
21) The transformation plot for the selected transformation is also the
response plot for that model (e.g., for the model that uses Y = log(Z) as
the response). Make all of the usual checks on the DOE model (residual and
response plots) after selecting the response transformation.
22) The log rule says try Y = log(Z) if max(Z)/ min(Z) > 10 where
Z > 0 and the subscripts have been suppressed (so Z Zij for the one way
Anova model).
p p
23) A contrast C = i=1 ki i where i=1 ki = 0. The estimated contrast
p
is C = i=1 ki Y i0 .
p
24) Consider
p a family of null hypotheses for contrasts {Ho : i=1 ki i = 0
where i=1 ki = 0 and the ki may satisfy other constraints }. Let S denote
the probability of a type I error for a single test from the family. The family
level F is an upper bound on the (usually unknown) size T . Know how to
interpret F T = P(of making at least one type I error among the family
of contrasts) where a type I error is a false rejection.
25) Two important families of contrasts are the family of all possible con-
trasts and the family of pairwise dierences Cij = i j where i = j.
The Schee multiple comparisons procedure has a F for the family of all
possible contrasts, while
# $ the Tukey multiple comparisons procedure has a F
for the family of all p2 pairwise contrasts.
5.6 Complements 197

26) Know how to interpret output for multiple comparisons procedures.


Underlined means or blocks of letters besides groups of means indicates that
the group of means are not signicantly dierent.
27) Graphical Anova for the one way Anova model makes a dot plot
of scaled treatment deviations (eects) above a dot plot of the residuals. For
small n 40, suppose the distance between two scaled deviations (A and B,
say) is greater than the range of the residuals = max(rij ) min(rij ). Then
declare A and B to be signicantly dierent. If the distance is less than
the range, do not declare A and B to be signicantly dierent. Assume
the ni m for i = 1, . . . , p. Then the
 ith scaled deviation is c(Y i0 Y 00 ) =
 np
ci = i where c = dfe /dftreat = .
p1
28) The analysis of the response, not that of the residuals, is of primary
importance. The response plot can be used to analyze the response in the
background of the tted model. For linear models such as experimental de-
signs, the estimated mean function is the identity line and should be added
as a visual aid.
29) Assume that the residual degrees of freedom are large enough for test-
ing. Then the response and residual plots contain much information. Linearity
and constant variance may be reasonable if the p dot plots have roughly the
same shape and spread, and the dot plots scatter about the identity line.
The p dot plots of the residuals should have similar shape and spread, and
the dot plots scatter about the r = 0 line. It is easier to check linearity with
the response plot and constant variance with the residual plot. Curvature is
often easier to see in a residual plot, but the response plot can be used to
check whether the curvature is monotone or not. The response plot is more
eective for determining whether the signal to noise ratio is strong or weak,
and for detecting outliers or inuential cases.

5.6 Complements

Often the data does not consist of samples from p populations, but consists
of a group of n = mp units where m units are randomly assigned to each of
the p treatments. Then the Anova models can still be used to compare treat-
ments, but statistical inference to a larger population cannot be made. Of
course a nonstatistical generalization to larger populations can be made. The
nonstatistical generalization from the group of units to a larger population
is most compelling if several experiments are done with similar results. For
example, generalizing the results of an experiment for psychology students
to the population of all of the university students is less compelling than the
following generalization. Suppose one experiment is done for psychology stu-
dents, one for engineers, and one for English majors. If all three experiments
198 5 One Way Anova

give similar results, then generalize the results to the population of all of the
universitys students.
Four good tests on the design and analysis of experiments are Box et al.
(2005), Cobb (1998), Kuehl (1994), and Ledolter and Swersey (2007). Also
see Dean and Voss (2000), Kirk (2012), Maxwell and Delaney (2003), Mont-
gomery (2012), and Oehlert (2000).
A randomization test has H0 : the dierent treatments have no eect.
This null hypothesis is also true if all p pdfs Y |(W = ai ) fZ (y ) are
n!
the same. An impractical randomization test uses all M = n1 !n p!
ways of
assigning ni of the Yij to treatment i for i = 1, . . . , p. Let F0 be the usual
F statistic. The F statistic is computed for each of the M permutations and
H0 is rejected if the proportion of the M F statistics that are larger than
F0 is less than . The distribution of the M F statistics is approximately
Fp1,np for large n when H0 is true. The power of the randomization test is
also similar to that of the usual F test. See Hoeding (1952). These results
suggest that the usual F test is semiparametric: the pvalue is approximately
correct if n is large and if all p pdfs Y |(W = ai ) fZ (y ) are the same.
Let [x] be the integer part of x, e.g. [7.7] = 7. Olive (2014, section 9.3)
shows that practical randomization tests that use a random sample of
max(1000, [n log(n)]) permutations have level and power similar to the tests
that use all M possible permutations. See Ernst (2009) and the lregpack func-
tion rand1way for R code.
All of the parameterizations of the one way xed eects Anova model
yield the same predicted values, residuals, and ANOVA F test, but the inter-
pretations of the parameters dier. The cell means model is a linear model
(without intercept) of the form Y = X c c + e = that can be t using OLS.
The OLS MLR output gives the correct tted values and residuals but an
incorrect ANOVA table. An equivalent linear model (with intercept) with
correct OLS MLR ANOVA table as well as residuals and tted values can
be formed by replacing any column of the cell means model by a column of
ones 1. Removing the last column of the cell means model and making the
rst column 1 gives the model Y = 0 + 1 x1 + + p1 xp1 + e given in
matrix form by (5.5) on the following page.
It can be shown that the OLS estimators corresponding to (5.5) are 0 =
Y p0 = p , and i = Y i0 Y p0 = i p for i = 1, . . . , p 1. The cell means
model has i = i = Y i0 .
Wilcox (2012) gives an excellent discussion of the problems that outliers
and skewness can cause for the one and two sample tintervals, the ttest,
tests for comparing 2 groups, and the ANOVA F test. Wilcox (2012) replaces
ordinary population means by truncated population means and uses trimmed
means to create analogs of one way Anova and multiple comparisons.
5.6 Complements 199

110 ... 0
. . .
.. .. ..
..
.
Y11 e11
.. 110 ... 0 ..
. ... 0
.
101
Y1,n1
.. .. .. ..
e1,n1

Y21 . . . . e21
0
.. 1 01 ... 0
..
. .. 1 +
= .. .. ..
.
. .
Y2,n2 . . . . .. (5.5)
e 2,n2
. 1 00 ... 1 .
.. p1 ..
.
.. ... ... ..
Yp,1 . ep,1

. 100 ... 1 .
.. 1 0 0
. .
... 0
Yp,np . . . .. ep,np
.. .. .. .
1 0 0 ... 0
Graphical Anova uses scaled treatment eects = scaled treatment de-
viations di = cdi = c(Y i0 Y 00 ) for i = 1, . . . , p. Following Box et al.
(2005, p. 166), suppose ni m = n/p for i = 1, . . . , n. If Ho: 1 =
= p is true, want the sample variance of the scaled deviations to
be approximately
p equal to the sample variance of the residuals. So want
1 2 2 p
p i=1 c d i M ST R SST R/(p 1) md2i /(p 1)
1 1 n 2 = F0 = = = i=1n
i=1 ri M SE SSE/(n p) i=1 ri /(n p)
2
n
p p
since SST R = i=1 m(Y i0 Y 00 ) = i=1 mdi . So
2 2

p 2n 2
p m(np) 2
i=1 c di i=1 di
F0 = n p2 = n p12 .
i=1 ri i=1 ri

Equating numerators gives

mp (n p) (n p)
c2 = =
n (p 1) (p 1)

since mp/n = 1. Thus c = (n p)/(p 1).
For Graphical Anova, see Box et al. (2005, pp. 136, 150, 164, 166) and
Hoaglin et al. (1991). The R package granova, available from
(http://streaming.stat.iastate.edu/CRAN/), and authored by R.M.
Pruzek and J.E. Helmreich, may be useful.
The modied power transformation family

() Zi 1
Yi = t (Zi ) Zi =

for = 0 and t0 (Zi ) = log(Zi ) for = 0 where L .
200 5 One Way Anova

Box and Cox (1964) give a numerical method for selecting the response
transformation for the modied power transformations. Although the method
gives a point estimator o , often an interval of reasonable values is gen-
erated (either graphically or using a prole likelihood to make a condence
interval), and L is used if it is also in the interval.
There are several reasons to use a coarse grid L of powers. First, several of
the powers correspond to simple transformations such as the log, square root,
and reciprocal. These powers are easier to interpret than = 0.28, for exam-
ple. Secondly, if the estimator n can only take values in L , then sometimes
n will converge in probability to L . Thirdly, Tukey (1957) showed
that neighboring modied power transformations are often very similar, so
restricting the possible powers to a coarse grid is reasonable.
The graphical method for response transformations is due to Olive (2004b).
A variant of the method would plot the residual plot or both the response
and the residual plot for each of the ve values of . Residual plots are also
useful, but they do not distinguish between nonlinear monotone relationships
and nonmonotone relationships. See Fox (1991, p. 55). Alternative methods
are given by Cook and Olive (2001) and Box et al. (2005, p. 321).
An alternative to one way Anova is to use FWLS (see Chapter 4) on the
cell means model with 2 V = diag(12 , . . . , p2 ) where i2 occurs ni times
on the diagonal and i2 is the variance of the ith group for i = 1, . . . , p.
ni
Then V = diag(S12 , . . . , Sp2 ) where Si2 = ni11 j=1 (Yij Y i0 )2 is the sample
variance of the Yij . Hence the estimated weights for FWLS are wij wi =
1/Si2 . Then the FWLS cell means model has Y = X c c + e as in (5.1) except
Cov(e) = diag(12 , . . . , p2 ).
Hence Z = U c c + . Then U Tc U c = diag(n1 w1 , . . . , np wp ), (U Tc U c )1 =
1
diag(S12 /n1 , . . . , Sp2 /np ) = (X V X T )1 , and U Tc Z = (w1 Y10 , . . . , wp Yp0 )T .
Thus from Chapter 4,

F W LS = (Y 10 , . . . , Y p0 )T = c .

That is, the FWLS estimator equals the one way Anova estimator of based
on OLS applied to the cell means model. The ANOVA F test generalizes
the pooled t test in that the two tests are equivalent for p = 2. The FWLS
procedure is also known as the Welch one way Anova and generalizes the
Welch t test. The Welch t test is thought to be much better than the pooled
t test if n1 = n2 and 12 = 22 . See Brown and Forsythe (1974a,b), Kirk (1982),
pp. 100, 101, 121, 122), Olive (2014, pp. 278279), Welch (1947, 1951), and
Problem 5.11.
In matrix form Z = U c c +  becomes
5.6 Complements 201

w1 Y1,1 w1 0 0 ... 0 11
.. .. .. .. .. ..
. . .
. . .
w1 Y1,n1 w1 0 0 ... 0
1,n1
w2 Y21 0 w2 0 ... 0 21
1
.. .. .. .. .. ..
. . 2 .

. = . .
. + . (5.6)
w2 Y2,n2 0 0
w2 0 ... .. 2,n2
.. .. .. .. ..
. p ...
 . . .  .
wp Yp,1 0 0 0 ... wp p,1

.. . .. .. .. .
.. .
.
 . . . .
wp Yp,np 0 0 0 ... wp p,np

Four tests for Ho : 1 = = p can be used if Rule of Thumb 5.3:


max(S1 , . . . , Sp ) 2 min(S1 , . . . , Sp ) fails. Let Y = (Y1 , . . . , Yn )T , and let
Y(1) Y(2) Y(n) be the order statistics. Then the rank transformation
of the response is Z = rank(Y ) where Zi = j if Yi = Y(j) is the jth order
statistic. For example, if Y = (7.7, 4.9, 33.3, 6.6)T , then Z = (3, 1, 4, 2)T .
The rst test performs the one way ANOVA F test with Z replacing Y . See
Montgomery (1984, pp. 117118). Two of the next three tests are described
in Brown and Forsythe (1974b). Let x be the smallest integer x, e.g.
7.7 = 8. Then the Welch (1951) ANOVA F test uses test statistic
p
i=1 wi (Y i0 Y00 ) /(p 1)
2

2(p2) p
FW =
i=1 (1 u ) /(ni
wi 2
1+ p2 1 1)
p p
where wi = ni /Si2 , u = i=1 wi and Y00 = i=1 wi Y i0 /u. Then the test
statistic is compared to an Fp1,dW distribution where dW = f  and

3 % w i &2
p
1/f = 1 /(ni 1).
p 1 i=1
2 u

For the modied Welch (1947) test, the test statistic is compared to an
Fp1,dM W distribution where dM W = f  and
p p
(S 2 /ni )2 i=1 (1/wi )
2
f = p i=11 i 2 2
=  p 1 2
.
i=1 ni 1 (Si /ni ) i=1 ni 1 (1/wi )

Some software uses f instead of dW or dM W , and variants on the denominator


degrees of freedom dW or dM W are common.
The modied ANOVA F test uses test statistic
p
i=1 ni (Y i0 Y 00 )
2
FM =  p .
i=1 (1 n )Si
ni 2
202 5 One Way Anova

The test statistic is compared to an Fp1,dM distribution where dM = f 


and
p
1/f = c2i /(ni 1)
i=1

where  p 
% ni & 2  % ni & 2
ci = 1 Si / 1 Si .
n i=1
n
The lregpack function anovasim can be used to simulate and compare
the four tests with the usual one way ANOVA test. Some simulation results
are in Haenggi (2009).

5.7 Problems

Problems with an asterisk * are especially important.

Output for Problem 5.1.


A B C D E
9.8 9.8 8.5 7.9 7.6
10.3 12.3 9.6 6.9 10.6
13.6 11.1 9.5 6.6 5.6
10.5 10.6 7.4 7.6 10.1
8.6 11.6 7.6 8.9 10.5
11.1 10.9 9.9 9.1 8.6
Analysis of Variance for Time
Source DF SS MS F P
Design 4 44.88 11.22 5.82 0.002
Error 25 48.17 1.93
Total 29 93.05
5.1. In a psychology experiment on child development, the goal was to
study how dierent designs of mobiles vary in their ability to capture the
infants attention. Thirty 3-month-old infants were randomly divided into
ve groups of six each. Each group was shown a mobile with one of ve
designs A, B, C, D, or E. The time that each infant spent looking at the
design was recorded in the output above along with the ANOVA table. Data
is taken from McKenzie and Goldman (1999, p. 234). See the above output.
a) Find 2 = B .
b) Perform a 4 step ANOVA F test.
5.2. Moore (2007, p. 651): Nematodes are microscopic worms. A botanist
desired to learn how the presence of the nematodes aects tomato growth.
She used 16 pots each with a tomato seedling. Four pots got 0 nematodes, four
5.7 Problems 203

got 1000, four got 5000, and four got 10000. These four groups are denoted by
none, n1000, n5000, and n10000, respectively. The seedling growths
were all recorded and the table below gives the one way ANOVA results.
a) What is none ?

b) Do a four step test for whether the four mean growths are equal.
(So Ho: none = n1000 = n5000 = n10000 .)
c) Examine the Bonferroni comparison of means. Which groups of means
are not signicantly dierent?

Output for Problem 5.2.


Variable MEAN SAMPLE SIZE GROUP STD DEV
NONE 10.650 4 2.0535
N1000 10.425 4 1.4863
N5000 5.600 4 1.2437
N10000 5.450 4 1.7711
TOTAL 8.0312 16 1.6666

One Way Analysis of Variance Table


Source df SS MS F p-value
Treatments 2 100.647 33.549 12.08 0.0006
Error 28 33.328 2.777
Total 15 133.974

Bonferroni Comparison of Means


Homogeneous
Variable Mean Groups
----------------------------
NONE 10.650 I
N1000 10.425 I
N5000 5.600 .. I
N10000 5.450 .. I

5.3. According to Cobb (1998, p. 9) when the famous statistician W. G.


Cochran was starting his career, the experiment was to study rat nutrition
with two diets: ordinary rat food and rat food with a supplement. It was
thought that the diet with the supplement would be better. Cochran and his
coworker grabbed rats out of a cage, one at a time, and Cochran assigned
the smaller less healthy rats to the better diet because he felt sorry for them.
The results were as expected for the rats chosen by Cochrans coworker, but
the better diet looked bad for Cochrans rats.
a) What were the units?
b) Suppose rats were taken from the cage one at a time. How should the
rats have been assigned to the two diets?

5.4. Use the output from the command below


204 5 One Way Anova

> sample(11)
[1] 7 10 9 8 1 6 3 11 2 4 5

to assign the following 11 people to three groups of size n1 = n2 = 4 and


n3 = 3.
Anver, Arachchi, Field, Haenggi, Hazaimeh,
Liu, Pant, Tosun, Yi, Zhang, Zhou

5.5. Sketch a good response plot if there are 4 levels with Y 10 = 2,


Y 20 = 4, Y 30 = 6, Y 40 = 7, and ni = 5.

output for Problem 5.6 level


1 2 3 4 5
15 percent 20 percent 25 percent 30 percent 35 percent

y1 y5 y2 y3 y4
9.8 10.8 15.4 17.6 21.6

5.6. The tensile strength of a cotton nylon ber used to make womens
shirts is believed to be aected by the percentage of cotton in the ber. The
5 levels of cotton percentage that are of interest are tabled above. Also shown
is a (Tukey pairwise) comparison of means. Which groups of means are not
signicantly dierent? Data is from Montgomery (1984, pp. 51, 66).

output for Problem 5.7


Source df SS MS F P
color 2 7.60 3.80 0.390 0.684
error 12 116.40 9.70

5.7. A researcher is interested in whether the color (red, blue, or green)


of a paper maze eects the time to complete the maze.
a) State whether this is a random or xed eects one way Anova. Explain
briey.
b) Using the above output, perform the appropriate 4 step ANOVA F test.

A B C Output for Problem 5.8.


9.5 8.5 7.7
3.2 9.0 11.3
4.7 7.9 9.7
7.5 5.0 11.5
8.3 3.2 12.4 Analysis of Variance for Time
Source DF SS MS F P
Design 2 49.168 24.584 4.4625 0.0356
Error 12 66.108 5.509
5.7 Problems 205

5.8. Ledolter and Swersey (2007, p. 49) describe a one way Anova design
used to study the eectiveness of 3 product displays (A, B, and C). Fifteen
stores were used and each display was randomly assigned to 5 stores. The
response Y was the sales volume for the week during which the display was
present compared to the base sales for that store.
a) Find 2 = B using output on the previous page.
b) Perform a 4 step ANOVA F test.

Scaled Treatment Deviations


20
18
graphicalanova
16
14
12
10

2 0 2 4 6
Residuals
Fig. 5.5 Graphical Anova for Problem 5.9

ganova(x,y)
smn -3.2333 -3.0374 6.2710
Treatments "A" "B" "C"

5.9. Ledolter and Swersey (2007, p. 49) describe a one way Anova design
used to study the eectiveness of 3 product displays (A, B, and C). Fifteen
stores were used and each display was randomly assigned to 5 stores. The
response Y was the sales volume for the week during which the display was
present compared to the base sales for that store. Figure 5.5 is the Graphical
Anova plot found using the lregpack function ganova.
a) Which two displays (from A, B, and C) yielded similar mean sales
volume?
b) Which eect (from A, B, and C) appears to be signicant?
206 5 One Way Anova

Source df SS MS F P
treatment 3 89.19 29.73 15.68 0.0002
error 12 22.75 1.90

5.10. A textile factory weaves fabric on a large number of looms. They


would like to obtain a fabric of uniform strength. Four looms are selected
at random and four samples of fabric are obtained from each loom. The
strength of each fabric sample is measured. Data is from Montgomery (1984,
pp. 7475).
a) State whether this is a random or xed eects one way Anova. Explain
briey.
b) Using the output above, perform the appropriate 4 step ANOVA F test.

Problems using R.

Use the command source(G:/lregpack.txt) to download the func-


tions and the command source(G:/lregdata.txt) to download the data.
See Preface or Section 14.1. Typing the name of the lregpack function,
e.g. pcisim, will display the code for the function. Use the args command,
e.g. args(pcisim), to display the needed arguments for the function. For some
of the following problems, the R commands can be copied and pasted from
(http://lagrange.math.siu.edu/Olive/lreghw.txt) into R.

5.11. The pooled t procedures are a special case of one way Anova with
p = 2. Consider the pooled t CI for 1 2 . Let X1 , . . . , Xn1 be iid with
mean 1 and variance 12 . Let Y1 , . . . , Yn2 be iid with mean 2 and variance
22 . Assume that the two samples are independent (or that n1 + n2 units
were randomly assigned to two groups) and that ni for i = 1, 2 in
such a way that = n1n+n1
2
(0, 1). Let = 22 /12 , and let the pooled
sample variance Sp = [(n1 1)S12 + (n2 1)S22 ]/[n1 + n2 2] and 2 =
2

(1 + )/[ + (1 )]. Then

X Y (1 2 ) D
 2 N (0, 1) and
S1 S22
n1 + n2

S12 S22
n1 + n2 X Y (1 2 ) X Y (1 2 ) D
  2 2
=  N (0, 2 ).
1 1 S1 S2 1 1
Sp n1 + n2 n1 + n2
Sp n1 + n2

Now let = S22 /S12 and 2 = (1 + )/( + (1 ) ). Notice that


= 1 if = 1/2, and = 1 if = 1. The usual large sample (1 )100%
pooled t CI for (1 2 ) is

1 1
X Y tn1 +n2 2,1/2 Sp + (5.7)
n1 n2
5.7 Problems 207

is valid if = 1. The large sample (1 )100% modied pooled t CI for


(1 2 ) is

1 1
X Y tn1 +n2 4,1/2 Sp + . (5.8)
n1 n2
The large sample (1 )100% Welch CI for (1 2 ) is

S12 S2
X Y td,1/2 + 2 (5.9)
n1 n2

where d = max(1, [d0 ]), and


S2 S22 2
( n11 + n2 )
d0 = S12 2 S22 2
.
1 1
n1 1 ( n1 ) + n2 1 ( n2 )

Suppose n1 /(n1 + n2 ) . It can be shown that if the CI length is multi-



plied by n1 , 
then the scaled length of the pooled t CI converges in probabil-
2
ity to 2z1/2 + 22 while the scaled lengths of the modied pooled t
1 1


CI and Welch CI both converge in probability to 2z1/2 12 + 1 22 . The
pooled t CI should have coverage that is too low if

12 + 22 < 12 + 2 .
1 1 2

See Olive (2014, Example 9.23).


a) Download the function pcisim.
b) Type the command pcisim(n1=100,n2=200,var1=10,var2=1) to
simulate the CIs for N (i , i2 ) data for i = 1, 2. The terms pcov, mpcov, and
wcov are the simulated coverages for the pooled, modied pooled, and Welch
95% CIs. Record these quantities. Are they near 0.95?
5.12. From the end of Section 5.6, four tests for Ho : 1 = = p can
be used if Rule of Thumb: max(S1 , . . . , Sp ) 2 min(S1 , . . . , Sp ) fails. In R,
get the function anovasim. When H0 is true, the coverage = proportion of
times the test rejects H0 has a nominal value of 0.05. The terms faovcov is
for the usual F test, modfcov is for a modied F test, wfcov is for the Welch
test, mwfcov for the modied Welch test, and rfcov for the rank test. The
function generates 1000 data sets with p = 4, ni = ni = 20, mi = i and
sdi = i .
a) Get the coverages for the following command. Since the four population
means and the four population standard deviations are equal, we want the
coverages to be near or less than 0.05. Are they? anovasim(m1 = 0, m2 = 0,
m3 = 0, m4 = 0, sd1 = 1, sd2 = 1, sd3 = 1, sd4 = 1)
b) Get the coverages for the following command. The population means
are equal, but the population standard deviations are not. Are the coverages
208 5 One Way Anova

near or less than 0.05? anovasim(m1 = 0, m2 = 0, m3 = 0, m4 = 0, sd1 =


1, sd2 = 2, sd3 = 3, sd4 = 4)
c) Now use the following command where H0 is false: the four population
means are not all equal. We want the coverages near 1. Are they?
anovasim(m1 = 1, m2 = 0, m3 = 0, m4 = 1)
d) Now use the following command where H0 is false. We want the cov-
erages near 1. Since the i are not equal, the ANOVA F test is expected to
perform poorly. Is the ANOVA F test the best? anovasim(m4 = 1, sd4 = 9)

5.13. This problem uses data from Kuehl (1994, p. 128).


a) Get lregdata and lregpack into R. Type the following commands. Then
simultaneously press the Ctrl and c keys. In Word use the menu command
Paste. Print out the gure.

y <- ycrab+1/6
aovtplt(crabhab,y)

b) From the gure, what response


transformation should be used: Y =
1/Z, Y = 1/ Z, Y = log(Z), Y = Z, or Y = Z?

5.14. The following data set considers the number of warp breaks per
loom, where the factor is tension (low, medium, or high).
a) Copy and paste the commands for this problem into R.
Highlight the ANOVA table by pressing the left mouse key and dragging
the cursor over the ANOVA table. Then use the menu commands Edit>
Copy. Enter Word and use the menu command Paste. b) To place the
residual plot in Word, get into R and click on the plot, hit the Ctrl and c
keys at the same time. Enter Word and use the menu command Paste or
hit the Ctrl and v keys at the same time.
c) Copy and paste the commands for this part into R.
Click on the response plot, hit the Ctrl and c keys at the same time. Enter
Word and use the menu command Paste.

5.15. Obtain the Box et al. (2005, p. 134) blood coagulation data from
lregdata and the R program ganova from lregpack. The program does graph-
ical Anova for the one way Anova model.
a) Enter the following command and include the plot in Word by simulta-
neously pressing the Ctrl and c keys, then using the menu command Paste
in Word, or hit the Ctrl and v keys at the same time.

ganova(bloodx,bloody)

The scaled treatment deviations are on the top of the plot. As a rule
of thumb, if all of the scaled treatment deviations are within the spread of
the residuals, then population treatment means are not signicantly dierent
(they all give response near the grand mean). If some deviations are outside
of the spread of the residuals, then not all of the population treatment means
5.7 Problems 209

are equal. Box et al. (2005, p. 137) state The graphical analysis discourages
overreaction to high signicance levels and avoids underreaction to very
nearly signicant dierences.
b) From the output, which two treatments means were approximately the
same?

c) To perform a randomization F test in R, get the program rand1way


from lregpack, and type the following commands. The output z$rdist is the
randomization distribution, z$Fpval is the pvalue of the usual F test, and
z$randpval is the pvalue of the randomized F test.

z<-rand1way(y=bloody,group=bloodx,B=1000)
hist(z$rdist)
z$Fpval
z$randpval

d) Include the histogram in Word.

One Way Anova in SAS


To get into SAS, often you click on a SAS icon, perhaps something like
The SAS System for . . . . A window with a split screen will open. The top
screen says Log-(Untitled) while the bottom screen says Editor-Untitled1.
Press the spacebar and an asterisk appears: Editor-Untitled1*.
For Problem 5.16, consider saving your le as hw5d16.sas on your ash
drive (J, say). (On the top menu of the editor, use the commands File > Save
as. A window will appear. Use the upper right arrow to locate Removable
Disk (J:), and then type the le name in the bottom box. Click on OK.) From
the top menu in SAS, use the File> Open command. A window will open.
Use the arrow in the NE corner of the window to navigate to Removable
Disk (J:). (As you click on the arrow, you should see My Documents, C: etc,
then Removable Disk (J:).) Double click on hw5d16.sas.
This point explains the SAS commands. The semicolon ; is used to
end SAS commands and the options ls = 70; command makes the output
readable. (An * can be used to insert comments into the SAS program.
Try putting an * before the options command and see what it does to the
output.) The next step is to get the data into SAS. The command data
clover; gives the name clover to the data set. The command
input strain $ nitrogen @ @; says the rst entry is variable strain and
the $ means it is categorical, the second variable is nitrogen and the @@
means read 2 variables, then 2, . . . , until the end of the data. The command
cards; means that the data is entered below. Then the data is entered and
the isolated semicolon indicates that the last case has been entered.
The commands proc glm; class = strain; model nitrogen = strain; tells
SAS to perform one way Anova with nitrogen as the response variable and
strain as the factor.
210 5 One Way Anova

5.16. Cut and paste the SAS program for this problem into the SAS
Editor.
To execute the program, use the top menu commands Run>Submit. An
output window will appear if successful.
(If you were not successful, look at the log window for hints on errors.
A single typo can cause failure. Reopen your le in Word or Notepad and
make corrections. Occasionally you cannot nd your error. Then nd your
instructor or wait a few hours and reenter the program.)
Data is from SAS Institute (1985, pp. 126129). See Example 5.6.
a) In SAS, use the menu commands Edit>Select All then Edit>Copy.
In Word, use the menu command Paste. Highlight the rst page of output
and use the menu command Cut. (SAS often creates too much output.
These commands reduce the output from 4 pages to 3 pages.)
You may want to save your SAS output as the le HW5d16.doc on your
ash drive.
b) Perform the 4 step test for Ho: 1 = 2 = = 6 .
c) From the residual and response plots, does the assumption of equal
population standard deviations (i = for i = 1, . . . , 6) seem reasonable?

One Way Anova in ARC

5.17. To get in ARC, you need to nd the ARC icon. Suppose the ARC icon
is in a math progs folder. Move the cursor to the math progs folder, click the
right mouse button twice, move the cursor to ARC, double click, move the
cursor to ARC, double click. These menu commands will be written math
progs > ARC > ARC. To quit ARC, move cursor to the x in the northeast
corner and click.
This Cook and Weisberg (1999a, p. 289) data set contains IQ scores on
27 pairs of identical twins, one raised by foster parents IQf and the other
by biological parents IQb. C gives the social class of the biological parents:
C = 1 for upper class, 2 for middle class and 3 for lower class. Hence the
Anova test is for whether mean IQ depends on class.
a) Activate twins.lsp dataset with the menu commands
File > Load > Data > twins.lsp.
b) Use the menu commands Twins>Make factors, select C and click on
OK. The line {F}C Factor 27 Factorrst level dropped should appear on
the screen.
c) Use the menu commands Twins>Description to see a description of
the data.
d) Enter the menu commands Graph&Fit>Fit linear LS and select {F}C
as the term and IQb as the response. Highlight the output by pressing the
left mouse key and dragging the cursor over the output. Then use the menu
commands Edit> Copy. Enter Word and use the menu command Paste.
5.7 Problems 211

e) Enter the menu commands Graph&Fit>Boxplot of and enter IQb in


the selection box and C in the Condition on box. Click on OK. When the
boxplots appear, click on the Show Anova box. Click on the plot, hit the
Ctrl and c keys at the same time. Enter Word and use the menu command
Paste. Include the output in Word. Notice that the regression and Anova
F statistic and p-value are the same.
f) Residual plot: Enter the menu commands Graph&Fit>Plot of, select
L1:Fit-Values for the H box and L1:Residuals for the V box, and
click on OK. Click on the plot, hit the Ctrl and c keys at the same time.
Enter Word and use the menu command Paste.
g) Response plot: Enter the menu commands Graph&Fit>Plot of, select
L1:Fit-Values for the H box and IQb for the V box, and click on
OK. When the plot appears, move the OLS slider bar to 1 to add the
identity line. Click on the plot, hit the Ctrl and c keys at the same time.
Enter Word and use the menu command Paste.
h) Perform the 4 step test for Ho: 1 = 2 = 3 .
One Way Anova in Minitab
5.18. a) In Minitab, use the menu command File>Open Worksheet
and double click on Baby.mtw. A window will appear. Click on OK.

This McKenzie and Goldman (1999, p. T-234) data set has 30 three-month-
old infants randomized into ve groups of 6 each. Each infant is shown a
mobile of one of ve multicolored designs, and the goal of the study is to see
if the infant attention span varies with type of design of mobile. The times
that each infant spent watching the mobile are recorded.
b) Choose Stat>Basic Statistics>Display Descriptive Statistics, select
C1 Time as the Variable, click the By variable option and press Tab.
Select C2 Design as the By variable. c) From the window in b), click on
Graphs the Boxplots of data option, and OK twice. Click on the plot
and then click on the printer icon to get a plot of the boxplots.
d) Select Stat>ANOVA>One-way, select C1-time as the response and
C2-Design as the factor. Click on Store residuals and click on Store ts.
Then click on OK. Click on the output and then click on the printer icon.
e) To make a residual plot, select Graph>Plot. Select Resi1 for Y
and Fits1 for X and click on OK. Click on the plot and then click on
the printer icon to get the residual plot.
f) To make a response plot, select Graph>Plot. Select C1 Time for
Y and Fits1 for X and click on OK. Click on the plot and then click
on the printer icon to get the response plot.
g) Do the 4 step test for Ho: 1 = 2 = = 5 .
To get out of Minitab, move your cursor to the x in the NE corner of
the screen. When asked whether to save changes, click on no.
Chapter 6
The K Way Anova Model

For a K way Anova model, A1 , . . . , AK are the factors with li levels for
i = 1, . . . , K. Hence there are l1 l2 lK treatments where each treatment
uses exactly one level from each factor. First the two way Anova model is
discussed and then the model with K > 2. Interactions between the K factors
are important.

6.1 Two Way Anova

Denition 6.1. The xed eects two way Anova model has two factors
A and B plus a response Y . Factor A has a levels and factor B has b levels.
There are ab treatments.

Denition 6.2. The cell means model for two way Anova is Yijk =
ij + eijk where i = 1, . . . , a; j = 1, . . . , b; and k = 1, . . . , m. The sample size
n = abm. The ij are constants and the eijk are iid from a location family
with mean 0 and variance 2 . Hence the Yijk f (yij ) come from a location
family with location parameter ij . The tted values are Yijk = Y ij0 = ij
while the residuals rijk = Yijk Yijk .

For one way Anova models, the cell sizes ni need not be equal. For K way
Anova models with K 2 factors, the statistical theory is greatly simplied
if all of the cell sizes are equal. Such designs are called balanced designs.

Denition 6.3. A balanced design has all of the cell sizes equal: for the
two way Anova model, nij m.

In addition to randomization of units to treatments, another key princi-


ple of experimental design is factorial crossing. Factorial crossing allows for
estimation of main eects and interactions.

Springer International Publishing AG 2017 213


D.J. Olive, Linear Regression, DOI 10.1007/978-3-319-55252-1 6
214 6 The K Way Anova Model

Denition 6.4. A two way Anova design uses factorial crossing if each
combination of an A level and a B level is used and called a treatment. There
are ab treatments for the two way Anova model.

Experimental two way Anova designs randomly assign m of the n = mab


units to each of the ab treatments. Observational studies take random samples
of size m from ab populations.

Denition 6.5. The main eects are A and B. The AB interaction is


not a main eect.

Remark 6.1. If A and B are factors, then there are 5 possible models.
i) The two way Anova model has terms A, B, and AB.
ii) The additive model or main eects model has terms A and B.
iii) The one way Anova model that uses factor A.
iv) The one way Anova model that uses factor B.
v) The null model does not use any of the three terms A, B, or AB. If the
null model holds, then Yijk f (y 00 ) so the Yijk form a random sample of
size n from a location family, and the distribution of the response is the same
for all ab treatments. For models i)iv), the distribution of the response is
not the same for all ab treatments.

Remark 6.2. The response plot, residual plot, and transformation plots
for response transformations are used in the same way as Chapter 5. The
plots work best if the MSE degrees of freedom max(10, n/5). The model
is overtting if 1 MSE df < max(10, n/5), and then the plots may only
be useful for detecting large deviations from the model. For the model that
contains A, B, and AB, there will be ab dot plots of size m, and we need
m 5 to check for similar shape and spread. For the additive model, the
response and residual plots often look like those for multiple linear regression.
Then the plotted points should scatter about the identity line or r = 0 line
in a roughly evenly populated band if the additive two way Anova model is
reasonable. We want n 5(number of parameters in the model) for inference.
So we want n 5ab or m 5 when all interactions and main eects are in
the two way Anova model.

Shown is an ANOVA table for the two way Anova model given in symbols.
Sometimes Error is replaced by Residual, or Within Groups. A and
B are the main eects while AB is the interaction. Sometimes p-value is
replaced by P, P r(> F ), or PR > F. The p-value corresponding to
FA is for Ho: 10 = = a0 . The p-value corresponding to FB is for Ho:
01 = = 0b . The p-value corresponding to FAB is for Ho: there is no
interaction. The sample p-value pval is an estimator of the population
p-value.
6.1 Two Way Anova 215

Source df SS MS F p-value
A a-1 SSA MSA FA =MSA/MSE pval
B b-1 SSB MSB FB =MSB/MSE pval
AB (a 1)(b 1) SSAB MSAB FAB =MSAB/MSE pval
Error n ab = ab(m 1) SSE MSE

Be able to perform the 4 step test for AB interaction:


i) Ho: no interaction HA : there is an interaction
ii) FAB is obtained from output.
iii) The pval is obtained from output.
iv) If pval reject Ho and conclude that there is an interaction between A
and B, otherwise fail to reject Ho and conclude that there is no interaction
between A and B. (Or there is not enough evidence to conclude that there is
an interaction between A and B.)

Remark 6.3. i) Keep A and B in the model if there is an AB interaction.


The two tests for main eects (below) make the most sense if we fail to
reject the test for interaction. Rejecting Ho for main eects makes sense
when there is an AB interaction because the main eects tend to be larger
than the interaction eects. (Failing to reject Ho for main eects when there
is an AB interaction may not make sense.)
ii) The main eects tests are just like the F test for the xed eects one
way Anova model. If populations means are close, then larger sample sizes are
needed for the F test to reject Ho with high probability. If Ho is not rejected
and the means are equal, then it is possible that the factor is unimportant,
but it is also possible that the factor is important but the level
is not. For example, factor A might be type of catalyst. The yield may be
equally good for each type of catalyst, but there would be no yield if no
catalyst was used.

Be able to perform the 4 step test for A main eects:


i) Ho: 10 = = a0 HA : not Ho
ii) FA is obtained from output.
iii) The pval is obtained from output.
iv) If pval reject Ho and conclude that the mean response depends on the
level of A, otherwise fail to reject Ho and conclude that the mean response
does not depend on the level of A. (Or there is not enough evidence to
conclude that the mean response depends on the level of A.)

Be able to perform the 4 step test for B main eects:


i) Ho: 01 = = 0b HA : not Ho
ii) FB is obtained from output.
iii) The pval is obtained from output.
iv) If pval reject Ho and conclude that the mean response depends on the
level of B, otherwise fail to reject Ho and conclude that the mean response
216 6 The K Way Anova Model

does not depend on the level of B. (Or there is not enough evidence to
conclude that the mean response depends on the level of B.)

Remark 6.4. One could do a one way Anova on p = ab treatments, but


this procedure loses information about A, B, and the AB interaction.

Denition 6.6. An interaction plot is made by plotting the levels of one


factor (either 1, . . . , a or 1, . . . , b) versus the cell sample means Y ij0 . Typically
the factor with more levels (e.g., A if a > b) is used on the horizontal axis.
If the levels of A are on the horizontal axis, use line segments to join the a
means that have the same j. There will be b curves on the plot. If the levels
of B are on the horizontal axis, use line segments to join the b means that
have the same i. There will be a curves on the plot. If no interaction is
present, then the curves should be roughly parallel.

The interaction plot is rather hard to use, especially if the nij = m are
small. For small m, the curves can be far from parallel, even if there is no
interaction. The further the curves are from being parallel, the greater the
evidence of interaction. Intersection of curves suggests interaction unless the
two curves are nearly the same. The two curves may be nearly the same if
two levels of one factor give nearly the same mean response for each level of
the other factor. Then the curves could cross several times even though there
is no interaction. Software lls space. So the vertical axis needs to be checked
to see whether the  sample means for two curves are close with respect to
the standard error M SE/m for the means.
The interaction plot is the most useful if the conclusions for the plot agree
with the conclusions for the F test for no interaction.

Denition 6.7. The overparameterized two way Anova model has Yijk =
ij + eijk with ij = 00 + i + j + ()ij where the interaction parameters
()ij = ij i0 0j + 00 . The A main eects are i = i0 00 for
i = 1, . . . , a.The B main
  eects are j = 0j 00 for j= 1, . . . , b. Here
i i = 0, j j =
  i ()ij = 0 for j = 1, . . . , b and j ()ij = 0 for
0,
i = 1, . . . , a. Thus i j ()ij = 0.

The mean parameters have the following meaning. The parameter ij is


the population mean response for the ijth treatment. The means 0j =
a b
i=1 ij /a, and the means i0 = j=1 ij /b.
As was the case for multiple linear regression, interaction is rather dicult
to understand. Note that if all of the interaction parameters ()ij = 0, then
the factor eects are additive: ij = 00 + i + j . Hence no interaction
implies that the factor eects are additive while interaction implies that
the factor eects are not additive. When there is no interaction, 1j = 00 +
1 + j , 2j = 00 + 2 + j , . . . , aj = 00 + a + j . Consider a plot with
the ij on the vertical axis and the levels 1, 2, . . . , a of A on the horizontal
axis. If there is no interaction and if the ij with the same j are connected
6.1 Two Way Anova 217

1.5
2
1

1.4
mean of Y
1.3
1.2
1.1

1 2
A
Fig. 6.1 Interaction Plot for Example 6.1.

with line segments, then there will be b parallel curves with curve height
depending on j . If there is interaction, then not all of the p curves will be
parallel. The interaction plot replaces the ij by the ij = Y ij0 .

Example 6.1. Cobb (1998, pp. 200212) describes an experiment on


weight gain for baby pigs. The response Y was the average daily weight
gain in pounds for each piglet (over a period of time). Factor A consisted of
0 mg of an antibiotic or 40 mg an antibiotic, while factor B consisted of 0
mg of vitamin B12 or 5 mg of B12. Hence there were 4 diets (A, B) = (0,0),
(40,0), (0,5), or (40,5). Hence level 1 corresponds to 0 mg and level 2 to more
than 0 mg.
The interaction plot shown in Figure 6.1 suggests that there is an interac-
tion. If no vitamin B12 is given, then the pigs given the antibiotic have less
mean weight gain than the pigs not given the antibiotic. For pigs given the
diet with 5 mg of B12, the antibiotic was useful, with a mean gain near 1.6.
Pigs with A = 1 (no antibiotic in the diet) had similar mean weight gains,
but pigs with A = 2 (antibiotic in the diet) had greatly dierent mean weight
gains. The best diet had both vitamin B12 and the antibiotic, while the worst
diet had the antibiotic but no vitamin B12.

Example 6.2. The output below uses data from Kutner et al. (2005, prob-
lems 19.1415). The output is from an experiment on hay fever, and 36 vol-
unteers were given medicine. The two active ingredients (factors A and B)
in the medicine were varied at three levels each (low, medium, and high).
218 6 The K Way Anova Model

The response is the number of hours of relief. (The factor names for this
problem are A and B.)
a) Give a four step test for the A*B interaction.
b) Give a four step test for the A main eects.
c) Give a four step test for the B main eects.

Source DF SS MS F P
A 2 220.0200 110.0100 1827.86 0.000
B 2 123.6600 61.8300 1027.33 0.000
Interaction 4 29.4250 7.3562 122.23 0.000
Error 27 1.6250 0.0602

Solution: a) Ho: no interaction HA : there is an interaction


FAB = 122.23
pval = 0.0
Reject Ho, there is an interaction between the active ingredients A and B.
b) Ho: 10 = 20 = 30 HA : not Ho
FA = 1827.86
pval = 0.0
Reject Ho, the mean hours of relief depends on active ingredient A.
c) Ho: 01 = 02 = 03 HA : not Ho
FB = 1027.33
pval = 0.0
Reject Ho, the mean hours of relief depends on active ingredient B.

6.2 K Way Anova Models

Use factorial crossing to compare the eects (main eects, pairwise inter-
actions, . . . , K-fold interaction if there are K factors) of two or more factors.
If A1 , . . . , AK are the factors with li levels for i = 1, . . . , K; then there are
l1 l2 lK treatments where each treatment uses exactly one level from each
factor.

Source df SS MS F p-value
#K $ K main eects e.g. SSA = MSA FA pA
2 2 factor interactions e.g. SSAB = MSAB FAB pAB
#K $
3 3 factor interactions e.g. SSABC = MSABC FABC pABC
.. .. .. .. ..
# $ . . . . .
K
K1 K 1 factor interactions
the K factor interaction SSA L = MSA L FAL pAL
Error SSE MSE
6.3 Summary 219

On the previous page is a partial ANOVA table for a K way Anova design
with the degrees of freedom left blank. For A, use H0 : 100 = = l1 00 .
The other main eects have similar null hypotheses. For interaction, use H0 :
no interaction.
These models get complex rapidly as K and the number of levels li in-
crease. As K increases, there are a large number of models to consider. For
experiments, usually the 3 way and higher order interactions are #not $ signif-
icant. Hence a full model that includes all K main eects and K 2 2 way
interactions is a useful starting point for response, residual, and transforma-
tion plots. The higher order interactions can be treated as potential terms
and checked for signicance. As a rule of thumb, signicant interactions tend
to involve signicant main ' eects.
K
The sample size n = m i=1 li m 2K is minimized by taking li = 2 for
i = 1, . . . , K. Hence the sample size grows exponentially fast with K. Designs
that use the minimum number of levels 2 are discussed in Section 8.1.

6.3 Summary

1) The xed eects two way Anova model has two factors A and B plus a
response Y . Factor A has a levels and factor B has b levels. There are ab
treatments. The cell means model is Yijk = ij + eijk where i = 1, . . . , a; j =
1, . . . , b; and k = 1, . . . , m. The sample size n = abm. The ij are constants
and the eijk are iid with mean 0 and variance 2 . Hence the Yijk f (y ij )
come from a location family with location parameter ij . The tted values
are Yijk = Y ijo = ij while the residuals rijk = Yijk Yijk .
2) Know that the 4 step test for AB interaction is
i) Ho: no interaction HA : there is an interaction
ii) FAB is obtained from output.
iii) The pval is obtained from output.
iv) If pval reject Ho, and conclude that there is an interaction between A
and B, otherwise fail to reject Ho, and conclude that there is no interaction
between A and B.
3) Keep A and B in the model if there is an AB interaction.
4) Know that the 4 step test for A main eects is
i) Ho: 10 = = a0 HA : not Ho
ii) FA is obtained from output.
iii) The pval is obtained from output.
iv) If pval reject Ho and conclude that the mean response depends on the
level of A, otherwise fail to reject Ho and conclude that the mean response
does not depend on the level of A.
5) Know that the 4 step test for B main eects is
i) Ho: 01 = = 0b HA : not Ho
ii) FB is obtained from output.
220 6 The K Way Anova Model

iii) The pval is obtained from output.


iv) If pval reject Ho and conclude that the mean response depends on the
level of B, otherwise fail to reject Ho and conclude that the mean response
does not depend on the level of B.
The tests for main eects (points 4) and 5)) do not always make sense if
the test for interactions is rejected.

6) Shown is an ANOVA table for the two way Anova model given in sym-
bols. Sometimes Error is replaced by Residual, or Within Groups. A
and B are the main eects while AB is the interaction. Sometimes p-value
is replaced by P, P r(> F ), or PR > F. The p-value corresponding to
FA is for Ho: 10 = = a0 . The p-value corresponding to FB is for Ho:
01 = = 0b . The p-value corresponding to FAB is for Ho: there is no
interaction.

Source df SS MS F p-value
A a-1 SSA MSA FA =MSA/MSE pval
B b-1 SSB MSB FB =MSB/MSE pval
AB (a 1)(b 1) SSAB MSAB FAB =MSAB/MSE pval
Error n ab = ab(m 1) SSE MSE

7) An interaction plot is made by plotting the levels of one factor (either


1, . . . , a or 1, . . . , b) versus the cell sample means Y ij0 . Typically the factor
with more levels (e.g., A if a > b) is used on the horizontal axis. If the levels
of A are on the horizontal axis, use line segments to join the a means that
have the same j. There will be b curves on the plot. If the levels of B are on
the horizontal axis, use line segments to join the b means that have the same
i. There will be a curves on the plot. If no interaction is present, then the
curves should be roughly parallel.
8) The interaction plot is rather hard to use, especially if the nij = m are
small. For small m, the curves could be far from parallel even if there is no
interaction, but the further the curves are from being parallel, the greater the
evidence of interaction. Intersection of curves suggests interaction unless the
two curves are nearly the same. The two curves may be nearly the same if
two levels of one factor give nearly the same mean response for each level of
the other factor. Then the curves could cross several times even though there
is no interaction. Software lls space. So the vertical axis needs to be checked
to see whether the  sample means for two curves are close with respect to
the standard error M SE/m for the means.
9) The interaction plot is the most useful if the conclusions for the plot
agree with the conclusions for the F test for no interaction.
10) The ij of the cell means model can be parameterized as ij = 00 +
i + j + ()  ij for i = 1, . . . , a and j = 1, . . . , b. Here the iare the A main
eects and i i = 0. The i are the B main  eects and j j = 0. The

()ij are the interaction eects and satisfy i ()ij = 0, j ()ij = 0
6.4 Complements 221
 
and i j ()ij = 0. The interaction  eect ()ij = ij i0 0j + 00 .
Here
 the row factor means
  i0 = j ij /b, the column factor means 0j =
i ij /a and 00 = i j ij /(ab).
11) If there is no interaction, then the factor eects are additive: ij =
00 + i + j .
12) If A and B are factors, then there are 5 possible models.
i) The two way Anova model has terms A, B, and AB.
ii) The additive model or main eects model has terms A and B.
iii) The one way Anova model that uses factor A.
iv) The one way Anova model that uses factor B.
v) The null model does not use any of the three terms A, B, or AB. If the
null model holds, then Yijk f (y 00 ) so the Yijk form a random sample of
size n from a location family, and the distribution of the response is the same
for all ab observed treatments.
13) A two way Anova model could be t as a one way Anova model with
k = ab treatments, but for balanced models where nij m, this procedure
loses information about A, B, and the interaction AB.
14) Response, residual, and transformation plots are used in the same way
for the two way Anova model as for the one way Anova model.

6.4 Complements

Four good texts on the design and analysis of experiments are mentioned
in the Complements of Chapter 5. The software for K way Anova is often
used to t block designs. Each block is entered as if it were a factor and the
main eects model is t. The one way block design treats the block like one
factor and the treatment factor as another factor and uses two way Anova
software without interaction to get the correct sum of squares, F statistic,
and p-value. The Latin square design treats the row block as one factor, the
column block as a second factor, and the treatment factor as another factor.
Then the three way Anova software for main eects is used to get the correct
sum of squares, F statistic, and p-value. These two designs are described in
Chapter 7. The K way software is also used to get output for the split plot
designs described in Chapter 9.
Consider nding a model using pretesting or variable selection, and then
acting as if that model was selected before examining the data. This method
does not lead to valid inference. See Fabian (1991) for results on the 2 way
Anova model. If the method can be automated, the bootstrap method of Olive
(2016a) is conjectured to be useful for inference. This bootstrap method may
also be useful for unbalanced designs where the nij are not all equal to m.
Gail (1996) explains why it took so long to use double blinded completely
randomized controlled experiments to test new vaccines.
222 6 The K Way Anova Model

An alternative method is to perform Anova on ranks. These rank tests


appear to work for main eects, but not for interactions. See Marden and
Muyot (1995).

6.5 Problems

Problems with an asterisk * are especially important.

Output for 6.1.


Source df SS MS F P
A 2 24.6 12.3 0.24 0.791
B 2 28.3 14.2 0.27 0.763
Interaction 4 1215.3 303.8 5.84 0.001
Error 36 1872.4 52.0
6.1. The above output uses data from Kutner et al. (2005, prob-
lems 19.1617). A study measured the number of minutes to complete a
repair job at a large dealership. The two explanatory variables were A =
technician and B = make of drive. The output is given above.
a) Give a four step test for no interaction.
b) Give a four step test for the B main eects.
6.2. Suppose A has 5 levels and B has 4 levels. Sketch an interaction plot
if there is no interaction.
Two Way Anova in SAS
In SAS, Y = A|B is equivalent to Y = A B A B. Thus the SAS model
statement could be written in either of the following two forms.
proc glm;
class material temp;
model mvoltage = material|temp;
output out =a p = pred r = resid;
proc glm;
class material temp;
model mvoltage = material temp material*temp;
output out =a p = pred r = resid;
6.3. Cut and paste the SAS program from
(http://lagrange.math.siu.edu/Olive/lreghw.txt) for 6.3 into the SAS
Editor.
To execute the program, use the top menu commands Run>Submit. An
output window will appear if successful. The data is from Montgomery (1984,
p. 198) and gives the maximum output voltage for a typical type of storage
battery. The two factors are material (1,2,3) and temperature (50, 65, 80 F).
6.5 Problems 223

a) Copy and paste the SAS program into SAS, use the le command
Run>Submit.
b) Click on the Graph1 window and scroll down to the second interaction
plot of tmp vs ymn. Press the printer icon to get the plot.
c) Is interaction present?
d) Click on the output window then click on the printer icon. This will
produce 5 pages of output, but only hand in the ANOVA table, response plot,
and residual plots.
(Cutting and pasting the output into Word resulted in bad plots. Using
Notepad gave better plots, but the printer would not easily put the ANOVA
table and two plots on one page each.)
e) Do the residual and response plots look ok?

Two Way Anova in Minitab

6.4. a) Copy the SAS data for problem 6.3 into Notepad. Then hit Enter
every three numbers so that the data is in 3 columns.

1 50 130
1 50 155
1 50 74
1 50 180
1 65 34
. . .
. . .
. . .
3 80 60

b) Copy and paste the data into Minitab using the menu commands
Edit>Paste Cells and click on OK. Right below C1 type material, below
C2 type temp and below C3 type mvoltage.
c) Select Stat>ANOVA>Two-way, select C3 mvoltage as the response
and C1 material as the row factor and C2 temp as the column factor.
Click on Store residuals and click on Store ts. Then click on OK.
Click on the output and then click on the printer icon.
d) To make a residual plot, select Graph>Plot. Select Resi1 for Y and
Fits1 for X and click on OK. Click on the printer icon to get a plot of
the graph.
e) To make a response plot, select Graph>Plot. Select C3 mvoltage for
Y and Fits1 for X and click on OK. Click on the printer icon to get
a plot of the graph.

f) Use the menu commands Stat>ANOVA>Interaction Plots. Enter


mvoltage in the Responses box and material and temp in the Factors
box. Click on OK and print the plot.
224 6 The K Way Anova Model

g) Use the menu commands Stat>ANOVA>Interaction Plots. Enter


mvoltage in the Responses box and temp and material in the Factors
box. Click on OK and print the plot.
h) Do the 4 step test for interaction.

R Problem

Use the command source(G:/lregpack.txt) to download the func-


tions and the command source(G:/lregdata.txt) to download the data.
See Preface or Section 14.1. Typing the name of the R function, e.g. aov,
will display the code for the function. Use the args command, e.g. args(aov),
to display the needed arguments for the function. For the following problem,
the R commands can be copied and pasted from
(http://lagrange.math.siu.edu/Olive/lreghw.txt) into R.
In R,
Y ~ A + B is equivalent to Y ~ . so the period
indicates use all main effects. Y ~ A:B is
equivalent to Y ~ A + B + A*B and Y ~ A*B and
Y ~ .^2 which means fit all main effects and all
two way interactions. A problem is that A and B
need to be of type factor.

6.5. The Box et al. (2005, p. 318) poison data has 4 types of treatments
(1,2,3,4) and 3 types of poisons (1,2,3). Each animal is given a poison and a
treatment, and the response is survival in hours. Get the poison data from
lregdata.
a) Type the following commands to see that the output for the three
models is the same. Print the output.

out1<-aov(stime~ptype*treat,poison)
summary(out1)
out2<-aov(stime~ptype + treat + ptype*treat,poison)
summary(out2)
out3<-aov(stime~.^2,poison)
summary(out3)
#The three models are the same.

b) Type the following commands to see the residual plot. Include the plot
in Word.

plot(fitted(out1),resid(out1))
title("Residual Plot")

c) Type the following commands to see the response plot. Include the plot
in Word.
6.5 Problems 225

FIT <- poison$stime - out1$resid


plot(FIT,poison$stime)
abline(0,1)
title("Response Plot")

d) Why is the two way Anova model inappropriate?


e) Now the response Y = 1/stime will be used. Type the following com-
mands to get the output. Copy the output into Word.

attach(poison)
out4 <- aov((1/stime)~ptype*treat,poison)
summary(out4)

f) Type the following commands to get the residual plot. Copy the plot
into Word.

plot(fitted(out4),resid(out4))
title("Residual Plot")

g) Type the following commands to get the response plot. Copy the plot
into Word.

FIT <- 1/poison$stime - out4$resid


plot(FIT,(1/poison$stime))
abline(0,1)
title("Response Plot")

h) Type the following commands to get the interaction plot. Copy the plot
into Word.

interaction.plot(treat,ptype,(1/stime))
detach(poison)

i) Test whether there is an interaction using the output from e).


Chapter 7
Block Designs

Blocks are groups of similar units and blocking can yield experimental designs
that are more ecient than designs that do not block. One way block designs
and Latin square designs will be discussed.
Denition 7.1. A block is a group of mk similar or homogeneous units.
In a block design, each unit in a block is randomly assigned to one of k
treatments with each treatment getting m units from the block. The meaning
of similar is that the units are likely to have similar values of the response
when given identical treatments.
In agriculture, adjacent plots of land are often used as blocks since adjacent
plots tend to give similar yields. Litter mates, siblings, twins, time periods
(e.g., dierent days), and batches of material are often used as blocks.
Following Cobb (1998, p. 247), there are 3 ways to get blocks. i) Sort units
into groups (blocks) of mk similar units. ii) Divide large chunks of material
(blocks) into smaller pieces (units). iii) Reuse material or subjects (blocks)
several times. Then the time slots are the units.
Example 7.1. For i), to study the eects of k dierent medicines, sort
n = bk people into b groups of size k according to similar age and weight. For
ii), suppose there are b plots of land. Divide each plot into k subplots. Then
each plot is a block and the subplots are units. For iii), give the k dierent
treatments to each person over k months. Then each person has a block of
time slots and the ith month = time slot is the unit.

7.1 One Way Block Designs

Suppose there are b blocks and n = kb. The one way Anova design randomly
assigns b of the units to each of the k treatments. Blocking places a constraint
on the randomization, since within each block of units, exactly one unit is
randomly assigned to each of the k treatments.

Springer International Publishing AG 2017 227


D.J. Olive, Linear Regression, DOI 10.1007/978-3-319-55252-1 7
228 7 Block Designs

Hence a one way Anova design would use the R command sample(n) and
the rst b units would be assigned to treatment 1, the second b units to
treatment 2, . . . , and the last b units would be assigned to treatment k.
For the completely randomized block designs, described below, the com-
mand sample(k) is done b times: once for each block. The ith command is
for the units of the ith block. If k = 5 and the sample(5) command yields
2 5 3 1 4, then the 2nd unit in the ith block is assigned to treatment
1, the 5th unit to treatment 2, the 3rd unit to treatment 3, the 1st unit to
treatment 4, and the 4th unit to treatment 5.

Remark 7.1. Blocking and randomization often makes the iid error
assumption hold to a useful approximation.

For example, if grain is planted in n plots of land, yields tend to be similar


(correlated) in adjacent identically treated plots, but the yields from all of
the plots vary greatly, and the errors are not iid. If there are 4 treatments
and blocks of 4 adjacent plots, then randomized blocking makes the errors
approximately iid.

Denition 7.2. For the one way block design or completely ran-
domized block design (CRBD), there is a factor A with k levels and
there are b blocks. The CRBD model is

Yij = ij + eij = + i + j + eij


k
where i is the ith treatment eect and i=1 i = 0, j is the jth block eect
b
and j=1 j = 0. The indices i = 1, . . . , k and j = 1, . . . , b. Then

1
b
io
i = ( + i + j ) = + i .
b b j=1

So the i are all equal if the i are all equal. The errors eij are iid with 0
mean and constant variance 2 .

Notice that the CRBD model is additive: there is no block treatment in-
teraction. The ANOVA table for the CRBD is like the ANOVA table for a
two way Anova main eects model. Shown below is a CRBD ANOVA table in
symbols. Sometimes Treatment is replaced by Factor or Model. Some-
times Blocks is replaced by the name of the blocking variable. Sometimes
Error is replaced by Residual.
7.1 One Way Block Designs 229

Source df SS MS F p-value
Blocks b-1 SSB MSB Fblock pblock
Treatment k-1 SSTR MSTR F0 =MSTR/MSE pval for Ho
Error (k 1)(b 1) SSE MSE

Be able to perform the 4 step completely randomized block de-


sign ANOVA F test of hypotheses. This test is similar to the xed eects
one way ANOVA F test. As always, pval is the estimated pvalue.
i) Ho: 1 = 2 = = k and HA : not Ho.
ii) Fo = MSTR/MSE is usually given by output.
iii) The pval = P(Fk1,(k1)(b1) > Fo ) is usually given by output.
iv) If the pval , reject Ho and conclude that the mean response depends
on the factor level. Otherwise fail to reject Ho and conclude that the mean
response does not depend on the factor level. (Or there is not enough evidence
to conclude that the mean response depends on the factor level.) Give a
nontechnical sentence.

Rule of thumb 7.1. If pblock 0.1, then blocking was not useful. If
0.05 < pblock < 0.1, then the usefulness was borderline. If pblock 0.05,
then blocking was useful.

Remark 7.2. The response, residual, and transformation plots are used
almost in the same way as for the one and two way Anova model, but all of
the dot plots have sample size m = 1. Look for the plotted points falling in
roughly evenly populated bands about the identity line and r = 0 line. See
Problem 7.4 for these plots and the following plot.

Denition 7.3. The block response scatterplot plots blocks versus the
response. The plot will have b dot plots of size k with a symbol corresponding
to the treatment. Dot plots with clearly dierent means suggest that blocking
was useful. A symbol pattern within the blocks (e.g., symbols A and B are
always highest while C and D are always lowest) suggests that the response
depends on the factor level.

Denition 7.4. Graphical Anova for the CRBD model uses the resid-
uals as a reference
set instead of an F distribution. The scaled treatment
deviations b 1(Y i0 Y 00 ) have about the same
variability as the resid-
uals if Ho is true. The scaled block deviations k 1(Y 0j Y 00 ) also have
about the same variability as the residuals if blocking is ineective. A dot
plot of the scaled block deviations is placed above the dot plot of the scaled
treatment deviations which is placed above the dot plot of the residuals. For
small n 40, suppose the distance between two scaled deviations (A and
B, say) is greater than the range of the residuals = max(rij ) min(rij ).
Then declare A and B to be signicantly dierent. If the distance is less
than the range, do not declare A and B to be signicantly dierent. Scaled
230 7 Block Designs

deviations that lie outside the range of the residuals are signicant: the cor-
responding treatment means are signicantly dierent from the overall mean.
For n 100, let r(1) r(2) r(n) be the order statistics of the
residuals. Then instead of the range, use r(0.975n) r(0.025n) as the dis-
tance where x is the smallest integer x, e.g. 7.7 = 8. So eects out-
side of the interval (r(0.025n) , r(0.975n) ) are signicant. See Box et al. (2005,
pp. 150151).

Example 7.2. Ledolter and Swersey (2007, p. 60) give completely ran-
domized block design data. The block variable = market had 4 levels (1
Binghamton, 2 Rockford, 3 Albuquerque, 4 Chattanooga) while the treat-
ment factor had 4 levels (A no advertising, B $6 million, C $12 million, D
$18 million advertising dollars in 1973). The response variable was average
cheese sales (in pounds per store) sold in a 3-month period.
a) From the graphical Anova in Figure 7.1, were the blocks useful?
b) Perform an appropriate 4 step test for whether advertising helped cheese
sales.

Output for Example 7.2.


Df Sum Sq Mean Sq F value Pr(>F)
block 3 79308210 26436070 54.310 4.348e-06
treatment 3 1917416 639139 1.313 0.3292
Residuals 9 4380871 486763

> ganova2(x,block,y) scaled block deviations


-3790.377 4720.488 2881.483 -3811.594
block 1 2 3 4

scaled treatment deviations


-266.086 -833.766 733.307 366.545
Treatments "A" "B" "C" "D"

Solution: a) In Figure 7.1, the top dot plot is for the scaled block deviations.
The leftmost dot corresponds to blocks 4 and 1, the middle dot to block 3
and the rightmost dot to block 1 (see output from the lregpack function
ganova2). Yes, the blocks were useful since some (actually all) of the dots
corresponding to the scaled block deviations fall outside the range of the
residuals. This result also agrees with pblock = 4.348e06 < 0.05.
b) i) Ho: 1 = 2 = 3 = 4 HA : not Ho
ii) Fo = 1.313
iii) pval = 0.3292
iv) Fail to reject Ho, the mean sales does not depend on advertising level.
In Figure 7.1, the middle dot plot is for the scaled treatment deviations.
From left to right, these correspond to B, A, D, and C since the output shows
that the deviation corresponding to C is the largest with value 733.3. Since
7.1 One Way Block Designs 231

Scaled Block Deviations

30
25
Treatmentdevs
20
15
10

4000 2000 0 2000 4000


Residuals
Fig. 7.1 Graphical Anova for a One Way Block Design

the four scaled treatment deviations all lie within the range of the residuals,
the four treatments again do not appear to be signicant.

frate Block Response Plot for Example 7.3


-
- A
12.0+ A
- B A
- 2 D C
-
- E 2
8.0+ A C
- E B
- B D
-
- D E E B
4.0+ C
- D E
- B
-
-
+------+------+------+------+------+--block
1 2 3 4 5
232 7 Block Designs

Example 7.3. Snedecor and Cochran (1967, p. 300) give a data set with
5 types of soybean seed. The response frate = number of seeds out of 100
that failed to germinate. Five blocks were used. On the previous page is a
block response scatterplot where A, B, C, D, and E refer to seed type. The 2
in the second block indicates that A and C both had values 10. Which type
of seed has the highest germination failure rate?
a) A b) B c) C d) D e) E
Solution: a) A since A is on the top for blocks 25 and second for block 1.

Response Plot
10

5
Y
6
2

4 6 8 10 12
FIT

Residual Plot
5
2 4
RES
2

4 6 8 10 12
FIT
Fig. 7.2 One Way Block Design Does Not Fit All of the Data

Note: The response and residual plots in Figure 7.2 suggest that one case
is not t well by the model. The Bs and Es in the block response plot suggest
that there may be a block treatment interaction, which is not allowed by the
completely randomized block design. Figure 7.2 was made with the following
commands using the lregpack function aovplots.

block <- c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4,5,


5,5,5,5)
seed <- rep(1:5,5)
block <- factor(block)
seed <- factor(seed)
frate <- c(8,2,4,3,9,10,6,10,5,7,12,7,9,9,5,13,11,8,
10,5,11,5,10,6,3)
7.2 Blocking with the K Way Anova Design 233

soy <- data.frame(block,seed,frate)


rm(block,seed,frate)
attach(soy)
z <- aov(frate~block+seed,soy)
aovplots(Y=frate,FIT=fitted(z),RES=resid(z))
#right click Stop twice
detach(soy)

7.2 Blocking with the K Way Anova Design

Blocking is used to reduce the MSE so that inference such as tests and con-
dence intervals are more precise. Below is a partial ANOVA table for a k way
Anova design with one block where the degrees of freedom are left blank. For
A, use H0 : 100 = = l1 00 . The other main eects have similar null
hypotheses. For interaction, use H0 : no interaction.
These models get complex rapidly as k and the number of levels li in-
crease. As k increases, there are a large number of models to consider. For
experiments, usually the 3 way and higher order interactions are not signif-
icant.
# $ Hence a full model that includes the blocks, all k main eects, and
all k2 two way interactions is a useful starting point for response, residual,
and transformation plots. The higher order interactions can be treated as
potential terms and checked for signicance. As a rule of thumb, signicant
interactions tend to involve signicant main eects.

Source df SS MS F p-value
block SSblock MSblock Fblock pblock
#k $ k main eects e.g. SSA = MSA FA p A
2 2 way interactions e.g. SSAB = MSAB FAB pAB
#k $
3 3 way interactions e.g. SSABC = MSABC FABC pABC
.. .. .. .. ..
# $ . . . . .
k
k1 k 1 way interactions
the k way interaction SSA L = MSA L FAL pAL
Error SSE MSE

The following example has one block and 3 factors. Hence there are 3 two
way interactions and 1 three way interaction.

Example 7.4. Snedecor and Cochran (1967, pp. 361364) describe a block
design (2 levels) with three factors: food supplements Lysine (4 levels), Me-
thionine (3 levels), and Protein (2 levels). Male pigs were fed the supplements
in a 432 factorial arrangement and the response was average daily weight
gain. The ANOVA table is shown on the following page. The model could be
234 7 Block Designs

described as Yijkl = ijkl + eijkl for i = 1, 2, 3, 4; j = 1, 2, 3; k = 1, 2; and


l = 1, 2 where i, j, k are for L,M,P and l is for block. Note that i000 is the
mean corresponding to the ith level of L.
a) There were 24 pigs in each block. How were they assigned to the 24 =
4 3 2 runs (a run is an L,M,P combination forming a pig diet)?
b) Was blocking useful?
c) Perform a 4 step test for the signicant main eect.
d) Which, if any, of the interactions were signicant?

Solution: a) Randomly.
b) Yes, 0.0379 < 0.05.
c) H0 : 0010 = 0020 HA : not H0
FP = 19.47
pval = 0.0002
Reject H0 , the mean weight gain depends on the protein level.
d) None.

Source df SS MS F pvalue
block 1 0.1334 0.1334 4.85 0.0379
L 3 0.0427 0.0142 0.5164 0.6751
M 2 0.0526 0.0263 0.9564 0.3990
P 1 0.5355 0.5355 19.47 0.0002
LM 6 0.2543 0.0424 1.54 0.2099
LP 3 0.2399 0.0800 2.91 0.0562
MP 2 0.0821 0.0410 1.49 0.2463
LMP 6 0.0685 0.0114 0.4145 0.8617
error 23 0.6319 0.0275

Remark 7.3. There are 3 basic principles of DOE. Randomization, fac-


torial crossing, and blocking can be used to create many DOE models.
i) Use randomization to assign units to treatments.
ii) Use factorial crossing to compare the eects of 2 or more factors in
the same experiment: if A1 , A2 , . . . , Ak are the k factors where the ith factor
Ai has li levels, then there are (l1 )(l2 ) (lk ) treatments where a treatment
has one level from each factor.
iii) Use blocking to increase precision. Divide units into blocks of similar
homogeneous units where similar implies that the units are likely to have
similar values of the response if given the same treatment. Within each block,
randomly assign units to treatments.

7.3 Latin Square Designs

Latin square designs have a lot of structure. The design contains a row block
factor, a column block factor, and a treatment factor, each with a levels. The
7.3 Latin Square Designs 235

two blocking factors, and the treatment factor are crossed, but it is assumed
that there is no interaction. A capital letter is used for each of the a treatment
levels. So a = 3 uses A, B, C while a = 4 uses A, B, C, D.

Denition 7.5. In an a a Latin square, each letter appears exactly once


in each row and in each column. A standard Latin square has letters written
in alphabetical order in the rst row and in the rst column.

Five Latin squares are shown below. The rst, third, and fth are standard.
If a = 5, there are 56 standard Latin squares.
A B C A B C A B C D A B C D E A B C D E
B C A C A B B A D C E A B C D B A E C D
C A B B C A C D A B D E A B C C D A E B
D C B A C D E A B D E B A C
B C D E A E C D B A

Denition 7.6. The model for the Latin square design is

Yijk = + i + j + k + eijk

where i is the ith treatment eect, j is the jth row block eect, k is the
kth column block eect with i, j, and k = 1, . . . , a. The errors eijk are iid
with 0 mean and constant variance 2 . The ith treatment mean i = + i .

Shown below is an ANOVA table for the Latin square model given in
symbols. Sometimes Error is replaced by Residual, or Within Groups.
Sometimes rblocks and cblocks are replaced by the names of the blocking
factors. Sometimes p-value is replaced by P, P r(> F ), or PR > F.

Source df SS MS F p-value
rblocks a1 SSRB MSRB Frow prow
cblocks a1 SSCB MSCB Fcol pcol
treatments a1 SSTR MSTR Fo =MSTR/MSE pval
Error (a 1)(a 2) SSE MSE

Rule of thumb 7.2. Let pblock be prow or pcol . If pblock 0.1, then block-
ing was not useful. If 0.05 < pblock < 0.1, then the usefulness was borderline.
If pblock 0.05, then blocking was useful.

Be able to perform the 4 step ANOVA F test for the Latin square
design. This test is similar to the xed eects one way ANOVA F test.
i) Ho: 1 = 2 = = a and HA : not Ho.
ii) Fo = MSTR/MSE is usually given by output.
iii) The pval = P(Fa1,(a1)(a2) > Fo ) is usually given by output.
iv) If the pval , reject Ho and conclude that the mean response depends
on the factor level. Otherwise fail to reject Ho and conclude that the mean
response does not depend on the factor level. (Or there is not enough evidence
236 7 Block Designs

to conclude that the mean response depends on the factor level.) Give a
nontechnical sentence. Use = 0.05 if is not given.

Remark 7.4. The response, residual, and transformation plots are used
almost in the same way as for the one and two way Anova models, but all of
the dot plots have sample size m = 1. Look for the plotted points falling in
roughly evenly populated bands about the identity line and r = 0 line. See
Problem 7.5 and the following example.

Source df SS MS F P
rblocks 3 774.335 258.1117 2.53 0.1533
cblocks 3 133.425 44.4750 0.44 0.7349
fertilizer 3 1489.400 496.4667 4.87 0.0476
error 6 611.100 101.8500

Example 7.5. Dunn and Clark (1974, p. 129) examine a study of four
fertilizers on yields of wheat. The row blocks were 4 types of wheat. The
column blocks were 4 plots of land. Each plot was divided into 4 subplots
and a Latin square design was used. (To illustrate the inference for Latin
square designs, ignore the fact that the data had an outlier. Case 14 had a
yield of 64.5 while the next highest yield was 35.5. For the response plot in
Figure 7.3, note that both Y and Y are large for the high yield. Also note
that Y underestimates Y by about 10 for this case.)
a) Were the row blocks useful? Explain briey.
b) Were the column blocks useful? Explain briey.
c) Do an appropriate 4 step test.

Solution:
a) No, prow = 0.1533 > 0.1.
b) No, pcol = 0.7349 > 0.1.
c) i) H0 : 1 = 2 = 3 = 4 HA : not H0
ii) F0 = 4.87
iii) pval = 0.0476
iv) Reject H0 . The mean yield depends on the fertilizer level.

Figure 7.3 was made with the following commands using the lregpack func-
tion aovplots.
rblocks <- c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4)
cblocks <- c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4)
fertilizer <- c(1,2,3,4,2, 3, 4, 1, 3, 4, 1, 2, 4, 1, 2, 3)
yield <- c(35.5,24.5,14.7,35.5, 14.4, 6.2, 13.7, 24.5, 14.1,
16.2, 34.3, 19.7, 15.0, 64.5, 34.6, 19.0)
rblocks <- factor(rblocks)
cblocks <- factor(cblocks)
fertilizer <- factor(fertilizer)
dcls <- data.frame(yield,rblocks,cblocks,fertilizer)
rm(yield,rblocks,cblocks,fertilizer)
7.3 Latin Square Designs 237

Response Plot
14

10 30 50
Y

10 20 30 40 50
FIT

Residual Plot
10

14
RES
0
10

10 20 30 40 50
FIT
Fig. 7.3 Latin Square Data

attach(dcls)
z <- aov(yield~rblocks+cblocks+fertilizer)
summary(z)
aovplots(Y=yield,FIT=fitted(z),RES=resid(z))
#right click Stop twice, drag the plots to make them square
detach(dcls)

Remark 7.5. The Latin square model is additive, but the model is often
incorrectly used to study nuisance factors that can interact. Factorial or
fractional factorial designs should be used when interaction is possible.

Remark 7.6. The randomization is done in 3 steps. Draw 3 random per-


mutations of 1, . . . , a. Use the 1st permutation to randomly assign row block
levels to the numbers 1, . . . , a. Use the 2nd permutation to randomly as-
sign column block levels to the numbers 1, . . . , a. Use the 3rd permutation
to randomly assign treatment levels to the 1st a letters (A, B, C, and D
if a = 4).

Example 7.6. In the social sciences, often a blocking factor is time: the
levels are a time slots. Following Cobb (1998, p. 254), a Latin square design
was used to study the response Y = blood sugar level, where the row blocks
were 4 rabbits, the column blocks were 4 time slots, and the treatments were
4 levels of insulin. Label the rabbits as I, II, III, and IV; the dates as 1, 2, 3,
4; and the 4 insulin levels i1 < i2 < i3 < i4 as 1, 2, 3, 4. Suppose the random
permutation for the rabbits was 3, 1, 4, 2; the permutation for the dates 1,
4, 3, 2; and the permutation for the insulin levels was 2, 3, 4, 1. Then i2 is
238 7 Block Designs

treatment A, i3 is treatment B, i4 is treatment C, and i1 is treatment D.


Then the data are as shown below on the left. The data is rearranged for
presentation on the right.

raw data presentation data


date date
rabbit 4/23 4/27 4/26 4/25 rabbit 4/23 4/25 4/26 4/27
III 57A 45B 60C 26D I 24B 46C 34D 48A
I 24B 48A 34D 46C II 33D 58A 57B 60C
IV 46C 47D 61A 34B III 57A 26D 60C 45B
II 33D 60C 57B 58A IV 46C 34B 61A 47D

Example 7.7. Following Cobb (1998, p. 255), suppose there is a rectan-


gular plot divided into 5 rows and 5 columns to form 25 subplots. There
are 5 treatments which are 5 varieties of a plant, labelled 1, 2, 3, 4, 5; and
the response Y is yield. Adjacent subplots tend to give similar yields under
identical treatments, so the 5 rows form the row blocks and the 5 columns
form the column blocks. To perform randomization, three random permuta-
tions are drawn. Shown below are 3 Latin squares. The one on the left is an
unrandomized Latin square.
Suppose 2, 4, 3, 5, 1 is the permutation drawn for rows. The middle Latin
square with randomized rows has 1st row which is the 2nd row from the
original unrandomized Latin square. The middle square has 2nd row that is
the 4th row from the original, the 3rd row is the 3rd row from the original,
the 4th row is the 5th row from the original, and the 5th row is the 1st row
from the original.

unrandomized randomized rows randomized Latin square


rows columns rows columns rows columns
1 2 3 4 5 1 2 3 4 5 1 4 2 5 3
1 A B C D E 2 B C D E A 2 B E C A D
2 B C D E A 4 D E A B C 4 D B E C A
3 C D E A B 3 C D E A B 3 C A D B E
4 D E A B C 5 E A B C D 5 E C A D B
5 E A B C D 1 A B C D E 1 A D B E C

Suppose 1, 4, 2, 5, 3 is the permutation drawn for columns. Then the


randomized Latin square on the right has 1st column which is the 1st column
from the middle square, the 2nd column is the 4th column from the middle
square, the 3rd column is the 2nd column from the middle square, the 4th
column is the 5th column from the middle square, and the 5th column is the
3rd column from the middle square.
Suppose 3, 2, 5, 4, 1 is the permutation drawn for variety. Then variety 3
is treatment A, 2 is B, 5 is C, 4 is D, and variety 1 is E. Now sow each subplot
with the variety given by the randomized Latin square on the right. Hence
7.4 Summary 239

the northwest corner of the square gets B = variety 2, the northeast corner
gets D = variety 4, the southwest corner gets A = variety 3, the southeast
corner gets C = variety 5, et cetera.

7.4 Summary

1) A block is a group of similar (homogeneous) units in that the units in a


block are expected to give similar values of the response if given the same
treatment.
2) In agriculture, adjacent plots of land are often used as blocks since
adjacent plots tend to give similar yields. Litter mates, siblings, twins, time
periods (e.g., dierent days), and batches of material are often used as blocks.
3) The completely randomized block design (CRBD) with k treatments
and b blocks of k units uses randomization within each block to assign ex-
actly one of the blocks k units to each of the k treatments. This design is a
generalization of the matched pairs procedure used for k = 2.
4) The ANOVA F test for the completely randomized block design with
k treatments and b blocks is nearly the same as the xed eects one way
ANOVA F test.
i) Ho: 1 = 2 = = k and HA : not Ho.
ii) Fo = MSTR/MSE is usually given by output.
iii) The pval = P(Fk1,(k1)(b1) > Fo ) is usually given by output.
iv) If the pval , reject Ho and conclude that the mean response depends
on the factor level. Otherwise fail to reject Ho and conclude that the mean
response does not depend on the factor level. Give a nontechnical sentence.
5) Shown below is an ANOVA table for the completely randomized block
design.

Source df SS MS F p-value
Blocks b-1 SSB MSB Fblock pblock
Treatment k-1 SSTR MSTR F0 =MSTR/MSE pval for Ho
Error (k 1)(b 1) SSE MSE
6) Rule of thumb: If pblock 0.1, then blocking was not useful. If 0.05 <
pblock < 0.1, then the usefulness was borderline. If pblock 0.05, then blocking
was useful.
7) The response, residual, and transformation plots for CRBD models are
used almost in the same way as for the one and two way Anova model, but all
of the dot plots have sample size m = 1. Look for the plotted points falling
in roughly evenly populated bands about the identity line and r = 0 line.
8) The block response scatterplot plots blocks versus the response.
The plot will have b dot plots of size k with a symbol corresponding to the
treatment. Dot plots with clearly dierent means suggest that blocking was
useful. A symbol pattern within the blocks suggests that the response depends
on the factor level.
240 7 Block Designs

9) Shown is an ANOVA table for the Latin square model given in sym-
bols. Sometimes Error is replaced by Residual, or Within Groups.
Sometimes rblocks and cblocks are replaced by the blocking factor name.
Sometimes p-value is replaced by P, P r(> F ), or PR > F.

Source df SS MS F p-value
rblocks a1 SSRB MSRB Frow prow
cblocks a1 SSCB MSCB Fcol pcol
treatments a1 SSTR MSTR Fo =MSTR/MSE pval
Error (a 1)(a 2) SSE MSE

10) Let pblock be prow or pcol . Rule of thumb: If pblock 0.1, then blocking
was not useful. If 0.05 < pblock < 0.1, then the usefulness was borderline. If
pblock 0.05, then blocking was useful.
11) The ANOVA F test for the Latin square design with a treatments is
nearly the same as the xed eects one way ANOVA F test.
i) Ho: 1 = 2 = = a and HA : not Ho.
ii) Fo = MSTR/MSE is usually given by output.
iii) The pval = P(Fa1,(a1)(a2) > Fo ) is usually given by output.
iv) If the pval , reject Ho and conclude that the mean response depends
on the factor level. Otherwise fail to reject Ho and conclude that the mean
response does not depend on the factor level. Give a nontechnical sentence.
12) The response, residual, and transformation plots for Latin square de-
signs are used almost in the same way as for the one and two way Anova
models, but all of the dot plots have sample size m = 1. Look for the plotted
points falling in roughly evenly populated bands about the identity line and
r = 0 line.
13) The randomization is done in 3 steps. Draw 3 random permutations
of 1, . . . , a. Use the 1st permutation to randomly assign row block levels to
the numbers 1, . . . , a. Use the 2nd permutation to randomly assign column
block levels to the numbers 1, . . . , a. Use the 3rd permutation to randomly
assign treatment levels to the 1st a letters (A, B, C, and D if a = 4).
14) Graphical Anova for the completely randomizedblock de-
makes a dot plot of the scaled block deviations j = k 1j =
sign
k 1(y 0j0 y 000 ) on top, a dot plot of scaled treatment deviations (eects)

i = b 1i = b 1(y i00 y 000 ) in the middle, and a dot plot of the
residuals on the bottom. Here k is the number of treatments and b is the
number of blocks.
15) Graphical Anova uses the residuals as a reference distribution. Suppose
the dot plot of the residuals looks good. Rules of thumb: i) An eect is
marginally signicant if its scaled deviation is as big as the biggest residual
or as negative as the most negative residual. ii) An eect is signicant if it is
well beyond the minimum or maximum residual. iii) Blocking was eective
if at least one scaled block deviation is beyond the range of the residuals.
7.5 Complements 241

iv) The treatments are dierent if at least one scaled treatment eect is
beyond the range of the residuals. (These rules depend on the number of
residuals n. If n is very small, say 8, then the scaled eect should be well
beyond the range of the residuals to be signicant. If the n is 40, the value
of the minimum residual and the value of the maximum residual correspond
to a 1/40 + 1/40 = 1/20 = 0.05 critical value for signicance.)

7.5 Complements

Box et al. (2005, pp. 150156) explain Graphical Anova for the CRBD and
why randomization combined with blocking often makes the iid error assump-
tion hold to a reasonable approximation.
It is easier to see model deciencies if the response and residual plots are
square. In R, drag the plots so the plots look square. Matched pairs tests are
a special case of CRBD with k = 2.
The R package granova may be useful for graphical Anova. It is available
from (http://streaming.stat.iastate.edu/CRAN/) and authored by R.M.
Pruzek and J.E. Helmreich. Also see Hoaglin et al. (1991).
A randomization test has H0 : the dierent treatments have no eect.
This null hypothesis is also true if within each block, all k pdfs are from the
same location family. Let j = 1, . . . , b index the b blocks. There are b pdfs, one
for each block, that come from the same location family but possibly dierent
location parameters: fZ (y 0j ). Let A be the treatment factor with k levels
ai . Then Yij |(A = ai ) fZ (y 0j ) where j is xed and i = 1, . . . , k.
Thus the levels ai have no eect on the response, and the Yij are iid within
each block if H0 holds. Note that there are k! ways to assign Y1j , . . . Ykj
to the k treatments within each block. An impractical randomization test
uses all M = [k!]b ways of assigning responses to treatments. Let F0 be the
usual CRBD F statistic. The F statistic is computed for each of the M
permutations and H0 is rejected if the proportion of the M F statistics that
are larger than F0 is less than . The distribution of the M F statistics is
approximately Fk1,(k1)(b1) for large n under H0 . The randomization test
and the usual CBRD F test also have the same power, asymptotically. See
Hoeding (1952) and Robinson (1973). These results suggest that the usual
CRBD F test is semiparametric: the pvalue is approximately correct if n is
large and if all k pdfs Yij |(A = ai ) fZ (y 0j ) are the same for each block
where j is xed and i = 1, . . . , k. If H0 does not hold, then there are kb pdfs
Yij |(A = ai ) fZ (y ij ) from the same location family. Hence the location
parameter depends on both the block and treatment.
Olive (2014, section 9.3) shows that practical randomization tests that
use a random sample of max(1000, [n log(n)]) randomizations have level and
power similar to the tests that use all M possible randomizations. Here each
randomization uses b randomly drawn permutations of 1, . . . , k.
242 7 Block Designs

Hunter (1989) discusses some problems with the Latin square design.
Welch (1990) suggests that the ANOVA F test is not a good approxima-
tion for the permutation test for the Latin square design.

7.6 Problems

Problems with an asterisk * are especially important.


Output for 7.1.
source Df Sum Sq Mean Sq F value Pr(>F)
block 4 49.84 12.46 2.3031 0.10320
seed 4 83.84 20.96 3.8743 0.02189
Residuals 16 86.56 5.41
7.1. Snedecor and Cochran (1967, p. 300) give a data set with 5 types of
soybean seed. The response frate = number of seeds out of 100 that failed
to germinate. Five blocks were used. Assume the appropriate model can be
used (although this assumption may not be valid due to a possible interaction
between the block and the treatment).
a) Did blocking help? Explain briey.
b) Perform the appropriate 4 step test using the output above.
Output for 7.2.
Source df SS MS F P
blocks 3 197.004 65.668 9.12 0.001
treatment 5 201.316 40.263 5.59 0.004
error 15 108.008 7.201
7.2. Current nitrogen fertilization recommendations for wheat include
applications of specied amounts at specied stages of plant growth. The
treatment consisted of six dierent nitrogen application and rate schedules.
The wheat was planted in an irrigated eld that had a water gradient in one
direction as a result of the irrigation. The eld plots were grouped into four
blocks, each consisting of six plots, such that each block occurred in the same
part of the water gradient. The response was the observed nitrogen content
from a sample of wheat stems from each plot. The experimental units were
the 24 plots. Data is from Kuehl (1994, p. 263).
a) Did blocking help? Explain briey.

b) Perform the appropriate 4 step test using the output above.


7.3. An experimenter wanted to test 4 types of an altimeter. There were
eight helicopter pilots available for hire with from 500 to 3000 ight hours of
experience. The response variable was the altimeter reading error. Perform
the appropriate 4 step test using the output below. Data is from Kirk (1982,
p. 244).
7.6 Problems 243

Output for Problem 7.3


Source df SS MS F P
treatment 3 194.50 64.833 47.78 0.000
blocks 7 12.50 1.786 1.32
error 21 28.50 1.357

One way randomized block designs in SAS, Minitab, and R

7.4. This problem is for a one way block design and uses data from Box
et al. (2005, p. 146).
a) Copy and paste the SAS program for this problem from
(http://lagrange.math.siu.edu/Olive/lreghw.txt). Print out the out-
put but only turn in the ANOVA table, residual plot, and response plot.
b) Do the plots look ok?
c) Copy the SAS data into Minitab much as done for Problem 6.4. Right
below C1 type block, below C2 type treat, and below C3 type yield.
d) Select Stat>ANOVA>Two-way, select C3 yield as the response and
C1 block as the row factor and C2 treat as the column factor. Click on
Fit additive model, click on Store residuals, and click on Store ts.
Then click on OK.
e) block response scatterplot: Use le commands Edit>Command
Line Editor and write the following lines in the window.
GSTD
LPLOT yield vs block codes for treat

f) Click on the submit commands box and print the plot. Click on the
output and then click on the printer icon.
g) Copy (http://lagrange.math.siu.edu/Olive/lregdata.txt) into R.
Type the following commands to get the following ANOVA table.
z<-aov(yield~block+treat,pen)
summary(z)

Df Sum Sq Mean Sq F value Pr(>F)


block 4 264.000 66.000 3.5044 0.04075 *
treat 3 70.000 23.333 1.2389 0.33866
Residuals 12 226.000 18.833

h) Did blocking appear to help?


i) Perform a 4 step F test for whether yield depends on treatment.

Latin Square Designs in SAS and R


(Latin square designs can be t by Minitab, but not with the Students
version of Minitab.)
For some of the following problems, the R commands can be copied and
pasted from (http://lagrange.math.siu.edu/Olive/lreghw.txt) into R.

7.5. This problem is for a Latin square design and uses data from Box
et al. (2005, pp. 157160).
244 7 Block Designs

Copy and paste the SAS program for this problem from
(http://lagrange.math.siu.edu/Olive/lreghw.txt).
a) Click on the output and use the menu commands Edit>Select All
and Edit>Copy. In Word use the menu command Paste then use the
left mouse button to highlight the rst page of output. Then use the menu
command Cut. Then there should be one page of output including the
ANOVA table. Print out this page.
b) Copy the data for this problem from
(http://lagrange.math.siu.edu/Olive/lregdata.txt)
into R. Use the following commands to create a residual plot. Copy and paste
the plot into Word. (Click on the plot and simultaneously hit the Ctrl and c
buttons. Then go to Word and use the menu command Paste.)
z<-aov(emissions~rblocks+cblocks+additives,auto)
summary(z)
plot(fitted(z),resid(z))
title("Residual Plot")
abline(0,0)
c) Use the following commands to create a response plot. Copy and paste
the plot into Word. (Click on the plot and simultaneously hit the Ctrl and c
buttons. Then go to Word and use the menu command Paste.)
attach(auto)
FIT <- auto$emissions - z$resid
plot(FIT,auto$emissions)
title("Response Plot")
abline(0,1)
detach(auto)
d) Do the plots look ok?
e) Were the column blocks useful? Explain briey.
f) Were the row blocks useful? Explain briey.
g) Do an appropriate 4 step test.
7.6. Obtain the Box et al. (2005, p. 146) penicillin data from
(http://lagrange.math.siu.edu/Olive/lregdata.txt) and the R pro-
gram ganova2 from (http://lagrange.math.siu.edu/Olive/lregpack.
txt). The program does graphical Anova for completely randomized block
designs.
a) Copy and paste the R commands for this problem into R. Include the
plot in Word by simultaneously pressing the Ctrl and c keys, then using the
menu command Paste in Word.
b) Blocking seems useful because some of the scaled block deviations are
outside of the spread of the residuals. The scaled treatment deviations are in
the middle of the plot. Do the treatments appear to be signicantly dierent?
Chapter 8
Orthogonal Designs

Orthogonal designs for factors with two levels can be t using least squares.
The orthogonality of the contrasts allows each coecient to be estimated
independently of the other variables in the model.
kf
This chapter covers 2k factorial designs, 2R fractional factorial designs,
and Plackett Burman PB(n) designs. The entries in the design matrix X are
either 1 or 1. The columns of the design matrix X are orthogonal: cTi cj = 0
for i = j where ci is the ith column of X. Also cTi ci = n, and the absolute
values of the column entries sum to n.
The rst column of X is 1, the vector of ones, but the remaining columns
of X are the coecients of a contrast. Hence the ith column ci has entries
that are 1 or 1, and the entries of the ith column ci sum to 0 for i > 1.

8.1 Factorial Designs

Factorial designs are a special case of the k way Anova designs of Chapter 6,
and these designs use factorial crossing to compare the eects (main eects,
pairwise interactions, . . . , k-fold interaction) of the k factors. If A1 , . . . , Ak are
the factors with li levels for i = 1, . . . , k then there are l1 l2 lk treatments
where each ' treatment uses exactly one level from each factor. The sample
k
size n = m i=1 li m 2k . Hence the sample size grows exponentially fast
with k. Often the number of replications m = 1.

Denition 8.1. An experiment has n runs where a run is used to measure


a response. A run is a treatment = a combination of k levels. So each run
uses exactly one level from each of the k factors.

Often each run is expensive, for example, in industry and medicine. A goal
is to improve the product in terms of higher quality or lower cost. Often the

Springer International Publishing AG 2017 245


D.J. Olive, Linear Regression, DOI 10.1007/978-3-319-55252-1 8
246 8 Orthogonal Designs

subject matter experts can think of many factors that might improve the
product. The number of runs n is minimized by taking li = 2 for i = 1, . . . , k.

Denition 8.2. A 2k factorial design is a k way Anova design where each


factor has two levels: low = 1 and high = 1. The design uses n = m2k runs.
Often the number of replications m = 1. Then the sample size n = 2k .
A 2k factorial design is used to screen potentially useful factors. Usually
at least k = 3 factors are used, and then 23 = 8 runs are needed. Often
the units are time slots, and each time slot is randomly assigned to a run
= treatment. The subject matter experts should choose the two levels. For
example, a quantitative variable such as temperature might be set at 80o F
coded as 1 and 100o F coded as 1, while a qualitative variable such as type
of catalyst might have catalyst A coded as 1 and catalyst B coded as 1.

Improving a process is a sequential, iterative process. Often high values of


the response are desirable (e.g. yield), but often low values of the response
are desirable (e.g. number of defects). Industrial experiments have a budget.
The initial experiment may suggest additional factors that were omitted,
suggest new sets of two levels, and suggest that many initial factors were
not important or that the factor is important, but the level of the factor is
not. (For example, one factor could be a catalyst with chemical yield as the
response. It is possible that both levels of the catalyst produce about the
same yield, but the yield would be 0 if the catalyst was not used. Then the
catalyst is an important factor, but the yield did not depend on the level of
catalyst used in the experiment.)
Suppose k = 5 and A, B, C, D, and E are factors. Assume high response
is desired and high levels of A and C correspond to high response where A
is qualitative (e.g. 2 brands) and C is quantitative but set at two levels (e.g.
temperature at 80 and 100o F ). Then the next stage may use an experiment
with factor A at its high level and at a new level (e.g. a new brand) and
C at the highest level from the previous experiment and at a higher level
determined by subject matter experts (e.g. at 100 and 120o F ).

Rule of thumb 8.1. Do not spend more than 25% of the budget on the
initial experiment. It may be a good idea to plan for four experiments, each
taking 25% of the budget.
p p
Denition 8.3. Recall that a contrast C = i=1 di i where i=1 di =
p
0, and the estimated contrast is C = i=1 di Y i0 where i and Y i0 are
appropriate population and sample means. In a table of contrasts, the
coecients di of the contrast are given where a corresponds to 1 and a +
corresponds to 1. Sometimes a column I corresponding to the overall mean
is given where each entry is a +. The column corresponding to I is not a
contrast.

To make a table of contrasts there is a rule for main eects and a rule for
interactions.
8.1 Factorial Designs 247

a) In a table of contrasts, the column for A starts with a then a + and


the pattern repeats. The column for B starts with 2 s and then 2 +s and
the pattern repeats. The column for C starts with 4 s and then 4 +s and
the pattern repeats. The column for the ith main eects factor starts with
2i1 s and 2i1 +s and the pattern repeats where i = 1, . . . , k.
b) In a table of contrasts, a column for an interaction containing several
factors is obtained by multiplying the columns for each factor where + = 1
and = 1. So the column for ABC is obtained by multiplying the column
for A, the column for B, and the column for C.
A table of contrasts for a 23 design is shown below. The rst column is
for the mean and is not a contrast. The last column corresponds to the cell
means. Note that y 1110 = y111 if m = 1. So y might be replaced by y if m = 1.
Each row corresponds to a run. Only the levels of the main eects A, B, and
C are needed to specify each run. The rst row of the table corresponds to
the low levels of A, B, and C. Note that the divisors are 2k1 except for the
divisor of I which is 2k where k = 3.

I A B C AB AC BC ABC y
+ + + + y 1110
+ + + + y 2110
+ + + + y 1210
+ + + + y 2210
+ + + + y 1120
+ + + + y 2120
+ + + + y 1220
+ + + + + + + + y 2220
divisor 8 4 4 4 4 4 4 4

The table of contrasts for a 24 design is shown on the following page. The
column of ones corresponding to I was omitted. Again rows correspond to
runs and the levels of the main eects A, B, C, and D completely specify the
run. The rst row of the table corresponds to the low levels of A, B, C, and
D. In the second row, the level of A is high while B, C, and D are low. Note
that the interactions are obtained by multiplying the component columns
where + = 1 and = 1. Hence the rst row of the column corresponding
to the ABC entry is ()()() = .
Randomization for a 2k design: The runs are determined by the levels
of the k main eects in the table of contrasts. So a 23 design is determined by
the levels of A, B, and C. Similarly, a 24 design is determined by the levels
of A, B, C, and D. Randomly assign units to the m2k runs. Often the units
are time slots. If possible, perform the m2k runs in random order.
Genuine run replicates need to be used. A common error is to take m
measurements per run, and act as if the m measurements are from m runs.
248 8 Orthogonal Designs

If as a data analyst you encounter this error, average the m measurements


into a single value of the response.
run A B C D AB AC AD BC BD CD ABC ABD ACD BCD ABCD
1 - - - - + + + + + + - - - - +
2 + - - - - - - + + + + + + - -
3 - + - - - + + - - + + + - + -
4 + + - - + - - - - + - - + + +
5 - - + - + - + - + - + - + + -
6 + - + - - + - - + - - + - + +
7 - + + - - - + + - - - + + - +
8 + + + - + + - + - - + - - - -
9 - - - + + + - + - - - + + + -
10 + - - + - - + + - - + - - + +
11 - + - + - + - - + - + - + - +
12 + + - + + - + - + - - + - - -
13 - - + + + - - - - + + + - - +
14 + - + + - + + - - + - - + - -
15 - + + + - - - + + + - - - + -
16 + + + + + + + + + + + + + + +

Denition 8.4. If the response depends on the two levels of the factor,
then the factor is called active. If the response does not depend on the two
levels of the factor, then the factor is called inert.

Active factors appear to change the mean response as the level of the factor
changes from 1 to 1. Inert factors do not appear to change the response as
the level of the factor changes from 1 to 1. An inert factor could be needed
but the level low or high is not important, or the inert factor may not be
needed and so can be omitted from future studies. Often subject matter
experts can tell whether the inert factor is needed or not.
The 2k designs are used for exploratory data analysis: they provide
answers to the following questions.
i) Which combinations of levels are best?
ii) Which factors are active and which are inert? That is, use the 2k design
to screen for factors where the response depends on whether the level is high
or low.
iii) How should the levels be modied to improve the response?

If all 2k runs give roughly the same response, then choose the levels that
are cheapest to increase prot. Also the system tends to be robust to changes
in the factor space so managers do not need to worry about the exact values
of the levels of the factors.
In an experiment, there will be an interaction between management, sub-
ject matter experts (often engineers), and the data analyst (statistician).
8.1 Factorial Designs 249

Remark
# $ 8.1. If m = 1, then there
# $ is one response per run but k main
eects, k2 2 factor interactions, kj j factor interactions, and 1 k way in-
teraction. Then the MSE df = 0 unless at least one high order interaction
is assumed
# $ to be zero. A full model that includes all k main eects and
all k2 two way interactions is a useful starting point for response, residual,
and transformation plots. The higher order interactions can be treated as
potential terms and checked for signicance. As a rule of thumb, signicant
interactions tend to involve signicant main eects.

Denition 8.5. An outlier corresponds to a case that is far from the


bulk of the data.

Rule of thumb 8.2. Mentally add 2 lines parallel to the identity line and
2 lines parallel to the r = 0 line that cover most of the cases. Then a case
is an outlier if it is well beyond these 2 lines. This rule often fails for large
outliers since often the identity line goes through or near a large outlier so
its residual is near zero. Often such outliers are still far from the bulk of
the data, and there will be a gap in the response plot (along the identity
line) separating the bulk of the data from the outliers. Such gaps appear in
Figures 3.7, 3.10b) (in an FF plot), 3.11, and 7.3 where the gap would be
easier to see if the plot was square. A response that is far from the bulk of
the data in the response plot is a large outlier (large in magnitude).
Rule of thumb 8.3. Often an outlier is very good, but more often an
outlier is due to a measurement error and is very bad.
Denition 8.6. A critical mix is a single combination of levels, out of
2k , that gives good results. Hence a critical mix produces good outliers (or a
single outlier if m = 1).

Be able to pick out active and inert factors and good (or the best) combi-
nations of factors (cells or runs) from the table of contrasts = table of runs.
Often the table will only contain the contrasts for the main eects. If high
values of the response are desirable, look for high values of y for m > 1. If
m = 1, then y = y. The following two examples help illustrate the process.

O H C y
5.9
+ 4.0
+ 3.9
+ + 1.2
+ 5.3
+ + 4.8
+ + 6.3
+ + + 0.8
250 8 Orthogonal Designs

Example 8.1. Box et al. (2005, pp. 209210) describes a 23 experiment


with the goal of reducing the wear rate of deep groove bearings. Here m = 1
so n = 8 runs were used. The 23 design employed two levels of osculation
(O), two levels of heat treatment (H), and two dierent cage designs (C).
The response Y is the bearing failure rate and low values of the observed
response y are better than high values.
a) Which two combinations of levels are the best?
b) If two factors are active, which factor is inert?

Solution: a) The two lowest values of y are 0.8 and 1.2 which correspond
to + + + and + + . (Note that if the 1.2 was 4.2, then + + + corresponding
to 0.8 would be a critical mix.)
b) C would be inert since O and H should be at their high + levels.

run R T C D y
1 14
2 + 16
3 + 8
4 + + 22
5 + 19
6 + + 37
7 + + 20
8 + + + 38
9 + 1
10 + + 8
11 + + 4
12 + + + 10
13 + + 12
14 + + + 30
15 + + + 13
16 + + + + 30

Example 8.2. Ledolter and Swersey (2007, p. 80) describes a 24 experi-


ment for a company that manufactures clay plots to hold plants. For one of
the companys newest products, there had been an unacceptably high num-
ber of cracked pots. The production engineers believed that the following
factors are important: R = rate of cooling (slow or fast), T = kiln temper-
ature (2000o F or 2060o F), C = coecient of expansion of the clay (low or
high), and D = type of conveyor belt (metal or rubberized) used to allow
employees to handle the pots. The response y is the percentage of cracked
pots per run (so small y is good).

a) For xed levels of R, T , and C, is the D+ level or D level of D better


(compare run 1 with run 9, 2 with 10, . . . , 8 with 16).
8.1 Factorial Designs 251

b) Fix D at the better level. Is the C or C+ level better?


c) Fix C and D at the levels found in a) and b). Is the R or R+ level
better?
d) Which factor seems to be inert?
Solution: a) D+ since for xed levels of R, T, and C, the number of cracks
is lower if D = + than if D = .
b) C, c) R, d) T .

A 2k design can be t with least squares. In the table of contrasts let a


+ = 1 and a = 1. The design matrix X needs a row for each response:
we cant use the mean response for each xed combination of levels. Let x0
correspond to I, the column of 1s. Let xi correspond to the ith main eect
for i = 1, . . . , k. Let xij correspond to 2 factor interactions, and let xi1 ,...,iG
correspond to G way interactions for G = 2, . . . , k. Let the design matrix X
have columns corresponding to the x. Then X will have n = m2k rows. Let
y be the vector of responses.
The table below relates the quantities in the 23 table of contrasts with the
quantities used in least squares when the design matrix

X = [x0 , x1 , x2 , x3 , x12 , x13 , x23 , x123 ].

Software often does not need the column of ones x0 .

x0 x1 x2 x3 x12 x13 x23 x123 y


I A B C AB AC BC ABC y
The table below relates quantities in the 24 table of contrasts with the
quantities used in least squares. Again the omitted x0 corresponds to I, the
column of ones, while y is the vector of responses.
x1 x2 x3 x4 x12 x13 x14 x23 x24 x34 x123 x124 x134 x234 x1234
A B C D AB AC AD BC BD CD ABC ABD ACD BCD ABCD
Denition 8.7. The least squares model for a 2k design contains a
least squares population coecient for each x in the model. The model
can be written as Y = xT + e with least squares tted values Y = xT .
In matrix form the model is Y = X + e and the vector of tted values is
Y = X . The biggest possible model contains all of the terms. The second
order model contains 0 , all main eects, and all second order interactions,
and is recommended as the initial full model for k 3. The main eects
model removes all interactions. If a model contains an interaction, then the
model should also contain all of the corresponding main eects. Hence if a
model contains x123 , the model should contain x1 , x2 , and x3 .

Denition 8.8. The coecient 0 corresponding to I is equal to the pop-


ulation I eect of x0 , and the (sample) I eect = 0 . For an x other than
x0 , the population eect for x is 2, the change in Y as x changes two
units from 1 to 1, and the (sample) eect is 2. The (sample) coecient
estimates the population coecient .
252 8 Orthogonal Designs

Suppose the model using all of the columns of X is used. If some columns
are removed (e.g. those corresponding to the insignicant eects), then for
2k designs the following quantities remain unchanged for the terms that were
not deleted: the eects, the coecients, and SS(eect) = MS(eect). The
MSE, SE(eect), F and t statistics, pvalues, tted values, and residuals do
change.
The regression equation corresponding to the signicant eects (e.g. found
with a QQ plot of Denition 8.9) can be used to form a reduced model. For
example, suppose the full (least squares) tted model is Yi = 0 + 1 xi1 +
2 xi2 + 3 xi3 + 12 xi12 + 13 xi13 + 23 xi23 + 123 xi123 . Suppose the A, B, and
AB eects are signicant. Then the reduced (least squares) tted model is
Yi = 0 + 1 xi1 + 2 xi2 + 12 xi12 where the coecients (s) for the reduced
model can be taken from the full model since the 2k design is orthogonal.
The coecient 0 corresponding to I is equal to the I eect, but the
coecient of a factor x corresponding to an eect is = 0.5 eect. Consider
signicant eects and assume interactions can be ignored.
i) If a large response Y is desired and > 0, use x = 1. If < 0, use
x = 1.
ii) If a small response Y is desired and > 0, use x = 1. If < 0, use
x = 1.
Rule of thumb 8.4. To predict Y with Y , the number of coecients =
the number of s in the model should be n/2, where the sample size n =
number of runs. Otherwise the model is overtting.
From the regression equation Y = xT , be able to predict Y given x. Be
able to tell whether x = 1 or x = 1 should be used. Given the x values
of the main eects, get the x values of the interactions by multiplying the
columns corresponding to the main eects.
Least squares output in symbols is shown below. Often Estimate is re-
placed by Coef or Coecient. Often Intercept is replaced by Con-
stant. The t statistic and pvalue are for whether the term or eect is sig-
nicant. So t12 and p12 are for testing whether the x12 term or AB eect is
signicant.

Coef or Est. Std.Err t pvalue


Intercept or constant 0 SE(coef) t0 p0
x1 1 SE(coef) t1 p1
x2 2 SE(coef) t2 p2
x3 3 SE(coef) t3 p3
x12 12 SE(coef) t12 p12
x13 13 SE(coef) t13 p13
x23 23 SE(coef) t23 p23
x123 123 SE(coef) t123 p123
8.1 Factorial Designs 253

The least squares coecient = 0.5 (eect). The sum of squares for an x
correspondingto an eect is equal to SS(eect).
 SE(coef) = SE() = 0.5
SE(eect) = M SE/n. Also SE(0 ) = M SE/n.
Example 8.3. a) The biggest possible model for the 23 design is Y =
0 + 1 x1 + 2 x2 + 3 x3 + 12 x12 + 13 x13 + 23 x23 + 123 x123 + e with least
squares tted or predicted values given by Yi = 0 + 1 xi1 + 2 xi2 + 3 xi3 +
12 xi12 + 13 xi13 + 23 xi23 + 123 xi123 .
The second order model is Y = 0 +1 x1 +2 x2 +3 x3 +12 x12 +13 x13 +
23 x23 + e. The main eects model is Y = 0 + 1 x1 + 2 x2 + 3 x3 + e.
b) A typical least squares output for the 23 design with m = 2 is shown
below. Often Estimate is replaced by Coef.
Residual Standard Error=2.8284 = sqrt(MSE)
R-Square=0.9763 F-statistic (df=7, 8)=47.054 pvalue=0

Estimate Std.Err t-value Pr(>|t|)


Intercept 64.25 0.7071 90.8632 0.0000
x1 11.50 0.7071 16.2635 0.0000
x2 -2.50 0.7071 -3.5355 0.0077
x3 0.75 0.7071 1.0607 0.3198
x12 0.75 0.7071 1.0607 0.3198
x13 5.00 0.7071 7.0711 0.0001
x23 0.00 0.7071 0.0000 1.0000
x123 0.25 0.7071 0.3536 0.7328
c) i) The least squares coecient or estimate = eect/2. So in the above
table, the A eect = 2(11.5) = 23. If x corresponds to the least squares
coecient, then the coecient = (xT y)/(xT x).
ii) The sum of squares = mean square corresponding to an x is equal to the
sum of squares = mean square of the corresponding eect. If x corresponds
to the least squares coecient, then the SS = MS = (xT y) 2 T
 /(x x).
iii) Suppose m 2. Then SE(coef) = SE(eect)/2 = 0.5 M SE/(m2k2 ).
Hence in the above table, SE(eect) = 2(0.7071) = 1.412.
iv) The t statistic t0 = coef/SE(coef), and t20 = F0 where t0 tdfe and
F0 F1,dfe where dfe = (m 1)2k is the MSE df. Hence the pvalues for least
squares and the 2k software are the same. For example, the pvalue for testing
the signicance of x1 = pvalue for testing signicance of the A eect = 0.000
in the above table. Also tA = 16.2635 and t2A = FA = 264.501.
v) The MSE, tted values, and residuals are the same for the least squares
output and the 2k software.
Suppose the two levels of the quantitative variable are a < b and x is
2x (a + b)
the actual value used. Then code x as c cx = . Note that
ba
the code gives c = 1 for x = a and c = 1 for x = b. Thus if the 2 levels
are a = 100 and b = 200 but x = 187 is observed, then code x as c =
[2(187) (100 + 200)]/[200 100] = 0.74.
254 8 Orthogonal Designs

There are several advantages to least squares over 2k software. The dis-
advantage of the following four points is that the design will no longer be
orthogonal: the estimated coecients and hence the estimated eects will
depend on the terms in the model. i) If there are several missing values or
outliers, delete the corresponding rows from the design matrix X and the
vector of responses y as long as the number of rows of the design matrix
the number of columns. ii) If the exact quantitative levels are not observed,
replace them by the observed levels cx in the design matrix. iii) If the wrong
levels are used in a run, replace the corresponding row in the design ma-
trix by a row corresponding to the levels actually used. iv) The number of
replications per run i can be mi , that is, we do not need mi m.

Denition 8.9. A normal QQ plot is a plot of the eects versus standard


normal percentiles. There are L = 2k 1 eects for a 2k design.

Rule of thumb 8.5. The nonsignicant eects tend to follow a line


closely in the middle of the QQ plot while the signicant eects do not follow
the line closely. Signicant eects will be the most negative or the most
positive eects.
Know how to nd the eect, the standard error of the eect, the sum
of squares for an eect, and a condence interval for the eect from a table
of contrasts using the following rules.
Let c be a column from the table of contrasts where + = 1 and = 1.
Let y be the column of cell means. Then the eect corresponding to c is
cT y
eect = . (8.1)
2k1

If the number of replications m 2, then the standard error for the eect is

MSE
SE(eect) = . (8.2)
m2k2
Sometimes M SE is replaced by 2 .

M SE
SE(mean) = (8.3)
m2k
where m2k = n, m 2, and sometimes M SE is replaced by 2 .
The sum of squares for an eect is also the mean square for the eect since
df = 1.
M S(eect) = SS(eect) = m2k2 (eect)2 (8.4)
for m 1.
A 95% condence interval (CI) for an eect is

eect tdfe ,0.975 SE(eect) (8.5)


8.1 Factorial Designs 255

where dfe is the MSE degrees of freedom. Use tdfe ,0.975 z0.975 = 1.96 if
dfe > 30. The eect is signicant if the CI does not contain 0, while the eect
is not signicant if the CI contains 0.

Rule of thumb 8.6. Suppose there is no replication so m = 1. Find J


interaction mean squares that are small compared to the bulk of the mean
squares. Add them up to make M SE with dfe = J. So

sum of small MS s


M SE = .
J
This method uses data snooping and M SE tends to underestimate 2 . So
the F test statistics are too large and the pvalues too small. Use this method
for exploratory data analysis, not for inference based on the F distribution.
Rule of thumb 8.7. MS(eect) = SS(eect) 2 21 M SE21 if the
eect is not signicant. M SE 2 2dfe /dfe if the model holds. A rule of
thumb is that an eect is signicant if MS > 5M SE. The rule comes from
the fact that 21,0.975 5.
Below is the ANOVA table for a 23 design. Suppose m = 1. For A, use
H0 : 100 = 200 . For B, use H0 : 010 = 020 . For C, use H0 : 001 = 002 .
For interaction, use H0 : no interaction. If m > 1, the subscripts need an
additional 0, e.g. H0 : 1000 = 2000 .

Source df SS MS F p-value
A 1 SSA MSA FA pA
B 1 SSB MSB FB pB
C 1 SSC MSC FC pC
AB 1 SSAB MSAB FAB pAB
AC 1 SSAC MSAC FAC pAC
BC 1 SSBC MSBC FBC pBC
ABC 1 SSABC MSA FABC pABC
Error (m 1)2k SSE MSE

Following Rule of thumb 8.6, if m = 1, pool J interaction mean squares


that are small compared to the bulk of the data into an MSE with dfe = J.
Such tests are for exploratory purposes only: the MSE underestimates 2 ,
so the F test statistics are too large and the pvalues = P (F1,J > F0 ) are
too small. (Actually the pvalue = pval, an estimated pvalue.) For example
F0 = FA = M SA/M SE. As a convention for using an F table, use the
denominator df closest to dfe = J, but if dfe = J > 30 use denominator df
= .
On the following page is the ANOVA table for a 2k design. For A, use
H0 : 100 = 200 . The other main eects have similar null hypotheses.
For interaction, use H0 : no interaction. If m = 1, use a procedure similar to
Rule of Thumb 8.6 for exploratory purposes.
256 8 Orthogonal Designs

eect
One can use t statistics for eects with t0 = tdfe where dfe
SE(eect)
is the MSE df. Then t20 = M S(ef f ect)/M SE = F0 F1,dfe .

Source df SS MS F p-value
k# main
$ eects 1 e.g. SSA = MSA FA p A
k
2 2 factor interactions 1 e.g. SSAB = MSAB FAB pAB
#k $
3 3 factor interactions 1 e.g. SSABC = MSABC FABC pABC
.. .. .. .. ..
.
# k $ . . . .
k1 k 1 factor interactions
the k factor interaction 1 SSA L = MSA L FAL pAL
Error (m 1)2k SSE MSE

I A B C AB AC BC ABC y
+ + + + 6.333
+ + + + 4.667
+ + + + 9.0
+ + + + 6.667
+ + + + 4.333
+ + + + 2.333
+ + + + 7.333
+ + + + + + + + 4.667
divisor 8 4 4 4 4 4 4 4

Example 8.4. Box et al. (2005, p. 189) describes a 23 experiment designed


to investigate the eects of planting depth (0.5 or 1.4 in.), watering (once or
twice daily), and type of lima bean (baby or large) on yield. The table of
contrasts is shown above. The number of replications m = 3.
a) Find the A eect.
b) Find the AB eect.
c) Find SSA = MSA.
d) Find SSAB = MSAB.
e) If MSE = 0.54, nd SE(eect).
Solution: a) The A eect =
6.333 + 4.667 9 + 6.667 4.333 + 2.333 7.333 + 4.667
= 8.665/4
4
= 2.16625. Note that the appropriate + and signs are obtained from the
A column.
b) The AB eect =
6.333 4.667 9 + 6.667 + 4.333 2.333 7.333 + 4.667
= 1.333/4
4
= 0.33325.
8.1 Factorial Designs 257

Normal QQ plot

2
1
effects
0
1
2

1.0 0.5 0.0 0.5 1.0


nquantiles
Fig. 8.1 QQ plot for Example 8.4

c) SSA = m2k2 (eect)2 = 3(2)(2.16625)2 = 28.1558.


d) SSAB = 6(eect)2 = 6(0.33325)2 = 0.6663.
e) 

MSE 0.54
SE(eect) = = = 0.09 = 0.3.
m2k2 3(2)

The lregpack functions twocub and twofourth can be used to nd the ef-
fects, SE(eect), and QQ plots for 23 and 24 designs. If m = 1, the twofourth
function also makes the response and residual plots based on the second order
model for 24 designs.
For the data in Example 8.4, the output below and on the following page
shows that the A and C eects have values 2.166 and 2.000 while the B
eect is 2.500. These are the three signicant eects shown in the QQ plot
in Figure 8.1. The two commands below produced the output.

z<-c(6.333,4.667,9,6.667,4.333,2.333,7.333,4.667)
twocub(z,m=3,MSE=0.54)

$Aeff
[1] -2.16625
$Beff
[1] 2.50025
258 8 Orthogonal Designs

$Ceff
[1] -2.00025
$ABeff
[1] -0.33325
$ACeff
[1] -0.16675
$BCeff
[1] 0.16675
$ABCeff
[1] 0.00025
$MSA
[1] 28.15583
$MSB
[1] 37.5075
$MSC
[1] 24.006
$MSAB
[1] 0.6663334
$MSAC
[1] 0.1668334
$MSABC
[1] 3.75e-07
$MSE
[1] 0.54
$SEeff
[1] 0.3

8.2 Fractional Factorial Designs

Factorial designs are expensive since n = m2k when there are k factors and m
replications. A fractional factorial design uses n = m2kf where f is dened
below, and so costs much less. Such designs can be useful when the higher
order interactions are not signicant.

kf
Denition 8.10. A 2R fractional factorial design has k factors and
kf
takes m2 runs where the number of replications m is usually 1. The design
is an orthogonal design and each factor has two levels low = 1 and high =
1. R is the resolution of the design.

Denition 8.11. A main eect or q factor interaction is confounded or


aliased with another eect if it is not possible to distinguish between the
two eects.
8.2 Fractional Factorial Designs 259

kf
Remark 8.2. A 2R design has no q factor interaction (or main eect for
q = 1) confounded with any other eect consisting of less than R q factors.
kf
So a 2III design has R = 3 and main eects are confounded with 2 factor
kf
interactions. In a 2IV design, R = 4 and main eects are not confounded
with 2 factor interactions but 2 factor interactions are confounded with other
2 factor interactions. In a 2Vkf design, R = 5 and main eects and 2 factor
interactions are only confounded with 4 and 3 way or higher interactions
respectively. The R = 4 and R = 5 designs are good because the 3 way and
higher interactions are rarely signicant, but these designs are more expensive
than the R = 3 designs.
In a 2Rkf
design, each eect is confounded or aliased with 2f 1 other
eects. Thus the M th main eect is really an estimate of the M th main eect
plus 2f 1 other eects. If R 3 and none of the two factor interactions are
signicant, then the M th main eect is typically a useful estimator of the
population M th main eect.

Rule of thumb 8.8. Main eects tend to be larger than q factor inter-
action eects, and the lower order interaction eects tend to be larger than
the higher order interaction eects. So two way interaction eects tend to be
larger than three way interaction eects.
Rule of thumb 8.9. Signicant interactions tend to have signicant
component main eects. Hence if A, B, C, and D are factors, B and D are
inert and A and C are active, then the AC eect is the two factor interaction
most likely to be active. If only A was active, then the two factor interactions
containing A (AB, AC, and AD) are the ones most likely to be active.
Suppose each run costs $1000 and m = 1. The 2k factorial designs need 2k
runs while fractional factorial designs need 2kf runs. These designs use the
fact that three way and higher interactions tend to be inert for experiments.

Remark 8.3. Let ko = k f . Some good fractional factorial designs for


ko = 3 are shown below. The designs shown use the same table of contrasts
as the 23 design and can be t with 23 software.

23 A B C AB AC BC ABC
241
IV A B C AB+ AC+ BC+ D
252
IIIABC D E BC+ BE+
263
IIIABC D E F AF+
274
IIIABC D E F G

Consider the 241


IV design. It has 4 factors A, B, C, and D. The D main
eect is confounded with the ABC three way interaction, which is likely to be
inert. The D eect is the D eect plus the ABC eect. But if the ABC eect
is not signicant, then the D eect is a good estimator of the population
260 8 Orthogonal Designs

D eect. Confounding = aliasing is the price to pay for using fractional


factorial designs instead of the more expensive factorial designs. The two
factor interactions are followed by a +, e.g. AB+, since these interactions are
confounded with other two factor interactions.
If m = 1, the 241 4
IV design uses 8 runs while a 2 factorial design uses 16
41
runs. The runs for the 2IV are dened by the 4 main eects: use the rst
3 columns and the last column of the table of contrasts for the 23 design to
dene the runs. Randomly assign the units (often time slots) to the runs.

Remark 8.4. Some good fractional factorial designs for ko = k f = 4


are shown below. The designs shown use the same table of contrasts as the
24 design and can be t with 24 software. Here the designs are 1) 24 , and the
fractional factorial designs 2) 251 62 73 84 95
V , 3) 2IV , 4) 2IV , 5) 2IV , 6) 2III , and 7)
1511
2III .
design
1)A B C D AB AC AD BC BD CD ABC ABD ACD BCD ABCD
2)A B C D AB AC AD BC BD CD DE CE BE AE E
3)A B C D AB+ AC+ AD+ BC+ BD+ CD+ E 3int 3int F AF+
4)A B C D AB+ AC+ AD+ BC+ BD+ CD+ E 3int F G AG+
5)A B C D AB+ AC+ AD+ BC+ BD+ CD+ E F G H AH+
6)A B C D AB+ AC+ AD+ BC+ BD+ CD+ E F G H J
7)A B C D E F G H J K L M N O P
kf kf
Remark 8.5. Let ko = k f for a 2R design. The QQ plot for 2R
designs is used in a manner similar to that of 2k designs where k = ko . The
formulas for eects and mean squares are like the formulas for a 2ko design.
Let c be a column from the table of contrasts where + = 1 and = 1.
Let y be the column of cell means. Then M SE = 2 needs to be given or
estimated by setting high order interactions to 0 for m = 1. Typically m = 1
for fractional factorial designs. The following formulas ignore the I eect.
cT y
a) The eect corresponding to c is eect = ko 1 .
2 
MSE
b) The standard error for the eect is SE(eect) = .
m2ko 2

M SE
c) SE(mean) = where m2ko = n.
m2ko
d) The sum of squares and mean square for an eect are
M S(eect) = SS(eect) = m2ko 2 (eect)2 .

Consider the designs given in Remarks 8.3 and 8.4. Least squares estimates
kf
for the 2R designs with ko = 3 use the design matrix corresponding to a 23
design while the designs with ko = 4 use the design matrix corresponding to
the 24 design given in Section 8.1.
Randomly assign units to runs. Do runs in random order if possible. In in-
dustry, units are often time slots (periods of time), so randomization consists
8.2 Fractional Factorial Designs 261

of randomly assigning time slots to units, which is equivalent to doing the


kf
runs in random order. For the above 2R designs, x the main eects using
the corresponding columns in the two tables of contrasts given in Section 8.1
to determine the levels needed in the m2kf runs.
The fractional factorial designs can be t with least squares, and the model
can be written as Y = xT + e with least squares tted values Y = xT .
In matrix form the model is Y = X + e and the vector of tted values is
Y = X .
kf
The biggest possible model for a 2R design where k f = 3 is
Yi = 0 + 1 xi1 + 2 xi2 + 3 xi3 + 12 xi12 + 13 xi13 + 23 xi23 + 123 xi123 + ei
with least squares tted or predicted values given by Yi = 0 + 1 xi1 + 2 xi2 +
3 xi3 + 12 xi12 + 13 xi13 + 23 xi23 + 123 xi123 .
The regression equation corresponding to the signicant eects (e.g. found
with a QQ plot) can be used to form a reduced model. For example, suppose
the full (least squares) tted model is Yi = 0 + 1 xi1 + 2 xi2 + 3 xi3 +
12 xi12 + 13 xi13 + 23 xi23 + 123 xi123 . Suppose the A, B, and AB eects
are signicant. Then the reduced (least squares) tted model is Yi = 0 +
1 xi1 + 2 xi2 + 12 xi12 where the coecients (s) for the reduced model can
be taken from the full model since fractional factorial designs are orthogonal.

For the fractional factorial designs, the coecient 0 corresponding to I


is equal to the I eect, but the coecient of a factor x corresponding to an
eect is = 0.5 eect. Consider signicant eects and assume interactions
can be ignored.
i) If a large response Y is desired and > 0, use x = 1. If < 0, use
x = 1.
ii) If a small response Y is desired and > 0, use x = 1. If < 0, use
x = 1.
From the regression equation Y = xT , be able to predict Y given x. Be
able to tell whether x = 1 or x = 1 should be used. Given the x values of the
main eects, get the x values of the interactions by multiplying the columns
corresponding to the main eects in the interaction. Least squares output
is similar to that in Section 8.1. The least squares coecient = 0.5 (eect).
The sum of squares for an x corresponding
 to an eect is equal to SS(eect).
SE(coef) = SE() = 0.5 SE(eect) = M SE/n. Also SE(0 ) = M SE/n.

Assume none of the interactions are signicant. Then the 274 III fractional
factorial design allows estimation of 7 main eects in 23 = 8 runs. The 21511
III
fractional factorial design allows estimation of 15 main eects in 24 = 16 runs.
The 23126
III fractional factorial design allows estimation of 31 main eects in
25 = 32 runs.
Fractional factorial designs with k f = ko can be t with software meant
for 2ko designs. Hence the lregpack functions twocub and twofourth can
262 8 Orthogonal Designs

be used for the ko = 3 and ko = 4 designs that use the standard table
of contrasts. The response and residual plots given by twofourth are not
appropriate, but the QQ plot and the remaining output are relevant. Some
of the interactions will correspond to main eects for the fractional factorial
design.
For example, if the Example 8.4 data was from a 241 IV design, then the
A, B, and C eects would be the same, but the D eect is the eect labelled
ABC. So the D eect 0.

Normal QQ plot
40
30
effects
20
10
0

1.0 0.5 0.0 0.5 1.0


nquantiles
Fig. 8.2 QQ plot for Example 8.5

Aeff Beff Ceff ABeff ACeff BCeff ABCeff


20.625 38.375 -0.275 28.875 -0.275 -0.625 -2.425

Example 8.5. Montgomery (1984, pp. 344346) gives data from a 274 III
design with the QQ plot shown in Figure 8.2. The goal was to study eye focus
time with factors A = sharpness of vision, B = distance of target from eye,
C = target shape, D = illumination level, E = target size, F = target density,
and G = subject. The lregpack function twocub gave the eects above.
a) What is the D eect?
b) What eects are signicant?

Solution: By the last line in the table given in Remark 8.3, note that for
this design, A, B, C, AB, AC, BC, ABC correspond to A, B, C, D, E, F, G. So
the AB eect from the output is the D eect.
8.3 Plackett Burman Designs 263

a) 28.875, since the D eect is the AB eect.


b) A, B, and D since these are the eects that do not follow the line in
the QQ plot shown in Figure 8.2.

I A B C AB AC BC ABC y
+ + + + 86.8
+ + + + 85.9
+ + + + 79.4
+ + + + 60.0
+ + + + 94.6
+ + + + 85.4
+ + + + 84.5
+ + + + + + + + 80.3

Example 8.6. The above table of 23 contrasts is for 252


III data.
a) Estimate the B eect.
b) Estimate the D eect.
Solution: a)
86.8 85.9 + 79.4 + 60 94.6 85.4 + 84.5 + 80.3
4
= 48.5/4 = 12.125.
b) Use Remark 8.3 to see that the D eect corresponds to the AB column.
So the D eect =
86.8 85.9 79.4 + 60 + 94.6 85.4 84.5 + 80.3
4
= 13.5/4 = 3.375.

8.3 Plackett Burman Designs

Denition 8.12. The Plackett Burman PB(n) designs have k factors


where 2 k n 1. The factors have 2 levels and orthogonal contrasts like
kf
the 2k and 2R designs. The PB(n) designs are resolution 3 designs, but the
confounding of main eects with 2 factor interactions is complex. The PB(n)
designs use n runs where n is a multiple of 4. The values n = 12, 20, 24, 28,
and 36 are especially common.
Fractional factorial designs need at least 2ko runs. Hence if there are 17
main eects, 32 runs are needed for a 21712
III design while a PB(20) design only
needs 20 runs. The price to pay is that the confounding pattern of the main
eects with the two way interactions is complex. Thus the PB(n) designs
are usually used with main eects, and it is assumed that all interactions
are insignicant. So the Plackett Burman designs are main eects designs
264 8 Orthogonal Designs

used to screen k main eects when the number of runs n is small. Often
k = n 4, n 3, n 2, or n 1 is used. We will assume that the number of
replications m = 1.
A contrast matrix for the PB(12) design is shown below. Again the column
of plusses corresponding to I is omitted. If k = 8 then eects A to H are
used but eects J, K, and L are empty. As a convention the mean square
and sum of squares for factor E will be denoted as MSe and SSe while MSE
= 2 .
run A B C D E F G H J K L
1 + - + - - - + + + - +
2 + + - + - - - + + + -
3 - + + - + - - - + + +
4 + - + + - + - - - + +
5 + + - + + - + - - - +
6 + + + - + + - + - - -
7 - + + + - + + - + - -
8 - - + + + - + + - + -
9 - - - + + + - + + - +
10 + - - - + + + - + + -
11 - + - - - + + + - + +
12 - - - - - - - - - - -
The PB(n) designs are k factor 2 level orthogonal designs. So nding
quantities such as eects, MS, SS, least squares estimates, et cetera for PB(n)
kf
designs is similar to nding the corresponding quantities for the 2k and 2R
designs. Randomize units (often time slots) to runs and least squares can be
used.
Remark 8.6. For the PB(n) design, let c be a column from the table of
contrasts where + = 1 and = 1. Let y be the column of responses since
m = 1. If k < n 1, pool the last J = n 1 k empty eects into the
MSE with df = J as the full model. This procedure is done before looking
at the data, so is not data snooping. The MSE can also be given or found
by pooling insignicant MSs into the MSE, but the latter method uses data
snooping. This pooling needs to be done if k = n 1 since then there is no
df for MSE. The following formulas ignore the I eect.
cT y 2cT y
a) The eect corresponding to c is eect = = .
n/2 n
 
MSE 4MSE
b) The standard error for the eect is SE(eect) = = .
n/4 n

M SE
c) SE(mean) = .
n
d) The sum of squares and mean sum of squares for an eect is
n
M S(eect) = SS(eect) = (eect)2 .
4
8.3 Plackett Burman Designs 265

Normal QQ plot for PB12 Design

15
10
effects
5
0
5

1.5 0.5 0.0 0.5 1.0 1.5


nquantiles
Fig. 8.3 QQ Plot for Example 8.7

For the PB(n) design, the least squares coecient = 0.5 (eect). The sum
of squares for an x corresponding
 to an eect is equal to SS(eect). SE(coef)
= SE() = 0.5 SE(eect) = M SE/n. Also SE(0 ) = M SE/n.

Example 8.7. Shown below is least squares output using PB(12) data
from Ledolter and Swersey (2007, pp. 244256). There were k = 10 factors so
the MSE has 1 df and there are too many terms in the model. In this case the
QQ plot shown in Figure 8.3 is more reliable for nding signicant eects.
a) Which eects, if any, appear to be signicant from the QQ plot?
b) Let the reduced model Y = 0 + r1 xr1 + + rj xrj where j is the
number of signicant terms found in a). Write down the reduced model.
c) Want large Y . Using the model in b), choose the x values that will give
large Y , and predict Y .
Estimate Std.Err t-value Pr(>|t|)
Intercept 6.7042 2.2042 3.0416 0.2022
c1 8.5792 2.2042 3.8922 0.1601
c2 -1.7958 2.2042 -0.8147 0.5648
c3 2.3125 2.2042 1.0491 0.4847
c4 4.1208 2.2042 1.8696 0.3127
c5 3.1542 2.2042 1.4310 0.3883
c6 -3.3958 2.2042 -1.5406 0.3665
c7 0.9542 2.2042 0.4329 0.7399
c8 -1.1208 2.2042 -0.5085 0.7005
c9 1.3125 2.2042 0.5955 0.6581
c10 1.7875 2.2042 0.8110 0.5662
266 8 Orthogonal Designs

Solution: a) The most signicant eects are either in the top right or
bottom left corner. Although the points do not all scatter closely about the
line, the point in the bottom left is not signicant. So none of the eects
corresponding to the bottom left of the plot are signicant. A is the signicant
eect with value 2(8.5792) = 17.1584. See the top right point of Figure 8.3.
b) Y = 6.7042 + 8.5792x1 .
c) Y = 6.7042 + 8.5792(1) = 15.2834.

The lregpack function pb12 can be used to nd eects and MS(eect) for
PB(12) data. Least squares output and a QQ plot are also given.

8.4 Summary

1) In a table of contrasts, the contrast for A starts with a then a +


and the pattern repeats. The contrast for B starts with 2 s and then 2
+s and the pattern repeats. The contrast for C starts with 4 s and then
4+s and the pattern repeats. The contrast for the ith main eects factor
starts with 2i1 s and 2i1 +s and the pattern repeats for i = 1, . . . , k.
2) In a table of contrasts, a column for an interaction containing several
factors is obtained by multiplying the columns for each factor where + = 1
and = 1. So the column for ABC is obtained by multiplying the column
for A, the column for B, and the column for C.
3) Let c be a column from the table of contrasts where + = 1 and = 1.
Let y be the column of cell means. Then the eect corresponding to c is
cT y
eect = k1 .
2
4) If the number of replications m 2, then the standard error for the
eect is 
MSE
SE(eect) = .
m2k2
Sometimes M SE is replaced by 2 .
5) 
M SE
SE(mean) =
m2k
where m2k = n, m 2 and sometimes M SE is replaced by 2 .
6) Since df = 1, the sum of squares and mean square for an eect is

M S(eect) = SS(eect) = m2k2 (eect)2

for m 1.
8.4 Summary 267

7) If a single run out of 2k cells gives good values for the response, then
that run is called a critical mix.
8) A factor is active if the response depends on the two levels of the factor,
and is inert, otherwise.
9) Randomization for a 2k design: randomly assign units to the m2k runs.
The runs are determined by the levels of the k main eects in the table of
contrasts. So a 23 design is determined by the levels of A, B, and C. Similarly,
a 24 design is determined by the levels of A, B, C, and D. Perform the m2k
runs in random order if possible.
10) A table of contrasts for a 23 design is shown below. The rst column
is for the mean and is not a contrast. The last column corresponds to the
cell means. Note that y 1110 = y111 if m = 1. So y might be replaced by y if
m = 1.

I A B C AB AC BC ABC y
+ + + + y 1110
+ + + + y 2110
+ + + + y 1210
+ + + + y 2210
+ + + + y 1120
+ + + + y 2120
+ + + + y 1220
+ + + + + + + + y 2220
divisor 8 4 4 4 4 4 4 4

11) Be able to pick out active and inert factors and good (or the best)
combinations of factors (cells or runs) from the table of contrasts = table of
runs.
12) Plotted points far away from the identity line and r = 0 line are
potential outliers, but often the identity line goes through or near an outlier
that is large in magnitude. Then the case has a small residual. Look for gaps
is the response and residual plots.
13) A 95% condence interval (CI) for an eect is

eect tdfe ,0.975 SE(eect)

where dfe is the MSE degrees of freedom. Use tdfe ,0.975 z0.975 = 1.96 if
dfe > 30. The eect is signicant if the CI does not contain 0, while the eect
is not signicant if the CI contains 0.
14) Suppose there is no replication so m = 1. Find J interaction mean
squares that are small compared to the bulk of the mean squares. Add them
up (pool them) to make M SE with dfe = J. So
268 8 Orthogonal Designs

sum of small MS s


M SE = .
J
This method uses data snooping and M SE tends to underestimate 2 . So
the F test statistics are too large and the pvalues = P (F1,J > F0 ) are too
small. For example F0 = FA = M SA/M SE. As a convention for using an
F table, use the denominator df closest to dfe = J, but if dfe = J > 30 use
denominator df = . Use this method for exploratory data analysis, not for
inference based on the F distribution.
15) MS = SS 2 21 M SE21 if the eect is not signicant. M SE
2 2
dfe /dfe if the model holds. A rule of thumb is that an eect is signicant
if MS > 5M SE. The rule comes from the fact that 21,.975 5.
16) The table of contrasts for a 24 design is below. The column of ones
corresponding to I was omitted.
run A B C D AB AC AD BC BD CD ABC ABD ACD BCD ABCD
1 - - - - + + + + + + - - - - +
2 + - - - - - - + + + + + + - -
3 - + - - - + + - - + + + - + -
4 + + - - + - - - - + - - + + +
5 - - + - + - + - + - + - + + -
6 + - + - - + - - + - - + - + +
7 - + + - - - + + - - - + + - +
8 + + + - + + - + - - + - - - -
9 - - - + + + - + - - - + + + -
10 + - - + - - + + - - + - - + +
11 - + - + - + - - + - + - + - +
12 + + - + + - + - + - - + - - -
13 - - + + + - - - - + + + - - +
14 + - + + - + + - - + - - + - -
15 - + + + - - - + + + - - - + -
16 + + + + + + + + + + + + + + +
17) Below is the ANOVA table for a 23 design. Let m = 1. For A, use
H0 : 100 = 200 . For B, use H0 : 010 = 020 . For C, use H0 : 001 = 002 .
For interaction, use H0 : no interaction.

Source df SS MS F p-value
A 1 SSA MSA FA pA
B 1 SSB MSB FB pB
C 1 SSC MSC FC pC
AB 1 SSAB MSAB FAB pAB
AC 1 SSAC MSAC FAC pAC
BC 1 SSBC MSBC FBC pBC
ABC 1 SSABC MSA FABC pABC
Error (m 1)2k SSE MSE
8.4 Summary 269

18) Below is the ANOVA table for a 2k design. For A, use H0 : 100 =
200 . The other main eects have similar null hypotheses. For interaction,
use H0 : no interaction. If m = 1 use a procedure similar to point 14) for
exploratory purposes.

Source df SS MS F p-value
k# main
$ eects 1 e.g. SSA = MSA FA p A
k
#k2$ 2 factor interactions 1 e.g. SSAB = MSAB FAB pAB
3 3 factor interactions 1 e.g. SSABC = MSABC FABC pABC
.. .. .. .. ..
#. k $ . . . .
k1 k 1 factor interactions
the k factor interaction 1 SSA L = MSA L FAL pAL
Error (m 1)2k SSE MSE

19) Genuine run replicates need to be used. A common error is to take m


measurements per run, and act as if the m measurements are from m runs.
If as a data analyst you encounter this error, average the m measurements
into a single value of the response.
eect
20) One can use t statistics for eects with t0 = tdfe where
SE(eect)
dfe is the MSE df. Then t0 = M S(eect)/MSE = F0 F1,dfe .
2

21) A 2k design can be t with least squares. In the table of contrasts


let a + = 1 and a = 1. Then X needs a row for each response:
we cant use the mean response for each xed combination of levels. Let x0
correspond to I, the column of 1s. Let xi correspond to the ith main eect
for i = 1, . . . , k. Let xij correspond to 2 factor interactions, and let xi1 ,...,iG
correspond to G way interactions for G = 2, . . . , k. Let the design matrix X
have columns corresponding to the x. Let y be the vector of responses.
22) The table below relates the quantities in the 23 table of contrasts with
the quantities used in least squares when the design matrix

X = [x0 , x1 , x2 , x3 , x12 , x13 , x23 , x123 ].

Software often does not need the column of ones x0 .

x0 x1 x2 x3 x12 x13 x23 x123 y


I A B C AB AC BC ABC y
23) The table below relates quantities in the 24 table of contrasts with the
quantities used in least squares. Again x0 corresponds to I, the column of
ones, while y is the vector of responses.
x1 x2 x3 x4 x12 x13 x14 x23 x24 x34 x123 x124 x134 x234 x1234
A B C D AB AC AD BC BD CD ABC ABD ACD BCD ABCD
24) A typical least squares output for the 23 design is shown below. Often
Estimate is replaced by Coef.
270 8 Orthogonal Designs

Estimate Std.Err t-value Pr(>|t|)


Intercept 64.25 0.7071 90.8632 0.0000
x1 11.50 0.7071 16.2635 0.0000
x2 -2.50 0.7071 -3.5355 0.0077
x3 0.75 0.7071 1.0607 0.3198
x12 0.75 0.7071 1.0607 0.3198
x13 5.00 0.7071 7.0711 0.0001
x23 0.00 0.7071 0.0000 1.0000
x123 0.25 0.7071 0.3536 0.7328

25) i) The least squares coecient or estimate = eect/2. So in the


above table, the A eect = 2(11.5) = 23. If x corresponds to the least squares
coecient, then the coecient = (xT y)/(xT x).
ii) The sum of squares = means square corresponding to an x is equal to the
sum of squares = mean square of the corresponding eect. If x corresponds
to the least squares coecient, then the SS = MS = (xT y) 2 T
 /(x x).
iii) Suppose m 2. Then SE(coef) = SE(eect)/2 = 0.5 M SE/(m2k2 ).
Hence in the above table, SE(eect) = 2(0.7071) = 1.412.
iv) The t statistic t0 = coef/SE(coef), and t20 = F0 where t0 tdfe and
F0 F1,dfe where dfe = (m 1)2k is the MSE df. Hence the pvalues for least
squares and the 2k software are the same. For example, the pvalue for testing
the signicance of x1 = pvalue for testing signicance of A eect = 0.000 in
the above table. Also tA = 16.2635 and t2A = FA = 264.501.
v) The MSE, tted values, and residuals are the same for the least squares
output and the 2k software.
26) There are several advantages to least squares over 2k software. i) If
there are several missing values or outliers, delete the corresponding rows
from the design matrix X and the vector of responses y as long as the
number of rows of the design matrix the number of columns. ii) If the exact
quantitative levels are not observed, replace them by the observed levels in
the design matrix. See point 27). iii) If the wrong levels are used in a run,
replace the corresponding row in the design matrix by a row corresponding
to the levels actually used.
27) Suppose the two levels of the quantitative variable are a < b and x is
2x (a + b)
the actual value used. Then code x as c = . Note that the code
ba
gives c = 1 for x = a and c = 1 for x = b.
28) A normal QQ plot is a plot of the eects versus standard normal
percentiles. There are L = 2k 1 eects for a 2k design. A rule of thumb
is that nonsignicant eects tend to follow a line closely in the middle of
the plot while the signicant eects do not follow the line closely. Signicant
eects will be the most negative or the most positive eects.
kf
29) A 2R fractional factorial design has k factors and takes m2kf runs
where the number of replications m is usually 1.
8.4 Summary 271

30) Let ko = k f . Some good fractional factorial designs for ko = 3 are


shown below. The designs shown use the same table of contrasts as the 23
design given in point 10), and can be t with 23 software.

23 A B C AB AC BC ABC
241
IV A B C AB+ AC+ BC+ D
252
IIIABC D E BC+ BE+
263
IIIABC D E F AF+
274
IIIABC D E F G

31) Some good fractional factorial designs for ko = k f = 4 are shown


below. The designs shown use the same table of contrasts as the 24 design
given in point 16), and can be t with 24 software. Here the designs are 1)
24 , and the fractional factorial designs 2) 251 62 73 84
V , 3) 2IV , 4) 2IV , 5) 2IV , 6)
95 1511
2III , and 7) 2III .
design
1)A B C D AB AC AD BC BD CD ABC ABD ACD BCD ABCD
2)A B C D AB AC AD BC BD CD DE CE BE AE E
3)A B C D AB+ AC+ AD+ BC+ BD+ CD+ E 3int 3int F AF+
4)A B C D AB+ AC+ AD+ BC+ BD+ CD+ E 3int F G AG+
5)A B C D AB+ AC+ AD+ BC+ BD+ CD+ E F G H AH+
6)A B C D AB+ AC+ AD+ BC+ BD+ CD+ E F G H J
7)A B C D E F G H J K L M N O P
kf
32) Let ko = k f for a 2R design. Then the formulas for eects and
mean squares are like the formulas for a 2ko design. Let c be a column from
the table of contrasts where + = 1 and = 1. Let y be the column of
cell means. Need M SE = 2 to be given or estimated by setting high order
interactions to 0 for m = 1. Typically m = 1 for fractional factorial designs.
cT y
a) The eect corresponding to c is eect = ko 1 .
2 
MSE
b) The standard error for the eect is SE(eect) = .
m2ko 2

M SE
c) SE(mean) = where m2ko = n.
m2ko
d) The mean square and sum of squares for an eect are
M S(eect) = SS(eect) = m2ko 2 (eect)2 .
kf
33) Least squares estimates for the 2R designs in points 30) and 31) are
obtained by using the design matrix corresponding to the table of contrasts
in point 10) for ko = 3 and point 16) for ko = 4.
kf
34) The QQ plot for 2R designs is used in a manner similar to point 28).

35) Randomly assign units to runs. Do runs in random order if possible.


In industry, units are often time slots (periods of time), so randomization
272 8 Orthogonal Designs

consists of randomly assigning time slots to units, which is equivalent to


kf
doing the runs in random order. For the 2R designs in points 30) and 31),
x the main eects using the corresponding columns of contrasts given in
points 10) and 16) to determine the levels needed in the m2kf runs.
36) Active factors appear to change the mean response as the level of
the factor changes from 1 to 1. Inert factors do not appear to change the
response as the level of the factor changes from 1 to 1. An inert factor
could be needed but the level low or high is not important, or the inert factor
may not be needed and so can be omitted from future studies. Often subject
matter experts can tell whether the inert factor is needed or not.
kf
37) A 2R design has no q factor interaction (or main eect for q = 1)
confounded with any other eect consisting of less than R q factors. So
kf
a 2III design has R = 3 and main eects are confounded with 2 factor
kf
interactions. In a 2IV design, R = 4 and main eects are not confounded
with 2 factor interactions but 2 factor interactions are confounded with other
2 factor interactions. In a 2Vkf design, R = 5 and main eects and 2 factor
interactions are only confounded with 4 and 3 way or higher interactions
respectively.
38) In a 2R kf
design, each eect is confounded or aliased with 2f 1 other
eects. Thus the M th main eect is really an estimate of the M th main eect
plus 2f 1 other eects. If R 3 and none of the two factor interactions are
signicant, then the M th main eect is typically a useful estimator of the
population M th main eect.
39) The R = 4 and R = 5 designs are good because the 3 way and higher
interactions are rarely signicant, but these designs are more expensive than
the R = 3 designs.
40) In this text, most of the DOE models can be t with least squares,
and the model can be written as Y = xT + e with least squares tted values
Y = xT . In matrix form the model is Y = X + e and the vector of tted
values is Y = X .
kf
41) The full model for a 23 or 2R design where k f = 3 is
Yi = 0 + 1 xi1 + 2 xi2 + 3 xi3 + 12 xi12 + 13 xi13 + 23 xi23 + 123 xi123 + ei
with least squares tted or predicted values given by
Yi = 0 + 1 xi1 + 2 xi2 + 3 xi3 + 12 xi12 + 13 xi13 + 23 xi23 + 123 xi123 .
42) An interaction such as xi123 satises xi123 = (xi1 )(xi2 )(xi3 ).
kf
43) For orthogonal designs like 2k , 2R , or PB(n) (described in point 51)),
the x value of an eect takes on values 1 or 1. The columns of the design
matrix X are orthogonal: cTi cj = 0 for i = j where ci is the ith column of X.

44) Suppose the full model using all of the columns of X is used. If some
columns are removed (e.g. those corresponding to the insignicant eects),
then for the orthogonal designs in point 43) the following quantities remain
unchanged for the terms that were not deleted: the eects, the coecients,
8.4 Summary 273

and SS(eect) = MS(eect). The MSE, SE(eect), F and t statistics, pvalues,


tted values, and residuals do change.
45) The regression equation corresponding to the signicant eects (e.g.
found with a QQ plot) can be used to form a reduced model. For example,
suppose the full (least squares) tted model is Yi = 0 + 1 xi1 + 2 xi2 +
3 xi3 + 12 xi12 + 13 xi13 + 23 xi23 + 123 xi123 . Suppose the A, B, and AB
eects are signicant. Then the reduced (least squares) tted model is Yi =
0 + 1 xi1 + 2 xi2 + 12 xi12 where the coecients (s) for the reduced model
are taken from the full model.
46) For the designs in 43), the coecient 0 corresponding to I is equal
to the I eect, but the coecient of a factor x corresponding to an eect
is = 0.5 eect. Consider signicant eects and assume interactions can be
ignored.
i) If a large response Y is desired and > 0, use x = 1. If < 0, use x = 1.
ii) If a small response Y is desired and > 0, use x = 1. If < 0, use
x = 1.
47) Rule of thumb: to predict Y with Y , the number of coecients = the
number of s in the model should be n/2, where the sample size n =
number of runs.
48) From the regression equation Y = xT , be able to predict Y given x.
Be able to tell whether x = 1 or x = 1 should be used. Given the x values
of the main eects, get the x values of the interactions using 42).
49) Least squares output for an example and in symbols are shown below
and on the following page for the designs in 43). Often Estimate is replaced
by Coef or Coecient. Often Intercept is replaced by Constant. The
t statistic and pvalue are for whether the term or eect is signicant. So t12
and p12 are for testing whether the x12 term or AB eect is signicant.

Residual Standard Error=2.8284 = sqrt(MSE)


R-Square=0.9763 F-statistic (df=7, 8)=47.054 pvalue=0

Estimate Std.Err t-value Pr(>|t|)


Intercept 64.25 0.7071 90.8632 0.0000
x1 11.50 0.7071 16.2635 0.0000
x2 -2.50 0.7071 -3.5355 0.0077
x3 0.75 0.7071 1.0607 0.3198
x12 0.75 0.7071 1.0607 0.3198
x13 5.00 0.7071 7.0711 0.0001
x23 0.00 0.7071 0.0000 1.0000
x123 0.25 0.7071 0.3536 0.7328
274 8 Orthogonal Designs

Coef or Est. Std.Err t pvalue

Intercept or constant 0 SE(coef) t0 p0


x1 1 SE(coef) t1 p1
x2 2 SE(coef) t2 p2
x3 3 SE(coef) t3 p3
x12 12 SE(coef) t12 p12
x13 13 SE(coef) t13 p13
x23 23 SE(coef) t23 p23
x123 123 SE(coef) t123 p123

50) The least squares coecient = 0.5 (eect). The sum of squares for an
x corresponding
 to an eect is equal to  SS(eect). SE(coef) = SE() = 0.5
SE(eect) = M SE/n. Also SE(0 ) = M SE/n.
51) The Plackett Burman PB(n) designs have k factors where 2 k
kf
n 1. The factors have 2 levels and orthogonal contrasts like the 2k and 2R
designs. The PB(n) designs are resolution 3 designs, but the confounding of
main eects with 2 factor interactions is complex. The PB(n) designs use n
runs where n is a multiple of 4. The values n = 12, 20, 24, 28, and 36 are
especially common.
52) The PB(n) designs are usually used with main eects so assume that all
interactions are insignicant. So they are main eects designs used to screen k
main eects when the number of runs n is small. Often k = n 4, n 3, n 2,
or n 1 is used. We will assume that the number of replications m = 1.
53) If k = n 1 there is no df for MSE. If k < n 1, pool the last
J = n 1 k empty eects into the MSE with df = J as the full model.
This procedure is done before looking at the data, so is not data snooping.
run A B C D E F G H J K L
1 + - + - - - + + + - +
2 + + - + - - - + + + -
3 - + + - + - - - + + +
4 + - + + - + - - - + +
5 + + - + + - + - - - +
6 + + + - + + - + - - -
7 - + + + - + + - + - -
8 - - + + + - + + - + -
9 - - - + + + - + + - +
10 + - - - + + + - + + -
11 - + - - - + + + - + +
12 - - - - - - - - - - -
54) The contrast matrix for the PB(12) design is shown above. Again the
column of plusses corresponding to I is omitted. If k = 8 then eects A to
8.5 Complements 275

H are used but eects J, K, and L are empty. As a convention the mean
square and sum of squares for factor E will be denoted as MSe and SSe while
MSE = 2 .
55) The PB(n) designs are k factor 2 level orthogonal designs. So nding
eects, MS, SS, least squares estimates, et cetera for PB(n) designs is similar
kf
to nding the corresponding quantities for the 2k and 2R designs.
56) For the PB(n) design, let c be a column from the table of contrasts
where + = 1 and = 1. Let y be the column of responses since m = 1.
For k < n 1, MSE can be found for the full model as in 53). MSE can also
be given or found by pooling insignicant MSs into the MSE, but the latter
method uses data snooping.
cT y 2cT y
a) The eect corresponding to c is eect = = .
n/2 n
 
MSE 4MSE
b) The standard error for the eect is SE(eect) = = .
n/4 n

M SE
c) SE(mean) = .
n
d) The sum of squares and mean square for an eect is
n
M S(eect) = SS(eect) = (eect)2 .
4
57) For the PB(n) design, the least squares coecient = 0.5 (eect). The
sum of squares for an x corresponding  to an eect is equal to SS(eect).
SE(coef) = SE() = 0.5 SE(eect) = M SE/n. Also SE(0 ) = M SE/n.

8.5 Complements

Box et al. (2005) and Ledolter and Swersey (2007) are excellent references
for k factor 2 level orthogonal designs.
Suppose it is desired to increase the response Y and that A, B, C, . . . are
the k factors. The main eects for A, B, . . . measure
Y Y
, ,
A B
et cetera. The interaction eect AB measures
Y
.
AB
Hence
Y Y Y
0, 0, and large
A B AB
276 8 Orthogonal Designs

implies that the design is in the neighborhood of a maximum of a response


that looks like a ridge. p
An estimated contrast is C = i=1 di Y i0 , and


  p
d2i
SE(C) = M SE .
i=1 i
n

If di = 1, p = 2k and ni = m, then SE(C) = M SE 2k /m. For a 2k
design, an eect can be written as a contrast with di = 1/2k1 , p = 2k and
ni = m. Thus

 
 
2k
1 1 MSE

SE(eect) = MSE = .
m 22k2 m2k2
i=1

There is an algebra for computing confounding patterns for fractional


factorial designs. Let M be any single letter eect (A, B, C, et cetera), and
let I be the identity element. Then i) IM = M , ii) M M = I and iii) multi-
plication is commutative: LM = M L.
k1
For a 2R design, set one main eect equal to an interaction, e.g. D =
ABC. The equation D = ABC is called a generator. Note that DD = I =
DABC = ABCD. The equation I = ABCD is the generating relationship.
Then M I = M = ABCDM , so M is confounded or aliased with ABCDM .
So A = AI = AABCD = BCD and A is confounded with BCD. Similarly,
BD = BDI = BDABCD = AC, so BD is confounded with AC.
k2
For a 2R design, 2 main eects L and M are set equal to an interaction.
Thus L = I and M 2 = I, but it is also true that L2 M 2 = I. As an illus-
2

tration, consider the 262 IV design with E = ABC and F = BCD. So E =


2
2 2 2
I = ABCE, F = I = BCDF , and F E = I = ABCEBCDF = ADEF.
Hence the generating relationship I = ABCE = BCDF = ADEF has 3
words, and each eect is confounded with 3 other eects. For example,
AI = AABCE = ABCDF = AADEF or A = BCE = ABCDF = DEF .
kf
For a 2R design, f main eects L1 , . . . , Lf are set equal to interactions.
# $ # $
There are f1 equations of the form L2i = I, f2 equations of the form L2i L2j =
# $ # $
I, f3 equations of the form L2i1 L2i2 L2i3 = I, . . . , ff equations of the form
L21 L22 L2f = I. These equations give a generating relationship with 2f 1
words, so each eect is confounded with 2f 1 other eects.
If the generating relationship is I = W1 = W2 = = W2f 1 , then the
resolution R is equal to the length of the smallest word. So I = ABC and
I = ABCE = ABC = ADEF both have R = 3.
The convention is to ignore 3 way or#higher$ order interactions. So the alias
patterns for the k main eects and the k2 2 way interactions with other main
eects and 2 way interactions is of interest.
8.6 Problems 277

8.6 Problems

Problems with an asterisk * are especially important.

Output for 8.1: Residual Standard Error=2.8284


R-Square=0.9763 F-statistic (df=7, 8)=47.054 pvalue=0
Estimate Std.Err t-value Pr(>|t|)
Intercept 64.25 0.7071 90.8632 0.0000
x1 11.50 0.7071 16.2635 0.0000
x2 -2.50 0.7071 -3.5355 0.0077
x3 0.75 0.7071 1.0607 0.3198
x12 0.75 0.7071 1.0607 0.3198
x13 5.00 0.7071 7.0711 0.0001
x23 0.00 0.7071 0.0000 1.0000
x123 0.25 0.7071 0.3536 0.7328

8.1. From the above least squares output, what is the AB eect?

I A B C AB AC BC ABC Y
+ + + + 3.81
+ + + + 4.28
+ + + + 3.74
+ + + + 4.10
+ + + + 3.75
+ + + + 3.66
+ + + + 3.82
+ + + + + + + + 3.68

8.2. Ledolter and Swersey (2007, pp. 108109) describes a 23 experiment


designed to increase subscriptions of the magazine Ladies Home Journal. The
2005 campaign made 8 brochures containing an order card. Each brochure was
mailed to 15042 households, and the response Y was the percentage of orders.
Factor A was front side of order card with (1) highlighting Double our Best
Oer and (+1) highlighting We never had a bigger sale. Factor B was back
side of order card with (1) emphasizing Two extra years free, while (+1)
featured magazine covers of a previous issue. Factor C was brochure cover
with (1) featuring Kelly Ripa and (+1) Dr. Phil. Assume m = 1.
a) Find the A eect.
b) Find the C eect.
c) Find SSC = MSC.
d) If two of the three factors A, B and C are active, which is inactive?
278 8 Orthogonal Designs

I A B C AB AC BC ABC y
+ + + + 86.8
+ + + + 85.9
+ + + + 79.4
+ + + + 60.0
+ + + + 94.6
+ + + + 85.4
+ + + + 84.5
+ + + + + + + + 80.3

8.3. The above table of 23 contrasts is for 252


III data.
a) Estimate the B eect.
b) Estimate the D eect.

8.4. Suppose that for 23 data with m = 2, the MSE = 407.5625. Find
SE(eect).

I A B C AB AC BC ABC y
+ + + + 63.6
+ + + + 76.8
+ + + + 60.3
+ + + + 80.3
+ + + + 67.2
+ + + + 71.3
+ + + + 68.3
+ + + + + + + + 74.3
divisor 8 4 4 4 4 4 4 4

8.5. Ledolter and Swersey (2007, p. 131) describe a 274


III data set shown
with the table of 23 contrasts above. Estimate the D eect.

I A B C AB AC BC ABC y
+ + + + 32
+ + + + 35
+ + + + 28
+ + + + 31
+ + + + 48
+ + + + 39
+ + + + 28
+ + + + + + + + 29
divisor 8 4 4 4 4 4 4 4
8.6 Problems 279

8.6. Kuehl (1994, pp. 361366) describes a 23 experiment designed to


investigate the eects of furnace temperature (1840 or 1880o F), heating time
(23 or 25 sec) and transfer time (10 or 12 sec) on the quality of a leaf spring
used for trucks. (The response Y was a measure of the quality.) The table of
contrasts is shown above.
a) Find the A eect.
b) Find the B eect.
c) Find the AB eect.
d) If m = 1, nd SSA.
e) If m = 1, nd SSB.
f) If m = 1, nd SSAB.
g) If m = 2 and MSE = 9, nd SE(eect).
(The SE is the same regardless of the eect.)
h) Suppose high Y = y is desirable. If two of the factors A, B, and C are
inert and one is active, then which is active and which are inert. (Hint: look
at the 4 highest values of y. Is there a pattern?)
i) If one of the factors has an interaction with the active factor, what is
the interaction (e.g. AB, AC, or BC)?


8.7. Suppose the B eect = 5, SE(ef f ect) = 2, and dfe = 8.
i) Find a 95% condence interval for the B eect.
ii) Is the B eect signicant? Explain briey.
R (along with 1 SAS and 1 Minitab) Problems

Use the command source(G:/lregpack.txt) to download the func-


tions and the command source(G:/lregdata.txt) to download the data.
See Preface or Section 14.1. Typing the name of the R function, e.g.
aov, will display the code for the function. Use the args command, e.g.
args(aov), to display the needed arguments for the function. For some of
the following problems, the R commands can be copied and pasted from
(http://lagrange.math.siu.edu/Olive/lreghw.txt) into R.

8.8. Copy the Box et al. (2005, p. 199) product development data from
(http://lagrange.math.siu.edu/Olive/lregdata.txt) into R.
Then type the following commands.

out <- aov(conversion~K*Te*P*C,devel)


summary(out)

a) Include the output in Word.


b) What are the ve eects with the biggest mean squares?
Note: an AB interaction is denoted by A:B in R.
280 8 Orthogonal Designs

8.9. Get the SAS program for this problem from (http://lagrange.math.siu.
edu/Olive/lreghw.txt). The data is the pilot plant example from Box et al.
(2005, pp. 177186). The response variable is Y = yield, while the three
predictors (T = temp, C = concentration, K = catalyst) are at two levels.
a) Print out the output but do not turn in the rst page.
b) Do the residual and response plots look ok?

8.10. Get the data for this problem. The data is the pilot plant example
from Box et al. (2005, pp. 177186) examined in Problem 8.9. Minitab needs
the levels for the factors and the interactions.
Highlight the data and use the menu commands Edit>Copy. In Minitab,
use the menu command Edit>PasteCells. After a window appears, click on
ok.
Below C1 type A, below C2 type B, below C3 type C and below
C8 type yield.
a) Use the menu command STAT>ANOVA>Balanced Anova put
yield in the responses box and
A|B|C
in the Model box. Click on Storage. When a window appears, click on
Fits and Residuals. Then click on OK. This window will disappear.
Click on OK.
b) Next highlight the bottom 8 lines and use the menu commands
Edit>Delete Cells. Then the data set does not have replication. Use the
menu command STAT>ANOVA>Balanced Anova put yield in the re-
sponses box and
A B C A*C
in the Model box. Click on Storage. When a window appears, click on
Fits and Residuals. Then click on OK. This window will disappear.
Click on OK.
(The model A|B|C would have resulted in an error message, not enough
data.)
c) Print the output by clicking on the top window and then clicking on
the printer icon.
d) Make a response plot with the menu commands Graph>Plot with
yield in the Y box and FIT2 in the X box. Print by clicking on the printer
icon.
e) Make a residual plot with the menu commands Graph>Plot with
RESI2 in the Y box and FIT2 in the X box. Print by clicking on the printer
icon.
f) Do the plots look ok?

8.11. Get the R code and data for this problem from
(http://lagrange.math.siu.edu/Olive/lreghw.txt). The data is the pilot plant
8.6 Problems 281

example from Box et al. (2005, pp. 177186) examined in Problems 8.9 and
8.10.
a) Copy and paste the code into R. Then copy and paste the output into
Notepad. Print out the page of output.
b) The least squares estimate = coecient for x1 is half the A eect. So
what is the A eect?

8.12. a) Obtain and the R program twocub from


(http://lagrange.math.siu.edu/Olive/lregpack.txt). To get the eects, mean
squares, and SE(eect) for the Box et al. (2005, p. 177) pilot plant data, type
the following commands and include the output in Word.

mns <- c(60,72,54,68,52,83,45,80)


twocub(mns,m=2,MSE=8)

b) Which eects appear to be signicant from the QQ plot? (Match the


eects on the plot with the output on the screen.)

8.13. Box et al. (2005, p. 237) describe a 241


IV fractional factorial design.
Assuming that you downloaded the twocub function in the previous problem,
type the following R commands.
mns <- c(20,14,17,10,19,13,14,10)
twocub(mns,m=1)
a) Include the output in Word, print out the output and label the eects
on the output with the corresponding eects from a 241IV fractional factorial
design.
b) Include the QQ plot in Word. Print out the plot. Which eects (from
the fractional factorial design) seem to be signicant?

8.14. a) Download lregpack into R, and type the following commands.

mns <- c(14,16,8,22,19,37,20,38,1,8,4,10,12,30,13,30)


twofourth(mns)

This is the Ledolter and Swersey (2007, p. 80) cracked pots 24 data and
the response and residual plots are from the model without 3 and 4 factor
interactions.
b) Copy the plots into Word and print the plots. Do the response and
residual plots look ok?
282 8 Orthogonal Designs

8.15. Download lregpack into R. The data is the PB(12) example from
Box et al. (2005, p. 287).
a) Type the following commands. Copy and paste the QQ plot into Word
and print the plot.

resp <- c(56,93,67,60,77,65,95,49,44,63,63,61)


pb12(resp,k=5)

b) Copy and paste the output into Notepad and print the output.
c) As a 25 design, the eects B, D, BD, E, and DE were thought to be real.
The PB(12) design works best when none of the interactions is signicant.
From the QQ plot and the output for the PB(12) design, which factors, if
any, appear to be signicant?
d) The output gives the A, B, C, D, and E eects along with the cor-
responding least squares coecients 1 , . . . , 5 . What is the relationship
between the coecients and the eects?
For parts e) to g), act as if the PB(12) design with 5 factors is
appropriate.
e) The full model has Y = 0 + 1 x1 + 2 x2 + 3 x3 + 4 x4 + 5 x5 . The
reduced model is Y = 0 + j xj where xj is the signicant term found in c).
Give the numerical formula for the reduced model.
f) Compute Y using the full model if xi = 1 for i = 1, . . . , 5. Then compute
Y using the reduced model if xj = 1.
g) If the goal of the experiment is to produce large values of Y , should
xj = 1 or xj = 1 in the reduced model? Explain briey.
Chapter 9
More on Experimental Designs

This chapter considers split plot designs briey and reviews the ten designs
considered in Chapter 5 Section 9.1. The one and two way Anova designs,
completely randomized block design, and split plot designs are the building
blocks for more complicated designs. Some split plot designs can be written as
a linear model, Y = xT +e, but the errors are dependent with a complicated
correlation structure.

9.1 Split Plot Designs

Denition 9.1. Split plot designs have two units. The large units are
called whole plots and contain blocks of small units called subplots. The
whole plots get assigned to factor A while the subplots get assigned to factor
B (randomly if the units are experimental but not randomly if the units are
observational). A and B are crossed so the AB interaction can be studied.

The split plot design depends on how whole plots are assigned to A. Three
common methods are described below, and methods a) and b) are described
in more detail in the following subsections. The randomization and split plot
ANOVA table depend on the design used for assigning the whole plots to
factor A.
a) The whole plots are assigned to A completely at random, as in a one
way Anova.
b) The whole plots are assigned to A and to a blocking variable as in a
completely randomized block design (if the whole plots are experimental, but
a complete block design is used if the whole plots are observational).
c) The whole plots are assigned to A, to row blocks, and to column blocks
as in a Latin square.
The key feature of a split plot design is that there are two units of dierent
sizes: one size for each of the 2 factors of interest. The larger units are assigned

Springer International Publishing AG 2017 283


D.J. Olive, Linear Regression, DOI 10.1007/978-3-319-55252-1 9
284 9 More on Experimental Designs

to A. The large units contain blocks of small units assigned to factor B. Also
factors A and B are crossed.

9.1.1 Whole Plots Randomly Assigned to A

Shown below is the split plot ANOVA table when the whole plots are assigned
to factor A as in a one way Anova design. The whole plot error is error(W) and
can be obtained as an A*replication interaction. The subplot error is error(S).
FA = M SA/M SEW, FB = M SB/M SES, and FAB = M SAB/M SES. R
computes the three test statistics and pvalues correctly, but for SAS FA and
the pvalue pA need to be computed using MSA, MSEW, dfA , and dfeW ob-
tained from the ANOVA table. Sometimes error(W) is also denoted as
residuals. There are ma whole plots, and each whole plot contains b sub-
plots. Thus there are mab subplots. As always, the pvalue column actually
gives pval, an estimate of the pvalue.

Source df SS MS F p-value
A a1 SSA MSA FA pA
error(W) or A*repl a(m 1) SSEW MSEW
B b1 SSB MSB FB pB
AB (a 1)(b 1) SSAB MSAB FAB pAB
residuals or error(S) a(m 1)(b 1) SSES MSES

The tests of interest for this split plot design are nearly identical to those of
a two way Anova model. Yijk has i = 1, . . . , a, j = 1, . . . , b, and k = 1, . . . , m.
Keep A and B in the model if there is an AB interaction.
a) The 4 step test for AB interaction is
i) Ho: there is no interaction HA : there is an interaction.
ii) FAB is obtained from output.
iii) The pval is obtained from output.
iv) If pval reject Ho and conclude that there is an interaction between A
and B, otherwise fail to reject Ho and conclude that there is no interaction
between A and B. (Or there is not enough evidence to conclude that there is
an interaction.)
b) The 4 step test for A main eects is
i) Ho: 100 = = a00 HA : not Ho.
ii) FA is obtained from output.
iii) The pval is obtained from output.
iv) If pval reject Ho and conclude that the mean response depends on the
level of A, otherwise fail to reject Ho and conclude that the mean response
does not depend on the level of A. (Or there is not enough evidence to
conclude that the response depends on the level of A.)
9.1 Split Plot Designs 285

c) The 4 step test for B main eects is


i) Ho: 010 = = 0b0 HA : not Ho.
ii) FB is obtained from output.
iii) The pval is obtained from output.
iv) If pval reject Ho and conclude that the mean response depends on the
level of B, otherwise fail to reject Ho and conclude that the mean response
does not depend on the level of B. (Or there is not enough evidence to
conclude that the response depends on the level of B.)

Source df SS MS F p-value
variety 7 763.16 109.02 1.232 0.3421
MSEW 16 1415.83 88.49
treatment 3 30774.3 10258.1 423.44 0.00
variety*treatment 21 2620.1 124.8 5.150 0.00
error(S) 48 1162.8 24.2

Example 9.1. This split plot data is from Chambers and Hastie (1993,
p. 158). There were 8 varieties of guayule (rubber plant) and 4 treatments
were applied to seeds. The response was the rate of germination. The whole
plots were greenhouse ats and the subplots were 4 subplots of the ats. Each
at received seeds of one variety (A). Each subplot contained 100 seeds and
was treated with one of the treatments (B). There were m = 3 replications
so each variety was planted in 3 ats for a total of 24 ats and 4(24) = 96
observations.
Factorial crossing: Variety and treatments (A and B) are crossed since all
combinations of variety and treatment occur. Hence the AB interaction can
be measured.
Blocking: The whole plots are the 24 greenhouse ats. Each at is a block
of 4 subplots. Each of the 4 subplots gets one of the 4 treatments.
Randomization: The 24 ats are assigned to the 8 varieties completely at
random. Use the sample(24) command to generate a random permutation.
The rst 3 numbers of the permutation get variety one, the next 3 get variety
2, . . . , the last 3 get variety 8. Use the sample(4) command 24 times, once
for each at. If 2, 4, 1, 3 was the permutation for the ith at, then the 1st
subplot gets treatment 3, the 2nd gets treatment 1, the 3rd gets treatment
4, and the 4th subplot gets treatment 2.

a) Perform the test corresponding to A.


b) Perform the test corresponding to B.
c) Perform the test corresponding to AB.

Solution: a) Ho: 100 = = 800 Ha: not Ho


FA = 1.232
pval = 0.3421
286 9 More on Experimental Designs

Fail to reject Ho, the mean rate of germination does not depend on va-
riety. (This test would make more sense if there was no variety * treatment
interaction.)
b) Ho: 010 = = 040 Ha: not Ho
FB = 423.44
pval = 0.00
Reject Ho, the mean rate of germination depends on treatment.
c) Ho: no interaction Ha: there is an interaction
FAB = 5.15
pval = 0.00
Reject Ho, there is a variety * treatment interaction.

9.1.2 Whole Plots Assigned to A as in a CRBD

Shown below is the split plot ANOVA table when the whole plots are
assigned to factor A and a blocking variable as in a completely random-
ized block design. The whole plot error is error(W) and can be obtained as
a block*A interaction. The subplot error is error(S). FA = M SA/M SEW,
FB = M SB/M SES, and FAB = M SAB/M SES. Factor A has a levels
and factor B has b levels. There are r blocks of a whole plots. Each whole
plot contains b subplots, and each block contains a whole plots and thus ab
subplots. Hence there are ra whole plots and rab subplots.
SAS computes the last two test statistics and pvalues correctly, and the
last line of SAS output gives FA and the pvalue pA . The initial line of output
for A is not correct. The output for blocks is probably not correct.

Source df SS MS F p-value
blocks r1
A a1 SSA MSA FA pA
error(W) or block*A (r 1)(a 1) SSEW MSEW
B b1 SSB MSB FB pB
AB (a 1)(b 1) SSAB MSAB FAB pAB
error(S) a(r 1)(b 1) SSES MSES

The tests of interest for this split plot design are nearly identical to those
of a two way Anova model. Yijk has i = 1, . . . , a, j = 1, . . . , b and k = 1, . . . , r.
Keep A and B in the model if there is an AB interaction.
a) The 4 step test for AB interaction is
i) Ho: there is no interaction HA : there is an interaction.
ii) FAB is obtained from output.
iii) The pval is obtained from output.
iv) If pval reject Ho and conclude that there is an interaction between A
and B, otherwise fail to reject Ho and conclude that there is no interaction
9.1 Split Plot Designs 287

between A and B. (Or there is not enough evidence to conclude that there is
an interaction.)
b) The 4 step test for A main eects is
i) Ho: 100 = = a00 HA : not Ho.
ii) FA is obtained from output.
iii) The pval is obtained from output.
iv) If pval reject Ho and conclude that the mean response depends on the
level of A, otherwise fail to reject Ho and conclude that the mean response
does not depend on the level of A. (Or there is not enough evidence to
conclude that the response depends on the level of A.)
c) The 4 step test for B main eects is
i) Ho: 010 = = 0b0 HA : not Ho.
ii) FB is obtained from output.
iii) The pval is obtained from output.
iv) If pval reject Ho and conclude that the mean response depends on the
level of B, otherwise fail to reject Ho and conclude that the mean response
does not depend on the level of B. (Or there is not enough evidence to
conclude that the response depends on the level of B.)

Source df SS MS F p-value
Block 5 4.150 0.830
Variety 2 0.178 0.089 0.65 0.5412
Block*Variety 10 1.363 0.136
Date 3 1.962 0.654 23.39 0.00
Variety*Date 6 0.211 0.035 1.25 0.2973
error(S) 45 1.259 0.028

Example 9.2. The ANOVA table above is for the Snedecor and Cochran
(1967, pp. 369372) split plot data where the whole plots are assigned to
factor A and to blocks in a completely randomized block design. Factor A =
variety of alfalfa (ladak, cossack, ranger). Each eld had two cuttings, with
the second cutting on July 7, 1943. Factor B = date of third cutting (none,
Sept. 1, Sept. 20, Oct. 7) in 1943. The response variable was yield (tons per
acre) in 1944. The 6 blocks were elds of land divided into 3 plots of land,
one for each variety. Each of these 3 plots was divided into 4 subplots for
date of third cutting. So each block had 3 whole plots and 12 subplots.
a) Perform the test corresponding to A.
b) Perform the test corresponding to B.
c) Perform the test corresponding to AB.

Solution: a) Ho: 100 = = 300 Ha: not Ho


FA = 0.65
pval = 0.5412
Fail to reject Ho, the mean yield does not depend on variety.
288 9 More on Experimental Designs

b) Ho: 010 = = 040 Ha: not Ho


FB = 23.39
pval = 0.0
Reject Ho, the mean yield depends on cutting date.
c) Ho: no interaction Ha: there is an interaction
FAB = 1.25
pval = 0.2973
Fail to reject Ho, there is no interaction between variety and cutting date.

Warning: Although the split plot model can be written as a linear model,
the errors are not iid and have a complicated correlation structure. It is also
dicult to get tted values and residuals from the software, so the model
cant be easily checked with response and residual plots. These facts make
the split plot model very hard to use for most researchers.

9.2 Review of the DOE Models

The three basic principles of DOE (design of experiments) are


i) use randomization to assign treatments to units.
ii) Use factorial crossing to compare the eects (main eects, pairwise
interactions, . . . , J-fold interaction) of J 2 factors. If A1 , . . . , AJ are the
factors with li levels for i = 1, . . . , J; then there are l1 l2 lJ treatments
where each treatment uses exactly one level from each factor.
iii) Blocking is used to divide units into blocks of similar units where
similar means the units are likely to have similar values of the response
when given the same treatment. Within each block, randomly assign units to
treatments.
Next the 10 designs of Chapter 5 to Section 9.1 are summarized. If the
randomization cannot be done as described, then much stronger assumptions
on the data are needed for inference to be approximately correct. There are
three common ways of assigning units. For inference, i) requires the least
assumptions and iii) the most.
i) Experimental units are randomly assigned.
ii) Observational units are a random sample of units from a population
of units. Each combination of levels determines a population. So a two way
Anova design has ab populations.
iii) Units (such as time slots) can be assigned systematically due to con-
straints (e.g., physical, cost, or time constraints).

I) One way Anova: Factor A has p levels.


a) For a xed eects one way Anova model, the levels are xed.
9.2 Review of the DOE Models 289

b) For a random eects one way Anova model, the levels are a random
sample from a population oflevels.
p
Randomization: Let n = i=1 mi and do the sample(n) command. Assign
the rst m1 units to treatment (level) 1, the next m2 units to treatment 2,
. . . , the last mp units to treatment p.
II) Two way Anova: Factor A has a levels and factor B has b levels. The
two factors are crossed, forming ab cells.
Randomization: Let n = mab and do the sample(n) command. Randomly
assign m units to each of the ab cells. Assign the rst m units to the (A, B) =
(1, 1) cell, the next m units to the (1,2) cell, . . . , the last m units to the (a, b)
cell.
III) k way Anova: There are k factors A1 , . . . , Ak with a1 , . . . , ak levels,
'k
respectively. The k factors are crossed, forming i=1 ai cells.
'k
Randomization: Let n = m i=1 ai and do the sample(n) command. Ran-
domly assign m units to each cell. Each cell is a combination of levels, so the
(1, 1, . . . , 1, 1) cell gets the 1st m units.
IV) Completely randomized block design: Factor A has k levels (treat-
ments), and there are b blocks (a blocking variable has b levels) of k units.
Randomization: Let n = kb and do the sample(k) command b times.
Within each block of k units, randomly assign 1 unit to each treatment.
V) Latin squares: Factor A has a levels (treatments), the row blocking
variable has a blocks of a units, and the column blocking variable has a blocks
of a units. There are a2 units since the row and column blocking variables are
crossed. The treatment factor, row blocking variable, and column blocking
variable are also crossed. A Latin square is such that each of the a treatments
occurs once in each row and once in each column.
Randomization: Pick an a a Latin square. Use the sample(a) command
to assign row levels to numbers 1 to a. Use the sample(a) command to assign
column levels to numbers 1 to a. Use the sample(a) command to assign
treatment levels to the rst a capital letters. If possible, use the sample(a2 )
command to assign units, 1 unit to each cell of the Latin square.
VI) 2k factorial design: There are k factors, each with 2 levels.
Randomization: Let n = m2k and do the sample(n) command. Randomly
assign m units to each cell. Each cell corresponds to a run which is determined
by a string of k +s and s corresponding to the k main eects.
kf
VII) 2R fractional factorial design: There are k factors, each with 2
levels.
Randomization: Let n = 2kf and do the sample(n) command. Randomly
assign 1 unit to each run which is determined by a string of k +s and s
corresponding to the k main eects.
VIII) Plackett Burman P B(n) design: There are k factors, each with 2
levels.
290 9 More on Experimental Designs

Randomization: Let n = 4J for some J. Do the sample(n) command.


Randomly assign 1 unit to each run which is a string of n 1 +s and s.
(Each run corresponds to a row in the design matrix, so we are ignoring the
column of 1s corresponding to I in the design matrix.)
IX) Split plot design where the whole plots are assigned to A as in a one
way Anova design: The whole plot factor A has a levels and each whole plot
is a block of b subplots used to study factor B which has b levels. Split plot
designs have two types of units: the whole plots are the larger units and the
subplots are the smaller units.
Randomization: a) Suppose there are n = ma whole plots. Randomly
assign m whole plots to each level of A with the sample(n) command. Assign
the rst m units (whole plots) to treatment (level) 1, the next m units to
treatment 2, . . . , the last m units to treatment a.
b) Do the sample(b) command ma times, once for each whole plot. Within
each whole plot, randomly assign 1 subplot (unit) to each of the b levels of B.
X) Split plot design where the whole plots are assigned to A and a blocking
variable as in a completely randomized block design: The whole plot factor A
has a levels and each whole plot is a block of b subplots used to study factor
B which has b levels. Split plot designs have two types of units: the whole
plots are the larger units and the subplots are the smaller units. There are
also r blocks of a whole plots. Each whole plot has b subplots. Thus there are
ra whole plots and rab subplots.
Randomization: a) Do the sample(a) command r times, once for each
block. For each block of a whole plots, randomly assign 1 whole plot to each
of the a levels of A.
b) Do the sample(b) command ra times, once for each whole plot. Within
each whole plot, randomly assign 1 subplot to each of the b levels of B.

Try to become familiar with the designs and their randomization so that
you can recognize a design given a story problem.

Example 9.3. Cobb (1998, pp. 200212) describes an experiment on


weight gain for baby pigs. The response Y was the average daily weight
gain in pounds for each piglet (over a period of time). Factor A consisted of
0 mg of an antibiotic or 40 mg of an antibiotic while factor B consisted of 0
mg of vitamin B12 or 5 mg of B12. Hence there were 4 diets (A, B) = (0,0),
(40,0), (0,5) or (40,5). If there were 12 piglets and 3 were randomly assigned
to each diet, what type of experimental design was used?
Solution: A and B are crossed with each combination of (A, B) levels
forming a diet. So the two way Anova (or 22 factorial) design was used.

Example 9.4. In 2008, a PhD student was designing software to analyze


a complex image. 100 portions of the image need to be analyzed correctly,
and the response variable is the proportion of errors. Sixteen test images
9.3 Summary 291

were available and thought to be representative. The goal was to achieve


an average error rate of less than 0.3 if many images were examined. The
student had identied 3 factors to reduce the error rate. Each factor had 2
levels. Thus there were 8 versions of the software that analyze images.
The student selected a single test image and ran a 23 design with 8 time
slots as units. Factor A was active but factors B and C were inert. When
A was at the (+) level the error rate was about 0.27. Briey explain why
this experiment did not give much information about how the software will
behave on many images.
Solution: More images are needed, 1 image is not enough.
(A better design is a completely randomized block design that uses each of
the 16 images as a block and factor A = software version with 8 levels. The
units for the block are 8 time slots so each of the 8 versions of the software
is tested on each test image.)

9.3 Summary

1) The analysis of the response, not that of the residuals, is of primary im-
portance. The response plot can be used to analyze the response in the back-
ground of the tted model. For linear models such as experimental designs,
the estimated mean function is the identity line and should be added as a
visual aid.
2) Assume that the residual degrees of freedom are large enough for testing.
Then the response and residual plots contain much information. Linearity and
constant variance may be reasonable if the plotted points scatter about the
identity line in a (roughly) evenly populated band. Then the residuals should
scatter about the r = 0 line in an evenly populated band. It is easier to check
linearity with the response plot and constant variance with the residual plot.
Curvature is often easier to see in a residual plot, but the response plot can
be used to check whether the curvature is monotone or not. The response plot
is more eective for determining whether the signal to noise ratio is strong
or weak, and for detecting outliers, inuential cases, or a critical mix.
3) The three basic principles of DOE (design of experiments) are
i) use randomization to assign units to treatments.
ii) Use factorial crossing to compare the eects (main eects, pairwise
interactions, . . . , J-fold interaction) for J 2 factors. If A1 , . . . , AJ are the
factors with li levels for i = 1, . . . , J then there are l1 l2 lJ treatments
where each treatment uses exactly one level from each factor.
iii) Blocking is used to divide units into blocks of similar units where
similar means the units are likely to have similar values of the response
when given the same treatment. Within each block randomly assign units to
treatments.
292 9 More on Experimental Designs

4) Split plot designs have two units. The large units are called whole plots
and contain blocks of small units called subplots. The whole plots get assigned
to factor A while the subplots get assigned to factor B (randomly if the units
are experimental but not randomly if the units are observational). A and B
are crossed so the AB interaction can be studied.
5) The split plot design depends on how whole plots are assigned to A.
Three common methods are a) the whole plots are assigned to A completely
at random, as in a one way Anova, b) the whole plots are assigned to A
and to a blocking variable as in a completely randomized block design (if the
whole plots are experimental, a complete block design is used if the whole
plots are observational), c) the whole plots are assigned to A, to row blocks,
and to column blocks as in a Latin square.
6) The split plot ANOVA table when whole plots are assigned to levels of
A as in a one way Anova is shown below. The whole plot error is error(W) and
can be obtained as an A*replication interaction. The subplot error is error(S).
FA = M SA/M SEW, FB = M SB/M SES, and FAB = M SAB/M SES. R
computes the three test statistics and pvalues correctly, but for SAS FA
and the pvalue pA need to be computed using MSA, MSEW, dfA , and dfeW
obtained from the ANOVA table.

Source df SS MS F p-value
A a1 SSA MSA FA pA
error(W) or A*repl a(m 1) SSEW MSEW
B b1 SSB MSB FB pB
AB (a 1)(b 1) SSAB MSAB FAB pAB
residuals or error(S) a(m 1)(b 1) SSES MSES

7) The tests of interest corresponding to 6) are nearly identical to those of


a two way Anova model. Yijk has i = 1, . . . , a, j = 1, . . . , b, and k = 1, . . . , m.
Keep A and B in the model if there is an AB interaction.
a) The 4 step test for AB interaction is
i) Ho: there is no interaction HA : there is an interaction.
ii) FAB is obtained from output.
iii) The pval is obtained from output.
iv) If pval reject Ho and conclude that there is an interaction between A
and B, otherwise fail to reject Ho and conclude that there is no interaction
between A and B.
b) The 4 step test for A main eects is
i) Ho: 100 = = a00 HA : not Ho.
ii) FA is obtained from output.
iii) The pval is obtained from output.
iv) If pval reject Ho and conclude that the mean response depends on the
level of A, otherwise fail to reject Ho and conclude that the mean response
does not depend on the level of A.
9.3 Summary 293

c) The 4 step test for B main eects is


i) Ho: 010 = = 0b0 HA : not Ho.
ii) FB is obtained from output.
iii) The pval is obtained from output.
iv) If pval reject Ho and conclude that the mean response depends on the
level of B, otherwise fail to reject Ho and conclude that the mean response
does not depend on the level of B.
8) The split plot ANOVA table when whole plots are assigned to levels
of A as in a completely randomized block design is shown below. The whole
plot error is error(W) and can be obtained as a block*A interaction. The
subplot error is error(S). FA = M SA/M SEW, FB = M SB/M SES, and
FAB = M SAB/M SES. SAS computes the last two test statistics and pvalues
correctly, and the last line of SAS output gives FA and the pvalue pA . The
initial line of output for A is not correct. The output for blocks is probably
not correct.

Source df SS MS F p-value
blocks r1
A a1 SSA MSA FA pA
error(W) or block*A (r 1)(a 1) SSEW MSEW
B b1 SSB MSB FB pB
AB (a 1)(b 1) SSAB MSAB FAB pAB
error(S) a(r 1)(b 1) SSES MSES

9) The tests of interest corresponding to 8) are nearly identical to those


of a two way Anova model and point 7). Yijk has i = 1, . . . , a, j = 1, . . . , b,
and k = 1, . . . , r. Keep A and B in the model if there is an AB interaction.
a) The 4 step test for AB interaction is
i) Ho: there is no interaction HA : there is an interaction.
ii) FAB is obtained from output.
iii) The pval is obtained from output.
iv) If pval reject Ho and conclude that there is an interaction between A
and B, otherwise fail to reject Ho and conclude that there is no interaction
between A and B.
b) The 4 step test for A main eects is
i) Ho: 100 = = a00 HA : not Ho.
ii) FA is obtained from output.
iii) The pval is obtained from output.
iv) If pval reject Ho and conclude that the mean response depends on the
level of A, otherwise fail to reject Ho and conclude that the mean response
does not depend on the level of A.
c) The 4 step test for B main eects is
i) Ho: 010 = = 0b0 HA : not Ho.
ii) FB is obtained from output.
iii) The pval is obtained from output.
294 9 More on Experimental Designs

iv) If pval reject Ho and conclude that the mean response depends on the
level of B, otherwise fail to reject Ho and conclude that the mean response
does not depend on the level of B.

9.4 Complements

See Robinson et al. (2009) for a comparison of completely randomized designs,


completely randomized block designs, and split plot designs. Some history of
experimental designs is given by Box (1980, 1984). Also see David (1995,
20067) and Hahn (1982).
The importance of DOE is discussed in Gelman (2005), and a review is
given by Steinberg and Hunter (1984). For experiments done as class projects,
see Hunter (1977).

9.5 Problems

Source df SS MS F p-value
Block 2 77.55 38.78
Method 2 128.39 64.20 7.08 0.0485
Block*Method 4 36.28 9.07
Temp 3 434.08 144.69 41.94 0.00
Method*Temp 6 75.17 12.53 2.96 0.0518
error(S) 12 50.83 4.24

9.1. The ANOVA table above is for the Montgomery (1984, pp. 386389)
split plot data where the whole plots are assigned to factor A and to blocks
in a completely randomized block design. The response variable is tensile
strength of paper. Factor A is (preparation) method with 3 levels (1, 2, 3).
Factor B is temperature with 4 levels (200, 225, 250, 275). The pilot plant
can make 12 runs a day and the experiment is repeated each day, with days
as blocks. A batch of pulp is made by one of the 3 preparation methods. Then
the batch of pulp is divided into 4 samples, and each sample is cooked at one
of the four temperatures.
a) Perform the test corresponding to A.
b) Perform the test corresponding to B.
c) Perform the test corresponding to AB.
9.5 Problems 295

Source df SS MS F p-value
Block 1 0.051 0.051
Nitrogen 3 37.32 12.44 29.62 0.010
Block*Nitrogen 3 1.26 0.42
Thatch 2 3.82 1.91 9.10 0.009
Nitrogen*Thatch 6 4.15 0.69 3.29 0.065
error(S) 12 1.72 0.21

9.2. The ANOVA table above is for the Kuehl (1994, pp. 473481) split
plot data where the whole plots are assigned to factor A and to blocks in
a completely randomized block design. The response variable is the average
chlorophyll content (mg/gm of turf grass clippings). Factor A is nitrogen
fertilizer with 4 levels (1, 2, 3, 4). Factor B is length of time that thatch was
allowed to accumulate with 3 levels (2, 5, or 8 years).
There were 2 blocks of 4 whole plots to which the levels of factor A were
assigned. The 2 blocks formed a golf green which was seeded with turf grass.
The 8 whole plots were plots of golf green. Each whole plot had 3 subplots
to which the levels of factor B were randomly assigned.
a) Perform the test corresponding to A.
b) Perform the test corresponding to B.
c) Perform the test corresponding to AB.

Source df SS MS F p-value
Block 5 4.150 0.830
Variety 2 0.178 0.089 0.65 0.5412
Block*Variety 10 1.363 0.136
Date 3 1.962 0.654 23.39 0.00
Variety*Date 6 0.211 0.035 1.25 0.2973
error(S) 45 1.259 0.028

9.3. The ANOVA table above is for the Snedecor and Cochran (1967, pp.
369372) split plot data where the whole plots are assigned to factor A and to
blocks in a completely randomized block design. Factor A = variety of alfalfa
(ladak, cossack, ranger). Each eld had two cuttings, with the second cutting
on July 7, 1943. Factor B = date of third cutting (none, Sept. 1, Sept. 20,
Oct. 7) in 1943. The response variable was yield (tons per acre) in 1944. The
6 blocks were elds of land divided into 3 plots of land, one for each variety.
Each of these 3 plots was divided into 4 subplots for date of third cutting.
So each block had 3 whole plots and 12 subplots.
a) Perform the test corresponding to A.
b) Perform the test corresponding to B.
c) Perform the test corresponding to AB.
296 9 More on Experimental Designs

9.4. Following Montgomery (1984, pp. 386389), suppose the response


variable is tensile strength of paper. One factor is (preparation) method with
3 levels (1, 2, 3). Another factor is temperature with 4 levels (200, 225,
250, 275).
a) Suppose the pilot plant can make 12 runs a day and the experiment is
repeated each day, with days as blocks. A batch of pulp is made by one of
the 3 preparation methods. Then the batch of pulp is divided into 4 samples,
and each sample is cooked at one of the four temperatures. Which factor,
method, or temperature is assigned to subplots?
b) Suppose the pilot plant could make 36 runs in one day. Suppose that 9
batches of pulp are made, that each batch of pulp is divided into 4 samples,
and each sample is cooked at one of the four temperatures. How should the 9
batches be allocated to the three preparation methods, and how should the
4 samples be allocated to the four temperatures?
c) Suppose the pilot plant can make 36 runs in one day and that the units
are 36 batches of material to be made into pulp. Each of the 12 method tem-
perature combinations is to be replicated 3 times. What type of experimental
design should be used? (Hint: not a split plot.)

R and SAS Problems

Use the command source(G:/lregpack.txt) to download the func-


tions and the command source(G:/lregdata.txt) to download the data.
See Preface or Section 14.1. Typing the name of the R function, e.g. aov,
will display the code for the function. Use the args command, e.g. args(aov),
to display the needed arguments for the function. For some of the following
problems, the R commands and SAS programs can be copied and pasted
from (http://lagrange.math.siu.edu/Olive/lreghw.txt) into R.

9.5. a) Download (http://lagrange.math.siu.edu/Olive/lregdata.txt) into


R, and type the following commands. Then copy and paste the output into
Notepad and print the output.
attach(guay)
out<-aov(plants~variety*treatment + Error(flats),guay)
summary(out)
detach(guay)
This split plot data is from Chambers and Hastie (1993, p. 158). There are
8 varieties of guayule (rubber plant) and 4 treatments were applied to seeds.
The response was the rate of germination. The whole plots were greenhouse
ats and the subplots were subplots of the ats. Each at received seeds of
one variety (A). Each subplot contained 100 seeds and was treated with one
of the treatments (B). There were m = 3 replications so each variety was
planted in 3 ats for a total of 24 ats and 4(24) = 96 observations.
b) Use the output to test whether the response depends on variety.
9.5 Problems 297

9.6. Download (http://lagrange.math.siu.edu/Olive/lregdata.txt) into R,


and type the following commands. Then copy and paste the output into
Notepad and print the output.

attach(steel)
out<-aov(resistance~heat*coating + Error(wplots),steel)
summary(out)
detach(steel)

This split plot steel data is from Box et al. (2005, p. 336). The whole plots
are time slots to use a furnace, which can hold 4 steel bars at one time. Factor
A = heat has 3 levels (360, 370, 380o F). Factor B = coating has 4 levels
(4 types of coating: c1, c2, c3, and c4). The response was corrosion resistance.
a) Perform the test corresponding to A.
b) Perform the test corresponding to B.
c) Perform the test corresponding to AB.

9.7. This is the same data as in Problem 9.6, using SAS. Copy and paste
the SAS program from (http://lagrange.math.siu.edu/Olive/lrsashw.txt)
into SAS, run the program, then print the output. Only include the second
page of output.
To get the correct F statistic for heat, you need to divide MS heat by MS
wplots.

9.8. a) Copy and paste the SAS program from


(http://lagrange.math.siu.edu/Olive/lrsashw.txt) into SAS, run the pro-
gram, then print the output. Only include the second page of output.
This data is from the SAS Institute (1985, pp. 131132). The B and AB
ANOVA table entries are correct, but the correct entry for A is the last line
of output where Block*A is used as the error.
b) Perform the test corresponding to A.
c) Perform the test corresponding to B.
d) Perform the test corresponding to AB.
Chapter 10
Multivariate Models

The multivariate location and dispersion model is a special case of the


multivariate linear regression model when the design matrix is equal to the
vector of ones: X = 1. See Chapter 12. (Similarly, the location model is a
special case of the multiple linear regression model. See Section 2.9.1.) The
multivariate normal and elliptically contoured distributions are important
parametric models for the multivariate location and dispersion model. The
multivariate normal distribution is useful in the large sample theory of the
linear model, covered in Chapter 11, while elliptically contoured distributions
are useful for multivariate linear regression. Section 3.4.1 used prediction re-
gions for iid multivariate data to bootstrap hypothesis tests.

Denition 10.1. An important multivariate location and dispersion


model is a joint distribution with joint pdf

f (z|, )

for a p 1 random vector x that is completely specied by a p 1 population


location vector and a pp symmetric
( positive denite population dispersion
matrix . Thus P (x A) = A f (z)dz for suitable sets A.

The multivariate location and dispersion model is in many ways similar to


the multiple linear regression model. The data are iid vectors from some dis-
tribution such as the multivariate normal (MVN) distribution. The location
parameter of interest may be the mean or the center of symmetry of an
elliptically contoured distribution. Hyperellipsoids will be estimated instead
of hyperplanes, and Mahalanobis distances will be used instead of absolute
residuals to determine if an observation is a potential outlier.
Assume that X 1 , . . . , X n are n iid p 1 random vectors and that the
joint pdf of X 1 is f (z|, ). Also assume that the data X i = xi has been
observed and stored in an n p matrix

Springer International Publishing AG 2017 299


D.J. Olive, Linear Regression, DOI 10.1007/978-3-319-55252-1 10
300 10 Multivariate Models

x1,1 x1,2 . . . x1,p
xT1 x2,1 x2,2
. . . x2,p
) *
W = ... = . .. . . .. = v1 v2 . . . vp
T
.. . . .
xn
xn,1 xn,2 . . . xn,p

where the ith row of W is xTi and the jth column is v j . Each column v j of
W corresponds to a variable. For example, the data may consist of n visitors
to a hospital where the p = 2 variables height and weight of each individual
were measured.
There are some dierences in the notation used in multiple linear regression
and multivariate location dispersion models. Notice that W could be used
as the design matrix in multiple linear regression although usually the rst
column of the regression design matrix is a vector of ones. The n p design
matrix in the multiple linear regression model was denoted by X, and xTi
was the ith row of X. In the multivariate location dispersion model, X and
X i will be used to denote a p 1 random vector with observed value xi ,
and xTi is the ith row of the data matrix W . Johnson and Wichern (1988,
pp. 7, 53) uses X to denote the n p data matrix and an n 1 random
vector, relying on the context to indicate whether X is a random vector or
data matrix. Software tends to use dierent notation. For example, R will
use commands such as
var(x)
to compute the sample covariance matrix of the data. Hence x corresponds
to W , x[,1] is the rst column of x, and x[4, ] is the 4th row of x.

10.1 The Multivariate Normal Distribution

Denition 10.2: Rao (1965, p. 437). A p 1 random vector X has a


pdimensional multivariate normal distribution Np (, ) i tT X has a uni-
variate normal distribution for any p 1 vector t.

If is positive denite, then X has a pdf


1 1
e(1/2)(z ) (z )
T
f (z) = (10.1)
(2)p/2 ||1/2

where ||1/2 is the square root of the determinant of . Note that if p = 1,


then the quadratic form in the exponent is (z )( 2 )1 (z ) and X has
the univariate N (, 2 ) pdf. If is positive semidenite but not positive
denite, then X has a degenerate distribution. For example, the univariate
N (0, 02 ) distribution is degenerate (the point mass at 0).
10.1 The Multivariate Normal Distribution 301

Denition 10.3. The population mean of a random p 1 vector X =


(X1 , . . . , Xp )T is
E(X) = (E(X1 ), . . . , E(Xp ))T
and the p p population covariance matrix

Cov(X) = E(X E(X))(X E(X))T = (ij ).

That is, the ij entry of Cov(X) is Cov(Xi , Xj ) = ij .

The covariance matrix is also called the variancecovariance matrix and


variance matrix. Sometimes the notation Var(X) is used. Note that Cov(X)
is a symmetric positive semidenite matrix. If X and Y are p 1 random
vectors, a a conformable constant vector, and A and B are conformable
constant matrices, then

E(a + X) = a + E(X) and E(X + Y ) = E(X) + E(Y ) (10.2)

and
E(AX) = AE(X) and E(AXB) = AE(X)B. (10.3)
Thus
Cov(a + AX) = Cov(AX) = ACov(X)AT . (10.4)
Some important properties of multivariate normal (MVN) distributions are
given in the following three propositions. These propositions can be proved
using results from Johnson and Wichern (Johnson and Wichern (1988), pp.
127132).

Proposition 10.1. a) If X Np (, ), then E(X) = and

Cov(X) = .

b) If X Np (, ), then any linear combination tT X = t1 X1 + +


tp Xp N1 (tT , tT t). Conversely, if tT X N1 (tT , tT t) for every p 1
vector t, then X Np (, ).
c) The joint distribution of independent normal random variables
is MVN. If X1 , . . . , Xp are independent univariate normal N (i , i2 ) random
vectors, then X = (X1 , . . . , Xp )T is Np (, ) where = (1 , . . . , p )T and
= diag(12 , . . . , p2 ) (so the o diagonal entries ij = 0 while the diagonal
entries of are ii = i2 ).
d) If X Np (, ) and if A is a q p matrix, then AX
Nq (A, AAT ). If a is a p 1 vector of constants and b is a constant, then
a + bX Np (a + b, b2 ). (Note that bX = bI p X with A = bI p .)

It will be useful to partition X, , and . Let X 1 and 1 be q 1 vectors,


let X 2 and 2 be (p q) 1 vectors, let 11 be a q q matrix, let 12
be a q (p q) matrix, let 21 be a (p q) q matrix, and let 22 be a
(p q) (p q) matrix. Then
302 10 Multivariate Models
     
X1 1 11 12
X= , = , and = .
X2 2 21 22

Proposition 10.2. a) All subsets of an MVN are MVN: (Xk1 ,


. . . , Xkq )T Nq (, ) where i = E(Xki ) and ij = Cov(Xki , Xkj ). In
particular, X 1 Nq (1 , 11 ) and X 2 Npq (2 , 22 ).
b) If X 1 and X 2 are independent, then Cov(X 1 , X 2 ) = 12 =
E[(X 1 E(X 1 ))(X 2 E(X 2 ))T ] = 0, a q (p q) matrix of zeroes.
c) If X Np (, ), then X 1 and X 2 are independent i 12 = 0.
d) If X 1 Nq (1 , 11 ) and X 2 Npq (2 , 22 ) are independent, then
     
X1 1 11 0
Np , .
X2 2 0 22

Proposition 10.3. The conditional distribution of an MVN is


MVN. If X Np (, ), then the conditional distribution of X 1 given
that X 2 = x2 is multivariate normal with mean 1 + 12 1
22 (x2 2 ) and
covariance matrix 11 12 1
22 21 . That is,

X 1 |X 2 = x2 Nq (1 + 12 1 1
22 (x2 2 ), 11 12 22 21 ).

Example 10.1. Let p = 2 and let (Y, X)T have a bivariate normal distri-
bution. That is,
     
Y Y Y2 Cov(Y, X)
N2 , 2 .
X X Cov(X, Y ) X

Also recall that the population correlation between X and Y is given by


Cov(X, Y ) X,Y
(X, Y ) =   =
VAR(X) VAR(Y ) X Y
if X > 0 and Y > 0. Then Y |X = x N (E(Y |X = x), VAR(Y |X = x))
where the conditional mean

1 Y2
E(Y |X = x) = Y + Cov(Y, X) 2 (x X ) = Y + (X, Y ) 2 (x X )
X X
and the conditional variance
1
VAR(Y |X = x) = Y2 Cov(X, Y ) 2 Cov(X, Y )
X

 
Y2
= Y2 (X, Y ) 2 (X, Y )
2
X Y2
X
10.2 Elliptically Contoured Distributions 303

= Y2 2 (X, Y )Y2 = Y2 [1 2 (X, Y )].


Also aX + bY is univariate normal with mean aX + bY and variance
a2 X
2
+ b2 Y2 + 2ab Cov(X, Y ).

Remark 10.1. There are several common misconceptions. First, it is not


true that every linear combination tT X of normal random variables
is a normal random variable, and it is not true that all uncorrelated
normal random variables are independent. The key condition in Propo-
sition 10.1b and Proposition 10.2c is that the joint distribution of X is MVN.
It is possible that X1 , X2 , . . . , Xp each has a marginal distribution that is uni-
variate normal, but the joint distribution of X is not MVN. The following
example is from Rohatgi (1976, p. 229). Suppose that the joint pdf of X and
Y is a mixture of two bivariate normal distributions both with EX = EY = 0
and VAR(X) = VAR(Y ) = 1, but Cov(X, Y ) = . Hence f (x, y) =
1 1 1
 exp( (x2 2xy + y 2 )) +
2 2 1 2 2(1 2 )

1 1 1 1 1
 exp( (x2 + 2xy + y 2 )) f1 (x, y) + f2 (x, y)
2 2 1 2 2(1 2 ) 2 2
where x and y are real and 0 < < 1. Since both marginal distributions
of fi (x, y) are N(0,1) for i = 1 and 2 by Proposition
(( 10.2 a), the marginal
distributions of X and Y are N(0,1). Since xyfi (x, y)dxdy = for i = 1
and for i = 2, X and Y are uncorrelated, but X and Y are not independent
since f (x, y) = fX (x)fY (y).

Remark 10.2. In Proposition 10.3, suppose that X = (Y, X2 , . . . , Xp )T .


Let X1 = Y and X 2 = (X2 , . . . , Xp )T . Then E[Y |X 2 ] = 1 + 2 X2 + +
p Xp and VAR[Y |X 2 ] = 2 is a constant that does not depend on X 2 . Hence
Y |X 2 = 1 + 2 X2 + + p Xp + e follows the multiple linear regression
model where e N (0, 2 ). Here 1 = E(Y ) T E(X 2 ), = 1
XX XY , and
2 = Y2 Y X 1
XX XY where X = Cov(X 2 ) and XY = Cov(X 2 , Y ).

10.2 Elliptically Contoured Distributions

Denition 10.4: Johnson (1987, pp. 107108). A p1 random vector X


has an elliptically contoured distribution, also called an elliptically symmetric
distribution, if X has joint pdf

f (z) = kp ||1/2 g[(z )T 1 (z )], (10.5)

and we say X has an elliptically contoured ECp (, , g) distribution.


304 10 Multivariate Models

If X has an elliptically contoured (EC) distribution, then the characteristic


function of X is
X (t) = exp(itT )(tT t) (10.6)
for some function . If the second moments exist, then

E(X) = (10.7)

and
Cov(X) = cX (10.8)
where
cX = 2  (0).

Denition 10.5. The population squared Mahalanobis distance

U D2 = D2 (, ) = (X )T 1 (X ). (10.9)

For elliptically contoured distributions, U has pdf

p/2
h(u) = kp up/21 g(u). (10.10)
(p/2)

For c > 0, an ECp (, cI, g) distribution is spherical about where I is


the p p identity matr