Mixed Model For Study

Lecture Slides on Mixed Models
Based on
A Course in Mixed Models for Use in

Animal Health and Animal Welfare Research
Søren Højsgaard & Erik Jørgensen
Biometry Research Unit

Danish Institute of Agricultural Sciences
Research Centre Foulum
October 18, 2001

1 Preface
In the spring 2001 the Biometry Research group at the Danish Institute of Agricultural Sciences
arranged a course in Mixed models for researchers at the Department of Animal Health and
Animal Welfare at the same institute. The course consisted a combination of lectures, group
exercises, written assignments and a final project report based on data from experiments that
the project participants were involved in.
During the course, the book SAS System for Mixed Models by Littell et al. (1996) was used,
referred to as LMSW in the present document. It was necessary to supplement the book with
additional theoretical material and examples based on data from the research institute. This
led to a comprehensive number of slides used for the presentations.
This supplementary material is compiled in the present document. We hope the readers will
find it useful. Maybe the online version1 of this document will be even more useful, because of
the hypertext facilities.
Søren Højsgaard & Erik Jørgensen

sorenh@agrsci.dk Erik.Jorgensen@agrsci.dk
Biometry Research Unit

Danish Institute of Agricultural Sciences
Research Centre Foulum
P.O. Box 50
DK-8830 Tjele
1
http://www.jbs.agrsci.dk/biometri/Courses/HSVmixed2001/HSVmixed2001Slides.pdf
3
1 Preface
4
Contents
1 Preface 3
Contents 9
2 Overview of slides 11
3 Basic Concepts from Linear algebra) 13

Why Linear Algebra?? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Linear Combinations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
n–dimensional Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Linear Subspaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Linear dependence and independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Projections onto Linear Subspaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4 Linear normal models 39

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Linear Normal Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Random Vectors and Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Functions of Random Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
The Multivariate Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
The Distribution of a LNM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
The Expectation in a LNM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Representations of Models in SAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Least Squares Estimation in a LNM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Estimation on matrix form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
The parameter vector β . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Estimability and Contrasts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
Estimability in SAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
Least Squares Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
Hypothetis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Calculating things in Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5 Some Basic Statistical Concepts 97

Data and Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Why the Normal Distribution is so “Normal” . . . . . . . . . . . . . . . . . . . . . . . 101
5
Contents
The Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

Some General Principles of Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
Method of Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
How good is an estimator? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Consistency of Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
Desirable Properties of Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . 112
The Method of Maximum Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . 113
The Likelihood function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
The Maximum likelihood principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
How Good is the Estimate? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
The Asymptotic Normal Distribution of the MLE . . . . . . . . . . . . . . . . . . . . . 122
Asymptotical normality of transformations of the MLE . . . . . . . . . . . . . . . . . . 125
Tests of Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
How to get the asymptotic normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
6 An overview 137
Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
Darwins maize . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
Galtons tilgang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
Korrekt tilgang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
Hvad er sket . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
Den 5. potte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
Populations genetik . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
Populations genetik/ Husdyravl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
Mixed Models generelt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
7 Experimental planning and design 149

Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
Forskningsprocessen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
Darwins majs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
Hypoteser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
Luse Beslutningsstøtte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
Forskningsbeslutningsstøtte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
Designmuligheder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
8 Randomized Complete Block Design 157

Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
Linear Normal Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
Random vs. Fixed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
ML - estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
Proc Mixed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
Andre eksempler på RCBD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
Proc Mixed fortsat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
IC - options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
6
Contents
9 Randomized Complete Block Design II 175

Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
BLUEs and BLUPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
BLUP Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
Model Check . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
10 Split-Plot Experiments 183

The General Idea behind Split–Plot Experiments . . . . . . . . . . . . . . . . . . . . . 184
Variance and Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
Comparing Differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
Inference Issues for Mixed Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
Analysis of the Split–Plot Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
Modelling the Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
Three Technical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
Back to the Original Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
Unbalanced cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
Satterthwaites approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
How Good is Satterthwaites Approximation . . . . . . . . . . . . . . . . . . . . . . . . 201
Two–sample Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
Split–Plot Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
Making the “right” tests with PROC MIXED . . . . . . . . . . . . . . . . . . . . . . . 204
A Severe Warning!! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
Some Tentative Conclusions on Satterthwaite . . . . . . . . . . . . . . . . . . . . . . . 207
Random or Fixed Effects? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
Multilocation Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
11 Examples of Split-Plot Designs 213

Example: W. Schouten Ph.D. work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
Breed Effect on Production . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
Straw shortener . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
Group Housing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
Herd Investigations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
Multilocation trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
12 Estimation and tests in mixed models 221

Maximum Likelihood and Linear Normal Models . . . . . . . . . . . . . . . . . . . . . 222
Maximum Likelihood Estimation in Mixed Models . . . . . . . . . . . . . . . . . . . . 225
Using ML or REML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
Tests in Mixed Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
13 Complications concerning Variance Components 235

Sugar Beet example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
Reason . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
Likelihood contour plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
7
Contents
G not positive definite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239

Warning Satterthwaite goes wrong . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
Testing effects of random components . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
14 Repeated Measurements 245

Analyzing Repeated Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
Tacit Assumptions when using the Split–Plot Model . . . . . . . . . . . . . . . . . . . 248
Modelling of Covariances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
Types of random variation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
Unstructured Covariance Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
The AR(1)–model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
How to estimate the autocorrelation?? . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
Compound Symmetry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
Which Covariance Structure to use? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
Numerical Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
What does the covariance structure mean for the conclusions? . . . . . . . . . . . . . . 263
15 Repeated Measurements: Covariance structures 265

Repeated statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
Types of variance structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
Unstructured . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
Autoregressive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
Antedependence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
Toeplitz . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
Heterogeneous . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
AR vs CS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
16 Random Regression 275

The Basic Idea behind Random Regression . . . . . . . . . . . . . . . . . . . . . . . . 276
Analyzing the Individual Regression Coefficients . . . . . . . . . . . . . . . . . . . . . 279
Random Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
How to ... In SAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
Correlation structure in Random Regression Models . . . . . . . . . . . . . . . . . . . 285
17 Factor Structure Diagrams 289

Factor Structure Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
Two–way ANOVA with Replicates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
Two–way ANOVA without Replicates . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
Block Experiments with Replicates within Blocks . . . . . . . . . . . . . . . . . . . . . 294
Block Experiments without Replicates within Blocks . . . . . . . . . . . . . . . . . . . 296
Split Plot Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
18 Covariate Models and Multivariate Response 301

Example of the use of covariates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302
8
Contents
Model reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304

Table 5:1 LMSW page 5.2.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
SAS- Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306
Feed vs daily gain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
Multivariate Responses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
The Components of a MLNM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311
How to ... In SAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
The general setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
19 Heterogeneous Variance 319

Why Variance Heterogeneity is Important to Recognize . . . . . . . . . . . . . . . . . 320
Graphical Investigation of the Variance Structure . . . . . . . . . . . . . . . . . . . . . 321
Variance Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
The Delta–method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
Taylors Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
Applying Taylors Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
Transformation of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328
Modelling Variance Heterogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330
Heterogeneous Variance for Grouped Data . . . . . . . . . . . . . . . . . . . . . . . . . 334
Power–of–Mean for Data with Covariates . . . . . . . . . . . . . . . . . . . . . . . . . 340
Noget om transformationer, normalfordelingsapproximation og konfidensintervaller . . 345
Transformation og konfidensintervaller . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
20 Variansheterogeneity: Example of effect of transformation 355

Variance Homogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356
Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358
Model of Expectations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360
Model comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361
Treatment differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
Natural Scales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
21 Variance Homogeneity: Diurnal Variation 365

Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366
Random Regression Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
Model of mean ? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368
Modelling variance inhomogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368
SAS model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
Experience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370
22 Links to supplementary material 371
Bibliography 373
9
Contents
10
2 Overview of slides
The course was arranged consisting of three blocks of lectures.
1. Brush-up concerning the necessary prerequisites of statistical concepts, linear algebra and
linear normal models. In addition, a historic review was given and experimental planning
discussed. This covers Chapter 3-7.
2. This block of lectures covered the basic application of Mixed Models within the experi-
mental designs typically used at the Department of Animal Health and Animal Welfare.
That is
• randomized complete block designs, (Chapter 8 and 9),

• split-plot designs (Chapter 10 and 11),
• repeated measurements. (Chapter 14 and 15)
• random regression. (Chapter 16)
• covariates and multivariate response. (Chapter 18)
In addition the fundamentals concerning estimation and tests in Mixed Models, is dis-
cuused in Chapter 12. The two remaining issues: numerical problems (Chapter 13) and
factor structure diagrams (Chapter 17) were included because of questions raised from the
participants. In practical examples some of the variance components estimates were very
often set to 0, leading to problems concerning the calculations of d.f. (i.e., with Satterth-
waites approximation). This further raised a need for a more ’manual’ approach towards
d.f. calculations in different designs.
3. In the final part of the course some additional topics and developments within Mixed
Models were presented and efforts were made to give a general summary and overview of
the topics. Lectures concerning variance heterogeneity is presented in Chapter 19 and 20.
An example using the presented methods on data concerning diurnal variation is presented
in Chapter 21
In addition, the preliminary work on the final project report were presented during this
final block.
11
2 Overview of slides
The final chapter (22) in this book consist of links to supplementary material. Mainly, SAS
examples.
The exercises uses in the course is not included but can be found by visiting the home page of
the course1
Finally, it should be mentioned that each chapter starts with a very short introduction to the
topic. In addition, a link to the full screen version of the presentation can be found.
1
http://www.jbs.agrsci.dk/biometri/Courses/HSVmixed2001/HSVmixed2001.htm
12
3 Basic Concepts from Linear algebra)
Linear algebra is an important prerequisite in order to understand the model formulation and
calculations within Mixed Model. The following slides served as a brush-up on the theory, with
presentation of the most important concepts and results.
Link to the full screen presentation1
1
http://www.jbs.agrsci.dk/biometri/Courses/HSVmixed2001/LinAlg.f.pdf
13
Why Linear Algebra??
• Many statistical models used in practice are assumed to have some

kind of a linear structure. (Linear regression and analysis of variance
are classical examples.)
• Linear algebra is the branch of mathematics that deals with linear

structures.
• Linear algebra is a convenient tool for handling models with linear

structures.
• Moreover, many concepts from linear algebra can be given

geometrical interpretation.
October 17, 2001 Mixed Models Course 1
• Hence geometry can be a way to understand statistical models with

linear structures
14
Vectors
Vectors: A column vector is a list of numbers stacked on top of each

other, e.g.  
2
a= 1 
 
3
A row vector is a list of numbers written one after the other, e.g.
b = (2, 1, 3)
In both cases, the list is ordered, i.e.
(2, 1, 3) 6= (1, 2, 3).

• Note In what follows all vectors are column vectors unless

otherwise stated.
In general an n–vector has the form
 
a1
 a
 
a =  .2

 .


an
where the ais are numbers.
15
Transpose of vectors: This means that a column vector is turned

into a row vector and that a row vector is turned into a column
vector. The transpose is denoted by “>”. For example,
a> = (a1, a2, . . . , an)
Hence transposing twice takes us back to where we started:
a = (a>)>
• Example:
 >  
1 1
 3  = [1, 3, 2] og [1, 3, 2]> =  3 
   
2 2
Multiplying a vector by a number: If a is a vector and α is a

number then αa is the vector
 
αa1
 αa2 
 
αa =  . 
 . 
αan
• Example:    
1 7
7  3  =  21 
   
2 14
16
Sum of vectors: Let a and b be n–vectors. The sum a + b is the
n–vector
     
a1 b1 a1 + b 1
 a   b2   a2 + b 2
     
a + b =  .2 + . = =b+a

 .   .   .. 
an bn an + b n
• Note Only vectors of the same dimension can be added !

• Example:
       
1 2 1+2 3
 3  +  8  =  3 + 8  =  11 
       
2 9 2+9 11
Inner product of vectors: Let a and b be n–vectors. The inner

product a · b is the number
n
X
a · b = a 1 b1 + a2 b2 + · · · + a n bn = ai bi
i=1
• Note The product is a number – not a vector

• Note Only vectors of the same dimension can be multiplied!
• Example:
   
1 2
 3  ·  8  = 1 · 2 + 3 · 8 + 2 · 9 = 44
   
2 9
17
The length (norm) of a vector: The length (or norm) of a vector

a is v
u n
√ uX
||a|| = a · a = t a2i
i=1
The 0–vector and the 1–vector: The 0-vector (1–vector) is a

vector with 0 (1) on all entries. The 0–vector (1–vector) is
frequently written simply as 0 (1) or as 0n (1n) to emphasize that
it is of length n.
Orthogonal (perpendicular) vectors: Two vectors a and b with

a 6= 0 and b 6= 0 are orthogonal if their inner product is zero,
written
a⊥b⇔a·b=0
Matrices
Matrix: A matrix A with r rows og c columns is an r × c table of

the form  
a11 a12 . . . a1c
 a a22 . . . a2c 
 
A =  21. .. . . . .. 
 . 
ar1 ar2 . . . arc
It is said that A has the dimension r × c.
• Note One can regard A as consisting of c columns vectors put
after each other:
A = [a1 : a2 : · · · : ac]
18
Transpose of matrices: A matrix is transposed by interchanging
rows and columns and is denoted by “>”. That is,
 
a11 a21 . . . ar1
a12 a22 . . . ar2 
 
>
A =

 .. . ... .  
a1c a2c . . . arc
Example:
 >
1 2
1 3 2
3 8  =
 
2 8 9
2 9
• Note If A is an r × c matrix then A> is a c × r matrix.

• Note One can regard a column vector of length r as an r × 1
matrix and a row vector of length c as a 1 × c matrix.
19
Multiplying a matrix with a number: For a number α and a matrix

A, the product αA is the matrix
 
αa11 αa12 . . . αa1c
 αa αa22 . . . αa2c
 
αA =  . 21

 . .. ... .. 

αar1 αar2 . . . αarc
Example:    
1 2 7 14
7  3 8  =  21 56 
   
2 9 14 63
Sum of matrices: Let A = [a1 : a2 : · · · : ac] and B = [b1 : b2 : · · · :

bc] be r × c matrices.
The sum A + B is the r × c matrix given by
A + B = [a1 + b1 : a2 + b2 : · · · : as + bs]
   
a11 a12 . . . a1c b11 b12 . . . b1c
 a21 a22 . . . a2c   b21 b22 . . . b2c 
   
=  . .. . . . ..  +
 .   .. .. . . . .. 

ar1 ar2 . . . arc br1 br2 . . . brc
 
a11 + b11 a12 + b12 . . . a1c + b1c
 a21 + b21 a22 + b22 . . . a2c + b2c 
 
=  .. .. ... .. =B+A
 
ar1 + br1 ar2 + br2 . . . arc + brc
20
• Note Only matrices with the same dimensions can be added.
Example:      
1 2 5 4 6 6
 3 8  +  8 2  =  11 10 
     
2 9 3 7 5 16
Multiplication of a matrix and a vector: Let A be an r × c matrix

and let b be a c-dimensional column vector. The product Ab is the
r × 1 matrix
    
a11 a12 . . . a1c b1 a11b1 + a12b2 + · · · + a1cbc
a a22 . . . a2c   b2   a21b1 + a22b2 + · · · + a2cbc
    
Ab =  21 =

 . .   ..  
.. . . . ..   .. 

ar1 ar2 . . . arc bc ar1b1 + ar2b2 + · · · + arcbc
• Eksempel:
     
1 2 1·5+2·8 21
 5
 3 8  =  3 · 5 + 8 · 8  =  79 
    
8
2 9 2·5+9·8 82
21
Multiplication of matrices: Let A be an r × c matrix and B a c × t

matrix, i.e. B = [b1 : b2 : · · · : bt]. The product AB is the r × t
matrix given by:
AB = A[b1 : b2 : · · · : bt] = [Ab1 : Ab2 : · · · : Abt]
Example:
    
"
1 2
# 1 2 1 2
5 4  5 4 
3 8 =  3 8  :  3 8 
  
8 2 8 2

2 9 2 9 2 9
   
1·5+2·8 1·4+2·2 21 8
=  3·5+8·8 3 · 4 + 8 · 2  =  79 28 
   
2·5+9·8 2·4+9·2 82 26
• Note The product AB can only be formed if the number of

rows in B and the number of columns in A are the same. On
that case, A and B are said to be conforme.
• Note In general AB and BA are not identical.
A mnemonic for matrix multiplication is :
5 4  
"
1 2
# 8 2 21 8
5 4
3 8 = 1 2 1 · 5 + 2 · 8 1 · 4 + 2 · 2 =  79 28 
 
8 2
2 9 3 8 3·5+8·8 3·4+8·2 82 26
2 9 2·5+9·8 2·4+9·2
22
Special matrices:
• An n × n matrix is said to be a square matrix
• A matrix with 0 on all entries is the 0–matrix and is often written
simply as 0 (or as 0r×c to emphasize the dimension).
• A matrix consisting of 1s in all entries is of written J (or as Jr×c
to emphasize the dimension).
• A square matrix with 0 on all off–diagonal entries and elements
d1, d2, . . . , dn on the diagonal is said to be a diagonal matrix and
is iften written diag{d1, d2, . . . , dn}
• A diagonal matrix 1s on the diagonal is called the unity matrix
and is denoted I (or In×n to emphasize the dimension).
• A matrix A is a symmetric matrix A = A>.
Some rules for matrix operations: For (conformable) matrices

A, B and C the following rules apply
(A + B)> = A> + B >
(AB)> = B >A>
A(B + C) = AB + AC
AB = AC 6⇒ B = C
23
Inverse of a matrix: The inverse of an n × n matrix A is the matrix

B (which is also n × n) which multiplied with A gives the identity
matrix I. That is,
AB = BA = I.
One says that B is A’s inverse and writes B = A−1.
• Note Only square matrices can have an inverse.
• Note Not all square matrices have an inverse.
• Note When the inverse exists, it is unique.
• Note Finding the inverse of a large matrix A is numerically

complicated.
Example 1. It is easy find the inverse for a 2 × 2 matrix. When

a b
A=
c d
then the inverse is

1 d −b
A−1 =
ad − bc −c a
under the assumption that ab − bc 6= 0. The number ab − bc is

called the determinant of A, sometimes written det(A).
If the determinant det(A) = 0, then A has no inverse. f in
24
Example 2. Finding the inverse of a diagonal matrix is easy: Let
 
a1 0 . . . 0
 0 a2 0 
 
A= . ... 0 
 . 
0 0 . . . an
where all ai 6= 0. Then the inverse is

 1 
a 0 ... 0
 01 1 0 
A−1 =  . a2 .
 
 . .. 0


1
0 0 ... an
If one ai = 0 then A−1 does not exist. f in
Generalized inverse: Not all square matrices have an inverse.

However all square matrices have a generalized inverse.
A generalized inverse of a square matrix A is a matrix A− satisfying
that
AA−A = A
Any square matrix has an infinite number of generalized inverses.
25
Linear Combinations
Let a1, a2, . . . , ac be r–vectors and let A = [a1 : a2 : · · · : ac] be the

corresponding r × c matrix.
Let vv = (v1, v2, . . . , vc)> be a c-vector and let

X
x = Av = a1v1 + a2v2 + · · · + acvc = a j vj
j
Then the r–vector x is said to be a linear combination of

a1 , a 2 , . . . , a c .
Let w = (w1, w2, . . . , wc)> be another c vector and let

P
correspondingly y = Aw = j aj wj .
Then the following can be noted:
• For a number α the vector αx = α(Av) = A(αv) is also a linear

combination of a1, a2, . . . , ac.
• The sum x + y = Av + Aw = A(v + w) is also a linear combination

of a1, a2, . . . , ac.
• Hence if x and y are both linear combination a1, a2, . . . , ac then so

is the sum αx + βy where α and β are numbers.
26
n–dimensional Spaces
A 2–vector x = (x1, x2) can be regarded as the point with

coordinates (x1, x2) in a 2–dimensional coordinate system, i.e. in the
plane.
Likewise a 3–vector x = (x1, x2, x3) can be regarded as the point

with coordinates (x1, x2, x3) in a 3–dimensional coordinate system,
i.e. in space.
In general an n–vector x = (x1, x2, . . . , xn) can be regarded as the

point with coordinates (x1, x2, . . . , xn) in an n–dimensional
coordinate system, i.e. in an n–dimensional space. Such as space
shall here be referred to as Rn. Its hard to draw!
To justify such n–dimensional spaces, suppose x consists of a

location of an object (that takes 3 coordinates), the temperature of
the object (that occupies one coordinate) and the time (that also
occupies one coordinate). Hence the total information about the
object can be regarded as a point in a 5–dimensional space.
Note that If x and y are both vectors in Rn then so is the sum

αx + βy.
27
Linear Subspaces
Consider a set a1, a2, . . . , ac of r–vectors.
We can regard these vectors as “building blocks” for creating new

vectors as linear combinations of the building blocks. Any such
vector is an r–vector
The set of vectors which can be created as linear combinations of

the “building blocks” is called a linear subspace of Rr .
Such a space, let us call it L, is said to be spanned by a1, a2, . . . , ac

and we write L = span(a1, a2, . . . , ac).
Example 3. Consider the vectors

   
2 1
a1 =  6  , a 2 =  5 
   
4 7
Hence span(a1, a2) is the set of vectors which can be written as

   
2 1
y =  6  v 1 +  5  v2
   
4 7
for alle possible choices of v = (v1, v2). f in
28
More precisely, L consists of all vectors of the form
a 1 v1 + a 2 v2 + · · · + a c vc
for all possible choices of c–vectors v = (v2, . . . , vc).
It is common to organize the building blocks as a matrix

A = [a1 : · · · : ac]. Then another way of describing L is as the set of
vectors that can be written as Av, or more precisely
L = {y|y = Av for all possible vectors v}
Frequenly one uses the name span(A) for L.
There are some additional aspects of subspaces of which a few will

be illustrated:
Example 4. Consider again the subspace L = span(a1, a2) where
a1 = (2, 6, 4)> a2 = (1, 5, 7)>
• A question is whether all vectors y = (y1, y2, y3)> can be written

as y = a1v1 + a2v2?
The answer is “no”, for example y = (1, 5, 3) can not be written
in that form.
• Another question is whether there are other ways of representing

L?
The answer is “yes” – there are infinitely many. To pick one, let
b1 = a1 + a2 and b2 = a1 − a2. Then L = span(b1, b2).
29
f in
• Note The 0-vector belongs to all linear subspaces. In the previous

example one gets y = 0 when choosing α = (0, 0, 0).)
Linear dependence and independence
Linearly dependent vectors: A set of vectors a1, ..., ac are

linearly dependent if one of them can be written as a linear
combination of the others, for example if
c−1
X
ac = a j qj
j=1
where the vj s are numbers.
Linearly independent vectors: If none of the vectors a1, ..., ac can

be written as a linear combination of the others, the set is said to
be linearly independent.
30
Throw–out–technique: If one vector, say ac, can be written as a
linear combination of the other vectors, then it can be thrown away
with changing the structure of the space, i.e.
span(a1, . . . , ac) = span(a1, . . . , ac−1)
This process can go on until one ends up with a set of linearly

independent vectors.
This allow us to find a representation of the which is as simple
(economical) as possible.

       
2 1 0 3
a1 =  6  , a2 =  5  , a3 =  2  og x =  13 
       
4 7 5 16
1. The vector x is a linear combination of a1, a2 and a3, since

x = a1 + a2 + a3 .
2. Since a3 = a2 − 12 a1, the ai–vectors are linearly dependent.

Consequently x can be written as a linear combination of only
a1 og a2, because x = 12 a1 + 2a2.
3. The vectors a1, a2 are linearly independent and so are the sets
a1, a3 and a2, a3.
31
f in
Basis of a subspace: If the vectors a1, ..., ac span a given subspace

L and are linearly independent, the are said to be a basis for L.
Any linear subspace has infinitely many different bases.
Dimension of a linear subspace: Yet all bases of a linear subspace

shares have a common feature: They have the same number of
elements. The number of elements of a basis is the dimension of
the subspace.
Throw–away: Having a linearly dependent set of vectors a1, ..., ac

on can always apply the throw–away–technique to obtain a
linearly independent set of vectors. This set is then a basis
for span(a1, . . . , ac).

     
2 1 0
a1 =  6  , a 2 =  5  , a 3 =  2 
     
4 7 5
   
1 2
b1 =  3  and b2 =  8 
   
2 9
and the corresponding matrices A = [a1 : a2 : a3], Ã = [a1 : a2] og
B = [b1 : b2].
1. Since a3 = a2 − 12 a1, the ai vectors are linearly dependent.

32
f in
• Note Since L = span(A) = span(B) one can think of the

matrices A and B as two different ways of representing the same
linear subspace.
Projections onto Linear Subspaces
Example 7. Consider the vector a = (2, 2) and y = (1, 2).
Clear y is not in span(a). In statistics the following question is

extremely important: Can we find a vector ŷ in span(a) which is as
“close to” y as possible?
The answer is “yes”: Find the (orthogonal) projection of the point

y onto the line going through a. There is a simple mathematical
expression for obtaining ŷ, namely
3

2 1 1 1 1 1 1
ŷ = a(a>a)−1a>y = [2, 2] = = 2
3
2 8 2 2 1 1 2 2
33
The property of ŷ is that the length of y − ŷ is as small as possible.
Moreover, y − ŷ and ŷ are orthogonal. f in
In general let y be an r–vector and let A = [a1 : · · · : ac] be an r × c

matrix.
Then there always exist a vector ŷ in span(A) which is as close to y

as possible.
If y is in span(A), then ŷ = y because in this case the lenght of

y − ŷ is zero.
If y is not in span(A) then the expression is as follows: Assume that

all columns of A are linearly independent. (Recall that if that is not
the case we can throw away redundant columns without changing

the space spanned by those remaining.)
Then ŷ = P y where
P = A(A>A)−1A>
is the projection matrix onto span(A).
It then holds that
1. P y is in span().
2. P y is the vector in span(A) which is closest to y (in the sense that

the lenght of y − ŷ is minmized.
3. P y = y if and only if y is already in span(A).

34
Example 8. Consider the 3 × 2 matrix A = [a1 : a2], where
   
1 2
a1 =  3  og a2 =  8 
   
2 9
Then the projection matrix onto span(A) is P = A(A>A)−1A>. To

find P we first calculate
 
1 2
1 3 2 14 44
A> A =  3 8 =
 
2 8 9 44 149
2 9
Hence

> −1 1 149 −44
(X X) =
150 −44 14
From this we find

> −1 > 1 149 −44 1 3 2
(X X) X =
150 −44 14 2 8 9

1 61 95 98
=
150 −16 −20 38
35
Finally we find
 
1 2
1  61 95 98
P = A(A>A)−1A> = 3 8 

150 −16 −20 38
2 9
 
29 55 −22
1 
=  55 125 10 

150
−22 10 146
f in
Exercises in linear algebra
Exercise 1. 1. Are the vectors (1, 1) and (1, 2) orthogonal?
2. Are (1, 1) and (2, −2) ?
3. Are (1, 1) and (−1, −1) ?
4. Make a drawing which illustrates these vectors
Exercise 2. Let  
1 2
A =  3 4 .
 
5 6
36
1. Is A symmetrical?
2. Is A>A symmetrical?
3. Is AA> symmetrical?
4. What is the result from adding A and A>?
Exercise 3. Let

1 2 1 0
A= , and B = .
3 4 1 1
Calculate AB and BA. What can be concluded from this?
Exercise 4. Let a = (1, 1, 1, 0, 0, 0)> be a 6 × 1 matrix. Find aa>

and a>a.
Exercise 5. Let

a b
A=
c d
and

1 d −b
B=
ad − bc −c a
Calculate AB. What can be concluded from this?
Exercise 6. What is the inverse to the 3 × 3 matrix diag(1, 4, 9)?
Exercise 7. Two equations with two unknowns. COnvince yourself

37
that the system of equations
x1 + 2x2 = 3
2x1 + 3x2 = 4
can be written as

1 2 x1 3
= ,
2 3 x2 4
i.e. as Ax = b. Find A−1 and use this for solving the system of
equations as follows:
x = Ix = A−1Ax = A−1b.
Exercise 8. Let  
1 0
1 0
 
A= .
 
 0 1 
0 1
1. How do vectors of the form Av look when v = (v1, v2)>?
2. Find the projection matrix P = A(A>A)−1A>.
3. Let y = (1, 3, 5, 7)> . Find P y.
38
4 Linear normal models
Linear normal models serves as a natural starting point for the presentation of Mixed Models
theory. Most researchers within animal science has a least a working knowledge of linear normal
models
These slides served the purpose of giving an overview of the different concepts, and to link the
concepts with the underlying statistical theory. Finally, the standard terminology used within
SAS, were presented from a theoretical point of view.
Link to the full screen presentation1 .
1
http://www.jbs.agrsci.dk/biometri/Courses/HSVmixed2001/LNM.f.pdf
39
Introduction
Many well known statistical models used in practice, for example

• linear regression,
• multiple regression,
• analysis of variance,
• analysis of covariance,
can be formulated in the general framework of linear normal models

(abbreviated LNM), which undoubtly is the most important class of
models in statistics.
A linear normal model is also sometimes called a

general linear model.
The SAS procedure PROC GLM is designed to deal with the class of
linear normal models.
Any linear normal model can be formulated in matrix form as
Y = Xβ +
where Y is an n × 1 vector of observations, X is an n × p matrix of

covariates, β is a p × 1 vector of unknown parameters and is a
n × 1 vector of unobservable random errors.
40
Example 1. One–way analysis of variance.
The model
Ykl = αk + kl
2
where kl ∼ N (0, σ ) for k = 1, 2 and l = 1, 2, 3 can be written in
matrix form as
     
Y11 1 0 11

 Y12 


 1 0 


 12 

 Y13  
 =  1 0  α1  13 
 α2 + 
   

 Y21 


 0 1   21 

 Y22   0 1   22 
Y23 0 1 23
Y = X β +
The vector of expected values µ = (µ11, µ12, . . . , µ23)> is

     
µ11 1 0 α1

 µ12 


 1 0 


 α1 

 µ13  
 =  1 0  α1  α1 
 α2 = 
   

 µ21 


 0 1   α2 

 µ22   0 1   α2 
µ23 0 1 α2
µ = X β
f in
41
There are good reasons for dealing with LNMs in general instead, of
treating regression analysis, analysis of variance etc. separately.
For LNMs in general it is easy to establish how to

• estimate parameters,
• estimate contrasts,
• make significance tests,
• perform model control.
From these general results, it can be deduced how to make the

corresponding tests in e.g. regression models and in analysis of
variance
It is also convenient to work with LNMs in matrix terminology,

because any LNM can be formulated generally as
y = Xβ +
Moreover, random effects models (mixed models) are an extension of

linear normal models. I.e. any linear normal model is in a sense also
a mixed model.
Many aspects of mixed models become extremely cumbersome if the

matrix representation is not available.
42
Example 2. Simple linear regression:
The linear regression model
Y i = β 0 + β 1 xi + i
where i ∼ N (0, σ 2) for i = 1, . . . , 6 can be written in matrix form as

     
Y1 1 x1 1

 Y2 


 1 x2 


 2 

 Y3  
 =  1 x3  β0  3 
 β1 + 
   

 Y4 


 1 x4   4 

 Y5   1 x5   5 
Y6 1 x6 6
Y = X β +
The vector of expected values µ = (µ1, µ2, . . . , µ6)> is

     
µ1 1 x1 β 0 + β 1 x1

 µ2 


 1 x2 


 β 0 + β 1 x2 

 µ3  
 =  1 x3  β0  β 0 + β 1 x3 
 β1 = 
   

 µ4 


 1 x4   β 0 + β 1 x4 

 µ5   1 x5   β 0 + β 1 x5 
µ6 1 x6 β 0 + β 1 x6
µ = X β
f in
43
Linear Normal Models
A linear normal model (LNM) is defined as follows:
1. The observations y1, . . . , yn come from (are realizations of)

independent random variables Y1, . . . , Yn.
2. Each random variable has a normal distribution
Yi = µi + i i ∼ N (0, σ 2).
Hence each Yi is allowed to have its own mean value, but the
variance σ 2 is the same for all i = 1, . . . , n.
3. To each observation yi there are covariates (known constants)

xi1, . . . , xip such that
p
X
µi = µ(β)i = xi1β1 + xi2β2 + · · · + xipβp = xik βk .
k=1
That is, the mean value µi is related to the covariates in a linear

way through the parameters β1, . . . , βp.
A practical interpretation of constant variance is that each random

variable Yi has the same tendency to deviate (in a random way)
from its expectation µi.
44
As it has been illustrated, any LNM can be cast in matrix form as
Y = Xβ +
where
Y : is an n × 1 vector of observations,
X : is an n × p matrix of covariates, whose ith row is xi1, . . . , xip,
β : is a p × 1 vector of unknown parameters, and
: is a n × 1 vector of unobservable random errors which are

independent and N (0, σ 2) distributed.
The matrix X is called the design matrix (or model matrix) because
it contains information about covariates, i.e. about the design of the
study.
45
Example 3. Polynomial regression:
The polynomial regression model
Yi = β0 + β1xi + β2x2i + i
where i ∼ N (0, σ 2) for i = 1, . . . , 6 can be written in matrix form as

x21
     
Y1 1 x1 1
 Y2   1 x2 x22 
 β0
  2 
x23
    
 Y3  
 =  1 x3  
  β1  +  3 
x24
 
 Y4   1 x4 
 β1
 4 
x25
    
 Y5   1 x5   5 
Y6 1 x6 x26 6
µ = X β +
f in
Random Vectors and Matrices
A random vector Z = (Z1, . . . , Zn)> is a vector of random variables.
Since we are working with vectors of random variables, it is

convenient to establish the notions of
• expectation vector (or mean vector ) and
• covariance matrix of a vector of random variables.
46
• Most frequently the interest is the the mean vector.
• Yet, the covariance matrix is of interest when modelling that

observations can not be regarded as comming from independent
random variables.
• In fact, one view of mixed models is that mixed models are

concerned with modelling the covariance matrix is some structured
way.
The mean or expectation of a random vector is the vector of mean

values, i.e.
   
E(Z1) µ1
E(Z2) µ2
   
E(Z) =  = =µ
   
 ..   .. 
E(Zn) µn
For a LNM, we have already seen a use of this, namely through

writing
µ = Xβ.
47
The covariance matrix Cov(Z) of a random vector
Z = (Z1, . . . , Zn)>
is the n × n matrix whose element in the ith row and jth column is
the covariance between Zi and Zj .
Example 4. For example, with n = 3 we have
σ12 σ12 σ13

" # " #
Var(Z1) Cov(Z1, Z2) Cov(Z1, Z3)
Cov(Z) = Cov(Z1, Z2) Var(Z2) Cov(Z2, Z3) = σ21 σ22 σ23
Cov(Z3, Z1) Cov(Z3, Z2) Var(Z3) σ31 σ32 σ32
f in
In general
Cov(Z)ij = Cov(Zi, Zj ) = E[(Zi − µi)(zj − µj )].
In particular the diagonal elements of Cov(Z) contain the variances,
Cov(Z)ii = Cov(Zi, Zi) = E[(Zi − µi)2] = V ar(Zi).
Since Cov(Zi, Zj ) = Cov(Zj , Zi), the covariance matrix is

symmetric.
48
Example 5. The error term = (1, . . . , n) from a linear normal
model has a very simple covariance matrix:
• Var(i) = σ 2 because the variance is the same for all units
• Cov(i, j ) = 0 because i and j are independent.
• Hence  
1 0 ... 0
0 1 ... 0 
Cov() = σ 2 .. .. . . . 2
..  = σ In

0 0 ... 1
f in
Functions of Random Vectors
Matrix algebra is useful when dealing with

linear functions of random vectors.
If Z is a random n-vector, A is an r × n matrix and b is an r–vector,

then
U = AZ + b
is also a random vector.
49
The mean and covariance of linear functions of random vectors is

easily calculated using the following:
Result 1.
E(AY + b) = A E(Y ) + b (1)

Cov(AY + b) = Cov(AY ) = A Cov(Y )A> (2)
A particular application of (1) and (2) is the following:
• Let Z be a random vector of length n with mean E(Z) (an

n–vector) and covariance matrix Cov(Z) (an n × n matrix).
• Let a = (a1, . . . , an)> be a vector of numbers and consider the

linear combination U = i aiZi = a>Z.
P
• Then (1) and (2) implies that
E(U ) = E(a>Z) = a> E(Z)

Cov(U ) = Cov(a>Z) = a> Cov(Z)a
50
The Multivariate Normal Distribution
So far, we have treated the mean and covariance of a random vector.
We shall now discuss a distribution of a random vector:
Definition 1. It is said that Z follows an n–dimensional

multivariate normal distribution (in short MVN) with mean vector
µ = E(Z) and covariance matrix Σ = Cov(Z), written
Z ∼ Nn(µ, Σ)
if a>Z follows a univariate normal distribution for all possible n-

vectors a.
Without going into detail, we shall just mention that if Σ has an

inverse, then Z has a density which can be written
n n 1
f (z) = (2π)− 2 det(Σ)− 2 exp{ (z − µ)>Σ−1(z − µ)}
2
Example 6. For n = 2 the density looks as follows:
51
f in
The Distribution of a LNM
For a LNM, the vector of unobservable errors is = (1, . . . , n)>,

where i ∼ N (0, σ 2) and 1, . . . , n are independent.
Hence we have
E() = 0 and Cov() = σ 2I
Since any linear combination of independent N (0, σ 2)–variables

yields a normal variable we conclude that
∼ Nn(0, σ 2 I)
52
Hence for the linear normal model Y = Xβ + we find that
E(Y ) = µ = E(Xβ + )
= Xβ + E() = Xβ
Cov(Y ) = Cov(Xβ + )
= Cov() = σ 2I
and can write

Y ∼ Nn(Xβ, σ 2I).
The Expectation in a LNM
Example 7. (Continuation of Example 1).
The one–way analysis of variance model in Example 1 can be

formulated at least three different ways:
1. As Ykl = αk + kl, and β = (α1, α2)>.
2. As Ykl = δ + γk + kl where γ2 = 0, such that γ1 is represents the

treatment effect. Hence, β2 = (δ, γ1)>.
3. As Ykl = δ + ρk + kl. Thus, β3 = (δ, ρ1, ρ2)>.

53
In many ways, the latter formulation is the most natural and

conventional, but it poses some problems
Let
     
1 0 1 1 1 1 0

 1 0 


 1 1 


 1 1 0 

1 0 1 1 1 1 0
     
X=  X2 =   X3 =  (3)
     
0 1 1 0 1 0 1

     
0 1 1 0 1 0 1
     
     
0 1 1 0 1 0 1
Any vector which can be written as Xβ must be of the form

(a, a, a, b, b, b)> for numbers a and b.
But that is also the case for vectors of the form X2β2 and X3β3.
From this we conclude that with respect to the mean vector the
matrices X, X2 and X3 are “all the same”.
This leads to that
µ = Xβ = X2β2 = X3β3.
1. X corresponds to writing the model as Ykl = αk + kl.
2. X2 corresponds to writing the model as Ykl = δ + γk + kl, with

γ2 = 0.
3. X3 corresponds to writing the model as Ykl = δ + ρk + kl.
54
Consider the mean vector µ = (2, 2, 2, 3, 3, 3)> . The formulation as
µ = X3β3 where β3 = (δ, ρ1, ρ2)> is different from the two others in
an important way:
• Under the representation µ = Xβ, there is only one choice of β

namely β = (2, 3) which yields µ.
• Under the representation µ = X2β2, there is only one choice of β2

namely β2 = (3, −1) which yields µ.
• Under the representation µ = X3β3, there are infinitely many ways

of obtaining µ. Two such are β3 = (1, 1, 2) and β3 = (3, −1, 0).
f in
• Example 7 illustrates that there in general are different

representations of the same model. Corresponding to the different
representations, there are different parameters, with different
interpretations.
• We say that the there are different parametrizations of the same

model.
• The representation µ = X3β3 is said to be over parametrized –

there are too many parameters in the model.
55
In many practical situations the models we work with are over

parametrized.
Yet, it does not matter which representation of the model we choose

and it is not really important that whether the model is over
parametrized in the following sense:
Any question that can be answered under one representation can

also be answered under another.
To treat these issues in detail, it is necessary to think about what a

LNM really says: It says that
y = Xβ + where µ = Xβ.
Hence β effects the distribution of the observables y only indirectly,

namely through Xβ.
Therefore since y is what can be observed, we can only use y for

saying “somethingh” about β if this “something” can be expressed
through Xβ.
This observation leads to the important notion of estimability and

estimable functions.
56
The columns of X defines a subspace of Rn which we denote by L,
i.e.
L = span(X).
The statement µ = Xβ simply means that µ can be written as a

linear combination of the column vectors of X, i.e. that µ lies in
span(X).
But as has been illustrated in Example 7, there might be more than

one β vector producing µ.
Hence by saying that µ = Xβ, all one really says is that µ belongs
to L.
Moreover, there are infinitely many different ways of representing L,

because one can always find another matrix, say X2 with

span(X2) = span(X) such that any vector µ = Xβ = X2β2.
Therefore, since the parameter vector β is closely related to the

actual representation of L, and since β might not be uniquely
determined, the value of a parameter vector β is rarely of direct
interest in itself.
57
Example 8. (Continuation of Example 2)
Let x̄. = n1 i xi denote the average of the xis. Define new variables
P
zi = xi − x̄. and consider the regression model
Y i = α0 + α1 z i + i .
This model corresponds to “centering the xis around their mean”.

Not surprisingly, this does not change the fundamental structure of
the model - it is still a linear regression model, but with the following
new design matrix:
   
1 z1 1 x1 − x̄.

 1 z2  
  1 x2 − x̄. 

1 z3 1 x3 − x̄. α0
   
X̃ =  =  , β̃ =
   
 1 z4   1 x4 − x̄.  α1
1 z5 1 x5 − x̄.
   
   
1 z6 1 x6 − x̄.
f in
58
Representations of Models in SAS
Here we shall illustrate some of the differences between different

ways of specifying the models in SAS.
The illustration is with PROC MIXED but applies to PROC GLM too.
The model in Example 7 can be analyzed with the SAS program
PROC MIXED;
CLASS TREAT;
MODEL Y = TREAT / SOLUTION;
RUN;
Here TREAT is a variable with levels 1 and 2.

1. First SAS generates the matrix X3.
2. SAS then realizes that the columns of X3 are linearly dependent.
3. SAS therefore proceeds by eliminating columns until a set of linearly

independent columns are achieved. This is done in a systematic
way: The column corresponding to the highest value of TREAT is
removed which yields X2.
The parameter estimates reported by SAS are therefore (δ, γ1).
Note that it is the option SOLUTION that causes the parameter

estimates to be reported.
59
The SAS program

PROC MIXED;
CLASS TREAT;
MODEL Y = TREAT / NOINT SOLUTION;
RUN;
on the other hand causes SAS to directly generate X, because the

NOINT option specifies that there shall not be a column of 1s in the
design matrix. The parameter estimates reported by SAS is therefore
(α1, α2).
Example 9. Consider the two–way analysis of variance
Yijk = δ + αi + βj + γij + ijk
where i = 1, 2, j = 1, 2 and k = 1, 2, 3. The mean vector is

 
δ
  α1 
1 1 0 1 0 1 0 0 0 
 α2 

β1
 1 1 0 0 1 0 1 0 0
  
µ= β2  = Xβ
 
 1 0 1 1 0 0 0 1 0

 γ11 
1 0 1 0 1 0 0 0 1
 
 γ12 
γ21
 
γ22
(where in the designmatrix we regard 1 and 0 as vectors of length 3).

60
This model is highly over parametrized. SAS handles this problem in
the way indicated above: A new design matrix giving the same model
is created, namely
  
1 1 1 1 δ
 1 1 0 0   α1 
  
µ=  = X 2 β2
 1 0 1 0   β1 

1 0 0 0 γ11
This corresponds to setting α2 = β2 = γ21 = γ12 = γ22 = 0 on

beforehand. (That is every time a parameter contains the level
number 2 in its index it is set to being zero.) f in
This means that SAS solves the problem of an over parametrized

model by simply reducing it to a representation which is not over
parametrized.
As mentioned previously, this is not a problem because any quation

that can be answered under one representation of a model can also
be answered under another.
Yet, care should be taken when it comes to interpreting output from

SAS, see Section 18.
61
Least Squares Estimation in a LNM
In a LNM, the mean µi is a function of the parameter vector β.
One frequently used criterion for estimation is the method of

least squares:
Find the vector µ̂ = (µ̂1, . . . , µ̂n)> which minimizes the sum of

squared deviations
n
X
D(β) = (yi − µi)2
i=1
under the restriction that µ̂ = X β̃ for some parameter vector β̃.

• Such a vector µ̂ always exists and is unique.
• We say that β̃ is a least squares estimate for β. Such an estimate

β̃ also exists, but it is in general not unique.
62
For the regression analysis we find
n
X
D(β) = (yi − (β0 + β1xi))2
i=1
Most standard textbooks on statistics take the following approach to

minimization of D(β):
∂ ∂
1) Calculate the derivatives ∂β0 D(β) and ∂β1 D(β),
2) set these equal to zero and
3) solve for β0 and β1.

This gives
P
i(y
Pi
− ȳ.)(xi − x̄.)
β̂1 = 2
i (xi − x̄.)
β̂0 = ȳ. − β̂1x̄.
f in
63
Example 11. (Continuation of Example 1) For the one–way analysis

of variance
X 2 X3
D(β) = (ykl − αk )2
k=1 l=1
The values of αk which minimizes D(β), where β = (α1, α2)>, are
3
1X
αk = ykl = ȳk
3
l=1
The vector µ̂ is in this case (ȳ1, ȳ1, ȳ1, ȳ2, ȳ2, ȳ2)>.
However, if the model is written as Ykl = δ + ρk + kl, i.e. as

Y = X3β3 + in Example 7, there is no unique least squares estimate
of β3 = (δ, α1, α2). To see this, just note that
δ = 0, α1 = ȳ1, α2 = ȳ2
and
δ = (ȳ1 + ȳ2)/2, α1 = (ȳ1 − ȳ2)/2, α2 = (−ȳ1 + ȳ2)/2
both results in the same vector µ̂ = (ȳ1, ȳ1, ȳ1, ȳ2, ȳ2, ȳ2)>. f in
64
Estimation on matrix form
The estimation problem can be formulated very generally in matrix

notation and can be solved generally using projections onto
subspaces:
Using matrix notation the least squares method is:
Find the vector µ̂ = (µ̂1, . . . , µ̂n)>
D(β) = (y − µ)>(y − µ)
under the restriction that µ̂ = X β̃ for some parameter vector β̃.

Then we have the following results:
1. There always exists a unique vector of expected values µ̂ =

(µ̂1, . . . , µ̂n)> which minimizes D(β).
2. The vector µ̂ is µ̂ = P y where P be is the projection matrix onto

span(X).
3. Since µ̂ is in span(X), there exists a vector β̂1 satisfying that

µ̂ = X β̂1. We say that β̂1 is a least squares estimate of β.
4. If the columns of X are linearly independent, there exists only one

vector β̂1 satisfying that µ̂ = X β̂1. In that case the least squares
estimate is unique.
65
5. If the columns of X are linearly dependent, there exists several least

squares estimates, i.e. there is another vector β̂2 with µ̂ = X β̂2,
and where β1 6= β2.
6. In regression problems, the least squares estimate is typically unique,

whereas in analysis of variance problems, the least squares estimate
is generally not unique.
7. In the case where the least squares estimate is unique is is given as
β̂ = (X >X)−1X >y.
It is easy to see why it is so: We know that µ̂ = P y =

X[(X >X)−1X >y]. However, since µ̂ is in span(X), we also
know that µ̂ = X β̂. But both equations can only be true if
β̂ = (X >X)−1X >y.
The vector e = y − µ̂ is the vector of residuals reflecting the

unobserved error vector .
Hence e>e = (y − µ̂)>(y − µ̂) is the residual sums of squares and if

the model fits well to data, e>e should be “small” in some sense.
If there are p linearly independent columns in X the estimate for the

variance σ 2 is
1 > 1
σ̂ 2 = e e= (y − µ̂)>(y − µ̂)
n−p n−p
66
Example 12. (Continuation of Example 7).
With the matrix X as in Example 7, the projection matrix becomes

 
1 1 1 0 0 0
 1 1 1 0 0 0 
 
1
 1 1 1 0 0 0 

P = 
3 0 0 0 1 1 1 

 0 0 0 1 1 1 
 
0 0 0 1 1 1
f in
The parameter vector β
We shall now assume that the LNM is such that the columns of X
are linearly independent such that the least squares estimate
β̂ = (X >X)−1X >y.
of β is unique.
Letting A = (X >X)−1X > we note that A is an p × n–matrix and

see that β̂ = Ay.
67
Thinking in terms of random variables, the data y is a realization of

a random vector Y with E(Y ) = Xβ and Cov(Y ) = σ 2I. Then
β̂(Y ) = (X >X)−1X >Y = AY
is also a random vector because β̂(Y ) is a function of the random

vector Y .
If the elements of A are denoted aij we see that the ith component
Pp
of β̂ is β̂i = j=1 aij yj
Hence each component βi of the vector β is a linear function of the

data y. Therefore it is not surprising that the corresponding random
variables β̂i(Y ) are dependent is some way.
Using the relations (1) and (2) we find that
E(β̂(Y )) = AE(Y ) = (X >X)−1X >E(Y )

= (X >X)−1X >Xβ = β (4)
Equation (4) says that the expected value of the least squares
estimator β̂ is simply the true but unknown value β.
68
Cov(β̂(Y )) = A Cov(Y )A> = σ 2AIA> = σ 2AA>
= σ 2(X >X)−1X >[(X >X)−1X >]>
= σ 2(X >X)−1X >X(X >X)−1
= σ 2(X >X)−1 (5)
Equation (5) says that the covariance of the least squares estimator
β̂ is proportional to the residual variance σ 2. Moreover, the matrix
(X >X)−1 does not depend on the data y but only on the design
matrix X, i.e. on how the study at hand was conducted.
Recall that on the diagonal of a covariance matrix one finds the

variances. Hence when knowning (X >X)−1 and an estimate for σ 2
then we also know the variance estimates for β̂i.
69
Example 13. (Continuation of Example 2) Suppose xi = i and

zi = i − 3.5 in the regression example for i = 1, . . . , 6.
Regression of y on x with the program
PROC GLM ;
MODEL y = x / inv;
RUN; QUIT;
gives the result
The GLM Procedure

X’X Inverse Matrix
Intercept x y
Intercept 0.8666666667 -0.2 -1.286578758
x -0.2 0.0571428571 0.4835938022
y -1.286578758 0.4835938022 3.225955579
Dependent Variable: y Sum of

Source DF Squares Mean Square F Value Pr > F
Model 1 4.09260190 4.09260190 5.07 0.0874

Error 4 3.22595558 0.80648889
Corrected Total 5 7.31855748
Standard
Parameter Estimate Error t Value Pr > |t|
Intercept -1.286578758 0.83603651 -1.54 0.1987
x 0.483593802 0.21467436 2.25 0.0874
70
The two first diagonal elements of (X >X)−1 times the variance
estimate σ̂ (i.e. the Mean Square Error) gives variance estimates of
the regression parameters.
The square root of these estimates are the standard errors reported.
Moreover, the covariance between the intercept and the slope is

estimated to be −0.2 so these estimates are correlated.
Regression of y on z with the program

PROC GLM ;
MODEL y = z / inv;
RUN; QUIT;
gives the result
71
The GLM Procedure

X’X Inverse Matrix
Intercept z y
Intercept 0.1666666667 0 0.4059995498
z 0 0.0571428571 0.4835938022
y 0.4059995498 0.4835938022 3.225955579
The GLM Procedure

Dependent Variable: y Sum of
Model 1 4.09260190 4.09260190 5.07 0.0874
Error 4 3.22595558 0.80648889
Standard
Intercept 0.4059995498 0.36662626 1.11 0.3302
z 0.4835938022 0.21467436 2.25 0.0874
In this case we see that centering the x values around their average
(3.5) gives parameter estimates which are uncorrelated. Moreover,
the estimate of the slope (and the associated standard error) is the
same as before. f in
72
With
 
1 x1

 1 x2 

1 x3 β0
 
X= , β =
 
 1 x4  β1
1 x5
 
 
1 x6
we find (when letting n = 6) that
P
n x i
X >X = P Pi 2
i xi i xi
Recall that

a b −1 1 d −b
A= implies that A =
c d ad − bc −c a
(provided that ab − bc 6= 0). Using this gives
P 2 P
> −1 1 i xi − i xi
(X X) = P 2 P
n i xi − ( i xi )2
P
− i xi n
1
Letting K = P 2 P 2 , the variance of the estimator β̂0 for the
n i xi −( i xi )
intercept is
2
P
i xi
V ar(β̂0) =
K
73
and the variance of the estimator β̂1 for the slope is

n
V ar(β̂1) =
K
The estimators β̂0 and β̂1 are correlated since
1 X
Cov(β̂0, β̂1) = − xi
K i
f in

P
Since i(xi − x̄.) = 0 (Verify this!) we find that

> n 0 n 0
X̃ X̃ = P 2 = P 2
0 i zi 0 i(xi − x̄.)
Since the inverse of a diagonal matrix is also diagonal, we conclude

that the estimators α̂0 and α̂1 are independent. f in
74
The estimator β̂ has a p–dimensional multivariate normal
distribution (in short MVN), with mean vector β and covariance
matrix σ 2(X >X)−1.
This is written
β̂ ∼ Np(β, σ 2(X >X)−1).
This means that any linear combination λ>β̂ has a univariate normal
distribution
λ>β̂ ∼ N (λ>β, σ 2λ>(X >X)−1λ) (6)
and that is a very important result for practical statistics.
Estimability and Contrasts
In a LNM with mean vector µ = Xβ one is typically interested in

making statements about (some of) the components of the
parameter vector β.
However, with µ = Xβ we only have indirect knowledge about β

P
because all we know is that µi = j xij βj and, as has been
illustrated, β is in general not uniquely determined. That is, there
can be another vector β2 such that µ = Xβ = Xβ2.
Hence there are some constraints on what can actually be said about
β.
75
In the one–way analysis of variance of Example 1 one might be

interested in the difference α1 − α2 or in α1 itself and there is no
problem in that. For later purposes it can be noted that
α1 − α2 = (1, −1)(α1, α2)> = (1, −1)β

α1 = (1, 0)(α1, α2)> = (1, 0)β
Example 16. Consider the two–way analysis of variance
Yij = δ + αi + βj + ij
where  
  δ
1 1 0 1 0
  α1 
 
1 1 0 0 1

µ=   α2  = Xβ
 
1 0 1 1 0
 β1 
 
1 0 1 0 1
β2
It is clear that this model is grossely over parametrized (why?)
Under this model we can estimate quantities like
1
α1 − α2, δ + α1, δ + α1 + (β1 + β2)
2
76
Note that
α1 − α2 = (0, 1, −1, 0, 0)β,

1 1 1
δ + α1 + (β1 + β2) = (1, 1, 0, , )β
2 2 2
However other things like
α1 = (0, 1, 0, 0, 0)β or β1 = (0, 0, 0, 1, 0)β
can not be estimated under this model.
f in
In a sense, the only thing uniqely determined in a LNM is µ.
Therefore the only thing one can truely say something about is linear
combinations of µ, i.e. linear combinations of the form
a> µ
for some n–vector a.
Most frequently interest is in contrasts of the form λ>β.
Therefore, a natural question is how
a>µ and λ>β
relate to each other?

77
Since µ = Xβ, we can only say something about β if one can

express it as
a>Xβ.
Note that a>X is an 1 × p–vector.
Therefore, we can say something about the contrast λ>β only if one
can find an n–vector a such that
a> X = λ >
If there exists such a vector a, the contrast λ>β is said to be

estimable.
In this case the contrast is can be written
λ>β = a>Xβ = a>µ

After having estimated µ̂, the contrast λ>β is estimated by
λ>β̂ = a>Xβ = a>µ̂.
Recall from the section on estimation that there might in general be

many least squares estimates for β. However, the following holds:
Result 2. The least squares estimate of λ>β is unique if and only

if λ>β is estimable.
In other words,
The only thing one can say something about in an unambiguous

way is estimable functions.
78
From the general result
λ>β̂ ∼ N (λ>β, σ 2λ>(X >X)−1λ) (7)
we know the distribution of the contrast λ>β̂ and hence testing for
the contrast being zero is straight forward.
Note that transposing a>X = λ> gives X >a = λ.
Hence the condition for estimability is that λ can be written as a

linear combination of the columns of X > i.e. as a linear combination
of the rows of X.
This amounts to solving a set of linear equations – and computers

can do that!
We wish to verify that
1 1 1
δ + α1 + (β1 + β2) = (1, 1, 0, , )β
2 2 2
is indeed estimable.
That is, we seek a vector a = (a1, a2, a3, a4)> such that
1 1
a>X = (1, 1, 0, , ).
2 2
79
Direct multiplication gives
a1 + a 2 + a 3 + a 4 = 1
a1 + a 2 = 1
a3 + a 4 = 0
1
a1 + a 3 =
2
1
a2 + a 4 =
2
It is not hard to spot that the solution to these equations are
a1 = a2 = 1/2 and a3 = a4 = 0.
f in
Estimability in SAS
In checking whether a specific contrast is estimable, it is

recommended to use PROC GLM.
The following SAS program deals with data from Example 16

proc glm data=a;
class i j;
model y = i j/E;
lsmeans i j /E;
run;
80
The output caused by the E–option in the MODEL statement is
General Form of Estimable Functions
Effect Coefficients
1 Intercept L1
2 i 1 L2
3 i 2 L1-L2
4 j 1 L4
5 j 2 L1-L4
Recall that β = (δ, α1, α2, β1, β2). The numbers 1,2,3,4,5 identify
the entry of the λ–vector, λ = (λ1, λ2, . . . , λ5), and the Ls specify
the constraints to be satisfied by the λis.
It reads as follows: λ1 can be set to any value L1, and λ2 can be set
to any value L2. But then λ3 is constrained to be equal to L1 − L2.
Likewise, λ4 can be set to any value L4, but then λ5 is constrained
to be equal to L1 − L4.
From this we see how to specify some contrasts
λ = (1, 1, 0, 1, 0) : λ>β = δ + α1 + β1
1 1 1
λ = (1, 1, 0, , ) : λ>β = δ + α1 + (β1 + β2)
2 2 2
>
λ = (0, 1, −1, 0, 0) : λ β = α1 − α2
But we can also see that the contrast δ + 12 (α1 + α2) is not
estimable: Taking λ1 = 1 and λ2 = λ3 = 12 would give the desired
result, but setting λ4 = 0 implies that λ5 = 1, so it is not possible.
The contrasts specified above are constructed as follows in PROC

GLM (and in PROC MIXED. Note that we have indicate two ways of
81
constructing the last contrast.

title ’Estimation of contrasts’;
proc glm data=a;
class i j;
model y = i j /E;
estimate ’Lambda 1’ intercept 1 i 1 0 j 1 0 / E;
estimate ’Lambda 2’ intercept 1 i 1 0 j .5 .5 / E;
estimate ’Lambda 3’ intercept 0 i 1 -1 j 0 0 / E;
estimate ’Lambda 3’ intercept 0 i 1 -1 / E;
run; quit;
Least Squares Means
The LSMEANS statement in GLM is an attempt to generate meaningful

estimates automatically, sometimes (but not always) with success.
These are denoted least squares means and can be constructed as
title ’Least squares means’;
proc glm data=a;
class i j;
model y = i j ;
lsmeans i j / E stderr;
run; quit;
The output caused by the E–option in the LSMEANS statement is
82
Least Squares Means
Coefficients for i Least Square Means i Level
Effect 1 2
1 Intercept 1 1
2 i 1 1 0
3 i 2 0 1
4 j 1 0.5 0.5
5 j 2 0.5 0.5
Coefficients for j Least Square Means j Level

Effect 1 2
1 Intercept 1 1
2 i 1 0.5 0.5
3 i 2 0.5 0.5
4 j 1 1 0
5 j 2 0 1
The interpretation of the columns to the right is exactly as before:
The vector λ = (1, 1, 0, 0.5, 0.5)> gives
1
λ>β = δ + α1 + (β1 + β2).
2
From this we see that the LSMEANS for i = 1 is the δ + α1 plus the
“average effect” of the factor j, i.e. 12 (β1 + β2).
83
Hypothetis Testing
Example 18. The two–way analysis of variance model
Yij = δ + αi + βj + ij , , i = 1, 2, j = 1, 2
is in the following be referred to as the large model.
Data is assumed to be in accordance with the large model.
Suppose we are interested in testing whether βj = 0.
The mean µij of Yij is δ + αi + βj and the mean vector has the form
 
    δ
µ11 1 1 0 1 0  α1 
µ   1 1 0 0 1
    
µ =  12 = α2  = Xβ
 
 µ21   1 0 1 1 0

β1
 
µ22 1 0 1 0 1
 
β2
Testing βj = 0 corresponds to testing whether the reduced model
Yij = δ + αi + ij
is in accordance with data.
84
Under the reduced model, the mean µij of Yij is δ + αi and the mean
vector has the form
   
µ11 1 1 0  
δ
 µ12   1 1 0  
   
µ= =   α1  = X 0 β 0

 µ21   1 0 1 
α2
µ22 1 0 1
Hence testing the hypothesis βj = 0 corresponds to testing whether

µ = X0β0 when we “know” that µ = Xβ. f in
Note that any vector µ that can be written as µ = X0β0 can also be
written as µ = Xβ – simply by setting the last two elements of β to
zero.
More generally, any vector in span(X0) is also in span(X), but not

vice versa.
(Recall that span(X0) is the set of vectors that can be written as a

linear combination of the columns of X0.)
Let P and P0 be the projection matrices corresponding to X and
85
X0. The least squares estimate of µ are
µ̂ = P y under the large model

µ̂ = P0y under the reduced model
How to judge whether the reduced model is feasible??
The answer lies in the “distance” between the observations and the
expected values.
The vector of residuals
e = y − µ̂ = y − P y = (I − P )y
reflect random deviations from the mean under the large model (in
which we “believe”).
Therefore the length of e (and hence the squared length e>e is

expected to be “small” in some sense.
86
If the reduced model is true then e0 = (I − P0)y is also the vector of
residuals, and the length of the vector should also be small.
On the other hand if the reduced model is not true, then e0 is not
just residuals, because it contains some of the variation due to the
factor βj .
In this case the length of the residual vector is expected to be large.
Consider the difference between the residuals
D = e − e0 = y − P y − (y − P0)y = P y − P0y = (P − P0)y
If the reduced model is true, then this difference is just difference

between residuals, and the length of D is expected to be small.
If we let d and d0 denote the number of independent columns in X

and X0, one can show the following
Result 3.
D >D 1
E( )= E(D >D) = σ 2 + k
d − d0 d − d0
or equivalently that
E(D >D) = (d − d0)(σ 2 + k) = (d − d0)σ 2 + (d − d0)k,
where k ≥ 0 and k = 0 when the reduced model is true.
If σ 2 had been known the result above would be very useful:
If D >D is “much larger” than (d − d0)σ 2, this would indcate that

87
k > 0 which in turn causes us to doubt the feasibility of the reduced

model.
There are two problems in this connection:
1. σ 2 is not known, and
2. what does “much larger” mean...
Yet, in Linear Normal Models there is a simple solution to this two

problems now to be outlined:
88
Problem 1: σ 2 is not known
Under the large model, the variance estimate is
σ̂ 2 = e>e/(n − d),
i.e. the residual sum of squares divided by the residual degrees of

freedom.
It is well known that E(σ̂ 2) = σ 2, so it is reasonable to assume that

σ̂ 2 ≈ σ 2.
Therefore, if the reduced model is true (and hence k = 0), the ratio
D>D/(d − d0)
F = ≈ 1.
e>e/(n − d)
That takes, to some extent, “care of” the problem that σ 2 is

unknown.
Problem 2: what does “much larger” mean... :
If the reduced model is not true, then the ratio F would tend to be
larger than 1. The problem remaining is to define what is meant by
“large”. On can show the following:
Result 4. If the reduced model is true then F has an Fd−d0,n−d–

distribution.
Here d − d0 is the number of parameters removed from the model

(i.e. the additional residual degrees of freedom gained by going from
the large to the reduced model), and n − d is the residual degrees of
89
freedom under the large model.
If the reduced model is not true, then F has an expected value larger
than 1.
Therefore, if F is larger than a pre–specified quantile in the

Fd−d0,n−d–distribution one would doubt the feasibility of the model
reduction, i.e. reject the hypothesis.
Calculating things in Practice
Consider again the difference between the residuals
D = e − e0 = y − P y − (y − P0)y = P y − P0y = (P − P0)y.
There is an easy way to calculate D >D in practice:
Result 5.
D > D = e> >

0 e0 − e e = RSS0 − RSS
where RSS and RSS0 denote the residual (or error) sums of squares
under the large and the reduced model respectively.
90
Tests in LNMs in short form
• Consider a LNM Y ∼ Nn(µ, σ 2I). Hence Y =D µ + e, where

e ∼ Nn(0, σ 2I).
• Consider the models for the mean value
M : µ ∈ L = C(X) calM0 : µ ∈ M0 = C(X0) L0 ⊂ L
where M is assumed to hold true, and let M and M0 denote the

corresponding projections of dimension d and d0.
• Under M, M Y = M µ + M e = µ + M e.
• If M0 is true, then
(M − M0)Y = M µ + M e − M0µ − M0e = (M − M0)e
is only “random noise”. In this case (M − M0)Y is expected to be

small.
• Clearly, M − M0 is the projection onto L ∩ L>.
>
• Hence ||(Md−d
−M0 )Y ||
0
(M −M0 )Y
= Y r(M −M0 ) is a measure of how close M0Y
is to M Y in relation to the difference in dimensionality of the
models.
91
• We use the results that
E(Y >AY ) = tr(A Var(Y )) + E(y)>A E(Y )

tr(M ) = d, tr(M − M0) = d − d0
• Assuming only M,
Y >(M − M0)Y σ2 (M − M0)

E( ) = ( tr(M − M0)) + β >X > Xβ
r(M − M0) d − d0 d − d0 )
(M − M0)
= σ 2 + β >X > Xβ
d − d0 )
= σ 2 + ||v||2
• If M0 is true, then ||v||2 = 0.

Y >(I−M )Y
• If we use M SE = n−d = σ̃ 2 as an estimate for σ 2 then
under M0,
Y > (M −M0)Y
d−d0
F = Y >(I−M )Y
≈1
n−d
• It is clear that nominator and denominator are independent:

I −M I −M 2 I −M 0
Y ∼N µ; σ
M − M0 M − M0 0 M − M0
• Under M0,
1 >
2
Y (M − M0)Y ∼ χ2(d − d0, β >X >(M − M0)Xβ)
σ
, i.e. a non–central χ2 distribution.
92
• Hence large values of F causes doubt in M0.
Hypothesis Testing in SAS
In practice SAS performs all relevant calculations (and,

unfortunately, a few more).
Degrees of freedom: A comment regarding the degrees of

freedom reported by SAS is appropriate:
Default in SAS is that all observations are centered around their

average.
This centering “costs” one degree of freedom and therefore SAS

reports the Corrected Total which is n − 1, where n is the
number of observations.
93
In the large model in Example 18 there are three parameters,

(δ, α1, β1)
Because of the centering of the data, SAS does not regard δ as a

parameter when it comes to reporting degrees of freedom. So the
real number of parameters is the number SAS reports plus 1. Hence
d = 2 + 1 while d0 = 1 + 1.
(Note: If the NOINT option is specified, the model degrees of

freedom become correct.)
In practice it is not a problem whether data are centered or not,

because we mainly are interested in differences between the number
of parameters, i.e. differences in degrees of freedom.
Example 19. (Continuation of Example 18) Below we find the

output from fitting the large and the reduced model in PROC GLM.
Dependent Variable: y Large model
Sum of
Model 2 3.76999467 1.88499734 2.70 0.3954

Error 1 0.69877998 0.69877998
Source DF Type III SS Mean Square F Value Pr > F

i 1 0.73276693 0.73276693 1.05 0.4924
j 1 3.03722775 3.03722775 4.35 0.2847
Dependent Variable: y Reduced model

Sum of
Model 1 0.73276693 0.73276693 0.39 0.5951

Error 2 3.73600773 1.86800386
94
In the notation from before
D>D = RSS0 − RSS = 3.73600773 − 0.69877998 = 3.037

e>e = RSS = 0.699
d − d0 = 3 − 2 = 2 − 1 = 1
n−d = 4−3=3−2=1
The F-statistic therefore becomes

3.037/1
F = = 4.35
0.699/1
This is the statistic reported in the Type III SS–section of the

output. So in most (but not all) cases, SAS does the work for us.
f in
Example 20. The two–way analysis of variance with interactions
Yijk = δ + αi + βj + γij + ijk , i = 1, 2; j = 1, 2; k = 1, 2, 3
has mean
 
δ

 α1 

    α2 
µ11 1 1 0 1 0 1 0 0 0  
 β1 
µ12   1 1 0 0 1 0 1 0 0
    
µ= = β2  = Xβ
  
µ21   1 0 1 1 0 0 0 1 0

γ11
  
µ22 1 0 1 0 1 0 0 0 1
 
 

 γ12 

 γ21 
γ22
95
Here we regard µij , 1 and 0 as vectors of length 3 such that µ

contains 12 elements.
In this form, the model is overparametrized so SAS works with an

equivalent representation, namely
  
1 1 1 1 δ
 1 1 0 0   α1 
  
µ=  = X 2 β2 (8)
 1 0 1 0   β1 

1 0 0 0 γ11
f in
96
5 Some Basic Statistical Concepts
This lecture presented/refreshed basic statistic concepts, such as central limit theorem, principles
of estimation, the likelihood principle and test of hypothesis.
1
http://www.jbs.agrsci.dk/biometri/Courses/HSVmixed2001/StatTheory.f.pdf
97
Data and Models
The starting point for a statistical analysis is a set of observations
y = (y1, . . . , yn)
resulting from an experiment (or perhaps an observational study)

conducted in order to gain insight in a specific area.
We shall in general use the term experiment even though the setting
may not be that of a controlled experiment.
Some Characteristics:
A fundamental characteristic of the experiment is that the outcome

is stochastic rather than deterministic.
Hence, if the experiment is repeated again under similar conditions

the new result would not necessarily be y.
Because of the random/stochastic variation in data, it is natural to
consider models based on probability theory, because this is the
branch of mathematics dealing with random variation. In this
setting, the starting point is the set of possible outcomes
Y = (Y1, . . . , Yn)
of the experiment.
98
Here Yi could be for example
• the set of all real numbers,
• the set of positive real numbers,
• the set {diseased, not diseased}, or
• the set {low, medium, high}.
The link between the observed value yi and the set of possible values
Yi is established through the notion of a random variable Yi.
A random variable Yi is a function whose values can be in the set

Yi, and the observed value yi is said to be a realization of the
random variable Yi.
The random variable Yi is a function, but not a deterministic

function such as e.g.
f (x) = x2 + 7.
It is a random function whose outcome on one hand is uncertain but

on the other hand typically governed by some rules. Those rules are
best formulated in terms of a probability distribution.
Example 1. : Binomial Experiment Any animal can be infected

with a specific disease, i.e. it can be diseased or not–diseased.
For the ith animal in the population the state of disease is denoted by
Yi and Yi can therefore take one of the values {diseased, not diseased}
(for brevity written simply as {1, 0}).
f in
99
Example 2. : Binomial Experiment If the possible outcomes of

Yi is the set {diseased, not diseased} (for brevity written simply as
{1, 0}) the random variable Yi can be either 1 or 0. A statistical
model for Yi is obtained by specifying the probability distribution for
Yi, for instance
p(Y = y) = θ y (1 − θ)1−y
where 0 ≤ θ ≤ 1. f in
Example 3. : Samples from the normal distribution If Yi has a

normal distribution, e.g. Yi ∼ N (θ, 1) the set of possible outcomes
Yi is the real line. f in
In both examples, the function Yi is specified through a

probability distribution.
The distribution depends on an (unknown) parameter θ. (In the

examples, θ is a single number but more generally the parameter is a
vector θ = (θ1, . . . , θp).)
100
In statistical terms, one speaks of a parametrical statistical model:
1. It is a statistical model, because the outcome of Yi is described in

terms of a probability distribution.
2. It is a parametrical model because once the parameter θ is known

the distribution is known.
Why the Normal Distribution is so “Normal”
The most frequenly employed distribution is the normal distribution.

Many (but certainly not all) random phenomena encountered in
practice exhibit a certain regularity:
1. Observations have a tendency to be clustered around a “mean

value”.
2. Deviations from the “mean value” are often symmetric.
3. The histogram of observations can be well approximated with the

bell–shaped normal (or Gaussian) distribution
101
Histogram of z.mean
5
4
Relative Frequency
3
2
1
0
0.3 0.4 0.5 0.6 0.7
z.mean
The bell-shaped curve is written
1 1
f (y; µ, σ 2 ) = √ exp(− 2 (y − µ)2)
2πσ 2σ
Why does this bell–shaped curve fit quite well to many

phenomenons encountered in practice??
The Central Limit Theorem
Parts of the answer is given by the Central Limit Theorem:
Let Z1, . . . , Zn be independent random variables with E(Zi) = µi

and V ar(Zi) = σi2.
Pn
Let Y = i=1 Zi.
Then E(Y ) = µ = i µi and V ar(Y ) = σ 2 = i σi2.

P P
What about the distribution of Y ?
102
Result 1. The Central Limit Theorem says that
Y ∼approx N (µ, σ 2).
The approximation becomes better as n → ∞.
(Note: We have not made any assumption about the distribution of

the Zis – it has only been assumed that they are independent.
Many things encountered in nature can be regarded as the sum of

many small (independent) contributions. That is one explanation
why the normal distribtuion is so “normal”.
Example 4. Let Zi be uniformly distributed on [0, 1], i.e. all values

in the [0, 1]–interval are “equally likely” for i = 1, . . . , 4.
Py
How does the distribution of Z̄ = n1 i=1 Zi look?
Quite normal, actually !

Histogram of z1 Histogram of z2 Histogram of z.mean Normal Q−Q Plot
0.8
0.0 0.5 1.0 1.5 2.0 2.5 3.0
1.2
0.7
1.5
Relative Frequency
Relative Frequency
Relative Frequency
Sample Quantiles
0.6
0.8
1.0
0.5
0.4
0.4
0.5
0.3
0.0
0.0
0.2
0.0 0.4 0.8 0.0 0.4 0.8 0.2 0.4 0.6 0.8 −2 −1 0 1 2
z1 z2 z.mean Theoretical Quantiles
f in
103
Some General Principles of Estimation
After establishing a statistical model a problem is to estimate the

value of the parameter θ. To find this estimate we need to make
some assumptions.
In what follows, a very fundamental assumption will be made:
There exists a true (but unknown) value of θ.
If θ had been known, then the distribution of Yi would be known

too. That is we would know the characteristics of the mechanism
which generated the data y.
A consequence of this is that the important task is to obtain a good

estimate of θ. Some examples of doing so are given in the following.
104
Example 5. (Continuation of Example 2) Consider the experiment
of tossing a “pin” n times, giving data y = (y1, . . . , yn). Hence the
possible outcomes are Yi = {up, down} which we write {1, 0}.
It is assumed that
P (Yi = 1) = θ
for all i, such that the probability of observing “pin up” (!) is the
same every time.P If we observe that the pin points upwards all
together y+ = i yi times, then it takes only very little creativity to
suggest that the relative frequency
y+/n
is a sensible estimate for θ. f in
Example 6. : Linear regression Consider the case where there is

associated a known number xi to each outcome of the experiment
yi, and where it is suspected that there might be an approximately
linear relationship between xi and yi.
This can lead to the linear regression model
Yi ∼ N (θi, σ 2) where θi = θ0 + θ1xi
This model is fundamentally different from the model in Example 2:

In Example 2, each observation was assumed to have the same
distribution. In the present model, this is not the case as the mean
for each random variable Yi is allowed to depend on the value of xi.
105
It is well known from any standard textbook on statistics that the

parameters θ = (θ0, θ1) can be estimated by minimizing the squared
distance between the observed and the expected values, i.e. by
minimizing the function
X
D(θ0, θ1) = (yi − (θ0 + θ1xi))2
i
f in
Example 7. (Continuation of Example 3) Suppose we conduct an

experiment where each observation yi is a realization of Yi ∼ N (θ, 1).
Then it takes very little fantasy to suggest that the average
n
1X
z1 = yi
n i=1
is a sensible estimate for θ. f in
106
In the examples above it is easy to suggest ways of estimating the
unknown parameters. These can be described as:
Example 5: Estimation by the relative frequency.
Example 6: Estimation by minimizing the squared distance.
Example 7: Estimation by the average.
However, it is clear that there is a need for:
• General principles for obtaining those estimates.
• Some notion for how “good” an estimate is.

In the following we present and discuss some of these principles

briefly.
The exposition is by no means intended to be neither comprehensive

nor very precise.
The aim is solely to illustrate some of the considerations made in

connection with estimation of unknown parameters on the basis of
data.
Eventually the exposition leads to the method of maximum

likelihood.
107
Method of Moments
One approach is to base the estimation on the moments, i.e. the

expectation, variance etc. of radom variables.
Recall that the first moment of a random variable X is E(X) and

the second central moment of X is E(X − E(X))2 = V ar(X).
For Example 3 with Yi ∼ N (θ, 1) we define a new random variable,

say Z1, as the avereage of the Yis. Then it is well known that
n
1X
Z1 = Yi ∼ N (θ, 1/n)
n i=1
Pn
The estimate z1 = n1 i=1 yi can then be regarded as a realization
of the random variable Z1 which has mean E(Z1) = θ.
It is important to keep in mind that Z1 is a function of Y1 . . . , Yn

which can be emphasized by writing Z1(Y ). Likewise, z1 is a
function of the observed data whih is emphasized by writing z1(y).
We say that
• the random variable Z1(Y ) is an estimator, and
• a specific value of Z1(y) is an estimate.
108
The method of moments is to consider θ̂(y) as a good estimate of θ
because the corresponding random variable Z1(Y ) has θ as its
expectation:
E(Z1(Y )) = θ (1)
How good is an estimator?
An estimator with the property (1) is said to be unbiased.
Unbiasedness seems to be desireable property of an estimator.
However, there are many estimators with the property (1). Two
additional ones are
• the average Z2(Y ) = (Y1 + Y2)/2 of the two first random variables,
and
• Z3(Y ) = Y1, i.e. the first random variable itself.

109
Yet, intuition indicates that z1 is a “better” estimate of θ than

z2 = (y1 + y2)/2 which in turn is “better” than z3 = y1.
To be precise about what is meant by “better” we consider the

variance of the estimators:
V ar(Z1(Y )) = 1/n
V ar(Z2(Y )) = 1/2
V ar(Z3(Y )) = 1
Hence (with more than 2 observations), we have
V ar(Z1) < V ar(Z2) < V ar(Z3),
and on the basis of this it is clear that we will consider Z1 to be a

better estimate of θ than Z2 or Z3.
Note: Because estimates are realizations of random variables (their

corresponding estimators) it is “a must” always to report a the
variance, a standard deviation or a related quantity whenever
reporting the value of an estimate.
110
Someone might suggest to estimate θ by Z4(Y ) = Z1(Y ) + 7.
In terms of considering estimators with small variance as being

“good”, one can argue that Z4 is just as good as Z1, because
V ar(Z4) = V ar(Z1).
However, E(Z4) = θ + 17 6= θ, so Z4 is not an unbiased estimate of

θ.
These considerations suggest that good estimators should be

unbiased and have as small variance as possible.
These two criteria leads to the theory of

Minimum Variance Unbiased Estimation – sometimes written briefly
as MVUE. It is not surprising that Z1 is a MVUE (Minimum
Variance Unbiased Estimator).
In general, establishing MVUEs can be a complicated task: Finding

estimators that are unbiased may not be too hard, but finding one
with the smallest possible variance may be very very complicated.
111
Consistency of Estimators
The estimator Z1 has other nice properties compared with Z2, Z3

and Z4.
When the number of observations n tends to infinity, the variance of
Z1 tends to 0. The practical implication of this is straight forward:
Z1 becomes indistinguishable from its expectation θ. An estimator
with this property is said to be consistent.
Consistency is an attractive feature of an estimator, because it
means that the estimate of θ gets better and better the more data
we collect.
It is clear that neither of Z2, Z3 and Z4 are consistent.
Desireable Properties of Estimators
From the discussion above we have found that
• Unbiasedness,
• Smallest possible variance, and
• Consistency
are three attractive properties of estimators.
112
Estimators, whatever kind they are, are functions of the random
variables Y1, . . . , Yn from which data y1, . . . , yn are realizations.
Hence estimators are random variables and as such they have a
distribution. This distribution is needed when drawing inference
about a parameter, e.g. when making a test or constructing a
confidence interval.
Therefore a fourth desireable property of an estimator is that
• The distribution of the estimator is known.
The Method of Maximum Likelihood
There is a general estimation method called maximum likelihood

estimation to be discussed in the following.
An estimator obtained from this method do not in general have the

attractive properties mentioned above – but almost. That is, when
the sample size goes to infinity (in a sufficiently well behaved way)
then the properties hold.
We say that the estimator is asymptotically unbiased, do

asymptotically have the smallest possible variance, is asymptotically
consistent and finally, the distribution of the estimator is
asymptotically normal.
113
These four properties of maximum likelihood estimators indicates

why this is such a powerful method.
Moreover, it turns out that the estimation process can be made by

maximizing a particular function, called the likelihood function.
Maximization of such a function can in practice be complicated, but

is in principle not much different from what we all learned in high
school: Calculate the derivative, set this one to zero and solve!
Example 8. : Binomial Experiment
Consider n throws with a pin where θ = P r(“Falls with pin up”).

Hence the outcome of the ith toss can be {U p, Down} written
briefly as {1, 0} and
p(yi; θ) = P (Yi = yi; θ) = θ yi (1 − θ)1−yi
114
Suppose the observed data are y = {1, 1, 0, 1, 0, 1, 0, . . . , 0, 0}.
If the outcomes of the tosses are independent, then the probability of

observing y is
p(y; θ) = p(y1; θ)p(y2; θ) . . . p(yn; θ)

= p(1)p(1)p(0)p(1)p(0)p(1)p(0) . . . p(0)p(0)
= θθ(1 − θ)θ(1 − θ)θ(1 − θ) . . . (1 − θ)(1 − θ)
= θ y+ (1 − θ)n−y+ (2)
P
where n is the number of times the pin is thrown and y+ = i yi is
the number of times the pin points up.
f in
The Likelihood function
When data y is observed, p(y; θ) can be regarded as a function of

θ. This function is called the likelihood function and is denoted by
L(θ).
Hence in the example,
L(θ) = θ y+ (1 − θ)n−y+ .
To be specific, let the pin be thrown n = 25 times, and suppose that

pin up is observed y+ = 10 times. Then we have
L(θ; y) = θ 10(1 − θ)25−10

115
Figure 1 shows a plot of L(θ) against θ for n = 25 and y+ = 10.
5*10^-8
4*10^-8
3*10^-8
Likelihood function
2*10^-8
10^-8
0
0.0 0.2 0.4 0.6 0.8 1.0
Theta value
Figure 1: Likelihood function for n = 25 and y+ = 10.
116
The Maximum likelihood principle
The principle in maximum likelihood estimation is that
the estimate of θ is the value of θ which maximizes the likelihood

function.
One can think of θ̂ as the value of θ which maximizes the probability

of observing the data which one actually has observed.
• This value is called the maximum likelihood estimate (MLE) and

is often denote by θ̂.
• The corresponding estimator is called the maximum likelihood estimator.
For clarity one should write θ̂(y) for the estimate and θ̂(Y ) for the
corresponding estimator, but this is too cumbersome to do. So,
except for special cases, we simple write θ̂ for both entities and then
derive from the context whether its is an estimate (a number) or and
estimator (the corresponding random variable).
Figure 1 suggests that 0.4 is the maximum likelihood estimate.
117
It is often easier to maximize the log-likelihood function often

denoted by l(θ):
l(θ) = log L(θ) = y+ log θ + (n − y+) log(1 − θ)
Since log is a monotone function the value of θ maximizing l(θ) will

also maximize L(θ).
Figure 2 shows a plot of l(θ) against θ for n = 25 and y+ = 10.

-20
-30
log-Likelihood function
-40
-50
-60
-70
0.0 0.2 0.4 0.6 0.8 1.0
Theta value
Figure 2: Log–Likelihood function for n = 25 and y+ = 10.
118
Maximization of
l(θ) = y+ log θ + (n − y+) log(1 − θ)
is obtained by solving the equation
S(θ) = l0(θ) = 0,
where l0(θ) denotes the derivative of l(θ).
• The function S(θ) is called the score function.
• The equation S(θ) = 0 is called the likelihood equation.
We find that
y+ n − y +
S(θ) = l0(θ) = − =0
θ 1−θ
119
which happens if and only if

y+
θ̂ =
n
Hence, the maximum likelihood estimate is just the relative
frequency. The corresponding maximum likelihood estimator is
Y+
θ̂(Y+) = .
n
Hence when y+(= 10) is observed the observed value of the

maximum likelihood estimator (i.e. the maximum likelihood
estimate) becomes θ̂(x) = θ̂(10) = 0.4 - in accordance with Figure 1
and Figure 2.
How Good is the Estimate?
When y+ = 10 and n = 25 we have θ̂ = 0.4, but the same value is

found if y+ = 2 and n = 5.
However, intuition suggests that with 25 observations we should

have more confidence that θ̂ is a good estimate than with only 5
observations. That is, we would expect that the variance of the
estimator is smaller with 25 observations than with only 5.
It is well known for binomial experiments that V ar(Y+) = nθ(1 − θ)

and hence that V ar(θ̂) = θ(1 − θ)/n which indeed confirms the
intuition.
120
In Figure 3 is shown the likelihood function for (n = 2, y+ = 5),
(n = 4, y+ = 10), (n = 10, y+ = 25) and (n = 20, y+ = 50).
0.0012
0.03
0.0008
Likelihood function
Likelihood function
0.02
0.0004
0.01
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Theta - y+=2, n=5 Theta - y+=4, n=10

5*10^-8
2*10^-15
Likelihood function
Likelihood function
3*10^-8
10^-15
10^-8
0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Theta - y+=10, n=25 Theta - y+=20, n=50
Figure 3: Likelihood function for (n = 5, y+ = 2), (n = 10, y+ = 4),

(n = 25, y+ = 10) and (n = 50, y+ = 20).
It is clear from those graphs that the more observations the more
“peaked” is the likelihood function and the higher is its curvature at
its maximum and.
That is, the value of L(θ̂) is more and more distinct from the value
of L(θ) for θ 6= θ̂ when more and more observations are made.
It is therefore not surprising that there is a connection (indeed it

turns out to be a close connection) between that variance of the
maximum likelihood estimator and the curvature of the likelihood
functions at its maximum.
This connection is presented in the next sections.
121
The Asymptotic Normal Distribution of the MLE
In this section we present a very important result:
The maximum likelihood estimator is asymptotically normally

distributed.
This property of the MLE is central to much practical statistical

inference.
Example 9. Frequently one is interested in making statements

about θ on the basis of the experiment. For example one might
be interested in whether on can reasonably assume that the true
value of θ is 0.5.
The key to answering this question is the random variable θ̂(y). Put
in a popular way, one has to investigate whether 0.5 is a “likely”
outcome of θ̂(Y ). To answer that question, one need to know
the distribution of θ̂(Y ) – and this distributions is in general very
complicated to find. f in
122
Therefore one frequently resorts to an approximate result, on which
so much resides in statistics:
When n → ∞ and certain conditions are satisfied then it holds

approximately that
1
θ̂ ∼ N (θ, − )
l00(θ̂)
That is, the distribution of θ̂(X) will asymptotically be like a normal

distribution with the true (but unknown) parameter θ as expectation
and and a variance − l001(θ̂) .
Example 10. For the binomial experiment, it is not hard to see why
the MLE is asymptotically normal:
We can regard y as a sum of independent random variables yi where
yi = 1 corresponds to pin up and yi = 0 is “pin not up”.
Hence the Central Limit Theorem gives that y is approximately
normally distributed, and hence so is θ̂ = y/n.
For a single experiment we know that E(yi) = θ and V ar(yi) =
θ(1 − θ). From this we find that
θ(1 − θ)
E(θ̂) = θ, V ar(θ̂) =
n
so approximately,
θ(1 − θ)
θ̂ ∼ N (θ, )
n
f in
123
Example 11. In general, the answer is not so straight forward. We

therefore outline the “standard” calculations which one goes through
in this connection:
The expression for the variance is obtained as follows: Recall that the
likelihood and score functions are given by
l(θ) = x log θ + (n − x) log(1 − θ)

x n−x
S(θ) = l0(θ) = −
θ 1−θ
Differentiating the scorefunction and changing sign gives
x n−x
−l00(θ) = +
θ2 (1 − θ)2
In practice θ is unknown. However, it can be justified to plug the

n
estimate θ̂ = x/n into l00(θ) and this gives −l00(θ̂) = θ̂(1− θ̂)
.
Hence, asymptotically,
!
θ̂(1 − θ̂)
θ̂ ∼ N θ,
n
124
With n = 25, x = 10 we get θ̂ = 0.4 and V ar(θ̂) ≈ 0.0096. Hence,
an (approximate) 95% confidence interval for θ is
q q
(θ̂ − 1.96 V ar(θ̂) ; θ̂ + 1.96 V ar(θ̂))
= (0.4 − 0.19 ; 0.4 + 0.19) = (0.21; 0.59)
f in
Asymptotical normality of transformations of the

MLE
If h is a function of θ then the distribution of h(θ̂) will,

asymptotically, look like a normal distribution with mean h(θ) and
variance which can be estimated by −(h0(θ̂))2/l00(θ̂), i.e.
asymptotically
h0(θ̂)2
h(θ̂) ∼ N (h(θ), − )
l00(θ̂)
125
Example 12. For example, if we are more comfortable with

interpreting the odds η = h(θ) = θ/(1−θ) we find h0(θ) = 1/(1−θ)2.
Hence, asymptotically,
θ θ̂ θ
η̂ ∼ N ( , ) = N( , 0.0133).
1 − θ (1 − θ̂)n 1−θ
f in
Tests of Hypotheses
The final point to touch upon concerns tests of hypotheses regarding

θ.
Suppose interest is in testing whether θ is equal to a specific fixed

value θ0.
The likelihood ratio test
The maximum likelihood estimate θ̂ is the value of θ which gives the

observed data the highest probability which is L(θ̂).
If the value θ0 assigns nearly the same probability L(θ0) as θ̂ does,

we would be tempted to accept the hypothesis that θ = θ0.
126
In other words, it is tempting to consider the
likelihood ratio test statistic Q defined by
L(θ0)
Q=
L(θ̂)
Clearly Q is a number between 0 and 1 and values close to 1 are in

favor of the hypothesis.
It can be shown that if the hypothesis is true then
−2 log Q = 2(l(θ̂) − l(θ0))
has (when n is large) approximately a χ2 distribution with 1 degree

of freedom. Large values of −2 log Q leads to rejection of the
hypothesis. In Figure 4 it can be seen that −2 log Q is twice the
vertical distance between the value of l in θ̂ and θ0. .
l(θ)




l(θ̂) − l(θ0 )



Slope
l0 (θ)
θ
.
θ̂ θ0
Figure 4: Illustration of the likelihood ratio test, the score test and
the Wald test.
The Score Test
A test statistic equivalent to −2 log Q is obtained by considering the

slope of l in the point θ0. It is known that the slope of l in θ̂ is 0
127
(l(θ̂) = 0 by definition of the MLE.) Hence values of l 0(θ0) near 0

will also speak in favor of the hypothesis.
It can be shown that when n is large and the hypothesis is true, the
distribtion of the so called score test
S = −l0(θ0)2/l00(θ0)
will also look like a χ2 distribution with 1 degree of freedom.
Hence when n is large the likelihood ratio test and the score test are
equivalent.
The Wald Test
A third test is the Wald test which compares the values of θ̂ and θ0
directly corresponding to the horizontal distance in Figure 4.
It can be shown that when n is large and the hypothesis is true, the
distribtion of the Wald test statistic
W = −(θ̂ − θ0)2(l00(θ̂))2
will also look like a χ2 distribution with 1 degree of freedom.
Note that in W is simply theqsquare of the difference (θ̂ − θ0) divided

by its standard deviation 1/ l00(θ̂). In the litterature, one frequently
use the term “Wald test” about the square root of W which yields a
test statistic approximately with a N (0, 1) distribution.
128
Hence when n is large the likelihood ratio test, the score test and
the Wald test are equivalent.
How to get the asymptotic normality
This section is somewhat theoretical.
Consider the following general setup: Let X be a single random

variable. The expectation and variance of X is
Z
µi = E(X) = xp(x; θ)dx
Z
V ar(X) = (x − µ)p(x; θ)dx.
Since X is a random variable, then so is the score function

S(θ; X) = l0(θ; X).
129
For later purposes we need the mean and the variance of the score
function.
To obtain these quantities, we use the following facts:
1
S(θ) = l0(θ; x) = (log p(x; θ))0 = p0(x; θ)
p(x; θ)
1 2 1
S 0(θ) = l00(θ; x) = − 2
(p 0
(x; θ)) + p00(x; θ)
p(x; θ) p(x; θ)
Z
p(x; θ)dx = 1
The function S 0(θ) is called the Hessian (matrix) and is very

important in connection with PROC MIXED.
Moreover, in most cases of practical interest, the order of

differentiation and integration can be interchanged. Hence
d d d
Z Z
p(x; θ)dx = p(x; θ)dx = 1 = 0
dθ dθ dθ
Mean of the score function We shall supress the dependence on X

in the following: We find that
Z Z
E(S(θ)) = E(l0(θ)) = l0(θ)p(x; θ)dx = p0(x; θ)dx
Interchanging the order of differentiation and integration yields
d d d
Z Z
E(S(θ)) = p(x; θ)dx = p(x; θ)dx = 1 = 0
dθ dθ dθ
130
So the expected value of the score function is zero.
Variance of the score function The variance of the score function

has a special name, namely the Fisher information and is usually
denoted by I(θ). Hence we have
I(θ) = V ar(S(θ)) = E(S(θ)2)

= E([l0(θ)]2)
1
Z Z
2
= 0
l (θ) p(x; θ)dx = p0(x; θ)2dx
p(x; θ)
because the expected value is zero.
A more convenient expression for the variance can be found in

terms of the derivative of the score funtion:
E(S 0(θ)) = E(l00(θ))

1 1
Z
0 2
= [− 2
(p (x; θ)) + p00(x; θ)]p(x; θ)dx
p(x; θ) p(x; θ)
1
Z
= [− (p0(x; θ))2 + p00(x; θ)]dx
p(x; θ)
Interchanging
R 00the order of differentiation and integration as before
gives that p (x; θ)dx = 0. Hence
1
Z
0
E(S (θ)) = − (p0(x; θ))2dx = −V ar(S(θ, X)).
p(x; θ)
131
Hence we have for a single observation
E(S(θ)) = 0
I(θ) = V ar(S(θ)) = E(S(θ)2) = −E(S 0(θ)) (3)
The likelihood for all data
From (2) it is seen that the likelihood for all data is the product of
the likelihood for each observation, i.e.
Y
L(θ; y) = p(y1; θ) . . . p(yn; θ) = p(yi; θ),
i
Consequently, the log–likelihood, the score function and the

derivative of the score function for all data is a sum of independent
components:
X X
l(θ) = l(θ; yi) = li(θ)
i u
X X X
S(θ) = l0(θ; y) = l0(θ; yi) = S(θ; yi) = Si(θ),
i i i
X
S 0(θ) = Si0(θ), (4)
i
For a single observation we have
E(Si(θ)) = 0
I(θ) = V ar(Si(θ)) = E(Si(θ)2) = −E(Si0(θ))
132
and correspondingly for all observations
E(S(θ)) = 0
V ar(S(θ)) = nI(θ).
We then need three small results:
Result 1: Since S 0(θ; y) = i Si0(θ) it is reasonable to assume (using

P
the law of large numbers) that

1 0 1X 0
S (θ) = Si(θ) ≈ E(Sk0 (θ)) = −I(θ)
n n i
P
Result 2: Since S(θ) = i Si(θ) is a sum of independent random
variables where E(Si(θ)) = 0 and V ar(Si(θ)) = I(θ). Hence by
the central limit theorem, approximately
S(θ) ∼ N (0, nI(θ))
Result 3: Let θ0 be the true (but unknown to us) value of the

parameter θ. Let us assume that θ̂ is a good estimate, i.e. close to
θ0. Then
0 = S(θ̂) ≈ S(θ0) + S 0(θ0)(θ̂ − θ0)
That is
1 1 √
√ S(θ0) ≈ − S 0(θ0) n(θ̂ − θ0)
n n
√
≈ I(θ0) n(θ̂ − θ0)
133
The left hand side is approximately N (0, I(θ)) distributed.

1
Hence, approximately, √nI(θ S(θ0) ∼ N (0, I(θ)−1). That is,
0)
approximately,
√
n(θ̂ − θ0) ∼ N (0, I(θ)−1).
or
θ̂ ∼ N (θ0, (nI(θ))−1).
as desired.
Likelihood and Linear Normal Models
For a linear normal model maximum likelihood estimation is the

same as least squares estimation. The unknown parameters are β
and σ 2, so let θ = (β, σ 2).
134
Because the observations are independent, the likelihood becomes
L(θ) = f (y1, ...yn; θ)

Yn
= f (yi; θ)
i=1
n
1 1
Y 1
= √ √ exp(− 2 (yi − µi)2)
i=1
2π σ 2 σ
1 1 1 X
= √ n √ n exp(− 2 (yi − µi))2)
2π σ 2 σ i
1 1 1
= √ n √ n exp(− 2 (y − Xβ)>(y − Xβ))
2π σ 2 σ
For the moment, suppose σ is known.

Maximizing L(θ) = L(β, σ 2) is done by minimizing

2 >
P
i (yi − µi )) ) = (y − Xβ) (y − Xβ)). But this is exactly what is
done in least squares estimation.
135
Once β has been estimated, it can be verified that the maximum

likelihood estimate for σ is
1
σ̂ 2 = (y − µ̂)>(y − µ̂)
n
In practice, one never uses this variance estimate. Instead one uses
1
σ̃ 2 = (y − µ̂)>(y − µ̂)
n−p
where p is the number of parameters in β.

The reason for using the latter estimate is that
n−p 2
E(σ̂ 2) = σ
n
E(σ̃ 2) = σ 2
Hence the latter estimate is unbiased while the former is not.
136
6 An overview
The purpose of this lecture was to illustrate, how the problems of the research within the
biological sciences is related to the progress within statistical theory both in general, and related
to mixed models.
Starting out with an experiment reported from Darwin, the lecture discussed the state of the art
of experimental design and analysis at Darwin’s time, proceeded with the progress in statistical
theory, very much related to animal breeding, and ended up with the general theory of mixed
models. Important researchers such as F. Galton, R.A. Fisher, S. Wright, C.R.Henderson were
presented.
The slides are in Danish. Link to the full screen presentation1
1
http://www.jbs.agrsci.dk/biometri/Courses/HSVmixed2001/oversigt.f.pdf
137
6 An overview
Outline
• Baggrund for metoder
• Historisk forløb
• Relation til vores fagområder
February 7, 2001
Darwins Majs
• C. Darwin (1876) The effects of cross- and self-fertilisation in the

vegetable Kingdom. John Murray, London.
• c.f. Fisher, R.A. (1935) Design of Experiments. Oliver and Boyd.
February 7, 2001
138
3
Darwins Majs
column I II III
Crossed Self. -fert
Pot I 23 48 17 38
12 20 38
21 20
Pot II 22 20
19 18 18 38
21 48 18 58
Pot III 22 18 18 58
20 38 15 28
18 28 16 48
21 58 18
23 28 16 28
Pot IV 21 18
22 18 12 68
23 15 48
12 18
February 7, 2001
Darwins Majs
column I II III
Crossed Self. -fert
Pot I 23.50 17.38
12.00 20.38
21.00 20.00
Pot II 22.00 20.00
19.13 18.38
21.50 18.63
Pot III 22.13 18.63
20.38 15.25
18.25 16.50
21.63 18.00
23.25 16.25
Pot IV 21.00 18.00
22.13 12.75
23.00 15.50
12.00 18.00
February 7, 2001
139
6 An overview
Darwins Majs
” As only a moderate number of crossed and self-fertilised plants
were measured, it was of great importance to learn, how far the
averages were trustworthy. I therefore asked Mr Galton, who has
much experience in statistical researches, to examine some of my
tables..... I may premise that if we took by chance a dozen score of
men belonging to different nations and measured them, it would I
presume, be very rash to form any judgment from such small
numbers on their average heights. But the case is somewhat
different with my crossed and self-fertilised plants, as they were of
exactly the same age, were subjected from first to last to the same
conditions, and were descended from the same parents”
February 7, 2001
Galtons tilgang
column I II III Sorteret Diff.
Crossed Self. -fert Crossed Self. -fert
Pot I 23.50 17.38 23.50 20.38 3.125
12.00 20.38 23.25 20.00 3.250
21.00 20.00 23.00 20.00 3.000
Pot II 22.00 20.00 22.13 18.63 3.500
19.13 18.38 22.13 18.63 3.500
21.50 18.63 22.00 18.38 3.625
Pot III 22.13 18.63 21.63 18.00 3.625
20.38 15.25 21.50 18.00 3.500
18.25 16.50 21.00 18.00 3.000
21.63 18.00 21.00 17.38 3.625
23.25 16.25 20.38 16.50 3.875
Pot IV 21.00 18.00 19.13 16.25 2.875
22.13 12.75 18.25 15.50 2.750
23.00 15.50 12.00 15.25 -3.250
12.00 18.00 12.00 12.75 -0.750
February 7, 2001
140
7
Galtons Tilgang
• Sortering
• Differencer
• Spredning (Most probable error) – men ikke t-test
February 7, 2001
Hvem var Galton
Anthropologi, Meteorologi, populationsgenetik, Eugenics

(arvehygiejne), fingeraftryk, Korrelation.
Meget interesseret i målemetoder, objektiv kvantificering af

fænomener.
K. Pearson’s Guru
February 7, 2001
141
6 An overview
Korrekt tilgang ?
column I II III Diff.
Crossed Self. -fert
Pot I 23.50 17.38 3.125
12.00 20.38 3.250
21.00 20.00 3.000
Pot II 22.00 20.00 3.500
19.13 18.38 3.500
21.50 18.63 3.625
Pot III 22.13 18.63 3.625
20.38 15.25 3.500
18.25 16.50 3.000
21.63 18.00 3.625
23.25 16.25 3.875
Pot IV 21.00 18.00 2.875
22.13 12.75 2.750
23.00 15.50 -3.250
12.00 18.00 -0.750
February 7, 2001
10
Korrekt tilgang ?
• Differencer
• Spredning + t-test
• Anova. Lineær Normal Model.
• Hypotesetest. Nul hypoteser.
• Uafhængighedsantagelse.
• Randomisering
February 7, 2001
142
11
Korrekt tilgang ?
column I II III Diff.
Crossed Self. -fert
Pot I 23.50 17.38 3.125
12.00 20.38 3.250
21.00 20.00 3.000
Pot II 22.00 20.00 3.500
19.13 18.38 3.500
21.50 18.63 3.625
Pot III 22.13 18.63 3.625
20.38 15.25 3.500
18.25 16.50 3.000
21.63 18.00 3.625
23.25 16.25 3.875
Pot IV 21.00 18.00 2.875
22.13 12.75 2.750
23.00 15.50 -3.250
12.00 18.00 -0.750
February 7, 2001
12
Hvad er sket
• R.A. Fisher
? Rothamstead
• Student (W. Gossett)
February 7, 2001
143
6 An overview
13
Den 5. Potte
• Hvad forventer vi af udslag i potte 5. Hvad

er et gæt på forskellen ?
• Hvorfor ?.
• Hvad er et gæt på niveauet for Self-fertilized.
• Tilfældige effekter,
Populationer,
Stikprøver
February 7, 2001
14
Populationsgenetik
• Population
• P =A+M
• V(P ) = V(A) + V(M )

V(A)
• h2 = V(P )
• Ao = 12 Am + 21 Af
February 7, 2001
144
15
Populationsgenetik
• R.A. Fisher
• Sewall Wright
• (Haldane)
February 7, 2001
16
Hierarkiske populationer

.
Sires
Females

Offspring
.
February 7, 2001
145
6 An overview
17
Populationsgenetik/ Husdyravl
• R.A. Fisher
• Sewall Wright
• Jay R. Lush
• C.R. Henderson
• S.R. Searle.
February 7, 2001
18
Husdyravl
• Oprindelig Hierarkisk Struktur
• Strukturen bryder ned, specielt pga. KS
• Metoder til krydset klassifikation
• Henderson’s Mixed Model Equations
February 7, 2001
146
19
Husdyravl
• Hovedvægt på estimation (Selektion)
• Afhængighed beskrives ved residual varians og

heretabilitet
• Problem er primært regneteknisk (Matrice-

invertering)
• Normalt MANGE! observationer
• Hypotesetest af mindre interesse
February 7, 2001
20
Mixed Models generelt
• Gentagne målinger/longitudinelle data
• Spatiale observationer
• Hierarkiske forsøgsdesign (e.g. split-plot)
• Mixed Model Equations fælles referenceramme
• Fælles program udvikling
February 7, 2001
147
6 An overview
21
Mixed Models generelt
• Hypotesetest af stor interesse
• Afhængighed beskrives ved mange

variansparametre
• Begrænset antal observationer
• Stadig løse ender
February 7, 2001
148
7 Experimental planning and design
The purpose of the lecture was to refresh the concepts used in experimental planning and design,
i.e., hypothesis, power of designs, blocking. Typical blocking factors were discussed.
Different types of experimental design, such as randomized block, split-plot, latin squares and
factorial designs, were discussed, and examples were sought within the participants areas of
research.
The slides are in Danish. Link to full-screen presentation1
1
http://www.jbs.agrsci.dk/biometri/Courses/HSVmixed2001/Forsplanpl.f.pdf
149
Outline
• Hypotheses
• Decision Support
• Need of information for planning
• Restrictions in experimental design
• Different designs
February 12, 2001
Forskningsprocessen
Pakke
Ansøgning
Publicering
Forsøg
February 12, 2001
150
3
Forskningsprocessen
• Få ideer til områder, hvor eksisterende

viden/teori er utilstrækkelig/forkert
• Foretage iagttagelser, så ideerne kan be- eller

afkræftes
• Beslutte om viden/teori skal justeres
• (Kvantificering af viden)
• Gruppearbejde over tid og sted
February 12, 2001
Darwins Majs
column I Height, Inch
Crossed Self. -fert
Pot I 23.50 17.38
12.00 20.38
21.00 20.00
Pot II 22.00 20.00
19.13 18.38
21.50 18.63
Pot III 22.13 18.63
20.38 15.25
18.25 16.50
21.63 18.00
23.25 16.25
Pot IV 21.00 18.00
22.13 12.75
23.00 15.50
12.00 18.00
February 12, 2001
151
Hypotheses
Hypothesis A GMO sugar beets are not harmfull to cows
Hypothesis B GMO sugar beets are harmfull to cows
Hypothesis A Pesticide use reduces fertility
Hypothesis B Pesticide use do not reduce fertility
February 12, 2001
Luse Beslutningsstøtte
Table 1: Sprøjteeksempel – gevinsttabel

Afgrødens tilstand
Beslutning Ingen lus Lus
Sprøjt
Omkostninger til Omkostninger til
sprøjtemiddel og sprøjtemiddel og
arbejde arbejde
Sprøjt ikke 0 Udbytte tab
February 12, 2001
152
7
Forskning Beslutningsstøtte
Table 2: Forskningseksempel – gevinsttabel

’Verdens’ tilstand
Beslutning Hypotese 1 er Hypotese 2 er
sand sand
Accepter hypotese 1 OK Fejl !
Accepter hypotese 2 Dyr fejl ! Gennembrud
!
February 12, 2001
Typer af fejlkonklusion
Hypotese 1 Hypotese 2
Type I fejl Type II fejl
February 12, 2001
153
Muligheder i designfase
Hypotese 1 Hypotese 2
Forøg præcision Forøg forsøgsudslag
-1 0 1 2 -1 0 1 2
NB!: Type I fejl er konstant, e.g. 0.05
February 12, 2001
10
Biologisk input
• Måleegenskaber
• Forventede forsøgsudslag
• Mulige konklusioner af forsøg
• Afhængige <> uafhængige hypoteser
• Hypotesegene(re)rende egenskaber
February 12, 2001
154
11
Table 3: Oversigt over forventede forsøgsudslag

Egenskab Hypotese 1 er sand Hypotese 2 er sand
Behandling 1 Behandling 2 Behandling 1 Behandling 2
A 100 100 100 120
B
.. .. .. .. ..
February 12, 2001
12
Typiske blokfaktorer
• Kuld
• Sti, Flok, Bur
• Køn
• Afstamning
• Besætning
• Observatør
February 12, 2001
155
13
Begrænsninger i design muligheder
• Blokstørrelse
• Opstaldning/Management
• Ressource kamp
February 12, 2001
14
Designtyper
• Randomiseret Blokforsøg
• Split-Plot forsøg
• Romer Kvadrat
• Ikke komplette blokforsøg
• Faktorielle forsøg
• Fraktionerede designs
February 12, 2001
156
8 Randomized Complete Block Design
These are the first slides in the second block of lectures. They start off with the augmentation
of the linear normal model to a mixed model. Then PROC MIXED in SAS were presented, and
example 1.2.1 in LMSW (Littell et al., 1996) were discussed. The slides can be seen as a summary
of chapter 1 in LMSW.
1
http://www.jbs.agrsci.dk/biometri/Courses/HSVmixed2001/RDBC.f.pdf
157
Outline
• Hypotheses
• Udvidelse af LNM
• Introduktion til Proc Mixed
• RCBD eksempel (1.2.4)
February 28, 2001
Linear Normal Model
Y11 = δ + α1 + u1 + ε11
Y12 = δ + α2 + u1 + ε12
Y21 = δ + α1 + u2 + ε21
Y22 = δ + α2 + u2 + ε22
εij ∼ N (0, σ 2)
Y ∼ N (0, σ 2I)
February 28, 2001
158
5
Matrix formulering
       
Y11 1 1 0   1 0 ε11
 δ   
Y12 1 0 1   1 0 u1 ε 
   
 =  α1 +  +  12
Y21 1 1 0 0 1 u2 ε21

α2
Y22 1 0 1 0 1 ε22
Y = Xβ + Zu + ε
ε ∼ N (0, R), u ∼ N (0, G)
V(Zu) = Z V(u)Z > = ZGZ >
V(Y ) = ZGZ > + R
February 28, 2001
Random vs. Fixed
• Do the levels of the factor come from a probability distribution?

McCulloch & Searle (1997)
• Are Inferences to be drawn from these data about just these level
of the factor ? Searle, (1971)
February 28, 2001
159
ML - estimation
Type Distribution Estimate
LNM Y ∼ N (Xβ, σ 2I) β̂ = (X >X)−1X >y

If V is known:
LMM Y ∼ N (Xβ, V ) β̂ = (X >V −1X)−1X >V −1y
V = ZGZ > + R is not known, depends on parameters,

V = f (σ 2, σu2 ).
February 28, 2001
Likelihood function
1 1 n
l(y, β, σ 2, σu2 ) = − log |V | − (y − Xβ)>V −1(y − Xβ) − log(2π)
2 2 2
−40
−41
Loglike
−42
−43
−44
1 2 3 4 5
February 28, 2001 σ2
160
9
Proc Mixed I
PROC MIXED < options > ;

BY variables ;
ID variables ;
WEIGHT variable ;
February 28, 2001
10
Proc Mixed II
CLASS variables ;
MODEL dependent = < fixed-effects > < / options > ;

RANDOM random-effects < / options > ;
REPEATED < repeated-effect> < / options > ;
PARMS (value-list) ... < / options > ;

PRIOR <distribution > < / options > ;
February 28, 2001
161
11
Proc Mixed III
CONTRAST ’label’ < fixed-effect values ... >

< | random-effect values ... > , ... < / options > ;
ESTIMATE ’label’ < fixed-effect values ... >
< | random-effect values ... >< / options > ;
LSMEANS fixed-effects < / options > ;
MAKE ’table’ OUT=SAS-data-set ;
February 28, 2001
12
Proc Mixed
Model concerns Xβ
Random concerns Zu and G = V(u)
Repeated concerns ε and R = V(ε)
February 28, 2001
162
13
Ingot Støbeblok/metal barre

metal Metal brugt til lodning (?) af Ingot (nickel, iron, copper)
Pres Tryk der brækker lodningen
/*---Data Set 1.2.4---*/

data rcb;
input ingot metal $ pres;
datalines;
1 n 67.0
1 i 71.9
1 c 72.2
.
.
February 28, 2001
14
Design
Ingot no.
Lodning 1 2 3 4 5 6 7
1 n i c c c n n
2 c n i i n c i
3 i c n n i i c
February 28, 2001
163
15
Andre eksempler på RCBD
• Parrede observationer
Den rullende Afprøvning
• (Beretning 685) Stigende mængder solsikkefrø (4 niveauer). 20
kuld a 4 grise.
• Beretning 546. Opdrætningsintensitet, Jersey. 10 par enæggede
tvillinger. Høj vs. lav intensitet,
• Forskningsrapport 25. Airwash systemet. Besætning opdeles efter
lige vs ulige konumre.
February 28, 2001
16
Proc Mixed model
proc mixed data=rcb;

class ingot metal;
model pres=metal;
random ingot;
lsmeans metal / pdiff;

estimate ’nickel mean’ intercept 1 metal 0 0 1;
estimate ’copper vs iron’ metal 1 -1 0;
contrast ’copper vs iron’ metal 1 -1 0;
run;
February 28, 2001
164
17
Anden notation
Yijk = µ + αi + uj + εij
uj ∼ N (0, σu2 )
εij ∼ N (0, σε2)
February 28, 2001
18
Tredje notation
Yijk = Xβ + Zu + ε
u ∼ N (0, G)
ε ∼ N (0, R)
February 28, 2001
165
19
SAS (8E) Output

The Mixed Procedure
Model Information
Data Set WORK.RCB

Dependent Variable pres
Covariance Structure Variance Components
Estimation Method REML
Residual Variance Method Profile
Fixed Effects SE Method Model-Based
Degrees of Freedom Method Containment
February 28, 2001
20
Class Level Information
Class Levels Values
ingot 7 1 2 3 4 5 6 7
metal 3 c i n
February 28, 2001
166
21
Dimensions
Covariance Parameters 2
Columns in X 4
Columns in Z 7
Subjects 1
Max Obs Per Subject 21
Observations Used 21
Observations Not Used 0
Total Observations 21
February 28, 2001
22
Iteration History
Iteration Evaluations -2 Res Log Like Criterion
0 1 112.40987952
1 1 107.79020201 0.00000000
Convergence criteria met.
February 28, 2001
167
23
Estimate of σu2 , σε2
Covariance Parameter
Estimates
Cov Parm Estimate
ingot 11.4478
Residual 10.3716
February 28, 2001
24
Kriterier for fit af model, bruges ved modelsammenligninger.
Fit Statistics
-2 Res Log Likelihood 107.8

AIC (smaller is better) 111.8
AICC (smaller is better) 112.6
BIC (smaller is better) 111.7
February 28, 2001
168
25
Signifikans test
Type 3 Tests of Fixed Effects
Num Den
Effect DF DF F Value Pr > F
metal 2 12 6.36 0.0131
February 28, 2001
26
Degrees of Freedom
Numerator H0 : α1 = α2 = α3 = 0
 
  µ
0 1 −1 0  
 α 
K β = 0 ⇔ 0 1 0 −1  1 = 0
> 
α2
0 0 1 −1
α3
Num DF is rank(K)
February 28, 2001
169
27
Denominator Containment method: ”Denote the fixed effect in

question A, and search the RANDOM effect list for the effects that
syntactically contain A. For example, the RANDOM effect B(A)
contains A, but the RANDOM effect C does not, even if it has the
same levels as B(A).
Among the RANDOM effects that contain A, compute their rank
contribution to the (XZ) matrix. The DDF assigned to A is
the smallest of these rank contributions. If no effects are found,
the DDF for A is set equal to the residual degrees of freedom,
N − rank(XZ)”
Methods CONTAIN,BETWITHIN, RESIDUAL, SATTERTH, KENWARDROGER.

MODEL .... \DDFM=SATTERTH;
February 28, 2001
28
Output fra Estimate
Estimates Standard
Label Estimate Error DF t Value Pr > |t|
nickel mean 71.1000 1.7655 12 40.27 <.0001

copper vs iron -5.7143 1.7214 12 -3.32 0.0061
Contrasts
Num Den
Label DF DF F Value Pr > F
copper vs iron 1 12 11.02 0.0061
February 28, 2001
170
29
Least Squares Means
Standard
Effect metal Estimate Error DF t Value Pr > |t|
metal c 70.1857 1.7655 12 39.75 <.0001

metal i 75.9000 1.7655 12 42.99 <.0001
metal n 71.1000 1.7655 12 40.27 <.0001
Differences of Least Squares Means
Standard
Effect metal _metal Estimate Error DF t Value Pr > |t|
metal c i -5.7143 1.7214 12 -3.32 0.0061

metal c n -0.9143 1.7214 12 -0.53 0.6050
metal i n 4.8000 1.7214 12 2.79 0.0164
February 28, 2001
30
GLM
GLM:
Source DF Type III SS Mean Square F Value Pr > F
ingot 6 268.2895238 44.7149206 4.31 0.0151

metal 2 131.9009524 65.9504762 6.36 0.0131
Mixed:
Num Den
Effect DF DF F Value Pr >F
metal 2 12 6.36 0.0131
February 28, 2001
171
31
GLM:
Standard LSMEAN
metal pres LSMEAN Error Pr > |t| Number
c 70.1857143 1.2172327 <.0001 1

i 75.9000000 1.2172327 <.0001 2
n 71.1000000 1.2172327 <.0001 3
Mixed: Least Squares Means
Standard
Effect metal Estimate Error DF t Value Pr > |t|
metal c 70.1857 1.7655 12 39.75 <.0001

metal i 75.9000 1.7655 12 42.99 <.0001
metal n 71.1000 1.7655 12 40.27 <.0001
February 28, 2001
32
GLM: Standard
nickel mean 71.1000000 1.21723265 58.41 <.0001

copper vs iron -5.7142857 1.72142692 -3.32 0.0061
Mixed: Standard
Label Estimate Error DF t Value Pr > |t|
nickel mean 71.1000 1.7655 12 40.27 <.0001

copper vs iron -5.7143 1.7214 12 -3.32 0.0061
February 28, 2001
172
33
Summary
• Model specification
• Output elements
• Estimation Methods
• Fit Statistics/Information Criterias
• Degrees of freedom, model parameters.
• GLM differs
February 28, 2001
34
IC Options
The IC option displays a table of various information criteria. The

criteria are all in smaller-is-better form, and are described in .
Criteria Formula Reference
AIC −2l + 2d Akaike (1974)
n∗
AICC −2l + 2d n∗−d−1 Burnham and Anderson (1998)
HQIC −2l + 2d log(log(n)) Hannan and Quinn (1979)
BIC −2l + d log(n) Schwarz (1978)
CAIC −2l + d(log(n) + 1) Bozdogan (1987)
Here l denotes the maximum value of the (possibly restricted) log
likelihood, d the dimension of the model, and n the number of
observations. In Version 6 of SAS/STAT software, n equals the
February 28, 2001
173
35
number of valid observations for maximum likelihood estimation and

n − p for restricted maximum likelihood estimation, where p equals
the rank of X. In later versions, n equals the number of effective
subjects as displayed in the ”Dimensions” table, unless this value
equals 1, in which case n equals the number of levels of the first
RANDOM effect you specify. If the number of effective subjects
equals 1 and you have no RANDOM statements, then n reverts to
the Version 6 values. For AICC (a finite-sample corrected version of
AIC), n∗ equals the Version 6 values of n, unless this number is less
than d + 2, in which case it equals d + 2.
For restricted likelihood estimation, d equals q the effective number

of estimated covariance parameters. In Version 6, when a parameter
estimate lies on a boundary constraint, then it is still included in the
calculation of d, but in later versions it is not. The most common
February 28, 2001
36
example of this behavior is when a variance component is estimated

to equal zero. For maximum likelihood estimation, d equals q + p.
For ODS purposes, the name of the ”Information Criteria” table is

”InfoCrit.
February 28, 2001
174
9 Randomized Complete Block Design II
These slides discussed the concept of BLUE and BLUP estimates. The question of model control
is addressed.
Link to full-screen presentation presentation1
1
http://www.jbs.agrsci.dk/biometri/Courses/HSVmixed2001/RDBC2SLU.f.pdf
175
Outline
• BLUEs and BLUPs
• Examples of model control
February 28, 2001
BLUEs and BLUPs
• Best Linear Unbiased Estimator l>Xβ0

• Best Predictor: E(u|y)
February 28, 2001
176
3
Linear Regression
2
2
1
1
x2
x2
0
0
−1
−1
−2
−2 −1 0 1 2 −2 −2 −1 0 1 2
x1 x1
February 28, 2001
Linear Regression

x2 µx2 V2 C21
∼N ,
x1 µx1 C12 V1
E(X2|X1) = µx2 + C21V1−1(x1 − µx1 )

V(X2|X1) = V2 − C21V1−1C21
>
V(E(X2|X1)) = C21V1−1C21
>
February 28, 2001
177

u µu G C
∼N ,
y µy C> V
u1 = u1
u2 = u2
Y11 = δ + α1 + u1 + ε11
Y12 = δ + α2 + u1 + ε12
Y21 = δ + α1 + u2 + ε21
Y22 = δ + α2 + u2 + ε22
February 28, 2001
BLUEs and BLUPs
• Best Linear Unbiased Estimator l>Xβ0

• Best Predictor: E(u|y)
• Best Linear Predictor: (µu) + CV −1(y − µy )
• Best Linear Unbiased Predictor:
BLUP(t>Xβ + s>u) = t>Xβ0 + s>CV −1(y − Xβ0)
• Estimated Best (?) Linear Unbiased Predictor:
EBLUP(t>Xβ + s>u) = t>X β̂0 + s>Ĉ V̂ −1(y − X β̂0)
EBLUP(t>Xβ + s>u) = t>X β̂0 + s>ĜZ >V̂ −1(y − X β̂0)
February 28, 2001
178
7
Variance in BLUP
u true value, ũ BLUP estimate, εu error in prediction.
u = ũ + εu ⇔ u − ũ = εu
V(u) = V(ũ) + V(εu)
The error of prediction:
V(ũ − u) = G − CV −1C >
The variance in BLUP value:
V(ũ) = CV −1C >
February 28, 2001
Example
One-way classification model: Effect of number of observations per block
niσu2
ũi = BLUP(ui) = (ȳi· − µ)
σ2 + niσu2
i: block no, ni: number of observations in block i, ȳi·: block mean.

2
ni σu
As n → ∞ the coefficient σ 2 +ni σu
2 → 1 and the variance of the BLUP estimates
V(ũi) → G.
February 28, 2001
179
Fixed vs. Random

2.5
2.5
2.0
2.0
1.5
1.5
1.0
1.0
0.5
−5 0 5 10 0.5 −30 −20 −10 0 10

Block Effect Block Effect
(c) Ingots (d) Litter
February 28, 2001
10
BLUP summary
• BLUP corresponds to the conditional expectation of the random effect given

observation
• Under normality assumptions and known variance BP=BLUP
• With unknown variance this no longer holds.
• Variance of BLUPs depends on the precision of information concerning the

random effects
February 28, 2001
180
11
Model check - LNM - model
• εi are independent and indentically distributed εi ∼ N (0, σ 2)
• Residual vs. predicted
• Residual vs. anything else
• Probit plots.
• εi,t vs εi,t−1
• etc.
February 28, 2001
12
Residuals – Mixed Models
Distribution of residuals Mixed Models
(y − Xβ) = (Zu + ε) ∼ N (0, V )
i.e., not iid. ( option OUTPM in PROC MIXED)
Another definition of residuals
(y − Xβ − Z E(u)) = (Z(u − ũ) + ε) ∼ N (0, VG − VGV −1VG> + R)
where VG = ZGZ >. i.e., not iid. ( option OUTP in PROC MIXED)
Standardized residuals ?
February 28, 2001
181
13
Residual vs predicted
10
10
5
0
0
r1
r2
−20 −15 −10 −5
−10
−20
72 73 74 75 66 68 70 72 74 76 78
p1 p2
(e) (f)
February 28, 2001
182
10 Split-Plot Experiments
These slides present the theoretical background for split-plot designs. The slides augments the
presentation of split-plot designs in chapter 2 in LMSW, (Littell et al., 1996). The concept of
variance-components are presented, and the different variance of different contrast presented. In
addition concepts such the distribution of Sum of Squares, Satterthwaite’s approximation and
the distinction between random and fixed effects are presented.
1
http://www.jbs.agrsci.dk/biometri/Courses/HSVmixed2001/SplitPlot.f.pdf
183
The General Idea behind Split–Plot Experiments
“Once upon a time there were linear normal models -

systematic effects plus one error term...”
Yet, many experiments and studies have a hierarchical structure
• with respect to treatments
• with respect to error structures
Split–plot models is a very powerful (and early) way of handling such

situations.
April 6, 2001 Mixed Models Course 1
The name “split–plot” comes from the area of field experiements:
• Some treatments (say factor A) are applied to entire plots (parcels).

Those plots are called whole–plot–units and the factor A is the
whole plot factor.
• A plot is sometimes further sub–divided into sub–plots and other

treatments (say factor B) are applied to each of these sub–plots. The
sub–plots are called split–plot–units and the factor B is called the
split–plot factor.
184
Other examples:
• Treatment A (e.g. feeding) applied to a whole pig pen (the whole–plot)

while treatment B (something...) is applied to pigs within a pen.
• Treatment A is applied to an entire litter of piglets, treatment B is

applied to each piglet in the litter.
• Treatment A is a management straegy applied to a whole farm, while

treatment B is a treatment of each pig pen on the farm.
The basic property of split–plot experiments is that subjects within a

whole–plot are more similar than than subjects on in different
whole–plots.
More generally, Subjects/individuals/plots close (in some sense) to each

other are expected to be more similar than if they were further apart.
Split–plit models are sometimes appropriate for analyzing repeated

measurements.
185
Example 1. (Example 2.2 from LMSW).
• The effect of 3 bacterial inoculation treatments (INOC, indexed with j)

applied to 2 grass cultivars (CULT, indexed with i).
• There are 4 blocks (BLOCK, indexed with k.) and and CULT is randomly
assigned to each half of the block.
• Half a block is the whole–plot unit. Each whole–plot unit is subdivided

into 3 split plot units and each INOC is applied there.
The statistical model is
yijk = µ + αi + βj + γij + rk + wik + ijk
where rk ∼ N (0, σr2), wik ∼ N (0, σw

2
) and ijk ∼ N (0, σ 2). f in
Variance and Correlation
The total variance is
Var(yijk ) = Var(rk ) + Var(wik ) + Var(ijk )

= σr2 + σw
2
+ σ 2 = σtot
2
which justifies the name variance component model:
• The total variance is a sum of individual variance contributions.
• Moreover, each variance contribution can be assigned to a specific

feature of the experiment.
186
The variance components have implications for the correlation structure
among the variables:
1. Observations within the same block (k) but with different levels
of factor A (i) are correlated through the block component
Cov(yijk ,yi0jk ) σr2
Corr(yijk , yi0j 0k ) = Corr(yijk , yi0jk ) = Var(yijk ) = 2
σtot
2. Observations within the same block (k) and with the same level
of factor A (i) but different levels of factor B (j) are correlated
through the block component and the whole–plot component
Cov(yijk ,yij 0k ) σr2+σw 2
Corr(yijk , yij 0k ) = 2
σtot
= 2
σtot
Hence in the split plot model it is assumed that the correlation, when
present, is positive.
The split–plot structure has important implications with respect to the

statistical inference:
1. The effect of the interaction between A and B and treatment B itself

should be compared with the “residual variation” i.e. the variation
between the split–plot units.
2. The effect of treatment A should be compared with the “whole–plot

variation”, i.e. the variation between the whole–plot units.
We shall illustrate these points for a balanced split–plot experiment.
187
Comparing Differences
Consider again the model
where rk ∼ N (0, σr2), wik ∼ N (0, σw 2

) and ijk ∼ N (0, σ 2), and
i = 1 . . . a, j = 1 . . . b and k = 1 . . . c.
A simple calculation of differences of means illstrates the special issues

arising in a split–plot experiment.
Different levels of factor A can be compared by
ȳ1.. − ȳ2.. = α1 − α2 + γ̄1. − γ̄2. + (w̄1. − w̄2.) + (¯

1.. − ¯2..)
Var(ȳ1.. − ȳ2..) = Var(w̄1. − w̄2.) + Var(¯
1.. − ¯2..)
2
σw σ2 2 2 σ2
= 2 + 2 = (σw + )
c bc c b
Different levels of factor B can be compared by
ȳ.1. − ȳ.2. = β1 − β2 + γ̄.1 − γ̄.2 + (¯

.1. − ¯.2.)
2 σ2
Var(ȳ.1. − ȳ.2.) = .1. − ¯.2.) = ( )
Var(¯
c a
188
Hence Var(ȳ1.. − ȳ2..) is bigger than Var(ȳ.1. − ȳ.2.).
In other words, the effect of the whole–plot–factor is determined less

accurately the the effect of the split–plot factor.
Inference Issues for Mixed Models
For balanced experiments, inference is based on F–tests.
For unbalanced cases, inference is a delicate issue. Loosely speaking

“What are the denominator degrees of freedom”.
In PROC MIXED one can make “approximate F–tests” (but SAS never
informs you that the tests are only approximate).
Several suggestions have been made regarding this. One such is

Satterthwaites Approximation.
189
Analysis of the Split–Plot Experiment
Consider again the model
where rk ∼ N (0, σr2), wik ∼ N (0, σw 2

) and ijk ∼ N (0, σ 2), and
i = 1 . . . a, j = 1 . . . b and k = 1 . . . c.
For simplicity suppose that factor B does not represent a treatment but
only replications within each whole–plot. Then the model reduces to
yijk = µ + αi + rk + wik + ijk

The replicates due to factor B are eliminated by calculating the average

within each block and treatment:
2
ȳi.k = µ + αi + rk + (wik + ¯i.k ) where Var(wik + ¯i.k ) = σw + σ 2/b
2
• Hence the between whole–plot variation (σw ) remains unchanged while
2
the within whole–plot variation σ is reduced by a factor b.
• Therefore by taking more replicates within a whole–plot unit, parts of

the variation is reduced , while other parts of the variation
remains the same.
190
Modelling the Mean
Let zik = ȳi.k denote the mean and define uik = wik + ¯i.k .
Then the model for the means can be written
zik = µ + αi + rk + uik
where uik ∼ N (0, σu2 ) with Var(uik ) = σw

2
+ 1b σ 2 = σu2 and
rk ∼ N (0, σr2).
This is an ordinary ANOVA–model with one treatment, one (random)

block effect and no interaction. Analyzing such a model is straight
forward.
Three Technical Results
In connection with ANOVA calculations, one frequently uses the following results:
ANOVA1: Let X, Y be independent with E(X) = E(Y ) = 0 and let a be a number.

Then
E(a + X + Y )2 = V ar(a + X + Y ) + [E(a + X + Y )]2

= Var(X) + Var(Y ) + a2 = E(X 2) + E(Y 2) + a2
ANOVA2:
Pn Let Y1, . . . , Yn be independent with Yi ∼ N (µ, σ 2), and let SSD =
2
i=1 (Yi − Ȳ. ) . Then
E(SSD) = (n − 1)σ 2 = (n − 1) Var(Yi)

SSD ∼ σ 2χ2(n − 1)
191
ANOVA3: Let Y1, . . . , Yn be independent with Yi = µi + i, where i ∼ N (µi, σ 2), and
let n n
X X
SSD = (Yi − Ȳ.)2 and Q(µ) = (µi − µ̄.)2
i=1 i=1
Then
n
X
E(SSD) = Q(µ) + E( (i − ¯.)2) = Q(µ) + (n − 1)σ 2
i=1
With
zik = µ + αi + rk + uik
summation gives
z̄i. = µ + αi + r̄. + ūi.

z̄.. = µ + ᾱ. + r̄. + ū..
The difference
z̄i. − z̄.. = (αi − α.) + (ūi. − ū..)
is a measurement of the treatment effect, and does not depend on the

block.
192
− z̄..)2 we find that
P
Letting SSDA = i(z̄i.
X X
E(SSDA) = 2
(αi − α.) + E( (ūi. − ū..)2)
i i
σu2
= Q(α) + (a − 1)
and hence
c
X
E(c (z̄i. − z̄..)2) = Q(α) + (a − 1)σu2 .
i
• If there is no effect of treatment A then Q(α) = 0 and SSDA has a

χ2–distribution.
• To be able to make the F –test we need to find a quantity which has

σu2 as expected value no matter whether αi = 0 or not.
− z̄i. − z̄.k + z̄..)2. It is easy to see that

P
1. Let SSDAC = ik (zik
zik − z̄i. − z̄.k + z̄.. = uik − ūi. − ū.k + ū..
2. It is not difficult to verify (and it can be found in any standard text

book on statistics) that
E(SSDAC ) = σu2 (a − 1)(c − 1).
3. Finally it is equally easy to verify that SSDA and SSDAC are

independent.
193
4. Therefore the F –statistic for testing αi = 0 becomes
c · SSDA/(a − 1)
F =
SSDAC /(a − 1)(c − 1)
c i(z̄i. − z̄..)2/(a − 1)
P
= P 2
ik (zik − z̄i. − z̄.k + z̄..) /(a − 1)(c − 1)
∼ Fa−1,(a−1)(c−1)
Large values of F are critical to the hypothesis.
• The important point is that the treatment effect of factor A is “tested

against” the variance σu2 = σw2
+ σ 2/b. which largely consists of the
2
whole–plot variation (σw ) + a “minor” contribution from the split–plot
2
variation (σ /b).
• In the balanced case, the test for αi = 0 can be made by simply

analyzing the “means”. That is the reason why PROC GLM in
special (balanced) cases can make the correct tests in certain variance
component models.
194
Back to the Original Setup
Return to the original model with a treatment effect of factor B, i.e.
1. The interaction effect γij is tested exactly as if wik and rk had been
fixed effects. I.e. the test is made “against” the residual variation σ 2.
2. In the absence of γij , the main effect βj is also tested as if wik and rk
had been fixed effects.
3. The main effect of factor A is tested as described previously. Just note

that the effect of B cancels out in all calculations.
Unbalanced cases
All the nice calculations previously presented breaks down when the
design is no longer balanced.
Consider again
yijk = µ + αi + rk + wik + ijk
and suppose this time that i = 1 . . . a, k = 1 . . . c and j = 1 . . . bik .
Hence there might not be the same number of replicates (j) within each
whole–plot unit.
195
As before, the replicates due to factor B are eliminated by calculating

the average within each block and treatment:
zik = ȳi.k = µ + αi + rk + (wik + ¯i.k )
But now with uik = wik + ¯i.k
2
Var(uik ) = σw + σ 2/bik = σu2 ik
That is, the zik s have different variances.
1. One unpleasant consequence of this is that
z̄i. = µ + αi + r̄. + ūi.
2
has variance (σw + σu2 i. )/c which depends on i.
2. Another, equally unpleaseant, consequence is that SSDAC from before

does not have a χ2 distribution.
3. Consequently, the F –statistic from before does not have an F

distribution.
196
Some consequences of this:
• Hence we can still calculate the F –statistic, but it has an unknown

distribtution in the unbalanced case.
• Hence we have a problem in judging whether an observed F –statistic

is “large”.
• It seems plausible that when the experiment is “nearly balanced”, then

F must “nearly be F –distributed. But what is “nearly balanced”, and
what to do when the experiment is very unbalanced?
A related problem:
A related problem arises even in the balanced case. Suppose interest is in

comparing
µ11 − µ21 = α1 − α2 + γ11 − γ21.
The optimal estimate of this contrast is in the balanced case the differnce
ȳ11. − ȳ21.
and the variance of that difference is

2 2
Var(ȳ11. − ȳ21.) = (σw + σ 2)
3
197
2
• The problem is that to estimate σw + σ 2, two sums–of–squares are
needed.
• To put it in general terms, suppose SSD1 ∼ σ12χ2(f1) and SSD2 ∼

σ22χ2(f2) are needed. The problem arising is that the weighted sum
SSD = a1SSD1 + a2SSD2
does not have a χ2–distribtution unless σ1 = σ2 and a1 = a2.

• Satterthwaites idea was the following: Let us assume that SSD
approximately has a χ2–distribution.
• The problem is then how many degrees of freedom – but this number
can be “estimated” in the following way.
Satterthwaites approximation
Consider the two–sample problem
Yij ∼ N (µi, σi2), i = 1, 2, j = 1, . . . , ni
Then
σi2 σ2 σ2
Ȳi ∼ N (µi, ), Ȳ1 − Ȳ2 ∼ N (µ1 − µ2, 1 + 2 )
ni n1 n2
ni
1X σ2
Si2 = (Yij − Ȳi.)2 ∼ i χ2(fi), fi = ni − 1
fi j=1 fi
198
2 σ12 σ2 2
Let σD = n1 + n22 . A natural and unbiased estimate for σD is
2 S12 S22
SD = + (1)
n1 n2
2
Question : What is the distribution SD ?
Satterthwaite (Worked at General Electric, USA) (approx. 1945): We

2
don’t know but let’s approximate the distribution of SD with a suitable
2
χ –distribution:
2 φ2
SD ∼approx χ2(η) (2)
η
2 S12 S2
• With SD = n1 + n22 we have
2 σ12 σ22 2
E(SD ) = + = σD
n1 n2
2 σ14 σ24
V ar(SD ) = 2( 2 + 2 )
n1f1 n2f2
2 φ2 2
• Under the approximation SD ∼approx η χ (η) is
2
E(SD ) = φ2
2 φ4
V ar(SD ) = 2
η
199
• Satterthwaites ide: Match the first two moments:
φ2 = σD
2
2 2
(σD )
η =
σ14 σ4
n21f1
+ n22f
2 2
• In real life σi2 and hence σD

2
are unknown. Instead we plug in the
2 2
estimates si and sD in the calculation of η:
(s2D )2
η =
s41 s4
n21f1
+ n22f
2 2
Example 2. Let σ12 = 2, σ22 = 10, n1 = n2 = 6, f1 = f2 = 5. Then
2 2 10
σD = + =2
6 6
22
η = 22 102
= 6.9 ≈ 7
62·5
+ 62·5
Hence
2 S12 S22 2
σD
SD = + ∼approx χ2(7)
n1 n2 7
f in
200
Example 3. Let σ12 = 100, σ22 = 90, n1 = 100, n2 = 10, f1 = 99, f2 =
9. Then
2 100 90
σD = + = 10
100 10
(1 + 9)2
η = 12 92 = 11.1
99 + 9
If the variances are assumed equal, then

2 99 · 100 + 9 · 90 1 1
σD = ( + ) = 10.9
108 100 10
which has a scaled χ2(108)–distribution.
Quite a difference! f in
How Good is Satterthwaites Approximation
The 1000 EURO question is now : How good is Satterthwaites

approximation ???
The usual answer : Simulate and calculate coverage percentages !!!
201
Two–sample Problem
Model:
Yij = µi + ij , i = 1, 2, j = 1, . . . , ni
where ij ∼ N (0, σi2).
1. Simulate data where µ1 = µ2.
2. Test hypothesis µ1 = µ2 at different significane levels.
- Using Satterthwaites approximation
- Using the Containment method, (default in PROC MIXED).
3. Calculate coverage percentages.

n1 σ1 n2 σ2 Method DDF F pr0.01 χ2 pr0.01 F pr0.05 χ2 pr0.05 F pr0.10 χ2 pr0.10

3 1 3 20 contain 4 0.047 0.127 0.114 0.204 0.182 0.260
3 1 3 20 satterth 2.16 0.020 0.124 0.056 0.202 0.106 0.258
8 1 3 20 contain 9 0.071 0.110 0.133 0.169 0.187 0.227
8 1 3 20 satterth 2.01 0.013 0.110 0.052 0.169 0.088 0.227
3 1 8 20 contain 9 0.009 0.030 0.053 0.084 0.101 0.134
3 1 8 20 satterth 7.16 0.006 0.030 0.046 0.084 0.093 0.134
8 1 8 20 contain 14 0.010 0.024 0.064 0.084 0.112 0.145
8 1 8 20 satterth 7.04 0.007 0.024 0.038 0.084 0.096 0.145
16 1 16 20 contain 30 0.013 0.025 0.068 0.078 0.119 0.128
16 1 16 20 satterth 15.1 0.010 0.025 0.060 0.078 0.110 0.128
3 1 3 5 contain 4 0.026 0.105 0.090 0.178 0.157 0.235
3 1 3 5 satterth 2.61 0.013 0.105 0.056 0.178 0.107 0.234
8 1 3 5 contain 9 0.078 0.132 0.168 0.210 0.226 0.271
8 1 3 5 satterth 2.62 0.026 0.132 0.070 0.210 0.130 0.271
3 1 8 5 contain 9 0.020 0.046 0.062 0.089 0.117 0.144
3 1 8 5 satterth 7.94 0.016 0.046 0.059 0.089 0.112 0.144
8 1 8 5 contain 14 0.026 0.035 0.056 0.080 0.107 0.131
8 1 8 5 satterth 7.73 0.014 0.035 0.048 0.080 0.090 0.131
Table 1: Two–sample problem - 1000 simulations
202
Split–Plot Experiment
We consider the model
Yijk = µ + αi + βj + wik + ijk , i = 1, 2, k = 1, . . . , ni, j = 1, . . . , nik
2
where wik ∼ N (0, σw ) and ijk ∼ N (0, σ 2).
2
• Make simulations for different values of σw .
• In the simulations α1 = α2.
• Test of the hypothesis α1 = α2.
The design is as follows:
n1 = 3 and n2 = 8
i = 1 : j = 1 . . . n1k = 5
i = 2 : k = 1 . . . 3 : j = 1 . . . n1k = 3
i = 2 : k = 4 . . . 8 : j = 1 . . . n1k = 9
So all problems arise to to unbalancedness (rather than variance

heterogeneity as before).
203
σ σw Method DDF F pr0.01 χ2 pr0.01 F pr0.05 χ2 pr0.05 F pr0.10 χ2 pr0.10

1 1 contain 9 0.007 0.030 0.050 0.068 0.086 0.125
1 1 satterth 9.67 0.012 0.030 0.051 0.068 0.088 0.125
3 1 contain 9 0.004 0.018 0.037 0.064 0.083 0.125
3 1 satterth 21.7 0.009 0.018 0.043 0.064 0.098 0.125
6 1 contain 9 0.002 0.014 0.020 0.043 0.057 0.086
6 1 satterth 33.5 0.012 0.014 0.034 0.043 0.072 0.086
9 1 contain 9 0.002 0.020 0.034 0.063 0.083 0.116
9 1 satterth 36.5 0.011 0.020 0.054 0.063 0.097 0.116
Table 2: Split–Plot Experiment - 1000 simulations
Making the “right” tests with PROC MIXED
A typical SAS program for analyzing the split plot data above is like
proc mixed data=sim noitprint;
class i j k subject;
model y = i j /ddfm=contain chisq;
random i*k;
run;
• The containment method is default in PROC MIXED (but can be

specified explicitely with ddfm=contain) in the MODEL statement.
• This tells SAS that when testing any of the fixed effects in the model,
SAS should look for a random effect which syntactically contains the
204
fixed effect: Since i is contained in i*j SAS then knows that that it is
this random effect the test should be “made against”.
• It is well known that this is the right thing to do when the experiment
is balanced.
A Severe Warning!!
A very commonly made mistake in this connection is the following:

Each combination (i, k) often identifies an experimental entity, e.g. an
animal or a (whole) plot in a field. Typically one would have a variable in
the data set identifying such an entity. For illustration we have made a
variable, called subject defined as (i, k). A typical SAS program would
then be:
class i j k subject;
random subject;
run;
Such a program is made under the mistaken impression that since

205
subject and (i, k) really identifies the same units in the experiment
then it should be immaterial what one writes.
This is not true, and the reason is the following:
Since i is not syntactically contained in subject the tests (for effect of

the factor i) would be made against the residual variance, which we
know is wrong.
To emphasize this point, suppose that we declare a new variable icopy

which is just a copy of i. Then writing
class i j k subject icopy;
random icopy*k;
run;
will also make SAS perform the test of effect of the factor i against the
residual variance which, as poined out above, is wrong.
If, however, we write ddfm=satterth in any of the examples above,

then SAS will actually identify the right variance component to make the
test for effect of factor i against.
206
Some Tentative Conclusions on Satterthwaite
• For small samples, Satterthwaites method performs much better than

the default Containment method.
• For larger samples, there is not much difference between the two
methods. In practice, this is because the difference between the
quantiles in a F (1, 7) and F (1, 14) distribution is not large whereas the
differences between quantiles in a F (1, 2) and a F (1, 4) distribution be
substantive
• Both methods generally perform better than the large sample χ2 tests.
• A drawback of Satterthwaites method is that it is computationally

somewhat intensive.
• Results suggest the use of Satterthwaites approximation.
207
Random or Fixed Effects?
Sometimes it is straight forward to decide on whether a specific effect

should be considered as random or fixed.
In other cases, it is a more delicate issue.
The text below is taken from lecture notes by L. R. Schaeffer, University

of Guelph, Ontario, Canada:
Fixed factors are factors in which the classes comprise all of the possible classes of
interest that could be observed. For example, the sex of an animal is either male,
female, sterilized male, or sterffized female. If the number of classes in a factor is small
and confined to this number even if conceptual resampling were performed an infinite
number of times, then the factor is likely fixed. Other examples are age classes,
lactation number, management system, cage number, and breed class. Usually if the
sampling were to be repeated a second time, those factors which maintain the same
classes between the two samplings would be fixed factors. For example, a growth trial
on pigs using two diets would probably need to use the same housing facilities, the
same age groups of pigs, and the same diets, but the individual pigs would necessarily
have to be new animals because an animal could not go through the same growth
phase a second time in its life. Pig effeets would be considered a random factor whfle
the other effects would be fixed.
Random factors are factors whose levels are considered to be drawn randomly from an
infinitely large population of levels. As in the previons pig experiment, pigs were
considered random because the pig population of the world is large enough to be
considered infinitely large, and the group that were involved in that experiment were a
random sample from that population. In actual fact, however, the pigs on that
experiment were likely sampled from those relatively few pigs that were available at the
time the trial started, but still they are considered to be a random factor because if the
experiment were to be repeated again, there would likely be a completely different
group of pigs involved.
208
Another way to determine if a factor is fixed or random is to know how the results will
be used. In a nutrition trial the results infer something about the diets in the trial. The
diets are specific and no inferences should be made about other diets not tested in the
experiment. Hence diet effects would be a fixed factor. In contrary, if animal effeets
were in the model, inferences about how any animal might respond to a specific diet
may need to be made. There should not be anything peculiar about the animal on the
trial that would nullify that inference. Animal effeets would be a random factor.
In general, a few questions need to be answered to make the correct choice of fixed or
random factor designation. Some of the questions are:
1. How many levels of the factor a-re in the model? If smalt, then perhaps this is a
fixed factor. If large, tILen perhaps this is a random factor.
2. Is the number of levels in the population large enough to be considered fiffinite? If

yes, then perhaps this factor is random.
3. Would the same levels be used again if the experiment were to be repeated a second
time? If yes, then perhaps this factor is fixed.
4. Are inferences to be made about levels not included in the experiment? If yes, then
perhaps this factor should be random.
5. Were the levels of a factor determined in a nonrandom manner? If yes, tiden perhaps
this factor should be treated as fixed.
By studying the scientific literature, a researcher should be able to get some help in this
decision process. If in doubt, then the assistance of an experienced statistician should
be sought.
209
Multilocation Trials
Consider the following setup:
• Four treatments, e.g. of housing systems for pigs are to be compared.
• Studies are carried out on 9 farms (locations)
• Within each farm a randomized block design with 3 blocks is employed,

i.e. each treatment is repeated 3 times within each farm, once in each
block.
How to analyze such data?

Note that since there are replicates within each farm, the
farm–treatment interaction can be estimated.
The following model seems appealing:
yijk = µ + τi + Lj + (RL)jk + (τ L)ij + ijk
where i = 1 . . . 4 is treatment, j = 1 . . . 9 is location and k = 1 . . . 3 is

block.
210
It is reasonable to assume that (RL)jk and ijk are random. But other
effects need more consideration:
• One can consider Lj and (hence) (τ L)ij as being random.
• Alternatively one can consider Lj and (τ L)ij to be fixed effects.
The effects in question can be considered random if the farms (locations)

are random representatives from the population of farms with specific
characteristics.
But if the farms are selected as e.g. “those 9 farms whose owners
responded to a questionnaire sent out to all farms with given
characteristics”, then the farms are not random representatives from the
population. In that case, the effects in question should be regarded as

fixed, and one can not extrapolate the conclusions from the study
outside these 9 farms.
What to do if 6 farms are selected randomly, while 3 are not?
What to do if there are only 3 randomly selected farms in the study?
211
212
11 Examples of Split-Plot Designs
The purpose of this lecture was to illustrate the kind of problems that may arise, if split-plot
designs are not treated properly. Most of the experiments presented were made at the Danish
Institute of Agricultural Sciences, or rather the National Institute of Animal Science, as it was
called in those days.
Another common aspect of several of the experiments were that they have led to a heated debate.
The pro’s and con’s in those debates were presented.
1
http://www.jbs.agrsci.dk/biometri/Courses/HSVmixed2001/SPLITPLOTExamples.pdf
213
. . . After reading 50 of these papers in AABS-issues [Applied

Animal Behaviour Science] of 1984 and 1985, we found that in
about 25 cases statistical methods were used incorrectly. The
main defect was that observations entered into test statistics
were not independent. In a number of cases it was totally
unclear how the authors made their computations
Hoekstra & Jansen, AABS 16 (1986) 303-308
March 6, 2001 1
Example: W. Schouten Ph.D. work
Rearing conditions and Behaviour in pigs.
How does early experience influence later behaviour ?
’Barren’ farrowing crates vs. 2 × 2 m2 vs ’enriched’ large straw pens

28 m2.
8 sows (4 sister-pairs). Within each sister-pair the pigs were

assigned to treatment at random. Each litter consisted of 8 pigs,
i.e., a total of 64 piglets.
Detailed behavioural observations
March 6, 2001 2
214
Anovas
Reported Model Litter Averages

Effect df SS F df SS F
Sister-Pair 3 384.2 3 48.0
Housing System 1 893.3 5.14 ∗
1 117.0 2.424
Residual 59 10253.7 3 138.2
Total 63 7
March 6, 2001 3
Mixed model formulation
Reported model:
Yijk = µ + Pi + Hj + εijk
Pi Effect of sister pair i ∈ {1, . . . , 4}. Hj effect of housing. εijk

random residual.
Correct model:
Yijk = µ + Pi + Hj + Sij + εijk
Sij Effect of sow.
March 6, 2001 4
215
Breed effect on production
Are the present feeding standards for essential nutrients per FUp
sufficient for Ad lib feeding ?
Beretning 579. A. Just et al. (1985)
6 litters (YY) and 6 litters of (LL) 6 (7) pigs (boars, gilts,

castrates). Two levels of nutrient concentrations in the feed.
March 6, 2001 5
Model
Yijkl = µ + ai + bj + ck + dl(j) + (ab)ij + (ac)ik + εijkl
• ai: effect of feed nutrient concentration, i ∈ {1, 2}

(Norm vs. Norm +20%).
• bj : effect of breed, j ∈ 1, 2 (LL and YY).
• ck : effect of sex k, k ∈ {1, 2, 3}.
• dl(j): effect of litter l within breed j.
• (ab)ij : interaction between feed concentration and breed.
• (ac)ik : interaction between feed concentration and sex.
• εijkl : random residual.
March 6, 2001 6
216
Similar designs
• Breeding line vs. pecking behaviour
• Rearing Conditions vs. later productivity
• Effect of organic feed.
• Effect of GMO production.
March 6, 2001 7
Straw shortener
A number of sows were fed with either control feed or feed containing straw from
fields treated with straw shortener (CCC). To investigate long term effects the
study covered 4 parities.
Reported model:
Yijk = µ + ti + pj + (tp)ij + εijk
Yijk : Observed variable e.g., litter size. ti: effect of treatment. pj : effect of parity.
(tp)ij : Interaction between parity and treatment.εijk : random residual
Correct model:
Yijk = µ + ti + pj + (tp)ij + Sik + εijk
Sik : Effect of sow k on treatment i, Sik ∼ N (0, σS2 )
March 6, 2001 8
217
Group housing
Loose housed sows. Automatic feeding systems.
Hypothesis: Pelleted feed reduces aggression compared with mealy

feed.
Hypothesis: Pelleted feed reduces the effect of rank on received

aggression.
March 6, 2001 9
Herd Investigations
Inspired by Nørgård (1999).
Yijkl = µ + ai + sj + Hijk + vl + (vs)jl + εijklm
• Yijklm measurement at slaughter.

• ai : Effect of Abattoir i.
• sj : Effect of herd disease state j.
2
• Hijk Random effect of herd Hijk ∼ N (0, σH ).
• vl: Effect of season l.
• (vs)jl: Interaction between season and disease state.
• εijklm: Random residual from mth animal. εijklm ∼ N (0, σ 2)
March 6, 2001 10
218
Multi location trials
Yijk = µ + τi + Lj + R(L)jk + (τ L)ij + εijk
• τi: effect of treatment

• Lj : effect of location
• R(L)jk : random effect of block within location, R(L)jk ∼
2
N (0, σR )
• (τ L)ij : interaction between treatment and location
• εijk : residual εijk ∼ N (0, σ 2)
March 6, 2001 11
219
220
12 Estimation and tests in mixed models
The purpose of this lecture was to give a detailed description of theoretical issues of estimation
and tests in mixed models, i.e. properties of maximum likelihood estimators in the linear normal
model and the mixed linear normal model. Concepts such as ML and REML is introduced.
1
http://www.jbs.agrsci.dk/biometri/Courses/HSVmixed2001/MLMixed.f.pdf
221
Maximum Likelihood and Linear Normal Models
Example 1. Consider the linear regression model
yi = β0 + β1xi + i
We shall show that the maximum likelihood estimate and the least squares
estimate for
β = (β0, β1)
are identical.
Because of the independence, the joint density for y1, . . . , yn (and hence
the likelihood function) becomes
n
Y
f (y1, ...yn; β) = f (yi; β)
i=1
n
Y 1 1 1
= √ exp(− 2 (yi − (β0 + β1xi))2)
i=1
2π σ 2σ
1 1 1 X
= √ n n exp(− 2 (yi − (β0 + β1xi))2)
2π σ 2σ i
= L(β)
222
The likelihood function is
1 1 1 X
L(β) = √ n
n
exp(− 2
(yi − (β0 + β1xi))2)
2π σ 2σ i
− (β0 + β1xi))2.
P
• Let D(β0, β1) = i(yi
• If σ is known then L(β) is maximized by minimizing the sum

of squared deviations D(β0, β1) (because of the “−” sign in the
exponential).
• Therefore the maximum likelihood estimate is the same as the least

squares estimate.
f in
For a general linear normal model
y = Xβ + where ∼ N (σ 2I)
the likelihood is
1 1 1 X
L(β, σ 2) = √ n n exp(− 2
(yi − µi))2)
2π σ 2σ i
1 1 1
= √ nn
exp(− 2
(y − Xβ)>(y − Xβ))
2π σ 2σ
Hence the maximum likelihood estimate for β is found by minimizing
(y − Xβ)>(y − Xβ).
223
Once β̂ (and hence µ̂) is found, it is not hard to verify that L(β̂, σ 2) is
maximized as a function of σ 2 by
1
σ̂ 2 = (y − X β̂)>(y − X β̂)
n
However, in practice one never uses the ML estimate for σ 2. Instead one
uses
1
σ̃ 2 = (y − X β̂)>(y − X β̂)
n−p
where p is the number of parameters in the model.

The reason for using σ̃ 2 instead of σ̂ 2 is that
E(σ̃ 2) = σ 2
n−p 2
E(σ̂ 2) = σ
n
That is σ̃ 2 is an unbiased estimate for σ 2 while σ̂ 2 is biased.
224
It can be noted that
1
σ̃ 2 = (y − X β̂)>(y − X β̂)
n−p
is called the REML estimate for σ 2, where REML means REstricted or

REsidual Maximum Likelihood.
The REML method is frequently applied in connection with mixed

models in an attempt to obtain unbiased variance estimates.
Maximum Likelihood Estimation in Mixed Models
For a mixed model

y = Xβ + Zu +
the variance of y is Cov(y) = V = Z Cov(u)Z > + Cov().
• The unknown parameters are in this case (β, V ).
• The typical case is that V depends only on a small number of parameters

itself, e.g. on α = (σr2, σw
2
, σ 2) in a split–plot experiment.
• So we write V = V (α).
225
In mixed models, maximum likelihood estimation becomes much more

involved.
The likelihood function is

1 n 1
L(β, V ) = √ det(V )− 2 exp(− (y − Xβ)>V −1(y − Xβ))
n
2π 2
Here det(V ) is a number, called the determinant of V .

There are two situations to consider: When V is known and when V is
unknown.
Case 1 - V is known: If V is known then L is maximized by

minimizing
(y − Xβ)>V −1(y − Xβ)
This quantity is minimized by
β̂ = (X >V −1X)−1X >V −1y
which is also the weighted least squares estimate of β.
226
Case 2 - V is unknown: If V is unknown (which of course is
generally the case in practice) things become more complicated.
There are different approaches available. Two of these are
• Maximum Likelihood (ML) and
• Restricted Maximum Likelihood (REML)
Maximum Likelihood: The expression
β̂(V ) = (X >V −1X)−1X >V −1y
depends on V which is unknown. If the expression for β̂ is substituted

into L we get
1 n 1
L(β̂(V ), V ) = √ n det(V ) 2 exp(− (y − X β̂(V ))>V −1(y − X β̂(V )))
2π 2
This likelihood depends now only on V .
227
Maximization of L has to be done iteratively.
This gives V̂ and hence
β̂(V̂ ) = (X >V̂ −1X)−1X >V̂ −1y
Typically, V only depends on a few parameters, say α, so we write

V = V (α).
In that case L(β̂(V (α)), V (α)) has to be maximized as a function of α.
Restricted Maximum Likelihood:
An alternative to ML estimation REML estimation.
This is the default method in PROC MIXED.
Consider a mixed model
y = Xβ + Zu + , where Var(y) = V
and V and β are unknown.
If β had been known, the residuals are
= y − Xβ ∼ N (0, V )
228
and one could use the ML method from before for estimating V .
However, β is not known. Therefore one frequently does the following:

The least squares estimate of β is
β̂ls = (X >X)−1X >y
and while not the optimal estimate for β, it is still an unbiased estimate.
One then considers the residuals
ls = y − X β̂ls ∼ N (0, A(X)V A(X)>)
where A(X) is a known matrix which is a function of X.
The likelihood for the “residuals” ls then depends only on V and one
can maximize that likelihood numerically.
This gives the REML estimate V̂reml for V .
When V depends on fewer parameters α the result is the REML estimate

α̂reml.
With this estimate at hand we can estimate β as
−1 −1
β̂reml = β̂(V̂reml) = (X >V̂reml X)−1X >V̂reml y
229
Using ML or REML
In practice the ML and the REML estimates do not differ much.
The main argument for REML estimation is that, at least in the balanced
cases, V̂reml is unbiased while V̂ml is not.
Whether V̂reml is always unbiased is not known.
Tests in Mixed Models
In dealing with tests in mixed models we shall first assume that the
covariance matrix V is known.
Typically we are interested in testing hypotheses of the form λ>β = k for

some vector λ and some number k (often k = 0.)
We know that the contrast λ>β is estimable if and only if there is a

vector a such that a>X = λ>.
The estimate of the contrast is λ>β is a>X β̂, where
X β̂ = X(X >V −1X)−1X >V −1y

230
Standard calculations gives that
Var(X β̂) = X(X >V −1X)−1X >V −1X(X >V −1X)−1X >
= X(X >V −1X)−1X >
so
X β̂ ∼ N (Xβ, X(X >V −1X)−1X >).
Hence
a>X β̂ ∼ N (a>Xβ, a>X(X >V −1X)−1X >a)
If the hypothesis λ>β = k is true then
a>X β̂ − k ∼ N (0, a>X(X >V −1X)−1X >a)

Therefore if V is known the task is to test whether E(a>X β̂ − k) = 0

when Cov(a>X β̂ − k) is known.
This can be done by constricting the statistic
X 2 = (a>X β̂ − k)>[a>X(X >V −1X)−1X >a]−1(a>X β̂ − k)
which under the hypothesis has a χ2(f1)–distribution where f1 is the

number of parameters “eliminated” in the contrast a>X β̂ = k
231
The problem is what to do when V is unknown?
In some cases V (e.g. in a split–plot experiment) the structure of V is

such that V = ω 2W −1 where W is known and ω 2 is unknown.
In that case, one can construct an F–statistic
(a>X β̂ − k)>[a>X(X >W −1X)−1X >a]−1(a>X β̂ − k)/f1

F =
ω̂ 2
which under the hypothesis has an Ff1,f2 –distribution.
How to derive f2 shall not be discussed here. We just note that PROC
MIXED attempts to construct such test statistics and to derive the
appropriate number f2 of denominator degrees of freedom.
In this connection it is to be pointed out that it is extremely important to

specify the random effects in the RANDOM–statement in the correct way.
232
Another approach is to construct approximate F–tests by establishing
a denominator D, such that
(a>X β̂ − k)>[a>X(X >V −1X)−1X >a]−1(a>X β̂ − k)/f1

F =
D/f2
has an approximate F –distribution when the hypothesis is true.
Adding the option DDFM=SATTERTH to the MODEL–statement causes PROC

MIXED to attempt to construct such tests.
A final option is the following:
When n → ∞ (in a suitably regular way) then V̂ and V becomes

indistinguishable.
Therefore, one approach is to simply “pretend” that the ML estimate V̂

is the true, but unknown variance V .
One can force PROC MIXED to making such tests by adding the
CHISQ–option to a the model statement.
233
234
13 Complications concerning Variance
Components
This lectures illustrated some of the problems that may arise because of numerical problems
in the iterative search for the maximum likelihood, and the reason why some of the variance
components are set equal to 0.
Based on an example from one of the exercises, the profile of the likelihood function is illustrated.
A special problem is that Satterthwaites approximation fails in the cases where the variance
component is set to 0, and the G matrix is not positive-semidefinit. Rules of thumb is suggested
in that case.
Finally, the relevance of a test of a positive variance component is discussed, e.g. comparable
to a test of a block effect, when block is treated as a fixed effect
Link to the fullscreen presentation1
1
http://www.jbs.agrsci.dk/biometri/Courses/HSVmixed2001/Complicate.pdf
235
13 Complications concerning Variance Components
Sugar beet example
Pct Sukk
Num Den
OPTAGN 1 2 15.21 0.0599
SAATID 4 16 189.37 <.0001
OPTAGN*SAATID 4 16 5.37 0.0061
Kg
OPTAGN 1 18 336.85 <.0001
SAATID 4 18 408.52 <.0001
OPTAGN*SAATID 4 18 12.70 <.0001
March 13, 2001 1
Inspection of Log
Pct Sukk
NOTE: Convergence criteria met.

NOTE: There were 30 observations read from the data set WORK.ROER.
Kg
NOTE: Convergence criteria met.

NOTE: Estimated G matrix is not positive definite.
NOTE: There were 30 observations read from the data set WORK.ROER.
March 13, 2001 2
236
Sugar beet example
Table 1: Covariance Parameter Estimates

Pct Sukk
Cov Parm Estimate Alpha Lower Upper
BLOK 0.001000 0.05 0.000164 37.9371
BLOK(OPTAGN) 0.001000 0.05 0.000219 0.2840
Residual 0.001333 0.05 0.000740 0.003088
Kg
BLOK 0.05344 0.05 0.01660 3.13E192
BLOK(OPTAGN) 0 . . .
Residual 5.1215 0.05 2.9241 11.2004
March 13, 2001 3
Outline
• Estimation of variance components

2
– Why are σ̂X =0
– Consequences
– Rules of Thumb
• Are random effects significant ?

– Are we really interested ?
– Likelihood ratio tests
March 13, 2001 4
237
Reason
The likelihood function is maximized subject to the constraint that

2
the variance component parameters σX ≥ 0.
The precision of numerical optimisation methods depends on the

internal representation of numbers in the computer. Proc Mixed
2
solves this by setting σ̂X = 0 if it is close to 0.
Other statistical packages (R,S-Plus) handles the constraint by

2
maximising the likelihood as a function of log(σX )
Sometimes (e.g., repeated measurements) the assumption that

2
σX ≥ 0 cannot be justified.
March 13, 2001 5
Likelihood contour plot Pct Sukk

2
1
0
log10(σ2B)
−1
−2
−3
−4
−5
−5 −4 −3 −2 −1 0 1 2
(
log10 σ2B(O) )
March 13, 2001 6
238
Likelihood contour plot Kg
2
1
0
log10(σ2B)
−1
−2
−3
−4
−5
−5 −4 −3 −2 −1 0 1 2
(
log10 σ2B(O) )
March 13, 2001 7
G Not positive Definite
 2 
σB 0 0 0 0 0
 0 ... 0 0 0 0 
 
0 2
0 σ B 0 0 0 
V(u) = G = 
 
2
 0 0 0 σ B(O) 0 0 

0

0 0 0 . .. 0 

2
0 0 0 0 0 σB(O)
March 13, 2001 8
239
 2 
σ̂B 0 0 0 0 0
 0 ... 0 0 0 0 
 
0 2
0 σ̂ B 0 0 0 
V̂(u) = Ĝ = 
 
2
 0 0 0 σ̂ B(O) 0 0 

0

0 0 0 . .. 0 

2
0 0 0 0 0 σ̂B(O)
March 13, 2001 9
2
 
σ̂B 0 0 0 0 0
0 ... 0 0 0 0
 2

0 0 σ̂B 0 0 0
Ĝ = 
0

0 0 0 0 0

0 0 0 0 ... 0

0 0 0 0 0 0
Ĝ−1 =???
March 13, 2001 10
240
Warning: Satterthwaite Goes Wrong
Satterthwaite’s approximation uses the estimated variance

components for calculation of test degrees of freedom. The
2
calculations includes differentiation with respect to σ̂X . At boundary
values such as 0 this differentiation is not defined.
In the PARMS statement a lower bound on the estimated variance

components may be specified, e.g.,
PARMS /LBOUND=0.001,0.001,0.001;
2
This produces the same problems as σ̂X =0
March 13, 2001 11
Conclusions
• If estimated covariance parameters are > 0 use Satterthaites

approximation.
• If not
– If model reductions are ”natural”, reestimate parameters using
revised models.
– Nested design should be reformulated to maintain design
– Use containment method but be careful to specify model
syntactically correct. (Compare with random statement in GLM)
March 13, 2001 12
241
Testing Effects of Random components
2
• Why are we interested in testing σB >0?
• Model Reduction
2
• σ̂B = 0 is not a test and may not be used for this purpose.
• Fixed effects vs. Random Effects
• Biologically significant, i.e.,. if we sample x individuals at random,

what are the average difference between lowest and highest,
confidence interval for the difference. What is the correlation,
heretability, repeatability, sensitivity and specificity.
March 13, 2001 13
Model Reduction
Consider model A and model B that represents a special case of A,

2
e.g., one of the variance components σX = 0. B is said to be nested
within A. In this case a Likelihood Ratio test may be performed
Then 2(LogLikeA − LogLikeB ) is asymptotically χ2 distributed with

(pA − pB ) degrees of freedom, where pA is number of parameters in
model A.
2
NB! This not feasible if σ̂X =0
March 13, 2001 14
242
General recommandations
• Using ML any nested models may be compared

• Using REML only nested models with identical fixed effects may be
compared.
• With respect to test for variance components this test is
conservative, i.e., true p-value is smaller than the calculated. Thus
the test results in too few significant findings.
• With respect to test for fixed effects this test is anti conservative,
i.e., true p-value is larger than the calculated. Thus the test results
in too many significant findings. (Therefore likelihood ratio tests
should not be used for fixed effects).
March 13, 2001 15
Fixed Effects
If the variance component is 0, this implies that ui = uj for every i

and j.
i.e., Reformulate model and treat the factor of interest as Fixed.
However:
ui ≈ uj does not imply that σu2 = 0
March 13, 2001 16
243
Biologically significant
• Very often the real interest can be formulated as an interval of the

variance component parameter, e.g., is it larger than some preset
’irrelevance’ level ?
• The confidence interval produced with the CL option in the
Proc Mixed statement are often sufficient for this. However,
the general comment about sufficient sample size is VERY relevant
here.
• Many ’biologically’ relevant parameters are combinations of several
variance component parameters, e.g., correlation ( repeatability)
σ2
( σ2+σ
A
2 ). Therefore the joint distribution of parameter estimates
ε A
need to be considered. This is not trivial (Interest ???).
March 13, 2001 17
Covariance Matrix: Sugar beet PCT Sukk
Asymptotic Covariance Matrix of Estimates
Row Cov Parm CovP1 CovP2 CovP3
1 BLOK 3.069E-6 -8.02E-7

2 BLOK(OPTAGN) -8.02E-7 1.613E-6 -4.44E-8
3 Residual -4.44E-8 2.222E-7
March 13, 2001 18
244
14 Repeated Measurements
This lecture gives an introduction to repeated measurements, and is a supplement to Chapter 3

in LMSW (Littell et al., 1996). It illustrates how it is possible to modify the tacit assumptions
of the split-plot design into a more flexible modelling of the variance matrix.
Different variance structure is illustrated graphically and the use of SAS to compare different
structures presented. The AR(1) and CS structure are discussed in detail. Finally, methods for
comparison between different structures is shown.
Links to full-screen presentation1
1
http://www.jbs.agrsci.dk/biometri/Courses/HSVmixed2001/Repeated.f.pdf
245
Analyzing Repeated Measurements
Consider the setup:
• A treatment factor A with a levels is applied to individuals, e.g.

pigs.
• Within each treatment there are c individuals
• On each individual repeated measurements of the same response is

made at b different time points.
Example: Exercise Therapy (LMSW p. 88)
• Subjects (SUBJ) were assigned to one of three different training

programs (PROGRAM) on weightlifting.
• The strength (STRENGTH) of the subjects was measured every

second day (TIME) for a two period from the start of the study.
Some questions:
• Is there a treatment effect?
• Is there an interaction between treatment and time?
246
Mean profiles
Group means
W
W
82.5
W W
strength
81.5
W
R R R
W R
R
80.5 R
C C C
C
R C
79.5
C C
1 2 3 4 5 6 7
time
The task: Comparison of the mean profiles
Clear evidence of treatment effect and treat–by–time interaction.
Individual profiles:
CONT RI WI
90
90
90
85
85
85
strength
strength
strength
80
80
80
75
75
75
1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7
time time time
No evidence of non–constant variance!!
Sometimes (but certainly not always!) repeated measurements can

be appropriately dealt with by a split–plot model.
247
• A statistical model for this situation could be
yijk = µ + αi + βj + γij + wik + ijk
2
where wik ∼ N (0, σw ) and ijk ∼ N (0, σ 2).
• Here i denotes treatment, k is replications (within treatment) and

j is “time”
• “Time” is called the within–subject factor.
• Note: “Time” can also refer to different locations, e.g. in the

intestine.
• It is the usual split–plot model!
Tacit Assumptions when using the Split–Plot Model
It is important to realize the assumptions one make in applying a

split–plot model to a repeated measurement problem:
1. It is assumed that the variance is constant.

This may not a reasonable assumption: Sometimes the variance
increases with the mean, and if the mean changes over “time”, this
assumption is violated.
If time is really location in the instine, there might be certain
segments where the variance of a given respons is much larger than
in other segments.
248
2. It is assumed that the correlation between two measurements on the
same individual is the same – no matter how far the measurements
are apart in time.
This may not be a reasonable assumption: Observations close
to each other in time might be expected to be more alike than
observations far from each other.
3. It is assumed that the correlation is positive.

This may not be a reasonable assumption: Consider a feeding
experiment. If the feed intake is lower than expected in one week
because of diseases it may be higher than expected in the next
week. Hence the observations would be negatively correlated.
4. It is assumed that the biological questions can be answered through

the interaction γik and possibly the main effects αi and βj .
That might be a too crude model. For example, data might
indicate the mean value evolves over time in a specific way, e.g.
µij = µ + αi + β × j + β2 × j 2
249
Modelling of Covariances
A classical way of thinking of a statistical model is as
Observables = Systematic effects + Random effects
Most frequently, the main interest is in the systematic effects, while

the random effects are considered as a nuissance.
Yet, the random effects are important to understand and to model in

an appropriate way.
Types of random variation

m+e m + subj
200
200
150
150
100
100
50
50
5 10 15 20 5 10 15 20
x x
m + ser m + subj + ser + e

200
200
150
150
100
100
50
50
5 10 15 20 5 10 15 20
x x
250
Can be summarized as:
• Random subject effect
• Serial dependence
• Residual variation
Unstructured Covariance Matrix
Consider Exercise Therapy data.
A very general model is the model where for each treatment i and
time j there is mean value µij , and the measurements have a
completely unstructured covariance matrix.
   
Yi1k µi1
Yik =  ..  ∼ N7(µi =  .. 
,V )
  
Yi7k µi7
where k refers to to subject within treatment, and where V is a
7 × 7 unstructured matrix.
251
Since the subjects are independent the random vector arising after
stacking all Yik s on the top of each other has a covariance matrix
consisting of V ’s on the “diagonal” and 0s outside.
Such a matrix is said to be block diagonal.
Note that in V there are 7 × 8/2 = 28 parameters.
This model can be fitted with the following SAS program:

proc mixed data=weight2;
class program subj time;
model strength = program time program*time / outP=pred;
repeated time / subject=subj*program type=un r;
ods listing exclude r; ods output r=r rcorr=rcorr;
data r; set r; keep col1-col7;
data rcorr; set rcorr; keep col1-col7;
run;
The data set r contains the estimated covariance matrix, while

rcorr contains the correlation matrix
Note that V is the covariance matrix for Yik . But if we write

Yik = µi + ik (note: everything here are vectors) then V is also the
covariance for the error terms ik which has mean 0.
252
The estimated correlation matrix is
1.0000 0.9602 0.9246 0.8716 0.8421 0.8091 0.7968
0.9602 1.0000 0.9396 0.8770 0.8596 0.8273 0.7917
0.9246 0.9396 1.0000 0.9556 0.9372 0.8975 0.8755
0.8716 0.8770 0.9556 1.0000 0.9601 0.9094 0.8874
0.8421 0.8596 0.9372 0.9601 1.0000 0.9514 0.9165
0.8091 0.8273 0.8975 0.9094 0.9514 1.0000 0.9531
0.7968 0.7917 0.8755 0.8874 0.9165 0.9531 1.0000
The AR(1)–model
Consider a sequence of measurements z1, z2, . . . , zT made on the

same experimental unit at T time points t = 1, . . . , T .
It is assumed that E(zt) = 0 for all t.
A frequently employed model is the AutoRegressive model of order

1, which states that
zt = ρzt−1 + t t = 2, . . . , T
where t ∼ N (0, σz2), all independent and where −1 < ρ < 1.

253
Hence what happens at time t is ρ times what happened at time

t − 1 + some random noise.
The variance of each zt is the same and is denoted ω 2.
This variance can be found as:
ω 2 = Var(zt) = Var(ρzt−1 + t)

= ρ2 Var(zt−1) + Var(t)
= ρ2 ω 2 + σ 2
σ2
Hence ω 2 = 1−ρ2
.
It is illustrative to investigate the covariance structure of this model.
First consider observations one time–step apart:
Cov(zt, zt−1) = Cov(ρzt−1 + t, zt−1)

= ρ Cov(zt−1, zt−1) = ρ Var(zt−1) = ρω 2
Next we consider observations two time–steps apart:
Cov(zt, zt−2) = Cov(ρzt−1 + t, zt−2)

= Cov(ρzt−1, zt−2) = ρ Cov(zt−1, zt−2)
= ρ2 ω 2
254
In general, the covariance between observations k time–steps apart is
Cov(zt, zt−k ) = ρk ω 2
The correlation between observations k time steps apart therefore

becomes
ρk ω 2
γ(k) = Corr(zt, zt−k ) = 2 = ρk
ω
The number k is called the lag between the observations and γ(k) is
called the autocorrelation function
If the postulated model is correct, the autocorrelation should tend to

0 as the lag increases.
Some Autocorrelations
Autocorrelation, rho= 0.5 Observations

1.0
10
0.8
rho^c(0, x)
5
0.6
z
0.4
0
0.2
−5
0.0
0 10 20 30 40 50 0 10 20 30 40 50
c(0, x) x
Autocorrelation, rho= −0.5 Observations

1.0
10
0.5
5
rho^c(0, x)
0
0.0
−5
−10
−0.5
0 10 20 30 40 50 0 10 20 30 40 50
c(0, x) x
255
1.0
10
0.8
5
rho^c(0, x)
0.6
0
z
0.4
−5
0.2
−10
0.0
0 10 20 30 40 50 0 10 20 30 40 50
c(0, x) x

1.0
10
0.8
5
rho^c(0, x)
0.6
0
0.4
−5
0.2
−10
0.0
0 10 20 30 40 50 0 10 20 30 40 50
c(0, x) x
How to estimate the autocorrelation??
A very brute–force way of estimating the autocorrelation is the

following: Suppose there are observations from 4 time points, i.e.
t = 1, . . . , 4 on many subjects and assume observations all have zero
mean.
Then the (symmetric) matrix of correlations is
 
1 ρ12 ρ13 ρ14
ρ21 1 ρ23 ρ24
 
Corr = 
 
ρ31 ρ23 1 ρ34

 
ρ41 ρ24 ρ43 1
256
Simple estimates of the autocorrelation for observations one, two
and three time–step apart are
1
γ̂(1) = (ρ12 + ρ23 + ρ34)
3
1
γ̂(2) = (ρ13 + ρ24)
2
1
γ̂(3) = (ρ14)
1
Obviously, for higher values of k, γ(k) will be poorly estimated as it

is the average over few values.
The autocorrelation can be estimated (as described above) by

invoking the macro:
%autocorr(r);
where r is the covariance matrix estimated in connection with the

model with unstructured covariance matrix.
If the file autocorr.sas is located in e.g. c:\stat then the macro
is included, i.e. made available by submitting the statement
%include ’d:\stat\autocorr.sas’;
This creates the SAS dataset autocorr with autocorrelation and lag.
257
The macro also creates a plot of the autocorrelation against lag:

Autocorrelation for Exercise Therapy data
1.00
0.95
autocorr
0.90
0.85
0.80
0 1 2 3 4 5 6
lag
What can be concluded from that?

• There is a clear indication of positive correlation and that the

correlation decreases with time.
• Whether the correlation structure can be appropriately described

by ρk is another issue. There is not much evidence for or against
that structure.
258
Since all autocorrelations γ(k) are positive it is tempting to plot
log γ(k) against k as well.
The reason is that if the autocorrelation is γ(k) = ρk then

log γ(k) = k log ρ.
Hence a plot of log γ(k) = k log ρ against k should approximately

yield a straight line with intercept 0 and slope log ρ:
Log Autocorrelation for Exercise Therapy data

0.00
−0.05
−0.10
log(autocorr)
−0.15
−0.20
0 1 2 3 4 5 6
lag
259
Again, there is not any strong evidence against the AR(1) structure.
From the graph it follows that the slope is approximately

log ρ ≈ −0.23/6 = −0.038 such that ρ ≈ 0.962.
Hence the correlation between observations does decrease as the

time between them increases – but it decreases very slowly!!
Compound Symmetry
The Split–plot model can also be formulated using a REPEATED

statement instead of a RANDOM statement.
proc mixed data=weight2;
class program subj time;
model strength = program time program*time;
repeated time / type=cs sub=subj(program) r rcorr;
ods listing exclude r; ods output r=r;
run;
Fortunately, the results using a REPEATED or a RANDOM statement are

the same!
The option type=cs specifies that the covariance matrix for each
260
subject has a compound symmetry structure:
 
σ 2 + σw2
σw 2
... σw 2
2
σw σ 2 + σw 2
... σw 2
 
 

 .. .. ... .. 

2 2
σw σw . . . σ 2 + σw2
From the SAS output one sees that the correlation between
observations on the same subject is estimated to
2
σw
2 + σ2
≈ 0.8892
σw
Which Covariance Structure to use?
With all this flexibility in choosing the covariance structure, some

guidelines are needed for choosing an appropriate one:
• Parsimony: Covariance structures with few parameters are most

attractive as there are fewer parameters to be estimated from data.
• Exploratory data analysis: A graphical investigation of the data

might suggest an appropriate covariance structure.
• Subject matter considerations: Sometimes the problem at hand

really dictates an appropriate covariance structure
261
• Necessity: Sometimes one is for numerical reasons forced to use a

very simple covariance structure – PROC MIXED might not be able
to fit the complex ones.
• Numerical criteria: There are some numerical criteria, which can

be a guideline.
Numerical Criteria
AIC and BIC are some criteria to be used. They are both the
log–likelihood + some term penalizing for the number of parameters
used in the model. BIC penalizes the use of many parameters harder
than AIC.
Smaller values of both criteria indicate a good fit.
For the Exercise Therapy the result is
Structure CS AR(1) UN
AIC 1424.9 1270.8 1290.9
BIC 1428.9 1274.9 1348.1
262
Hence the result is in favor of using the AR(1)–structure.
What does the covariance structure mean for the

conclusions?
For the Exercise Therapy the p–values for the test of no interaction
effect are:
Structure CS AR(1) UN
Program*Time 0.0005 0.3007 0.1297
Radically different conclusions!
The data really suggests that the interaction is present!
263
264
15 Repeated Measurements: Covariance
structures
This lecture gives an overview of how to specify different covariance structures in SAS via the
REPEATED statement in PROC MIXED. The lecture is based on the description in the on-line SAS-
manual1 .
The most important types of covariance structure is presented.
• Unstructured (UN)
• Autoregressive (AR(1)–SP(POW))
• Antedependence (ANTE(1))
• Toeplitz (TOEP)
• Heterogeneous variance (ARH(1),CSH, etc.)
The pro’s and con’s of the different structures are discussed

Link to full screen Presentation2
1
http://dokumentation.agrsci.dk/sasdocv8/sasdoc/sashtml/onldoc.htm
2
http://www.jbs.agrsci.dk/biometri/Courses/HSVmixed2001/RepeatedType.f.pdf
265
15 Repeated Measurements: Covariance structures
Repeated statement
Y = Xβ + Zu + ε
V(ε) = R
R is a n × n matrix, where n is number of observations.
In order to handle this, a structure of the matrix is defined with

repeated use of the elements in the structure.
March 21, 2001 1
Repeated Statement
The syntax of the REPEATED statement
REPEATED < repeated-effect > < / options >;
Usually a formulation like:

REPEATED time / subj=animal*treat ;
A good precaution is always to specify the repeated-effect
March 21, 2001 2
266
Missing data: example
Treat Animal Time Y

A 1 1 12.4
A 1 2 .
A 1 3 14.5
B 1 1 14.3
B 1 2 15.3
B 1 3 14.8
.. .. .. ..
March 21, 2001 3
PROC MIXED: REPEATED Statement
REPEATED < repeated-effect > < / options > ;
You can specify the following options in the REPEATED statement

after a slash (/).
GROUP=effect HLM HLPS
LDATA=SAS-data-set LOCAL LOCALW
NONLOCALW R<=value-list> RC<=value-list>
RCI<=value-list> RCORR<=value-list> RI<=value-list>
SSCP SUBJECT=effect TYPE=covariance-structure
March 21, 2001 4
267
Types of variance structure
• Approximately 30 different methods
• ”Time”/”linear” structure vs. spatial structure
• Homogeneous vs. heterogeneous variance
• ”Banded” vs full structure
March 21, 2001 5
Unstructured: type=un
The measurements of each subject

 
σ11 σ12 σ13 σ14
σ22 σ23 σ24
 

σ33 σ34
 

σ44
Parameters t × (t + 1)/2
March 21, 2001 6
268
Autoregressive: type=AR(1)

 
1 ρ ρ 2 ρ3
 1 ρ ρ2 
 
σ2 
1 ρ


1
ρ ρ ρ ρ
Y1 Y2 Y3 Y4 Y5
.
March 21, 2001 7
Autocovariance
1.0
0.8
0.6
ρ
0.4
0.2
0.0
0 1 2 3 4 5 6
lag
March 21, 2001 8
269
Autocovariance
1.0
0.8
0.6
ρ
0.4
0.2
0.0
0 1 2 3 4 5 6
lag
March 21, 2001 9
Autoregressive: type=SP(POW)

 
1 ρ|t2−t1| ρ|t3−t1| ρ|t4−t1|
1 ρ|t3−t2| ρ|t4−t2|
 
σ2 

1 ρ|t4−t3|


1
March 21, 2001 10
270
Ante-Dependence: type=ANTE(1)
AR(1)
.
ρ ρ ρ ρ
Y1 Y2 Y3 Y4 Y5
.
ANTE(1)
.
ρ1 ρ2 ρ3 ρ4
Y1 Y2 Y3 Y4 Y5
.
March 21, 2001 11
Ante-Dependence: type=ANTE(1)

 
σ12 σ1σ2ρ1 σ1σ3ρ1ρ2 σ1σ4ρ1ρ2ρ3
σ22 σ 2 σ 3 ρ2 σ 2 σ 4 ρ2 ρ3 
 

σ32 σ 3 σ 4 ρ3 
 

σ42
March 21, 2001 12
271
Toeplitz: type=TOEP

 
σ 2 σ1 σ2 σ3
σ 2 σ1 σ2 
 

σ 2 σ1 
 

σ2
March 21, 2001 13
Heterogenous variance
Instead of identical variance at every time point, the variance is

estimated at each time point
In general, the type is found by simple adding an H to the type, i.e.,

csh, arh(1), toeph
The structures are preserved as far as the correlation between time

points are concerned
More elaborate parametric techniques are available Eq. LIN
March 21, 2001 14
272
Conclusions
• Parsimony !
• Fixed observation times and similar intervals : AR(1)
(2 parms)
• Slightly varying observation times and similar intervals
: SP(POW) (2 parms)
• Fixed observation times but intervals of different type:
ANTE(1) (2t − 1 parms (heterogen. variance))
• Fixed observation times, similar intervals, no simple
lag-structure : TOEP (t − 1 parms)
March 21, 2001 15
AR vs CS
AR(1)
.
ρ ρ ρ ρ
Y1 Y2 Y3 Y4 Y5
.
CS A
Y1 Y2 Y3 Y4 Y5
.
March 21, 2001 16
273
274
16 Random Regression
The random regression model is discussed starting with an example from one of the exercises.
The presentation supplements chapter 7: Random Coefficients in LMSW (Littell et al., 1996)
The basic idea behind random regression and the implementation of the model in PROC MIXED
is shown. Finally, the implications for the covariance structure of the observations is presented.
Link to full-screen presentation1
1
http://www.jbs.agrsci.dk/biometri/Courses/HSVmixed2001/RandomRegression.f.pdf
275
The Basic Idea behind Random Regression
Feeding pigs with different amounts of vitamin E supplement.
Weights recorded weekly.

Cu = 1 Cu = 2 Cu = 3
100
100
100
80
80
80
Weight
Weight
Weight
60
60
60
40
40
40
4 6 8 10 12 4 6 8 10 12 4 6 8 10 12
Time Time Time
• Clearly (random) between–subject (pig) variation.
• Approximately linear increase in weight.
• Slight tendency to larger dispersion between pigs at the end of the

study than at the beginning.
• Repeated measurement problem.
Aims:
• Find a regression model which describes the weight as function of

time.
• Draw inferences about possible treatment effects.
276
First idea: fit linear regression model (with random pig effect) and
treatment specific parameters:
yijt = αi + βit + Uij + ijt
Here, i is treatment, j is subject (pig) within treatment, t is time,

Uij ∼ N (0, σu2 ) and ijt ∼ N (0, σ 2), all independent.
title ’Linear regression (with random Pig effect)’;
title2 ’Treatment specific parameters’;
proc mixed data=CuFeed;
class Cu Pig;
model Weight = Cu Cu*Time /noint solution outp=R1 ;
random Cu*Pig;
run;
Plot the curves of residuals:

symbol i=j;
proc gplot data=R1;
by Cu;
plot resid*Time=Pig;
run;
Cu = 1 Cu = 2 Cu = 3
5
5
Resid
Resid
Resid
0
0
−5
−5
−5
−10
−10
−10
4 6 8 10 12 4 6 8 10 12 4 6 8 10 12
Time Time Time
The “residual curves” do not look random.

277
Second idea: fit individual linear regression model (with random pig
effect):
yijt = αi + βij t + Uij + ijt
where i is treatment, j is subject (pig) within treatment, t is time,
and Uij ∼ N (0, σu2 ) and ijt ∼ N (0, σ 2), independent.
title ’Individual linear regressions (with random Pig effect)’;
class Cu Pig;
model Weight = Cu Cu*Pig*Time /noint solution outp=R2;
random Cu*Pig;
ods output solutionf=sf2;
proc gplot data=R2;
by Cu;
plot resid*Time=Pig;
run;
Cu = 1 Cu = 2 Cu = 3
4
4
2
2
Resid
Resid
Resid
0
0
−2
−2
−2
−4
−4
−4
4 6 8 10 12 4 6 8 10 12 4 6 8 10 12
Time Time Time
The “residual curves” now look much more random.
This approach gives a whole lot of parameter estimates βij , where i

refers to treatment and j to individual within treatment.
How to proceed with the analysis?

278
Analyzing the Individual Regression Coefficients
Frequently the task is to estimate the effect of time for each

treatment.
A tempting (and classical) way of doing this is to continue analyzing

the βij s.
For example, β̄i. = J1 j βij is the average slope within treatment i.

P
The analysis could then proceed by comparing β̄1. , β̄2. and β̄3. in
some way.
Yet - it is somewhat unsatisfactory to first estimate the βij s as

systematic effects and then afterwards analyzing these as if they
were random quantities.
279
Some graphics of the βij s:

Estimates Time*Cu*Pig for Cu= 1 Normal Q−Q Plot
Sample Quantiles
0.0 0.4 0.8
Density
7.5
6.0
5 6 7 8 9 −1.0 −0.5 0.0 0.5 1.0
Estimate Theoretical Quantiles
Sample Quantiles
Density
0.6
7.4
0.0
6.6
5 6 7 8 9 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

Sample Quantiles
Density
0.3
7.0
5.5
0.0
5 6 7 8 9 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
Random Regression
A random regression model is an alternative:
yijt = αi + βit + Uij + Bij t + ijt
The systematic effects are as usual.
The random effects are Uij ∼ N (0, σu2 ), Bij ∼ N (0, σB

2
) and
2
ijt ∼ N (0, σ ).
It is assumed that ijt is independent of Uij and of Bij but needs

not to be assumed that Uij and Bij are independent.
280
Hence
• βi is the population slope pigs receiving the ith treatment.
• Bij describes individual random deviations from the population

slope.
In this way systematic and random variation of the regression

coefficients can be separated.
Just like the parameter estimates in a regression usually are

correlated, then so might the random effects Uij and Bij also be.
To obtain such flexibilities, we assume

2
Uij 0 σU σU B
∼ N2 , 2
Bij 0 σU B σB
If σU B = 0 then Uij and Bij are independent.
281
How to ... In SAS
Independence:
title ’Random regression model (with random Pig effect)’;
title2’Independent intercepts and slopes’;
class Cu Pig;
model Weight = Cu Cu*Time / ddfm=satterth noint solution outp=R3;
random int Time / sub=Pig type=vc solution;
ods exclude listing solutionr;
ods output solutionr=sr3;
run;
Independence of Uij and Bij is obtained by type=vc in the RANDOM

statement.
Dependence:
title ’Random regression model (with random Pig effect)’;
title2’Dependent intercepts and slopes’;
class Cu Pig;
model Weight = Cu Cu*Time / ddfm=satterth noint solution outp=R4;
random int Time / sub=Pig type=un solution;
ods exclude listing solutionr;
ods output solutionr=sr4;
run;
Dependence of Uij and Bij is obtained by type=un in the RANDOM

statement.
282
Inference
In connection with random regression models we recommend always

using the ddfm=satterth option for estimating the degrees of
freedom.
Contrast etc. can be obtained as follows:

class Cu Pig;
model Weight = Cu Time Cu*Time / ddfm=satterth solution outp=R3;
random int Time / sub=Pig type=vc solution;
lsmeans Cu / diff;
estimate ’Slope: Cu1 vs Cu2’ Cu*Time 1 -1 0;
estimate ’Slope: Cu1 vs Cu3’ Cu*Time 1 0 -1;
estimate ’Slope: Cu2 vs Cu3’ Cu*Time 1 0 -1;
run;
283
When a random regression coefficient is present in the model, then

it is important that the model also contains a random intercept.
To see why consider the random regression model
Suppose that the scale of time t is changed to t0 = c1t + c2. Then it

would be very desirable to obtain the same result whether t or t0 was
used as time in the regression.
Now we use t0 in a random regression model without random

intercept:
yijt = αi + βit + Bij t0 + ijt

= αi + βit + Bij (c1t + c2) + ijt
= αi + βit + (Bij c1t) + (Bij c2) + ijt
Hence Bij c2 will play the role of a random intercept.
In other words, the presence of a random intercept a matter of the

scale on which t is measured.
Likewise, in a polynomial regression involving t2: If there is a

random regression coefficient for t2 then there must also be a
284
random regression coefficient for t and a random intercept.
Correlation structure in Random Regression Models
Consider again the random regression model
and assume for simplicity that Uij and Bij are independent.
The variance of Yijt is
2 2 2
Var(Yijt) = σU + σB t + σe2
2 2 2
For later use let Vt = σU + σB t .
285
Next consider the variance at time t + k:

2 2
Var(Yij(t+k) ) = σU + σB (t + k)2 + σe2 = Vt+k + σe2
2
= σU + t 2 σB
2
+ k(2t + k)σB2
+ σe2
2
= Vt + k(2t + k)σB + σe2
The covariance between Yijt and Yij(t+k) is
Cov(Yijt, Yij(t+k) ) = Cov(Uij + Bij t + ijt, Uij + Bij (t + k) + ijt)

= Var(Uij ) + Cov(Bij t, Bij (t + k))
2 2
= σU + t(t + k)σB
2
= [σU + t 2 σB
2 2
] + tkσB 2
= Vt + tkσB
In total
Var(Yijt) = Vt + σe2
2
Var(Yij(t+k) ) = Vt + k(2t + k)σB + σe2
2
Cov(Yijt , Yij(t+k) ) = Vt + tkσB
Hence the correlation is

2
Vt + tkσB
Corr(Yijt , Yij(t+k) ) = p 2 + σ 2)
(Vt + σe2)(Vt + k(2t + k)σB e
Now consider a fixed t. The numerator is a linear function in k while

the denominator is a quadratic function in k.
286
Hence we know from high school mathematics that
Corr(Yijt , Yij(t+k) ) → 0
as k (i.e. the time span between Yijt and Yij(t+k) goes to infinity.
In other words, under the random regression model, the correlation

decreases as with distance in time.
That is an appealing property of the model!
287
288
17 Factor Structure Diagrams
The discussion with participants during the previous lectures had shown the need for an inde-
pendent means of checking the degrees of freedom in the F-tests in PROC MIXED. The methods
of calculation of degrees of freedom (option ddfm) is not fool-proof. The containment method
may lead to errors if the experimental design cannot be deducted from the model specification,
and the satterthwaite method is erroneous if one of the variance component is estimated as
0.
Therefore, the factor structure diagram method were presented, supplement with an exercise.
1
http://www.jbs.agrsci.dk/biometri/Courses/HSVmixed2001/FactorStructure.f.pdf
289
Factor Structure Diagrams
Factor structure diagrams is a way of representing certain factorial

designs, including block experiments, split plot experiments etc.
• With such diagrams, it is for certain balanced cases easy to calculate

the correct degrees of freedom for the tests.
• It is also for certain balanced cases easy to identify which “error” an

effect is to be “tested against”.
April 17, 2001 1
However
• it is a somewhat restricted class of models that can be appropriately

represented this way.
• the degree of freedom calculations are not correct in unbalanced cases
• It is a very comprehensive task to describe the class of designs for which

factor structure diagrams can be used
Nonetheless, they are quite useful...
April 17, 2001 2
290
Two–way ANOVA with Replicates
Factors A and B have a and b levels. Replicates within each

combination A × B are denoted by the factor R with r levels.
That is, there are abr units in the experiment
The usual two–factor ANOVA model is
yabr = µ + αa + βb + (αβ)ab + abr
The model can be represented in a factor structure diagram

April 17, 2001 3
Aaa−1
[ABR]abr
abr−ab
ab
ABab−a−b+1 O11
b
Bb−1
• The term O is to be identified with µ
• The term A is to be identified with αa
• The term AB is to be identified with (αβ)ab etc.
• Terms in [. . . ] are random effects.

April 17, 2001 4
291
Calculating the degrees of freedom
1. Fill in the levels of the factors as superscripts (i.e. the red) symbols.
2. Then calculate the degrees of freedom (DF) recursively from right to

left:
The DF for O is 1 (the blue symbol).
The DF for A is a minus the sum of DFs from factors pointing towards
A in the diagram, i.e.
a−1=a−1
3. Proceed like this towards left in the diagram: The DF for AB are
ab − (a − 1) − (b − 1) − 1 = ab − a − b + 1
April 17, 2001 5
“Proof that it works...”

%let a=4; %let b=2; %let r=3;
title ’Two-way ANOVA with replicates’;
data data1;
do A=1 to &a;
do B=1 to &b;
do R=1 to &r;
y=rannor(0);
output;
end; end; end;
proc mixed data=data1 noinfo noclprint;

class A B R;
model Y = A B A*B;
run;

Num Den
A 3 16 0.56 0.6470
B 1 16 1.54 0.2329
A*B 3 16 1.72 0.2021
April 17, 2001 6
292
Two–way ANOVA without Replicates
If there are no replicates within each combination of A and B (i.e.
r = 1), the model is
yab = µ + αa + βb + ab
since the interaction can not be estimated.
Following the lines from before, a diagram is

Aaa−1
[ABR]ab
ab−ab=0
ab
ABab−a−b+1 O11
b
Bb−1
April 17, 2001 7
Another way of looking at it is by saying that the random error is the

interaction!!
So a more appropriate diagram is
Aaa−1
[AB]ab
ab−a−b+1 O11
b
Bb−1
April 17, 2001 8
293
title ’Two-way ANOVA without replicates’;

data data2;
do A=1 to &a;
do B=1 to &b;
y=rannor(0);
output;
end; end;
class A B;
model Y = A B;
run;

Num Den
A 3 3 0.45 0.7377
B 1 3 0.05 0.8414
April 17, 2001 9
Block Experiments with Replicates within Blocks

If A is a (random) block effect and there are replicates of the factor B
within each block the model is
yabr = µ + Ua + βb + Vab + abr
The diagram is
[A]aa−1
[ABR]abr
abr−ab [AB]ab
ab−a−b+1 O11
b
Bb−1
April 17, 2001 10
294
Note:
• The systematic effect B is to be tested against the random effect

closest to it in the diagram, i.e. [AB]
• Note that since A is a random effect, any factor containing A must

also be random.
April 17, 2001 11

title ’Block experiment with replicates within blocks’;
data data3;
do A=1 to &a;
U = rannor(0);
do B=1 to &b;
V = rannor(0);
do R=1 to &r;
y=rannor(0) + U + V;
output;
end; end; end;

class A B R;
model Y = B;
random A A*B;
run;

Num Den
B 1 3 14.99 0.0305
April 17, 2001 12
295
Block Experiments without Replicates within Blocks

If A is a (random) block effect and there are no replicates of the factor
B within each block the model is
yab = µ + Ua + βb + ab
The diagram is
[A]aa−1
[AB]ab
ab−a−b+1 O11
b
Bb−1
April 17, 2001 13

title ’Block experiment without replicates within blocks’;
data data4;
do A=1 to &a;
U = rannor(0);
do B=1 to &b;
y=rannor(0) + U;
output;
end; end;
class A B;
model Y = B;
random A;
run;

Num Den
B 1 3 3.30 0.1671
April 17, 2001 14
296
Split Plot Experiment
Let A denote the whole–plot treatment and B the split–plot treatment.
Replicate units within A are denoted by R.
The model is:
yabr = µ + αa + Uar + βb + (αβ)ab + abr
[AR]ar
ab−a−b+1 [A]aa−1
[ABR]abr
abr−ab O11
ab b
ABab−a−b+1 Bb−1
April 17, 2001 15
“Proof that it works”

title ’Split plot experiment’;
%let a=4; %let b=3; %let r=3;
data data5;
do A=1 to &a;
do R=1 to &r;
U = rannor(0);
do B=1 to &b;
y=rannor(0) + U;
output;
end; end; end;
class A B R;
model Y = A B A*B;
random A*R;
run;

Num Den
A 3 8 0.68 0.5901
B 2 16 3.81 0.0444
A*B 6 16 2.57 0.0618
April 17, 2001 16
297
Split Plot Experiment – Homework
Let E and C be the vitamin E and copper treatments applied to R pigs

within each combination of E and C.
Let M denote the membrane.
Hence the model is
yecrm = µ+αe+βc+(αβ)ec+Uecr +γm+(αγ)em +(βγ)cm+(αβγ)ecm+ecrm
April 17, 2001 17
The factor structure diagram becomes
cm m
CMcm−c−m+1 Mm−1
ecm
ECM(e−1)(c−1)(m−1)
[ECRM ]ecrm
ec(rm−r−m+1)
em
EMem−e−m+1 e
Ee−1 O11
[ECR]ecr
ec(r−1)
ec c
ECec−e−c+1 Cc−1
April 17, 2001 18
298
“Proof that it works”
title ’Split plot experiment - homework - with 3 membranes’;
%let sigma_G = 2;
%let sigma_M = 6;
%let sigma_E = 1;
data mem;
do cu= 1 to 2;
do e_vit= 1 to 2;
do grnr= 1 to 8;
U_g = &sigma_G * rannor(0);
do membran= 1 to 3;
V_m = &sigma_M * rannor(0);
do muskel= 1 to 2;
E = &sigma_E * rannor(0);
y = U_g + V_m + E;
output;
end;
end;
end;
end;
end;
data mem1; set mem(where=(muskel=1));
April 17, 2001 19
proc mixed data=mem1;

class cu e_vit membran grnr;
model y = cu | e_vit | membran ;
random cu*e_vit*grnr ;
run;
Num Den
cu 1 28 0.05 0.8316
e_vit 1 28 0.10 0.7489
cu*e_vit 1 28 1.55 0.2230
membran 2 56 0.10 0.9091
cu*membran 2 56 0.57 0.5708
e_vit*membran 2 56 1.26 0.2904
cu*e_vit*membran 2 56 1.16 0.3198
April 17, 2001 20
299
A Neat Little Exercise
1. Draw a factor structure diagram for the entire membrane experiment.
2. Compute the degrees of freedom for each test.
3. Verify by simulation that SAS does the right thing.
Hint: Use a BIG sheet of paper!
April 17, 2001 21
300
18 Covariate Models and Multivariate
Response
The use of covariates in mixed models is discussed, initially based on chapter 5 in LMSW (Littell
et al., 1996), i.e., model specification, comparison, and reduction.
Then it is shown that the covariate model may be naturally modified to include several dependent
variables, i.e., to a multivariate response model. The data manipulation steps in SAS is described
and the necessary model specification shown.
Link to full screen presentation1
1
http://www.jbs.agrsci.dk/biometri/Courses/HSVmixed2001/covariate.f.pdf
301
18 Covariate Models and Multivariate Response
Example of use of covariates
Excercise 1: Treatments copper and vitamin E each at three levels.

Litters as blocks. Dependent variables, daily gain (and feed intake).
Weight at start differed.
April 17, 2001 1
Plot
1.0
0.9
Daily Gain
0.8
0.7
0.6
15 20 25 30 35
Start weight
April 17, 2001 2
302
Plot
1.0
0.9
Daily Gain
0.8
0.7
0.6
15 20 25 30 35
Start weight
April 17, 2001 3
Yijk = (αγ)ij + Lk + βij wijk + εijk
• Yijk : Daily gain,

• wijk weight at start,
• βij regression coefficient for level ij of treatment,
• (αγ)ij interaction between copper and vitamin E,
2
• Lk random effect of litter (Lk ∼ N (0, σL )),
2
• εijk random residual, εijk ∼ N (0, σ )
Model reduction ?
April 17, 2001 4
303
Model reduction
Reformulate as additive model and remove non-significant terms
Yijk = (αγ)ij + Lk + βij wijk + εijk
(αγ)ij = µ + αi + γj + (αγ)0ij
0
βij = β0 + β1i + β2j + βij
April 17, 2001 5
Table 5:1 LMSW, page 5.2.2
1. Are all slopes = 0 ? If fail to reject goto step 2. else goto 3
2. Fit a common slope and test hypothesis = 0. If fail to reject

compare treatments using ANOVA, else use parallel lines
3. Test that the slopes are equal. If fail to reject use common slope
model, if reject goto step 4.
4. Use the unequal slopes model.
April 17, 2001 6
304
SAS-code
Step 1:
proc Mixed data=a;

class Kuld Evit Kobber ;
model Tilv= Evit*Kobber
Startv*Evit*Kobber /noint solution ;
random kuld ;
Step 3:
model Tilv= Evit Kobber Evit*Kobber

Startv Startv*Evit Startv*Kobber
Startv*Kobber*Evit ;
April 17, 2001 7
SAS-Anova

Num Den
EVIT 2 34 0.54 0.5905

KOBBER 2 34 0.46 0.6333
EVIT*KOBBER 4 34 1.10 0.3740
STARTV 1 34 27.62 <.0001
STARTV*EVIT 2 34 0.79 0.4627
STARTV*KOBBER 2 34 0.55 0.5829
STARTV*EVIT*KOBBER 4 34 1.13 0.3572
April 17, 2001 8
305
Plot
1.0
0.9
Daily Gain
0.8
0.7
0.6
15 20 25 30 35
Start weight
April 17, 2001 9
Final Model
1.0
0.9
Daily Gain
0.8
0.7
0.6
15 20 25 30 35
Start weight
April 17, 2001 10
306
Feed per day
1.0
0.9
Daily Gain
0.8
0.7
0.6
1.4 1.6 1.8 2.0 2.2 2.4 2.6

Feed pr day
April 17, 2001 11
Feed per day

1.0
0.9
Daily Gain
0.8
0.7
0.6
1.4 1.6 1.8 2.0 2.2 2.4 2.6

Feed pr day
April 17, 2001 12
307
Feed per day
1.0
0.9
Daily Gain
0.8
0.7
0.6
1.4 1.6 1.8 2.0 2.2 2.4 2.6

Feed pr day
April 17, 2001 13
SAS-code
Test
proc Mixed data=a;

class Kuld Evit Kobber ;
model Tilv= Kobber
Fedag Fedag*Kobber ;
random kuld ;
Estimation:
model Tilv= Kobber

Fedag*Kobber /noint solution ;
April 17, 2001 14
308
Feed per day
1.0
0.9
Daily Gain
0.8
0.7
0.6
1.4 1.6 1.8 2.0 2.2 2.4 2.6

Feed pr day
April 17, 2001 15
The lines actually denotes the conditional distribution of the daily

gain given the feed intake, i.e.,
Yij = µ + βXij + εij
If both variables measures the effect of the treatment, the joint

distribution may be more interesting.
There is a relatively simple relationship between the conditional and
joint distribution.
E(Xij ) = µx
E(Yij ) = µy = E(µ + βXij ) = µ + βµx
April 17, 2001 16
309
V(Xij ) = σx2
1
V(Yij |Xij ) = V(εij ) = σx2 − σyx σxy
σx2
C(Xij , Yij ) = C(Xij , µ + βXij + εij ) = β V(Xij ) = βσx2
V(Yij ) = σε2 + β 2σx2
i.e., the joint distribution

2
βσx2

Xij µx σx
∼N ,
Yij µy βσx2 σε2 + β 2σx2
Can this be generalised ?

April 17, 2001 17
Multivariate Responses
Consider a feeding experiment where a treatment factor A (say

supplement of copper) is applied to pigs.
Two responses are measured:
Y 1 : Weight gain
Y 2 : Feed intake
Hence the response is a two–dimensional vector Y = (Y 1, Y 2)>.
April 17, 2001 18
310
Return to the feeding experiment.
A model for each response Y r , where r = 1, 2 could be
Yikr = µr + αir + εrik
where i = 1, . . . , I is treatment, k = 1, . . . K is replicates within

each treatment, and εrik ∼ N (0, σr2).
Hence all parameters µr , αir , σr2 are specific to the rth response.
April 17, 2001 21
The Components of a MLNM
For each response Y r it is assumed that E(Y r ) can be written as a

linear function of the explanatory variables.
In the example,
E(Yikr ) = δ r + αir
April 17, 2001 22
311
It is assumed that the mean value has the same structure for each
response r made on the same unit.
In the example,
E(Yik ) = (E(Yik1 ), E(Yik2 )) = (δ 1 + αi1, δ 2 + αi2) = (µ1i , µ2i )
It is also assumed that the parameters β r and β s relating to the rth

respectively the sth response have nothing in common.
In the example, this means that there are no restrictions on the

parameters of the form that e.g. αi1 and αi2 are restricted to being
identical.
April 17, 2001 23
The responses are possibly correlated. To account for this we allow

for a covariance matrix of the form
2
σ1 σ12
Σ = C(Yik ) =
σ21 σ22
The model we consider can be briefly written
Yik = (Yik1 , Yik2 ) ∼ N2((µ1i , µ2i ), Σ)
If the vectors are regarded as row vectors, then it just looks like two
linear normal models appended to each other, with the extra finesse
that the two responses are allowed to be non–independent.
And - that is just what it is !
April 17, 2001 24
312
Such models can be dealt with in a mixed model setup.
The trick is to arrange the data in columns.
Suppose there are two treatments, i.e. i = 1, 2 and two pigs per
treatment, i.e. j = 1, 2.
Then there 4 units in the experiment, each with two measurements

giving all together 8 measurements.
April 17, 2001 25
It is not very hard to see that the mean of each of these can be
written in the matrix form
 1   
Y11 1 1 0 0 0 0  
2
 Y11 
  
 0 0 0 1 1 0 
 δ1
1

 Y12 


 1 1 0 0 0 0 
 α11 

2
 Y   0 0 0 1 1 0  α12 
E( 12 ) =
    
1
δ2

Y21 1 0 1 0 0 0
  
    
2
0 0 0 1 0 1 α21
    
 Y
 21







1
α22

 Y22   1 0 1 0 0 0 
2
Y22 0 0 0 1 0 1
April 17, 2001 26
313
The covariance matrix is easy to specify too: The units are assumed
independent, and hence the covariance between measurements on
different units is zero.
The covariance structure for measurements on the same unit

together with the variances are described in the 2 × 2 matrix Σ.
April 17, 2001 27
For all measurements, the covariance matrix is therefore the 8 × 8

matrix
 1 
Y11
2
 Y11 
 
 
1

 Y 12

 Σ 0 2 0 2 0 2
2
Y 0 Σ 0 0
   
 2 2 2 
C( 12 ) =
 
1

Y21  02 02 Σ 0 2 
  
 
 2 
02 02 02 Σ
 Y21
 
1

 Y22 
2
Y22
where 02 is the 2 × 2 matrix consisting exclusively of 0s.
April 17, 2001 28
314
How to ... In SAS
A brief outline about how to work with such problems in SAS.
The response variables are stacked on top of each other in a variable

called Y.
Let R be another variable with levels, say W and I indicating whether

the corresponding measurement in Y is a measurement of weight or
feed intake.
Let K be a variable identifying the subjects (within the treatment),

and let A be the treatment factor.
Then the following SAS program would do the trick:

April 17, 2001 29
proc mixed data=...;

class R K A;
model Y = R R*A / noint ddfm=satterth ...;
repeated R / subject=K*A type=un;
run;
In the REPEATED statement the subject option specifies the blocks

of the covariance matrix (in the example that there are 4 blocks).
The option type=un specifies that the blocks should be completely

unstructured
The variable R in the REPEATED statement is used for identifying the

different response types.
April 17, 2001 30
315
The General Setup
More generally,
E(Yjr ) = x>
j β
r
where xj are covariates for the jth experimental unit and β r is a

vector of parameters establishing the connection between E(Yjr ) and
xj
More generally,
E(Yj ) = (E(Yj1), E(Yj2), . . . , E(YjR)) = x> 1 2 R >

j [β : β : · · · : β ] = xj B
Hence B = [β 1 : β 2 : · · · : β R] is now a matrix of parameters where

the rth column is the parameters associated with the rth response.
April 17, 2001 31
If we let Yj = (Yj1, . . . , RjR) be a row vector, then

E(Yj ) = x>
j B
is also a row vector and is given by

If the rows of data from all n units are stacked on top of each other
we obtain an n × R matrix
 
Y11 Y12 . . . , R1R
 Y2 Y22 . . . , R2R 
 1 
Y = . .. .. 
 . 
1 2 R
Yn Yn . . . , R n
Similarly the covariates x>

j can be stacked on top of each other to
give a design matrix X (with dimension n × p) in the usual way.
April 17, 2001 32
316
The previous considerations then gives that
E(Y ) = X B
(n × R) (n × p) (p × R)
i.e. the mean is now organized as a matrix rather than as a vector.
April 17, 2001 33
317
318
19 Heterogeneous Variance
The purpose of this lecture was to present why it is important to recognize variance heterogeneity,
how to model such heterogeneity and consequences of different modelling approaches. The
lecture extends the description in chapter 8 in LMSW (Littell et al., 1996).
Graphical techniques for finding suitable models of variance heterogeneity is presented and vari-
ance functions including the power-family is introduced. In addition, the effect of transformation
is illustrated.
1
http://www.jbs.agrsci.dk/biometri/Courses/HSVmixed2001/VarianceStructure.f.pdf
319
Why Variance Heterogeneity is Important to

Recognize
Frequently the usual assumptions about variance homogeneity are

not met in practice. In that case the variance is said to be
heterogeneous.
One reason for incorporating variance heterogeneity in the model is

the ability to
• downweight portions of data which are highly variable, and
• extract more information from portions of the data which are more
precise.
As always there is a price to pay:
• The models become less parsimonious in terms of the number of

parameters.
• Fitting the models can be more difficult (numerical problems).
• Usually, only asymptotic inference can be carried out (i.e. no exact

F–tests etc.)
• Model control becomes more complicated.
320
Graphical Investigation of the Variance Structure
Frequently there is some structure on the way in which the variance

is non–constant:
Frequently the variance increases when the mean increases.
That is, the variance is a function of the mean, symbolically
Var(Y ) = f (E(Y ))
With grouped data, the variance function can sometimes be

identified.
Example 1. One-way ANOVA:
Ykl = αk + kl
where kl ∼ N (0, σk2 ) for some treatments k = 1, 2, . . . K and

replicates within treatments l = 1, 2, . . . , Lk .
Good estimates for mean and variance in the kth group are
• Mean: ȳk.
1
• Variance: s2k = − ȳk.)2
P
Lk −1 l (ykl
A reasonable idea is to plot s2k against ȳk. to see if the variance is a

function of the mean. f in
321
Variance Functions
After having found that the variance is non–constant, the next step
is to look for some structure in which it is non–constant.
This is obtained by considering a particular function for the variance

as a function of the mean.
Frequently in practice one works with the variance function
Var(Y ) = σ 2µθ
where µ = E(Y ), and σ 2 and θ are unknown constants.
Variance functions of this form are called the power family.
With
Var(Y ) = σ 2µθ
we have a linear relationship on the log–scale:
log Var(Y ) = log σ 2 + θ log µ
Therefore, in the ANOVA example the natural thing to do is to plot

log s2k against log ȳk. and see if the relationship is approximately
linear.
If so, it may be reasonable to assume then we are within the power

family of variance functions – and this is a nice family as shall soon
be shown.
322
Example 2. A substance X14 has been added in the concentration
fod∈ {0.0, 4.4, 6.2, 9.3} to the food for some pigs. The pigs are
fed (up!) with this food until their weight is 60 kg. From thereof
and until they are slaughtered at 100kg, their food does contain the
substance.
At 60kg (sample=1) and 100kg (sample=2) muscle biopsies are made

and the concentration of the substance is determined.
Concentrations, 1=60kg, 2=100kg
1
3
1
2
m
2
2
1
1
2
0 2 4 6 8
fod
Plot of individual points and of log–variance against log–mean indicate

that variance increases with the mean:
Sample = 1 Sample = 2 Log−var vs log−mean, slope=1.23(0.25)
−1.5
4
4
3
logv
X14
X14
−2.5
2
2
1
−3.5
0
0 2 4 6 8 0 2 4 6 8 −1.0 −0.5 0.0 0.5 1.0
fod fod logm
• One possibility is a linear increase with the slope being ≈ 1.

• Another is that there are two variances: One when fod= 0 and
another one when fod6= 0.
f in
323
From hereof there are different possibilities:
• Transform data onto a scale where the variance is (approximately)

constant
• Include the heterogeneous variance explicitly in the model
The Delta–method
First we consider transformation of data onto a scale where the

variance is approximately constant.
Let Y be a random variable and let h() be a nice function, e.g.

√
h(y) = y, h(y) = y 2, h(y) = log y.
We shall investigate the properties of the transformed random

variable Z where
Z = h(Y )
324
Example 3. Let Y ∼ N (µ, σ 2). If h is linear, i.e. h(y) = α + βy,
then it is well known that
Z = h(Y ) ∼ N (α + βµ, β 2σ 2)
If h is non–linear, e.g. if h(y) = log y then Z is not normally

distributed. f in
• However, Z = h(Y ) will in certain cases be approximately normal

if Y is normal.
• Moreover, one can find the approximate mean and variance of Z

independently of whether Y is normal or not.
Taylors Approximation
The road to these results can be based on the following argument:
Let x0 and x be two numbers (not too far apart) and assume that h
is “nice” (i.e. differentiable).
Then it is well known from high school that
h(x) ≈ h(x0) + h0(x0)(x − x0).
The further x is from x0 the worse is this approximation.
This approximation is frequently called a Taylor expansion of h

around x0.
325
First order Taylor approximation
80
60
f(x)
40
20
0
0 1 2 3 4
h(x) ≈ h(x0) + h0(x0)(x − x0).
Applying Taylors Approximation
Taylors approximation is now applied to the random variable Y with

mean µ = E(Y ) and variance σ 2 = Var(Y ).
The approximation is around µ. We then get
Z = h(Y ) ≈ h(µ) + h0(µ)(Y − µ).
• Hence, when Y is “close to” µ, then h(Y ) is approximately a linear

function of Y .
• Y “being close to” µ means basically that σ 2 has to be to be small.
326
• From the approximation
Z = h(Y ) ≈ h(µ) + h0(µ)(Y − µ).
we also conclude that
E(Z) = E(h(Y )) ≈ h(µ)

Var(Z) = Var(h(Y )) ≈ h0(µ)2 Var(Y )
• Hence, if Y is normal then it follows that Z must also be

approximately normal since Z is an approximately linear function
of Y . In this case we therefore conclude
Z = h(Y ) ≈ N (h(µ), h0(µ)2σ 2).

It must be emphasized that these results are asymptotic results.

How good they are depend on many things including
• the variance of Y , i.e. how close Y –value tend to be to µ
• the form of h – how “smooth” (that is how close to being linear)

h is.
327
Transformation of Data
The previous results can sometimes be used for identifying

transformations of data onto a scale where the variance is constant.
It is assumed in the following that
E(Yi) = µi and V ar(Yi) = σ 2µθi .
By plotting log–variance against log–mean one can frequently get a

good estimate of θ, and from that one can (sometimes) identify an
appropriate transformation.
We look for a function h such that Z = h(Y ) has constant variance

2
σZ :
• From the previous section we have
2
σZ = Var(h(Y )) ≈ h0(µ)2 Var(Y ) = h0(µ)2σ 2µθ
• If we solve for h0 we get

r
2
σZ − θ2
h0(µ) ≈ µ
σ2
q 2
σZ
• For later use let c = σ2
. Hence we look for a function h which
328
satisfies that its derivative is
β
h0(µ) = cµ− 2 .
Such an equation is called a differential equation.
The search for h has to be taken in two steps:
When θ = 2: Then h0(µ) = c µ1 , and high school knowledge tell us

that the solution is the natural logarithm, i.e.
h(µ) = c log(µ).
When θ 6= 2: In this case we need the anti–derivative of a simple

power function. It is then well know from high school that
2 2−θ
h(µ) = c µ 2 .
2−θ
329
With Var(Y ) = σ 2µθ there are some well known special cases:
• Note that θ = 0 implies that the Var(Y ) = σ 2.

(As is the case in Linear Normal Models)
• Note that σ 2 = θ = 1 implies that the Var(Y ) = µ.

(As is the case in the Poisson distribution.)
• Note that θ = 2 implies that the Var(Y ) = σ 2µ2.

(I.e. the coefficient of variation is constant as is the case in the
Gamma distribution.)
Modelling Variance Heterogeneity
As has been seen transformation of data in an attempt to obtain

variance can be a mixed blessing:
• the transformation can ruin the linearity of the men structure.
• it can be very difficult to report contrasts and their standard error

on the original scale.
An attractive alternative to transformation is therefore to include

variance heterogeneity in the model.
330
Consider the pig–feeding example from before and the model
yis = α + βxi + βsxi + is
where i is pig, s is sample and xi is the dose given to the ith pig.
• if is ∼ N (0, σ 2) then it is a LNM, i.e. there is assumed variance

homgeneity.
• if is ∼ N (0, σx2i ) then we accomodate for different variances

corresponding to different doses of x. (Recall that xi can assume
4 different values, so there are 4 different variance parameters
• if is ∼ N (0, σ12) when xi = 0.0 and is ∼ N (0, σ22) when xi 6= 0.0
there are two different variance parameters in the model.
• if is ∼ N (0, σx2i,s) then we accomodate for different variances

corresponding to different doses of x and for there different samples
(Hence there are 8 different variance parameters).
331
Fitting the models in PROC MIXED:
data biopsi; set biopsi; fod_c =fod; if fod=0.0 then fod_c2 = 1;

else fod_c2=2;
title ’Variance homogeneity’;

proc mixed data=biopsi;
class sample fod_c fod_c2;
model x14=fod fod*sample / ddfm=satterth chisq solution outp=o1;
title ’Variance heterogeneity, 4 variances’;

repeated fod_c/ type=un(1);
title ’Variance heterogeneity, 2 variances’;

repeated fod_c2/ type=un(1);
run;
332
Parts of the SAS output is
Variance homogeneity: Residual 0.1262
-2 Res Log Likelihood 51.6
AIC (smaller is better) 53.6
Variance heterogeneity, 4 variances: Cov Parm Estimate

-2 Res Log Likelihood 39.1 UN(1,1) 0.02512
AIC (smaller is better) 47.1 UN(2,2) 0.08855
AICC (smaller is better) 48.0 UN(3,3) 0.1491
BIC (smaller is better) 54.6 UN(4,4) 0.2481
Variance heterogeneity, 2 variances: Cov Parm Estimate

-2 Res Log Likelihood 41.8 UN(1,1) 0.02517
AIC (smaller is better) 45.8 UN(2,2) 0.1592
The parameter estimates are:

Effect sample Estimate StdErr DF tValue Probt model
Intercept 0.3130 0.09145 46 3.42 0.0013 varhomo

fod 0.1453 0.01735 46 8.38 <.0001 varhomo
fod*sample 1 0.2433 0.01689 46 14.40 <.0001 varhomo
fod*sample 2 0 . . . . varhomo
Intercept 0.2608 0.04468 12.1 5.84 <.0001 varhet1

fod 0.1546 0.01552 41 9.96 <.0001 varhet1
fod*sample 1 0.2489 0.01985 33.3 12.54 <.0001 varhet1
fod*sample 2 0 . . . . varhet1
Intercept 0.2620 0.04489 11.9 5.84 <.0001 varhet2

fod 0.1524 0.01466 44.9 10.39 <.0001 varhet2
fod*sample 1 0.2432 0.01897 34.9 12.82 <.0001 varhet2
fod*sample 2 0 . . . . varhet2
333
Heterogeneous Variance for Grouped Data
Example 4. Example 8.2 from LMSW, p. 268.
• The response is the ultrafiltration rate UFR (in ml/hr) of 20 high

flux membrane dialyzers measured at 7 different transmembrane
pressures TMP.
• The measurements are made in vivo and the aim is to characterize

the ultrafiltration characteristics of the membranes.
• The dialyzers are evaluated in vitro using bovine blood and flow
rates QB of either 200 or 300 dl/min.
QB= 200 QB= 300

60
60
40
40
ufr
ufr
20
20
0
0.5 1.0 1.5 2.0 2.5 3.0 0.5 1.0 1.5 2.0 2.5 3.0
tmp tmp
• Plots suggest inhomogeneous variance, and more specifically that

variance increases with the mean.
• The plot also suggest that there might be individual curves for each
membrane, i.e. to consider random regression coefficient models.
334
The starting point is the 4. degree polynomial model
yimj = β0 + τi + (β1 + δ1i)ximj + (β2 + δ2i)x2imj

+(β3 + δ3i)x3imj + (β4 + δ4i)x4imj + imj
where x is TMP, i denotes QB–level, m is membrane within QB–

level, and j is the jt measurement on the membrane to which the
measurement ximj is associated.
There are 7 measurements on each membrane, so a crude starting

point could be to assume that im = (im1, . . . , im7 ) follows a
7–dimensional normal distribution,
im ∼ N (0, R)
where R is an unstructured 7 × 7 covariance matrix.

The SAS program employed by LMSW for fitting this model is

proc mixed data=dial;
class qb sub;
model ufr = tmp|tmp|tmp|tmp qb|tmp|tmp|tmp|tmp;
repeated / type=un subject=sub r rcorr;
ods output r=r rcorr=rcorr;
run;
With this program data is treated as being equidistant in TMP, i.e. the
actual difference between two TMP–measurements is accounted for.
This becomes transparent if the program is rewritten as

class qb sub index;
repeated index / type=un subject=sub r rcorr;
run;
335
Some of the SAS output is
Estimated Covariance matrix

2.76 2.90 3.57 3.04 0.36 0.46 0.64
2.90 5.10 6.40 6.38 4.13 3.32 1.16
3.57 6.40 11.15 12.46 8.33 5.44 4.02
3.04 6.38 12.46 18.54 13.38 10.90 7.68
0.36 4.13 8.33 13.38 17.71 13.83 12.04
0.46 3.32 5.44 10.90 13.83 20.31 11.33
0.64 1.16 4.02 7.68 12.04 11.33 19.67
Estimated Correlation matrix

1.00 0.77 0.64 0.43 0.05 0.06 0.09
0.77 1.00 0.85 0.66 0.43 0.33 0.12
0.64 0.85 1.00 0.87 0.59 0.36 0.27
0.43 0.66 0.87 1.00 0.74 0.56 0.40
0.05 0.43 0.59 0.74 1.00 0.73 0.65
0.06 0.33 0.36 0.56 0.73 1.00 0.57
0.09 0.12 0.27 0.40 0.65 0.57 1.00
f in
• Note that with the model above there are 7 × 8/2 = 28 parameters
in the covariance matrix.
• The variances increase with TMP, and hence the covariances increase
with the differences in TMP.
• Yet, the correlations decrease with the difference in TMP.
• We seek a more parsimoneous model describing this correlation

structure.
336
• A simple AR(1) model in which the ijth element of R is
Rij = σ 2ρ|i−j|
(which has 2 parameters) will clearly not fit to these data.
• A more flexible alternative is the heterogeneous AR(1) model (the

ARH(1) model) in which the ijth element of R is
Rij = σiσj ρ|i−j|
(which has 8 parameters). This model is still much more

parsimonious than the unstructured covariance matrix which
requires 28 parameters.
The ARH(1) model can be fitted using

class qb sub index;
repeated index / type=arh(1) subject=sub r rcorr;
run;
337
The empirical and estimated correlation matrix from the ARH(1)

model are close:
Estimated Correlation matrix (ARH(1))

1.00 0.76 0.58 0.44 0.34 0.26 0.20
0.76 1.00 0.76 0.58 0.44 0.34 0.26
0.58 0.76 1.00 0.76 0.58 0.44 0.34
0.44 0.58 0.76 1.00 0.76 0.58 0.44
0.34 0.44 0.58 0.76 1.00 0.76 0.58
0.26 0.34 0.44 0.58 0.76 1.00 0.76
0.20 0.26 0.34 0.44 0.58 0.76 1.00
Estimated Correlation matrix (Unstructured)

1.00 0.77 0.64 0.43 0.05 0.06 0.09
0.77 1.00 0.85 0.66 0.43 0.33 0.12
0.64 0.85 1.00 0.87 0.59 0.36 0.27
0.43 0.66 0.87 1.00 0.74 0.56 0.40
0.05 0.43 0.59 0.74 1.00 0.73 0.65
0.06 0.33 0.36 0.56 0.73 1.00 0.57
0.09 0.12 0.27 0.40 0.65 0.57 1.00
For the model with the unstructured covariance matrix, a plot of the
residuals against TMP gives some insight:
Residuals, UN − QB= 200 Residuals, UN − QB= 300

5
5
Resid
Resid
0
0
−5
−5
−10
−10
0.5 1.0 1.5 2.0 2.5 3.0 0.5 1.0 1.5 2.0 2.5 3.0
tmp tmp
• The profiles do not vary randomly around 0 – some profiles are

steadily increasing, other steadily decreasing.
338
• This suggests that maybe we are not faced with variance
heterogeneity but rather with individual regression coefficients.
• (After all, there is likely to be some variation between the

membranes).
The random regression model is fitted by:

proc mixed data=dial ;
class qb sub index;
model ufr = tmp|tmp|tmp|tmp qb|tmp|tmp|tmp|tmp / outp=o2;
random int tmp tmp*tmp / subject=sub type=un;
run;
Now there is no tendency for the residuals to be steadily increasing

or decreasing when plotted against TMP.
Residuals, RandomReg − QB= 200 Residuals, RandomReg − QB= 300
4
4
2
2
Resid
Resid
0
0
−2
−2
−4
−4
0.5 1.0 1.5 2.0 2.5 3.0 0.5 1.0 1.5 2.0 2.5 3.0
tmp tmp
Yet, the curves are still somewhat “smooth” suggesting that some
within subject variation has yet to be accounted for.
339
Power–of–Mean for Data with Covariates
Previously it was discussed that the variance can sometimes be

regarded as a function of the mean.
This was used for
• identifying situations where serious variance heterogeneity was

present
• suggesting transformations of data
Yet, until now the actual structure – the variance as a function of

the mean has never been used directly.
Usually when estimating variance/covariance parameters this is done

by subtracting estimates for the mean from the observed data to
give residuals. The residuals are then used for estimating the
variance/covariance parameters.
REML estimation is a clear example of this.
• In the setup in this section the mean and variance parameters are
not estimated separately.
• With this setup, one can capture variance heterogeneity together

with having random regression coefficients in the model
340
• We consider cases where the variance of the residuals is
Var(i) = σ 2|µi|θ
such that the R–matrix is diagonal with Rii = σ 2|µi|θ .
• Since µi = x>
i β, the mixed model becomes complicated:
y = Xβ + Zu +
where
E(Y ) = Xβ
Var() = R(σ 2, β, θ) = diag(σ 2|x> θ
i β| )
are both functions of β.

• Consequently, maximizing the likelihood function is going to be a

very complicated task.
341
Yet, it is easy to suggest a heuristic solution to the estimation

problem:
• Suppose we have a provisional estimate β p of β.
• If this estimate is plugged into R, i.e.
R(σ 2, β p, θ) = diag(σ 2|x> p θ 2

i β | ) = R̃(σ , θ)
then R is all of a sudden only a function of σ 2 and the power θ.
• These parameters can be estimated, together with β and the

parameters in Var(u) in PROC MIXED.
• The trick is then to set β p equal to the new estimate for β and
repeat the iteration until the parameters stop changing.
In LMSW, p. 278 a way of doing it is shown. A simpler way is given

here:
1. First the iteration has to be started:

class qb sub;
model ufr = tmp|tmp|tmp|tmp qb|tmp|tmp|tmp|tmp / s;
random int tmp tmp*tmp / type=un sub=sub;
repeated / local;
ods output solutionf=sf covparms=cp;
run;
342
2. Then the estimated parameters β are used as provisional parameters
in the next iteration. (This happens in the repeated statement).
The estimated parameters of Var(u) as used as starting point
for the maximization algorithm. (This happens in the parms
statement).
This step is not necessary to but it speeds up the procedure
considerably:

class qb sub;
model ufr = tmp|tmp|tmp|tmp qb|tmp|tmp|tmp|tmp / s outp=o3;
random int tmp tmp*tmp / type=un sub=sub;
repeated / local=pom(sf);
parms / pdata=cp;
ods output solutionf=sf1 covparms=cp1;
run;
3. Finally the provisional estimate β p is set to the recent estimate for

β.
Likewise, the starting values for the parameters in Var(u) are set
to the recently estimated values of these:
proc compare brief data=sf compare=sf1;

var estimate;
data sf; set sf1;

data cp; set cp1;
run;
Now iterate between 2. and 3. until convergence, i.e. until the

parameters in sf and sf1 become very similar.
343
Parts of the output from the final iteration is

Covariance Parameter Estimates
Cov Parm Subject Estimate
UN(1,1) sub 3.8360

UN(2,1) sub -5.8353
UN(2,2) sub 28.2501
UN(3,1) sub 1.3778
UN(3,2) sub -8.3312
UN(3,3) sub 2.6970
POM 1.9785
Residual 0.001974
The power is estimated to 1.9785 ≈ 2 which, in a sense, corresponds

to the case of constant coefficient of variation.
Now there is no tendency for the residuals to be steadily increasing

or decreasing when plotted against TMP.
Also the curves are less smooth than before, suggesting that more of
the within subject variation has yet to be accounted for.
Residuals, POM − QB= 200 Residuals, POM − QB= 300
4
4
2
2
Resid
Resid
0
0
−6 −4 −2
−6 −4 −2
0.5 1.0 1.5 2.0 2.5 3.0 0.5 1.0 1.5 2.0 2.5 3.0
tmp tmp
344
Noget om transformationer,
normalfordelingsapproximationen og
konfidensintervaller
Baseret på 250 kvitteringer for indkøb af benzin li samt tilsvarende

registreringer af kørte kilometer pr. tankfuld ki er benzinøkonomien
ki
yi = , i = 1, . . . , 250
li
udtrykt ved kilometer pr. liter beregnet.
Histogrammet og probitdiagrammet i øverste række af nedenstående

figur viser at man med rimelighed kan antage at yi’erne er
realisationer af stokastiske variabler Yi, hvor
Yi ∼ N (µ, σ 2), i = 1, . . . , 250
På basis af data kan man nu opstille f.eks. et konfidensinterval, for µ.
Af forskellige grunde beslutter man sig for at ville sælge bilen i USA,
hvor man sædvanligvis angiver benzinøkonomi som “gallon pr. 100
miles”. For at gøre det nemt betragter vi i stedet “liter pr. 100 km”,
nemlig
li 1
zi = 100 = 100 .
ki yi
Det vil sige at vi transformerer data som zi = h(yi) = 100/yi.
345
Det er velkendt at hvis Yi er normalfordelt, så er 100/Yi IKKE

normalfordelt.
Nedenfor ses histogrammer og qqplots for Y og Z = h(100/Y ).

Histogram of y Normal Q−Q Plot
Sample Quantiles
14
Frequency
30
12
0 10
10
10 11 12 13 14 15 −3 −2 −1 0 1 2 3
y Theoretical Quantiles
Histogram of z Normal Q−Q Plot
Sample Quantiles
10
60
Frequency
40
9
8
20
7
0
7 8 9 10 11 −3 −2 −1 0 1 2 3
z Theoretical Quantiles
Man sporer en svagt højreskæv fordeling for zierne, men ellers ser
data ud til rimeligt at kunne beskrives ved en normalfordeling. Det
vil sige at med en vis rimelighed kan man arbejde med at 100/Yi
tilnærmelsesvist er normalfordelt.
Ovenstående data er i virkeligheden 250 observationer simulerede fra

en N (12, 12)–fordeling.
Vi skal nu illustrere at approximationen til normalfordelingen bliver

gradvist dårligere når spredningen bliver større.
Vi har derfor gennemført ovenstående for spredningen σ = 2 og

σ = 3. Resultaterne er vist nedenfor:
Histogram of y Normal Q−Q Plot Histogram of y Normal Q−Q Plot
20
Sample Quantiles
Sample Quantiles
60
16
Frequency
Frequency
30
15
40
12
10
20
10
5
0
6 8 10 12 14 16 18 −3 −2 −1 0 1 2 3 5 10 15 20 −3 −2 −1 0 1 2 3
y Theoretical Quantiles y Theoretical Quantiles
Histogram of z Normal Q−Q Plot Histogram of z Normal Q−Q Plot

Sample Quantiles
Sample Quantiles
14
60
20
Frequency
Frequency
60
40
15
10
20
10
20
8
6
0
6 8 10 12 14 −3 −2 −1 0 1 2 3 5 10 15 20 −3 −2 −1 0 1 2 3
z Theoretical Quantiles z Theoretical Quantiles
Vi skal nu illustrere hvorledes det går med middelværdien og
346
variansen af de transformerede data.
I det følgende lader vi E(Z) = η og V ar(Z) = τ 2. Vi kan da

estimere η og τ 2 direkte på bagrrund af de transformerede data som
henholdsvis gennemsnittet og stiskprøvevariansen.
Dernæst bemærkes at med h(x) = 100/x er h0(x) = −100/x2. Af

resultaterne
E(Z) = E(h(Y )) ≈ h(µ)

V ar(Z) = V ar(h(Y )) ≈ h0(µ)2V ar(Y )
har vi derfor at E(Z) ≈ 100/µ og V ar(Z) ≈ 10000σ 2 /µ4.
For de σ = 1, 2, 3 er tallene givet i tabelen nedenfor.

1
Det ses at µ̂ er en god approximation til E(Z) = η og ligeledes er
2
10000 σ̂µ̂4 en rimelig tilnærmelse til V ar(Z) = τ 2 når spredningen er
lille. Det fremgår også at når spredingen bliver stor, blive specielt
approximationen til V ar(Z) = τ 2 dårlig.
Størrelse σ=1 σ=2 σ=3

µ̂ 11.968 11.919 11.962
σ̂ 2 1.146 4.007 9.475
η̂ 8.423 8.658 9.261
τ̂ 2 0.588 2.834 18.833
E(Z) = E(h(Y )) ≈ 100 µ̂1 8.355 8.389 8.359
2
V ar(Z) = V ar(h(Y )) ≈ 10000 σ̂µ̂4 0.558 1.985 4.627
Afslutningsvis bemærkes at η og µ i dette eksempel er et udtryk for

347
det samme nemlig benzinøkonomien.
Gennem transformationen af data zi = 100/yi fås at zi er utrykt i

“liter pr. 100 km”, hvilket også bliver enheden for E(Z) = η.
Enheden for µ er “km. pr. liter”, og derfor er enheden for 100/µ

“liter pr. 100 km”.
Man kan derfor diskutere hvorvidt 100/µ eller η er den relevante

størrelse. De estimeres forskelligt, den første som 100 gange et
reciprokt gennemsnit og den anden som 100 gange gennemsnittet at
reciprokke data:
1 X −1
100/µ̂ = 100( yi )
n i
1X 1X 1
η̂ = zi = 100
n i n i yi
Beslutter man sig for at enheden “liter pr. 100 km” er den relevante
størrelse, så har vi altså to måder at få den frem på: Enten som et
gennemsnit af transformerede data eller som en transformation af
middelværdien af de oprindelige data.
348
Transformation og konfidensintervaller
Antag at de observerede data er y1, . . . , yn og at disse f.eks. for at

opnå varianshomogenitet er transformeret til z1, . . . , zn med
transformationen h, dvs. zi = h(yi).
På den transformerede skala er der udført en statistisk analyse. Lad

θ være den størrelse vi er interesserede i. På baggrund af (de
transformerede) data fås et estimat θ̂, for θ samt et estimat σ̂θ for
spredningen på θ̂.
F.eks. kunne θ være hældningen i en lineær regression
Zi = α + θxi + i.
Generelt er et (1 − α) konfidensinterval for θ givet ved to stokastiske

variable Zlav og Zhøj sådan at sandsynligheden for at θ ligger i
intervallet [Zlav , Zhøj ] er 100(1 − α)%.
I mange klassiske lineære modeller beregnes et (1 − α)

konfidensinterval som
Ẑlav = θ̂ − t1− α2 (d)σ̂θ

Ẑhøj = θ̂ + t1− α2 (d)σ̂θ
hvor t1− α2 (d) er 1 − α2 –fraktilen i en t–fordeling med d frihedsgrader.
Hvis f.eks. θ er hældningen i en regression som ovenfor så udtrykker

θ den forventede tilvækst på Z når x øges med een enhed.
Ofte er man interesseret i at undersøge udtrykke den forventede

349
tilvækst af Y altså på den originale skala når x øges med een enhed.
Populært sagt, vil man udtrykke θ “på den oprindelige skala”.
Dette gøres ofte ved følgende. Lad h−1 være den omvendte funktion
til h. Da lader man h−1(θ) være et udtryk for θ “på den oprindelige
skala”.
Man anvender derfor h−1 på den estimerede værdi θ̂, hvilket giver
η̂ = h−1(θ̂). Konfidensgrænserne på den transformerede skala kan
også transformeres tilbage med h−1:
Hvis h er strengt voksende da er
Ŷlav = h−1(Ẑlav )
Ŷhøj = h−1(Ŷhøj )
og hvis h er strengt aftagende, så er
Ŷlav = h−1(Ẑhøj )
Ŷhøj = h−1(Ŷlav )
Hvis [Ẑlav , Ẑhøj ] er et 100(1 − α)% konfidensinterval for θ da er

[Ŷlav , Ŷhøj ] er et 100(1 − α)% konfidensinterval for h−1(θ).
Bemærk: [Ẑlav , Ẑhøj ] er symmetrisk omkring θ̂ men [Ŷlav , Ŷhøj ] er

IKKE generelt symmetrisk omkring h−1(θ̂).
Hvis h er approximativt lineær, da her h−1 ligeså, og i det tilfælde

bliver [Ŷlav , Ŷhøj ] tilnærmelsesvist symmetrisk omkring h−1(θ̂).
Et alternativ til ovenstående er følgende: Middelværdi og varians på

350
den transformerede skala er tilnæmelsesvis givet ved
E(Z) = E(h(Y )) ≈ = h(E(Y ))

V ar(Z) = V ar(h(Y )) ≈ h0(E(Y ))2V ar(Y ).
Man kan nu løse disse ved hjælp af h−1. Man får
E(Y ) ≈ h−1(E(Z))
V ar(Z) V ar(Z)
V ar(Y ) ≈ 0 2
= 0 −1 .
[h (E(Y ))] [h (h (E(Z)))]2
Disse resultater kan anvendes på parameteren θ, som vi er

interesseret i. Man får da
η̂ = h−1(θ̂)
σ̂θ
σ̃η =
|h0(η̂)|
Det er nu fristende at udregne konfidensgrænser for h−1(θ) som
Ỹlav = η̂ − t1− α2 (d)σ̃η̂

Ỹhøj = η̂ + t1− α2 (d)σ̃η̂
Dette interval bliver symmetrisk omkring η̂.
Der er dog ikke såvidt vides gode formelle argumenter for at kalde
351
[Ỹlav , Ỹhøj ] for et 100(1 − α)% konfidensinterval for h−1(θ). Derfor

anbfeales generelt [Ŷlav , Ŷhøj ]
Det vil dog i nogle tilfælde være tilfældet at [Ỹlav , Ỹhøj ] og

[Ŷlav , Ŷhøj ] faktisk ligner hinanden meget.
Dette sker hvis varitionen i datamaterialet er lille. Indenfor et

snævert interval han h da betragtes som nogenlunde lineær, hvorved
ovennævnte approximationer bliver gode.
Eksempel: Antag at data er transformeret som
zi = h(yi)
På baggrund af de transformerede data laves en regression
Zi = α + βxi + i
Vi er interesserede i et konfidensinterval for h−1(β).
På baggrund af data estimeres β̂ = 0.25 of σ̂β = 0.03.
Vi vil nu sammenligne to måder at beregne intervallerne på. For

argumentets skyld skal vi gennemføre tilsvarende beregninger for
σ̂β = 0.06 og σ̂β = 0.09.
352
For simpelhedens skyld antager vi at der er så mange observationer
at t fordelingen ligner en normalfordeling. Dermed bliver
t1− α2 (d) ≈ 1.96 for α = 0.05.
Bemærk først at
p
h(y) = (y) = y 1/2 hvormed
h−1(y) = y 2 og
1
h0(y) = √ .
2 y
og at η̂ = h−1(β̂) = 0.0625 samt at h0(η̂) = 2 (regn selv efter)!.

For σ̂β = 0.03 fås nu
Ẑlav = β̂ − 1.96σ̂β = 0.19

Ẑhøj = β̂ + 1.96σ̂β = 0.31
Transformeres disse grænser tilbage ved h−1 fås
Ŷlav = 0.192 = 0.0361

Ŷhøj = 0.312 = 0.0961
der ikke er symmetrisk omkring η̂ = h−1(β̂) (men næsten!).

353
Under den anden metode skitseret ovenfor skal vi beregne
σ̂β
σ̃η =
|h0(η̂)2|
σ̂β
= = 0.015
2
idet σ̂β = 0.03. Vi får nu
Ỹlav = η̂ − 1.96σ̃η = 0.0331

Ỹhøj = η̂ + 1.96σ̃η = 0.0919.
Vi ser altså at intervallerne [Ỹlav , Ỹhøj ] og [Ŷlav , Ŷhøj ] ligner

hinanden meget.
For σ̂β = 0.06 gennemføres helt analoge beregninger og vi finder

σ̃η = 0.06/2 = 0.03. Dermed fås
Ẑlav = β̂ − 1.96σ̂β = 0.13

Ẑhøj = β̂ + 1.96σ̂β = 0.37
Ŷlav = 0.132 = 0.0175
Ŷhøj = 0.372 = 0.1351
Ỹlav = η̂ − 1.96σ̃η = −0.0037
Ỹhøj = η̂ + 1.96σ̃η = 0.1213.
Vi ser nu at intervallerne [Ỹlav , Ỹhøj ] og [Ŷlav , Ŷhøj ] bliver mere

forskellige.
354
20 Variansheterogeneity: Example of effect of
transformation
This lecture illustrates the consequence of transformation, based on an analysis of an experiment

investigation the effect of feed concentration on muscle content of a certain ingredient.
Transformation back to the original scale is discussed, both related to the mean level and to
estimates of treatment effects.
Finally, examples are shown of different scales for usual production traits within animal produc-
tion.
1
http://www.jbs.agrsci.dk/biometri/Courses/HSVmixed2001/VariansHetero.f.pdf
355
20 Variansheterogeneity: Example of effect of transformation
Variance homogeneity
• Variance homogeneity
• Transformation as a solution
• Effect of back-transformation.
12. oktober 2001
Yij = µ + αi + εij
εij ∼ N (0, σ 2)
Variance homogeneity implied by missing suffix
12. oktober 2001
356
3
Herd no. Herd type Observations Herd average

A 1 10 12.3
B 2 10 13.6
C 1 10 10.2
D 2 10 15.0
12. oktober 2001
Herd no. Herd type Observations Herd average

A 1 100 12.3
B 2 100 13.6
C 1 1 10.2
D 2 1 15.0
Weigh according to precision in measurements
12. oktober 2001
357
Variance of an average
nobs
1 X
Ȳ = Yi
nobs i
1
V(Ȳ ) = σY2
nobs
The magnitude of variance inhomogeneity can be assessed by
using this as an analogue.
12. oktober 2001
Example
A certain ingredient is added to the feed ration in the

concentration x, x ∈ {0.0, 4.4, 6.2, 9.3}. The pigs are fed with the
rations until 60 kg. Biopsies are made at 60 kg. Concentration of
the feed ingredient in the biopsy is measured. Let yi denote the
concentration of the ingredient in animal i.
12. oktober 2001
358
7
Mean curve
5
y, Muscle conc., 60 kg
4
3
2
1
0
0 2 4 6 8
x, Feed contents
12. oktober 2001
Transformation ?
0.4
−1.5
0.3
Log(Variance)
Variance
0.2
−2.5
0.1
−3.5
0.5 1.0 1.5 2.0 2.5 3.0 3.5 −1.0 −0.5 0.0 0.5 1.0
Mean Log(Mean)
12. oktober 2001
359
Model of expectations
E(y) = µ + αi
√
E( y) = µ + αi ⇒ E(y) = µ2 + αi2 + 2µαi
E(log(y)) = µ + αi ⇒ E(y) = exp(µ) exp(αi)
12. oktober 2001
10
Curve tting
E(y) = µ + β1x + β2x2

√
E( y) = µ + β1x + β2x2
E(log(y)) = µ + β1x + β2x2
12. oktober 2001
360
11
Model comparison
√
Dependent variable y y
Parameter Estimate P-value Estimate P-value
β1 0.438 0.081∗∗∗ 0.242 0.026∗∗∗
β2 -0.007 0.008 -0.010 0.003∗∗
12. oktober 2001
12
Sqrt transformed
sqrt(y), Muscle conc., 60 kg
5
2.0
4
1.5
3
1.0
2
1
0.5
0 2 4 6 8 0 2 4 6 8
x, Feed contents x, Feed contents
12. oktober 2001
361
13
Comparisons
5
5
4
4
3
3
2
2
1
1
0
0 2 4 6 8 0 2 4 6 8
x, Feed contents x, Feed contents
12. oktober 2001
14
Treatment differences
Very often we are inter-

sqrt(y), Muscle conc., 60 kg
ested in estimating treat-

2.0
ment differences, α1 − α2.

1.5
In SAS we may use PDIFF

option in LSMEANS, or
1.0
ESTIMATE.
0.5
How do we transform ??
0 2 4 6 8
x, Feed contents
12. oktober 2001
362
15
Conclusion
• Transformations may achieve variance homogeneity
• Transformations changes the model of the mean
• Back transformations of expected values OK
• Back transformations of general estimable functions may cause

problem
12. oktober 2001
16
Natural scales ?
• Geometric cell-count
• Daily gain vs Age at slaughter
• Feed utilisation FU/Gain vs. Gain/FU
• Calvings per cow year vs. Calving interval.
• Feeding interval vs. Feeding frequency
12. oktober 2001
363
364
21 Variance Homogeneity: Diurnal Variation
The purpose of this lecture was to illustrate the application and combination of some of the
advanced topics presented during the course.
A data set consisting of half-hourly observations of cortisol release in pigs was analysed using a
random regression model to capture the individual difference between pigs in diurnal variation.
The power-of-mean approach was used to model the variance heterogeneity.
The application of such a model requires iterative use of PROC MIXED
The experience with the model was that it was possible to estimate the model parameters, but
that it was necessary to ’nudge’ the procedure to secure convergence of the iterative calculations,
and that the calculations were very time-consuming. At the current of state-of-the art the
application of such models is not a routine matter.
1
http://www.jbs.agrsci.dk/biometri/Courses/HSVmixed2001/PowerOfMean.f.pdf
365
Example
In an experiment pigs were assigned to two different treatments on

order to study the effect of the treatment on the diurnal release of
cortisol. Cortisol were sampled continuously in a period of
approximately 24 hours for each animal.
Yijlm = µ+αi+Aij +(β1+B1j ) cos( 2π 2π

24 tijk )+(β2 +B2j ) sin( 24 tijk )+εijk
where Yijk is the logarithmic transformed plasma cortisol cortisol, µ

general mean, αi effect of treatment, Aij random effect of animal j
within treatment i.
May 2, 2001 1
cos( 2π 2π
24 tijk ) and sin( 24 tijk ) are covariates for estimation of the
diurnal variation. βk and Bkj are corresponding regression
parameters. βk a systematic effect and, Bkj a random deviation
from the line. The random effects (Aij , B1k , B2k )> ∼ N 3(0, V ),
where V is a 3 × 3 variance matrix. εijk ∼ N (0, σ 2)
May 2, 2001 2
366
Random regression model
The model is a random regression model and can be estimated using

the following SAS statements
*Initial model ;
data a ;
....
PI=3.141593 ;
sint=sin(time*2*pi/24) ;
cost=cos(time*2*pi/24) ;
proc mixed CL data=a ;

class beh dyr ;
model Logcort = beh sint cost /ddfm=satterth ;
random intercept sint cost / subject=dyr*kuld*beh type=un ;
May 2, 2001 3
Resultat eksempler
Dyrnr: 17111 Dyrnr: 31111 Dyrnr: 35111

6.0
6.0
6.0
log(Cortisol)
log(Cortisol)
log(Cortisol)
5.0
5.0
5.0
4.0
4.0
4.0
3.0
3.0
3.0
15 20 25 30 35 15 20 25 30 35 15 20 25 30 35
Timer Timer Timer
May 2, 2001 4
367
Model of Mean ?
exp(Xβ) =exp(µ + αi + Aij + (1)

(β1 + B1j ) cos( 2π 2π
24 tijk ) + (β2 + B2j ) sin( 24 tijk )) (2)
May 2, 2001 5
Modelling variance inhomogeneity
Logarithmic transform of cortisol were used because the variance

increased with the mean. Another approach to model this increase
directly.
Using the so-called power of mean method, we use the measured
cortisol level directly, but instead of homogenous variance we assume
εijk ∼ N (0, σn2 |Xβ|δ )
and estimate σn2 and δ.

In order to do this it is neccessary to perform the calculations with
PROC MIXED iterativly.
May 2, 2001 6
368
SAS Model
*Initial model ;
proc mixed CL data=a ;
class kuld beh dyr ;
model cortisol = beh sint cost /ddfm=satterth s;
random intercept sint cost / subject=dyr*kuld*beh type=un ;
repeated / subject=dyr*kuld*beh local ;
ods output SolutionF=sf ;
ods output Covparms=cp ;
run;
May 2, 2001 7
* Loop ;
proc mixed CL data=a maxiTER=100 CONVH=1e-8;
class kuld beh dyr ;
model cortisol = beh sint cost /ddfm=satterth s;
random intercept sint cost /
subject=dyr*kuld*beh type=un s ;
repeated /local=pom(sf) ;
parms /pdata=cp ;
ods output SolutionF=sf1 ;
ods output SolutionR=Coeff ;
ods output Covparms=cp1 ;
run ;
proc compare brief data=sf compare=sf1 ;

var estimate ;
run;
data sf ; set sf1 ;
data cp ; set cp1 ;
run;
May 2, 2001 8
369
Experience
• δ was estimated as 3.10, indicating that logarithmic may not be

1
sufficient to obtain variance homogeneity (y − 2 )
• Estimation of a single model run much longer with pom
• It was necessary to adjust convergence criteria to obtain

convergence
• Approx. 10 iterations needed.
May 2, 2001 9
370
22 Links to supplementary material
In order to illustrate the underlying principles in linear algebra it was necessary to introduce
a method for performing the calculations. For that purpose the IML procedure of SAS was
introduced using the small program in ImlExample.sas1
Several SAS macros were introduced for performing standard calculations, e.g., a SAS macro
for calculation of autocorrelations2 . The biometry research unit has further SAS macros and
examples on this web-page3 .
The book used for the course, LMSW (Littell et al., 1996), contains a series of program examples.
These examples may be downloaded from SAS institutes home pages, but can be found here 4
as well. Another important link is the SAS online manual5
Finally, most of the course participants used Word for text processing and SAS for making graphs.
To get these two programs to interact satisfactorily was clearly a problem. Therefore a short
note Eksport af grafer fra SAS til Word 6 were made, and references made to SAS tech. report
ts252x7 were the export facilities are discussed in detail.
1
http://www.jbs.agrsci.dk/biometri/Courses/HSVmixed2001/ImlExample.sas
2
http://www.jbs.agrsci.dk/biometri/Courses/HSVmixed2001/SAS/autocorr.sas
3
http://www.jbs.agrsci.dk/Biometri/SASmateriale/SASmateriale.html
4
http://www.jbs.agrsci.dk/biometri/Courses/HSVmixed2001/SAS/sasmixed.sas
5
http://dokumentation.agrsci.dk/sasdocv8/sasdoc/sashtml/onldoc.htm
6
http://www.jbs.agrsci.dk/biometri/Courses/HSVmixed2001/SAS2Word.pdf
7
http://www.jbs.agrsci.dk/biometri/Courses/HSVmixed2001/ts252x.pdf
371
22 Links to supplementary material
372
Bibliography
Littell, R.C., G.A. Milliken, W.W. Stroup, & R.D. Wolfinger (1996). SAS System for Mixed
Models. SAS Institute, Inc., Cary, NC.
373

Mixed Model For Study

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Mixed Model For Study

Diunggah oleh

Hak Cipta:

Format Tersedia

Lecture Slides on Mixed Models

A Course in Mixed Models for Use in

Søren Højsgaard & Erik Jørgensen

Biometry Research Unit

October 18, 2001

Søren Højsgaard & Erik Jørgensen

Biometry Research Unit

3 Basic Concepts from Linear algebra) 13

4 Linear normal models 39

5 Some Basic Statistical Concepts 97

The Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

7 Experimental planning and design 149

8 Randomized Complete Block Design 157

9 Randomized Complete Block Design II 175

10 Split-Plot Experiments 183

11 Examples of Split-Plot Designs 213

12 Estimation and tests in mixed models 221

13 Complications concerning Variance Components 235

G not positive definite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239

14 Repeated Measurements 245

15 Repeated Measurements: Covariance structures 265

16 Random Regression 275

17 Factor Structure Diagrams 289

18 Covariate Models and Multivariate Response 301

Model reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304

19 Heterogeneous Variance 319

20 Variansheterogeneity: Example of effect of transformation 355

21 Variance Homogeneity: Diurnal Variation 365

22 Links to supplementary material 371

The course was arranged consisting of three blocks of lectures.

• randomized complete block designs, (Chapter 8 and 9),

Why Linear Algebra??

• Many statistical models used in practice are assumed to have some

• Linear algebra is the branch of mathematics that deals with linear

• Linear algebra is a convenient tool for handling models with linear

• Moreover, many concepts from linear algebra can be given

• Hence geometry can be a way to understand statistical models with

October 17, 2001 Mixed Models Course 2

Vectors: A column vector is a list of numbers stacked on top of each

In both cases, the list is ordered, i.e.

(2, 1, 3) 6= (1, 2, 3).

• Note In what follows all vectors are column vectors unless

where the ais are numbers.

October 17, 2001 Mixed Models Course 4

Transpose of vectors: This means that a column vector is turned

a> = (a1, a2, . . . , an)

Hence transposing twice takes us back to where we started:

October 17, 2001 Mixed Models Course 5

Multiplying a vector by a number: If a is a vector and α is a

October 17, 2001 Mixed Models Course 6

• Note Only vectors of the same dimension can be added !

October 17, 2001 Mixed Models Course 7

Inner product of vectors: Let a and b be n–vectors. The inner

• Note The product is a number – not a vector

October 17, 2001 Mixed Models Course 8

The length (norm) of a vector: The length (or norm) of a vector

The 0–vector and the 1–vector: The 0-vector (1–vector) is a

Orthogonal (perpendicular) vectors: Two vectors a and b with

October 17, 2001 Mixed Models Course 9

Matrix: A matrix A with r rows og c columns is an r × c table of

October 17, 2001 Mixed Models Course 11

• Note If A is an r × c matrix then A> is a c × r matrix.