8. ed.
Bjarne Kjeer Ersb0ll , Knut Conradsen
Kgs. Lyngby 2012
Contents
1 Summary of linear algebra
1.1 Vector space . . . . . . . . . . . . .
1.1.1 Definition of a vector space
1.1.2 Direct sum of vector spaces
1.2 Linear transformations and matrices
1.2.1 Linear transformations ...
1.2.2 Matrices...........
1.2.3 Linear transformations using matrixformulation
1.2.4 Coordinate transformation
1.2.5 Rank of a matrix . . . .
1.2.6 Determinant of a matrix .
1.2.7 Block matrices ..... .
1
1
2
5
5
7
8
10
11
13
14
16
1.3 Pseudoinverse or generalised inverse matrix 18
1.4 Eigenvalue problems. Quadratic forms. . . 28
1.4.1 Eigenvalues and eigenvectors for symmetric matrices 29
1.4.2 Singular value decomposition of an arbitrary matrix. Q and
Rmode analysis . . . . . . . . . . . . . . . . . . . . . . 34
1.4.3 Quadratic forms and positive semidefinite matrices. . . . 37
1.4.4 The general eigenvalue problem for symmetrical matrices 43
1.4.5 The trace of a matrix . . . . . . . . . . . . . . . 46
1.4.6 Differentiation of linear form and quadratic form 47
1.5 Tensor or Kronecker product of matrices 50
1.6 Inner products and norms. . . . . . . . . 51
2 Multidimensional variables
2.1 Moments of multidimensional random variables . . . . . . .
2.1.1 The mean value . . . . . . . . . . . . . . . . . . . .
2.1.2 The variancecovariance matrix (dispersion matrix).
2.1.3 Covariance.........
2.2 The multlvftrlatc normal dtltributlon . . . . . . . . . . . . .
57
57
57
58
61
63
v,
2.2.1 Definilion and simple properlies ...
2.2.2 Independence and contour ellipsoids.
2.2.3 Conditional distributions . . . . . . .
2.2.4 Theorem of reproductivity and the central limit theorem.
2.2.5 Estimation of the parameters in a multivariate normal distribu
2.3
2.4
2.5
tion .................... .
2.2.6 The twodimensional normal distribution.
Correlation and regression . . . . . . . . .
2.3.1 The partial correlation coefficient. .
2.3.2 The multiple correlation coefficient
2.3.3 Regression ...... .
The partition theorem . . . . . .
The Wishart distribution and the
generalised variance .
....................
3 The general linear model
3.1 Estimation in the general linear model
3.1.1 Formulation of the Model. . .
3.1.2 Estimation in the regular case
3.1.3 The case ofx':E1x singular.
3.1.4 Constrained estimation ....
3.1.5 Confidenceintervals for estimated values.
Predictionintervals ........... .
3.2 Tests in the general linear model . . . . . . . . .
3.2.1 Test for a lower dimension of model space.
3.2.2 Successive testing in the general linear model. .
4 Regression analysis
4.1 Linear regression analysis.
4.1.1 Notation and model. ...
4.1.2 Correlation and regression.
4.1.3 Analysis of assumptions ..
4.1.4 On "Influence Statistics" .
4.2 Regression using orthogonal polynomials
4.2.1 Definition and formulation of the model.
4.2.2 Determination of orthogonal polynomials ..
4.3 Choice of the "best" regression equation
4.3.1 The Problem .......... .
4.3.2 Examination of all regressions ..
4.3.3 Backwards elimination ..
4.3.4 Forward selection ..
4.3.5 Stepwise regression.
4.3.6 Numerical appendix.
4.4 Other regression models and solutions
4.4. I Orth0lona1 rclrcIIII\on (lincW' functional relationlhlp).
, ..
M
69
73
74
75
77
83
83
90
93
97
103
109
109
109
112
118
128
134
141
141
150
161
161
161
165
167
170
176
176
180
185
185
188
190
192
195
197
203
2()3
4.4.2
4.4.3
Regulurization and Ridge Regression .
Nonlinear regression and curve fitting.
5 Tests in the multidimensional normal distribution
5.1 Test for mean value. . . . . . . . . . . . . . . .
5.1.1 Hotelling's T2 in the OneSample Situation
5.1.2 HoteHing's T2 in the twosample situation.
5.2 The multidimensional general linear model. . . . .
5.3 Multivariate Analyses of Variance (MANOVA) ..
5.3.1 Onesided multidimensional analysis of variance.
5.3.2 Twosided multidimensional analysis of variance .
5.4 Tests regarding variancecovariance matrices ...... .
5.4.1 Tests regarding a single variancecovariance matrix
5.4.2 Test for equality of several variancecovariance matrices
207
215
221
221
221
226
231
243
243
245
252
252
255
6 Discriminant analysis 257
6.1 Discrimination between two populations . . . . . . . . . 257
6.1.1 Bayes and minimax solutions .......... 257
6.1.2 Discrimination between two normal populations 260
6.1.3 Discrimination with unknown parameters 269
6.1.4 Test for best discrimination function. 271
6.1.5 Test for further information 274
6.2 Discrimination between several populations . 276
6.2.1 The Bayes solution . . . . . . . . . . 276
6.2.2 The Bayes' solution in the case with several normal distributions277
6.2.3 Canonical Discriminant Analysis ............... 280
7 Principal components, canonical variables and correlations, and factor
analysis 287
7.1 Principal components . . . . . . . . . . . . . 288
7.1.1 Definition and simple characteristics . 288
7.1.2 Estimation and Testing . . . . . . . . 292
7.2 Canonical variables and canonical correlations . 299
7.2.1 Definition and properties 299
7.2.2 Estimation and testing . 302
7.3 Factor analysis ......... 304
7.3.1 Model and assumptions 304
7.3.2 Estimation of factor loadings. 306
7.3.3 Factor rotation . . . . . . . . 309
7.3.4 Computation of the factor scores . 313
7.3.5 Briefly on maximum likelihood factor analysis 317
7.3.6 Qmode analysis . . . . . . . . . . . . . . . . 320
Blbllolraphy 323
Index 325
vIII
.l1li
Chapter 1
Summary of linear algebra
This chapter contains a summary of linear algebra with special emphasis on its use in
statistics. The chapter is not intended to be an introduction to the subject. Rather it is a
summary of an already known subject. Therefor we will not give very many examples
within the areas typically covered in algebra and geometry courses. However, we will
give more examples and sometimes proofs within areas which usually do not receive
much attention in allround courses, but which do enjoy significant use within algebra
in statistics.
In the course of analysis of multidimensional statistical problems one often needs to
invert nonregular matrices. For instance this is the case if one considers a problem
given on a true subspace of the considered ndimensional vectorspace. Instead of
just considering the relevant subspace, many authors prefer giving partly algebraic
solutions by introducing the pseudoinverse of a nonregular matrix. In order to ease
thc reading of other literature we will introduce this concept and try to visualize it
gcometrically.
We note that use of pseudoinverse matrices gives a very convenient way to solve many
matrix equations in an algorithmic form.
1.1 Vector space
Wo IIlllrt hy alvlna an nvorvlow (If lho dellnlll(ln and elementllry properties in Lhe fun
damontal concept of linear vectnr "paco.
1.1.1 Definition of a vector space
A vector space (on the real numbers) is a set V with a composition rule + in the set
V x V ~ V which is called vector addition and a composition rule in R x V ~ V
called scalar multiplication, which obey
i) Vu, v E V: u + v = v + u (commutative law for vector addition)
ii) Vu, v, x E V: u + (v + x) = (u + v) + x (associative law for vector
addition)
iii) 30 E V Vu E V: u + 0 = u ( existence of a neutral element)
iv) Vu E V3  u E V: u + ( u) = 0 ( existence on an inverse element)
v) V)' E R Vu, v E V: ).( u + v) = ).u + ).v ( distributive law for scalar
multiplication)
vi) V).1,).2 E R Vu E V : ().1 + ).2)U = ).1 u + ).2U ( distributive law for scalar
multiplication)
vii) V).1,).2 E R Vu E V
multiplication)
viii) Vu E V: 1u = u
EXAMPLE 1.1. It is readily shown that all ordered ntuples
of real numbers constitute a vector space if the compositions are defined element by
element, i.e.
and
Thill vector "pace Is denoted un
A vector "pace 1I which IlIlIubset of 1\ vectm' "pace V 1M culled It ,1'ub,lpace In \I. On
the other hand, If we conftllder vectors "1, ' , , , v ~ \ C V, we CUll dellne the linear .Iparl
of those vectors
SplUl{v\, ... ,vd
US the smallest subspace of V, which contains {V1' ... ,Vk}. It is easily shown that
A vector of the form 2.: (YiVi is called a linear combination of the vectors Vi, 'i =
1, ... , k. The above result can then be expressed such that span { v 1, ... , v k} pre
cisely consists of all linear combinations of the vectors VI, ... ,V k Generally we
define
where Ui <;;;; V, as the smallest subspace of V, which contains all Ui , i = 1, ... ,po
A sidesubspace is a set of the form
v + U = {v + ulu E U},
where I J is a subspace in V.
The situation is sketched in fig. 1.1.
Vectors V1, ... ,V
n
are said to be linearly independent if the relation
Impliell that
HI = ... = n
1l
= 0
In tho oppollite cllse they are said to be linearly dependent and at least one of them can
bI.xprclllled 111111 linear combination of the other two.
A balll for the vector space V is a set of linearly independent vectors which span all
Or V, Any vector can be expressed unambiguously as II linear combination of vectors
in I b I., The number of clementi In different busiscs of u vector space is always the
11m'. I( thlM numher III finite It I. called Ihe dlmtm'lon of the vector space and it is
wrln.n dlm(V),
4 CHAPTER 1.
(2)
(1)
.!:+U
Figure 1.1: Subspace and corresponding sidesubspace in R2.
EXAMPLE 1.2. Rn has the basis
and is therefore ndimensional
In an expression like
n
V = 2.:= (};iVi
i=l
where {Vl, 000, 'lin} ill a basis for V, we call the Net ni, . 0 0 , '11'1 coordlnat" with
relpect to the balil {Vl' ... I 'lin}.
t,i. LINIAt TtANltOtMATION' AND MATRICI.
1.1.2 Direct lurn of vector spaces
Lct V he u vector spuce (of tinite dimension) and let If l, ... , I h be subspaces of V.
We then say that V is the direct sum of the subspaces lh, ... , lh, and we write
k
V=lhii)(j)lh= CD Ui,
i = 1
if an arbitrary vector v E V in exactly one way can be expressed like
This condition is equivalent to that for vectors Ui E Ui the following holds true
Ul + ... + Uk = 0 =? Ul = ... = Uk = O.
This is again equivalent to
k
dim(span(Ul, ... ,Uk)) = 2.:=dimUi = dim V
i=l
(1.1)
Finally, this is equivalent to that all unions of some of the Ui's are O. Of course, it is
"gcneral condition that span( U
l
, ... , Uk) = V, i.e. that it is at all possible to find an
""pression like 1.1. It is the unambiguousity of 1.1 which implies that we may call the
"Mum" direct.
W", sketch some examples below in fig. 1.2.
Ir V is partitioned into a direct sum
thon we call any arbitrary vector v's component in Ui for v's projection onto U
i
(by
tho direction determined by lh, ... , [hl, Ui+l, ... , Uk) and we denote it Pi(V). The
IIUUlttlnn Iii Nkctchcd in lig. 1.3.
Th. projection 1', is idempotent, i.e. Pi 0 Pi (v) = Pi (v), "Iv where fog denotes the
oomblnatlon of f and go
1.2 Lln r transformations and matrices
. .. WI 1111'\ with I leetlon on IIn,ar tranq'tlrmatlDnI (or IIn,ar mapplnllW).
~                       ~  = = = ~ ~ ~ =  =  ~  ~ ~  ~  ~  = = ~ ~ ~  ~ ~ =  I I " . ~
o
Figure 1.2:
U1 EEl U
2
EEl U
3
= R3 The sum
is direct because for instance
dim U1 + dim U
2
+ dim U
3
= 3
R3 is not a direct sum of [h
U2 ; because dim U
1
+dim U
2
=
4
Her U
1
EEl U
2
= R3 because
for instance U
1
and U
2
be
sides spanning R3 also satisfy
U
1
n U
2
= 0
Figure 1.3: Projection of a vector.
1.2.1 Linear transformations
A transformation (or mapping) A : U t V, where U and V are vector spaces is said
to be linear if
V A1, A2 E R VU1, U2 E U : A( A1 U1 + A2U2) =
A1A(ud + A2 A(U2)
EXAMPLE 1.3. A transformation A : R t R is linear if its graph is a straight line
through (0,0). If the graph is a straight line which does not pass through (0,0) we say
the transformation is affine.
By the nullspace N(A) of a linear transformation A: U t V we mean the subspace
A
1
(O) = {uIA(u) = O}
The following formula holds connecting the dimension of image space and nullspace
dim N(A) + dim A(ll) = dim U
In Pilfticulllr we have
!llmA(fI) <c\lmll
with equality If A i" Injective (I.e. unambl.unulI). If A I" bijcctivc we readily sce
that dim' I  cllm V. W. lay that lIuch I trln"rormltion il an l;rom()rplJl,rm lind
.. 1I1IIIIIIIIIII=
A(x) B (x)
x x
Figure 1.4: Graphs for a linear and an affine transfonnation R + R.
that U and V are isomorphic. It can be shown that any ndimensional (real) vector
space is isomorphic with Rn. In the following we will therefore often identify an
ndimensional vector space with Rn.
It can be shown that the transfonnations mentioned in the previous section are linear
transformations.
1.2.2 Matrices
By a matrix A we understand a rectangular table of numbers like
We will often use the abbreviated notation
More specifically we call A an m x n matrix because there are m rows and n
columns. If m = 1 then the matrix can be called a rowvector and if n = 1 it elln be
called columnvector.
The matrix one gets by interchanging rows and columns is called the ,,,.,.,,.,. matrix
of A and we denote it by A' or by AT, i.e.
=
1/111 f/,mll
      
  '
An 1/1. X '/I matrix is square if '/I =/11. A squarc matrix for which A = A' is called a
symmetric matrix. The elements lIii, i = 1, ... , n are called the diagonal elements.
An especially important matrix is the identity matrix of order n
A matrix which has zeroes off the diagonal is called a diagonal matrix. We use the
notation
[
bO:" ]
d=diag(bl, ... ,bn )= u
For given n x m matrices A and B one defines the matrix sum
[
a1l b1l
A+B= :
anl + bnl
alm + b
lm
]
a
nm
b
nm
.
Scalar multiplication is defined by
i.e. elementwise multiplication.
Pm an m x n matrix C and an n x p matrix D we define the matrix product P = C D
by having that P is a m x p matrix with the (i, j) 'th element
n
Ptj = 2: Ciktlkj
kl
We note that the matrix product is not commutative, i.e. that CD generally does not
equill DC.
Por tranlpolltion we hive the fol\uwlnlJ rulel
(A + B)'
(cA)'
(CD)'
=
=
A'+B'
cA'
D'C'
1.2.3 Linear transformations using matrixformulation
It can be shown that for any linear transformation A : Rn + Rm there is a corre
sponding m x n matrix A, such that
'ixERn:A(x)=Ax
Conversely an A defined by this relation is a linear transformation. A is easily found
as the matrix which as columns has the coordinates of the transformation of the unit
vectors in Rn. E.g. we have
o
1
o
o
If we also have a linear transformation B : Rm + Rk with corresponding matrix B
(k x m), then we have that BoA +7 B A i.e.
'ix E R
n
: B 0 A(x) = B(A(x)) = B Ax
Here we note, that an n x n matrix A is said to be regular if the corresponding linear
transformation is bijective. This is equivalent with the existence of an inverse matrix,
i.e. a matrix A 1, which satisfies
AA
1
=A
1
A=I
where I is the identity matrix of order n. Furthermore we have
(A
1
)1
(kA)l
(A')l
A
k
(A 1)'
and for invertible matrices A, B we have
______ .. _. __ . __ 11
{
a1e1 + 0!2e2 = [ e1 e2
] [ ]
a1e1 + a2e2 = [ e1 e2
] [ (:1 ]
a2
Figure 1.5: Sketch of the coordinate transformation problem.
A square matrix which corresponds to an idempotent transformation itself called
idempotent. It is readily seen that a matrix A is idempotent if and only 1f
AA=A
We note that if an idempotent matrix is regular, then is equals the identity matrix, i.e.
the corresponding transformation is the identity.
1.2.4 Coordinate transformation
In this section we give formulas for the matrix formulation of a linear mapping (trans
formation) by going from one basis to another.
We first consider the change of coordinates going from one coordinate system to an
other. Normally, we choose not to distinguish between a vector u and its set of coor
dinates. This gives a simple notation and does not lead to confusion. w?en
lieveral coordinate systems are involved we do need to be able to make th1S distmctlOn.
In un we consider two coordinate systems (el,"" en) and (el' ... ,en) Tre co
ordinates of a vector u in each of the two coordinate systems is denoted respectively
(nl' ... ,(\!n)' nnd (l\:1' ... cf. figure 1.5.
Let the "new" IIYlitem (el' ... ,en) be jlven by
CHAPTER 1. SUMMARY 0'_
i.e.
i = 1, ... ,no
The columns in the Smatrix are thus equal to the "new" systems "old" coo d' t S
. 11 r ma es.
IS ca ed the coordinate transfonnation matrix.
~ E M A R ~ 1.1. However, many references use the expression coordinate transforma
tIon matrIx about the matrix S 1. It is therefore important to be sure which matrix
one is talking about.
Since
(cf. fig. 1.5), the connection between a vectors "old" and "new" coordinates becomes
[ ~ ~ ] ~ S [ : ] = [ ~ ~ ] ~ S, [ Z ].
We now consider a linear mapping A : Rn 7 Rm, and let A's matrix formulation
w.r.t. the bases (el, ... , en) and (f 1, ... , f m) be
f3=Aa
ll1!d the f0,:mulation w.r.t. the bases (e1"'" en) = (e1"'" en)S and
(f 1, ... , f m) = (f 1, ... , f m)T be
Then we have
A=s lAT,l
which is readily found by use of the rules of coordinate transformatiOl _Ibt aoordl
natell,
1.J. LlNIAI TIANIfOIMATIONI AND MATIICII
13
If we arc concerned with mllpplnllfol un + Uti and we lise the same coordinate trans
formation, then we IIet the reilltion
A=S"AS.
The matrices A and A = S 1 A S are then called similar matrices.
1.2.5 Rank of a matrix
By rank of a linear projection A : R
n
7 R
m
we mean the dimension of the image
space, Le.
rk(A) = rank(A) = dimA(Rn).
By rank of a matrix A we mean the rank of the corresponding linear projection.
We see that rk(A) exactly equals the number of linearly independent column vectors
in A. Trivially we therefore have
1_ rk(A) ~ n.
If we introduce the transposed matrix A'it is easily shown that rk(A) = rk(A/)i.e.
we have
1 .:. rk(A) ~ min(m, n).
If A and B are two m x n matrices, then
1_. rk(A + B) ~ rk(A) + rk(B).
This relation is obvious when one remembers that for the corresponding projections A
llnd H we have (A + B) (Rn) <;;; A(Rn) U B(Rn).
II' A is an (m x n)matrix and B is an (k x m)matrix we have
l rk(BA) ~ rk(A).
Jr 8 III reaular (m x rn) we have
rk(BA) = rk(A).
Tholle relationlllre Immediate conllequencefl of the relation dim 1J(;1 (Iln)) ~ dim( A( Rn)),
whore we have equality It' H ill injective, There arc of course analogue relations for an
(n. )( IJ)matrlx C:
rk(A. 0) S rk(A.)
14
CHAPTER 1. SUMMARY or
A
I
b
Figure 1.6: A rectangle and its image after a linear projection.
with equality if C is a regular (n x n)matrix. From these we can deduce for regular
B and C that
rk(B A C)  rk(A).
Finally we mention that an (n x n)matrix A is regular if rk(A) = n.
1.2.6 Determinant of a matrix
The abstract definition of the determinant of a square p x p matrix A is
det(A) = ""' a la(l) ... apa(p) ,
aile (J
where (J is a permutation of the numbers 1, ... ,p and where we use the + sign if the
is even (i.e. it.can composed of an even number of neighbour swaps)
and  It IS odd. If confusIOn wIth the absolute value of a real number is unlikely we
sometlmes use the notation [A[ = det(A)
We will not go into the background of this definition. We note that the determinant
the volume ratio of the corresponding linear projection i.e. for an (n x n)
matnx A
[det(A)[ = vol(A(I))
vol (1)
,,:here! is an n dimensional box and A( 1) is the image of 1 (being an n _
dImensIonal parallelepiped) found by the corresponding projection.
The situation is sketched in 2 dimensions in fig. 1.6. For 2 x 2 and 3 x 3 matrices the
definition of the determinant becomes
det
(
ttl,'. 11) tl = rui .. be
15
[
/t II t']
clot Ii /' .(
,1/ II. '/.
.11('(' hla
illl!.
.;0.0 1/.1'; + I!I.II + ('(lh
For determinants of higher order (here n'th order) we can develop the determinant by
the 'i,'th row i.e.
n
det(A) = L aij( _l)i+j det(Aij),
j=1
where Aij is the matrix we get after deleting the i'th row and the j'th column of A.
The number
is also called the cofactor of element aij. Of course an analogue procedure exists for
development by columns.
When one explicitly must evaluate a determinant the following three rules are handy:
i) interchanging 2 rows (columns) in A multiplies det(A) by1.
ii) multiplying a row (column) by a scalar multiplies det(A) by the scalar.
iii) multiplying the matrix with a scalar multiplies det(A) by the scalar raised to
the power of p.
iv) adding a multiplum of a row (column) to another row (column) leaves det(A)
unchanged.
When determining the rank of a matrix it can be useful to remember that the rank
is the largest number r for which the matrix has a determinant of the minor which
different from 0 and of r'th order. We find as a special case that A is regular if
and only if dot A i= O. This also seems intuitively obvious when one considers the
determinant being the volume. If it is 0 then the projection must in some sense "reduce
the dimension".
For a square matrix A we have
!' dct(A') = det(A)
For Hquare matrices A and B we have
== .... _ ...___ _
Pur 1\ dillonit mltrlx A  diaa(Al, ' , , ,'\n) we hive
16 CHAPTER 1. SUMMARY or
For a triangular matrix C with diagonal elements C!, ... ,C
n
we have
det(C) = Cl ... en
By means of determinants one can directly state the inverse of a regular matrix A. We
have
1 1 )'
A = det(A) (Aij ,
i.e. the inverse of a regular matrix A is the transposed of the matrix we get by substi
tuting each element in A by its cofactor divided by det A. However, note that this
formular is not directly applicable for the inversion of large matrices because of the
large number of computations involved in the calculation of determinants.
Something similar is true for Cramer's theorem on solving a linear system of equa
tions: Consider the regular matrix A = (AI, ... , An). Then the solution to the
equation
Ax=b
is given by
1.2.7 Block matrices
By a block matrix we mean a matrix of the form
where the blocks Bij are matrices of order mi x nj. A block matrix is also called a
partioned matrix.
When adding and multiplying one can use the usual rules of calculation for matrices
and just consider the blocks as elements. For instance we find
[
A B][R] [AR+BS]
CDS  CR+DS '
under the obvious condition that the involved products exist etc.
First we give a result on determinantli of the "triangular" matrix.
1.J. LINEAl TIANlf01lMATIONI AND MATIICII _
1'7
THEOREM 1.1. Let the "quare matrix A be plU1itioncd into block matrices
A \ B C 1
 () D
where Band D are square and 0 is a matrix only containing D's. Then we have
dct(A)  dct(B) dct(D)
PROOF. We have that
where the I 's are identitymatrices, not necessarily of same order. If one the
first matrix by its I st row we see that it has the same determinant as th.e gets
by deleting the first row and column. By repeating this until the remaining mmor IS D,
we see that
det = det(D)
Analogously we find that the last matrix has the determinant det B and the result
follows.
The following theorem expands this result.
THEOREM 1.2. Let the matrix "E be partitioned into bloek matriees
"E12 ]
"E22
Then we have
under the condition that 1:22 is regular.
PROOF, Since
] l
'J 12 1
()
18 CHAPTER 1. SUMMARY or
the result follows immediately from the previous theorem.
Remark. The martrix
is called the Schur complement of the block :E
22
.
The last theorem gives a useful result on inversion of matrices which are partitioned
into block matrices.
THEOREM 1.3. For the symmetrical matrix
:E=
we have
where
A
B
:E221:E21
:E11  :E12 :E221 :E21 ,
conditioned on the existence of the inverses involved.
PROOF. The result follows immediately by multiplication of:E and :E
1
.
1.3 Pseudoinverse or generalised inverse matrix
of a nonregular matrix
We consider a linear transformation
A:E+F
where E is an ndimensional and F an mdimensional (euclid ian) vector space. The
matrix corresponding to A is usually called A and it has the dimensions m x n, We
equal the null space of A to U i.e.
1.3, '.BUDOINVI I O' OINERALlIID INVERIE ___ _________ !!
    .
and call it!! dlmcnllion 1', The IlIluge space
has dimension 8 = 'II,  r ,cf. section 1.2.1.
We now consider an arbitrary s dimensional space <;: E, which is complementary
to U, and an arbitrary m  s dimensional subspace V* <;: F, which is complementary
to V.
An arbitrary vector x E E can now be written as
x = u + u*, u E U og u* E U*,
since u and u * are given by
u = xpu(x)
u* pu'(x)
Here Pu' denotes the projection of E onto U* along the subspace U. Similarly any
y E F can be written
y = (y  pv(y)) + pv(y) = v* + v
where
pv: F + V
is the projection of F onto V along V* .
Since
A(x) = A(u + u*) = A(u*),
we see that A is constant on the sidespaces
u + If = {u + u\u E lJ}
and It f01l0WR that A'I reatrlction on ( I + III 11 bijective projection of 1'+ onto V. This
prnjection therefore hal an invcr"c
CHAPTER 1. SUMMARY or
+ u
E
12 plane in E
A
F
Figure 1.7: Sketch showing pseudoinverse transformation.
1.3. '.lllDOINV 01 O.NI.ALIIED INVERIE MATRIX
21
given by
We arc now able to formulate the definition of the pseudoinverse transformation.
DEFINITION 1.1. By a pseudoinverse or generalised inverse transformation of the
transformation A we mean a transformation
B = Bl 0 Pv : F ~ E,
where PV and Bl are as mentioned previously.
REMARK 1.2. The pseudoinverse is thus the combined transformation onto V along
V' and the inverse of A's restriction to U*. "
REMARK 1.3. The pseudoinverse is of course by no means unambiguous, because we
get one for each choice of the subspaces U* and V*. "
We can now state some obvious properties of the pseudoinverse in the following
THEOREM 1.4. The pseudoinverse B of A has the following properties
i) rk(B) = rk(A) = s
ii) A 0 B = PV : F ~ V
iii) BoA = PU. : E ~ U*
It can be shown that these properties also characterise pseudoinverse transformations,
because we have
THEOREM 1.S. Let A : I ~ + F be linear with rllnk H. Assume that II also has rank
If, and that A 0 II and II 0 A buth arc projection" of funk H. Then H iPi a pseudoinversc
of A all defined IlhClVC,
'lOor. Omitted (relativily Ilmple auroll. in Un.ar al bra),
22
CHAPTER 1. SUMMARY
We now give a matrix fonnulation of the above mentioned definitions.
DEFINITION 1.2. Let A be an (m x n)matrix of rank s. An (n x m)matrix D.
which satisfies
i) A B idempotent with rank s
ii) B A idempotent with rank s,
is called a pseudoinverse or a generalised inverse of A.
By means of the pseudoinverse we can characterise the set of possible solutions of a
system of linear equations. This is due to the following
THEOREM 1.6. Let A and B be as in definition 1.2. The general solution of the
equation
AIIIO
'I
(I SA)., E 11.",
and &h n.r.lllnlutinn of the (which is assumed to be consistent)
A.II,
 ,
z E un.
By+(1 BA)z,
PROOF. We first consider the homogeneous equation. A solution x is obviously a
point in the nullspace N (A) = A 1 (0) of the linear projection corresponding to A.
The matrix B A according to theorem 1.1  corresponds precisely to the projection
onto U*. Therefore 1  B A corresponds to the projection onto the nullspace U =
N(A). Therefore, an arbitrary x E N(A) can be written
x = (I  BA)z,
The statement regarding the homogeneous equation has now been proved.
The equation A x = y only has a solution (i.e. is only consistent) if y lies in the
image space of A. For such a y we have
ABy = y,
1.S. '.I'UDOM ......... LI.ID I"V MAT.11i

accordln, to thoorem I,.,
The rellult fur the complete ,mlutloll follows rcadi ty.
Tn order to illustrate the concept we now give
EXAMPI,E 1.4. We consider the matrix
A obviously has the rank 2.
We will consider the linear projection corresponding to A which is
A: 1<) + F
where and P are 3dimensional vector spaces with bases { e1, e2, e3} og {f 1, f 2' f 3}'
The coordinates of these bases are denoted by small x's and y's respectively, such that
A can be formulated in the coordinates
\111'1'11 we will determine the nullspace
rur A. We have
",ell } Ax=O
:> :1:1 + X2 + 2:J::1 = 0 1\ 2:1:1 + X2 + X3 = 0
} Xl = X:I 1\ 3:1;1 = :1:2
<* 1II'=Xl(l,3,1).
Th. nullMpuce III then
CHAPTER 1. SUMMARY
As complementary subspace we choose to consider the orthogonal complement' J. ,
This has the equation
(1, 3, l)x = 0,
or
We now consider a new basis for E, namely {Ul,U2,U3}. Coordinates in this are
denoted using small z's. The conversion from zcoordinates to xcoordinates is given
by
1 [ 1
1 Z3
or
_I"
Th' Imlulnnllur the S .matrlx IlI'O known to be the u's coordinates in the esystem.
A'II IInullc Mpnce V iN 2dimensional and is spanned by A's columns. We can for
InNlllnl'l' dlOONC the lil'sttwo, i.e.
As complementary subspace V* we choose V's orthogonal complement. This is
produced by making the crossproduct of Vl and V2:
We now consider the new basis {Vl' V2, V3} for F. The coordinates in this are
denoted using small w's. The conversion from wcoordinates to ycoordinates is given
by
[
'1)1 1
IJ'J =
7Ja
o 1 ['1111 1 1 'III'J I
1 Wa
t.S.
."" ............. ...
or in compact notation
'IITw.
We wilillow lind cOOl'dillute expressions for.1 in z and wcoordinates. Since
y=Ax
we have
Tw = ASz
or
w = T
1
ASz.
Now we have
Tl=
[
wherefore
=
1
[
2
o
1 1
,
2"
1
2"
[
2 0
3 11
o 0
Ilnce {u\, u:.a} spans /}* and {Vl' V2} spans V, we note that the condition
hu the coordinate expression
II h .. the InverM" projection
26 CHAPTER 1. SUMMARY
If we consider the points as points in and F  and not just as points In U and V
then we get
(1.2)
The projection of F onto V along V* has the formulation in coordinates
(1.3)
This is the z  UJ coordinate formulation for the pseudoinverse B of the projection A.
However, we want a description in :z;  y coordinates. Since
w ... '
whore 0 1M the Illntrl" In formula 1.1.
We there fmc have
B
SCT l
[
; 1 [ 1
1 3 1 0
u
=
1 [8
 2
22 14
This matrix is a pseudoinverse of A.
1
2"
As it is seen from the previous example it is rather tedious just to use the definition in
order to calculate a pseudoinverse. Often one may utilise the followln,
THEOREM 1.7. Let the 'TIl, x n matrix A have rank II and lot
1.3. 'I.UDOlNVllflWliIN U.IIID INVIII. MATIIX
'1.7
whore C I" re.ular with dlmon"lon II x II. A (posIIlble) plleudoillVel'Se of A is then
where the Omutrices have dimensions such that A has the dimension n x m.
PROm'. We have
Since rk(A) = s, then the last n  s columns can be written as linear combinations of
Ihe lirsl 8 columns, i.e. there exists a matrix H, so
or
D = CH
F = EH
Ilrnm lhis we lind
F=EC 1D.
If we InNert this in the top formula we have
AA'A=A
Jy premultiplication with Aand with  respectively,
Ihlt A A Ilnd AA  are idempotent. Thc theorem IS now denved from the defimtlOn
,..Ill.
WI lI\uMtr&Ue the ulle of the theorem in the following
'.AMPUC 1.5. We conliider the mutt'lx given in example 1.4
[
1 1 :I]
A :I 1 1 ,
:I 1 1
18
CHAPTER 1. SUMMARY
Since
we can use as pseudoinverse:
[
1
A = 2
o
1 0 1
1 0
o 0
The advantage of using the procedure given in example 1.4 instead of the far more
!llmple one aiven in example 1.5, is that one obtains a precise geometrical description
of tho !lltuation.
RIMA.1e 1.4. Finally, we note thutlhe literature has a number of definitions of pseu
dolnvol'llo!l and loneraliNcd inver!!e!!, PIn it is necessary to specify exactly what the def
Inition IN. A Cille of "pecial Interest is the socalled MoorePenrose inverse A + of a
mltrlx A. It lllIlllIlie!! Ihe following
.  ,
i) AAIA = A
ii) AtAA+ = A+
iii) (AA+), = AA+
iv) (A+ A)' = A+ A
Many authors reserve the name pseudoinverse to the MoorePenrose inverse. It is ob
vious that condition i) is equivalent to the general conditions for being a generalised
inverse. A matrix that satisfies i) and ii) is called a g2 inverse. This is often used in
estimation in the socalled General Linear Model. All 4 conditions guarantee that a
least squares solution of an inconsistent equation find a solution with minimal norm.
We will not pursue this further here, only refer the interested reader to the literature
e.g. [22]. "
1.4 Eigenvalue problems. Quadratic forms
We hegin with the fundamcntal definition!! and theorem. in
!!l.= ... .. _____
1.4.1 Elg.nvllu Ind eigenvectors for symmetric matrices
The definition of lin eigenvector lind Ull eigenvalue given helow are valid for arbitrary
"'1
uarc
matrices. However, in the se'luel we will always assume the involved matrices
are symmetriclll unless explicitly stated otherwise.
An eigenvalue A of the symmetric'll x n maU'ix A is a solution to the equation
clct(A AI) = O.
There arc 'It (realvalued) eigenvalues (some may have equal values). If A is an eigen
value, then vectors x =I 0, exist such that
Ax = AX,
i.e. vector exist such that the linear projection corresponding to A leads to a multiplum
of its self. Such vectors are called eigenvectors corresponding to the eigenvalue A. The
number of eigenvalues different from 0 equals rk(A). An eigenvalue is to be counted
aN many times as its multiplicity indicates. A more interesting theorem is
THROREM 1.S. If Ai and Aj are different eigenvalues, and if Xi and Xj are the
corresponding eigenvectors, then Xi and Xj are orthogonal, i.e. = O. ..
PROOF. We have
ttere we readily find
= AixjXi
=
Wo Irulll'lposc the iiI'S! relationship and get
linoo A IlIliymmctrlc thili ImplicR that
30 CHAPTER 1. SUMMARY
The result in theorem 1.8 can be supplemented with the following theorem .Ivln with"
out proof.
THEOREM 1.9. If A is an eigenvalue with multiplicity m, then the set of eigenvectors
corresponding to A forms an mdimensional subspace. This has the special implica
tion that there exists m orthogonal eigenvectors corresponding to A. A
By combining these two theorems one readily sees the following
COROLLORY 1.1. For an arbitrary symmetric matrix A a basis exists for R
n
con
sisting of mutually orthogonal eigenvectors of A.
If such a basis consisting of orthogonal eigenvectors is normed then one gets an or
thonormal basis (PI, ... ,Pn)' If we let P equal the n x n matrix whos columns are
the coordinates of these vectors, i.e.
we get
P'P=I
P is therefore by definition an orthogonal matrix, and
AP=PA
where A is a diagonal matrix with the eigenvalues for A (repeated corresponding to
multiplicity) on the diagonal. By means of this we get the following
THEOREM 1.10. Let A be a symmetric matrix. Then an orthogonal matrix P exists,
such that
A
where A is a diagonal matrix with A 's eigenvalues on the dillonal (repeated cor
responding to the multiplicity). As P one can choolo a matrix. wholi columna art
orthonormed eigenvectors of A.
PROOF. Obvioull from the above rolation.
1.4. IIGINVALVIftOIUM gUADIATIC rOIM.
31
THEOREM t.ll. Lot A be a lIymmelrlc I11l1lrl)( with nonnegative eigenvalues. Then
II reAlular matrix B exilltlilluch Ihlll
n'AD  E,
where E is a diagonal matrix having O's or I's on the diagonal. The number of 1 's
cquals l'k(A). If A is of full rank then E becomes an identity matrix. A
PROOF. By (post) multiplication of P with a diagonal matrix C which has the fol
lowing diagonal elements
we readily find the theorem wilh B = PC.
The reilltion in theorem 1.10 is equivalent to
______________________________________
I,., we huve the following partitioning of the matrix
A':',\',p,pi + ... +
partUlnnlng of the symmetrical matrix A is often called its spectral decomposi
IIlnco the ellenva)ues {A' , ... , An} are called the spectrum of the matrix.
tho ubvloUII definition of A being diag( v'Xl, ... , A), we note that we can
I H.re we mention thlt Ir A III 1'<111It Ivc doftnlto, thclllhcrc 1M u l'clUllon
I.. 1..',
31
CHAPTER 1. SUMMARY
where L is a lower triangular matrix. This relation is called the Cholellky decamp"""
tion of A (see e.g. [26]). It is unique.
Finally we have
THEOREM 1.12. Let A be a regular symmetrical matrix. Then A and AI have the
same eigenvectors corresponding to reciprocal eigenvalues. ...
PROOF. Let A be an eigenvalue of A and x be a corresponding eigenvector, i.e.
Ax = AX.
Since A is regular then this is equivalent to
1 1
A x = :\x,
which concludes the proof.
Finally, we note that
detA = II Ai
i
EXAMPLE 1.6. Orthogonal transformations of the plane. In order to give a geometri
cal understanding of the transformations which reduce a symmetrical matrix into diag
onal form, we state the orthogonal transformations of the plane.
By utilising the orthogonality conditions pip = I we readily see, that the only or
thogonal 2 x 2matrices are matrices of the form
l
a  sin 0: ]
sma cos 0:
og
l
sma
sin 0: ]
coso: .
We will now show that these correspond to rotations around the origin and reflect/o",
in straight lines.
We do this by determining coordinate exprcRlloftl for thllluar trlnNl'nrmllt!on/l tin an4
, which respectively represent a rotltion of tbI pIIM of Chi In,lo Ulld U
in the line havinl the anile 0: with the 1 l IU,
Tho trln.fonnaUonl 11'0 UlUltrltod in
!!l.= ... ___ 33
x
o
Figure I.R: Rolation and reflection as determined by the angle a.
III c'luul to I. we have
(I,. (x)
[
cOH(n + v) ] = [  ]
Ai11(O+V) smo:coSV+ coso:smv
 sin 0: ] [ v ] .
cos 0: SIn v
IImnllhlN we find d" has the matrix representation
l
.1: I 1+ l. CUH n
.1'" Hill n
sin O! ] [ Xl ]
cos 0: X2
r
eOA(2n v) ] = ]
Hill(2n  v) sin20:cosv  cos20:smv
r
COH 2n sin 2n ] [ v ] .
Hill 2n  cos 20: sm v
'n hlN the mlltrix representation
f.
cWI2n
+
I'I1t12n
8in 20 ] [ :r.1 ].
 C08 :r.:,)
.,I\lYON the pmof of the Introductory IItltement.
u M tn tn hAve tho foUnwln, rolatlonl4 hclWOOI1 rntatlnnl and rcftckllClllfil of
mind
34
s t 0 ell> = 8 t 'f
Sa = St 0 d:;;2a.
CHAPTER 1. SUMMARY
The first relation follows from
[ ~
1 ] [ c ~ s a  sin a ] =
o sm a cos a
[
sin a cos a ] _ [ cos (i  a)
cos a  sin a  sin( ia)
sin(i
a
) ]
 cosef  a) .
The last two relations are forund from the first by substituting a with ~  2a.
Part of the following section will be devoted to consider the problem of generalising
the spectral decomposition of an arbitrary matrix.
1.4.2 Singular value decomposition of an arbitrary matrix.
Q and Rmode analysis
We flrlitlitute the Il1l1in result. ulso known as EckartYoung's theorem.
THRORRM 1.13. Let x bc un arbitrary n x 1) matrix of rank T. Then orthogonal
matrlccN U (I' x 'I') Ilnd V ('/I X .,.) exist. as do positive numbers 1'1, ... ,I'r' such that
where r = diag( 1'1, ... ,I'r) and VI, ... ,Vr are the columns of V and Ul, ... ,Ur
are the columns of U. A
PROOF. Omitted. See e.g. [10].
Remark. The numbers 1'1, ... ,I'r are called x's singular values. The vectors VI, .. ,'lI,.
are called the left singular vectors of x and the vectors Ul, ... ,Ur the right singular
vectors. The factorization of x in the theorem is called the Singular Value Decomposl
tion (SVD) of x.
In the sequel we will investigate the relationship betwccn x 's singular valucs and the
ciacnvaluc problcms for thc symmctrical matricol xx' (16 )( u) and x'x (p )( 1)).
1 IIOIMA'. tlllLlMI. qUADIATIC 'OIM.
However. flrllt we will IlllC
THRORRM 1.14. For an I\rbltrury (rcul valucd) matrix x it holds that XiX and x x'
have nonncglltlvc elgenvulues and
I'k(x
'
x) = rk(x x') = rk(x)
.... _ ....... .... __ .. __ .... '
PROOF. It suffices to prove the results for XiX. It is obvious that XiX is symmetric,
NO lin 0l1hogonal matrix P, exists such that
pIX'XP = A
I.e.
(x P)' (xP) = A.
Hy letting xP = B = (h
ij
), we find B'B = A, i.e.
Ai = 2:: / J ~ j > 0,
j
I.e. x'x hus nonnegative eigenvectors. Furthermore we see that
I'k(x'x) = card('>'i ::/= 0)
= card! columns b
j
in B ,which are I 0 }
Iince b ~ b j = () for'i ::/=) (due to equation 1.1) we have
rk(x'x) = rk(B)
Iln!.!o P IN regular. und using u result in section 1.2.5, we find
rk(B) .... rk(xP) = rk(x).
I"to I lmall oarollll")' to tho tho()rom.
36
CHAPTER 1. SUMMARY
COROLLORY 1.2. Let:E be symmetrical and posiLive definite. Then for an ArhllrlU'Y
matrix x it holds that
rk(x':E
1
x) = rk(x),
J
under the condition that the involved products exist.
PROOF. Since :E
1
is also regular and positive definite, an orthogonal matrix P exists,
such that
where A is a diagonal matrix. This implies
Here A denotes the diagonal matrix, whos diagonal elements are the square roots
of the correllponding clements of A. It is obvious that B is regular. This relation is
hlllcrted lind we find
x'E IX"'" x'BB'x;;..:: (B'x)'B'x.
I.e.
rk(x'E IX) = rk(B'x) = rk(x),
which concludes the proof.
Using the notation from theorem 1.14 we have.
THEOREM 1.15. The matrix xx' (n x n) has r positive eigenvalues and n  r
eigenvalues equal to a. The positive eigenvalues are "d, ... , "i;, where ')'1, ... ,')'r are
the singular values of x. The corresponding eigenvectors are V1, ... ,Vr
Similarly x'x (p x p) has r positive and (pr) aeigenvalues. The positive eigenvalues
are "d, ... ,,),; and the corresponding eigenvectors are U1 , ... , U r
The positive eigenvalues of x x' and x'x are therefore equal and the relationship be
tween the corresponding eigenvectors is (m = 1, ... , r)
1
Vm = XUm
"Ym
and
1 ,
U," == x V m ,
"Ym
1.4.1101""'1."1 "'LIM QUADRA.TIC roaM ..
37
or In I more compAct notation
PROOF. Follows by usc of EckartYoung's theorem.
REMARK 1.5. Analysis of the matrix x'x is called R mode analysis and the analysis
of x x' is called Q mode analysis. These names originate from factor analysis, cf.
chapter 7.
REMARK 1.6. The theorem implies that one can find the results for an Rmode anal
yllill from a Qmode analysis ad vice versa. For practical use one should therefore
consider which of the matrices x'x and x x' has lowest order.
1.4.3 Quadratic forms and positive semidefinite matrices
In this section we still consider symmetrical matrices only.
By the quadratic form corresponding to the symmetrical matrix A we mean the map
plni
:.c + x'Ax = +2l:aij;r;i:r;j.
l<j
We lilly that a symmetrical matrix A is positive definite respectively positive semi
d"./Inllt' if the corresponding quadratic form is positive respectively nonnegative for
voctnrll different from the avector, i.e. if
't:.c ; 0: x' A x > a,
".pectlvely
'tw f; 0 : w' A rD > n.
WI then allm lIay the qUAdrltlc form I" IItM'ltlvt' ",,/1"'11' rcllpectlvely po.r/tive ,\'emi
hlv, \h, followlnl
38 CHAPTER 1. SUMMARY or
THEOREM 1.16. The symmetrical matrix A' " . .
definite if all A" I . . IS posItIve defintte respectively
, s elgenva ues are posItIve respectively nonnegative. A
PROOF. With P as in theorem 1.10 we have
x'Ax x'P'PAPP'x = (P'x)'A(P'x)
'A 2
Y Y = AIYl + ... + A y2
n n'
Another useful result is
THEOREM 1.17. A symmetrical n x n
nants of all principal minors matrix A is positive definite if the detenni
[
au
d, = det :
ail
i = 1, ... ,n,
are positive.
PROOF. Omitted
We now state a very important theorem on extrema of quadratic forms
THEOREM .1.18. If we let the eigenvalues for the symmetrical matri A
... 2': An WIth corresponding eigenvectors p d d x equal Al >
1, ,P
n
, an we efine
I
x'Ax
R(x) = 
. x'x '
and
Mk = {xlx'Pi = 0,
Then it holds that
sup R(x) =
x
if]/ H(x) =
lI\lp U(aJ)

i = 1, ... ,k  1}.
R(Pl) = AI,
R(P
n
) = An!
R(p,,)  All,
1.4. QUADRATIC rOaM.
PROOF. An arbitrary vector x can be written
If = 0, i = 1, ... , k  1, we find 001 = ... = akI = 0, i.e .
Therefore we have
and
It is obvious that this expression is maximal for
where it takes the value Ak. The result with inf is proved analogously.
REMARK 1.7. The theorem say for k = 1, that the unit vector, i.e. the "direction",
for which the quadratic form takes its maximal value, is the eigenvector corresponding
to the largest eigenvalue. If we only consider the quadratic form in unit vectors which
are orthogonal to eigenvectors corresponding to the k  1 largest eigenvalues, then the
theorem says that maximum is in the direction corresponding to the eigenvector which
correspon<ic; to the k'th largest eigenvalue. "
REMARK 1.S. H.(;r,) is also cuned Rayleigh's coefficient or quotient.
We will nnw dCMcrlhe the level Metll I'm pUNitive dellnlte forms.
THEORIM 1.19. Let A he plIllltlv" dlnnllo. Th"n the Kctof solulions for lhe e4ualinn
Il> 0,
40
CHAPTER 1. SUMMARY or
Pi
s = {(PI" .. ,Pn) coord.inates y
(el' ... , en) coordmates X
Figure 1.9: Illustration showing change of basis
an with principle axes in the directions of the eigenvectors The first p .
ax lIS corresponds with the smallest eigenvalue, the second to the
elgenva ue etc.
PROOF. We consider the matrix P  ( ) h
f rth d .  PI,, Pn ' w os columns are the coordinates
o 0 onorme eIgenvectors of A. Assuming y = pi x the following holds
x'Ax
= (1.4)
The matrix equation
y = pi X {:} X = P y
corresponds to a change of basis from the original orthonormal basis { }
the orthonormal basis {PI' ... ,P
n
}. el, ... ,en to
This is seen by letting S be a point whos {e e} d'
h { I, ... , n co or mates are called x and
w os PI"", Pn}coordinates are called y. Then it holds that
or
i.e.
1m  PII,
1.4. EIGINY."'" ftoILIMI . QUADRATIC 'O.MI._..
where I hi t\ unIt matrix,
The expression in 1.4 therefore shows the equation of the set of solutions in ycoordinates
corresponding to the coordinate system consisting of orthonormed eigenvectors. This
shows that we are dealing with an ellipsoid. The rest of the theorem now follows by
noting that the 1.st principle axis corresponds to the Yi, for which 1/ A is maximal,
i.e. for which Ai is minimal.
REMARK 1.9. If the matrix is only positive semidefinite then the set of solutions to
the equation correspond to an elliptical cylinder. This can be seen by change of base to
the base {PI' ... ,Pn} consisting of orthonormal eigenvectors, where we for simplicity
assume that PI , ... 'P
r
corresponds to the eigenvalues which are different form O. We
then have
x' A x = c {:} Alyi + ... + ArY; + OY;+I + ... + = c
{:} Alyi + ... + ArY; = c.
This leads to the the statement. If we consider the restriction of the quadratic form to
the subspace spanned by the eigenvectors corresponding to eigenvectors> 0, then the
set of solutions becomes an ellipsoid.
EXAMPLE 1.7. We consider the symmetrical positive definite matrix
vr;].
The quadratic form corresponding to A is
so the unit ellipse corresponding to A is the set of solutions to the equation
In order tn determine the princIple axcli we determine A's eigenvalues. We find
(A A I) "" () .. A
3
+ 4 ... ()
.. 1 V .. 4,
Bl,ln YlctorA oOlTllpondln. \0 _ 1  " 11'0 AlOn to bo nt' tho t'nrm
42 CHAPTER 1. SUMMARY O ' ~
Figure 1.10: Ellipse determined by the quadratic form given in example 1.7.
/.(1, V2) respectively /.(1, V2/2). We norm these and get
If we choose the base {Pl,P2}' then the coordinate representation of the quadrati'c
form becomes
2 4 2
Y t Yl + Y2,
The ellipse has the equation
It is illustrated in figure 1.10
Since
the new coordinate lIy"tem corresponds to II rotlltion of the old one with the angle
54.7.
1.4.4 The general eigenvalue problem for symmetrical ma
trices
For use with the theory of canonical correlations and in discriminant analysis we will
need a slightly more general concept of eigenvalues than seen in the previous sections.
We introduce the concept in
DEFINITION 1.3. Let A and B be realvalued m x m symmetrical matrices and let
B be of full rank. A number A, for which
det(A  ,.\B) = 0,
is termed an eigenvalue of A w.r.t. B. For such a"\ it is possible to find an a: # 0
such that
Aa: = "\Ba:.
Such a vector a: is called an eigenvector for A w.r.t. B.
REMARK 1.10. The concepts given above can be traced back to eigenvalues imd eigen
vectors for the nonsymmetrical matrix B
1
A. ~
THEOREM 1.20. Wc consider again the situation in the definition 1.3 and further let
B be positive definite. There are then m real eigenvalues of A w.r.t. B. If A is
positive semidefinite, then these will be nonnegative and if A is positive dcfinite then
they will be positive.
PROOF. According to theorem 1.11 there is a matrix I where
T'BT=I.
Let
DT'AT
D I" obv\oUlily lIymmctrlclI, .nd "Ince
III'DIII (TIII)'A.(TIII),
44 CHAPTER I. SUMMARY 0' L.
wc sec that D and A are at the same time respectivcly positivc semidefinite IIl1d
positive definite.
Now we have
(D  AI)v = 0 {:;} (T'AT  AT'BT)v = 0
{:;} (A  AB)(Tv) = 0
From this we deduce that D's eigenvalues equal A's eigenvalues w.r.t. B, and that
the eigenvectors of A w.r.t. B are found by using the transformation T on D 's
eigenvectors. The result regarding the sign of the eigenvalues follows trivially.
THEOREM 1.21. Let the situation be as above. Then a basis exists for Rm consisting
of eigenvectors U1, ... ,U
m
of A w.r.t. B. These vectors can be chosen as conjugated
vectors both w.r.t. A as well as w.r.t. B ,i.e.
PROOF. Follows from the proof of the above theorem and of the corollary to theo
rem 1.9, remembering that
where Vi, ... ,V
m
is an orthonormal basis for R
m
consisting of eigenvectors of D.
Finally we have
THEOREM 1.22. Let A be symmetrical and let B be positive definite. Then a regular
matrix R exists with
R' AR = A = diag(A1, ... ,An),
and
R'BR=I.
1.4. 1:101.""'."'' ""'ft ............ _ ..... _
1 . R,I'
h re
\ \ are the eluenvlIlues of A w.r.t. B. If thc i'th co umn 1\1. IS
W e 1\ I , , , I\n .
termed 8, then thelle reilltions cll_n_bc_w_r_It_tc_n ______________ i
________________
and
B
OOF From the proof of theorem 1.20 we consider the D = T' A T: D is
:;mmetrlCal, according to theorem 1.10 there exists an orthogonal matnx C wIth
C'DC =A,
because we have that D's eigenvalues are A's eigenvalues w.r.t. B.
If we choose R = T C, then we have that
R'BR= C'T'BTC = C'C = I,
and
R'AR= C'T'ATC = C'DC = A.
Finally we state an analogue of theorem 1.18 in the following
THEOREM 1.23. Let A be positive semidefinite and let B be positive definite
b
,
, . .. > A and let V1, ... ,vm denote a aSIS
A's eIgenvalues W.r.t. B be A1:::::  m . _ 0 ."1. We
for Rm consisting of the corresponding eigenvectors wIth Vi
B
v j  J .
define the general Rayleigh quotient
.. .. .. .'
and put
Then wo obtain
46
CHAPTER 1. SUMMARY or
supll(::c) 
x
inf R(::c)
x
SUp R(::c)
::CEMk
ll(VI) = '\1
R(vm ) =.A
m
R(Vk) = .Ak.
PROO.F. Without loss of generality the Vi'S can be chosen so that Vi Bv  1 d'
an arbItrary vector::c can be written  , an SInce
we find
From this the two first statements are easily seen If::c E M th b'
. k, en::c can e wntten
and
which leads to the desired result.
1.4.5 The trace of a matrix
By the tenn trace of the (square) matrix A we mean the sum of the di' 1 1
i.e. agona e ements.
n
tr(A) = Laii.
i=I
Obviously
F (
tr(A/) tr(A). ._______ ._ _J
or squarc) matrlcclI A and B thc followlni hold/l
1.4. IIOI,.\,ALVI.IOILIMI. QUADRATIC
 .. (1.5,) I
Furthermore we huv!! that the trace equals the sum of eigenvalues, i.e.
tr(A) = L.Ai'
i=1
This follows trivially from 1.5 and theorem 1.10
For positive semidefinite matrices the trace is therefore another measure of "size" of a
matrix. If the trace is large then at least some of the eigenvalues are large. On the other
hand this measure is not sensitive to if some eigenvalues might be 0, i.e. if the matrix
is degenerate. The determinant is sensitive to that, since we recall
n
det(A) = II .Ai'
i=l
We note further that for an idempotent matrix A we have that
tr(A) = rk(A).
Further we have
tr(B B) = rk(B),
where B is an arbitrary pseudoinverse of B.
Finally we note that for a regular matrix S we have that
tr(SlB S) = tr(B) .
1.4.6 Differentiation of linear form and quadratic form
Let f : Rn + R. We will use the following notation for the vector of partial derivatives
Thc followlna thcorem huldll for dlftll'tntlldon of certuln forms
THIOREM 1.24. Por I "ymmetrloll (n)C n).matrlx A llnd 1m arbitrary udlmcnll\unal
v.O\Ol'. it hold_ that
48 CHAPTER 1. SUMMARY or I . ~
r
i) Ix(b'x) = b
ii) a'ix (XiX) = 2x
iii) a'ix (x' A x) = 2A x.
PROOF. The proof of i) and ii) are trivial. iii) is (strangely) proved most easily by
means of the definition. For an arbitrary vector h we have that
(x + h)/A(x + h) = x'Ax + h'Ah + 2h'Ax
By choosing h = (0, ... ,h, ... ,0)' we see that
and the result follows readily.
We will illustrate the use of the theorem in the following
EXAMPLE 1.8. We want to find the minimum of the function
g(O) = (y  A O)/B(y  A 0),
where y, A and B are given and B is further positive semidefinite (and symmetrical).
Since g(O) is convex (a paraboloid, possibly degenerate), then the point corresponding
to the minimum is found by solving the equation
First we rewrite g. We have that
g( 0) y'B y  0
'
A'B y + 0' A'B A 0  y'B A 0
= y'By  2y'BAO+ O'A/BAO.
Here we have used that
9' A'B 11  II'B A 0
1.4. IIGIM,,tJI 'IOII.IMI. Q\JADaATIC roaM
(both 1 x 1 mlltriccli. I.e. II scalar, and each others transposed). From this fo\1ows that
0/1, ._. 2A/B y + 2A/B A 0,
159 
and it is seen that
ug = 0
DO
A/BA() = A/By.
This equation has as mentioned always at least one root. If A'B A is regular then we
have
Omin = (A'B A)l A'B y.
If the matrix is singular, then we can write
()min = (A'B A) A/By,
where (A'B A) denotes a pseudoinverse of A'B A.
We are now able to find an alternative description of the principle axes in an ellipsoid,
due to
THEOREM 1.25. Let A be a positive definite symmetrical matrix. The principle
directions of the ellipsoid Ec with the equation
x'Ax = c, c> 0
are those directions where x' x, x E Ee, has stationary points.
PROOF. We may assume that x = 1. We then need to find the stationary points for
f(x) = XiX
with the condition that
x/Ax"" 1
We upply u I ,Ilirllnlc muttll'lIor teohnlqUOllnd dcnne
",(111,'\) 111'111  ,\(.' A .... 1).
50
Be differentiation we obtain
Dtp
ax = 2x  2>.Ax.
If this quantity is to equal 0, then
or
x=>.Ax
1
Ax = :\x,
i.e. x must be an eigenvector.
CHAPTER 1. SUMMARY or
1.5 Tensor or Kronecker product of matrices
It is an advantage to use this product when treating the multidimensional general linear
model.
DEFINITION 1.4. Let A be an m x n matrix and let B be a k x matrix. By the
term tensor  or Kronecker product of A and B we mean the matrix
(1.6)
This concept corresponds to the tensor product of linear projections, which can be
stated independently of coordinate system (see e.g. [3]). It this is introduced in coor
dinate form then we can either use 1.6 or equivalently, A Q9 B = (Abij ). This only
corresponds to changing the order of the coordinates, i.e. to changing row and columns
in the respective matrices.
We briefly give some rules of calculation for the tensorproduct. These are proved trl
ally hy means of the definition.
1.6. INN.I tIOIVC'r1 AND NOaMI
.. 0
ii) (AI+A:.a)t:':I D=A\MB+A2
MB
iii) A C'/) (Bl + B:.l) = AM Bl + A Q9 B2
iv) etA 09 (m = o!.(3A 09 B
v) AIA2 Q9 BIB2 = (AI Q9 Bl)(A2 Q9 B
2
)
vi) (AQ<)B)1 = AI Q9B
1
, if the inverses exist
vii) (A0B) = A Q9B
viii) (A Q9 B)' = A' Q9 B'
ix) Let A be symmetrical and p x p, have eigenvalues ... ,a
p
and eigenvectors
Xi, and let B, be symmetrical and q x q, have eIgenvalues (31," ., (3q
eigenvectors Yl"" ,Y
q
. Then A 0 B will have the eigenvalues ai(3j, z =
1 P J
" = 1 .,. q with corresponding eigenvectors.
, ... " " ,
x) det(A Q9 B)  (det A)q(det B)P
1.6 Inner products and norms
For ndimensional vectors we note that the inner product or scalar product or dot
product of x and Y is defined by
and we note that x and y are orthogonal if and only if
x . y = x' y == n.
The correllpondlna norm III
[ 11m II I  (;;;.:: .;;1+ ... + __ J
WI note thlt 11111 1111 ropro nt.CbllltUdlu "ltanOI between tho polntllll and'll.
_______ . __ . ___
o x
IIx+y112
= (x+y)'(x+y)
= x' x + x' y + y' x + y' y
= x'x +y'y
= IIxl1
2
+ IIYI12.
For orthogonal vectors x and y (i.e. x,l y) we have the pythagorean theorem
fig:re 1.6. we note that the (orthogonal) projection p( x) of a vector x onto
b;pace can be determined by means of the norm, since we have that p( x) is
Ilx  p(x)11 = min Ilx  zll
ZEU
PROOF.
Due to the Pythagorean theorem
we have that
Ilx  p(x)112 liz  p(x)112
= Ilx  z112,
i.e. the minimal value of
= Ilx  z112, and therefore of
= Ilx  zll is achieved for
z = p(x).
callY to "how that thc vaUdlty of thc Ilbovc rCliultli only dcpend on 4
of thl inner U WI term tho Inner of II and
1.6. INN "OD"CTI AND NOIMS
by (xIY) thcn they are
IPl: (xly) = (ylx)
IP2: (x + ylz) = (xlz) + (ylz)
IP3: (kxly) = k(xly)
IP4: xi= 0 =? (xix) > o.
For an arbitrary bilinear form CI) ,which satisfies the above one can define a concept
of orthogonality by
x ,l y ,/b (xly) = o.
For an arbitrary positive definite symmetrical matrix A we can define an inner product
by
(XlY)A = x' Ay.
It is trivial to prove that IP 14 are satisfied. for this inner product and the corresponding
norm given by
IlxiiA = V(XIX)A = Jx'Ax,
we will whenever it does not lead to confusion  use the terms (xly) and IIxll
We note that the set of points with constant Anorm equal to 1 is the set
{xlllxll
2
= l} = {xlx'Ax = l},
i.e. the points on an ellipsoid.
Conversely, to any nondegenerate ellipsoid there is a corresponding positive definite
matrix A, so
R = {xix' Ax = l} = {xl l\xlli = l}.
In this way we have brought about a connection between the set of possible inner
products and the !let of el1ip!loidll.
Two vector!! x and 11 are n'theJlle,"'" (with rt.fpt!ct to A). if
I.e. If III and 11 11'1 COnJU,lto ,UNotion. 'n the .mp,old corrcllpondlnll to A.
It t. 11.0 \0 Introd\lOlI meln. nl' tho dcl\nlllnl1
I.
CHAPTER I, SUMMARY Of
l L_ .._ .._._. ___"_______ CJ
We now give a lemma which we will need for the theorems of independence of projec
tions of normally distributed stochastic variables.
LEMMA 1.1. Let R
n
be partitioned in a direct sum
of ni dimcnsional subspaccs that arc orthogonal w.r.t. thc positive definite matrix
i.e.
For i = 1, ... ,k we let the projection Pi onto U
i
be given by the matrix Ci . Then
for all i =f j. Furtmermore, we have
= = C
i
.
PROOF. Since Pi 0 Pi = Pi, we have
and since
The right hand side of the equation is obviously symmetrical, so that
By pre and postmultiplication with we get
so
This gives
1.6, INN.'
(cr. the illustration) we have
l)i(X)'E 1 (x l)'i(X)) = 0,
i.e.
[x  Cix] = O.
This holds for all x, and therefore
(I  C
i
) = 0,
or
Pi
11 it
The secondlast equal sign follows from the fact that the sum is direct, so for a x
holds that
i.e.
Since x _ as was mentioned previously  is arbitrary, then this implies
or
= O.
56
CHAPTER 1. SUMMARY 0' L ~
Chapter 2
Multidimensional variables
In this chapter we start by supplementing the results on multidimensional random vari
ables, given in chapter 0, volume 1. Then we discuss the multivariate normal distribu
tion and distributions derived from it. Finally we shortly describe the special consider
ations that estimation and testing give rise to.
2.1 Moments of multidimensional random variables
We start with
2.1.1 The mean value
Let there be given a random (or stochastic) matrix, i.e. a matrix, where the single
elements are random (stochastic) variables:
Wc thcn define the mit," valuI, nr tho ,xp,ctatlon value. or the expected value of X as
1"ln j
: := IJ.
I I ' ~ I I
" ,
'"
CHAPT.R 2. MULTIDIMIN._
THRORRM 2.1. Let A be a A: x 'II matrix of constants. Then
E(A + X) = A + E(X).
This theorem follows trivially from the definition as does the following.
THEOREM 2.2. Let A and B be constant matrices, so that A X and X B exist.
Then
E(AX)
E(XB)
A E(X)
E(X)B
Finally we have
THEOREM 2.3. Let X and Y be stochastic matrices of the same rank. Then
E(X + Y) = E(X) + E(Y).
REMARK 2.1. We have not mentioned that we of course assume, that the involved
expected values exist. This is assumed here and in all the following, where these are
mentioned.
2.1.2 The variancecovariance matrix (dispersion matrix).
The generalisation of the variance of a stochastic variable is the variancecovariance
matrix (or dispersion matrix) for a multidimensional random (stochastic) variable X =
(Xl, ... ,Xn)'.Itisdefinedby
D(X) = :E = E{(X  J.L) (X  J.L)'},
where
J.L = E(X).
It should be noted, that D(X) also often is called the covariancematrix and ill then de
noted Cov(X). However, thlll III a bit mllllcadina. linee it could millunderlltood I" the
2.1. MOMltfilttf MULTIDIMIN.IONAL yA " ......
covariance between two (multidimensional) stochllstic variables. Another commonly
used notation 11'1 V (X). Furthermore, we note that
f
(X
l
JLl)2
(Xl  /lI)(X2  !h2) (X,   I'nl 1
(X2  !h2)(Xl  !hl)
(X2  !h2)2
(X2  JL2)(Xn  JLn)
(Xn  JLn)(Xl  JLd
(Xn  /In)(X2  JL2)
(Xn  JLn)2
i.e. the variancecovariance matrix's (i, j)'th element is COV(Xi' X
j
), or
f
V(Xl)
COV(X2, Xl)
:E = D(X) = :
COV(Xn,XI)
Cov(Xl,X
n
) 1
Cov(X2,Xn)
v(Xn )
We will often use the following notation
f a'
al2
a," 1 f an
al2
aln
a:l
a2n a2l a22 a2n
:E= .
J'
.
.
anl an2
anl
an2
b d d b th
2 and as a". We note, that:E is symmetric.
i.e. the variances can e enote 0 as ai 00
More interesting is the following
THEOREM 2.4. The variancecovariance matrix:E for a multidimensional random
A
variable is positive semidefinite.
PROOF. For any vector Y we have
y' Ey == y' E{(X  ",)(X 1J.)'}y
 E{1I' (X  II.)(X It)',,}
E{ [(X " ",)'1I1'[(X 1')'111 }
> 0,
lneo the exprclllllon In tho \nIlIttIl. 0,
10
CHA.T 2. MtJLTIDIMIW_
There exist theorems which are analogous to the ones known from the one dim,nlh,.,"1
stochastic variables.
THEOREM 2.5. Let X and Y be independent. Then
D(X + Y) = D(X) + D(Y).
Let b be a constant. Then we have
D(b + X) = D(X).
If A is a constant matrix, so that A X exists, then the following holds
D(AX) = A D(X)A'.
PROOF. Thc first relation comes from
COV(Xi + Yi,Xj + lj) COV(Xi,Xj) + COV(Xi' lj) +
Cov(Yi, Xj) + Cov(Yi, lj)
COV(Xi,Xj) + Cov(Yi, lj),
since Cov(Yi, X j ) = 0, because Xj and Yi are independent. The second relation is
trivial. The last one comes from
D(AX)
If we let
E{(A X  A J.L)(A X  A J.L)'}
E{A[X  J.L][X  J.LJ' A'}
AE{[X  J.L][X  J.LJ'}A'
AD(X)A'
V = diag ... , = (T
[
1
ITt (Tn
o
and we "lIcltle" X by V. we gct
r
1
"12
D(V X) = V V' =
CTl an
a1 an
"2n
0"2 an
1
We note, that the elements are the correlation coefficients between X's
which is why this matrix is also called the correlation matrix for X, and we wnte
where
2.1.3 Covariance
Let there be given two random variables
with mean values J.L and v. We now define the covariance between X and Y as
[
COV(Xl' Y1 ) ...
C(X, Y) = E[(X  J.L)(Y  v)'J = :
Cov(Xp, Y1 )
Then
C(X, X) = D(X)
and
C(X t Y) (C(Y, X)I',
Loll trlvlall.
COV(Xl' Yq) ].
Cov(Xp, Yq)
IJ
.. n .. r .............. .
THEOREM 1.6. Let X and Y be as above, and let A and B be n )C , lid '" )C '1
matrices of constants respectively. Then
C(AX,BY) = AC(X,Y)B'.
If U is a pdimensional and V is a qdimensional random variable the following
holds
Finally
C(X + U, Y) = C(X, Y) + C(U, Y)
C(X, Y + V) = C(X, Y) + C(X, V).
D(X + U) = D(X) + D(U) + C(X,U) + C(U,X).
PROOF. According to the definition we have
C(AX,B Y) E[(AX  AJ1.)(B Y  Bv)']
= E[A(X  J1.)(Y  v)'B']
= AE[(X  J1.)(Y  v)']B'
AC(X,Y)B'.
This proves the first statement. Similarly  if we let E(U) = ~ 
C(X + U, Y) E[(X + U  J1.  ~ ) ( Y  v)']
E[(X  J1.)(Y  v)' + (U  ~ ) ( Y  v)']
= E[(X  J1.)(Y  v)'] + E[(U  ~ ) ( Y  v)']
= C(X,Y)+C(U,Y),
and the corresponding relation with Y + V is shown analogously. Finally we have
D(X+U) = C(X+U,X+U)
= C(X,X) + C(X,U) + C(U,X) + C(U,U).
If O(X, Y) = 0 then X and Y are !laid ta be uncarrelatcd. Thill correllpandll to all
componentll of X beina uncarrclated with an component. af Y,
Later, when we conMldcr the multldlmcnKlonal generulllnelll" model we will need the
following
THEOREM 2.7. Let Xl, ... , X n be independent, pdimensional random variables
with the same variancecovariance matrix ~ = ((Jij). We let
(Note, that the variable index is the first index and the repetition index is the second).
If we define
Xll
vc(X) =
i.e. as the vector consisting of the columns in X (vc = vector of columns) we get
where In is the identity matrix of n'th order.
PROOF. Follows trivially from the definition of a tensorproduct and from the defini
tion of the variancecovariance matrix.
2.2 The multivariate normal distribution
The multlvllrlatc normal dl.ttlbutlon phl)'!1 the !lIme Important role in the theory of
multldimen!lional variable., II the normal dilltrlbutinn doe" In the univariate caPle. We
"tart with
CHA.T.12.
2.2.1 Definition and simple properties
Let Xl,.'" Xp be mutually independent, N(O,l) distributed variables. We then say
that
is standardised (normed) pdimensionally normally distributed, and we write
X E N(O,I) = Np(O,I),
where the last notation is used, if there is any doubt about the dimension.We note, that
E(X) = 0, D(X) = I.
We define the multivariate normal distribution with general parameters in
DEFINITION 2.1. We say that the pdimensional random variable X is normally dis
tributed with parameters JL and:E, if X has the same distribution as
JL+AU,
where A satisfies
AA' =:E,
and where U is standardised pdimensional normally distributed. We write
where the last notation again is used, if there is any doubt about the dimension.
REMARK 2.2. The definition is only valid, if one shows, that A A' = B B' implies
that the random variables
JL + A U and JL + B V,
where U and V are standardised normally distributed and not necessarily ot' the /lRmO
dimension, have the Rame distribution. The relation hi valid, but we will not pursue thlM
2.2. TBI MUIJrI'YAaIAT. NoaMAL DIITa.BUTION
further here. Ilrolll Iheorelll 1.10 follows lhul for uny posilive semidefinite matrix :E
there exists a matrix A with A A' = :E, so the expression N(JL,:E) makes sense for
any positively semidefinite 1) x P matrix:E and any pdimensional vector JL.
Trivially, we note that
X E N(JL,:E) * E(X) = JL and D(X) = :E
i.e. the distribution is parametrised by its mean and variancecovariance matrix.
1f:E has full rank, then the distribution has the density given in
THEOREM 2.S. Let X E Np(JL, :E), and let rg(:E) = p. Then X has the density
f(x) =
1 1 1 I 1
$ Vdet:E ex
p
["2(x  JL) :E (x  JL)]
1 1 1 2
$ Vdet:E ex
p
["2
l1x
 JLii ],
where the norm used is the one defined by :E
1
, see section 1.6.
PROOF. Let U E Np(O, I). Then U has the density
p 1 1 lIP
h(u) IT . f<C exp( "2u;) = f<Cp exp(  L u;)
i=l v 27T V 27T 2 i=l
1 (1 I
 exp u u).
$ 2
We then consider the transformation from RP + RP given by
u+x=JL+Au
where A A' =:E. From theorem 1.14 it follows that A is regular. We obtain
giving
u'u  (Ill IJ)'Ac'''A''i(_ ~ )
_ (Ill h "')'Il"'l(_ ... ,..).
CHAPT.R 3. MULTIDIMItU _
Furthermore, since
dct(:E) = det(A A') = det(A)2,
i.e.
1 1
det(A )= ~
ydet :E
and the result follows from the theorem on the distribution of transformed random
variables.
We note that the inverse variancecovariance matrix :E
1
is often called the precision
of the normal distribution.
If:E is not regular, then the distribution is degenerate and has no density. We then
introduce the concept of the affine support in the following definition.
DEFINITION 2.2. Let X E Np(JL, :E). By the (affine) support for X we mean the
smallest (side) subspace of RP, where X is defined with probability 1. J..
REMARK 2.3. If we restrict the considerations to the affine support, then X is regu
larly distributed and has a density as shown in theorem 2.8. ..
We have different possibilities of determining the support of a pdimensional normal
distribution. Firstly
THEOREM 2.9. Let X E Np(JL,:E), and let A beanpxm matrix,sothatAA' =:E.
We then let V equal A's projectionspace, i.e.
Then the (affine) support for X is the (side) subspace
JL + V = {JL + vlv E V}.
PROOF. Omitted.
3.2. TH. MULTIVARIAT. NORMAL DIITRI.UTION rH
Further, we have
THEOREM 2.tO. Let X be as in the previous theorem. Then the subspace V equals
the direct sum of the eigenspaces corresponding to those eigenvalues in:E which are
different from O. J..
PROOF. Omitted.
Finally we have
THEOREM 2.11. Let X be as in the previous theorems. Then the subspace V equals
the orthogonal complement to the nullspace for:E, i.e.
V = {vl:Ev = O}J..
PROOF. Omitted.
The three theorems are illustrated in
EXAMPLE 2.1. We consider
Since
then X 1M Mlngultarly dhltrlhuted. and we will determine the affine Huppnrt .
We "rllt Meek 1l1f\lllrlx A, 10 A A'  E. '0, dn lh., we "rill dotonnlno E'I! clllcnVlllucli
,. CHAPTER 2. Mv
and (normed) eigenvectors. These are
It now follows that
1 1 [
_ 0 0 0 0 20
6 3
From this we see that we as Amatrix can choose
A=
[
1 00] [1.
; i I
o
v'2
o
If we regard A as the matrix for a linear projection R3 + R3 we then obtain that the
projectionspace is
v = {Aulu E R3}
{UIPI + U2P21ul E R 1\ U2 E R}.
It is immediately noted that this is also the direct sum of the eigenspaces corresponding
to the eigenvalues which are different from O.
The nullspace for is given by
} U=t'P3'
This again gives the same description of V.
The affine support for Y is then the (side) subspace
REMARK 2.4. From the example the proofs of theorems 2.92.11 can nearly be de
duced completely.
We now formulate a trivial but useful theorem.
THEOREM 2.12. Let X E N(J.L, Then
AX + b E N(AJ.L+ b,
where we implicitly require that the implied matrixproducts etc. exist.
PROOF. Trivial from the definition.
2.2.2 Independence and contour ellipsoids.
In this section we will give the conditions for independence of the normally distributed
stochastic variables, and we will prove that the isosets for the density functions are
ellipsoids. First we have
THEOREM 2.13. Let
X = [ Xl] E N ([ J.Ll ] , [
X 2 J.L2
Then
and
D
Xl, X:oJ are IItochuPitically independent } = = 0,
where 0 is the null matrix.
PROOF. The firllt It.tomont ro'h,wl from tho prcVIUUH theorcm. Thc second follows by
provln. thlt the condition 0 MIWlI, Chit the dilltribut\on become !I a product
dlltrlbutlon,
70
Figure 2.1: Density functions
variancecovariance matrices
( ~ ~ ) . ( ~ ~ ) . ( ~ . 9
CHAPTER 2. MtJLTIDIMI_
for twodimensional normal distributions with the
0.9
1
1
0.5
0.5
1
).
2.2. THI MtJLTIVAIIATI NORMAl. DI8TRIBtJTION
71
From the theorem follows that the componcnts in a vector X E N(J.L,:E) are stochas
tically independent if lJ is a diagonal matrix. We will now show that independence is
just a question of choosing a suitable coordinatesystem.
Let X E N(/t,:E) and let:E have the orthonormed eigenvectors PI"" ,Pn We
now consider a coordinate system, with origo in J.L and the vectors PI' ... 'Pn as
basevectors. The coordinates in this system are called y.
If we let
we have the following correspondence between the original coordinates x and the new
coordinates y for any point E Rn.
y=P'(xJ.L) {=} x=pY+J.L,
cf. p. 12.
Note: The above relation is a relation between coordinates for a fixed vector viewed in
two coordinatesystems.
Using this, if we let Y be the new coordinates for X we have
THEOREM 2.14. Let X E N(J.L,:E) and let Y be as above. Then
Y E N(O,A),
where A is a diagonal matrix with :E's eigenvalues on the diagonal.
PROOF. Follows from theorem 2.12 and theorem 1.10.
REMARK 2.5. By translating and rotating (or reflection of) the original coordinate
system we have obtained, that the variancecovariance matrix is a diagonal matrix. I.e.
that the components in the stochastic vector are uncorrelated and thereby also indepen
dent. ~
By rellcnlina tho IX WI gin even obtl\n thllt the variancecovariance matrix has zeros
or mlell on the dillonll, Conllarinl tho billievector"
(liP, , I I I , ""',,'
72
where
1
A
1
CHAPTER 2. MULTIDIMIN"
if /\ > 0
if ..\=O'
cf. the proof of theorem 1.11, and calling the coordinates in this system z, we get the
equation
Z = C'P'(x  JL) = (P C)'(x  JL),
where C = diag(cl, ... , en).
If we let the z coordinates for X equal Z we get
Z = N(O,E),
where
E= = = C'AC
has zeros or ones on the diagonal.
The transformation into the new bases is closely related to the isocurves for the density
function for the normal distribution.
As mentioned earlier the density for an X E IS
f(x) k exp(   JL))
2
1 2
k . exp( (llx  JLID )
2 .
Therefore we have
where kl and C are constants. Since is positive definite the isocurves
will be ellipsoids, cf. theorem 1.19. From theorem 1.19 is also seen that the major axes
in these ellipsoids are the eigenvectors for but from theorem 1.12 we note that
they are also eigenvectors for E. In the new coordinates the densities become
1 1 a
1(11) ... k ' oxp( ...  ),
3
3.3. THE MULTIVARIATE NORMAL DISTRIBUTION
73
where /\ is the i'th eigenvalue and
The ellipsoids.tJ
i
are often called contourellipsoids. The relation to the ChiSquare
(X
2
) distribution is given in the following theorem.
THEOREM 2.15. Let P and C be as above. Then
(X  JL)'(PC)(PC)'(X  JL) E
If has full rank p then
PROOF. (X  JL)'(PC)(PC)'(X  JL) = Z'Z = I:c5i Z;,
where lSi = 1 if.Ai 1= 0 and equal to 0 otherwise.
Since the nondegenerate components in Z are stochastically independent and N(O,l)
distributed the result follows immediately. The last remark comes from
PC(PC)' = PC C'P' = PA1p' =
REMARK 2.6. The result of the theorem is that the probability of an outcome being
within the contour ellipsoid can be computed using a X
2
distribution.
Examples of these concepts will be given in example 2.3, where we consider the two
dimensional normal distribution.
2.2.3 Conditional distributions
In this scction wc conliidcr the partitioning of a random variable X E Np(JL, into
x. [ ];
WI \hln haw
E = [ Ell
E:n
CHA.TIR2.
THEOREM 2.16. If X 2 is regularly distributed. i.e. if has full rank, then the
distribution of X 1 conditioned on X 2 = X2 is again a normal distribution, lind the
following holds
E(X1 IX2 = X2)
D(X1 IX2 = X2)
J.Ll + :E12:E;l(X2  J.L2)
:En  :E12:E;l:E2l .
If :E22 does not have full rank then the conditional distribution is still normal and :E;l
in the above equations should be substituted by a generalised inverse :E;2' ..
PROOF. The proof is technical and is omitted, however cf. section 2.2.5.
REMARK 2.7. It is seen that the conditional dispersion of Xl is independent of X2
This result is not valid for all distributions, but is special for the normal distribution.
Also we see the conditional mean is an affine function of X2, cf. the discussion in
section 2.3.3. Furthermore, we see that the conditional dispersion equals the Schur
Complement of the nonconditional dispersion of X 2
We will not discuss the implications of the theorem here. Instead we refer to the exam
ples in section 2.2.5.
2.2.4 Theorem of reproductivity and the central limit theo
rem.
Analogous to the theorem of reproductivity for the univariate normal distribution we
have
THEOREM 2.17. (Theorem of reproductivity). Let Xl, ... ,X k be independent, and
let Xi E N(J.Li' :Ei ).
Then
PROOF. Omitted,
J.I. THI MUtT.VAI.ATE NORMAL DISTRIBUTION 7S
As in the univariate case, central limit theorems exist, i.e. sums of independent multidi
mensional stochastic variables are under generel assumptions asymptotically normally
distributed. We state an analogue to LindebergLevy's theorem.
THEOREM 2.1S. (Central limit theorem). Let the independent and identically dis
tributed variables Xl, ... , X n, . .. have finite first and second moments
Then we have  with Xn = + ... + Xn)  that
has an N(O,:E) distribution as its limiting distribution, and we say that Xn is asymp
totically N(J.L, distributed. ..
PROOF. This and the previous theorem can be proved from the corresponding univari
ate theorems by first using a theorem, which characterises the multivariate distribution
(a multidimensional variable is normally distributed if and only if all linear combina
tions of its components are (univariate) normally distributed; and by using a theorem
which characterises a multivariate limiting distribution as limiting distributions of lin
ear combinations of the components (coordinates). However, this is out of the scope of
this presentation and the interested reader is referred to the literature e.g. [21], section
2c.5.
2.2.5 Estimation of the parameters in a multivariate normal
distribution.
We cOnlddor a number 0' oblervatlon!l Xl, ... , X tI, which are assumed independent
and Identically Nil (I', Jl) dlltrlbuted, We Ilssumo thore are more observations than
the dlmenllton Indio."., 1,1, that n :> 1" Tn thlM Mcctlon we wl11 alve eMtlmlltcN of tho
paramotll1 I' lid m.
CHA.TI. 2. Mu
We introduce the notation
n n
1 2:  , 1 '"""" ' n  ,
S = 1 (Xi X)(Xi X) =1 1 X X .
n n . n
i=1 2=1
If we consider the datamatrix
where the i'th row corresponds to the i'th observation, we can also write
X = [ 1 =
n . n
1
(n  1)S
f)Xi  X)(Xi  X)' = x'x  nXX' = X'X 
i=1
With this we can now state
THEOREM 2.19. Let the situation be as stated above. Then the maximum likelihood
estimators for J.L and:E equal
X
n 1S = f,)Xi  X)(Xi  X)'.
n n i=l
jL is an unbiased estimate of p., and S is an unbiased estimate of E.
PROOF. Proof. Omitted, lee 0.1. [2], chaptor 3.
J.J. Till MULTIVAIIATI NORMAL DIITIIBUTION
77
REMARK 2.S. Since the empirical variancecovariance matrix S is an unbiased esti
mate :E, and since it only differs from the maximum likelihood estimator by the factor
' we often prefer S as the estimate. Often one will see the notation t used for S.
One should in each case be aware of what the expression t precisely means.
The distribution of jL comes trivially from theorem 2.2.4. The following holds
The distribution of S is the Wishart distribution, the multivariate analogue to the Chi
Square distribution. It is treated in section 2.5.
We give an example of estimating the parameters in the following section.
2.2.6 The twodimensional normal distribution.
We now specialise the results from before to two dimensions.
Let X = [ ] be normally distributed with (IL, where
and
Since
is, if dct(E) '# D,
Introduclna the correll' Inn cnonlc.:lent fl
,.
lit
Figure 2.2: The density of a twodimensional normal distribution.
we get
and the density becomes
The graph is shown in fig. 2.2 It is immediately seen that we have a product distribution
i.e. that Xl and X
2
are stochastically independent, if p = 0, i.e. if I: is a diagonal
matrix.
The conditional distribution of Xl conditioned on X
2
= :1:2 is proportional to the
intersecting curve between the plane through (0, X2, 0) parallel to the (1)(3) plane. If
we denote the density as g we have
g(.) = cf(',x:J),
:1.:1. TU MULTIYAIIIA'I'Ii I'UJIIMIU" U .III.'.raIUI"'I
"
where c Is a normalisation constant. We have
[
1 1 { [Xl  ILl J 2 2 Xl  ILl X2  IL2 }]
kl . CXP ,  p':'::"'
2 1  p2 al al a2
Note that no bookkeeping has been done with respect to X2. It has disappeared into
different constants. From the final result we note that the conditional distribution is
normal and that
and finally that
and
We have shown the result of theorem 2.16 for the case n = 2. Note, that the conditional
mean depends linearly (or more correctly: alTinely) upon X2, and that the conditional
variance is independent of X2. Further we have
and the squared coefficient of correlation represents the reduction in variance. i.e. the
fraction of Xl '/I variance, which can be explained by X 2 , since
In thc followtq ...... WI IOft.ldlr numcricill example which ailio involvell an
clltlmatlon probltll,
10 CHAPTER 2.
EXAMPLE 2.2. In the following table corresponding values of the air's content of air
borne particle matter measured in ~ is shown. Two different measuring principles
were used, a measure of greyvalue (using a socalled OEeD instrument) and a weigh
ing principle (using a socalled High Volume Sampler). Among other things the reason
for the large deviations is that the measurements using the grey value principle are
sensitive to the deviation of the suspended dust particles from "normal dust". In this
way, a large content of calcium dust in the air could result in the measurements being
systematically too small.
I 2 5 15 16 16 19 26 24 16 36
Method II 2 12 4 21 41 14 31 29 31 8
I 39 42 44 40 42 42 50 51 58 64
II 30 44 26 60 34 34 14 41 58 47
We consider this data as being observations from independent identically distributed
stochastic variables
We will examine whether we can assume the distribution is normal with parameters
(f..t, :E). If the distribution is normal, we find the estimates
A = [ itl ] = [ ~ ] = [ 32.35 ]
f..t it2 Y 29.05'
and
8
12
] = [311 182]
S ~ 182 279 '
where:E is the unbiased estimate of :E. Specially we have
1 ~  
812 = n  1 L..,..(Xi  X)(Yi  Y).
i=1
We now want to check if the observations can be assumed to come from a normal
distribution with parameters (fL, :E). To do that we first estimate the contour ellipses.
The eigenvalues and eigenvectors for:E are
),1 = 477.613 nd A _ [ D.nO ]
1'1 0.678
2.2. THE MULTIVARIATE NORMAI_ DISTRIBUTION 81
and
~ 2 = 112.676 and
[
0.678 ]
0.736 .
If we choose the coordinate system with origo in fL and with PI and P2 as base
vectors, the contour ellipsoids have equations of the form
or
Z2 z2
_=1_ + 2  c
477.613 112.676  ,
where the new coordinates are given by
In figure 2.2 we show the observations and 3 contour ellipses conesponding to the c
values Cl = X
2
(2)0.40 = 1.02, C2 = X2 (2)0.80 = 3.22 and C3 = X2 (2)0.95 = 5.99.
This has the effect (see theorem 2.15) that in the normal distribution with parameters
(fL,:E) we have the probabilities 40%, 80% and 95% of having the observations
within the inner, the middel and the outer ellipse. For the areas between the ellipses
resp. outside these, we have the probabilities 40%, 40%,15% and 5%. These numbers
can be compared to the corresponding observed relative probabilities 40%, 30%, 30%
and 0%. The fit is  if not overwhelming  at least acceptable.
If one wants a more precise result, one can perform a X2 test. It would then be
reasonable to divide the plane further according to the eigenvectors. In the case shown,
this would result in 4 x 4 areas with estimated probabilities of 10%,10%,3.75% and
1.25%. One can then compute the usual X2 teststatistic:
'"' (observed  expected)2
L...J expected
and compare it with a i'(H 6) distribution (we have estimated 5 parameters). In the
prcIlcnt Clle thoro are not reilly cnouah observations to perform this analysis.
The correlallon Im.mel.,U III cNllmllled al
". __ iIii o,ea,
12 CHA.TER 2.
y
Figure 2.3: Estimated contour ellipses and estimated density function corresponding to
the data in example 2.2
IIJ
I.S. CORRILATION AND RIG RIll ION
     ~     ~ , ~ ~ ,       ~ " '  '   '    
llnd the conditional variances arc estimated at
v(X!Y = lJ)
V(y!X = x)
= 311(1 l?) = 192
279(1  jP) = 172.
We see, that the conditional variances have been reduced by 38% corresponding to
p2 = 0.38. That the conditional variance of e.g. an OEeDmeasurement for given
High Volume Sampler measurement is substantially less than the unconditional vari
ance seems rather reasonable. If we ego find, that the amount of suspended matter
measured using a High Volume Sampler is found as e.g. 2 : : . ~ , we would not expect
to get results from the OEeDinstrument, which deviate grossly. This corresponds to
a small conditional variance. If the result from the High Volume Sampler is unknown,
then we must expect a measurement from the OEeDinstrument that can lie anywhere
in its natural range of variation  corresponding to a larger unconditional variance.
2.3 Correlation and regression
In this section we will discuss the meaning of parameters in a multidimensional nor
mal distribution in greater detail. First we will try to generalise the properties of the
correlation coefficient seen in the previous section.
2.3.1 The partial correlation coefficient.
The starting point is the formula for the conditional distributions in a multidimensional
normal distribution. Let X E Np(J.I" :E), and let the variables be partitioned as follows
where X 1 consists of the m first elements in X and likewise with the others. Then the
conditional dispersion of Xl for given X 2 = X2 is, as was shown in theorem 2.16,
equal to
By the /1(",1(11 f'cmW'"um ,IUII/kl,nt botwoen X, lU1d X j. 'i, j ~ 'Tn, conditioned
nn (or: for alvin) X _. WI will undlrJItlnd the correlation in the conditional
dlltrlbutlon of Xl .... .x _'1 1\ II onoted by P'Jlm+l,.",p
______________________
Let
and
we now have
For the special case of X being three dimensional we have with
that
P12(Tl (T2
P23(T2(T3

[
(Tr P12(TI (T2 ]
P120'1 0'2
[
O'r(l  Pr3)
(Tl(T2(P12  P13P23)
From this follows that the partial correlation coefficient between Xl and X 2 condi
tioned on X3 is
P12  P13P23
For a pdimensional vector X we therefore find
"'t :;8    Strength 3
C:1S+T 0.309 0.091 0.158
C
3
A
BLAINE
Strength 3
Strength 28
0.309
0.091
0.158
0.344
1
0.192
0.120
0.166
0.192
1
0.745
0.320
0.120
0.745
1
0.464
Strength 28
0.344
0.166
0.320
0.464
1
Table 2.1: The correlation matrix for 5 cement variables.
Since it is possible to find conditional disttibutions for given Xm+l, ... ,Xp by succes
sive conditionings we can therefore determine partial correlation coefficients of higher
order by successive use of (**). E.g. we find
here we have first conditioned on Xk and then conditioned on Xl.
In section 2.2.6 we saw that the (squared) correlation coefficient is a measure of the
reduction in variance if we condition on one of the variables. Since the partial correla
tion coefficients are just correlations in conditional distributions we can use the same
interpretation here. We have e.g. that P;jlkl gives the fraction of X/s variance for
given Xk = Xk and Xl = Xl which is explained by Xj. It should be emphasised
that these interpretations are strongly dependent on the assumption of normality. For
the general case the conditioned variances will depend on the values with which they
are conditioned (i.e. depend on Xk and Xl)
When estimating the partial correlations one just estimates the variancecovariance
matrix and then computes the partial correlations as shown. If the estimate of the
variancecovariance matrix is a maximumlikelihood estimator then the estimates of
the partial correlations computed in this way will also be maximum likelihood esti
mates (cf. theorem 10 p. 2.28 in volume I).
We will now illustrate the concepts in
EXAMPLE 2.3. (Data are from [20]).
In table 2.1 correlation coefficients between 3 and 28day strengths for Portland Ce
ment and the content of minerals C3S (Alit, Trica1ciumsilicat Ca3Si05) and C3A
(Aluminat, Tricalciumaluminat. Ca3Ah06), and the degree offinegrainedness (BLAINE)
are given. The correlations are estimated using 51 corresponding observations.
It should be noted that CaS con"titute" about 3560% of normal portland clinkers and
Ca
A
III shout M'If, nl' clinker. The RLAINE is II mcasurc oj" thc specific surfacc so
that a larac KLAINR corrollpnndl to VII')' flnc"ruincd ccmcnt.
We wUl be e.peotilly in ....... '" .relation.hlp between C:IA content in clinker
10
600
400
200
Pr ur tr.nqth
in kp/an
l
400
200
7 days 28 days
Pressure strength
inkp/an
2
_______ 7 days
',...,.,,r_ BLAINE
1400 1700 2100 2400
(a) Strength by pressure test at ordinary tempera (b) Pressure strengths for different finegrainedness
ture of paste OfC3S and C3A seasoned for different of the cement. (from [14]).
amounts of time. (from [14]).
Breaking strength
Hydratization depth l.l
5
Degree of hydratization
3h 1 day 3 days 28 days
(c) Degree of hydratisation for cement minerals and (d) Relationship between degree of hydratisation
their dependence on time (from [14]). and strength (from [14]).
Figure 2.4:
and the two strengths. It is commonly accepted cf. the following figure, that a large
content of C3A gives a larger 3day strength which is also in correspondence with
PC
3
A,Strength3 = 0.120. The problem is that this larger 3day strength for cement
with large content of C
3
A only depends on C
3
A 's larger degree of hydratisation (the
faster the water reacts with the cement the faster it will have greater strength. C
3
A's
far greater hydratisation after 3 days as seen from figure 2.4( c) and the degree of
hydratisation and its influence on the strengths has been sketched in figure 2.4(d).
If we look at the correlation matrix we also see that the content of C3A is positively
correlated with the BLAINE i.e. cements with a very high content of C
3
A will usually
be very finegrained and as it is seen in figure 2.4(b) this should also help increase the
strength.
Finally we sec that the 28day strcnath is sliihtly neiatlve)y correlated with the content
of CaA This doell not IIcem IItranle if we con.ider the temporal clependonoe of C:sS'.
C
3
S
C
3
A
Strength 3
Strength 28
3
1 0.333 0.137
0.333
0.137
0.333
1
0.035
0.246
0.035
1
0.358
Strength 28
0.333
0.246
0.358
1
Table 2.2: Correlation matrix for 4 cement variables conditioned on BLAINE.
and C
3
A's as scen in c.g. in figure 2.4(a) even though the finer grain (for ccment with
large content of C3A ) should also be seen in the 28day strength cf. figure 2.4(b).
In order to separate the different characteristics of C3A from the effects which arise
from a C
3
A rich cement seems to be easier to grind and therefore often is seen in
a bit more finegraincd form. Therefore, wc will cstimate the conditional correlations
for fixed value of BLAINE. These arc seen in table 2.3. We sec that the partial
correlation coefficient betwccn 3day strength and C
3
A for givcn finegraincdness is
ncgative (note the unconditioncd cOlTelation coefficient was positivc). This implies
that we for fixed finegrainedness must expect that cements with a high content of C3A
will tend to have lower strengths. This might indicate that the large 3day strength for
cements with high content of C
3
A rather depends on these cements having a large
BLAINE (that they are crushed somewhat easier) than that C3A hydrates quickly!
We see a corresponding effect on the correlation between C3A and 28day strength.
Here the unconditional correlation is 0.168 and the partial correlation for fixed BLAINE
has become 0.246.
REMARK 2.9. The example above shows that one has to be very cautious in the in
terpretation of correlation coefficients. It would be directly misleading e.g. to say that
a large content of C
3
A assures a large 3day strength. First of all it is not possible
to conclude anything about the relation between two variables just by looking at their
correlation. What you can conclude is that there seems to be a tendency that a high
content of C
3
A and a high 3day strength appear at the same time. The reason for this
could be that they both depend on a third but unknown factor without there having to
be any direct relation between the two variables. Secondly we also see that going from
unconditioned to partial correlations can even give a change of sign corresponding to
an effect which is the opposite of that we get by a direct analysis. The reason for this
is a correlation with a 3rd factor in this case BLAINE which disturbs the picture.
In many situations we would Iikc to test if the correlation coefficient can be assumed
to be O. You can then usc
THEORIM 2,20. Let It "'" Uj:1Im+ 1. .. 1) he the empirical partial correlation cocfficient
bctwoen X, lAd XJ conditioned on (or: for liven) Xm+l,,,,,X,,. It ill uNHumed to be
computed froa 1M unblaled c"tlmlltell of the variancecovariance mlttrlx Itnd from n
CHAPTER 2. MtlLTIDIMINtilONAL VARIABLES
obllDrntlClns. Then
II
_ I vn 2  (p  m) E t(n  2  (,)  111)),
vi Il
[
..'
if flijlllll I ... ,/1 = O.
PROOF. Omitted.
REMARK 2.10. The number (pm) is the number of variuhlcs which are fixed (condi
tioned upon). The degrees of freedom are therefore equal to the number of observations
,minus 2 minus the number of fixed variables. The theorem is also valid if p  m = 0
i.e. ifwe have the case of an unconditional correlation coefficient.
We continue example 2.3 in
EXAMPLE 2.4. Let us investigate whether the value of T2413 is significantly different
from O. We find with T2413 = R:
vn 2  (p  m)
J1R2
0.035 . V51  2  (5  4)
v'"F.
1
==::O=::.
0.243 = t(48)40%'
A hypothesis that P2413 is 0 will therefore be accepted using a test at level a for a <
80%. (Note: this is by nature a twosided test.)
If we wish to test other values of P or to determine confidence intervals we can use
THEOREM 2.21. Assume the situation is as in the previous theorem. We consider the
hypothesis
110 : Pijlm+l, ... ,p = Po
vcrliuli
III : Pijlm+l" .. " rI: PO
Wo lot
2.3. CORRELATION AND REGRESSION
and
1 1 + Po
Zo = In.
2 1  Po
Under Ho we will have
(Z  zo) . vn (p  m)  3 approx. E N(O, 1).
PROOF. Omitted.
89
EXAMPLE 2.5. Let us determine a 95% confidence interval for P2413 in example 2.4.
We have
P {1.96 < (Z  z) . V51  (5  4)  3 < 1.96} 95%
:} P{ 1.96  6.86Z < 6.86z < 1.96  6.86Z} 95%
:} P{Z  0.29 < z < Z + 0.29} 95%.
The relationship between z and P
24
13 = P is
z = In 1 + P
2 1p
The observed value of Z is
e
2z
1
P = e2z + 1
1 1  0.035
Z = In = 003501
2 1 + 0.035 . .
The limits for z become
[0.3250, 0.2549J.
The correNponding limits for arc
[
,,n,lIlInn 1 r,(),IIUGIL  1]
t! 0,1100+ l' 11mB. + 1  [0.31,0.28j.
JJC
2.3.2 The multiple correlation coefficient
The partial correlation coetlicient is one possible generalisation of the correlatioll hc
tween two variables. The partial correlations are mostly intended to describe the degree
of relationship (correlation, covariance) between two variables. Instead we will now
consider the formula on p. 79
2 _ V(X1 )  V(X1 IX
2
= X2)
P  V(X
1
)
This is the "degree of reduction in variation" interpretation of the (squared) correlation
coefficient. This we now seek to generalisc. We again consider thc partition of the
pdimensionally normally distributed vector X i an rndimensional vector X 1 and a
(p  m )dimensional vector X 2, and the resulting partitioning of the parameters i.e.
We now define the multiple correlation coefficient between Xi, i = 1, ... ,m and X 2
as the maximal correlation between Xi and a linear combination of X 2'S elements. It
is denoted Pilm+l, ... ,p'
It can be shown that the optimal linear combination of X 2'S elements is
where is the i'th row in the matrix :E12:E221. This matrix appears in the expression
for the conditional mean of X 1 given X 2. As stated before this is
It can also be shown that
i.e. the considered linear combination minimises the variance of (Xi  a.' X 2)' cf.
section 2.3.3
We now have the following important
THEOREM 2.22. We consider the situation above. Let (jibe the i'th column in E:
H
,
i.e. is the i'th row in :E
12
.
Then
r
l'ilm+1, ... ,I) =  ..  va;; .
If we let
J
:E22 '
then
2 det:E
i
1  PI 1 = '
,m+ , ... ,p O"iidet:E22
PROOF. The proofs to the claims before the theorem are quite simple. One just has to
use a Lagrange multiplier and also use that the variancecovariance matrix is positive
semidefinite. What is claimed in the theorem then follows by using the formula for the
conditional variancecovariance structure (p. 74) on:Ei by use of the matrix formulas
in section 1.2.7.
REMARK 2.11. In the theorem we have obtained a large number of characteristics for
the multiple correlation coefficient and since
we note that we have generalised the property of reduction in variance. It is important
to note that we can see from the determinant formula that it is possible to compute the
multiple correlation coefficient from the correlation matrix by using the same formulas
valid when computing it from the variancecovariance matrix.
With regard to the estimation of multiple correlation coefficients the same remark as
on p. H5 rcgarding the estimation or partial coefficients holds.
In the next example we continue example 2.4.
EXAM'LII.I. To II' an lmprolllliu/l of to which dClrcc the content of CaA and CaS
In oXlmplll.4 _ .... tho v.rlAttnn In c.l. 3dlY IItrcnith we eftn compute the
multiple cOI'l'clation cocOicicnt hetween stJ'cngth day 1und (CaS, and CIA). We nnd
det [ 0.;58
0.158 0.120
1
1 0.309
1 ,2
0.120 0.309 1
 P4112 =
1 . det l
Oi
309
]
where the indices of the variables correspond to those used in example 2.3. We find
= 1  0.9435 = 0.0565.
The data therefore indicate that only about 6% of the variation in the strength of the ce
ment (from samples which have been collected the way these data have been collected)
can be explained by variations in C3S and C
3
A content alone.
If the multiple correlation coefficient is 0 (i.e. if (Ji = 0) it is not difficult to determine
the distribution of P7Im+l, ... ,P' We give the results in the slightly changed form in
THEOREM 2.23. Let R = Pilm+l, ... ,p be the empirical multiple correlation coeffi
cient between Xi and X 2 = (X m+l, ... , Xp) based upon n observations. Then
R2 n  (p  m)  1
2' EF(pm,n(pm)I),
lR pm
if Pilm+l, ... ,p = O.
PROOF. Omitted
REMARK 2.12. The number p  m is equal to the number of variables in X 2, i.e. the
number of variables we condition on. 'f
This can be used in testing the hypotheses
Ho : Pilm+l, ... ,p = 0 against HI: Pilm+l, ... ,p 1: O.
We reject the null hypothesi" fOf h\fiC vulue" of the tesl statistic. Thill IN lIIulltratcd In
EXAMPUt 2.7. Con!lldcr thc situntion in exumple 2.6. We now want to examine if it
can be assumed thut the multiple correlution bctwcen X'l and (Xl, X
2
) is O. (Note
that 1) ;;.;;:; and '/11 = 1.) We find the statistic
/{2 51 (3  1) 1 = 0.0565 . 48 = 1.44.
1 .. H2 3  1 0.9435 2
Since
F(2,48)0.90 = 2.42,
we will at least accept a hypothesis that P
4
112 = 0 for any level a < 10%. With the
available data it cannot be rejected that P4112 = O. This does not mean that it is not
different from 0 (which it probably is), only that we cannot be sure using the available
data because the true (but unknown) value of P
4
1
12
is probably rather small.
We shall not consider tests for other values of Pilm+l, ... ,n"
2.3.3 Regression
We start the section with some remarks on errors of estimators and predictors. If we
consider an estimator iJ og an unknown parameter 8 (a fixed number) the Mean Squared
Error of B is
MSE(B) = MSE(B,8) E[(B8)2]
V(B) + (E(B)  8)2
i.e the MSE( iJ) is equal to the variance of B plus the squared bias of B thus relating the
MSE to the precision (low variance) and the accuracy (low bias) of the estimation. If
the estimator is unbiased we see that the MSE equals the variance of the estimator.
For a predictor Y = ,I/(X) of a random variable Y the Mean Squared (Prediction)
Error is
 MIE(p(X). }') =: E [(f' y):l J
I [(I(K)  y):I J
CHAPTER 2.
Where the mean is taken with respect to the joint distribution of Y and g(X)  Yo If
Y (= g(X)) have the same mean, we obtain
MSE(Y) = V(Y  Y) = V(Y) + V(Y)  2 Cov(Y, Y).
A reasonable condition for finding a good predictor is of course to minimize the MSE
of the predictor. The solution is given in the following theorem.
THEOREM 2.24. The Minimum Mean Squared (Prediction) Error predictor of Y based
onX is
g(X) = E(YIX)
Since E(E(YIX)) = E(Y) the prediction variance is
V(Y  Y) = E(v(YIX))
PROOF. A consequence of basic results on conditional means.
In the case of normally distributed random variables we use the term regression for the
above conditional mean. More specifically we proceed as follows.
Let [ 1 be a stochastic vector. By the term regression of Y on X we mean the
function given by
g(x) = E(YIX = x),
i.e. the conditional mean as a function of the conditioned variable.
Let [ ] be normally distributed with parameters
and 1
.
Then theorem 2.16 shows that
..... ]
1(1II)  E(YIX  1II)  1'1 + D'i Eaa
1
(1II I.&a). .
2.3. CORRELATION AND REGRESSION 95
i.e. the regression is lineal' (affine). The prediclion variance is  since V(YIX) is
independent of X and therefore E(v(YIX)) = v(YIX) 
V(Y  g(x))
We now specialise to two dimensions.
Let [ 1 be normally distributed with parameters
Then the regression of Y on X is given by
O'y
E(YIX = x) = /ly + p(x  /lx),
O'x
and the regression of X on Y is given by
Let us assume that we have measurements [ ] , ... , [ i: ].
The maximum likelihood estimates for the slopes are obtained by using the maximum
likelihood estimators for the parameters in the formula. Then
;'l =
1/
2:(Xi  X)(li  Y)
v2:(X
i
 X)2 2:(Y;  y)2
"'(Xi" X)2,
2:(y, 
and we lice e", that tho clltimltteN Ill' the slope in the expression for the regression of Y
()n X becom
96
CHAPTER 2. Mu
This gives the empirical regression equation
,  Sp xy 
E(Y\X = x) = Y + SSx (x  X),
i.e. precisely the same result as we obtain in the one dimensional linear regression
analysis. However, there the assumptions are completely different since we assume
that the values of the independent variable are deterministic values. In the present
context we assume that they are observations of a normally distributed variable which
is correlated with the dependent variable. Concerning the estimation it is not important
which of the two models one works with but the interpretation of the results are of
course dependent hereon. We now continue with example 2.8.
EXAMPLE 2.S. In this example we will determine the linear relations from a measure
ment by one of the two methods stated in example 2.2 to the other measurement.
We find the regressions
and
_ ,82 ( _ )
X2 + P Xl  Xl
81
0.58x1 + 10.14.
These lines are shown in figure 2.5. If we wish to check if there might be some sort of
relation between Xl and X
2
we can examine the correlation coefficient. It has been
found to be
'= 182 = 0.617
P V311 . 279 '
i.e.
f} = 0.380.
The test statistic for a test of the hypothesis p = 0 is, cf. section 2.3.1, with l' = m = 2
0.617
t = y'20  2 = 3.32 > t(18)o 0011
vI 0.380 .
Ull1nll a tClt at level Q > 1% wc mUlit reject the hypoth,.,. and we Illume that p " 0,
'I different from O. I... w, now Iume thlfl .t .... Until' nl.ttonlhlp betwlln the
i TXI .AITITION TRIOIIM
; ; . ; . ; ~ . ; ; : : ; ; . . . : . : . ; ; . ; ; . : . : : . : . : . ; : . : : . ; . . . : . : . : = = : : . :        ~ ~  ~  ~ ~
50
40
30
20
10
,
,
.
Xl 0.65x + 13.43
0.S8x +10.14
10 20 30 40 50 60 70 80 90
Figure 2.5:
97
methods of measurements in the two cases and it is estimated by the two regressions.
We cun then find estimates of the errors etc. in the usual fashion.
In the figure we have also shown a contourellipse and its main axes. It can be shown
thut the first axis is the line which is obtained by minizing the orthogonal squared dis
tance to the points. On the other hand the regression equations are found by minimizing
tht verticaL and horizontaL distances respectively. The first main axis is therefore also
called thc orthogonaL regression. In chapter 4 we will return to this concept.
2.4 The partition theorem
In thlll Mcction we will consider a stochastic variable X E N(j.L, }:;), where}:; is regular
or nrdcr 1/.. We will consider the inner product defined by }:;1 and the corresponding
norm I.e.
Ind
91 CHAPTER 2.
Now let the subspaces lh, ... , Uk: be orthogonal (with respect to this inner prmluct)
so that
We let dim U
i
= ni and call the projection onto U
i
for Pi. The corresponding
projection matrix is called C
i
.
Using the notation mentioned above the following is valid
THEOREM 2.25. (The partition theorem) If we let
i = l, ... ,k
and
'i=l, ... ,k,
then
and
k:
IIX  JLII2 = I: Ki
i=l
Furthermore Y 1, ... , Y k: are stochastically independent and normally distributed and
K 1, ... , K k: are stochastically independent and X
2
(ni) distributed variables.
PROOF. We have that Y i = C
i
(x  JL). Therefore
From this we obtain
2.4. THI PARTITION __ _
99
Here the theorem says that
y 1 and y 2 are independent and
that 1\ Y
1
111 X
2
(2) and
1\ Y2 \1
2
X2 (1)
Figure 2.6:
100 CHAPTER 2.
Now for i "I j it follows from the lemma on page 54 that
From this it follows that the components of Y are stochastically independent (because
Y is normally distributed).
We must now determine the distribution of Ilpi(X  JL)112. We have that X can be
written
X = JL+AZ
where Z E N(O, I) and A A' =:E. From this it follows that
Now
Ilpi(AZ)11
2
= IICi AZI1
2
Z' = Z'DiZ.
A'C'C':E
1
:E C':E
1
C A
'l. 'l. 'l. 't
A'C':E
1
CA
i.e. Di is idempotent. In the above we have used the lemma 1.1 repeatedly. It is
obvious that rk(Di) = ni. Now, since
A lC
i
A
(A lCiA)'(A lC
i
A),
then Di is positive semidefinite (cf. theorem 1.16 p. 38) therefore there exists an or
thogonal (and even orthonormal) matrix P' (theorem 1.10) so that
where Ai is a diagonal matrix with rank ni. Since D, is idempotent we obtain
2 .. Tn PARTITION THIORIM
or AI = At. Therefore Ai has ni I's and II ni O's on the diagonal. Therefore
Z'DiZ = Z'P AiP' Z = (P' Z)' Ai(P' Z)'
V'AiV
2 2
V
1
+",+V
n
'"v"'
ni components #0.
Since V E N(O,P'P) = N(O,I) itis seen that
101
EXAMPLE 2.9. Let Xl, ... ,X
n
be independent and N(/l., (
2
) distributed. Then
We consider the subspace U
1
given by
and the orthogonal subspace to U
1
(with respect to a
2
I) called U
2
. (This concept of
orthogonality corresponds to the usual one). Now the identity
shows that the projection onto U
1
is given by
[
.1: 1
TJl(X) = ,
;1:
which mellnN
102 CHA'TI12.
Figure 2.7:
Since dim U
1
= 1 and dim U
2
= n  1 wc find from the partition theorem that
are stochastically independent. Pl(X  J.L) is normally distributed and Ilp2(X .
is X
2
(n  1) distributed.
Since
1_ " AaT U Ta UTlU:' A:'U '.Ii
GINIIALI.ID VAIIANCI
and
103
we again find the results <ll"the distribution of X and (nl)S2 = ;2 I:(Xi  X)2.
2.5 The Wishart distribution and the
generalised variance
In the one dimensional case a number of sampledistributions are derived from the
normal distribution. The most important of these is the X
2
distribution, which corre
sponds to the sum of squared normally distributed data. Its multidimensional analog
is the Wishart distribution.
We give the definition by means of the density in
DEFINITION 2.3. Let V be a continuously distributed random P x pmatrix, which
is symmetrical and positive semidefinite with probability I. Then V is said to be
Wllihart distributed with parameters (n, (n p), if the density for V is
f(v) = r:. [dct(v)] cxp( tr(v .
2
for v positive definite and 0 otherwise. Here is a positive definite p x pmatrix, and
,. is the constant given by
1 p 1
. = 2!n
p
p
(
po
l)/4(dct IIr( (n + 1  i)).
r: 2
i=l
Abbreviated we write
V E W(n, E) == WpCn, E).
where the first version is lIsed whencvcr there is douht ahout the dimension.
We now live a remark about the mean and variance of the components in a Wishart
dl"trlbutlon A
THIORIM 2.26. Let V .. (Vti) bo Wlllhll't d1l1tr1butod W(u, E), where E  (tT,,,).
Th.n It holdl that
104
E( Vij) = rUJij
v(Vij) = n( a;j + IJ"iilJ"jj)
Cov(Vij, Vkl) = n( IJ"iklJ"jl + IJ"iJ!Tjk).
PROOF. Omitted.
The analogy with the X2distribution is seen in
THEOREM 2.27. Let Xi E Np(O, :E), i = 1, ... , n, be independent and regularly
distributed. Then for n "2: p it holds that
Y = L E W(n, :E).
i=l
PROOF. Omitted.
REMARK 2.13. If n < p then Y as it is defined in the theorem does not have a density
function. However, we still choose to say, that Y is Wishart distributed with parameters
(n,:E).
Corresponding remarks hold if :E is singular. Using this convention the theorem holds
without the restriction n S; p.
A nearly trivial implication of the above now is
THEOREM 2.28. Let V 1, ... , V k be independent random p x pmatrices, which are
W(ni' :E)distributed. Then it holds
One of the main theorems in the theory of sampling functions of normally distributed
random variables is that X and 8
2
are independent and that 8
2
is a
2
X2 / J distributed
with 1 degree of freedom less than the number of observations. This theorem has its
multidimensional analog in
i.l. THI WIIHAaT DIITalBUTION AND THI
OINERALlIED ____ ............. ___ .... __ ..... ___ ._... _. __ 10_5
We let
X =
s
Then
and
1
S E W(n  1, n _ 1 :E).
Furthermore, X and S are stochastically independent.
PROOF. Omitted.
We will now consider some results on marginal distributions. We have that
THEOREM 2.30. Let V be Wishart distributed with parameters (n, :E). We consider
the partitioning
V = [ Vn
V 21
It then holds that
Further, it holds that
and :E = [ :En
:E21
] .
L..i22
THEOREM 1.31. We 1IIIn conlldor the above lIituation. If :E12 and :E21 are 0
matricclI, then V 11 and Vaa .... IItochllltlcllly Independent.
'aoo,. for tho thooftlU. They follow oonllldlrin. the corrcllpondlni partl
dlI wtlhll1 dlltributtonll.
106
CHAPTRR 1. MULTIDIMENSIONAL '.11 1.1.
Since the multidimensional normal distrihution can he defined independent of the co
ordinate system, then it is not surprising that something similar holds for the Wishart
distribution. Because change form coordinates in one coordinate system to coordinates
in another is performed by manipulating matrices we have the following
THEOREM 2.32. Let V E W p (n, and let A be an arbitrary fixed r x pmatrix.
Then
A V A' E
PROOF. As indicated above one just has to consider the normally distributed vectors
which result in V and then transform them. The resultat then follows readily.
We now conclude the chapter by introducing a different generalisation from the one
dimensional variance to the multidimensional case than the variancecovariance matrix.
DEFINITION 2.4. Let the pdimensional vector X have the variancecovariance ma
trix By the term the generalised variance of X we mean the determinant of the
variancecovariance matrix, i.e.
gen.var.(X) =
REMARK 2.14. In section 1.2.6 we established that the determinant of a matrix cor
responds to the volume relationship of the corresponding linear projection, i.e. it is a
intuitively sensible measure of the "size" of a matrix.
If we have observations X 1, ... , X n, then we define the empirical generalised vari
ance in a straight forward way from the empirical variancecovariance matrix
1 I:
n
 ,
S =  (Xi X)(Xi X),
nl
i=1
as dct(S)
In the normal cI.e we can .Itabltlh the diatrlbution of the empiricallencraUI,d vari
2.'. THE WIIHART DIITRI8UTroN AND THK
GENRRAUHED VARIANCE
107
_._  . 
THEOREM 2.33. Let Xi E Nl'(/I" i = 1, ... , n, be stochastically independent.
Then the empirical generalised variance follows the same distribution as
'/ Z
)
LJ1 .. P'
(n 1 1)
where 7,1, ... , 7,P are stochastically independent and 7,i E X
2
(n  i).
PROOF. Omitted.
For p = 1 and 2 it is possible to find the density of the empirical generalised variance.
However, for larger values of p this density involves integrals, which cannot readily be
written as known functions, but for n 00 we do have
THEOREM 2.34. Let S be as above (in the normal case). Then it holds that
(det(S) _ 1)
yn  1
PROOF. Omitted,
asymptotically E N(O, 2p).
108 CHAPTER 2. Mu
Chapter 3
The general linear model
In this chapter we will formulate a model which is a natural generalisation of the
variance and regression analysis models known from introductory statistics. The
theorems and definitions will to a large extent be interpreted geometrically in order to
give a more intuitive understanding of problems.
3.1 Estimation in the general linear model
We first give a description of the model in
3.1.1 Formulation of the Model.
We consider an n dimensional stochastic variable Y E N(p"o2:E) where :E is
assumed known. Consider the norm given by :E
1
i.e.
The norm 1 deflned hy lhe inverse variancecovariance matrix is given by
Tho two norma 1ft .. " be propurtlonltl lind thoy rolUIt In tho IIltme cOllcept nl'
... ldlr I number ot problema In connection with tho
110 CHAPTRR 3. THE GENERAL LINI41 MODII.
estimation and testing of the mean value J.L in cases where J.L is a known linear
function of unknown parameters i.e.
J.L=xf)
or
where x is assumed known.
Geometrically this can be expressed such that we assume the expected value of the
stochastic vector Y is contained in a subspace M of Rn. M is the image of Rk
corresponding to the linear projection x. The dimension of M is rg(x) ::; k. The
situation is depicted in the following figure.
Figure 3.1: Geometrical sketch of the general linear model.
We will call such a model, where the unknown mean value J.L is a (known) linear
function of the parameter f) a (general) linear model. This is also valid without the
assumption Y has to be normally distributed.
EXAMPLE 3.1. Consider an ordinary onedimensional reafCssion analYllili modcl i.c.
we have observationll
3.1. ESTIMATION IN THE ORNERAI, LINEAR MODEL 111

where E(ci) = o. This model can be written
or
Y=xf)+e,
i.e. the model is linear in the meaning stated above.
Another example is
EXAMPLE 3.2. We now consider a situation, where
i = 1, ... ,n
and still we have E(ci) = O. Even in this case we have a linear model which is
y
y = a + I3x + Y In x
x
We note tha' the term UnoII' hall nothlna to do with E( Y I X) = ( ~ + /1 x + l' In x being
\lneu in the lndlptnclln' varlabl. .11, rather that E( Y 1:1:) conNidcrcd all a function of
the unkDown,..."" (D,Il, "t)' .hould be Ilnear. If' we had had a model luch .1
< 0:, ~ Kili
ttl CHAPTER 3. THE GENERAL LINIAI MODII.
where tY, (J, 'Y and 6 are the unknown parameters it would not be possible to write
with the known x matrix and we would therefore not have a linear model.
3.1.2 Estimation in the regular case
We will first fonnulate the result of estimating 0 in
THEOREM 3.1. Let x and 0 be given as in the preceding section and let Y E
N
n
(x 0, 02:E), where:E is positive definite. Then the maximum likelihood estimator
o for 0 is given by x 0 being the projection (with respect to :E ) onto M, 0 is
a solution to the socalled normal equation( s)
If x has full rank k, then
and since a linear combination of normally distributed variables 0
distributed with parameters
E( 0) ()
D(O) 02(X':E
1
x)1.
It is especially noted that 0 is an unbiased estimate of O.
is also normally
PROOF. If Y E N(x 0, 02:E), where :E is regular then the density for Y
1 1 1 1
f(y) = ~  ~ exp[2 2 (y  X O)':E
1
(y  x 0)]
V 27l' on det :E 0
1 [1 2
= k on exp  20211y  x011 ].
We have the likelihood function
3.1. ESTIMATION IN THE GENERAL LINEAR MODEL
taking the logarithm on each side gives
1 2
InL(O) = k1  211y x()11 .
20
113
._
It is now evident that maximisation of the likelihood function is equivalent to min
imisation of the squared distance between any point in M and the observation i.e.
equivalent to minimisation of
From the result p. 52 the value of x 0, giving the minimum is equal to the orthogonal
projection (with respect to :E
1
) of y on M. From example 1.8 p. 48 the optimal
o is the solution to the equation
If x':E
1
x has full rank k, i.e. if x has rank k (cf. p. 35) we therefore have
We have now shown the first half of the theorem.
From theorem 2.2 we find that
And from theorem 2.5 we find
D( 0) (x':E
1
x) l
x
':E
1
(02:E):E
1
x(x':E
1
X)1
2( ' ~  1 )1
0 X.u X ,
The silualion is illustrated in the following figure 3.2.
REMARK 3.1. We note that 0 is estimated by minimising the squared distance onto
M, 0 III theroforo .110 ',a$t ,rquare,I' estimate of 0, If we do not have the distri
butlonllllllumpUon WI wlU uflcn be uble to tllle the estimator 0 in theorem 3.1 as
an c"tlmltl of., It Olll be .hown thut the lea"t "quarell c"tlmatof 0 hus the least
icncrall d vlrillll all the olltimatorll that arc Unear functlonll uf tho ob"crva
tionl (Chi ,It,orfm) cr. [13]. WI .110 .IY thlt tho leallt "quare.
'.0;,\ iilJi4 ;".<
114
CHAPTER 3. THE GENERAL LlNUa .. ODII.
Figure 3.2: Geometric sketch of the problem of estimation in the general linear model.
Since a
2
is often unknown we will now find estimators for it. We have
THEOREM 3.2. Let the situation be as above. The maximum likelihood estimator of
a
2
is
2 1 A2 1 A A
iT = IIY x811 = (Y x8)':E
1
(y xlJ).
n n
The unbiased estimator of a
2
is
where x 0 is the maximum likelihood estimator of E(Y). The following holds
and &2 is independent of the maximum likelihood estimator of the expected value
and is therefore independent of O. ..
PROOF. The likelihood function is
and
In
115
3.1.IITIMATION IN THE GENERAl. LINEAR MODEl,
._"..__.,_ .._._ .
now
(j
:) 21nL
(a
n 1 1 2
 + IIY x811
2 a
2
2a
4
n 1 2 1 2
(a llyx(11)
2 a
4
n
After differentiating with respect to 8 we get the ordinary system of normal equa
tions. We therefore find that the maximum likelihood estimates to (0, (2) for (8, (
2
)
are solutions for
If we consider the partitioning of R
n
as the direct sum of ]vI and All.., where]VI l..
is the orthogonal component (with respect to :E
1
) of lvI, we get that
PM (Y  x 8) = x 0  x 8
and
YxO
are stochastically independent and that
IIY  xOl1
2
E a2x2(dimlvIl..)
= a2x2(n  rkx).
From this we especially get
i.e. the likelihood elltimator of a ~ ill not unbiased. If we want an unbiased estimate
we can obvioul'lly ulle
lilY X 011:1,
n , rkx
MOlt often wo wtll be Ullin. the unblilld I.dmatl or 11:1, Inu we will therefore UPiC
the notIUon &11 for dd..
116
CHAPTER 3. THE GENERAL
REMARK 3.2. If}:; is the identity matrix then IIyI1
2
= L, Ur So in this cuse we
have
,2 1 2:
n
, 2
(Y = k (Yi  E(Yi)) ,
nr x
i=l
where E(Yi) = (x Ok
DEFINITION 3.1. The deviation
between the i'th observation and its estimated value E(Yi) = (x O)i is called the i'th
residual. The squared distance between the observation and the estimated model is
SSR = SSres = IIY  x 011
2
= (Y  xO)':E
1
(Y  xO).
In the case :E = I we see that SSR is the sum of the squared residuals, and also in the
general case (if misunderstandings don't occur) we will denote this as the residual sum
~ ~ ~ .
Before we will go on we will give a small example for the purpose of illustration.
EXAMPLE 3.3. In the production of a certain synthetic product two raw materials A
and B are mainly used. The quality of the end product can be described by a stochastic
variable which is normally distributed with mean value f.1. and variance (Y2. The
meanvalue is known to depend linearly on the added amount of A and B respectively
i.e.
where XA is the added amount of A and Xs is the corresponding added amount of
B. (Y2 is assumed to be independent of the added amount of rawmaterials. For the
determination of () A and ()s three experiments were performed after the following
plan.
I
2
3
117
The single experiments are assumed to be stochastically independent. The simultane
ous distribution of the experimental results Y
1
, Y
2
, Y
3
is then a three dimensional
normal distribution with mean value
and variancecovariance matrix (Y2I.
We have
[
5 1
]
(x'x)l = [
5 1
] ,
x'x=
1 ~
:::}
~
 ~
4 4
6
6
and
x'y = Yl + ~ Y 3 ,
[
1]
!J2 + 'j)J3
giving
In this case we observed
[ y,
1
[ ~ n
!J2
Y3
so that
[ ~ A
Os
J = [ ~ ~ ] .
From this we easily find
unu
y
118
CHAPTER THE GENERALi.INif .... L
This gives the residual sum of squares
(Y  X 0)' (Y  x 0) = 25 + 25 + 100 = 150,
Alternatively we may compute
(xO)'(xO)
y'y
14475
14625
and obtain the sum of squared residuals as the difference between those, i.e. 14625 _
14475 = 150. In any case we obtain that an unbiased estimate of (72 is
_1150 = 150
32
3.1.3 The case of x':E1x singular
If rk(x) = p < k then is singular and we cannot find a unique solution to
the equation.
However, if we have a pseudoinverse for then we can write
However, sometimes it is possible to use a little trick in the determination of the pseudo
inverse. The reason for the singularity is that we have too many parameters. It would
therefore be reasonable to restrict () to only vary freely in a (side)subspace of Rk.
One of those could e.g. be determined by () satisfying the linear equations (restric
tions)
b8=c
or
l b"
blk
11m,.
3.1. EHTIMATION IN THR ORNRRAJ, LlNRAR MODEL 119

We will only consider parameter
values which lie in this
side subspace in Rk
120
"'A ,WktJ4JijJ'
CHAPTER 3. THE C;ENERAL LINIAI MODII,
 ,
If there exist 0 s that satisfy this equation system then they span a subspace of' dimen
sion k  rk(b).
Since
rk(x) = p, and we have k () components it would be reasonable to remove k  p
of these i.e. impose the restriction k  rk(b) = p or k = p + rk(b).
Now if
Xll Xlk
rk [ ] = rk
Xnl Xnk
=k,
b
ll b1k
b
m1 bmk
we can consider the model
We let
D = [I;l On,m] = [I;l 0]
Om,n Im,m 0 I'
where the short notation should not cause confusion.
If we in the usual way compute
{[x'b']D [ ]} l{[x'b']D [ ]}
{x'I;lx + b'b}l{X'I;ly + b'e},
then we have a quantity which minimises
g( 0) {[ ]  [ ] O}'D{ [ ] [ ] 8}
= [YoxO],[;l
"'" (1/ ,. X 8)'E
1
(1/ X 0)

3.1. ESTIMATION IN THE GENERAL UNRAR MODRL 121
Since this is exactly the same quantity we must minimize in order to find the ML
est i mates, we Ii nd that
really is the maximum likelihood estimator for 0. The only requirement is that we
must lind a matrix b so has full rank and this corresponds to restricting O's
region of variation.
The variancecovariance matrix of iJ becomes
This expression is found immediately by using theorem 2.5.
As before the unbiased estimate of (J2 is
Here we have n  rkx = n  k + rkb.
First we give a little theoretical
EXAMPLE 3.4. Consider a very simple onesided analysis of variance with two groups
with two observations in each group. We could imagine that we were examining the
effect of a catalyst on the results of some process. We therefore conduct four ex
periments, two with the catalyst at level A and two with the catalyser at level B. We
therefore have the following observations
level A: Y
ll
, Y
12
level B: Y
21
, Y
22
If we assume that the observations are stochastically independent and have mean val
ues
= E(Yu) = 01
= 
then we can exprcllll the model.a
122. CHAPTER 3. THE GENERAl. UNIAI MODIU.
.__ ..............""'"
We easily lind that
, [2 0]
xX= 0 2 '
and
which are the usual estimators. If we instead use the (commonly used) parametrisation
E(Y12) = JL + (\;1
E(Y22) = JL + (\;2
i.e. we express the effect of a catalyst as a level plus the specific effect of that catalyst.
Then we have
It is easily seen that x has rank 2 (the sum of the last two columns equals the first).
We will therefore try to introduce a linear restriction between the parameters. We will
try with
We can now formally introduce the model
[
~ ~ I [ ~ ~ ~
Y
21
1 0 1
Y
22
1 0 1
o 0 1 1
or
123
3.1. ESTIMATION IN THE URNERAJ. I.INEAR MODEL
 
We now have that
[0 7 1 r [0 7 1 1 ~ x'x + [ ~ : ~ ] [ ~ ~ ! ]
The inverse of this matrix is
i ]
o .
1
"2
Now, since
we have
[] [
111
100
011 [
Yll I 0] Y12
1 Y21
1 Y22
o
[
2: Y  ]
Yll ~ Y12 ,
Y21 + Y22
1]['" ] [ ]
4 L.., Yij Y
~ Yll + Y12 = ~ 1  ~ ,
"2 Y21 + Y22 Y2  Y
Le. exactly the same estimators we are used to from a balanced onesided analysis of
variance (note: We know in beforehand that we will get these estimators cf. the results
earlier in this section).
We will now give a more practical example of the estimation of parameters in the case
where x':E IX is singUlar.
EXAMPLE 3.S. In the production of enzymes one can use two principally different
types of bacteria. Via its metabolism one type of bacterie liberates acid during the
production (acid producer). The other produces neutral metabolic products. In order to
regulute the pHvalue in the substrate on which the bacterial! are produced, one can add
1\ Ploculled pHbuffer. It ill known, that the pHbuffer itself does not huve any effect
on the production of the enzyme, rather It work. throuih an Inteructlon with the acid
content and the metabolic productll of the bacteria,
at blOtlnl whieh llvil on I.ublw." withou, pHbutt.r tho mlln
124
CHAPTER 3. THE GENERAL LINEAR MODIU.
pHbuffer
added not added
bacteria l acid producer 0,2 19,15
culture I neutral 6,0,2
Table 3.1: Differences between nominal yield and actual yield under different experi
mental circumstances.
mentioned interactions one has measured the difference between the normal production
and the actual production of enzyme in 7 experiments as shown below.
First we will formulate a mathematical model that can describe the above mentioned
experiment.
We have observations
Y12v,
v = 1,2
v 1,2
v 1,2,3.
These are assumed to have the mean values
E(Yllv)
E(Y12v)
E(Y2lv)
ILl + 6111
ILl + 6112
6121 ,
where ILl is the effect of using acid producing bacteria and Bij is the interaction
between pHbuffer and bacteria culture.
Furthermore we assume that the observations are stochastically independent and we
have the same but unknown variance a
2
.
We can now formulate the model as a general linear model. We have
Yill 1 1 0
Yi12 1 1 0 0
Yi2l 1 0 1
[ ] +e,
Yi22 = 1
1 0
Y
211 0 0 0 1
61
12
Y212 0 0 0 1
(J'J1
Y:na 0 0 0 1
.. . .::F.:.::.R:.::A:..=.=L..:..I.:.::.IN:..:.=E.:..::A==R_M __<_>_D_E_L_________ 1_25
We lind
and
x'y =
[
Yl.. ]
Yll.
Y12.
Y2L
where a dot as an indexvalue indicates that we have summed over the corresponding
index.
Since x'x only has the rank 3, we are unable to invert it. Instead we can find a
pseudoinverse. We use the theorem 1.7 p. 26 and get
so the estimates from the parameters become  with this special choice of pseudo
inverse 
where e.g.
3
1 Y""
ihl. = '3 L...J ]J'Jlv'
v .. \
Now, since
I  (XiX) XiX ..
[
] ,
Y12.
1121.
1 000
1 0 0 0
126 CHAPTER 3. THE GENERAL
we have
(I  (X/X)X/X)Z = [ 1
,q
o
From theorem 1.6 the complete solution to the normal equations is therefore all vectors
of the form
0+ [ 1 = [  t 1 '
t Yl2,  t
o [hI,
t E R.
An arbitrary maximum likelihood estimator for () is then of this form.
The observed value of e is
[
0 1
' 1
Bobs = 1 I .
23'
It is obvious that this estimator is not very satisfactory since e.g. j)'l always will be O.
In order to get estimators which correspond to our expectations about physical reality
we must impose some constraints on the parameters. It seems reasonable to demand
that
i.e.
or
b () = o.
It is obvious that
3.1. I.TIMATION IN THE GENERAl. UNEAR MODEl. _______
<. ..,.,"'<'"'
so we can usc the result from p. 121. We lind
2 2
n +
0 0
n
XiX + bib
2 0 1 1
0 2 1 1
0 0 0 0
2 2
n
3 1
1 3
0 0
Since
we find
(x'x + b'W' [1
1 1
n
I
4
"2
0
0
1
"2
0 0
We now get
[
YL 1
e
'  (' b/b)l I = YlL YL .
_ xx+ xy _ 
Yl2,  Yl,.
[hI,
The observed value is
[]j
[
acid producing effect 1 )
buffer & acid interaction
(buffer) & acid i.nteracti.on .
buffer & neutral mteractlOn
We nnw find the vlrllncecnvllriuncc mutrix for O. We have
D(b) "II(X'X + bib)
, 0
.,1 0
iQiiP ....
118
CHAPTER 3. THE GENERAL UMIAt MODII,
i.e. the estimators are not independent.
In order to estimate a
2
we find the vector of residuals. Since
iLl + 811
iLl + 811
iLl + 812
x8 = P.1 + 812
821
821
821
the vector of residuals is
1
1
2
yxO= 2
We then find
3!.
3'
1
1
17
17
23
An unbiased estimate of 0'2 is therefore
2 1 2 1
s =  28 = 7.
73 3 6
3.1.4 Constrained estimation
We now consider a problem that resemblell the situation in the previouN Noetion!!. More
spccillcal\y we want to eNthnulc pUI'UmetcrN thalllatJllfy allnour eOIlHtrulnl
3.1. ESTIMATION IN THE OENRRAI .. UNRAR MODEL
129
This is e.g. the case when estimating angles in a triangle. They obviously satisfy
and therefore we must require this for the estimates as well.
The main result on estimation of () is expressed in
THEOREM 3.3. Let E(Y) = x () where Y is an ndimensional random variable, x a
known n x k matrix and () a kdimensional vector of unknown parameters satisfying
the s linear constraints
H'() = e,
where H is a known k x s matrix and a known sdimensional vector. Finally we
suppose that D(Y) = where is known. The least squares estimator lJ for ()
under the constraint H' () = e is a solution to the equations
PROOF. We must determine a () that minimizes
we introduce the Lagrange multiplier ,X and put
Then
F((),'x) =   x()) + ,X'(H'()  e)
2
... H'OC
Thol. two d.riv.tlv.IUI 0 in.., ___ tor (Y x JJ 1 (Y x 9) undortho
130 CHAPTER 3. THE GENERAL LINI"I MODIU,
Next we consider the problem of estimating 0
2
in
THEOREM 3.4. Letting
[
x':Elx H]
H' 0
be a pseudoinverse to the coefficient matrix in Theorem 3.3. Then
and an unbiased estimation of 0
2
is
, 
where (() , A')' is a solution to the equations in Theorem 3.3, and
f = n  rk (x', H) + rk(H).
PROOF. By introducing the pseudoinverse we get
C
1
x'1;ly + C
2
e
C
3
x'1;ly + C
4
e
From this we immediately obtain
02C
1
x':E
1
02C
1
The last equality sign follows by using properties of pseudoinverse matrices.
By plugging in it is seen that
so now it just remains to be shown that the degreell of freedom are as p08tulated in the
theorem. The lIolution to
3.1. ESTIMATION IN THR (jENERA I. LlNRAR MODEL 131
_ ... _ ... __ .._._ .. _
can be written as
() = ()o + B {3,
where ()o is a particular solution and B is a (k x s) matrix (rk(H) = k  s) satisfying
H'B=O.
Finally {3 is a sdimensional vector of "free", new parameters. If we consider
Z = y  x()o,
we get
E(Z) x() x()o = x(()  ()o)
x(()()o)
xB{3.
We may now consider the model
Z=xB{3+e,
where e is the error vector and solve this. By doing this we obtain the earlier stated
estimates.
Letting
we obtain
and consequently
y xO = Z + x ()n  x ()n _. x B ,B
= Z xBlj.
From the seneral theory It follows thllt the degrees of freedom arc n ... rk(x B). We
have
rk(xB)  dhn{xDfJltJ e U'}
 dtm{x"YIH'''YI!iiIO, "YIllm}
132 CHAPTER 3. THE GENERAL LlN.AI MODII,
The last equality sign follows from the relation
dimS; + = rk ( ),
where
8; { ( ) 'Y I 'Y E N (H) }
S; { ( ) 'Y I 'Y E N(H)L }
remembering that dim S2 = rk H.
We now present an illustrative example.
EXAMPLE 3.6. Suppose that we have 3 x 2 independent measurements of the angles
in a triangle (e.g. measured in the field), and that they were
VI 52,54
V2 74,74
V3 48,46.
Furthermore we suppose that the uncertainty on these values are the same and may be
expressed by a variance (12.
We state this as a linear model with constraints, i.e.
'lJll 1 0 0 tl
1)12 1 0 0
[ ] +
t2
'/)21 0 1 0 t3
'/)22 0 1 0 t4
1)31 0 0 1 t5
'/J32 0 0 1 t6
(1,1,1) 1 180,
We get
A (pseudo )inverse of this matrix is
C, 1 [ :
1 1
j
[ C
1
2 1
C
3
C
4
61 1 2
2 2 2 4
Therefore we get
[
106] 1 [ 2]
+ 6 ; 180
We observe that the sum of the estimates is 180.
The dispersion matrix is
The estimate of (12 is
(12 = 1 (20992 _ 21684  (720)) = 7 = 2.6
2
,
63+1
Plince
134 CHAPTER 3. THE GENERAL LINa". MODII,
 
REMARK 3.3. As indicated in the example this type of model is particularly rl!lcvunt
in geodesy and surveying. 'Y
3.1.5 Confidenceintervals for estimated values.
Predictionintervals
We consider the usual model (n > k)
where
e E N(O, (J2:E).
Here we will denote the V's as dependent variables and the x's as the independent
variables.
As usual (J2 is (assumed) unknown and :E is (assumed) known. We have the
estimator
for (), and (J2 is estimated using
8
2
= _Itty xott
2
nk
1 A, 1 A
(y  x()) :E (Y  x()).
nk
If we wish to predict the expected value of the new observation Y of the dependent
variable corresponding to the values of the independent variables:
it is obvious that we will use
z  ('::1, . , , , .tAt)
3. THE ________ l3_5
as our predictor.
We have that E( Z) = E( Y) and that
V(Z)
where
z'D(O)z
0
2
z' (x':E
1
x) 1 z
(J2C,
We therefore immediately have
Z  E(Y) E N(O, 1),
(J.jC
and therefore also
Z  E(Y) ( k)
S.jC E t n .
We are now able to formulate and prove
THEOREM 3.5. Let the situation be as above. Then the (1  a) confidence interval
for the expected value of a new observation Y will be
PROOF. From the above considerations we have
1 (\! = P{Z t(n A:}t <: E(Y) < Z + t(n  kh 'isv'c},
and therefore we immediately hl"e the theorem,
otten one i. more intere.tld in I oonldtaOliDlII'Val fur tho new (or f'uturo) oblerva
136
CHAPTER 3. THE GENERAL
L
prohlel11 of determining the confidence interval ror the average Yq of (/ (coming)
observations taken at (Zl' ... , Zk). If Y;q E N(E(Y), (l(j2), then we have that
 1:1 2
Yq E N(E(Y), (5 ).
q
If we now assume that the new (or future) observations are independent of those we
already have then
i.e.
 2 cl
Z  Yq E N(O,(j (c+ )),
q
Zy
P
q Et(nk).
S c+!.
q
From this we can as before derive
THEOREM 3.6. Let us assume that q new observations taken at (ZI' ... , Zk) each
have a variance Cl (52 Furthermore, they are independent of each other and independent
of the earlier observations. In that case a (1  Q) confidence interval for the average
of the q observations equals the interval
[zt(nk)I_"S c+,z+t(nkh_"s c+j.
R
l Rl
2 q 2 q
REMARK 3.4. The above mentioned interval is a confidence interval for an obser
vation and not for a parameter as we are used to. One therefore often speaks of a
prediction interval in order to distinguish between the two situations.
REMARK 3.5. We see that the correspondence to the interval for Yq instead of the
interval for E(Yq) = E(Y) just consists of the expression under the square root sign
being larger by an amount equal to c; which is the variance of :q
EXAMPLE 3.7. We consider the following corresponding observations of an indepen
dent variable x and a dependent variable y:
3.1. ENTIMATION IN THR GRNRRAJ. UNRAR MODEL
137

We assume that the y 's originate from independent stochastic variables Y
1
, ... , Y
7
which are normally distributed with mean values
E(Ylx) = /3x
2
and variances
x> o.
We would now like to find a confidence interval for a new (or future) observation cor
responding to x = 10. This observation is called Y, and we have
E(Y) 100/3
V(Y) 100(52 .
We now reformulate the problem in matrix form:
Y
1 0 1
Y
2 1
Y
3 4
Y
4 9
/3+ =xf3+e,
Y
5 16
Y
6 25
Y
7 36
7
where
1
0
1
4
D(e) = (52
9 = (52:E.
16
25
0
36
We have that
(0, a6) ,11 .. (1 , ... , ;6) [ j 1
.. 81.
x'lJ IX

'AI."3)5
138
CHAPTER 3. THE GENERAl, LIN MODICI,
so
and
A 17.2
(3 =  = 0.1890
91 '
o
1
4
9 0.1890 =
16
25
36
o
0.1890
0.7560
1.7010
3.0240
4.7250
6.8040
The residuals are
so
I.e.
0.4000
0.1110
0.7440
Y  PM(Y) = 0.4010
1.1240
0.5250
1.1960
(0.40001.1960) r
0.45829
1
I
A 1
(52 = 8
2
= 7 _ 10.45829 = 0.07638 = 0.27637
2
.
The constants t: and (:1 are equal to
1
c:  100 . 100  10g.89
1
[
0.4000]
1... 1.1960
36
139
3.1. ESTIMATION IN THIC OICNltRA L J,INEAR MODIU,

The prediction for :1: = lOis
z = 1 0 0 ~ = 18.90
The confidence interval for the expected value at :1: = 10 is therefore given by
18.90 t(6)o.9750.2764V109.89
18.90 2.447 0.2764V109.89
18.90 7.09.
The corresponding prediction interval for the next observation is
18.90 t(6)O.975 . 0.2764V109.89 + 100.
== 18.90 9.80,
i.e. a somewhat broader interval than for the expected value. The explanation is simply
that we have a variance of 10
2
(52 = 100(52 in x=10. We depict the observations and
estimated polynomial in the following graph. Further the two confidence intervals are
given.
y
16
14
12
10
8
6
4
2
1 2 3 4 5 6
Predicted value
Lower confidence
limit:
for expected
value
for observation
x
3.2. TI8TS IN THI OINIRAI, UNIAR MODK ..
141
3.2 Tests In the general linear model
In this section we will check if the mean vector can be assumed to lie in a true subspace
of the model space and also check if the mean vector successively can be assumed to
lie in subspaces of smaller and smaller dimensions, i.e. we want to successively test
whether we can use fewer and fewer parameters to describe the data.
3.2.1 Test for a lower dimension of model space
Let Y E Nn(J.L, 1T
2
:E), where:E is regular and known. We assume that J.L E M, is a
kdimensional subspace and we will test the hypothesis
JJ
n
: 11, E 11 against Ih: J.L E M\JJ,
where II is an ordimensional subspace of M. In the following we will consider the
norm given by:E 1 . The maximum likelihood estimator for J.L is then the projection
7I M(Y) onto M and if Ho is true then the maximum likelihood estimator PM(Y), is
Y 's projection onto H. The ML estimator for 17
2
in the two cases are respectively
; ~ l l y 1IM(y)11
2
and ~ l l y  JiH(y)11
2
.
y
The likelihood function iii
14l CHAPTRR 3. THR GENERAL LlNIAI MODEl,
THEOREM 3.7. Let the situation be as above. Then the likelihood ratio test at level 0
of testing
Ho : J1. E H versus HI: J1. E M\H,
is equivalent to the test given by the critical region
IlpM(Y)  PH(y)11
2
/(kr)
C" ={(YI,"" Yn)1 Ily _ PM(y)112 /(nk) > F(kr, nk)loJ.
PROOF. The likelihood ratio test statistic is
Q
SUPHo L(J1., 0
2
) _ L(PH(Y), ;2)
supL(J1., 0
2
)  L(PM(Y), ;2)
[
IIY  PM(Y)11
2
] exp( = [IIY  PM(Y)11
2
]
Ily  PH(y)11
2
exp( Ily  PH(y)11
2
From this we see
Q<q {=}
Since we reject the hypothesis for small values of Q we see that we reject when the
length of the leg (cathetus) Y  PM (Y) is much less than the length of the hypotenuse.
From Pythagoras we have that
we see that we may just as well compare the two legs i.e. use
Q<q {=}
IlpM(Y)  PH(y)11
2
/(k  r) >
IIYPM(Y)11
2
/(nk) c.
(3.1)
Under Ho both the numerator and denominator are 02
X
2(f)/f distributed with re
spectively k  rand n  k degrees of freedom and they are furthermore independent
(follows from the partition theorem). The ratio will therefore be Fdistributed under
flo. and the theorem follows from this. The rea80n why we in (3.1) have divided
the respective norms with the dimension of the relevant sublpace II of courlle that we
want the teat statistic to be Pdlltributed under 110, and not Ju.' proportional to In
3.l. TESTS IN THR ORNRRA I. I=,___________ 14_3
One usually collects the calculations in an analysis of variance table.
Variation 88
Degrees of freedom
= dimension
Of model from
IlpM(Y)  PH(Y)11
2
kr
hypothesis
Of observations
IIY  PM(Y)11
2
nk
from model
Of observations
IIY  PH(Y)11
2
nr
from hypothesis
REMARK 3.6. Often one will be in the situation that the subspaces }vI and H are
parameterised, i.e.
30 E Rk (J1. = xO)
3')' E R
r
(J1. = Xo ')' ) ,
where x and Xo are n x k respectively n x r (with r :::; k) matrices. We then have
that PM (y) = xO and PH (y) = xo1' are computed by solving the equations
(x'::E1x)O x'::EIy
with respect to 0 and 1'.
We now state a useful theorem used in residual analysis in the general linear model.
THEOREM 3.8. Let x be an n x k matrix not necessarily of full rank. Then the socalled
hat matrix
is independent of the choice of the generalised inverse (x'x) . Furthermore it is idem
potent and symmetric. and
rk(H) = tr(H) = rk(x)
'Ioor.i .. p. J2). If (x'.) bu tun rank we of count u.,. th,lnveno,
144
H corresponds to projection on the column space of x, and it is easily lIeen thut the
matrix
M = I  H = I  x(x'x)x'
projects on the othogonal complement, i.e.
MH=O
REMARK 3.7. Using Pythagoras' theorem we see that there are two other ways of
computing
(3.2)
besides the direct formula, namely as
(3.3)
or as
IIY  PH(Y)11
2
IIY  PM(Y)11
2
(3.4)
= (Y   xO)  (Y   xO'Y)
For numerical reasons (3.3) is normally preferred, but if one has computed the residual
sum of squares using (3.4) is straight forward.
REMARK 3.8. Output from statistical standard software will often be organised slightly
different from what is presented above. We assume that = I so that the norm we are
considering is given by
n
IIzl12 = z'z = '2: z;
i=l
The output from e.g. SAS using the General Linear Model procedure OLM will then
include a Analysis of Variance table (ANOVA table) as
3.2. TR8T81N THE GINIRAI. t.INRAR MODRI,
145
Here
_ .... _.__ .... __._,
SS(Moclcl) = IlpM(Y)11
2
= (xO)'(xO) = y'HY
SSHcs(Modcl) = IIY  JiM(Y)11
2
= (Y  xO)'(Y  xO) = y'MY
n
SSTot(Uncorrected) = 11Y112 = y'y = '2:}i2
i=l
If we want to test a hypothesis then we may obtain the necessary sums of squares by
applying the GLM procedure on the model and on the hypothesis and then compute the
denominator in the test statistic (3.1) using one of the formulas (3.2), (3.3) or (3.4).
REMARK 3.9. We now consider general linear models with an intercept Ct, i.e. models
of the form
or
We still use the compact matrix terminology
Y = xO+
where x now is n x (k + 1) and 0 is (k + 1) x 1. We also assume that D(e) = (J2I.
Many systems for statistical computing will automatically add a column of l's to the
design matrix unless one directly specifics that this should not be done. In the SAS
procedure GLM a model statement
model II = ;z;l x2j
will thull be interpreted .11
If we want to avoid the Intlrcept tlrm we mUIU write
146 CHAPTER 3. THE (jENERAL I,INIAI MODRI.
In the intercept case the output from the SAS GLM procedure includes an ANaYA
table
Here
Source of variation Sum of Squares Degrees of Freedom
Model SS(Model) rk(x)  1
Error SSRes(Model) n  rk(x)
Corrected Total SSTot (Corrected) nl
SS(Model) = (xO  171)'(xO  171) = y'HY  n172
SSRes(Model) = (Y  xO), (Y  xO) = y'MY
h
SSTot(Uncorrected) = y'y  n172 = I)Yi  17)2
i=l
Also in this case we may test a hypothesis by applying the GLM procedure on the
model and on the hypothesis and then compute the necessary sums of squares using
formulas (3.2), (3.3) or (3.4).
If we in the above case compute
F = _SS_('.M,o_de_l ):...:...j,..:...( r,...,..k,( x...:....)_,1=:):
SSRes(Model)j(12  rk(x))
this will be the test statistic for the hypothesis that all parameters except the intercept
are zero, i.e.
Ho: 131 = ... = 13k = 0
against all alternatives. The critical values are given by
c = {YIP> F (rk(x) 1,12  rk(x))l_a}
when testing at significance level a.
Once again we consider the model from Example 3.3 (p. 116),
EXAMPLE 3.S. We have the model
3.1. TKSTS IN THR GENERAl, UNRAR MODEL
147
We observe data where y' "" (!lO, 30, 75). We wish to test the hypothesis
We reformulate the hypothesis into
The estimator for 1 is
The observed value is 1 = 102. From this we have
we see that
xB  Xo1 =
[
 ~ 7 ~ ]
and thus
IIx 0  x0111
2
= 49 + 1225 + 196 = 1470.
Now, since
12 ]
:m I
24
and
141 CHAPTER 3. THE GENERAL
and since we had (p. 118)
Ily  xOl1
2
= (y  xO)'(Y  xO) = 150,
we get
Ilx iJ  X01'I12 = 1620  150 = 1470.
We may also compute this quantity as
(xB)' (xB)  (X01')' (X01')
14475  13005
1470.
From this the test statistic becomes
and we accept the hypothesis at least for any a < 20%.
Explanation of the degrees of freedom:
,k x ,k n 2 k
,k xo ,k [ n 1 r
n= 3.
We will now look at the continuation of example 3.5 p. 123.
EXAMPLE 3.9. From the formulation of the problem it MccmM rCllllOnable to assumc
that thc parametcr ()'Jl = 0, Wc will therefore tellt thc hypothclIl!!
Ho : 811  0 1lllna' H,; la, Q.
3.2. TESTS IN THI GENERAl, UNRAR MODEL 149
 ' ,.  ..
The hypothesisspace II is therefore given by
1 1 0 /l1 + 811
1 1 0 /l1 + 811
1 0 1
[ 1
/l1 + 812
E(Y) = 1 0 1
/l1 + 812
0 0 0 8
12 0
0 0 0 0
0 0 0 0
We now find
1 1 0
1 1 0
[
1 1 1 0 0
n
1 0 1
= 1 0 0 0 0 1 0 1
0 1 1 0 0 0 0 0
0 0 0
0 0 0
and
[ Y, 1
= Yll. .
Y12.
We see that Xl is singular, and we add the linear restriction
b () = (0 1 1) [ 1 = 8ll + 812 = O.
8
12
Since
we hllve
x'x+b'b
This matrix is inverted on p. 121. We therefore lind the estimator under flo as
The observed value is (9, +8, 8)'. The new residual vector is
y  xl(h = (1, 1, 2. + 2, 6,0, 2)'.
The norm of this vector is 50, and the number of degrees of freedom is 72 = 5. We
therefore find that
[[y  PH(y)[[2  [[y  PM(y)[[2
2 1
50  28 = 21.
3 3
We now collect the calculations in the following analysis of variance table.
Variation SS
f
8'2
Test
MH 2 1 ~ 32=1 2 1 ~
2.97
OM 2 8 ~ 73=4
7 ~
OH 50 72=5
Since the observed value of the test statistic 2.97 < F(1,4)0.90 we will accept the
hypothesis, and therefore assume that Ho is true.
3.2.2 Successive testing in the general linear model.
In this section we will illustrate the test procedure one should follow, when one succes
sively wants to investigate if the mean vector for ones observations lies in subspaces
Hi with
m ~ k.
We will /itart by cOnltlderlna the followlna numben from the yield of penicillin fer
mantatlon ulln, two different typel of IU,U namaly: IlCtolO and can. lUlu. at tho
3.1. TESTS IN THE nENRRAJ, LINEAR MODEL
151

_._..
Factor B: concentration
2% 4% 6% 8%
Lactose 0.606 0.660 0.984 0.908
Factor A:
Cane sugar 0.761 0.933 1.072 0.979
The numbers are from [5] p. 314. The yield has been expressed by the logarithm of the
weight of the mycelium after one week of growth.
We are now interested in investigating the two factors A's and B's influence on the yield.
We assume that the observations are stochastic independent and normally distributed.
They are called
and
further we will assume that
where xj gives the j'th sugar concentration. We will perform change in scale of the
sugar concentration
2% 3
4%
6%
1
1
8% 3,
or more stringently define x by
:rj  5%
Xj = 1%
We then get the following expression for the mean values
Wo are a .. umlna that the yi.ld within the Ilv.n \tmll. elm be exprolled u polynomlulli
ohocond do,l'll.
152
1
o
CHAPTER 3. THE GENERAL LINEAR MODEL
___ a
i
+ Six + Y
i
X 2
lactose
1
a
2
+ E'2x + Y
2
X
2
cane sugar
concentration
1) if ')II = ')12 = 0, i.e. if a description by affine functions is sufficient
2) if that is accepted then if /31 = /32 = /3, i.e. if the marginal effect by increasing
the concentration is the same for the two types of sugar
3) if that is accepted then if 0:1 = 0:2 = 0:, i.e. if the two types of sugar are equal
with respect to the yield and if this is accepted
4) then if /3 = 0, i.e. if the concentration has any influence at all
i) We first write the model in matrix form
Y
n
1 3 9 0 0 0 Cl
Y
12
1 1 1 0 0 0 0:1 c2
1'13
1 1 1 0 0 0 /31 c3
1'14
1 3 9 0 0 0 ')II
+
c4
121
=
0 0 1 3 9 0 0!2 CIS
Y22
0 0 0 1 1 1 /3'l c6
Y23
0 0 0 1 1 1 ')''l E7
Y 24
0 0 0 1 3 9 II
or
3.2. TESTS IN THE GENERAL LINEAR MODEL 153
We find
1 3 9 0 0 0
1 1 1 1 0 0 0 0 1 1 1 0 0 0
3 1 1 3 0 0 0 0 1 1 1 0 0 0
x'x =
9 1 1 9 0 0 0 0 1 3 9 0 0 0
0 0 0 0 1 1 1 1 0 0 0 1 3 9
0 0 0 0 3 1 1 3 0 0 0 1 1 1
0 0 0 0 9 1 1 9 0 0 0 1 1 1
0 0 0 1 3 9
4 0 20 0 0 0
0 20 0 0 0 0
20 0 164 0 0 0
0 0 0 4 0 20
0 0 0 0 20 0
0 0 0 20 0 164
Since
[
0
]'
[ "
0
1
64
20 0
1
o ,
20
0 164
5
0
1
64 64
then
41
0
5
0 0 0
64
64
0
1
0 0 0 0
20
5
0
1
0 0 0
(X
/
X)l =
64
64
0 0 0
41
0
5
64 64
0 0 0 0
1
0
20
0 0 0
5
0
1
64
64
From this we see that
 1 1111 + +  41114
0.830
 1111 'IpUl'J + + 1I14
0.062
O=:.
1111 f' r!\+ r"
0.008
=
lI'lt + lI'l'l + U'l:l 1J'l4
1,(HO
Un
() ,040
(Un7
154
4U2.'
CHAPTER 3. THE GENERAL Lift MODII.
AI is
1 3 9 0 0 0
1 1 1 0 0 0 0.830
1 1 1 0 0 0 0.062
PM(Y) = xO =
1 3 9 0 0 0 0.008
0 0 0 1 3 9 1.019
0 0 0 1 1 1 0.040
0 0 0 1 1 1 0.017
0 0 0 1 3 9
We therefore have the residuals
0.034
0.100
0.100
Y  PM(Y) =
0.036
0.015
0.029
0.030
0.007
The squared length of this vector is
\\Y _ PM(y)1\2 = 0.034
2
+ ... + (0.007)2 = 0.024467.
As an estimate of 0
2
we can therefore use
&2 = _10.024467 = 0.0122335.
86
ii) If the hypothesis J1, E HI, i.e. /'1 = /'2 = 0, or
1 3 0 0
1 1 0 0
1 1 0 0
l ~ l J H ~ X,d, H"
1 3 0 0
y=
0 0 1 3
0 0 1 1
0 0 1 1
0 0 1 3
is true, then we get the estimates
~ 1 I 1 1 + hlll + iz,13 + h14
,Vll  ~ ' I I U + ~ 1 I 1 I + "'111.
iual + twa + 1111 + ha4
0.572
0.760
0.884
0.944
0.746
0.962
1.042
0.986
(),790
(),()62
0,036
0,040
3.2. TEITIIN THI GENERAL UNRAR MODRJ,
155
......... __ ....... _ ... 
The residuals arc
The squared length of this vector is
0.002
0.068
0.132
0.068
0.055
0.037
0.096
0.077
I\y  PH! (y) 1\2 = 0.002
2
+ ... + (0.077)2 = 0.046215.
iii) If J1, E H
2
, d.v.s. (31 = fh = (3, the model becomes
1 0 3
1 0 1
1 0 1
1 0 3
y=
0 1 3
0 1 1
0 1 1
0 1 3
The estimates become
and the residuals
[ ;;1 H, ~ x,d, + e,
0.031
0.079
0.143
0.035
0,022
0.048
0.085
0.110
The IIqulrod norm ot the_tall Victor III
.o,oale.
156 CHAPTRR 3. THR GENRRAL LINEAl MODlr.
iv) If p, E H
3
, i.e. /h = /12 = /1 and 0:1 = 0:2 = 0:, then the model is
1
1
1
1
y=
1
1
1
1
We find
and
giving
3
1
1
3
3
1
1
3
l ~ l + C3 = X3
d
3 + C3
0.104
0.152
0.070
0.108
0.051
0.121
0.158
0.037
[[y  PH
3
(y)[[2 = 0.094059.
v) Finally we consider the case p, E H
4
, i.e. f3 = 0, or
1
1
1
1
a = X4d4 +C4. y=
1
1
1
1
We flnd
157
3.2. TRSTS IN THE GENRMAI. LlNRAR MODRL
_.__. 
giving
and
0.250
0.203
0.121
0.045
0.102
0.070
0.209
0.116
[[y  P
H
4(y)[[2 = 0.196365.
Since we let rk(xi) = 1'i and rk(x) = k we can summarise the testing procedure in an
analysis of variance table such as
Variation SS
Degrees of freedom =
dimension
H4 H3 [[P
H
4(y)  PH
3
(y)[[2 1'3  1'4 = 2  1 = 1
H3 H2 [[PH
3
(Y)  PH
2
(y)[[2
1'2  1'3 = 3  2 = 1
H2 H1 [[PH
2
(Y)  PH1(y)[[2
1'1  1'2 = 4  3 = 1
H
1
M [[PHl (y)  PM(y)[[2
k  1'1 = 6 4 = 2
M  obs. [[PM(Y)  y[[2 nk=86=2
H4  obs. [[PH
4
(Y)  y[[2 n  1'4 = 8 1 = 7
This table is a simple extension of the table on p. 143. We can use the partition theorem
and get, under the different hypotheses, that the sum of squares are independent and
distributed as ()2X
2
with the respective degrees of freedom.
If a hypothesis Hi is accepted then the test statistic for the test of Hi+1 becomes
[[PH; (y)  PH;+l (y)[[2 /('f'i  'f'i+d
[[PH;(Y)  y[[2/(n 1'i)
Under the hypothesis this measure is FCI't 1'Hl, n  ri) distributed (according to the
partition theorem) and  IItill following the theory from the previous section  we reject
for large valueR of Z i.e. for
a.foll w. tcart
co "VI lome computational formula .
HI8
<." ,. . ...
CHAPTER 3. THE GENERAL I.INI MODII.
Using Pythagoras' theorem we now see  c.f. Remark 3.7  that there are two alternative
ways of computation for
they are
(3.5) I
and
(3.6) I
Of these the first must be preferred from numerical reasons but if one has computed the
residuals sum of squares anyhow it seems to be easier to use (3.6).
The analysis of variance table in our case becomes
Variation SS
f
Test statistic
H4 H3
0.102306 1
0.102306 1  5 44
0.09 0159 6  .
H3 H2 0.043070 1
0.080981 8 = 4.22
H2  lIt 0.004774
0 41
0.040318 4.... .
TIl  M 0.021748 2
o.o:u 0 SQ
o.oll4a, a  .
M obi 0.024467 :2
...  ....
159
Since
4.22 F(l,5)o.91,
and
5.44 F(1, 6)0.94,
we will not by testing at say, level 0' = 5%  reject any of the hypothesis Ih, Ih, II3
or II4.
NOTE 1. We will of course not test e.g. H
2
, if we had rejected Ih, since II2 is a
subhypothesis of Ih.
The conclusion is therefore that we (until new investigations reject this) will continue to
work with the model that the yield Y by penicillin fermentation is independent of type
of sugar and the concentration (2% ::::; concentration::::; 8%) at which the fermentation
takes place. We have with
that
and
E(Y) = 0' and V(Y) = a
2
,
{y = 0.863,
&2 = 0.196365 = 0.028052 0.17
2
.
7
Finally
160 CRAPTIR 3. TRE GENERAL I. ..... MODII.
Chapter 4
Regression analysis
In this chapter we will give an overview on regression analysis. Most of it is a special
case of the general linear model but since a number of uses are often concerned with
regression situations we will try to describe the results in this language.
There is a small section on orthogonal regression (not to be confused with regression
by orthogonal polynomials). From a statistical point of view this is more related to the
section on principle components and factor analysis, and considering ways of compu
tation wc also refer to that chapter. However, from a curvefitting point of view we
have found it sensible to mention the concept in the present chapter too.
4.1 Linear regression analysis
In this section linear regression analysis will be analysed by means of the theory for
the general linear model. We start with
4.1.1 Notation and model.
In the ordinary reareliion IlnlllYldli WI IIlmply work with 11 gcncrullincar modcl with an
Intercept. \.e. we work with chi mod.l
162
<"diiiiic
CHAPTER 4. REGRE ION ANALYI.H
where the ;r;'s are known variables and the /j's (and It) are unknown parameter". If we
have given '/I, observations of Y we could more precisely write the model as
Xnl
or
Y=x{3+e.
We assume as usual that
D(e) = (]"2:E,
where:E is known and (]"2 is (usually) unknown.
The estimators are found in the usual way by solving the normal equations
or if:E = I
x'xj3 = x'Y.
In the first case we talk of a weighted regression analysis.
Before we go on it is probably appropriate once again to stress what is meant by the
word linear in the term linear regression analysis.
As in the ordinary general linear model the meaning is that we have linearity in the
parameters. We can easily do regression by e.g. time and the logarithm of the time.
The model will then just be
E(Y) = LV + f3lt + f321n t,
cf. example 3.2. With n observations this model in matrix form becomes
4.1. LINIAIIIO.EIISION ANAI,YSIS 163
u(t)
l1(t)
~ ____________________________ +t
o
Figure 4.1:
Another banality that could be useful to stress is that one can force the regression
surface through 0 by deleting the LV and first column in the x matrix i.e. use the model
It can be useful to note that you can use the following trick if you wish the regression
surface to go through O. We assume that :E = I.
We consider the observations Y
I
, ... , Y
n
and the corresponding values of the inde
pendent variables 1, XiI, ... , Xik, i = 1, ... , n. If we add Y
I
, ... , Y
n
and
1, XiI,"" Xik, i = 1, ... , n and write down the usual model we get
VI 1 Xll Xkl
[ a 1
Y
n
1 Xln 'Ckn
(h
}"l
=
1
+e,
;C11 ;Ckl
( ~ k
Y
ll
1 :/:ln :1:/
l1li
or more cmnpnctly
164
CHAPTER 4. REGRI., MALYI
The normal equations become
or
[
I'
x'
1
I' ] [ 1
1
].
[
I'
x'
1
I' ] [
If we write out the equations we get
or
2na 0
o
In other words in this way we have found the estimators of the coefficients to a regres
sion surface which has been forced through O.
The reason why the above is useful is that a number of standard programmes cannot
force the surface through O. Using the above mentioned trick the problem can be
circumvented.
The output from such a programme should be interpreted cautiously since all the sums
of squares are twice their correct size. E.g. the residual sums of squares will be com
puted as
([ ]  [ ])' ([ ]  [ ])
= ([Y X1,8]',[Y+Xl,8]') [
Y+XI,l3
= 2[Y  Xl ,8]'[Y  Xl ,8],
i.e. twice the correct relliduallum of .quarel. The mentioned de.,... o( t'rccdom will
not be correct either. We havc to write up thc urdlnllry IIn r mod.l and ftnd tho correct
de ..... of freodom oon.ldIrIq t.he cUmen.toal,
165
4.1.
ANAI ..YSI8
_._,.,_. __ .._,',._ 
4.1.2 Corr.latlon and regression.
In theorem 2.23 section 90 a result was stated, which can be used for a test if the
multiple correlation coefficient between normally distributed variables is O. We will
now show that this result corresponds to a certain test in a regression model.
We will assume that we have the usual model p. 161 and we assume that = I.
Without any problems we can use the theory from chapter 3 to test different hypothesis
about the parameters a, 131, ... ,13k.
By formal calculations we can estimate the multiple correlation coefficient between Y
and Xl, ... ,Xk using expressions mentioned in section 2.3.2.
It can be shown that we get
where
and
R2 = IIY  po(Y)11
2
IIY  PM(Y)1I
2
IIY  Po(Y) 112
PM(Y) = xj3 = E(Y).
These results are not very surprising. We remember that the multiple correlation coef
ficient could be found as the linear combination of X which minimises the variance
of (Y  a' X) and this corresponds exactly to writing the condition for least squares
estimates.
If we as in Remark 3.9 let
and
2 "" 2
SSTot = SSTot(Corrcctcd) = IIY  po(Y)11 = L.)Yi  Y) ,
SSRCB = SSROM(Mudtll) ... IIY ,. == E(Yi)):.l,
,
we cln write
166 CHAPTER 4. REGRESSION ANALVlIH
i.e. the squared multiple correlation coefficient can also he expressed as the purt of the
total variation in the Y's which are explained using the independent variables.
SAS also compute an Rsquare adjusted for the degrees of freedom. The adjusted
Rsquare is calculated as
where 'l is equal to 1 if there is an intercept and 0 otherwise. n is the number of
observations used to fit the model, and P is the number of parameters in the model
including a possible intercept.
A corresponding reinterpretation of the partial correlations is of course also possible.
Furthermore, we see that if we formally write the test on p. 92 for PYlxl "",Xk = 0 we
get  assuming that rk x = k + 1 
R2 n  k 1
IR2 k
IIY  Po(Y)11
2
IIY  PM(Y)11
2
n  k 1
IIY  PM(Y)11
2
k
IlpM(Y)  Po(Y)11
2
/k
IIY  PM(Y)11
2
/(n  k 1)
(SSTot  SSRes) / k
SSRes/(n  k  1)
From the normal theory (p. 142) this is exactly the test statistic for the hypothesis
and the distribution of the test statistic is a F(k, n  k  I)distribution  exactly the
same as we found on p. 142.
For testing it is from the numerical point of view therefore of no importance if we
choose to consider the :I:'S as observations of a kdimensional normally distributed
rundom variable or as fixed delerministlc variables.
ThI"llIIIuo eln thorofore be "eplrated from the ulumptlonl we wUl con,ider In the next
167
4.1.3 Analysis of assumptions.
If we for corresponding xvalues
have more observations of Y, it would be possible to compute the usual tests for distri
butional type (histograms, quantile diagrams, X
2
tests, etc.) and for the homogeneity
of variances (Bartlett'S test and others). Finally we could also do run tests for random
ness etc. etc.
However the situation is often that we very seldom have (more than maybe a couple) of
for different values of the independent variable. It is. therefore possible
to do these types of checks of the assumptions. Instead we conslder the reslduals
If the model is valid these will be approximately independent and N(O, (J2) distributed.
Initially we shall present some results on the residuals in a regression model. We recall
the definitions of the hat matrix H and the matrix M presented in section 3.2. In the
full rank case they are
H x(x'X)lX'
M IH
and we see that the predicted values are
Y =x8=HY
and the vector of residuals
R=YY =MY.
Using results from Plectlon 2.1.2 and 2.1.3 we obtain
Ind
168 CHAPTER 4. REGRESSION ANAI,Y8IS
Since Hand M are not diagonal it follows that the predicted values are correlated as
are the residuals. If we denote element (t, j) in the hat matrix H by hi,j we sec that
the variance of the predicted value is
and the residual variance becomes
Furthermore it follows that the residuals and the predicted values are uncorrelated
since
Cor(Y,R) = Cor(HY,MY) = 02HM = O.
And finally we find that the sum of the residuals is O. This follows from
l'R= l'MY = 0
since I' is the first row in x' and we have
x'M = x'  x'x(X'x)lX' = O.
If one depicts the residuals in different ways and thereby sees something which does
not look (or could not be) observations of independently N(O, 0
2
) distributed random
variables then we have an indication that there is something wrong with the model.
Most often we would probably start with a usual analysis of the distribution of the
residuals i.e. do runtests, draw histograms, quantile diagrams etc.
Afterwards we could depict the residuals against different quantities (time, independent
variables, etc.). We show the following 4 sketches to illustrate often seen residual plots
and give a short description of what the reason for plots of this kind could be. First we
note that 1 always is acceptable (however, cf. p. 170).
i) Plot of residuals against time
2 The variance increases with time. Perform a weighted analysis.
3 Lack terms of the form i3time
4 Lack term" of the form i31 . time + f j ~ . tlme
a
4.1. LIN a.KINION ANAI,YSIS _ ._. ____ .... ______ . ______ 16_9
      ~ ...  .. __ .... _
CD residual
residual
@ residual
residual
Figure 4.2: Residual plots.
170
CHAPTER 4. REGRE ION ANALYIIII
2 The variance increases with E(Yi). Perform a weighted analysis or trans
form the Y's (e.g. with the logarithm or equivalent)
3 Lack constant term (the regression is possibly erroneously forced through
0). Error in the analysis.
4 Bad model. Try with a transformation of the Y's.
iii) Plot against independent variable Xi
2 The variance grows with Xi. Perform a weighted analysis or transform the
Y's.
3 Error in the computations
4 Lacks quadratic term in Xi
The above is not meant to be an exhaustive description of how to analyse residual plots
but may be considered as an indication of how such an analysis could be done.
REMARK 4.1. In practise we will often have our residual plot printed on printer list
ings. Then the plots might look as shown on p. 171. The 4 plots have been taken from
[24] p. 1415 in appendix C.
When interpreting these plots we should remember that there are not always an equal
numbers of observations for each value of the independent variable.
This is e.g. the case in the plot which depicts the residual against variable 10.
There are 7 observations corresponding to XlO rv 0.2704 E 04 and 35 observations
corresponding to XlO rv 0.7126E03. The range of variation for the residuals is
approximately the same in the two cases. If the residuals corresponding to the 2 values
of XlO had the same variance we would, however, expect the range of variation for the
one with many observations to be the largest.
In other words if one has most observations around the centre of gravity for an in
dependent variable a residual plot should rather be elliptical than of the form 1 to be
satisfactory.
4.1.4 On "Influence Statistics"
When judging the quality of a regression analysis one often consider the following two
possibilities:
I) Check if deviationll from the model look random.
2) ChIoIc the .ft'tot ot ob rvldODI
."''' ..
.... , ., MOO , ..... _____ 1.. 1.... 
.... , " I ..
,
,
,
,
0.12." at
I
,
,
r
.,
I II
I
"':!I 1 nil .. I
II 11,'1
I a I ....
t ,. t II It I 1
I I II' "1 II.. I lUI
".tt,.u: II U' , I I I U I %,
1 1'" '21t' I II I J:l'
I I I 12 ,,, to" I' , 131 I
III , Ul'1l '1'" II I I It III.U'4
" ""1 ,SI:II11 'lUU.IU'!I'UII
'.UQJC CI I"'" It "" 11111 I nt """
It. JI) JI' "'U IIIZ II II'UUUU I
t1' II t I 11 111 In ZUlU 1211114,1 1121
I II t It 111 ... 1t,,11 UUUUI 1 II
I II'!' at It' I t 12 u, :U.2
O.II"'t6t I!':! 11 'I It. I II 11 UIU I I
I' " I IJ I 11 U " ., I I I IU
, it I '1'1 1111 11 I I 1"
II ,It II I 1 11 I: :
.. Co.,.a,t (II! I
I r
, I' I I
I ,
I I I
0 ... ,76' ..
I
I
_a_7r 'Iif o.
r
I
.. r.1'I5""''' el! , ..
.. 11...._____... ,   1  .......1 It
o e] ..c.l.ne o. ,,,.,. 02
o.o,.e( II e.lool'( ''1 ..... ot! .t
':1".11. ..no VAII alL ,
... ... ...1....... ... ... ......
......... I ..
I
I
r
I
OI?I e o.
I
I
I I
I
I.S'.'I: nl
II I
I .,
r I
I I I
] ..... 01 , .,
I , .. H
I JI "' 1
I I I 11 .. 1 I
I
I
, ,.... IU.II 1
1.... 01 'UII' .,11
I .1 III ....
I ".111111 nllt
I I .. "up lIn ..
II
II II 1" UJn " I
....... .1 II,'
11.. 11 """U'U" ,. II
.UUnt.,1I ....... I
I 111141. I ..... II
_ 1 .... II I I' I" , .. U'"
I ... "" ......... ..
"' I '1 .. 11 I n
I'" II II
1 "II I
....... n It I I
I I I
I II I
I "
I " ........ II
I
I
I
I
.'."'1' 'I
I
It
I ..
II
, .
. 
II
"
It
I
I I
.. !,I .. ,,,
....... '1.", ........ .............. 'In ... II
171
..oct:\. r'
PLO' .. ,r ___ t.. tI _______ ,
0 [ ClI' ..
I
I
I
r
o. nlft 02
.
I
I
I
0.S4"[ ot
I
I
I
I
0.'."91!. 02
" I
I
r
a.190.r: ClI
"' I
I
I
11\
1
, ,
I II
I
I
,
I II U II I
1 I ]:I, 1 I
I I I" I I "1
IU" It l' 1" , ,,111'
I II' II U I lUI I '111
I lilt I I II I"U I Itl
II II , II I '" ntln I'
11 I "UlIlI 111 ., t lit "
I I II III 2'" U'UI U lUI UII I
O.109U
I
I
I
I
IS I III 1] II 11 111 UJI '1 141'.2 I
, til I au,,; Ulltll111 11111"'131 J
1I111.ltll.,1 .11 IIU. II 111' IlII uu ,I
II tI' UII] lUll II III 1. It " II
III I U Itl 11 I lUlU I I I.t C
'" "' 111 I I. lun
, t. II Il I 1111 .. '" II II I I
O.I.'fIE 01
I
I
I
I
I I I 1111 I' I 12 t 11.1
I
, I
0. , ,,: 01
I
I
,
I
0.52761: til
I
o.ranl!. Clit
I
I
II
t 11 I S I II II
I I.
II
I
II "
I "
I
I
I 
I _
.. e '061: O!s ___ ... ____ 1...... _______ __ .. l
O.ICIJOe: OJ 0.1 . 13112 .... ne:
O "lce .1: 0.10"511 02 o 1t01! .,
_Il lUI
..,C1T ",III .. ..__...._I_.. ___ _.. I. __ ... _I:
...... 4& II I
I
I
.. ,.,,, ..
I
I
,,1:
I
I
I
e." te .1
I
, _0",
I
I
I
. ,1'_
I
I
II
.... 1 '1
I
I
I
I
,.1.'"
I
I
I
...... ?lI
I
I
,
.1."'" ,
,
I
I I
I
.1 ".1
I. II U. ."
It, .... I I
II I I III I
I Ill' I J I' , II
::. :'"n :'. I I
II. I lI.l? I II' II
, ." ..... ,.ttl I I 'U. I
I I 111 , .. , " II 12" III It
, It I I IIIU .. 2 II I II I
IU lit .... ,ulle ,I 111.1. ,
.. I nu,!UU II 1 nUl 1 I
al IUIUIt l!' IIUII'..
.. UII .. :U' III II I
1 I ".1 , , II.
It """1'1 III "
I I ,n I' II
I ,I I' .. I I 1 I
II I 1
I J
I I
I I
I
. ,
 '1 ...... : . ! ........ _ .... , ........ _ .... ___..... , ...... _ ......... ,
..... ". !t.u .................... ,,"":.t: ... ,t
172
CHAPTER 4. REGRI IOII ANALYI.,.
Considerations regarding I) are given in seclion 4.13 ahove. Here we will hrlefly
consider 2).
The deletion formula
Reca1culation of parameter estimates when discarding a single observation can be
done using the formula
A
I 'Al
I 1 1 uv
(Auv) =A + 1 'Al
v U
where the involved matrices are assumed to exist.
We let Xi be the i'th row in the design matrix x. Letting A = XiX
we get
(
I ) 1 I (/)1
(
I _ I ,)1 _ ( I )1 X X XiXi X X
X X XiX,  X X + 1 _ h
ii
and u = v =
If we denote the xmatrix where the i'th row is removed x(i) we have that
x(i)'x(i) = XiX 
The proof is omitted.
We can now state the relevant expressions.
Cook's D
A confidence region for the parameter 0 is all the vectors 0*, which satisfy
\ (0  O*)'X/X(O  0*) P(p, n  pha.
pO'
We use the left hand side as a measure of the distance between the parameter vector
and O. We let O(i) be the estimate, which corresponds to the deletion of the i'th
observation
yU) = (Yl, ... ,llil, ?Ji+l, ... , :tin)'
and therefore have
4.1. ANALYSIS
173
Cook'" 0 then C4llllls
,(0 O(i))/X/X(O  O(i)).
LIm'}.
If Cook's D equals e.g. F
60
% then this corresponds to likelihood esti
mate moving to the 60 % confidenceellipsoid for O. ThIS IS a large change
when just removing a single observation. There are several suggestIOns for cutoff
values of Cook's D:
D
>
1
D >
1
n
1
D
>
np
D >
F(n, n  p)O,50
In the SASprogram REG one can find Cook's D together with other diagnostics statis
tics. Some are mentioned below.
RSTUDENT & STUDENT RESIDUAL
RSTUDENT is a socalled "studentised" residual, i.e.
RSTUDENT
i
= &(i).Jl  h
ii
)
where &( i)2 is the estimate of variance corresponding to deletion of the i'th observa
tion.
SAS also computes a similar statistic, where the i'th observation is not excluded
ri
I STUDENT RESIDUALi =
Since both these types of residual are standardised a sensible rule of thumb is that they
should lie within +2 or 3.
in the dlWminant of the dl"pcrllion matrix for the
We ftnd
174
CHAPTER 4. REGRESSION AN.U,YHIH
This quantity "should" be close to 1. If it lies far from 1 then the i'th observation has a
too large influence. As a rule of thumb I COVRATIO
i
 1 I:s: 3p / n
Leverage
The quantity h ii introduced earlier is called the Leverage. It is
h (
, )1 ,
ii = Xi X X Xi
and it measures how far the i'th vector of independent variables is from the mean of
the remaining. Thus it is a measure of influence, since it 'forces' the regression surface
to lie closer to this point. If we have p parameters in the model, points with a leverage
> 2p/n should be investigated.
DFFITS
DFFITS is  like Cook's distance  a measure of the total change when deleting one
single observation. As a rule of thumb they should lie within say 2. A similar rule
adjusted for number of observations says within 2Jp/(n  p).
DFFITS
DFBETAS
_ Yiy(i)i
 &(i)../hii
_ xdfJB(i)]
 &(i)../hii
While DFFITS measures changes in the prediction of an observation corresponding
to changes in all parameter estimates, then DFBETAS simply measures the change in
each individual parameter estimate. As a rule of thumb they should lie within say 2.
A rule adjusted for number of observations says within 2/ Vii.
We have
ttQJ,'<c
4.1. LINIAI IIOIIHHION ANAI.YSIS
y
175
If we have a model
and must estimate fh f32 that can not
be done in a reliable way since we can
not vary one x with the other fixed.
Figure 4.4: All the (Xl, X2) are in the shaded (blue) area
Multicollinearity
If the independent (explanatory) variables in a multiple regression are highly correlated,
we say that we have case with multicollinearity. This may cause that the individual
parameter estimates are very uncertain, without necessarily ruining the descriptive and
predictive power of the model, as long as the explanatory variables vary in the same
range cf. figure 4.4. But predictions where we move out of the range may be highly
unreliable.
Diagnostic checks for multicollinearity in SAS include methods based on measuring
the correlation between one independent variable and all the others, and the other on
analysing the eigenvalues of xx'.
We define the tolerance (TOL) and the variance inflation (VIF) as
= 1  R.'J (x, I all other xvariables)
1
==
TOL.
All a rule of thumb, TOL < UJ or equivalently VIP> 10 indicates a multicollinearity
problem.
We doftn. che condition nw,"",,. .1 Ib,
divided hlmaU .. ,. The
root of the hlracllt clacnvalu" of xx'
bl below lG. If It t. above 30, It
176
CHAPTER 4. REGRESSION ANALYSIS
Call in SAS
All the mentioned statistics can be found using simple SAS statements e.g.
proc reg data = sundhed;
model ilt = maxpuls loebetid / r influence;
Model statements etc. are the same in REG as in GLM. The diagnostic tests come with
theoptions/r influence.
4.2 Regression using orthogonal polynomials
When performing a regression analysis using polynomials one can often obtain rather
large computational savings and numerical stability by introducing the socalled or
thogonal polynomials. In the end this will give the same expression for estimates of
the mean value as a function of the independent variable but with considerably smaller
computational load.
4.2.1 Definition and formulation of the model.
We will assume that a polynomial regression model is given i.e. that
Here i = 0,1, ... , k are known polynomials of i'th degree in t. We assume that
In the usual fashion we can in this model estimate and test hypotheses regarding the
parameters (a, 131,"" 13k).
As noted before it would be a grcat advantage to conilider the lIocalJod ortho.onal pol)'.
nomial. (, .inco tho computational load will bo reducod con.idorabl)'. W. tnuoduol
DEFINITION 4.1. Hy a set of orthogonal polynomials corresponding to the values
tJ, ... ,111 we mean polynomials 6, ... where is ofi'th degree which satisfy
n
L = 0, i = 1,2, ... , k
j=l
n
= 0,
j=l
(4.1)
(4.2)
REMARK 4.2. It is seen that is a constant, so 4.1 is of course not used for For
notational reasons we let (tj) = Vi,j. Later we will return to the problem of
actually determining orthogonal polynomials.
I f we now assume that the polynomials in the model are orthogonal we find using
[
1 [ <0(,,)
6(tJ)
'k("1
(tn) 6(tn) t;k(tn)
that
0 0
]
0
e'e =
0
L:
i.e. e' { is a dlagonul Illlllrix. We lhcrcfurc nnd
178
CHAPTER 4. REGRII.rON ANAI.YrnS
and
A 0
D(,8) = (}2
o
We now have that the estimators for the parameters are uncorrelated and since we are
working in a normal model they are therefore also stochastic independent.
We find that the residual sum of squares is
SSres
From this we immediately have
THEOREM 4.1. We have the following partitioning of the total variation
or with an easily understood notation
SStot = SSl.grad + ... + SSk.grad + SSresl
i.e. the total sum of squares has been partitioned in terms corresponding to each poly
nomial plus the residual sum of squares. The degrees of freedom are n  1 respectively
1, ... 1 1 and n  k  1. A
PROOF. Follows trivially from the above mentioned.
Usina the partition theorem we furthennore have
TRIO.IM 4.3. Tho tumt of IqUU'l' wlUob
. ,
4.3. Ilaam. J.s._ ..
arc stochalillcllIJy Indepcndcnt with cxpected values
and
Finally
and if /3i = 0 
PROOF. Obvious.
j
(}2 + /3;
j
i= L ... ,k.
179
The theorems contain the necessary results to be able to establish tests for the hypothe
ses
180
CHAPTER 4. REGRE8IION ANAI.VIIIS
Variation
SS
J E(SS/ J)
Linear
SSl.deg 1
a
2
+ f3r L 6(lj)2
J
Quadratic
SS2.deg 1
a
2
+ L6(tj)2
Cubic
SS3.deg 1
J
a
2
+ f3j L6(tj )2
J
k'th order
SSk.deg 1
a
2
+ L';k(tj )2
j
Residual
SSres nkl a
2
Total
SStot nl
REMARK 4.3. The big advantage of using orthogonal polynomials in the regression
analysis is that one without changing any of the previous computations can introduce
polynomials of degree (p + 1) and degree (p + 2) etc. When establishing the order
for the describing polynomial we will usually continue (estimation and) testing until 2
successive f3i 's = 0 since contributions which are caused by terms of even degree and
terms of odd degree are different in nature. This is, however, a rule of thumb which
should be used with caution. If we e.g. have an idea which is based on physical con
siderations that terms of 5th order are important, then we would not stop the analysis
just because the 3rd and 4th degree coefficients do not differ significantly from O.
4.2.2 Determination of orthogonal polynomials.
It is readily seen, that multiplication with a constant does not change the orthogonality
conditions 4.1 and 4.2. We therefore choose to let
= = 1.
The polynomial of 1st degree is
6(t)=t+a,
since we can choose the coefficient for t as 1. From 4.1 we have
n n n
0= L6(t
j
) = L(tj + iL) = L tj + niL,
j=l j=l j=1
or
4.2. IIGnitON V .. INn _______ 18_1
i.e.
6(1)"'/ I.
We can then choose 6 as a linear combination of 1, 6 a, i.e.
From 4.1 we have
0= t6(tj) = rW02 + 012 L(tj  t) + a22 2.:(tj  t)2
j=l j J
a02 = 2.:(tj  t)2.
a22 n j
From 4.2 we have
n
o 2.:6(lj)6(tj)
j=l
(102 I)tj  t) + a12 2.:(tj  t)2 + a22 2:)tj  t)3
j j J
(J,12 2:)tj  t)2 + a22 2.:(tj  t)3.
j j
From this we get
a12 Lj(tj _l)3
(J,22 =  Lj(tj  t)2'
6. etc. are found analogously.
The computations are especially simple if the tj 's are equidistant. Then we let
/, (tl" '/II)
'Ilj =
'/II
where UJ  I,'J "1'" 1,,+ 1 "j' WI thin hive
 1, ... ,,,,
182
CHAPTER 4. REGRESSION AN.U.VHIH
(4.3)
(4.4)
(4.5)
In the table on p. 183 we have given some values of orthogonal polynomials 6, ...
k :::; 5, with t = 1, ... , n for n = 1, ... ,8.
In order to avoid fractional numbers and large values we have chosen to give polyno
mials where the coefficient to the term of largest degree is a number>.. which is also
seen in the table. Furthermore we have stated the terms
n n
D = = La
j
.
j=l j=l
We now give an illustrative
EXAMPLE 4.1. In the following table corresponding values of reaction temperature
and yield of a process (in a fixed time) have been given.
Temperature
200F
210F
220F
230F
240F
250F
260F
Yield
0.75 oz.
1.00 oz.
1.35 oz.
1.80 oz.
2.60 oz.
3.60 oz.
5.45 oz.
We will try to describe the yield as a function of temperature using a polynomial. We
will assume that the assumptions in order to perform a regression analysis are fulfilled.
First we transform the temperatures Ti, i = 1, ... , 7 by means of the following relation
t. _ Ti  (200  10) _ .....:Ti__19_0
, 10  10
We then ,et thc VllluCN tl, . .. ,t . ., _ 1, ... ,7.
........
........ 0
0>\>, Ol
........
t:51>
I I C!'
w ................ w
........ J..J.. ........ G"..,.
I I fY;
........ w w ........ w
I I "'"
t:) ........ 0 ........ t:) >'
I I I G"
"" ........ t:) ........ ""
I I "'"
........ t:) 0 "" ........ w
I I
........ ..,. Ol ..,. ........
I I I "'"
CJl W ........ ........ W CJl >'
I I I I "'"
CJl ........ ..,. ..,. ........ CJl '"
I I
........ w t:) W W ........
I I "'"
I................ '"
........CJlOOCJl ........
I I I "'"
w t:) ........ 0 ........ t:) W >'
I I I "'"
CJl 0 W ..,. W 0 CJl '"
........ J..J..o ................ J.. 'C'.
I I"f:'
W. ........ Ol ........ .W
........ .l CJl 0 d" ..,. J.. Q'
I I I I
........ ........ w CJl .
I I I I
w CJl CJl W ........ .
I I I I
. CJl . W W . CJl .
Table 4.1: Value" of orthoional polynomials.
183
184
CHAPTER 4. REGRESSION AMALYIIS
I j
6 6 6 (4
..
(r;
,
Yj
1 3 5 1 3 1
0.75
2 2 0 1 7 4 1.00
3 1 3 1 1 5 1.35
4 0 4 0 6 0 1.80
5 1 3 1 1 5 2.60
6 2 0 1 7 4
3.60
7 3 5 1 3 1 5.45
28 84 6 154 84
16.55 = LYj
L6jYj 20.55 11.95 0.85 1.15 0.55
56.0475 = Lyl
A 1 1
1 7 7
6 12 20
56.0475 _ 16.55
2
7
= 56.0475  39.1289
16.9186
0: = 16
7
55 = 2.36
= = 0.7339
= 11'1
5
= 0.1423
= 0:5 = 0.1417
= = 0.0075
= = 0.0065
SSl.grad = = 15.0822
SS2.grad = = 1.7000
SS3.grad = = 0.1204
SS4.grad = = 0.0086
SS  0.55
2
0 0036
5.grad  84 = .
We summarise the result in the following table.
We see terms of 2nd and 3rd degree are significant and the two following
are not slgmficant, so we will choose a polynomial of 3rd degree for the description.
99.8%
99.7%
98.0%
2.32 7ts.O%
".3.
".IIT" REGRESSION EQUATION
From the recurllinn fnrmuills 4.3, 4.4 and 4.5 we get  since'll, = 7
6(t)
4)2 _ 48
(I. ... 12
[2 _ 8t + 12
(t4)(t28t+12) 4.45(t_4)
4 15
t
3
 12t2 + 4lt  36.
Since Al = 1, A2 = 1 and A3 = 1/6 we get the following estimated polynomial
Since
p,(t)
A A 1 A
2.36 + 1 f316(t) + 1 f326(t) + 6'f336(t)
0.0236t
3
 0.1409t
2
+ 0.563lt + 0.2818.
Ti  190
ti = 10'
185
we can get an expression where the original temperatures are given by entering this
relationship in the expression for p,(t). We find
g(T) = 0.000024T
3
 0.0148617
2
+ 3.147610T  223.15440.
The estimated polynomial is shown together with the original data in the following
figure.
4.3 Choice of the "best" regression equation
In this section we will consider the problem of choosing a suitable (small) number of
independent variables SivinS a reasonable description of our data.
4.3.1 The Probl.m.
If we are In the (unpl .... nt) .i""don of not btlnll able to formulate a model based
upon ph)'III"11 relltlonlhlp. far ChI phInomtnl W' lUG ,.tudylna. we wl11 often "Imply
"Illter III thl variabll. Wllhlllc I .. offlet on our oblll'Vld vllue,.. If WG
186
Yield
oz.
6
5
4
3
2
1
CHAPTER 4. REGRESSION ANAI,VIIIS
200 210 220 230 240 250 260 of
Temperature
Figure 4.5: The correspondence between temperature and yield by the process given in
example 4.1.
Taylorapproximation point of view) we will very quickly have an enormous number
of terms in our regression. If we start off with 10 basicvariables Xl, ... ,XlO, then
an ordinary second order polynomial in these variables will contain 66 terms. If we
include 3rd degree we have on the order of 150 terms. Expressions containing so many
terms will (if it is at all possible to estimate all the parameters) be very tedious to
work with. If we e.g. wish to determine optimal production conditions for a chemical
process we could estimate the response surface and find the maximum for this. This
will be extremely difficult if there are many variables involved. We would therefore
seek to find a considerably smaller number of terms which will give a reasonably good
description of the variation in the material (cf. the section on ridge regression).
It is important, however, to note that an expression found by applying the methods dis
cussed in the following should be used with caution. It will (probably) be an expression
which describes the data at hand very well. Whether or not the method is adequate to
predict future observations depends upon if the expression also describes the physical
conditions well enough. One way of determining this is in the first instance only to base
the estimation on half of the data and then compare the other half with the elltimated
model. If the degree of agreement is reasonable we have the indication that the model
is not completely inadequate as a prediction model.
Wo will use. lIinllo illu.trative .xlmpl. for 111 tho mltboda w. wUl eII.GriM, In order
4.3. CHOlea or EQUATION 187
have only tuken n very small part of the original data material. We should therefor.e
not evaluate the suitability of the methods by means of the example, but only use It
as an illustration of the principals and the way of going about these. The data
some corresponding measurements of the quality Y of a food additive (measured usmg
viscosity) and some some production parameters :1;1, X2 X3. (pressure, temperature
degree of neutralisation). In order to simplify the calculatIOns the data are coded, 1.e.
the variables have had some constants subtracted and been divided by others. We have
the following measurements
y Xl X2 X3
4.9 0 0 2
3.0 1 0 1
0.2 1 1 0
2.9 1 2 2
6.4 2 1 2
Experience shows that within a suitably small region of variation of the production
parameters it is reasonable to assume that the quality shows a linear dependency on
these. We will therefore use the following model
E(Ylx) = a + fJlxl + fJ2:r 2 + /13:r3,
or in matrix form
Y
l
1 0 0
I r i j
1::1
Y
2
1 1 0 1::2
Y
3
1 1 1
+
1::3
Y4 1 1 2 1::4
Y
5
1 2 1 1::5
I:: E N(O,0"2I).
In the numerical appendix (p. 197) all the 2
3
regression analysis with y as depen
dent variable and one of the more of the :/:'s as independent variables are shown. The
following models are possible
M = (} flt:t:J
 n {jlml
 () 1'11"'1
 Q
+
+
+
(13;]:3
+ (1:t Xa
188
CHAPTER 4. REGRE8IION ANAI,YSIH
For each of these 8 models the estimators for 0: and the /1 's are shown, we find the
projection of the observation vector onto the subspace corresponding to the model we
determine the residual vector, the squared length of the residual vector (the residual
sum of squares), the estimate of variance, and the (squared) multiple correlation coef
ficient. After that we show the analysis of variance tables for the possible sequences
of successive testings of hypotheses: that the mean vector is a member of successively
smaller (lower dimension) subspaces in sequences like
The above mentioned sequence of subspaces corresponds to successive testing of the
hypothesis
/33 = 0, /31 = 0, /32 = O.
There are 6 (= 3!) possible tables of this type. Finally we show some partial correlation
matrices. If we let y = X4 the empirical variancecovariance matrix is (as usual)
defined by the (i, j) 'th element being
The (i, j) 'th element in the correlation matrix is then
Using the formula on p. 84 in section 2 we then compute the partial correlations for
given X3 and for given X2, X3.
We now have enough background material to mention some of the most popular ways
of selecting single independent variables to describe the variation of the dependent
variable.
4.3.2 Examination of all regreSSions.
This method can of course only be used if there arc reasonably few variablell. We
summarise the result from the appendix In the follow!n. table
4.3.
.. , ..
. Model Multiple Residual variance Average
IP
S2
of S;
r
OR(\') n' ..
0 5.47 5.47
_ IV + 5.1% 6.91
Il2 : /,,'P') = II' + /32:1;2 3.8% 7.01 5.35
Ih : L'( V) = IY + f33:r3 70.8% 2.13
II12 : E(Y) = 0' + !11:I:1 + /J2:1:2
15.3% 9.26
II
13 : E(Y) = 0' + f31:r1 + /33x3
76.0% 2.63 4.68
JJ
23 : E(Y) = 0' + ih2 + /33 x 3
80.4% 2.14
M : E(Y) = 0' + (:i1x1 + f12:E2 + /33x3 97.1% 0.634 0.634
Looking at the multiple correlation coefficient quickly indicates that we do not gain so
much by going from one variable (X3) up to 2 variables. The crucial jump happens
when including all 3 variables. Considerations of this type lead us rather to just use X3
i.e. the model E(Y[x) = 0' + /33X3. This decision is strengthened by looking at the
residual variance S;. We then see that S; for the best equation in one variable is less
than for the best equation in two variables which strongly indicates that we should just
look at one variable (or use all three). If we besides looking at the smallest S; also
look at the average values and depict them by number of included variables we have
graphs like
Smallest S2
r
o 1 2
Average
52
r
3 No. of 0 1
vars in
equation
2
3No. of
vars in
equation
This also indicateli that the number of variables in an equation should be either 1 or 3
(there hi no siHnifleul1t Improvement by going from I to 2).
If we only look at the Iraph with tho averale vllluell it is not obvious that we should
Include any Independont vlli.bl t .11. WI could therefore test if (ja = 0 in the model
113 (E(lIli1l) III + fJ.IIl.)
190
CHAPTER 4. REGRESSION ANAI,YSIS
Therefore we will reject fh = 0 at all levels greater than 8%.
As a conclusion of these (rather loose) considerations we will use the model H3:
Here means estimated at). The estimate of the error (the variance) on the measure
ments is (estimated with 3 degrees of freedom):
S2 = 2.13.
REMARK 4.4. It should be added here that the idea of looking at the averages of the
residual variances does seem a bit dubious. It has been included merely because the
method seems to enjoy widespread use  at least in some parts of the literature. 'Y
4.3.3 Backwards elimination.
This method is far more economical with respect to computational time than the pre
vious one. Here we start with the full model M and then investigate which of the
coefficients which has the smallest Fvalue for a test of" the hypothesis that the cocf"li
cient might be O.
This variable is then excluded and the procedure is repeated with the k  1 remaining
variables etc.
We can then stop the procedure when none of the remaining variables have an Fvulue
less than the 1  O! quantile in the relevant Fdistribution.
We can illustrate the procedure using our example. We collect the data in the followln"
table.
Prom tho tlblo cln be on thlt tht. proolduro 1110 wm.nd wAdi die model JlI: 1(11) ..
Start with full
model
Find Fvalues
for all variables
in equation
YES
Remove the
corresponding
variable
191
Figure 4.6: Flow diagram for Backwardselimination procedure in stepwise regression
analysis.
Step Fvalue for test ofJ3i = 0
/ Quantile in Fdistribution
Model: E(Y) = O! + J31x1 + J32x2 + J33
x
3
. 3.625/1 _
131 . 0.634/1 
5.76 = F(l, 1)0.71
1
. 4.621/1 _
132 . 0.634/1 
7.29 = F(l, 1)0.72
/1 . 17.879/1 
3 0.634/1 
28.20 = F(l, 1)0.86
Remove Xl : Model is now: E(Y) = O! + fhx2 + J33 x3
2 13
2 . 4.285 2
0.98 = F(l, 2)0.55
fl . =
3 .' '
7.82 = F(l, 2)0.88
Remove :r.:,) : ModellH now: H(Y) = /13;1::1
3
fI . 15'."jl:
It n,311/3
7.28\ F(l, 3)n.o2
"':0'""8<' __ * .". _ .._ .. _
Tho dilldvlntllo with thl. method "1 thlt w. hlye to ."lve the full rcllrcllilion model
which can bel problem If then vlli.bloll,
192
Start with
constant
Choose the variable
wi th the larges t _
partial
with dependent variable
Find Fvalue
for the
variable
CHAPTER 4. REGRESSION ANAI,YSIS
Include the variable
__ .... in the equation
4.7: Flow diagram for Forwardselection procedure in stepwise regression anal
YSIS
4.3.4 Forward selection
In this procedure we start with the constant term in the equation only. Then we choose
the independent variable which shows the greatest correlation with the dependent vari
able. We then perfonn an Ftest to check if this coefficient is significantly different
from O. If so, then it is included in the model.
Among the independent variables not yet included we now choose the one that has the
partial correlation coefficient with the dependent variable given the
varIables already III the equation. We perform an Ftest to check if the new variable
has contributed to the reduction of the residual variance, i.e. if the coefficient for it is
different from O. If so, continue as before if not stop the analysis.
In our example the steps will be the following
1) From the correlation matrix (p. 202) we see that X3 hall the grente"t correlation
coefficient with y, viz. 0.8416. We test if /13 in the model E(Y) ... n + f1:t.r.:t cln be
assumed to be 0 we have the tellt IItatistic (lice p. 201).
193
If we UNe n to'*' we contlnlle (since we then reject fh = 0).
2) From the pllrtlill corrdution matrix given :1:3 (p. 202) we see that the variable which
has the grclllcsl purliul correlation coefficient with the y's (given that X3 is in the
equation) is .r:l = 0.5728). We include :];2 and check if f32 in the model
can be assumed to be O. We have the test statistic (see p. 202)
2.095/1
4.2855/2 = 0.98 ':'::' F(I, 2)0.55.
Since we were using a = 10%, then this statistic is not significantly different from 0,
and we stop the analysis here without including X2 . The resulting model is
where a and f3 are estimated as earlier. We especially note that Xl has not been
included in the equation at all.
REMARK 4.5. If we had used a = 50% we would have continued the analysis and
considered the partial correlations given X2 and ;[;3. According to the matrix p. 203
the partial correlation coefficient between y and :J:I given that X2 and X3 are included
in the equation
Now Xl is the only variable not included so it is trivially the one which has the greatest
partial correlation with y. We now include Xl in the equation and investigate if f31 in
the model E(Y) = a + f31xl + f32X2 + fhx3 is significantly different from O. The test
statistic is (p. 201)
3.652/1 F()
0.634/1 = 0.76 1,1 0.71
In tho callo we have Neen that the equation wall extended considerably just by changing
a. It ill important to note that chanieN In H clln have drastic consequences for thc
rcllultlna model.
REMARK 4.6. Tho pl'OOlCluN of lll0011n, the vutlblo which hili the ireatellt p"rthd
oorrellUon with lb. AI to the
194
CHAPTER 4. REGRESSION ANALYIIIII
F
r
1
the relation between the partial correlation coefficient and the Fstatistic.This is of the
form
r2
F = g( r) = . f
1  r2 '
f. is the number of degrees of freedom for the denominator (cf p. 166). This
relatIOn IS monotonously increasing
If we e.g. in step 2 want to compute the Ftest statistic from the correlation matrix we
would get
( 0.5728)2
F = 1 _ (0.5728)2 2 = 0.98.
It is that. the mentioned criterion is equivalent to at each step always taking
the van able WhICh gIves the greatest reduction in residual sum of squares.
REMARK 7. In some of the existing standard regression programmes it is not possi
ble to an We must then instead give a fixed number as the limit for the
Ftest we wIll accept respectively reject. We must then by looking at a table
over fquanlile find a suitable value. If we e.g. wish to have 0: = 5%, we sce that wc
should use the value 4 since
F(I, n)0.95 4,
for reasonably large values of n.
The 'forward "election' method haa ita merltll compared to the baokward ellmJnation
method in that we do not have to compute tho total equation, Th. Fla".' drawback
with the method t. that w. do not cau Into aaooua&
4.3.
195
could be redundant It otho,,, cntcl' lit ulatcl' stugc. If we e.g. have that ;1;1 = a:1:2 + b:1:3
(approximately) and thltt ,", hilS been chosen as the most important variable. If we
then at a litter "tltao In the IInulysis also include :1:2 and :[;3 then it is obvious that we
no longer need ,1'" It should therefore be removed. This happens in the last method
we mention.
4.3.5 Stepwise regression.
The name is badly chosen since we could equally well call the last two methods by this
name. There are also many authors who use the name stepwise regression as a common
name for a number of different procedures. In this text we will specifically have the
following method in mind. Choice of the variable to enter the equation is performed
like in the forward selection procedure, but at every single step we check each of the
variables in the equation as if they were the last included variable. We then compute an
F test statistic for all the variables in the equation. If some of these are smaller than the
1  a quantile in the relevant Fdistribution then the respective variable is removed. If
we look at our standard example we get the following steps (ain = 50%, aout = 40%).
1) X3 is included as in the forward selection procedure and we test if 133 is signifi
cantly different from O. The test statistic and the conclusion are as before,
2) We now include X2 ' We compute the partial Ftest for 132 (in the model E(Y) =
a + J32X2 + J33X3):
2.095/1
Fvalue = 4.285/2 = 0.98 F(I, 2)0.55.
Then we compute a partial Ftest for 133 (in the model E(Y) = a+ J32
x
2 + J33
X
3).
Using the table p. 201 we find that
16.757/1
Fvalue = 4.285/2 = 7.82 F(I, 2)0.88.
3) We now again remove :1:2 from the equation since 0.55 < 0.60. The difference
at this step betwecn the forward selection procedure and the stepwise procedure
iii that we also compute an Fvalue for X3 and thereby have a possibility that X3
again will be eliminated from the equation, This was not possible by the ordinary
forward lIelectlon procedure.
4) The only remalnlna varIable 'I llli. It ha" a partlill Fvilluc of
ail :
... v
Choose the variable wit
largest partial cor
relation with dependent
variable
Find the
Fvalue.for
the variable
NO
CIA'T 4. R.O ...... ANALY.I
Find Fvalue for
all variables
in equation
Include variable
in question into
the equation
YES
Remove the
variable
in question
Figure 4.8: Flow diagram for StepwiseRegression procedure in stepwise regression
analysis.
I.IIO.IIIION EQUATION 197
Th.lftllYIII ItnplIllntl we huVI! the model
REMARK 4.H. The rcason why we investigated the partial Fvalue under 2, but not
under 4 is that :r 1 does not enter the equation at all since
0.43 < F(I, 2)0.50 = F Ia d'
,n
On the other hand :r2 was entered into the equation since
0.98 < F(I, 2)0.55 > Flaind'
REMARK 4.9. Like the section on the forward selection procedure we can note that
we are often forced to use fixed Fvalues instead of I  0: quantiles. If we do not use
the same level when determining if we want to include more variables as we do when
determining if some of the variables should be removed, we will often let the last value
be about half as big as the first one i.e.
F f . IF' .
out a equatlOn = 2 mto equatlOn.
(This is the opposite of what we actually used in the example).
4.3.6 Numerical appendix.
In this appendix we will show the calculation of the numbers used in the previous
sections. It should not be necessary to go through all these computations but they are
shown, so we with the help of these should be able to check our understanding of the
different principles.
A. Data:
'IJ ;'"
41)_.'(j'
3,0 1
0,2 1
2,D 1
CU 2
; / : ~ ;1::1
ff':f
() 1
1 0
2 2
1 a
ANALV.I.
4.1, ,
ftWIQUATION
I"
B. Basic Model: E(Y) = H + /JI;I:I + /J'JJ:'J + /J:\:/:a or
[
4.05] [ 0.85] .' . . . 260 0.40
n O.J.)() '. = 1.20
O.750];PH13(Y)= ,YP
H
13(Y) 1.90
(J 2.200 0 85
:2 5.55 .
e E N(O, (j2J)
C. Estimators in submodels
5.2250
5 3
liY
 PH 13(y)11
2
= 2 = 2.6275
2 21.868  5.2550 = 76.0%
R = 21.868
[
eX j [0.175 j [ 4.575] [ 0.325] /31 1.450 3.650 0.650
/32 = 1.400 ;PM(Y)= 0.125 ;YPM(Y)= 0.325
/3 2.375 3.225 0.325
3 6.075 0.325
[
5.563] [ 0.663] 0.254
eX 0.945 3.254 7
[
/32] = [0.872] ; PH2 3(y) = 0.073 ;YPH
2
3(Y) =
' 2309 3.819
/33' 4.691 1. 709
1 II 2 0.845
5  4 Y  PM(Y) II = 1 = 0.845
R2 = 21.868  0.633750 = 97 107'
21.868 . 10
1 2 _ 4.285456 = 2.1427
IIY  PH23 (y)11  2
53
2 _ 21.868  4.2855 = 80.4%
R  21.868
v) Model HI: E(Y) = a + /31 XI
[
3.026] [ 1.874]
(X 3.026 4.269 1.269
[
] = [ 1.243] ;PH12 (Y) = 3.282 ;YPH
12
(Y) = 3.082
fh 0.987 2.295 0.605
4.525 1.875
[
2.73] [ 2.17] 3.48 0.48
[
] = [ 2.73 ] ;PHt(Y) = 3.48 ;Y  PH1(Y) = 3.28
/31 0.75 3.48 0.58
4.23 2.17
_1_11  ()11
2
 18.512611, n 21563
5 3 Y PHu 71  2 .. II.
U'J = 21.868  18.1512611 1 flew.
21,808  ''''''TV
1 \I 20,7430 " "14')
\\11 1'11, (11)\1   U,II ,J
Ull .. .. 20
1
741 _I,'"
. 21.1188 .
_n A.U 9*' A"ALYII.
vi) Model }{.J.: E(Y) :.:: It +
f
3.914 J f () ( . J
& 3 914 3.914 .. )86
, = . " 0.914
[ /32] [0.543] ,PH2 (Y) = 3.371 iY  PH (y) = 3171
2.828 2
3.371 0.072
3.029
1
//Y _ nH (y)//2 _ 21.042858
5  2 y 2  3 = 7.0143
R2 = 21.868  21.043
21.868 = 3.8%
, f 4.8 J f 0.1 J
? = 0.4 . 2.6 0.4
[ /33] [2.2] ,PH3 (Y) = 0.4 ;Y  PH
3
(Y) = 0.2
4.8 19
4.8 1:6
1
5 _ 2// Y  PH3 (y)//2 = 6;8 = 2.1267
R2 = 21.868  6.38
21.868 = 70.8%
viii) Model Ho: E(Y) = a
& = 3.48
PHo(Y) = f H: J i Y  PHo(Y) = f ::H: J
3.48 0.58
3.48 2.92
1
_//Y P ()//2 21.8680
5  1  Ho Y = 4 = 5.4670
D. Successive testfngs
1) H 2 HI'}. 2 fit .:J 110 i.e. : Ii:) == 0, tJa .0, lJa .0
".3.
s'S  .,._.  "" (C(). (, .
I
21. .H(iH 2(fY430 ';ITirr C
1f1'J
20.7430 . 18.5126 = 2.230 1
/II 'J.
18.5126  0.6338 = 17.879 1
0.6338 = 0.634 1 ohs
110 21.868 4
2) M :;2 JJ
12
:;2 H2 :;2 Ho d.v.s. : /33 = 0, /31 = 0, /32 = 0
Variation 88
Ho H2 (/32 = 0) 21.8680  21.0429 = 0.825
H2  H12 (/31 = 0) 21.0429  18.5126 = 2.530
H12 M (/33 = 0)
18.5126  0.6338 = 17.879
M obs 0.6338 = 0.634
Ho  obs 21.868
3) M .:J H
13
.:J HI .:J Ho d.v.s. : /32 = 0, /33 = 0, fh = 0
Variation
Ho HI (/31 = 0)
HI  H13 (/33 = 0)
H
13
M (/32 = 0)
M obs
Ho  obs
Variation
Ho  H3 (/33 = 0)
H3  H13 (/31 = 0)
H
13
 M (/32 = 0)
M obs
Ho  obs
88
21.8680  20.7430 = 1.125
20.7430  5.2550 = 15.488
5.2550  0.6338 =
4.621
0.6338 =
0.634
21.868
88
21.8680  6.38 = 15.488
6.38  5.2550 = 1.125
5.2550  0.6338 = 4.621
0.6338 = 0.634
21.868
5) M :;2 H
23
:;2 H2 2 Ho d.v.s. : /31 = 0, /33 = 0, /32 = 0
Variation 88
/fl) . /f'}. (t93 == 0)
f::... _ ...
21.0429 = 0.825 21.8680
II'}.  1I'}.3 (t93 .. 0) 21.0420 4.2855 = 16.757
1I'J3 M  0)
0.6338

3.652
AJ obi 0.033M
0,634
no . obi
"""""'"'"   
21 . el
 ,,_0"'""
d.oJ.
1
1
1
1
4
d.o.f.
1
1
1
1
4
d.o.f.
1
1
1
1
4
d.oJ.
1
1
1
1
4
...... ""'
201
...
6) AI J II'J:J :J 11:1 :) lin d.u. : fit .0, (I'J .. n, (Ia ... 0
Variation ss
lIo lh
(113 = 6) 21:"8680
ti.:fs
H3  H
23
 15.488 1
IJ23  M
(IJ2 = 0)
6.38  4.2855
== 2.095 1
M obs
(fJI = 0) 4.2855
0.6338 ==
3.652 1
Ho
0.6338
0.634 1
obs
21.868
4
E. Variancecovariance_ and correlation matrix for data.
CL
1
0
L50 )
:1:1
Variancecovariance matrix = _1_
2.8
0.4 1.52
:1:2
51
0.4 3.2 7.04
1.52
7.04 21.868
lJ
Xl X2 X3
lJ
(
0.4225
0
02268 )
:1:1
correlation matrix =
1
0.13393
0.1942
:1:2
0.1336
1
0.8416
0.2268
0.1942
:1:3
0.8416
1
Xl :1:2
iJ
X3
Y
F. Partial correlations for given X3:
(
_ ( ) [1}I[ 0
0.2268 0.1942 1 0.8416 0.1336 0.8416}
== 0.4225 0.9822 0.3066
(
1 0.4225 0.2268 )
0.2268 0.3066 0.2917 '
i.e. the correlation matrix is
(
1 0.4263
0.4263 1
0.4199 0.5728
Xl X2
0.4199 )
0.5728
1
Y
First calculated using the above mentioned partial correlation matrix
(
1 0.4199) (0.4263)
0.4199 1  0.5728 [lJ 1 [0.4263  0.5728J "'"
(
0.8183 0.6641)
0.6641 0.6718 '
4.4. AND .OLUTION.
which relult d11l1Uowtn. corl'tllitllun tnlltl'ix
(
In, HUl\fI )
II,HUM I
'/'1
lJ
As a check we could compute it from the original covariance matrix
(
2 1.50) ( 1 0) (2.8 0.4) 1 (1 1.52)
1.50 21.868  1.52 7.04 0.4 3.2 0 7.04
(
2 1.50) ( 1 0) ( 0.3636 0.0455) (1 1.52)
= 1.5021.868  1.527.04 0.0455 0.3182 0 7.04
= (1.6363 2.3727)
2.3727 4.2855 '
and the partial correlation matrix is then
(
1 0.8960 )
0.8960 1
The deviations in the elements off the diagonal are a result of truncation errors.
4.4 Other regression models and solutions
203
In this section we shall look at an alternative criterion for estimating a (linear) function
of some independent variables. Furthermore we shall consider a linear regularization
solution to the normal equations in the case where we have strong multicollinearity
between the independent variables, the socalled ridge regression.
4.4.1 Orthogonal regression (linear functional relationship).
In the ordinary least squares estimation of a regression surface we minimise the sum
of squares of the vertical distances between the regression surface and the observed
points.
Often we will be In the !lltu.tlon that it would be more reasonable to minimise the
urthoional dilitancc!lllnd then we talk about orthogonal regression (not to be confused
with rCjfelllllon by orthollonal polynomlall,).
Let UII ullumc tho fnllnwlnl variablta I I ,/J", which Nlltisf'y II linear relationship
(4.6)
....... spau ANALY
y
,..
(xeY!)
x
y
Y .. ax + b
It
It
(xi,Y
t
)
x
Figure 4.9: In ordinary onedimensional regression ( ~ and (J are determined by mini
mizing L e;. In orthogonal regression a and b are determined by minimizing L ill.
i.e. the variables lie in a hyper plane with the above mentioned equation. We are
interested in determining this plane i.e. to determine fro, ... ,Q
p
Assume that it is not
possible to observe the values /ll, ... ,/lp, but only measure
j = 1, ... ,p, i = 1, ... , n,
where the Zji'S are random variables with mean value 0 and where J"il , ... ,11'ip, Z :.::
1, ... , n, satisfies 4.6.
Estimation of the parameters fri on the basis of such a set of observations is often
called estimations of a linear functional relationship in the literature.
Here it would intuitively reasonable exactly to use the hyper plane which is found by
minimising the orthogonal distances down to this. If the Zij'S are normally distributed
with the same variance it can be shown (see e.g. [13J p. 392) that this plane givell the
maximum likelihood estimator of the fr's. We formulate the solution to the problem
in
THEOREM 4.3. We consider n points :1:1, ... ,:1:
71
E UP and the hyperplane
which rninimiHe the lIum Ot'lIqUIU'OI of the ortholonftl dlHtllncell from the pointll. Th.n
fr1 , ... ,n'p arc the coord!n o. ot a normed oi,envcctor correllpondln, to the ImmalJ t
......
, , . t .. x for the'll xpoints. The at ..... ..,..11 variAnce VIlI"iUllcccovununcc mu II , eigenvAlul 1M
lust cuclllcl"nt hi ,Ivon hy
PROOF. We write the observations as
( )' to the hyperplane ia
. f oint with the coordinates x = :1:1, ... , xp
The dIstance rom a p
fro + frlXl + ... + frpXp
J ( l ~ + ... + ( l ~
Therefore we must determine Qo, ... , (lp so that
d t by XiO = 1, i = 1, ... , n, we
. ... d If we introduce a zero'th coor ma e Xo IS mInImIse.
could write
Solving this minimisation problem is equivalent to minimize
Hubjcct to the conlltraint
 Y77 .,...  pc ... ANALYII.
introduce a Laaranae multJpJlor .\
mlnllnum of' we lIee that we mUlt determine thc MJobul
The coordinate of the gradient vector are for // = 1 .
, ... ,]I
and for v = 0
Putting these partial derivatives = 0, the last equation becomes
n p
LLLtjXij = 0,
i=l j=O
or
where
If this is inserted in the first set of equations they can be rewritten as
n p
L L Ltj(Xij  Xj)(:[;iv :;;v)
i=l j=l
If We denote the em "I I
plnca var anee cuvlIrlnnccvarlance matrix th b
II reo Nel'vlltionll
4.4.
AND IOI,UTIOH.
we IICO that ttI'IIIY.onl IN nnw rewritten liS
,\
/I
I 11'". 0, // = 1, ... ) I)
If we let
then the equations system can be written as
A A
ra=a
nl '
i.e. a is an eigenvector of t corresponding to the eigenvalue .
207
The question is now, which of the p eigenvalues for t should be chosen. After some
manipulations with the original equations it follows that we must choose the smallest
eigenvalue. This concludes the proof of the theorem.
REMARK 4.10. The result which has been stated in the theorem has a close connection
to the results which will be shown in a later chapter on principal components.
4.4.2 Regularization and Ridge Regression
In the analysis of regression models one often will have stability problems if the design
matrix is illconditioned. This may be detected by suitable regression diagnostics. If
we have a model with many independent variables a solution to this problem may be
various stepwise regression procedures. This will however not always work satisfacto
rily, so instead of excluding some variables and focus completely on others one might
try utilize the information available in all independent variables in a different way.
We consider the ulluaJ model
Y =x{i+c,
where x hI. known n )( , llllril, _ till unknown parameter vector and c the error
VICtor,
Wc IIl1l1umc that
E(e) ::::: 0
D(e) ::::: a2In.
PC'" AKALY'
The ordinary least squares estimator is  assuming that x has maximum rank _
Furthermore we assume that the independent variables are scaled so that x'x has cor
relation form (i.e. the single independent variables are reduced with their average and
divided with their standard deviation). This normalization wiU help make the estimates
more stable numerically. This normalization is often recommendable in a practical
situation.
If x'x in this form is close to a unity matrix, i.e. if the independent variables are near
orthogonal, the least squares estimator is fine. If we have multicollinearity, i.e. if the
independent variables are strongly correlated, the estimates j3 wiII be unstable.
We now analyze some properties of j3 that has not been dealt with earlier. We have
If we put L equal to the distance from j3 to j3 ,
we get
p P
E(L2)::::: LE[(fii  fJd
2
]::::: LV(fii) = a2tr(x'x)I.
i=1
i=1
Since
we get that the expected value of the squared length of {3 is
If we denote the eigenvalUCIi for x'x
4.4. AND 10l..UTIONI
.&1 I to Thonrem 1.12 IIml thc I'CSUItS Oil p. 46) wc obtain (lOOunul,
and
2
, , ") :.! ( 1 + ... + ) + {3' {3 > + {3' {3.
E({3 {3 = (J Al Ap Ap
209
correlated the eigenvalues of x'x will vary
If the independent variables are strongly 11 < 1) According to the
h lIest will be very sma . I
a lot and consequently t e sma d j3 will in this case have a arge
relations above the squared distance an expection by far exceeding
expected value, and the squared lengt 0
the squared length of {3. . d f {3 The ques
., A. d by requiring unblase ness 0 '.
This tendency to 'mfiatlOn of {3 IS c.ause h' quirement may obtain estImates that
h by relaxmg on t IS re b I
tion is therefore whet er we roblem is been sketched in the figure e ow.
in some sense are closer to (3. The p
Unbiased: E = B
J\
Biased: E<a),. B
Figure 4.10:
. t r jj (in the one Hcre we may aaain refer to the mean squared error of an estlma 0
dimensional calle)
 ., fl)IJ + ,"
a1 to thl variance plull the /lquured bias. If
I.e. thllt the MSB of an I.UIllIOf :. 14\1 Rtaln. lreat reduction In thc variance.
wo thorefore by aUowUl, I .... MAr
CIA",II I'GIUIION ANALYIII
this would obviously be preferable. ThiN III exactly whitt is obtllinet! with the rlt!ge
estimator introduced in the definition hclow.
DEFINITION 4.2. A ridge estimator for /'J in the model
"* A *
is an estimator 13k = 13 , that is a solution to
[
(x'x + k . 1)13* = x'Y,
i.e.
[
13* = (x'x + k I)lx'Y.
Here k is a constant E [0,1].
]
REMARK 4.11. In numerical mathematics such way of solving the normal equations
is called a (Tikhonov) regularization. This is a very common way of solving illposed
problems. y
We are now listing some properties of 13*. These properties are among other things
used in determining k, a quantity not given in beforehand.
We have
THEOREM 4.4. Let the situation be like in the definition. We put x'y = lJ and t!enote
the observed residual sum of squares for an arbitrary estimator jJ equal to
H(jJ) = (y  xi3)'(y  xi3).
Then the gradient of H in 13 = 0 is Proportional to 9 and has the opposite direction of
'* 
9 . 13
k
may be determined by, that it for a fixed length minimizcs U({oJ). i.e.
" ... .... .
Furthermore H(f3/c) is an increasing function of The length of (oJ
ltt
ill t!ecreINln. In
k, and the angle "Y between Int! 0 Is decrel5ina in k.
PROOF. Not vII')' compU"ltid but hI omltled. Tho readot II flrorrod tn (Uland (J 1J.
4.4.
111
Figure 4.11:
The instructive figure 4.11 above is taken from [18] ,
. . th  2 The point {3 in the center of the
It depicts the situation geometnca11y In p  Ie' vel curves for H The circle with
I olution The e Ipses are .
ellipses is the east squares s . . "'! that {3' * is the shortest vector
" t to the small ellIpse. vve see .
center in ongo IS a tangen 11 th value of H on the small ellIpse.
that has a residual sum of squares as sma as ,e
Furthermore we see that 13* always lies between {3 and g.
. f the ridge estimator are given in the following theorem
Other propertIes 0 , * , *
h {3 {3 is a linear
4 5 Let the situation be as described above. T en k = THEOREM
transformation of 13 since
, *
{3 Zkj3 (x'x + kI) 1 (x'x)j3.
, * .
{oJ is biased SInce
, *
The dispersion matrix of (3 is
l___ D(j3*) a
2
[x'x + kI] 1 (x'x)[x'x + kI] 1
and the expected squared dilltance to f3 is
[_ E[ (iJ' _. fl)' (/1'  {Ill  tr(D(iJ}l.:+:lJ' ('k I)' ('k  I) tJ. .
I th" 'trllt tonn I. equal In Ihe vlIl'iullCC 0" the squarcd Icngth of {3 In thc IIINt c"J1rCN" on
and the 1&'lIt term III equal to thl Iquand bta .
'Ioor. Omittod. Pollow.
\,iKAPTI. 4 UOION AMALYII.
From the theorem follows an important corollary
COROLLORY 4.1. If 13'13 is limited there exists a k > 0. so that the expected squared
distance between 13 and 13* is strictly smaller than the expected distance between Ii
and 13.
PROOF. This follows by noting that tr(D(13*)) is decreasing in k whereas j3'(Zk
A * A
I)'(Zk  1)13 is increasing in k. Since k + 0 :::} 13 + 13 the result follows immedi
.
The only remaining problem is determining a reasonable k. Historically the socalled
ridge trace is used. There are other alternatives (see e.g. [25J for results on using cross
validation), but the ridge trace is a straight forward method.
DEFINITION 4.3. By the ridge trace we understand the mapping of the individual
coefficients in the ridge estimator as a function of k. ..
REMARK 4.12. The philosophy behind using the ridge trace in determining k is a sen
sitivityargument. From the ridge trace it follows which coefficients that are sensitive
to Variations in k. One then selects the smallest value of k giving a stable sequence of
the coefficients.
We illustrate the principles in the next example.
EXAMPLE 4.2. ([18]). In the example we consider the relation between ASTM and
gas chromatography distillation of gasoline. We shall not dwell on the differences but
only state that gas chromatography is far more precise in assessing the volatility than
the ASTM standard used in 1975. It is therefore of interest to use gas chromatography
in controlling the distillation but at the same time being able to state what ASTM
result this would correspond to. So we must predict an ASTM result based on gall
chromatography measurements.
The ASTM method involves measuring the volatility as the fraction that has evaporated
at different temperature levels. We shall only consider one single ASTM temperature,
namely 158F. The fraction evaporated at this temperature is called 1/.
We now want to predict II based on determinations by gas chromatography of fraction"
evaporated at 15 different temperature levell!. Those fractionll are called ;"1, ... ,:1:1/1
We apply a linear model
4.4. 213
. t duced an inter ................... t I lind therefore we have not 111 \0
The Independent ftI1tIIHR luw up n
cept. .. f 0
redictions of ASTM distIllatIOns 0 gas 
The main UNO or tho 111lido I Nhllllld he P . d' d d viation should be smaller than
line and It WUN re4ueNIed Ihul Ihe predIctIon sd
tan
e "'or determining optimal mix
h ., It 'h lUld be use as mpu l'
1.5%. Furthermore I C ICSU S ( . Earlier results using ordinary least
b 'f IlIlear programmmg. t
ing procedures y mcans 0 . d .. h' d iven coefficients that were unaccep _
squarcs and stepwise regre.sslIlIl U.ICS a g
able from a physical/chem1cal pmnt of VIew.
. ariables These were split in two parts  one
There were 59 observatlOns of the 16 v . t'n and one consisting of 30 used for
. sed for estIma 10 ,
consisting of 29 observatIOns u fi c havc shown thc ridge trace. It
d . r In the next gurc w
determining the pre lctlOn erro . d k = 0 005 0.01. In the case
is seen that the system stabilizes for kvalues aroun, .
presented k = 0.006 was chosen.
a
o
1
0.5
0.4
0.3
0.2
0.1
0
k
0.1
0.2
0.3
o 0.005 0.010 0.015 0.020 0.025 0.030 0.035 0.040 0.045 0.050
k
0.2
O.J
..... _' n',_,
, '.... "" , OJI O.OtO O.OU 0.050
..... IY I ............. .
.... from ampl.4,2, Tho cocfflelcnls UI'C shown in ,lliU'" 4.12: Rid" trlC. tor die .... I.
M relative leal. (Y i.equaI to , ,', ' '. "'. ,

Itandard deviation
of! prediotion
1.30
1.20
1.10
1.00
0.90
0.80 '''r.,_ k
o
0.005 0.010 0.015 0.020
Figure 4.13: Prediction standard deviation as a function of k.
Tn figure 4.13 we show the prediction stand d d . .
of k. Note that the minimum occu [, k eVlatlOn found for the different values
the ridge trace. The least squares or. h  0.006, value obtained by analysing
and the ridge model 1 01 In th "11 as a predictIOn standard deviation of 1.28
. '. e ,0 owmg figure 4 14 th tift
wIth coefficients resulting from the th . I . . e coe clents arc compared
. eoretlca conslderatio Th .
necessanly the 'truth' but of ns. ese arc of course not
course you expect a considerable similarity.
100
__
1
0
:,1
1
1
10:fj1
o
50 100 1'0
aoo 350
lOO lIIO 400
II Lea.t .qu.re.
It I Ilictv
In I Theoretioal,
... t.uro 4,14: CmnpllliNon of coefflctontN.
..
215 AND .OLUTION.
.. ,,,
It III lIoon th.t the rldp .ltlmatoN lihow u more steady progression with increasing gas
chromato.raphy t.mperatureN, Furthermore they arc closer to the theoretical values.
4.4.3 Nonlinear regression and curve fitting.
Often we will have to analyse regression situations which give rise to non linear normal
equations or likelihood equations.
We could of course then use a general programme for maximising non linear functions
or an iterative method to solve the non linear equations. The variance covariance matrix
for the estimates can then be estimated using the inverse information matrix. We shall
not pursue this further in the present context.
We give a few examples.
EXAMPLE 4.3. We consider some data which concern conserving of iron items from
the Iron Age. The data are from Eva Salomonsen (1977) [23]. On the National Mu
seum's conservation laboratory they have for 63 years used Rosenberg's annealing
method to remove c10rids from the iron. In order to investigate the effectiveness of
this method 295 ion items conserved in the years 19131974, have been investigated
and the number of defect items i.e. items where a continuing disintegration has been
found is summed up for each year. The numbers are shown below.
Period Number number number of defects in
investigated defects % of investigated.
1913 52 14 26.9
192124 34 11 32.4
193334 53 10 18.9
194043 47 13 27.7
195354 56 4 7.1
196169 46 4 8.7
197274 7 0 0
Total 295 56 19.0
Table 4.2: Number of defecta annoaled iron items in comparison to the total amount
invcstigutcu for cnch Npcclftc y,.r.
"" time and thla .rowth I, whIt WI) wlnt
CKA.TII 4. RIOIIUION ANALYlI.
to model. A reasollable model would be to let
Xi
= number of defect for age Ii
ni
= number of investigated for age Ii
= the probability of an item with age t b' d
ell1g efect
Pi
and then claiming that
Xi E B(ni,Pi)
i.e. binomially distributed w'th
1 parameteres (n. p.)
Z, 'l
As age we of course choose the time which h .
For the periods, which cover several years the :s elap.sed the annealing treatment.
of the considered time interval. nnealll1g tIme has been set to the middle
The remaining problem is to find the de end
a very often used model is th I . . pence of the defect percent Pt of time. Here
e OglStiC curve:
Pi = P(ti) = 1
1 + exp( Q  /3t
i
) .
curve has asymptotes in p = 0 and p = 1 and is .
satisfies the basic requirements we might h If growing so it of COllrse
10git Pi = In
1 Pi'
we find that
10git Pi = a + /3t
"
ave. we define thc so called logil
i.e. the model is linear in these 10 its Th
connectIOn with bioassays especially by model has been used quite a lot in
The likelihood function is
AND .OI,UTIONH 117
1 Defe.'
0.5
Time since conservation
a
3 13 23 33 43 53 63 73 83 93 100 Years
Figure 4.15: Defect proportion for annealed iron items.
and then
1nL(a,,8) =
C  LXi 1n(1 + exp( a  ,8ti))  L(ni  xi)(a + ,8ti)
 L(ni  Xi) 1n(1 + exp( a  ,8ti ))
C  L ni 1n(1 + exp( a  ,8ti))  L(ni  xi)(a + ,8ti)'
We can now either differentiate this expression with respect to a,,8 and let the dif
fcrcntial coefficients equal 0 or we can maximise In L( a,(8) by means of a general
minimising program. By doing the first the following estimates are found
2.99984
0.03813
The resulting logistic curve has been drawn in figure 4.15. Furlhermore 95% confi
dellce intervals around the single observations have been drawn.
A fllir agreement is found but we should, however, be careful to extrapolate hundreds
III' years into the future even though the ligurc might lead us to.
Often we are In the NltuaUnn where we wlllh to HI a "Iven dllin with 1\ suitahle smooth
CUI'VO hut where there ducl! nul I.om Lo bo pt)l!lIlhlllly I'm' it iN very dimeult to deter
mine a univonallaw whtoh Gan GOYII' III of .. oOftlildered IlrCll, We could then think
ur performtn, I ptlelwili' approIlIIIII. wi. difflrtnt runctlun". Here a very appro

 '9' ...IIPf" ANALYlI.
device. Theile arc introduced in
DEFINITION 4.4. Let there be given an interval [a, II] and observlltions or points
:1:1, ... ,:l:n which all lie in the interval. Furthermore let there be a value,lli for each
:J:i As a spline of order 2mI with knots :r 1, ... ,:1'" we wi II understand the fUnction
rp which satisfies
1) rp is a polynomial of degree 2ml in [:J:i, :J:i+ 1]
2) rp is a polynomial of degree Tn  1 in [a, :J:1] and [:J:
n
, b]
3) rp, 'P', ... , rp(2m2) are continuous in Xl, ... ,:r:
n
REMARK 4.13. A spline of order 2ml is put together or smoothly of (2mI) degree
polynomials. It can be shown that the obtained curve very much resembles the one we
would obtain by nailing two nails into each knot and then forCing a very elastic steel
rug through these (a so called drawing spline). We will not pursue this further but just
refer the reader to the literature on the area.
EXAMPLE 4.4. In figure 4.16 the level for a number of points on a line of length 5 km
in Dyrehaven are shown. The data is made available by Poul Frederiksen, Department
of Surveying, DTD. In different projects it is interesting to compute an expression for a
variation around a suitable chosen smooth trend curve. It is therefore obvious to use a
cubic spline function. This is done by means of the Harwell programme VB05B which
on the basis of observations (Xl, Y1), ... , (Xm, Ym) minimises
m
F = L WNYi  S(Xi)}2,
i=l
where Wi 'er are users specified weights and S is a cuhic spline function with knots J,
j=I, ... ,n, which abscesses are specified hy the users.
The resulting spline and its knots ure also given. We note the very nice lit hctwccn tho
very irregular observations and the spline function.
d . .\.$4.e;;;i%! . 219
_____ .. __
m
30
25
20
15
10
5
0 1 2 3 4 5 km
P 4 16' Levels and approximating cubic splines. Igure . .
CIA"II 4. I."IOH AHALY'I'
Chapter 5
Tests in the
multidimensional normal
distribution
In this chapter we will give a number of generalisations to some of the well known
test statistics based on one dimensional normally distributed random variables. In most
cases the test statistics will be analogues to the well known ones, except for multiplica
tion being substituted with matrix multiplication, numerical values by the determinant
of the matrix etc.
5.1 Test for mean value.
5.1.1 Hotelling's T2 In the OneSample Situation
In this section we will consider independent random variables Xl, ... , X n. where
I.e. pdimenliionally normally dllltrlbuCOd with mean vector II. and variancecovariance
matrix E. We "Illume thlt JJ 'I "lulU' and unknown. We want to test a hypothesis
About the mean vector I' bllna lCIuli eo I ,twn Victor lJu AIlII11/it all alternatives i.e.
Ho II' 1'0
. ".MAIt DI''I'llltI'I'ION
We /lrsl repeal SOll1e results Oil th I
the following results 01; 1l1&1lnI'H. theorem 2.29 Il. 104 we have
covariance matrix S ca melln vector X and the empirical variance
__ __
X=lI:X '
n i=1 ' E Np(jl,
n
S = I: (Xi  X)(Xi  X)'
 ,=1
X and S are stochastically independent.
E W(n 1 _1_,"1)
, n1 "''
In following we will furthermore need the foIl .
certam functions of normally d' trib d . owmg results on the distribution of
IS ute and WIshart distributed stochastic variables.
LEMMA 5.1. Let Y be a d' .
stochastic matrix with p ImenSlOnal stochastic variable and let [j be a J! x 7)
Y E Np(jl, E)
rnU E W(rn, E),
furthermore let Y and U be stochastically independent. We now let
T2 = Y'U1y.
Then the following holds
rnp+1 2
rnp T E F(p, rn  p + 1; jl':EI jl),
!;e left hand side is noncentrally Fdistributed . .
jl:E jl and degrees of freedom equal to rn _ WIth parameter
noncentrality parameter is 0 i e we th h (Pth' + 1). If jl = 0, then the
. . en ave e specIal case
rnp+1 2
rnp T E F(p,rn  p+ 1).
PROOF. Omitted. See e.g. [2J, p. 106.
We now have the following main TeNuit
THEOREM 5.1. We wfll UIIO the notatinn
1 '}'3  n(k  .. ...
(
'.1. TIft',OI .. IAN VALtll. 223
where X. Ito und S urc lUi !!tuted In the Introduction to this section. Then the critical
ureu for 1\ rutlo tellt of 110 uglllnst II, Itt level, v III
[
_.
f n 1) :.I _. ,
(, ' {XI, .. . , x"I'("'" . t") /. r, (I), II
'II, 1)
where /'2 is the observed value of '/ ''2.
PROOF. From Lcmma 6.1 we find that
11,  P r ,2
( ) 7 E F(p, n  p)
n1p
under Ho. From this follows that C is the critical region for a test of Ho versus HI at
level 0'. That this corresponds to a ratio test follows from direct computation by using
theorem 1.2 among other things.
REMARK 5.1. The quantity T2 is often called Hotelling's T2 after Harold Hotelling,
who first considered this test statistic.
REMARK 5.2. In the one dimensional case we use the test statistic
z  y'n(X  110)
 s 
We now have that Z2 can be written
i.e. precisely the same as T2 reduces to in the onedimensional case. Furthermore
note that the square of a student distributed variable t(v) is F(l, v) distributed which
means that there (of course) also is a relation between the distribution of the two test
statistics.
In order to computc the tellt IItaUlltlc It III u"eful to remember the follow theorem where
it is seen that lnvcl'llion or I matrix can be lIubNtltutcd by the calculation of some deter
minant".
THKORKM 5.1. Let thl notation be buv., Thon tho following holds
'I'''' .. _ ......  ..... """''''!'Ii ... __ 1IiiiI1W
 "  ........ 11II1'II.".IOa.4L MOIM4L DI".,Il/TION
PROOF. Omitted. P I
matrix ure y technical and follows by using th
eorem 1.2 p. 17 on the
 ILo)' ]
We now give an illustrative
EXAMPLE 5.1. In the following table values D '"
samples collected on the moon are given or slhclUm and aluminium (in %) in 7
1
2
Sample
3 4
5
Silicium
19.4
21.5
6
7
Aluminium
19.2
18.4
20.6
19.8
18.7
5.9
4.0 4.0
5.4
6.2
5.7 6.0
We are now very interested in testin if th
population with the same mean valu;s as :se ;am
f
pIes can be assumed to come from a
asa t rom our own planet earth Th
. esc are
IL = ( 22.10 )
o 7.40'
It seems sensible to use Hotelling's ]"2
the observations Xl, ... , X7, we find to help answer the above question. If we call
Since
x =
s =
(
19.66 )
5.31 '
(
1.1795
0.3076
0.3076 )
0.8681 .
if  II == ( 2.44 )
,...() 2.0D I
1.1. T.I'
then
and
Then
( )( )
' (41.liH :i5.70)
'/I m 1'0 m I'll  :m.70 30.58 '
s + n(x
(
42.86 35.39)
ILo)' = 35.39 31.45 .
t
2
= 95.49 _ 1 = 101.75.
0.9293
The Ftest statistic is
7  2 2
"6.2t = 42.8> F(2, 5)0.999 = 37.1,
and the hypothesis is therefore rejected at least at all levels a larger than 0.1 %. It
therefore does not seem reasonable to assume that the 7 moon samples originate from
a population with the same mean value of silicium and aluminium as basalt from our
planet earth.
From the result of theorem 5.1 we can easily construct a confidence region for IL. We
have with the usual notation
THEOREM 5.3. A (1  a) confidence region for the expectation E(X) is
i.e. an ellipsoid with centre in x and main axes determined by the eigenvectors in the
inverse empirical variancecovariance matrix. A
PROOF. Trivial from the definition of a confidence area and theorem 5.1.
We now continue example 5.1 In the followlni
EXAMPLE 5.2. We will nnw dottrmlne I confidence urea ror the mean vector.
According to thenrem the mn"denooartlill nrderod hy the ellipse
or
(19.66  Ill, 5.31  11''2)8 1 (
We find
1 _ (0.9341 0.3310 )
s  0.3310 1.2692
11.1 ) ::;;: 1.9851.
/l'J
with the eigenvalues 1.4727 and 0.7307 and the corresponding (normed) eigenvectors
(
0.5236 )
0.8520
and
(
0.8520 )
0.5236 .
In the coonlinate '>"tem with origin in '" and the above mentioned V""to", "' unity
vectors the ellipse has the equation
1.4727yi + 0.7307y1 = 1.9851
or
2 2
1.1610
2
+ 1.64822
In figure 5.1 the confidence 'egion and the obse.vations ace shown. FUrthennoce 1'0 =
(22.10, 7.4{)), ;s given. It;s seen that thls obse.vation II", out,ldc the contld.nee
region corresponding to the hypothesis J.L = J.Lo against J.L =1= J.Lo being rejected at all
levels great" than 0.0 I % and thecefore "'peclally fo, n = 5 %.
5.1.2 Hotelling's T2 in the twosample situation.
Quite analogous to the ttest in the one dimensional case Hotelling's '/,2 can be ulicd to
investigate if samples from two normal distributions (with the same variancecovariance
structure) can be assumed to have the same expected values. We consider independent
stochastic Variables Xl, ... , X nand Y 1, ... , Y m. where
Xi E Np(J.L,IJ)
Y i E Np(V, IJ),
and We wish to tClit
.
________________ __
22'7
'Al
7
6
"
5
I
,
4
" x
.
, ,
't,
I ............
" !!.
a !!,a
x
17 18 19
F 5 \. Observations and confidence region. Igure ..
We use the notation
x =
s =
n _ ,
_1_  X)(Xi  X)
n  1 i=l
m _,
_1_ 2.:(Y
i
 Y)(Y
i
 Y)
m 1 i=l
(n  1)S1 + (m  1)S2
n+m2
From theorem 2.29 and theorem 2.28 we have
% s1
   ... .... II"If Tall MVLTID,WIN.,ONAI. NOIMAL DIITIIIUTION
X __ _y_ .. _)'_8__
1
_ (X_ __ y_)_. __ ___ ___ ___
Then the critical region for a test of JI
o
against HI at level Q is equal to
Here t
2
is the observed value of T2. A
PROOF. From lemma 6.1 and from the above mentioned relationships we find that
n+mpl 2
( ) T EF(p,n+mp_l; (Jlv)'El(JlV)), n+m2p
and the result follows readily.
Analogous to the onesample situation we can use the results to determine a confidence
region for the difference between mean vectors. We have
THEOREM 5.5. We still consider the above mentioned situation and let Jl v = ,so.
Then a (1  a) confidence region for,so is equal to
'
PROOF. Follows directly from the definition of a confidence rcgion and from thco
rem 5.4.
REMARK 5.3. The confidence region is an ellipsoid with centre in fi: _" 1i LInd main
axes determined by the eigenvectors in s . 1 ,
REMARK 5.4. As mentioned tho lesl resulls LInd confidence Intervills /'C'4
u
h'o Ihllt tho
variancecovariancc matricol for the X lind for tho Y oblorvatlonll I1I'e equal. It
this is not the caso tho abov. montloned rcllultll are not exact and I dltTerent procedure
should be uscd. W. wUI not oonltdor thlll hore but retll' to .. e2]. p. 118. ,
1.1.
229
. r '/,'J in the twosample situation.
Wo will now oOllidlr In .xlmple on the usc 0 . one has
J d Climatctcchmque, DTU, .. 5 3 At the Llthot'lltory 01 Heatmg .In EXAMP...... .
measured the followln" 11\111\ expcnment
i) the height in CIII.
. 2 k' during a 3 hour peri ode
I'i) evaporation loss In glm S III k' t
b ring the s III em o . erature is found y measu .
iii) mean temperature in C. temp minute for 5 minutes (same locatIOns
perature at 14 different locatIOns an average of a1114 x 5 = 70 mea
t'me) The mean temperature IS en every 1 .
surements,
6 d 16 women. The result of the experiment is given in the table p. 230.
on 1 men an .
'd these numbers as realisations of stochastic vanables We conSl er
X X and 1(1,' .. ,1(16' 1, .. , 16
th t the variables are stochastic independent and that We furthermore assume, a
and
1(i E N(v, :E),
. ed equal. Later we will discuss whether . the variancecovariance matrIces are assum
I.e. 1 t
this hypothesis is reasonab e or no . .
d e the empirical mean vectors l.e. The estimates for fL an v ar
(
179.7 )
jL = x = 24.5
33.6
and
(
166.1 )
I = 'II "'" .
33.4
h and" III III"nlftcl1nt, I.e. whelher fL We wUl now chook if ChI 1M............ ..... n ,..
and II oln b. "'''lUd '. .... .
.  . __ ......... rONAL NO.u,;.
. ..&' D"'.,IVTION
Person No.
Height
Evaporation loss
incm
in g/m
2
skin
Mean temperature
1
in DC
177
18.1
2
189
18.8
33.9
3
181
20.4
33.2
4
184
19.5
33.9
5
183
30.5
33.8
6
178
22.2
33.3
7
162
19.4
33.6
8
176
26.7
39.2
9
190
16.6
33.2
10
180
45.4
33.2
11
179
24.0
33.5
12
175
34.6
33.9
13
183
21.3
33.8
14
177
33.3
33.5
15
185
22.9
33.9
16
176
18.6
33.8
1
160
33.5
14.6
2
171
27.0
32.9
3
168
27.6
33.5
4
171
20.2
32.3
5
169
30.8
33.1
6
169
17.4
33.4
7
167
21.1
33.5
8
170
19.3
33.0
9
162
21.5
34.1
10
160
15.2
33.8
11
168
15.4
33.0
12
157
25.2
33.7
13
161
13.9
33.9
14
164
20.2
34.8
15
161
25.3
31.9
16
180
12.6
39.0
33.5
Table 5 l' Dat ti .
. " a rom mdoorclimate
ex .
ClImatetechnique, DTU. penments, Laboratory for Heatlns
and
.."". "''* ; z< ' #
J. TWI .. "IhIDIMINIIONAL OINIIAL LIMIA. MODIL.
With the I1ntutlnl1 ehnNCI1 III theorcm we find
and
:U:l.G
0.8
45.5
0.8 )
O. ,
()
t
2
= 1616 (5: _ Y)'S1(5:  y) = 52.4.
16 + 16
The test statistic then becomes
16 + 16  3 1
(16 + 16 _ 2)3 52.4 = 16.3.
Since
F(3,28)0.999 = 7.19
231
a hypothesis that fL = v will at least be rejected at all levels greater than 0.1 %. We
will therefore conclude that there is a fairly large (simultaneous) difference in the three
variables for men and for women, a result which probably will not chock anyone when
it is remembered that the first variable gives the height.
If we instead only consider the second and third coordinates, i.e. the values for evapo
ration loss and mean temperature we get the test statistic
16.1616+1621(4.00.2) ( 45.5 0.3 )1 (4.0)
16 + 16 (16 + 16  2)2' 0.3 0.3 0.2
This quantity is to be compared with the quantiles in an F(2,29) distribution and it
is readily seen that a hypothesis that the mean vectors are equal can be accepted at all
reasonable levels.
5.2 The multidimensional general linear model.
In the previous loction WI havi look.d at tho ono and twosumple situation for the
multidimensional nonnal dt,crtbu"., WI havi II,n that the multidimensional results
ICC quite Inaloioul to the Oil on , In thtl loctlon und in the following
we will continul tht. .... flluitl flprettnM cclll'cII"lon und unaly"l"
of vlU'tancl ot mullddj."
 _. . 11ft VWIU".RIIONAL MOINAL DI"IIIUTION
We consider independently distributed variable!! Y J , , Y n.
The variancecovariance matrix:E (and the mean vectors J.ti) are assumed unknown.
We arrange the observations in an n x p data matrix
_ [ ] _ [
Y .  .
. .
Y
n1
Here the single rows represent e.g. repetitions 0/ measurements 0/ a pdimensional
phenomenum. In full analogy with the model which we considered in the univariate
general linear model we will assume that the mean parameter J.ti can be written as
known linear functions of other (and fewer) unknown parameters (), i.e.
E(Y) = x(} =
It is seen that we assume x known and e unknown. This model can be viewed from
different angles. If we let the j'th column in the Y matrix equal
then we can write
The n measurements on the j'th "property" (attributelvariable) will there/ore/ollow
an ordinary one dimensional general linear model.
If we instead write the mean value of 1\ single observation Y
i
we find
__  __
'.J.
th mAtrix. Thill rendlly gives
where .. x i III the ,j.'th row In c x
. .". d into a column vector
If the observatIons arc lCanangc 1\
vc(Y) [ :J .
we find from theorem 2.7, p. 63, that
f:E and In, cf. section 1.5.
where :E ,:)0<) In is the tensor product 0
. t' ate () We have
The first problem IS to es 1m . . y,
. t' If the observatIOns ,
. d the above mentioned sItua IOn.
THEOREM 5.6. We const er. lik l'hood estimate of () is given by
are normally distributed the maXImum e 1
o (XiX) Ix/Y.
PROOF. Omitted. See e.g. [2].
REMARK 5.5. We see that
, (' )1 'Y
(}il = xx x ii'
. () is simply equal to the result we get by only
i.e. the estimate for the COI
1
U mn In 1 model for the j'th "property".
considering the one dimell!l1()na genera 1
ot normally distributed one will still be able
REMARK 5.6. If the oblorv,Uonl 11'1 n UNt like the one dimensional ease has a
to URe the eKtlmate h, alnel dd, of eOf:: with this but just mention a couple
OaURIIMarkov propcrty. WI wUl not 10 n ttl ,
of feAultA. The lo,"t Iquartl propIIiII.... ,
xi)
__ T  ..... r. ... , '.IT. IN Tn MVIII'IDIMIN.IONAL NOIMAL DI.TII.UTION
is positive semidefinite. From this follows that
where chi corresponds to the i'th largest eigenvalue. From this follows again that 0
minimises
det(Y  x O)'(Y  x 0)
and
treY  x 0)' (Y  x 0)_
REMARK 5.7. Above we have silently assumed that x'x has full rank i.e. rk(x) =
k < n. If this is not the case one can by analogy to the one dimensional (univariate)
results find solutions by means of pseudo inverse matrices. y
After these considerations on the estimation of 0 we turn to the estimation of E.
THEOREM 5.7. We consider the situation from theorem 5.6. Then the maximum like
lihood estimate for!; equals
_____________________________ _
A 1 AI AI I
!;* = ;, L.)Yi  0 :Vi)(Y
i
 0 :Vi)
i=l
1 A I A
=  (Y  xO) (Y  xO)
n
= !:.[Y'Y  (xO),(xO)].
n
The (i, j) 'th element can also be written
,
PROOF. The many Idontllt betw"n t 'II elementll arc fuund by .'mpl. matrl" ma.
nipulation., Por thorelulel WI fI'" to e2J,
'.2, Till 13S
The dlrltributlon at thl .ltlmlUm'lI mentioned lire given in .
THRORRM 5.fI. We cnnllider the situatHlIl from theorems _ an _ .. , 5 6 d 5 7 and we mtroduce
the usual notlltlonN
Then we have that 8 is normally distributed
and nE* is Wishart distributed
nE* E Wen k, :E).
A A * and 0 are stochastically independent. Finally E* and () and therefore also :E
PROOF. It is trivial that
E(O) = E[(X'X)lx'Y] = (X'X)lx'X 0 = 0
'"  Furthermore () is of course normally dis and from this it follows that E(O) = O.
tributed.
Finally we have that
and
0(
1
" 0lj)  (x'x)lx'O(YI" Ylj)x(X'X)l = (1ij(X' x)l.
1 min. the YlI'lanc. covariance matrix for 6 is readily seen. From thlll the relu t cono.
of fl and cone.mlna the Independence of (J
The rellult cone.min. thl one dlmenlll"n.1 rcllUltll but we wtll
and t 11'0 qulteanalol.' to _
not look fur1hlr "'to
3577........ .,
Fmm the theorem we readily find
COROU,ORY 5.1. The unbiased estimate for E is equal to
, n, 1 " ,
:E = :E' = (V  x (}) (Y_ x (}).
nk nk
__ 1
..
PROOF. Trivial when you remember that
E(W(k, A)) = kA.
We now tum to testing the parameters in the model.
We have
THEOREM 5.9. We consider the above mentioned situation including the assumption
of the normality of the observations. Furthermore we consider the hypothesis
Ho: A{}B' = C against HI: A{}B'  C'I
where A(r x k), B(s x p) and C(r x s) are given (known) matrices. We introduce
A = AOB'C
R = nt' = (Y  x 0)' (Y  x 0) = Y'y _ {}' (x'x){}
and
E = BRB'
H = A'[A(x'x)IA']I A.
Since the likelihood ratio test for testing Ho against HI is equivalent to the test given
by the critical region
]
J
,__ {_Y_I_ de_t_(e_+_h_) __ (s_, 7_.,_n__k_)_Q_}' _________ '. ___ .j
where U (s, r, n  k) Q is the a quantile in the nullhypothesis distribution of the test
statistic (see below). A
PROOF. Omitted. The eillential part of the proof ill that it can be IIhown that S and
H are independent Wishart dilltrlbuted variables if 110 is true, for more detail w.
refer to the literature. AI it I. lIoon Indirectly from the formulation of tho theorem the
nullhypothelil cUICribudon of
_.
AU .... ch;t(E+'H)
_.... .. .' l1untit is termed in the literature as or
only depends on 8, '/" und 'II: ;hC three parameters it is somewhat dIfficult
Anderson's (J. Sinee the dlstn u Ion : . . aHon to an Fdistribution in the
'n pr'\etl'se and we therefore gIve an dpploxlm to usc I , .
following
THEOREM , , 5 10 Let U be U(p q r)distributed and let
t
v 
Then
F
{
1 p2+q2=5
p2q
L
4 p2 + q2 # 5
p2+q2_5
2
1
vi + 1  "2pq
pq
is approximately distributed as
I
1
F(pq,vi + 1 pq).
2 ., t
I t 1 or 2 then the approximatlOn IS exac . If either p or q are equa 0 ,
PROOF. Omitted.
. . in Theorem 5.9 compares the "size" of the
REMARK 5.S. We see that the test stahshc t the test statistic as a function of the
matrices E and E + H. W,e shthall now
th
prfeusne:tions of those eigenvalues that are also
. fEIH and gtYe ree 0 er .. 59
eIgenvalues 0 . . f the hypothesis gIVen m Theorem . .
only considered as test statIstics or
comm 1 H dIet z. be the corre
> \ be the ordered eigenvalues of E an , We let >'1 > ,. ',_ "n
sponding eigenvector", i.1.
  _ w  ... , '" T_ NOIM.U, DII'I.IUT.ON
1
z
1 + Ai '
Ai
z
1 + Ai '
Thus getting the eigenvalues of the matrices we see that Wilks' Lambda is equal to
A  det(E) _ II
n
1
 det(E + H)  i=l 1 + Ai
and we introduce the Pillai' s Trace
HotellingLawley's Trace
n
H = tr(E1H) = II Ai
i=l
and finally Roy's Maximum Root
Earlier we presented an Fapproximation to Wilks' Lambda. There exist similar ex
pressions for the other, and all statistics are computed in the multivariate procedures of
SAS. SAS may also produce exact or nearexact pvalues in the multivariate tests.
We shall now illustrate the introduced concept in the following example.
EXAMPLE 5.4. In the period 196869 the Royal Veterinary and Agricultural Univer
sity's Experimental Farm for crop growing, H!6jbakkegArd, conducted an experiment
concerning the growth of lucerne. They investigated the offsprings from 176 croll/ilna
s
.
In order to establish the "quality" of the single crossings 9 propertiell were meullurcd
on each one. The 9 variables are alven In the following table.
As mentioned, tho' IIrllt vartablesarc aruded Oil U Ilumcricuilicale, Thill methud IN cho
sen since it Is very ,UtrlouJe to me.sure the rcNpcCtlvc vurh,blel directly. and expericmcc
shows that It ,iVlI S.Ultuem, l'IIuhl.
1.2. T
2: Rearnwlh ""el' wlnler
3: Ability to creep
4: Activity
5: Time of blooming
6: Plant height
7: Seed weight
8: Plant weight
9: Percent seed
LINIAI MODIL.
cm
g per plant
g per plant
after drying
%
1 = no runners,
9 = most runners
1 = weakest, 9 = strongest
1 = latest blooming,
9 = earliest blooming
Calculated per plant
by means of (7) and (8)
alues for the 9 variables based on
b d on the average v 0 It)
The following analyses are ase (t of the results are based on 2 p an s .
15 and 20 plants mas
numbers from between . f these numbers is shown.
In the following table a sectlOn 0
Variable No. and name
7 8
Obs.No.
2 3
4 5 6
1 =
Plant Plant Seed
Ability Activity Bloom
weight
No. of Type of Re
ing height weight
growth growth to
cros
sing
creep
3.67 50.00 3.47 120.10
5.00 3.05 6.17
0.82 111.33
1 4.11
5.17 61.50
3.08 4.75 4.17 7.50
55.29 0.86 97.47
2
3.35 6.53 3.99
3 3.12 4.00
176 4.00 4.40 4.60 7.40 2.90 50.00 0.66 153.50
. examine the variation among the 9 vari
Thc main goal with the experiment to 1 d in how variahle 3 (ahility to
uhles. More IIpecif\cally one Willi e.g. others. The two variables mentioned
and variable 4 (activity) varicli wd t I pment of a plant and it is therefore of
I nrtnnce for I c eVe 0
Ilrc uNually 1M 10 tho other variableI',
Importance w I the emplrtcAI cmr..,I"II"" mlllrl". It IN found tn
AM u ftrllt ortentation we will cnmpu 0
be
9
Per
cen'
see(
0.8
0.4
 ",nAPT T.ITI IN 'I. MVLfIDIM.N.IONA" NOue." DIITallUTION
1 2 3 4 rs 6 7
1 1.000 0.033 0.116 0.018 0.131 =0.207
0.035 0,087 01541
2 0.033 1.000 0.711 0.515 0.125 0.199 0.025
0.348 0.066
3 0.116 0.711 1.000 0.440 0.022 0.039 0.133
0.218 0.157
4 0.018 0.515 0.440 1.000 0.201 0.517 0.071 0.689 0.081
5 0.131 0.125 0.022 0.201 1.000 0.496 0.987 0.168 0.486
6 0.207 0.199 0.039 0.517 0.496 1.000 0.453 0.559 0.367
7 0.035 0.025 0.133
0.071 0.487 0.453 1.000 0.360 0.947
8 0.087 0.348 0.218 0.689 0.168 0.559 0.360 1.000 0.128
9
0.041 0.066 0.157 0.081
0.486 0.367 0.947 0.128 1.000
We note that variable 1 (type of growth) is only vaguely correlated with the other vari
ables. On the other hand e.g. variables 2 and 3 (regrowth and ability to creep) and (of
course) 7 and 9 (weight of seed and percentage of seed) are very strongly correlated.
As mentioned we are especially interested in variable 3's and variable 4's variation
with the other variables. We note that there are a number of fairly large correlations
but it is difficult to get an impression solely based on these. We will therefore try if it
is possible to express these two variables as linear functions of the others i.e.
k
E(ll) =
LBilXi
i=l
k
E(Y2) = L Bi2 X;i
i=l
where we now have used the variable notations
II Ability to "creep"
1'2 = Activity
Xl Type of growth
X2
Re growth after winter
X3 Time of blooming
X4
= Height of plant
X5
= Weight of seed
X6
= Weight of plant
X7
= Percentale of Rccd
We arc obviou.ly taJ.kjn. about. multldlmenlionll ,onofll linear mndel. It wo lot
1.1. TI.
(J  (Oj3)' WI",
O,2H4()O
O.71JrOH
o .02r, 7:\
(Ull Uil
O.1J!467
0.00307
0.10614
If we assume
0,427:'L
O.222:m
0.02607
0.06290
0.16756
0.01103
0.03463
then the unbiased estimate of:E is
A [0.85897 0.07870]
:E = 0.07870 0.29444 .
The matrix (X'X)l is found to be
1 2 3
1.55920 0.16549 0.47258
0.16549 0.85139 0.17981
0.47258 0.17981 1.77862
0.05010 0.01327 0.10728
0.41826 0.63774
0.29340
0.00235 0.01759 0.01164
0.42289 0.69467 0.02184
LINIAI MODIL.
141
4 5 6 7
0.05010 0.41826 0.00235 0.42289
0.69467
0.63774 0.01759
0.01327
0.02184
0.29340 0.01164
0.10728
0.02253 0.12325 0.00441 0.17012
0.12325 5.25546 0.08437 7.04885
0.00441 0.08437 0.00243
0.11182
0.17012 7.04885 0.11182 10.11541
. ance and covanance on the smgle 8
From this we can easily compute the van
values .
Because we have
(
(
')1
z 1 all x x
D(8)=:E0(x'x) = a21(x'x)1
O"12(X'X)l )
(
')1 ,
0"22 X X
and therefore e.g.
"'" 0.2944 ,0,02253  0.0066.
. I '" of ordinary ttests for the single coef
theile rellultll con bo UII.d In thl con.truet I'e Instead we will give a couple of
flelenls. We will. hnw.vlr, not c.lI. conllider the hypothesis
examplcR of' hnw to conlCNO' Ihl1II.
110 : 1)41 IIii 8" .0
.... .ft ....... , . " T" MVIIIIDIMINIIONAL NO.MAL DI.T.rIVTrON
against all a.lternatives. This hypotheses must be brought into the form given in theo
rem 5.9. ThIs can be done by choosing
and
A =
B =
(0001000)
c = (0 0).
Then we will have
A () B' = (()41 ()42).
By the Use of a standard programme we get the Ftest statistic
F = 53.66
with degrees of freedom
(fl, h) = (2,168).
The test statistic is in this case exact Fdistributed, since 8 = 2 and _ 1 I "
that the observ d Fl' . . r  . t IS seen
' e va ue IS sIgntficant at all reasonable levels.
As another example consider the hypothesis
against all alternatives. This hypothesis can be transformed into the form of theo
rem 5.9 by choosing
[g
0 0 0 1
00]
A
= 0 0 0 0 1 0 ,
0 0 0 0 o 1
B
=
and
c=U H
1.3. _ . , ........ ,
Ninee then WI obtain
A 0 8' ... OJ,
Asain using a standard programlllc wc lind
F = 10.63; CFt,h) = (6,336).
Once again we have a clear significance.
As a last example consider the hypothesis
against all alternatives. This is brought into the standard form by choosing
A
B (0 1)
and
The Ftest statistic has (2,169) degrees of freedom and is found to be 27.4.
values shown are therefore significant.
243
The
5.3 Multivariate Analyses of Variance (MANOVA)
We will now specialise the results from the previous section to generalisations of the
univariate one and twosided analysis of variance. First
5.3.1 Oneaided multldlmenalonal analysis of variance
We consider obaervatlon.
0 0 0 I
 .  ......... anllj MULTIDIMINIIONAL NoaMAL Dr.TarBUTION
These are assumed to be stochastically independent with
Y
ij
E Np(J.Li' E),
i = 1, ... , k ; . = 1
J , ... , ni,
pdimensional normal distributed with the .
wIsh to test hypothesis same vanancecovariance matrix.
We
against
Analogously to the univariate onesided . .
deviation matrices analYSIS of vanance we define sums of s
quares
k ni
LL(Yij  Y)(Yij  Y)'
>=1 J=1
T
k ni
= LL(Yij  Yi)(Y
ij
 Yd
>=1 j=1
W
k
B = L ni(Yi  Y)(Y
i
 Y)'
iI
Here we have with n '\'
D> ni
Y =
a bit of algebra we see that "total" m tn' T .
matnx B and th " . h' a x IS the sum of th "b
e WIt In groups" matrix W . e etween groups"
I.e.
T=W+B ,
I.e .. in the onedimensional case we have a '"
VanatlOn between groups and the vae' r . of the total variation in the
. Ia IOn wlthm groups.
It IS trivial that we as an unbiased estimate of .
use the vanancecovarlllnco mlltrix :E can
, 1
E=_W
n  A:

1.3. M1JIJI'I'fAa""lHALYII. or VA.IANel (MANOVA)
245
If tho hypothellill hi tl\le then T will also be pl'Oportlonlll with such an estimate. If the
hypothelilM III not tl\le thon T will be "larger". Therefore the following theorem seems
intuitively rOIlNonllhlo.
THEOREM S.l1. The ratio test for the test of the hypothesis lIo against HI is given
by the critical region
PROOF. Omitted. Is found by special choices of A, B and C matrices in theorem 5.9.
Just as the case for the onedimensional analysis of variance the results are displayed
using an analysis of variance table.
Source of variation 55  matrix Degrees of freedom
Deviation from hy
pothesis = varia B = Lni(Yi  Y)(Yi  Y)'
k1
tion between groups
i
Error = variation
W = L L(Yij  Yi)(Yij  Yd
nk
within groups
i j
Total T = LL(Yij  Y)(Yij  Y)'
n1
i i
As it is done in univariate ANOVA it is of course possible to determine expected values
of the Band T matrices even without Ho being true. We will, however, not pursue
this further here.
5.3.2 Twosided multidimensional analysiS of variance
In this case we will only look at a twoliided analysis of variance with 1 observation
per coli. We will therefore allllume that we havo obllorvations
YII, II, Y 1",
YAlI, II
Y"",
_ .... r 'liITIIN Ta MVIJr
IDIMIHIIOHAI. NORMAL DIITRIIVTION
which are assumed to be pdimensional . .
covariance matrix.E and Wl'th m I normal dIstributed with the same variance
ean va ues 
where the parameters Qi f3
j
satisfy
.LQi = .L,6j = o.
j
We now want to test the hypothesis
against
and
](0 : ,61 = ... = 13
m
= 0
against
Analogous to the sums of squares of the one d' .
ance we define the matrices  ImenslOnal (univariate) analysis of vari
k m
.L.L(Yij  Yi .  Y . + Y )(y.. Y  .
i=1 j=1 .J.. 'J  i.  Y. j + Y..)'
k
= m .L(Yi .  Y' .. )(Y'i.  Y .. )'
=
2=1
m
k ~ ) Y . j  }/''')(Y.
J
 Y .. )'.
j1
Here we have UII.,d the UIUaI notaUon
.. ,.
VAIIANCI (MANOVA
'9'" 
i ~ 1, ... , k
y.
. J
j= l, ... ,m .
We see in this case that we also have the usual partitioning of the total variation
i.e. the total variation (T) is partitioned in the variation between rows (Q2), and
the variation between columns (Q3) and the residual variation (interaction variation)
(Ql)'
We now have
THEOREM 5.12. The ratio test at level a for test of Ho against HI is given by the
critical region
The ratio test at level a for test of Ko against](1 is given by the critical region
\
det(ql)
{Yll'''',Ykm ( ) :::;U(p,m1,(k1)(m1)),,}.
det ql + q3
PROOF. Omitted. Follows readily from theorem 5.9. See e.g. [2].
We collect the results in a usual analysis of variance table
r=r:r:::..
ftV
+' ....... III11TION
Source of
SSmatrlx
Deareell of freedom
Thllt lltiltlNl1
variation
Differences f.___ ._ . .,... __
.. 
between
Q3 = kLCY.
j
... Y )(Y .
.. .J Y..)'
II/. 1
ili1'f.(l:l1":j' a
columns j
Differences
between
Q2 = m LeYi.  Y..)(Y
i
.  Y .. )'
k .. 1
;r;;t,yql + ;
rows i
Residual
Q1 = L L(Y .  Y  Y . + Y ) x
'J . .J ..
(k  l)(m  1)
i j
(Y"  y.  Y + Y )'
'J . .J ..
Total
T=LL(Y'Y )(Y'_Y)'
km1
'J .. 'J ..
i j
The matrix (k1)(m1) Q1 can be used as a unbiased estimate of:E.
We now give an illustrative example.
EXAMPLE 5.5. At the Royal Veterinary and Agricultural University's experimental
farm, H!!ljbakkegard, an experiment concerning the yield of crops was conducteded in
the period 195658 as part of an international study. Experiments on 10 plant types
were performed. The kinds of yield which were of interest were the amounts of
dry matter
green matter
nitrogen.
Each type of plant was grown in 6 blocks (i.e. plots of soil with different quality). In
order to reduce the amount of data we will limit ourselves to three plants and to the
year 1957. The results of the experiment considered are given below.
Type of
Type of
Block No.
plant
yield
1 2
3 4
5 6
Marchi
Dry matter
9.170
10.683 10.063
8.104
10.018
9.570
giana
nitrogen
0.286
0.335 0.315
0.259 0.319
0.304
green matter
40.959 47.677 44.950
36.919
45.859
43.838
Dry matter
9.403 10.914 11.018
11.385 13.387
12.848 Kayseri
nitrogen
0.285
0.330 0.333 0.339
0.400
0.383
green matter
42.475
49.546
50.152 51.718
60.758 58.334
Atlan
Dry matter
11.349
10.971 9.794 8.944 11.715
11.903
tic
nitrogen
0.369
0.357 0.319 0.291
0.379
0.386
green matter
52.476
60.757 45.151 42.221
05.505
66.364
Yield In J 000 kg/ha
We wish to anaJYI., how tho yield varlel with the blockll, the type of plant" and the type
of yield,
c
1.1. MUIJI'IVAIIAT. ANALYI 0' W_ \!IIMOVA)
I II' For we huse the ullulysis Oil U
We will IIrNt unulYNc cueh lyt,e of yield hy tlill ,
twolllded unlllyNl1I of vurlunce. The model III
IIi) = II. + (\'i + lij + c'ij
j=1, ... ,6),
. b written as a sum of . . tho t each observatIOn lhJ can e
and we arc therefore assummg a . . . bl k) . de' (residual being a small
/1 (level), Il', (effect Of. plant), /1) (effect o( oc an.) ,
randomly varying quantIty).
I f we lirst consider dry matter we get
10 683 Y36 = 11.903. Un = 9.170, Y12 = . , ... ,
(f d by means of SSPANOVA) The analysis of variance table was oun
Source of Sums of Degrees of Mean
variation squares freedom squares
A 11.218244 5 2.243648
B 10.945597 2 5.472798
AB 9.970109 10 0.997010
Total 32.133936 17
The test statistic for the hypothesis (31 = ... = f36 = 0 is
F  = 2.25 < 3.33 = F95%(5, 10)
 2
8
1
Fvalues
2.25
5.49
i e we cannot reject that the f3 s equal h th _ ct2 = Ct3 = 0 equals
the test statistic for the ypo eSlS ct1 
F  = 5.49 > 4.10 = F95%(2, 10).
 2
09
1
At a 5% level we therefore reject that the n s all equal O. However, we note that
FII7.II'K,(2,1O)  5.48,
Nil thcl'c ill 110 IIlllnlnClnOl Ie the 2,5" lovol, . .
.. ........... " on tho nitro,en yield we get. ustng as
If' we pcrrnrtn tho corn.,. I ........ 
obllcrvatlon,,: '/hI' iOlOl
 D. '''.,T.IN '1'. ""'WID.IN.IONAL NOIMAL DII'I'IIIUTION
Source of
Sums of
Degrees of Mean
variation
squares
freedom
squares
A
10802.27734
5
2160.45532
2.60
B
8030.77734
2
4015.38867
4.83
AB
8310.55469
10
831.05542
Total
27143.60938
17
Here we again find that there is no difference between blocks but there is possibly a
difference between plants. This difference is, however, not significant at the 2.5% level.
The corresponding computations on yield of green matter was (again using coded ob
servations: = 1000Yij):
Source of
Sums of
Degrees of
Mean
Fvalues.
variation
squares
freedom
squares
A
261702416
5
52340480
2.75
B
260173824
2
130086912
6.83
AB
190600448
10
19060032
Total
712476672
17
Here we again have that there is no difference between blocks. We also find a difference
between plants at the 5% level but not at the 1 % level since
F99%(2,1O) = 7.56.
We therefore see that the three types of yield show more or less the same sort of vari
ation: There is no difference between blocks but there is difference between plants.
These are, however, not significant at a small levels of a.
Now the three forms of yield are known to be strongly interdependent. Therefore we
will expect that the analysis of variance would give more or less similar results and it
would therefore be interesting to examine the variation and the yield when we take this
dependency into consideration. Such a type of analysis can be performed by a three
dimensional twosided analysis of variance i.e. we use the model
i=I,2,3, j=1, ... ,6,
where
211
Ilnd the obllorvl,lonl I ...
(
content Ill' ,reen mutter
Y /j ," content III: .
content 01 dry IlIdttCI
in plant i. in blok j
"
)
"
The observed values arc
(
40.9G9 )
Yll = 0,286 , ... ,
9.170
(
56.364 )
Y36 = 0.386 .
11.903
. hown above into one.
In this way we can aggregate the three analysis of variances s
. Q Q and Q3 are found to be
With the notation from p. 246 the matrIces 1, 2
1.38547 0.00803
[
260.18359 1
q2 52.37032 0.26262 10.94564
1.67129 0.01080
[
261.70239 1
q3 53.97473 0.34801 11.21827
1.25512 0.00831
[
190.59937 1
ql = 43.45444 0.28667 9.97013
The matrices have been found by means of the BMDprogramme BMDX69. Still by
means of the programme mentioned we find
Ustat Degrees of Approx Degrees of
In(Gen
freedom imate F freedom
Source eralized istic
variance)
statistic
I 1.89908 0.003315 3 2 10 43.6455 6 16.00
J 4.84194 0.062894 3 5 10 2,5843 15 22.49
Full
7.60824
model
h . ation between
Here I correRpondR to tho variltlon between plants and .J to t e Varl
blocks. . _ . = 0
I I I I Wilt of the hypotheSIS 0:1 = 0:2  0:3 ,
The (in thlll CillO oxaet) pw,,' "tit "' 0 or I) I 43 6 The number of degrees of freedom
(I.e .. tho hypothOllill thl' IU planel .... ICIUI " "
III (6,16). Sinco
F(O. 10)0.0111 1,",
... \,iRA" I. TIITIIN T. MVIJrIDIMINIIONAL NOIMAL DIITIIIUTION
we therefore have a very strong rejection of the hypothesis.
Since
F(15,22)0.975 = 2.50,
we see that now also the hypothesis of the blocks being equal is rejected at the level
0: = 2.5%.
The conclusion on the multidimensional analysis of variance is therefore that there is
a clear difference in the yield for the three types of plants. It is on the other hand more
uncertain if there are differences between the blocks.
We note a difference from three onedimensional analyses. In these cases we only have
moderate or no significance for the hypothesis of the plant yields being equal. We
therefore have different results by considering the simultaneous analysis instead of the
three marginal ones.
5.4 Tests regarding variancecovariance matrices
In this section we will briefly give some of the tests for hypothesis on variance covari
ance matrices. On one hand corresponding to a hypothesis about the variance covari
ance matrix having a given structure or is equal to a given matrix, or on the other hand
corresponding to a hypothesis that several variance covariance matrices are equal.
5.4.1 Tests regarding a single variancecovariance matrix
First we will give a test that kgroups of normally distributed variables are independent.
We are considering a X E Np(p" ~ ) , and we divide X in k components we the
dimensions PI, ... ,Pk, i.e.
The corresponding partitioning of the parameters is
1.4. TIlt.
COVAIIAMCI MATIICII
nnd
. d endent i.e. that the variancecovariance
Our hypothesis is now that Xl, ... , X k are III ep
matrix has the form
f X in the usual way and if
If we define t computed on the basis of n realisations 0
we partition ~ analogously to the partitioning of:E, we have
5 13
""e consider the above mentioned situation and let
THEOREM VV'
. ~ :Eo is given by the critical
Then the coefficient test for test of the hypotheSIs 
region
{V ::::: va}.
When finding the boundary of the critical region we can use that
P{ mIn V::::: v} 2 .
~ p{X2(J) < v} + ;::2 [p{X2(J + 4) ::::: v}  P{X (.t) ::::: 'Ii}],
where
rn =
=
p. I:TI'
 _ .. NOIMAL DIITIIIVTION
PROOF. Omitted. See e.g. [21.
In the above mention situation we looked at a test for a variance covariance matrix
having a certain structure. We will now tum around and look at a test for the hypothesis
that a variance covariance matrix is Proportional with a givcn matrix. We hriclly give
the result in
THEOREM 5.14. We consider independent observations Xl, . .. , X" with Xi E
Np(j.L, .E), and we let
The likelihood ratio test statistic for a test of Ho : .E = a2 .E
o
, where.E
o
is known
and a
2
unknown against all alternatives is
When determining the critical region we can use that
P{(n1)plnW:::; z}
where
P{X
2
(j) :::; z} + W2[P{X
2
[J + 4J :::; z}  P{X2(j) ::; z}],
= 1 2p2 + p+ 2
p 6p(n  1)
J =
1
2P(P+ 1)1
(p + 2)(p  l)(p  2) (2p3 + 6p2 + 3p + 2)
PROOF. Omitted. See e.g. [2].
Finally we will consider the II/tuatlon where we wish to tcllt that a varlnnco cnvHl'lnnco
matrix is equal to a aiven matrix. Thon tho follow'na holda true
THEOREM 5.11. W. oonltd.r indopondont obllorvattona XI, ' .. , X" with Xi E
1.4.
CI MATRICII
II
A " },)Xj X)(X, X)'.
/1
The quotient test statistic for a test of flo : 'E = 'Eo,
alternatives is
where.E
o
is known against all
(e )pn/2[det(A 'Eol)]:gc exp( tr(A 'Eo 1 )).
n
.. I . n we can use that When determining the cntlca regIO
PROOF. Omitted. See e.g. [2].
5.4.2
ariance matri Test for equality of several variancecov
ces
. lem of testing the assumption of equal
We will in this section consIder the prob I' t t' on and in the multidimensIOnal
covariance matrices in HoteHing's two samp e SI ua 1
analysis of variance.
We will assume that there are independent observations
Xu, ... ,
and we wish to test the hypothesis
ItDainst lit: 3i,.J : 'Ei =f 'Ej . Ilo : El = ... "'"
We let
n  LUc,
 E(x.;", Jr.)(tXij  Jr.)',
J.1
7 
and
k
W = '"'W
"
i=I
cf. section 5.3.1.
We then have
THEOREM 5.16. As a test statistic for the test of II. . H
o agamst 1 we can use
n
k [ (n'I)
L = iI (n 
[det Wj
(n2k) k
n
. (. _l)p(n;I) .
,=1 n, 2
The critical region is of the form
and in the determination of this we can Use that
P{ 2plnL::; z}
P{x
2
(j) < z} +W2[P{X
2
(j +4) < z}
where
1
f = 2(k  l)p(p + 1),
p = 1 _ ('"' _ 1 2p2 + 3p  1
L:ni
1
W2 = 48p2
P
(P + l)[(p  l)(p + 2)(2: ;   6(A:  1)(1 _ )2]
i n
i
n
2
{J .
PROOF. Omitted. See e.g. [2J.
Chapter 6
Discriminant analysis
In this section we will address the problem of classifying an individual in one of two
(or more) known populations based on measurements of some characteristics of the
individual.
Wc first consider the problcm of discriminating between two groups (classes).
6.1 Discrimination between two populations
6.1.1 Bayes and minimax solutions
We consider the populations 1fI and 1f2 and wish to conclude whether a given individ
ual is a member of group one or group two. We perform measurements of p different
characteristics of the individual and hereby get the result
If the individual come. from 9f1 &hi frlqu.nc)' t'Unctlon of' X iii fl (::I:) and if it comes
It III fa(/I).
Lot UII t'Urthormora ......
Chapter 6
Discriminant analysis
In this section we will address the problem of classifying an individual in one of two
( or more) known populations based on measurements of some characteristics of the
individual.
We first consider the problem of di scriminating between two groups (classes).
6.1 Discrimination between two populations
6.1 .1 Bayes and minimax soluti ons
We consider the populations 7f1 and 7f2 and wish to conclude whether a given individ
ual is a member of group one or group two. We perform measurements of p different
characteristics of the individual and hereby get the result
If the individual comes from 7f1 the frequency function of X is f1 (x) and if it comes
from 7f2 it is f
2
(x).
Let us furthermore assume that we have given a loss function L:
257
258 CHAPTER 6. DISCRIMINANT A ' : : ,  ~ ~ ~
Choice:
7fl 7f2
7fl 0 L(1,2)
Truth
7f2 L(2,1) 0
We will assume that there is no loss if we take the correct decision.
In certain situations one also knows approximately what the prior probabili;: 
an individual from each of the groups i.e. we haven given a prior distributic:. ;
We now seek a decision function d: RP + {7fl, 7f2}. d is defined by
if x E Rl
if x E R2 = CR
1
.
We divide RP in two regions Rl and R
2
. If our observation lies in Rl we
7fl and if our observation lies in R2 we will choose 7f2 .
If we have a prior distribution we define the posterior distribution k by
which is the conditional distribution of 7f given x. The result follows :..
Theorem.
The expected loss in this distribution is
L(7fl, d
R1
(X))k(7fl Ix) +L(7f2,dR1(X ..
{
L(7f2,7fl)k(7f2Ix), x E Rl
L(7fl,7f2)k(7fllx) , x E R2 .
The Bayes solution is defined by minimising this quantity for any x (p. c. ;;
i.e. we define Rl by
{:} L(2, 1 )k( 7f2Ix) :::; L(l, 2)k( 7fllx)
L(1, 2)fl(X)Pl > 1
L(2, 1)f2(x)p2 
fl(X) L(2, 1)p2
{:}   > 
f2(X)  L(l , 2) Pl'
These considerations are collected in
6.1. DISCRIMINATION BETWEEN TWO POPULATIONS 259
THEOREM 6.1. The Bayes solution to the classification problem is given by the re
gion
A
If we do not have a prior distribution we can instead determine a minimax strategy i.e.
determine R1 so that the maximal risk is minimised. The risk is (cf. p. 6.3, Voll)
R( 7T1 , d
R1
)
R(7T2 , d
R1
)
E1fl L(7T1' d
R1
(X)) = L(l , 2)P{X E R2 17TI}
E1f2 L(7T2, d
R1
(X)) = L(2, l)P{X E Rll7T2}.
One can now show (see e.g. the proof for theorem 4, chapter 6 in Vol. 1)
THEOREM 6.2. The minimax solution for the classification problem is given by the
region
where c is determined by
~ ( x ) ~ ( x )
L(l,2)P{() < CI 7TI} = L(2, l)P{() 2: CI7T2} .
f2 X f2 X
REMARK 6.1. The relation for determinating c can be written
L(l , 2) (the probability ofmisclassification if 7T1 is true)
L(2, 1) (the probability of misclassification if 7T2 is true)
Since the first is an increasing and the second is a decreasing function of c it is obvious
that we will minimise the maximal risk when we have equality. If we do not have any
idea about the size of the losses we can let them both equal one. The minimax solution
gives us the region which minimises the maximal probability of misclassification. V
We will now consider the important special case where f1 and f2 are normal distribu
tions.
260 CHAPTER 6. DISCRIMINANT Al\A:'
6.1.2 Discrimination between two normal populations
If f1 and f2 are normal with the same variancecovmiance matrix we have
THEOREM 6.3. Let 711 and 712 N(J.L2) Then we have
PROOF. We introduce the inner product (I') and the norm II II by
and
IIxl 1
2
= (xix).
We then have
From this we readily get
f1(X) > CB In f1(x) > Inc
 
B  llx  J.L111
2
+ Ilx  J.L211
2
2: 2 Inc
B (x  JL11x  JL1) + (x  JL21X  J.L2) 2: 2Inc
B 2(xlJL1)  2(xlJL2)  (JL11JL1) + (JL21JL2) 2: 2Inc
B 2(xlJL1  JL2)  (JL11J.Ll) + (JL21JL2) 2: 2lnc.
By using the connection between (I) and we find that the theorem readily
REMARK 6.2. The expression 2: c is seen to define a subset of IF
delimited by a hyperplane (for p = a straight line and for p = 3 a plane).
The vector is the orthogonal projection (NB! The orthogonal projecti
spect to 1) of x onto the line which connects JLl and JL2 ' (It can be ....
6.1. DISCRIMINATION BETWEEN TWO POPULATIONS 261
o
the slope of the projection lines etc. are equal to the slope of the ellipse (ellipsoid)
tangents in the at the points where they intersect the line (J.Ll) J.L2)' Since the length
of a projection of a vector is equal to the inner product between the vector and a unit
vector on the line we see that we have classified the observation as coming from 7Tl iff
the projection of x is large enough (computed with sign). Otherwise we will classify
the observation as coming from 7T2.
The function
is called the discriminator or the discriminant function.
We then have that the discriminator is the linear projection which  after the addition of
suitable constants  minimises the expected loss (the Bayes situation) or the probability
of misclassification (the mjnimax situation). T
In order to make the reader more confident with the content  we will now give a slightly
262 CHAPTER 6. DISCRIMINANT . ~ 
different interpretation of a discliminator. If we let
we have the following
THEOREM 6. 4. The vector J has the property that it maximises the functior:
[(J.Ll  J.Ld d]2
d / ~ d
P ROOF. The proof is fairly simple. Since we readily have that <p ( k . d) = _
can determine extremes for <p by determining extremes for the numerator
following constraint
We introduce a Lagrange multiplier A and seek the maximum of
Now we have that
If we let this equal 0, we have
i.e.
where k is a scalar.
REMARK 6.3. The content of the theorem is that the linear function d e t e ~ 
6.1. DISCRIMINATION BETWEEN TWO POPULATIONS 263
~ ' 2. 0
is the projection that "moves" 7Tl furthest possible away from 7T2 or  in analysis ofvari
ance terms  the projection which maximises the valiance between populations divided
by the total variance.
The geometrical content of the theorem is indicated in the above figure where
b: is the projection of the ellipse onto the line /Ll' /L2 in the direction determined
by x / ~ = 0
a: is the projection of the ellipse onto the line /Ll' /L2 in a different direction.
It is seen that the projection determined by ~ onto the line which connects /Ll and /L2 is
the one which "moves" the projection of the contour ellipsoids of the two populations
distribution furthest possible away from each other. V
We now give a theorem which is very useful in the determination of rnisclassification
probabilities.
THEOREM 6.5. We consider the criterion in theorem 6.3
Then
..
264 CHAPTER 6. DISCRIMINANT ANAL_
if 7f1 is true
if 7f2 is true
PROOF. The proof is straight forward. Let us e.g. consider the case 7f1 true. \\e
have that E(X) = JL1 and then
E(Z)
I  1 1 I 1 1 I 1
JL1 (JL1  JL2)  2 JL1 JL1 + 2 JL2 JL2
(JL1  (JL1  JL2)
=
1 2
2 11 JL1  JL2 11 .
v( Z) (JL1  (JL1  JL2)
(JL1  (JL1  JL2)
IIJL1  JL21 12.
The result regarding 7f2 is shown analogously.
We will now consider some examples.
EXAMPLE 6.1. We consider the case where
N( ( ) , (
1
))
7f1 ft
2
1
)), 7f2 ft
2
and we want to determine a "best" discriminator function. Since we know 
about the prior probabilities and so on, we will use the function which corres
the constant c in theorem 6.3 being 1. Since
 1 )
1 '
we get the following function
(X1 X2)   (2.16 + 14  28) + (2 1 + 11  :... _
(
2  1) (3) 1 1 '1
1 1 1 2 2
6.1. DISCRIMINATION BETWEEN TWO POPULATIONS
or
1
5XI  2X2  9  = O.
2
If we enter an arbitrary point, e.g. ( ~ ) we get
1 1
5 5  2 . 6  9 = 3 > O.
2 2
This point is therefore classified as coming from 1TI.
We have indicated the situation in the following figure
5x  2x  9
1
= 0
1 2 2
Contourellipse belong
ing to 11
1
' s distribution
If we have a loss function, the procedure is a bit different which is seen from
265
EXAMPLE 6.2. Let us assume that we have losses assigned to the different decisions:
Choice:
1TI 1T2
1TI 0 2
Tmth:
1T2
1 0
266 CHAPTER 6. DISCRIMINANT A: _ 
Since we have no prior probabilities we will determine the minimax solution.
need
IIJ.Ll  J.L2 11
2
= 29 + 1 1 23 1 = 13.
From theorem 6.2 follows that we must determine c so
2 . P ff
1
(X) < C
I
"f1} = P {f
1
(X) > clrr2 l
l f2(X) H" f2(X)  I" J
:} 2 P{Z < ln cl7ld = P{Z 2: ln cl 7l2}
1 1
:} 2.P{N(2"13, 13) < Inc} =P{N(  2"13, 13) 2: In c}
:} 2. P {N(O 1) < Inc  6.5 } = P {N(O 1) > In c + 6.5} .
' v!f3 '  v!f3
By trying with different values of c we see that
c 0.5617.
Using this value the misclassification probabilities are
If 711 is true:
P{N(O 1) < lnO.5617 6.5} 0 025
, v'13  . .
If 7f2 is true:
P{N(O 1) < lnO.5617+6.5} 0 050
, v'13  ..
The discriminating line is now determined by
1
5 Xl  2X2  92" = In 0.5617,
or
5Xl  2X2  8.92 = O.
This line intersects the line connecting J.Ll and J.L2 in (2.36, 1.46) i.e. it is mw=_
wards J.L2 compared to the midpoint (2.5, 1.5). It is also obvious that the line is
parallelly in this direction since we see from the loss matrix that it is more se:::::_
be wrong if 7fl is true than if 7f2 is true. We must therefore expand Rl i. e. IIlC _
limiting line towards J.L2.
We must stress that it is of importance that the variancecovariance matrices for
populations are equal. If this is not the case we will get a completely _
which will be seen from the following example.
6.1. DISCRIMINATION BETWEEN TWO POPULATIONS 267
EXAMPLE 6.3. Let us assume that the valiancecovariance matrix for population 2 is
changed to an identity matlix i.e.
Again we want to classify an observation X which comes from one of the above men
tioned distlibutions. Since the vaJ.1ance covariance matlices are not equal we cannot
use the result in theorem 6.3 but have to start from the beginning with theorem 6.2.
For c > we have
Since
and
then
2(:1:1  4)2 + (X2  2)2  2(Xl  4)(X2  2)
2xi + x ~  2XlX2  12xl + 4X2 + 20,
(Xl  1)2 + (X2  1)2
xi + x ~  2XI  2X2 + 2,
If we choose c = 1, we note that the curve which separates RI and R2 is the hyperbola
It has centre in (3, 2) and asymptotes
Xl  3 = 0,
268 CHAPTER 6. DISCRIMINANT ANAL
These curves are shown in the above figure together with the contour ellipse 
two nonnal distributions. Note e.g. that a point such as (9, 0) is in R2 and fu'""""!"
will be classified as coming from the distribution with centre in (1, 1). FurthefIll _
frequency functions are shown.
We will not consider the problem of misclassification probabilities in cases as the
mentioned where we have quadratic discriminators.
6.1. DISCRIMINATION BETWEEN TWO POPULATIONS 269
6.1.3 Discrimination with unknown parameters
If one does not know the two distributions f1 and f2 one must estimate them based on
some observations and then construct discriminators from the estimated distributions
the same way we did for the exact distributions .
Let us consider the normal case
7f1 f+ N(J.L1 ) h)
7f2 B N(J.L2 ) h ),
where the parameters are unknown. If we have observations X l , . .. , X nl which we
know come from 7f1 and observations Y 1, ... , Y no which we know come from 7f2 we
can estimate the parameters as usual
In complete analogy to the theorem on p. 260 we have the discriminator
The exact distribution of this quantity if we substitute x with a random variable X E
N (J.Li' h) is fairly complicated but for large sample sizes it is asymptotically equal to
the distribution of Z in theorem 6.5 so for reasonable sample sizes we can use the
theory we have derived.
The estimated norm between the expected values is
This is called Mahalanobis' distance. It should here be noted that a number of authors
use the expression Mahalanobis' distance also about the quantity II J.L1  J.L2112. This is
after the Indian statistician EC. Mahalanobis who developed discriminant analysis at
the same time as the English statistician R.A. Fisher in the 1930's.
By means of D2 we can test if J.L1 = J.L2 since
Z= n1+
n
2p  1. n1
n
2 D2
p(n1 + n22) n1+n2
p 
270 CHAPTER 6. DISCRIMINANT ANALL
is F(p, n l + n2  P  1 )distributed if JLI = JL2 ' If JLI # JL2 then Z has a larger ~
value so the critical region cOlTesponds to large values of Z . This test is of co
equivalent to Hotelling's T2test in section 5.1.2.
We give an example (data come from K.R. Nair: A biometric study of the desert l ~
Bull. Int. Stat. Inst. 1951).
EXAMPLE 6.4. In a study of desert locusts one measured the following biometric :..
acteristics they were
X l: length of hind femur
X2 : maximum width of the head in the genal region
X3 : length of pronotum at the scull
The two species which were examined are gregaria and an intermediate phase be
gregaria and solotaria.
The following mean values were found.
Mean values
Gregaria Intermediate phase
nl = 20 n2 = 72
Xl 25.80 28.35
X2 7.81 7.41
X3 10.77 10.75
The estimated variancecovariance matrix is
4.7350
0.5622
1.4685
0.5622 1.4685
0. 1413 0.2174
0.2174 0.5702
We are now interested in determining a discrimination function for classificati
future locusts by means of measurements of Xl, X2 , X3'
However, first it would be reasonable to check if the three measurements fro
two populations are different at all i.e. we must investigate if it can be assumei
JLI = J1.2 We have
This value is inserted in the test statistic p. 269 and we get
Z = 20 + 72  3  1 . 2072 .9.7421 = 49.70.
3(20 + 72  2) 20 + 72
6.1. D ISCRIMINATION BETWEEN TWO POPULATIONS 271
Since
F(3, 88)0.999 ~ 6,
we will reject the hypothesis of the two mean values being equal. It is therefore rea
sonable to try constructing a discriminator.
We have
and
Since there is no information on prior probabilities we will use c = 1, i.e. : In c = 0,
and we will therefore use the function
d(x) =  2.7458:1:1 + 6.6217x 2 + 4.582x3  25.3506
in classifying the two possible locust species.
If we for instance have caught a specimen and measured the characteristics
(
27.06 )
x = 8.03
11.36
we get d(x) = 5.5715 > 0 meaning we will classify the individual as being a gregaria.
6.1 .4 Test for best discrimi nation function
We remind ourselves that the best discriminator
can be found by maximising the function
272 CHAPTER 6. DISCRIMINANT A::\.li
The maximum value is
i.e. Mahalanobis' D2 is the maximum value of cp( d). For an arbitrary (fixed) d
let
We can then test the hypothesis that the linear projection determined by d is th=
discriminator by means of the test statistic
z nl+n2pl. nln2(D
2
Dr)
 p  l (nl+n2)(nl+n2  2)+nln2Dr'
which is F(p  1, nl + n2  P  I) distributed under the hypothesis. Large val:::=
Z are critical.
We will not consider the reason why the distribution for the nullhypothesis 10 0
way it does but just note that Z gives a measure of how much the "distance" be
the two populations is reduced by using d instead of 6. If this reduction is too big
Z is large we will not be able to assume that d gives essentially as good a
between the two populations as 6.
EXAMPLE 6.5. In the following table we give averages of 50 measurements of
ent characteristics of two different types of Iris, Iris versicolor and Iris setosa.
data come from Fisher's investigations in 1936.)
Versicolor Setosa Difference
Sepal length 5.936 5.006 0.930
Sepal width 2.770 3.428 0.658
Petal length 4.260 1.462 2.789
Petal width 1.326 0.246 1.080
The estimated variancecovariance matrix (based on 98 degrees of freedom) is
:E = 0.12108
[
0.19534 0.09220 0.099626
0.04718
0.12549
0.033061
0.02525
0.039586
0.02511
6.1. DISCRIMINATION BETWEEN TWO POPULATIONS
273
From this it readily follows that
[
 3.
0692
1
' '1"  18.0006
15 = ~ (J.L1  J.L2) = 21. 7641 .
30.7549
Mahalanobis' distance between the mean values is
[
3.
0692
1
2  18.0006
D = [0.930, 0.658, 2.789, 1.080] 21.7641 = 103.2119.
30.7549
We first test if we can assume that J.L1 = J.L2' The test statistic is
50 + 50  4  1 50 50
4(50 + 50  2) 50 + 50 . 103.2119 = 625.3256
> F(4, 95)o.9995 ~ 5.5.
It is therefore not reasonable to assume J.L1 = J.L2'
By looking at the differences between the components in J.L1 and J.L2 we note that the
number for versicolor is largest except for X2 (the sepal width). Since we are looking
for a linear projection which takes a large value for J.L1  J.L2 we could try with the
projection
where do here i, the veeto, [1 1
We will now test if it can be assumed that the best discriminator has the form
15 = constant
[
 1111 1
= constant do.
We determine the value of if' corresponding to do:
[(
, ')' d ]2
J.L1 ~ 1;!2 0 = 61.9479.
d o ~ d o
274 CHAPTER 6. DISCRIMINANT ANAL
The test statistic becomes
50 + 50  4 1 5050(103.2119  61.9479)
4  1 (50 + 50)(50 + 50  2) + 505061.9479
= 1984 > F(3, 95)0.9995 6.5.
We must therefore reject the hypothesis and note that we cannot assume that the ...::===
discriminator is of the form Xl  X2 + X3 + X4 .
6.1.5 Test for further information
Given one has obtained measurements of a number of variables for some indi, ' 
with the objective of determining a discriminant function. Often the question afis
it is really necessary with all the measurements, or if one can do with fewer vaL;.:
in order to separate the populations from each other. One could e.g. think it
sufficient just to measure the length of sepal and petal in order to discriminate be
Iris versicolor and Iris setosa.
We will formulate these thoughts a bit more precisely. In order to perform a dis
ination we measure the variables Xl, .. . , Xp. We now will formulate a test in
to investigate if it might be possible that the last q variables are unnecessary f
discrimination.
We still assume that there are n1 observations from 71"1 and n2 observations frOIL;
ulation 7l"2 . We let
and
and we perform the same partitioning of mean vectors and variancecovariance
JLi
[
(1) 1
[
:Ell
:E21
We now compute Mahalanobis' distance between the populations, first
information i.e. all p variabl es and then using the reduced information i.e. only
6.1. DISCRIMINATION BETWEEN TWO POPULATIONS 275
p  q variables. We then have
D2 (' ')'t
1
(' ')
p = J.Ll  J.L2 J.Ll  J.L2
and
A test for the hypothesis that the last q vmiables do not contribute to a better discrimi
nation is based on
It can be shown that Z E F(q, nl + n2  P  1) if Ho is true. We omit the proof, but
just state that Z "measures" relatively the larger distance there is between populations
when going from p  q variables to p variables. It is therefore also intuitively reasonable
that we r ~ i e c t the hypothesis that it is sufficient with p  q variables if Z is large.
We now give an illustrative
EXAMPLE 6.6. We will investigate if it is sufficient only to measure the length of sepal
and petal in order to discriminate the types of Iris given in example 6.5.
We now perform an ordinary discriminant analysis on the data given that we disregard
the width measurements. The resulting Mahalanobis' distance is
D ~ = 76.7082,
so the test statistic for the hypothesis is
50 + 50  4  1 5050(103.2119  76.7082)
2 (50 + 50)(50 + 50  2) + 50 5076.7082
= 15.6132 > F(2, 95)0.9995 ~ 8.25.
We must therefore conclude that there is actually extra information in the width mea
surements which can help us in discriminating setosa from versicolor.
276 CHAPTER 6. DISCRIMINANT A.::', T
6.2 Discrimination between several populatio
6.2.1 The Bayes solution
The main idea of the generalisation in this section is that one compares the pop
pairwise as in the previous section to finally choose the most probable populati
We consider the populations
7Tl , ... , 7Tk
Based on measurements of p characteristics (or variables) of a given individual
to classify it as coming from one of the populations 7Tl, . . . , 7Tk .
The observed measurement is
If the individual comes from 7T i then the frequency function for X is fi (x ).
We assume that a loss function L is given as shown in the following table.
Choice
7Tl 7T2
. . .
7Tk
7Tl 0 L(l , 2)
...
L(l , k)
7T2 L(2, l ) 0
.. .
L(2, k)
Truth
7Tk L(k, l ) L(k,2)
...
0
Finally we assume we have a prior distribution
i = 1, .. . ,k.
For an individual with the observation x we define the discriminant value or dis
nant score for the i' th population as
(note that L( i, i) = 0 so the sum has no term pdi(X)). Since the posterior p r o b ~
for 7T v is
6.2. DISCRIMINATION BETWEEN SEVERAL POPULATIONS 277
we note that by choosing the i'th population then S;: is a constant (h(x)) times the
expected loss with respect to the posterior distribution of 1r. Since the proportionality
factor  h( x) is negative we note that the Bayes' solution to the decision problem is to
choose the population which has the largest discriminant value (discriminant score) i.e.
choose 1f v if
If all losses L( i, j) (i =J=. j) are equal we can simplify the expression for the discriminant
score. We prefer 1fi compared to 1fj if
S;: > Sj,
i.e. if
v v
In this case we can therefore choose the discriminant score
In this case the Bayes' rule is that we choose the population which has the largest
posterior distribution i.e. choose group i, if S ~ > Sj, 'tIj =J=. i. This rule is not only used
where the losses are equal but also where it has not been possible to determine such
losses. If the Pi'S are unknown and it is not possible to estimate them one usually uses
the discriminant score
i.e. choose the population where the observed probability is the largest.
The minimax solutions are determined by choosing the strategy which makes all the
misclassification probabilities equally large. (Still assuming that all losses are equal.)
However, we will not go into more detail about this here.
6.2.2 The Bayes' solution in the case with several normal
distributions
We will now consider the case where
278 CHAPTER 6. DISCRIMINANT ..
i.e.
for i = 1, . .. , k.
Since we get the same decision rule by choosing monotone transformations 
discriminant scores we will take the logarithm of the fi's and disregard the c
factor (27f)  This gives (assuming that the losses are equal)
I 1 ) 1 ( )' 1 )
Si =  "2
1n
(det "2 x  J.Li (x  J.Li + Inpi
This function is quadratic in x and is called a quadratic discriminant function. If _
are equal then the terms
1 1 I 1
 x x
2 2
are common for all Si'S and can therefore be omitted. We then get
This is seen to be a linear (affine) function in x and is called a linear discrirr.:::.
junction. If there are only two groups we note that we choose group 1 if
i.e. the same result as p. 260.
The posterior probability for the v'th group becomes
It is of course possible to describe the decision rules by dividing RP into sets R
1
, . . .
so that we choose 7fi exactly when x E Among other things this can be seen 
the following
EXAMPLE 6.7. We consider populations 7fl, 7f2 and 7f3 given by normal distribu"';
with expected values
6.2. DISCRIMINATION BETWEEN SEVERAL POPULATIONS 279
and common variancecovariance matrix
cf. the example p. 264. Assuming that all Pi are equal so that we may disregard them
in the discliminant scores  we then have
8'
I
8'
3
(XIX2 ) ( )  (
6XI  2X2  10
(XI X2) (
2  1
) (
1
)  (
 1 1 1
1
Xl 
2
( 2  1
(XIX2) 1 1 ) ( )  (2, 6) (
2XI + 4X2  10.
We now choose to prefer 7f1 to 7f2 if
1
(6XI  2X2 10)  (Xl  2)
1
5XI  2X2  9
2
> o.
We choose to prefer Til to Ti3 if
2
 1
2
 1
2
 1
(6XI  2X2  10)  (2XI + 4X2  10)
8XI  6X 2
> 0,
and finally we will choose to prefer 7f2 to 7f3 if
1
(Xl  2)  (2XI +4X2 10)
1
3XI4x2+92
> o.
 1
) ( ) 1
 1
) ( ) 1
 1
) ( )
1
280 CHAPTER 6. DISCRIMINANT MiAL
It is now evident that we will choose 7f1 if both 'U12 (x) > 0 and U13 (x ) > 
analogously with the others.
We can therefore define the regions
R1 {X IU12(X) > 0 /\ U13(X) > O}
R2 {X IU12(X) < 0 /\ U23(X) > O}
R3 {X IU13(X) < 0 /\ U23(X) < O},
and we have that we will choose 7fi exactly when x E ~ .
We have sketched the situation in the following figure 6.1.
One can easily prove that the lines will intersect in a point. It is, however, also po_
to make a simple reasoning for this. Let us assume that the situation is as in figure
We now note that
For the point x we have
U23(X) < 0
U13(X) > 0
U12(X) < 0
i.e.
i.e.
i.e.
f2(X) < f3(X)
fdx) > f3(X)
f1(X) < f2(X)
We have now established a contradiction i.e. the three lines determined by U12, U13 c::..::.
U23 must intersect each other in one single point.
If the parameters are unknown and instead are estimated they are normally substitu:CI...
in the estimating expressions in the above mentioned relations cf. the procedure =
section 6.1.3.
6.2.3 Canonical Discriminant Analysis
In the previous section we have given one form of the generalisation of discliminac
analysis from 2 to several populations. We will now desclibe another procedure whic'
instead generalises theorem 6.4.
We still consider k groups with n1, ... ,nk observations in each, i.e. the same sitc
ation as in the one sided MANOVA (section 5.3.1). The group averages are calle 
6.2. DISCRIMINATION BETWEEN SEVERAL POPULATIONS 281
Figure 6.1:
282 CHAPTER 6. DISCRIMINANT ANAL
XI , ... , X k. We defi ne the between groups matrix
k
B = L ni(X
i
 X)(Xi  X)',
i=1
and the within groups matlix
k ni
W = L L(Xij  X i) (Xij  Xd
i=1 j=1
and the total matlix
k ni
T = L L(Xij  X)(Xij  X)'.
i=1 j=1
A fundamental equation is that
T = B+W.
We can now go ahead with the disclimination. We seek a best discriminator f i m ~
where best means that the function should maximise the ratio between valiation
groups and valiation within groups. I.e. we seek a function y = d' x so
d'Bd
<p(d) = d'W d
(dis chosen so d'W d = 1)
is maximised. We note from theorem 1.23 that the maximum value is the largest eiT
value >'1 and the corresponding eigenvector d
1
to
det(B  >'W) = 0
or
det(W
1
B  >.1) = O.
We then seek a new discliminant function d
2
so
is maximised under the constraint that
6.2. DISCRIMINATION BETWEEN SEVERAL POPULATIONS 283
This corresponds to the second largest eigenvalue for W
1
B and the corresponding
eigenvector.
In this way one can continue until one gets an eigenvalue for W 
1
B which is 0 (or
until W
1
B is exhausted).
A plot of the values
(
d ~ ( X i j  x) )
d' (x  x)
s tJ
is a very useful way of visualizing the data. These plots separates the data best in the
sense described above as maximizing the difference between groups with respect to the
variation within groups.
Another useful plot consists of the vectors
(
dll ) ( d
1p
)
d
21
l ' 1 d
2p
.
These show with which weight the value of each single variable contributes to the plot
on the (d
11
d
2
)plane.
The functions d ~ x are called Canonical Discriminant Functions (CDF) and the type of
analysis a Canonical Discriminant Analysis (CDA).
EXAMPLE 6.8. In the following table we give mean values and standard deviations
for the content of different elements of 208 washed soil samples collected in Jameson
Land. The variable Sum gives the sum of the content of Y and La.
Variable Mean Value Standard deviation
B 73 141
Ti 40563 22279
V 678 491
Cr 1135 1216
Mn 2562 2081
Fe 225817 122302
Co 62 26
Ni 116 54
Cu 69 56
Ga 21 10
Zr 14752 14771
Mo 29 20
Sn 56 99
Pb 351 786
Sum  
284 CHAPTER 6. DISCRIMINANT ANAL _
A distributional analysis showed that the data were best approximated by LN
Therefore all numbers were logarithmically transformed and were furthermore _
dardised in order to obtain a mean of 0 and a variance of 1. The problem is to "
great an extent the content of the elements characterises the different geologic pcr:_
involved in the area. The number of measurements from the different periods are i="
below.
Period Number
Jura 17
Trias 80
Perm 30
Carbon 9
Devon 31
Tertirere intrusives 35
Caledonsk crystallic 4
Eleonora Bay Formation 2
In order to examine this some discriminant analyses were performed. We will 
pursue this further here. We will simply illustrate the use of the previously mentio::; _
plot, see figure 6.2.
d
d
d
d
d
d d
d 0
d
t
I
d
d
d
t
II
II t
dt
t
. ,
5.
I
I
II
i
t
t tttt I t
t PT t t II
t t t t .. t t t p t
P t t j t pt
j p p t t
i p t P . t
t J t 4 t
t p .... t t t c
p J" t t P t
J t I
5.
Figure 6.2:
,
I
c
.. i P
, . .
ii
i I
, ,
,
, .
i
6.2. DISCRIMINATION BETWEEN SEVERAL POPULATIONS 285
Mn
Figure 6.3:
Tn figure 6.3 the coefficients for the ordinary vatiables on the two canonical discrimi
nant functions.
By comparing the two figures one can e.g. see that eu is fairly specific for Devon, and
overall the figures give quite a good impression of what the distribution of elements is
for the different periods.
Chapter 7
Principal components,
canonical variables and
correl ations, and factor
analysis
In thi s chapter we will give a first overview of some of the methods which can be used
to show the underlying structure in a multidimensional data material.
Principal components simply correspond to the results of an eigenvalue analysis of the
variance covariance matrix for a multidimensional stochastic variable. The method
has its origin from around the tum of the century (Karl Pearson), but it was not until
the thirties it got its precise formul ation by Harold Hotelling.
Factor analysis was originally developed by psychologists  Spearman (1904) and Thur
stone at the beginning of the previous century. Because of this the terminology has un
fortunately largely been determined by the terminology of the psychologists. Around
1940 Lawley developed the maximum likelihood solutions to the problems in factor
analysis  developments which later have been refined by Joreskog and who in thi s
period introduced factor analysis as a "statistical method".
The canonical variables and correlations also date back to Harold Hotelling. The con
cept resembles principal components a lot, however, we are now considering the cor
relation between two sets of variables instead of just transforming a single one.
287
CHAPTER 7. PRINCIPAL COMPONENTS, CANONICAL
288 CORRELATIONS, AND FACTOR
7.1 Principal components
7.1.1 Definition and simple characteristics
We consider a multidimensional, stochastic variable
which has the variancecovruiance (dispersion) matrix
D(X) =:E,
and without loss of generality we can assume it has the mean value D.
We will sort the eigenvalues in :E descending order and will denote them
The corresponding orthonormal eigenvectors are denoted
and we define the Olthogonal matrix P by
We then have the following
DEFINITION 7.1. By the i'th principal axis of X we mean the direction of
vector Pi corresponding to the i'th largest eigenvalue.
DEFINITION 7.2. By the i'th principal component of X we will understand T
jection Yi = on the i'th principal axis.
The vector
y= ( ) = P'X
Y
k
.
7.1. PRINCIPAL COMPONENTS
,
1
\ 2.nd principal axis
is called the vector o/principal components.
289
lost
principal
axis
The situation has been sketched geomettically in the figure above where we have drawn
the unit ellipsoid corresponding to the vatiancecovaliance structure i.e. the ellipsoid
with the equation
It is seen that the principal axes are the main axes in this ellipsoid.
A number of theorems hold about the characteristics of the principal components. Most
of these theorems are statistical reformulations of a number of the results corresponding
to symmettical positive semidefinite matrices which are given in chapter 1.
THEOREM 7.1. The ptincipal components are uncorrelated and the valiance of the
i'th component is .Ai i.e. the i'th largest eigenvalue.
PROOF. From the theorems 2.5 (p. 60) and 1.10 (p. 30) we have
D(Y ) = D(P' X ) = P ' ~ P = A =
0' ;,),
290
CHAPTER 7. PRINCIPAL COMPONENTS, CANONICAL VARIABLG
CORRELATIONS, AND FACTOR A_  
and the result follows readily.
Further we have
THEOREM 7.2. The generalised variance of the principal components is equc..::
generalised variance of the original observations.
PROOF. From the definition p. 106 we have
GV(X) = det:E
and
GV(Y) = det A = AI' " Ak )
A similar result is the following
THEOREM 7. 3. The total variance i.e. the sum of variance of the original vaG'"
equal to the sum of the variance of the principal components i.e.
PROOF. Since
and
LV(Yi) = tr A
the result follows from the note above.
Finally we have
:
7.1. PRINCIPAL COMPONENTS 291
THEOREM 7.4. The first principal component is the linear combination (with normed
coefficients) of the original variables which has the largest variance. The m'th plin
cipal components is the linear combination (with normed coefficients) of the original
variables which is uncorrelated with the first m  1 plincipal components and then has
the largest valiance. Formally expressed:
sup V(b'X) = /\1 ,
Il bll =l
and the supremum is given when b = Pl' Further we have
sup V(b' X ) = Am,
b 1 PI " " , Pm l
Ilbl l = 1
and the supremum is given by b = Pm
PROOF. Since
V(b'X) = b''Eb,
and
so that
b' X) = b
b' X) = 0 {::} Pi 1 b,
the theorem is just a reformulation of theorem 1.15 p. 36.
REMARK 7.1. From the theorem we have that if we seek the linear combination of the
original variables which explains most of the valiation in these, then the first plincipal
component is the solution. If we seek the m valiables which explain most of the oligi
nal valiation, then the solution is the m first principal components. A measure of how
well these desclibe the oliginal valiation is found by means of theorems 7.1 and 7.3
which show that the m first principal components describe the fraction
Al + ... + Am
of the oliginal valiation.
292
CHAPTER 7. PRINCIPAL COMPONENTS, CANONICAL VARL
CORRELATIONS, AND FACTOR _
A better and more qualified measure of how good the "recreation abil ity:
by trying to reconstruct the original X from the vector
y * = (Y
1
, . . . ,Ym, O, ... ,O)l
Since
y = p'X {:} X = pY,
It is tempting to try with
X * = PY* .
We find
D(X*) PD(Y*)P'
o
o o
or
D(X*)
The spectral decomposition of ~ is (p. 31)
which means that
If there is a large difference between the eigenvalues then the smallest one
negligible and the difference between the original variancecovariance matrix
one "reconstructed" from the first m principal components is therefore small.
7.1 .2 Estimation and Testing
If the variance covariance matrix is unknown but is estimated on the basis of n c'
tions, then one estimates the principal components and their variances simply i J ~
will b
o
and the
"
 observa
y using
7.1. PRINCIPAL COMPONENTS 293
the estimated valiance covariance matrix as if it were known. If all the eigenvalues in
~ are different it can be shown that the eigenvalue and eigenvectors we get in this way
are maximum likelihood estimates of the true parameters (see e.g. [2]).
There is, however, a very common problem here since it can be shown that the principal
components are dependent of the scales of measurements our original variables have
been measured in. Therefore one often chooses only to consider the normed (standard
ised) variables i.e.
where
i = 1, ... ,no
This transformation corresponds to analysing the empirical correlation matrix instead
of analysing the empirical variance covariance matrix.
If one decides to use only some of the principal components in the further analysis one
could e.g. choose a strategy such as to retain as many of the components needed to
account for at least e.g. 90% of the total variation.
Another criterion would be to test a hypothesis like
against the alternative that we have a distinct "greater than" ( among the k  m last
eigenvalues.
If we are using the estimated variance covariance matrix t, the test statistic becomes
where
I 1 ) 2)
n =n  m  (2(km +1+ ,
6 k  m
and
294
CHAPTER 7. PRINCIPAL COMPONENTS, CANONICAL VARIAB 
CORRELATIONS, AND FACTOR _\...:
The critical region using a test at level a is approximately
2 1
{(Xl )' " ) Xn)!zI > X (  (k  m + 2)(k  m  l))l  aJ.
2
If we instead are using the estimated correlation matrix R we get the criteti
det R >m+l . . . >k
Z2 = nln A A A = nln A )
Al ... Am . A
k

m
A
k

m
where
The critical region for a test at level a becomes approximately equal to
However, it should be noted that this approximation is far worse than the c o r r c , ~
approximation for the variance covariance mattix.
A discussion of the above mentioned tests can be found in [16].
We now give an example.
EXAMPLE 7.1. The example is based on an example from [6] p. 486. The b ~ =
material is measurements of seven variables on 25 boxes with randomly =
sides. The seven variables are
Xl: longest side
X
2
: second longest side
X3: smallest side
X
4
: longest diagonal
X5: radius in the circumscribed sphere divided by radius in the inscribed .:::
X6: longest side + second longest side)/shortest side
X 7: surface area/volume.
In the following table we have shown some of the observations of the seven
Box Xl X
2 X3
X
4
X5 X6 X
7
1 3.760 3.660 0.540 5.275 9.768 13.741 4.782
2 8.590 4.990 1.340 10.022 7.500 10.162 2.130
24 8.210 3.080 2.420 9.097 3.753 4.657 1.719
25 9.410 6.440 5.110 12.495 2.446 3.103 0.914
7.1. PRINCIPAL COMPONENTS 295
We will now consider the question: Which things about a box determine how we per
ceive its size?
In order to answer this question we will perform a principal component analysis of the
above mentioned data. By such an analysis we hope to find out if the above mentioned
7 vmiables, which all in one way or another are related to "size" or "form" vary freely
in the 7 dimensional space or if they are more or less concentrated in some subspaces.
We first give the empiricalvariance covariance matrix for the variables. It is
5.400 3.260 0.779 6.391 2.155 3.035 1.996
3.260 5.846 1.465 6.083 1.312 2.877 2.370
0.779 1.465 2.774 2.204 3.839  5.167 1.740
t= 6.391 6.083 2.204 9.107 1.610 2.782  3.283
2.155 1.312  3.839 1.610 10.710 14.770 2.252
3.035 2.877  5.167 2.782 14.770 20.780 2.622
1.996 2.370  1.740 3.283 2.252 2.622 2.594
Then we determine the eigenvectors and eigenvalues for t. The eigenvectors are given
in descending order together with the fraction and the cumulated fraction of the total
variance that the eigenvalues contribute:
Eigenvalue Percentage of Cumulated percent
).i=l . 7
2, , ,
total variance age of total variance
34.490 60.290 60.290
19.000 33.210 93.500
2.540 4.440 97.940
0.810 1.410 99.350
0.340 0.600 99.950
0.033 0.060 100.010
0.003 0.004 100.014
Computational errors in the determination of the eigenvalues lead to deviations like the
cumulated sum being more than 100%.
The corresponding coordinates of the eigenvectors are shown in the following table.
Variable
1>1 1>2 1>3 1>4 1>5 1>6 1>7
Xl 0.164 0.422 0.645  0.090 0.225 0.415  0.385
X
2
0.142 0.447  0.713 0.050 0.395 0.066 0.329
X3 0.173 0.257 0.130 0.629  0.607 0.280 0.211
X
4
0.170 0.650 0.146 0.212 0.033  0.403 0.565
X5 0.546  0.135 0.105 0.165  0.161  0.596 0. 513
X6 0.768 0.133  0.149 0.062 0.207 0.465 0.327
X
7
0.073 0.313 0.065 0.719 0.596 0.107 0.092
CHAPTER 7. PRINCIPAL COMPONENTS, CANONICAL VARIABLES _\.:
296 CORRELATIONS, AND FACTOR ANALL
It is seen that the first eigenvector is the direction which corresponds to more than =
of the total variation, has especially numerically large 5th and 6th coordinates.
means that the first principal component
Y
1
= 0.164Xl + ... + 0.546X
5
+ 0.768X6 + 0.073X
7
is especially sensitive to variations in X5 and X
6
. These two variables: The_
between the radius in the circumsclibed sphere and the radius in the inscribed
and the ratio between the sum of the two longest sides and the shortest side both 
something to do with how "Hat" a box is. The larger these two variables, the
the box. Therefore, the first principal component measures the difference in "fl atn=
of the boxes. The second eigenvector has large positive coordinates for the
variables and a fairly large negative coordinate for the last variable. If the
principle component
Y
2
= 0.422X
1
+ 0.447X
2
+ 0.257X3 + 0.650X4 + ...  0.313X
7
,
is large then one or more of the variables Xl , . .. , X
4
must be large while X 7 is
Now we know that a cube is the box which for a given volume has the smallest
Therefore we also know that if a box deviates a lot from a cube then it will ha\'e 2.
Xr value, and this corresponds to a very strong reduction of Y
2
. A large
therefore indicates that most of the sides are large  and furthermore  more
equal. We therefore conclude that Y
2
measures a more general perception of
In the following figure we have depicted the boxes in a coordinate system w
axes are the first two principal axes. The coordinates for a single box then b
values of the first and the second principal component for that specific box.
For the first box we e.g. find
Yi. = 0.1643.760 + ... + 0.0734.782 = 18.18
Y
2
0.4223.760 + ...  0.3134.782 = 2.15.
At the coordinate (18.18,2.15) we have then drawn a picture of box No. 1, e' _
From this graph we also very clearly see the interpretation we have given the =
components. To the left in the graph corresponding to small values of
we have shown the "fattest" boxes and to the right the "flattest". At the top of  _
corresponding to big values of component No. 2 we have the big boxes a=:::
bottom we have the small ones.
On the other hand we do not seem to have any precise discrimination be
oblong boxes and the more flat boxes. This discrimination is first seen whe
consider the third principal component. It is
Y
3
= 0.645X
1
 0.713X2 + .. . + 0.065X
7
.
7.1. PRINCIPAL COMPONENTS 297
15
~ ~
0
E ! ) ~ <> <)
<>
10
Q, ~ ~ ,
~ <>
6)
0
1
Figure 7.1:
This component puts a large positive weight on variable No.1 the length of the largest
side and a large negative weight on the length of the second largest side. An oblong
box will have Xl > > X
2
and therefore Y3 will be relatively large for such a box. If
the base of the box corresponding to the two largest sizes is close to a square then 13
will be close to 0 for the respective box.
The three first principal components then take care of about 98% of the total vari
ation and by means of these we can partition a box's "size characteristics" in three
uncorrelated components: one corresponding to the flatness of the box (Yi), one which
corresponds to a more general concept of size (Y
2
), and one which corresponds to "the
degree of oblongness" (Y
3
). Now the initial question of: What is "the size of a box"
should at least be partly illustrated.
The next example is based on some investigations by Agterberg et al. (see [1] p. 128).
EXA.MPLE 7.2. The Mount Albert peridotit intrusion is part of the Appalachtic ultra
mafic belt in the Quebec province. A number of mineral samples were collected and
the values of the 4 following variables were determined:
Xl: mol% forsterit (= Mgolivin)
X
2
: mol% enstatit (= Mgortopyroxen)
X3 : dimension of unitcell of chromespinal
X
4
: specific density of mineral sample.
CHAPTER 7. PRINCIPAL COMPONENTS, CANONICAL VARIABLE.:
298 CORRELATIONS, AND FACTOR Al\_
Using between 99 and 156 observations the following correlation matrix beme
variables was estimated:
r
1.00
R = 0.32
0.41
 0.31
0.32
1.00
0.68
0.38
0.41
0.68
1.00
 0.36
 0.31 1
 0.38
 0.36 .
1.00
It is quite obvious that we should analyse the correlation matrix rather than the var
covariance matrix. Because we are analysing variables which are measured
comparable units we must standardise the numbers.
The eigenvalues and the corresponding eigenvectors are
r
043
1
2.25;
PI =
0.55
0.57
 0.44
r 
066
1
0.74;
P2 =
0.49
0.37
0.44
r 060 1
0.70;
P3 =
0.02
=
0. 16
0.78
r 
0 14
1
0.31;
P4 =
0.68
0.72
0.06
All the eigenvectors have fairly large coordinates in most places so there does
seem to be any obvious possibility of giving an intuitive interpretation of the _
components.
The first principal component corresponds to 2.25/4 = 56.25% of the total valiati
It would be interesting to know if the three smallest eigenvectors of the carre 
matrix can be considered as being of the same magnitude.
The test statistic we will use is
Z =  nIn [( 0.74.0.70.0.3\ / J3 = 0.2120n,
0.74 + 0.70 + 0.31 3
where n is the number of observations on which we have based the correlation IIlE.==:
on. Since this number is not the same for all the different correlation coefficienrs
7.2. CANONICAL VARIABLES AND CANONICAL CORRELATIONS 299
theoretical background for the test disappears so to speak. However, if we disregard
that problem, then the number of degrees of freedom in the X
2
distribution with which
to compare the test statistic becomes
1
f = 2"(4 1 + 2)(4  1  1) = 5.
Since
X2(5)o.995 = 16.7,
and since 0.2117, for 17, approximately equal to 100 is quite a lot larger than this value
it would be reasonable to conclude that the three smallest eigenvectors in the (true)
correlation matrix are not of the same order of magnitude.
7.2 Canonical variables and canonical correlations
7.2.1 Definition and properties
In the following we will discuss dependency between groups of variables where we
in the last section only looked at dependency (correlation stmcture) between single
variables.
We consider a random variable Z
where p :"::: q and Z and the parameters have been partitioned as follows:
If we on the basis of 17, observations of Z wish to investigate if X and Y are indepen
dent this could be done as shown in section 5.3.1 by investigating
det(S)
which is Up,q ,nlq distributed for Ho. We will now try to consider the problem for
another point of view. We will consider two onedimensional variables U and V given
by
U = a' X and V = b'Y.
300
CHAPTER 7. PRINCIPAL COMPONENTS, CANONICAL VARIABLL
CORRELATIONS, AND FACTOR AK\.:"
Then we have
and the correlation between U and V is
Now we have
p(a, b) =0.
The accept region for the hypothesis p( a, b) = 0 is of the form (cf. chapter 2)
T2(a b) < .,.2
,  f3 '
where r( a, b) is the empirical correlation coefficient and is a suitable quantile _
distribution of the 0 hypothesis. We therefore have an accept of = 0 if
which is obviously equivalent to
We now have that the 2 groups are independent if the maximal (empilical) corre
coefficient between a linear combination of the first group and a lillear comb"
from the second group is suitable smalL This maximum correlation coefficient is
the first (empirical) canonical correlation coefficient and the corresponding \ 
the first (empilical) canonical variables.
It is now obvious as in the case of the principal components can contillue the de 
We can define the second canonical correlation coefficient as the maximum COIl":!
between the linear combination of XI'S and X 2 's so that these combinations are
pendent of the previous ones etc .. Formerly we have
DEFINITION 7.3. Let Z = ( ) be a stochastic variable where X has p
nents and Y q components (p :::; q). The r'th pair of canonical variables is
lillear combillations
....
7.2. CANONICAL VARIABLES AND CANONICAL CORRELATIONS 301
which each has the variance 1 and which are uncorrelated with the previous T  1 pairs
of canonical valiables and which have maximum correlation. The correlation is the
r'th canonical correlation. A
Now we have the problem of determining the canonical variables and correlations. We
have the following theorem:
THEOREM 7.5. Let the situation be given in the above mentioned definition and let
D(X) = be partitioned analogously
).
Then the T'th canonical correlation is equal to the r'th largest root Qr of
and the coefficients in the r'th pair of canonical variables satisfies
(i)
(
) (
a
r
)
=0
f3r
(ii)
= 1
(iii)
= 1.
PROOF. We have a maximisation problem with constraints and one can solve the prob
lem by using a Lagrange multiplier technique see e.g. [2]p. 289. II
One can also detennine the correlations and the coefficients by solving an eigenvalue
problem since we have
THEOREM 7.6. Let the situation be as in the previous theorem then we have
 0
 0
respectively
302
CHAPTER 7. PRINCIPAL COMPONENTS, CANONICAL VARIABLES _
CORRELATIONS, AND FACTOR ANALL


PROOF. Omitted see e.g. [2].
7.2.2 Estimation and testing
o
o
If the parameters are unknown they may be estimated from observations. If we ir.s
the maximum likelihood estimates for in the previous theorems we get the maxi;r:::::
likelihood estimates of the parameters. Most often one will probably insert the _
estimate S and one then gets what one can call the empirical values for the param.e::=::::
involved. More specifically will we assume that we have n independent observa _
of Z organized in a data matrix
l
j l Xll
[XY] = :: :
X n1
j
Y
nq
and we assume that the mean has been subtracted from the variables. Then we haye 
unbiased estimator t given by
A , [X'X X'Y ]
(n = [XV] [XV] = Y'X y'y .
Based on this matrix we can then obtain estimates of canonical correlations and
abIes by using the formulas in the preceding section.
In order to test whether the canonical correlations are 0 we set up matrices similar 
what was done in the multivariate linear model for the case where the Y's are predi
by means of the X's. Thus
T
H
E
y'y = (n l)tyy
Y'X(X'X)lX'Y = (n  l)t t1t
yx xx xy
T  H = (n  1) (tyy  tyX t;; t
XY
) .
We see that T corresponds to the total valiation and E to the residual variation a.=
having predicted Y by means of X .
......
.: ~ e insert
7.2. CANONICAL VARIABLES AND CANONICAL CORRELATIONS 303
The eigenvalues of T 
1
H are solutions .A
r
to
or
(H  .AT),8 = 0
i.e .
The r'th solution .A
r
= f2; is equal to the r'th squared canonical correlation according
to Theorem 7.6. Next we find the eigenvalues of E
1
H, i.e. we must solve
or
(H  'Y(T  H)) ,8 = 0
which gives
and
(
T1H  _'Y
I
),8 = O.
1+ 'Y
Therefore
'Yr
1 + 'Yr
'Yr
It now follows that tests for whether the (squared) canonical correlations are zero is
equivalent to test whether the eigenvalues of E
1
H are zero. Here we may refer back
CHAPTER 7. PRINCIPAL COMPONENTS, CANONICAL VARIABLE.::
304 CORRELATIONS, AND FACTOR AKL:.:
to the tests presented in section 5.2 and obtain the possibilities Wilks' Lame':"'"
Andersons U), Pillai's Trace, HotellingLawlay's Trace, and Roy's maximum
In addition to the above tests SAS also provides output enabling a thorough anal_
how well different sets of vmiables are explained by other sets of variables.
7.3 Factor analysis
Once again we will consider the analysis of the correlation structure for a sing.:.=
tidimensional variable but contrary to the case in the section on principal COIIIj:ic::.t:::::::
we here assume an underlying model of the structure.
7.3.1 Model and assumptions
It is assumed that we have an observation
which  considering the situation historically  can be thought of as a single p=::;
scores in e.g. k different types of intelligence tests or the reactions of a pers
different stimuli.
One then has a model for how one thinks that these reactions (scores) depend 0::
underlying factors or more specifically that
X=AF+G,
or in more detail
Here we call F the vector of commonfactors, they are also calledfactor scores :
are not observable. Examples of these are characteristics like three dimensio:::2...
ligence, verbal intelligence etc.
The elements of the A matrix are called factor loadings and they give the wei.:
how the single factors enter the description of the different variables. If one = =
sumes that A desclibes 3dimensional intelligence and verbal intelligence an 
is the result of a test of a 3dimensional kind and Fm the result of a reading =
7.3. FACTOR ANALYSIS 305
then one will obviously have that Xl is large and Xk is small and viceversa that akl
is small and akm is large conesponding to the 3dimensional intelligence being deter
ministic of a person's scores in the solving of 3dimensional problems and analogously
for the verbal intelligence.
The vector G is called the vector of unique factors and can be thought of as com
poscd of some specific factors i.e. factors which are spccial for these specific tests and
of enors i.e. nondescribable deviations. Obviously these factors are not observable
either.
Here we must emphasize that both X and F and G are assumed to be stochastic.
Therefore we are not considering a general linear model with the parameters F
I
) ... ) Fm.
In order to make this difference quite clear we therefore give the model in the case
where we have several observations X I) ... ) X n . We then have the n models
Here we note that F i and G
i
change value when the observations Xi change value.
We can aggregate the above models into
It is assumed that F and G are unconelated and that
1': )
= I = 1m )
and
(
01
D(G) = 6
Furthermore, we assume that the observations are standardised in such a way that
V(X
i
) = 1, Vi i.e. that the variancecovariance matrix for X is equal to its cone
306
CHAPTER 7. PRINCIPAL COMPONENTS, CANONICAL
CORRELATIONS, AND FACTOR ANALY
lation matrix which is denoted
D(X) = R = (
rkl
From the original factor equation we find by means of theorem 2.5 p. 60, that
From this we especially find that for j = 1, .. . ,k we have
Here we introduce the notation
h
2 2 2
'j = ajl + ... + ajrn'
j = 1, ... ,k.
These quantities are called communalities and h; describes how large a proportio:r:
X/s variance is due to the m common factors. Correspondingly 15
j
gives 
uniqueness in Xj's variance. I.e. the proportion of Xj's variance which is not due
the m common factors.
Finally the (i , j) 'th factor weight gives the correlation between the i'th variable anc 
j'th factor i.e.
Cov(Xi,Fj) = COV(z=aiVFV + Gi ,Fj ) = aij'
v
It can be shown [7] that
i.e. that the j'th communality is always larger than or equal to the square of the mulr:;
correlation coefficient between Xj and the rest of the variables. This is not
when remembering that this quantity exactly equals the proportion of Xj's varic::::
which is described by the variance in the other Xi'S.
7.3.2 Estimation of factor loadings
We now turn to the more basic problem of estimating the factors. What we are jr=
ested in determining is A. We find
AA'=R  Ll.
:::;!
: ariance
;! c.re inter
7.3. FACTOR ANALYSIS 307
The diagonal elements in this matrix are
j = 1, ... ,k.
We do not know these but we could estimate them e.g. by inserting the squares of the
multiple correlation coefficient. If we insert these we get a matlix
[
r2
V= .
rkl
in which the elements outside the diagonal are equal to the original cOlTelation matrix
R's elements. This matrix is still symmetric but not necessarily positive semidefinite.
However, since it is still an estimate of one, we will (silently) assume that it still is
positive semidefinite.
Independently of how the communalities have been estimated the resulting "colTelation
matrix" is called V. V could e.g. be the above mentioned.
We will call the eigenvalues of V and the cOlTesponding nOlmed orthogonal eigenvec
tors respectively
and
If we let
we then have from theorem 1.10 p. 30, that
(
AI
P'VP=A= b
Since P is orthogonal (as a consequence of being orthonormal) we get
V = PAP' = (P
CHAPTER 7. PRINCIPAL COMPONENTS, CANONICAL VARIABLE':::
308 CORRELATIONS, AND FACTOR A_ _k
where
o
)
We now define
';;\1 0
1
A: = ./\
YAm
o 0
1
Le_ A: consists of the first m columns in i A! corresponding to the m largest e' =
values_ We then see that
1 1 1 1
(P A;)(P Ani PA:A:'pl
p(
Al 0
) p' Am
0 0
c:,:
V,
cf. the analogous considerations p. 292.
Since V is an estimate of A A', we then have
1 1
AA' c:,: (PAn(PAn' ,
1
so it would be natural to choose P A: as an estimate of A. This solution is called 
principle factor solution for our estimation problem.
We will summarize our considerations in the following
THEOREM 7.7. We consider the factor model X = A F+G where X is kdimens:
and F mdimensional. The correlation matrix of X is denoted R, and V is the l::
trix which we find by substituting the ones in the diagonal of R with estimates of 
communalities. These should be chosen in the interval [1'2, 1 J where 1'2 is the mulrt
IS _alled the
3:::lensional
 ~ the ma
of the
:=e multiple
7.3. FACTOR ANALYSIS 309
correlation coefficient between the relevant variable and the rest of the variables. U su
ally one chooses either 7' 2 or 1. The principle factor solution to the estimation problem
is then
where .\, i = 1, . .. , m are the m largest eigenvalues of V and where Pi ' i = 1, . . . , m
are the corresponding normed eigenvectors.
REMARK 7.2. In the theorem we assume that the number of factors m is known. If this
is not the case it is common to retain those which correspond to eigenvalues larger than
1. Other authors recommend that one retains one, two or three because that will usually
be the upper limit to how many factors one can give a reasonable interpretation.
7.3.3 Factor rotation
Once again we consider the expression
1 1
AA' ~ (P A;)(P A;)'
If Q is an arbitrary m x m orthonormal matrix i.e. Q Q' = I then we have
1 1 1 1
(P A ~ Q)(P A ~ Q)' (P A ~ ) Q Q'(P A ~ ) '
1 1
(P A;)(PAn'
AA' .
This means that we can have as many estimates of the Amatrix as we want by
multiplying the principle factor solution by an orthonormal matrix.
The problem is then how to choose the Qmatrix in a reasonable way. The main prin
ciple is that one wants the Amatrix to become "simple" (without explaining what this
means).
One of the most often used criterions is the one introduced by Kaiser, the Varimax
criterion. It says that we must choose Q in such a way that the quantity
310
CHAPTER 7. PRINCIPAL COMPONENTS, CANONICAL VARIABLE=
CORRELATIONS, AND FACTOR AJliAl_
is maximised. It is seen that the expression is the empirical variance of the :...
a; j / h;. The maximisation will therefore mean that many of the aij'S become 
proximately) and many become large (close to 1). This corresponds to a :
stmcture which will be easy to interpret.
Another rotation principle is the socalled quartimax principle. Here we try to ~
the rows in the factor matrix simple so that the single variables have a simple IT'
with the factors.
Contrary to this the Varimax criterion tries to make the columns simple correspo
to easily interpretable factors.
Before we continue with the theory we give an example.
E XAMPLE 7.3. We will now perform a factor analysis on the data given in e ~
pIe 7.1.
First we determine the correlation matrix. From the estimate of the variancecovaI"' 
matrix p. 295 we find
1.000 0.580 0.201 0.911 0.283 0.287 0.533
0.580 1.000 0.364 0.834 0.166 0.261 0.609
0.201 0.364 1.000 0.439 0.704 0.681 0.649
R=
0.911 0.834 0.439 1.000 0.163 0.202 0.676
0.283 0.166  0.704 0.163 1.000 0.990 0.427
0.287 0.261  0.681 0.202 0.990 1.000 0.357
 0.533  0.609  0.649 0.676 0.427 0.357 1.000
Completely analogously with the procedure in example 7.1 we then determine 
eigenvalues and vectors for R (note that in this case our choice of V is simply ....
We find
Eigenvalue Percentage of Cumulated percent
~ i ' 1, . .. , 7 total variance age of total variance
3.3946 48.495 48.495
2.8055 40.078 88.573
0.4373 6.247 94.820
0.2779 3.971 98.791
0.0810 1.157 99.948
0.0034 0.049 99.996
0.0003 0.004 100.000
The coordinates of the corresponding eigenvectors are shown in the following table.
LES AND
: terms
":::e 0 (ap
_ simple
:=' :0 make
= relation
_ = exam
: TGriance
7.3. FACTOR ANALYSIS 311
Variable
ih P2 P3 P4 P5 P6 P7
Xl 0.405 0.293 0.667 0.089 0.227 0.410  0.278
X
2
0.432 0.222 0.698  0.034 0.437 0.144  0.254
X3 0.385 0.356 0.148 0.628 0.512 0.188  0.108
X
4
0.494 0.232 0.119 0.210  0.105 0.588 5.536
X5 0.128 0.575 0.209 0.111 0.389 0.423  0.556
X6 0.097 0.580 0.174  0.006 0.355 0.500 0.498
X
7
0.481 0.130 0.018 0.735 0.455 0.033 0.049
We now assume that the number of factors is 2 (the assumption is not based on any
deep consideration of the structure of the problem. The number 2 is chosen because
there are only two eigenvalues larger than 1).
From theorem 7.7 the estimated principal factor solution to the problem is (V>:;PI '
where