Regression Model in Matri

Page 22
General Linear Regression Model in Matrix Terms
Suppose we have one response variable Y and (p-1) predictor (explanatory)

variables X1, X2, . . . , Xp-1, and n observations, so that the dataset looks like
the following:
X1 X2 . . . Xp-1 Y ε (random error)
X11 X12 . . . X1(p-1) Y1 ε 1

X21 X22 . . . X2(p-1) Y2 ε 1
….. …… ………. ….. ….
….. …… ………. ….. ….
….. …… ………. ….. ….
Xn1 Xn2 . . . Xn(p-1) Yn ε nb
The general linear regression model is given by
Yi = β 0 + β 1Xi1 + β 2X12 + . β p-1 X1(p-1) +... ε i, i = 1, 2, …,

n.
In matrix terms this becomes
Y=Xβ +ε
where
Yi = β 0 + β 1Xi1 + β 2X12 + . β X1(p-1) + . . .
p-1 ε i,
Y is the vector of n responses Y1, Y2, . . . , Yn
X is the n x p matrix with first column all 1’s and the values of X1,
X2 , . . .Xp-1 (assumed to be of rank p)
β is the p x 1 vector of parameters. β 0, β 1. , ...,β p-1.
ε is an n x 1 vector of uncorrelated errors. ε 1, ε 2, ...,ε p.

The random errors ε 1, ε 2, . . . , ε p are assumed to be independent with
mean 0 and common variance σ 2. For the purpose of making statistical
inferences, it is further assumed that the errors are normally distributed.
Page 23
Estimation of Parameters.
The most commonly used criterion to estimate the parameters in the model
is the principle of least squares, which involves minimizing
n
Q= ∑ε
1
i
2
= Σ[Yi - β 0 - β 1Xi1 - β 2Xi2 - . β Xi(p-1)]2 = (Y = X β )′ ( Y = X β )
p-1
It is easily shown that the value b of β which minimizes Q is the solution

of the least squares normal equations
X′Xb = X′Y
which has the solution
b = (X′X)-1X′Y
Note: If we assume that the errors are normally distributed then the least
squares estimator b is also the maximum likelihood estimator of β (to be
discussed later).
Residuals ei are the difference between observed and fitted values and are
given by
ei = Yi – Yî ,
or in vector form by
e=Y– Yˆ = Y – Xb = Y - X(X′X)-1X′Y = [I – X(X′X)-1X′]Y = (I –

H)Y
where H = X(X′X)-1X′. H is called the ‘hat matrix’ and plays an important

role in regression diagnostics (to be discussed below).
The resulting minimum value of Q, called the sum of squared errors SSE, is
given by
Yî
2
SSE =∑ ei = (Y- Xb)′(Y –Xb) = Σ(Yi – )2
Page 24
The fitted values are given by
Yˆ = Xb = X(X′X)-1X′Y = HY
This representation of the vector Yˆ of predicted (fitted) values displays

directly the relationship between them and the observations. Letting hij
denote the i,jth element of H, the fitted value, we have
n
Yî = ∑h
j =1
ij Yj
High values of the diagonal elements hij indicate that the observation Yi has
a high influence on the fitted value.
Example. Two predictors X1 and X2 , n = 4 observations:
1 2  3  0.445 0.374 0.303 −.123 

1 3  3  0.374 0.329 0.284 0.013 
X=   , Y =  3  , H =  ,
1 4  0.303 0.284 0.265 0.148 
     
1 10  40  −.123 0.013 0.148 .0.961 
.0.445 
0.329 
Diag= 
0.265


 
0.961 
The diagonal elements hii measure the influence (leverage) of the individual
observations. Both h22 and h44 have very high leverage. For example,
Yˆ4 = -0.123 Y1 + 0.013 Y2 + 0.148 Y3 + 0.961 Y4 .

Fitted Line Plot
y = - 11.56 + 5.013 x
S 5.14750
40
R-Sq 94.8%
R-Sq(adj) 92.3%
Here is a plot of the fitted and 30
observed values: 20
y
10
1 2 3 4 5 6 7 8 9 10
x
Page 25
From the graph it is easy to see why the observation Y4 has a large influence
(leverage) on its fitted value (and on the fitted regression line as well).
Coefficient of Multiple Determination R2.
We ask: ‘How much improvement is obtained by using a predictor to obtain

fitted (average) values of the response, versus just using the mean y ?’, One
answer is the following:
Compare the two ways of getting fitted values:

1. Use the average value (sample mean ) of the observations Y , so
Yˆ = Y and compute SSTO = ∑(Yi −Y ) = sum of squared errors of
2
predictions.
2. Use the fitted regression line, getting fitted values Yˆ = βˆ 0 + βˆ1 x in the
simple linear regression model.. Then compute SSE = ∑(Yi −Yî ) 2 .
Compare the sum of squared errors of fitted and observed values for the two
methods. Then
R2 = (SSTO – SSE)/ SSTO
equals the proportionate reduction in the sum of squared errors using the
fitted regression line vs. using the sample mean Y . R2.is usually expressed
as a percentage reduction. It is also interpreted as the amount of variability
in the observations that can be explained (or accounted for) by the
predictors. Note that the sample variance of the observations is
S y2 = SSTO/ (n-1)
and the variance of the residuals using the regression equation is given by
MSE = SSE/(n-p). s= MSE is the estimated standard deviation of

the random errors εi.
It is easily shown that the ‘Total sum of Squares’ SSTO can be decomposed
as
Page 26
SST0 = ∑ i − = ∑(Yi −Y ) + ∑(Yi −Yi ) = SSR + SSE.
ˆ ˆ
2 2 2
(Y Y )
This breakdown of sum of squares can be summarized in an ‘Analysis of

Variance’ Table:
Source of Sum of df MS F-Test

Variation Squares
------------- ---------------------- ------ -------------------- -------------
Regression SSR = ∑(Yî −Y ) 2 p - 1 MSR=SSR/(p-1) MSR/MSE
Error SSE = ∑(Yi −Yî ) 2 n – p MSE = SSE/(n-p)
------------- ---------------------- ------ --------------------
SSTO = ∑(Yi −Y )
2
Total n–1
The F-test is used to test the hypothesis that all of the parameters β 1, β 2.,
…, β p-1 are simultaneously zero. Use the p-value of the test to make a
decision on this (this is probably practically not an issue!).
Confidence Intervals.
Recall (from page 23) that b = (X′X)-1X′Y is the estimated vector of (vector)
of parameters β . It can be shown that the variance –covariance matrix of b
is given by
Var-Cov (b) = (X′X)-1σ2
which is estimated by
Est. Var-Cov (b) = (X′X)-1MSE
The square root of the diagonal elements s(bi) of this matrix are the standard
errors of the estimated regression parameters b1, b2., …, bp-1. Confidence
intervals for the bi ‘s are then given by
bi ± t* s(bi), where t* is a critical value of the t-distribution with n-p df.

Tests of hypotheses about individual parameters are conducted using the t-
distribution also—refer to the p-values of these tests in regression output.
Page 27
Similarly, one can construct confidence intervals for the mean response
µnew=E(Ynew) corresponding to a population mean indexed by for values of
x1, x2., …, xp-1. The mean response is estimated by Ŷnew = X h' b , where
'
X new is the (row) vector of values of x1, x2., …, xp-1. It can be shown that
the standard error of the estimated response is given by
s.e.( Ŷnew )= '

X hnew ( X ' X ) −1 X new MSE
Model Selection Criteria.
If there are (P-1) predictors x1, x2., …, xP-1. one can conceivably fit 2P-1
different models to the data. For example, there are P-1 models with one
predictor x1, P(P-1) models with 2 predictors, etc. Some criteria used for
comparing models include the following (p as a subscript below refers to the
number of predictors in a model):
SSEp, R p2 , Ra2, p , Cp , AICp , BICp, and Pressp .
These can be described as follows:
SSEp or R p2 . Note first that SSEp and R p2 are equivalent measures, in that
SSE
R p2 =1= p
SSTO
The goal in using either of these statistics is to choose a model where their
2
values are ‘small’. One can plot, e.g., R p against p and choose a model, or
models, where it is asmptoting (not changing).
Ra2, p ,
is the same measure as R p2 but with an adjustment for sample size. It
is given by
( n −1) SSE p MSE p
Ra2, p = 1−
( n − p ) SSTO
= 1− S y2
2
where S y = SSTO/(n-1) is the sample variance of the observations. Thus,
Ra2, p looks at how the ratio of sample variances for the model with p
Page 28
predictors changes in comparison with the model with no predictors (a

‘baseline’ model).
Cp . This criterion is concerned with the total mean squared error of the n
fitted values for each subset selection model. It is a bit complicated to
describe here. Suffice to say, most statisticians now prefer to use the BIC
criterion.
BICp. Schwarz’s Bayesian Information Criterion is given by
BICp = n ln SSEp – n ln + [ln n] p
We will look at an example using the : SDSS Quasar Sloan Digital Sky
Survey team (CASt dataset SDSS_quasar.dat). Here are 8 of the first 10
observations in the dataset (which contains 46420 observations in all). The
variables are as follows:
Dec. z u_mag g_mag r_mag i_mag z_ mag Radio X-ray J_mag

H_mag K_mag M_i
15.30 1.20 19.92 19.81 19.39 19.16 19.32 -1.00 -9.00 0.00 0.00 0.00
-25.08
13.94 2.24 19.22 18.89 18.45 18.33 18.11 -1.00 -9.00 0.00 0.00 0.00
-27.42
14.93 0.46 19.64 19.47 19.36 19.19 19.00 -1.00 -9.00 0.00 0.00 0.00
-22.73
0.04 0.48 18.24 17.97 18.03 17.96 17.91 0.00 -1.66 16.65 15.82 14.82
-24.05
14.18 0.95 19.52 19.28 19.11 19.16 19.07 -1.00 -9.00 0.00 0.00 0.00
-24.57
-8.86 1.25 19.15 18.72 18.26 18.28 18.26 13.97 -9.00 0.00 0.00 0.00
-26.06
15.33 0.99 19.41 19.18 18.99 19.08 19.13 -1.00 -1.88 0.00 0.00 0.00
-24.71
13.77 0.77 19.35 19.00 18.92 19.01 18.84 -1.00 -9.00 0.00 0.00 0.00
-24.19

Regression Model in Matri

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Regression Model in Matri

Diunggah oleh

Hak Cipta:

Format Tersedia

Page 22

General Linear Regression Model in Matrix Terms

Suppose we have one response variable Y and (p-1) predictor (explanatory)

X1 X2 . . . Xp-1 Y ε (random error)

X11 X12 . . . X1(p-1) Y1 ε 1

Xn1 Xn2 . . . Xn(p-1) Yn ε nb

The general linear regression model is given by

Yi = β 0 + β 1Xi1 + β 2X12 + . β p-1 X1(p-1) +... ε i, i = 1, 2, …,

In matrix terms this becomes

Y is the vector of n responses Y1, Y2, . . . , Yn

β is the p x 1 vector of parameters. β 0, β 1. , ...,β p-1.

ε is an n x 1 vector of uncorrelated errors. ε 1, ε 2, ...,ε p.

It is easily shown that the value b of β which minimizes Q is the solution

which has the solution

e=Y– Yˆ = Y – Xb = Y - X(X′X)-1X′Y = [I – X(X′X)-1X′]Y = (I –

where H = X(X′X)-1X′. H is called the ‘hat matrix’ and plays an important

This representation of the vector Yˆ of predicted (fitted) values displays

Example. Two predictors X1 and X2 , n = 4 observations:

1 2  3  0.445 0.374 0.303 −.123 

Yˆ4 = -0.123 Y1 + 0.013 Y2 + 0.148 Y3 + 0.961 Y4 .

Here is a plot of the fitted and 30

Coefficient of Multiple Determination R2.

We ask: ‘How much improvement is obtained by using a predictor to obtain

Compare the two ways of getting fitted values:

R2 = (SSTO – SSE)/ SSTO

MSE = SSE/(n-p). s= MSE is the estimated standard deviation of

This breakdown of sum of squares can be summarized in an ‘Analysis of

Source of Sum of df MS F-Test

Var-Cov (b) = (X′X)-1σ2

Est. Var-Cov (b) = (X′X)-1MSE

bi ± t* s(bi), where t* is a critical value of the t-distribution with n-p df.

s.e.( Ŷnew )= '

Model Selection Criteria.

SSEp, R p2 , Ra2, p , Cp , AICp , BICp, and Pressp .

These can be described as follows:

predictors changes in comparison with the model with no predictors (a

BICp. Schwarz’s Bayesian Information Criterion is given by

BICp = n ln SSEp – n ln + [ln n] p

Dec. z u_mag g_mag r_mag i_mag z_ mag Radio X-ray J_mag

Anda mungkin juga menyukai