Anda di halaman 1dari 30

LinearMethodsforRegression

TheElementsofStatisticalLearning
TrevorHastie,RobertTibshirani,JeromeFriedman

PresentedbyJunyan Li

Linearregressionmodel
Input:XT=(x1,,xp )

Linear regression model: f


isaddedsothatf(x)donothavetopassthroughtheorigin.

The linear model could be a reasonable approximation:

Basic expansion:

=> a polynomial representation

Interactions between variables:

Linearregressionmodel

residualsumofsquares, RSS

,givenasetoftrainingdata(x,y)

denotebyX theN*(p+1)matrixwitheachrowaninputvector(witha1in
thefirstposition)
RSS
Let

0.As

0,ifX isoffullrank(ifnot,

removetheredundancies).
, here iscalledthehat.

Linearregressionmodel

istheorthogonalprojectionofy ontothesubspacethecolumn
vectorsofX byx0,,xp,withx0 1

0.=>y

isorthogonaltothesubspace.

Linearregressionmodel
Assume

hasconstantvariance

Assumethedeviationsof
andGaussian,ande~N 0,

arounditsexpectationareadditive
.Then ~
,
.

Linearregressionmodel
areconstrainedtop+1equalities:
0

So,

isaunbiasedestimatorand

0,use z

To test the hypothesis that a particular coefficient

0,1 (if

is

~
(if
is not given). When Np1 is large enough, the
given) or t
different between the tail quantities of a tdistribution and a standard normal
distribution is negligible, so we use the standard normal distribution regardless of
whether is given or not.
Simultaneously, we could use F statistic to test whether some parameters could be
removed.

/
/

where RSS is the residual sumofsquares for the least squares fit of the bigger
model with +1 parameters, and RSS the same for the nested smaller model with
+1 parameters.

GaussMarkovTheorem

the least squares estimates of the parameters have the smallest


variance among all linear unbiased estimates.
Let
E
E
E
E
E
E
=>DX=0

beanotherlinearestimatorof
,asE(e)=0

However,theremayexistabiasedestimatorwithsmallermeansquarederror,is
intimatelyrelatedtopredictionaccuracy.

MultipleRegressionfromSimpleUnivariate Regression

Ifp=1,

,
,

,where<x,y>=
,

j,

Letx1,,xp bethecolumnsofX.If<xi,xj>=0(orthogonal)foreachi

r
r

,
,

<xp,xp>

=>r ,
<x1,x1>

areorthogonal.

,
,

=0

MultipleRegressionfromSimpleUnivariate Regression
Orthogonal inputs occur most often with balanced, designed experiments (where
orthogonality is enforced), but almost never with observational data.

RegressionbySuccessiveOrthogonalization:
1.Initializez0 =x0 =1.
2.Forj=1,2,...,p

Regressxj onz0,z1,...,,zj1 toproducecoefficients


andresidualvectorz

,l=0,...,j 1

3.Regressyontheresidualzp togivetheestimate

wecanseethateachofthe
orthogonaltoeachother.

,
,
,

isalinearcombinationofthez ,kj,andare

MultipleRegressionfromSimpleUnivariate Regression

In matrix form:
, where Z has columns zj and is the upper triangular matrix
with entries . Introducting D with jth diagonal entry Djj=
, let
and
The QR decomposition represents a convenient orthogonal basis for
the column space of X.
Q is an N (p+1) orthogonal matrix, and R is a (p + 1) (p + 1) upper
triangular matrix.
,
,

SubsetSelection

two reasons why we are often not satisfied with the least squares estimates :
prediction accuracy: the least squares estimates often have low bias but large
variance
interpretation. With a large number of predictors, we often would like to
determine a smaller subset that exhibits the strongest effects.

1. BestSubsetSelection
Bestsubsetregressionfindsforeachk {0,1,2,...,p}thesubsetofsizekthat
givessmallestresidualsumofsquares.
Infeasibleforpmuchlargerthan40.

SubsetSelection

2. Forward and BackwardStepwise Selection


1) Forwardstepwise selection is a greedy algorithm, and starts with the
intercept, and then sequentially adds into the model the predictor that most
improves the fit.
Computational: for large p we cannot compute the best subset
sequence, but we can always compute the forward stepwise sequence
(even when p >>N).
Statistical: a price is paid in variance for selecting the best subset of
each size; forward stepwise is a more constrained search, and will have
lower variance, but perhaps more bias.
2) Backwardstepwise selection starts with the full model, and sequentially
deletes the predictor that has the least impact on the fit. (use zscore, n>p+1)
3) Hybrid

SubsetSelection

3. ForwardStagewise Regression
Forwardstagewise regression (FS) is even more constrained than forwardstepwise
regression. It starts like forwardstepwise regression, with an intercept equal to y, and
centered predictors with coefficients initially all 0. At each step the algorithm identifies
the variable most correlated with the current residual. It then computes the simple
linear regression coefficient of the residual on this chosen variable, and then adds it to
the current coefficient for that variable. This is continued till none of the variables
have correlation with the residuals.

In Forward selection, variables are added at one time, but in FS selection, variables are
added partially, which works better in very high dimensional problems.

ShrinkageMethods
it often exhibits high variance, and so doesnt reduce the prediction error of
the full model. Shrinkage methods are more continuous, and dont suffer
as much from high variability.

1. RidgeRegression

Penalizingbythesumofsquares

,Subjectto

0
There is a onetoone correspondence between in "
subject to

0,
" and t in

ShrinkageMethods

RSS

, which could be got by calculating the


first and second derivation.

singular value decomposition (SVD):


, where U and V are N p
and pp orthogonal matrices, with the columns of U spanning the column
space of X, and the columns of V spanning the row space. D is a p p
diagonal matrix, with diagonal entries d11 d22 dpp 0.

Likelinearregression,ridgeregressioncomputesthecoordinatesofywith
respecttotheorthonormalbasisU.Itthenshrinksthesecoordinatesbythe
factors

ShrinkageMethods
Eigendecompositionof
D
,andTheeigenvectorsvj (columnsofV)
arealsocalledtheprincipalcomponents(orKarhunenLoeve)directionsofX.

Hence the small singular values dj


correspond to directions in the column
space of X having small variance, and
ridge regression shrinks these directions
the most.

Degreeoffreedomdf

tr

ShrinkageMethods

2. LassoRegression

The latter constraint makes the solutions nonlinear in the yi. there is no
closed form expression as in ridge regression. Computing the lasso solution
is a quadratic programming problem.

ifthesolutionoccursat
acorner,thenithasone
parameterjequalto
zero.

ShrinkageMethods
AssumethatcolumnsofX areorthonormal=>

=I

Minimize
=

<==>minimize
(

)+

Tominimize
so

) =0so
(

)(

)+

,wecanseethat
)=

)(

)
)=

1. Unbiasedness: The resulting estimator is early unbiased when the true unknown
parameter is large to avoid unnecessary modeling bias.
2. Sparsity: The resulting estimator is a thresholding rule, which automatically sets small
estimated coeffcients to zero to reduce model complexity.
3. Continuity: The resulting estimator is continuous to avoid instability in model
prediction.
so

)(

ShrinkageMethods

3. LeastAngleRegression
Forward stepwise regression builds a model sequentially, adding one
variable at a time. At each step, it identifies the best variable to include
in the active set, and then updates the least squares fit to include all
the active variables.
Least angle regression uses a similar strategy, but only enters as much
of a predictor as it deserves.
1. Standardize the predictors to have mean zero and unit norm. Start with the residual
, ,...,
0.
2. Find the predictor xj most correlated with r.(cosine)
3. Move from 0 towards its leastsquares coefficient <xj , r>, until some other
competitor xk has as much correlation with the current residual as does xj .
,
) defined
4. Move and in the direction(
by their joint least squares coefficient of the current residual on (xj , xk), until some other
competitor xl has as much correlation with the current residual.
4a. If a nonzero coefficient hits zero, drop its variable from the active set of variables
and recompute the current joint least squares direction.(it becomes lasso if this is added)
5. Continue in this way until all p predictors have been entered. After min(N 1, p) steps,
we arrive at the full leastsquares solution.

LinearMethodsforClassification
K classes and the fitted linear models f x

x. The
decision boundary between class k and l is that set of points for
which f x
f x
P(G=k|X=x)
For two classes, a popular model is
P G

1X

P G

2X

log

Hyperplanes
>model boundaries as
linear

LinearMethodsforClassification

Thus if G has K classes, there will be K such indicators Yk, k = 1, . . . ,K, with
Yk = 1 if G = k else 0. These are collected together in a vector Y = (Y1, . . . ,
YK.
Classify according to G x
argmax f x
Because of the rigid nature of the regression model, classes may be
masked by others, even though they are perfectly separated.
A loose but general rule is that if K 3 classes are lined up, polynomial
terms up to degree K 1 might be needed to resolve them.

LinearMethodsforClassification
Suppose f x is the classconditional density of X in class G=k and k is the prior
probability of class k.
f x
P G kX x
f x
Suppose f x

exp

In linear discriminant analysis(LDA) we assume that


log

log

log

log

log

is linear in x in a pdimension hyperplane.

for each k.

LinearMethodsforClassification

Thelineardiscriminantfunction:
1
x
log
2

,as issymmetricand
transposeisstillanumber.

isanumber,whose

isignored,asitisaconst.

Inpractice,weestimatetheparametersusingtrainingdata:

/ ,where

isthenumberofclasskobservations
/

LinearMethodsforClassification

Inquadraticdiscriminantanalysis(QDA)weassumethat arenotequal.
x

log

log

LDA in the enlarged quadratic polynomial space is quite similar with QDA.
a
1 a
Regularized discriminant analysis(RDA): a
In practice a can be chosen based on the performance of the model on
validation data, or by crossvalidation.
,where isap*porthonormaland

matrixofpositiveeigenvaluesd .
x

x
x

log

isadiagonal

x
x

LinearMethodsforClassification

Gaussian classification with common covariances leads to linear decision


boundaries. Classification can be achieved by sphering the data with respect to W,
and classifying to the closest centroid (modulo log k) in the sphered space.
Since only the relative distances to the centroids count, one can confine the data to
the subspace spanned by the centroids in the sphered space.
This subspace can be further decomposed into successively optimal subspaces in
term of centroid separation. This decomposition is identical to the decomposition
due to Fisher.

LinearMethodsforClassification

a X suchthatthebetweenclass

FindthelinearcombinationZ

varianceismaximizedrelativetothewithinclassvariance.max

B isthebetweenclassesscattermatrixandW isthewithinclasses
scattermatrix
Asascattermatrix=>min
L

a
1
a
2

a subjecttoa
a

1
a
2
a a

a
1

LogisticRegression

P G

kX

P G

KX

LetP G

KX

,k=1,,K1

P x ;
x ;

intwoclasscase:
viaa0/1responseyi,whereyi =1whengi =1,andyi =0whengi =2
l
l

;
1
log 1

log 1

0

NewtonRaphson algorithm

LogisticRegression

In matrix notation:
Y denote the vector of yi values; X the N*(p+1) matrix of xi values; P the
vector of fitted probabilities with ith element
;
and W a N*N
diagonal matrix of weights with ith diagona element
;
1
;

,adjusted response

Too complicated, many softwares use a quadratic approximation to logistic


regression and L1 Regularized Logistic Regression.

LDAorLogisticRegression

LDA: log

log

Logistic Regression: log

Although they have exactly the same form, the difference lies in the way the linear
coefficients are estimated. The logistic regression model is more general, in that it
makes less assumptions.
Logistic regression: fit the parameters by maximizing the conditional likelihood
the multinomial likelihood with probabilities the P(G = k|X), where P(X) is ignored.
LDA: fit the parameters by maximizing the full loglikelihood, based on the joint
density P(X,G=k)=(X;k,)k, where P(X) does play a role as P(X)= P(X,G=k).
Assume f(x)s are Gaussian, we can find a more efficient way.
It is generally felt that logistic regression is a safer, more robust bet than the LDA
model, relying on fewer assumptions. It is our experience that the models give
very similar results, even when LDA is used inappropriately.

AnyQuestions?

Anda mungkin juga menyukai