Linear Regression

LinearMethodsforRegression
TheElementsofStatisticalLearning
TrevorHastie,RobertTibshirani,JeromeFriedman
PresentedbyJunyan Li
Linearregressionmodel
Input:XT=(x1,,xp )
Linear regression model: f

isaddedsothatf(x)donothavetopassthroughtheorigin.
The linear model could be a reasonable approximation:
Basic expansion:
=> a polynomial representation
Interactions between variables:
residualsumofsquares, RSS
,givenasetoftrainingdata(x,y)
denotebyX theN*(p+1)matrixwitheachrowaninputvector(witha1in
thefirstposition)
RSS
Let
0.As
0,ifX isoffullrank(ifnot,
removetheredundancies).
, here iscalledthehat.
istheorthogonalprojectionofy ontothesubspacethecolumn
vectorsofX byx0,,xp,withx0 1
0.=>y
isorthogonaltothesubspace.
Assume
hasconstantvariance
Assumethedeviationsof
andGaussian,ande~N 0,
arounditsexpectationareadditive
.Then ~
,
.
areconstrainedtop+1equalities:
0
So,
isaunbiasedestimatorand
0,use z
To test the hypothesis that a particular coefficient
0,1 (if
is
~
(if
is not given). When Np1 is large enough, the
given) or t
different between the tail quantities of a tdistribution and a standard normal
distribution is negligible, so we use the standard normal distribution regardless of
whether is given or not.
Simultaneously, we could use F statistic to test whether some parameters could be
removed.
/
/
where RSS is the residual sumofsquares for the least squares fit of the bigger
model with +1 parameters, and RSS the same for the nested smaller model with
+1 parameters.
GaussMarkovTheorem
the least squares estimates of the parameters have the smallest

variance among all linear unbiased estimates.
Let
E
E
E
E
E
E
=>DX=0
beanotherlinearestimatorof
,asE(e)=0
However,theremayexistabiasedestimatorwithsmallermeansquarederror,is
intimatelyrelatedtopredictionaccuracy.
MultipleRegressionfromSimpleUnivariate Regression
Ifp=1,
,
,
,where<x,y>=
,
j,
Letx1,,xp bethecolumnsofX.If<xi,xj>=0(orthogonal)foreachi
r
r
,
,
<xp,xp>
=>r ,
<x1,x1>
areorthogonal.
,
,
=0
Orthogonal inputs occur most often with balanced, designed experiments (where
orthogonality is enforced), but almost never with observational data.
RegressionbySuccessiveOrthogonalization:
1.Initializez0 =x0 =1.
2.Forj=1,2,...,p
Regressxj onz0,z1,...,,zj1 toproducecoefficients

andresidualvectorz
,l=0,...,j 1
3.Regressyontheresidualzp togivetheestimate
wecanseethateachofthe
orthogonaltoeachother.
,
,
,
isalinearcombinationofthez ,kj,andare
In matrix form:
, where Z has columns zj and is the upper triangular matrix
with entries . Introducting D with jth diagonal entry Djj=
, let
and
The QR decomposition represents a convenient orthogonal basis for
the column space of X.
Q is an N (p+1) orthogonal matrix, and R is a (p + 1) (p + 1) upper
triangular matrix.
,
,

SubsetSelection
two reasons why we are often not satisfied with the least squares estimates :
prediction accuracy: the least squares estimates often have low bias but large
variance
interpretation. With a large number of predictors, we often would like to
determine a smaller subset that exhibits the strongest effects.
1. BestSubsetSelection
Bestsubsetregressionfindsforeachk {0,1,2,...,p}thesubsetofsizekthat
givessmallestresidualsumofsquares.
Infeasibleforpmuchlargerthan40.
SubsetSelection
2. Forward and BackwardStepwise Selection

1) Forwardstepwise selection is a greedy algorithm, and starts with the
intercept, and then sequentially adds into the model the predictor that most
improves the fit.
Computational: for large p we cannot compute the best subset
sequence, but we can always compute the forward stepwise sequence
(even when p >>N).
Statistical: a price is paid in variance for selecting the best subset of
each size; forward stepwise is a more constrained search, and will have
lower variance, but perhaps more bias.
2) Backwardstepwise selection starts with the full model, and sequentially
deletes the predictor that has the least impact on the fit. (use zscore, n>p+1)
3) Hybrid
SubsetSelection
3. ForwardStagewise Regression
Forwardstagewise regression (FS) is even more constrained than forwardstepwise
regression. It starts like forwardstepwise regression, with an intercept equal to y, and
centered predictors with coefficients initially all 0. At each step the algorithm identifies
the variable most correlated with the current residual. It then computes the simple
linear regression coefficient of the residual on this chosen variable, and then adds it to
the current coefficient for that variable. This is continued till none of the variables
have correlation with the residuals.
In Forward selection, variables are added at one time, but in FS selection, variables are
added partially, which works better in very high dimensional problems.
ShrinkageMethods
it often exhibits high variance, and so doesnt reduce the prediction error of
the full model. Shrinkage methods are more continuous, and dont suffer
as much from high variability.
1. RidgeRegression
Penalizingbythesumofsquares
,Subjectto
0
There is a onetoone correspondence between in "
subject to
0,
" and t in
ShrinkageMethods
RSS
, which could be got by calculating the

first and second derivation.
singular value decomposition (SVD):

, where U and V are N p
and pp orthogonal matrices, with the columns of U spanning the column
space of X, and the columns of V spanning the row space. D is a p p
diagonal matrix, with diagonal entries d11 d22 dpp 0.
Likelinearregression,ridgeregressioncomputesthecoordinatesofywith
respecttotheorthonormalbasisU.Itthenshrinksthesecoordinatesbythe
factors
ShrinkageMethods
Eigendecompositionof
D
,andTheeigenvectorsvj (columnsofV)
arealsocalledtheprincipalcomponents(orKarhunenLoeve)directionsofX.
Hence the small singular values dj

correspond to directions in the column
space of X having small variance, and
ridge regression shrinks these directions
the most.
Degreeoffreedomdf
tr
ShrinkageMethods
2. LassoRegression
The latter constraint makes the solutions nonlinear in the yi. there is no
closed form expression as in ridge regression. Computing the lasso solution
is a quadratic programming problem.
ifthesolutionoccursat
acorner,thenithasone
parameterjequalto
zero.
ShrinkageMethods
AssumethatcolumnsofX areorthonormal=>
=I
Minimize
=
<==>minimize
(
)+
Tominimize
so
) =0so
(
)(
)+
,wecanseethat
)=
)(
)
)=
1. Unbiasedness: The resulting estimator is early unbiased when the true unknown
parameter is large to avoid unnecessary modeling bias.
2. Sparsity: The resulting estimator is a thresholding rule, which automatically sets small
estimated coeffcients to zero to reduce model complexity.
3. Continuity: The resulting estimator is continuous to avoid instability in model
prediction.
so
)(
ShrinkageMethods
3. LeastAngleRegression
Forward stepwise regression builds a model sequentially, adding one
variable at a time. At each step, it identifies the best variable to include
in the active set, and then updates the least squares fit to include all
the active variables.
Least angle regression uses a similar strategy, but only enters as much
of a predictor as it deserves.
1. Standardize the predictors to have mean zero and unit norm. Start with the residual
, ,...,
0.
2. Find the predictor xj most correlated with r.(cosine)
3. Move from 0 towards its leastsquares coefficient <xj , r>, until some other
competitor xk has as much correlation with the current residual as does xj .
,
) defined
4. Move and in the direction(
by their joint least squares coefficient of the current residual on (xj , xk), until some other
competitor xl has as much correlation with the current residual.
4a. If a nonzero coefficient hits zero, drop its variable from the active set of variables
and recompute the current joint least squares direction.(it becomes lasso if this is added)
5. Continue in this way until all p predictors have been entered. After min(N 1, p) steps,
we arrive at the full leastsquares solution.
LinearMethodsforClassification
K classes and the fitted linear models f x
x. The
decision boundary between class k and l is that set of points for
which f x
f x
P(G=k|X=x)
For two classes, a popular model is
P G
1X
P G
2X
log
Hyperplanes
>model boundaries as
linear
Thus if G has K classes, there will be K such indicators Yk, k = 1, . . . ,K, with
Yk = 1 if G = k else 0. These are collected together in a vector Y = (Y1, . . . ,
YK.
Classify according to G x
argmax f x
Because of the rigid nature of the regression model, classes may be
masked by others, even though they are perfectly separated.
A loose but general rule is that if K 3 classes are lined up, polynomial
terms up to degree K 1 might be needed to resolve them.
Suppose f x is the classconditional density of X in class G=k and k is the prior
probability of class k.
f x
P G kX x
f x
Suppose f x
exp
In linear discriminant analysis(LDA) we assume that

log
log
log
log
log
is linear in x in a pdimension hyperplane.
for each k.
Thelineardiscriminantfunction:
1
x
log
2
,as issymmetricand
transposeisstillanumber.
isanumber,whose
isignored,asitisaconst.
Inpractice,weestimatetheparametersusingtrainingdata:
/ ,where
isthenumberofclasskobservations
/
Inquadraticdiscriminantanalysis(QDA)weassumethat arenotequal.
x
log
log
LDA in the enlarged quadratic polynomial space is quite similar with QDA.
a
1 a
Regularized discriminant analysis(RDA): a
In practice a can be chosen based on the performance of the model on
validation data, or by crossvalidation.
,where isap*porthonormaland
matrixofpositiveeigenvaluesd .
x
x
x
log
isadiagonal
x
x
Gaussian classification with common covariances leads to linear decision

boundaries. Classification can be achieved by sphering the data with respect to W,
and classifying to the closest centroid (modulo log k) in the sphered space.
Since only the relative distances to the centroids count, one can confine the data to
the subspace spanned by the centroids in the sphered space.
This subspace can be further decomposed into successively optimal subspaces in
term of centroid separation. This decomposition is identical to the decomposition
due to Fisher.
a X suchthatthebetweenclass
FindthelinearcombinationZ
varianceismaximizedrelativetothewithinclassvariance.max
B isthebetweenclassesscattermatrixandW isthewithinclasses
scattermatrix
Asascattermatrix=>min
L
a
1
a
2
a subjecttoa
a
1
a
2
a a
a
1
LogisticRegression
P G
kX
P G
KX
LetP G
KX
,k=1,,K1
P x ;
x ;
intwoclasscase:
viaa0/1responseyi,whereyi =1whengi =1,andyi =0whengi =2
l
l
;
1
log 1
log 1

0
NewtonRaphson algorithm
LogisticRegression
In matrix notation:
Y denote the vector of yi values; X the N*(p+1) matrix of xi values; P the
vector of fitted probabilities with ith element
;
and W a N*N
diagonal matrix of weights with ith diagona element
;
1
;
,adjusted response
Too complicated, many softwares use a quadratic approximation to logistic

regression and L1 Regularized Logistic Regression.
LDAorLogisticRegression
LDA: log
log
Logistic Regression: log
Although they have exactly the same form, the difference lies in the way the linear
coefficients are estimated. The logistic regression model is more general, in that it
makes less assumptions.
Logistic regression: fit the parameters by maximizing the conditional likelihood
the multinomial likelihood with probabilities the P(G = k|X), where P(X) is ignored.
LDA: fit the parameters by maximizing the full loglikelihood, based on the joint
density P(X,G=k)=(X;k,)k, where P(X) does play a role as P(X)= P(X,G=k).
Assume f(x)s are Gaussian, we can find a more efficient way.
It is generally felt that logistic regression is a safer, more robust bet than the LDA
model, relying on fewer assumptions. It is our experience that the models give
very similar results, even when LDA is used inappropriately.
AnyQuestions?

Linear Regression

Diunggah oleh

Informasi Dokumen

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Linear Regression

Diunggah oleh

Hak Cipta:

Format Tersedia

LinearMethodsforRegression

Linear regression model: f

The linear model could be a reasonable approximation:

=> a polynomial representation

Interactions between variables:

To test the hypothesis that a particular coefficient

the least squares estimates of the parameters have the smallest

Regressxj onz0,z1,...,,zj1 toproducecoefficients

2. Forward and BackwardStepwise Selection

, which could be got by calculating the

singular value decomposition (SVD):

Hence the small singular values dj

In linear discriminant analysis(LDA) we assume that

is linear in x in a pdimension hyperplane.

Gaussian classification with common covariances leads to linear decision

Too complicated, many softwares use a quadratic approximation to logistic

Logistic Regression: log

Anda mungkin juga menyukai