17.1
whichisreferredtoassimplelinearregression.Wewouldbe
interestedinestimating0and1fromthedatawecollect.
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.
17.2
17.3
House
Price
25K$
House size
17.4
17.5
17.6
100
92
80
82
Test B2
Test 2
60
40
20
72
62
52
0
40
50
60
70
80
90
42
100
Test 1
60
70
80
90
100
Test B1
Test B2
90
80
70
60
50
50
60
70
80
90
100
Test B1
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.
17.7
7
6.6
6.2
Height
Height
6.6
5.8
5.4
6.2
5.8
5
4.6
100
140
180
220
5.4
260
100
Weight
140
180
220
260
Weight
6.8
6.6
6.2
6.2
Height
Height
6.5
5.9
5.6
5.8
5.4
5.3
100
140
180
220
260
Weight
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.
5
100
140
180
220
260
Weight
17.8
17.9
DeterministicModel:anequationorsetofequationsthat
allowustofullydeterminethevalueofthedependent
variablefromthevaluesoftheindependentvariables.
y = $25,000 + (75$/ft2)(x)
Area of a circle: A = *r2
ProbabilisticModel:amethodusedtocapturethe
randomnessthatispartofareallifeprocess.
y=25,000+75x+
E.g.doallhousesofthesamesize(measuredinsquarefeet)
sellforexactlythesameprice?
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.
17.10
rise
run
=slope (=rise/run)
=y-intercept
x
17.11
17.12
(Thisisanapplicationoftheleastsquaresmethodandit
producesastraightlinethatminimizesthesumofthe
squareddifferencesbetweenthepointsandtheline)
17.13
diff
d
these differences
are
u
are called
sq line
e
e
th
residuals or
of nd th
um ts a
errors
s
the poin
s
ize the
m
i
en
n
i
e
??
?
m
w
e
t
om slop
ne be
i
r
l
f
e
is
or
f
m
Th
o
c
14
1
n
.
o
ati and 2
u
eq pt
e
lin terce
e
th y-in
d
i
d
ra
e
r
o
f
he 34
w
ut et .9
b
eg
w
d
i
wd
o
H
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.
17.14
Thecoefficientsb1andb0forthe
leastsquaresline
arecalculatedas:
17.15
Least Squares Line See if you can estimate Y-intercept and slope from
this data
Recall
Statistics
Data
Information
Data Points:
x
17
12
y = .934 + 2.114x
17.16
Least Squares Line See if you can estimate Y-intercept and slope from
this data
17.17
17.18
17.19
Required Conditions
Fortheseregressionmethodstobevalidthefollowingfour
conditionsfortheerrorvariable()mustbemet:
Theprobabilitydistributionofisnormal.
Themeanofthedistributionis0;thatis,E()=0.
Thestandarddeviationofis,whichisaconstant
regardlessofthevalueofx.
Thevalueofassociatedwithanyparticularvalueofyis
independentofassociatedwithanyothervalueofy.
17.20
17.21
Thesumofsquaresforerroriscalculatedas:
andisusedinthecalculationofthestandarderrorof
estimate:
Ifiszero,allthepointsfallontheregressionline.
17.22
Standard Error
Ifissmall,thefitisexcellentandthelinearmodelshould
beusedforforecasting.Ifislarge,themodelispoor
But what is small and what is large?
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.
17.23
Standard Error
Judgethevalueofbycomparingittothesamplemeanof
thedependentvariable().
Inthisexample,
=.3265and
=14.841
so(relativelyspeaking)itappearstobesmall,henceour
linearregressionmodelofcarpriceasafunctionof
odometerreadingisgood.
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.
17.24
17.25
Iftheerrorvariable()isnormallydistributed,thetest
statistichasaStudenttdistributionwithn2degreesof
freedom.Therejectionregiondependsonwhetherornot
weredoingaoneortwotailtest(twotailtestismost
typical).
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.
17.26
Example 17.4
Testtodetermineiftheslopeissignificantlydifferentfrom
0(at5%significancelevel)
Wewanttotest:
H1:0
H0:=0
(ifthenullhypothesisistrue,nolinearrelationshipexists)
Therejectionregionis:
ORcheckthepvalue.
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.
17.27
Example 17.4
COMPUTE
WecancomputetmanuallyorrefertoourExceloutput
p-value
Weseethatthetstatisticfor
odometer(i.e.theslope,b1)is13.49
Compare
whichisgreaterthantCritical=1.984.Wealsonotethatthep
valueis0.000.
Thereisoverwhelmingevidencetoinferthatalinear
relationshipbetweenodometerreadingandpriceexists.
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.
17.28
Hence:
Thatis,weestimatethattheslopecoefficientliesbetween
.0768and.0570
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.
17.29
Coefficient of Determination
Teststhusfarhaveshownifalinearrelationshipexists;itis
alsousefultomeasurethestrengthoftherelationship.This
isdonebycalculatingthecoefficientofdeterminationR2.
Thecoefficientofdeterminationisthesquareofthe
coefficientofcorrelation(r),henceR2=(r)2
rwillbecomputedshortlyandthisistrueformodels
withonly1indepenentvariable
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.
17.30
Coefficient of Determination
R2hasavalueof.6483.Thismeans64.83%ofthevariation
intheauctionsellingprices(y)isexplainedbyyour
regressionmodel.Theremaining35.17%isunexplained,
i.e.duetoerror.
Unlikethevalueofateststatistic,thecoefficientof
determinationdoesnothaveacriticalvaluethatenablesus
todrawconclusions.
IngeneralthehigherthevalueofR2,thebetterthemodelfits
thedata.
R2=1:Perfectmatchbetweenthelineandthedatapoints.
R2=0:Therearenolinearrelationshipbetweenxandy.
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.
17.31
degrees
of
freedom
Sums of
Squares
Mean
Squares
F-Statistic
Regressio
n
SSR
MSR =
SSR/1
F=MSR/MSE
MSE =
SSE/(n2)
Error
n2
SSE
Total
n1
Variation
in y (SST)
17.32
17.33
Prediction Interval
Thepredictionintervalisusedwhenwewanttopredictone
particularvalueofthedependentvariable,givenaspecific
valueoftheindependentvariable:
(xgisthegivenvalueofxwereinterestedin)
17.34
(Technicallythisformulaisusedforinfinitelylarge
populations.However,wecaninterpretourproblemas
attemptingtodeterminetheaveragesellingpriceofallFord
Tauruses,allwith40,000milesontheodometer)
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.
17.35
1
Used to estimate the value
of one value of y (at given
x)
Confidence Interval
no 1
Used to estimate the mean
value of y (at given x)
17.36
Regression Diagnostics
Therearethreeconditionsthatarerequiredinorderto
performaregressionanalysis.Theseare:
Theerrorvariablemustbenormallydistributed,
Theerrorvariablemusthaveaconstantvariance,&
Theerrorsmustbeindependentofeachother.
Howcanwediagnoseviolationsoftheseconditions?
ResidualAnalysis,thatis,examinethedifferences
betweentheactualdatapointsandthosepredictedbythe
linearequation
17.37
Nonnormality
Wecantaketheresidualsandputthemintoahistogramto
visuallycheckfornormality
werelookingforabellshapedhistogramwiththemean
closetozero[ouroldtestfornormality].
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.
17.38
Heteroscedasticity
Whentherequirementofaconstantvarianceisviolated,we
haveaconditionofheteroscedasticity.
Wecandiagnoseheteroscedasticitybyplottingtheresidual
againstthepredictedy.
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.
17.39
Heteroscedasticity
Ifthevarianceoftheerrorvariable()isnotconstant,then
wehaveheteroscedasticity.Herestheplotoftheresidual
againstthepredictedvalueofy:
17.40
Whenthedataaretimeseries,theerrorsoftenarecorrelated.
Errortermsthatarecorrelatedovertimearesaidtobe
autocorrelatedorseriallycorrelated.
Wecanoftendetectautocorrelationbygraphingthe
residualsagainstthetimeperiods.Ifapatternemerges,itis
likelythattheindependencerequirementisviolated.
17.41
17.42
17.43
Outliers
Possiblereasonsfortheexistenceofoutliersinclude:
Therewasanerrorinrecordingthevalue
Thepointshouldnothavebeenincludedinthesample
*Perhapstheobservationisindeedvalid.
Outlierscanbeeasilyidentifiedfromascatterplot.
Iftheabsolutevalueofthestandardresidualis>2,we
suspectthepointmaybeanoutlierandinvestigatefurther.
Theyneedtobedealtwithsincetheycaneasilyinfluence
theleastsquaresline
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.
17.44
17.45