Anda di halaman 1dari 10

A Tutorial on Regression

I put together a tutorial on linear regression derivations, examples and all to fill in the needed
background. Ill start with a simple example, to show how and why all the formulas come about, adjust the
example a bit, and then dive into 3 interesting real world examples, one which involves an lympic
running event.
!ll the regression"related topics have been bundled on the #ikipedia under the following link$
Outline of Regression Analysis
http://en.wikipedia.org/wiki/Outline_of_regression_analysis
%he topics listed on that page most closely connected with our material are the following$
Non-Statistical:
inear Regression! east S"uares! Non-linear east S"uares! #ur$e %itting
Statistical:
#orrelation! #orrelation #oefficient!
&ean S"uare 'rror! Residual Su( of S"uares! Total Su( of S"uares
%here are lot of other topics, including those that address issues such as goodness of fit, or
when&where&how linear regression models break down.
One )aria*le
'+a(ple ,: An actual linear relation:
' 3 y x
( ) * 3 '
' ) * + ,
x
y
If you do linear regression to try and match this to y a bx + , then you should get ' a and 3 b ,
along with some indication that this is a perfect match. %he coefficients
a
and b are called parameters,
and the goal is to adjust them to find the best fit. %he model is an example of what is called a parametric
model.
Step ,: -tart by taking the averages of the linear and .uadratic combinations$
( ) ( ) ( )
* * * * *
*
* * *
* *
*
) ) ) ) ) ( ) * 3 '
) ), /,
+ +
( ) * 3 ' ( ' ) ) * * 3 + ' ,
*, )(,
+ +
' ) * + ,
' ) * + ,
*, **.
+ +
x
x xy
y y
+ + + + + + + +

+ + + + + + + +

+ + + +
+

0ou dont actually need
)
. I just put it in to illustrate the process. It is the 0
th
moment of
x
and
y
%he
averages x and
y
at the 1
st
moments of x and
y
, and
* *
, , x xy y are the 2
nd
moments of
x
and
y
Step -: -ubtract out the matching combinations of the )
st
moments from the *
nd
moments$
( ) ( )
( )
* * *
*
* *
/ * * -.uared 1ariance of ,
)( * * / 2ovariance of and ,
** * ), -.uared 1ariance of .
x
xy
y
x x x x
xy x y x y
y y y y



0oure probably used to seeing these defined in a slightly different way. 0ou can arrive at the same result
by first subtracting out the averages
* ) ( ) *
/ 3 ( 3 /
x x x
y y y


and then taking their *
nd
moments$
( ) ( )
( ) ( )
* *
* * *
* *
* *
* * *
* *
* ) ( ) *
*,
+
* / ) 3 ( ( ) 3 * /
/,
+
/ 3 ( 3 /
),.
+
x
xy
y
x
x y
y
+ + + +

+ + + +

+ + + +

%he indication of how well the linear fit works already shows itself here$ it is determined by the s.uare of
the correlation between
x
and
y
$
( )
*
*
*
* *
/
) )((3.
* ),
xy xy
xy xy
x y x y
r r



Step .: The #alculus Step
%he best"fitting values for the parameters are determined by minimi4ing the average s.uare difference
between
y
and a bx + $
( )
( ) ( ) ( ) ( )
*
* * * *
* * * *
* *
* *
* * *
* * *
** * * * )( * * /
** ' *( ' / .
y a bx y ay bxy a abx b x
y ay bxy a abx b x
a b a ab b
a b a ab b
+ + +
+ + +
+ + +
+ + + + +
%he minimum occurs at the stationary points for the parameters
a
and b which is where their
derivatives are each (. %o find it, take the derivative with respect to each parameter, holding the other
constant$
( )
( )
* *
b constant
* *
a constant
( ** ' *( ' / ' * ' * * (
'
3
( ** ' *( ' / *( ' )* + 3 (
d
a b a ab b a b a b
a
da
d b
a b a ab b a b a b
db

+ + + + + + + + +

; '


+ + + + + + + + +

If you substitute back to find what the minimum actually is, you get$
( )
*
* *
** ' *( ' /
** )/ /( )/ ', +'
(,
y a bx a b a ab b + + + + +
+ + +

which again indicates a perfect fit.


Alternate Step .: %he same result follows by dropping the constant coefficient, working with x and
y

instead, to find b and then back"substituting in the e.uation
y a bx +
to get the constant coefficient
afterwards$
( )
( ) ( )
*
* * *
* * *
*
*
*
*
), * / *
), )* * .
y xy x
y b x y b x y b x
b b
b b
b b
+
+
+
+ +
%he stationary point for the parameter b is determined by$
( )
*
* *
*
/
( * * * * )* ' 3.
*
xy
xy y
y
d
y b x x y b x b b b
db


+ + +

!fterwards, you get$


( ) ( ) * 3 * '. a y bx
Step /: %he goodness of fit is the actual minimum value, itself$
( )
*
* * *
* *
*
* *
*
*
*
*
*
.
y xy x
xy xy
y
x x
xy
y
x
y b x b b +

+

%he value has two other e.uivalent forms$


( ) ( ) ( )
* *
* * *
) .
y xy y xy
y b x r y b x b
-o, the way this is interpreted is that the model has accounted for the correlation between
x
and
y
by
subtracting
xy
b
, and that the portion extracted from the variance
*
y
, as a result, is given by
*
xy
r 5which
for this model, was )((36.
-o, to recap 5I merged steps 3 and ' in the table below6$
To fit
y
to a bx +
Step ,: %ake )
st
and *
nd
moments$
x ,
y
,
*
x
, xy ,
*
y
Step -: 2onvert the second moments to 5co"6variances$
* * * * * *
x
, ,
xy y
x x xy x y y y
Step .: 7et the slope, constant coefficient, and reduced variance$
( ) ( )
*
* * * *
,
,
) .
xy
x y xy y xy
a y bx
b
y a bx b r


'+a(ple -: A linear fit against
*
' 3 y x x +
( ) * 3 '
' * * ' ,
x
y
%he solution$
*
* * *
* *
*
*
*
-tep )... ...-tep ) -tep * -tep 3
*
)
/
* / * *
*
, )( , )( * ' * , * .
'
*(., ' '.,
* +
*(.,
').83
* '., )*
xy
x x
xy
y
xy
b
x
x
xy a y bx
y
y
r








' ; ' ; ' ; ' ;








Two )aria*les
'+a(ple .: A quadratic fit against
*
' 3 y x x +
%his time, take example * and fit it against the curve
*
y a bx cx + + . 9o this by treating
*
x as a second
independent variable
*
w x and doing the fit with y a bx cw + +
*
( ) * 3 '
( ) ' : )/
' * * ' ,
x
w x
y
1
]
%he first three steps are all the same, but there are more combinations$
* * * * * * * *
* * * *
-tep )... ...-tep ) -tep *
8(., / 8(., / 3'., / * *
/
* , *( )( , *( * / , )( * ' *
'
3'., *(., 3+., / ' )(., *(., '
w x
wx xy
wy y
w x w x
w
x xw xy w x x y
y
wy y w y y






' ; ' ;




'.,



' ;



%his time, for step 3, if you go back to the calculus problem and rework the derivation, you end up getting
the following e.uations, which I put alongside the e.uations for ) variable to draw the comparison$
( ) ( )
*
*
*
* *
* *
) 1ariable$ * 1ariables$
y = a bx y = a bx+cw
x y b x c x w
x y b x
w y b w x c w
a y bx a y bx cw
y a bx y b x y y a bx cw y b x y c w y

+ +
+

+


%he main difference is the change in the last e.uation$ the reduced variance is no longer directly related to
the correlations
*
xy
r or
*
wy
r . %he total effect of extracting out two variables is different than the two
individual effects because the variables
x
and
w
also have a correlation
wx
r with each other. %he actual
expression is$
*
*
*
*
*
;it 3 .
)
xy w wx
wx
y xy wy
r r r r
b x y
r
r
c w y
y
+
+

uuuuuur
#ithout the red
xw
cross"correlations, this would just be the sum of the two individual fits for
x
and
w
.
-o, here is the modified step 3$
( ) ( ) ( ) ( )
*
*
*
-tep 3
* * , 3
)(., , 3'., )
' * / '
;it 3
3 * ) )(.,
'.,
)((3.
x y b x c x w
b c b
w y b w x c w b c c
a y bx cw a b c a
b x y c w y
y

+
+


+ +
' ; ' ; ' ;





+

uuuuuur
-o, its a perfect fit.
-o, to recap$
To fit
y
to a bx cw + +
Step ,: %ake )
st
and *
nd
moments$
w , x ,
y
,
*
w
, wx , wy ,
*
x
, xy ,
*
y
Step -: 2onvert the second moments to 5co"6variances$
* * *
* * * * * *
, , ,
, , .
w w w w x wx w x w y wy w y
x x x x y xy x y y y y


Step .: -et up the e.uations for the parameters, solve and back"substitute$
( )
*
*
*
*
,
,
.
,
a y bx cw
b x c x w x y
y a bx y b x y c w y
b w x c w w y

+

+
'+a(ple /: A Synergistic it
It is possible for two variables combined to produce a better fit than the combined effects of both$
<est ;its
* ) ( ) *
/ + * )3 (
' ) / ) '
w
x
y


;or
$ y a cw +
(, (, (3 ;it a c
;or
y a bx +
$
(, (.(,+, *.'3 ;it a b
;or
$ y a bx cw + +
(, (.)3:, (.')8, '.(3 ;it a b c
Three or &ore )aria*les
;or three or more variables, doing this by calculator involves a lot of work. <y contrast, as a computer
program, its a short = page routine that directly translates the routine listed in the blue boxes and takes
about a minute to write up from scratch. In addition, the routines and much more advanced routines are
contained in typical statistics software packages, such as ->--.
%he following examples were done with computer, not on calculator or by hand. I will only spell out the
detailed results for one 3"variable regression in the examples below.
'+a(ple 0: !en "ersus #omen in the $lympic 200 !eter Sprint%
%he first lympics in 88/<2 had only ) event$ the one thats e.uivalent to todays *(( meter run. ?en
have run it in todays lympics since ),:/, and women since ):',. -o, .uestion$ are women catching up to
men@ 2ould there even be a cross"over point@
! .uick look at the graph and youll see the moral of this example$ polynomials cannot be used for long"
term trends because all polynomials go to infinity over the long term, unless they are constant. 5%he
example following this one uses sines and cosines instead.6
Aere is the list for the lympic *(( meter winning times in the *(
th
century 5*((( is in the *(
th
century ...
but ):(( is in the ):
th
century, oops6.
0ear ):(( ):(' ):(, ):)* ):*( ):*' ):*, ):3* ):3/ ):', ):+* ):+/
9 "+.( "'./ "'.* "3., "3.( "*./ "*.* ")., ").' "(.* (.* (./
?en **.* *)./ **./ *).8 **.( *)./ *)., *).* *(.8 *).) *(.8 *(./
#omen *'.' *3.8 *3.'
0ear ):/( ):/' ):/, ):8* ):8/ ):,( ):,' ):,, )::* )::/ *(((
9 ).( ).' )., *.* *./ 3.( 3.' 3., '.* './ +.(
?en *(.+ *(.3 ):.,3 *(.(( *(.*3 *(.): ):.,( ):.8+ *(.() ):.3* *(.(:
#omen *'.( *3.( **.+( **.'( **.38 **.(3 *).,) *).3' *).,) **.)* *).,'
%he regressions
( )
* * 3
Binear Cuadratic 2ubic
#inning %ime 5seconds6, D9ecades after ):+(,
t a b& t a b& c& t a b& c& d&
t &
+ + + + + +

are done separately for men over the years ):(("*((( and for women over the years ):',"*(((. %he
results are graphed below on this page. ;or each graph, the curve has been spread out with a width
matching the reduced variance. %he narrowest curves are those with the best 3fits.
%he cubic fits are done by treating
*
w x and
3
" x as separate variables. %his involves more
combinations, all listed below$
To fit
y
to a bx cw d" + + +
Step ,: %ake )
st
and *
nd
moments$
" , w , x ,
y
,
*
"
, "w , "x , "y ,
*
w
, wx , wy ,
*
x
, xy ,
*
y
Step -: 2onvert the second moments to 5co"6variances$
*
*
* *
, , , ,
, , ,
, , .
" " w " x " y
w w x w y
x x y y



Step .: -et up the e.uations for the parameters, solve and back"substitute$
( )
*
*
*
*
*
,
,
,
.
,
b x c x w d x " x y
a y bx cw d"
b x w c w d w " w y
y a bx y b x y c w y d " y
b " x c " w d " " y
+ +

+ +

+ +
%here are fewer womens times 5)'6, so Ill illustrate the cubic fit with those.
v ".((, .((, .*)/ ).( *.8'' +.,3* )(./', )8.+8/ *8 3:.3(' +'.,8* 8'.(,, :8.33/ )*+
w .(' .(' .3/ ).( ).:/ 3.*' '.,' /.8/ : )).+/ )'.'' )8./' *).)/ *+
x "(.* (.* (./ ).( ).' )., *.* *./ 3 3.' 3., '.* './ +
y *'.' *3.8 *3.' *'.( *3.( **.+ **.' **.38 **.(3 *).,) *).3' *).,) **.)* *).,'
%he calculations were all done to )( decimals, but are only being shown to '. -o, you may get results that
are slightly different.
* *
*
*
-tep )... ...-tep ) -tep *
*+:+.+/,) )3+.)),'
3*.+''(
+,'.(/8, 3*.+''(
,.3/((
, )3+.)),' ),3.+*)))
*.'(((
8)*./'3( ,.3/((
**./**:
+)*.+::( +*.::+)
" w
"
"w wx
w
"x wy
x
"y x
y
y xy









' ; ' ;


* *
*
*
)+3/.'+/* /+.**,,
3)*.(((( )*.',((
, +8.()*, +./(+:
*3.+:+3 *./(((
(.,(+3 ).*::8
" w
" w w x
" x w y
" y x
y x y








' ;







*
*
*
-tep 3
).*::8 *./ )*.', +8.()*,
+./(+: )*.', /+.**,, 3)*
*3.+:+3 +8.()*, 3)* )+3/.'+/*
**./**: *
x y b x c x w d x "
b c d
b c d
w y b x w c w d w "
b c d
" y b x " c w " d "
a
a y bx cw d"

+ +
+ +

+ +
+ +

' ;
+ +

+ +




( ) ( ) ( ) ( ) ( ) ( )
*
(./3:*
(.)*/:
(.(3')
.' ,.3/ 3*.+'' *'.)(8(
;it 3
(./3:* ).*::8 (.)*/: +./(+: (.(3') *3.+:+3
(.,(+3
:).+(3.
b
c
d
b c d a
b x y c w y d " y
y



' ; ' ;





+ +

+ +

uuuuuur
#on$erting Non-inear Regression to inear %or(
Binear regression can also be used for non"linear relations, if the relations can be converted to linear form.
%he last two examples are for illustration only, and will only be outlined, but not spelled out in detail.
'+a(ple 1: Sun 'ise and Sun Set (imes )2002*
! fit can be done against sines and cosines. ;or example, if a function ( ) f x
has a period ), then it can be
fit against the sines&cosines for the base fre.uency and all of its harmonics$
( )
( ) ) * *
cos * sin * cos ' sin ' f x a a x b x a x b x + + + + + K
It can even be done as an infinite series in which case it is generally a )((3 fit. %he re.uired calculations
simplify, because over any period, each sine and cosine has the same average and 4ero correlation with
each other. %he result is the ourier series for the function. ;or example,
x
has a ;ourier series
( )
sin / )
*
sin '
(
si
3 *
sin *
)
'
.
n , x
x
x x
x
x


< <

K
!s the following example shows, the sun rise and sun set times are not exactly sinusoidal and arent even in
phase with one anotherE !t this latitude, for winter, one peaks around %hanksgiving, the other after
Fanuary. <ut the fit will only be done with the first harmonic$ the yearly cycle.
%he data was taken 5from the *((* #orld <ook !lmanac6 weekly from %uesday Fanuary *, *((* on to
%uesday 9ecember 3), *((* for '( degrees latitude north 5or about *(( miles south of ?ilwaukee6. %he fit
is against
x
, which is normali4ed to * for 3/+.*+ days. %he variables are taken as
cos x
and sin x ,
cos sin t a b x c x + +
with the results
( ) ( )
( ) ( )
-unset minutes $ )(,/.((/* ,).8(*( cos *3.'':3 sin :,./:3 ;it
-unrise minutes $ 3+'.('(* ,).3)(' cos ,.(+)* sin :,.3,3 ;it
x x
x x
+
+
'+a(ple 2: #orld +opulation )1,-0.200/*
Its a little"known fact that the population curve is no longer exponential, but is curving downward. %hough
the slowing"down trend may have temporarily reversed for a couple years, because of the baby"boom echo,
its mostly been slowing down since the late ):,(s and birth rates have plummeted worldwide since the
beginning of the century.
! large part of the reason for this can be directly seen in the following GHI-2 data.
2ollege Inrollment 53 ;emale6 by GHI-2 Jegion
Jegion )::: *((( *(() *((* *((3 *((' *((+ *((/
Horth !merica and #estern
Iurope
+'.* +'.' +'./ ++.) ++.' ++.8 ++.: +/.(
2entral and Iastern Iurope +3.' +3.+ +'.( +'.8 +'.8 +'., +'.: +'.,
Batin !merica and the 2aribbean +*.8 +3.( +*.8 +3.3 +3./ +3./ +3.+ +3./
2entral !sia ',.( ',.* ',.8 ':.8 +(.) +(., +)./ +).8
!rab -tates ')./ ')./ ''.* ''., '+./ '8./ ':.( ':.)
Iast !sia and the >acific ')./ ').+ ').+ '+./ '+.' '/./ '/.,
-outh and #est !sia 3,.' 3:.' 3:., 3:.+ '(.: '(., ').3
-ub"-aharan !frica '(.* 38.: 38./ 3,.( 38.8 3,.( '(.) '(.*
3orld /4.5 /4./ /6.0 /6.4 05.- 05.-
%his trend goes hand"in"hand with the decline in birth rates, for reasons that should be fairly obvious.
Hote that males of college age generally outnumber females in nearly every part of the world except sub"
-aharan !fricaK also except in island nations where occasionally small"number fluctuations lead to
temporary female majorities in this age range.
%he data was from the G- 2ensus International 9atabase in *((' and covers their estimated world
populations for the years ):+("*(('.
! fit against an exponential can be done by setting up the variables like so$
( )
):+(
ln ln ):+( ln .
t
+ ab + a t b

+
! linear fit done with the census data produces the result
( ) ln 8.,++,, ln (.()8+:, ;it $ ::./+3 .
) million
+
b
%he fit accounts for most of the trend, but undershoots the mark in the ):8("):,( period and veers way off
after *(((. %he current population is about 8 billion$ closer to the green curve$
( )
):,8.'),
+.(/: 3./8* tanh , ;it $ ::.::3 tanh is defined as .
billion ''.3'8
x x
x x
+ t e e
x
e e

_
+

+
,
Gnlike the exponential, there is no simple way to make the fit
tanh
t c
y a b
d

+
into linear form. %he same routine used for step 3 can be used to find
a
and b , and even
c
. <ut the
coefficient d cannot easily be found. I used trial and error, sampling over a number of choices of values
for d .

Anda mungkin juga menyukai