Anda di halaman 1dari 9

SAS Regression Using Dummy Variables and Oneway ANOVA

/*****************************************************************
This example illustrates:
How to create side-by-side boxplots
How to create dummy variables
How to use dummy variables in a linear regression model
How to fit a oneway AN!A
"rocs used:
"roc #eans
"roc $oxplot
"roc %eg
"roc &nivariate
"roc '(#

)ilename: dummy*variables+sas
*******************************************************************/
The commands below allow us to utilize the user-defined formats along with our permanent SAS data set.
"T,N- )%#.HA%/01----121---2/1-/345*06
libname b789 0e:3789306
options fmtsearch/:;%< b789=6
options nofmterr6
We get descriptive statistics for all variables within each level of ORII!.
proc means data/b789+cars6
class origin6run6
The MEANS Procedure
N
ORIGIN Obs Variable N Mean Std Dev Minimum Maximum

!SA "#$ MPG "%& "'()"&""#& *($+*&'#, )'(''''''' $,('''''''


ENGINE "#$ "%+(+)$%$&+ ,&(++,,*+& &#(''''''' %##('''''''
-ORSE "%, )),(*'*%"#+ $,(+,,)*%+ #"(''''''' "$'('''''''
.EIG-T "#$ $$*+($$ +&&(*))+$," )&''('' #)%'(''
A//E0 "#$ )%(,"&%#&# "(&')))#, &(''''''' ""("''''''
1EAR "#$ +#(#")+$,) $(+)%#&%$ +'(''''''' &"('''''''
/10INDER "#$ *("+**+,& )(**"*#"& %(''''''' &('''''''
Euro2e +$ MPG +' "+(&,)%"&* *(+"$,",* )*("'''''' %%($''''''
ENGINE +$ )',(%*#+#$% ""($+),'&$ *&(''''''' )&$('''''''
-ORSE +) &)(''''''' "'(&)$%#+" %*(''''''' )$$('''''''
.EIG-T +$ "%$)(%, %,'(&&$*)+" )&"#('' $&"'(''
A//E0 +$ )*(&"),)+& $(')',)+# )"("'''''' "%(&''''''
1EAR +$ +#(+$,+"*' $(#*$'$$" +'(''''''' &"('''''''
/10INDER +$ %()#'*&%, '(%,'+&"* %(''''''' *('''''''
3a2an +, MPG +, $'(%#'*$", *(',''%&) )&(''''''' %*(*''''''
ENGINE +, )'"(+'&&*'& "$()%')"*' +'(''''''' )*&('''''''
-ORSE +, +,(&$#%%$' )+(&),),,) #"(''''''' )$"('''''''
.EIG-T +, """)("$ $"'(%,+"%+, )*)$('' ",$'(''
A//E0 +, )*()+")#), )(,#%,$+' ))(%'''''' ")('''''''
1EAR +, ++(%%$'$&' $(*#'#,%+ +'(''''''' &"('''''''
/10INDER +, %()')"*#& '(#,'%)$# $(''''''' *('''''''

We can see that the mean of vehicle miles per gallon "#$% is lowest for the American cars& intermediate for
the 'uropean cars& and highest for the (apanese cars.
)
We now loo* at a side-b+-side bo,plot of miles per gallon "#$% for each level of origin.
/*'et side-by-side boxplots of ;eight for each
vehicle origin*/
proc sort data/b789+cars6
by origin6
run6
goptions device/win target/winprtm6
proc boxplot data/b789+cars6
plot mpg * origin / boxstyle/schematic6
run6
The bo,plot shows the pattern of means that we noted in the descriptive statistics. The variance is similar for the
American& 'uropean and (apanese cars. The distribution of mpg is somewhat positivel+ s*ewed for American
and 'uropean cars& and negativel+ s*ewed for (apanese cars. There are some high outliers in the American and
'uropean cars. -ecause ORII! is a nominal variable& we will not be tempted to thin* of this as an ordinal
relationship. If we had a different coding for ORII!& this graph would have shown a different pattern.
-efore we can fit a linear regression model with a categorical "in this case& nominal% predictor we need to create
dumm+ variables to be used in the model. We will create . dumm+ variables& even though onl+ two of them will
be used in the regression model. 'ach dumm+ variable will be coded as / or ). A value of ) will indicate that a
case is in a given level of origin& and a value of / will indicate that the case is not in that level of origin.
The dumm+ variables for ORII! are created in the data step below. !ote that the output shows the one car
with a missing origin does not have a value for an+ of the dumm+ variables. Also note that the fre0uenc+
tabulations for the three dumm+ variables show that there is one missing value for each dumm+ variable.
1
/*>ata step to create dummy variables for each level of %,',N*/
data b789+cars?6
set b789+cars6
if origin not/+ then do6
American/:origin/8=6
@uropean/:origin/?=6
Aapanese/:origin/B=6
end6
run6
proc print data/b789+cars?
var origin American @uropean Aapanese weight6
run6
Obs ORIGIN American Euro2ean 3a2anese .EIG-T
) ( ( ( ( +$"
" !SA ) ' ' )&''
$ !SA ) ' ' )&+#
% !SA ) ' ' ),)#
# !SA ) ' ' ),##
( ( (
"#* Euro2e ' ) ' )&"#
"#+ Euro2e ' ) ' )&$%
"#& Euro2e ' ) ' )&$#
"#, Euro2e ' ) ' )&$#
"*' Euro2e ' ) ' )&%#
( ( (
$", 3a2an ' ' ) )*%,
$$' 3a2an ' ' ) )+##
$$) 3a2an ' ' ) )+*'
$$" 3a2an ' ' ) )++$
proc freC data/b789+cars?6
tables origin american european Dapanese6
run6
The 4RE5 Procedure
/umulative /umulative
ORIGIN 4re6uenc7 Percent 4re6uenc7 Percent

!SA "#$ *"(%+ "#$ *"(%+
Euro2e +$ )&('" $"* &'(%,
3a2an +, ),(#) %'# )''(''
4re6uenc7 Missin8 9 )
/umulative /umulative
American 4re6uenc7 Percent 4re6uenc7 Percent

' )#" $+(#$ )#" $+(#$
) "#$ *"(%+ %'# )''(''
4re6uenc7 Missin8 9 )
/umulative /umulative
Euro2ean 4re6uenc7 Percent 4re6uenc7 Percent

' $$" &)(,& $$" &)(,&
) +$ )&('" %'# )''(''
4re6uenc7 Missin8 9 )
/umulative /umulative
3a2anese 4re6uenc7 Percent 4re6uenc7 Percent

' $"* &'(%, $"* &'(%,
) +, ),(#) %'# )''(''
4re6uenc7 Missin8 9 )
.
We can now fit a regression model& to predict #$ for each Origin. We will use American cars as the reference
categor+ in this model. To do this we will include the dumm+ variables for 'uropean and (apanese cars in our
model. These two dumm+ variables represent a contrast between the average #$ for 'uropean vs. American
cars and (apanese vs. American cars& respectivel+. In general& if +ou have * categories in +our categorical
variable& +ou will need to include *-) dumm+ variables in the regression model.
/*)it a regression model with American cars as the reference category*/
proc reg data/b789+cars?6
model mpg / european Dapanese6
plot residual+*predicted+6
output out/regdat p/predicted r/residual rstudent/rstudent6
run6 Cuit6
The REG Procedure
Model: MODE0)
De2endent Variable: MPG
Number o; Observations Read %'*
Number o; Observations !sed $,+
Number o; Observations <ith Missin8 Values ,
Anal7sis o; Variance
Sum o; Mean
Source D4 S6uares S6uare 4 Value Pr = 4
Model " +,&%(,#+"# $,,"(%+&*" ,+(,+ >(''')
Error $,% )*'#* %'(+#"$"
/orrected Total $,* "%'%)
Root MSE *($&$+# RS6uare '($$")
De2endent Mean "$(##))$ Ad? RS6 '($"&+
/oe@ Var "+()'#,$
Parameter Estimates
Parameter Standard
Variable D4 Estimate Error t Value Pr = AtA
Interce2t ) "'()"&"$ '(%'#$+ %,(*# >(''')
Euro2ean ) +(+*$"' '(&*%'' &(,, >(''')
3a2anese ) )'($""%) '(&"%+$ )"(#" >(''')
The parameter estimate for the intercept represents the estimated #$ for the reference categor+& American
cars. 2ompare this estimate "1/.)13% with the mean #$ for American cars in the descriptive statistics. The
parameter estimate for 'uropean "4.45% represents the contrast in mean #$ for 'uropean cars vs. American
cars "the reference%. That is& 'uropean cars are estimated to have a mean value of #$ that is 4.45 units higher
than American cars on average. This difference is significant "t
$,%
6 3.77& p8/.///)%. The parameter estimate for
(apanese ")/..1% represents the contrast in mean #$ for (apanese cars vs. American cars. (apanese cars are
estimated to have a mean value of #$ that is )/..1 units higher than American cars. This difference is
significant "t
$,%
6 )1.91& p8/.///)%.
We can calculate the mean #$ for 'uropean cars b+ adding the intercept plus the parameter estimate for
'uropean "1/.)131.:4.45.1/ 6 14.37);.%. We calculate the mean #$ for (apanese cars b+ adding the
intercept plus the parameter estimate for (apanese "1/.)131. : )/..11;) 6 ./.;9/5;%. We can compare these
values with the values in the output from $roc #eans& and see that the+ agree within rounding error.
We now loo* at the residual vs. predicted values plot to see if there is appro,imate e0ualit+ of variances across
levels of origin. !otice that there is onl+ one predicted value of #$ for each ORII!. We can see that the
;
spread of residuals for each origin is appro,imatel+ the same& indicating that we have reasonable
homos*edasticit+ for this model fit.
We now loo* at the distribution of the studentized-deleted residuals for this model& using $roc <nivariate& to see
if we have reasonabl+ normall+ distributed residuals. The plot indicates that we have somewhat longer tails than
e,pected for a normal distribution& but it is reasonabl+ s+mmetric.
/*.hecE distribution of studentiFed-deleted residuals*/
proc univariate data/regdat normal6
var rstudent6
histogram6
CCplot / normal :mu/est sigma/est=6
run6

The tests for normalit+ are significant& as shown below. -ecause we are testing H
0
the distribution of residuals
is normal& we would re=ect H
0
and conclude that the residuals are not normall+ distributed. We might wish to
investigate transformations of > "e.g.& the log of >% to get more normall+ distributed residuals. ?owever& these
departures from normalit+ do not appear to be severe& and transformations will not be e,plored here.
Tests ;or Normalit7
Test Statistic 2 Value
Sha2iro.ilB . '(,*",)# Pr > . >'(''')
Colmo8orovSmirnov D '('&$'#+ Pr = D >'(')''
/ramervon Mises .S6 '(#&+**, Pr = .S6 >'(''#'
AndersonDarlin8 AS6 $(&&,%&$ Pr = AS6 >'(''#'
9
We now e,amine the output data set "regdat% produced b+ $roc Reg "the name we choose for this data set is
arbitrar+%. This new data set will contain all the original variables& plus the new ones that we re0uested. Again&
notice that the predicted value is the same for all observations within the same origin& and is e0ual to the mean
#$ for that level of origin. Also& notice that some of the residuals are positive and some are negative.
/*TaEe a looE at the output data set*/
proc print data/regdat6
var mpg origin predicted residual rstudent6
run6
Obs MPG ORIGIN 2redicted residual rstudent
) , ( ( ( (
" $* !SA "'()"&" )#(,+)& "(#"%'$
$ $, !SA "'()"&" )&(&+)& "(,,),%
% $* !SA "'()"&" )#(#+)& "(%#,&$
# "* !SA "'()"&" #(&+)& '(,")%&
* ", !SA "'()"&" &(&+)& )($,%""
+ $% !SA "'()"&" )%("+)& "("#)+'
& "# !SA "'()"&" %(&+)& '(+*%",
, $) !SA "'()"&" )'($+)& )(*$)%$
)' $% !SA "'()"&" )$($+)& "()'&'#
( ( (
)') "$ !SA "'()"&" "($+)++ '($+)&&
)'" )% !SA "'()"&" *()"&"$ '(,*)&"
)'$ "' !SA "'()"&" '()"&"$ '('"')'
)'% )& !SA "'()"&" "()"&"$ '($$$*&
( ( (
$'& "" Euro2e "+(&,)% *(",)% '(,,"*$
$', ( Euro2e "+(&,)% ( (
$)' "' Euro2e "+(&,)% +(#,)% )(),&%$
$)) ), Euro2e "+(&,)% &(&,)% )(%'%*)
$)" )& Euro2e "+(&,)% ,(&,)% )(#*$#)
$)$ "" Euro2e "+(&,)% #(&,)% '(,",$&
$)% $* Euro2e "+(&,)% &(#'&* )($%$&%
$)# "$ Euro2e "+(&,)% %(&,)% '(++)$+
$)* ") Euro2e "+(&,)% *(&,)% )('&+#+
$)+ ( Euro2e "+(&,)% ( (
$)& )+ Euro2e "+(&,)% )'(&,)% )(+""+"
$), "' Euro2e "+(&,)% +(&,)% )("%#,+
$"' $) Euro2e "+(&,)% "(&'&* '(%%"*&
$") "+ Euro2e "+(&,)% '(*,)% '()'&,*
$"" "& Euro2e "+(&,)% '("'&* '('$"&+
$"$ $' Euro2e "+(&,)% "()'&* '($$"$)
( ( (
$*" )& 3a2an $'(%#'* )"(%#'* )(,*,,,
$*$ "+ 3a2an $'(%#'* $(%#'* '(#%$#'
$*% "+ 3a2an $'(%#'* $(%#'* '(#%$#'
$*# $' 3a2an $'(%#'* '(,#'* '()%,*&
$** $% 3a2an $'(%#'* $($%,% '(#"+#%
$*+ "& 3a2an $'(%#'* "(%#'* '($&#,"
$*& $* 3a2an $'(%#'* #(#%,% '(&+%#,
$*, ", 3a2an $'(%#'* )(%#'* '(""&%)
$+' $* 3a2an $'(%#'* #(#%,% '(&+%#,
$+) $% 3a2an $'(%#'* $("%,% '(#))+&
We now fit a new model& using (apan as the reference categor+ of ORII!. This time we include the two
dumm+ variables& American and 'uropean in our model. The SAS commands and output are shown below@
/*%efit the model using Aapan as the reference category*/
proc reg data/b789+cars?6
model mpg / american european6
plot residual+*predicted+6
run6 Cuit6
5
The REG Procedure
Model: MODE0)
De2endent Variable: MPG
Number o; Observations Read %'*
Number o; Observations !sed $,+
Number o; Observations <ith Missin8 Values ,
Anal7sis o; Variance
Sum o; Mean
Source D4 S6uares S6uare 4 Value Pr = 4
Model " +,&%(,#+"# $,,"(%+&*" ,+(,+ >(''')
Error $,% )*'#* %'(+#"$"
/orrected Total $,* "%'%)
Root MSE *($&$+# RS6uare '($$")
De2endent Mean "$(##))$ Ad? RS6 '($"&+
/oe@ Var "+()'#,$
Parameter Estimates
Parameter Standard
Variable D4 Estimate Error t Value Pr = AtA
Interce2t ) $'(%#'*$ '(+)&"$ %"(%' >(''')
American ) )'($""%) '(&"%+$ )"(#" >(''')
Euro2ean ) "(##,"' )('%+&+ "(%% '(')#'
!otice that the Anal+sis of Aariance table for this model is the same as for the previous model. ?owever& the
parameter estimates differ& because the+ represent different 0uantities than the+ did in the first model. The
intercept is now the estimated mean #$ for (apanese cars "./.;9%. The parameter estimate for American
represents the contrast in the mean #$ for American cars vs. (apanese cars "American cars have on average&
)/..1 less #$ than do (apanese cars%. The parameter estimate for 'uropean represents the contrast in the
mean #$ for 'uropean cars vs. (apanese cars "'uropean cars have on average& 1.95 #$ less than (apanese
cars%.
We can use another method to fit this same linear model. When we use $roc B#& we do not have to create the
dumm+ variables as we did for $roc Reg. ?ere is sample SAS code for fitting a onewa+ A!OAA model using
$roc B#. !ote the class statement specif+ing ORII! as a class variable. This causes SAS to create dumm+
variables for ORII! automaticall+. SAS will use the highest formatted level "<SA in this case% of ORII! as
the reference categor+. SAS also over-parameterizes the model& including a dumm+ variable for each level of
ORII!& but setting the parameter for the highest level e0ual to zero.
/*)it an AN!A model :&-A will be the default reference category=*/
proc glm data/b789+cars?6
class origin6
model mpg / origin / solution6
means origin / hovtest/levene:type/abs= tuEey6
run6 Cuit6
The G0M Procedure
/lass 0evel In;ormation
/lass 0evels Values
ORIGIN $ Euro2e 3a2an !SA
Number o; Observations Read %'*
Number o; Observations !sed $,+
The G0M Procedure
De2endent Variable: MPG
Sum o;
4
Source D4 S6uares Mean S6uare 4 Value Pr = 4
Model " +,&%(,#+"# $,,"(%+&*" ,+(,+ >(''')
Error $,% )*'#*(%)%+% %'(+#"$"
/orrected Total $,* "%'%)($+),,
RS6uare /oe@ Var Root MSE MPG Mean
'($$")$% "+()'#,$ *($&$+## "$(##))$
Source D4 T72e I SS Mean S6uare 4 Value Pr = 4
ORIGIN " +,&%(,#+"%# $,,"(%+&*"$ ,+(,+ >(''')
Source D4 T72e III SS Mean S6uare 4 Value Pr = 4
ORIGIN " +,&%(,#+"%# $,,"(%+&*"$ ,+(,+ >(''')
Standard
Parameter Estimate Error t Value Pr = AtA
Interce2t "'()"&""#&) D '(%'#$*&&" %,(*# >(''')
ORIGIN Euro2e +(+*$"'"+* D '(&*%''""* &(,, >(''')
ORIGIN 3a2an )'($""%'+)' D '(&"%+"+&* )"(#" >(''')
ORIGIN !SA '('''''''' D ( ( (
NOTE: The EFE matrix has been ;ound to be sin8ularG and a 8eneraliHed inverse <as used to solve
the normal e6uations( Terms <hose estimates are ;ollo<ed b7 the letter FDF are not
uni6uel7 estimable(

We also re0uested the BeveneCs test for homogeneit+ of variances for the three groups of #$. -ecause we are
testing H
0
@
"
American
6
"
Euro2ean
6
"
3a2anese
& and we do not re=ect H
0
& we conclude that the variances are not
significantl+ different from each other. The results of BeveneCs test indicates that there is not a problem with
ine0ualit+ of variances for this model.
0eveneFs Test ;or -omo8eneit7 o; MPG Variance
ANOVA o; Absolute Deviations ;rom Grou2 Means
Sum o; Mean
Source D4 S6uares S6uare 4 Value Pr = 4
ORIGIN " $('%%* )(#""$ '()) '(&,,+
Error $,% #*+%(' )%(%'',
If we had re=ected H
0
;or this modelG <e could have instructed SAS to It a model allo<in8
une6ual variances across levels o; ori8in b7 usin8 the ;ollo<in8 s7ntax as 2art o; Proc
G0M(
means origin / hovtest/levene:type/abs= tuEey welch6
The Welch test "not shown here% gives a p-value for the A!OAA model& ad=usted for une0ual variances.
The output below is for Tu*e+Cs studentized range test for comparing the means of #$ for each pair of
origins. There are . possible comparisons of means& and the Tu*e+ procedure assures that the overall
e,perimentwise T+pe I error rate will not be e,ceeded. -+ default& SAS uses an overall alpha level of /./9.
TuBe7Fs StudentiHed Ran8e J-SDK Test ;or MPG
NOTE: This test controls the T72e I ex2eriment<ise error rate(
Al2ha '('#
Error De8rees o; 4reedom $,%
Error Mean S6uare %'(+#"$"
/ritical Value o; StudentiHed Ran8e $($"+',
/om2arisons si8niIcant at the '('# level are indicated b7 LLL(
Di@erence
ORIGIN Det<een Simultaneous ,#M
/om2arison Means /onIdence 0imits
3a2an Euro2e "(##," '(',%' #('"%% LLL
3a2an !SA )'($""% &($&") )"("*"+ LLL
3
Euro2e 3a2an "(##," #('"%% '(',%' LLL
Euro2e !SA +(+*$" #(+$'# ,(+,#, LLL
!SA 3a2an )'($""% )"("*"+ &($&") LLL
!SA Euro2e +(+*$" ,(+,#, #(+$'# LLL
We can see from the above output that all pairwise comparisons of means are significant at the ./9 level& after
appl+ing the Tu*e+ method for multiple comparisons. SAS shows each comparison of means twice& but that
reall+ isnCt necessar+. There are man+ methods for multiple comparisons available in SAS $roc B# and other
A!OAA procedures& such as $roc #i,ed.
7

Anda mungkin juga menyukai