Anda di halaman 1dari 28

Bivariate analysis

The Multiple Regression Model


Idea: Examine the linear relationship between
1 dependent (Y) & 2 or more independent variables (X
i
)
i ki k 2i 2 1i 1 0 i
X X X Y + + + + + =
Multiple Regression Model with k Independent Variables:
Y-intercept
Population slopes Random Error
Assumptions of Regression
Use the acronym LINE:
Linearity
The underlying relationship between X and Y is linear
Independence of Errors
Error values are statistically independent
Normality of Error
Error values () are normally distributed for any given value of X
Equal Variance (Homoscedasticity)
The probability distribution of the errors has constant variance
Regression Statistics
Multiple R 0.998368
R Square 0.996739
Adjusted R
Square 0.995808
Standard
Error 1.350151
Observations 28
ANOVA
df SS MS F
Significan
ce F
Regression 6 11701.72 1950.286 1069.876 5.54E-25
Residual 21 38.28108 1.822908
Total 27 11740
.996739
11740
11704.1
SST
SSR
r
2
= = =
99.674% variation is
explained by the
dependent Variables
Adjusted r
2
r
2
never decreases when a new X variable is
added to the model
This can be a disadvantage when comparing models
What is the net effect of adding a new variable?
We lose a degree of freedom when a new X variable
is added
Did the new X variable add enough explanatory
power to offset the loss of one degree of freedom?
Shows the proportion of variation in Y explained
by all X variables adjusted for the number of X
variables used




(where n = sample size, k = number of independent variables)

Penalize excessive use of unimportant independent
variables
Smaller than r
2
Useful in comparing among models
Adjusted r
2
(

|
.
|

\
|

=
1
1
) 1 ( 1
2 2
k n
n
r r
adj
Error and coefficients relationship
B1 = Covar(yx)/Varp(x)

Stddevp 419.28571 1103.4439 115902.4 1630165.82 36245060.6 706538.59 195.9184
Covar 662.14286 6862.5 25621.4286 120976.786 16061.643 257.1429
b1 0.6000694 0.059209 0.01571707 0.00333775 0.0227329 1.3125
Is the Model Significant?
F Test for Overall Significance of the Model
Shows if there is a linear relationship between all of the
X variables considered together and Y
Use F-test statistic
Hypotheses:
H
0
:
1
=
2
= =
k
= 0 (no linear relationship)
H
1
: at least one
i
0 (at least one independent
variable affects Y)
F Test for Overall Significance
Test statistic:





where F has (numerator) = k and
(denominator) = (n k - 1)
degrees of freedom
1 k n
SSE
k
SSR
MSE
MSR
F

= =
Case discussion
Multiple Regression Assumptions
Assumptions:
The errors are normally distributed
Errors have a constant variance
The model errors are independent
e
i
= (Y
i
Y
i
)
<

Errors (residuals) from the regression model:
Error terms and coefficient estimates
Once we think of the Error term as a random
variable, it becomes clear that the estimates
of b1, b2, (as distinguished from their true
values) will also be random variables, because
the estimates generated by the SSE criterion
will depend upon the particular value of e
drawn by nature for each individual in the
data set.
Statistical Inference and Goodness of
fit
The parameter estimates are themselves random
variables, dependent upon the random variables e.
Thus, each estimate can be thought of as a draw
from some underlying probability distribution, the
nature of that distribution as yet unspecified.
If we assume that the error terms e are all drawn
from the same normal distribution, it is possible to
show that the parameter estimates have a normal
distribution as well.
T Statistic and P value
T = B1-B1average/B1 std dev
Can you have a hypothesis that
b1 average = b1 estimate
and do the T test
Are Individual Variables Significant?
Use t tests of individual variable slopes
Shows if there is a linear relationship between the
variable X
j
and Y
Hypotheses:
H
0
:
j
= 0 (no linear relationship)
H
1
:
j
0 (linear relationship does exist
between X
j
and Y)
Are Individual Variables Significant?
H
0
:
j
= 0 (no linear relationship)
H
1
:
j
0 (linear relationship does exist
between x
j
and y)


Test Statistic:



(df = n k 1)
j
b
j
S
b
t
0
=

Coefficien
ts
Standard
Error t Stat P-value
Lower
95%
Upper
95%
Lower
95.0%
Upper
95.0%
Intercept -59.0661 11.28404 -5.23448 3.45E-05 -82.5325 -35.5996 -82.5325 -35.5996
OFF -0.00696 0.04619 -0.15068 0.881663 -0.10302 0.089097 -0.10302 0.089097
BAR 0.041988 0.005271 7.966651 8.81E-08 0.031028 0.052949 0.031028 0.052949
YNG 0.002716 0.000999 2.717326 0.012904 0.000637 0.004794 0.000637 0.004794
VEH 0.00147 0.000265 5.540878 1.69E-05 0.000918 0.002021 0.000918 0.002021
INV -0.00274 0.001336 -2.05135 0.052914 -0.00552 3.78E-05 -0.00552 3.78E-05
SPD -0.2682 0.068418 -3.92009 0.000786 -0.41049 -0.12592 -0.41049 -0.12592
with n (k+1) degrees of freedom
Confidence Interval Estimate
for the Slope
Confidence interval for the population slope
j

where t has (n k 1) d.f.





j
b k n j
S t b
1

Example: Form a 95% confidence interval for the effect of


changes in Bars on fatal accidents:
0.041988 (2.079614 )(0.005271)
So the interval is (0.031028, 0.052949 )
(This interval does not contain zero, so bars has a significant
effect on Accidents)

Coefficien
ts
Standard
Error t Stat P-value
Lower
95%
Upper
95%
Intercept -59.0661 11.28404 -5.23448 3.45E-05 -82.5325 -35.5996
OFF -0.00696 0.04619 -0.15068 0.881663 -0.10302 0.089097
BAR 0.041988 0.005271 7.966651 8.81E-08 0.031028 0.052949
YNG 0.002716 0.000999 2.717326 0.012904 0.000637 0.004794
VEH 0.00147 0.000265 5.540878 1.69E-05 0.000918 0.002021
INV -0.00274 0.001336 -2.05135 0.052914 -0.00552 3.78E-05
SPD -0.2682 0.068418 -3.92009 0.000786 -0.41049 -0.12592
Using Dummy Variables
A dummy variable is a categorical explanatory
variable with two levels:
yes or no, on or off, male or female
coded as 0 or 1
Regression intercepts are different if the
variable is significant
Assumes equal slopes for other variables
Interaction Between
Independent Variables
Hypothesizes interaction between pairs of X
variables
Response to one X variable may vary at different
levels of another X variable

Contains cross-product term


) X (X b X b X b b
X b X b X b b Y

2 1 3 2 2 1 1 0
3 3 2 2 1 1 0
+ + + =
+ + + =
Effect of Interaction
Given:

Without interaction term, effect of X
1
on Y is
measured by
1
With interaction term, effect of X
1
on Y is
measured by
1
+
3
X
2
Effect changes as X
2
changes
X X X X Y
2 1 3 2 2 1 1 0
+ + + + =
X
2
= 1:
Y = 1 + 2X
1
+ 3(1) + 4X
1
(1) = 4 + 6X
1

X
2
= 0:
Y = 1 + 2X
1
+ 3(0) + 4X
1
(0) = 1 + 2X
1

Interaction Example
Slopes are different if the effect of X
1
on Y depends on X
2
value
X
1
4
8
12
0
0 1 0.5 1.5
Y
= 1 + 2X
1
+ 3X
2
+ 4X
1
X
2

Suppose X
2
is a dummy variable and the estimated regression equation is
Y

Residual Analysis
The residual for observation i, e
i
, is the difference between
its observed and predicted value
Check the assumptions of regression by examining the
residuals
Examine for linearity assumption
Evaluate independence assumption
Evaluate normal distribution assumption
Examine for constant variance for all levels of X (homoscedasticity)
Graphical Analysis of Residuals
Can plot residuals vs. X
i i i
Y

Y e =
Residual Analysis for
Independence
Not Independent
Independent
X
X
r
e
s
i
d
u
a
l
s

r
e
s
i
d
u
a
l
s

X
r
e
s
i
d
u
a
l
s


Residual Analysis for
Equal Variance
Non-constant variance

Constant variance
x x
Y
x
x
Y
r
e
s
i
d
u
a
l
s

r
e
s
i
d
u
a
l
s

Linear fit does not give
random residuals
Linear vs. Nonlinear Fit
Nonlinear fit gives
random residuals

X
r
e
s
i
d
u
a
l
s

X
Y
X
r
e
s
i
d
u
a
l
s

Y
X
Quadratic Regression Model
Quadratic models may be considered when the scatter diagram takes on one of
the following shapes:
X
1
Y
X
1
X
1
Y Y Y

1
< 0
1
> 0
1
< 0
1
> 0

1
= the coefficient of the linear term

2
= the coefficient of the squared term
X
1
i
2
1i 2 1i 1 0 i
X X Y + + + =

2
> 0
2
> 0
2
< 0
2
< 0

Anda mungkin juga menyukai