HOMEWORK II
LEONARDO D. VILLAMIL
09/16/2013.
Exercise 2-7.
a) Fit a simple linear regression model to the data.
> df=read.csv("dataSet2-7.csv", header=T)
> fit = lm(df$purity~df$hydro)
> fit
>x=df$hydro
Call:
lm(formula = df$purity ~ df$hydro)
Coefficients:
(Intercept) df$hydro
77.86 11.80
Our fitted model is:
>y=77.86+11.80*x
__________________________________________________________________
The following calculate some variables that will be used ahead to make hypothesis tests and create
confidence intervals.
ssr stands for residuals sum of squares.
2
( ) ssr observed fitted = E
> ssr=sum((df$purity-y)^2)
> ssr
[1] 232.8348
msr stands for residual mean square.
2
ssr
msr
n
=
> msr=ssr/18
sxx stands for
2
( )
i
sxx x x = E
> sxx=sum((df$hydro-mean(df$hydro))^2)
Using the t statistic to test
0 1
: 0 H | = .
1
0
Re s
xx
t
MS
S
|
=
Re
1
( )
s
xx
MS
se
S
| =
ses stands for standard error slope.
> ses=sqrt(msr/sxx)
> t=11.80/ses
> t
[1] 3.385821
_____________________________________________________________________________________
We can also see that this t statistic value could be obtained directly from R using summary(fit). See the
number in red in the following summary.
> summary(fit)
Call:
lm(formula = df$purity ~ df$hydro)
Residuals:
Min 1Q Median 3Q Max
-4.6724 -3.2113 -0.0626 2.5783 7.3037
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 77.863 4.199 18.544 3.54e-13 ***
df$hydro 11.801 3.485 3.386 0.00329 **
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 3.597 on 18 degrees of freedom
Multiple R-squared: 0.3891, Adjusted R-squared: 0.3552
F-statistic: 11.47 on 1 and 18 DF, p-value: 0.003291
b)
The corresponding P value is 0.003291 which is a small number, therefore I reject the null hypothesis
and I can say that there is a linear relationship between percent purity and percent hydrocarbons.
c)
We can also see from the summary(fit) display that the value of
2
R is 0.3891 or 38.91%.
d)
A 100(1-o ) percent confidence interval (CI) on the slope
1
| is given by
1 /2, 2 1 1 1 /2, 2 1
( ) ( )
n n
t se t se
o o
| | | | |
s s +
Using the values obtained from the summary(fit) display and data from the context of the problem tha
arrived to the following conclusions:
/ 2 0.025 o =
1
11.80 | =
0.025,18
1
2.101
( ) 3.485
t
se |
=
=
So my 95% CI on the slope is:
4.47802 19.122 | s s
e)
From the data obtained in the display of summary(fit) and data from the context of the problem we
obtain:
| 0 1
77.863 11.80(1.00) 89.663
y x
x | | = + = + =
0.025,18
2.101 t =
The left hand side of the interval is:
> ls=89.663-2.101*sqrt(msr*((1/20)+((1-mean(x))^2/sxx)))
> ls
[1] 87.50878
The right hand side of the interval is:
> lr=89.663+2.101*sqrt(msr*((1/20)+((1-mean(x))^2/sxx)))
> lr
[1] 91.81722
So my 95% CI on the
| y x
is:
|
87.51 91.82
y x
s s
f)plot for data set.
Exercise 2-12.
a.
> df=read.csv("dataSet2-12.csv")
> fit=lm(df$usage~df$temp)
> summary(fit)
Call:
lm(formula = df$usage ~ df$temp)
Residuals:
Min 1Q Median 3Q Max
-2.5629 -1.2581 -0.2550 0.8681 4.0581
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -6.33209 1.67005 -3.792 0.00353 **
df$temp 9.20847 0.03382 272.255 < 2e-16 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 1.946 on 10 degrees of freedom
Multiple R-squared: 0.9999, Adjusted R-squared: 0.9999
F-statistic: 7.412e+04 on 1 and 10 DF, p-value: < 2.2e-16
> x=df$temp
My fitted value is:
> y=-6.332+9.208*x
b.
Test for significant of regression.
0 1
: 0 H | = ,
1 1
: 0 H | =
> anova(fit)
Analysis of Variance Table
Response: df$usage
Df Sum Sq Mean Sq F value Pr(>F)
df$temp 1 280590 280590 74123 < 2.2e-16 ***
Residuals 10 38 4
From the above display I see that:
Re
280590
38
R
s
SS
SS
=
=
0
Re Re Re
/
/
R R R
s s s
SS df MS
F
SS df MS
= =
> f=280590/4
> f
[1] 70147.5
The above value is also obtained from the anova table.
0
F is large and the
16
2.2
value
P e
= , therefore I
reject the
0 1
: 0 H | = .
If we use the t test we also found evidence to reject the null hypothesis.
1
( ) 0.03382 se | =
1
9.208 | =
0
272.255 t =
16
2.2
value
P e
=
Therefore we fail to reject
0 1
: 0 H | = and there is a linear relationship between the number of pounds
of steam per month and the average monthly temperature.
c.
y=-6.332+9.208*x #This is also called the line of means
This is equivalent to say
0 1
: 10, 000 H | = ,
1 1
: 10, 000 H | = now the usage is per 1000 according with
the data provided in the table, so my
1
10 | = .
From the summary(fit) table presented above:
> t=(9.20847-10)/0.03382
> t
[1] -23.4042
For any of the / 2 o defined for the t distribution the
16
2.2
value
P e