Anda di halaman 1dari 16

LINEAR REGRESSION ANALYSIS I

HOMEWORK II
LEONARDO D. VILLAMIL
09/16/2013.

Exercise 2-7.
a) Fit a simple linear regression model to the data.
> df=read.csv("dataSet2-7.csv", header=T)
> fit = lm(df$purity~df$hydro)
> fit
>x=df$hydro
Call:
lm(formula = df$purity ~ df$hydro)
Coefficients:
(Intercept) df$hydro
77.86 11.80

Our fitted model is:
>y=77.86+11.80*x
__________________________________________________________________
The following calculate some variables that will be used ahead to make hypothesis tests and create
confidence intervals.

ssr stands for residuals sum of squares.
2
( ) ssr observed fitted = E
> ssr=sum((df$purity-y)^2)
> ssr
[1] 232.8348
msr stands for residual mean square.
2
ssr
msr
n
=


> msr=ssr/18
sxx stands for
2
( )
i
sxx x x = E
> sxx=sum((df$hydro-mean(df$hydro))^2)

Using the t statistic to test
0 1
: 0 H | = .


1
0
Re s
xx
t
MS
S
|
=


Re
1
( )
s
xx
MS
se
S
| =
ses stands for standard error slope.
> ses=sqrt(msr/sxx)
> t=11.80/ses
> t
[1] 3.385821
_____________________________________________________________________________________
We can also see that this t statistic value could be obtained directly from R using summary(fit). See the
number in red in the following summary.
> summary(fit)
Call:
lm(formula = df$purity ~ df$hydro)
Residuals:
Min 1Q Median 3Q Max
-4.6724 -3.2113 -0.0626 2.5783 7.3037
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 77.863 4.199 18.544 3.54e-13 ***
df$hydro 11.801 3.485 3.386 0.00329 **
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Residual standard error: 3.597 on 18 degrees of freedom
Multiple R-squared: 0.3891, Adjusted R-squared: 0.3552
F-statistic: 11.47 on 1 and 18 DF, p-value: 0.003291

b)
The corresponding P value is 0.003291 which is a small number, therefore I reject the null hypothesis
and I can say that there is a linear relationship between percent purity and percent hydrocarbons.

c)
We can also see from the summary(fit) display that the value of
2
R is 0.3891 or 38.91%.
d)
A 100(1-o ) percent confidence interval (CI) on the slope
1
| is given by

1 /2, 2 1 1 1 /2, 2 1
( ) ( )
n n
t se t se
o o
| | | | |

s s +
Using the values obtained from the summary(fit) display and data from the context of the problem tha
arrived to the following conclusions:
/ 2 0.025 o =
1
11.80 | =
0.025,18
1
2.101
( ) 3.485
t
se |
=
=

So my 95% CI on the slope is:
4.47802 19.122 | s s
e)
From the data obtained in the display of summary(fit) and data from the context of the problem we
obtain:
| 0 1
77.863 11.80(1.00) 89.663
y x
x | | = + = + =
0.025,18
2.101 t =
The left hand side of the interval is:
> ls=89.663-2.101*sqrt(msr*((1/20)+((1-mean(x))^2/sxx)))
> ls
[1] 87.50878
The right hand side of the interval is:
> lr=89.663+2.101*sqrt(msr*((1/20)+((1-mean(x))^2/sxx)))
> lr
[1] 91.81722

So my 95% CI on the
| y x
is:

|
87.51 91.82
y x
s s

f)plot for data set.


Exercise 2-12.
a.
> df=read.csv("dataSet2-12.csv")
> fit=lm(df$usage~df$temp)
> summary(fit)

Call:
lm(formula = df$usage ~ df$temp)
Residuals:
Min 1Q Median 3Q Max
-2.5629 -1.2581 -0.2550 0.8681 4.0581
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -6.33209 1.67005 -3.792 0.00353 **
df$temp 9.20847 0.03382 272.255 < 2e-16 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Residual standard error: 1.946 on 10 degrees of freedom
Multiple R-squared: 0.9999, Adjusted R-squared: 0.9999
F-statistic: 7.412e+04 on 1 and 10 DF, p-value: < 2.2e-16

> x=df$temp
My fitted value is:
> y=-6.332+9.208*x

b.
Test for significant of regression.
0 1
: 0 H | = ,
1 1
: 0 H | =
> anova(fit)
Analysis of Variance Table
Response: df$usage
Df Sum Sq Mean Sq F value Pr(>F)
df$temp 1 280590 280590 74123 < 2.2e-16 ***
Residuals 10 38 4

From the above display I see that:
Re
280590
38
R
s
SS
SS
=
=

0
Re Re Re
/
/
R R R
s s s
SS df MS
F
SS df MS
= =
> f=280590/4
> f
[1] 70147.5

The above value is also obtained from the anova table.
0
F is large and the
16
2.2
value
P e

= , therefore I
reject the
0 1
: 0 H | = .
If we use the t test we also found evidence to reject the null hypothesis.
1
( ) 0.03382 se | =
1
9.208 | =
0
272.255 t =
16
2.2
value
P e

=
Therefore we fail to reject
0 1
: 0 H | = and there is a linear relationship between the number of pounds
of steam per month and the average monthly temperature.

c.
y=-6.332+9.208*x #This is also called the line of means

This is equivalent to say
0 1
: 10, 000 H | = ,
1 1
: 10, 000 H | = now the usage is per 1000 according with
the data provided in the table, so my
1
10 | = .
From the summary(fit) table presented above:
> t=(9.20847-10)/0.03382
> t
[1] -23.4042
For any of the / 2 o defined for the t distribution the
16
2.2
value
P e

= is much smaller than / 2 o ,


therefore I reject the null hypothesis and claim than the increase is less than 10,000 lb.

d.
> anova(fit)
Analysis of Variance Table
Response: df$usage
Df Sum Sq Mean Sq F value Pr(>F)
df$temp 1 280590 280590 74123 < 2.2e-16 ***
Residuals 10 38 4
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
>x=df$temp
> sxx=sum((x-mean(x))^2)

> ll=527.732-3.169*sqrt(msr*(1+(1/12)+((58-mean(x))^2/sxx)))
> ll
[1] 521.0146
> lr=527.732+3.169*sqrt(msr*(1+(1/12)+((58-mean(x))^2/sxx)))
> lr
[1] 534.4494
The 99% CI on steam usage in a month with average ambient temperature of 58
o
is approximately equal
to:
(521.01,534.45).

The scatter plot for the data:

















Exercise 2-13.
a.
> df=read.csv("dataSet2-13.csv")
Scatter Plot
> plot(df$days~df$index,xlab="Index",ylab="Days",main="Exercise 2-13",ylim=c(8,160))




b.
> fit=lm(df$days~df$index)
> fit
Call:
lm(formula = df)
Coefficients:
(Intercept) index
183.596 -7.404
My prediction equation will be
183.596 7.404 y x =

c.
0 1
1 1
: 0
: 0
H
H
|
|
=
=

> summary(fit)

Call:
lm(formula = df)

Residuals:
Min 1Q Median 3Q Max
-48.252 -21.947 -2.305 26.979 48.008
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 183.596 214.359 0.856 0.406
index -7.404 12.351 -0.599 0.558

From these results I compared the values of t and p-values and I see that there is not enough evidence
to reject the null hypothesis. Hence there is not linear relationship between index and Days.

If I use anova function on fit:
> anova(fit)
Analysis of Variance Table
Response: days
Df Sum Sq Mean Sq F value Pr(>F)
index 1 349.7 349.69 0.3593 0.5585
Residuals 14 13624.7 973.20

From the above result I see that F is small and there is not much difference with the p-value. This
reinforce the idea of not linear relationship between the response and predictor variable.

d.
My confidence intervals with original data:
>conf1=predict(fit,df,interval="confidence",conf.level=0.95,lty=1,lwd=3)
My prediction intervals with original data:
> pre1=predict(fit,df,interval="prediction",conf.level=0.95,lty=2,lwd=3)
I create a new data frame with data within the range of the predictor variable.
> nd <- data.frame(x=seq(16,18.20,length=51))
Now I create new confidence and prediction intervals for nd.
> conf2=predict(fit,nd,interval="confidence",conf.level=0.95,lty=1,lwd=3)
> pre2=predict(fit,nd,interval="prediction",conf.level=0.95,lty=2,lwd=3)
> plot(df$days~df$index,xlab="Index",ylab="Days",main="Exercise 2-13",ylim=c(-40,160))
>abline(fit)
> matlines(df$index,conf1[,c("lwr","upr")],col=2,lty=1,type="b",pch="+")
> matlines(df$index,pre1[,c("lwr","upr")],col=2,lty=2,type="b",pch="1")
> matlines(nd$x,conf2[,c("lwr","upr")],col=4,lty=1,type="b",pch="+")
> matlines(nd$x,pre2[,c("lwr","upr")],col=4,lty=2,type="b",pch=1)

Exercise 2.17.
a.
> df=read.csv("dataSet2-17.csv")
> names(df)=c("b","p")
My scatter plot will be:
>plot(df$b~df$p,xlab="Barometric Pressure (in Hg)",ylab="Boiling Points (F)",main="Boiling Point Vs
Barometric Pressure")

> fit=lm(df$b~df$p, data=df)
> fit
Call:
lm(formula = df$b ~ df$p, data = df)
Coefficients:
(Intercept) df$p
163.93 1.58
My fitted model will be:

163.93 1.58 y x =

Analysis of Variance Table.
> anova(fit)
Analysis of Variance Table
Response: df$b
Df Sum Sq Mean Sq F value Pr(>F)
df$p 1 376.92 376.92 226.04 1.879e-10 ***
Residuals 15 25.01 1.67
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

TESTING SIGNIFICANCE OF REGRESSION. (Analysis of Variance Approach)
The hypothesis:
0 1
: 0 H | = ,
1 1
: 0 H | =
The F value obtained above is large compare with the P-value therefore we fail to reject the null
hypothesis and there is a linear relationship between the boiling points and pressure. It also can be
concluded that the model is a good fit for the data.

CONFIDENCE INTERVALS ON
0
| ,
1
| and
2
o .
I will calculate a 95% confidence interval.
17 n =
/ 2 0.025 o =
0.025,15
2.131 t =
2
0.025,15
27.488 X =
> sxx=sum((df$p-mean(df$p))^2)
> sxx
[1] 151.054
Re
1
1.67
( ) 0.10551
151.054
s
xx
MS
se
S
| = = =
2
0 Re
( ) (1/ /
s xx
se MS n x S | = +
0
( ) se | = sqrt(msres*((1/17)+(mean(x))^2/sxx))
0
( ) se | = 2.655158


95% CI on
0
|
163.93 2.131(2.655158)
163.93 5.655158
(158.272 , 169.588)

95% CI on
1
| .
1.58 2.131(0.10551)
1.58 0.224842
(1.35516 , 1.80484)

95% CI on
2
o
> ssres=sum((df$b-y)^2)
> msres=ssres/15
> ll=(17*msres)/27.488
> ll
[1] 1.031311
> rl=(17*msres)/6.262
> rl
[1] 4.527097
(1.031311, 4.527097)

A 95% Prediction Interval for boiling point when the pressure is 27.
163.93 1.58 y x =
121.27 y =
0.025,15
2.131 t =
> msres
[1] 1.66757
> sqrt(msres*((1+1/17)+(27-mean(df$p))^2))
[1] 2.796948
121.27 1.131(2.796948)
121.27 3.16335
(118.107, 124.433)







The summary(fit) will give me more analysis information.
> summary(fit)

Call:
lm(formula = df$b ~ df$p, data = df)
Residuals:
Min 1Q Median 3Q Max
-1.41483 -0.91550 -0.05148 0.76941 2.72840
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 163.9307 2.6551 61.74 < 2e-16 ***
df$p 1.5796 0.1051 15.04 1.88e-10 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Residual standard error: 1.291 on 15 degrees of freedom
Multiple R-squared: 0.9378, Adjusted R-squared: 0.9336
F-statistic: 226 on 1 and 15 DF, p-value: 1.879e-10

The coefficient of determination is
2
R = 0.9378.

PLOT 95% CONFIDENCE AND PREDICTION BANDS.
My confidence intervals with original data:
>conf1=predict(fit,df,interval="confidence",conf.level=0.95,lty=1,lwd=3)
My prediction intervals with original data:
> pre1=predict(fit,df,interval="prediction",conf.level=0.95,lty=2,lwd=3)
I create a new data frame with data within the range of the predictor variable.
> >nd <- data.frame(a=seq(20,30,length=50))
Now I create new confidence and prediction intervals for nd.
> conf2=predict(fit,nd,interval="confidence",conf.level=0.95,lty=1,lwd=3)
> pre2=predict(fit,nd,interval="prediction",conf.level=0.95,lty=2,lwd=3)
>plot(df$b~df$p,xlab="Barometric Pressure (in Hg)",ylab="Boiling Points (F)",main="Boiling Point Vs
Barometric Pressure", ylim=c(190,215))
>abline(fit)
> matlines(df$p,conf1[,c("lwr","upr")],col=1,lty=1,type="b",pch="+")
> matlines(df$p,pre1[,c("lwr","upr")],col=2,lty=2,type="b",pch="1")
>matlines(nd$a,conf2[,c("lwr","upr")],col=1,lty=1,type="b",pch="+")
>matlines(nd$a,pre2[,c("lwr","upr")],col=2,lty=2,type="b",pch=1)




Exercise 2-18.
>df=read.csv("dataset2-18.csv")
> names(df)=c("firm","s","r")
> fit=lm(df$r~df$s)
> fit
Call:
lm(formula = df$r ~ df$s)
Coefficients:
(Intercept) df$s
22.1627 0.3632
a.
My fitted model:
22.1627 03632 y x = +

b.
> anova(fit)
Analysis of Variance Table
Response: df$r
Df Sum Sq Mean Sq F value Pr(>F)
df$s 1 7723.3 7723.3 13.983 0.001389 **
Residuals 19 10494.1 552.3

After analyzing the anova return I see that F is a large value compared to p value, therefore I concluded
that the model is a good fit to the data. In other words there is a linear relationship between the
variable is question.
> summary(fit)
Call:
lm(formula = df$r ~ df$s)
Residuals:
Min 1Q Median 3Q Max
-42.422 -12.623 -8.171 8.832 50.526

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 22.16269 7.08948 3.126 0.00556 **
df$s 0.36317 0.09712 3.739 0.00139 **
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 23.5 on 19 degrees of freedom
Multiple R-squared: 0.424, Adjusted R-squared: 0.3936
F-statistic: 13.98 on 1 and 19 DF, p-value: 0.001389

From the summary display I see that
2
R is about 42,4% which means that there is a lot of variation
which is not explained by this model.

c.
conf1=predict(fit,df,interval="confidence",conf.level=0.95,lty=1,lwd=3)
pre1=predict(fit,df,interval="prediction",conf.level=0.95,lty=2,lwd=3)
I create a new data frame with data within the range of the predictor variable.
> >nd <- data.frame(a=seq(0,200,length=50))
Now I create new confidence and prediction intervals for nd.
conf2=predict(fit,nd,interval="confidence",conf.level=0.95,lty=1,lwd=3)
pre2=predict(fit,nd,interval="prediction",conf.level=0.95,lty=2,lwd=3)

plot(df$r~df$s,xlab="Amount Spent (Millions)",ylab="Returned Impression per Week",main="Linear
Regression Model ", ylim=c(-50,150))
abline(fit)
matlines(df$s,conf1[,c("lwr","upr")],col=1,lty=1,type="b",pch="+")
matlines(df$s,pre1[,c("lwr","upr")],col=2,lty=2,type="b",pch="1")
matlines(nd$a,conf2[,c("lwr","upr")],col=1,lty=1,type="b",pch="+")
matlines(nd$a,pre2[,c("lwr","upr")],col=2,lty=2,type="b",pch=1)



d.
Confidence and Prediction interval for x=26.9
> anova(fit)
Analysis of Variance Table
Response: df$r
Df Sum Sq Mean Sq F value Pr(>F)
df$s 1 7723.3 7723.3 13.983 0.001389 **
Residuals 19 10494.1 552.3

From anova display
Re
552.3
s
MS =
0.025,19
2.093 t =
| 0 1
22.1627 0.3632(26.9) 31.9328
y x
x | | = + = + =
0
26.9 x =
> sxx=sum((df$s-mean(df$s))^2)
> sxx
[1] 58556.08

95% CI on the mean response.
> 31.92-2.093*sqrt((552.3*((1/21)+(((26.9-mean(df$s))^2)/sxx))))
[1] 20.17142
> 31.92+2.093*sqrt((552.3*((1/21)+(((26.9-mean(df$s))^2)/sxx))))
[1] 43.66858
(20.17142, 43.66858).
95% prediction Interval.
> 31.92-2.093*sqrt((552.3*((1+1/21)+(((26.9-mean(df$s))^2)/sxx))))
[1] -18.65135
> 31.92+2.093*sqrt((552.3*((1+1/21)+(((26.9-mean(df$s))^2)/sxx))))
[1] 82.49135
(-18.65135,82.49135).

Anda mungkin juga menyukai