Anda di halaman 1dari 9

Case Study: Linear regression

Let us assume that for are a small business owner for regional delivery service,inc.(RDS)who
offer same day delivery for letters, packages, and other small cargo. They are able to use
Google Map to group individual deliveries into one trip to reduce time and fuel costs.
Therefore, some trips will have more than one delivery. As the owner, we would like to able
to estimate how long a delivery will take based on three factors: 1) the total distance trip
in miles 2) the number of deliveries that must be made during the trip, and 3) the daily
price of gas/petrol in U.S dollars.
To predict analysis, take a random sample of 10 past trip and records four pieces of
information for each trip
# install package
install.packages("car")
install.packages("SparseM")
install.packages("corrplot")
install.packages("PerformanceAnalytics")
#library
library(car)
# dataset
test<- data.frame(
TravelTime_y = c (7,5.4,6.6,7.4,4.8,6.4,7,5.6,7.3,6.4),
MilesTravel_x1 = c (89,66,78,111,44,77,80,66,109,76),
NumDeliveries_x2 = c (4,1,3,6,1,3,3,2,5,3),
GasPrice_x3 = c (3.84,3.19,3. 78,3.89,3.57,3.57,3.03,3.51,3.54,3.25))
#scatter plot
# Plot a correlation graph
plot(test)
other method: -
library(corrplot)
newdatacor = cor(test)
corrplot(newdatacor, method = "number")
library("PerformanceAnalytics")

chart.Correlation(test, histogram=F, pch=19)


#AIC
AIC(m1)
AIC(m2)
AIC(m3)
newdatacor = round(cor(test),2) AIC(m4)
AIC(m5)
AIC(m6)
AIC(m7)
#AIC
step(m7,direction = "both")
step(m7,direction = "forward")
step(m7,direction = "backward")
#RMSE
predict=predict(m7, test)
difference=predict-test$TravelTime_y
rmse<-sqrt(mean(difference^2))
Relationship Correlation coefficient(r) P value ACTION
Dependent variable vs Dependent variable
X4 vs x1 0.928 (high) 0.0001 Accept (r value is high)
(significant)
X4 vs x2 0.916 (high) 0.0002 Accept (r value is high)
(significant)
X4 vs x3 0.236 (low) 0.5111 Reject (r value is low)
(Not significant)
Independent variable vs Independent variable
X2 vs x1 0.956 (high) <0.0001 Reject (r value is high)
(significant)
X3 vs x2 0.314 (low) 0.3771 Accept (r value is high)
(Not significant)
X3 vs x2 0.453 (low) 0.1881 Accept (r value is high)
(Not significant)
#Model #summary #vif
m1 <- lm(TravelTime_y ~ MilesTravel_x1 ,data=test) summary(m1) car::vif(m1)
m2 <- lm(TravelTime_y ~ NumDeliveries_x2 ,data=test) summary(m2) car::vif(m2)
m3 <- lm(TravelTime_y ~ GasPrice_x3,data=test) car::vif(m3)
summary(m3)
summary(m4) car::vif(m4)
m4 <- lm(TravelTime_y ~ MilesTravel_x1 + NumDeliveries_x2 ,data=test)
summary(m5) car::vif(m5)
m5 <- lm(TravelTime_y ~ MilesTravel_x1 + GasPrice_x3,data=test)
summary(m6) car::vif(m6)
m6 <- lm(TravelTime_y ~ NumDeliveries_x2 + GasPrice_x3,data=test)
summary(m7) car::vif(m7)
Result
m7 <- of all 7 model:
lm(TravelTime_y ~ MilesTravel_x1 +NumDeliveries_x2 +GasPrice_x3,data=test)

X1 X2 X3 Multiple Adjusted Residual RMSE F- p-value Coefficients VIF AIC


R-squared R-squared Standard statistic
error MilesTrav NumDelive GasPri
Intercept MilesTravel_x1 NumDeliv GasPrice_ el_x1 ries_x2 ce_x3
eries_x2 x3

x 0.8615 0.8442 0.3423 0.306 49.77 0.00010 3.185 0.040 10.7

x 0.8399 0.8199 0.3681 0.329 41.96 0.00019 4.845 0.498 12.1

x 0.0714 -0.0446 0.8864 0.792 0.615 0.4555 3.536 0.81 29.7

x x 0.8714 0.8347 0.3526 0.295 23.72 0.00076 3.732 0.026 0.184 11.59 11.59 11.9

x x 0.8661 0.8278 0.3599 0.301 22.63 0.00087 3.867 0.041 -0.21 1.14 1.14 12.3

x x 0.8876 0.8555 0.3297 0.275 27.63 0.00047 7.324 0.566 -0.76 1.33 1.33 10.6

x x x 0.8947 0.842 0.3447 0.266 16.99 0.00245 6.211 0.014 0.383 -0.606 14.93 17.35 1.71 11.9

Conclusion:

1. correlation coefficient of X3 is very low (0.27) so, we (m3, m5, m6, m7) can’t consider for
model selection.
2. Model m7 & m4 are redundant due to vif >10 and correlation is high in between x1 & x2.
3. Model m1 have high R-square adj, high F-Statistic, low RMSE and low AIS compare with Model
m2, so Model m1 is the best model.
After excluding third variable(GasPrice_x3) according to correlation matrix, we can also
verify model with stepwise AIC method.

> step(m4,direction = "both")


Start: AIC=-18.41
TravelTime_y ~ MilesTravel_x1 + NumDeliveries_x2

Df Sum of Sq RSS AIC


- NumDeliveries_x2 1 0.066906 0.9374 -19.672
<none> 0.8705 -18.413
- MilesTravel_x1 1 0.213433 1.0839 -18.220

Step: AIC=-19.67
TravelTime_y ~ MilesTravel_x1

Df Sum of Sq RSS AIC


<none> 0.9374 -19.6723
+ NumDeliveries_x2 1 0.0669 0.8705 -18.4128
- MilesTravel_x1 1 5.8316 6.7690 -1.9023

Call:
lm(formula = TravelTime_y ~ MilesTravel_x1, data = test)

Coefficients:
(Intercept) MilesTravel_x1
3.18556 0.04026
Some plot for Model m1:
par(mfrow=c (2,2))
plot(m1)

• First is residual Vs fitted.it should be randomly distributed and there should not be any pattern. in figure,it looks like random so looks good.
• Q-q plot is for normality .it stands for quantile plot. And it should follow the straight line. Doesn't looks very good but ok.
• Third is same as first one but it is standardized residual. It should be random as well and chart looks ok.
• Leverage graph is for extreme point and here also it looks good.
• More inference could have been drawn if the number of observations were higher.

Anda mungkin juga menyukai