Anda di halaman 1dari 5

Home » Tutorials – SAS / R / Python / By Hand Examples » Linear Regression Example in R using lm() Function

Linear Regression Example in R using lm() Function

S
ummary: R linear regression uses the lm() function to create a regression model given some formula, in the form of Y~X+X2.  To look at the model, you
use the summary() function.  To analyze the residuals, you pull out the $resid variable from your new model.  Residuals are the differences between the
prediction and the actual results and you need to analyze these differences to find ways to improve your regression model.

To do linear (simple and multiple) regression in R you need the built-in lm function.
Here’s the data we will use, one year of marketing spend and company sales by month. Download: CSV

Month Spend Sales

1 1000 9914

2 4000 40487

3 5000 54324

4 4500 50044

5 3000 34719

6 4000 42551

7 9000 94871

8 11000 118914

9 15000 158484

10 12000 131348

11 7000 78504

12 3000 36284

Assuming you’ve downloaded the CSV, we’ll read the data in to R and call it the dataset variable

1 #You may need to use the setwd(directory-name) command to


2 #change your working directory to wherever you saved the csv.
3 #Use getwd() to see what your current directory is.
4 dataset = read.csv("data-marketing-budget-12mo.csv", header=T,
5 colClasses = c("numeric", "numeric", "numeric"))

Simple (One Variable) and Multiple Linear Regression Using lm()


The predictor (or independent) variable for our linear regression will be Spend (notice the capitalized S) and the dependent variable (the one we’re trying to
predict) will be Sales (again, capital S).

The lm function really just needs a formula (Y~X) and then a data source.  We’ll use Sales~Spend, data=dataset and we’ll call the resulting linear model “fit”.

1 simple.fit = lm(Sales~Spend, data=dataset)


2 summary(simple.fit)
3 multi.fit = lm(Sales~Spend+Month, data=dataset)
4 summary(multi.fit)

Notices on the multi.fit line the Spend variables is accompanied by the Month variable and a plus sign (+).  The plus sign includes the Month variable in the
model as a predictor (independent) variable.

The summary function outputs the results of the linear regression model.
Output for R’s lm Function showing the formula used, the summary statistics for the residuals, the coefficients (or weights) of the predictor variable, and finally the performance measures including RMSE, R-
squared, and the F-Statistic.

Both models have significant models (see the F-Statistic for Regression) and the Multiple R-squared and Adjusted R-squared are both exceptionally high
(keep in mind, this is a simplified example).  We also see that all of the variables are significant (as indicated by the “**”)

Interpreting R’s Regression Output


Residuals: The section summarizes the residuals, the error between the prediction of the model and the actual results.  Smaller residuals are better.
Coefficients: For each variable and the intercept, a weight is produced and that weight has other attributes like the standard error, a t-test value and
significance.
Estimate: This is the weight given to the variable.  In the simple regression case (one variable plus the intercept), for every one dollar increase in
Spend, the model predicts an increase of $10.6222.
Std. Error: Tells you how precisely was the estimate measured.  It’s really only useful for calculating the t-value.
t-value and Pr(>[t]): The t-value is calculated by taking the coefficient divided by the Std. Error.  It is then used to test whether or not the
coefficient is significantly different from zero.  If it isn’t significant, then the coefficient really isn’t adding anything to the model and could be
dropped or investigated further.  Pr(>|t|) is the significance level.
Performance Measures: Three sets of measurements are provided.
Residual Standard Error: This is the standard deviation of the residuals.  Smaller is better.
Multiple / Adjusted R-Square: For one variable, the distinction doesn’t really matter.  R-squared shows the amount of variance explained by the
model.  Adjusted R-Square takes into account the number of variables and is most useful for multiple-regression.
F-Statistic: The F-test checks if at least one variable’s weight is significantly different than zero.  This is a global test to help asses a model.  If the
p-value is not significant (e.g. greater than 0.05) than your model is essentially not doing anything.
Need more concrete explanations?  I explain summary output on this page.

With the descriptions out of the way, let’s start interpreting.

Residuals: We can see that the multiple regression model has a smaller range for the residuals: -3385 to 3034 vs. -1793 to 1911.  Secondly the median of
the multiple regression is much closer to 0 than the simple regression model.
Coefficients:
(Intercept): The intercept is the left over when you average the independent and dependent variable.  In the simple regression we see that the
intercept is much larger meaning there’s a fair amount left over.  Multiple regression shows a negative intercept but it’s closer to zero than the
simple regression output.
Spend: Both simple and multiple regression shows that for every dollar you spend, you should expect to get around 10 dollars in sales.
Month: When we add in the Month variable it’s multiplying this variable times the numeric (ordinal) value of the month.  So for every month you are
in the year, you add an additional 541 in sales.  So February adds in $1,082 while December adds $6,492 in Sales.
Learn by Marketing
Performance Measures:
Residual Standard Error: The simple regression model has a much higher standard error, meaning the residuals have a greater variance.  A 2,313
standard error is pretty high considering the average sales is $70,870.
Multiple / Adjusted R-Square: The R-squared is very high in both cases.  The Adjusted R-square takes in to account the number of variables and so
it’s more useful for the multiple regression analysis.
F-Statistic: The F-test is statistically significant.  This means that both models have at least one variable that is significantly different than zero.

Analyzing Residuals
Anyone can fit a linear model in R.  The real test is analyzing the residuals (the error or the difference between actual and predicted results).

There are four things we’re looking for when analyzing residuals.
The mean of the errors is zero (and the sum of the errors is zero)
The distribution of the errors are normal.
All of the errors are independent.
Variance of errors is constant (Homoscedastic)

In R, you pull out the residuals by referencing the model and then the resid variable inside the model.  Using the simple linear regression model (simple.fit)
we’ll plot a few graphs to help illustrate any problems with the model.

1 layout(matrix(c(1,1,2,3),2,2,byrow=T))
2 #Spend x Residuals Plot
3 plot(simple.fit$resid~dataset$Spend[order(dataset$Spend)],
4 main="Spend x Residuals\nfor Simple Regression",
5 xlab="Marketing Spend", ylab="Residuals")
6 abline(h=0,lty=2)
7 #Histogram of Residuals
8 hist(simple.fit$resid, main="Histogram of Residuals",
9 ylab="Residuals")
10 #Q-Q Plot
11 qqnorm(simple.fit$resid)
12 qqline(simple.fit$resid)

 Residuals are normally distributed


The histogram and QQ-plot are the ways to visually evaluate if the residual fit a normal distribution.
If the histogram looks like a bell-curve it might be normally distributed.
If the QQ-plot has the vast majority of points on or very near the line, the residuals may be normally distributed.

The plots don’t seem to be very close to a normal distribution, but we can also use a statistical test.
The Jarque-Bera test (in the fBasics library, which checks if the skewness and kurtosis of your residuals are similar to that of a normal distribution.

The Null hypothesis of the jarque-bera test is that skewness and kurtosis of your data are both equal to zero (same as the normal distribution).

1 library(fBasics)
2 jarqueberaTest(simple.fit$resid) #Test residuals for normality
3 #Null Hypothesis: Skewness and Kurtosis are equal to zero
4 #Residuals X-squared: 0.9575 p Value: 0.6195

With a p value of 0.6195, we fail to reject the null hypothesis that the skewness and kurtosis of residuals are statistically equal to zero.

Residuals are independent


The Durbin-Watson test is used in time-series analysis to test if there is a trend in the data based on previous instances – e.g. a seasonal trend or a trend
every other data point.

Using the lmtest library, we can call the “dwtest” function on the model to check if the residuals are independent of one another.

The Null hypothesis of the Durbin-Watson test is that the errors are serially UNcorrelated.

1 library(lmtest) #dwtest
2 dwtest(simple.fit) #Test for independence of residuals
3 #Null Hypothesis: Errors are serially UNcorrelated
4 #Results: DW = 1.1347, p-value = 0.03062

Based on the results, we can reject the null hypothesis that the errors are serially uncorrelated.  This means we have more work to do.

Let’s try going through these motions for the multiple regression model.
1 layout(matrix(c(1,2,3,4),2,2,byrow=T))
2 plot(multi.fit$fitted, rstudent(multi.fit),
3 main="Multi Fit Studentized Residuals",
4 xlab="Predictions",ylab="Studentized Resid",
5 ylim=c(-2.5,2.5))
6 abline(h=0, lty=2)
7 plot(dataset$Month, multi.fit$resid,
8 main="Residuals by Month",
9 xlab="Month",ylab="Residuals")
10 abline(h=0,lty=2)
11 hist(multi.fit$resid,main="Histogram of Residuals")
12 qqnorm(multi.fit$resid)
13 qqline(multi.fit$resid)

Residuals are Normally Distributed


Histogram of residuals does not look normally distributed.
However, the QQ-Plot shows only a handful of points off of the normal line.
We fail to reject the Jarque-Bera null hypothesis (p-value = 0.5059)

1 library(fBasics)
2 jarqueberaTest(multi.fit$resid) #Test residuals for normality
3 #Null Hypothesis: Skewness and Kurtosis are equal to zero
4 #Residuals X-squared: 1.3627 p Value: 0.5059

Residuals are independent


We fail to reject the Durbin-Watson test’s null hypothesis (p-value 0.3133)

1 library(lmtest) #dwtest
2 dwtest(multi.fit) #Test for independence of residuals
3 #Null Hypothesis: Errors are serially UNcorrelated
4 #Results: DW = 2.1077, p-value = 0.3133

Residuals have constant variance


Constant variance can be checked by looking at the “Studentized” residuals – normalized based on the standard deviation.  “Studentizing” lets you compare
residuals across models.

The Multi Fit Studentized Residuals plot shows that there aren’t any obvious outliers.  If a point is well beyond the other points in the plot, then you might
want to investigate.  Based on the plot above, I think we’re okay to assume the constant variance assumption.  More data would definitely help fill in some of
the gaps.

Recap / Highlights
Regression is a powerful tool for predicting numerical values.
R’s lm function creates a regression model.
Use the summary function to review the weights and performance measures.
The residuals can be examined by pulling on the $resid variable from your model.
You need to check your residuals against these four assumptions.
The mean of the errors is zero (and the sum of the errors is zero).
The distribution of the errors are normal.
All of the errors are independent.
Variance of errors is constant (Homoscedastic)

Here’s the full code below


1 dataset = read.csv("data-marketing-budget-12mo.csv", header=T,
2 colClasses = c("numeric", "numeric", "numeric"))
3 head(dataset,5)
4 #/////Simple Regression/////
5 simple.fit = lm(Sales~Spend,data=dataset)
6 summary(simple.fit)
7
8 #Loading the necessary libraries
9 library(lmtest) #dwtest
10 library(fBasics) #JarqueBeraTest
11
12 #Testing normal distribution and independence assumptions
13 jarqueberaTest(simple.fit$resid) #Test residuals for normality
14 #Null Hypothesis: Skewness and Kurtosis are equal to zero
15 dwtest(simple.fit) #Test for independence of residuals
16 #Null Hypothesis: Errors are serially UNcorrelated
17
18 #Simple Regression Residual Plots
19 layout(matrix(c(1,1,2,3),2,2,byrow=T))
20 #Spend x Residuals Plot
21 plot(simple.fit$resid~dataset$Spend[order(dataset$Spend)],
22 main="Spend x Residuals\nfor Simple Regression",
23 xlab="Marketing Spend", ylab="Residuals")
24 abline(h=0,lty=2)
25 #Histogram of Residuals
26 hist(simple.fit$resid, main="Histogram of Residuals",
27 ylab="Residuals")
28 #Q-Q Plot
29 qqnorm(simple.fit$resid)
30 qqline(simple.fit$resid)
31
32 #///////////Multiple Regression Example///////////
33
34 multi.fit = lm(Sales~Spend+Month, data=dataset)
35 summary(multi.fit)
36
37 #Residual Analysis for Multiple Regression
38 dwtest(multi.fit) #Test for independence of residuals
39 #Null Hypothesis: Errors are serially UNcorrelated
40 jarqueberaTest(multi.fit$resid) #Test residuals for normality
41 #Null Hypothesis: Skewness and Kurtosis are equal to zero
42
43 #Multiple Regression Residual Plots
44 layout(matrix(c(1,2,3,4),2,2,byrow=T))
45 plot(multi.fit$fitted, rstudent(multi.fit),
46 main="Multi Fit Studentized Residuals",
47 xlab="Predictions",ylab="Studentized Resid",
48 ylim=c(-2.5,2.5))
49 abline(h=0, lty=2)
50 plot(dataset$Month, multi.fit$resid,
51 main="Residuals by Month",
52 xlab="Month",ylab="Residuals")
53 abline(h=0,lty=2)
54 hist(multi.fit$resid,main="Histogram of Residuals")
55 qqnorm(multi.fit$resid)
56 qqline(multi.fit$resid)

Corrections:

Thanks to Thiam Huat for the correction on coefficient interpretation.


· © 2018 Learn by Marketing · Designed by Themes & Co ·

Back to top

Anda mungkin juga menyukai