Straight line
regression
Dr.
1 Simon Sheather MS Analytics Class of 2016
Straight line regression
• We begin this course by considering
problems involving modeling the
relationship between two variables. The
simplest meaningful model between two
variables involves a straight line.
2
Case Study: Museum Attendance
Question:
To what extent can future museum attendance be predicted from past attendance?
Is there evidence that the 2012 attendance is significantly different from the 2011
attendance?
3
Data exist on top 100 museums worldwide in terms of attendance in 2011.
4
5
74 Museums are on the top 100 lists in both 2011 and 2012
The data are given in 2011_2012_Attendance.jmp
6
To what extent can future museum attendance
be predicted from past attendance?
• We begin by modeling
Y, 2012 attendance
as a linear function of
X, 2011 attendance
7
Definition of regression
• Mathematically, the regression of a random variable Y on a
random variable X is E(Y|X), the mean (or average value) of
Y for a given value of X.
• For example, if X = Day of the week and Y = Sales at a given
company, then the regression of Y on X represents the
mean (or average) sales on a given day.
• The X‐variable is called the explanatory or predictor
variable, while the Y‐variable is called the response
variable or the dependent variable.
8
2012 PGA Tour Data – Estimated regression functions
for the length of tee shots on par 3’s (n=56,698) & 4’s
(n=152,977) for different yardages (red curves)
Conclusions from the plots?
The data are given in TeeShots2012.jmp
9
PGA Tour tee shots:
Using Fit Y by X in JMP to produce estimated
regression functions based on smoothing splines
1. Select Distance
(yards) then click on Y,
Response
2. Select Yardage then
click on X, Factor
3. Select Par Value then
click on By
4. Click OK
10
PGA Tour tee shots:
Using Fit Y by X in JMP to produce estimated
regression functions based on smoothing splines
1. Click the red triangle
next to Bivariate Fit of
Distance (yards)
2. Select Fit Spline
3. Select 1000000, stiff
11
PGA Tour tee shots:
Using Fit Y by X in JMP to produce estimated
regression functions based on smoothing splines
1. Click the red triangle
next to Smoothing
Spline Fit
2. Click Line Width
3. Click 3
12
PGA Tour tee shots:
Using Fit Y by X in JMP to produce estimated
regression functions
1. Click the slider next to
Change Lambda
2. Adjust the value of
Lambda until the curve
has few if any bumps
13
Straight line regression
• Mathematically, the regression of a random variable Y on a random variable X
is E(Y|X), the mean (or average value) of Y for a given value of X.
14
Which line fits the data best?
15
Which line fits the data best?
16
Which line fits the data best?
17
Which line fits the data best?
18
Which line fits the data best?
19
Which line fits the data best?
20
Which line fits the data best?
Pardoe (2012, Figure 2.4, page 41)
21
This line fits the data best in a least
squares sense
ŷ bˆ0 bˆ1 x
22
23
This line fits the data best in a least squares sense
24
Least Squares Estimates
25
Least squares line
or residual
26
Case Study: Museum Attendance
How do we interpret
these probabilities?
27
The data are given in 2011_2012_Attendance.jmp
Museum Attendance:
Using Fit Y by X in JMP to obtain the least
squares straight line fit
1. Select 2012 Attendance
then click on Y
2. Select 2011 Attendance
then click on X
3. Click OK
28
Museum Attendance:
Using Fit Y by X in JMP to obtain the least
squares fit
1. Click the red triangle
next to Bivariate Fit of
2012 Attendance
2. Click Fit Line
29
Museum Attendance:
Using Fit Y by X in JMP to obtain the least
squares fit
1. Click the red triangle
next to Linear Fit
2. Click Line Width
3. Click 3
30
Case Study: Museum Attendance
Testing a hypothesis about the intercept b0
Test H0: b0 = 0 against HA: b0 ≠ 0
Test statistic:
bˆ0 0
t ~ tn‐2 (n=74)
sbˆ
0
In this case,
16406.246 0
t 0.41
40396.16
31
Pardoe (2012, page 290)
32
Case Study: Museum Attendance The df (degrees of freedom)
Testing a hypothesis about the equal n‐2, since we are
intercept b0 estimating b0 and b1
In this case,
16406.246 0
t 0.41
40396.16
So,
p‐value = 2P(T≥0.41)=0.6858
Conclusion:
The intercept b0 is not statistically
significantly different from zero.
33
Case Study: Museum Attendance
Testing a hypothesis about the slope b1
Test H0: b1 = 0 against HA: b1 > 0
Test statistic:
bˆ1 0
t ~ tn‐2 (n=74)
In this case, sbˆ1
1.033107 0
So, t 58.54
0.017647
p‐value = P(T≥58.54)<0.0001
Conclusion:
The slope b1 is highly statistically
significantly different from zero. Thus,
there is a statistically significant
positive linear association between
attendance in 2011 and 2012.
34
Case Study: Museum Attendance
Testing a different hypothesis about the slope b1
Test H0: b1 = 1 against HA: b1 ≠ 1
Test statistic:
bˆ1 1
t ~ tn‐2 (n=74)
sbˆ
1
In this case,
1.033107 1
t 1.876
0.017647
So,
p‐value = 2P(T≥1.876)=2(1 – 0.968) = 0.064
Conclusion:
The slope b1 is not statistically significantly
different from 1 at the 5% level. Thus, at
the 5% level, 1 is just a feasible value for
the slope b1.
35
Case Study: Museum Attendance
95% confidence interval for the slope b1
• Recall that the slope b1 is not
statistically significantly different
from 1 at the 5% level. Notice
that the 95% confidence interval
for the slope b1 (0.998,1.068)
contains the value 1.
• Notice also that the 95%
confidence interval for the
intercept b0 contains the value 0.
36
Case Study: Museum Attendance
Is there evidence that the 2012 attendance is
significantly different from the 2011 attendance?
37
Museum Attendance:
Using Fit Model in JMP to find CIs for the
intercept and slope
1. Select 2012 Attendance
2. Click Y
3. Select 2011 Attendance
4. Click Add
5. Click Run
38
Museum Attendance:
Using Fit Model in JMP to find CIs for the
intercept and slope
1. Click the red triangle
next to Response
2012 Attendance
2. Click Regression
Reports
3. Click Show All
Confidence Intervals
39
Case Study: Museum Attendance
95% Confidence Intervals for E(Y) at each value of X
Notice how the
95% confidence
intervals for
E(Y) at each
value of X
covers very few
of the observed
data points.
40
Case Study: Museum Attendance
95% Prediction Intervals for Y at each value of X
Notice how the
95% prediction
intervals for Y
at each value of
X cover most of
the observed
data points.
41
Museum Attendance:
Using Fit Y by X in JMP to
display CIs at each X
1. Click the red triangle
next to Linear Fit
2. Click Confid Curves Fit
3. Click Confid Shaded Fit
42
Museum Attendance:
Using Fit Y by X in JMP to
display PIs at each X
1. Click the red triangle
next to Linear Fit
2. Click Confid Curves
Indiv
3. Click Confid Shaded
Indiv
43
Case Study: Museum Attendance
44
Museum Attendance:
Using Fit Model in
JMP to save CIs and
PIs at each X
1. Click the red triangle
next to Response 2012
Attendance
2. Click Save Columns
3. Click Mean Confidence
Interval
4. Click Indiv Confidence
Interval
45
Notice that there is an extra term in the standard error part of the prediction interval
formula. This extra term accounts for the fact that individual values of Y vary around
their mean value, E(Y).
46
95% Confidence Intervals for E(Y) at each value of X
and 95% Prediction Intervals for Y at each value of X
47
What is s?
s, estimates the standard deviation of the random
errors, e. It is equal to the standard deviation of the
residuals.
where
48
Interpreting the standard deviation of the residuals, s: In
very large samples, approximately 95% of the observed Y‐
values lie within ± 2s of the regression line
Pardoe (2012, Figure 2.7, page 47)
49
50
Evaluating the model fit numerically
51
SSE, Error
Sum of
Squares is a
measure of
the error
with the
regression
model
TSS, Total Sum
of Squares is a (SSE is
measure of the sometimes
total error also referred
without the to as the
regression residual sum
model of squares,
RSS)
52
R2, coefficient of determination
= 1 ‐
53
Pardoe (2012, page 51)
54
Pardoe (2012, page 52)
55
Case Study: Museum Attendance
In JMP,
s is given by “Root Mean Square Error”.
In this case, s = 225,218.4
So that ±2s = 450,437
R2 = 0.979, which looks at first sight to be
very high.
Interpretation: 97.9% of the variation in
2012 Attendance (about its mean) can be
explained by a straight line regression
model between 2012 Attendance and
2011 Attendance.
56
Model assumptions for valid inferences
Not essential for large data sets, except for prediction intervals
57
Pictorial View of Model Assumptions
Pardoe (2012, Figure
2.13, page 60)
58
Checking Model Assumptions
59
Examples of residual plots where the assumptions hold
Pardoe (2012, Figure 2.14, page 62)
Moving across each plot from left
to right, we see that the residuals
appear to have an average value
close to zero and remain equally
variable and symmetric around the
horizontal axis. In addition, there
are no clear nonrandom patterns.
60
Examples of residual plots where the normal
distribution assumption does hold
Pardoe (2012, Figure 2.16, page 65)
61
Examples of residual plots where the zero mean
assumption does not hold
Pardoe (2012, Figure 2.15, page 63)
Moving across each plot from left
to right, we see that the residuals
appear not to have an average
value close to zero. In addition,
there are clear nonrandom
patterns. The first and second plots
show a quadratic pattern, while the
third plot shows a cubic pattern. In
each of the 3 situations what
would you do next?
62
Examples of residual plots where the constant variance
assumption does not hold
Pardoe (2012, Figure 2.15, page 63)
Moving across each plot from left
to right, we see that the residuals
appear not to remain equally
variable. The first plot shows
residuals with increasing variance
as x increases. The second plot
shows residuals with decreasing
variance as x increases. The third
plot shows that the variance of the
residuals is larger for both small x
and large x. In each of the 3
situations what would you do next?
63
Examples of residual plots where the normal
distribution assumption does not hold
Pardoe (2012, Figure 2.15, page 63)
Moving across each plot from left
to right, we see that the residuals
appear not to appear to be
normally distributed. The first plot
shows a large positive outlier and a
large negative outlier. The second
plot shows residuals skewed
towards positive values. The third
plot shows residuals skewed
towards negative values. In each of
the 3 situations what would you do
next?
64
Examples of residual plots where the normal
distribution assumption does not hold
Pardoe (2012, Figure 2.16, page 65)
In each plot we see that the
residuals appear not to appear to
be normally distributed. The first
plot shows a large positive outlier
and a large negative outlier. The
second plot shows residuals
skewed towards positive values.
The third plot shows residuals
skewed towards negative values. In
each of the 3 situations what
would you do next?
65
Examples of residual plots where the independence
assumption does not hold
Pardoe (2012, Figure 2.15, page 63)
Moving across each plot from left
to right, we see that the residuals
appear not to appear to be
independent. The first plot shows
that positive residuals are generally
followed by positive residuals and
that negative residuals are
generally followed by negative
residuals. The second plot shows
that negative residuals are
generally followed by positive
residuals. The third plot shows
residuals that belong to at least
two groups. In each of the 3
situations what would you do next?
66
Case Study: Museum Attendance
Conclusions from the residual plots:
67
Museum Attendance:
Using Fit Y by X in JMP to
display residual plots
1. Click the red triangle
next to Linear Fit
2. Click Plot Residuals
68