Anda di halaman 1dari 68

Topic 1:

Straight line
regression

Dr.
1 Simon Sheather MS Analytics Class of 2016
Straight line regression
• We begin this course by considering
problems involving modeling the
relationship between two variables. The
simplest meaningful model between two
variables involves a straight line.

2
Case Study: Museum Attendance

Question: 
To what extent can future museum attendance be predicted from past attendance? 
Is there evidence that the 2012 attendance is significantly different from the 2011 
attendance?

3
Data exist on top 100 museums worldwide in terms of attendance in 2011.

4
5
74 Museums are on the top 100 lists in both 2011 and 2012

The data are given in 2011_2012_Attendance.jmp 

6
To what extent can future museum attendance
be predicted from past attendance?
• We begin by modeling
Y, 2012 attendance
as a linear function of
X, 2011 attendance

7
Definition of regression
• Mathematically, the regression of a random variable Y on a 
random variable X is E(Y|X), the mean (or average value) of 
Y for a given value of X. 
• For example, if X = Day of the week and Y = Sales at a given 
company, then the regression of Y on X represents the 
mean (or average) sales on a given day.
• The X‐variable is called the explanatory or predictor 
variable, while the Y‐variable is called the response 
variable or the dependent variable.

8
2012 PGA Tour Data – Estimated regression functions
for the length of tee shots on par 3’s (n=56,698) & 4’s
(n=152,977) for different yardages (red curves)

Conclusions from the plots?
The data are given in TeeShots2012.jmp 

9
PGA Tour tee shots:
Using Fit Y by X in JMP to produce estimated
regression functions based on smoothing splines
1. Select Distance
(yards) then click on Y,
Response
2. Select Yardage then
click on X, Factor
3. Select Par Value then
click on By
4. Click OK

10
PGA Tour tee shots:
Using Fit Y by X in JMP to produce estimated
regression functions based on smoothing splines
1. Click the red triangle
next to Bivariate Fit of
Distance (yards)
2. Select Fit Spline
3. Select 1000000, stiff

11
PGA Tour tee shots:
Using Fit Y by X in JMP to produce estimated
regression functions based on smoothing splines
1. Click the red triangle
next to Smoothing
Spline Fit
2. Click Line Width
3. Click 3

12
PGA Tour tee shots:
Using Fit Y by X in JMP to produce estimated
regression functions
1. Click the slider next to
Change Lambda
2. Adjust the value of
Lambda until the curve
has few if any bumps

13
Straight line regression
• Mathematically, the regression of a random variable Y on a random variable X 
is E(Y|X), the mean (or average value) of Y for a given value of X. 

14
Which line fits the data best?

15
Which line fits the data best?

16
Which line fits the data best?

17
Which line fits the data best?

18
Which line fits the data best?

19
Which line fits the data best?

20
Which line fits the data best?

Pardoe (2012, Figure 2.4, page 41) 

21
This line fits the data best in a least
squares sense

ŷ  bˆ0  bˆ1 x

22
23
This line fits the data best in a least squares sense

24
Least Squares Estimates

25
Least squares line

or residual

26
Case Study: Museum Attendance

How do we interpret 
these probabilities?

27
The data are given in 2011_2012_Attendance.jmp 
Museum Attendance:
Using Fit Y by X in JMP to obtain the least
squares straight line fit
1. Select 2012 Attendance
then click on Y
2. Select 2011 Attendance
then click on X
3. Click OK

28
Museum Attendance:
Using Fit Y by X in JMP to obtain the least
squares fit
1. Click the red triangle
next to Bivariate Fit of
2012 Attendance
2. Click Fit Line

29
Museum Attendance:
Using Fit Y by X in JMP to obtain the least
squares fit
1. Click the red triangle
next to Linear Fit
2. Click Line Width
3. Click 3

30
Case Study: Museum Attendance
Testing a hypothesis about the intercept b0
Test H0: b0 = 0 against HA: b0 ≠ 0
Test statistic: 
bˆ0  0
t ~ tn‐2  (n=74)
sbˆ
0

In this case,
16406.246  0
t  0.41
40396.16

31
Pardoe (2012, page 290) 

32
Case Study: Museum Attendance The df (degrees of freedom) 
Testing a hypothesis about the equal n‐2, since we are 
intercept b0 estimating b0 and b1

Test H0: b0 = 0 against HA: b0 ≠ 0


Test statistic: 
bˆ0  0
t ~ tn‐2  (n=74)
sbˆ
0

In this case,

16406.246  0
t  0.41
40396.16
So,
p‐value = 2P(T≥0.41)=0.6858
Conclusion:
The intercept b0 is not statistically 
significantly different from zero.

33
Case Study: Museum Attendance
Testing a hypothesis about the slope b1
Test H0: b1 = 0 against HA: b1 > 0
Test statistic: 
bˆ1  0
t ~ tn‐2  (n=74)
In this case, sbˆ1

1.033107  0
So, t   58.54
0.017647
p‐value = P(T≥58.54)<0.0001

Conclusion:
The slope b1 is highly statistically 
significantly different from zero. Thus, 
there is a statistically significant 
positive linear association between 
attendance in 2011 and 2012.

34
Case Study: Museum Attendance
Testing a different hypothesis about the slope b1
Test H0: b1 = 1 against HA: b1 ≠ 1
Test statistic: 
bˆ1  1
t ~ tn‐2  (n=74)
sbˆ
1

In this case,
1.033107  1
t  1.876
0.017647
So,
p‐value = 2P(T≥1.876)=2(1 – 0.968) = 0.064

Conclusion:
The slope b1 is not statistically significantly 
different from 1 at the 5% level. Thus, at 
the 5% level, 1 is just a feasible value for 
the slope b1.

35
Case Study: Museum Attendance
95% confidence interval for the slope b1

• Recall that the slope b1 is not 
statistically significantly different 
from 1 at the 5% level. Notice 
that the 95% confidence interval 
for the slope b1 (0.998,1.068) 
contains the value 1.
• Notice also that the 95% 
confidence interval for the 
intercept b0 contains the value 0.

36
Case Study: Museum Attendance
Is there evidence that the 2012 attendance is
significantly different from the 2011 attendance?

• Notice that the 95% confidence interval for the


slope b1 contains the value 1.
• Notice also that the 95% confidence interval for
the intercept b0 contains the value 0.
• Thus there is not strong evidence that the 2012
attendance is significantly different from the 2011
attendance.

37
Museum Attendance:
Using Fit Model in JMP to find CIs for the
intercept and slope
1. Select 2012 Attendance
2. Click Y
3. Select 2011 Attendance
4. Click Add
5. Click Run

38
Museum Attendance:
Using Fit Model in JMP to find CIs for the
intercept and slope
1. Click the red triangle
next to Response
2012 Attendance
2. Click Regression
Reports
3. Click Show All
Confidence Intervals

39
Case Study: Museum Attendance
95% Confidence Intervals for E(Y) at each value of X

Notice how the 
95% confidence 
intervals for 
E(Y) at each 
value of X
covers very few 
of the observed 
data points.

40
Case Study: Museum Attendance
95% Prediction Intervals for Y at each value of X

Notice how the 
95% prediction 
intervals for Y
at each value of 
X cover most of 
the observed 
data points.

41
Museum Attendance:
Using Fit Y by X in JMP to
display CIs at each X
1. Click the red triangle
next to Linear Fit
2. Click Confid Curves Fit
3. Click Confid Shaded Fit

42
Museum Attendance:
Using Fit Y by X in JMP to
display PIs at each X
1. Click the red triangle
next to Linear Fit
2. Click Confid Curves
Indiv
3. Click Confid Shaded
Indiv

43
Case Study: Museum Attendance

44
Museum Attendance:
Using Fit Model in
JMP to save CIs and
PIs at each X
1. Click the red triangle
next to Response 2012
Attendance
2. Click Save Columns
3. Click Mean Confidence
Interval
4. Click Indiv Confidence
Interval

45
Notice that there is an extra term in the standard error part of the prediction interval 
formula. This extra term accounts for the fact that individual values of Y vary around 
their mean value, E(Y).

46
95% Confidence Intervals for E(Y) at each value of X
and 95% Prediction Intervals for Y at each value of X

47
What is s?
s, estimates the standard deviation of the random
errors, e. It is equal to the standard deviation of the
residuals.

where

48
Interpreting the standard deviation of the residuals, s: In
very large samples, approximately 95% of the observed Y‐
values lie within ± 2s of the regression line

Pardoe (2012, Figure 2.7, page 47)

49
50
Evaluating the model fit numerically

Note that Pardoe (2012) refers to s as the “regression standard error”. We shall typically 


refer to s as the standard deviation of the residuals.

51
SSE, Error 
Sum of 
Squares is a 
measure of 
the error 
with the 
regression 
model
TSS, Total Sum 
of Squares is a  (SSE is 
measure of the  sometimes 
total error  also referred 
without the  to as the 
regression  residual sum 
model of squares, 
RSS)

52
R2, coefficient of determination

= 1 ‐

53
Pardoe (2012, page 51)

54
Pardoe (2012, page 52)

55
Case Study: Museum Attendance

In JMP,
s is given by “Root Mean Square Error”. 
In this case, s = 225,218.4
So that ±2s = 450,437

R2 = 0.979, which looks at first sight to be 
very high. 
Interpretation: 97.9% of the variation in 
2012 Attendance (about its mean) can be 
explained by a straight line regression 
model between 2012 Attendance and 
2011 Attendance.

56
Model assumptions for valid inferences

Not essential for large data sets, except for prediction intervals

57
Pictorial View of Model Assumptions

Pardoe (2012, Figure 
2.13, page 60)

58
Checking Model Assumptions

59
Examples of residual plots where the assumptions hold

Pardoe (2012, Figure 2.14, page 62)
Moving across each plot from left 
to right, we see that the residuals 
appear to have an average value 
close to zero and remain equally 
variable and symmetric around the 
horizontal axis. In addition, there 
are no clear nonrandom patterns. 

60
Examples of residual plots where the normal
distribution assumption does hold

Pardoe (2012, Figure 2.16, page 65)

61
Examples of residual plots where the zero mean
assumption does not hold

Pardoe (2012, Figure 2.15, page 63)
Moving across each plot from left 
to right, we see that the residuals 
appear not to have an average 
value close to zero. In addition, 
there are clear nonrandom 
patterns. The first and second plots 
show a quadratic pattern, while the 
third plot shows a cubic pattern. In 
each of the 3 situations what 
would you do next?

62
Examples of residual plots where the constant variance
assumption does not hold

Pardoe (2012, Figure 2.15, page 63)
Moving across each plot from left 
to right, we see that the residuals 
appear not to remain equally 
variable. The first plot shows 
residuals with increasing variance 
as x increases. The second plot 
shows residuals with decreasing 
variance as x increases. The third 
plot shows that the variance of the 
residuals is larger for both small x
and large x. In each of the 3 
situations what would you do next?

63
Examples of residual plots where the normal
distribution assumption does not hold

Pardoe (2012, Figure 2.15, page 63)
Moving across each plot from left 
to right, we see that the residuals 
appear not to appear to be 
normally distributed. The first plot 
shows a large positive outlier and a 
large negative outlier. The second 
plot shows residuals skewed 
towards positive values. The third 
plot shows residuals skewed 
towards negative values. In each of 
the 3 situations what would you do 
next?

64
Examples of residual plots where the normal
distribution assumption does not hold

Pardoe (2012, Figure 2.16, page 65)
In each plot we see that the 
residuals appear not to appear to 
be normally distributed. The first 
plot shows a large positive outlier 
and a large negative outlier. The 
second plot shows residuals 
skewed towards positive values. 
The third plot shows residuals 
skewed towards negative values. In 
each of the 3 situations what 
would you do next?

65
Examples of residual plots where the independence
assumption does not hold

Pardoe (2012, Figure 2.15, page 63)
Moving across each plot from left 
to right, we see that the residuals 
appear not to appear to be 
independent. The first plot shows 
that positive residuals are generally 
followed by positive residuals and 
that negative residuals are 
generally followed by negative 
residuals. The second plot shows 
that negative residuals are 
generally followed by positive 
residuals. The third plot shows 
residuals that belong to at least 
two groups. In each of the 3 
situations what would you do next?

66
Case Study: Museum Attendance

Conclusions from the residual plots:

67
Museum Attendance:
Using Fit Y by X in JMP to
display residual plots
1. Click the red triangle
next to Linear Fit
2. Click Plot Residuals

68

Anda mungkin juga menyukai