Anda di halaman 1dari 56

Statistics for Business

STAT130
Unit 8: Correlation and
Regression Analysis
Chapter 13
Simple Linear Regression
Analysis
Introduction
 In addition to hypothesis testing and confidence
intervals, inferential statistics involves
determining whether a relationship between two
or more quantitative variables exists.
 Most commonly used technique for investigating
the relationship between two or more variables;
 Correlation is a statistical method used to
determine whether a relationship between
variables exists.
 Regression is a statistical method used to
describe the nature of the relationship between
variables—that is, positive or negative, linear or
nonlinear. 3
Relationships
 Variables
 Dependent Variable: measures an outcome of a
study and is sometimes called the response
variable.
 Independent Variable: explains or causes changes
in the response variable and is sometimes called
the explanatory or predictor variable.
 A scatter plot shows the relationship between two
variables.
 Always plot the independent variable on the
horizontal axis, and the dependent variable as
the vertical axis.
4
Examining Relationships
 In any graph of data, look for the overall pattern
and for striking deviations from that pattern.
 You can describe the overall pattern by the:
 Form: Describe the type of trend between X and Y
(linear, quadratic, exponential).
 Direction: describes the direction of the trend
upward (positive) or downward (negative).
 Strength: Measures the amount of scatter around
the general trend.
 An important kind of deviation is an outlier, an
individual that falls outside the overall pattern of
the relationship.
5
Examples of Scatter Plots

6
Correlation Coefficient
 The coefficient of correlation, denoted  in the
population and r in the sample, is used to
measure the strength of the linear association
between two quantitative variables.
 The sample coefficient of correlation is

 x  x  y i  y 
n
i 1 i SS xy
r 

 i 1 x i    i 1  i
2 2
x  y SS xx SS yy
n n
y
 

 Correlation in MegaStat:
MegaStat  Correlation / Regression
 Correlation Matrix
7
Properties of r
 -1 ≤ r ≤ 1 (Check this applet)
 r < 0  negative linear association
 r > 0  positive linear association
 r = 1  perfect linear relationship
 r is independent of units.
 Empirical rule to interpret r:
 |r| close to 1  strong linear association
 |r| close to 0.5  moderate linear association
 |r| close to 0  weak or no linear association

8
Correlation: Examples

9
Testing the significance of 
 We can test to see if the correlation is significant
using the hypotheses
 H 0:  = 0
 Ha:  ≠ 0
r n2
 The statistic is t 
1 r 2
which follows a t-distribution with n-2 degrees
of freedom.
 MegaStat produces 95% and 99% critical limits
for . If r falls outside these limits then H0 is
rejected at the corresponding significance level.
10
Example
 A car dealer wants to find whether there is a linear
relationship between the odometer reading and
the selling price of used cars. A random sample of
100 cars is selected, and the data recorded.
 There is a strong negative linear relationship.
6500

6000

5500
Price

5000

4500

4000
10000 20000 30000 40000 50000 60000
Odometer
11
Example
 Compute and interpret the coefficient of
correlation between odometer and price.
 r = -0.806  negative strong correlation (linear
relationship).
 r falls outside the 99% critical limits (0.256) which
means that there is significant linear relationship
between the odometer reading the price of the car
(≠0).

12
Simple Linear Regression
 Regression is used to predict the value of one
variable (the dependent variable - y) based on
the value of other variables (independent
variables x1, x2,…xk).
 The objective of regression analysis is to build a
regression model that can be used to describe,
predict and control the dependent variable on
the basis of the independent variable
 Examples:
 Relationship between odometer reading (X) and a
used car’s selling price (Y).
 Relationship between years of experience (X) and
the salary of an accountant (Y).
13
The Model
 The simple linear regression model
y= b0 + b1x+ 
y = dependent variable
x = independent variable
y|x = b0 + b1x = the mean value of y given x
b0 = y-intercept
b1 = slope of the line
 = error variable
 β0 and β1 are called regression parameters
 b0 is the estimate of β0 and b1 is the estimate of β1

14
The Simple Linear Regression Model

15
Model Assumptions
1) Constant Variance Assumption
At any given value of x, the population of
potential error term values has a variance that
does not depend on the value of x
2) Normality Assumption
At any given value of x, the population of
potential error term values has a normal
distribution
3) Independence Assumption
Any one value of the error term ε is statistically
independent of any other value of ε

16
Estimating the coefficients
 The regression equation that estimates the
equation of the first order linear model is:
ŷ = b0 + b1x
 The estimates of the coefficients are:

b1 
 (x  x )( y  y ) SS
i i
 xy

 (x  x )
2
i
SS xx

b0  y  b1x
 Regression in MegaStat
MegaStat  Correlation / Regression
 Regression Analysis
17
Interpretation of Regression Coefficients
 The intercept, b0 is the estimated average
value of y when the value of x is zero.
 The slope, b1 is the estimated change in the
average value of y as a result of a one-unit
increase in x.

18
Example
 A car dealer wants to find the relationship
between the odometer reading and the selling
price of used cars.
 Find and plot the regression line.

6500 y = -0.031 x + 6,533.383


R2 = 0.650
6000

5500

Price
5000

4500

4000
10000 20000 30000 40000 50000 60000
Odometer

19
Example
 The estimated regression equation is
ŷ = 6533 – 0.0312 x
 Interpretation of Regression Coefficients:
 The intercept is b0 = 6533
 Do not interpret the intercept as the ―Price of
cars that have not been driven‖ – (Why?)
 The slope is b1 =-$0.0312
 For each additional mile on the odometer, the
price decreases by an average of $0.0312

20
Testing the Significance of the Slope
 A regression model is not likely to be useful
unless there is a significant relationship between
x and y.
 To test significance, we use the null hypothesis:
H0: β1 = 0 vs. Ha: β1 ≠ 0
 The test statistic

b1 s
t= where sb1 
sb1 SSxx
which follows a t-distribution with df=n-2.
 This test is equivalent to testing whether the
correlation coefficient equals zero.
21
Example: Testing the Slope
 Test whether the odometer reading of a car and
its price are linearly related.
 Hypotheses:
 H0: b1 = 0 vs. Ha: b1 ≠ 0
 Test statistic: t=-13.495
 P-value≈0
 Conclusion: There is overwhelming evidence to
infer that the odometer reading affects the
auction selling price, i.e. reject H0: b1 = 0

22
An F Test for Model
 For simple regression, this is another way to test
the null hypothesis H0: β1 = 0
 The F test tests the significance of the overall
regression relationship between x and y and is
given in the ANOVA table in regression output.
 The test statistic is the square of the test statistic
in the t-test, but the p-value should be the same.
 Example:

23
The Simple Coefficient of Determination (r2)
 For simple linear regression model:
 Total variation= Σ(yi - ȳ)2
 Explained variation=Σ(ŷi - ȳ)2
 Unexplained variation= Σ(yi - ŷi)2
 Σ(yi - ȳ)2 = Σ(ŷi - ȳ)2 + Σ(yi - ŷi)2
 The simple coefficient of determination is

Explained variation
r 
2

Total variation

24
The Simple Coefficient of Determination (r2)
 The simple coefficient of determination (r2) is
the proportion of the total variation in the
dependent variable (Y) that is explained or
accounted for by the variation in the independent
variable (X).
 It is the square of the coefficient of correlation (r).
 0  r2  1.
 r2= 1: Perfect match between the line and the data.
 r2= 0: There is no linear relationship between x and y.
 It does not give any information on the direction of
the relationship between the variables.
 The larger the value of r2, the better the fit is.
25
Example
 Find the simple coefficient of determination for
example; what does this statistic tell you about
the model?
 From Regression output, r2= 0.65
 65% of the variation in the auction selling
price is explained by the variation in odometer
reading. The rest (35%) remains unexplained
by this model.

26
Using the Regression Equation
 Before using the regression model, we need
to assess how well it fits the data.
 If we are satisfied with how well the model
fits the data, we can use it to make
predictions for y.
 Example
 Predict the selling price of a three-year-old
Taurus with 40,000 miles on the odometer.

yˆ  6533  .0312x  6533  .0312(40,000)  $5, 285

27
Confidence and Prediction Intervals
 There are two different intervals for the
response variable:
 Confidence interval for the mean response: What
is the mean response, y|x, for a given value, x0,
of the predictor variable?
 Prediction interval for individual value of the
response: What would one predict a new
observation, y, to be for a given value, x0, of the
predictor variable?
 The point estimate for both inferences is the
value of ŷ for the specified value of x0.

28
Confidence and Prediction Intervals
 A (1-)100% confidence interval for mean value
of y when x=x0 is
x 0  x 
2
1
yˆ  t  /2 s 
x i  x 
2
n
 A (1- )100% prediction interval for an individual
value of y when x=x0 is
x 0  x 
2
1
yˆ  t  /2 s 1  
 x i  x  2
n
 Here s is the standard error and t/2 is based on
(n-2) degrees of freedom.
29
Confidence and Prediction Intervals
 A prediction interval is intended to trap a new
observation of the dependent variable given
values of the independent variables. While the
confidence interval is intended to trap the mean
of the dependent variable given values of the
independent variables.
 In MegaStat:
 In regression window, choose ―Type in predictor
values‖ and enter the predictor values for the
independent variable.

30
Example: Prediction
 The car dealer wants to bid on a lot of 250 Ford
Taurus, where each car has been driven for about
40,000 miles.
 The dealer needs to estimate the mean price per
car. The 95% confidence interval is ($5252, $5322)
 Provide an interval estimate for the bidding price on
a Ford Taurus with 40,000 miles on the odometer.
 The dealer would like to predict the price of a single
car. The 95% prediction interval is ($4984, $5590)

31
Exercises
1) The president of a company that
manufactures car seats has been concerned
about the number and cost of machine
breakdowns. The problem is that the
machines are old and becoming quite
unreliable. However, the cost of replacing
them is quite high and the president is not
certain that the cost can be made up in
today’s slow economy. To help make a
decision about replacement, he gathered data
about last month’s costs for repairs and the
ages (in months) of the plant’s 20 welding
machines (worksheet: Repair).
32
Exercises
a) Find the sample regression line.
b) Interpret the coefficients.
c) Determine the coefficient of determination and
discuss what this statistic tells you.
d) Test at 5% significance level whether the age of a
machine and its repair cost are linearly related.
e) Find a 95% prediction interval for the monthly
repair cost of a welding machine that is 120
months old.
f) Find a 95% confidence interval for the average
monthly repair cost of welding machines that are
120 months old.
33
Exercises
2) Ten cars between 1 and 6 years old were
randomly selected from the classified ads. The
data were obtained (Worksheet: Cars), where x
denotes age, in years, and y denotes price, in
hundreds of dollars.
a) Develop a scatter plot and describe the
relationship between the price and the age of
the cars.
b) Compute the correlation coefficient.
c) Determine the regression equation for the data.
d) Interpret carefully the regression coefficients.

34
Exercises
e) Compute and interpret the coefficient of
determination, r2.
f) Does the age of the car seem a good predictor for
its price? Test the appropriate hypothesis at
=0.05.
g) Obtain a point prediction for the mean price of all
4-year-old cars.
h) Obtain a 95% confidence interval for the mean
price of all 4-year old cars.
i) Obtain a point prediction for the mean price of all
12-year-old cars. Comment on the accuracy of
this prediction.

35
Exercises
3) A fire insurance company wants to relate the
amount of fire damage in major residential fires
to the distance between the burning house and
the nearest fire station. The study is to be
conducted in a large suburb of a major city; a
sample of 15 recent fires in the suburb is
selected. The amount of damage (in $1,000) and
the distance (in miles) between the fire and the
nearest fire station are recorded for each fire.
a) Develop a scatter plot and describe the relationship
between the distance and the damage.
b) Find and interpret the correlation coefficient.
36
Exercises
c) Determine the regression equation for the data.
d) Interpret carefully the regression coefficients.
e) Compute and interpret R2.
f) Does the distance seem a good predictor for the
damage in the burning house? Test the
appropriate hypothesis at =0.05.
g) Obtain a point estimate for the mean damage of
all houses on fire that are 4 miles away from the
nearest fire station.
h) Obtain a 90% prediction interval for the damage
of a house on fire that is 4 miles away from the
nearest fire station.
37
Chapter 14
Multiple Regression
The Multiple Regression Model
 Simple linear regression used one independent
variable to explain the dependent variable
 Some relationships are too complex to be
described using a single independent variable
 Multiple regression uses two or more independent
variables to describe the dependent variable
 This allows multiple regression models to handle
more complex situations
 There is no limit to the number of independent
variables a model can use
 Multiple regression has only one dependent
variable (y)
39
The Multiple Regression Model
 The multiple linear regression model relating y to
x1, x2,…, xk is
y = β0 + β1x1 + β2x2 +…+ βkxk + 
 β0, β1, β2,… βk are unknown parameters
  is an error term

 The estimated regression equation is given by


ŷ = b0 + b1x1 + b2x2 + … + bkxk
 b0, b1, b2,…, bk are the least squares point
estimates of the parameters β0, β1, β2,…, βk

40
Multiple Regression
 Coefficients interpretation:
 bi represents an estimate of the change in y
corresponding to a one-unit increase in xi when all
other independent variables are held constant.
 Coefficient of Multiple Determination R2
 The multiple coefficient of determination, R2, is the
proportion of the total variation in the n observed
values of the dependent variable that is explained
by the multiple regression model.
 Confidence and Prediction Intervals
 Similar to simple linear regression.

41
The Overall F Test (Validity of the model)
 This F test is used to find out if all of the
regression coefficients, except the intercept, are
equal to zero.
 To test
H0: β1= β2 = …= βk = 0
Ha: At least one of β1, β2,…, βk ≠ 0
 The test statistic is
(Explained variation)/k
F
(Unexplained variation)/[n-(k  1)]
which follows an F distribution with k and n-k-1
degrees of freedom.
42
Testing the Significance of an Independent
Variable
 To test significance of an independent variable xj,
we test H0: βj = 0 vs. Ha:βj ≠ 0
 Test Statistic
t=bj /sbj
which follows t distribution with df=n-k-1.
 Note on Significance testing:
 Whether the independent variable xj is significantly
related to y in a particular regression model is
dependent on what other independent variables are
included in the model.
 That is, changing independent variables can cause a
significant variable to become insignificant or cause an
insignificant variable to become significant
43
Example
 A researcher wanted to find the effect of driving
experience and the number of driving violations on
auto insurance premiums. A random sample of 12
drivers insured with the same company and having
similar auto insurance policies was selected from a
large city. The data includes the monthly auto
insurance premiums (in $) paid by these drivers,
their driving experiences (in years), and the
numbers of driving violations committed by them
during the past three years.
 Find the regression equation of monthly premiums
paid by drivers on the driving experiences and the
numbers of driving violations.
44
Example
 From the scatter plots we
have:
 Moderate negative linear
relationship between the
monthly premium and the
driving experience.
 strong positive linear
relationship between the
monthly premium and the
number of driving
violations.

45
Example: Estimated Model
 The proposed regression model is
y = b0+ b1 x1+ b2 x2 + 
 y = the monthly auto insurance premium in dollars
 x1 = the driving experience in years of a driver
 x2 = the number of driving violations committed by
a driver during the past three years
 The estimated regression model is
ŷ=110.28 – 2.75 x1 + 16.11 x2

46
Example: Coefficients Interpretation
 b0 = $110.28
 a driver with no driving experience and no driving violations
committed in the past three years is expected to pay an auto
insurance premium of $110.28 per month.
 That may not be true because none of the drivers in our
sample has both zero experience and zero driving violations.
 b1 = -$2.75
 A driver with one extra year of experience but the same
number of driving violations is expected to pay $2.75 less per
month for the auto insurance premium.
 b2 = $16.11
 A driver with one extra driving violation during the past three
years but with the same years of driving experience is
expected to pay $16.11 more per month for the auto
insurance premium.
47
Example: R2
 Find and interpret the coefficient of
determination.
 The value of R2= 0.931 tells us that the two
independent variables; years of driving experiences
and the numbers of driving violations, explain
93.1% of the variation in the auto insurance
premiums.

48
Example: Overall F-test
 At 5% level, can you conclude that the number of
years of driving experience and number of driving
violations are useful useful in the regression
model?
 H0: β1= β2 = 0 vs. Ha: At least one of β1, β2 ≠ 0
 F=60.88 and P-value=0.00000589
 Reject H0. The number of years of driving
experience and number of driving violations should
be retained in the model.

49
Example: Testing significance
 At 5% level, can you conclude that the number of
years of driving experience is useful in the
regression model?
 H0: β1= 0 vs. Ha: β1 ≠ 0
 t=-2.812
 P-value=0.0203
 Reject H0. The number of years of driving
experience is a useful predictor and should be
retained in the model.

50
Example: Prediction
 What is the predicted auto insurance premium paid
per month by a driver with 7 years of driving
experience and 3 driving violations committed in
the past three years?
 ŷ=$139.36
 Find a 99% prediction interval for the auto
insurance premium paid per month by a driver with
7 years of driving experience and 3 driving
violations committed in the past three years.
 ($97.73, $181)

51
Exercises
1) In an effort to explain to customers why
their electricity bills have been so high
lately, and how, specifically, they could save
money by reducing the thermostat settings
on both space heaters and water heaters, an
electric utility company has collected total
kilowatt consumption figures last year’s
winter months as well as thermostat settings
on space and water heaters for 100 homes.
The data stored in columns 1 (consumption),
2 (space heater thermostat setting) and 3
(water heater thermostat setting).
52
Exercises
a) Determine the regression equation.
b) Determine the coefficient of determination and
comment about what it tells you.
c) Test the validity of the model and describe what this
test tells you.
d) Predict with 95% confidence the electricity
consumption of a house whose space heater
thermostat is set at 70 and whose water heater
thermostat is set at 130.
e) Estimate with 95% confidence the average electricity
consumption for houses whose space heater
thermostat is set at 70 and whose water heater
thermostat is set at 130.
53
Exercises
2) A real estate expert wanted to find the
relationship between the sale price of houses
and various characteristics of the houses. She
collected data on four variables, recorded in
(realestate worksheet), for 13 houses that were
sold recently. The four variables are
 Price= Sale price of a house in thousands of dollars
 Lot size= Size of the lot in acres
 Living area= Living area in square feet
 Age= Age of a house in years

54
Exercises
a) Determine the regression equation.
b) Interpret carefully the regression coefficients.
c) Test the validity of the model at 5% level.
d) Test whether the living area should be dropped from
the model at 5% level.
e) Determine the coefficient of determination and
comment about what it tells you.
f) Predict the sale price of a house that has a lot size of
2.5 acres, a living area of 3000 square feet, and is
14 years old.
g) Find a 99% prediction interval for a house that has a
lot size of 2.2 acres, a living area of 2500 square
feet, and are 7 years old.
55
Exercises
3) A realtor working in a large city wants to
identify the secular trend in the weekly number
of single-family houses sold by her firm. For the
past 15 weeks she has collected data on her
firm’s home sales, as shown in the table.
a) Plot the time series. Is there visual evidence of a
quadratic trend?
b) The realtor hypothesize the model
y=b0+b1t+b2t2+ for the trend of the weekly time
series. Fit the model to the data. How well does
the model describe the trend?

56