QTM Regression Analysis Ch4 RSH

How to perform Regression analysis
Nadia Z Khan NUST Business School

Friday, May 25, 12
regression analysis
A very valuable tool for todays manager. Regression Analysis is used to:
Understand the relationship between variables.
Predict the value of one variable based on another variable. A regression model has: dependent, or response, variable - Y axis an independent, or predictor, variable - X axis
Friday, May 25, 12 Nadia Z Khan NUST Business School
regression analysis
Triple A Construction Company renovates old homes in Albany. They have found that its dollar volume of renovation work is dependent on the Albany area payroll.
Local Payroll ($100,000,000's) 3 4 6 4 2 5
Friday, May 25, 12
Triple A Sales ($100,000's) 6 8 9 5 4.5 9.5

Scatter plot
10 8
Sales 100,000
6 4 2 0
3
Local Payroll ($100,000,000's)
6
Friday, May 25, 12
regression analysis model

Regression: Understand & Predict
Create a Scatter Plot Perform Regression Analysis

some random error that cannot be predicted. Slope Independent Variable, Predictor
Dependent Variable, Response
Intercept (Value of Y when X=0)

Friday, May 25, 12

Sample data are used to estimate the true values for the intercept and slope.
Y = b0+ b 1X
Where, Y = predicted value of Y
The difference between the actual value of Y and the predicted value (using sample data) is known as the error.
Error = (actual value) (predicted value)
e=Y-Y
Friday, May 25, 12

Sales (Y) 6 8 9 5 4.5 9.5 Payroll (X) 3 4 6 4 2 5 (X - X) 1 0 4 0 4 1
(X-X)(Y-Y) 1 0 4 0 5 2.5 12.5
_ _
Calculating the required parameters: b 1= (X-X)(Y-Y) (X-X) 2

=
12.5 10
= 1.25
bo= Y b1X = 7 (1.25)(4) = 2 So,
Y = 2 + 1.25 X
Summations for each column: 42 24 10
Y = 42/6 = 7
X = 24/6 = 4
Friday, May 25, 12
Measuring the Fit of the linear Regression Model
Nadia Z Khan NUST Business School Friday, May 25, 12
Measuring the Fit of the linear Regression Model

To understand how well the X predicts the Y, we evaluate
Variability in the Y variable
SSR > Regression Variability that is explained by the relationship b/w X & Y + SSE > Unexplained Variability, due to factors then the regression -----------------------------------SST > Total variability about the mean
Friday, May 25, 12
Correlation Coefcient
r Strength of the relationship between Y and X variables
Standard Error
St Deviation of error around the Regression Line
Residual Analysis
Validation of Model
Coefcient of Determination
R Sq - Proportion of explained variation
Test for Linearity

Signicance of the Regression Model i.e. Linear Regression Model
Variability
10 8 6 4 2 0 0 1 2
Local Payroll ($100,000,000's)
Friday, May 25, 12
y = 1.25x + 2
R = 0.6944 SSR
explained variability
SSE
SST
_ Y
4
Regression Line
6
Variability
Errors (deviations) may be positive or negative. Summing the errors would be misleading, thus we square the terms prior to summing.
Sum of Squares Total (SST) measures the total variable in Y. SST = (Y-Y)
2
For Triple A Construction:

SST = (Y-Y)
2
= 22.5
2
SSE = e 2 = (Y-Y)
= 6.875
SSR =(Y-Y)2 = 15.625
Sum of the Squared Error (SSE) is less than the SST because the regression line reduced the variability. SSE = e 2 = (Y-Y) 2 Sum of Squares due to Regression (SSR) indicated how much of the total variability is explained by the regression model. SSR =(Y-Y)2
Friday, May 25, 12
Note:
SST = SSR + SSE

Explained Variability Unexplained Variability
Coefcient of Determination
The coefficient of determination (r2 ) is the proportion of the variability in Y that is explained by the regression equation.
r2 = SSR = 1 SSE SST SST For Triple A Construction: r2 = 15.625 = 0.6944 22.5
SST, SSR and SSE just themselves provide little direct interpretation. This measures the usefulness of regression
69% of the variability in sales is explained by the regression based on payroll. Note: 0 < r2 < 1
Friday, May 25, 12 Nadia Z Khan NUST Business School
The correlation coefficient (r) measures the strength of the linear relationship. nXY XY
Possible Scatter Diagrams for values of r.
r=
[nX
(X) ][nY
2
(Y
(Y) ]
Shown as Multiple R in the output of Excel 2 le
For Triple A Construction, r = 0.8333
Note: -1 < r < 1

Standard error
The mean squared error (MSE) is the estimate of the error variance of the regression equation.
s = MSE = SSE nk-1

2 Where, n = number of observations in the sample k = number of independent variables
Estimate of Variance. Just like St Dev (which is around mean), it measures the variation of Y variation around the regression line OR St Dev of error around the Regression Line. Same units as Y. Means +1.3 x 100,000 USD Sales error in prediction
For Triple A Construction, s 2= 1.31

Friday, May 25, 12
Test for linearity

An F-test is used to statistically test the null hypothesis that there is no linear relationship between If p<alpha Reject the null hypothesis that the X and Y variables (i.e. 1 = 0). there is no linear relationship If the significance level for the F between X & Triple A Construction: For Y test is low, we reject Ho and conclude MSR = 15.625 = 15.625 there is a linear relationship.
1
p value is signicance level alpha = level of signicance or = 1-condence interval
F = MSR MSE
where, MSR = SSR k
Friday, May 25, 12
= 15.625 = 9.0909 1.7188
The significance level for F = 9.0909 is 0.0394, indicating we reject Ho and conclude a linear relationship exists between sales and payroll. Nadia Z Khan
NUST Business School
Computer Software for Regression

In Excel, use Tools/ Data Analysis. This is an add-in option.
Multiple R is correlation coefcient

Estimate of Variance. Just like St Dev (which is around mean), it measures the variation of Y variation around the regression line OR St Dev of error around the Regression Line. Same units as Y. Means +1.3 x 100,000 USD Sales error in prediction
The adjusted R Sq takes into account the number of independent variables in the model.
p Value < Alpha (0.05 or 0.1) means relationship between X & Y is linear
Anova table
to verify regression assumptions are correct
Residual Analysis:
Assumptions of the Regression Model

We make certain assumptions about the errors in a regression model which allow for statistical testing. Assumptions: Errors are independent. Errors are normally distributed. Errors have a mean of zero. Errors have a constant variance.
A plot of the errors (Real Value minus predicted value of Y), also called residuals in excel may highlight problems with the model.
PITFALLS: Prediction beyond the range of X values in the sample can be misleading, including interpretation of the intercept (X=0). A linear regression model may not be the best model, even in the presence of a significant F test. Nadia Z Khan
NUST Business School Friday, May 25, 12
Constant variance
Triple A Construction
Errors have constant Variance Assumption

Plot Residues w.r.t X values Pattern should be random!
Non-constant Variation in Error Residual Plot violation

0 X
Normal distribution
Histogram of Residuals - Should look like a bell curve
Not possible to see the bell curve with just 6 observations. Need more samples
zero mean
Errors have zero Mean
independent errors
Example: Manager of a package If samples collected over a delivery store wants to predict period of time and not at the weekly sales based on the same time, then plot the number of customers making residues w.r.t time to see if purchases for a period of 100 any pattern (Autocorrelation) days. Data is collected over a period of time so check for exists. autocorrelation (pattern) effect.
Residues
If substantial autocorrelation, Regression Model Validity becomes doubtful

Autocorrelation can also be checked using DurbinWatson statistic.
Friday, May 25, 12
Cyclical Pattern! A Violation
time
Residual analysis for validating assumptions

Nonlinear Residual Plot violation
multiple regression
multiple regression
Multiple regression models are similar to simple linear regression models except they include more than one X variable.
Wilson Realty wants to develop a model to determine the suggested listing price for a house based on size and age.
Price
35000 47000 49900 55000 58900 60000 67000 70000 78500 79000 87500 93000 95000 97000
Sq. Feet
1926 2069 1720 1396 1706 1847 1950 2323 2285 3752 2300 2525 3800 1740
Age
30 40 30 15 32 38 27 30 26 35 18 17 40 12
Condition
Good Excellent Excellent Good Mint Mint Mint Excellent Mint Good Good Good Excellent Mint
Y = b0+ b1 X 1+ b2X 2++ bnXn

slope Independent variables
multiple regression
67% of the variation in sales price is explained by size and age. Ho: No linear relationship is rejected
Wilson Realty has found a linear relationship between price and size and age. The coefficient for size indicates each additional square foot increases the value by $21.91, while each additional year in age decreases the value by $1449.34.
Y = 60815.45 + 21.91(size) 1449.34 (age) For a 1900 square foot house that is 10 years old, the following prediction can be made:
Y = 60815.45 + 21.91(size) 1449.34 (age) Ho: 1 = 0 is rejected Ho: 2 = 0 is rejected
$87,951 = 21.91(1900) + 1449.34(10)
binary or dummy variables
dummy variables
Binary (or dummy) variables are special variables that are created for qualitative data.
A dummy variable is assigned a value of 1 if a particular condition is met and a value of 0 otherwise. The number of dummy variables must equal one less than the number of categories of the qualitative variable.
Return to Wilson Realty, and lets evaluate how to use property condition in the regression model. There are three categories: Mint, Excellent, and Good.
X3= 1 if the house is in excellent condition = 0 otherwise X4 = 1 if the house is in mint condition = 0 otherwise Note: If both X and X = 0 then the house is in good condition
dummy variables
As more variables are added to the model, the r2 usually increases.
Y = 48329.23 + 28.21 (size) 1981.41(age) + 23684.62 (if mint) + 16581.32 (if excellent)
model building
adjusted r-Square
The best model is a statistically significant model with a high r2 and a few variables.
As more variables are added to the model, the r2 usually increases. The adjusted r2 takes into account the number of independent variables in the model.
Note: When variables are added to the model, the value of r2 can never decrease; however, the adjusted r2 may decrease.
Friday, May 25, 12
multicollinearity
Collinearity or multicollinearity Duplication of information occurs exists when an independent variable is correlated with another When multicollinearity exists, independent variable.
Collinearity and multicollinearity create problems in the coefficients. The overall model prediction is still good; however individual interpretation of the variables is questionable.
the overall F test is still valid, but the hypothesis tests related to the individual coefcients are not. A variable may appear to be signicant when it is insignicant, or a variable may appear to be insignicant when it is signicant.
non-linear regression
Engineers at Colonel Motors want to use regression analysis to improve fuel efficiency. They are studying the impact of weight on miles per gallon (MPG).
Linear regression model:

MPG = 47.8 8.2 (weight) F significance = .0003 r2 = .7446
Nonlinear (transformed variable)regression model
2 MPG = 79.8 30.2(weight) + 3.4(weight)
F significance = .0002 R2 = .8478
We should not try to interpret the coefficients of the variables due to the correlation between (weight) and (weight squared). Normally we would interpret the coefficient for as the change in Y that results from a 1-unit change in X1, while holding all other variables constant. Obviously holding one variable constant while changing the other is impossible in this example since If changes, then must change also. This is an example of a problem that exists when multicollinearity is present.

QTM Regression Analysis Ch4 RSH

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

QTM Regression Analysis Ch4 RSH

Diunggah oleh

Hak Cipta:

Format Tersedia

How to perform Regression analysis

Nadia Z Khan NUST Business School

Triple A Sales ($100,000's) 6 8 9 5 4.5 9.5

Friday, May 25, 12

regression analysis model

Create a Scatter Plot Perform Regression Analysis

Dependent Variable, Response

Intercept (Value of Y when X=0)

Nadia Z Khan NUST Business School

regression analysis model

Nadia Z Khan NUST Business School

regression analysis model

(X-X)(Y-Y) 1 0 4 0 5 2.5 12.5

Calculating the required parameters: b 1= (X-X)(Y-Y) (X-X) 2

bo= Y b1X = 7 (1.25)(4) = 2 So,

Summations for each column: 42 24 10

Friday, May 25, 12

Measuring the Fit of the linear Regression Model

Nadia Z Khan NUST Business School Friday, May 25, 12

Measuring the Fit of the linear Regression Model

Test for Linearity

Nadia Z Khan NUST Business School

For Triple A Construction:

SSR =(Y-Y)2 = 15.625

SST = SSR + SSE

Nadia Z Khan NUST Business School

Shown as Multiple R in the output of Excel 2 le

For Triple A Construction, r = 0.8333

Note: -1 < r < 1

Nadia Z Khan NUST Business School Friday, May 25, 12

s = MSE = SSE nk-1

For Triple A Construction, s 2= 1.31

Nadia Z Khan NUST Business School

Test for linearity

= 15.625 = 9.0909 1.7188

Computer Software for Regression

Nadia Z Khan NUST Business School Friday, May 25, 12

Computer Software for Regression

Nadia Z Khan NUST Business School Friday, May 25, 12

Multiple R is correlation coefcient

Computer Software for Regression

Nadia Z Khan NUST Business School Friday, May 25, 12

Nadia Z Khan NUST Business School Friday, May 25, 12

to verify regression assumptions are correct

Nadia Z Khan NUST Business School Friday, May 25, 12

Assumptions of the Regression Model

Errors have constant Variance Assumption

Non-constant Variation in Error Residual Plot violation

Nadia Z Khan NUST Business School Friday, May 25, 12

Nadia Z Khan NUST Business School Friday, May 25, 12

Errors have zero Mean

Nadia Z Khan NUST Business School Friday, May 25, 12

If substantial autocorrelation, Regression Model Validity becomes doubtful

Cyclical Pattern! A Violation

Nadia Z Khan NUST Business School

Residual analysis for validating assumptions

Nadia Z Khan NUST Business School Friday, May 25, 12

Nadia Z Khan NUST Business School Friday, May 25, 12

Y = b0+ b1 X 1+ b2X 2++ bnXn

Nadia Z Khan NUST Business School Friday, May 25, 12

Y = 60815.45 + 21.91(size) 1449.34 (age) Ho: 1 = 0 is rejected Ho: 2 = 0 is rejected

$87,951 = 21.91(1900) + 1449.34(10)

Nadia Z Khan NUST Business School Friday, May 25, 12