Anda di halaman 1dari 91

# Applied Statistics 102

July 2017
Agenda

## Simple Linear Regression

Multiple Linear Regression
Linear Regression

Business Problem: Investigate, by how much a typical familys food expenditure change as a result of a change
in its income

Objective: Test the below hypothesis (1) and model spending on food using family income
1) Does spending on food increase when a familys income increases?
2) By how much does the spending on food change when family income increases (or decreases)?

Dataset: Food consumption (\$) and Family income (\$) for 50 families embedded excel file

Simple Linear Correlation And Regression
Correlation & Regression

## When the pattern of relationship is known, Correlation analysis can

be applied to determine the degree to which the variables are
Correlation related
Correlation analysis informs how well the estimating equation
actually describes the relationship

## Quantifying the relationship between two continuous variables

Predict (or forecast) the value of one variable from knowledge of
Regression the value of another variable
That is an estimating equation - a mathematical formula - will be
developed

Analysis steps

1 Assess the relationship between Family Income (X) and Food spending (Y) Scatter plot

## 4 Calculate error: Y(predicted) Y(Actual)

5 Use Family Income (X) to model Error(Y) Observe correlation and trendline

Continued

Analysis steps

## 14 Interpret other results

Re-Cap: Correlation Strength Of A Relationship
Correlation

## The strength of the linear relationship between two variables is measured by

the coefficient of correlation , rho. For a sample we estimate rho using
Pearsons correlation coefficient R
Correlation coefficients range between -1 and +1

## Stronger linear relationships have values closer to 1, weaker relationships

have values closer to 0.
0 indicates no linear relationship at all

## 1 indicates a perfect relationship

E.g.: Income and Expenditure is positive correlation; Demand and Price is
negative correlation

Mathematically: Coefficient Of Correlation
Coefficient of Correlation

## Sample correlation coefficient,

R xi yi nx y
i
x 2
nx 2
iy 2
ny 2

R2
1. Measures how close all the (x,y) ordered pairs come to falling exactly on a
straight line.
2. -1 R 1
3. Slope determines only the sign of R.

Coefficient Of Determination
Coefficient of Determination is

The coefficient of determination R2 measures how well the line fits the data

## It tells us how much of the variation in Y is explained by the relationship with X

Consider the Y variable alone. It has some total variation calculated using yy i

This variation can be partitioned into a part explained by the regression line and
the residual yy
i
y i
y
y
i
y i

The equation can be converted to squared terms, which are squared
deviations

yi y yi y i y i y
2 2 2

Total sum of squared sum of squared sum of squared
deviations in y deviations of residuals deviations of regression line

Explanatory Power Of A Linear Regression Equation
The Linear Regression Equation

Coefficient of determination

## SSR SSE unaccounted for variance accounted for variance

R2 1 1
SST SST total variance total variance

## R2 is a descriptive measure of the strength of the regression relationship

between X and Yit measures the proportion of the variability in Y
accounted for by the regression relationship with X.
0 R2 1

Food For Thought: R vs R2
R vs R2

## It is the measure of linear relationship between two

It gives the proportion of the variance (fluctuation) of one
variables (say X & Y). If the data is non-linear, then R is
variable (Y) that is predictable from the other variable (X).
meaning less

## Range of R2: 0<= R2 <=1 (mathematically, coefficient of

determination is actually the square of coefficient of
Range of R: -1<=R<=+1
correlation, and knowing one, the other quantity can be
easily ascertained)

## The closer R is to +1 or -1, the stronger the linear

The closer R-squared is to 1, the better x explains y.
relationship between x and y.

To take an example: Consider a regression equation: To take an example: Consider a regression equation:
Y=2.5X+4 with R=0.922 => R2 = 0.850. Y=2.5X+4 with R=0.922 => R2 = 0.850.
Using R, we can deduce that there is a very strong Using R2, we can deduce that 85% of the total variation in
positive correlation between X and Y and that increase Y can be explained by the linear regression equation
or decrease in X would lead to a corresponding (between X and Y), and the other 15% of the total
increase or decrease in Y. variation in Y remains un-explained

Simple Linear Regression Single Independent Variable
Linear Regression

Definitions

Regression is a measure of the average One of the most frequently used techniques in
relationship between two or more variables in economics and business research, to find a
terms of the original units of the data relation between two or more variables that
are related casually, is regression analysis
Samuel B. Richmore Taro Yamne

## Dependent Variable Called Regressed or Explained variable or Response variable

In simple linear regression we generate an equation to calculate the value of a dependent variable
(Y) from an independent variable (X).
Example: Time taken to get to work (Y) is a function of the distance travelled (X)

The Regression Model
Regression Model

## Let us develop a simple regression Representing in a linear scale

model with an example:
7
Say you drive to work at an average
6
of 60 kms/hour. It takes about 1

## Time taken (minutes)

minute for every kilometre 5
travelled
4
Travel time = 1 minute kilometres
3
travelled
2
This is a mathematical model that
represents the relationship 1
between the two variables
0
0 2 4 6 8
Distance travelled (km's)

The Regression Model (Contd.)
Regression Model

Actually, it wont be that simple, because there will some time taken to walk to your
car and then walk from the car to work. Say this takes an extra 3 minutes per day

10
Time taken (minutes)

6
Note: the extended
4
line will intercept
Y axis at 3 minutes
2

0
0 1 2 3 4 5 6 7 8
Distance travelled (km's)

The Regression Model (Contd.)
Regression Model

It also wont be that precise because there will be slight variations in time taken
because of traffic, road works, etc.

12
Time taken (minutes)

10

8
Note: Line does not
6 perfectly pass
through all the
4
points
2

0
0 1 2 3 4 5 6 7 8
Distance travelled (km's)

The Regression Model (Contd.)
Regression Model

## In general, the regression equation takes the form:

y O 1 x
y = the dependent variable

## x = the independent variable

o = The y-intercept

## = random error term

The Regression Model (Contd.) Line Of Best Fit
Regression Model

Given a data set, we need to find a way of calculating the parameters of the
equation

14 ?
?
12 ?
10
8
6
4
2
0
0 5 10

## We need to fit a line of best fit

The Regression Model (Contd.) Line Of Best Fit
Regression Model

## Simple Linear Regression Model

The Regression Model (Contd.) Line Of Best Fit
Regression Model

Because the line will seldom fit the data precisely, there is always some error
associated with our line
The line of best fit is the line that minimises the spread of these errors

14
12
10
(yi - y ) = predicted value
8 of Y from Xi
6
4
2
0
0 2 4 6 8 10

The Regression Model (Contd.) Error Term
Regression Model

## The term ( yi y ) is known as the error or residual

ei ( yi y )

The line of best fit occurs when the Sum of the Squared Errors is minimised
y

SSE ( yi y )
2

The Regression Model (Contd.) Estimating Of Parameters
Regression Model

xy x y
Slope =
( x x )( y y )
n
(x x)
1 2
2
( x )
x 2

n
n n
Intercept= 0
y x
1 y
yi
i 1 x
xi
i 1
n n

## 0 is the estimated average value of y when the value of x is zero

1 is the estimated change in the average value of y as a result of a one-unit change in x
Assumptions Of The Error Term & On Linear Regression
Linear Regression

## OLS method for estimating regression equation

parameters are only valid if certain conditions are met
The error variable is normally distributed
Mean of the error term is zero
The variance of the error is constant over the entire range of X values
The errors terms are uncorrelated. In other words, the observations
have been drawn independently.
The underlying relationship between the dependent and independent
variable is linear

Simple Linear Regression Example
Linear Regression

## A real estate agent wishes to examine the relationship between the

selling price of a home and its size (measured in square feet)

## Independent variable (x) = square feet

Sample Data For House Price Model
An Example

## House Price in \$1000s Square Feet

(y) (x)
245 1400
312 1600
279 1700
308 1875
199 1100
219 1550
405 2350
324 2450
319 1425
255 1700

Regression Output
Regression Output

Regression Statistics
Multiple R 0.76211
R Square 0.58082
Standard Error 41.33032
Observations 10

ANOVA Df SS MS F Significance F
Regression 1 18934.9348 18934.9348 11.0848 0.01039
Residual 8 13665.5652 1708.1957
Total 9 32600.5000

## Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

Intercept 98.24833 58.03348 1.69296 0.12892 -35.57720 232.07386
Square Feet 0.10977 0.03297 3.32938 0.01039 0.03374 0.18580

Graphical Presentation
Graphical representation of our earlier example

450
400

## House Price (\$1000s)

350
Slope
300
250
= 0.10977
200
150
100
50
Intercept 0
= 98.248 0 1000 2000 3000
Square Feet

## house price 98.24833 0.10977 (square feet)

Interpretation Of The Intercept, 0
The Intercept

## 0 is the estimated average value of Y when the value of X is

zero (if x = 0 is in the range of observed x values)
Here, no houses had 0 square feet, so 0 = 98.24833 just indicates that,
for houses within the range of sizes observed, \$98,248.33 is the portion
of the house price not explained by square feet
It may just be an initial fixed cost, that needs to be paid, irrespective to
the size of the house purchased

Interpretation Of The Slope Coefficient, 1
Slope Coefficient

## b1 measures the estimated change in the average value of Y as

a result of a one-unit change in X
Here, 1 = .10977 tells us that the average value of a house increases by
.10977(\$1000) = \$109.77, on average, for each additional one square
foot of size

Least Squares Regression Properties
Regression properties

## The sum of the residuals from the least squares regression

line is 0 ( ( y y ) 0 )
The sum of the squared residuals is a minimum (minimized ( y
y ) 2
)

The simple regression line always passes through the mean of the y variable
and the mean of the x variable

## The least squares coefficients are unbiased estimates of 0 and 1

Explained And Unexplained Variation
Variance

## SST SSE SSR

Total sum of Sum of Squares Sum of Squares
Squares Error Regression

## SSE ( y y)2 SSR ( y y)2

where:
y = Average value of the dependent variable
y = Observed values of the dependent variable
y = Estimated value of y for the given x value

Explained And Unexplained Variation (Contd.)
Variance

## SST = total sum of squares

Measures the variation of the yi values around their mean y

## SSE = error sum of squares

Variation attributable to factors other than the relationship between x and y

## SSR = regression sum of squares

Explained variation attributable to the relationship between x and y

Explained And Unexplained Variation (Contd.)
Variance

y
yi 2
SSE = (yi - yi ) y
_
SST = (yi - y)2

y _
_ SSR = (yi - y)2 _
y y

Xi x

Coefficient Of Determination, R2
Coefficient of Determination

## The coefficient of determination is the portion of the total

variation in the dependent variable that is explained by
variation in the independent variable
The coefficient of determination is also called R-squared and
is denoted as R2

SSR
R
2
where 0 R2 1
SST

Coefficient Of Determination, R2 (Contd.)
Coefficient of Determination

Coefficient of determination

## SSR sum of squares explained by regression

R 2

SST total sum of squares

## Note: In the single independent variable case, the coefficient of determination is

R r2 2

where:
R2 = Coefficient of determination
r = Simple correlation coefficient

Examples Of Approximate R2 Values
Example: Coefficient of Determination

y
R2 = 1

## Perfect linear relationship

between x and y:
x
R2 = 1
y 100% of the variation in y is
explained by variation in x

Examples Of Approximate R2 Values (Contd.)
Example: Coefficient of Determination

y
0 < R2 < 1

## Weaker linear relationship

between x and y:
x
Some but not all of the variation
y
in y is explained by variation in x

Examples of Approximate R2 Values (Contd.)
Example: Coefficient of Determination

R2 = 0
y
No linear relationship between x
and y:

## The value of Y does not depend

x on x. (None of the variation in y
R2 = 0
is explained by variation in x)

Regression Output
Regression Output
SSR 18934.9348
R 2
0.58082
Regression Statistics SST 32600.5000
Multiple R 0.76211
58.08% of the variation in house
R Square 0.58082
prices is explained by variation in
square feet
Standard Error 41.33032
Observations 10

ANOVA df SS MS F Significance F
Regression 1 18934.9348 18934.9348 11.0848 0.01039
Residual 8 13665.5652 1708.1957
Total 9 32600.5000

## Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

Intercept 98.24833 58.03348 1.69296 0.12892 -35.57720 232.07386
Square Feet 0.10977 0.03297 3.32938 0.01039 0.03374 0.18580

Standard Error of Estimate
Standard Error of Estimate

## The standard deviation of the variation of observations around the

regression line is estimated by

SSE
s
n k 1
where
SSE = Sum of squares error
n = Sample size
k = number of independent variables in the model

The Standard Deviation Of The Regression Slope
Standard Deviation of Regression Slope

## The standard error of the regression slope coefficient (b1) is estimated by

s s
sb1
(x x) 2
(
x n
2 x) 2

where:
s b1 = Estimate of the standard error of the least squares slope
SSE = Sample standard error of the estimate
s
n k -1

Regression Output
Regression Output
s 41.33032
Regression Statistics
Multiple R
R Square
0.76211
0.58082
sb1 0.03297
Standard Error 41.33032
Observations 10

ANOVA df SS MS F Significance F
Regression 1 18934.9348 18934.9348 11.0848 0.01039
Residual 8 13665.5652 1708.1957
Total 9 32600.5000

## Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

Intercept 98.24833 58.03348 1.69296 0.12892 -35.57720 232.07386
Square Feet 0.10977 0.03297 3.32938 0.01039 0.03374 0.18580

Comparing Standard Errors & Graphical Interpretation
Standard Errors: A comparison

Variation of observed y values from the Variation in the slope of regression lines from
regression line different possible samples
y y

y y

## large s x large sb1 x

Inference About The Slope: t Test

## t test for a population slope

Is there a linear relationship between x and y?
Null and alternative hypotheses
H0: 1 = 0 (no linear relationship)
H1: 1 0 (linear relationship does exist)
Test statistic

1 0 where:
t 1 = Sample regression slope
s b1 coefficient
0 = Hypothesized slope

## d.f. n k - 1 sb1 = Estimator of the standard

error of the slope

Inference About The Slope: t Test (Contd.)

House Price
Square Feet
in \$1000s Estimated Regression Equation:
(x)
(y)
245 1400 house price 98.25 0.1098 (sq.ft.)
312 1600
279 1700
308 1875
199 1100 The slope of this model is 0.1098
219 1550 Does square footage of the house affect its
405 2350 sales price?
324 2450
319 1425
255 1700

Inferences About The Slope: t Test Example

## Test Statistic: t = 3.329

1 s b1 t
H0: 1 = 0 From Excel output:

## HA: 1 0 Coefficients Standard Error t Stat P-value

Intercept 98.24833 58.03348 1.69296 0.12892
Square Feet 0.10977 0.03297 3.32938 0.01039

a/2=.025
Reject H0
a/2=.025

## Conclusion: There is sufficient

Do not reject H0 Reject H
0 evidence that square footage
-t/2 t/2
0 affects house price
-2.3060 2.3060 3.329
Regression Analysis For Description
Regression Analysis for Description

## Confidence Interval Estimate of the Slope:

b1 t /2sb1 d.f. = n - 2

## Excel Printout for House Prices:

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

## At 95% level of confidence, the confidence interval for the slope is

(0.0337, 0.1858)

Regression Analysis For Description
Regression Analysis for Description

## Since the units of the house price variable is \$1000s, we are

95% confident that the average impact on sales price is between
\$33.70 and \$185.80 per square foot of house size

## This 95% confidence interval does not include 0.

Conclusion: There is a significant relationship between house price and
square feet at the .05 level of significance

Residual Analysis
Residual Analysis

## Examine for linearity assumption

Purposes Examine for constant variance for all levels of x
Evaluate normal distribution assumption

## Graphical Can plot residuals vs. x

Analysis of
Residuals Can create histogram of residuals to check for normality

Residual Analysis For Normality
Residual Analysis

The simplest way to assess whether or not the residuals are normal is to draw a
histogram and visually inspect the distribution

0.45
0.4
0.35
probability

0.3
0.25
0.2
0.15
0.1
0.05
0
-3 -2 -1 0 1 2 3
e
Normally distributed

Residual Analysis For Linearity
Residual Analysis

y y

x x

residuals
residuals

x x

Not Linear
Linear

Residual Analysis For Constant Variance
Residual Analysis

Heteroscedasticity Homoscedasticity

y
y

x
x

residuals
residuals

x x

Non-constant variance
Constant variance

Food For Thought: ANOVA vs Linear Regression
ANOVA vs Linear Regression

## This can be used when both the dependent variable and

This is specifically used when the independent variable is
independent variables are continuous; and when
categorical and dependent variable is continuous
independent variables are categorical as well

## ANOVA is used to see whether particular categories have

different effects on the dependent variables Use regression to figure out whether different categorical
i.e. independent categorical variables have different effects variable have any effect at all
on the dependent variables

## Regression and ANOVA exactly give the same R2. R2

measures the extent in which the variation in all the
Regression and ANOVA exactly give the same R2
independent variables explain the overall variation in the
dependent variables

## ANOVA asks: How much do differences in category make a

Regression asks: How much does a category matter at all ?
difference in results?

Food For Thought: ANOVA vs Linear Regression (Contd.)
ANOVA vs Linear Regression

## ANOVA Linear Regression

Null hypothesis of Regression tests whether there is any
Null hypothesis of ANOVA tests whether the categories (or
relationship between the dependent variable (assumed)
groups) have same means or not
with the (assumed) independent variables

## Regression solves for the linear equation that minimizes the

ANOVA uses the categories to split the overall population into
sum of the squared errors; for each dummy variable it
sub-populations (what we call "segments" in marketing and
assigns a coefficient, i.e. a number by which it is multiplied.
"test groups" in industrial quality control), and then tests
Obviously, if a coefficient is zero, then the variable drops out
against the null hypothesis that the subpopulations all have
of the equation and doesn't have any effect at all. So for
the same average value of the dependent variable. The F-
regression, the F-statistic tests how likely it is that the
statistic tests the probability that the means differ only by
coefficient is not zero (against the null hypothesis that the
chance.
coefficient is zero and there is no effect)

E.g: If we want to know whether being female means lower E.g: If we want to know whether gender, or having a college
income, or having a BA degree means higher, ANOVA should degree, have any effect on income, regression should be
be used. used

Agenda

## Simple Linear Regression

Multiple Linear Regression
Multiple Regression
Multiple Regression

## The objective of multiple regression analysis is to predict the single

dependent variable by a set of independent variables.

## Both independent and dependent variables should be metric (interval or

ratio data). However, under certain conditions, dummy-coded (categorical)
independent variables are used.

## the criterion variable is assumed to be a random variable

There are
some there would be statistical relationship (estimating the
average value) rather functional relationship (calculating an
assumptions
exact value)
in using this
statistics there should be linear relationship among the predictors and
between the predictors and criterion variable.

Multiple Regression
Multiple Regression

## Multiple regression analysis provides a population model as:

Where,
x1, x2, . . . , xk (k of them) are Independent variables
Data is of type:
(y1, x1, x2, . . . , xk ), . . . , (yn, x1n, x2n, . . . , xkn)
And same analysis provides a predictive equation as: Yi 0 1 X1i .....k X ki i

## yi b0 b1x1i .....bk xki i

Regression coefficients: b0, b1,, bk are estimates of 0, 1,, k .
Where,
o = intercept of the line
k = partial regression coefficients
e = error term associated

Multiple Regression Example
Multiple Regression

## Rent, \$ 360 1000 450 525 350 300

No. of rooms 2 6 3 4 2 1

If we want to predict rent (in dollars per month) based on the size of the
apartment (number of rooms). You would collect data by recording the size and
rent and fit a model.
The following information has been gathered from a random sample of
apartment renters in a city.

Multiple Regression Example
Multiple Regression

## Rent vs Number of rooms

7
6
5
no.of rooms

4
3
2
1
0
0 200 400 600 800 1000 1200
rent (\$)
Number of rooms

## And because the data looks linear, fitting an LSR line

Multiple Regression Example
Multiple Regression

## Rent vs Number of rooms

7
6
5
no.of rooms

4
3
2
1
0
0 200 400 600 800 1000 1200
rent (\$)
Number of rooms

Multiple Regression Example
Multiple Regression

## Rent, \$ 360 1000 450 525 350 300

No. of rooms 2 6 3 4 2 1
Distance from
1 1 2 3 10 4
Downtown (in miles)

But number of rooms isnt the only factor that has an impact on Rent.

## The Distance from Downtown may be another predictor.

With multiple regression we will have more then one independent variable, so we could
use number of rooms and Distance from Downtown to predict Rent.

Our new table, with the data, the Distance from Downtown, looks like this

Multiple Regression Example
Multiple Regression

This data cant be graphed like simple linear regression, because there are two
independent variables.
No. of observations read 6 Regression
No. of observations used 6 output (by SAS)

Analysis Of Variance
Source DF SS MS F value Pr>F
Model 2 306910 153455 16.28 0.0245
Error 3 28277 9425.76565
Corrected total 5 335188

## Root MSE 97.08638 R-Square 0.9156

Dependent Mean 497.5 Adj R-Sq 0.8594
Coeff Var 19.51485

Multiple Regression Example
Multiple Regression

Parameter Estimates
Standar-
Parameter Standard Variance
Variable Label DF t Value Pr > |t| dized
Estimate Error Inflation
Estimate
Intercept Intercept 1 96.458 118.12 0.82 0.47 0 0
Number_of_ Number_of_
1 136.48 26.864 5.08 0.01 0.94297 1.23
rooms rooms
Distance_
dis_downtown from_ 1 -2.4035 14.171 -0.17 0.88 -0.0315 1.23
Downtown

## What does all this mean?

Multiple Regression Example
Multiple Regression

## PS: Just like linear regression, when we fit a multiple

regression to data, the terms in the model equation are statistics
not parameters.

## Rent 96.458 (136.48)No. of rooms (-2.4035)Distance

We get the coefficient values from the SAS output
Parameter Estimates

Multiple Regression Example
Multiple Regression

Hypotheses
All independent variables are unimportant for predicting y

H A : at least one k 0
At least one independent variable is useful for predicting y

H O : 1 2 3 ... k 0
What type of test should be used?
The distribution used is called the Fischer
distribution. The F-Statistic is used with this
distribution.

Multiple Regression Example
Multiple Regression

n
Yi Y
2
Regression SS
i1

n

2
Error SS Yi Y
i1
n
Yi Y
2
Total SS
i1

## Total SS = Regression SS + Error SS

Y Y
n n n
Yi Y Yi Y
2 2 2
i
i 1 i 1 i 1

Multiple Regression Example
Multiple Regression

There are also regression mean of squares, error mean of squares, and total
mean of squares (abbreviated MS).
To calculate these terms, you divide the sum of squares by its respective
degrees of freedom
Regression d.f. = k
Error d.f. = n-k-1
Total d.f. = n-1
Where k is the number of independent variables and n is the total number of
observations used to calculate the regression
Now we can calculate the F-statistic.
F = (model mean square / error mean square)

Multiple Regression Example
Multiple Regression

The p-value for the F-statistic is then found in a F-Distribution Table. As you
saw before, it can also be easily calculated by software.

A small p-value rejects the null hypothesis that none of the independent
variables are significant. i.e., at least one of the independent variables are
significant.

## The conclusion in the context of our example is:

We have strong evidence (p is approx. 0) to reject the null hypothesis.
i.e. either No. of rooms or Distance from downtown is significant in
predicting Rent.

Once you know that at least one independent variable is significant, you can go
on to test each independent variable separately.

Multiple Regression Example
Multiple Regression

## If an independent variable does not contribute

significantly to predicting the value of Y, the coefficient
of that variable will be 0.
Testing
Individual The test for this hypotheses determines whether the
Terms estimated coefficient is significantly different from 0.

## From this, we can tell whether an independent variable

is important for predicting the dependent variable.

Multiple Regression Example
Multiple Regression

## Test for Individual Terms:

HO: j 0
The independent variable, xj, is not important for predicting y

HA: j 0 or j 0 or j 0
The independent variable, x j, is important for predicting y where j
represents a specified random variable

Multiple Regression Example
Multiple Regression

Test Statistic
d.f. = n-k-1
Remember, this test is only to be performed, if the overall model of the test is
significant.
Tests of individual terms for significance are the same as a test of significance in simple
linear regression
t
j

s j

A small p-value in Parameter estimates table means that the independent variable is
significant.
This test of significance shows that No. of rooms is a significant independent variable
for predicting Rent, but average Distance from Downtown is not.

Multiple Regression
Multiple Regression

## Some more evaluations:

Strength of association is measured using coefficient of multiple
determination (R2).

2 2 1 R 2

......(1)
n k 1
Residual Analysis to check appropriateness of the model
Histogram - Normal distribution assumption. Can also be checked with K-S
one sample test
Plotting residuals against predicted values Assumption of constant
variance of the error term.

Food For Thought: R2 vs Adjusted R2

## If we already have R2 for a regression model, why do we need Adjusted R2 ?

Answer: When we compare one multiple regression model (call it A) with another
multiple regression model (call it B) for that same sample with A having different
number of independent variables in comparison to B, then to compare both the
models, we should not use R2. This is because R2 of A and R2 of B (and consequently
the predicted values of dependent variables) have been computed using different
number of independent variables. So the comparison need not be an apt one.
can be clearly seen in it formula in equation (1) on the previous slide (See
denominator: n-k-1; k is the number of independent variables). Hence it is suitable
for making comparisons between two regression models having the same dependent
variable BUT different number of independent variables.

Multiple Regression
Multiple Regression

Plotting
residuals Plotting residuals against time/sequence of observations assumption of
against time/ non correlation across error terms (Durbin Watson test provides a formal
sequence of analysis of same)
observations

Plotting
residuals
against Plotting residuals against independent variables appropriateness of model
independent
variables

## Multiple regression is sometimes done stepwise - Each predictor variable is

Multiple included/removed one at a time which helps in cases of multicollinearity
regression is Forward inclusion/selection
sometimes
done stepwise Backward elimination
Stepwise solution

Multiple Regression
Multiple Regression

## Statistical significance partial F test or t-test

Necessary to
understand r2
relative R2 - between the independent and dependent variable controlling for effect
importance of of other independent variables
predictors
R2 changes in R2 when variable is entered into an equation

## Regression estimated using entire data set

Data Split into estimation and validation sample
Cross Regression on estimation sample alone - compared with the model done on
validation entire sample on partial regression coefficients
This model is applied to validation sample
The observed and predicted values are correlated to get an r2

## Dummy To use nominal/categorical variables as predictor/independent variables

variable Class data (High, medium, Low or Old, middle age and young) is converted
regression into binary variables(1,0)

Post Regression Check List
The regression check list

## Calculate the correlation between pairs of x variables

Watch for evidence of multi-collinearity (VIF > 3)
Statistics Check signs of coefficients do they make sense?
checklist
Check 95% C.I. (use t-statistics as quick scan) are
coefficients significantly different from zero?
R2 :overall quality of the regression, but not the only measure

## Normality look at histogram of residuals

Residual Heteroscedasticity plot residuals with each x variable
checklist
Autocorrelation if data has a natural order, plot residuals in
order and check for a pattern

Final Check List
Checking on the terms we have learnt so far

Linearity : scatter plot, common sense, and knowing your problem, transform including interactions if useful
t-statistics: are the coefficients significantly different from zero? Look at width of confidence intervals
F-tests : for subsets, equality of coefficients
R2: is it reasonably high in the context?
Influential observations, outliers in predictor space, dependent variable space
Normality : plot histogram of the residuals - Studentized residuals
Heteroscedasticity: Plot residuals with each x variable, transform if necessary, Box-Cox transformations
Autocorrelation: time series plot
Multicollinearity: compute correlations of the x variables, do signs of coefficients agree with intuition? -
Principal Components
Missing Values: Values which outght to have been in place, but are missing.

Food For Thought: Interpreting Regression Coefficients
Regression Coefficients and their interpretation

## Consider the following generalized regression equation:

Y = B0 + B1*X1 + B2*X2 + E -----(1)
Y = Response variable; X1 = predictor variable1; X2 = predictor variable2;
E = residual error; B0= Y intercept; B1 = first regression coefficient; B2 is the second
regression coefficient.
Specifically Let:
Y = height of a shrub (in cm), based on: a) X1= the amount of bacteria in the soil (1000/
ml) and; b) X2= whether the plant is located in partial or full sun (0= partial sun; 1=full
sun)
Y= 42+2.3*X1+11*X2 (say)
Interpreting coefficients of continuous predictor variables: Since X1 is continuous
variable, B1 represents the difference in the predicted value of Y for each one-unit
difference in X1, if X2 remains constant. This means that if X1 differed by one unit,
and X2 did not differ, Y will differ by B1 units, on average.

Food For Thought: Interpreting Regression Coefficients
Regression Coefficients and their interpretation

## Interpreting coefficients of continuous predictor variables (contd.): In the current

example, shrubs with a 5000 bacteria count would, on average, be 2.3 cm taller than
those with a 4000/ml bacteria count, which likewise would be about 2.3 cm taller
than those with 3000/ml bacteria, as long as they were in the same type of sun. Note
that since the bacteria count was measured in 1000 per ml of soil, 1000 bacteria
represent one unit of X1.
Interpreting Coefficients of Categorical Predictor Variables: Similarly, B2 is
interpreted as the difference in the predicted value in Y for each one-unit difference
in X2, if X1 remains constant. However, since X2 is a categorical variable coded as 0 or
1, a one unit difference represents switching from one category to the other. B2 is
then the average difference in Y between the category for which X2 = 0 (the
reference group) and the category for which X2 = 1 (the comparison group). So
compared to shrubs that were in partial sun, we would expect shrubs in full sun to be
11 cm taller, on average, at the same level of soil bacteria.

Food For Thought: Interpreting Regression Coefficients
Regression Coefficients and their interpretation

Interpreting the Intercept: B0, the Y-intercept, can be interpreted as the value you
would predict for Y if both X1 = 0 and X2 = 0. We would expect an average height of
42 cm for shrubs in partial sun with no bacteria in the soil. However, this is only a
meaningful interpretation if it is reasonable that both X1 and X2 can be 0, and if the
dataset actually included values for X1 and X2 that were near 0. If neither of these
conditions are true, then B0 really has no meaningful interpretation. It just anchors
the regression line in the right place. In this case, it is easy to see that X2 sometimes
is 0, but if X1, our bacteria level, never comes close to 0, then our intercept has no
real interpretation.

Food For Thought: Scaling Of Regression variables
Scaling Of Regression variables & their effect on coefficients

Changing the scale of the variable will lead to a corresponding change in the scale of
the coefficients and standard errors BUT no change in the significance or
interpretation.

## Consider the equation:

y=0+1x1+2x2++jxj+e ----(1)
if xj is multiplied by c, then its coefficient is divided by c
If y multiplied by c, all its OLS coefficients are multiplied by c
Both the t and F statistics are un-affected by changing the units of
measurement of any variables
In case the variable appears in the logarithmic form, changing the units of
measurement does not affect the slope coefficient.

Effects Of Data Scaling In A Tabulated Summary
Data Scaling

Dependent
y cy
Independent

R-squared R2 R2

## SSR SSR c2 * SSR

Food For Thought: Transformation Of Regression variables
Transformation of regression variables

## Broadly speaking, there are two kinds of transformations:

Linear transformation Nonlinear transformation
A linear transformation preserves linear A nonlinear transformation changes
relationships between variables. Therefore, (increases or decreases) linear relationships
the correlation between x and y would be between variables and, thus, changes the
unchanged after a linear transformation. correlation between variables. Examples of a
Examples of a linear transformation to nonlinear transformation of variable x would
variable x would be multiplying x by a be taking the square root of x or the
constant, dividing x by a constant, or adding reciprocal of x.
a constant to x

## In regression, a transformation to achieve linearity is a special kind of

nonlinear transformation. It is a nonlinear transformation that increases the
linear relationship between two variables.

Common Transformation Methods
Common Transformation Methods To Achieve Linearity For Regression Analysis - Tabulated

Predicted value ()
Regression
Method Transformation(s) (Back
equation
transformation)
Standard linear
None y = b 0 + b 1x = b 0 + b 1x
regression
Exponential
Dependent variable = log(y) log(y) = b0 + b1x = 10b0 + b1x
model
Quadratic model Dependent variable = sqrt(y) sqrt(y) = b0 + b1x = ( b 0 + b 1 x )2
Reciprocal model Dependent variable = 1/y 1/y = b0 + b1x = 1 / ( b 0 + b 1x )
Logarithmic
Independent variable = log(x) y= b0 + b1log(x) = b0 + b1log(x)
model
Dependent variable = log(y) log(y)= b0 +
Power model = 10b0 + b1log(x)
Independent variable = log(x) b1log(x)

Steps Involved In Transformation For Achieving Linearity
Transformation for achieving linearity

## Transforming a data set to enhance linearity is a multi-step, trial-and-error process:

Conduct a standard regression analysis on the raw data.
Construct a residual plot.
If the plot pattern is random, do not transform data.
If the plot pattern is not random, continue.
Compute the coefficient of determination (R2).
Choose a transformation method (from the table above).
Transform the independent variable, dependent variable, or both.
Conduct a regression analysis, using the transformed variables.
Compute the coefficient of determination (R2), based on the transformed variables.
If the transformed R2 is greater than the raw-score R2, the transformation was successful.
Congratulations!
If not, try a different transformation method.
The only way to determine which method is best is to try each and compare the result
(i.e., residual plots, correlation coefficients).

Food For Thought: Solved Example On Transformation
Transformation example

X 1 2 3 4 5 6 7 8 9
Y 2 1 6 14 15 30 40 74 75

As shown in the above table (on left), the data for independent and dependent variables
- x and y, respectively. Applying a linear regression to the untransformed raw data,
the residual plot shows a non-random pattern (a U-shaped curve) (on right), which
suggests that the data are nonlinear.
On repeating the analysis, using a quadratic model for transformation. Taking the square
root of y, rather than y, as the dependent variable. Using the transformed data, our
regression equation is:
y't = b0 + b1x , where
yt = transformed dependent variable, which is equal to the square root of y
y't = predicted value of the transformed dependent variable yt
x = independent variable
b0 = y-intercept of transformation regression line
b1 = slope of transformation regression line

Food For Thought: Solved Example On Transformation
Transformation example

X 1 2 3 4 5 6 7 8 9
Y 1.41 1.00 2.45 3.74 3.87 5.48 6.32 8.60 8.66
Since the transformation was based on the quadratic model
(yt = the square root of y), the transformation regression
equation can be expressed in terms of the original units of
variable Y as: y' = ( b0 + b1x )2 , where
y' = predicted value of y in its original units
x = independent variable
b0 = y-intercept of transformation regression line
b1 = slope of transformation regression line
In the residual plot above (using the square root transformation regression), there is no
pattern => the transformation has been successful and the relationship between the
transformed dependent variable (square root of Y) and the independent variable (X) is
linear. Also, the coefficient of determination was 0.96 with the transformed data versus
only 0.88 with the raw data. Hence, the transformed data resulted in a better model.

Food For Thought: Standardization Of Regression Coefficients
Standardization of regression coefficients

In multiple regression, the relative size of the coefficients is not important. For example, say that
the MPA Program. We use their undergraduate GPA, their GRE scores, and the number of years
they have been out of college as independent variables. We obtain the following regression
equation:
Y=1.437 + (.367) (UG-GPA) + (.00099) (GRE score) + (-.014) (years out of college) ---- (1)
In the above equation, one cannot compare the size of the various coefficients because the
three independent variables are measured on different scales. Undergraduate GPA is measured
on a scale from 0.0 to 4.0. GRE score is measured on a scale from 0 to 1600. Years out of college
is measured on a scale from 0 to 20. We cannot directly tell which independent variable has
the most effect on Y (graduate level GPA).
However, it is possible to transform the coefficients into standardized regression coefficients,
which are written as the plain English letter b. The standardized regression coefficients in any
one regression equation are measured on the same scale, with a mean of zero and a standard
deviation of 1. They are then directly comparable to one another, with the largest coefficient
indicating which independent variable has the greatest influence on the dependent variable.

Food For Thought: Standardization Of Regression Coefficients
Standardization of regression coefficients

Non-Standardized Standardized
Variable Name
Coefficient (beta) Coefficient (b)
GRE score .00099 +.175
Years out of college -.014 -.122
Intercept or Constant (a) 1.437 n/a

The table above, gives the nonstandard as well as standardized regression coefficients
from the regression equation (1)
From the standardized coefficient (b): it is clear that Undergraduate GPA is the most
important variable of the three variables
The one difference between non-standardized and standardized regression is that
standardized regression does not have an term (a constant). If there is no term (no
constant), then the regression coefficients have been standardized. If there is an
term, then the regression coefficients have not been standardized.

Standardized vs Unstandardized Regression Coefficients
Standardization vs Un-standardization regression coefficients

## Standardized (beta or ) Unstandardized (B)

Standardized relationships say that for a one-standard Unstandardized relationships say that for a one-raw-
deviation increment on a predictor, the outcome unit increment on a predictor, the outcome variable
variable increases (or decreases) by some number of increases (or decreases) by a number of its raw units
SD's corresponding to what the coefficient is. corresponding to what the B coefficient is.

All variables have been converted to a common metric, The different predictor variables' unstandardized B
namely standard-deviation (z-score) units, so the coefficients are not directly comparable to each other,
coefficients can meaningfully be compared in because the raw units for each are (usually) different. In
magnitude. In this case, whichever predictor variable other words, the largest B coefficient will not
has the largest (in absolute value) can be said to have necessarily be the most significant, as it must be judged
the most potent relationship to the dependent in connection with its standard error (B/SE = t, which is
variable, and this predictor will also have the greatest used to test for statistical significance).
significance (smallest p value).