Anda di halaman 1dari 40

MULTIPLE LINEAR REGRESSION

JUST A LITTLE POKE

GERNIMO MALDONADO-MARTNEZ, RPT; MPH, PHD(C)


DIANA M. FERNNDEZ-SANTOS, MS; EDD

Sir Francis Galton


Widely promoted regression techniques.
Cousin of Charles Darwin.

Making Sense of Regression


My emphasis here is on

Understanding the key elements of regression


Requirements
Application
Limitations

Regression Is A Powerful Analytical


Technique
Enables researchers to do two things:
1. Determine the strength of the relationship
The r-squared value

Regression Is A Powerful Analytical


Technique
2. Determine the impact of the independent
variable(s) on the dependent variable
The regression coefficient is the predicted
change in the dependent variable for every one
unit of change in the independent variable
Collectively, the regression coefficients enable
the researchers to make estimates of how the
dependent variable will change using different
scenarios for the independent variables

Assumptions
Variables are normally distributed.
Continuous nature
Assumption of a linear relationship
between the independent and
dependent variables.
Assumption of homoscedasticity.

Multiple Regression Equation


Y = a + bX+bX2+bXk + e
Where:
Y = predicted value of the dependent variable
a = the constant or Y intercept (where the
imaginary line crosses the Y access)
b = the regression coefficient
X = the independent variable
e = error

Theoretical Linear Model

Linear Regression
Model Type X=size of house, Y=cost of house
Deterministic Model: an equation or set of equations
that allow us to fully determine the value of the dependent
variable from the values of the independent variables.

Probabilistic Model: a method used to capture the


randomness that is part of a real-life process.

R-square And Its Companions


r = correlation coefficient (overall fit or measure of
association, which is also called r, Pearsons r, Pearson
Product Moment Correlation coefficient, or zero-order
coefficient).
r-square = proportion of the explained variance of the
dependent variable (also called the coefficient of
determination)
1 minus r-square = proportion of unexplained variance
in the dependent variable

Dirty Interpretation

Example: Researchers look at GRE scores and academic


performance in graduate school as measured by grade point
average
The hypothesis is that people who have high GRE scores will
also have high GPAs
From an admissions committee perspective: the belief
that GRE scores are a good predictor of future academic
success and are, therefore, a good criteria for admission
decisions
The researchers report an r-squared of .2
GREs explain 20 percent of the change in GPAs
This means that 80 percent of the changes in GPA are
explained by other factors.

A quick example

Dont get lost

.
.
.
.
. . .. . . .
. . ..
. . .
.
. .

Y Axis: Plane Maintenance Costs


$1,000

$500

Predicted values
if perfect relationship

X Axis: Age of Planes

5 years

10 years

20 years

How It Is Applied
Analysts collect data over the past two years and
crunch it. The computer gives these results:
Y = 100 and .020X
The constant is 100:
If they do not fly at all, the computer estimates
there is still a cost of $100
The .020 is the regression coefficient:
This gets interpreted as: for every mile flown, there
is $.02 change in maintenance costs.

How It Is Applied
Y = 100 and .020X
Interpreting the regression coefficient:
For every mile flown, the maintenance costs
goes up by 2 cents.
For every 100 miles flown, costs are $2
For every 1,000 miles, the costs are $20
For every 100,000 miles, the costs are
$20,000

Making Maintenance Cost Estimates


They can then solve the equation:
Assuming 100,000 miles will be flown, how much
will they need to budget for maintenance?
100,000 multiplied by .020 = $20,000
Y= 100 + $20,000 + error

The estimate maintenance will cost:


$20,100 + error

Practicality

Simple Regression: Another Example


Hypothesis: If schools have a higher
percentage of poor children, then they
will have lower test scores.
A regression analysis shows:
A regression coefficient of -.04
An r-squared value of .25

Even More
Interpretation?
Regression coefficient: For every increase in the
percent of children in poverty within a school, the
average test score goes down by .04
R-squared: 25% of the test scores are explained
by the percent of children in poverty in the school
Researchers will ask: what other factors might
explain differences in test scores in the schools?

Multiple Regression Equation


Y = a + bX1 + bX2 + bX3 + bX4 + e.
Y = dependent variable
X1 = independent variable 1,
for X2, X3, X4
X2 = independent variable 2
controlling for X1, X3, X4
X3 = independent variable 3
controlling for X1, X2, X4
X4= independent variable 4
controlling for X1, X2, X3

controlling

Multiple Regression Equation


It has the same basic structure of simple
regression
Y is still the dependent variable
There is still a constant (a) and some
amount of error (e) that the computer
calculates
But there are more Xs to represent the
multiple independent variables

Multiple Regression:
An Example
Hypothesis: Income is a function of education
and seniority?
We suggest that income (the dependent
variable) will increase as both education and
seniority increases (two independent
variables)
Y (Income) = a + education + seniority +
error

Multiple Regression: Interpretation


Results:
Y= 6000 + 400X1 (education) + 200X2 (seniority)
R square = .67
First look at the R-Square: This shows a strong
relationshipso analysis can continue
Partial regression coefficients:
For every year of education, holding seniority
constant, income increases by $400.
For every year of seniority, holding education
constant, income increases by $200.

Multiple Regression: Application


Estimate the income of someone who has
10 years of education and
5 years of seniority
We solve the regression equation:
Multiply the 10 years of education by the regression
coefficient of 400: equals 4,000
Multiply 5 years of senior by the regression coefficient of
200: equals 1,000
Put it together with the constant and you have
Y=6000 + 400(10) + 200(5) + error
Y= $ 11,000 + error

Demystifying the monster


Statistics
Statistics
Data
Data
x

17

12

Information

Multivariate regression pitfalls


Multi-collinearity
Residual

confounding
Overfitting

Multicollinearity
Multicollinearity

arises when two variables


that measure the same thing or similar things
(e.g., weight and BMI) are both included in a
multiple regression model; they will, in
effect, cancel each other out and generally
destroy your model.
VIF>1

is bad
Tolerance: low is bad

Residuals Diagnostics
You cannot completely wipe out confounding simply
by adjusting for variables in multiple regression
unless variables are measured with zero error (which
is usually impossible).
Example: meat eating and mortality

A clean example in PRISM

A clean example in PRISM

A clean example in PRISM

A clean example in PRISM

A clean example in PRISM

A clean example in PRISM

A real linear regression output

A real linear regression output

A real linear regression output

A real linear regression output

References
Kleinbaum, D. G. Applied
Regression Analysis and
Multivariable Methods.
Third Edition (2011)

Anda mungkin juga menyukai