Anda di halaman 1dari 48

Regression Analysis

y a b1 x1 b2 x2 b3 x3 ... bk xk
y
X3
X1
X2

STATITICAL DATA ANALYSIS


COMMON TYPES OF ANALYSIS?
1.Compare Groups
a. Compare Proportions (e.g., Chi Square Test 2)
H0:

P1 = P2 = P3 = = P k

b. Compare Means (e.g., Analysis of Variance)


H0:

1 = 2 = 3 = = k

2.Examine Strength and Direction of Relationships


a. Bivariate (e.g., Pearson Correlationr)
Between one variable and another: Y = a + b 1 x1

b. Multivariate (e.g., Multiple Regression Analysis)


Between one dep. var. and each of several indep. variables,
while holding all other indep. variables constant:
Y = a + b 1 x 1 + b 2 x2 + b 3 x 3 + + b k x k

Simple and Multiple Regression Analysis


What does regression analysis do?
Examines whether changes/differences in values of one variable
(dependent variable Y) are linked to changes/differences in values
of one or more other variables (independent variables X 1, X2, etc.),
while controlling for the changes in values of all other Xs.
E.g., Relationship between salary and gender for people who have the same
levels of education, work experience, position level, seniority, etc.

The DV (Y) must be metric.


The IVs (Xs) must be either metric or dummy var.
Central Question Addressed:
Is Y a function of X1, X2, etc.? How ?
Is there a relationship between Y and X 1, X2 , etc., (in each case,
after controlling for the effects of all other Xs)? In what way?
What is the relative impact of each X on Y, holding all other Xs
constant (that is, all other Xs being equal)?

Simple and Multiple Regression Analysis


More specifically,
Do values of Y tend to increase/decrease as
values of X1, X2, etc. increase/decrease?

If so,
By how much?
And
How strong is the connection/relationship

y
between Xs and Y?
what % of differences/variations
in Y values (e.g., income) among
study subjects can be explained by
(or attributed to) differences in
X1
X values (e.g. years of education,
years of experience, etc.)?

X3

X2

Simple and Multiple Regression Analysis


NOTE: Once we can determine how values of Y change as a
function of values of X1, X2, etc., we will also be able to
predict/estimate the value of Y from specific values of X 1, X2,
etc.

Y = a + b1 x1 + b2 x2 + b3 x3 + + bk xk+

Therefore, regression analysis, in a sense, is about


ESTIMATING values of Y, using information about
values of Xs:
Estimation, by definition, involves?
The objective?
To minimize error in estimation.
Or, to compute estimates that are
as close to the true/actual values as possible.

Simple and Multiple Regression Analysis


QUESTION: What is the simplest way to obtain an
estimate for some population characteristic
(e.g., number of credit cards per U.S. household)?
ANSWER:
1.Select a representative sample from the population and
2.Compute the mean for that sample (e.g., compute the
average number of CCs for the sample households).
X
Regression analysis can be viewed as a technique that often
significantly improves the accuracy of estimation results relative
to using the mean value.
So, suppose we were to estimate the number of credit cards for
U.S. households, based on information from a random sample of,
say, n = 8 families.

Simple and Multiple Regression Analysis


Estimating Number of Credit Cards*
i

yi

Family
Number

Actual # of Credit
Cards

10

y Estimate?
y y

56
7
8

QUESTION: Can we
determine how much error in
estimation we are committing
by using Y 7 as our estimate,
for each of these households?

56

* This example was adopted from Hair, Black, Babin, Anderson, & Tatham, (2006). Multivariate Data Analysis, 6th ed., Prentice Hall.

Simple and Multiple Regression Analysis


Estimating Number of Credit Cards
i
Family
Number

yi
Actual # of
Credit Cards

y y
Estimate for #
of Credit
Cards

Error in
Estimation

10

yi 56

y y

56
7
8

Simple and Multiple Regression Analysis


Estimating Number of Credit Cards
i
Family
Number

yi

yi y

Actual # of
Credit Cards

y y
Estimate for #
of Credit
Cards

Error in
Estimation

-3

-1

-1

+1

+1

10

+3

yi 56

y y

56
7
8

Lets now see all


this graphically

Simple and Multiple Regression Analysis


Actual # of credit cards

10
9
8
7
6

F8
F5
F7
F6
F4
F2, F3

5
4

F1

3
2
1
0

Lets spread the dots away from each


other to see things more clearly!

Y Y Estimate

Simple and Multiple Regression Analysis


Actual # of credit cards

10
9
8
7

F3

3
2
1
0

F4

F7
F6

F2

5
4

F8

Graphic Representation
Actual Estimate F5

F1

Estimation Error
Can we determine the
total estimation error
for all 8 families?

Y Y Estimate

Simple and Multiple Regression Analysis


i
Family
Number

yi

yi y

Actual # of
Credit Cards

y y
Estimate for #
of Credit
Cards

Error in
Estimation

-3

-1

-1

+1

+1

10
yi 56

7
56
y y
7
8

+3
(
yi y ) =

What would be the


total estimation
error for all 8
families combined?
0

Solution?

Simple and Multiple Regression Analysis


Estimating Number of Credit Cards
yi y

i
Family
Number

yi

Actual # of
Credit Cards

y y
Estimate for #
of Credit
Cards

Error in
Estimation

Errorsi Squared

-3

-1

-1

+1

+1

+3
( yi y ) 0

9
2
( yi y ) 22

10
yi 56

y y

56
7
8

( y y)

SST = Sum of Squares Total

Simple and Multiple Regression Analysis


22 = SST = Index for total (combined) amount of estimation error
for all families (observations) in the sample when using the mean
as the estimate.
SST is also the sum of squared deviations from the mean.
o Remember the formula for computing Variance?
Objective in Estimation?
Minimize error, maximize precision.
Can we cut down the amount of estimation error (SST)? How?
Yes, we can, by using information about other variables suspected
to be strong predictors (strongly related to) # of credit cards
possessed by families (e.g., family size, family income,
income etc.)..

Simple and Multiple Regression Analysis


y

i
Family
Number

Actual # of
Credit Cards

Family Size

10

We now can attempt to


estimate # of credit cards
from the information on
family size, rather than
from its own mean.
Lets first see this graphically!

Y
# Of Credi t Cards

10
9

Simple and Multiple Regression Analysis

F2

F5
F6

F4

F7

F1

y y

Original (Baseline)
Estimate

F3

x 2, y 4

QUESTION: Does the mean ( y ) appear to represent the


closest estimate of the actual c.c. numbers for our
sample families ?
That is, is the green line the best line to represent the
location of estimates of # of CC for these families?

3
2
1
0

F8

Plot actual numbers of CCs


against family Size.

7
Family Size

Simple and Multiple Regression yAnalysis


a b x

# Of Credi t Cards

Y Generic Equation for any


10
straight line: Y= a + bx

F8

y a3 b3 x
Regression Line

9
8

F4

F2

F5

F7

F3

5
4

F1

3
2

y a2 b2 x

y y

Original (Baseline)
Estimate

F6

y a 0 x y

Regression Line
(Line of Best Fit)-new improved
location for CC
estimates (see next
slide)

1
0

7
Family Size

Simple and Multiple Regression Analysis


F8

# Of Credi t Cards

10

y a bx

9
8

F2

F4

y Original
(Baseline)
Estimate

F3

5
4

Estimation ERROR ( y

F1

Regression Line will


Minimize

F7

F6

F5

( y y )

Reg. Line (Line of


Best Fit)--new
improved location
for CC estimates

y )

= total estimation error.

But, how do we know the values a and b in y a bx (the reg. line)?

7
Family Size

Actual # of credit cards


EQUATION FOR REGRESSION LINE (LINE OF BEST
FIT)-Values of a and b for the regression line:

y a bx

( x x)( y y)

b
2
(x x)

a y bx
Lets use above formulas to compute the values of a
and b for the regression line in our example.
We will need: y , x ,

( x x )( y y ),

and

(x x)

Simple and Multiple Regression Analysis


We need: y, x , ( x x )( y y ), and ( x x )
y
i
Family Actual #
Number of Credit
Cards

x
Family
Size

xx

y y ( x x )( y y )

(x x)

10

56
Y
7
8

34
x 4.25
8

( x x )( y y ) ?

(x x) ?
2

Simple and Multiple Regression Analysis


We need: y, x , ( x x )( y y ), and ( x x )
y
i
Family Actual #
Number of Credit
Cards

x
Family
Size

xx

y y ( x x )( y y )

(x x)

-2.25

-3

6.75

5.0625

-2.25

-1

2.25

5.0625

-.25

-1

.25

.0625

-.25

.0625

.75

.75

.5625

.75

.5625

1.75

1.75

3.0625

10

1.75

5.25

3.0625

56
Y
7 x 34 4.25
8
8

( x x )( y y ) 17 ( x x )

17.5

Simple and Multiple Regression Analysis


REGRESSION LINE (LINE OF BEST FIT):

y a bx

( x x)( y y ) 17

.971
2
17.5
(
x

x
)

a y b x 7 .971( 4.25) 2.87

a =2.87

b = .97

y 2.87 .97 x
?

Y-Intercept

Regression Coefficient

Simple and Multiple Regression Analysis


# Of Credi t Cards

F5

F2

F7

F4

Estimate

F3

5
4

New
Improved
Estimates

y Original
(Baseline)

F6

F1

Can we tell how much estimation error we have


committed by using the new regression line?
Yes, examine differences between our households
actual # of CCs and their new/regression estimates.

2
1
0

y 2.87 .97 x

F8

10

7
Family Size

Simple and Multiple Regression Analysis


y 2.87 .97 x
i
y
Family Actual #
Numbe of Credit
r
Cards

x
Family
Size

y
y

y y

( y y )

Regression
Error
Estimate
(Residual)

Errors
Squared

10

( y y )

Simple and Multiple Regression Analysis


y 2.87 .97 x

y 2.87 .97(2) 4.81

i
y
Family Actual #
Numbe of Credit
r
Cards

x
Family
Size

y y

( y y )

Regression
Error
Estimate
(Residual)

Errors
Squared

4.81

-.81

.66

4.81

1.19

1.42

6.76

-.76

.58

6.76

.24

.06

7.73

.27

.07

7.73

-.73

.53

8.7

-.7

.49

10

8.7

1.3

1.69

5.486 ( y y ) 2
SSE = Sum of Squares Error (SS Residual)

Simple and Multiple Regression Analysis


Total Baseline Error using the mean (SS Total)
22.0
New or Remaining Error (SS Error or SS Residual) 5.486 ~ 5.5
Total Var.
QUESTION: How much of the original estimation error have we explained in Y = 22

away (eliminated) by using the regression model (instead of the mean)?

5.5

22 5.486 = 16.514 (SS Regression or SS Explained)

16.5

X1

QUESTION: What % of estimation error have we explained (eliminated by


using the regression model?

R2 = 16.514 / 22 = .751 or 75% What is this called?


% of differences in # of CCs among households that is
explained by differences in their family size.

What does the remaining 25% represent?

Percent of variation (differences) in number of credit cards owned by families


that can be accounted for by: (a) all other potential predictors not included in the
model, beyond family size, and (b) unexplainable random/chance variations.

Simple and Multiple Regression Analysis


R2 = SS Regression / SS Total = 16.5/22 = 75%
R2 is a measure of our success regarding accuracy of our estimation effort.
R2 = % of estimation error that we have been able to explain away by
using the regression model, instead of using the mean.
R2 indicates how much better we can predict Y from information about
Xs, rather than from using its own mean.
R2 = % of differences (variations) in Y values that is explained by
(attributable to) differences in X values.

Note: When dealing with only two variables (a single X and Y):
16.514
r R
.75 .866
22
2

Pearson Correlation
of Y with X1
(NOT controlling for
any other var.)

Lets now examine all this graphically!

Simple and Multiple Regression Analysis


# Of Credi t Cards

10

Regression Line (New Improved Estimates):

F8

y 2.87 .97 x

9
8

F2

7
6
5
4

y y

y y

Original
Baseline
ERROR
for F1

F4
by
? Explained
REGRESSION

F5
F6

F7

y Original
(Baseline)
Estimate

F3

Model

? y y
F1 New ERROR

(Unexplained/
RESIDUAL)

3
2
1
0

7
Family Size

Simple and Multiple Regression Analysis


5.5 = SSE = The amount of estimation error for the 8 sample families
when using simple regression (i.e., a regression model that includes
only information about family size).
Can we reduce the amount of estimation
error (SSE) to an even lower level and,
thus, improving the estimation process? How?
Yes, by adding information on a second variables suspected to be
strongly related to # of credit cards (e.g., family income--X2).

Simple and Multiple Regression Analysis


x1

x2

i
Family
Number

Actual # of
Credit Cards

Family Size

14

16

14

17

18

21

17

10

25

yi

Generic Equation for a linear plane:

Family
Income

We now can attempt


to estimate # of CCs
from our information
on family size and
family income!
Our regression model
will now be a linear
plane, rather than a
straight line!

y a b1 x1 b2 x2

Lets examine the regression plane for our example graphically.

Y = # of Credit Cards
12

y a b1 x1 b2 x2

11

10
Formulas are available for
computing values of
9
a, b1 and b2
8
MULTIPLE REGRESSION
7
MODEL FOR OUR EXAMPLE:

y .482 .63x1 .216 x2

Lets now see


how much error
in estimation we
are committing
by using this
multiple
regression
model.

Family Income

6
5
4
3
2
1
0

Actual
Regression Estimate

X1 = Family Size

Simple and Multiple Regression Analysis


y .482 .63x1 .216 x2
y
i
Family Actual #
Number of Credit
Cards

x1
Family
Size

x2

y y

Family Regression
Income Estimate
($000)

Error
(Residual)

( y y )

Errors
Squared

14

16

14

17

18

21

17

10

25

)
(y y

Simple and Multiple Regression Analysis


y .482 .63 x1 .216 x2
y
i
Family Actual #
Number of Credit
Cards

y .482 .63(2) .216(14) 4.77

x1
Family
Size

x2

Family Regression
Income Estimate
($000)

y y
Error
(Residual)

( y y )
Errors
Squared

14

4.77

-.77

.59

16

5.20

.80

.64

14

6.03

-.03

.00

17

6.68

.32

.10

18

7.53

.47

.22

21

8.18

-1.18

1.39

17

7.95

.05

.00

10

25

9.67

.33

.11

SSE = Sum of Squares Error (Residual)

3.05 ( y y )

Unique (additional) contribution of X2 (family income) beyond X1 = ? 5.5 3.05 = 2.45

Simple and Multiple Regression Analysis


The MULTIPLE REGRESSION MODEL FOR OUR EXAMPLE:

y .482 .63 x1 .216 x2


?

Y-Intercept,
Y-Intercept a

b1 and b2 = Regression Coefficients

(NOTE: Only when all Xs


can meaningfully take on
value of zero, the intercept
will have a meaningful/direct/
practical interpretation.
Otherwise, it is simply an aid
in increasing accuracy of
estimation.

0.63: Among families of the same income, an increase in


family size by one person would, on average, result in .63
more credit cards.
0.21: Among families of the same size, an income increase
of $1,000, results in an average increase of 0.2 credit cards .
bs represent effect of each X on Y when all other Xs are
controlled for/held constant/taken into account
i.e., after impacts of all other variables are accounted
for (remember the high blood pressure-hearing
problem connection?)

Simple and Multiple Regression Analysis


The MULTIPLE REGRESSION MODEL FOR OUR EXAMPLE:

y .482 .63 x1 .216 x2

SST = 22

SSE = 3.05

What is our new R2?


SS Regression = 22 3.05 = 18.95
2

R = 18.95 / 22 = .861 or 86%


The Remaining 14%?
(3.05 / 22 = .14)

Percent of differences in households


number of CCs that is explained by
differences in family size and family
income.
Percent of variation in number of credit
cards that can be accounted for by (a) all
other relevant factors not included in the
model, beyond family size and income, and
(b) unexplainable random/chance
variations.

Y= # of CC

Total Variation/Error in Y = SS Total = a + b + c + d = 22

X1=Family
Size

X2 = Family

y 2.87 .97 X 1 r2 = ?

SSR =

Income

a+c

X1=Family

= 16.5

size

y 0.063 .398 X 2

SSR =

c+b
= 15.12

X2 = Family
Income

R2 = (a+c) / (a+b+c+d)

R2 = 16.5 / 22 = 0.75

What do we call the square root of this?


Pearson/simpl ryx 16.5 0.75 0.867
22
e
Correlation
ac
ryx1
of Y with X1
abcd
(not
controlling for
X2)
2
r = (b+c) / (a+b+c+d) = 15.12 / 22 = 0.687

Pearson/simpl
bc
r

yx2
e Correlation
abcd
of Y with X2
(not
15.11
ryx2
0.829
controlling for
22
X)?

a
c

y .482 .63x1 .216 x2


b

X1=Family
Size

X2 = Family
Income

Graphically = ?
NOTE: c is explained by
both X1 and X2

R2

SSR = a + b +c = 18.95
SST = a + b + c + d = 22

R2 = SSR / SST = (a + b + c) / (a + b + c + d) = 18.95 / 22 = 86%


SSE = ?

SSE = d = 22 18.95 = 3.05

Simple and Multiple Regression Analysis


y .482 .63 x1 .216 x2
i
Family
Number

y .482 .63(2) .216(14) 4.77

x1

x2

Family Regression
Income Estimate
($000)

y y
Error
(Residual)

( y y )

Actual #
of Credit
Cards

Family
Size

14

4.77

-.77

.59

16

5.20

.80

.64

14

6.03

-.03

.00

17

6.68

.32

.10

18

7.53

.47

.22

21

8.18

-1.18

1.39

17

7.95

.05

.00

10

25

9.67

.33

.11

SSE = Sum of Squares Error (Residual)

Remember:

Errors
Squared

3.05 ( y y )

Unique (additional) contribution of X2 = 5.5 3.05 = 2.45

Exercise 1: Redo the credit card


analysis with SPSS.
First, Correlations and Simple Regression
Next, Multiple Regression (also ask for part
and partial correlations.)

SPSS CREDIT CARD FILE

Simple and Multiple Regression Analysis


EXERCISE 2:

Using gss_2 data file, we are interested in

understanding the role that the following demographics (age, educ, sibs,
agewed), as well as respondent income (rincmdol), job satisfaction (satjob_2),
and marriage satisfaction (hapmar_2) play in determining/predicting ones
general happiness (happy_2).
We also wish to know which of the above variables is the strongest predictor of
general happiness (Standardized Reg. Coefficients).

Use the gss_2 data file and conduct the appropriate analysis.
NOTE:
satjob_2 is coded as:
1 = Very Dissatisfied
2 = A Little Dissatisfied
3 = Pretty Satisfied
4 = Very Satisfied

hapmar_2 is coded as:


1 = Not Too Happy
2 = Pretty Happy
3 = Very Happy

Interpreting Regression Results


Ho: R2 = 0. That is, There is NO RELATIONSHIP between the DV
and ANY OF the IVs included in the regression model.
No
Dont reject Ho; No indep. Variable has a
1. Is overall F significant?
sig. relationship with dep. Variable.
(i.e., < 0.05)
Stop.
Yes
Reject Ho; One or more independent
variables are significantly related to the
dep. Variable.
2. Which independent variable(s) have significant relationships with the
dep. Var.? In the Coefficients table, look up the result of the t-test for
each indep. variables regression coefficient (b). Ho for t-test of a given
variable hypothesizes that the coefficient b = 0. That is, there is no
relationship between the corresponding independent variable and the
dep. Variable. If a t-tests < 0.05, reject the null and conclude that the
corresponding variable has a significant relationship with the dep.
Variable.
3. Look up the sign of the regression coefficient (b) ONLY
FOR
those indep. variables that are found to have a
significant
relationship with the dependent variable (i.e., those
with < 0.05), and state your conclusions accordingly.

Simple and Multiple Regression Analysis


Regression Analysis Using Categorical Variables:
General Rule: Categorical variables should NOT be used in multiple

regression since interpretation of the variables regression coefficient becomes


nonsensical.
Coded: Democrat = 1
Republican = 2
EXAMPLE: Income = 24000 + 1400 Political Party

Independent = 3
Other = 4

Exception to the above Rule: Dummy variables (i.e., categorical

variables representing only two groups--such as gender, when coded as 0 and 1)


can be used as independent variables in regression analysis. The reason is that a
dummy variables values (0, 1) can go up or down by only 1 unit, signifying a
change from one group to another.
EXAMPLE:
Income = 24000 + 1400 gender

Meaning?

Coded: Female = 0, Male = 1

Note: A dummy variables regression coefficient represents the


average difference in the value of the dependent variable between the
two groups represented by the dummy variable.

Simple and Multiple Regression Analysis


Coded: Female = 0, Male = 1

EXAMPLE 1:

Income = 24000 + 1400 gender.

Average income of females is $24,000.


Males on average make $1400 more than females

MULTIPLE REGRESSION EXAMPLE 2:

Coded: Female = 0, Male = 1

Income = 12000 + 1000 Education Years + 800 Gender


Meaning?
Average income of females
with no education is $12000.

Meaning?
Among people of the same gender, every
additional year of education results in an
average additional income of $1,000.
Males make, on average, $800 more in
comparison with females who have the
same number of years of education.

Exercise 4: Suppose we are interested in


knowing what role, if any, demographic
characteristics (i.e., age, sex_Dummy,
educ, sibs, agewed, incomdol), as well as
job satisfaction (satjob-2), and marriage
satisfaction (hapmar-2) play in determining
ones overall happiness in life (happy-2).
Use the gss_2 data file and conduct the
appropriate analysis.

Exercise 3: Suppose we are interested


in knowing what role, if any, the following
demographic characteristics play in
determining ones income (rincmdol):
Age,
Sex_Dummy (0=male, 1=female),
age first married (agewed),
Years of education completed (educ), and
Political party affiliation--republic
(0=Democrat, 1=Republican) .

Use the gss_2 data file and conduct the


appropriate analysis.

Assignment 5
Data file Salary.sav contains information about 474 employees hired by a Midwestern bank
between 1969 and 1971 (NOTE: Due to SPSS site license restrictions, this hyperlink will
not work if you are off campus). Of the 474 employees, 258 were men, 216 women, 370
white, and 104 non-white. The bank was subsequently involved in EEOC litigation; the
bank was accused of gender and race discrimination in its hiring and compensation
practices. The two issues that were of particular interest in the litigation were alleged
gender and racial inequalities not only in the banks beginning salaries (variable salbeg),
but also in its later salaries (variable salnow).
1.
Print, examine, and interpret correlation coefficients between beginning salary
(salbeg) and age in years (age), education in years (edlevel), employment category or job
classification level--rated from 1=lowest to 8=highest (jobcat), and work experience in
months (work).
2.
Conduct the appropriate analysis to see: (a) What role each of the variables age,
education (edlevel), employment category (jobcat), and work experience (work) played,
holding all other variables constant, in determining the banks beginning salaries? For
example, what was the differential pay for one additional year of education among new
hires who otherwise had the same age, employment category, and work experience? (b)
Which of the above demographic characteristics had the strongest influence on beginning
pay? How can you tell? (c) What percent of the differences in employees beginning
salaries can be explained by/attributed to difference in all of the above characteristics?

Assignment 5
3.
Now conduct the appropriate analysis to indicate, holding all other variables
constant, what roles gender (sex, male=0, female=1) played in determining beginning
salaries at the bank. That is, what was the differential beginning pay between male and
female employees who otherwise had the same age, education, employment category, and
work experience? Does this evidence support the charges of gender discrimination in the
banks practices regarding initial compensation?
4.
During litigation, it was charged that the banks unfair compensation practices had
continued beyond its initial salary decisions. That is, the prosecution claimed that with
time, not only the beginning salary disparities between men and women did not shrink, but
further widened. Conduct the appropriate analysis to indicate (a) everything else being
equal, what roles gender played in determining employees later salaries at the bank
(salnow). That is, what was the average differential pay between male and female
employees who otherwise had the same age, education, employment category, work
experience, and job seniority (variable time represents seniority in terms of number of
months employed at the bank)? (b) Compare the later pay disparities you have just
identified with the beginning pay disparities you had found in question 3 above to explain
if the evidence supports the prosecutions charges of continued gender discrimination
beyond initial salary decisions, resulting in widening disparities in later pay.
NOTE: For each question, provide thorough explanations on corresponding pages and
parts of your printout.

Simple and Multiple Regression Analysis

QUESTIONS
OR
COMMENTS
?

Anda mungkin juga menyukai