Anda di halaman 1dari 44

Regression Technique

Regression

Web support

Simple regression a reminder

Multiple regression an introduction

Reporting regression analyses

Choosing regressors (predictor variables)

Choosing a regression model

Model checking - residuals

Simple Regression

Establish equation for the best-fit line:


y = bx + a

Best-fit line same as Regression line

b is the regression coefficient for x

x is the predictor or regressor variable for y

Multiple Regression

Establish equation for the best-fit line:


y = b1x1 + b2x2 + b3x3 + a

Where:
b1 = regression coefficient for variable x1
b2 = regression coefficient for variable x2
b3 = regression coefficient for variable x3
a = constant

Multiple Regression
R2 - Goodness of fit
Model Summary
Model
1

R
.721a

R Square
.520

Adjusted
R Square
.399

Std. Error of
the Estimate
17.70134

a. Predictors: (Constant), AGE, GENDER, INCOME

For multiple regression, R2 will get larger every time another


independent variable (regressor/predictor) is added to the model

Add work stress to model ?

New regressor may only provide a tiny improvement in amount


of variance in the data explained by the model

Need to establish the added value of each additional regressor


in predicting the DV

Multiple Regression
R2adj - adjusted R-square

Takes into account the number of regressors in the model

Calculated as:
R2adj = 1 - (1-R2)(N-1)/(N-n-1)
where:
N = number of data points
n = number of regressors

You dont need to memorise this equation, but

Note that R2adj will always be smaller than R2

How well does a model explain the variation in the


dependent variable?

Effectiveness vs Efficiency

Effectiveness:
maximises R2
ie: maximises proportion of variance explained by model

Efficiency:
maximises increase in R2adj upon adding another regressor
ie: if new regressor doesnt add much to the variance explained,
it is not worth adding

How well does a model explain the variation in the


dependent variable?

Effectiveness (R2 and R2adj)


0 - 25%

very poor and likely to be unacceptable

25 - 50%

poor, but may be acceptable

50 - 75%

good

75 - 90%

very good

90% +

likely that there is something wrong with


your analysis

Are the regressors, taken together, significantly


associated with the dependent variable?
ANOVAb
Model
1

Regression
Residual
Total

Sum of
Squares
4065.388
3760.050
7825.438

df
3
12
15

Mean Square
1355.129
313.337

F
4.325

Sig.
.028a

a. Predictors: (Constant), AGE, GENDER, INCOME


b. Dependent Variable: DEPRESS

Analysis of Variance test checks to see if model, as a whole, has a


significant relationship with the DV

Part of the predictive value of each regressor may be shared by one


or more of the other regressors in the model, so the model must be
considered as a whole (i.e. all regressors/IVs together)

Read off ANOVA table in SPSS output, and report as you did in
week 3/4 assignments

What relationship does each individual regressor


have with the dependent variable?
Coefficientsa

Model
1

(Constant)
INCOME
GENDER
AGE

Unstandardized
Coefficients
B
Std. Error
68.285
15.444
-9.34E-02
.029
3.306
8.942
-.162
.344

Standardized
Coefficients
Beta
-.682
.075
-.101

t
4.421
-3.178
.370
-.470

Sig.
.001
.008
.718
.646

a. Dependent Variable: DEPRESS

SPSS output table entitled Coefficients

Column headed Unstandardised coefficients - B

Gives regression coefficient for each regressor variable (IV)

With all the other variables held constant

Units of coefficient are same as those for regressor (IV)

What relationship does each individual regressor


have with the dependent variable?

Units of coefficient are same as those for variable


eg: dependent variable score on video game (in points)
regressor time of day (in hours)
B coefficient for time = 844.57
score = (B coefficient x time) + constant
score = (844.57 time) 4239.6

This means that for every increase of one hour in the variable
time, we would predict that a persons score will increase by
844.57 points

What relationship does each individual regressor


have with the dependent variable?
dependent variable score on video game
regressor gender

Gender coded so that:

1 = male, 2 = female

Let B coefficient for gender = 100.00


So,

score = 100.00 gender + constant

Adding 1 to the variable gender means that we go from


male to female
This means that females would be expected to score 100.00
points more than males
Remember that the B coefficient is calculated on the basis that
1=male and 2=female (different coding will give a different
coefficient)

Which regressor has the most effect on the dependent


variable?

Units for each regression coefficient are different, so we


must standardise them if we want to compare one with
another
Column headed Standardised coeficients - Beta

Can compare the Beta weights for each regressor variable


to compare effects of each on the dependent variable

Larger Beta weight indicates stronger effect of regressor


on values of DV

Are the relationships of each regressor with the


dependent variable statistically significant?

Assessed using a t-test

Check values in column headed t and sig

If regression coefficient is negative, then t-value will also


be negative (it does not matter about the sign, it is the size
of t that is important)

Reporting regression analyses

How should I report a regression analysis?

Reporting Regression analyses

Describe the characteristics of the model before you describe


the significance of the relationship

So:
1. R2, R2adj - how well does the model fit the data?
2. Fm,n

- is the relationship significant?

3. Regression equation

- how to calculate values of


DV from known values of IVs?

4. Describe results in plain English

Reporting Regression analyses


We want to predict IQ score
using brain size (MRI), height and gender as regressors

Units:

IQ: IQ points

brain size (MRI): pixels

height: centimetres

gender: 0 = male, 1 = female

Reporting Regression analyses (1)

SPSS output tells us that:


R2 = 21.7%

R2adj = 14.6%

Reporting Regression analyses (2)

SPSS output tells us that:


F 3,33 = 3.051, p < 0.05

Reporting Regression analyses (3)

Regression equation:
y = b1x1 + b2x2 + b3x3 + b4x4 + a
IQ = 1.824x10-4 MRI 0.316 height + 2.426 gender + (-6.411)
= 0.0001824 MRI 0.316 height + 2.426 gender + (-6.411)
= 0.0002 MRI 0.316 height + 2.426 gender + (-6.411)

Reporting Regression analyses (4)

The regression was a poor fit, describing only 21.7% of the


variance in IQ (R2adj= 14.6%), but the overall relationship was
statistically significant (F3,33= 3.05, p<0.05).
With other variables held constant, IQ scores were negatively
related to height, decreasing by 0.32 IQ points for every extra
centimetre in height, and positively related to brain size,
increasing by 0.0002 IQ points for every extra pixel of the
scan. Women tended to have higher scores than men, by 2.43
IQ points. However, the effect of brain size (MRI) was the only
significant effect (t33=2.75, p=0.01)

Break

Five minutes please be back promptly

Selecting Regressors

What do we want of a regressor?

To have a significant effect on the dependent variable


Ability to discriminate between values of the dependent
variable

Selecting Regressors
How well do potential regressors predict the Dependent Variable?

Dichotomous variable (eg: gender)

Compare using t-test

If significant, then possible regressor


predicts differences in dependent
variable

Selecting Regressors
How well do potential regressors predict the Dependent Variable?

Continuous variable (eg: Height)

Compare using correlation

If significant, then possible regressor


predicts differences in dependent
variable

Selecting Regressors

Some of discriminatory value in regressor may be accounted


for by regressors present in model already

gender, income, height

age, experience, value of property

In the presence of all regressors

Adding regressor may not add as much to models predictive


value as you might have anticipated

What makes the best model?

Same number of regressors

Choose model with highest value of R2adj

This gives best value per regressor

Will also have the highest value of R2 and F

Different number of regressors

Highest value of R2adj (more regressors)

Highest value of F (fewer regressors)

Efficiency vs Effectiveness

Effective: highest R2 (most complete)

will have more regressors

will be effective, but not efficient

Efficient: highest F-ratio (most significant)

will have fewer regressors

will be efficient, but not particularly effective

Compromise: largest increase in R2adj (best of both worlds)

will contain only the best regressors available

manageable number of regressors and reasonably effective

Minitabs BREG command

Tries every possible combination of available regressors (up


to maximum of 20)

eg: 20 regressors give over 1,000,000 different models

Command:

Dependent variable is in column 10

Independent variables in columns 1 to 6

BREG C10 C1-C6

Will not be required to carry out this type of analysis in


exam, but you need to be able to interpret output

Sample of BREG output


MTB > BREG C13 C1-C12
Best Subsets Regression
Response is prodebt
304 cases used 160 cases contain missing values.

Vars
7
7
8
8
9
9
10

R-Sq
19.3
19.1
19.9
19.5
20.2
20.1
20.4

Adj.
R-Sq
17.4
17.2
17.7
17.4
17.8
17.6
17.6

C-p
7.3
7.8
6.9
8.2
7.8
8.3
9.3

s
0.65539
0.65602
0.65388
0.65536
0.65375
0.65434
0.65427

i
n
c
o
m
e
g
p
X
X
X
X
X
X
X

h
o
u
s
e

c
h
i
l
d
r
e
n

X
X
X
X X

s
i
n
g
p
a
r

a
g
e
g
p
X
X
X
X
X
X
X

b
a
n
k
a
c
c

b
s
o
c
a
c
c
X
X
X
X
X

m
a
n
a
g
e
X
X
X
X
X
X
X

c
c
a
r
d
u
s
e
X
X
X
X
X
X
X

c
i
g
b
u
y
X
X
X
X
X
X

x
m
a
s
b
u
y
X
X
X
X
X
X
X

l
o
c
i
n
t
r
n
X
X
X
X
X
X
X

BREG output

Best two models for each possible number of regressors


are displayed in output

Compare R2adj values directly

Select best model(s)

Run normal regression in SPSS for each selected model

Compare F-ratio values

Best Subset Regression model

Identify best subset of regressors from BREG output

Must run ordinary regression procedure

calculates F-ratio

calculates individual coefficients and significance

Highest R2adj values result in significant F-ratios

if F-ratio not significant, check data and procedure

BUT: Advisable to try two or three models, as the


number of respondents contributing to each analysis
may not be the same between Minitab and SPSS

Equivalent SPSS procedures

Choose procedure by selecting appropriate tab in drop-down


menu

Enter procedure:

Adds all regressors to model simultaneously

Calculates F-ratio and R2adj for all regressors

Stepwise procedure:

Adds regressors one at a time

Calculates F-ratio and R2adj for each set of regressors

considers taking regressors out at each stage

Missing values

Frequently have values missing from data set

missed out questions

couldnt understand question

couldnt collect data for some reason

Must specify missing values in SPSS in Define Variable


window
Differences in R2adj or F-ratio values are most likely to be due to
missing values
Leads to different n in each analysis

Model checking

Residuals (general)

Unusual observations outliers

Model checking - Residuals

Predicted value for y (dependent variable)


y = b1x1 + b2x2 + + a

Actual (observed) value for y

Actual (observed) value minus predicted (calculated) value

Model checking - Residuals


180

160

160

140
120

S ymptom Index

S ymptom Index

140
120
100
80
60

100
80
60

40

40

20

20

0
0

50

100

150

200

Drug A (dose in mg)

250

50

100

150

200

Drug B (dose in mg)

Good fit

Moderate fit

low residuals

larger residuals

250

Model checking - Residuals


Residuals should be:

Normally distributed

Independent of one another

some big, some small, most average-sized


no constant covariation with one another

almost identical in terms of variance

regardless of the values of the IVs or DVs

These things are easy to check with SPSS plots option

Model checking - Unusual observations

Outliers

80

Linear regression would


work quite well for this
data, except for the
presence of three outlier
points

70

60

50

40

30

20

EXAM

10
0

ANXIETY

10

20

Dealing with outliers

Run regression analysis

Plot data on a scattergram

Remove outliers by deleting the rows in SPSS

Run regression analysis again

Note any qualitative differences:

if there are qualitative differences, then check data. If no


errors, report both analyses
if only quantitative differences, then leave outliers in
analysis, noting their presence

Justification

Removing outliers

80

70

Plotting data may indicate


that some participants
belong to a separate subsample.
Eg: people with an
exam phobia?

60

50

40

30

20

EXAM

10
0

ANXIETY

10

20

Residuals

DV vs IV

Differences between actual and


predicted values (ie: residual
values) should show a normal
distribution)

Some large positive

Some large negative

80

70

60

50

40

30

EXAM

20

10
0

ANXIETY

10

20

But mostly small (positive or


negative), or zero
ie: Normally distributed

Residuals
80

70

60

50

40

30

20

EXAM

DV vs IV

10
0

ANXIETY

10

20

If our best-fit line does


not fit too well, this will
be revealed in the
distribution of the
Residuals

Questions ?

Call .Veera at 012-2313979