Anda di halaman 1dari 44

# Regression Technique

Regression

Web support

## Model checking - residuals

Simple Regression

y = bx + a

## x is the predictor or regressor variable for y

Multiple Regression

## Establish equation for the best-fit line:

y = b1x1 + b2x2 + b3x3 + a

Where:
b1 = regression coefficient for variable x1
b2 = regression coefficient for variable x2
b3 = regression coefficient for variable x3
a = constant

Multiple Regression
R2 - Goodness of fit
Model Summary
Model
1

R
.721a

R Square
.520

R Square
.399

Std. Error of
the Estimate
17.70134

## For multiple regression, R2 will get larger every time another

independent variable (regressor/predictor) is added to the model

## New regressor may only provide a tiny improvement in amount

of variance in the data explained by the model

## Need to establish the added value of each additional regressor

in predicting the DV

Multiple Regression

## Takes into account the number of regressors in the model

Calculated as:
R2adj = 1 - (1-R2)(N-1)/(N-n-1)
where:
N = number of data points
n = number of regressors

## How well does a model explain the variation in the

dependent variable?

Effectiveness vs Efficiency

Effectiveness:
maximises R2
ie: maximises proportion of variance explained by model

Efficiency:
maximises increase in R2adj upon adding another regressor
ie: if new regressor doesnt add much to the variance explained,
it is not worth adding

## How well does a model explain the variation in the

dependent variable?

0 - 25%

25 - 50%

50 - 75%

good

75 - 90%

very good

90% +

## Are the regressors, taken together, significantly

associated with the dependent variable?
ANOVAb
Model
1

Regression
Residual
Total

Sum of
Squares
4065.388
3760.050
7825.438

df
3
12
15

Mean Square
1355.129
313.337

F
4.325

Sig.
.028a

## a. Predictors: (Constant), AGE, GENDER, INCOME

b. Dependent Variable: DEPRESS

## Analysis of Variance test checks to see if model, as a whole, has a

significant relationship with the DV

## Part of the predictive value of each regressor may be shared by one

or more of the other regressors in the model, so the model must be
considered as a whole (i.e. all regressors/IVs together)

Read off ANOVA table in SPSS output, and report as you did in
week 3/4 assignments

## What relationship does each individual regressor

have with the dependent variable?
Coefficientsa

Model
1

(Constant)
INCOME
GENDER
AGE

Unstandardized
Coefficients
B
Std. Error
68.285
15.444
-9.34E-02
.029
3.306
8.942
-.162
.344

Standardized
Coefficients
Beta
-.682
.075
-.101

t
4.421
-3.178
.370
-.470

Sig.
.001
.008
.718
.646

## What relationship does each individual regressor

have with the dependent variable?

## Units of coefficient are same as those for variable

eg: dependent variable score on video game (in points)
regressor time of day (in hours)
B coefficient for time = 844.57
score = (B coefficient x time) + constant
score = (844.57 time) 4239.6

This means that for every increase of one hour in the variable
time, we would predict that a persons score will increase by
844.57 points

## What relationship does each individual regressor

have with the dependent variable?
dependent variable score on video game
regressor gender

## Gender coded so that:

1 = male, 2 = female

So,

## Adding 1 to the variable gender means that we go from

male to female
This means that females would be expected to score 100.00
points more than males
Remember that the B coefficient is calculated on the basis that
1=male and 2=female (different coding will give a different
coefficient)

variable?

## Units for each regression coefficient are different, so we

must standardise them if we want to compare one with
another
Column headed Standardised coeficients - Beta

## Can compare the Beta weights for each regressor variable

to compare effects of each on the dependent variable

on values of DV

## Are the relationships of each regressor with the

dependent variable statistically significant?

## If regression coefficient is negative, then t-value will also

be negative (it does not matter about the sign, it is the size
of t that is important)

## Describe the characteristics of the model before you describe

the significance of the relationship

So:
1. R2, R2adj - how well does the model fit the data?
2. Fm,n

## - is the relationship significant?

3. Regression equation

## - how to calculate values of

DV from known values of IVs?

## Reporting Regression analyses

We want to predict IQ score
using brain size (MRI), height and gender as regressors

Units:

IQ: IQ points

## brain size (MRI): pixels

height: centimetres

R2 = 21.7%

## SPSS output tells us that:

F 3,33 = 3.051, p < 0.05

## Reporting Regression analyses (3)

Regression equation:
y = b1x1 + b2x2 + b3x3 + b4x4 + a
IQ = 1.824x10-4 MRI 0.316 height + 2.426 gender + (-6.411)
= 0.0001824 MRI 0.316 height + 2.426 gender + (-6.411)
= 0.0002 MRI 0.316 height + 2.426 gender + (-6.411)

## The regression was a poor fit, describing only 21.7% of the

variance in IQ (R2adj= 14.6%), but the overall relationship was
statistically significant (F3,33= 3.05, p<0.05).
With other variables held constant, IQ scores were negatively
related to height, decreasing by 0.32 IQ points for every extra
centimetre in height, and positively related to brain size,
increasing by 0.0002 IQ points for every extra pixel of the
scan. Women tended to have higher scores than men, by 2.43
IQ points. However, the effect of brain size (MRI) was the only
significant effect (t33=2.75, p=0.01)

Break

## Five minutes please be back promptly

Selecting Regressors

## To have a significant effect on the dependent variable

Ability to discriminate between values of the dependent
variable

Selecting Regressors
How well do potential regressors predict the Dependent Variable?

## If significant, then possible regressor

predicts differences in dependent
variable

Selecting Regressors
How well do potential regressors predict the Dependent Variable?

## If significant, then possible regressor

predicts differences in dependent
variable

Selecting Regressors

## Some of discriminatory value in regressor may be accounted

for by regressors present in model already

## Adding regressor may not add as much to models predictive

value as you might have anticipated

## Highest value of F (fewer regressors)

Efficiency vs Effectiveness

## Tries every possible combination of available regressors (up

to maximum of 20)

Command:

## Will not be required to carry out this type of analysis in

exam, but you need to be able to interpret output

## Sample of BREG output

MTB > BREG C13 C1-C12
Best Subsets Regression
Response is prodebt
304 cases used 160 cases contain missing values.

Vars
7
7
8
8
9
9
10

R-Sq
19.3
19.1
19.9
19.5
20.2
20.1
20.4

R-Sq
17.4
17.2
17.7
17.4
17.8
17.6
17.6

C-p
7.3
7.8
6.9
8.2
7.8
8.3
9.3

s
0.65539
0.65602
0.65388
0.65536
0.65375
0.65434
0.65427

i
n
c
o
m
e
g
p
X
X
X
X
X
X
X

h
o
u
s
e

c
h
i
l
d
r
e
n

X
X
X
X X

s
i
n
g
p
a
r

a
g
e
g
p
X
X
X
X
X
X
X

b
a
n
k
a
c
c

b
s
o
c
a
c
c
X
X
X
X
X

m
a
n
a
g
e
X
X
X
X
X
X
X

c
c
a
r
d
u
s
e
X
X
X
X
X
X
X

c
i
g
b
u
y
X
X
X
X
X
X

x
m
a
s
b
u
y
X
X
X
X
X
X
X

l
o
c
i
n
t
r
n
X
X
X
X
X
X
X

BREG output

## Best two models for each possible number of regressors

are displayed in output

## Must run ordinary regression procedure

calculates F-ratio

## BUT: Advisable to try two or three models, as the

number of respondents contributing to each analysis
may not be the same between Minitab and SPSS

Enter procedure:

## Calculates F-ratio and R2adj for all regressors

Stepwise procedure:

Missing values

## Must specify missing values in SPSS in Define Variable

window
Differences in R2adj or F-ratio values are most likely to be due to
missing values
Leads to different n in each analysis

Model checking

Residuals (general)

## Predicted value for y (dependent variable)

y = b1x1 + b2x2 + + a

180

160

160

140
120

S ymptom Index

S ymptom Index

140
120
100
80
60

100
80
60

40

40

20

20

0
0

50

100

150

200

250

50

100

150

200

Good fit

Moderate fit

low residuals

larger residuals

250

## Model checking - Residuals

Residuals should be:

Normally distributed

## some big, some small, most average-sized

no constant covariation with one another

Outliers

80

## Linear regression would

work quite well for this
data, except for the
presence of three outlier
points

70

60

50

40

30

20

EXAM

10
0

ANXIETY

10

20

## if there are qualitative differences, then check data. If no

errors, report both analyses
if only quantitative differences, then leave outliers in
analysis, noting their presence

Justification

Removing outliers

80

70

## Plotting data may indicate

that some participants
belong to a separate subsample.
Eg: people with an
exam phobia?

60

50

40

30

20

EXAM

10
0

ANXIETY

10

20

Residuals

DV vs IV

## Differences between actual and

predicted values (ie: residual
values) should show a normal
distribution)

80

70

60

50

40

30

EXAM

20

10
0

ANXIETY

10

20

## But mostly small (positive or

negative), or zero
ie: Normally distributed

Residuals
80

70

60

50

40

30

20

EXAM

DV vs IV

10
0

ANXIETY

10

20

## If our best-fit line does

not fit too well, this will
be revealed in the
distribution of the
Residuals

Questions ?