Anda di halaman 1dari 25

Module 3: Multiple Regression

Concepts

1. Please ignore this bit & leave it alone


2. - its to get automated Figure & Table numbering
right, and
3. will not appear in the pdfs

Fiona Steele1
Centre for Multilevel Modelling

It works in this way.

Contents

We wish automatically made figure & table numbering that has module numbers in
it so we get captions like Figure 2.1, or Figure 3.2, etc.

Introduction .........................................................................................................3

MS Word works out what numbering to use by looking at the heading styles used in
the document.

What is Multiple Regression? ................................................................................... 3


Motivation......................................................................................................... 3
Conditioning ...................................................................................................... 3
Data for multiple regression analysis ......................................................................... 4
Introduction to Dataset ...........................................................................................4

Every time the style Heading 1 is used in the document (as above on Please
ignore), MS Word counts this as a new chapter, and increments the chapter
number part of the automatic numbering.

C3.1

Regression with a Single Continuous Explanatory Variable......................................6

C3.1.1
C3.1.2
C3.1.3
C3.1.4
C3.1.5
C3.1.6

So to get a caption like Figure 2.x - you need to have preceding it two headings
formatted with the style Heading 1.
This template has Heading 1 used only once, so figures and tables in this document
will be numbered as Figure 1.x or Table 1.x
These figure and table numbers will be automatically updated each time another
Heading 1 is inserted.

C3.2

Comparing Groups: Regression with a Single Categorical Explanatory Variable .......... 19

C3.2.1
C3.2.2
C3.2.3
C3.3

Statistical control.................................................................................
The multiple regression model .................................................................
Using multiple regression to model a non-linear relationship .............................
Adding further predictors .......................................................................

26
27
31
33

Interaction Effects ..................................................................................... 36

C3.4.1
C3.4.2
C3.4.3
C3.4.4
C3.4.5
C3.5

Comparing two groups ........................................................................... 19


Comparing more than two groups.............................................................. 21
Comparing a large number of groups .......................................................... 25

Regression with More than One Explanatory Variable (Multiple Regression) .............. 26

C3.3.1
C3.3.2
C3.3.3
C3.3.4
C3.4

Examining data graphically ....................................................................... 6


The linear regression model ...................................................................... 8
The fitted regression line ....................................................................... 10
Explained and unexplained variance and R-squared ........................................ 13
Hypothesis testing ................................................................................ 14
Model checking.................................................................................... 15

Model with fixed slopes across groups.........................................................


Fitting separate models for each group .......................................................
Allowing for varying slopes in a pooled analysis: interaction effects ....................
Testing for interaction effects..................................................................
Another example: allowing age effects to be different in different countries .........

36
37
39
41
41

Checking Model Assumptions in Multiple Regression............................................ 44

C3.5.1
C3.5.2
C3.5.3

Checking the normality assumption ........................................................... 44


Checking the homoskedasticity assumption .................................................. 45
Outliers ............................................................................................. 47

With additional material from Kelvyn Jones. Comments from Sacha Brostoff, Jon Rasbash and
Rebecca Pillinger on an earlier draft are gratefully acknowledged.

Centre for Multilevel Modelling, 2008

Centre for Multilevel Modelling, 2008

All of the sections within this module have online quizzes for you to
test your understanding. To find the quizzes:
EXAMPLE

Module 3 (Concepts): Multiple Regression


Introduction

Introduction

From within the LEMMA learning environment


Go down to the section for Module 3: Multilevel Modelling
Click "3.1 Regression with a Single Continuous Explanatory Variable"
to open Lesson 3.1
Q1
Click
to open the first question

What is Multiple Regression?


Multiple regression is a technique used to study the relationship between an
outcome variable and a set of explanatory or predictor variables.

Motivation

All of the sections within this module have practicals so you can
learn how to perform this kind of analysis in MLwiN or other
software packages. To find the practicals:

To illustrate the ideas of multiple regression, we will consider a research problem


of assessing the evidence for gender discrimination in legal firms. Statistical
modelling can provide the following:

EXAMPLE

From within the LEMMA learning environment


Go down to the section for Module 3: Multiple Regression, then
Either
Click "3.1 Regression with a Single Continuous Explanatory Variable" to open
Lesson 3.1
Click
Or
Click
Print all Module 3 MLwiN Practicals

Pre-requisites

Conditioning

We can use regression modelling in different modes: 1) as description (what is the


average salary for men and women?), 2) as part of causal inference (does being
female result in a lower salary?), and 3) for prediction (what happens if
questions).

Understanding of types of variables (continuous vs. categorical variables,


dependent and explanatory); covered in Module 1.
Correlation between variables
Confidence intervals around estimates
Hypothesis testing, p-values
Independent samples t-test for comparing the means of two groups

Online resources:
http://www.sportsci.org/resource/stats/
http://www.socialresearchmethods.net/
http://www.animatedsoftware.com/statglos/statglos.htm
http://davidmlane.com/hyperstat/index.html

Centre for Multilevel Modelling, 2008

A quantitative assessment of the size of the effect; e.g. the difference in salary
between women and men is 5000 per annum;
A quantitative assessment after taking account of other variables; e.g. a female
worker earns 6500 less after taking account of years of experience. This
conditioning on other variables distinguishes multiple regression modelling from
simple testing for differences analyses.
A measure of uncertainty for the size of the effect; e.g. we can be 95%
confident that the female-male difference in salary in the population from
which our sample was drawn is likely to lie between 4500 and 5500.

The key feature that distinguishes multiple regression from simple regression is
that more than one predictor variable is involved. Even if we are interested in the
effect of just one variable (gender) on another (salary) we need to take account of
other variables as they may compromise the results. We can recognise three
distinct cases where it is important to control or adjust for the effects of other
variables:
i)

Inflation of a relationship when not taking into account extraneous


variables. For example, a substantial gender effect could be reduced after
taking account of type of employment. This is because jobs that are
characterized by poor pay (e.g. in the service sector) have a predominantly
female labour force.

ii)

Suppression of a relationship. An apparent small gender gap could increase


when account is taken of years of employment; women having longer
service and poorer pay.

Centre for Multilevel Modelling, 2008

Module 3 (Concepts): Multiple Regression

Module 3 (Concepts): Multiple Regression

Introduction

Introduction

iii)

He (sic) seeks every chance he can to have fun. It is important to him to do


things that give him pleasure.

Having a good time is important to him. He likes to spoil himself.

No confounding. The original relationship remains substantially unaltered


when account is taken of other variables. Note, however, that there may
be unmeasured confounders.

Data for multiple regression analysis


Statistical analysis requires a quantifiable outcome measure (dependent variable)
to assess the effects of discrimination. Possibilities include the following,
differentiated by the nature of the measurement: a continuous measure of salary,
a binary indicator of whether an employee was promoted or not, a three-category
indicator of promotion (promoted, not promoted, not even considered), a count of
the number of times rejected for promotion, the length of time that it has taken
to gain promotion. All of these outcomes can be analysed using regression
analysis, but different techniques are required for different scales of
measurement.
The term multiple regression is usually applied when the dependent variable is
measured on a continuous scale. A dichotomous dependent variable can be
analysed using logistic regression and multinomial logistic and ordinal regression
can be applied to nominal and ordinal dependent variables respectively. There are
also methods for handling counts (Poisson regression) and time-to-event data
(event history analysis or survival analysis). These techniques will be described in
later Modules.
The explanatory variables may also have different scales of measurement. For
example, gender is a binary categorical variable; ethnicity is categorical with more
than two categories; education might be measured on an ordinal scale (e.g. <11,
11-13, 14-16 and >16 years of education); years of employment could be measured
on a continuous scale. Multiple regression can handle all of these types of
explanatory variable, and we will consider examples of both continuous and
categorical variables in this Module.

A respondents own values are inferred from their self-reported similarity to a


person with the above descriptions. Each of the two items is rated on a 6-point
scale (from very much like me to not like me at all). The mean of these
ratings is calculated for each individual. The mean of the two hedonism items is
then adjusted for individual differences in scale use2 by subtracting the mean of all
value items (a total of 21 are used to measure the 10 values). These centred
scores recognise that the 10 values function as a system rather than
independently. The centred hedonism score is interpreted as a measure of the
relative importance of hedonism to an individual in their whole value system.
The scores on the hedonism variable range from -3.76 to 2.90, where higher scores
indicate more hedonistic beliefs.
We consider three countries France, Germany and the UK with a total sample
size of 5845. That is, we use a subsample of the original data.
Hedonism is taken as the outcome variable in our analysis. We consider three
explanatory variables:

Age in years
Gender (coded 0 for male and 1 for female)
Country (coded 1 for the UK, 2 for Germany and 3 for France)
Years of education.

An extract of the data is given below.


Respondent

Introduction to Dataset
The ideas of multiple regression will be introduced using data from the 2002
European Social Surveys (ESS).
Measures of ten human values have been
constructed for 20 countries in the European Union. According to value theory,
values are defined as desirable, trans-situational goals that serve as guiding
principles in peoples lives. Further details on value theory and how it is
operationalised in the ESS can be found on the ESS education net
(http://essedunet.nsd.uib.no/cms/topics/1/).

1
2
3
4
.
.
5845

Hedonism
1.55
0.76
-0.26
-1.00
.
.
0.74

Age

Gender

Country

Education

25
30
59
47
.
.
65

0
0
0
1
.
.
0

2
2
2
3
.
.
1

10
11
9
10
.
.
9

We will study one of the ten values, hedonism, defined as the pleasure and
sensuous gratification for oneself. The measure we use is based on responses to
the question How much like you is this person?:

Centre for Multilevel Modelling, 2008

Centre for Multilevel Modelling, 2008

Some individuals will tend to select responses from one side of the scale (very much like me)
for any item, while others will select from the other side (not like me at all). If we ignore these
differences in response tendency we might incorrectly infer that the first type of individual
believes that all values are important, while the second believes that all values are unimportant.

Module 3 (Concepts): Multiple Regression

Module 3 (Concepts): Multiple Regression


C3.1.1 Examining data graphically

C3.1 Regression with a Single Continuous


Explanatory Variable

We will first consider age as an explanatory variable for hedonism. The age range
in our sample is 14 to 98 years with a mean of 46.7 and standard deviation of 18.1.
Relationship between X and Y

We will begin with a description of simple linear regression for studying the
relationship between a pair of continuous variables, which we denote by Y and X.
Simple regression is also commonly known as bivariate regression because only two
variables are involved.
Y is the outcome variable (also called a response or dependent variable)
X is the explanatory variable (also called a predictor or independent variable).

C3.1.1

In its simplest form, a regression analysis assumes that the relationship between X
and Y is linear, i.e. that it can be reasonably approximated by a straight line. If
the relationship is nonlinear, it may be possible to transform one of the variables
to make the relationship linear or the regression model can be modified (see
C3.3.3). The relationship between two variables can be viewed in a scatterplot. A
scatterplot can also reveal outliers.

Examining data graphically

Before carrying out a regression analysis, it is important to look at your data first.
There are various assumptions made when we fit a regression model, which we will
consider later, but there are two checks that should always be carried out before
fitting any models: i) examine the distribution of the variables and check that the
values are all valid, and ii) look at the nature of the relationship between X and Y.
Distribution of Y
We can examine the distribution of a continuous variable using a histogram. At
this stage, we are checking that the values appear reasonable. Are there any
outliers, i.e. observations outside the general pattern? Are there any values of 99 in the data that should be declared as missing values? We also look at the
shape of the distribution: is it a symmetrical bell-shaped distribution (normal), or
is it skewed? Although it is the residuals3 that are assumed to be normally
distributed in a multiple regression model, rather than the dependent variable, a
skewed Y will often produce skewed residuals. If the residuals turn out to be nonnormal, it may be possible to transform Y to obtain a normally distributed
variable. For example, a positively skewed distribution (with a long tail to the
right) will often look more symmetrical after taking logarithms.
Figure 3.1 shows the distribution of the hedonism scores. It appears approximately
normal with no obvious outliers. The mean of the hedonism score is -0.15 and the
standard deviation is 0.97.
Distribution of X
For a regression analysis the distribution of the explanatory variable is
unimportant, but it is sensible to look at descriptive statistics for any variables
that we analyse to check for unusual values.

Figure 3.1. Histogram of hedonism

Figure 3.2 shows a scatterplot of hedonism versus age, where the size of the
plotting symbol is proportional to the number of respondents represented by a
particular data point. Also shown is what is commonly called the line of best fit,
which we will come back to in a moment. The scatterplot shows a negative
relationship: as age increases then hedonism decreases. The Pearson correlation
coefficient for the linear relationship is -0.34.

The residual for each observation is the difference between the observed value of Y and the value
of Y predicted by the model. See C3.1.2 for further details.

Centre for Multilevel Modelling, 2008

Centre for Multilevel Modelling, 2008

Module 3 (Concepts): Multiple Regression

Module 3 (Concepts): Multiple Regression

C3.1.1 Examining data graphically

C3.1.2 The linear regression model

so that the intercept is now denoted by 0 and the slope by 1 . The subscripts on
the s indicate the variable to which each coefficient is attached. We could have
written (3.2) as y = 0 x 0 + 1 x 1 where x 0 =1 for every observation and x 1 = x .
Later we will be adding further explanatory variables ( x 2 , x 3 etc.) with
coefficients 2 , 3 , etc.
For a given individual i (i=1, 2, 3, .., n), we denote their value on Y by y i and
their value on X by x i . (Note that when we consider more than one explanatory
variable, we will introduce a second subscript to index the variable. For example,
x 2i will denote the value on variable x 2 for individual i.)
For individual i, the linear relationship between Y and X may be expressed as:
y i = 0 + 1 x i + ei (3.3)
ei is called the residual and is the difference between the ith individuals actual
y-value and that predicted by their x-value. We know that we cannot perfectly
predict an individuals value on Y from their value on X; the points in a scatterplot
of x and y will never lie perfectly on a straight line (see Figure 3.2, for example).
The residuals represent the (vertical) scatter of points about the regression line.

Figure 3.2. Plot of hedonism by age

C3.1.2

The linear regression model

In a linear regression analysis, we fit a straight line to the scatterplot of Y against


X. The equation of a straight line is traditionally written as
y = mx + c

(3.1)

where m is the gradient or slope of the line, and c is the intercept or the point at
which the line cuts the Y-axis (i.e. the value of y when x=0). The gradient is
interpreted as the change in y expected for a 1-unit change in x. In statistics, we
often refer to m and c as coefficients. A coefficient of a variable is a quantity that
multiplies it. The slope m is the coefficient of the predictor x, and the intercept c
is the coefficient of a variable which equals 1 for each observation (usually
referred to as the constant).
Because we will soon be adding more explanatory variables (Xs), it is convenient to
use a more general notation with coefficients represented by Greek betas ( ).
Thus (3.1) becomes
y = 0 + 1 x

The equation (3.3) is called the linear regression model.


0 and 1 are the
intercept and slope of the regression line in the population from which our sample
was drawn, and ei is the difference between an individuals y-value and the value
of y predicted by the population regression line. We estimate these quantities
using the sample data. Quantities such as 0 and 1 that relate to the population
are called parameters. Parameters are very often represented by Greek letters in
statistics4.
We make the following assumptions about the residuals ei :
i)

The residuals are normally distributed with zero mean and variance 2
(spoken as sigma-squared). This assumption is often written in shorthand
as ei ~ N (0, 2 ).

ii)

The variance of the residuals is constant, whatever the value of x. This


means that if we take a slice through the scatterplot of y versus x at any
particular value of x, the y values have approximately the same variation as
at any other value of x. If the variance is constant, we say the residuals are
homoskedastic. Otherwise they are said to be heteroskedastic.

(3.2)
4

Other examples of parameters are the population mean


(sigma-squared).

Centre for Multilevel Modelling, 2008

Centre for Multilevel Modelling, 2008

(mu) and the population variance

Module 3 (Concepts): Multiple Regression

Module 3 (Concepts): Multiple Regression

C3.1.2 The linear regression model

C3.1.3 The fitted regression line

iii)

The residuals are not correlated with one another, i.e. they are
independent. Correlations might arise if some individuals contribute more
than one observation (e.g. repeated measures) or if individuals are
clustered in some way (e.g. in schools). If it is suspected that residuals are
correlated, the regression model needs to be modified, e.g. to a multilevel
model (see Module 5).

We can use the fitted line to predict an individuals hedonism based on their age.
So, for example, for an individual of age 25 we would predict a hedonism score of
0.712 (0.018 25) = 0.262. In contrast, we would predict a score of -0.188 for
someone of age 50. The regression line is the line of best fit shown in Figure 3.2.
Most statistical packages will report the results of a regression analysis in tabular
form, e.g. as in Table 3.1.

If these assumptions are not met the estimate of 0 , and more importantly 1 ,
may be biased and imprecise.

C3.1.3

Table 3.1. Results from a simple regression of hedonism on age

The fitted regression line

In linear regression analysis, 0 and 1 are estimated from the data using a
method called least squares in which the sum of the squared residuals is
(Responses with other scales of measurement require other
minimized5.
techniques, but all of them are based on the same underlying principle of
minimizing the poorness of fit between the actual data points and the fitted
model.)
By applying the method of least squares to our sample data, we obtain an estimate
of the underlying population value of the intercept and of the slope. These
estimates are denoted by 0 and 1 (spoken as beta-0-hat and beta-1-hat).
The predicted value of y for individual i is denoted by yi and is calculated as:
yi = 0 + 1 x i

(3.4)

The equation (3.4) is the equation of the estimated or fitted regression line. The
predicted value yi is the point on the fitted line corresponding to x i .
If we regress hedonism on age we obtain 0 =0.712 and 1 =-0.018, and the fitted
regression line is written (substituting HED for y and AGE for x) as:
HEDi = 0.712 0.018 AGE i .
The slope estimate tells us that for every extra year of age, hedonism is predicted
to decrease by 0.018. Importantly, the decrease in hedonism expected for an
increase from 14 to 15 years old is the same as for an increase for 54 to 55 years
old. This is a direct consequence of assuming that the underlying functional form
of the model is linear and fitting a linear equation.

Constant
Age

Coefficient
0.712
-0.018

Interpretation of the intercept and slope estimates

0 =0.712 is the predicted value of Y when X=0. So we would expect someone of


age zero to have a hedonism score of 0.712. Because the minimum age in the
sample is 14, this is not very informative.
1 =-0.018 is the predicted change in Y for a 1 unit change in X. So we expect a
decrease of 0.018 in the hedonism score for each 1 year increase in age.
Centring

Continuous variables are often centred about the mean so that the intercept has a
more meaningful interpretation. For example, we would centre the variable AGE
by subtracting the sample mean of 46 years from each of its values. If we then
repeat the regression analysis replacing AGE by AGE-46, the intercept becomes the
predicted value of Y when AGE-46=0, i.e. when AGE=46. Rather than a prediction
for a baby of 0 years, which is well outside the age range in the sample, the
intercept now gives a prediction for a 46 year old adult
The intercept in the analysis based on centred AGE is estimated as -0.139, which is
the predicted hedonism score for a 46 year old. Centring does not affect the
estimate of the slope because only the origin of X has been shifted; its scale
(standard deviation) has not changed.
Standardisation and standardised coefficients

Sometimes X is standardised, which involves subtracting the sample mean and then
dividing the result by the standard deviation:

A description of least squares can be found at


http://mathforum.org/dynamic/java_gsp/squares.html

Centre for Multilevel Modelling, 2008

10

Centre for Multilevel Modelling, 2008

11

Module 3 (Concepts): Multiple Regression

Module 3 (Concepts): Multiple Regression

C3.1.3 The fitted regression line

C3.1.4 Explained and unexplained variance and R-squared

X mean of X
.
SD of X

C3.1.4

Explained and unexplained variance and R-squared

All statistical models have a common form:

Standardising a variable forces it to have a mean of zero and a standard deviation


of 1, while centring shifts only the origin and leaves the scale unaltered.
After standardisation a unit corresponds to one standard deviation, so if X is
standardised its slope is interpreted as the change in Y expected for a one
standard deviation change in X.
Sometimes standardised coefficients are calculated. In simple regression the
standardised coefficient of X is the slope that would be obtained if X and Y had
both been standardised, which is equivalent to the Pearson correlation coefficient.
The standardised coefficient of X is interpreted as the number of standard
deviation units change in Y that we would expect for each standard deviation
change in X. While standardised coefficients put each variable on the same scale,
and may therefore be useful for comparing the effect of X on Y in different
subpopulations, the natural meaning of the X and Y variables is lost. The use and
interpretation of standardised coefficients in multiple regression is discussed in
C3.3.2.
When age is standardised the estimated intercept and slope of the regression line
are 0 =-0.151 and 1 =-0.335. If we also standardise hedonism, we obtain 1 =0.343 (now a standardised coefficient) which is equal to the Pearson correlation
coefficient given earlier in C3.1.1.
Important note: We cannot claim that there is a causal relationship between X
and Y from such a simple model or, indeed, from any regression model applied to
observational data. So when interpreting the slope it is better to avoid
statements like a change in X leads to or causes an increase in Y. Taking
account of other factors would provide stronger evidence of a causal relationship
if the original relationship did not change as additional predictors are included in
the model.

Response = Systematic part + Random part


where for a simple regression (3.3) the systematic part is 0 + 1 x i and the
random part is the residual ei .
The systematic part gives the average relationship between the response and the
predictor(s), while the random part is what is left over (the unexplained part)
after taking account of the included predictor(s). Figure 3.2 displays the values on
X and Y for individuals in the sample, and a straight line that we have threaded
through the (X, Y) data points to represent the systematic relation between
hedonism and age. The line represents the fitted values, e.g. if you are 20 years
old you are predicted to have a hedonism score of about 0.3.
The term random means allowed to vary and, in relation to Figure 3.2, the
random part is the portion of hedonism that is not accounted for by the underlying
average relationship with age. Some people are more and some less hedonistic
given their age. The residual is the difference between the actual and predicted
hedonism. In some cases there will be a close fit between the actual and fitted
values, e.g. if differences in age explain most of the variability in hedonism. In
other cases there may be a lot of noise, e.g. if, for any given age, there is a wide
range of hedonism scores. It is helpful to characterise this residual variability. To
do so requires us to make some assumptions about the residuals (normality and
homoskedasticity - see C3.1.2). Under these assumptions we can summarise the
variability in a single statistic, the variance of the residuals 2 . We can think of
the residual variance as the part of the variance in Y that is unexplained by X.
The part of the variance in Y that is explained by X (the systematic part of the
model) is called the explained variance in Y. For Figure 3.2 the residual or
unexplained variance is 0.84. The total variance in hedonism scores (which is the
sum of the explained and unexplained variances) is 0.95, so by subtraction the
explained variance is 0.11.
Another key summary statistic is the R-squared (R2) value which gives the
correspondence between the actual and fitted values, on a scale between zero (no
correspondence) and 1 (complete correspondence). R-squared can also be
interpreted as the proportion of the total variance in Y that can be explained by
variability in X. For the hedonism data, R-squared = 0.11/0.95 = 0.12 so 12% of the
variance in hedonism scores can be explained by age.
In the case of simple regression, R-squared is the square of the Pearson correlation
coefficient.

Centre for Multilevel Modelling, 2008

12

Centre for Multilevel Modelling, 2008

13

Module 3 (Concepts): Multiple Regression

Module 3 (Concepts): Multiple Regression

C3.1.5 Hypothesis testing

C3.1.5 Hypothesis testing

C3.1.5

Hypothesis testing

We must bear in mind that the estimates of the intercept and slope are subject to
sampling variability, as is any statistic calculated from a sample. While we have
established that there is a negative relationship between hedonism and age in our
sample, we are really interested in their relationship in the population from which
our sample was drawn (the combined populations of France, Germany and the UK).
In other words, is the relationship statistically significant, or could we have got
such a result by chance?
The null hypothesis (H0) for our test is that there is no relationship between
hedonism and age in the population, in which case 1 =0.

SE(1 )

which is compared to a normal distribution (or a t distribution if the sample size is


small). In this case the p-value is tiny, less than 0.001. If there was no
relationship between hedonism and age in the population (i.e. the true slope is
zero), we would expect less than 0.1% of samples from that population to produce
a slope estimate of magnitude greater than 0.018.
In the practice sections, we will generally use the Z-ratio to test significance
rather than calculating confidence intervals.

C3.1.6

The alternative hypothesis (HA) is that there is a relationship, i.e. 1 0.


The test of a relationship between hedonism and age is based on the estimate of
the slope of the relationship and a measure of the precision of this estimate. The
standard error is a measure of imprecision, where large values indicate greater
uncertainty about the true (population) value. The standard error is inversely
related to sample size, so that the precision of the estimate of 1 increases as the
sample size increases. The standard error also depends on the amount of
variability in X and the amount of variance in Y that is unexplained by X (the
residual variance): the standard error decreases as the variance in X increases, and
the standard error increases as the residual variance increases.
In our example, the standard error of 1 is 0.001 and a 95% confidence interval
for 1 is therefore

1 1.96 SE(1 ) = 0.018 (1.96 0.001) = (0.020, 0.016) 6


Zero (the value of 1 under H0) is well outside the 95% confidence interval, so we
reject the null hypothesis and conclude that the relationship is statistically
significant at the 5% level.
We can also calculate a confidence interval for the population intercept 0 , but
the slope is of principal interest since it measures the relationship between X and
Y.

0.018
= 27.9
0.001

Model checking

A number of assumptions lie behind a regression model.


C3.1.2 but, briefly, we assume:

These were given in

i)

The residuals ei are normally distributed.

ii)

The variance of the residuals is constant, whatever the value of x, i.e. the
residuals are homoskedastic.

iii)

The residuals are not correlated with one another, i.e. they are
independent.

We can check the validity of assumptions i) and ii) by examining plots of the
estimated residuals. If it is suspected that residuals might be correlated because
the data are clustered in some way, we can test assumption iii) by comparing a
multilevel model, which accounts for clustering, with a multiple regression model
which ignores clustering (see Module 5).
To check assumptions about ei , we use the estimated residuals which are the
differences between the observed and predicted values of y:
ei = y i yi
We usually work with the standardized residuals ri which we obtain by dividing ei
by their standard deviation.

Alternatively, but equivalently, we can calculate the test statistic (often called
the Z or t-ratio)

6
-1.96 and +1.96 are the 2.5% and 97.5% points of a standard normal distribution (one with a mean
of zero and a standard deviation of one). The middle 95% of the distribution lies between these
points.

Centre for Multilevel Modelling, 2008

14

Centre for Multilevel Modelling, 2008

15

Module 3 (Concepts): Multiple Regression

Module 3 (Concepts): Multiple Regression

C3.1.6 Model checking

C3.1.6 Model checking

Checking the normality assumption


We can check whether residuals are normally distributed by looking at a histogram
or a normal probability plot of the standardized residuals. If the normality
assumption holds, the points in a normal plot should lie on a straight line.
Expected cumulative probability

1.0

Figure 3.3 and Figure 3.4 show a histogram and normal probability plot of residuals
from a simple regression model with age. Both plots suggest that the normal
distribution assumption is reasonable here.

Checking the homoskedasticity assumption


To check that the variance of the residuals is fairly constant across the range of X,
we can examine a plot of the standardized residuals against X and check that the
vertical scatter of the residuals is roughly the same for different values of X with
no funnelling.

0.8

0.6

0.4

0.2

500

0.0
0.0

0.2

0.4

0.6

0.8

1.0

Observed cumulative probability

Figure 3.4. Normal probability plot of ri

Frequency

400

300

Standardized residual

200

100

0
-2.5

0.0

2.5

Standardized residual

Figure 3.3. Histogram of ri

-2

Figure 3.5 shows a plot of ri versus x i . The vertical spread of the points appears
fairly equal across different values of X, so we conclude that the assumption of
homoskedasticity is reasonable.
0

20

40

60

80

100

Age, in number of years, in 2002

Figure 3.5. Plot of ri versus xi


Centre for Multilevel Modelling, 2008

16

Centre for Multilevel Modelling, 2008

17

Module 3 (Concepts): Multiple Regression

Module 3 (Concepts): Multiple Regression

C3.1.6 Model checking

C3.2 Comparing Groups: Regression with a Single


Categorical Explanatory Variable

Outliers
We can also check for outliers using any of the above residual plots. An outlier is a
point with a particularly large residual. We would expect approximately 95% of
the residuals to lie between 2 and +2.
Of major interest, however, is whether an outlier has undue influence on our
results. For example, in simple regression, an outlier with very large values on X
and Y could push up a positive slope.
A straightforward way to judge the
influence of an outlier is to refit the regression line after excluding it. If the
results are very similar to those based on all observations, we would conclude that
the outlier does not have undue influence. An observations influence can also be
measured by a statistic called Cooks D (see C3.5.3).

Dont forget to do the practical for this section! (see page 2 for
details of how to find the practical)

When X is continuous, we are fitting a straight line relationship. Regression can


also be applied when X is categorical, in which case we are allowing the mean of Y
to be potentially different for the different categories of X.

C3.2.1

Comparing two groups

Suppose that a categorical explanatory variable X has only two categories. We


wish to compare the mean of our response variable Y for the two groups defined by
these categories.
We will examine whether there are gender differences in hedonism. In the human
values dataset there is a variable called SEX which is coded 1 for female, and 0 for
male. Variables that have codes of 0 and 1 are often called dummy variables. If
we simply calculate the mean of our response variable HED for men and women,
we obtain the results given in Table 3.2.

Please read P3.1, which is available in online form or as part of a pdf file.

Table 3.2. Descriptive statistics for hedonism by sex

Dont forget to take the online quiz for this section!


(see page 2 for details of how to find the quiz questions)

Sample size
Women
Men

2747
3098

Mean hedonism score


-0.225
-0.069

So the (female-male) difference in means is -0.225-(-0.069) = -0.156.

Normal (t) test for comparing two independent samples


We can use a normal test (or t-test if the sample is small) to test for a difference
between women and men in the population. The null hypothesis for the test is
that the gender difference between the mean of hedonism in the population is
zero. The test statistic is -6.12 and the p-value is less than 0.0001. A 95%
confidence interval for the difference between the female and male population
means is (-0.206, -0.106), which does not contain the null value of zero. We
therefore conclude that the difference between women and mens hedonism
scores is statistically significant (at the 0.01% level).

Comparing two groups using regression


We can also compare groups using a regression model. The advantage of using
regression, rather than a normal (or t) test, is that in a regression model we can
allow for the effects of other variables as well as gender. To start with, however,
we will consider gender as the only explanatory variable and demonstrate how
men and womens hedonism scores can be compared using regression.

Centre for Multilevel Modelling, 2008

18

Centre for Multilevel Modelling, 2008

19

Module 3 (Concepts): Multiple Regression

Module 3 (Concepts): Multiple Regression

C3.2.1 Comparing two groups

C3.2.2 Comparing more than two groups

Suppose we fit the simple regression model

C3.2.2

y i = 0 + 1 x i + ei
where y i is the hedonism score of individual i, and x i =1 if the individual is a
woman, and 0 if the respondent is male7.
Table 3.3. Regression of hedonism on sex

Coefficient
Constant
Sex

-0.069
-0.156

Comparing more than two groups

Suppose now that a categorical explanatory variable X has three categories. We


wish to compare the mean of our outcome variable Y for the three groups defined
by these categories.
The respondents in the hedonism example come from three countries. The mean
of HED for each country is given in Table 3.4.
Table 3.4. Descriptive statistics for hedonism by country

Standard Error
0.019
0.025

Sample size

The regression output is given in Table 3.3, from which we obtain the fitted
regression equation:
HEDi = 0.069 0.156 SEX i

UK
Germany
France

Mean hedonism score

1748
2785
1312

-0.384
-0.128
0.108

Analysis of Variance (ANOVA)


The standard way to compare more than two groups is to use analysis of variance
(ANOVA)8. The null hypothesis is that there is no difference between groups (i.e.
that the group means are all equal). Table 3.5 shows the results from an ANOVA
for a comparison of hedonism for the three countries. When there is just one
categorical variable, this type of analysis is usually called a one-way ANOVA.

We can use this equation to predict HED for men and women:
For men (SEX=0), HED = 0.069 (0.156 0) = 0.069
For women (SEX=1), HED = 0.069 (0.156 1) = 0.225

Table 3.5. Analysis of variance of country differences in hedonism

Notice that these predicted values are just the mean hedonism scores for men and
women, and that the coefficient of SEX is the difference between these means
(womens mean mens mean, since SEX is coded 1 for women here).
The null hypothesis that there is no difference between the mean score for men
and women in the population can be expressed as H0: 1 = 0 . The standard error
of 1 is 0.025 and the Z-ratio is therefore 0.156 0.025 = 6.12 . The 95%
confidence interval for 1 is (-0.206, -0.106).
Note that these results are exactly the same as those for the independent samples
comparison of means test given earlier. So if SEX is the only explanatory variable,
a regression analysis gives exactly the same results as a t-test. But only in a
regression analysis can we include other explanatory variables.

Between countries
Within countries
Total

Sum of
squares

d.f.

Mean
square

F statistic

p-value

184.5
5370.7
5555.2

2
5842
5844

92.3
0.9

100.4

<0.001

Note: d.f. is degrees of freedom9

The tiny p-value suggests that we can reject the null hypothesis and conclude that
there are significant between-country differences in hedonism.

See, for example, http://www.animatedsoftware.com/statglos/sg_anova.htm


Discussions of degrees of freedom can be found at
http://www.animatedsoftware.com/statglos/sgdegree.htm
and http://davidmlane.com/hyperstat/A42408.html

9
7

Note that it would not be sensible to centre or standardize a binary variable.

Centre for Multilevel Modelling, 2008

20

Centre for Multilevel Modelling, 2008

21

Module 3 (Concepts): Multiple Regression

Module 3 (Concepts): Multiple Regression

C3.2.2 Comparing more than two groups

C3.2.2 Comparing more than two groups

Comparing groups using multiple regression

Table 3.6. Regression analysis of hedonism on country, UK taken as reference

The statistical model behind ANOVA is in fact a multiple regression model. But
rather than including country as an explanatory variable10, we create dummy
variables for two of the three countries and include these. Suppose we create
three variables which indicate whether a respondent is from a particular country,
i.e.
UK
=1 if respondent is from the UK, =0 if from Germany or France
GERM
=1 if respondent is from Germany, =0 if from UK or France
FRANCE
=1 if respondent is from France, =0 if from UK or Germany
These variables are called dummy variables11. In fact we do not need all three of
these variables because if we know a respondents value on two of them, we can
infer their value on the third. E.g. if we know that UK=0 and GERM=1, then we
know that FRANCE=0. (A respondent can only be living in one country at the time
of survey, so only one of UK, GERM and FRANCE can equal 1 for any given
individual.) By the same argument, when we have a categorical variable with only
two categories (e.g. our SEX variable in C3.2.1) we do not need to create any
additional variables. SEX is already a dummy variable, and can therefore be
included directly in the model as an explanatory variable.
To allow for differences between the UK, Germany and France, we choose
(arbitrarily) two of the country dummy variables and include those as explanatory
variables. Suppose we choose GERM and FRANCE, then the multiple regression
model is:

Coefficient
Constant
Country
Germany
France

Standard
error

Z-ratio

-0.384

0.023

0.256
0.492

0.029
0.035

8.765
14.052

p-value

<0.001
<0.001

The fitted regression equation is:


HEDi = 0.384 + 0.256 GERMi + 0.492 FRANCE i
We can use this equation to predict the hedonism score for inhabitants of each
country.
For UK residents (GERM=0, FRANCE=0):

HEDi = 0.384 + (0.256 0) + (0.492 0) = 0.384


For Germans (GERM=1, FRANCE=0):

HEDi = 0.384 + (0.256 1) + (0.492 0) = 0.128

HEDi = 0 + 1GERMi + 2 FRANCE i + ei

For French residents (GERM=0, FRANCE=1):

Table 3.6 shows the results from fitting this model.

HEDi = 0.384 + (0.256 0) + (0.492 1) = 0.108


Notice that these predictions give exactly the same results as the country means in
Table 3.4. We obtain the prediction for the UK directly; the estimate of the
intercept will always equal the mean for the omitted category, i.e. the category
for which we do not include a dummy variable in the model. The coefficients 1
and 2 are interpreted as differences between one of the other countries and the
omitted category country.
10

It would not make sense to fit a model of the form HEDi =

0 + 1 COUNTRYi + ei

because

the coding of COUNTRY is arbitrary (i.e. COUNTRY is a nominal variable). In such a model, b 1
would be interpreted as the effect on HED of a 1 unit change in COUNTRY, but a 1 unit change in
COUNTRY has no meaning!
11
This is the most common way of coding dummy variables for a categorical variable and is often
called simple coding, but other types of coding are possible depending on which comparisons are of
interest. A comprehensive discussion of alternative coding systems can be found at
http://www.ats.ucla.edu/stat/stata/webbooks/reg/chapter5/statareg5.htm

Centre for Multilevel Modelling, 2008

22

1 = 0.256 is the difference between the means for Germany and the UK
2 = 0.492 is the difference between the means for France and the UK
The UK is the reference category. (If we had included the UK and FRANCE dummy
variables in the model, then Germany would have been the reference.)

Centre for Multilevel Modelling, 2008

23

Module 3 (Concepts): Multiple Regression

Module 3 (Concepts): Multiple Regression

C3.2.2 Comparing more than two groups

C3.2.3 Comparing a large number of groups

The remaining contrast is between Germany and France, which is estimated as


1 2 = 0.236.

C3.2.3

The null hypothesis for testing whether there is a difference between the mean
hedonism scores in the populations of Germany and the UK can be expressed as H0:
1 =0. Similarly the null for testing whether there is a difference between France
and the UK is H0: 2 =0. The simplest way to compare Germany and France would
be to refit the model, making one of these countries the reference category.
Table 3.7 shows the results when Germany is taken as the reference, i.e. when the
UK and FRANCE dummies are included in the model. The difference between the
means for France and Germany is now obtained directly (from the coefficient of
the FRANCE dummy) as 0.236.
All coefficients in Table 3.6 and Table 3.7 are significantly different from zero (all
p-values are <0.001), so we conclude that all pairwise differences between
countries are significant.

Constant
Country
UK
France

Standard
error

Z-ratio

-0.128

0.018

-0.256
0.236

0.029
0.032

-8.765
7.341

Suppose that instead of three countries, we had 20 or more countries that we


wished to compare. One approach would be to include 19 dummy variables for
countries. However, there are some potential drawbacks of this approach:
i)

ii)

19 is a large number of coefficients to estimate! Adding interactions


between countries and other explanatory variables (see C3.4) will lead to
even more parameters. Also, if the sample sizes within some countries are
small, the estimates of the coefficients of the dummy variables for those
countries may be unreliable
Suppose we wish to estimate the effects of country characteristics, e.g. the
effects on hedonism of cultural factors, such as religiosity, or economic
status. It can be shown that it is not possible to estimate the effects of
these variables as well as the coefficients of the country dummy variables.
(This is because any country-level variable can be expressed as a linear
function of the country dummy variables.)

In other applications, the sampled groups may be regarded as a random sample


from a larger population. For example, we may have data on a sample of schools.
In such cases it is the population of groups from which our sample was drawn that
is of interest. However, the origins of the dummy variable (or ANOVA) approach
lie in experimental design where there is typically a small number of groups to be
compared and all groups of interest are sampled. The ANOVA approach does not
allow us to make inferences beyond the groups in our sample.

Table 3.7. Regression analysis of hedonism on country, Germany as reference

Coefficient

Comparing a large number of groups

p-value

<0.001
<0.001

An approach that overcomes these problems is multilevel modelling (see Module


5).

The above tests are for comparing pairs of countries. In ANOVA, the null
hypothesis is that the means for all three countries are equal. In most statistical
packages, an ANOVA table is given as part of the standard regression analysis
output. The only difference between the regression ANOVA table and the one-way
ANOVA table is that the between-country sum of squares would usually be called
the regression sum of squares, and within-country would be replaced by
residual. All numerical results would be exactly the same. In regression terms,
the null hypothesis being tested is that all coefficients are zero, which in this case
is that 1 = 2 = 0 . This is just another way of saying that all three country means
are equal.

Dont forget to do the practical for this section! (see page 2 for
details of how to find the practical)
Please read P3.2, which is available in online form or as part of a pdf file.

Dont forget to take the online quiz! (see page 2 for details of how
to find the quiz questions)

The advantage of multiple regression over one-way ANOVA is that regression can
allow for the effects of several explanatory variables simultaneously12.

12
A multiple regression analysis with two categorical explanatory variables is sometimes called a
two-way ANOVA, while a regression with a mixture of categorical and continuous variables is called
an analysis of covariance (ANCOVA).

Centre for Multilevel Modelling, 2008

24

Centre for Multilevel Modelling, 2008

25

Module 3 (Concepts): Multiple Regression

Module 3 (Concepts): Multiple Regression


C3.3.2 The multiple regression model

C3.3 Regression with More than One Explanatory


Variable (Multiple Regression)
C3.3.1

C3.3.2

In simple regression we have a single predictor or explanatory variable (X), and the
linear regression model is

Statistical control

So far we have used simple regression to assess the linear relationship between
two variables. In reality there will be a number of factors that are potential
predictors of the outcome variable.
The advantage of using a regression
framework is that we can straightforwardly account for the effects of multiple
variables simultaneously.

Examples
i) Suppose we compare two secondary schools on their age 16 exam performance,
e.g. we might compare the percentage of students who achieve a pass in five or
more subjects. Suppose we find that school 1 has a higher percentage with 5+
passes than school 2. Would we conclude that school 1s performance was better
than school 2? What other factors would we like to take into account? An obvious
candidate would be a measure of students achievement when they entered
secondary school so that school effects are value-added.
ii) Comparisons of men and womens salaries often reveal that women earn less.
Explanations that are commonly put forward for this discrepancy are that women
tend to work in jobs that have been traditionally lower paid, or that women have
taken time out of paid employment to raise children. To determine whether there
are salary differences between men and women who have been working in the
same job for the same amount of time, we would wish to account for occupation
and number of years of full-time employment as well as other factors such as
education level. Using multiple regression we can test whether these other factors
explain gender differences in salary, i.e. does any gender difference disappear
when we adjust for the effects of these other variables?
We can use multiple regression to take into account or adjust for other factors
that might predict the response variable. Sometimes the effects of these other
factors are of interest in themselves, e.g. predictors of age 16 attainment other
than the school attended. Other times the effects of other factors are not of
major interest, but it is important to adjust for their effects to obtain more
meaningful estimates of effects that we are interested in. Such factors are often
called controls.

Centre for Multilevel Modelling, 2008

The multiple regression model

26

y i = 0 + 1 x i + ei .
In multiple regression, we have more than one predictor. Suppose that we have
two predictors, denoted by X1 and X2, which may be continuous or categorical. We
have in fact already used a multiple regression model to analyse country
differences in hedonism (in C3.2.2). Although there was just one predictor,
country, it was represented by two dummy variables. More generally, we can
include several predictors and any of these may be represented by a set of dummy
variables.
The multiple (linear) regression model for two continuous (or dichotomous)
explanatory variables is written

y i = 0 + 1 x 1i + 2 x 2i + ei
where 0 is the value of y that would be expected when x1 =0 and x 2 =0.
The coefficients 1 and 2 are interpreted as follows:

1 is the coefficient of x1 , which is interpreted as the change in y for a 1-unit


change in x1 controlling or adjusting for the effect of x 2 . In other words, 1 is
the effect of x1 for individuals with the same value of x 2 (or holding x 2
constant).

Similarly, 2 is the coefficient of x 2 , which is interpreted as the change in y


for a 1-unit change in x 2 controlling for the effect of x1 .

Because each multiple regression coefficient represents the relationship between


an explanatory variable and the dependent variable, conditioning on the effect of
all other explanatory variables in the model, they are sometimes called partial
regression coefficients.
We can test for a linear relationship between the response variable Y and a
predictor variable Xk by testing the null hypothesis that the coefficient of Xk is zero
(H0: k =0) versus the alternative hypothesis that the coefficient is non-zero (HA:
k 0).
As in simple regression, we can test for significance by examining
confidence intervals for each parameter or, equivalently, by comparing Z-ratios to
the normal distribution and calculating a p-value.

Centre for Multilevel Modelling, 2008

27

Module 3 (Concepts): Multiple Regression

Module 3 (Concepts): Multiple Regression

C3.3.2 The multiple regression model

C3.3.2 The multiple regression model

As in the simple regression model, ei is a residual. The residuals now represent


factors other than X1 and X2 that predict Y, but we use ei in a general way to
represent residuals in any model.

Notice that, as expected, there is little change in the coefficient of age when
education is added, but the relationship between hedonism and education is now
negative after accounting for age. Both relationships are significantly different
from zero at the 0.1% level. The relationship between hedonism and education
should be interpreted with some caution, however. We should hesitate to
conclude that education affects or causes hedonism. It is likely that hedonism and
education are both influenced by variables that we have not accounted for in this
model.

If X1 and X2 are continuous we can examine their relationship using a scatterplot.


Note that when we have two predictors we would need a three-dimensional
scatterplot to represent the relationship between Y and X1 and X2 graphically13.
As it can be difficult to interpret three-dimensional plots, we can explore the data
by looking at plots of Y versus X1, Y versus X2, and X1 versus X2. The third plot is
important to check whether X1 versus X2 are highly correlated.

Example
We will begin with the case where both X1 and X2 are continuous. Lets consider
the effects of age (X1) and education (X2) on hedonism. We will ignore gender and
country differences for now. We have already examined the bivariate relationship
between hedonism and age and found that older respondents tend to be less
hedonistic (in C3.1). This relationship may change when we account for education
if education is related to both hedonism and age. For example, we would expect
older respondents to have fewer years of education and a higher level of education
might be associated with less hedonistic beliefs if the more career-minded choose
study over having a good time!

Figure 3.6 shows the relationship between hedonism and education (see Figure 3.2
for a plot of hedonism versus age). The relationship between the two explanatory
variables, age and education, is shown in Figure 3.7. The correlation between
hedonism and education is very weak; the Pearson coefficient is only 0.024. As
expected, there is a negative correlation between age and education (r=-0.242).
Because of the weak correlation between hedonism and education, however, we
would not expect the addition of education in a multiple regression to have much
impact on the coefficient of age.

Figure 3.6. Plot of hedonism by education

In C3.1.3 the fitted equation from a simple regression of hedonism on age was
found to be:

HEDi = 0.712 0.018 AGE i .


If we add education to the model, we obtain the following fitted multiple
regression equation:

HEDi = 0.971 0.019 AGE i 0.017 EDUCi .

For those of you who are interested, y =


dimensional space.

13

Centre for Multilevel Modelling, 2008

0 + 1 x1 + 2 x 2

is the equation of a plane in 3-

28

Centre for Multilevel Modelling, 2008

29

Module 3 (Concepts): Multiple Regression

Module 3 (Concepts): Multiple Regression

C3.3.2 The multiple regression model

C3.3.2 The multiple regression model

Further, some predictors may be easier to manipulate than others, which is


particularly important if regression results are used to inform public policy14.
When standardised coefficients are reported, they should be accompanied by the
corresponding unstandardised coefficients which represent effects in terms of the
original units of measurement for X and Y. This is particularly important for
categorical X.

C3.3.3

Using multiple regression to model a non-linear relationship

Suppose a scatterplot of Y versus X resembles Figure 3.8. The relationship is nonlinear, so it would not be appropriate to fit the straight line relationship implied
by a linear regression model. We should fit a curve through the points rather than
a line. The simplest curve is a quadratic function (or a second order polynomial):

y i = 0 + 1 x i + 2 x i2 + ei
Note that the above is an example of a multiple regression model with x 1 = x and
x 2 = x 2 . Also shown in Figure 3.8 is the fitted quadratic curve, which turns out to
have equation yi = 1.00 + 1.02 x i 0.47 x i2 .

Figure 3.7. Plot of age by education

Standardised coefficients
Standardisation and standardised coefficients were introduced in C3.1.3. To recap,
the standardised coefficient for a predictor X is the estimate of the slope that
would be obtained if X and Y were both standardised before the regression
analysis. In simple regression, the standardised coefficient of X is equal to the
Pearson correlation coefficient. In multiple regression, with two predictors X1 and
X2, the standardised coefficient of X1 is interpreted as the change in standardised Y
for a 1-unit change in standardised X1, holding X2 constant. (Recall that 1 unit of a
standardised variable corresponds to 1 standard deviation.) For example, in a
multiple regression model of hedonism on age and education, the standardised
coefficient for AGE is -0.358. Thus we can say that a 1 standard deviation change
in age predicts a 0.358 standard deviation decrease in hedonism. Note that if all
variables (Y and the Xs) had been standardised prior to the analysis, then the
unstandardised and standardised coefficients would be equal.

-1

-2

-3
-2.0

Standardised coefficients are produced by many statistical software packages and


reported in much published quantitative research, but they should be interpreted
with caution. It is often claimed that standardised coefficients can be compared
across the predictors to determine which has the strongest influence on Y.
However, predictors are usually correlated with one another and it is rarely
possible to change the value of one without changing the value of another.

Centre for Multilevel Modelling, 2008

30

-1.5

-1.0

-.5

0.0

.5

1.0

1.5

2.0

Figure 3.8. Example of a non-linear relationship between Y and X


14

See http://www.tufts.edu/~gdallal/importnt.htm for further discussion of the use and


interpretation of standardised coefficients.

Centre for Multilevel Modelling, 2008

31

Module 3 (Concepts): Multiple Regression

Module 3 (Concepts): Multiple Regression

C3.3.3 Using multiple regression to model a non-linear relationship

C3.3.4 Adding further predictors

The results from fitting a quadratic curve to the relationship between hedonism
and age are given in Table 3.8. Note that this analysis is based on standardised
age and its square. This is because, for older respondents (remember the oldest is
98), age2 takes very large values; this may cause computational difficulties and the
coefficient of age2 would be very small.
Table 3.8. Regression with quadratic effects for age

Coeff.

S.E.

-0.222
-0.348
0.072

0.017
0.012
0.011

Constant
Standardised age
Standardized age-squared

Z-ratio
-28.669
6.288

C3.3.4

Adding further predictors

Suppose that we have p predictors, which we denote by X1, X2, X3, . . ., Xp. Then
the multiple regression model is

y i = 0 + 1 x 1i + 2 x 2i + 3 x 3i + ... + p x pi + ei
Variation explained: R2

p-value

In simple regression, R2 is the proportion of variance in Y that is explained by the


explanatory variable X (see C3.1.4). When there is more than one X, R2 is the
proportion of variance in Y explained by all variables in the model. An alternative
interpretation of R2 is as the square of the correlation between the predicted
values of Y (from the fitted model) and the observed values of Y.

<0.001
<0.001

The coefficient of age-squared is significantly different from zero at the 0.1%


level, so we conclude that age-squared should be retained in the model and that
the quadratic model is therefore a better fit to the data than the linear model.
The positive coefficient of the squared term, together with the negative
coefficient of the linear term, indicates that the negative relationship flattens out
at older ages. A scatterplot with the fitted curve is shown in Figure 3.9.

The R2 for the regression model with age and education effects is 0.121, so 12.1%
of the variance in hedonism scores is due to variation in age and education. The
correlation between the predicted and observed hedonism scores is 0.348
= 0 . 121 . As suggested by the low bivariate correlation between hedonism and
education, education has little explanatory power; when education is removed the
model R2 decreases only slightly to 0.118.
A problem with R2 is that it always increases even if irrelevant variables are added
to the model. Therefore in multiple regression a measure called the adjusted R2 is
usually quoted. The adjusted R2 takes into account the number of variables in the
model. It is therefore a goodness-of-fit measure that is penalised by the
complexity of the model. With such a measure, the value will only increase if the
additional predictors are accounting for some of the variability in the response. In
this simple example with only two explanatory variables, age and education, the
adjusted R2 turns out to be the same as the unadjusted value.

Multicollinearity
Before carrying out a regression analysis, we should always look at the correlation
between each pair of predictor variables. If the correlation between a pair is very
high (>0.8 say), the estimates of the coefficients of those variables may be
unstable and imprecise (large standard errors). If the two variables are really
measuring the same thing, we should consider dropping one. Otherwise, we might
replace the two variables by a new variable which is a combination of the two15.

Figure 3.9. Plot of hedonism versus standardised age with fitted quadratic curve
15
Principal components analysis or factor analysis can be used to reduce a set of correlated
variables into a smaller set of uncorrelated variables.

Centre for Multilevel Modelling, 2008

32

Centre for Multilevel Modelling, 2008

33

Module 3 (Concepts): Multiple Regression

Module 3 (Concepts): Multiple Regression

C3.3.4 Adding further predictors

C3.3.4 Adding further predictors

Principles of model selection

The coefficient of age in months (AGE*12) is -0.002, which is the coefficient of age
in years (AGE) divided by 12. This is because 1 unit on the scale of AGE*12 is equal
to 1/12 of a unit on the scale of AGE. Notice that the intercept does not change
because AGE=0 means the same whether the measurement is in months or years.
The coefficient of education is unaffected by transformations in age.

In most quantitative research there is a large set of potential explanatory


variables. There are many procedures that have been proposed to automatically
select the best model from a set of variables (e.g. backward elimination, forward
selection, stepwise selection), and many of these have been implemented in
mainstream statistical software. These procedures are sometimes useful in that
they provide a systematic means of model selection, but they should be used with
caution or you may be accused of data dredging. In practice your research
design and analysis will be guided by theory, which will come from previous
research in the same or related areas as well as your own ideas. Often you will
have several rival theories that you wish to compare and assess which have the
stronger empirical support. These theories and your particular research question
will guide the order in which you enter explanatory variables into the model.
For example, suppose you are interested in examining gender differences in salary
levels. The first model you fit might include only a gender effect. Suppose you
find that there is a significant difference between men and women. You might
then add in other explanatory variables to see which ones, if any, help to explain
the gender difference.
A further step in the analysis would be to test whether
the gender difference is the same for all men and women, e.g. gender differences
may be larger in some occupation categories than in others (an example of an
interaction effect - see C3.4). In other situations, there will be variables that you
want to include for interpretation purposes. For example, in educational research,
you might be interested in looking at predictors of academic progress rather than
academic attainment at one point in time. One way to do that is to include prior
attainment as an explanatory variable in the model.

Because regression coefficients depend on scale, standardised coefficients are


sometimes quoted too. When researchers talk about effect sizes, they are often
referring to standardised coefficients.

Dont forget to do the practical for this section! (see page 2 for
details of how to find the practical)
Please read P3.3, which is available in online form or as part of a pdf file.

Dont forget to take the online quiz for this section! (see page 2 for
details of how to find the quiz questions)

Effect sizes
The size of the coefficient for predictor variable Xk will depend on the scales of Xk
and the response variable. For example, suppose we multiply each value of AGE
by 12 to give age in months rather than years and refit the multiple regression
model with age and education effects. We obtain the results shown in Table 3.9.
Table 3.9. Regression of hedonism on age and education for different age scales.

Age in years
Coeff.
Constant
Age
Education

0.971
-0.019
-0.017

Centre for Multilevel Modelling, 2008

Z-ratio
-28.206
-4.915

Age in months
Coeff.
Z-ratio
0.971
-0.002
-0.017

-28.206
-4.915

34

Centre for Multilevel Modelling, 2008

35

Module 3 (Concepts): Multiple Regression

Module 3 (Concepts): Multiple Regression


C3.4.1 Model with fixed slopes across groups

C3.4 Interaction Effects


In C3.2 we saw how to compare groups using dummy variables in a regression
model. For example, we compared the mean hedonism score for men and women,
and for different countries. So far, however, we have assumed that the effects of
other predictor variables, e.g. age, are the same for each group. This is
equivalent to assuming that group differences in hedonism are the same for all
values of the other predictors. This assumption may be unrealistic. Perhaps age
differences in hedonism are more pronounced among men, which would imply that
the age effect differs for men and women.
Two predictors are said to have an interaction effect on Y if the effect of one of
the predictors on Y depends on the value of the other predictor.

C3.4.1

Model with fixed slopes across groups

Suppose we fit a multiple regression model with age and gender effects:

HEDi = 0 + 1 AGE i + 2 SEX i + ei

(3.7)
Figure 3.10. Regression lines for men and women, fixed slopes

We obtain the following fitted regression equation:

Note: The age range in the sample is 14 to 98 years. The software used to draw the plot
has extrapolated beyond the observed range regression lines which is not generally
recommended.

HEDi = 0.791 0.018 AGE i 0.152 SEX i


For SEX=0 (men), the relationship between HED and AGE is represented by the
line:

HEDi = 0.791 0.018 AGE i


and for SEX=1 (women), the fitted line is:

HEDi = 0.639 0.018 AGE i

C3.4.2

Fitting separate models for each group

Is it reasonable to assume that the gender difference in hedonism is the same for
all ages? One way of allowing men and women to have different slopes for the
relationship between hedonism and age is to fit a separate regression line for each
sex. We do this by splitting the sample by gender16, and fitting a simple regression
of HED on AGE for each sex. If we do this, we obtain the results shown in Table
3.10.

So the lines for men and women have different intercepts, but the same slope, i.e.
the regression lines are parallel (see Figure 3.10). There are two equivalent ways
of interpreting Figure 3.10. We can say that the effect of age on hedonism is the
same for men and women. Alternatively we can say that the gender difference in
hedonism is the same at all ages.

16
This is often done using a select if command or menu option, or by requesting an analysis that is
stratified by gender.

Centre for Multilevel Modelling, 2008

36

Centre for Multilevel Modelling, 2008

37

Module 3 (Concepts): Multiple Regression

Module 3 (Concepts): Multiple Regression

C3.4.2 Fitting separate models for each group

C3.4.3 Allowing for varying slopes in a pooled analysis: interaction effects

Table 3.10. Regression of hedonism on age with separate models fitted for men and women

C3.4.3

Coeff.

S.E.

Z-ratio

0.839
-0.019

0.047
0.001

-20.854

0.597
-0.018

0.047
0.001

-18.910

Rather than fitting a separate model for each sex, we will fit a single model to the
whole pooled sample. We create a new variable which is the product of AGE and
SEX:

Men
Constant
Age (years)

Women
Constant
Age (years)

AGE_SEX=AGESEX
The new variable AGE_SEX is added as another predictor variable to model (3.7) to
give:

For men the slope of age is -0.019, compared to -0.018 for women. So the slope is
slightly steeper for men. Because women have a lower intercept than men, a
steeper slope for men implies that the gender difference is greater among younger
respondents (see Figure 3.11 later).

HEDi = 0 + 1 AGE i + 2 SEX i + 3 AGE _ SEX i + ei

Table 3.11. Example of hedonism dataset with age by sex interaction variable

Respondent

i)

The sample size for some groups may be small.

ii)

There may be more than one categorical predictor, and therefore more than
one way of grouping the data. The effects of the other predictors may vary
across each grouping, e.g. hedonism may vary by sex and by country.
Splitting the data into groups defined by sex and country will lead to a large
number of groups; in this dataset, the sample sizes in each group remain
large, but this will often not be the case.

iii)

In general there will be several predictors in the model, but it is unlikely


that the effects of all predictors will vary across groups. In that case,
fitting a separate regression for each group is inefficient. Where the
coefficient of a predictor does not vary across groups, it would be better to
estimate it using information from the whole sample; the estimate of the
coefficient would then be based on a larger sample size and would therefore
have a smaller standard error than if it were estimated separately for each
group.
When separate analyses are carried out for each group, it is not possible to
carry out hypothesis tests to compare coefficients across groups. For
example, if we fit separate regressions of hedonism on age for men and
women we cannot test whether there is a gender difference in the
relationship between hedonism and age in the population.

1
2
3
4
.
.
5845

Hedonism
1.55
0.76
-0.26
-1.00
.
.
0.74

AGE

SEX

AGE_SEX

25
30
59
47
.
.
65

0
0
0
1
.
.
0

0
0
0
47
.
.
0

The inclusion of AGE_SEX, called the interaction between AGE and SEX, allows the
effect of AGE on HED to differ for men and women (or, equivalently, the effect of
sex on HED to depend on AGE). If the effect of age differs by sex, we say that
there is an interaction effect. To see how an interaction effect works, we will
look at the regression model for each value of SEX.
For SEX=0 (men), AGE_SEX=0 so the regression model (3.8) becomes:
HEDi = 0 + 1 AGE i + ei

38

(3.9)

For SEX=1 (women), AGE_SEX=AGE and the regression model (3.8) becomes:

HEDi = 0 + 1 AGE i + 2 + 3 AGE i + ei


= ( 0 + 2 ) + (1 + 3 )AGE i + ei
Centre for Multilevel Modelling, 2008

(3.8)

Table 3.11 gives an extract of the analysis data file to which (3.8) could be fitted.

While splitting the sample into groups is a simple way of allowing for different
slopes for each group, there are several problems with this approach:

iv)

Allowing for varying slopes in a pooled analysis: interaction


effects

Centre for Multilevel Modelling, 2008

(3.10)

39

Module 3 (Concepts): Multiple Regression

Module 3 (Concepts): Multiple Regression

C3.4.3 Allowing for varying slopes in a pooled analysis: interaction effects

C3.4.3 Allowing for varying slopes in a pooled analysis: interaction effects

In equation (3.9) the intercept is 0 , and in (3.10) it is 0 + 2 . So 2 is the


difference between intercepts for men and women.
In equation (3.9) the slope of AGE is 1 , and in (3.10) it is 1 + 3 . So 3 is the
difference between slopes for men and women.
Table 3.12 shows the results from fitting model (3.8) to the hedonism data.
Table 3.12. Regression of hedonism on age and sex, pooled analysis with interaction

Coeff.
Constant
Age (years)
Female
Age Female

0.839
-0.019
-0.242
0.002

S.E.
0.048
0.001
0.066
0.001

Z-ratio

p-value

-20.075
-3.649
1.461

<0.001
<0.001
0.144

So the fitted regression equation is:


Figure 3.11. Regression lines for men and women, varying slopes

HEDi = 0.839 0.019 AGE i 0.242 SEX i + 0.002 AGE _ SEX i

C3.4.4

For SEX=0 (men), the fitted regression equation is

Testing for interaction effects

Is the slope in the regression of hedonism on age significantly different for men
and women?

HEDi = 0.839 0.019 AGE i

Recall that 3 , the coefficient of the interaction variable AGE_SEX in equation


(3.8), is the difference in the slope for men and women. So the null hypothesis
that the slopes are the same for men and women can be expressed as H0: 3 =0.

For SEX=1 (women), the fitted regression equation is


HEDi = 0.839 0.019 AGE i 0.242 + 0.002 AGE
= 0.597 0.017 AGE i
Notice that the intercept and slope estimates from the interaction model are
exactly the same as the estimates we got from fitting a simple regression for each
sex separately. Figure 3.11 shows the predicted regression lines for men and
women. Note that the lines are no longer parallel because we have allowed for
different slopes in our regression model. The gender difference in hedonism is
slightly larger among young respondents.

From Table 3.12, we see that the Z-ratio for this test is 1.461 and the p-value is
0.144. So we cannot reject the null hypothesis and we conclude that the slope of
age is the same for men and women. We would then return to the simpler model
(3.7) with the fixed slope.

C3.4.5

Another example: allowing age effects to be different in


different countries

We have concluded that the effect of age on hedonism is the same for men and
women. Or, equivalently, we can conclude that the gender difference in hedonism
is the same for all ages. We will now test whether the effect of age is the same in

Centre for Multilevel Modelling, 2008

40

Centre for Multilevel Modelling, 2008

41

Module 3 (Concepts): Multiple Regression

Module 3 (Concepts): Multiple Regression

C3.4.5 Another example: allowing age effects to be different in different countries

C3.4.5 Another example: allowing age effects to be different in different countries

each of the three countries, which is equivalent to testing whether differences


between countries depend on age.

simplest way to compare Germany and France (H0: 4 5 = 0 ) would be to refit


the model taking either Germany or France as the reference category.

In C3.2.2, we allowed for country effects by including dummy variables for


Germany and France, i.e. we included the variables GERM and FRANCE as
predictors in the regression model. To allow the effect of age on hedonism to vary
across countries, we need to create two interaction variables which we will call
AGE_GERM and AGE_FRANCE. These are defined as follows:

To test whether all three countries have the same slope (a joint test), we need to
test the null that 4 and 5 are both (simultaneously) equal to zero. We can do
this using an F-test for comparing nested models: the model in which 4 and 5
are freely estimated (the interaction model) versus the model with both 4 and 5
fixed at zero (the main effects model, i.e. without interaction terms). The pvalue for this test turns out to be 0.040, so there is evidence at the 5% level that
the interaction model is a significantly better fit to the data: at least one of the
age-by-country interaction coefficients is non-zero. We therefore conclude that
the age effect differs between countries.

AGE_GERM=AGEGERM
AGE_FRANCE=AGEFRANCE
The interaction model has the form:
HEDi = 0 + 1 AGE i + 2 GERMi + 3 FRANCE i + 4 AGE _ GERM + 5 AGE _ FRANCE + ei
The results from fitting this model are given in Table 3.13.

Please read P3.4, which is available in online form or as part of a pdf file.

Table 3.13. Regression with age by country interaction effect

Constant
Age (years)
Country
Germany
France
Age Germany
Age France

Dont forget to do the practical for this section! (see page 2 for
details of how to find the practical)

Coeff.

S.E.

Z-ratio

p-value

0.604
-0.021

0.061
0.001

-17.210

<0.001

-0.007
0.386
0.005
0.001

0.078
0.090
0.002
0.002

-0.085
4.277
3.207
0.570

0.932
<0.001
0.001
0.569

Dont forget to take the online quiz for this section! (see page 2 for
details of how to find the quiz questions)

The effect of age in each country is:


-0.021 in the UK (the reference category)
-0.021+0.005 = -0.016 in Germany
-0.021+0.001 = -0.020 in France
It therefore appears that the negative effect of age on hedonism is weaker in
Germany than in the UK or France. The coefficient of the AGE_GERM term has a Zratio of 3.207 so the differential age effect for Germany is significant at the 0.1%
level.
The individual Z-ratios for each interaction term allow us to carry out two separate
tests: 1) whether the slopes for the UK and Germany are the same (H0: 4 = 0 ),
and 2) whether the slopes for the UK and France are the same (H0: 5 = 0 ). The

Centre for Multilevel Modelling, 2008

42

Centre for Multilevel Modelling, 2008

43

Module 3 (Concepts): Multiple Regression

Module 3 (Concepts): Multiple Regression


C3.5.1 Checking the normality assumption

C3.5 Checking Model Assumptions in Multiple


Regression
The assumptions of a multiple regression model are the same as those for a simple
regression model (see C3.1.6), i.e. i) the residuals ei are normally distributed, ii)
the variance of the residuals is the same for each value of X (or combination of
values for different X variables), and iii) the residuals are independent. We can
check assumptions i) and ii) by looking at various plots of the standardised
residuals. The same plots can be used to check for outliers and their influence on
the regression results can be assessed by looking at the distribution of the Cooks D
Statistic.

C3.5.1

Checking the normality assumption

In C3.1.6, we checked the normality assumption of simple regression using two


plots of the standardized residuals: a histogram and a normal probability plot. The
same plots are used in multiple regression. Figure 3.12 and Figure 3.13 show the
histogram and normal probability plot of residuals from a multiple regression
model of hedonism that includes age, education, gender and country effects. The
histogram shows a symmetric bell-shaped distribution and the normal plot shows a
straight line, suggesting that the normal distribution assumption is reasonable.

Figure 3.13. Normal probability plot of ri

C3.5.2

Checking the homoskedasticity assumption

For simple regression, we check that the variance of the residuals is fairly constant
across the range of X in a plot of the standardised residuals against the explanatory
variable X. In multiple regression, it is useful to start with a plot of ri against yi
because, for any individual, the predicted value of y is a linear function of their
values on all X variables in the model. This should be followed by an examination
of pairwise plots of the standardized residuals against each explanatory variable X
in turn. For each plot we are looking for indications of funnelling where the
vertical scatter of the residuals is different for different values of xi or y i , in
which case the assumption of homoskedasticity is not met.
A common reason for funnelling (or heteroskedasticity) is the existence of groups
in the data among which the relationship between Y and one or more X differs, i.e.
unmodelled interaction effects. To illustrate the idea of funnelling, suppose that
the relationship between Y and a continuous variable X1 is different for two
subgroups defined by a binary variable X2: the relationship between Y and X1 is
positive for both groups, but stronger for X2=0 than for X2=1. The predicted
regression lines from a multiple regression of Y on X1, X2 and their interaction X1*X2
are shown in Figure 3.14.

Figure 3.12. Histogram of ri

Centre for Multilevel Modelling, 2008

44

Centre for Multilevel Modelling, 2008

45

Module 3 (Concepts): Multiple Regression

Module 3 (Concepts): Multiple Regression

C3.5.2 Checking the homoskedasticity assumption

C3.5.2 Checking the homoskedasticity assumption

as X1 increases, so the average line will lie close to the individual group lines and
the residuals are smaller.

x2
0
1

2
0

x2=1

Standardized Residual

Standardized predicted value

-2

x2=0
-4

-1

-2

-5.0

-2.5

0.0

2.5

5.0

x1

-3

Figure 3.14. Prediction lines from a multiple regression with an interaction effect

-5.0

-2.5

0.0

2.5

5.0

x1

Now suppose we mistakenly fit a simple regression of Y on X1, so we ignore the fact
that there are two groups with different relationships between Y and X1. Figure
3.15 shows the residual plot for this misspecified model17. (The data points for the
groups defined by X2 are distinguished, but remember that X2 is not included in the
model.) The plot shows evidence of heteroskedasticity because the vertical spread
of the residuals gets smaller as X1 increases this is an example of what we mean
by funnelling. Why has this happened? Instead of fitting two regression lines
with different intercepts and slopes for each group, we have fitted a single
average line which would lie somewhere in between the lines in Figure 3.1418. At
small values of X1, where we have the largest difference in the predicted value of
Y for the two groups, the residuals about this line are large and positive for X2=1
and large and negative for X2=0. The difference between groups becomes smaller

17

The residuals are plotted against

x 1i , but the plot of ri against the predicted response, yi ,


x 1i .

Returning to the hedonism data, Figure 3.16 shows a plot of ri versus standardised
yi from the model with age, education, gender and country included as
explanatory variables. The vertical spread of the points appears fairly equal across
different values of standardised y i , so we conclude that the assumption of
homoskedasticity is reasonable.

C3.5.3

Outliers

We can also check for outliers using any of the residual plots. An outlier is a point
with a particularly large residual. We would expect approximately 95% of the
residuals to lie between 2 and +2.
Of major interest, however, is whether an outlier has undue influence on our
results. An influence statistic called Cooks D (where D is for distance) measures

i is just a linear function of


would look exactly the same because in simple regression y
18
We would expect this average line to lie closer to the line for the largest group.
Centre for Multilevel Modelling, 2008

Figure 3.15. Plot of ri versus X1 from fitting a misspecified regression without X2 or its
interaction with X1

46

Centre for Multilevel Modelling, 2008

47

Module 3 (Concepts): Multiple Regression

Module 3 (Concepts): Multiple Regression

C3.5.3 Outliers

C3.5.3 Outliers

how different our estimated regression coefficients would have been if a sample
observation were omitted. Cooks D is calculated for every observation. The
higher the value of D, the more likely it is that an observation exerts influence on
the estimates of the coefficients. However, D does not have a fixed range and so
we focus on those values of D which are considerably greater than, say, the 90th
percentile.

Figure 3.17. Boxplot of Cooks D


Table 3.14. Impact of omitting outliers on estimated coefficients and Z-ratios

Full sample

Figure 3.16. Plot of ri versus standardised yi from a multiple regression of hedonism


scores

For a regression of hedonism on age, education, sex and country, we find that the
90th percentile of the distribution of Cooks D is 0.000046. A boxplot of Cooks D
is given in Figure 3.17. Two observations have relatively large values of D: case
numbers 3225 and 2948. However, removing these observations from the analysis
has negligible impact on our results (see Table 3.14).

Constant
Age (years)
Education (years)
Female
Country
Germany
France

Omitting observations 3225


and 2948

Coeff.

Z-ratio

0.790
-0.019
-0.015
-0.160

-27.611
-4.281
-6.752

0.789
-0.019
-0.015
-0.160

-27.537
-4.350
-6.774

0.222
0.436

8.068
13.145

0.222
0.441

8.090
13.322

Dont forget to do the practical for this section! (see page 2 for
details of how to find the practical)
Please read P3.5, which is available in online form or as part of a pdf file.

Dont forget to take the online quizzes for this module if you
havent already done so! (see page 2 for details of how to find the
quizzes)
Centre for Multilevel Modelling, 2008

48

Centre for Multilevel Modelling, 2008

49