0 Suka0 Tidak suka

19 tayangan25 halamanAn excellent source for multiple regression techniques

Jul 31, 2015

© © All Rights Reserved

PDF, TXT atau baca online dari Scribd

An excellent source for multiple regression techniques

© All Rights Reserved

19 tayangan

An excellent source for multiple regression techniques

© All Rights Reserved

- Influence: The Psychology of Persuasion
- Grit: The Power of Passion and Perseverance
- Why We Sleep: Unlocking the Power of Sleep and Dreams
- Come as You Are: The Surprising New Science that Will Transform Your Sex Life
- Come as You Are: The Surprising New Science that Will Transform Your Sex Life
- Good to Great: Why Some Companies Make the Leap...And Others Don't
- The Bone Labyrinth: A Sigma Force Novel
- The Bone Labyrinth: A Sigma Force Novel
- The Rise and Fall of the Dinosaurs: A New History of a Lost World
- The Rise and Fall of the Dinosaurs: A New History of a Lost World
- Micro: A Novel
- Ball Lightning
- Influence: The Psychology of Persuasion
- The Happiness Project, Tenth Anniversary Edition: Or, Why I Spent a Year Trying to Sing in the Morning, Clean My Closets, Fight Right, Read Aristotle, and Generally Have More Fun
- Focus: The Hidden Driver of Excellence
- Focus: The Hidden Driver of Excellence
- Just One Damned Thing After Another: The Chronicles of St. Mary's Book One
- Why We Sleep: Unlocking the Power of Sleep and Dreams
- Undaunted Courage: Meriwether Lewis Thomas Jefferson and the Opening

Anda di halaman 1dari 25

Concepts

2. - its to get automated Figure & Table numbering

right, and

3. will not appear in the pdfs

Fiona Steele1

Centre for Multilevel Modelling

Contents

We wish automatically made figure & table numbering that has module numbers in

it so we get captions like Figure 2.1, or Figure 3.2, etc.

Introduction .........................................................................................................3

MS Word works out what numbering to use by looking at the heading styles used in

the document.

Motivation......................................................................................................... 3

Conditioning ...................................................................................................... 3

Data for multiple regression analysis ......................................................................... 4

Introduction to Dataset ...........................................................................................4

Every time the style Heading 1 is used in the document (as above on Please

ignore), MS Word counts this as a new chapter, and increments the chapter

number part of the automatic numbering.

C3.1

C3.1.1

C3.1.2

C3.1.3

C3.1.4

C3.1.5

C3.1.6

So to get a caption like Figure 2.x - you need to have preceding it two headings

formatted with the style Heading 1.

This template has Heading 1 used only once, so figures and tables in this document

will be numbered as Figure 1.x or Table 1.x

These figure and table numbers will be automatically updated each time another

Heading 1 is inserted.

C3.2

C3.2.1

C3.2.2

C3.2.3

C3.3

Statistical control.................................................................................

The multiple regression model .................................................................

Using multiple regression to model a non-linear relationship .............................

Adding further predictors .......................................................................

26

27

31

33

C3.4.1

C3.4.2

C3.4.3

C3.4.4

C3.4.5

C3.5

Comparing more than two groups.............................................................. 21

Comparing a large number of groups .......................................................... 25

Regression with More than One Explanatory Variable (Multiple Regression) .............. 26

C3.3.1

C3.3.2

C3.3.3

C3.3.4

C3.4

The linear regression model ...................................................................... 8

The fitted regression line ....................................................................... 10

Explained and unexplained variance and R-squared ........................................ 13

Hypothesis testing ................................................................................ 14

Model checking.................................................................................... 15

Fitting separate models for each group .......................................................

Allowing for varying slopes in a pooled analysis: interaction effects ....................

Testing for interaction effects..................................................................

Another example: allowing age effects to be different in different countries .........

36

37

39

41

41

C3.5.1

C3.5.2

C3.5.3

Checking the homoskedasticity assumption .................................................. 45

Outliers ............................................................................................. 47

With additional material from Kelvyn Jones. Comments from Sacha Brostoff, Jon Rasbash and

Rebecca Pillinger on an earlier draft are gratefully acknowledged.

All of the sections within this module have online quizzes for you to

test your understanding. To find the quizzes:

EXAMPLE

Introduction

Introduction

Go down to the section for Module 3: Multilevel Modelling

Click "3.1 Regression with a Single Continuous Explanatory Variable"

to open Lesson 3.1

Q1

Click

to open the first question

Multiple regression is a technique used to study the relationship between an

outcome variable and a set of explanatory or predictor variables.

Motivation

All of the sections within this module have practicals so you can

learn how to perform this kind of analysis in MLwiN or other

software packages. To find the practicals:

of assessing the evidence for gender discrimination in legal firms. Statistical

modelling can provide the following:

EXAMPLE

Go down to the section for Module 3: Multiple Regression, then

Either

Click "3.1 Regression with a Single Continuous Explanatory Variable" to open

Lesson 3.1

Click

Or

Click

Print all Module 3 MLwiN Practicals

Pre-requisites

Conditioning

average salary for men and women?), 2) as part of causal inference (does being

female result in a lower salary?), and 3) for prediction (what happens if

questions).

dependent and explanatory); covered in Module 1.

Correlation between variables

Confidence intervals around estimates

Hypothesis testing, p-values

Independent samples t-test for comparing the means of two groups

Online resources:

http://www.sportsci.org/resource/stats/

http://www.socialresearchmethods.net/

http://www.animatedsoftware.com/statglos/statglos.htm

http://davidmlane.com/hyperstat/index.html

A quantitative assessment of the size of the effect; e.g. the difference in salary

between women and men is 5000 per annum;

A quantitative assessment after taking account of other variables; e.g. a female

worker earns 6500 less after taking account of years of experience. This

conditioning on other variables distinguishes multiple regression modelling from

simple testing for differences analyses.

A measure of uncertainty for the size of the effect; e.g. we can be 95%

confident that the female-male difference in salary in the population from

which our sample was drawn is likely to lie between 4500 and 5500.

The key feature that distinguishes multiple regression from simple regression is

that more than one predictor variable is involved. Even if we are interested in the

effect of just one variable (gender) on another (salary) we need to take account of

other variables as they may compromise the results. We can recognise three

distinct cases where it is important to control or adjust for the effects of other

variables:

i)

variables. For example, a substantial gender effect could be reduced after

taking account of type of employment. This is because jobs that are

characterized by poor pay (e.g. in the service sector) have a predominantly

female labour force.

ii)

when account is taken of years of employment; women having longer

service and poorer pay.

Introduction

Introduction

iii)

things that give him pleasure.

when account is taken of other variables. Note, however, that there may

be unmeasured confounders.

Statistical analysis requires a quantifiable outcome measure (dependent variable)

to assess the effects of discrimination. Possibilities include the following,

differentiated by the nature of the measurement: a continuous measure of salary,

a binary indicator of whether an employee was promoted or not, a three-category

indicator of promotion (promoted, not promoted, not even considered), a count of

the number of times rejected for promotion, the length of time that it has taken

to gain promotion. All of these outcomes can be analysed using regression

analysis, but different techniques are required for different scales of

measurement.

The term multiple regression is usually applied when the dependent variable is

measured on a continuous scale. A dichotomous dependent variable can be

analysed using logistic regression and multinomial logistic and ordinal regression

can be applied to nominal and ordinal dependent variables respectively. There are

also methods for handling counts (Poisson regression) and time-to-event data

(event history analysis or survival analysis). These techniques will be described in

later Modules.

The explanatory variables may also have different scales of measurement. For

example, gender is a binary categorical variable; ethnicity is categorical with more

than two categories; education might be measured on an ordinal scale (e.g. <11,

11-13, 14-16 and >16 years of education); years of employment could be measured

on a continuous scale. Multiple regression can handle all of these types of

explanatory variable, and we will consider examples of both continuous and

categorical variables in this Module.

person with the above descriptions. Each of the two items is rated on a 6-point

scale (from very much like me to not like me at all). The mean of these

ratings is calculated for each individual. The mean of the two hedonism items is

then adjusted for individual differences in scale use2 by subtracting the mean of all

value items (a total of 21 are used to measure the 10 values). These centred

scores recognise that the 10 values function as a system rather than

independently. The centred hedonism score is interpreted as a measure of the

relative importance of hedonism to an individual in their whole value system.

The scores on the hedonism variable range from -3.76 to 2.90, where higher scores

indicate more hedonistic beliefs.

We consider three countries France, Germany and the UK with a total sample

size of 5845. That is, we use a subsample of the original data.

Hedonism is taken as the outcome variable in our analysis. We consider three

explanatory variables:

Age in years

Gender (coded 0 for male and 1 for female)

Country (coded 1 for the UK, 2 for Germany and 3 for France)

Years of education.

Respondent

Introduction to Dataset

The ideas of multiple regression will be introduced using data from the 2002

European Social Surveys (ESS).

Measures of ten human values have been

constructed for 20 countries in the European Union. According to value theory,

values are defined as desirable, trans-situational goals that serve as guiding

principles in peoples lives. Further details on value theory and how it is

operationalised in the ESS can be found on the ESS education net

(http://essedunet.nsd.uib.no/cms/topics/1/).

1

2

3

4

.

.

5845

Hedonism

1.55

0.76

-0.26

-1.00

.

.

0.74

Age

Gender

Country

Education

25

30

59

47

.

.

65

0

0

0

1

.

.

0

2

2

2

3

.

.

1

10

11

9

10

.

.

9

We will study one of the ten values, hedonism, defined as the pleasure and

sensuous gratification for oneself. The measure we use is based on responses to

the question How much like you is this person?:

Some individuals will tend to select responses from one side of the scale (very much like me)

for any item, while others will select from the other side (not like me at all). If we ignore these

differences in response tendency we might incorrectly infer that the first type of individual

believes that all values are important, while the second believes that all values are unimportant.

C3.1.1 Examining data graphically

Explanatory Variable

We will first consider age as an explanatory variable for hedonism. The age range

in our sample is 14 to 98 years with a mean of 46.7 and standard deviation of 18.1.

Relationship between X and Y

We will begin with a description of simple linear regression for studying the

relationship between a pair of continuous variables, which we denote by Y and X.

Simple regression is also commonly known as bivariate regression because only two

variables are involved.

Y is the outcome variable (also called a response or dependent variable)

X is the explanatory variable (also called a predictor or independent variable).

C3.1.1

In its simplest form, a regression analysis assumes that the relationship between X

and Y is linear, i.e. that it can be reasonably approximated by a straight line. If

the relationship is nonlinear, it may be possible to transform one of the variables

to make the relationship linear or the regression model can be modified (see

C3.3.3). The relationship between two variables can be viewed in a scatterplot. A

scatterplot can also reveal outliers.

Before carrying out a regression analysis, it is important to look at your data first.

There are various assumptions made when we fit a regression model, which we will

consider later, but there are two checks that should always be carried out before

fitting any models: i) examine the distribution of the variables and check that the

values are all valid, and ii) look at the nature of the relationship between X and Y.

Distribution of Y

We can examine the distribution of a continuous variable using a histogram. At

this stage, we are checking that the values appear reasonable. Are there any

outliers, i.e. observations outside the general pattern? Are there any values of 99 in the data that should be declared as missing values? We also look at the

shape of the distribution: is it a symmetrical bell-shaped distribution (normal), or

is it skewed? Although it is the residuals3 that are assumed to be normally

distributed in a multiple regression model, rather than the dependent variable, a

skewed Y will often produce skewed residuals. If the residuals turn out to be nonnormal, it may be possible to transform Y to obtain a normally distributed

variable. For example, a positively skewed distribution (with a long tail to the

right) will often look more symmetrical after taking logarithms.

Figure 3.1 shows the distribution of the hedonism scores. It appears approximately

normal with no obvious outliers. The mean of the hedonism score is -0.15 and the

standard deviation is 0.97.

Distribution of X

For a regression analysis the distribution of the explanatory variable is

unimportant, but it is sensible to look at descriptive statistics for any variables

that we analyse to check for unusual values.

Figure 3.2 shows a scatterplot of hedonism versus age, where the size of the

plotting symbol is proportional to the number of respondents represented by a

particular data point. Also shown is what is commonly called the line of best fit,

which we will come back to in a moment. The scatterplot shows a negative

relationship: as age increases then hedonism decreases. The Pearson correlation

coefficient for the linear relationship is -0.34.

The residual for each observation is the difference between the observed value of Y and the value

of Y predicted by the model. See C3.1.2 for further details.

so that the intercept is now denoted by 0 and the slope by 1 . The subscripts on

the s indicate the variable to which each coefficient is attached. We could have

written (3.2) as y = 0 x 0 + 1 x 1 where x 0 =1 for every observation and x 1 = x .

Later we will be adding further explanatory variables ( x 2 , x 3 etc.) with

coefficients 2 , 3 , etc.

For a given individual i (i=1, 2, 3, .., n), we denote their value on Y by y i and

their value on X by x i . (Note that when we consider more than one explanatory

variable, we will introduce a second subscript to index the variable. For example,

x 2i will denote the value on variable x 2 for individual i.)

For individual i, the linear relationship between Y and X may be expressed as:

y i = 0 + 1 x i + ei (3.3)

ei is called the residual and is the difference between the ith individuals actual

y-value and that predicted by their x-value. We know that we cannot perfectly

predict an individuals value on Y from their value on X; the points in a scatterplot

of x and y will never lie perfectly on a straight line (see Figure 3.2, for example).

The residuals represent the (vertical) scatter of points about the regression line.

C3.1.2

X. The equation of a straight line is traditionally written as

y = mx + c

(3.1)

where m is the gradient or slope of the line, and c is the intercept or the point at

which the line cuts the Y-axis (i.e. the value of y when x=0). The gradient is

interpreted as the change in y expected for a 1-unit change in x. In statistics, we

often refer to m and c as coefficients. A coefficient of a variable is a quantity that

multiplies it. The slope m is the coefficient of the predictor x, and the intercept c

is the coefficient of a variable which equals 1 for each observation (usually

referred to as the constant).

Because we will soon be adding more explanatory variables (Xs), it is convenient to

use a more general notation with coefficients represented by Greek betas ( ).

Thus (3.1) becomes

y = 0 + 1 x

0 and 1 are the

intercept and slope of the regression line in the population from which our sample

was drawn, and ei is the difference between an individuals y-value and the value

of y predicted by the population regression line. We estimate these quantities

using the sample data. Quantities such as 0 and 1 that relate to the population

are called parameters. Parameters are very often represented by Greek letters in

statistics4.

We make the following assumptions about the residuals ei :

i)

The residuals are normally distributed with zero mean and variance 2

(spoken as sigma-squared). This assumption is often written in shorthand

as ei ~ N (0, 2 ).

ii)

means that if we take a slice through the scatterplot of y versus x at any

particular value of x, the y values have approximately the same variation as

at any other value of x. If the variance is constant, we say the residuals are

homoskedastic. Otherwise they are said to be heteroskedastic.

(3.2)

4

(sigma-squared).

iii)

The residuals are not correlated with one another, i.e. they are

independent. Correlations might arise if some individuals contribute more

than one observation (e.g. repeated measures) or if individuals are

clustered in some way (e.g. in schools). If it is suspected that residuals are

correlated, the regression model needs to be modified, e.g. to a multilevel

model (see Module 5).

We can use the fitted line to predict an individuals hedonism based on their age.

So, for example, for an individual of age 25 we would predict a hedonism score of

0.712 (0.018 25) = 0.262. In contrast, we would predict a score of -0.188 for

someone of age 50. The regression line is the line of best fit shown in Figure 3.2.

Most statistical packages will report the results of a regression analysis in tabular

form, e.g. as in Table 3.1.

If these assumptions are not met the estimate of 0 , and more importantly 1 ,

may be biased and imprecise.

C3.1.3

In linear regression analysis, 0 and 1 are estimated from the data using a

method called least squares in which the sum of the squared residuals is

(Responses with other scales of measurement require other

minimized5.

techniques, but all of them are based on the same underlying principle of

minimizing the poorness of fit between the actual data points and the fitted

model.)

By applying the method of least squares to our sample data, we obtain an estimate

of the underlying population value of the intercept and of the slope. These

estimates are denoted by 0 and 1 (spoken as beta-0-hat and beta-1-hat).

The predicted value of y for individual i is denoted by yi and is calculated as:

yi = 0 + 1 x i

(3.4)

The equation (3.4) is the equation of the estimated or fitted regression line. The

predicted value yi is the point on the fitted line corresponding to x i .

If we regress hedonism on age we obtain 0 =0.712 and 1 =-0.018, and the fitted

regression line is written (substituting HED for y and AGE for x) as:

HEDi = 0.712 0.018 AGE i .

The slope estimate tells us that for every extra year of age, hedonism is predicted

to decrease by 0.018. Importantly, the decrease in hedonism expected for an

increase from 14 to 15 years old is the same as for an increase for 54 to 55 years

old. This is a direct consequence of assuming that the underlying functional form

of the model is linear and fitting a linear equation.

Constant

Age

Coefficient

0.712

-0.018

age zero to have a hedonism score of 0.712. Because the minimum age in the

sample is 14, this is not very informative.

1 =-0.018 is the predicted change in Y for a 1 unit change in X. So we expect a

decrease of 0.018 in the hedonism score for each 1 year increase in age.

Centring

Continuous variables are often centred about the mean so that the intercept has a

more meaningful interpretation. For example, we would centre the variable AGE

by subtracting the sample mean of 46 years from each of its values. If we then

repeat the regression analysis replacing AGE by AGE-46, the intercept becomes the

predicted value of Y when AGE-46=0, i.e. when AGE=46. Rather than a prediction

for a baby of 0 years, which is well outside the age range in the sample, the

intercept now gives a prediction for a 46 year old adult

The intercept in the analysis based on centred AGE is estimated as -0.139, which is

the predicted hedonism score for a 46 year old. Centring does not affect the

estimate of the slope because only the origin of X has been shifted; its scale

(standard deviation) has not changed.

Standardisation and standardised coefficients

Sometimes X is standardised, which involves subtracting the sample mean and then

dividing the result by the standard deviation:

http://mathforum.org/dynamic/java_gsp/squares.html

10

11

X mean of X

.

SD of X

C3.1.4

of 1, while centring shifts only the origin and leaves the scale unaltered.

After standardisation a unit corresponds to one standard deviation, so if X is

standardised its slope is interpreted as the change in Y expected for a one

standard deviation change in X.

Sometimes standardised coefficients are calculated. In simple regression the

standardised coefficient of X is the slope that would be obtained if X and Y had

both been standardised, which is equivalent to the Pearson correlation coefficient.

The standardised coefficient of X is interpreted as the number of standard

deviation units change in Y that we would expect for each standard deviation

change in X. While standardised coefficients put each variable on the same scale,

and may therefore be useful for comparing the effect of X on Y in different

subpopulations, the natural meaning of the X and Y variables is lost. The use and

interpretation of standardised coefficients in multiple regression is discussed in

C3.3.2.

When age is standardised the estimated intercept and slope of the regression line

are 0 =-0.151 and 1 =-0.335. If we also standardise hedonism, we obtain 1 =0.343 (now a standardised coefficient) which is equal to the Pearson correlation

coefficient given earlier in C3.1.1.

Important note: We cannot claim that there is a causal relationship between X

and Y from such a simple model or, indeed, from any regression model applied to

observational data. So when interpreting the slope it is better to avoid

statements like a change in X leads to or causes an increase in Y. Taking

account of other factors would provide stronger evidence of a causal relationship

if the original relationship did not change as additional predictors are included in

the model.

where for a simple regression (3.3) the systematic part is 0 + 1 x i and the

random part is the residual ei .

The systematic part gives the average relationship between the response and the

predictor(s), while the random part is what is left over (the unexplained part)

after taking account of the included predictor(s). Figure 3.2 displays the values on

X and Y for individuals in the sample, and a straight line that we have threaded

through the (X, Y) data points to represent the systematic relation between

hedonism and age. The line represents the fitted values, e.g. if you are 20 years

old you are predicted to have a hedonism score of about 0.3.

The term random means allowed to vary and, in relation to Figure 3.2, the

random part is the portion of hedonism that is not accounted for by the underlying

average relationship with age. Some people are more and some less hedonistic

given their age. The residual is the difference between the actual and predicted

hedonism. In some cases there will be a close fit between the actual and fitted

values, e.g. if differences in age explain most of the variability in hedonism. In

other cases there may be a lot of noise, e.g. if, for any given age, there is a wide

range of hedonism scores. It is helpful to characterise this residual variability. To

do so requires us to make some assumptions about the residuals (normality and

homoskedasticity - see C3.1.2). Under these assumptions we can summarise the

variability in a single statistic, the variance of the residuals 2 . We can think of

the residual variance as the part of the variance in Y that is unexplained by X.

The part of the variance in Y that is explained by X (the systematic part of the

model) is called the explained variance in Y. For Figure 3.2 the residual or

unexplained variance is 0.84. The total variance in hedonism scores (which is the

sum of the explained and unexplained variances) is 0.95, so by subtraction the

explained variance is 0.11.

Another key summary statistic is the R-squared (R2) value which gives the

correspondence between the actual and fitted values, on a scale between zero (no

correspondence) and 1 (complete correspondence). R-squared can also be

interpreted as the proportion of the total variance in Y that can be explained by

variability in X. For the hedonism data, R-squared = 0.11/0.95 = 0.12 so 12% of the

variance in hedonism scores can be explained by age.

In the case of simple regression, R-squared is the square of the Pearson correlation

coefficient.

12

13

C3.1.5

Hypothesis testing

We must bear in mind that the estimates of the intercept and slope are subject to

sampling variability, as is any statistic calculated from a sample. While we have

established that there is a negative relationship between hedonism and age in our

sample, we are really interested in their relationship in the population from which

our sample was drawn (the combined populations of France, Germany and the UK).

In other words, is the relationship statistically significant, or could we have got

such a result by chance?

The null hypothesis (H0) for our test is that there is no relationship between

hedonism and age in the population, in which case 1 =0.

SE(1 )

small). In this case the p-value is tiny, less than 0.001. If there was no

relationship between hedonism and age in the population (i.e. the true slope is

zero), we would expect less than 0.1% of samples from that population to produce

a slope estimate of magnitude greater than 0.018.

In the practice sections, we will generally use the Z-ratio to test significance

rather than calculating confidence intervals.

C3.1.6

The test of a relationship between hedonism and age is based on the estimate of

the slope of the relationship and a measure of the precision of this estimate. The

standard error is a measure of imprecision, where large values indicate greater

uncertainty about the true (population) value. The standard error is inversely

related to sample size, so that the precision of the estimate of 1 increases as the

sample size increases. The standard error also depends on the amount of

variability in X and the amount of variance in Y that is unexplained by X (the

residual variance): the standard error decreases as the variance in X increases, and

the standard error increases as the residual variance increases.

In our example, the standard error of 1 is 0.001 and a 95% confidence interval

for 1 is therefore

Zero (the value of 1 under H0) is well outside the 95% confidence interval, so we

reject the null hypothesis and conclude that the relationship is statistically

significant at the 5% level.

We can also calculate a confidence interval for the population intercept 0 , but

the slope is of principal interest since it measures the relationship between X and

Y.

0.018

= 27.9

0.001

Model checking

C3.1.2 but, briefly, we assume:

i)

ii)

The variance of the residuals is constant, whatever the value of x, i.e. the

residuals are homoskedastic.

iii)

The residuals are not correlated with one another, i.e. they are

independent.

We can check the validity of assumptions i) and ii) by examining plots of the

estimated residuals. If it is suspected that residuals might be correlated because

the data are clustered in some way, we can test assumption iii) by comparing a

multilevel model, which accounts for clustering, with a multiple regression model

which ignores clustering (see Module 5).

To check assumptions about ei , we use the estimated residuals which are the

differences between the observed and predicted values of y:

ei = y i yi

We usually work with the standardized residuals ri which we obtain by dividing ei

by their standard deviation.

Alternatively, but equivalently, we can calculate the test statistic (often called

the Z or t-ratio)

6

-1.96 and +1.96 are the 2.5% and 97.5% points of a standard normal distribution (one with a mean

of zero and a standard deviation of one). The middle 95% of the distribution lies between these

points.

14

15

We can check whether residuals are normally distributed by looking at a histogram

or a normal probability plot of the standardized residuals. If the normality

assumption holds, the points in a normal plot should lie on a straight line.

Expected cumulative probability

1.0

Figure 3.3 and Figure 3.4 show a histogram and normal probability plot of residuals

from a simple regression model with age. Both plots suggest that the normal

distribution assumption is reasonable here.

To check that the variance of the residuals is fairly constant across the range of X,

we can examine a plot of the standardized residuals against X and check that the

vertical scatter of the residuals is roughly the same for different values of X with

no funnelling.

0.8

0.6

0.4

0.2

500

0.0

0.0

0.2

0.4

0.6

0.8

1.0

Frequency

400

300

Standardized residual

200

100

0

-2.5

0.0

2.5

Standardized residual

-2

Figure 3.5 shows a plot of ri versus x i . The vertical spread of the points appears

fairly equal across different values of X, so we conclude that the assumption of

homoskedasticity is reasonable.

0

20

40

60

80

100

Centre for Multilevel Modelling, 2008

16

17

Categorical Explanatory Variable

Outliers

We can also check for outliers using any of the above residual plots. An outlier is a

point with a particularly large residual. We would expect approximately 95% of

the residuals to lie between 2 and +2.

Of major interest, however, is whether an outlier has undue influence on our

results. For example, in simple regression, an outlier with very large values on X

and Y could push up a positive slope.

A straightforward way to judge the

influence of an outlier is to refit the regression line after excluding it. If the

results are very similar to those based on all observations, we would conclude that

the outlier does not have undue influence. An observations influence can also be

measured by a statistic called Cooks D (see C3.5.3).

Dont forget to do the practical for this section! (see page 2 for

details of how to find the practical)

also be applied when X is categorical, in which case we are allowing the mean of Y

to be potentially different for the different categories of X.

C3.2.1

wish to compare the mean of our response variable Y for the two groups defined by

these categories.

We will examine whether there are gender differences in hedonism. In the human

values dataset there is a variable called SEX which is coded 1 for female, and 0 for

male. Variables that have codes of 0 and 1 are often called dummy variables. If

we simply calculate the mean of our response variable HED for men and women,

we obtain the results given in Table 3.2.

Please read P3.1, which is available in online form or as part of a pdf file.

(see page 2 for details of how to find the quiz questions)

Sample size

Women

Men

2747

3098

-0.225

-0.069

We can use a normal test (or t-test if the sample is small) to test for a difference

between women and men in the population. The null hypothesis for the test is

that the gender difference between the mean of hedonism in the population is

zero. The test statistic is -6.12 and the p-value is less than 0.0001. A 95%

confidence interval for the difference between the female and male population

means is (-0.206, -0.106), which does not contain the null value of zero. We

therefore conclude that the difference between women and mens hedonism

scores is statistically significant (at the 0.01% level).

We can also compare groups using a regression model. The advantage of using

regression, rather than a normal (or t) test, is that in a regression model we can

allow for the effects of other variables as well as gender. To start with, however,

we will consider gender as the only explanatory variable and demonstrate how

men and womens hedonism scores can be compared using regression.

18

19

C3.2.2

y i = 0 + 1 x i + ei

where y i is the hedonism score of individual i, and x i =1 if the individual is a

woman, and 0 if the respondent is male7.

Table 3.3. Regression of hedonism on sex

Coefficient

Constant

Sex

-0.069

-0.156

wish to compare the mean of our outcome variable Y for the three groups defined

by these categories.

The respondents in the hedonism example come from three countries. The mean

of HED for each country is given in Table 3.4.

Table 3.4. Descriptive statistics for hedonism by country

Standard Error

0.019

0.025

Sample size

The regression output is given in Table 3.3, from which we obtain the fitted

regression equation:

HEDi = 0.069 0.156 SEX i

UK

Germany

France

1748

2785

1312

-0.384

-0.128

0.108

The standard way to compare more than two groups is to use analysis of variance

(ANOVA)8. The null hypothesis is that there is no difference between groups (i.e.

that the group means are all equal). Table 3.5 shows the results from an ANOVA

for a comparison of hedonism for the three countries. When there is just one

categorical variable, this type of analysis is usually called a one-way ANOVA.

We can use this equation to predict HED for men and women:

For men (SEX=0), HED = 0.069 (0.156 0) = 0.069

For women (SEX=1), HED = 0.069 (0.156 1) = 0.225

Notice that these predicted values are just the mean hedonism scores for men and

women, and that the coefficient of SEX is the difference between these means

(womens mean mens mean, since SEX is coded 1 for women here).

The null hypothesis that there is no difference between the mean score for men

and women in the population can be expressed as H0: 1 = 0 . The standard error

of 1 is 0.025 and the Z-ratio is therefore 0.156 0.025 = 6.12 . The 95%

confidence interval for 1 is (-0.206, -0.106).

Note that these results are exactly the same as those for the independent samples

comparison of means test given earlier. So if SEX is the only explanatory variable,

a regression analysis gives exactly the same results as a t-test. But only in a

regression analysis can we include other explanatory variables.

Between countries

Within countries

Total

Sum of

squares

d.f.

Mean

square

F statistic

p-value

184.5

5370.7

5555.2

2

5842

5844

92.3

0.9

100.4

<0.001

The tiny p-value suggests that we can reject the null hypothesis and conclude that

there are significant between-country differences in hedonism.

Discussions of degrees of freedom can be found at

http://www.animatedsoftware.com/statglos/sgdegree.htm

and http://davidmlane.com/hyperstat/A42408.html

9

7

20

21

The statistical model behind ANOVA is in fact a multiple regression model. But

rather than including country as an explanatory variable10, we create dummy

variables for two of the three countries and include these. Suppose we create

three variables which indicate whether a respondent is from a particular country,

i.e.

UK

=1 if respondent is from the UK, =0 if from Germany or France

GERM

=1 if respondent is from Germany, =0 if from UK or France

FRANCE

=1 if respondent is from France, =0 if from UK or Germany

These variables are called dummy variables11. In fact we do not need all three of

these variables because if we know a respondents value on two of them, we can

infer their value on the third. E.g. if we know that UK=0 and GERM=1, then we

know that FRANCE=0. (A respondent can only be living in one country at the time

of survey, so only one of UK, GERM and FRANCE can equal 1 for any given

individual.) By the same argument, when we have a categorical variable with only

two categories (e.g. our SEX variable in C3.2.1) we do not need to create any

additional variables. SEX is already a dummy variable, and can therefore be

included directly in the model as an explanatory variable.

To allow for differences between the UK, Germany and France, we choose

(arbitrarily) two of the country dummy variables and include those as explanatory

variables. Suppose we choose GERM and FRANCE, then the multiple regression

model is:

Coefficient

Constant

Country

Germany

France

Standard

error

Z-ratio

-0.384

0.023

0.256

0.492

0.029

0.035

8.765

14.052

p-value

<0.001

<0.001

HEDi = 0.384 + 0.256 GERMi + 0.492 FRANCE i

We can use this equation to predict the hedonism score for inhabitants of each

country.

For UK residents (GERM=0, FRANCE=0):

For Germans (GERM=1, FRANCE=0):

Notice that these predictions give exactly the same results as the country means in

Table 3.4. We obtain the prediction for the UK directly; the estimate of the

intercept will always equal the mean for the omitted category, i.e. the category

for which we do not include a dummy variable in the model. The coefficients 1

and 2 are interpreted as differences between one of the other countries and the

omitted category country.

10

0 + 1 COUNTRYi + ei

because

the coding of COUNTRY is arbitrary (i.e. COUNTRY is a nominal variable). In such a model, b 1

would be interpreted as the effect on HED of a 1 unit change in COUNTRY, but a 1 unit change in

COUNTRY has no meaning!

11

This is the most common way of coding dummy variables for a categorical variable and is often

called simple coding, but other types of coding are possible depending on which comparisons are of

interest. A comprehensive discussion of alternative coding systems can be found at

http://www.ats.ucla.edu/stat/stata/webbooks/reg/chapter5/statareg5.htm

22

1 = 0.256 is the difference between the means for Germany and the UK

2 = 0.492 is the difference between the means for France and the UK

The UK is the reference category. (If we had included the UK and FRANCE dummy

variables in the model, then Germany would have been the reference.)

23

1 2 = 0.236.

C3.2.3

The null hypothesis for testing whether there is a difference between the mean

hedonism scores in the populations of Germany and the UK can be expressed as H0:

1 =0. Similarly the null for testing whether there is a difference between France

and the UK is H0: 2 =0. The simplest way to compare Germany and France would

be to refit the model, making one of these countries the reference category.

Table 3.7 shows the results when Germany is taken as the reference, i.e. when the

UK and FRANCE dummies are included in the model. The difference between the

means for France and Germany is now obtained directly (from the coefficient of

the FRANCE dummy) as 0.236.

All coefficients in Table 3.6 and Table 3.7 are significantly different from zero (all

p-values are <0.001), so we conclude that all pairwise differences between

countries are significant.

Constant

Country

UK

France

Standard

error

Z-ratio

-0.128

0.018

-0.256

0.236

0.029

0.032

-8.765

7.341

wished to compare. One approach would be to include 19 dummy variables for

countries. However, there are some potential drawbacks of this approach:

i)

ii)

between countries and other explanatory variables (see C3.4) will lead to

even more parameters. Also, if the sample sizes within some countries are

small, the estimates of the coefficients of the dummy variables for those

countries may be unreliable

Suppose we wish to estimate the effects of country characteristics, e.g. the

effects on hedonism of cultural factors, such as religiosity, or economic

status. It can be shown that it is not possible to estimate the effects of

these variables as well as the coefficients of the country dummy variables.

(This is because any country-level variable can be expressed as a linear

function of the country dummy variables.)

from a larger population. For example, we may have data on a sample of schools.

In such cases it is the population of groups from which our sample was drawn that

is of interest. However, the origins of the dummy variable (or ANOVA) approach

lie in experimental design where there is typically a small number of groups to be

compared and all groups of interest are sampled. The ANOVA approach does not

allow us to make inferences beyond the groups in our sample.

Coefficient

p-value

<0.001

<0.001

5).

The above tests are for comparing pairs of countries. In ANOVA, the null

hypothesis is that the means for all three countries are equal. In most statistical

packages, an ANOVA table is given as part of the standard regression analysis

output. The only difference between the regression ANOVA table and the one-way

ANOVA table is that the between-country sum of squares would usually be called

the regression sum of squares, and within-country would be replaced by

residual. All numerical results would be exactly the same. In regression terms,

the null hypothesis being tested is that all coefficients are zero, which in this case

is that 1 = 2 = 0 . This is just another way of saying that all three country means

are equal.

Dont forget to do the practical for this section! (see page 2 for

details of how to find the practical)

Please read P3.2, which is available in online form or as part of a pdf file.

Dont forget to take the online quiz! (see page 2 for details of how

to find the quiz questions)

The advantage of multiple regression over one-way ANOVA is that regression can

allow for the effects of several explanatory variables simultaneously12.

12

A multiple regression analysis with two categorical explanatory variables is sometimes called a

two-way ANOVA, while a regression with a mixture of categorical and continuous variables is called

an analysis of covariance (ANCOVA).

24

25

C3.3.2 The multiple regression model

Variable (Multiple Regression)

C3.3.1

C3.3.2

In simple regression we have a single predictor or explanatory variable (X), and the

linear regression model is

Statistical control

So far we have used simple regression to assess the linear relationship between

two variables. In reality there will be a number of factors that are potential

predictors of the outcome variable.

The advantage of using a regression

framework is that we can straightforwardly account for the effects of multiple

variables simultaneously.

Examples

i) Suppose we compare two secondary schools on their age 16 exam performance,

e.g. we might compare the percentage of students who achieve a pass in five or

more subjects. Suppose we find that school 1 has a higher percentage with 5+

passes than school 2. Would we conclude that school 1s performance was better

than school 2? What other factors would we like to take into account? An obvious

candidate would be a measure of students achievement when they entered

secondary school so that school effects are value-added.

ii) Comparisons of men and womens salaries often reveal that women earn less.

Explanations that are commonly put forward for this discrepancy are that women

tend to work in jobs that have been traditionally lower paid, or that women have

taken time out of paid employment to raise children. To determine whether there

are salary differences between men and women who have been working in the

same job for the same amount of time, we would wish to account for occupation

and number of years of full-time employment as well as other factors such as

education level. Using multiple regression we can test whether these other factors

explain gender differences in salary, i.e. does any gender difference disappear

when we adjust for the effects of these other variables?

We can use multiple regression to take into account or adjust for other factors

that might predict the response variable. Sometimes the effects of these other

factors are of interest in themselves, e.g. predictors of age 16 attainment other

than the school attended. Other times the effects of other factors are not of

major interest, but it is important to adjust for their effects to obtain more

meaningful estimates of effects that we are interested in. Such factors are often

called controls.

26

y i = 0 + 1 x i + ei .

In multiple regression, we have more than one predictor. Suppose that we have

two predictors, denoted by X1 and X2, which may be continuous or categorical. We

have in fact already used a multiple regression model to analyse country

differences in hedonism (in C3.2.2). Although there was just one predictor,

country, it was represented by two dummy variables. More generally, we can

include several predictors and any of these may be represented by a set of dummy

variables.

The multiple (linear) regression model for two continuous (or dichotomous)

explanatory variables is written

y i = 0 + 1 x 1i + 2 x 2i + ei

where 0 is the value of y that would be expected when x1 =0 and x 2 =0.

The coefficients 1 and 2 are interpreted as follows:

change in x1 controlling or adjusting for the effect of x 2 . In other words, 1 is

the effect of x1 for individuals with the same value of x 2 (or holding x 2

constant).

for a 1-unit change in x 2 controlling for the effect of x1 .

an explanatory variable and the dependent variable, conditioning on the effect of

all other explanatory variables in the model, they are sometimes called partial

regression coefficients.

We can test for a linear relationship between the response variable Y and a

predictor variable Xk by testing the null hypothesis that the coefficient of Xk is zero

(H0: k =0) versus the alternative hypothesis that the coefficient is non-zero (HA:

k 0).

As in simple regression, we can test for significance by examining

confidence intervals for each parameter or, equivalently, by comparing Z-ratios to

the normal distribution and calculating a p-value.

27

factors other than X1 and X2 that predict Y, but we use ei in a general way to

represent residuals in any model.

Notice that, as expected, there is little change in the coefficient of age when

education is added, but the relationship between hedonism and education is now

negative after accounting for age. Both relationships are significantly different

from zero at the 0.1% level. The relationship between hedonism and education

should be interpreted with some caution, however. We should hesitate to

conclude that education affects or causes hedonism. It is likely that hedonism and

education are both influenced by variables that we have not accounted for in this

model.

Note that when we have two predictors we would need a three-dimensional

scatterplot to represent the relationship between Y and X1 and X2 graphically13.

As it can be difficult to interpret three-dimensional plots, we can explore the data

by looking at plots of Y versus X1, Y versus X2, and X1 versus X2. The third plot is

important to check whether X1 versus X2 are highly correlated.

Example

We will begin with the case where both X1 and X2 are continuous. Lets consider

the effects of age (X1) and education (X2) on hedonism. We will ignore gender and

country differences for now. We have already examined the bivariate relationship

between hedonism and age and found that older respondents tend to be less

hedonistic (in C3.1). This relationship may change when we account for education

if education is related to both hedonism and age. For example, we would expect

older respondents to have fewer years of education and a higher level of education

might be associated with less hedonistic beliefs if the more career-minded choose

study over having a good time!

Figure 3.6 shows the relationship between hedonism and education (see Figure 3.2

for a plot of hedonism versus age). The relationship between the two explanatory

variables, age and education, is shown in Figure 3.7. The correlation between

hedonism and education is very weak; the Pearson coefficient is only 0.024. As

expected, there is a negative correlation between age and education (r=-0.242).

Because of the weak correlation between hedonism and education, however, we

would not expect the addition of education in a multiple regression to have much

impact on the coefficient of age.

In C3.1.3 the fitted equation from a simple regression of hedonism on age was

found to be:

If we add education to the model, we obtain the following fitted multiple

regression equation:

dimensional space.

13

0 + 1 x1 + 2 x 2

28

29

particularly important if regression results are used to inform public policy14.

When standardised coefficients are reported, they should be accompanied by the

corresponding unstandardised coefficients which represent effects in terms of the

original units of measurement for X and Y. This is particularly important for

categorical X.

C3.3.3

Suppose a scatterplot of Y versus X resembles Figure 3.8. The relationship is nonlinear, so it would not be appropriate to fit the straight line relationship implied

by a linear regression model. We should fit a curve through the points rather than

a line. The simplest curve is a quadratic function (or a second order polynomial):

y i = 0 + 1 x i + 2 x i2 + ei

Note that the above is an example of a multiple regression model with x 1 = x and

x 2 = x 2 . Also shown in Figure 3.8 is the fitted quadratic curve, which turns out to

have equation yi = 1.00 + 1.02 x i 0.47 x i2 .

Standardised coefficients

Standardisation and standardised coefficients were introduced in C3.1.3. To recap,

the standardised coefficient for a predictor X is the estimate of the slope that

would be obtained if X and Y were both standardised before the regression

analysis. In simple regression, the standardised coefficient of X is equal to the

Pearson correlation coefficient. In multiple regression, with two predictors X1 and

X2, the standardised coefficient of X1 is interpreted as the change in standardised Y

for a 1-unit change in standardised X1, holding X2 constant. (Recall that 1 unit of a

standardised variable corresponds to 1 standard deviation.) For example, in a

multiple regression model of hedonism on age and education, the standardised

coefficient for AGE is -0.358. Thus we can say that a 1 standard deviation change

in age predicts a 0.358 standard deviation decrease in hedonism. Note that if all

variables (Y and the Xs) had been standardised prior to the analysis, then the

unstandardised and standardised coefficients would be equal.

-1

-2

-3

-2.0

reported in much published quantitative research, but they should be interpreted

with caution. It is often claimed that standardised coefficients can be compared

across the predictors to determine which has the strongest influence on Y.

However, predictors are usually correlated with one another and it is rarely

possible to change the value of one without changing the value of another.

30

-1.5

-1.0

-.5

0.0

.5

1.0

1.5

2.0

14

interpretation of standardised coefficients.

31

The results from fitting a quadratic curve to the relationship between hedonism

and age are given in Table 3.8. Note that this analysis is based on standardised

age and its square. This is because, for older respondents (remember the oldest is

98), age2 takes very large values; this may cause computational difficulties and the

coefficient of age2 would be very small.

Table 3.8. Regression with quadratic effects for age

Coeff.

S.E.

-0.222

-0.348

0.072

0.017

0.012

0.011

Constant

Standardised age

Standardized age-squared

Z-ratio

-28.669

6.288

C3.3.4

Suppose that we have p predictors, which we denote by X1, X2, X3, . . ., Xp. Then

the multiple regression model is

y i = 0 + 1 x 1i + 2 x 2i + 3 x 3i + ... + p x pi + ei

Variation explained: R2

p-value

explanatory variable X (see C3.1.4). When there is more than one X, R2 is the

proportion of variance in Y explained by all variables in the model. An alternative

interpretation of R2 is as the square of the correlation between the predicted

values of Y (from the fitted model) and the observed values of Y.

<0.001

<0.001

level, so we conclude that age-squared should be retained in the model and that

the quadratic model is therefore a better fit to the data than the linear model.

The positive coefficient of the squared term, together with the negative

coefficient of the linear term, indicates that the negative relationship flattens out

at older ages. A scatterplot with the fitted curve is shown in Figure 3.9.

The R2 for the regression model with age and education effects is 0.121, so 12.1%

of the variance in hedonism scores is due to variation in age and education. The

correlation between the predicted and observed hedonism scores is 0.348

= 0 . 121 . As suggested by the low bivariate correlation between hedonism and

education, education has little explanatory power; when education is removed the

model R2 decreases only slightly to 0.118.

A problem with R2 is that it always increases even if irrelevant variables are added

to the model. Therefore in multiple regression a measure called the adjusted R2 is

usually quoted. The adjusted R2 takes into account the number of variables in the

model. It is therefore a goodness-of-fit measure that is penalised by the

complexity of the model. With such a measure, the value will only increase if the

additional predictors are accounting for some of the variability in the response. In

this simple example with only two explanatory variables, age and education, the

adjusted R2 turns out to be the same as the unadjusted value.

Multicollinearity

Before carrying out a regression analysis, we should always look at the correlation

between each pair of predictor variables. If the correlation between a pair is very

high (>0.8 say), the estimates of the coefficients of those variables may be

unstable and imprecise (large standard errors). If the two variables are really

measuring the same thing, we should consider dropping one. Otherwise, we might

replace the two variables by a new variable which is a combination of the two15.

Figure 3.9. Plot of hedonism versus standardised age with fitted quadratic curve

15

Principal components analysis or factor analysis can be used to reduce a set of correlated

variables into a smaller set of uncorrelated variables.

32

33

The coefficient of age in months (AGE*12) is -0.002, which is the coefficient of age

in years (AGE) divided by 12. This is because 1 unit on the scale of AGE*12 is equal

to 1/12 of a unit on the scale of AGE. Notice that the intercept does not change

because AGE=0 means the same whether the measurement is in months or years.

The coefficient of education is unaffected by transformations in age.

variables. There are many procedures that have been proposed to automatically

select the best model from a set of variables (e.g. backward elimination, forward

selection, stepwise selection), and many of these have been implemented in

mainstream statistical software. These procedures are sometimes useful in that

they provide a systematic means of model selection, but they should be used with

caution or you may be accused of data dredging. In practice your research

design and analysis will be guided by theory, which will come from previous

research in the same or related areas as well as your own ideas. Often you will

have several rival theories that you wish to compare and assess which have the

stronger empirical support. These theories and your particular research question

will guide the order in which you enter explanatory variables into the model.

For example, suppose you are interested in examining gender differences in salary

levels. The first model you fit might include only a gender effect. Suppose you

find that there is a significant difference between men and women. You might

then add in other explanatory variables to see which ones, if any, help to explain

the gender difference.

A further step in the analysis would be to test whether

the gender difference is the same for all men and women, e.g. gender differences

may be larger in some occupation categories than in others (an example of an

interaction effect - see C3.4). In other situations, there will be variables that you

want to include for interpretation purposes. For example, in educational research,

you might be interested in looking at predictors of academic progress rather than

academic attainment at one point in time. One way to do that is to include prior

attainment as an explanatory variable in the model.

sometimes quoted too. When researchers talk about effect sizes, they are often

referring to standardised coefficients.

Dont forget to do the practical for this section! (see page 2 for

details of how to find the practical)

Please read P3.3, which is available in online form or as part of a pdf file.

Dont forget to take the online quiz for this section! (see page 2 for

details of how to find the quiz questions)

Effect sizes

The size of the coefficient for predictor variable Xk will depend on the scales of Xk

and the response variable. For example, suppose we multiply each value of AGE

by 12 to give age in months rather than years and refit the multiple regression

model with age and education effects. We obtain the results shown in Table 3.9.

Table 3.9. Regression of hedonism on age and education for different age scales.

Age in years

Coeff.

Constant

Age

Education

0.971

-0.019

-0.017

Z-ratio

-28.206

-4.915

Age in months

Coeff.

Z-ratio

0.971

-0.002

-0.017

-28.206

-4.915

34

35

C3.4.1 Model with fixed slopes across groups

In C3.2 we saw how to compare groups using dummy variables in a regression

model. For example, we compared the mean hedonism score for men and women,

and for different countries. So far, however, we have assumed that the effects of

other predictor variables, e.g. age, are the same for each group. This is

equivalent to assuming that group differences in hedonism are the same for all

values of the other predictors. This assumption may be unrealistic. Perhaps age

differences in hedonism are more pronounced among men, which would imply that

the age effect differs for men and women.

Two predictors are said to have an interaction effect on Y if the effect of one of

the predictors on Y depends on the value of the other predictor.

C3.4.1

Suppose we fit a multiple regression model with age and gender effects:

(3.7)

Figure 3.10. Regression lines for men and women, fixed slopes

Note: The age range in the sample is 14 to 98 years. The software used to draw the plot

has extrapolated beyond the observed range regression lines which is not generally

recommended.

For SEX=0 (men), the relationship between HED and AGE is represented by the

line:

and for SEX=1 (women), the fitted line is:

C3.4.2

Is it reasonable to assume that the gender difference in hedonism is the same for

all ages? One way of allowing men and women to have different slopes for the

relationship between hedonism and age is to fit a separate regression line for each

sex. We do this by splitting the sample by gender16, and fitting a simple regression

of HED on AGE for each sex. If we do this, we obtain the results shown in Table

3.10.

So the lines for men and women have different intercepts, but the same slope, i.e.

the regression lines are parallel (see Figure 3.10). There are two equivalent ways

of interpreting Figure 3.10. We can say that the effect of age on hedonism is the

same for men and women. Alternatively we can say that the gender difference in

hedonism is the same at all ages.

16

This is often done using a select if command or menu option, or by requesting an analysis that is

stratified by gender.

36

37

Table 3.10. Regression of hedonism on age with separate models fitted for men and women

C3.4.3

Coeff.

S.E.

Z-ratio

0.839

-0.019

0.047

0.001

-20.854

0.597

-0.018

0.047

0.001

-18.910

Rather than fitting a separate model for each sex, we will fit a single model to the

whole pooled sample. We create a new variable which is the product of AGE and

SEX:

Men

Constant

Age (years)

Women

Constant

Age (years)

AGE_SEX=AGESEX

The new variable AGE_SEX is added as another predictor variable to model (3.7) to

give:

For men the slope of age is -0.019, compared to -0.018 for women. So the slope is

slightly steeper for men. Because women have a lower intercept than men, a

steeper slope for men implies that the gender difference is greater among younger

respondents (see Figure 3.11 later).

Table 3.11. Example of hedonism dataset with age by sex interaction variable

Respondent

i)

ii)

There may be more than one categorical predictor, and therefore more than

one way of grouping the data. The effects of the other predictors may vary

across each grouping, e.g. hedonism may vary by sex and by country.

Splitting the data into groups defined by sex and country will lead to a large

number of groups; in this dataset, the sample sizes in each group remain

large, but this will often not be the case.

iii)

that the effects of all predictors will vary across groups. In that case,

fitting a separate regression for each group is inefficient. Where the

coefficient of a predictor does not vary across groups, it would be better to

estimate it using information from the whole sample; the estimate of the

coefficient would then be based on a larger sample size and would therefore

have a smaller standard error than if it were estimated separately for each

group.

When separate analyses are carried out for each group, it is not possible to

carry out hypothesis tests to compare coefficients across groups. For

example, if we fit separate regressions of hedonism on age for men and

women we cannot test whether there is a gender difference in the

relationship between hedonism and age in the population.

1

2

3

4

.

.

5845

Hedonism

1.55

0.76

-0.26

-1.00

.

.

0.74

AGE

SEX

AGE_SEX

25

30

59

47

.

.

65

0

0

0

1

.

.

0

0

0

0

47

.

.

0

The inclusion of AGE_SEX, called the interaction between AGE and SEX, allows the

effect of AGE on HED to differ for men and women (or, equivalently, the effect of

sex on HED to depend on AGE). If the effect of age differs by sex, we say that

there is an interaction effect. To see how an interaction effect works, we will

look at the regression model for each value of SEX.

For SEX=0 (men), AGE_SEX=0 so the regression model (3.8) becomes:

HEDi = 0 + 1 AGE i + ei

38

(3.9)

For SEX=1 (women), AGE_SEX=AGE and the regression model (3.8) becomes:

= ( 0 + 2 ) + (1 + 3 )AGE i + ei

Centre for Multilevel Modelling, 2008

(3.8)

Table 3.11 gives an extract of the analysis data file to which (3.8) could be fitted.

While splitting the sample into groups is a simple way of allowing for different

slopes for each group, there are several problems with this approach:

iv)

effects

(3.10)

39

difference between intercepts for men and women.

In equation (3.9) the slope of AGE is 1 , and in (3.10) it is 1 + 3 . So 3 is the

difference between slopes for men and women.

Table 3.12 shows the results from fitting model (3.8) to the hedonism data.

Table 3.12. Regression of hedonism on age and sex, pooled analysis with interaction

Coeff.

Constant

Age (years)

Female

Age Female

0.839

-0.019

-0.242

0.002

S.E.

0.048

0.001

0.066

0.001

Z-ratio

p-value

-20.075

-3.649

1.461

<0.001

<0.001

0.144

Figure 3.11. Regression lines for men and women, varying slopes

C3.4.4

Is the slope in the regression of hedonism on age significantly different for men

and women?

(3.8), is the difference in the slope for men and women. So the null hypothesis

that the slopes are the same for men and women can be expressed as H0: 3 =0.

HEDi = 0.839 0.019 AGE i 0.242 + 0.002 AGE

= 0.597 0.017 AGE i

Notice that the intercept and slope estimates from the interaction model are

exactly the same as the estimates we got from fitting a simple regression for each

sex separately. Figure 3.11 shows the predicted regression lines for men and

women. Note that the lines are no longer parallel because we have allowed for

different slopes in our regression model. The gender difference in hedonism is

slightly larger among young respondents.

From Table 3.12, we see that the Z-ratio for this test is 1.461 and the p-value is

0.144. So we cannot reject the null hypothesis and we conclude that the slope of

age is the same for men and women. We would then return to the simpler model

(3.7) with the fixed slope.

C3.4.5

different countries

We have concluded that the effect of age on hedonism is the same for men and

women. Or, equivalently, we can conclude that the gender difference in hedonism

is the same for all ages. We will now test whether the effect of age is the same in

40

41

between countries depend on age.

the model taking either Germany or France as the reference category.

Germany and France, i.e. we included the variables GERM and FRANCE as

predictors in the regression model. To allow the effect of age on hedonism to vary

across countries, we need to create two interaction variables which we will call

AGE_GERM and AGE_FRANCE. These are defined as follows:

To test whether all three countries have the same slope (a joint test), we need to

test the null that 4 and 5 are both (simultaneously) equal to zero. We can do

this using an F-test for comparing nested models: the model in which 4 and 5

are freely estimated (the interaction model) versus the model with both 4 and 5

fixed at zero (the main effects model, i.e. without interaction terms). The pvalue for this test turns out to be 0.040, so there is evidence at the 5% level that

the interaction model is a significantly better fit to the data: at least one of the

age-by-country interaction coefficients is non-zero. We therefore conclude that

the age effect differs between countries.

AGE_GERM=AGEGERM

AGE_FRANCE=AGEFRANCE

The interaction model has the form:

HEDi = 0 + 1 AGE i + 2 GERMi + 3 FRANCE i + 4 AGE _ GERM + 5 AGE _ FRANCE + ei

The results from fitting this model are given in Table 3.13.

Please read P3.4, which is available in online form or as part of a pdf file.

Constant

Age (years)

Country

Germany

France

Age Germany

Age France

Dont forget to do the practical for this section! (see page 2 for

details of how to find the practical)

Coeff.

S.E.

Z-ratio

p-value

0.604

-0.021

0.061

0.001

-17.210

<0.001

-0.007

0.386

0.005

0.001

0.078

0.090

0.002

0.002

-0.085

4.277

3.207

0.570

0.932

<0.001

0.001

0.569

Dont forget to take the online quiz for this section! (see page 2 for

details of how to find the quiz questions)

-0.021 in the UK (the reference category)

-0.021+0.005 = -0.016 in Germany

-0.021+0.001 = -0.020 in France

It therefore appears that the negative effect of age on hedonism is weaker in

Germany than in the UK or France. The coefficient of the AGE_GERM term has a Zratio of 3.207 so the differential age effect for Germany is significant at the 0.1%

level.

The individual Z-ratios for each interaction term allow us to carry out two separate

tests: 1) whether the slopes for the UK and Germany are the same (H0: 4 = 0 ),

and 2) whether the slopes for the UK and France are the same (H0: 5 = 0 ). The

42

43

C3.5.1 Checking the normality assumption

Regression

The assumptions of a multiple regression model are the same as those for a simple

regression model (see C3.1.6), i.e. i) the residuals ei are normally distributed, ii)

the variance of the residuals is the same for each value of X (or combination of

values for different X variables), and iii) the residuals are independent. We can

check assumptions i) and ii) by looking at various plots of the standardised

residuals. The same plots can be used to check for outliers and their influence on

the regression results can be assessed by looking at the distribution of the Cooks D

Statistic.

C3.5.1

plots of the standardized residuals: a histogram and a normal probability plot. The

same plots are used in multiple regression. Figure 3.12 and Figure 3.13 show the

histogram and normal probability plot of residuals from a multiple regression

model of hedonism that includes age, education, gender and country effects. The

histogram shows a symmetric bell-shaped distribution and the normal plot shows a

straight line, suggesting that the normal distribution assumption is reasonable.

C3.5.2

For simple regression, we check that the variance of the residuals is fairly constant

across the range of X in a plot of the standardised residuals against the explanatory

variable X. In multiple regression, it is useful to start with a plot of ri against yi

because, for any individual, the predicted value of y is a linear function of their

values on all X variables in the model. This should be followed by an examination

of pairwise plots of the standardized residuals against each explanatory variable X

in turn. For each plot we are looking for indications of funnelling where the

vertical scatter of the residuals is different for different values of xi or y i , in

which case the assumption of homoskedasticity is not met.

A common reason for funnelling (or heteroskedasticity) is the existence of groups

in the data among which the relationship between Y and one or more X differs, i.e.

unmodelled interaction effects. To illustrate the idea of funnelling, suppose that

the relationship between Y and a continuous variable X1 is different for two

subgroups defined by a binary variable X2: the relationship between Y and X1 is

positive for both groups, but stronger for X2=0 than for X2=1. The predicted

regression lines from a multiple regression of Y on X1, X2 and their interaction X1*X2

are shown in Figure 3.14.

44

45

as X1 increases, so the average line will lie close to the individual group lines and

the residuals are smaller.

x2

0

1

2

0

x2=1

Standardized Residual

-2

x2=0

-4

-1

-2

-5.0

-2.5

0.0

2.5

5.0

x1

-3

Figure 3.14. Prediction lines from a multiple regression with an interaction effect

-5.0

-2.5

0.0

2.5

5.0

x1

Now suppose we mistakenly fit a simple regression of Y on X1, so we ignore the fact

that there are two groups with different relationships between Y and X1. Figure

3.15 shows the residual plot for this misspecified model17. (The data points for the

groups defined by X2 are distinguished, but remember that X2 is not included in the

model.) The plot shows evidence of heteroskedasticity because the vertical spread

of the residuals gets smaller as X1 increases this is an example of what we mean

by funnelling. Why has this happened? Instead of fitting two regression lines

with different intercepts and slopes for each group, we have fitted a single

average line which would lie somewhere in between the lines in Figure 3.1418. At

small values of X1, where we have the largest difference in the predicted value of

Y for the two groups, the residuals about this line are large and positive for X2=1

and large and negative for X2=0. The difference between groups becomes smaller

17

x 1i .

Returning to the hedonism data, Figure 3.16 shows a plot of ri versus standardised

yi from the model with age, education, gender and country included as

explanatory variables. The vertical spread of the points appears fairly equal across

different values of standardised y i , so we conclude that the assumption of

homoskedasticity is reasonable.

C3.5.3

Outliers

We can also check for outliers using any of the residual plots. An outlier is a point

with a particularly large residual. We would expect approximately 95% of the

residuals to lie between 2 and +2.

Of major interest, however, is whether an outlier has undue influence on our

results. An influence statistic called Cooks D (where D is for distance) measures

would look exactly the same because in simple regression y

18

We would expect this average line to lie closer to the line for the largest group.

Centre for Multilevel Modelling, 2008

Figure 3.15. Plot of ri versus X1 from fitting a misspecified regression without X2 or its

interaction with X1

46

47

C3.5.3 Outliers

C3.5.3 Outliers

how different our estimated regression coefficients would have been if a sample

observation were omitted. Cooks D is calculated for every observation. The

higher the value of D, the more likely it is that an observation exerts influence on

the estimates of the coefficients. However, D does not have a fixed range and so

we focus on those values of D which are considerably greater than, say, the 90th

percentile.

Table 3.14. Impact of omitting outliers on estimated coefficients and Z-ratios

Full sample

scores

For a regression of hedonism on age, education, sex and country, we find that the

90th percentile of the distribution of Cooks D is 0.000046. A boxplot of Cooks D

is given in Figure 3.17. Two observations have relatively large values of D: case

numbers 3225 and 2948. However, removing these observations from the analysis

has negligible impact on our results (see Table 3.14).

Constant

Age (years)

Education (years)

Female

Country

Germany

France

and 2948

Coeff.

Z-ratio

0.790

-0.019

-0.015

-0.160

-27.611

-4.281

-6.752

0.789

-0.019

-0.015

-0.160

-27.537

-4.350

-6.774

0.222

0.436

8.068

13.145

0.222

0.441

8.090

13.322

Dont forget to do the practical for this section! (see page 2 for

details of how to find the practical)

Please read P3.5, which is available in online form or as part of a pdf file.

Dont forget to take the online quizzes for this module if you

havent already done so! (see page 2 for details of how to find the

quizzes)

Centre for Multilevel Modelling, 2008

48

49

- Datascience Training in hyderabadDiunggah olehrs trainings
- cu 3008 Assignment 2 - Due Feb 25thDiunggah olehJim Hack
- Simple Regression With SPSSDiunggah olehyazid_mrsmbp2000
- Reyem Affiar Case Questions and GuidelinesDiunggah olehvignesh__m
- Cook Weisberg Residuals and InfluenceDiunggah olehEnrique Slim
- Mind on Statistics Ch14 QDiunggah olehglenlcy
- 16652-16650-2-PBDiunggah olehAndreas Wibowo
- 433_6Diunggah olehAvinaash Veeramah
- analysis of teeth estimationDiunggah olehAisyah Rieskiu
- ch07Diunggah olehPetru Madalin Schönthaler
- ForecastingDiunggah olehTeresse Dacanay
- 6Diunggah olehganeshrudra
- Egger and Lassmann (2012)_The Language Effect in International Trade a Meta-AnalysisDiunggah olehTan Jiunn Woei
- Pensri Et Al Biopsychosocial Factors and Perceived Disability in Saleswomen With Concurrent Low Back PainDiunggah olehbubbly_bea
- Passengers Preference and Satisfaction of Public Transport in MalaysiaDiunggah olehNorAinRazali
- SOFTWARE NOTES S10Diunggah olehhfan88
- DecisionTree.docDiunggah olehPham Tin
- small-area-rezumat.pdfDiunggah olehbogdan.oancea3651
- BA7_Multiple Regression 7.05Diunggah olehandreea143
- Investigation of Multi Linear Regression Methods on Estimation of Free Vibration Analysis of Laminated Composite Shallow ShellsDiunggah olehIJAERS JOURNAL
- SSRN-id2643050Diunggah olehBhuwan
- art%3A10.1007%2Fs10109-007-0050-4Diunggah olehvarunsingh214761
- 10.1.1.502.3611.pdfDiunggah olehRessa Fitra Adinda
- (39)-1.pdfDiunggah olehVincentius Janar Puspo Adi
- Uts AplikomDiunggah olehabcd
- Bammens Et Al SBEDiunggah olehDedalu Selalu
- 1-Budgting Concepts and ForcastingDiunggah olehkcp123
- THE QUALITY EFFECT OF ONLINE SHOPPING SITE AND CUSTOMER SERVICE TO UMB (UNIVERSITAS MERCUBUANA) STUDENT INTEREST IN LINE SHOPPINGDiunggah olehIRJCS-INTERNATIONAL RESEARCH JOURNAL OF COMPUTER SCIENCE
- Statistical Analysis to Identify the Main Parameters ToDiunggah olehInternational Journal of Research in Engineering and Technology
- regression ppt final.pptxDiunggah olehAmisha Pandey

- PhD Glineur Topics in Convex OptimizationDiunggah olehscribd-ml
- Chapter_04.pptDiunggah olehJeff Reyes
- Wed Math (Lesson 5-10)Diunggah olehdarnelllogan8
- Well Founded RecursionDiunggah olehaneto1
- Article1 Meanvalue TheoremDiunggah olehHimansu Mookherjee
- DifferenciationDiunggah olehAssasinator Faz
- Practice Exam IBEB NonCD Version 2014Diunggah olehYonYon
- transformation.docDiunggah olehKrishna Rana Magar
- applied-la-2013.pdfDiunggah olehImad Siddiqui
- Baker (2008) Intro to PSHA v1 3Diunggah olehketanbajaj
- MS12 16Diunggah olehKirul Nizam Aziz
- grade 7 7 3 l19Diunggah olehapi-296039056
- [Doi 10.1109%2Firos.2004.1389776] Bouabdallah, S.; Noth, A.; Siegwart, R. -- [IEEE 2004 IEEERSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No.04CH37566) - Sendai, JapDiunggah olehMikoto Rifky
- sensors-13-01151Diunggah olehThuyhoi Dinh
- UnitDiunggah olehDurka Nagaraj
- Data Analysis LabDiunggah olehephrem
- Optimization Lectures.docxDiunggah olehGafeer Fable
- How to Publish Counterexamples in 1 2 3 Easy StepsDiunggah olehTed Hill
- Final Foundations to Learning and Teaching FractionsDiunggah olehPedro Ribeiro
- Sample VignettesDiunggah olehJovenil Bacatan
- Fg 201905Diunggah olehDũng Nguyễn Tiến
- Programming ExercisesDiunggah olehericgcc
- Bsc MathematicsDiunggah olehGanesh Tiwari
- Oomph-lib.maths.man.Ac.uk Doc Beam Steady Ring Latex RefmanDiunggah olehडॉ. कनिष्क शर्मा
- Probability - Extreme Value Theory - Show_ Normal to Gumbel - Cross ValidatedDiunggah olehalex
- NUMB3RS - Activities Curriculum AlignmentDiunggah olehRafael González Diez
- MATH1208AnnotatedBook ImpDiunggah olehmaconny20
- Maths IGCSEDiunggah olehPhelix_Young_6404
- grade 5 module 4 parent letterDiunggah olehapi-285755972
- XAct ParisDiunggah olehmadhavanrajagopal

## Lebih dari sekadar dokumen.

Temukan segala yang ditawarkan Scribd, termasuk buku dan buku audio dari penerbit-penerbit terkemuka.

Batalkan kapan saja.