Quantitative
Business
Research
Methods
Rob J Hyndman
c Rob J Hyndman, 2008.
Email: Rob.Hyndman@buseco.monash.edu.au
Telephone: (03) 9905 2358
www.robhyndman.info
Contents
Preface 5
1 Research design 9
1.1 Statistics in research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2 Organizing a quantitative research study . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.3 Some quantitative research designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.4 Data structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.5 The survey process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Appendix A: Case studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2 Data collection 23
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2 Data collecting instruments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3 Errors in statistical data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.4 Questionnaire design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.5 Data processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.6 Sampling schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.7 Scale development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Appendix B: Case studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3 Data summary 53
3.1 Summarising categorical data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.2 Summarizing numerical data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.3 Summarising two numerical variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.4 Measures of reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.5 Normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5 Significance 77
5.1 Proportions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3
5.2 Numerical differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
13 Readings 145
Subject convenor
Objectives
• the necessary quantitative skills to conduct high quality independent research related to
business administration;
• comprehensive grounding in a number of quantitative methods of data production and
analysis;
• been introduced to quantitative data analysis through a practical research activity.
Synopsis
This unit considers the quantitative research methods used in studying business, management
and organizational analysis. Topics to be covered:
1. research design including experimental designs, observational studies, case studies, lon-
gitudinal analysis and cross-sectional analysis;
2. data collection including designing data collection instruments, sampling strategies and
assessing the appropriateness of archival data for a research purpose;
3. data analysis including graphical and numerical techniques for the exploration of large
5
Preface
data sets and a survey of advanced statistical methods for modelling the relationships
between variables;
4. communication of quantitative research; and
5. the use of statistical software packages such as SPSS in research.
The effective use of several quantitative research methods will be illustrated through reading
research papers drawn from several disciplines.
References
None of these are required texts—they provide useful background material if you want to read
further. Huck (2007) is excellent on interpreting statistical results in academic papers. Pallant
(2007) is very helpful when using SPSS and in giving advice on how to write up research results.
Use Wild and Seber (2000) if you need to brush up on your basic statistics; it contains lots of
helpful advice and interesting examples.
1. H UCK , S.W. (2007) Reading statistics and research. 5th ed., Allyn & Bacon: Boston, MA
2. PALLANT, J. (2007) SPSS survival manual, 3rd ed., Allen & Unwin.
3. DE VAUS , D. (2002) Analyzing social science data. SAGE Publications: London.
4. W ILD , C.J., & S EBER , G.A.F. (2000) Chance encounters: a first course in data analysis and
inference. John Wiley & Sons: New York.
Timetable
17 July Introduction/Chapter 1
24 July Chapters 2
31 July Chapter 3
7 August Chapter 4 SPSS tutorial
14 August Chapter 5
21 August Chapter 6
28 August Chapter 7 SPSS tutorial
4 September Chapter 8–9 SPSS tutorial
11 September Chapter 10
18 September Chapter 11–12 First assignment due
25 September No class
2 October No class
9 October SPSS tutorial
16 October Oral presentations Second assignment due
Assessment
1. A written report presenting and critiquing a research paper which uses quantitative re-
search methods. 45%
• It can be a published research paper from a scholarly journal, or a company report.
It must contain substantial quantitative research. It must be approved in advance.
• Your report should include comments on the research questions addressed, the ap-
propriateness of the data used, how the data were collected, the method of analysis
chosen, and the conclusions drawn.
• Length: 4000–5000 words excluding tables and graphs.
• Due: 17 September
2. A written report presenting some original quantitative analysis of a suitable multivariate
data set. 45%
• You may use your own data, or use data that I will provide. The data set must
include at least four variables. It can be data from your workplace.
• Your report should include comments on the research questions addressed, the ap-
propriateness of the data used, how the data were collected, the method of analysis
chosen, and the conclusions drawn.
• You may use any statistical computing package or Excel for analysis.
• Length: 4000–5000 words excluding tables and graphs.
• Due: 15 October
3. A 20 minute oral presentation of one of the above reports. 10%.
• On either 8 or 15 October.
Choose something you are interested in. For example, it can be an article you are reading as
part of your other DBA studies or something you have read as part of your professional life.
The following journals contain some articles that would be suitable. There are also many others.
You can obtain online copies for some of these via the Monash Voyager Catalogue. Hard copies
should be in the Monash library.
All papers should be approved by Rob Hyndman before you begin work on the assignment.
• Choose something you know about. The best data analyses involve a mix of good knowl-
edge of the data context as well as good use of statistical methodology.
• Don’t try to do too much. One response variable with 3–5 explanatory variables is usually
sufficient. Resist the temptation to write a long treatise!
• You will find it easier if the response variable is numeric. Analysing categorical response
variables with several explanatory variables can be tricky.
• Be clear about the purpose of your analysis. State some explicit objectives or hypotheses,
and address them via your statistical analysis.
• Think about what you include. A few well-chosen graphics that tell a story is better than
pages of computer output that mean very little.
• Start early. Even before we cover much methodology, you can do some basic data sum-
maries and think about the key questions you want to address.
• All data sets should be approved by Rob Hyndman before you begin work on the assign-
ment.
Readings
Most weeks we will read a case study from a research journal and discuss the analysis. Please
read these in advance. We will discuss them in the third hour. You cannot use a paper we
have discussed for your first assessment task. If you have a suggestion of a paper that may be
suitable for class discussion, please let me know.
A definition
Statistical Analysis: Mysterious, sometimes bizarre, manipulations performed upon the col-
lected data of an experiment in order to obscure the fact that the results have no generalizable
meaning for humanity. Commonly, computers are used, lending an additional aura of unreality
to the proceedings.
(Source unknown)
9
Part 1. Research design
2
Number of O-rings damaged
55 60 65 70 75 80
Charlie’s chooks
14
12
Y: Percentage mortality
10
8
6
4
0 20 40 60 80 100
A doctor wants to investigate who is most at risk for coronary-related deaths. He selects 12
patients at random from his clinic and records their age, blood pressure and drug used. He
also records whether they eventually died from heart disease or not.
There is a strong positive correlation between smoking and lung cancer. There are several
possible explanations.
Postnatal care
Mothers who return home from hospital soon after birth do better than those who stay in
hospital longer.
University applicants
• Average tax rate has increased with time even though rate in every income category has
decreased. Why?
• Ave. female salary of B.Sc. graduates is lower than ave. male salary. Why?
Causality or association?
• Distinguish between: causation & association, prediction & causation, prediction & ex-
planation.
• Note difference between deterministic and probabilistic causation.
These questions are broken down in more detail below. (These are mostly taken from Rubin et
al. (1990), and have also appeared in Balnaves and Caputi (2001).)
1.2.1 Hypothesis
1.2.3 Method
• What methods or techniques will be used to collect the data? (This holds for applied and
non-applied research)
• What procedures will be used to apply the methods or techniques?
• What are the limitations of these methods?
• What factors will affect the study’s internal and external validity?
• Will any ethical principles be jeopardized?
1.2.4 Sample
• Who (what) will provide (constitute) the data for the research?
• What is the population being studied?
• Who will be the participants for the research?
• What sampling technique will be used?
• What materials and information are necessary to conduct the research?
• How will they be obtained?
• What special problems can be anticipated in acquiring needed materials and information?
• What are the limitations in the availability and reporting of materials and information?
1.2.6 Communication
• involves intense involvement with a few cases rather than limited involvement with
many cases
• can’t generalize results easily
• useful in exploring ideas and generating hypotheses
Hypotheses:
1. Women believe they are better at managing than men.
2. Children who listen to poetry in early childhood make better progress in learn-
ing to read than those who do not.
3. A business will run more efficiently if no person is directly responsible for more
than five other people.
4. There are inherent advantages in businesses staying small.
5. Employees with postgraduate qualifications have shorter job expectancy than
employees without postgraduate qualifications.
A population is the entire collection of ‘things’ in which we are interested. A sample is a subset of
a population. We wish to make an inference about a population of interest based on information
obtained from a sample from that population.
E XAMPLES :
• A case is the unit about which you are taking measurements. E.g., a person, a business.
• A variable is a measurement taken on each case.
E.g., age, score on test, grade-level, income.
The ways of organizing, displaying and analysing data depends on the type of data we are
investigating.
Note that we sometimes treat numerical data as categories. (e.g. three age groups.)
Response variable: measures the outcome of a study. Also called dependent variable.
Sometimes the response variable is called the dependent variable and the explanatory variables
are called the independent variables.
We wish to do a statistical comparison of the injury management pilots with the current stan-
dard injury management arrangements.
Performance measures
• age
• gender
• injury type
• agency (e.g., powered tools)
• severity of injury
• medical interventions
• employer size
• insuring agency
• weekly pay at time of injury
• industry (ANZSIC code)
• occupation (ASCO code)
Ideally, we would use a randomized control trial. This eliminates the effect of driving vari-
ables.
We have to use pseudo-control groups and eliminate differences between the control and IMP
groups using statistical models.
• All injuries within the specified industry group, geographical region or insurer will be
subject to the new IMP during 2001.
• The pseudo-controls will be the equivalent groups of employees in 2000 who are not
subject to the new IMP.
Problem of confounding
• If there are differences between the IMP and the control, is it due to the different IM
program or the different group?
Solution:
Comparisons undertaken
Instead, we compare the change between 2000 and 2001 in each industry group and each geo-
graphical region.
• If all 2001 groups are different from the 2000 groups after taking into account all drivers,
then it is likely there are changes between years not reflected in the drivers. We won’t be
able to attribute any changes to the IMP.
• If all IMP 2001 groups are different from the 2000 groups after taking into account all drivers,
but the non-IMP 2001 groups are not different from the 2000 groups, then it is likely the
changes between years are due to the IMP.
Needlestick injuries
You are interested in the number and severity of needle stick injuries amongst health workers
involved in blood donation and transfusion. Work in groups of three to carefully define the
objectives of your survey. You will need to specify
A few years ago, I helped the Health Department with a survey on palliative care. As part
of the study, it was necessary to study the ‘referral’ pattern for palliative care providers: how
many patients they send to hospital (for inpatient or outpatient treatment); how many they
refer to consultants for specialist comment; how many to community health programs; and so
on.
1. sample a group of palliative care practitioners and study their referral patterns;
2. sample a group of palliative care patients and study their referral patterns.
2.1 Introduction
“You don’t have to eat the whole ox to know that the meat is tough.”
Samuel Johnson
Sampling is very familiar to all of us, because we often reach conclusions about phenomena
on the basis of a sample of such phenomena. You may test a swimming pool’s temperature by
dipping your toe in the water or the performance of a new vehicle by a short test drive. These
are among the countless small samples that we rely on when making personal decisions. We
tend to use haphazard methods in picking our sample and risk substantial sampling error.
Research also usually reaches its conclusions on the basis of sampling, but the methods used
must adhere to certain rules that are going to be discussed. The goal in obtaining data through
survey sampling is to use a sample to make precise inferences about the target population. We
want to be highly confident about our inferences. It is important to have a substantial grasp
of sampling theory to appraise the reliability and validity of the conclusions drawn from the
sample taken.
The choice of data collection instrument is crucial to the success of the survey. When deter-
mining an appropriate data collection method, many factors need to be taken into account,
including complexity or sensitivity of the topic, response rate required, time or money avail-
able for the survey and the population that is to be targeted. Some of the most common data
collection methods are described in the following sections.
23
Part 2. Data collection
Interviewer enumerated surveys involve a trained interviewer going to the potential respon-
dent, asking the questions and recording the responses.
Web surveys are increasingly popular, although care must be taken to avoid sample selection
bias and multiple responses from an individual.
Self-enumeration mail surveys are where the questionnaire is left with the respondent to com-
plete.
• cheaper to administer
• more private and confidential
• in some cases does not require interviewers
A telephone survey is the process where a potential respondent is phoned and asked the survey
questions over the phone.
2.2.5 Diaries
Diaries can be used as a format for a survey. In these surveys respondents are directed to record
the required information over a predetermined period in the diary, book or booklet supplied.
Representative samples
Avoidance or refusal bias Good Good Poor
Control over who completes the questionnaire Good Good Satisfactory
Gaining access to the selected person Satisfactory Good Good
Locating the selected person Satisfactory Good Good
Quality of answers
Minimize socially desirable responses Poor Satisfactory Good
Ability to avoid distortion due to
Interviewer characteristics Poor Satisfactory Good
Interviewer opinions Satisfactory Satisfactory Good
Influence of other people Satisfactory Good Poor
Allows opportunities to consult Satisfactory Poor Good
Avoids subversion Poor Satisfactory Good
Table 2.1: Advantages and disadvantages of three methods of data collection. Table taken from de Vaus
(2001) who adapted it from Dillman (1978).
1. Provide reward
2. Systematic follow up
3. Keep it short.
4. Interesting topic.
Rather than collecting your own data, you may use some existing data. If you do, keep the
following points in mind.
Available information Is there sufficient documentation of the original research proposal for
which the data were collected? If not, there may be hidden problems in re-using the data.
Geographical area Are the data relevant to the geographical area you are studying? e.g., what
country, city, state or other area does the archive data cover?
Time period Are the data relevant to the time period you are studying? Does your research
area cover recent events, or is it historical or does it look at changes over a specified range
of time? Most data are at least a year old before they are released to the public.
Population What population do you wish to study? This can refer to a group or groups of
people, particular events, official records, etc. In addition you should consider whether
you will look at a specific sample or subset of people, events, records, etc.
Context Does the archival data contain the information relevant to your research area?
In sample surveys there are two types of error that can occur:
• sampling error which arises as only a part of the population is used to represent the whole
population and;
• non-sampling error which can occur at any stage of a sample survey.
Sampling error is the error we make in selecting samples that are not representative of the
population. Since it is practically impossible for a smaller segment of a population to be exactly
representative of the population, some degree of sampling error will be present whenever we
select a sample. It is important to consider sampling error when publishing survey results as
it gives an indication of the accuracy of the estimate and therefore reflects the importance that
can be placed on interpretations.
If sampling principles are carefully applied within the constraints of available resources, sam-
pling error can be accurately measured and kept to a minimum. Sampling error is affected
by:
• sample size
• variability within the population
• sampling scheme
Generally larger sample sizes decrease sampling error. To halve the sampling error the sample
size has to be increased fourfold. In fact, sampling error can be completely eliminated by
increasing the sample size to include every element in the population.
The population variability also affects the error, more variable populations give rise to larger
errors as the samples or estimates calculated from different samples are more likely to have
greater variation. The effect of the variability within the population can be reduced by increas-
ing sample size to make it more representative of the target population.
Non-sampling error can be defined as those errors in a survey that are not sampling errors.
Non-sampling error is any error not caused by the fact that we have only selected part of
the population in the survey. Even if we were to undertake a complete enumeration of the
population, non-sampling errors might remain. In fact, as the size of the sample increases, the
non-sampling errors may get larger, because of such factors as possible increase in the response
rate, interviewer errors, and data processing errors.
For the most part we cannot measure the effect that non-sampling errors will have on the re-
sults. Because of their nature, these errors may not be totally eliminated. Perhaps the biggest
source of non-sampling error is a poorly designed questionnaire. The questionnaire can in-
fluence the response rate achieved in the survey, the quality of responses obtained and conse-
quently the conclusions drawn from survey results.
Some common sources of non-sampling error are discussed in the following paragraphs.
Target Population
Failure to identify clearly who is to be surveyed. This can result in an inadequate sam-
pling frame; imprecise definitions of concepts and poor coverage rules.
Non-response
A non-response error occurs when the respondents do not reflect the sampling frame.
This could occur when the people who do not respond to the survey differ to the people
who did respond to the survey. This often occurs in voluntary response polls. For ex-
ample, suppose that in an air bag study we asked respondents to call a 0018 number to
be interviewed. Because a 0018 call cost $2 per minute, many drivers may not respond.
Furthermore, those who do respond may be the people who have had bad experiences
with air bags. Thus the final sample of respondents may not even represent the sampling
frame.
For example,
• telephone polls miss those people without phones
• household surveys miss homeless, prisoners, students in colleges, etc.
• train surveys only target public transport users and tend to include regular public
transport users.
In 1991 it was claimed that data showed that right-handed persons live on average
almost a decade longer than left-handed or ambidextrous persons. The investigators
had compared mean ages at death of people who appeared to be survivors as left,
right or mixed handed.
• What is the problem?
The questionnaire
Poorly designed questionnaires with mistakes in wording, content or layout may make it
difficult to record accurate answers. The most effective methods of designing a question-
naire are discussed in Section 2.4. If these principles are followed it will help reduce the
non-sampling error associated with the questionnaire.
Interviewers
If an interviewer is used to administer the survey, their work has the potential to produce
non-sampling error. This can be due to the personal characteristics of the interviewer.
For example, an elderly person will often be more comfortable giving information to a
female interviewer. Other factors which could cause error are the interviewer’s opinions
and characteristics which may influence the respondent’s answers.
In 1968, one year after a major racial disturbance in Detroit, a sample of black resi-
dents was asked:
Do you personally feel that you can trust most white people, some white people,
or none at all?
Of those interviewed by whites, 35% answered “Most”, while only 7% of those in-
terviewed by blacks gave this answer. Many questions were asked in this study.
Only on some topics, particularly black-white trust or hostility, did the race of the
interviewer have a strong effect on the answers given. The interviewer was a large
source of non-sample error in this study.
Respondents
Respondents can also be a source of non-sampling error. They may refuse to answer ques-
tions, or provide inaccurate information to protect themselves. They may have memory
lapses and/or lack of motivation to answer the questionnaire, particularly if the ques-
tionnaire is lengthy, overly complicated or of a sensitive nature. Respondent fatigue is a
very important factor.
Social desirability bias refers to the effect where respondents will provide answers which
they think are more acceptable, or which they think the interviewer wants to hear. For
example, respondents may state that they have a higher income than is actually the case
if they feel this will increase their status.
Respondents may refuse to answer a question which they find embarrassing or choose
a response which prevents them from continuing with the questions. For example, if
asked the question: “Are you taking oral contraceptive pills for any reason?”, and know-
ing that if they respond “Yes” they will be asked for more details, respondents who are
embarrassed by the question are likely to answer “No”, even if this is incorrect.
Fatigue can be a problem in surveys which require a high level of commitment for respon-
dents. The level of accuracy and detail supplied may decrease as respondents become
tired of recording all information. Sometimes interviewer fatigue can also be a problem,
particularly when the interviewers have a large number of interviews to conduct.
In 1987, Shere Hite published a best-selling book called Women and Love. The author distributed
100,000 questionnaires through various women’s groups, asking questions about love, sex, and
relations between women and men. She based her book on the 4.5% of questionnaires that were
returned.
Exercise 1: In Case 2, it was necessary to study the ‘referral’ pattern for palliative
care providers: how many patients they send to hospital (for inpatient or out-
patient treatment); how many they refer to consultants for specialist comment;
how many to community health programs; and so on. Two alternative sam-
pling schemes are available: sample a group of palliative care practitioners
and study their referral patterns; or sample a group of palliative care patients
and study their referral patterns. Discuss the possible advantages and disad-
vantages of the two schemes.
2.4.1 Introduction
The purpose of a questionnaire is to obtain specific information with tolerable accuracy and
completeness. Before the questionnaire is designed, the collection objectives should be defined.
These include:
Careful consideration should be given to the content, wording and format of the questionnaire
as one of the largest sources of non-sampling error is poor questionnaire design. This error can
be minimized by considering the objectives of the survey and the required output, and then
devising a list of questions that will accurately obtain the information required.
Relevant questions
It is important to ask only questions that are directly related to the objectives of a survey as a
means of minimizing the burden place on respondents. The concept of a fatigue point, which oc-
curs when respondents can no longer be bothered answering questions, should be recognized,
and questions designed so that the respondent is through the form before this point is reached.
Towards the end of long questionnaires, respondents may give less thought to their answers
and concentrate less on the instructions and questions, thereby decreasing the accuracy of in-
formation they provide. Very long questionnaires can also lead the respondent to refuse to
complete the questionnaire. Hence it is necessary to ensure only relevant questions are asked.
Reliable questions
It is important to include questions in a questionnaire that can be easily answered. This objec-
tive can be achieved by adhering to the following techniques.
Appropriate recall If information is requested by recall, the events should be sufficiently recent
or familiar to respondents. People tend to remember what they should have done, have
selective memories, and move into reference period activities which surround the event.
Minimizing the need for recall improves the accuracy of response.
Common reference periods To make it easier for the respondent to answer, use reference periods
which match those of the respondent’s records.
Results justify efforts The amount of effort to which a respondent goes to obtain the data must
be worth it. It is reasonable to accept a respondent’s estimate when calculating the exact
figures would make little difference to the outcome.
Filtering Respondents should not be asked question they cannot answer. Filter questions should
be asked to exclude respondents from irrelevant questions.
Factual questions
Information is required from these questions rather than an opinion. For example respon-
dents could be asked about behaviour patterns (e.g., When did you last visit a General
Practitioner?).
Opinion questions
Rather than facts, these questions seek opinion. There are many problems associated with
opinion questions:
• a respondent may not have an opinion/attitude towards the subject so the response
may be provided without much thought;
• opinion questions are very sensitive to changes in wording;
• it is impossible to check the validity of responses to opinion questions.
Hypothetical questions
The “What would you do if . . . ?” type of question. The problems with these questions
are similar to opinion questions. You can never be certain how valid any answer to a
hypothetical is likely to be.
Questions can generally be classified as one of two types, open or closed, depending on the
amount of freedom allowed in answering the question. When deciding which type of question
to use, consideration should be given to the kind of information sought, ease of processing the
response, and the availability of the resources of time, money, and personnel.
Open questions
Open questions allow the respondents to answer the question in their own words. These ques-
tions allow as many possible answers and they can collect exact values from a wide range of
possible values. Hence, open questions are used when the list of responses is very long and not
obvious.
The major disadvantage of open questions is they are far more demanding than closed ques-
tions both to answer and process. These questions are most commonly used where a wide
range of responses is expected. Also, the answers to these questions depend on the respon-
dents ability to write or speak as much as their knowledge. Two respondents might have the
same knowledge and opinions, but their answers may seem different because of their varying
abilities.
Question Format
I believe Japanese cars are less reliable than Likert scale (opinion) questions
European cars.
Strongly Agree Agree No opinion Disagree Strongly disagree
1 2 3 4 5
Closed questions
Closed questions ask the respondents to choose an answer from the alternatives provided.
These questions should be used when the full range of responses is known. Closed questions
are far easier to process than open questions. The main disadvantage of closed questions is the
reasons behind a particular selection cannot be determined.
• Limited choice questions require the respondent to choose one of two mutually exclusive
answers. For example yes/no.
• Multiple choice questions require the respondent to choose from a number of responses
provided.
• Checklist questions allow a respondent to choose more than one of the responses pro-
vided.
• Partially closed questions provide a list of alternatives where the last alternative is “Other,
please specify”. These questions are useful when it is difficult to list all possible choices.
• Opinion (Likert) scale An opinion scale question seeks to locate a respondent’s opin-
ion on a rating scale with a limited number of points. For example, a five point scale
measure of strong and weak attitudes would ask the respondent whether they strongly
agree/agree/are neutral/disagree/strongly disagree with a particular statement of opin-
ion. Whereas a three point scale would only measure whether they agree, disagree or are
neutral. Opinion scales of this sort are called Likert scales.
Five point scales are best because:
–
–
–
Response Categories
When questions have categories provided, it is important that every response is catered for.
Number of Categories
The quality of the data can be influenced if there are too few categories as the respondent
may have difficulty finding one which accurately describes their situation. If there are too
many categories the respondent may also have difficulty finding one which accurately
describes their situation.
Don’t Know A ‘Don’t Know’ category can be included so respondents are not forced to make
decisions/attitudes that they would not normally make. Excluding the option is not usu-
ally good, however, it is hard to predict the effect of including it. The decision of whether
or not to include a ‘Don’t Know’ option depends, to a large extent, on the subject matter.
I was gifted to be able to answer promptly, and I did. I said I didn’t know.
Mark Twain, Life on the Mountain
Language
Questions which employ complex or technical language or jargon can confuse or irritate re-
spondents. Respondents who do not understand the question may be unwilling to appear
ignorant by asking the interviewer to explain the question or if a interviewer is not present,
may not answer or answer incorrectly.
Ambiguity
If ambiguous words or phrases are included in a question, the meaning may be interpreted
differently by different people. This will introduce errors in the data since different respondents
will virtually be answering different questions.
For example “Why did you fly to New Zealand on Qantas airlines?”. Most might interpret
this question as was intended, but it contains three possible questions, so the response might
concern any of these:
Double-barreled questions
When one question contains two concepts, it is known as a double-barreled question. For
example , “How often do you go grocery shopping and do you enjoy it?”.
Each concept in the question may have a different answer, or one concept may not be relevant,
respondents may be unsure how to respond. The interpretation of the answers to these ques-
tions is almost impossible. Double-barreled questions should be split into two or more separate
questions.
Leading questions
Questions which lead respondents to answers can introduce error. For example, the question
“How many days did you work last week?”, if asked without first determining whether re-
spondents did in fact take work in the previous week, is a leading question. It implies that
the person would have been at work. Respondents may answer incorrectly to avoid telling the
interviewer that they were not working.
Unbalanced questions
“Are you in favour of euthanasia?” is an unbalanced question because is provides only one al-
ternative. It can be reworded to ‘Do you favour or not favour euthanasia?’, to give respondents
more than one alternative.
Similarly, the use of a persuasive tone can affect the respondent’s answers. Wording should be
chosen carefully to avoid a tone that may produce bias in responses.
Recall/memory error
Respondents tend to remember what should have been done rather that what was done. The
quality of data collected from recall questions is influenced by the importance of the event to
the respondent and the length of time since the event took place. Subjects of greater interest or
importance to the respondent, or events which happen infrequently, will be remembered over
longer periods and more accurately. Minimizing the recall period also helps to reduce memory
bias.
Telescoping is a specific type of memory error. This occurs if the respondent reports events
as occurring either earlier or later than they actually occur. Error occurs when respondents
included details of an event which actually occurred outside the specified reference period.
Sensitive questions
Questions on topics which respondents may see as embarrassing or highly sensitive can pro-
duce inaccurate answers. If respondents are required to answer questions with information
that might seem socially undesirable, they may provide the interviewer with responses they
believe are more ‘acceptable’. If placed at the being of the questionnaire, it could lead to non-
response if respondents are unwilling to continue with the remaining questions.
For example, “Approximately how many cans of beer do you consume each week, on aver-
age?”
1. None
2. 1–3 cans
3. 4–6 cans
4. More than 6
A respondent might answer response 2 or 3 rather than admit to consuming the greatest quan-
tity on the scale. Consider extending the range of choices far beyond what is expected. The
respondent can select an answer closer to the middle and feel more in the normal range.
In 1980, the New York Times CBS News Poll asked a random sample of Americans
about abortion. When asked “Do you think there should be an amendment to the
Constitution prohibiting abortions, or should not there be such an amendment?”
29% were in favour and 62% were opposed. The rest of the sample were uncer-
tain. The same people were later asked a different question: “Do you believe there
should be an amendment to the Constitution protecting the life of the unborn child,
or should not there be such an amendment?” Now 50% were in favour and only
39% were opposed.
Acquiescence
This situation arises when there is a long series of questions for which respondents answer
with the same response category. Respondents get used to providing the same answer and
may answer inaccurately.
Including an introduction
To ensure that the questionnaire can be easily administered by interviewer or respondents, the
pages of the questionnaire and the questions should be number consecutively with a simple
numbering system. Question numbering is a way of providing sign-posts along the way. They
help if remedial action is required later, and you want to refer the interviewer or respondent
back to a particular place.
Sequencing
The questions in a questionnaire should follow an order which is logical and smoothly flows
from one question to the next. The questionnaire layout should have the following character-
istics.
If possible, question ordering should try and anticipate the order in which respondents
will supply information. It shows good survey design if a question not only prompts an
answer but also prompts an answer to a question following shortly.
Question ordering
It is important to be aware that earlier questions can influence the responses of later ques-
tions, so the order of questions should be carefully decided. In attitudinal questions, it
is important to avoid conditioning respondents in an early question which could then
bias their responses to later questions. For example, you should ask about awareness of
a concept before any other mention of the concept.
Respondent motivation
Whenever possible, start the questionnaire with easy and pleasant questions to promote inter-
est in the survey and give the respondent confidence in their ability to complete the survey.
The opening questions should ensure that the particular respondent is a member of the survey
population.
Questions that are perceived as irritating or obtrusive tend to get a low response rate and
may effectively trigger a refusal from the respondent. These questions need to be carefully
positioned in a questionnaire where they are least likely to be sensitive.
It is also important that respondents are only asked relevant questions. Respondents may be-
come annoyed and disinterested if this does not occur. Include filter questions to direct re-
spondents to skip to questions which do not apply to them. Filter questions often identify
sub-populations. For example,
Questionnaire layout
The questionnaire layout should be aesthetically pleasing, so the layout does not contribute to
respondent fatigue. Things that can interfere with the answering of a questionnaire are: unclear
instructions and questions, insufficient space to provide answers, hard-to-read text, difficulty
in understanding language, back-tracking through the form. Many of these things are bad form
design and are avoidable.
Only include essentials on the questionnaire form. Keep the amount of ink on the form to the
minimum necessary for the form to work properly. Anything that is not necessary contributes
to the fatigue point of the respondent and to the subsequent detriment of the data quality.
General layout
Consistency of layout: If consistency and logical patterns are introduced into the form design, it
eases the form filler’s task. Patterns that can be useful are:
Type Size: A font size between 10 and 12 is considered the best in most circumstances. If the
respondent does not have perfect vision, or ideal working conditions, small fonts can
cause problems.
Use of all upper-case text: It is best to avoid upper case text. Upper case text has been shown to
be hard to read, especially where large amounts of text are involved. Words lose their
shape when in upper case, becoming converted to rectangles. Text in upper case should
be left for use for titles or for emphasis but, this can often be done just as well using other
methods, such as bold, italics, or slightly larger type size.
Line length: As the eye has a clear focus range of only a few degrees, lines should be kept short.
It takes the eyeball several eye movements to scan a line of text. If more than 2 or 3 such
movement occur then the eye can become fatigued. There is a tendency for the eye to lose
track of which line it is reading. This leads to backtracking the text or misinterpretation.
Character and line spacing: It is very important to leave enough space on a form for answers. It
has been shown in research that forms requiring hand written responses need a distance
of 7–8mm between lines and a 4–5mm width for each possible character.
Response layout
Obtaining responses: A popular way of obtaining responses is using tick boxes. However, it is
usually preferable to use a labelled list (e.g., a, b, c, . . . ) and ask respondents to circle their
response. This makes coding and data entry easier.
If a written response is required it is best to provide empty answer spaces, with lines
made up of dots.
Order of response options: The consideration of the order of responses is important as the order
can be a source of bias. The options presented first may be selected because they make
an impact on respondents or because respondents lose concentration and do not hear or
read the remaining options. The last options may be chosen because it was easily recalled,
particularly if respondents are faced with a long list of options. Long or complex response
options may also make recall more difficult and increase the effects due to the order of
response options.
Prompt card: If the questionnaire is interviewer based, and a number of response options are
given for some questions, then a prompt card may be appropriate. A prompt card is a list
of possible responses to a question, displayed on a separate card which are shown by the
interviewer to assist respondents. This helps to decrease error resulting from respondents
being unable to remember all the options read out. However respondents with poor
eyesight, migrants with limited English or adults with literacy problems will experience
difficulties in answering accurately.
Exercise 2: (Case 2) The questionnaire on pages 47–48 was an early draft of the
questionnaire prepared by the client. The questionnaire on pages 49–51 is a
later draft of the questionnaire after I had provided the client with some advice.
See if you can determine why each of the changes has been made. How could
you further improve the questionnaire?
Each type of testing is used at a different stage of survey development and aims to test different
aspects of the survey.
Skirmishing
Skirmishing is the process of informally testing questionnaire design with groups of re-
spondents. The questionnaire is basically unstructured and is tested with a group of
people who can provide feedback on issues such as each question’s frame of reference,
the level of knowledge needed to answer the questions, the range of likely answers to
questions and how answers are formulated by respondents. Skirmishing is also used to
detect flaws or awkward wording of questionnaires as well as testing alternative designs.
At this stage we may use open-ended response categories to work-out likely responses.
The questionnaire should be redrafted after skirmishing.
Focus groups
A skirmish tests the questionnaire design against general respondents whilst focus groups
concentrate on a specific audience. For example, a survey studying the effects of living
on unemployment benefits could have a group of unemployed people as a focus group.
A focus group can be used to test questions directed at small sub-populations. For ex-
ample if we were looking at community services we may have a filter question to target
disabled people. Since there may not be many disabled chosen in the sample, we need to
test the questions on a focus group of disabled people, which is a biased sample.
Observational studies
Respondents complete a draft questionnaire in the presence of an observer during an
observational study. Whilst completing the form the respondents explain their under-
standing of the questions and the method required in providing the information. These
studies can be a means of identifying problem questions through observations, questions
asked by the respondents, or the time taken to complete a particular question. Data avail-
ability and the most appropriate person to supply the information can also be gauged
through observational studies. The form is being tested and not the respondent and this
should be stressed to the respondent.
Pilot testing
Pilot testing involves formally testing a questionnaire or survey with a small represen-
tative sample of respondents. Semi-closed questions are usually used in pilot testing to
gather a range of likely responses which are used to develop a more highly structured
questionnaire with closed questions. Pilot testing is used to identify any problems asso-
ciated with the form, such as questionnaire format, length, question wording and allows
comparison of alternative versions of a questionnaire.
Data processing involves translating the answers on a questionnaire into a form that can be
manipulated to produce statistics. In general, this involves coding, editing, data entry, and
monitoring the whole data processing procedure. The main aim of checking the various stages
of data processing is to produce a file of data that is as error free as possible.
Up to this point, the questionnaire has been considered mainly as a means of communication
with the respondent. Just as important, the questionnaire is a working document for the trans-
fer of data on to a computer file. Consequently it is important to design the questionnaire to
facilitate data entry.
Unless all the questions on a questionnaire are “closed” questions, some degree of coding is
required before the survey data can be sent for punching. The appropriate codes should be de-
vised before the questionnaires are processed, and are usually based on the results of pretesting.
Coding consists of labelling the responses to questions (using numerical or alphabetic codes) in
order to facilitate data entry and manipulation. Codes should be formulated to be simple and
easy. For example if Question 1 has four responses then those four responses could be given
the codes a, b, c, and d. The advantage of coding is the simplistic storage of data as a few-digit
code compared to lengthy alphabetical descriptions which almost certainly will not be easy to
categorize.
Coding is relatively expensive in terms of resource effort. However, improvements are always
being sought by developing automated techniques to cover this task. Other options include the
use of self coding where respondents answer the appropriate code or the interviewer performs
Before the interviewing begins, the coding frame for most questions can be devised. That is, the
likely responses are obvious from previous similar surveys or thorough pilot testing, allowing
those responses and relevant codes to be printed on the questionnaire. An “Other (Please
Specify)” answer code is often added to the end of a question with space for interviewers to
write the answer. The standard instruction to interviewers in doubt about any precodes is that
they should write the answers on the questionnaire in full so that they can be dealt with by a
coder later.
Ensure that the questionnaire is designed so data entry personnel have minimal handling of
pages. For example, all codes should be on the left (or right) hand side of the page. It is
advisable to use trained data entry people to enter the data. It is quicker and more reliable and
therefore more cost effective.
When you have a clear idea of the aims of the survey and the data requirements, the degree of
accuracy required, and have considered the resources and time available, you are in a position
to make a decision about the size and the form of collection of sampling units.
The two qualities most desired in a sample (besides that of providing the appropriate findings),
are its representativeness and stability. Sample units may be selected in a variety of ways. The
sampling schemes fall into two general types: probability and non-probability methods.
If the probability of selection for each unit is unknown, or cannot be calculated, the sample is
called a non-probability sample. For non-probability samples, since there is no control over rep-
resentativeness of the sample, it is not possible to accurately evaluate the precision of estimates
(i.e., closeness of estimates under repeated sampling of the same size). However, where time
and financial constraints make probability sampling infeasible, or where knowing the level of
accuracy in the results is not an important consideration, non-probability samples do have a
role to play. Non-probability samples are inexpensive, easy to run and no frame is required.
This form of sampling is popular amongst market researchers and political pollsters as a lot of
their surveys are based on a pre-determined sample of respondents of certain categories.
Probability sampling schemes are those in which the population elements have a known chance
of being selected for inclusion in a sample. Probability sampling rigorously adheres to a pre-
cisely specified system that permits no arbitrary or biased selection. There are four main types
of probability sampling schemes.
Simple Random Sample: If a sample size of size n is drawn from a population of size N in
such a way that every possible sample of size n has the sample chance of being selected,
the sampling procedure is called simple random sampling. The sample thus obtained
is called a simple random sample. This is the simplest form of probability sample to
analyse.
Stratified Random Sample: A stratified random sample is one obtained by separating the pop-
ulation elements into non-overlapping groups, called strata, and then selecting a simple
random sample from each stratum. This can be useful when a population is naturally
divided into several groups. If the results on each stratum vary greatly, then it is possi-
ble to obtain more efficient estimators (and therefore more precise results) than would be
possible without stratification.
Systematic Sample: A sample obtained by randomly selecting one element from the first k el-
ements in the frame and every kth element thereafter is called a 1-in-k systematic sample,
with a random start. This is obviously a simple method if there is a list of elements in
the frame. Systematic sampling will provide better results than simple random sampling
when the systematic sample has larger variance than the population. This can occur when
the frame is ordered.
Cluster Sample: A cluster sample is a probability sample in which each sampling unit is a
collection, or cluster, of elements. The population is divided into clusters and one or
more of the clusters is chosen at random and sampled. Sometimes the entire cluster is
sampled; on other occasions a simple random sample of the chosen clusters is taken.
Cluster sampling is usually done for administrative convenience, and is especially useful
if the population has a hierarchical structure.
A comparison of these four sampling schemes appears in the table on the following page.
Example (Case 2): A few years ago, I advised the Department of Health and Com-
munity Services on a survey of palliative care patients in Victoria.
Objective: To estimate the proportion of palliative care patients in Vic-
torian hospitals.
Difficulties: What is a “palliative care patient”? Proportion of what?
Target population: Patients in acute beds at the time of the survey?
Survey population: All patients in acute beds in Victorian hospitals except for
very small (< 10 bed) country hospitals.
Sampling scheme: Stratified (hospital types) and clustered (hospitals). Ran-
dom selection of hospitals within each strata. Total cover-
age of patients in the selected hospitals.
Sample: All patients in the 18 hospitals selected out of 115 hospitals
in Victoria.
Exercise 3: Consider the four cases listed in the Appendix. What sampling scheme
was used in each case? Why were these schemes used?
With Likert scale data, it is common to construct a new numerical variable by summing the
values of questions on a related topic (treating the answers as numerical scores from 1–5). This
forms a “measure” or “scale” for the underlying “construct”.
More sophisticated means of deriving scales are possible. One common approach is to use
Factor Analysis (discussed in Section 8.1).
2.7.1 Validity
E XAMPLE :
• A study compares job-satisfaction of people over time and finds it is declining. Does that
mean poor management is leading to declining job satisfaction?
• How would you construct a valid study which enables the measurement of the effect of
management on job satisfaction?
• How do you measure workplace harmony? Is frequency of arguments a valid measure?
• Are the results of a study in your company generalizable to other companies?
• How would you construct a valid study of this issue which applies to other companies?
2.7.2 Reliability
A reliable measure is one that gives the same ‘reading’ when used in repeated occasions.
• A measure is reliable but not valid if it is consistently wrong. e.g., survey on alcohol
intake.
• A measure is valid but unreliable if it sometimes measures the thing of interest, but not
always. e.g., survey on sexual experience.
This survey was designed to estimate the number of palliative care patients in Victorian hos-
pitals. A palliative care patient was defined as a patient who was terminally ill and whose
life expectancy was less than 6 months. The Department of Health and Community Services
did not know how many patients were in this category, but a previous survey in another
state indicated the proportion might be about 12%. The hospitals in Victoria were divided
into eight groups: metropolitan teaching, metropolitan large non-teaching, metropolitan small
non-teaching, country base, large country, small country, metropolitan extended care, country
extended care. These eight hospital types included 115 Victorian public hospitals.
Within each group of hospitals, one or more were selected at random for the sample. Eighteen
hospitals in total were sampled. For each hospital surveyed, the number of palliative care pa-
tients was recorded. From this information, the proportion of hospital patients in Victoria who
could be classified as “palliative care” patients was estimated. The final estimated proportion
was about 4.5%.
This survey was conducted by a company who had designed and marketed health safety prod-
ucts including needle protectors. As part of their marketing, they were interested in the fre-
quency and severity of needlestick injuries amongst health workers. The survey was conducted
in seven Australian cities over a one week period. The sample consisted of 56 staff members of
the Red Cross Transfusion Services and 136 nursing staff in 25 Australian haemodialysis units.
All staff who worked during the survey week were included in the sample. Each filled in a
questionnaire.
The Catholic Church Life Survey is a collection of 25 separate questionnaires designed to collect
information about the opinions and characteristics of the Catholic church’s clergy and mem-
bership. Each diocese in Australia was surveyed. Within each diocese there are both urban and
rural parishes. A sample of urban parishes was surveyed and a sample of rural parishes was
surveyed within each diocese. For those parishes surveyed, a random sample consisting of 2/3
of the members who attended on the day of the survey completed the main questionnaire.
The ways of organizing, displaying and analysing data depends on the type of data
we are investigating.
• Categorical Data (also called nominal or qualitative)
e.g. sex, race, type of business, postcode Averages don’t make sense. Or-
dered categories are called ordinal data
• Numerical Data (also called scale, interval and ratio)
e.g. income, test score, age, weight, temperature, time.
Averages make sense.
Note that we sometimes treat numerical data as categories. (e.g. three age
groups.)
53
Part 3. Data summary
Pie chart shows proportion of observations in each category by angle of each segment; quite
poor at communicating the information.
Bar chart shows number of observations in each category by length of each bar. Much easier
to see differences.
Neoplasms
Neoplasms 0 10 20 30 40
male
Neoplasms
Diseases
Male
Female
Suicide
Suicide
Diseases Female
Male
Neoplasms
3.2.1 Percentiles
• Example: the 90th percentile is the point where 90% of data lie below that point and 10%
of data lie above that point.
• The median is the 50th percentile. It is sometimes labelled Q2.
• The median is the middle measurement when the measurements are arranged in order.
If there are an even number of measurements, it is the average of the middle two.
• The quartiles are the 25th and 75th percentiles. They are often labelled Q1 and Q3.
• The interquartile range is Q3−Q1.
3.2.3 Outliers
One definition of an outlier: Any point which is more than 1.5(IQR) above Q3 or more than 1.5(IQR)
below Q1.
Don’t delete outliers. Investigate them!
Number of airline accidents for 17 Asian airlines for 1985–1994. Source: Newsday
(1995).
Accidents Airline
0 Air India (India)
0 Air Nippon (Japan)
0 All Nippon (Japan)
1 Asiana (South Korea)
0 Cathay Pacific (Hong Kong)
1 Garuda (Indonesia)
5 Indian Airlines (India)
1 Japan Airlines (Japan)
0 Japan Air System (Japan)
1 Korean Air Lines (South Korea)
0 Malysia Air Lines (Malaysia)
10 Merpati (Indonesia)
0 Niu Guini (Papua New Guinea)
3 Philippine Air Lines (Philippines)
3 PIA (Pakistan)
0 SIA (Singapore)
1 Thai Airways (Thailand)
3.2.4 Boxplots
Graphical representation
of five number summary.
Outlier
Maximum when
outliers omitted
Q3 (upper quartile)
= 75th percentile
Median
Q1 (lower quartile)
= 25th percentile
B5 G5 B6 G6 B7 G7 B8 G8 B9 G9 B 10G 10
Box plots of letter recognition scores in each age/sex group
3.2.5 Histograms
6
4
2
0
0 5 10 15 20
Score
Average (mean)
The average is the sum of the measurements divided by the number of measurements. Usually
denoted by x̄.
Suppose we have n observations and let x1 denote the first observation, x2 the second, and so
on up to xn . Then the average is
x1 +x2 +···+xn 1 Pn
Sample mean: x̄ = n = n i=1 xi .
This is the most widely used measure of the centre of the data set, and it has good arithmetic
properties. But it does have the drawback of being influenced by extreme values (“outliers”).
Trimmed mean
QUIZ
• True or False?
1. The median and the average of any data set are always close together
2. Half of a data set is always below average.
3. With a large sample, the histogram is bound to follow the normal curve
quite closely.
• In a study of family incomes, 1000 observations range from $12,400 a year
to $132,800 a year. By accident, the highest income gets changed to
$1,328,000.
1. Does this affect the mean? If so by how much?
2. Does this affect the median? If so by how much?
Range
The range is the difference between the maximum and minimum. However, it is not a good
measure of spread since it is generally larger when more data are collected and it is sensitive to
outliers.
Interquartile range
The interquartile range (IQR) is the difference between the upper and lower quartiles: Q3 − Q1.
Range = IQR =
The variance is based on the deviations from the mean, i.e. the difference between the indi-
vidual values and the mean of those values, represented by (xi − x̄). Obviously, if these were
simply added, or averaged, we would always end up with zero. Therefore, we want all values
to be positive. A simple way to do this is to square the deviations and then average these. This
is known as the variance
s2 = n−1 1
(x1 − x̄)2 + (x2 − x̄)2 + · · · + (xn − x̄)2
n
X
= 1
n−1 (xi − x̄)2
i=1
Note that this is not quite the average of the squared differences from the mean.
The variance is in squared units, so by taking the square-root of the variance, we have a mea-
sure of dispersion that is in the same units of measurement as the original variable. This is
called the standard deviation, and is denoted by:
√ q
1 Pn
s= s2 = n−1 i=1 (xi − x̄)2 .
SD = 1 SD = 5
15
15
Frequency
Frequency
10
10
5
5
0
−3 −2 −1 0 1 2 3 0 5 10 15 20 25
SD = 10 SD = 100
20
20
Frequency
Frequency
15
10
10
5
5
0
−130 −120 −110 −100 −90 −80 −300 −200 −100 0 100 200
1
s2 = (0 − 4.1)2 + (0 − 4.1)2 + · · · + (20 − 4.1)2
29
= 20.02
√
s = 20.02
= 4.47.
1
s2 = (0 − 1.53)2 + (0 − 1.53)2 + · · · + (10 − 1.53)2
16
= 6.765
√
s = 6.765
= 2.60.
• Mean: =AVERAGE(A1:A20)
• Median: =MEDIAN(A1:A20)
3.3.1 Scatterplots
Scatterplots are good at graphically displaying the relationship between two numerical vari-
ables.
3.3.2 Correlation
The Pearson correlation coefficient is a measure of the strength of the linear relationship be-
tween two numerical variables.
It is calculated by
1 Pn xi −x̄ yi −ȳ
r= n−1 i=1 sx sy
where sx is the sample standard deviation of the x observations and sy is the sample standard
deviation of the y observations.
r2 : a useful interpretation
The squared correlation, r2 , is the fraction of the variation in the y values that
is explained by the linear relationship.
● ● ● ●
9
9 10 ● ●
●
● ●
8
● ●
7
●
●
8
●
y
y
●
6
●
7
5
●
6
4
5
●
● ●
3
4
4 6 8 10 12 14 4 6 8 10 12 14
x x
● ●
12
12
10
10
y
y
● ●
●
8
● ●
8
● ●
●
● ●
●
● ●
●
6
●
6
● ●
●
● ●
4 6 8 10 12 14 8 10 12 14 16 18
x x
In many questionnaires, there are several questions that are designed to measure the same
thing (sometimes called a “construct”). The answers to the questions are often added together
to provide an overall “scale” which gives a single measure of the construct.
In these circumstances, it is useful to judge how closely the results from the questions are re-
lated to each other. This is called “internal consistency reliability”.
Internal consistency reliability involves seeing how closely the answers to these questions (or
“items”) are related.
We can look at the correlation between any pair of items which are supposed to be measuring
the construct.
The average inter-item correlation is the average of all correlations between the pairs of items.
Split-half reliability
Here we randomly divide all items that are intended to measure the construct into two sets.
The total score for each set of items is then computed for each person. The split-half reliability
is the correlation between these two total scores.
Cronbach’s alpha
Cronbach’s alpha is the average of all split-half estimates. That is, if we computed all possible
split-half reliabilities (by computing it on all possible divisions of items), and averaged the
results, we would have Cronbach’s alpha.
In practice, there is a quicker way to compute it than actually doing all these split-half estimates.
Suppose there are k items, let si be the standard deviation of the answers to the ith item and s
be the standard deviation of the totals formed by summing all the items for each person. Then
Cronbach’s alpha can be calculated as follows:
k
!
k 1 X 2
α= 1− 2 si .
k−1 s
i=1
How large is good enough? Some books suggest that α > 0.7 is necessary to have a reliable
scale. I think this is an arbitrary figure, but it gives you some idea of what is expected.
Example 1
Q1 Q2 Q3 Q4 Correlation matrix:
1 5 2 2 5
2 2 3 1 1
3 2 3 3 1 Q1 Q2 Q3 Q4
4 3 3 5 1
5 2 1 2 2 Q1 1.000 0.521 0.250 0.726
6 5 5 2 5 Q2 0.521 1.000 0.182 0.328
7 2 1 2 2
8 3 2 2 1 Q3 0.250 0.182 1.000 0.029
9 3 3 5 1 Q4 0.726 0.328 0.029 1.000
10 1 2 5 2
11 1 3 3 1
12 2 2 4 2 Average inter-item correlation:
13 3 3 5 3
14 1 1 1 3 0.339
15 3 2 4 4
16 5 5 3 5
17 1 1 1 1 Cronbach’s alpha:
18 1 4 1 2 0.664
19 1 3 1 1
20 4 5 4 3
Example 2
Q1 Q2 Q3 Q4 Correlation matrix:
1 1 1 1 1
2 2 3 2 2
3 3 2 4 3 Q1 Q2 Q3 Q4
4 5 5 5 3
5 1 2 1 1 Q1 1.000 0.819 0.826 0.684
6 4 4 2 1 Q2 0.819 1.000 0.697 0.515
7 4 3 5 4
8 4 2 3 5 Q3 0.826 0.697 1.000 0.813
9 5 5 5 5 Q4 0.684 0.515 0.813 1.000
10 5 5 5 5
11 4 3 4 5
12 2 2 3 2 Average inter-item correlation:
13 3 3 2 2
14 3 3 5 5 0.725
15 2 3 2 4
16 4 5 5 5
17 5 4 5 5 Cronbach’s alpha:
18 3 3 3 3 0.910
19 5 5 5 4
20 2 2 1 1
Example 3
Q1 Q2 Q3 Q4 Correlation matrix:
1 2 2 2 2
2 4 4 4 4
3 2 2 2 2 Q1 Q2 Q3 Q4
4 1 1 1 1
5 2 2 2 1 Q1 1.000 0.874 0.968 0.939
6 4 3 4 4 Q2 0.874 1.000 0.864 0.795
7 1 1 1 1
8 5 3 5 5 Q3 0.968 0.864 1.000 0.921
9 4 4 4 4 Q4 0.939 0.795 0.921 1.000
10 4 5 4 4
11 2 2 2 2
12 5 5 5 3 Average inter-item correlation:
13 2 2 2 2
14 5 5 5 5 0.894
15 2 2 3 2
16 4 2 4 4
17 4 4 5 4 Cronbach’s alpha:
18 5 5 5 5 0.971
19 2 2 2 2
20 4 4 5 4
Often a set of data, or some statistic calculated from the data, is assumed to follow a normal
distribution. Data which are normally distributed have a histogram with a symmetric bell-
shape like this.
3.5.1 Parameters
The normal distribution is the basis of many statistical methods. It can be specified by two
parameters:
d
If we call the variable Y , we write Y = N(µ, σ 2 ). We use the probability model to draw conclu-
sions about future observations.
Mean µ: The mean µ is the average of measurements taken from the entire population (rather
than just a sample). We usually denote this by µ to distinguish it from the sample mean x̄. The
sample mean is often used as an estimate of µ.
Many statistical methods assume the data are normal, or that the errors from a
fitted model are normal. To test this assumption:
• Plot the histogram. It should look bell-shaped.
• Do a QQ plot on a computer. It should look straight.
k Prob.
0.50 38.3%
0.67 50.0%
1.00 68.3%
1.28 80.0%
1.50 86.6%
1.64 90.0%
1.96 95.0%
2.00 95.5%
2.50 98.8% µ-kσ µ µ+kσ
2.58 99.0%
3.00 99.7%
3.29 99.9%
3.89 99.99%
k Prob.
0.00 50.0%
0.50 30.9%
0.84 20.0%
1.00 15.9%
1.28 10.0%
1.50 6.7%
1.64 5.0%
2.00 2.3%
2.33 1.0% µ µ+kσ
2.50 0.62%
3.00 0.13%
3.09 0.10%
3.72 0.01%
3.50 0.02%
• If you must use Excel for basic statistics, use a different spreadsheet from your main data
file.
70
Part 4. Computing and quantitative research
Figure 4.1: Typical set-up of excel spreadsheet ready for importing to a statistics package.
Advantages: Disadvantages:
Some of these problems have been known since at least 1994. Microsoft won’t respond to any
requests for fixes.
4.2.2 SPSS
Advantages: Disadvantages:
• Very widely used — lots of people to • Few modern methods included (e.g.,
help. nonparametric smoothing)
• Most standard methods are available. • Lots of irrelevant output. Hard to know
• Click and point interface as well as com- what’s important.
mand interface. • Routines used not properly docu-
mented.
• Very difficult to produce customized
analysis
• Graphics are difficult to customize with
code.
• Click-and-point interface.
• Easy to learn and use.
• Sometimes limits on data size
• Tedious for repetitive tasks and repeated analyses.
• Examples: JMP, Statgraphics, Statview.
• Forecast Pro
• EViews (for econometric methods)
• Amos (for structural equation modelling)
BUT. . .
Packages at Monash
Data set
We will use data on emergency calls to the New York Auto Club (the NY equivalent of the
RACV). Download the data from
http://www.robhyndman.info/downloads/NYautoclub.xls
The variable Calls concerns emergency road service calls from the second half of January in
1993 and 1994. In addition, we have the following variables:
The idea is to use these variables to predict the number of emergency calls.
Loading data
1. Run SPSS and open the excel file with the data.
2. Go to the “Variable view” sheet, and ensure the variables are correctly set to Scale (i.e.,
Numerical) or Nominal (i.e,. Categorical).
3. For the categorical variables, give the values meaningful labels.
Data summaries
Exploratory graphs
6. Try plotting the number of calls against each of the other variables using an appropriate
plot (i.e., scatterplot or boxplot). [Go to Graphs in the menu.]
7. Are there any outliers in the data?
8. Which of the explanatory variables seem to be related to Calls?
9. Do you think the effects of some variables may be confounded with other variables?
5.1 Proportions
A survey of 400 people in Melbourne found that 45% were watching the Channel
Nine CSI show on Sunday night. The estimated proportion of people watching is
45
p̂ = = 0.1125 or 11.25%.
400
Let’s do a thought experiment. If we were able to collect additional samples of 400 customers,
we could calculate p̂ for each sample. Suppose we obtained 999 additional samples of 400
observations each. We would obtain a different value of p̂ each time because each sample
would be random and different. We now have 1000 values of p̂, all of them different. The
variability in these p̂ values tells us how accurate p̂ is for estimating p.
Of course, we can’t collect additional samples. We just have one sample. But statistical theory
can be used to calculate the standard deviation of these p̂ values if we were able to conduct
such an experiment. The standard deviation of p̂ is called the standard error:
q
p(1−p)
s.e.(p̂) = n−1
where n is the number of observations in our sample. (This is the standard deviation of the
estimated proportions if we took many samples of size n and estimated the proportion from
each sample.)
For percentages, the standard error is 100 times that for proportions. Notice that
• the standard error depends on the size of the sample but not the size of the target popu-
lation (assuming the target population is very large);
77
Part 5. Significance
• the standard error is smaller if the sample size is increased. This is to be expected: the
more elements in the survey, the more you will know.
A confidence interval is a range of values which we can be confident includes the true value
of the parameter of interest, in this case the proportion p. If we wish to construct a confidence
interval for p we take a multiple of the standard error either side of the estimate of the propor-
tion.
p̂ ± 1.96s.e.(p̂).
(This is an interval which we are 95% sure will contain the true proportion.)
Notice that this interval is quite wide. If another TV show rates 12.5%, then we can’t
say which of the two shows actually had the bigger audience.
The 95% confidence interval of the proportion can be interpreted to be the range of values that
will contain the true proportion with a probability of 0.95. Thus if we calculate the confidence
interval for a proportion for each of 1000 samples, we would expect that about 950 of the cal-
culated confidence intervals would actually contain the true proportion.
Other confidence intervals beside 95% intervals can be calculated by replacing 1.96 by a differ-
ent multiplying factor.
The multiplying factor (1.96 in the example above) depends on the number of observations in
the sample and the confidence level required.
• It only works for larger n. For small n, we need a different (and more complex) method
of calculation.
Consider an example where the fraction of all Australian married couples with chil-
dren was to be estimated and a simple random sample was used. The population
characteristic is the proportion of married couples with children in the target pop-
ulation. We denote this by p. It cannot be known without surveying the entire
population. The statistic is the proportion of married couples with children in the
sample. We denote this by p̂. It is calculated from the survey data as follows.
The margin of error is usually defined as half the width of a 95% confidence interval.
So in the TV ratings example, the margin of error was 0.018. In the couples with children
example, the margin of error is 0.03.
The following table shows the margin of error for proportions for a range of sample sizes and
proportions.
Sample size calculation is most often done by first specifying what is an acceptable margin of
error for a key population characteristic.
If the survey aims to estimate the proportion of couples with children, the key population
characteristic is p. Making n the subject of equation (5.1), we obtain
3.84p(1−p)
n=1+ m2
.
Then substituting in the chosen values for m and p, we can obtain the sample size required.
Again, we can ‘guess’ p from previous knowledge of the population such as a pilot survey or
previous surveys.
Alternatively, a conservative approach is to use p = 0.5 since this results in the largest sample
size. Using p = 0.5 gives the sample size
0.960 1
n=1+ m2
≈ m2
.
This provides an upper bound on the required sample size. Other values of p will give smaller
sample sizes.
The following table gives sample sizes for different values of m and p.
Exercise 5: For television ratings surveys, what number of people would need to
be surveyed for the margin of error to be 2%?
A researcher is studying the change in stress scores over time for a in-house stress man-
agement program. 10 employees complete the test at the start of the program, and they do
the test again at the end of their first 6 weeks on the program.
where s is the standard deviation of the sample data and √ n is the number of observations in
our sample. So in the example, the standard error is 2.2/ 10 = 0.7. This figure is used to draw
conclusions about the value of µ.
A confidence interval is a range of values which we can be confident includes the true value
of the parameter of interest, in this case the population mean µ. If we wish to construct a
confidence interval for µ we take a multiple of the standard error either side of the estimate of
the mean. For example a 95% confidence interval for µ in this example is x̄ ± 2.262s.e.(x̄).
The 95% confidence interval of the mean can be interpreted to be the range of values that will
contain the true mean with a probability of 0.95. Thus if we calculate the confidence interval for
a mean for each of 1000 samples, we would expect that about 950 of the calculated confidence
intervals would actually contain the true mean.
Other confidence intervals beside 95% intervals can be calculated by replacing 2.262 by a dif-
ferent multiplying factor.
The multiplying factor (2.262 in the example above) depends on the number of observations in
the sample and the confidence level required.
Definition: The two complimentary hypotheses in a hypothesis testing problem are called the
null hypothesis and the alternative hypothesis. They are denoted by H0 and H1 respectively.
In a hypothesis testing problem, after observing the sample the experimenter must decide ei-
ther to accept H0 as true or reject H0 as false and decide in favour of H1 .
To make this decision we use a test statistic. That is we calculate the value of some formula
which is a function of the sample data. The value of the test statistic provides evidence for or
against the null hypothesis.
In the case of a test for the mean µ, the test statistic we use is
t = s.e.x̄(x̄)
x̄ 2.488
t= = = 3.613.
s.e.(x̄) 0.689
5.2.4 P-values
A p-value is the probability of randomly observing a value greater than or equal to the one
observed, when the null hypothesis is true. The decision to accept the null hypothesis is based
on the p-value.
In this context, the p-value is the probability of observing an absolute t value as greater than
or equal to the one observed (3.613), if µ = 0. That’s the same as the probability of observing a
value of x̄ at least as far away from 0 as the x̄ value we obtained for this sample (2.488). This
probability can be calculated easily using a statistical computer package.
If we obtain a ‘large’ p-value, then we say that data similar to that observed are likely to have
occurred if the null hypothesis was true. Conversely, a small p-value would indicate that it
is unlikely that the null hypothesis was true (because if the null hypothesis were true, it is
unlikely that such data would occur by chance). The smaller the p-value the more unlikely the
null hypothesis.
The p-value is used to define statistical significance. If the p-value is below 0.05 then we say
this result is statistically significant. The choice of threshold is completely arbitrary. It is only
convention that dictates the use of a 0.05 or 0.01 significance level. Instead of saying an effect is
significant at the 0.05 level, quoting the actual p-value will allow the reader to make their own
interpretation.
One-sided tests
A one-sided test only looks at the evidence against the null hypothesis in one di-
rection (e.g., the mean µ is positive) and ignores the evidence against the null
hypothesis in the other direction (e.g., the mean µ is negative).
The question of whether a p-value should be one or two-sided may arise; a one-
sided p-value is rarely appropriate. Even though there may be a priori evidence to
suggest a one-sided effect, we can never really be sure that one treatment, say, is
better than another. If we did then there would be no need to do an experiment to
determine this! Therefore, routinely use two-sided p-values.
There are at least two reasons why we might get the wrong answer with an hypothesis test.
Type I Error is where we accept the alternative hypothesis (reject the null hypothesis) even
though it is not true. This is sometimes referred to as a false positive. The type I error is set in
advance and is typically 5% (one in 20) or 1% (one in 100). This implies that one in 20 pieces of
scientific research based on an hypothesis test is mistaken! We use α to denote the probability
of type I error (the size or level of the test).
Type II Error is the risk of rejecting the hypothesis (accepting the null hypothesis) when it is in
fact true. This is sometimes referred to as a false negative. It is often denoted by β.
If the chance of making a type I error is made very small, then automatically the risk of making
a type II error will grow.
The power of a statistical test is 1 − β. This is the probability of accepting the alternative hy-
pothesis when it is true. Obviously we want this as high as possible. However, the smaller we
make α, the less power we have for the test.
standard error: The standard deviation of a statistic calculated from the data, such as a pro-
portion or the mean difference.
p-value: The probability of observing a value as large as that which was observed if, in fact,
there is no real change.
95% confidence interval: An interval which contains the true mean change with probability of
95%. So if the confidence interval does not include zero, then the p-value is smaller than
0.05.
1. The diastolic blood pressures (DBP) of a group of young men are normally distributed
with mean 70mmHg and a standard deviation 10 mmHg. It follows that
(a) About 95% of the men have a DBP between 60 and 80 mmHg.
(b) About 50% of the men have a DBP above 70 mmHg.
(c) The distribution of DBP is not skewed
(d) All the DBPs must be less than 100 mmHg.
(e) About 2.5% of the men have DBP below 50 mmHg.
2. Following the introduction of a new treatment regime in an alcohol dependency unit,
‘cure’ rates improved. The proportion of successful outcomes in the two years following
the change was significantly higher than in the preceding two years (p < 0.05). It follows
that:
(a) If there had been no real change in cure rates, the probability of getting this differ-
ence or one more extreme by chance, is less than one in twenty.
(b) The improvement in treatment outcome is clinically important.
(c) The change in outcome could be due to a confounding factor.
(d) The new regime cannot be worse than the old treatment.
(e) Assuming that there are no biases in the study method, the new treatment should
be recommended in preference to the old.
3. As the size of a random sample increases:
(a) The standard deviation decreases.
(b) The standard error of the mean decreases.
(c) The mean decreases.
(d) The range is likely to increase.
(e) The accuracy of the parameter estimates increases.
4. A 95% confidence interval for a mean
(a) Is wider than a 99% confidence interval.
(b) In repeated samples will include the population mean 95% of the time.
(c) Will include the sample mean with a probability 1.
(d) Is a useful way of describing the accuracy of a study.
(e) Will include 95% of the observations of a sample.
5. The p-value
(a) Is the probability that the null hypothesis is false
(b) Is generally large for very small studies
(c) Is the probability of the observed result, or one more extreme, if the null hypothesis
were true.
(d) Is one minus the type II error
(e) Can only take a limited number of discrete values such as 0.1, 0.05, 0.01, etc.
Bicep circumference
[Ref: Bland, J.M. and Altman, D.G. (1986) Statistical methods for assessing agreement between
two methods of clinical measurement. Lancet, 307–310.]
The table below shows the circumference (cm) of the right and left bicep of 15 right-handed
tennis players.
Regression is useful when there is a numerical response variable and one or more explana-
tory variables.
Ref: Makridakis, Wheelwright and Hyndman, 1998. Forecasting: methods and applications, John
Wiley & Sons Chapter 5.
Pulp shipments World pulp price Pulp shipments World pulp price
(millions metric tons) (dollars per ton) (millions metric tons) (dollars per ton)
Si Pi Si Pi
10.44 792.32 21.40 619.71
11.40 868.00 23.63 645.83
11.08 801.09 24.96 641.95
11.70 715.87 26.58 611.97
12.74 723.36 27.57 587.82
14.01 748.32 30.38 518.01
15.11 765.37 33.07 513.24
15.26 755.32 33.81 577.41
15.55 749.41 33.19 569.17
16.81 713.54 35.15 516.75
18.21 685.18 27.45 612.18
19.42 677.31 13.96 831.04
20.18 644.59
88
Part 6. Statistical models and regression
6.1.1 Scatterplots
‘Eye-balling’ the data would suggest that Shipments decreases with price. A plot of shipments
against price is a good preliminary step to ensure that a linear relationship is appropriate.
In regression problems we are interested in how changes in one variable are related to changes
in another. In the case of Shipments and Price we are concerned with how Shipments changes
with Price, not how Price changes with Shipments. The explanatory variable is Price, and the
response variable it predicts is Shipments.
The relationship between the explanatory variable, x, and the response variable, y, is
yi = a + bxi + ei
where a is the intercept of the line, b is the slope, and ei is the error, or that part of the observed
data which is not described by the linear relationship. ei is assumed to be Normally distributed
with mean 0 and standard deviation σ.
If we can find the line that best fits the data we could then determine what increase in price is
associated with a unit decrease in shipments.
35
30
Pulp shipments
25
20
15
10
Figure 6.1: The relationship between world pulp price and pulp shipments is negative. As the price
increases, the quantity shipped decreases.
The line of ‘best’ fit is found by minimizing the sum-of-squares of the deviations from the
observed points to the line. This method is called the method of least squares. So the line of
best fit minimizes the sum-of-squares of the deviations from the observed points to the line ,
that is, it minimizes
Xn n
X
(yi − ŷi )2 = (yi − â − b̂xi )2
i=1 i=1
These calculations are done easily using a statistics package (or even a calculator).
A useful plot for spotting outliers is the scatterplot of residuals ei against the explanatory vari-
able xi . This shows whether a straight line was appropriate. We expect to see a scatterplot
resembling a horizontal band with no values too far from the band and no patterns such as
curvature or increasing spread.
Another useful plot for spotting outliers and other unwanted features is to plot residuals
against the fitted values ŷi . Again, we expect to see no pattern.
4
2
Residuals
0
-2
-4
-6 500 600 700 800
6.1.5 Correlation
Recall: the correlation coefficient is a measure of the strength of the linear relationship.
The pulp price and shipments data have a correlation of r = −0.931, indicating a very strong
negative relationship between pulp price and pulp shipped. If the pulp price increases, the
quantity of pulp shipped tends on average to decrease and vice versa.
So r2 = 0.867 showing that 86.7% of the variation is explained by the regression line. The other
13.3% of the variation is random variation about the line.
The table below gives the values for 32 babies of x, the birth weight, and y, the increase in
weight between the 70th and 100th day of life, as a percentage of birth weight.
120
100
80
60
40
What would the expected percentage increase in weight be for an infant whose birth weight
was 94 oz?
Data: returns for 30 stocks listed on NASDAQ and NYSE for 9–13 May 1994.
We look at absolute return in prices of stocks. This is a measure of volatility. For example,
a market where stocks average a weekly 10% change in price (positive or negative) is more
volatile than one which averages a 5% change.
Graphical summary: boxplots
NYSE
NASDAQ
Numerical summaries:
NASDAQ NYSE
Min. :0.00380 Min. :0.00260
1st Qu.:0.01745 1st Qu.:0.01120
Median :0.03930 Median :0.02480
Mean :0.04395 Mean :0.02913
3rd Qu.:0.05575 3rd Qu.:0.04010
Max. :0.12240 Max. :0.08910
Our model is that each group has a different mean. So if we let yi,j be the ith measurement
from the jth group and µj be the mean of the jth group, then we can write the model as
yi,j = µj + ei,j
d
Again, we assume ei,j = N (0, σ 2 ) That is, all groups have the same standard deviation.
If a categorical variable takes only two values (e.g., ‘Yes’ or ‘No’), then an equivalent numerical
variable can be constructed taking value 1 if yes and 0 if no. This is called a dummy variable.
In this case, the problem becomes identical to the case with a numerical explanatory variable.
If there are more than two categories, then the variable can be coded using several dummy
variables (one fewer than the total number of categories). Then the problem is one of several
numerical explanatory variables and is discussed in the next section.
In multiple regression there is one variable to be predicted (e.g., sales), but there are two or
more explanatory variables. The general form of multiple regression is
Y = b0 + b1 X1 + b2 X2 + · · · + bk Xk + e.
Thus if sales were the variable to be modelled, several factors such as GNP, advertising, prices,
competition, R&D budget, and time could be tested for their influence on sales by using re-
gression. If it is found that these variables do influence the level of sales, they can be used to
predict future values of sales.
Each of the explanatory variables (X1 , . . . , Xk ) is numerical, although it is easy to handle cate-
gorical variables in a similar way using dummy variables.
To illustrate the application of multiple regression, we will use a case study taken from Makri-
dakis, Wheelwright and Hyndman, 1998. Forecasting: methods and applications, John Wiley &
Sons Chapter 6.
These data refer to a mutual savings bank in a large metropolitan area. In 1993 there was
considerable concern within the mutual savings banks because monthly changes in deposits
were getting smaller and monthly changes in withdrawals were getting bigger. Thus it was
of interest to develop a short-term forecasting model to forecast the changes in end-of-month
(EOM) balance over the next few months. Table 6.1 shows 60 monthly observations (February
1988 through January 1993) of end-of-month balance (in column 2). Note that there was strong
growth in early 1991 and then a slowing down of the growth rate since the middle of 1991.
Also presented in Table 6.1 are the composite AAA bond rates (in column 3) and the rates on
U.S. Government 3-4 year bonds (in column 4). It was hypothesized that these two rates had
an influence on the EOM balance figures in the bank.
Now of interest to the bank was the change in the end-of-month balance and so first differences
of the EOM data in Table 6.1 are shown as column 2 of Table 6.2. These differences, denoted
D(EOM) in subsequent equations are plotted in Figure 6.3, and it is clear that the bank was
Example:
(1) (2) (3) (4) (1) (2) (3) (4)
Month (EOM) (AAA) (3-4) Month (EOM) (AAA) (3-4)
1 360.071 5.94 5.31 31 380.119 8.05 7.46
2 361.217 6.00 5.60 32 382.288 7.94 7.09
3 358.774 6.08 5.49 33 383.270 7.88 6.82
4 360.271 6.17 5.80 34 387.978 7.79 6.22
5 360.139 6.14 5.61 35 394.041 7.41 5.61
6 362.164 6.09 5.28 36 403.423 7.18 5.48
7 362.901 5.87 5.19 37 412.727 7.15 4.78
8 361.878 5.84 5.18 38 423.417 7.27 4.14
9 360.922 5.99 5.30 39 429.948 7.37 4.64
10 361.307 6.12 5.23 40 437.821 7.54 5.52
11 362.290 6.42 5.64 41 441.703 7.58 5.95
12 367.382 6.48 5.62 42 446.663 7.62 6.20
13 371.031 6.52 5.67 43 447.964 7.58 6.03
14 373.734 6.64 5.83 44 449.118 7.48 5.60
15 373.463 6.75 5.53 45 449.234 7.35 5.26
16 375.518 6.73 5.76 46 454.162 7.19 4.96
17 374.804 6.89 6.09 47 456.692 7.19 5.28
18 375.457 6.98 6.52 48 465.117 7.11 5.37
19 375.423 6.98 6.68 49 470.408 7.16 5.53
20 374.365 7.10 7.07 50 475.600 7.22 5.72
21 372.314 7.19 7.12 51 475.857 7.36 6.04
22 373.765 7.29 7.25 52 480.259 7.34 5.66
23 372.776 7.65 7.85 53 483.432 7.30 5.75
24 374.134 7.75 8.02 54 488.536 7.30 5.82
25 374.880 7.72 7.87 55 493.182 7.27 5.90
26 376.735 7.67 7.14 56 494.242 7.30 6.11
27 374.841 7.66 7.20 57 493.484 7.31 6.05
28 375.622 7.89 7.59 58 498.186 7.26 5.98
29 375.461 8.14 7.74 59 500.064 7.24 6.00
30 377.694 8.21 7.51 60 506.684 7.25 6.24
Table 6.1: Bank data: end-of-month balance (in thousands of dollars), AAA bond rates, and rates for
3-4 year government bond issues over the period February 1988 through January 1993.
facing a volatile situation in the last two years or so. The challenge is to forecast these rapidly
changing EOM values.
In preparation for some of the regression analyses to be done in this chapter, Table 6.2 desig-
nates D(EOM) as Y , the response variable, and shows three explanatory variables X1 , X2 , and
X3 . Variable X1 is the AAA bond rates from Table 6.1, but they are now shown leading the
D(EOM) values. Similarly, variable X2 refers to the rates on 3-4 year government bonds and
they are shown leading the D(EOM) values by one month. Finally, variable X3 refers to the first
differences of the 3-4 year government bond rates, and the timing for this variable coincides
with that of the D(EOM) variable.
10
8
6
D(EOM)
4
2
0
-2
8.0
7.5
AAA
7.0
6.5
6.0
8
7
(3-4)
6
5
4
0.5
D(3-4)
0.0
-0.5
Figure 6.3: (a) A time plot of the monthly change of end-of-month balances at a mutual savings bank.
(b) A time plot of AAA bond rates. (c) A time plot of 3-4 year government bond issues. (d) A time plot of
the monthly change in 3-4 year government bond issues. All series are shown over the period February
1988 through January 1993.
Example:
t Y X1 X2 X3 t Y X1 X2 X3
Month D(EOM) (AAA) (3-4) D(3-4) Month D(EOM) (AAA) (3-4) D(3-4)
1 1.146 5.94 5.31 0.29 31 2.169 8.05 7.46 -0.37
2 -2.443 6.00 5.60 -0.11 32 0.982 7.94 7.09 -0.27
3 1.497 6.08 5.49 0.31 33 4.708 7.88 6.82 -0.60
4 -0.132 6.17 5.80 -0.19 34 6.063 7.79 6.22 -0.61
5 2.025 6.14 5.61 -0.33 35 9.382 7.41 5.61 -0.13
6 0.737 6.09 5.28 -0.09 36 9.304 7.18 5.48 -0.70
7 -1.023 5.87 5.19 -0.01 37 10.690 7.15 4.78 -0.64
8 -0.956 5.84 5.18 0.12 38 6.531 7.27 4.14 0.50
9 0.385 5.99 5.30 -0.07 39 7.873 7.37 4.64 0.88
10 0.983 6.12 5.23 0.41 40 3.882 7.54 5.52 0.43
11 5.092 6.42 5.64 -0.02 41 4.960 7.58 5.95 0.25
12 3.649 6.48 5.62 0.05 42 1.301 7.62 6.20 -0.17
13 2.703 6.52 5.67 0.16 43 1.154 7.58 6.03 -0.43
14 -0.271 6.64 5.83 -0.30 44 0.116 7.48 5.60 -0.34
15 2.055 6.75 5.53 0.23 45 4.928 7.35 5.26 -0.30
16 -0.714 6.73 5.76 0.33 46 2.530 7.19 4.96 0.32
17 0.653 6.89 6.09 0.43 47 8.425 7.19 5.28 0.09
18 -0.034 6.98 6.52 0.16 48 5.291 7.11 5.37 0.16
19 -1.058 6.98 6.68 0.39 49 5.192 7.16 5.53 0.19
20 -2.051 7.10 7.07 0.05 50 0.257 7.22 5.72 0.32
21 1.451 7.19 7.12 0.13 51 4.402 7.36 6.04 -0.38
22 -0.989 7.29 7.25 0.60 52 3.173 7.34 5.66 0.09
23 1.358 7.65 7.85 0.17 53 5.104 7.30 5.75 0.07
24 0.746 7.75 8.02 -0.15 54 4.646 7.30 5.82 0.08
25 1.855 7.72 7.87 -0.73 55 1.060 7.27 5.90 0.21
26 -1.894 7.67 7.14 0.06 56 -0.758 7.30 6.11 -0.06
27 0.781 7.66 7.20 0.39 57 4.702 7.31 6.05 -0.07
28 -0.161 7.89 7.59 0.15 58 1.878 7.26 5.98 0.02
29 2.233 8.14 7.74 -0.23 59 6.620 7.24 6.00 0.24
30 2.425 8.21 7.51 -0.05
Table 6.2: Bank data: monthly changes in balance as response variable and three explanatory variables.
(Data for months 54–59 to be ignored in all analyses and then used to check forecasts.)
Referring to the numbers in the first row of Table 6.2, they are explained as follows:
(Note that the particular choice of these explanatory variables is not arbitrary, but rather based
on an extensive analysis that will not be presented in detail here.)
For the purpose of illustration in this chapter, the last six rows in Table 6.2 will be ignored in
all the analyses that follow, so that they may be used to examine the accuracy of the various
models to be employed. (The idea is to forecast the D(EOM) figures for periods 54–59, and
then compare them with the known figures not used in developing our regression model. This
comparison hasn’t actually been in these notes.)
The bank could model Y (the D(EOM) variable) on the basis of X1 alone, or on the basis of a
combination of the X1 , X2 , and X3 variables shown in columns 3, 4, and 5. So Y , the response
variable, is a function of one or more of the explanatory variables. Although several different
forms of the function could be written to designate the relationships among these variables, a
straightforward one that is linear and additive is
Y = b0 + b1 X1 + b2 X2 + b3 X3 + e, (6.1)
where Y = D(EOM),
X1 = AAA bond rates,
X2 = 3-4 rates,
X3 = D(3-4) year rates,
e = error term.
From equation (6.1) it can readily be seen that if two of the X variables were omitted, the
equation would be like those handled previously with simple linear regression.
Time plots of each of the variables are given in Figure 6.3. These show the four variables
individually as they move through time. Notice how some of the major peaks and troughs line
up, implying that the variables may be related.
Scatterplots of each combination of variables are given in Figure 6.4. These enable us to visu-
alize the relationship between each pair of variables. Each panel shows a scatterplot of one of
the four variables against another of the four variables. The variable on the vertical axis is the
variable named in that row; the variable on the horizontal axis is the variable named in that
column. So, for example, the panel in the top row and second column is a plot of D(EOM)
against AAA. Similarly, the panel in the second row and third column is a plot of AAA against
(3–4). This figure is known as a scatterplot matrix and is a very useful way of visualizing the
relationships between the variables.
Note that the mirror image of each plot above the diagonal is given below the diagonal. For
example, the plot of D(EOM) against AAA given in the top row and second column is mirrored
in the second row and first column with a plot of AAA against D(EOM).
Figure 6.4 shows that there is a weak linear relationship between D(EOM) and each of the
other variables. It also shows that two of the explanatory variables, AAA and (3-4), are related
linearly. This phenomenon is known as collinearity and means it may be difficult to distinguish
the effect of AAA and (3-4) on D(EOM).
For the bank data in Table 6.2—using only the first 53 rows—the model in equation (6.1) can be
solved using least squares to give
Ŷ = −4.34 + 3.37(X1 ) − 2.83(X2 ) − 1.96(X3 ). (6.2)
Note that a “hat” is used over Ŷ to indicate that this is an estimate of Y , not the observed
Y . This estimate Ŷ is based on the three explanatory variables only. The difference between
the observed Y and the estimated Ŷ tells us something about the “fit” of the model, and this
discrepancy is called the residual (or error):
2 4 6 8
• •
•• • • •• • • •
• •• ••• • • ••••••••• • • • • •• •• •••
D(EOM) •
• • ••
•• •• •• •
• ••• • •••• •• •• •• • •• •• •
• ••• ••• •••• • •
•••• •••• ••• ••
•
• • ••• ••• •• •••
• • • •• • • •• •• ••• • • •• • • • •
•• • • • • • • • • •• • •• •
-2
• • • • •• • •
• •• •
•• • ••
8.0
•• • •
• • • • • •• •
••
•
• ••••• •• • •
• • • • • • ••
• •• •• •• •
• • •
•• • ••• • • •••• • •• • • • • •• • ••••••••• • • • • ••• • • •
•
•
•• • • • •• •• ••••• •
7.0
•
• •• AAA • •• • ••
•• • • •• • • •
•• • ••• •• •
• •••• •• •• • • •• ••
6.0
• • •• ••
• • •• •• •
• •• ••• • •
8
• •
• ••• ••• •
•
• •
• • ••• •
•• • • • •• • •
7
•• • •• • •
• •
• •• • ••• ••
• • ••• ••• • (3-4) • • ••
• •
••
6
•• • • • • • •• •
• • •• ••• •••• ••• •• ••• ••• •• ••••••• • • • ••••••• •• •
• •••• •• • ••••••
•• • •
• ••• •• • •
5
• •• • •
• • • •
• • •
4
• • •
• • •
0.5
• • • • •
•• •••• • • •
• • •• ••
•• • • •
• •• • •
•• ••
•• • • •• • • • • ••••••• • • • ••
• • • •• ••••••••• • ••• • •
• •• • • • • •
•• •• • •• • • • ••• D(3-4)
• • •••• • • • •• ••
••
• •• • •
•
•• • • •
•• •• • • • • • • •• • •• • •
•
• • • ••
-0.5
•
• • •• • • • •
• • • ••
• •
-2 2 6 10 4 5 6 7 8
Figure 6.4: Scatterplots of each combination of variables. The variable on the vertical axis is the variable
named in that row; the variable on the horizontal axis is the variable named in that column. This
scatterplot matrix is a very useful way of visualizing the relationships between each pair of variables.
Example:
Y X1 X2 X3
D(EOM) (AAA) (3-4) D(3-4)
Y = D(EOM) 1.000 0.257 -0.391 -0.195
X1 = (AAA) 0.257 1.000 0.587 -0.204
X2 = (3-4) -0.391 0.587 1.000 -0.201
X3 = D(3-4) -0.195 -0.204 -0.201 1.000
Table 6.3: Bank data: the correlations among the response and explanatory variables.
ei = Yi − Ŷi
↑ ↑
(observed) (estimated using
regression model)
Computer output
In the case of the bank data and the linear regression of D(EOM) on (AAA), (3-4), and D(3-4),
the full output from a regression program included the following information:
R2 = 0.53.
Residual analysis
Figure 6.5 shows four plots of the residuals after fitting the model
The bottom right panel of Figure 6.5 shows the residuals (ei ) against the fitted values (Ŷi ). The
other panels show the residuals plotted against the explanatory variables. Each of the plots can
be interpreted in the same way as the residual plot for simple regression. The residuals should
not be related to the fitted values or the explanatory variables. So each residual plot should
show scatter in a horizontal band with no values too far from the band and no patterns such as
curvature or increasing spread. All four plots in Figure 6.5 show no such patterns.
If there is any curvature pattern in one of the plots against an explanatory variable, it sug-
gests that the relationship between Y and X variable is non-linear (a violation of the linear-
ity assumption). The plot of residuals against fitted values is to check the assumption of ho-
moscedasticity and to identify large residuals (possible outliers). For example, if the residuals
show increasing spread from left to right (i.e., as Ŷ increases), then the variance of the residuals
is not constant.
It is also useful to plot the residuals against explanatory variables which were not included in
the model. If such plots show any pattern, it indicates that the variable concerned contains
some valuable predictive information and it should be added to the regression model.
To check the assumption of normality, we can plot a histogram of the residuals. Figure 6.6
shows such a histogram with a normal curve superimposed. The histogram shows the number
of residuals obtained within each of the intervals marked on the horizontal axis. The normal
4
2
2
Residuals
Residuals
0
0
-2
-2
-4
-4
6.0 6.5 7.0 7.5 8.0 4 5 6 7 8
AAA (3-4)
4
4
2
2
Residuals
Residuals
0
0
-2
-2
-4
Figure 6.5: Bank data: plots of the residuals obtained when D(EOM) is regressed against the three
explanatory variables AAA, (3-4), and D(3-4). The lower right panel shows the residuals plotted against
the fitted values (ei vs Ŷi ). The other plots show the residuals plotted against the explanatory variables
(ei vs Xj,i ).
10
8
Frequency
6
4
2
0
-6 -4 -2 0 2 4
Figure 6.6: Bank data: histogram of residuals with normal curve superimposed.
curve shows how many observations one would get on average from a normal distribution. In
this case, there does not appear to be any problem with the normality assumption.
There is one residual (with value −5.6) lying away from the other values which is seen in
the histogram (Figure 6.6) and the residuals plots of Figure 6.5. However, this residual is not
sufficiently far from the other values to warrant much close attention.
Computer output for regression will always give the R2 value. This is a useful summary of the
model.
(Ŷi − Ȳ )2
P
2
R =P (6.3)
(Yi − Ȳ )2
However, it needs to be used with caution. The problem is that R2 does not take into account
“degrees of freedom”. This arises because the models are more flexible when more variables
are added. Consequently, adding any variable tends to increase the value of R2 , even if that
variable is irrelevant.
where n is the number of observations and k is the number of explanatory variables in the
model. Note that R̄2 is referred to as “adjusted R2 ” or “R-bar-squared,” or sometimes as “R2 ,
corrected for degrees of freedom.”
There are other measures which, like R̄2 , can be used to find the best regression model. Some
computer programs will output several possible measures. Apart from R̄2 , the most commonly
used measures are Mallow’s Cp statistic and Akaike’s AIC statistic.
Developing a regression model for real data is never a simple process, but some guidelines
can be given. Generally, we have a long list of potential explanatory variables. The “long list”
needs to be reduced to a “short list” by various means, and a certain amount of creativity is
essential.
There are many proposals regarding how to select appropriate variables for a final model. Some
of these are straightforward, but not recommended:
• Plot Y against a particular explanatory variable (Xj ) and if it shows no noticeable rela-
tionship, drop it.
• Look at the correlations among the explanatory variables (all of the potential candidates)
and every time a large correlation is encountered, remove one of the two variables from
further consideration; otherwise you might run into multicollinearity problems (see Sec-
tion 6.6).
• Do a multiple linear regression on all the explanatory variables and disregard all variables
whose p values are very large (say |p| > 0.2).
Although these approaches are commonly followed, none of them is reliable in finding a good
regression model.
Quite often, a combination of the above will be used to reach the final short list of explanatory
variables.
Ideally, we would like to calculate all possible regression models using our set of candidate ex-
planatory variables and choose the best model among them. There are two problems here. First
it may not be feasible to compute all the models because of the huge number of combinations
of variables that is possible. Second, how do we decide what is best?
We will consider the second problem first. A naı̈ve approach to selecting the best model would
be to find the model which gives the largest value of R2 . In fact, that is the model which
contains all the explanatory variables! Every additional explanatory variable will result in an
increase in R2 . Clearly not all of these explanatory variables should be included. So maximizing
the value of R2 is not an appropriate method for finding the best model.
Instead, we can compare the R̄2 values for all the possible regression models and select the
model with the highest value for R̄2 . If we have 44 possible explanatory variables, then we
can use anywhere between 0 and 44 of these in our final model. That is a total of 244 = 18
trillion possible regression models! Even using modern computing facilities, it is impossible to
compute that many regression models in person’s lifetime. So we need some other approach.
Clearly the problem can quickly get out of hand without some help. To select the best ex-
planatory variables from among 44 candidate variables, we need to use stepwise regression
(discussed in the next section).
Stepwise regression is a method which can be used to help sort out the relevant explanatory
variables from a set of candidate explanatory variables when the number of explanatory vari-
ables is too large to allow all possible regression models to be computed.
Several types of stepwise regression are in use today. The most common is described below.
Step 2: Find the best pair of variables (X1∗ together with one of the remaining explanatory
variables—call it X2∗ ).
Step 3: Find the best triple of explanatory variables (X1∗ , X2∗ plus one of the remaining ex-
planatory variables—call the new one X3∗ ).
Step 4: From this step on, the procedure checks to see if any of the earlier introduced variables
might conceivably have to be removed. For example, the regression of Y on X2∗ and
X3∗ might give better R̄2 results than if all three variables X1∗ , X2∗ , and X3∗ had been
included. At step 2, the best pair of explanatory variables had to include X1∗ , by step 3,
X2∗ and X3∗ could actually be superior to all three variables.
Step 5: The process of (a) looking for the next best explanatory variable to include, and (b)
checking to see if a previously included variable should be removed, is continued until
certain criteria are satisfied. For example, in running a stepwise regression program, the
user is asked to enter two “tail” probabilities:
1. the probability, P1 , to “enter” a variable, and
2. the probability, P2 , to “remove” a variable.
When it is no longer possible to find any new variable that contributes at the P1 level
to the R̄2 value, or if no variable needs to be removed at the P2 level, then the iterative
procedure stops.
• Putting all the explanatory variables in can lead to a significant overall effect but with no
way of determining the individual effects of each variable.
• Using stepwise regression is useful for choosing only the most significant variables in the
regression model.
• Stepwise regression is not guaranteed to lead to the best possible model.
• If you are trying several different models, use the adjusted R2 value to select between
them.
6.6 Multicollinearity
In regression analysis, multicollinearity is the name given to any one or more of the following
conditions:
The reason for concern about this issue is first and foremost a computational one. If perfect
multicollinearity exists in a regression problem, it is simply not possible to carry out the LS
solution. If nearly perfect multicollinearity exists, the LS solutions can be affected by round-
off error problems in some calculators and some computer packages. There are computational
methods that are robust enough to take care of all but the most difficult multicollinearity prob-
lems, but not all packages take advantage of these methods. Excel is notoriously bad in this
respect.
The other major concern is that the stability of the regression coefficients is affected by multi-
collinearity. As multicollinearity becomes more and more nearly perfect, the regression coef-
ficients computed by standard regression programs are therefore going to be (a) unstable—as
measured by the standard error of the coefficient, and (b) unreliable—in that different computer
programs are likely to give different solution values.
Multicollinearity is not a problem unless either (i) the individual regression coefficients are of
interest, or (ii) attempts are made to isolate the contribution of one explanatory variable to Y ,
without the influence of the other explanatory variables. Multicollinearity will not affect the
ability of the model to predict.
A common but incorrect idea is that an examination of the intercorrelations among the ex-
planatory variables can reveal the presence or absence of multicollinearity. While it is true that
a correlation very close to +1 or −1 does suggest multicollinearity, it is not true (unless there
are only two explanatory variables) to infer that multicollinearity does not exist when there are
no high correlations between any pair of explanatory variables. This point will be examined in
the next two sections.
Recall the data on emergency calls to the New York Auto Club (the NY equivalent of the
RACV). (See p.76.)
1. Try fitting a linear regression model to predict Calls using the other variables. [Go to
Analyze → Regression → Linear .]
Which variables are significant? Are the coefficients in the direction you expected? Write
down the R2 and adjusted-R2 values.
2. Now try fitting the same regression model but with “Method” set to “Stepwise”. This
only puts in the variables which are useful and leaves out the others.
Now which variables are included? Write down the R2 and adjusted-R2 values. How
have they changed?
3. Finally, try the model with explanatory variables Flow and Rain. Write down the R2 and
adjusted-R2 values. This shows that the step-wise method in SPSS doesn’t always find
the best model!
Constant 7952.7
Flow −173.8
Rain 1922.2
5. The busiest day of all in 1994 was January 27 when the daily forecast low was 14◦ F and
the ground was under six inches of snow. The club answered 8947 calls. Could this have
been predicted from the model?
Our model is that the residuals are normally distributed with constant variance. So please
check the residual plots before doing any confidence intervals or tests of significance. If this
assumption is invalid, then the confidence intervals and tests are invalid.
When a regression model is fitted, we can test whether the model is any better than having no
variables at all. The test is conducted using an ANOVA (ANalysis Of VAriance) table. The test
is called an F-test. Here the null hypothesis is that no variable has any effect (i.e., all coefficients
are zero). The alternative hypothesis is that at least one variable has some effect (i.e., at least
one coefficient is non-zero).
An analysis of variance seeks to split up the variation in the data into two components: the
variation due to the model and the variation left over in the residuals. If the null hypothesis
is true (no variable is relevant) then we would expect the variation in the residuals to be much
larger than the variation in the model. The calculations required to answer this question are
summarized in an “analysis of variance” or ANOVA table.
Source DF SS MS F P
Regression 1 1357.2 1357.2 149.38 0.000
Residual Error 23 209.0 9.1
Total 24 1566.2
107
Part 7. Significance in regression
The Analysis of Variance (ANOVA) Table above contains six columns; Source of Variation,
degrees of freedom (DF), sums of squares (SS), mean square (MS), the variance ratio or F-value
(F), and the p-value (P). Of primary interest are the F and P columns.
• The F-value follows an F-distribution, and is used to decide if the model is significant.
• The p-value is the probability that a randomly selected value from the F-distribution is
greater than the observed variance ratio.
• As a general rule, if the F-probability (or p-value) is less than 0.05 then the model is
deemed to be significant.
In this case, there is a significant effect due to the included variable.
• If there are two groups, the p-value from the ANOVA (F-test) is the same as the p-value
from the t-test (provided a t-test with “pooled variance” is used).
The regression equation gives us the equation for the line best relating shipments to price. We
now ask a statistical question namely, is the relationship significant? In the context of linear
regression a relationship is significant if the slope of the line is significantly different from zero.
Since a slope which is equal to zero would imply that as price increases shipments remains
unchanged, that is no relationship.
To test the significance of the relationship between shipments and price the hypotheses are:
H0 : b = 0 H1 : b 6= 0.
As usual, if the p-value is less than 0.05 then the linear regression is deemed to be significant.
This means that the estimated slope of the line is significantly greater than zero.
x = birth weight
y = increase in weight between the 70th and 100th day of life, as a percentage of birth weight.
Percentage increase in birth weight (day 70-100)
120
100
80
60
40
Analysis of Variance
SOURCE DF SS MS F p
Regression 1 7666.4 7666.4 24.20 0.000
Error 30 9502.1 316.7
Total 31 17168.5
1. Is there an association between birth weight and % weight increase in the 70th to 100th
day?
When the explanatory variable takes only two values (e.g., male/female), we use a two-sample
t-test and associated methods. The interpretation is similar to the paired t-test used in the
previous section.
• The p-value gives the probability of the group means being as different as was observed
if there was no real difference between the groups.
• The 95% confidence intervals contain the true difference between the means of the two
groups with probability 0.95.
Data: returns for 30 stocks listed on NASDAQ and NYSE for 9–13 May 1994.
We look at absolute return in prices of stocks. This is a measure of volatility. For example,
a market where stocks average a weekly 10% change in price (positive or negative) is more
volatile than one which averages a 5% change.
Numerical summaries:
NASDAQ NYSE
Min. :0.00380 Min. :0.00260
1st Qu.:0.01745 1st Qu.:0.01120
Median :0.03930 Median :0.02480
Mean :0.04395 Mean :0.02913
3rd Qu.:0.05575 3rd Qu.:0.04010
Max. :0.12240 Max. :0.08910
Analysis of Variance Table
Response: absreturn
Df Sum Sq Mean Sq F value Pr(>F)
exchange 1 0.003293 0.003293 4.0405 0.04908 *
Residuals 58 0.047270 0.000815
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Conclusion
There is some evidence (but not very strong evidence) that the NASDAQ is more volatile
than the NYSE.
When we have a categorical explanatory variable with more than two categories, it is natural to
ask which categories are different from each other? For example, if the variable is “Day of the
week”, then are all days different from each other, or are weekends different from weekdays,
or something more complicated?
We will use data on the number of calls to a Melbourne call centre. Download the data from
http://www.robhyndman.info/downloads/Calls.xls
The variable Calls gives the total number of calls each day. The variable Trend gives the
smooth trend through the data eliminating the effect of daily fluctuations.
1. Produce a time plot of the data over time with the trend on the same graph. Can you
explain the fluctuations in the trend?
2. Calculate the percentage deviation from the trend for each day.
3. Compute summary statistics and boxplots for the deviations for each day.
4. Use an ANOVA test to check that the percentage deviations for each day are significantly
different from each other.
Factor analysis is most useful as a way of combining many numerical explanatory variables
into a smaller number of numerical explanatory variables
Basic idea:
• You try to uncover some underlying, but unobservable quantities called “factors”. Each
variable is assumed to be (approximately) a linear combination of these factors.
• For example, if there are two factors called F1 and F2 , then the ith observed variable Xi
can be written as
Xi = b0 + b1 F1 + b2 F2 + error.
The coefficients b0 , b1 and b2 differ for each of the observed variables.
• The factors are assumed to be independent of each other.
• The factors are chosen so they explain as much of the variation in the observed variables
as possible.
• The factor loadings are the the values of b1 and b2 in the above equation.
• Principal components analysis is the usual method for estimating the factors.
• The estimated factors (or scores) can be used as an explanatory variable in subsequent
regression models.
112
Part 8. Dimension reduction
The data on national track records for men are listed in the following table
Correlation matrix:
9.0 10.0 6.6 7.2 7.8 5.0 5.6 6.2 4.4 5.0
9.5
X100m
8.5
10.0
X200m
9.0
X400m 8.5
7.5
7.8
7.2
X800m
6.6
6.6
X1500m
6.0
6.2
X5000m
5.6
5.0
6.0
5.4
X10000m
4.8
5.0
Marathon
4.4
Figure 8.1: Scatterplot matrix of national track record data. All data in average metres per second.
Loadings:
Factor1 Factor2
X100m 0.275 0.918
X200m 0.379 0.886
X400m 0.546 0.736
X800m 0.684 0.623
X1500m 0.799 0.527
X5000m 0.904 0.382
X10000m 0.911 0.387
Marathon 0.914 0.271
Factor1 Factor2
SS loadings 4.108 3.205
Proportion Var 0.513 0.401
Cumulative Var 0.513 0.914
−1
−2
−3
Cook Islands
−3 −2 −1 0 1
Factor1
Stock price data for 100 weekly rates of return on five stocks are listed below. The data were
collected for January 1975 through December 1976. The weekly rates of return as defined as
Loadings:
Factor1 Factor2
Allied Chemical 0.683 0.192
Du Pont 0.692 0.519
Union Carbide 0.680 0.251
Exxon 0.621
Texaco 0.794 -0.439
Factor1 Factor2
SS loadings 2.424 0.567
Proportion Var 0.485 0.113
Cumulative Var 0.485 0.598
Factor 1 seems to be almost equally weighted and therefore indicates overall measure of mar-
ket activity. Factor 2 represents a contrast between the chemical stocks (Allied Chemical, Du
Pont and Union Carbide) and the oil stocks (Exxon and Texaco). Thus it measures an industry
specific difference.
0.08
0.10
Allied Chemical
0.05
0.04
Exxon
0.00
0.00
−0.05
0.08 −0.04
0.10 −0.10
0.04
0.05
Du Pont
Texaco
0.00
0.00
−0.05
−0.04
0.10
0 20 40 60 80 100
Time
0.05
Union Carbide
0.00
−0.05
0 20 40 60 80 100
Figure 8.3: Time series of five stocks between January 1975 and December 1976.
0.10
Allied Chemical
0.00
−0.10
0.05
Du Pont
−0.05
0.05
Union Carbide
−0.05
0.08
0.02
Exxon
−0.04
0.02 0.06
Texaco
−0.04
The following results are from a survey of students’ excuses for not sitting exams.
United States France Britain
Dead grandparent 158 22 220
Car problem 187 90 45
Animal trauma 12 239 8
Crime victim 65 4 125
Do different nationalities have different excuses?
119
Part 9. Data analysis with a categorical response variable
Example: Snoozing
Dead 3 11 14
8.07 5.93
Total 53 39 92
Logistic regression is used when the response variable is categorical with two categories (e.g.,
Yes/No). The model allows the calculation of the probability of a “Yes” given the set of ex-
planatory variables.
Multinomial regression is a regression model where the response variable is categorical with
more than two categories.
Useful reference
• K LEINBAUM , D.G., and K LEIN , M. (2002) Logistic regression: a self-learning text, 2nd ed,
Springer-Verlag.
1. Repeat the examples in Section 9.1 using SPSS to find the p-values.
2. In a study of health in Zambia, people were rated as having ‘good’, ‘fair’ or ‘poor’ health.
Similarly, the economy of the village in which each person lived was rated as ‘poor’, ‘fair’
or ‘good’. For the 521 villagers assessed, the following data were observed.
Health
Village Good Fair Poor Total
Poor 62 103 68 233
Fair 50 36 33 119
Good 80 69 20 169
Total 192 208 121 521
(a) Find a 95% confidence interval for the proportion of poor villages in Zambia.
(b) Use SPSS to carry out a chi-squared test for independence on these data.
(c) Explain in one or two sentences how these data differ from what you would expect
if health and village were independent.
(d) Do these data show that economic prosperity causes better health? Explain in one
or two sentences.
(e) Consider now only people from poor villages. What proportion of these people
have health that is rated less than good? Give a 95% confidence interval for this
proportion.
(f) An alternative approach to this problem would have been to measure health numer-
ically for each person. What sort of analysis would have been most appropriate in
that case?
I use a decision tree based on the type of response variable and the type of explanatory vari-
able(s).
Response variable: measures the outcome of a study. Also called dependent vari-
able.
Explanatory variable: attempts to explain the variation in the observed outcomes.
Also called independent variables.
Many statistical problems can be thought of in terms
of a response variable and one or more explanatory
variables.
Sometimes the response variable is called the dependent variable and the explana-
tory variables are called the independent variables.
• Study of level of stress-related leave amongst Australian small business em-
ployees.
– Response variable: No. days of stress-related leave in fixed period.
– Explanatory variables: Age, gender, business-type, job-level.
• Return on investment in Australian stocks.
– Response variable: Return
– Explanatory variables: industry, risk profile of company, etc.
124
Part 10. A survey of statistical methodology
Other methods
T IME SERIES
10
8
6
4
0 20 40 60 80 100
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) 4.2986 0.7063 6.0864 0.0000
tegelpc 0.0168 0.0079 2.1276 0.0350
10
8
6
4
0 20 40 60 80 100
A method for predicting a response variable given a set of explanatory variables constructed
using a “tree”. This is constructed using binary splits of the explanatory variables. The re-
sponse variable can be categorical or numerical. The explanatory variables can be categorical
or numerical.
131
Part 11. Further methods
9. Admission qualification: the honors level of each student’s admission qualification. This
will be first class honors (H1), second class honors level A (H2A) or second class honors
level B (H2B). Note that Masters candidates were classified as either H1, H2A or H2B
equivalent.
10. Publications: whether a student had any publications when admitted to enter the course.
The tree is constructed by recursively partitioning the data into two groups at each step. The
variable used for partitioning the data at each step is selected to maximize the differences be-
tween the two groups. The splitting process is repeated until the groups are too small to be
split into significantly different groups.
Classification tree
0.6612
n = 428
Arts
| BusEco
Faculty
ArtDesign IT, Law
Education Medicine
Engineering Pharmacy
0.5226 Science
n = 199 0.7817
n = 229
Arts Faculty No Publication Yes
0.472 0.7456 0.8833
n = 125 n = 169 n = 60
Figure 11.1 shows a classification and regression tree for the completion rate. Only six of the
ten variables were significant and used in the tree construction. These were:
• faculty
• age on enrolment
• admission qualification
• admission degree
• international student
• prior publication
On this tree, the variable used to split the data set is shown at each node. At each leaf or
terminal node, the class split at the upper node is displayed with its completion probability.
Note that although there were 445 students in the data set, 11 students had their admission
qualifications missing, 2 students had their admission degrees missing and a further four stu-
dents had other data missing. These students were all removed from the data to be able to
estimate the tree structure. So there are 428 observations used in the tree.
The following conclusions can be drawn from this analysis:
• The most important variable is Faculty with BusEco, IT, Law, Medicine, Pharmacy and
Science students having a higher completion probability than students from other fac-
ulties. Arts, in particular, had a low completion probability than students from other
faculties.
• For Arts students, the next most important variable was age with young students (en-
rolment age less than 22 years) having much higher completion rate than older students.
For the older Arts students, international students performed better.
• For students from BusEco, IT, Law, Medicine, Pharmacy and Science, the situation is more
complex. Students with a publication had a higher completion rate, especially if that also
had a Masters degree. Students without a publication did well if they had an H2A entry
rather than an H1 entry. For students with no publication and an H1 entry, the older
students (enrolment age greater than 23 years) did the worst.
– Arts students with completion probability of 0.47. Of these, students aged 22 or more
on enrolment had completion probability of 0.45 (and only 0.41 for non-international
students).
– BusEco, IT, Law, Medicine, Pharmacy and Science students with a publication had
completion probability of 0.88 (and 100% for Masters students with a publication).
– Law, Pharmacy and Medicine students without a publication and over 23 on enrol-
ment had a completion probability greater than 0.9.
Further reading
• B REIMAN , L., F RIEDMAN , J., O LSHEN , R., and S TONE , C., (1984) Classification and regres-
sion trees, Wadsworth & Cole: Belmont, CA.
Sets of linear equations used to specify phenomena in terms of presumed cause-and-effect vari-
ables. Some variables can be unobserved.
Further reading
• S CHUMACKER , R.E., and L OMAX , R.G. (1996) A beginners guide to structural equation mod-
eling, Hillsdale, N.J.: Lawrence Erlbaum Associates.
• K LINE , R.B. (1998) Principles and practice of structural equation modeling, Guilford: New
York.
These are models of time series data and are usually designed for forecasting. The most com-
mon models:
• exponential smoothing;
• ARIMA (or Box-Jenkins) models;
• VAR models (for modelling several time series simultaneously).
Further reading
• M AKRIDAKIS , S., W HEELWRIGHT, S., and H YNDMAN , R.J. (1998). Forecasting: methods
and applications, John Wiley & Sons: New York. Chapters 4 and 7.
Replacements for t-tests and ANOVA when the data are not normally distributed.
Further reading
• G IBBONS , J.D., and C HAKRABORTI , S. (2003). Nonparametric statistical inference, CRC
Press.
• Give only as many decimal places as are accurate, meaningful and useful.
• Use horizontal or vertical lines to help the reader make the desired comparisons.
135
Part 12. Presenting quantitative research
12.2 Graphics
Some graphs mislead by showing a small section of data rather than the whole context in which
the data lie. A decrease in sales over a few months may be part of a long decreasing trend or a
momentary drop in an long increasing trend.
We are accustomed to interpreting graphs with the dependent variable (e.g. sales) on the ver-
tical axis and the independent variable (e.g. time) on the horizontal axis. Flouting convention
can be misleading.
Some graphs attempt to show data over a wide range by using a broken axis. This also can be
misleading. If the data range is too wide for the graph, the data is probably on the wrong scale.
• Look carefully at the axis scales
• Show context
• Keep dependent variable on vertical axis
• Avoid broken axes
Graphical competence demands three quite different skills: the substantive, sta-
tistical, and artistic. Yet now most graphical work, particularly at news publica-
tions, is under the direction of but a single expertise—the artistic. Allowing artist-
illustrators to control the design and content of statistical graphics is almost like
allowing typographers to control the content, style and editing of prose. Substan-
tive and quantitative expertise must also participate in the design of data graphics,
at least if statistical integrity and graphical sophistication are to be achieved.
E.R. Tufte
Cleveland’s conclusions are that there is an ordering in the accuracy with which we carry out
these tasks. The order, from most accurate to least accurate, is:
1. Position along a common scale
2. Position along identical, non-aligned scales
3. Length
4. Angle and slope
5. Area
6. Volume
7. Colour hue, colour saturation, density
Some of the tasks are tied in the list; we don’t have enough insight to determine which can be
done more accurately.
This leads to the basic principle:
Encode data on a graph so that the visual decoding involves tasks as high as
possible in the ordering.
There are some qualifications:
• It’s a guiding principle, not a rule to be slavishly followed;
• Detection and distance have to be taken into account; they may sometimes override the
basic principle.
This paradigm implies the following, which is not a systematic list, but a number of examples
of the insights which follow from it.
1. Pie charts are not a good method of graphing proportions because they rely on comparing
angles rather than distance. A better method is to plot proportions as a bar chart or dot
chart. It is also easier to label a bar chart or dot chart than a pie chart.
2. Categorical data with a categorical explanatory variable are difficult to plot. A common
approach is to use a stacked bar chart. The difficulty here is that we need to compare
lengths rather than distances. A better approach is the side-by-side bar chart which leads
to distance comparisons. Ordering the groups can assist making comparisons. However,
side-by-side bar charts can become very cluttered with several group variables.
3. Time series should be plotted as lines with time on the horizontal axis. This enables
distance comparisons, emphasises the ordering due to time and allows several time series
to be plotted on the same graph without visual clutter.
5. If a key point is represented by a changing slope, consider plotting the rate of change
itself rather than the original data.
6. Think of simplifications which enhance the detection of the basic properties of the data.
7. Think of how the distance between related representations of data affects their interpre-
tation.
Friendly Unfriendly
words are spelled out, mysterious and abbreviations abound, requiring the
elaborate encoding avoided viewer to sort through text to decode
abbreviations
words run from left to right, the usual di- words run vertically, particularly along
rection for reading occidental languages the Y-axis; words run in several different
directions
little messages help explain data graphic is cryptic, requires repeated ref-
erences to scattered text
elaborately encoded shadings, cross- obscure codings require going back and
hatching and colors are avoided; instead, forth between legend and graphic
labels are placed on the graphic itself; no
legend is required
graphic attracts viewer, provokes curios- graphic is repellent, filled with chartjunk
ity
colors, if used, are chosen so that the design insensitive to color-deficient view-
color-deficient and color-blind (5 to 10 ers; red and green used for essential con-
percent of viewers) can make sense of the trasts
graphic (blue can be distinguished from
other colors by most color-deficient peo-
ple)
type is upper-and-lower case, with serifs type is all capitals, sans serif
References
• C LEVELAND , W.S. (1985) The elements of graphing data, Wadsworth.
• C LEVELAND , W.S. (1993) Visualizing data, Hobart Press
• T UFTE , E.R. (1983) The visual display of quantitative information, Graphics Press.
• T UFTE , E.R. (1990) Envisioning information, Graphics Press.
?
100 West
50 East
0
The good news is that graphical capabilities are now 1st 2nd 3rd 4th
Qtr Qtr Qtr Qtr
References
145