Anda di halaman 1dari 12


qxd 11/17/2007 5:40 PM Page 23




dvances in clinical psychology critically proof he needed with his own eyes. That is, he
depend on methods researchers use for claimed his patients almost invariably benefited
investigating the causal relationships from his therapy. When a treatment outcome study
among variables. Research questions commonly was conducted and published by an independent
include issues about the putative mechanisms group of investigators, it was found that the psy-
associated with various forms of psychopathol- chotherapy used by Dr. Smith was no more effective
ogy (e.g., do dysfunctional beliefs cause depres- than giving patients a placebo. Dr. Smiths response
sion?) and questions about the effects of was there must have been something wrong with
treatments (e.g., does cognitive therapy cause a the research; the researchers didnt include patients
greater reduction in eating disorder symptoms seen in the real world.
than does placebo?). How do we go about draw-
ing objective conclusions about causality? This Scenario 2: The interns and their clinical supervi-
is a question about whether the manipulation of sors gathered in the seminar room for the weekly
one variable (i.e., the independent variable) has journal club. This weeks article, which recently
effects on another variable (i.e., the dependent appeared in a leading journal, described an experi-
variable). The answer to this question is more mental investigation of the effects of cognitive fac-
complex than it might seem. Consider the fol- tors (expectations of disapproval) on social anxiety.
lowing common scenarios. Research participants led to expect high disap-
proval (from an experimental confederate) experi-
enced more social anxiety than participants who
Scenario 1: Dr. Smith is a well-known local advo- were led to expect no disapproval.The investigators
cate of a controversial form of psychotherapy. He concluded that expectations of disapproval can
claims that it works faster and more powerfully cause social anxiety and likely play a role in clinically
than all other treatments. For many years he has severe conditions such as social anxiety disorder.
practiced this treatment and trained other clinicians Despite the many methodological strengths of the
through workshops, even though there were no sci- study, the participants of the journal club quickly
entific data on its efficacy.When challenged on this searched out the methodological weaknesses. One
point, Dr. Smith retorted that he had seen all the of the interns raised a particularly important

03-Mckay-45470.qxd 11/17/2007 5:40 PM Page 24


question about the generalizability of the study. Her 1970; Cook & Campbell, 1979; Onghena &
observation was met with nods of approval from Edgington, 2005).
her supervisors. By the time the journal club had fin-
ished, all the attendees had convinced themselves
that the study was fatally flawed. Some attendees INTERNAL VALIDITY
even left the meeting with the impression that psy-
chological research, even research published in Internal validity is the degree to which observed
leading journals, is largely a waste of time. changes in a dependent variable can be attrib-
uted to changes in an independent variable.
Thus, internal validity is a matter of degree (e.g.,
This chapter is written largely in response to high, medium, low) rather than one of presence
these two types of scenarios, which we have or absence. The researchers confidence in his or
encountered time and again. The reactions her findings is proportionate to the strength of
illustrated in these scenarios retard the scien- internal validity of the research design (Finger &
tific progress of clinical psychology. The first Rand, 2003). True experiments are designs that
scenario raises many issues regarding internal have strong internal validity; that is, participants
and external validity. Dr. Smith claims that all are randomized to experimental conditions, and
his patients benefit from his treatment. Yet, his other means are used to ensure that changes in
series of case observations have many problems the dependent variable can be attributed to the
of internal validity. His dismissal of a recent experimental manipulation of the independent
study of his psychotherapy raises the issue of variable. Quasi-experimental designs have
external validity. The journal club scenario weaker internal validity, as we will illustrate later.
raises the question of what we can conclude There are several types of threat to internal valid-
from research that does not have perfect ity (Cook & Campbell, 1979; Finger & Rand,
internal and external validity. 2003; Rosenthal, 2002), including
The important issues raised in these scenarios
are the focus of the remainder of this chapter. We
will begin by defining internal validity and illus-
trating the various threats to it. Some studies,
such as those using quasi-experimental designs,
statistical regression,
are widely used in clinical research, despite their
imperfect internal validity. After reading this
chapter, you should have a good understanding
interactions with selection,
of why such studies are used and why they are
diffusion or imitation of treatments,
useful. After discussing internal validity, we then
compensatory equalization of treatments, and
examine the concept of external validity (gener-
experimenter expectancy.
alizability) and consider the relationship
between internal and external validity. As you Each of these threats to internal validity are
will see, there are some research situations in defined and illustrated in the following sections.
which high external validity is vital and other
situations in which it is not a priority. Finally, we
will conclude with some comments about how History
scientific knowledge can be advanced even Description. When changes in the dependent
though most research studies have imperfect variable are due to some extraneous event that
internal or external validity. Throughout this takes place between pre- and posttest, it makes
discussion we will consider a number of com- it difficult to determine whether the results were
monly used experimental designs that have been due to the experimental manipulation (i.e.,
developed to deal with issues of internal and changes in the independent variable) or to the
external validity. Our discussion of these designs extraneous event. In some research, such as a
will be illustrative rather than comprehensive. short study of memory, this threat can be con-
Detailed discussions of experimental and quasi- trolled by shielding participants from outside
experimental designs are available elsewhere influences during the study (e.g., testing them in
(e.g., Asmundson, Norton, & Stein, 2002; a quiet lab) or by choosing dependent variables
Barlow & Hersen, 1984; Campbell & Stanley, that could not plausibly have been affected by
03-Mckay-45470.qxd 11/17/2007 5:40 PM Page 25

Internal and External Validity in Clinical Research 25

outside influences (Cook & Campbell, 1979). In variable under investigation. Some measures are
treatment studies, which may take the partici- highly reactive, whereas other measures are
pant several weeks to complete, these methods largely unreactive. Also, repeated testing can
may not effectively control for the effects of increase familiarity with the test, which might
history. Other methods are more often used, bias scores. This threat can be dealt with in vari-
such as random assignment of participants to an ous ways, such as by selecting unreactive mea-
experimental group or a control group. In a ther- sures (e.g., unobtrusive observation) or by
apy study, the latter might be a waiting list con- including a control group.
trol in which the participants are simply assessed Example. In studies of smoking, the act of
twice, with the retest interval being matched to monitoring ones use of cigarettes affects the fre-
the duration of the treatment study. Participants quency of smoking. That is, self-monitoring
in the treatment condition would be similarly may help some people realize how much they
assessed twice, before and after treatment. smoke, thereby motivating them to cut down. To
Example. Midway through an uncontrolled illustrate the effects of test familiarity, consider a
study of treatments of driving phobia, a well- study in which tests of intelligence are adminis-
known celebrity was killed in a car accident, tered on multiple occasions. With repeated test-
thereby inflating the fears of the treatment par- ing, participants may become better at some tests
ticipants. As a result, their posttreatment scores (e.g., the digit-symbol subtest of the Wechsler
on a measure of driving fear tended to be higher Adult Intelligence Scale) simply because they
than their pretreatment scores, thereby giving have learned the correct responses (e.g., the sym-
the misleading impression that treatment wors- bol that goes with each digit) as a result of
ened their phobias. Inclusion of a waiting list repeated testing.
control group would have demonstrated the
impact of this event on those with driving pho-
bia who were not receiving treatment.
Description. When an effect is due to a change
in the measuring instrument from pre- to post-
test rather than due to the manipulation of the
Description. Change in the participant over independent variable. Instrumentation can affect
the course of time, where such change is not the all forms of measurement, including observers,
focus of interest of the research study. This may self-report tools, interview schedules, and devices
involve growth (e.g., getting smarter or stronger) that measure physiological processes.
or decline (e.g., dementing). This threat can be Example. In observational studies, progres-
addressed by using a control group. sive fatigue of observers who are coding various
Example. A drug company treated 20 elderly types of marital interaction can impact their
people in the early stages of Alzheimers disease rating accuracy. With increasing fatigue, the
with a new medication for depression. The investi- observers may be less likely to detect subtle
gators concluded that the drug was effective in alle- interaction patterns. Using one scale at pretest
viating depression in this population. However, (e.g., the first edition of the Beck Depression
they failed to realize that depression naturally Inventory) and another edition at posttest (e.g.,
remitted for many patients as their dementia wors- the second edition of this inventory) might sug-
ened. That is, as their memories became worse, the gest a change in depressive symptoms where
participants no longer had insight into the fact there was no change. Similarly, in a study mea-
that they were dementing and so were no longer suring physiological reactivity to stress before
depressed about this problem. Inclusion of a con- and after stress inoculation training, changes in
trol group (in this case, one receiving a placebo equipment calibration might falsely indicate or
pill) would have allowed the researchers to assess mask a treatment effect.
natural changes in depressive symptoms associ-
ated with increasing dementia.
Statistical Regression
Description. People selected for extreme scores
(very high or very low) will have less extreme
Description. The reactive effects of testing scores when they are retested on the same or
where the very act of assessment influences the related variables. Why does regression occur? The
03-Mckay-45470.qxd 11/17/2007 5:40 PM Page 26


farther a score is from the mean, the more Attrition

extreme it is. The more extreme the score, the
Description. The loss of participants from a
rarer it is and the more likely it is to have been the
study (e.g., due to mortality or treatment
result of a very rare combination of factors. If
dropout). This threatens internal validity if
these factors are temporally unstable, then statis-
attrition is not random: for instance, if attrition
tical regression will occur (Furby, 1973).
is greater in one experimental condition than
Statistical regression is always toward the popula-
another, or if particular participants are most
tion mean of the group. Its magnitude is greater
likely to drop out of the study. Attrition can
when the test-retest reliability of a measure is low
cause serious problems for clinical researchers
(indicating that scores are readily influenced by
by introducing biases into an experiment. There
chance factors) and when a persons score is
are various methodological and statistical pro-
extreme, relative to the mean of the population
cedures for limiting, evaluating, and correcting
from which the person was chosen (Cook &
for attrition (see Flick, 1988). However, there are
Campbell, 1979). Regression effects will not be a
circumstances in which attrition can render the
threat if assessment methods are chosen that are
results uninterpretable, as illustrated in the fol-
virtually error free or uninfluenced by random
lowing example.
factors (e.g., measuring a persons height; Finger
Example. A residential treatment center
& Rand, 2003). It is important to note that statis-
reported that 80% of patients with anorexia ner-
tical regression effects can be due to psychologi-
vosa were much improved or greatly improved
cally substantive phenomena and should not be
after completing the program. Unfortunately, the
automatically dismissed out of hand as statistical
results were biased because 20% of patients did
artifacts (Taylor, 1994). Regression might be either
not complete the program, and no outcome data
noise or the phenomena of interest, depending on
were available for them. Some withdrew because
ones research goals.
they benefited quickly from treatment and felt
Examples. A gambling researcher screened a
that they no longer needed to be in the program.
large group of students in order to identify people
Others dropped out because they failed to bene-
who could be classified as heavy gamblers, as
fit. Some severely anorexic patients had either
measured by a questionnaire. When these people
died or withdrawn from the clinic and were
presented to the lab to participate in the experi-
admitted to hospital. Given the large proportion
ment, they completed the questionnaire a second
of treatment dropouts and the uncertainty about
time. To the researchers chagrin, many of the
whether treatment completers differed, as a
participants no longer had extreme scores on
group, from treatment dropouts, it was not possi-
the questionnaire and had to be excluded from
ble to draw any legitimate conclusions from the
the study. Another example concerns uncon-
treatment study.
trolled treatment studies, in which a group of
patients are selected on the basis of extreme
scores on some measure (e.g., scores on a perfec- Selection
tionism scale), and then receive an intervention
(e.g., a treatment for excessive perfectionism) as Description. When the effects on the depen-
well as a posttest. Statistical regression may occur, dent variable arise from differences in the kinds
resulting in what appears to be a treatment effect of people in the experimental groups. Selection
(i.e., a decline in scores from pre- to posttest). effects are pervasive in quasi-experimental
The solution to this problem is to include a con- designs (Cook & Campbell, 1979). These are
trol group. Note that statistical regression is among the most widely used designs in clinical
unlikely to occur if participants are selected psychology, in which a target group is compared
because they have persistently elevated scores, with one or more control groups (e.g., a group
such as people with chronically high scores on a of healthy people, a group with another psy-
measure of anxiety (i.e., high trait anxiety). Such chopathology). Attempts are made to match the
people are unlikely to show statistical regression groups on background variables (e.g., demo-
because this phenomenon is due to transient graphics), and then they are compared on the
factors that produce elevations in scores (e.g., a variables of interest. However, assignment of par-
near-miss while driving to the lab to participate ticipants to groups (e.g., target group vs. healthy
in an experiment would transiently increase control group) is, by definition, nonrandom in
ones anxiety). quasi-experimental designs. One must remember
03-Mckay-45470.qxd 11/17/2007 5:40 PM Page 27

Internal and External Validity in Clinical Research 27

that the distinction between the target group and live in stressful environments, whereas nor-
any control group is an observed not experimen- motensive controls might tend to live in relatively
tally manipulated distinction. low stress environments. Thus, the different
Example. Many studies have investigated histories of the groups (i.e., differences in envi-
whether people with anxiety disorders, as com- ronmental stressors) might be responsible for
pared with healthy controls, tend to selectively any group differences in anger proneness, even if
focus their attention on sources of threat in their anger is unrelated to hypertension.
environment (Mogg & Bradley, 1999). Although
such studies have yielded a good deal of useful
Diffusion or Imitation of Treatments
information and have stimulated a great deal of
research, these studies are prone to selection Description. When participants in the differ-
effects. That is, although the clinical (target) and ent experimental conditions can communicate
control groups were matched on many back- with one another, such that participants in one
ground variables, there is no guarantee that the condition learn about what happens in the other
differences between the groups (e.g., the threat- condition. This can undermine the differences
focused attention effects) were due to the pres- between the experimental manipulations in
ence versus absence of an anxiety disorder. The each condition.
effects could have been due to other factors that Example. In a study of the effects of stress (in
were not assessed in the study. In these studies, the form of electric shock) on snake phobias,
selection effects are addressed in three ways. snake-fearful undergraduate psychology students
First, the plausibility of confounding factors is were randomly assigned to one of two groups.
taken into consideration. Anxious patients and Participants in each group were tested individu-
healthy controls could differ on an almost infi- ally. Each participant was asked to walk up to a
nite range of factors. Some of these factors could container housing a large, harmless snake and to
confound the study of threat-focused attention touch it. Participants in the experimental group
(e.g., depression), while other factors are less received a painful electric shock at a randomly
plausible (e.g., the participants Zodiac sign). determined point as they approached the snake.
Second, researchers try to control for all the Participants in the control group experienced
plausible confounding factors (e.g., all partici- no shock as they approached the snake.
pants are asked to refrain from caffeine con- Unfortunately, the students who had completed
sumption on the day of testing; testing is done the experiment described their experiences with
under conditions of normal or corrected-to- students who were soon to participate in the
normal vision). Finally, if another confounding study. This contaminated the experimental
factor is subsequently identified (e.g., whether manipulation because many of the people in the
or not the person is taking antianxiety medica- control group had heard about the electric
tion), then the study can be replicated, control- shock. As they approached the snake they wor-
ling for this factor. ried about getting a shock. Thus, the no shock
Interactions With Selection. Many of the previ- control condition was compromised. Possible
ously mentioned threats to internal validity can solutions to the problem of diffusion of treat-
interact with selection to produce effects on the ments is to ask participants not to discuss the
dependent variable that may be confused with experiment with other students (during the
effects due to the independent variable (Cook & period in which the study is being conducted) or
Campbell, 1979). Examples include selection- to use experimental designs in which diffusion is
history, selection-maturation, and selection- not an issue (e.g., in the snake example, one
instrumentation effects. Selection-maturation could explicitly inform the control participants
interactions occur when the experimental groups that they have been allocated to a no shock
mature at different speeds. Selection-history condition).
interactive effects occur when different experi-
mental groups come from different settings,
Compensatory Equalization of Treatments
where each setting is associated with different
histories. In a study designed to test the hypoth- Description. When participants learn that they
esis that people with hypertension tend to be have been assigned to an experimental condition
high in trait anger (i.e., anger proneness), for where they wont receive the possible benefits
example, the hypertensive patients might tend to received by participants in another experimental
03-Mckay-45470.qxd 11/17/2007 5:40 PM Page 28


condition. Participants may be reluctant to toler- to the aims of the investigation and by using
ate this inequality and thereby seek out the poten- procedures to counter any expectation effects of
tial benefits. therapists.
Examples. In a study examining the effects of Examples. In studies of a novel treatment,
biofeedback for tension headaches, participants compared with some standard treatment, thera-
were randomly assigned to either biofeedback pists may be highly enthusiastic about the new
or a no-treatment (waiting list) control. treatment and less impassioned by the standard
Participants in the control group were aware that treatment. A similar problem was encountered
participants in the other group were receiving a in our recent randomized, controlled study of
potentially beneficial treatment. This perceived three treatments for post-traumatic stress disor-
inequality prompted some people from the con- der (PTSD): behavior therapy, relaxation train-
trol group to seek out headache treatment while ing, and eye movement desensitization and
they were in the waiting list condition. This reprocessing (EMDR, Taylor et al., 2003). Some
confounded the investigation of the effects of therapists were enthusiastic advocates of behav-
biofeedback. Another example concerns the use ior therapy, while others were equally enthusias-
of pill placebo in drug studies. Participants tic about EMDR. To control for possible
in these studies are informed that they will be expectancy effects, we had two therapists deliver
randomly assigned to receive either capsules all three treatments. One therapist was an expert
containing the drug under investigation or cap- and advocate of EMDR, while the other had
sules containing an inert substance (placebo). expertise in behavior therapy. Thus, the thera-
Unfortunately, in many drug studies, it is not dif- pists had potentially opposite expectations. This
ficult for patients to discover whether they have design enabled us to assess whether these and
been assigned to the drug or placebo conditions, other therapist factors influenced treatment
because drugs, unlike placebos, commonly pro- outcome. In this study, the treatments differed
duce side effects. Antidepressant medications, for in efficacy (behavior therapy tended to be most
example, may produce dry mouth or temporary effective), whereas there were no differences in
jitteriness as side effects. Participants who dis- the efficacy of the therapists, and there was no
cover that they are taking placeboes may there- treatment-by-therapist interaction.
fore seek out additional treatments during the
experiment, thereby confounding the investiga-
tion of the effects of the drug. Alternatively, some
participants who realize that they are taking Returning to the first scenario that opened
a placebo may become further depressed about this chapter, we can see that most of these
not getting the real treatment. A solution to threats to internal validity would apply to
such confounds is to use placeboes that produce Dr. Smiths observations about the effects of his
side effects. Some studies have taken this treatment. His conclusions that his treatment
approach (Margraf et al., 1991). was highly effective were based on simple
pre/post case studies, that is, on patients assessed
before and after this therapy. These studies failed
Experimenter Expectancy
to control for history, maturation, and statistical
Description. A phenomenon whereby the regression. Attrition was also a problem.
participants responses are influenced by expec- A number of patients started treatment with
tations of the experimenter (or a proxy for Dr. Smith but dropped out because they failed to
the experimenter, such as a therapist or research benefit from therapy. Dr. Smith conveniently
assistant conducting a component of a study; failed to include these patients in his appraisal of
Rosenthal, 2002). In other words, the partici- his treatments efficacy.
pants responses are shaped in the direction of Not all case studies suffer as much from
the experimenters expectations. These effects threats to internal validity. Various single-case
may be unintentional on the part of the experi- experimental designs (which are part of the
menter (or therapist). This bias, sometimes family of quasi-experimental designs) have been
known as the allegiance effect, can be circum- developed to deal with these threats (e.g., Barlow
vented by keeping the people running the exper- & Hersen, 1984; Onghena & Edgington, 2005).
iment (e.g., therapists, research assistants) blind To illustrate, as part of our research into
03-Mckay-45470.qxd 11/17/2007 5:40 PM Page 29

Internal and External Validity in Clinical Research 29

cognitive-behavioral therapy (CBT) of panic dis- Beck, Greenberg, Wright, & Berchick, 1989) and
order, we conducted a case study of an unusual to randomized, controlled trials (which have
presentation, in which the patients panic disor- very strong internal validity; Barlow, Gorman,
der appeared to arise from blood-injury reactiv- Shear, & Woods, 2000) in which CBT was com-
ity (vasovagal dizziness and fainting in response pared to control conditions (e.g., waiting list
to the sight of blood or injury; Anderson, Taylor, or placebo) and to other treatments (e.g.,
& McLean, 1996). The patient was initially imipramine). Thus, even though single-case
treated with standard CBT for panic disorder experimental designs often have far-from-perfect
(Taylor, 2000). Two years later, he relapsed when internal validity, they can yield valuable informa-
exposed unexpectedly to blood-injury stimuli. tion and thereby can advance our understanding
This led us to hypothesize that his blood-injury of psychopathology and its treatment.
reactivity played a causal role in his panic disor- Drawing inferences, whether in quasi-
der. To test this possibility, we provided the experiments or experiments, is a matter of rul-
patient with another course of standard CBT for ing out rival hypotheses (e.g., hypotheses about
panic disorder. As before, he was no longer pan- the role of threats to internal validity) that could
icking after treatment. Then we asked the patient account for the results. Randomizing partici-
if we could expose him to blood-injury stimuli pants to experimental and control groups can
for one month (a videotape of injections and overcome many of the threats to internal valid-
blood extractions). The thought of being ity. Random selection of participants and
exposed to such a tape stimulated blood-injury random allocation to experimental conditions
reactions (e.g., dizziness), which were followed ensures, within the limits of sampling error, that
by a relapse of his panic disorder. The next part the sample is representative of the target popu-
of the case study involved treating the patient lation and that the samples in the experimental
with applied tension, which is a specific treat- groups are comparable to one another in terms
ment for blood-injury reactivity (st & Sterner, of the background features of the participants,
1987). This treatment reduced his panic attacks such as demographics or other variables (Cook
and blood-injury reactivity. When he was reex- & Campbell, 1979).
posed to the videotape, he did not have any Randomization doesnt control for some
blood-injury reactions, his panic disorder did threats, such as diffusion of treatments, com-
not return, and he was free of psychopathology pensatory equalization of treatments, or exper-
at his four-month follow-up. This case study imenter expectancy. These threats can be
involved an ABABCB design, where A = the first overcome by other means, such as those men-
and second courses of CBT, B = exposure to tioned earlier. For quasi-experimental designs,
blood-injury stimuli, and C = treatment with however, there is always some degree of threat to
applied tension. This design makes it unlikely internal validity, such as the selection threat.
that the results are due to threats to internal Cook and Campbell (1979) offer the following
validity such as history or maturation. guidelines about how to assess the degree of
There are many other types of single-case threat to internal validity.
experimental designs, which can be used for
other types of research questions (see Barlow Estimating the internal validity of a
& Hersen, 1984; Onghena & Edgington, 2005). relationship is a deductive process in which
Studies using single-case experimental designs the investigator has to systematically think
are useful for studying unusual cases and for through how each of the internal validity
conducting preliminary evaluations of new threats may have influenced the data. Then,
treatments. These studies are insufficient in the investigator has to examine the data to test
themselves for drawing strong conclusions, but which relevant threats can be ruled out. In all
they provide some indication of whether it is of this process, the researcher has to be his or
useful to conduct further investigations. Early her own best critic, trenchantly examining all
case studies of CBT for panic disorder (e.g., of the threats he or she can imagine. When all
Clark, Salkovskis, & Chalkey, 1985) provided of the threats can be plausibly eliminated,
encouraging results, which led researchers to it is possible to make confident conclusions
conduct open (uncontrolled) trials to evaluate abut whether a relationship is probably
the treatment with more patients (e.g., Sokol, causal. (p. 55)
03-Mckay-45470.qxd 11/17/2007 5:40 PM Page 30


EXTERNAL VALIDITY A popular research strategy is to use ana-

logue samples. For example, students may be
External validity has to do with the generaliz- selected because of their high scores on a
ability of the research findings; to what extent measure of schizotypy for a study of variables
can the findings of an experiment or quasi- thought to be relevant to schizophrenia.
experiment be generalized to and across various Analogue studies have the advantage of having
populations, settings, and epochs? In the follow- strong internal validity (e.g., randomized
ing sections, we examine, in further detail, the assignment of schizotypal students to two or
major types of threats to external validity, the more experimental conditions). However, ana-
relationship between internal and external logue studies may also have important prob-
validity, and the situations in which we should lems with external validity. Can findings
(or shouldnt) be concerned with threats to obtained from schizotypal students who, for
external validity. Threats to external validity are example, report having some degree of magical
evaluated by tests of the extent to which one can thinking and perceptual aberration, be general-
generalize across various kinds of people, set- ized to people with schizophrenia?
tings, and times and are, in essence, tests of sta- Studies using clinical samples also may
tistical interactions (Cook & Campbell, 1979). encounter problems with external validity. Some
The major threats include three types of interac- treatment outcome studies, for example, may be
tions with the experimental condition that the highly selective in the patients that are enrolled.
participants are in. These are interactions with A study of the treatment of bulimia nervosa
selection, setting, and history. might only include patients if they agree to
suspend any other treatment they might be
receiving and remain on a stable dose of any
Interaction of Selection and psychotropic medication they might be receiv-
Experimental Condition ing. These research requirements have the
Description. This concerns the question of advantage of controlling for threats to internal
whether the findings from the selected group of validity, but they do raise questions about exter-
research participants can be generalized to other nal validity; that is, are the patients yielding
categories of people, such as people with other these clinical findings representative of patients
geographic or demographic features. typically seen in clinical practice? If the patients
Examples. A study comparing patients with are not representative, then the question arises
severe major depression with healthy controls as to whether the treatment findings can be gen-
might seek to match the participants on eralized to clinical practice in the real world.
demographic features. Many severely depressed These concerns with patient representativeness
patients are unable to work and are therefore and the use of analogue samples were raised in
unemployed, receiving welfare or disability the two scenarios that opened this chapter.
assistance. To match the patients with the con- Even when participants belong to the target
trols on demographic factors, the researcher population of interest, recruitment factors
might decide to include only unemployed con- might lead to threats to external validity (Cook
trol participants. While this strengthens the & Campbell, 1979). A researcher, for example,
internal validity of the study, it raises the ques- who is interested in studying conversion disor-
tion of whether the results can be generalized to der might recruit patients by placing advertise-
people from other levels of occupational func- ments in the local newspaper. This process of
tioning. If the results of the research study vary recruitment could possibly result in a sample of
across occupational levels, then there is an inter- people with conversion disorder that is unrep-
action between selection (in this case, occupa- resentative of people in general with this disor-
tional status) and experimental condition. This der. This threat to external validity can be
interaction threatens the external validity of the examined by comparing patients recruited
study. The only way to determine whether this from the newspaper to patients recruited by
threat exists is to determine whether the results other means (e.g., from physical referrals) to
vary with occupational status. This means that see whether the groups differ on relevant
further studies might be needed to better under- variables such as the type and severity of the
stand the external validity of the findings. conversion disorder.
03-Mckay-45470.qxd 11/17/2007 5:40 PM Page 31

Internal and External Validity in Clinical Research 31

Interaction of Setting or Context and important limits to the external validity (gener-
Experimental Condition alizability) of the findings; for example, many
soldiers in World War I apparently responded to
Description. The question of concern is whether
the trauma of war with symptoms of conversion
findings obtained in one setting or context can
disorders (e.g., hysterical paralysis or blind-
be generalized to other settings.
ness) rather than by developing reexperiencing
Example. Research conducted at Harvard
symptoms (e.g., Lerner, 2003).
University suggests that people who claim to
Sometimes an experiment takes place in a
have been abducted by space aliens are more sus-
very special epoch, such as during the weeks fol-
ceptible, compared to control groups, to forming
lowing the September 11, 2001, terrorist attacks
false memories (McNally, 2003). But do these
in New York and Washington, D.C. The results
findings apply to purported alien abductees in
of a study, for example, of college student stress
general, including people from other educa-
around the time of September 11 might not
tional levels or geographic locations? Alien
apply to other periods, past or future. One needs
encounters are commonly reported in Brazil, for
to rely, in part, on common sense to determine
example (Pulos & Richman, 1990). Are these
whether the results of an experiment would gen-
people similarly subject to false memories? To
eralize from one time period to another.
answer this question, one may need to repeat the
experiment in different settings.
Even when circumstances are relatively more
A related threat concerns the novelty of an
mundane, we still cannot logically extrapolate
intervention (Finger & Rand, 2003). If a new
findings from the present to the future.
treatment is evaluated, typically in a university or
Yet, while logic can never be satisfied,
hospital research setting, the participant may
commonsense solutions for short-term
be aware that he or she is receiving a novel treat-
historical effects lie in either replicating the
ment, and the therapist may be highly optimistic
experiment at different times . . . or in
or enthusiastic about the intervention. The result
conducting a literature review to see if prior
obtained under such conditions might not gener-
evidence exists which does not refute the causal
alize to other contexts, such as settings in which
relationship. (Cook & Campbell, 1979, p. 74)
the treatment is no longer regarded as novel.

Interaction of History and
Internal Versus External Validity. Internal
Experimental Condition
validity often takes precedence over external
Description. This concerns the question of validity, because one must first obtain an unam-
whether the findings obtained today would biguous finding before you can generalize the
apply to the past or future, or whether the find- results. Accordingly, many studies in clinical
ings would apply to people who had otherwise psychology have high internal validity and lower
different histories. (or unknown) external validity. To illustrate, in a
Examples. Contemporary observations of study of memory functioning in generalized
the effects of traumatic stressors and quasi- anxiety disorder (GAD), internal validity is
experimental analogue studies of the effects of improved if participants taking medication are
mildly disturbing events (e.g., medical students excluded from the study. This is because some
conducting their first human dissection) suggest anxiolytic medications, such as benzodiazepines,
that exposure to traumatic events leads to symp- may impair memory function. Excluding such
toms of PTSD, particularly persistent reexperi- participants improves internal validity, but it
encing of the event (e.g., dreams or unwanted raises questions about external validity because
thoughts of the event). It has been debated as to it remains to be established that the results
whether these are timeless responses or whether would apply to GAD patients who happen to be
they are simply a product of contemporary taking medication. Such patients might have
Western culture. In other words, it is unclear clinically more severe GAD than unmedicated
whether the finding that stress produces reexpe- patients. This would be an important issue if the
riencing symptoms has strong external validity. researcher hypothesizes that GAD arises from
There is some suggestion that there are particular patterns of memory processing. By
03-Mckay-45470.qxd 11/17/2007 5:40 PM Page 32


excluding the more severe patients, it is not pos- draw conclusions not about a population but
sible to determine whether the memory results about a theory that specifies what these partici-
can be generalized to more severe cases of the pants ought to do? Or (as in the case of false
disorder. memories) would it be important if any subject
When Does External Validity Matter? One does, or can be induced to do, this or that?
should not automatically assume that it is impor-
tant that a study has good external validity. We may Regarding the Setting. Is my intention to
not be so concerned with external validity if the predict what would happen in a real-life setting
focus of the investigation concerns what can hap- or target class of such settings? The answer may
pen, instead of what typically does happen (Mook, be no if the aim is to test a prediction about what
1983). Thus external validity is less of a concern if ought to happen in the experimental setting. In
the goal of ones research is to test predictions this situation, external validity is not an issue.
derived from theory or conjecture. Consider, for If the answer is yes, then you need to consider
example, patients who report that they suddenly whether it is necessary that the setting be
became aware of long-buried memories of child- representative.
hood sexual abuse. The veracity of such recov-
ered memories is highly controversial. Some Evaluating and Improving External Validity.
clinicians argue that these are genuine memories There are several ways of evaluating and improv-
that had been repressed and then retrieved. A ing external validity. One approach is to try to
number of researchers have argued that these are ensure that the sample is representative of the
false memories, sometimes implanted by thera- target population. The deliberate inclusion of a
pists using hypnosis, guided imagery, or other heterogeneous sample can be used to determine
memory recovery techniques to get to the bot- if particular variables predict the results. If you
tom of the patients problems (for a review of this are conducting a treatment outcome study, for
debate, see McNally, 2003). This debate raises the example, and want to know whether the results
following question about the mechanisms of vary with socioeconomic status (SES), then you
memory, which has been evaluated in several lab- could select patients from a range of different
oratory studies: Is it possible to implant a clearly SES levels (using stratified random sampling)
false childhood memory using memory recovery and determine whether SES predicts treatment
techniques? Note that this is not an issue of does outcome. Note that this approach often requires
it happen but a question of can it happen. The a large sample (e.g., n = 50 per treatment condi-
answer is yes. Analogue research using university tion), so that sufficient numbers of participants
students has shown that it is possible to lead the from each SES level are in each treatment condi-
participants to recall something that, according tion. Another approach is to conduct multiple
to their parents, never happened to them, such as studies across different subgroups, settings, or
being savagely mauled by a dog (e.g., Porter, Yuille, times. This provides a means of determining
& Lehman, 1999). Although such findings have whether the findings are replicable.
relevance to the memory recovery controversy, the Benchmarking studies can also be used as
primary value of this type of research is to shed a means of evaluating external validity. These
light on memory processes. are investigations in which research conducted in
To determine whether external validity is tightly controlled laboratory situations (which
important in a given research investigation, you have high internal validity and may have low
need to consider the conclusion that you would external validity) are compared with field studies
like to make and whether your sample and (which may have good external validity but lower
research design will enable to you reach this internal validity). A recent meta-analysis com-
conclusion. The following is a sample of ques- pared results from lab and field studies across a
tions that you might ask in deciding whether the range of research domains, including clinically
usual criteria of external validity should even be relevant investigations such as studies of aggres-
considered (Mook, 1983): sion or depression (Anderson, Lindsay, &
Bushman, 1999). The investigators examined
Regarding the Sample. Am I trying to esti- the correspondence between lab- and field-based
mate from sample characteristics the character- effect sizes from studies using conceptually similar
istics of some population? Or am I trying to dependent and independent variables. The results
03-Mckay-45470.qxd 11/17/2007 5:40 PM Page 33

Internal and External Validity in Clinical Research 33

of field research tended to mirror the findings imperfect studies are worthless? If this were
from lab research, suggesting that lab studies gen- the case, then scientific progress would not be
erally have good external validity. possibleneither in psychology nor in the
To illustrate benchmarking research, several other sciences. But what can we legitimately
studies have been conducted in which treatment- conclude from imperfect investigations? Like
outcome findings from a university-based spe- all areas of science, no single study in clinical
cialty clinic (e.g., the Center for Anxiety and psychology provides the final answer to
Related Disorders at Boston University) are an important research question. Science is a
compared to findings from community mental cumulative process, whereby different studies
health clinics. The university-based research investigate the research questions in different
tended to have high internal validity, although ways, controlling for different factors. In other
the use of patient inclusion and exclusion crite- words, science progresses through the develop-
ria raised questions about the external validity ment of cumulative findings from programs
of the findings. Patients are typically excluded of research (Lakatos & Musgrave, 1970). The
from CBT studies if their doses of psychotropic overall pattern of findings that emerges across
medication are unstable or if they have particu- studies is the most important factor in answer-
lar comorbid disorders. Studies of panic disor- ing important research questions.
der, for example, often exclude patients who The strength of internal and external valid-
have comorbid paranoid, schizotypal, or bor- ity of a study can help researchers evaluate the
derline personality disorders. Studies conducted relative importance of that study in an overall
in community clinic settings are more liberal in program of research. If a study has very weak
their inclusion criteria and more closely approx- internal validity, then it may be given little
imate routine clinical treatment that patients or no consideration in evaluating what the cor-
would receive. This means that these studies pus of research suggests about an important
have good external validity but weaker internal research question. A study might have several
validity. Benchmarking studies of major depres- strengths but might have some noteworthy
sion and panic disorder indicate that the results weaknesses. A weakness of an analogue study of
from community clinics are similar to those schizophrenia, for example, has the shortcom-
obtained in university clinics and that the ing of not using actual participants with the
patients from both settings are broadly similar disorder. This is not a legitimate reason for dis-
in their pretreatment clinical characteristics, missing the study altogether. The limitation
such as the severity and duration of their disor- simply raises another question to be answered
ders (e.g., Merrill, Tolbert, & Wade, 2003; Wade, in another study: If people who have features
Treat, & Stuart, 1998). These findings indicate similar to schizophrenia (analogues) produce
that tightly controlled treatment studies from particular patterns of findings, then do people
university clinics have good external validity. with full-blown schizophrenia show the same
Such studies address the concerns of critics pattern of results? The analogue study may
like Dr. Smith from Scenario 1, who claimed that have high internal validity and lower external
treatment research findings do not generalize to validity, whereas the field study (using actual
patients in the real world. patients with schizophrenia) would probably
have lower internal validity (because it is diffi-
cult to control for all confounding factors when
using clinical samples) but higher external
CONCLUSIONS: PERFECTING OUR validity. Together, the two types of studies com-
Internal and external validity are important
Few, if any, research studies are methodologi- issues in evaluating the merits of a study,
cally perfect. Some consumers of the research but they are not the only considerations. Other
literature tend to throw out the baby with the important issues include the way the data are
bathwater; that is, if a study has a minor limita- analyzed, the reliability and validity of the mea-
tion, they tend to dismiss it entirely. This was sures or manipulations used, and the statistical
the case for the attendees of the journal club power of the design. Those issues are discussed
discussed in Scenario 2. But is it really true that elsewhere in this volume.
03-Mckay-45470.qxd 11/17/2007 5:40 PM Page 34


REFERENCES McNally, R. J. (2003). Remembering trauma.

Cambridge, MA: Harvard University Press.
Anderson, C. A., Lindsay, J. J., & Bushman, B. J. Merrill, K. A., Tolbert, V. E., & Wade, W. A. (2003).
(1999). Research in the psychological Effectiveness of cognitive therapy for depression
laboratory: Truth or triviality? Current in a community mental health center:
Directions in Psychological Science, 8, 39. A benchmarking study. Journal of Consulting
Anderson, K. W., Taylor, S., & McLean, P. (1996). and Clinical Psychology, 71, 404409.
Panic disorder associated with blood-injury Mogg, K., & Bradley, B. P. (1999). Selective attention
reactivity: The necessity of establishing and anxiety: A cognitive-motivational
functional relationships among maladaptive perspective. In T. Dalgelish & M. J. Power
behaviors. Behavior Therapy, 27, 463472. (Eds.), Handbook of cognition and emotion
Asmundson, G. J. G., Norton, G. R., & Stein, M. B. (pp. 145170). New York: Wiley.
(2002). Clinical research in mental health: Mook, D. G. (1983). In defense of external invalidity.
A practical guide. Thousand Oaks, CA: Sage American Psychologist, 38, 379387.
Publications. Onghena, P., & Edgington, E. S. (2005).
Barlow, D. H., Gorman, J. M., Shear, M. K., & Customization of pain treatments: Single-case
Woods, S. W. (2000). Cognitive-behavioral design and analysis. Clinical Journal of Pain,
therapy, imipramine, or their combination for 21, 5668.
panic disorder: A randomized controlled trial. st, L.-G., & Sterner, U. (1987). Applied tension:
Journal of the American Medical Association, A specific behavioral method for treatment of
283, 25292536. blood phobia. Behaviour Research and Therapy,
Barlow, D. H., & Hersen, H. (1984). Single case 25, 2529.
experimental designs. New York: Pergamon. Porter, S., Yuille, J. C., & Lehman, D. R. (1999).
Campbell, D. T., & Stanley, J. C. (1970). Experimental The nature of real, implanted, and fabricated
and quasi-experimental designs for research. childhood emotional events: Implications for
Chicago: Rand McNally. the recovered memory debate. Law and Human
Clark, D. M., Salkovskis, P. M., & Chalkey, A. J. Behavior, 23, 517537.
(1985). Respiratory control as a treatment for Pulos, L., & Richman, G. (1990). Miracles and other
panic attacks. Journal of Behavior Therapy and realities. Vancouver, British Columbia: Omega.
Experimental Psychiatry, 16, 2330. Rosenthal, R. (2002). Covert communication in
Cook, T. D., & Campbell, D. T. (1979). Quasi- classrooms, clinics, courtrooms, and cubicles.
experimentation: Design and analysis issues for American Psychologist, 57, 839849.
field settings. Boston: Houghton Mifflin. Sokol, L., Beck, A. T., Greenberg, R. L., Wright, F. D.,
Finger, M. S., & Rand, K. L. (2003). Addressing & Berchick, R. J. (1989). Cognitive therapy for
validity concerns in clinical psychology panic disorder: A nonpharmacological
research. In M. C. Roberts & S. S. Ilardi (Eds.), alternative. Journal of Nervous and Mental
Handbook of research methods in clinical Disease, 177, 711716.
psychology (pp. 1330). Malden, MA: Blackwell. Taylor, S. (1994). The overprediction of fear: Is it a
Flick, S. N. (1988). Managing attrition in clinical form of regression toward the mean? Behaviour
research. Clinical Psychology Review, 8, 499515. Research and Therapy, 32, 753757.
Furby, L. (1973). Interpreting regressions toward the Taylor, S. (2000). Understanding and treating panic
mean in developmental research. Developmental disorder. New York: Wiley.
Psychology, 8, 172179. Taylor, S., Thordarson, D. S., Maxfield, L, Fedoroff, I. C.,
Lakatos, I., & Musgrave, A. (1970). Criticism and Lovell, K., & Ogrodniczuk, J. (2003).
the growth of knowledge. Cambridge, UK: Comparative efficacy, speed, and adverse effects
Cambridge University Press. of three treatments for PTSD: Exposure therapy,
Lerner, P. (2003). Hysterical men: War, psychiatry, and EMDR, and relaxation training. Journal of
the politics of trauma in Germany, 18901930. Consulting and Clinical Psychology, 71, 330338.
New York: Cornell University Press. Wade, W. A., Treat, T. A., & Stuart, G. L. (1998).
Margraf, J., Ehlers, A., Roth, W. T., Clark, D. B., Transporting an empirically supported
Sheikh, J., Agras, W. S., et al. (1991). How treatment for panic disorder to a service clinic
blind are double-blind studies? Journal of setting: A benchmarking study. Journal of
Consulting and Clinical Psychology, 59, 184187. Consulting and Clinical Psychology, 66, 231239.