Anda di halaman 1dari 58

Statistical Conclusion Validity


Statistical Conclusion Validity

The validity of inferences about the

correlation (covariation) between
treatment X and outcome Y


Population parameter: A fixed feature of a population (e.g., the

population mean, the population standard deviation, . . .); we
conventionally indicate parameters using greek letters (, , , . . .)
Sample statistic: A feature that varies from one sample to another
(e.g., the sample average, the sample standard deviation, . . .)
Estimator: Any function of sample data used to estimate
Expectation: The mathematical expectation of a variable
indicated as is the population average of this variable
When a sample statistic has expectation equal to the corresponding
population parameter, its said to be an unbiased estimator of that

Formal Statistical Inference

The process of drawing conclusions about a

population based on sample data
Practical questions
How much uncertainty is associated with sample
Do my results constitute strong evidence or just a
lucky draw/chance finding?

Formal Statistical Inference (contd)

If we select at random a sample of units from a population of

units, every possible sample of size has the same chance of selection

= possible samples are equally likely
! !

Example: if we select 3 units at random from a population of 8 units,

56 samples are equally likely

! 8! 87654321
= = = 56
! ! 3! 5! (3 2 1)(5 4 3 2 1)

The Mean

For a given population

Only one [ ] (parameter)
Many sample averages = that
depend on

What units are drawn

Unbiasedness of the Sample Mean

If we were to draw infinitely many random samples,

the average of the resulting sample means would be
the population mean

= [ ]

Variability of the Sample Mean
Sampling variance

Population Sample Sample mean

= = 1
Variance = 2 ( )2 =
( ) =1
=1 2

Std. Dev. =

SE summarizes the
variability in an estimate due
to random sampling 8
Estimated standard error

The population standard deviation is usually

unknown and must be estimated by replacing

T-statistic for the sample mean

Under the working/null hypothesis

A t-statistic for the sample mean is

= =

T-statistic for the sample mean (contd)

If the null hypothesis is

= = 0

A t-statistic for the sample mean is

Central limit theorem

If = then, as long as the sample is large

enough, has a sampling distribution that is
close a standard normal distribution (mean of 0 and
standard deviation of 1), irrespective of the
population distribution of
In other words, for large samples, the distribution
of a t-statistic is independent of the distribution of
the underlying data

Distribution of a t-statistic

Hypothesis testing

With standard normal variables, the

frequency of values larger than 2 is about
Any t-statistic larger than 2 in absolute value
is too unlikely to be consistent with the null
hypothesis We reject the null

Confidence interval

If we repeatedly drew infinitely many independent

samples from the same population

, + 2

would include the population mean about 95% of

the time

Confidence Level

Confidence level: the percentage of all possible independent

samples from the same population that can be expected to
include the true population parameter
If we repeatedly drew infinitely many independent samples
from the same population and we calculated a confidence
interval for each sample, then a certain percentage
(confidence level) of the intervals would include the
unknown population parameter
Confidence intervals are usually calculated so that this
percentage is 95%, but we can produce 90%, 99%, 99.9% (or
whatever) confidence intervals for the unknown parameter

Comparison of Two Group Averages

1 = [ = 1

0 = [ = 0

0 : 1 0 = = 0

1 0 1 0
= =
1 1
1 2

Significance vs. Effect magnitude

A large t-statistic may be due to

A large effect size

Or a small estimated standard error

Null Hypothesis Significance Testing
The null hypothesis (0 ) is a claim to be tested,
usually an hypothesis of no difference (e.g., no
difference between test scores in group A and
group B)
The alternative hypothesis (1 ) is the one we
would believe if the null hypothesis is rejected
Rejecting 0 does not prove 0 to be false nor 1 to be
The only way 0 can be proven false (or true) is to know
the value of the population parameter(s) specified in the
null hypothesis; sample data do not provide that kind of

The probability of getting the observed

or more extreme results if the null
hypothesis were true
Following Fisher (1926), we usually say that
results are statistically significant if p < .05

More on NHST

Do not reject H0 Reject H0

Correct Decision Type I Error

H0 is true Prob = 1 Prob =

TRUTH Type II Error Correct Decision

H0 is false Prob = Prob = 1

More on NHST (contd)

= =

= 1 = 1

Statistical Conclusion Validity

1. Do X and Y covary?
Type I error (false positive): We may incorrectly
conclude that X and Y covary when they do not
Type II error (false negative): We may incorrectly
conclude that X and Y do not covary when they do
2. How strongly do X and Y covary?
We can over/underestimate
The magnitude of covariation
The degree of confidence that magnitude estimate

Threats to Statistical
Conclusion Validity

1. Low Statistical Power

An insufficiently powered experiment may

incorrectly conclude that the relationship between
treatment and outcome is not significant (Shadish et
al. 2002, 55)

= 1 = 1
The ability of a test to detect relationships that exist in
the population
The probability that a statistical test will reject the null
hypothesis when it is false

1. Low Statistical Power (contd)

Low power Larger estimated SE Wider

confidence intervals
Common practice to set = .20 Power = .80
Important to increase power when missing a real effect
would have negative consequences, e.g. when testing for
harmful effects of a new drug
Low power is a problem when effect sizes are small
Remedy: Meta-analysis
Comprehensive list of remedies: Table 2.3 SKC

1. Low Statistical Power (contd)

Factors affecting power

A. Sample size: The larger the sample
size, the higher the power (see figure
1) Remedy:
Increasing sample size (sometimes

Figure 1: The relationship between sample size and power for
H0: = 75, real = 80, one-tailed = 0.05, for 's of 10 and 15.

Source: Lane 2015
1. Low Statistical Power (contd)

Factors affecting power

B. Standard deviation (SD): The smaller
the SD, the higher the power (see
figure 1) Remedies:
Sampling from a homogeneous
Reducing random measurement error

Figure 1: The relationship between sample size and power for
H0: = 75, real = 80, one-tailed = 0.05, for 's of 10 and 15.

Source: Lane 2015
1. Low Statistical Power (contd)

Factors affecting power (contd)

C. Effect size (i.e. difference between
hypothesized and true parameter):
Easier to detect larger effects (see
figure 2)

Figure 2. The relationship between and power for
H0: = 75, one-tailed = 0.05, for 's of 10 and 15

Source: Lane 2015

1. Low Statistical Power (contd)

Factors affecting power (contd)

D. Significance level (): The lower the
(i.e. the probability of Type I
error/false positive), the lower the
power (see figure 3)

Figure 3. The relationship between significance level and
power with one-tailed test: H0: = 75, real = 80, and = 10.

Source: Lane 2015
2. Violated Assumptions of the Test Statistics

Violations of statistical test assumptions can lead to

either overestimating or underestimating the size and
significance of an effect (Shadish et al. 2002, 55)
Example: If we ignore the hierarchical/multivel
structure of the data (e.g., soccer players nested
within teams, students nested within classes), we
may severely underestimate standard errors and
conclude that effects that might be ascribed to
chance are real (i.e. a higher risk of Type I error)

3. Fishing and the Error Rate Problem

Repeated tests for significant relationships, if uncorrected for

the number of tests, can artifactually inflate statistical
significance (Shadish et al. 2002, 55)
If the nominal = .05, the actual = .923 when the test is
repeated fifty times (Maxwell & Delaney 1990)
Fishing until we find a significant effect
Multiple researchers analyzing the same data
Bonferroni correction: divides the target by the number of tests and uses
the Bonferroni-corrected in all individual tests
Bonferroni and other corrections may be too conservative in low-powered studies (high
risk of Type II error)

4. Unreliability of Measures

Measurement error weakens the relationship between two variables

and strengthens or weakens the relationship between three or more
variables (Shadish et al. 2002, 55)
With three or more variables, unreliability of measures can lead to
either false positives or false negatives what does that mean?
Particularly problematic in longitudinal studies that assess change
over time
Increasing the number of measurements
More items to measure the same concept
Multiple raters
Improving the quality of measures
Using validated scale items
Training for raters
Techniques like latent variable modelling

4. Unreliability of Measures (contd)

Structural equation modeling (SEM): family of

statistical modeling techniques (e.g., confirmatory
factor analysis, path analysis) to test theoretical
Two main components
Measurement model: uses observed variables (e.g.,
survey items) to define latent constructs (e.g., happiness,
self-efficacy, intelligence)
Structural regression model: system of simultaneous
regression equations to estimate paths linking the latent

5. Restriction of Range

Reduced range on a variable usually weakens the

relationship between it and another variable (Shadish et al.
2002, 55)
Small range Lower power
This problem can affect either the
Independent variable (IV). Example:
Comparing two similar treatments Remedy: using different
treatment doses and even full-dose vs. no treatment
Dependent variable (DV). Examples:
Floor effects (respondents cluster near the bottom)
Ceiling effects (respondents cluster near the top)
Remedy: Using models that are appropriate for limited
variables (e.g., Tobit, truncated regression, Heckman)
6. Unreliability of Treatment Implementation

If a treatment that is intended to be implemented in a

standardized manner is implemented only partially for some
respondents, effects may be underestimated compared with
full implementation (Shadish et al. 2002, 55)
Common in field experiments
It usually decreases effect size, but it can also increase the
effect size when implemention is tailored to the recipients
Important to measure all components of the treatment

7. Extraneous Variance in the Experimental
Some features of an experimental setting may inflate error,
making detection of an effect more difficult (Shadish et al.
2002, 55)
Example: Fire drill or concert downstairs during lab
Particularly frequent in field experiments
When sources of extraneous variance cannot be controlled,
we should measure them and include them in the statistical

8. Heterogeneity of Units

Increased variability on the outcome variable within conditions

increases error variance, making detection of a relationship more
difficult (Shadish et al. 2002, 55)
Heterogeneity of respondents on an outcome variable increases
standard deviations on that variable and on any other correlated with
it Weaker treatment effect
Sample units that are similar on characteristics correlated with outcome
Potential risks:
Lower external validity
Limited range on DV
Measure respondent characteristics that interact with a cause-effect
relationship and use them for blocking or as covariates
Within-participants designs comparing pre- and post-test scores for each

9. Inaccurate Effect Size Estimation

Some statistics systematically overestimate or

underestimate the size of an effect (Shadish et al.
2002, 55)
Outliers (departing from normal distribution) can
dramatically decrease effect sizes
Analyzing binary outcomes with effect size measures
intended for continuous variables (correlation coefficient
or standardized mean difference statistic)
Underestimation of effect size

Internal Validity

Internal Validity

The validity of inferences about

whether observed covariation
between X (the presumed treatment)
and Y (the presumed outcome) reflects
a causal relationship from X to Y as
those variables were manipulated or
Internal Validity (contd)

Local molar causal validity (Campbell 1986)

Local: Causal conclusions are limited to the context
of the particular treatments, outcomes, times,
settings, and persons studied
Molar: Treatments are a complex package
consisting of many components, all of which are
tested as a whole

Threats to Internal Validity

Threats to Internal Validity

Typically, we infer from an effect to a cause by

eliminating other possible causes (Mackie 1974, p.
Threats to internal validity are those other possible
Different threats are not necessarily independent

1. Ambiguous Temporal Precedence

Lack of clarity about which variable occurred first may yield

confusion about which variable is the cause and which is the
effect (Shadish et al. 2002, 55)
Correlational studies are often unable to answer the
question: Which came first, the chicken or the egg?
Not always: e.g., unlikely that an increase in the sales of
airconditioners increases outside temperature
Particularly tricky because some causation is bidirectional
High performance Self-efficacy Higher performance

2. Selection

Systematic differences over conditions in respondents

characteristics that could also cause the observed effect
(Shadish et al. 2002, 55)
A new drug is given only to patients who volunteer to take the new
The volunteering patients might differ from nonvolunteers in ways
(e.g., sicker, older, etc.) that might affect the outcome
Random assignment eliminates selection bias because
randomly formed groups differ only by chance

3. History

[External] Events occurring concurrently with

treatment could cause the observed effect (Shadish
et al. 2002, 55)
Example: A study of psychotherapy with depressed
patients at the time a new antidepressant went on
the market

4. Maturation

Naturally occurring changes over time could be

confused with a treatment effect (Shadish et al.
2002, 55)
While maturation is internal, a natural course of
things having to do with some quality of the
participants in the study, history has to do with an
external event of some kind
Example: We may think that an ineffective medicine
works because patients get better by themselves

5. Regression Artifacts

When units are selected for their extreme scores,

they will often have less extreme scores on other
variables, an occurrence that can be confused with a
treatment effect (Shadish et al. 2002, 55)
Test theory Every measure has
A true score component reflecting a true ability
Plus a random error component that is normally and
randomly distributed around the mean of the

5. Regression Artifacts (contd)

High scores will tend to have more positive random error pushing
them up, low scores will tend to have more negative random error
pulling them down
On the same measure at a later time, or on other measures at the
same time, the random error is less likely to be so extreme
A compensatory tutoring program for kids in the lowest 10 percent on a
pretest will seem more effective than it actually is because those kids will
tend to improve anyway in the post-test
People tend to go to psychotherapy after a shock and organizations tend to
hire consultants after a downturn; clients measured progress is partly a
movement back toward their stable mean as the temporary shock grows less

6. Attrition/Mortality

Loss of respondents to treatment or to measurement can produce

artifactual effects if that loss is systematically correlated with
conditions (Shadish et al. 2002, 55)
A special subset of selection bias occurring after the treatment is in
Unlinke selection bias, attrition is not controlled by random
If those dropping out of a compensatory tutoring course are the low pretest
test scorers, by the end of the course the participants who remain will be the
ones with higher academic skills
By comparing the average pretest to posttest scores we would overestimate
the effect of the course

7. Testing

Exposure to a test can affect scores on subsequent

exposures to that test, an occurrence that can be
confused with a treatment effect (Shadish et al.
2002, 55)
Only in pretest-posted designs
Example: People commonly improve on
standardized tests such as intelligence tests, SATs, or
GREs, due to practice, familiarity or other forms of

8. Instrumentation

The nature of a measure may change over time or conditions

in a way that could be confused with a treatment effect
(Shadish 2002, 55)
Only in pretest-postest designs
Whereas testing involves a change in the participant,
instrumentation involves a change in the instrument
The spring on a bar press might become weaker and easier to push
over time
Schools often use two different types of tests before and after a
compensatory tutoring course (to reduce the testing threat); if the
level of difficulty is not the same between the two tests, part or all of
any pre-post difference is due to the change in instrument, not to the
9. Additive and Interactive Effects of Threats to
Internal Validity

The impact of a threat can be added to

that of another threat or may depend on
the level of another threat (Shadish et al.
2002, 55).