Anda di halaman 1dari 17

Statistics for Diagnostic Tests

Andy Vail
Biostatistics
University of Manchester
FOCUS Training Day, 30th April 2012

Opener

• Assume DNA-fingerprinting is very good


– 1 in 20 million chance of incorrect match
• Defendant has DNA match
• Do the odds:
A. Overwhelmingly imply guilty?
B. Favour guilty?
C. Favour innocent?
Outline

• ‘Proof of concept’ studies


– Summary statistics
– Confidence Intervals
– Sample size calculations
• Diagnostic accuracy studies
– Likelihood & odds ratios
– Receiver Operating Characteristic (ROC) curves
• Clinical/cost effectiveness studies

Definition

• A process by which a sample from an individual


(patient, location,…) is assigned a value which is
used to decide if a particular attribute (disease,
microorganism, trait, genotype,...) is present or
absent.
• Underlying process may give a numerical value, but
this is dichotomised to make a decision
Proof of concept (1)

• Given a new measurement, how to assess value?


• First show that measurement is reliable
– Between analyses on the same sample
• Different days/operators/reagent batches
• Different laboratories
– Between repeat samples on the same individual
• “Same” time
• Over time

• Always assess ‘blinded’ to earlier result(s)

Reliability analysis

• Categories
– Cohen’s Kappa is a measure of the strength of agreement
between two categorisations, adjusted for chance
agreement
– kappa = (observed – chance) / (maximum - chance)
• Continuous
– Bland-Altman: plot differences against average value
Proof of concept 2

• Given reliability, does it detect barn-door differences?


• Identify clear cases and clear non-cases
30
Frequency
10 0

0 5 10
Frequency
10 020

0 5 10

Discrimination analysis

• Typically see tests of association


– Mann-Whitney, t-test, chi-squared, Fisher’s
• Interest here is not in average difference
• Interest is extent of overlap
– Potential for mis-classification
Sensitivity and Specificity

True Diagnosis Sensitivity - proportion of cases


Y N correctly identified = a/(a+c)
Test Result

Y a b

N c d Specificity - proportion of controls


correctly identified = d/(b+d)

True Diagnosis
Y N Sensitivity = 90/100 = 90%
Test Result

Y 90 20 Specificity = 80/100 = 80%


N 10 80

Confidence intervals

• Estimates are no use alone


• Need idea of precision: “give or take a bit”
– “95% confidence interval” sounds more convincing!
• Sensitivity and specificity are just percentages
– Standard methods for CI of a proportion
– 8/10 = 80% (44% to 97%)
– 80/100 = 80% (71% to 87%)
– 800/1000 = 80% (77% to 82%)
95% Confidence Intervals

• If a study were repeated many times, 95% of such


intervals would contain the true value

• Sadly not “95% chance that it contains true value”

• The bigger the study, the tighter the confidence


interval will be

Sample size calculation

• In case-control studies, can fix number of each group


• Need some idea of desirable/plausible sens & spec

-- 95% Confidence interval --


Proportion n=50 n=100 n=200
80% 67 to 89 71 to 87 74 to 85
90% 79 to 96 83 to 94 85 to 93
95% 85 to 98 89 to 98 91 to 97
100% 93 to 100 96 to 100 98 to 100
Diagnostic Accuracy Studies

• Barn-door differences: so what?


• Clinical diagnosis occurs in real, messy cases
• Need
– independent “Gold” or “Reference” Standard
– Usually prospective cases
– Follow-up of test negatives

Same table, more stats!

True Diagnosis • Sens: a/(a+c)


Y N
• Spec: d/(b+d)
Test Result

Y a b
• PPV: a/(a+b)
N c d
• NPV: d/(c+d)
• LR+: sens/(1-spec)
• LR-: (1-sens)/spec
• DOR: (a/c)/(b/d) = ad/bc
Predictive values (PPV & NPV)

• Depend on prevalence
– meaningless in Case-Control design
– may not transfer to different settings
• Positive Predictive Value
– Proportion of positive results that are correct
– PPV = (Prev x Sens) / [Prev x Sens + (1- Prev)(1-Spec)]

• Negative Predictive Value


– Proportion of negative results that are correct
– NPV = [(1-Prev) x Spec) / [(1-Prev) x Sens + Prev x (1-Spec)]

Likelihood Ratios

• Measure of information contained in test result


• Unlike PPV & NPV, no direct dependence on prev
– But only useful in real context
• LR+
– How much more likely to test +ve if affected
• LR-
– How much more likely to test –ve if affected
Diagnostic odds ratio

• Attempt to summarise value as single figure


• DOR = (LR+)/(LR-)
– How much higher odds that test +ve if affected
– How much higher odds that affected if test +ve
• Increasing use
– logistic regression and meta-analysis

Example

True Diagnosis • Sens: 90/(90+10) = 90%


Y N
• Spec: 80/(20+80) = 80%
Test Result

Y 90 20
• PPV: 90/(90+20) = 82%
N 10 80
• NPV: 80/(10+80) = 89%
• LR+: 90/20 = 4.5
• LR-: 10/80 = 0.125
• DOR: (90/10)/(20/80) = 36
ROC curves

• In practice, usually have continuous score (assay)


• Need to make categorical decision (yes/no)
• Choose threshold value to determine diagnosis
• Higher threshold leads to:
– Fewer diagnoses
– Lower sensitivity (as fewer ‘affected’ testing positive)
– Higher specificity (as more ‘unaffected’ testing negative)
20 15
Frequency
10 5
0

0 5 10
test result
Determining a cut-off

• Trade-off between sensitivity and specificity


• Produce table for all possible cut-off values
• Calculate and plot Sens v (100% - Spec) for each
• At minimum possible value
– Everyone tests positive, so Sens = 100%, Spec = 0%
• At maximum possible value
– Everyone tests negative, so Sens = 0%, Spec =100%

Minimum
threshold
1.0

Perfect Test
0.8
Sensitivity
0.4 0.6

Guessing
0.2
0.2
0.0
0.0

Maximum
threshold 0.0 0.2 0.4 0.6
0.6 0.8
0.8 1.0
1.0
1-Specificity
1-Specificity
1.0
0.8 Preferable if
high sens key
Sensitivity
0.6

‘optimal’ to
0.4

minimise errors
0.2
0.2

Preferable if
high spec key
0.0
0.0

0.0 0.2 0.4 0.6


0.6 0.8
0.8 1.0
1.0
1-Specificity
1-Specificity

Area under ROC (AuROC)

• Perfect discrimination: AuROC=1


• Guesswork: AuROC=0.5
• AuROC = probability that randomly chosen affected
person will have higher value than randomly chosen
unaffected person
1.0
Perfect Test
AuROC=1.0 0.8
Sensitivity
0.6

AuROC=0.94
0.4

Guessing
0.2
0.2

AuROC=0.59
AuROC=0.5
0.0
0.0

0.0 0.2 0.4 0.6


0.6 0.8
0.8 1.0
1.0
1-Specificity
1-Specificity

STARD

• STAndards for the Reporting of Diagnostic accuracy


• Checklist (with justification) for reporting:
– Title (1 item)
– Intro (1)
– Method (11)
– Results (11)
– Discussion (1)
STARD examples

• 12. Describe methods for calculating or comparing


measures of diagnostic accuracy, and the statistical
methods used to quantify uncertainty (e.g. 95%
confidence intervals).
• 18. Report distribution of severity of disease (define
criteria) in those with the target condition; other
diagnoses in participants without the target condition.
• 22. Report how indeterminate results, missing
responses and outliers of the index tests were
handled.

Clinical/cost effectiveness

• Are sens, spec, PPV, NPV, AuROC, etc enough?


• What is a good diagnostic test?
– One that makes a clinical difference!
• Ultimately need comparison studies
– ‘Act on new test’ versus ‘Act in ignorance of new test’
– Sample size as for other randomised trials
– CONSORT rather than STARD reporting
– Remarkably rare!
Resources

• BMJ Statistical notes series


– Altman & Bland 1994;308:1552. sensitivity and specificity.
– Altman & Bland 1994;309:102. predictive values.
– Altman & Bland 1994;309:188. roc plots.
– Altman & Bland 2004;329:168. likelihood ratios.
• Bland & Altman. Lancet 1986, i, 307. assessing
agreement between two methods.
• STARD: http://www.stard-statement.org/

Prosecutor’s fallacy

Guilty Innocent Total


DNA Match
No match
Total
Prosecutor’s fallacy

Guilty Innocent Total


DNA Match
No match
Total 1 ~60 million ~60 million

Prosecutor’s fallacy

Guilty Innocent Total


DNA Match 3
No match ~60million
Total 1 ~60 million ~60 million
Prosecutor’s fallacy

Guilty Innocent Total


DNA Match 1 2 3
No match 0 ~60million ~60million
Total 1 ~60 million ~60 million

• Sens=100%, 1-Spec = 1/20million


• Prev = very low = 1/60million
• Probability that guilty given positive match?
– PPV = 1/3
– Odds are in defendant’s favour!

Summary

• Statistics of diagnostic test accuracy largely just


percentages and ratio of percentages.
• Difficulty in technical versus lay language
• Need to ensure statistics and interpretation are
appropriate to phase of research

Anda mungkin juga menyukai