Published online 1 April 2013 in Wiley Online Library (wileyonlinelibrary.com) DOI: 10.1002/pds.3418
ORIGINAL REPORT
ABSTRACT
Purpose This study aimed to develop Natural Language Processing (NLP) approaches to supplement manual outcome validation,
specically to validate pneumonia cases from chest radiograph reports.
Methods We trained one NLP system, ONYX, using radiograph reports from children and adults that were previously manually reviewed.
We then assessed its validity on a test set of 5000 reports. We aimed to substantially decrease manual review, not replace it entirely, and so,
we classied reports as follows: (1) consistent with pneumonia; (2) inconsistent with pneumonia; or (3) requiring manual review because of
complex features. We developed processes tailored either to optimize accuracy or to minimize manual review. Using logistic regression, we
jointly modeled sensitivity and specicity of ONYX in relation to patient age, comorbidity, and care setting. We estimated positive and
negative predictive value (PPV and NPV) assuming pneumonia prevalence in the source data.
Results Tailored for accuracy, ONYX identied 25% of reports as requiring manual review (34% of true pneumonias and 18% of nonpneumonias). For the remainder, ONYXs sensitivity was 92% (95% CI 9093%), specicity 87% (8688%), PPV 74% (7276%), and
NPV 96% (9697%). Tailored to minimize manual review, ONYX classied 12% as needing manual review. For the remainder, ONYX
had sensitivity 75% (7277%), specicity 95% (9496%), PPV 86% (8388%), and NPV 91% (9091%).
Conclusions For pneumonia validation, ONYX can replace almost 90% of manual review while maintaining low to moderate
misclassication rates. It can be tailored for different outcomes and study needs and thus warrants exploration in other settings. Copyright
2013 John Wiley & Sons, Ltd.
key wordspneumonia; Natural Language Processing; sensitivity; specicity; validity; pharmacoepidemiology
Received 13 July 2012; Revised 11 January 2013; Accepted 14 January 2013
INTRODUCTION
Pneumonia is common and can have severe consequences in older adults. A growing literature suggests
that some medications increase pneumonia risk.15
Pharmacoepidemiologic studies often identify pneumonia cases within large databases using International
Classication of Diseases, version 9 (ICD-9) codes or
the equivalent. However, ICD-9 codes lack accuracy
*Correspondence to: S. Dublin, Group Health Research Institute, 1730 Minor
Avenue, Suite 1600, Seattle, WA 98101-1448, USA. E-mail: dublin.s@ghc.org
This work was presented as an oral presentation at the HMO Research Network
Annual Conference in Boston, Massachusetts on 24 March 2011, where it
received an Early Career Investigator Award.
835
METHODS
Setting
Group Health (GH) is an integrated healthcare delivery
system in the Northwest USA with extensive electronic
health data. GH members have coverage through
employer-based plans, individual plans, Medicare, and
Table 1. Characteristics of the source population of radiology reports and the test set of 5000 reports used for validation
Source dataset*
Characteristic
Patient age (years)
04
519
2044
4564
6574
7584
85+
Pneumonia case
Comorbidities
Congestive heart failure
Chronic lung disease
Cancer
Setting of care
Outpatient
Inpatient
Missing
Test set
(N = 5000){
n (%)
n (%)
n (%)
7130 (8)
7839 (8)
12 248 (13)
24 984 (27)
14 466 (16)
17 940 (19)
8503 (9)
26 345 (28)
3084 (12)
3426 (13)
3311 (13)
5840 (22)
3538 (13)
4752 (18)
2394 (9)
26 345 (100)
4046 (6)
4413 (7)
8937 (13)
19 144 (29)
10 928 (16)
13 188 (20)
6109 (9)
0 (0)
423 (8)
439 (9)
609 (12)
1256 (25)
840 (17)
987 (20)
446 (9)
2200 (44)
14 424 (15)
29 079 (31)
12 285 (13)
3712 (14)
8240 (31)
3255 (12)
10 712 (16)
20 839 (31)
9030 (14)
877 (18)
1767 (35)
747 (15)
86 028 (92)
6366 (7)
716 (1)
23 926 (91)
2191 (8)
228 (1)
62 102 (93)
4175 (6)
488 (1)
4598 (92)
362 (7)
40 (1)
n (%)
*A dataset of 93 110 chest radiograph reports that were previously manually reviewed for a study of pneumococcal conjugate vaccine.16
836
s. dublin et al.
837
RESULTS
Overall
Age, in years
0-4
5 - 19
20 - 44
45 - 64
65 - 74
75 - 84
85+
Congestive heart failure
No
Yes
Chronic lung disease
No
Yes
Cancer
No
Yes
Setting of care
Outpatient
Inpatient
50
60
70
80
Specificity*
90
100
50
60
70
80
Sensitivity*
90
100 50
60
70
80
90
100
Figure 1. ONYXs performance under conditions aimed at improving accuracy (allows more reports to be designated as needing manual review): (A)
specicity, (B) sensitivity, and (C) positive predictive value. *Compared with gold standard from manual review of reports. Positive predictive value
estimated assuming prevalence of pneumonia in the source dataset (28%). {Comorbid conditions ascertained from health plan automated diagnosis data
(International Classication of Diseases, version 9, codes)
838
s. dublin et al.
A
Overall
Age, in years
0-4
5 - 19
20 - 44
45 - 64
65 - 74
75 - 84
85+
Congestive heart failure
No
Yes
Chronic lung disease
No
Yes
Cancer
No
Yes
Setting of care
Outpatient
Inpatient
50
60
70
80
90
100
50
Specificity*
60
70
80
Sensitivity*
90
100 50
60
70
80
90
100
Figure 2. ONYXs performance under conditions aimed at decreasing the proportion of reports needing manual review: (A) specicity, (B) sensitivity, and
(C) positive predictive value. *Compared with gold standard from manual review of reports. Positive predictive value estimated assuming prevalence of
pneumonia in the source dataset (28%). {Comorbid conditions ascertained from health plan automated diagnosis data (International Classication of Diseases,
version 9, codes)
839
11 4 10 9
5
3 6
12
7
90
Sensitivity
13
15
80
14
70
60
50
40
40
50
60
70
80
90
100
Specificity
ONYX: total
incorrectly classied
False positives
False negatives
B
100
13
90
Sensitivity
Number of charts
requiring manual
review
With true pneumonia
Without pneumonia
10
11
12 3 5
8
29 1 6
80
70
Manual
review
of all
records
ONYX with
Classier 1
(prioritizes
accuracy)
ONYX with
Classier 2
(prioritizes fewer
manual reviews)
10 000
2248
1008
2800
7200
952
1296
504
504
916
909
0
0
768
148
335
574
6836
8083
0
0
1700
5136
1722
6361
14
60
15
50
40
40
50
60
70
80
90
100
Compared with the gold standard of manual review and thus, by denition,
there are no false positives or negatives when all reports undergo manual
review.
Specificity
Figure 3. ONYXs accuracy for individual radiologists. (Limited to the
15 radiologists who read 100 or more reports in the test set. Each number
represents one radiologist, and each radiologist is assigned the same number
in both graphs.) (A) ONYX with Classier 1 (prioritizes accuracy).
(B) ONYX with Classier 2 (prioritizes decreasing the amount of manual
review). Thus, additional reports are included in the analyses for Figure 3(B)
that were not included in Figure 3(A)
with chronic lung disease. Classier 2 had high specicity across all groups but lower sensitivity, especially
in the oldest patients. Older patients and those with
chronic lung disease may be more likely to have chronic
lung abnormalities, and NLP may have difculty
distinguishing chronic, stable abnormal ndings from
acute inltrates.
Compared with prior NLP studies, our results indicate
similar or somewhat better accuracy, although it is
difcult to compare studies because of differences in
the gold standards and patient populations, including
pneumonia prevalence (which inuences PPV and NPV).
Also, prior studies used NLP to classify all reports
even those with complex features that our study
diverted for manual review. This difference may make
our tools accuracy appear higher. Mendonca et al. applied an NLP system, MedLEE, to chest radiograph reports from the neonatal intensive care unit.14
Sensitivity was 71%, specicity 99%, and PPV only
7.5% compared with medical record review. The low
Copyright 2013 John Wiley & Sons, Ltd.
840
s. dublin et al.
kappa of 0.37 for inltrate in adults23 and 0.58 in children.24 Established case denitions, for example, the
CDC/National Healthcare Safety Network denition
for hospital-acquired pneumonia,10 typically require a
positive radiograph in addition to clinical symptoms
and ndings. One could argue that NLP should draw
on symptoms, vital signs, and laboratory ndings
in addition to radiology reports. Murff et al. took
such an approach.11 Compared with medical record
review, their NLP algorithms had sensitivity of 64%
(95% CI, 5870%) and specicity of 95% (9496%)
for post-operative pneumonia. While a multifaceted
approach may be desirable, challenges to implementation exist, including that many of the data required
are rarely measured in outpatients. Still, a crucial building block is an accurate method for classifying chest
radiographs, which are central to most pneumonia
denitions.
Our study has limitations. We measured comorbid
illnesses from administrative data, which do not have
perfect accuracy. Our data included relatively few
inpatient cases, resulting in less precise estimates for
this subgroup. Reports came from a single health-care
system, which may limit generalizability. Radiologists
from different institutions or geographic regions may
use different language, and so, an important next step
will be to assess transferability. It would also be useful
to investigate transferability of our tools to other
conditions. As a next step, we plan to assess our NLP
tools accuracy for reports from other institutions and
for other conditions. In the current study, considerable
effort was needed to construct a classier to create a
binary pneumonia variable from ONYXs output.
We developed a rule-based classier on the basis of
expert opinion, but machine learning is an alternative
approach that could improve efciency. It would also
be valuable to explore the use of machine learning
alone, without an NLP tool such as ONYX, to detect
pneumonia from radiograph reports. Schuemie et al.
recently described such efforts; he found that the best
approaches achieved sensitivity of 8293% with PPV
8190%.25 We currently use only features of the radiology report to determine which reports need manual
review. Future work could explore whether incorporating patient characteristics (e.g., age or comorbidity)
can improve this determination.
In conclusion, vast amounts of clinical data are
becoming available in free text within EMRs. They could
be valuable for many purposes but currently are expensive and time consuming to access. New technologies
such as NLP offer tremendous opportunities. Studies
such as ours provide insight into the potential of NLP
to improve research and clinical care.
Copyright 2013 John Wiley & Sons, Ltd.
CONFLICT OF INTEREST
Dr. Dublin has received a Merck/American Geriatrics
Society New Investigator Award.
KEY POINTS
ACKNOWLEDGEMENT
We thank Dr. Lisa Jackson, Principal Investigator of
the PSS, for sharing the PSS data and commenting
on an early draft of the manuscript.
Dr. Dublin was funded by a Paul Beeson Career
Development Award from the National Institute on
Aging (grant K23AG028954), by the Branta Foundation, and by Group Health Research Institute internal
funds. The Beeson award is also supported by the
Hartford and Starr Foundations and Atlantic Philanthropies. Dr. Carrell was funded by National Cancer
Institute grant RC1CA146917. Dr. Chapman was
funded by National Institutes of Health grants
5R01GM090187 and U54HL108460. Group Health
Research Institute internal funds covered the data
collection and analysis.
This work does not necessarily reect the views of
the National Institute on Aging or National Institutes
of Health.
SUPPORTING INFORMATION
Additional supporting information may be found in
the online version of this article
Online Appendix Table 1. Pneumonia classier:
further explanation of criteria used to identify reports
needing manual review
Pharmacoepidemiology and Drug Safety, 2013; 22: 834841
DOI: 10.1002/pds
14.
REFERENCES
15.
11.
12.
13.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
841
Copyright of Pharmacoepidemiology & Drug Safety is the property of John Wiley & Sons,
Inc. and its content may not be copied or emailed to multiple sites or posted to a listserv
without the copyright holder's express written permission. However, users may print,
download, or email articles for individual use.