Anda di halaman 1dari 9

pharmacoepidemiology and drug safety 2013; 22: 834841

Published online 1 April 2013 in Wiley Online Library (wileyonlinelibrary.com) DOI: 10.1002/pds.3418

ORIGINAL REPORT

Natural Language Processing to identify pneumonia from


radiology reports
Sascha Dublin1,2, Eric Baldwin1, Rod L. Walker1, Lee M. Christensen6, Peter J. Haug5,6, Michael L. Jackson1,
Jennifer C. Nelson1,3, Jeffrey Ferraro5,6, David Carrell1 and Wendy W. Chapman4
1

Group Health Research Institute, Seattle, WA, USA


Department of Epidemiology, University of Washington, Seattle, WA, USA
3
Department of Biostatistics, University of Washington, Seattle, WA, USA
4
Division of Biomedical Informatics, University of California, San Diego, La Jolla, CA, USA
5
Homer Warner Center for Informatics Research, Intermountain Healthcare, Salt Lake City, UT, USA
6
University of Utah, Salt Lake City, UT, USA
2

ABSTRACT
Purpose This study aimed to develop Natural Language Processing (NLP) approaches to supplement manual outcome validation,
specically to validate pneumonia cases from chest radiograph reports.
Methods We trained one NLP system, ONYX, using radiograph reports from children and adults that were previously manually reviewed.
We then assessed its validity on a test set of 5000 reports. We aimed to substantially decrease manual review, not replace it entirely, and so,
we classied reports as follows: (1) consistent with pneumonia; (2) inconsistent with pneumonia; or (3) requiring manual review because of
complex features. We developed processes tailored either to optimize accuracy or to minimize manual review. Using logistic regression, we
jointly modeled sensitivity and specicity of ONYX in relation to patient age, comorbidity, and care setting. We estimated positive and
negative predictive value (PPV and NPV) assuming pneumonia prevalence in the source data.
Results Tailored for accuracy, ONYX identied 25% of reports as requiring manual review (34% of true pneumonias and 18% of nonpneumonias). For the remainder, ONYXs sensitivity was 92% (95% CI 9093%), specicity 87% (8688%), PPV 74% (7276%), and
NPV 96% (9697%). Tailored to minimize manual review, ONYX classied 12% as needing manual review. For the remainder, ONYX
had sensitivity 75% (7277%), specicity 95% (9496%), PPV 86% (8388%), and NPV 91% (9091%).
Conclusions For pneumonia validation, ONYX can replace almost 90% of manual review while maintaining low to moderate
misclassication rates. It can be tailored for different outcomes and study needs and thus warrants exploration in other settings. Copyright
2013 John Wiley & Sons, Ltd.
key wordspneumonia; Natural Language Processing; sensitivity; specicity; validity; pharmacoepidemiology
Received 13 July 2012; Revised 11 January 2013; Accepted 14 January 2013

INTRODUCTION
Pneumonia is common and can have severe consequences in older adults. A growing literature suggests
that some medications increase pneumonia risk.15
Pharmacoepidemiologic studies often identify pneumonia cases within large databases using International
Classication of Diseases, version 9 (ICD-9) codes or
the equivalent. However, ICD-9 codes lack accuracy
*Correspondence to: S. Dublin, Group Health Research Institute, 1730 Minor
Avenue, Suite 1600, Seattle, WA 98101-1448, USA. E-mail: dublin.s@ghc.org

This work was presented as an oral presentation at the HMO Research Network
Annual Conference in Boston, Massachusetts on 24 March 2011, where it
received an Early Career Investigator Award.

Copyright 2013 John Wiley & Sons, Ltd.

for pneumonia: in validation studies, their sensitivity


has ranged from 48% to 80% and positive predictive
value (PPV) from 73% to 81%.6,7 Misclassication
of outcomes can limit statistical power and bias study
results. Some studies have reviewed medical records to
validate cases,1,3,8 but this approach is costly and time
consuming. Automated methods for outcome validation
would be very helpful and could also be used for clinical
decision support or public health surveillance.
With the growing use of electronic medical records,
automated outcome validation may be possible using
Natural Language Processing (NLP), in which a computer
processes free text to create structured variables. Pneumonia is suited to this approach because the diagnosis

835

nlp for pneumonia

requires a positive chest radiograph,9,10 and chest


radiograph reports have fairly standard format and
language. Several studies have used NLP to identify
pneumonia from clinical texts,1114 with sensitivity
from 64% to 100% and specicity from 85% to 99%.
Most prior studies examined relatively few reports
and included few pneumonia cases. Little is known
about accuracy in the outpatient setting, where
pneumonia is often diagnosed. No prior study evaluated
NLP as a lter for chart review.
Our aim was to develop NLP approaches to validate
pneumonia cases from electronic radiology reports.
Because we knew that certain reports are challenging
to classify, we aimed to replace a large portion of
manual review with NLP, but not all. We used reports
that were previously manually reviewed to train one
NLP tool, ONYX,15 and assess its accuracy. We
tailored our approach for different scenarios to explore
trade-offs between efciency and accuracy.

Medicaid. The racial and ethnic composition is similar


to the surrounding region, including 79% Caucasian,
3% African-American, 8% Asian/Pacic Islander, 1%
Native American, 5% Hispanic, and 3% other race.
This study was approved by the GH Human Subjects
Review Committee with a waiver of consent.
Data sources
The gold standard measure of pneumonia came from
medical record reviews performed for the Pneumonia
Surveillance Study (PSS).16 Presumptive cases were
identied from ICD-9 codes (480487.0 or 507.0)
for GH members of all ages between 1998 and
2004.16 Trained abstractors reviewed about 93 000
electronic chest radiograph reports (Table 1) to determine
if an inltrate was present or the radiologist interpreted
the report as showing pneumonia. To improve consistency and ensure that abnormal ndings were likely
to represent pneumonia, abstractors were given
detailed instructions (manual available on request).
For instance, inltrates described as streaky, nodular,
mass-like, or consistent with atelectasis did not
qualify. Inter-rater agreement was 94% (kappa 0.84)
and intra-rater agreement 95% (kappa 0.87). For the
current analyses, the gold standard for pneumonia
was that a report described an inltrate meeting study
criteria or contained a radiologists interpretation that
pneumonia was present.

METHODS
Setting
Group Health (GH) is an integrated healthcare delivery
system in the Northwest USA with extensive electronic
health data. GH members have coverage through
employer-based plans, individual plans, Medicare, and

Table 1. Characteristics of the source population of radiology reports and the test set of 5000 reports used for validation
Source dataset*

Characteristic
Patient age (years)
04
519
2044
4564
6574
7584
85+
Pneumonia case
Comorbidities
Congestive heart failure
Chronic lung disease
Cancer
Setting of care
Outpatient
Inpatient
Missing

Test set

(N = 5000){

All reports (N = 93 110)

Positive for pneumonia (N = 26 345)

Negative for pneumonia (N = 66 765)

n (%)

n (%)

n (%)

7130 (8)
7839 (8)
12 248 (13)
24 984 (27)
14 466 (16)
17 940 (19)
8503 (9)
26 345 (28)

3084 (12)
3426 (13)
3311 (13)
5840 (22)
3538 (13)
4752 (18)
2394 (9)
26 345 (100)

4046 (6)
4413 (7)
8937 (13)
19 144 (29)
10 928 (16)
13 188 (20)
6109 (9)
0 (0)

423 (8)
439 (9)
609 (12)
1256 (25)
840 (17)
987 (20)
446 (9)
2200 (44)

14 424 (15)
29 079 (31)
12 285 (13)

3712 (14)
8240 (31)
3255 (12)

10 712 (16)
20 839 (31)
9030 (14)

877 (18)
1767 (35)
747 (15)

86 028 (92)
6366 (7)
716 (1)

23 926 (91)
2191 (8)
228 (1)

62 102 (93)
4175 (6)
488 (1)

4598 (92)
362 (7)
40 (1)

n (%)

*A dataset of 93 110 chest radiograph reports that were previously manually reviewed for a study of pneumococcal conjugate vaccine.16

According to manual review.


{
Oversampled for true pneumonia cases (according to manual review) and comorbidities.

Copyright 2013 John Wiley & Sons, Ltd.

Pharmacoepidemiology and Drug Safety, 2013; 22: 834841


DOI: 10.1002/pds

836

s. dublin et al.

In addition to using PSS data, we also trained


ONYX using reports from a second study that used
similar methods.
Information about subjects age, care setting (outpatient vs. inpatient), and comorbidities (congestive heart
failure, chronic obstructive pulmonary disease (COPD),
or cancer) came from GH automated data. Comorbidities
were dened based on ICD-9 codes from the prior
12 months. We used automated data to identify the
individual radiologist reading the report.
NLP tools: ONYX and ConText
ONYX is an open-source NLP system (available at
http://code.google.com/p/onyx-mplus/) that integrates
knowledge about syntax (the structure of sentences)
and semantics (the meaning of words) to interpret free
text and produce structured output.15 ONYX can be
trained on documents from a particular domain (e.g.,
the pulmonary domain), and it helps the user create
training cases. Its output is a set of concepts identied
from individual sentences. For instance, from the
phrase ill-dened density in the right lower lobe,
ONYX generates a concept of localized inltrate
with a location of right lower lobe.
Decision rules are applied to ONYXs output to
classify the report as a whole. We designed rules to classify reports as follows: (1) consistent with pneumonia;
(2) inconsistent with pneumonia; or (3) needing manual
review (when a report had certain prespecied complex
features). Because studies needs differ, we developed
two classiers. Both are based on the PSS abstraction
manual, and both treat ONYX as a lter for chart
review, but the extent of ltering differs. The rst classier is designed to prioritize accuracyto minimize
false positives and false negativesat the cost of more
reports needing manual review. This approach might
be useful for a study with greater resources to support
manual review. The second classier is designed to
minimize manual review, as might be preferred when
resources are limited, although we thought this efciency
might come at the cost of higher misclassication.
The rst classier identies reports as needing manual
review if any of the following are present: (1) seemingly
inconsistent statements about pneumonia (e.g., inltrates
present and also absent); (2) atelectasis and pneumonia
identied in a single report; or (3) state change language
(e.g., inltrate improving or resolved). Online
Appendix Table 1 provides more detail. For the second
classier, we made the following changes: (1) reports
containing inconsistent statements were classied on
the basis of the most frequently occurring concept
(pneumonia vs. not pneumonia); (2) reports with both
Copyright 2013 John Wiley & Sons, Ltd.

atelectasis and pneumonia were classied as not


pneumonia; and (3) ONYXs output was processed
by an algorithm called ConText to more accurately
detect negated and resolved ndings and to distinguish
clinical history from current ndings.17
ONYX is based on a family of NLP systems12,18
developed largely for chest radiograph interpretation.
Our initial training data came from these systems; it
consisted of phrases extracted from chest radiograph
reports and their relationship to relevant concepts. Next,
ONYX was applied to 30 GH reports, and the results
were manually reviewed and corrected. This process
allows ONYX to learn from mistakes, creating new
associations that are applied to future texts. Next, we
applied ONYX to 1000 PSS reports and targeted
training on the basis of ONYXs mistakes. In addition
to editing errors in ONYXs output, we made changes
to ONYXs processing engine, such as adding a new
semantic or syntactic grammar rule. Finally, ONYXs
performance was assessed in an independent test set
of 5000 PSS reports oversampled for pneumonia
(based on manual review) and comorbidities to
improve the precision of estimates of the test statistics
for patient subgroups.
Statistical analyses
For each classier, we calculated the proportion of
reports that ONYX classied as requiring manual review
and then estimated sensitivity and specicity for the
remaining reports. Sensitivity is the proportion of true
pneumonia reports that ONYX correctly identied as
showing pneumonia. Specicity is the proportion of
non-pneumonia reports that ONYX correctly identied
as not showing pneumonia. Because we wanted to
examine how these measures varied by patient characteristics, we did not calculate them directly but instead
modeled them using multivariable logistic regression.
Specically, we jointly modeled ONYXs true positive
rate (sensitivity) and false positive rate (1 minus specicity) as a function of patient age, comorbidities, and
care setting.19 Interaction terms between each characteristic and pneumonia status were used to facilitate
estimation of both the true and false positive rates in
a single model. We used weights to account for
oversampling and generalized estimating equations to
provide standard errors that account for potential
correlation between multiple reports from the same
patient.20 Using the coefcients from this model, we
estimated ONYXs overall sensitivity and specicity
in a population with the same characteristics as the
source population (using the predictive margins
method).21,22 Finally, from the estimated sensitivity
Pharmacoepidemiology and Drug Safety, 2013; 22: 834841
DOI: 10.1002/pds

837

nlp for pneumonia

RESULTS

and specicity, we calculated PPV and NPV assuming


pneumonia prevalence as in the source population.
Specifying prevalence is important because PPV
increases as prevalence goes up, because the proportion
of true positives increases relative to false positives.
For similar reasons, NPV goes down as prevalence
goes up.
We also estimated ONYXs performance for
individual radiologists who had read at least 100
reports from the test set.
Results showed that Classier 2 had markedly
higher specicity and PPV than Classier 1. This
was in part expected because of the new decision rule
that arbitrarily classied certain ambiguous reports as
not showing pneumonia. As a post hoc analysis, we
explored the factors contributing to Classier 2s higher
specicity by comparing the classiers performance
on the subset of reports that both could classify and
examining Classier 2s accuracy for the reports that
only it was able to classify.
We examined the implications of our ndings for a
hypothetical study seeking to validate 10 000 potential
pneumonia cases, assuming the same prevalence as the
source population. We calculated the number of
reports that would need manual review and the number
of false negatives and false positives under three
scenarios: manual review without ONYXs assistance,
ONYX with Classier 1 (tailored for accuracy),
and ONYX with Classier 2 (tailored to decrease
manual review).
Analyses were performed using SAS software,
version 9.2 (SAS Institute, Inc., Cary, NC).
A

Table 1 shows characteristics of chest radiograph


reports in the source dataset of 93 110 reports and the
test set of 5000 reports. Patients mean age in the test
set was 55 years, and 92% of reports came from the
outpatient setting. Outpatient reports predominate
because GH patients requiring hospitalization are cared
for in outside hospitals, so their inpatient radiograph
reports are not available.
When ONYX and the pneumonia classier were
tailored to maximize accuracy (Classier 1), 25% of
reports were identied as requiring manual review
(34% of true pneumonias and 18% of non-pneumonias).
This proportion varied by age, care setting, and presence
of chronic lung disease (Online Appendix Figure 1).
For the remaining reports, ONYX had an estimated
sensitivity of 92% (95% CI, 9093%), specicity
87% (8688%), PPV 74% (7276%), and NPV 96%
(9697%) (Figure 1(A)(C)). Sensitivity differed
slightly by age but not comorbidity or care setting.
Specicity was lowest in those 65 years or older in
age and was lower for people with COPD than those
without COPD.
Tailored to minimize manual review (Classier 2),
ONYX classied 12% of reports as needing manual
review, including 18% of true pneumonias and 7%
of non-pneumonias. For the remaining reports,
sensitivity was 75% (7277%) and specicity 95%
(9496%). Sensitivity varied considerably by age but
specicity varied little, and neither varied substantially
by comorbidity or care setting (Figure 2(A)(C)). The

Overall
Age, in years
0-4
5 - 19
20 - 44
45 - 64
65 - 74
75 - 84
85+
Congestive heart failure
No
Yes
Chronic lung disease
No
Yes
Cancer
No
Yes
Setting of care
Outpatient
Inpatient

50

60

70

80

Specificity*

90

100

50

60

70

80

Sensitivity*

90

100 50

60

70

80

90

100

Positive predictive value*,

Figure 1. ONYXs performance under conditions aimed at improving accuracy (allows more reports to be designated as needing manual review): (A)
specicity, (B) sensitivity, and (C) positive predictive value. *Compared with gold standard from manual review of reports. Positive predictive value
estimated assuming prevalence of pneumonia in the source dataset (28%). {Comorbid conditions ascertained from health plan automated diagnosis data
(International Classication of Diseases, version 9, codes)

Copyright 2013 John Wiley & Sons, Ltd.

Pharmacoepidemiology and Drug Safety, 2013; 22: 834841


DOI: 10.1002/pds

838

s. dublin et al.
A

Overall
Age, in years
0-4
5 - 19
20 - 44
45 - 64
65 - 74
75 - 84
85+
Congestive heart failure
No
Yes
Chronic lung disease
No
Yes
Cancer
No
Yes
Setting of care
Outpatient
Inpatient

50

60

70

80

90

100

50

Specificity*

60

70

80

Sensitivity*

90

100 50

60

70

80

90

100

Positive predictive value*,

Figure 2. ONYXs performance under conditions aimed at decreasing the proportion of reports needing manual review: (A) specicity, (B) sensitivity, and
(C) positive predictive value. *Compared with gold standard from manual review of reports. Positive predictive value estimated assuming prevalence of
pneumonia in the source dataset (28%). {Comorbid conditions ascertained from health plan automated diagnosis data (International Classication of Diseases,
version 9, codes)

estimated PPV was 86% (8388%) and NPV 91%


(9091%).
Online Appendix Table 2 shows results of analyses
exploring reasons for Classier 2s higher specicity.
Classier 2 had high specicity for the reports that
were able to be classied by both classiers and also
for the reports that only it could classify.
Figure 3 shows results for the 15 radiologists who
had read at least 100 reports, graphed separately for
each classier. A few had particularly high (or low)
values.
Table 2 shows the estimated outcomes if a hypothetical
study used each of three approaches to validate 10 000
potential cases.
DISCUSSION
We found that one NLP tool, ONYX, accurately
identied a large proportion of pneumonia cases from
electronic chest radiograph reports and could dramatically
reduce manual review for outcome validation. Because
research projects have varying needs, we created two
versions of our tools. The rst (aiming to maximize
accuracy) had sensitivity 92%, specicity 87%, and
PPV 74%, but 25% of reports were designated as
needing manual review. The second (aiming to
decrease manual review) had sensitivity 75%, specicity 95%, and PPV 86%, with 12% of reports needing
manual review. With this second classier, ONYX
could eliminate almost 90% of manual medical record
review while maintaining a low false positive rate
(5%) and moderate false negative rate (25%).
Copyright 2013 John Wiley & Sons, Ltd.

The amount of outcome misclassication that is


acceptable will vary according to a projects goals and
resources. In some contexts, a false negative rate of
25% may be too high. It could lead to underestimation
of pneumonia incidence or selection of cases that are
not representative. In these settings, our rst classier
may be preferable, with its false negative rate of only
8%. The trade-off is that more resources will be needed
for manual review. Still, the false negative rates for both
classiers should be placed in context: most
pharmacoepidemiologic studies dene pneumonia from
ICD-9 codes or the equivalent, which have false negative
rates of 2052%.6,7 With NLP research proceeding at
a rapid pace, the accuracy of tools like ONYX will
probably continue to improve.
Differences in accuracy between Classiers 1 and 2
likely stem from several causes, including the addition
of ConText, an NLP algorithm designed to improve
handling of negation and historical information.
Adding ConText should improve both sensitivity and
specicity. Another factor was that we changed a
decision rule so that certain ambiguous reports were
automatically classied as not pneumonia. This decision,
which was somewhat arbitrary, favors specicity over
sensitivity, because shifting reports into the not
pneumonia category will inevitably result in more
false negatives (reducing sensitivity) as well as true
negatives (improving specicity).
ONYXs accuracy varied modestly by patient age
and comorbidity. For Classier 1, sensitivity was
consistently high for all groups. Specicity was more
variable, with lower values for older people and those
Pharmacoepidemiology and Drug Safety, 2013; 22: 834841
DOI: 10.1002/pds

839

nlp for pneumonia


A
100

11 4 10 9
5
3 6
12
7

90

Sensitivity

Table 2. Trade-offs between efciency and accuracy in a hypothetical


study with 10 000 potential cases and true pneumonia prevalence the same
as in the source population*

13

15

80

14

70
60
50
40
40

50

60

70

80

90

100

Specificity

ONYX: total
incorrectly classied
False positives
False negatives

B
100

13

90

Sensitivity

Number of charts
requiring manual
review
With true pneumonia
Without pneumonia

10

11
12 3 5
8
29 1 6

80

70

ONYX: total correctly


classied (no manual
review needed)
True positives
True negatives

Manual
review
of all
records

ONYX with
Classier 1
(prioritizes
accuracy)

ONYX with
Classier 2
(prioritizes fewer
manual reviews)

10 000

2248

1008

2800
7200

952
1296

504
504

916

909

0
0

768
148

335
574

6836

8083

0
0

1700
5136

1722
6361

14

60

15

50

40
40

50

60

70

80

90

100

*Assumes that reports classied by NLP as either consistent or inconsistent


with pneumonia do not undergo further manual review.

Compared with the gold standard of manual review and thus, by denition,
there are no false positives or negatives when all reports undergo manual
review.

Specificity
Figure 3. ONYXs accuracy for individual radiologists. (Limited to the
15 radiologists who read 100 or more reports in the test set. Each number
represents one radiologist, and each radiologist is assigned the same number
in both graphs.) (A) ONYX with Classier 1 (prioritizes accuracy).
(B) ONYX with Classier 2 (prioritizes decreasing the amount of manual
review). Thus, additional reports are included in the analyses for Figure 3(B)
that were not included in Figure 3(A)

with chronic lung disease. Classier 2 had high specicity across all groups but lower sensitivity, especially
in the oldest patients. Older patients and those with
chronic lung disease may be more likely to have chronic
lung abnormalities, and NLP may have difculty
distinguishing chronic, stable abnormal ndings from
acute inltrates.
Compared with prior NLP studies, our results indicate
similar or somewhat better accuracy, although it is
difcult to compare studies because of differences in
the gold standards and patient populations, including
pneumonia prevalence (which inuences PPV and NPV).
Also, prior studies used NLP to classify all reports
even those with complex features that our study
diverted for manual review. This difference may make
our tools accuracy appear higher. Mendonca et al. applied an NLP system, MedLEE, to chest radiograph reports from the neonatal intensive care unit.14
Sensitivity was 71%, specicity 99%, and PPV only
7.5% compared with medical record review. The low
Copyright 2013 John Wiley & Sons, Ltd.

PPV arose because pneumonia prevalence was very


low (2%) and also the gold standard required signs
and symptoms not captured on chest radiographs (and
thus not available to MedLee.) Fiszman et al. applied
SymText to 292 reports (pneumonia prevalence,
38%) and reported accuracy similar to ours: sensitivity
95%, specicity 85%, and PPV 78%.12 Their gold
standard was physicians interpretation of the reports.
Elkin et al. applied the Multithreaded Clinical Vocabulary
Server to 400 reports (pneumonia prevalence, 3.5%).13
Compared with physician review of the reports, NLP
had sensitivity of 100%, specicity 90%, and PPV
70%. Prior studies included relatively few reports.
Most did not describe the care setting (inpatient vs.
outpatient), and none reported accuracy for subgroups.
Strengths of our study include the large test set and
the availability of clinical information including age,
comorbidity, and care setting. These resources allowed
more precise estimates of accuracy as well as subgroupspecic estimates. The preponderance of outpatient
exams makes our ndings more relevant to the general
population. To our knowledge, ours is the rst study to
explore tailoring NLP tools to different study needs.
Thus far, we have applied NLP only to chest radiograph reports and not other clinical data relevant to
pneumonia. Radiologists interpretation of chest radiographs can be inconsistent, with studies reporting
Pharmacoepidemiology and Drug Safety, 2013; 22: 834841
DOI: 10.1002/pds

840

s. dublin et al.

kappa of 0.37 for inltrate in adults23 and 0.58 in children.24 Established case denitions, for example, the
CDC/National Healthcare Safety Network denition
for hospital-acquired pneumonia,10 typically require a
positive radiograph in addition to clinical symptoms
and ndings. One could argue that NLP should draw
on symptoms, vital signs, and laboratory ndings
in addition to radiology reports. Murff et al. took
such an approach.11 Compared with medical record
review, their NLP algorithms had sensitivity of 64%
(95% CI, 5870%) and specicity of 95% (9496%)
for post-operative pneumonia. While a multifaceted
approach may be desirable, challenges to implementation exist, including that many of the data required
are rarely measured in outpatients. Still, a crucial building block is an accurate method for classifying chest
radiographs, which are central to most pneumonia
denitions.
Our study has limitations. We measured comorbid
illnesses from administrative data, which do not have
perfect accuracy. Our data included relatively few
inpatient cases, resulting in less precise estimates for
this subgroup. Reports came from a single health-care
system, which may limit generalizability. Radiologists
from different institutions or geographic regions may
use different language, and so, an important next step
will be to assess transferability. It would also be useful
to investigate transferability of our tools to other
conditions. As a next step, we plan to assess our NLP
tools accuracy for reports from other institutions and
for other conditions. In the current study, considerable
effort was needed to construct a classier to create a
binary pneumonia variable from ONYXs output.
We developed a rule-based classier on the basis of
expert opinion, but machine learning is an alternative
approach that could improve efciency. It would also
be valuable to explore the use of machine learning
alone, without an NLP tool such as ONYX, to detect
pneumonia from radiograph reports. Schuemie et al.
recently described such efforts; he found that the best
approaches achieved sensitivity of 8293% with PPV
8190%.25 We currently use only features of the radiology report to determine which reports need manual
review. Future work could explore whether incorporating patient characteristics (e.g., age or comorbidity)
can improve this determination.
In conclusion, vast amounts of clinical data are
becoming available in free text within EMRs. They could
be valuable for many purposes but currently are expensive and time consuming to access. New technologies
such as NLP offer tremendous opportunities. Studies
such as ours provide insight into the potential of NLP
to improve research and clinical care.
Copyright 2013 John Wiley & Sons, Ltd.

CONFLICT OF INTEREST
Dr. Dublin has received a Merck/American Geriatrics
Society New Investigator Award.

KEY POINTS

When disease outcomes are identied from


diagnosis codes in automated databases,
misclassication can limit statistical power and
bias results.
Natural Language Processing (NLP) can extract
information from free-text electronic medical
records in an automated fashion, which could
improve validity and efciency.
We found that one NLP system, ONYX, works
well to identify pneumonia from free-text radiology
reports: after training, ONYX could replace nearly
90% of manual medical record review, with
sensitivity 75%, specicity 95%, and positive
predictive value 86%.
ONYX is available open-source and can be
adapted for different outcomes and study needs.

ACKNOWLEDGEMENT
We thank Dr. Lisa Jackson, Principal Investigator of
the PSS, for sharing the PSS data and commenting
on an early draft of the manuscript.
Dr. Dublin was funded by a Paul Beeson Career
Development Award from the National Institute on
Aging (grant K23AG028954), by the Branta Foundation, and by Group Health Research Institute internal
funds. The Beeson award is also supported by the
Hartford and Starr Foundations and Atlantic Philanthropies. Dr. Carrell was funded by National Cancer
Institute grant RC1CA146917. Dr. Chapman was
funded by National Institutes of Health grants
5R01GM090187 and U54HL108460. Group Health
Research Institute internal funds covered the data
collection and analysis.
This work does not necessarily reect the views of
the National Institute on Aging or National Institutes
of Health.
SUPPORTING INFORMATION
Additional supporting information may be found in
the online version of this article
Online Appendix Table 1. Pneumonia classier:
further explanation of criteria used to identify reports
needing manual review
Pharmacoepidemiology and Drug Safety, 2013; 22: 834841
DOI: 10.1002/pds

nlp for pneumonia

Online Appendix Table 2. Factors Contributing to


Changes in Performance Measures Between Classier
1 and Classier 2
Online Appendix Figure 1. The proportion of
reports that ONYX could classify (i.e., not needing
manual review), overall and for subgroups, under
conditions aimed at optimizing accuracy, stratied
by true pneumonia status (from manual review).

14.

REFERENCES

15.

1. Triro G, Gambassi G, Sen EF, et al. Association of community-acquired


pneumonia with antipsychotic drug use in elderly patients: a nested casecontrol
study. Ann Intern Med 2010; 152:418425, W139-40. DOI:10.1059/0003-4819152-7-201004060-00006.
2. Dublin S, Walker RL, Jackson ML, et al. Use of opioids or benzodiazepines and
risk of pneumonia in older adults: a population-based casecontrol study. J Am
Geriatr Soc 2011; 59: 18991907. DOI:10.1111/j.1532-5415.0211.03586.x.
3. Laheij RJ, Sturkenboom MC, Hassing RJ, et al. Risk of community-acquired
pneumonia and use of gastric acid-suppressive drugs. JAMA 2004;
292: 19551960.
4. Gulmez SE, Holm A, Frederiksen H, et al. Use of proton pump inhibitors and the
risk of community-acquired pneumonia: a population-based casecontrol study.
Arch Intern Med 2007; 167: 950955.
5. Herzig SJ, Howell MD, Ngo LH, et al. Acid-suppressive medication use and the
risk for hospital-acquired pneumonia. JAMA 2009; 301: 21202128. DOI:301/
20/2120 [pii] 10.1001/jama.2009.722.
6. Aronsky D, Haug PJ, Lagor C, et al. Accuracy of administrative data for
identifying patients with pneumonia. Am J Med Qual 2005; 20: 319328.
DOI:10.1177/1062860605280358.
7. van de Garde EM, Oosterheert JJ, Bonten M, et al. International classication of
diseases codes showed modest sensitivity for detecting community-acquired pneumonia. J Clin Epidemiol 2007; 60: 834838. DOI:10.1016/j.jclinepi.2006.10.018.
8. Jackson ML, Nelson JC, Weiss NS, et al. Inuenza vaccination and risk of
community-acquired pneumonia in immunocompetent elderly people: a populationbased, nested casecontrol study. Lancet 2008; 372: 398405. DOI:10.1016/S01406736(08)61160-5.
9. Mandell LA, Wunderink RG, Anzueto A, et al. Infectious Diseases Society of
America/American Thoracic Society consensus guidelines on the management
of community-acquired pneumonia in adults. Clin Infect Dis 2007; 44 Suppl 2:
S2772. DOI:10.1086/511159.
10. Horan TC, Andrus M, Dudeck MA. CDC/NHSN surveillance denition of
health care-associated infection and criteria for specic types of infections in

Copyright 2013 John Wiley & Sons, Ltd.

11.

12.

13.

16.

17.

18.

19.
20.
21.
22.
23.

24.

25.

841

the acute care setting. Am J Infect Control 2008; 36:309332. DOI:10.1016/j.


ajic.2008.03.002.
Murff HJ, FitzHenry F, Matheny ME, et al. Automated identication of
postoperative complications within an electronic medical record using Natural Language Processing. JAMA 2011; 306: 848855. DOI:10.1001/jama.2011.1204.
Fiszman M, Chapman WW, Aronsky D, et al. Automatic detection of acute
bacterial pneumonia from chest X-ray reports. J Am Med Inform Assoc 2000;
7:593604.
Elkin PL, Froehling D, Wahner-Roedler D, et al. NLP-based identication of
pneumonia cases from free-text radiological reports. AMIA Annu Symp Proc
2008: 172176.
Mendonca EA, Haas J, Shagina L, et al. Extracting information on pneumonia in
infants using Natural Language Processing of radiology reports. J Biomed
Inform 2005; 38: 314321. DOI:10.1016/j.jbi.2005.02.003.
Christensen LM, Harkema H, Haug P, et al. ONYX: a system for the semantic
analysis of clinical text. In Proceedings of the Workshop on Current Trends in
Biomedical Natural Language Processing. Association for Computational
Linguistics: Boulder, CO, 2009; pp. 1927.
Nelson JC, Jackson M, Yu O, et al. Impact of the introduction of pneumococcal
conjugate vaccine on rates of community acquired pneumonia in children and
adults. Vaccine 2008; 26: 49474954. DOI:10.1016/j.vaccine.2008.07.016.
Harkema H, Dowling JN, Thornblade T, et al. ConText: an algorithm for
determining negation, experiencer, and temporal status from clinical reports.
J Biomed Inform 2009; 42: 839851. DOI:10.1016/j.jbi.2009.05.002.
Christensen L, Haug PJ, Fiszman M. MPLUS: a probabilistic medical
language understanding system. In Proceedings of the Association for Computational Linguistics 2002 Workshop on Natural Language Processing in the
Biomedical Domain. Johnson S (Ed). Association for Computational Linguistics 2002: 2936.
Pepe MS. The Statistical Evaluation of Medical Tests for Classication and
Prediction. Oxford University Press: Oxford, 2003.
Liang KY, Zeger SL. Longitudinal data analysis using generalized linear models.
Biometrika 19986; 73: 1322.
Lane PW, Nelder JA. Analysis of covariance and standardization as instances of
prediction. Biometrics 1982; 38: 613621.
Graubard BI, Korn EL. Predictive margins with survey data. Biometrics 1999;
55: 652659.
Albaum MN, Hill LC, Murphy M, et al. Interobserver reliability of the chest
radiograph in community-acquired pneumonia. PORT Investigators. Chest
1996; 110: 343350.
Hansen J, Black S, Shineeld H, et al. Effectiveness of heptavalent pneumococcal conjugate vaccine in children younger than 5 years of age for prevention of
pneumonia: updated analysis using World Health Organization standardized
interpretation of chest radiographs. Pediatr Infect Dis J 2006; 25: 779781.
DOI:10.1097/01.inf.0000232706.35674.2f00006454-200609000-00005 [pii].
Schuemie MJ, Sen E, t Jong GW, et al. Automating classication of free-text
electronic health records for epidemiological studies. Pharmacoepidemiol Drug
Saf 2012. DOI:10.1002/pds.3205.

Pharmacoepidemiology and Drug Safety, 2013; 22: 834841


DOI: 10.1002/pds

Copyright of Pharmacoepidemiology & Drug Safety is the property of John Wiley & Sons,
Inc. and its content may not be copied or emailed to multiple sites or posted to a listserv
without the copyright holder's express written permission. However, users may print,
download, or email articles for individual use.