Anda di halaman 1dari 6

The development of a data-matching algorithm to dene

the case patient


Shelley Cox
1,2
BA, BSc (Hons), PhD, Senior Research Fellow (Trauma)
Rohan Martin
1
BHlthSc, VACIS Manager
Piyali Somaia
1
BComm (Hons), BSci, Performance Analyst
Karen Smith
1
BSc(Hons), Grad Dip Epi Biostats, Grad Cert Exec BA, PhD,
Manager, Research & Evaluation
1
Ambulance Victoria, 375 Manningham Rd, Doncaster, VIC 3108, Australia.
Email: Rohan.Martin@ambulance.vic.gov.au, Piyali.Somaia@ambulance.vic.gov.au,
Karen.Smith@ambulance.vic.gov.au
2
Corresponding author. Email: Shelley.Cox@ambulance.vic.gov.au
Abstract
Objectives. To describe a model that matches electronic patient care records within a given case to one or more
patients within that case.
Method. This retrospective study included data from all metropolitan Ambulance Victoria electronic patient care
records (n=445 576) for the time period1January200931May2010. Data were capturedvia VACIS(Ambulance Victoria,
Melbourne, Vic., Australia), an in-eld electronic data capture system linked to an integrated data warehouse database. The
case patient algorithm included JaroWinkler, Soundex and weight matching conditions.
Results. The case patient matching algorithm has a sensitivity of 99.98%, a specicity of 99.91% and an overall
accuracy of 99.98%.
Conclusions. The case patient algorithm provides Ambulance Victoria with a sophisticated, efcient and highly
accurate method of matching patient records within a given case. This method has applicability to other emergency services
where unique identiers are case based rather than patient based.
What is known about the topic? Accurate pre-hospital data that can be linked to patient outcomes is widely accepted as
critical to support pre-hospital patient care and system performance.
What does this paper add? There is a paucity of literature describing electronic matching of patient care records at the
patient level rather thanthe case level. Ambulance Victoria has developeda complexyet efcient andhighlyaccurate method
for electronically matching patient records, in the absence of a patient-specic unique identier. Linkage of patient
informationfrommultiple patient care records todetermine if the records are for the same individual denes the case patient.
What are the implications for practitioners? This paper describes a model of record linkage where patients are matched
within a given case at the patient level as opposed to the case level. This methodology is applicable to other emergency
services where unique identiers are case based.
Additional keywords: ambulance, deterministic linkage, electronic patient care record, pre-hospital, probabilistic linkage,
record linkage.
Received 19 March 2012, accepted 2 July 2012, published online 21 December 2012
Introduction
Accurate emergency medical services data that can be linked to
patient outcomes are essential tosupport pre-hospital patient care,
and promote ongoing quality improvement to ensure system
performance evolves and improves over time
1,2
. In 2006 the
Metropolitan Ambulance Service
A
completed implementation of
VACIS (Ambulance Victoria, Melbourne, Vic., Australia), an in-
eld electronic data capture system and linked database that
possibly represents the most comprehensive pre-hospital data-
base worldwide. Ambulance paramedics record clinical and
A
On 1 July 2008 the Metropolitan Ambulance Service, Rural Ambulance Victoria and Alexandra and District Ambulance Service merged to form Ambulance
Victoria. All analyses described in this document were performed exclusively on metropolitan data prior to the merge.
Journal compilation AHHA 2013 www.publish.csiro.au/journals/ahr
CSIRO PUBLISHING
Australian Health Review, 2013, 37, 5459
Feature
http://dx.doi.org/10.1071/AH11161
HEALTH SERVICES RESEARCH
operational data for all emergency incidents as electronic patient
care records (ePCRs), via VACIS, and these data are stored in an
integrated data warehouse.
2,3
Victoria follows a two-tiered emergency medical response
model, which includes over 2000 paramedics authorised to
practice some advanced life support techniques, as well as over
400 mobile intensive care ambulance (MICA) paramedics.
2,3
The
nature of this model means that more than one ambulance crew
can attend a single incident (or case), and there can be more than
one VACIS ePCRper patient. There can also be multiple patients
per case with more than one ePCR each. Following the imple-
mentation of VACIS, the ambulance service recognised the need
to be able to electronically match ePCRs at the patient level rather
than at the case level for the purposes of data analysis and clinical
reporting. This requires a set of complex matching algorithms to
allow ePCRs within a case to be matched to one or more patients
within that case.
There are several different recognisedrecordlinkage methods.
The twomost commonlyusedmethods for electronic data linkage
are the deterministic and probabilistic approaches. The deter-
ministic approach links records based on an exact match of a
unique identier or a set of common identiers.
4
Deterministic
linkage often performs poorly for non-unique identiers such as
name and date of birth. Probabilistic linkage bases record linkage
decisions on statistical probabilities that two or more records are
likely matches, using a set of common identiers.
4
This method
can be more sensitive than deterministic linkage and can be used
to match within and across cases.
Record linkage techniques are widely used in epidemiological
studies to link records between two or more databases,
411
but
there is little literature describing matching techniques within one
database. The challenges associated with data linkage within one
database are similar to those encountered with data linkage
between databases. However, in the absence of a unique patient
identier, matching pre-hospital patient records at the patient
level rather than the case level is a critical rst step in ensuring the
accuracy of pre-hospital patient records before attempting to link
records between services (e.g. linking pre-hospital data with
hospital data).
The current paper describes the development of a model
designed to match ePCRs within a given case to one or more
patients within that case. The matching of patient information
from multiple ePCRs to determine if the records are for the same
individual denes the case patient.
Methods
Study setting and population
Thestudywas conductedinthe Australianstateof Victoria, which
covers ~227 416 km
2,12
and has a population of 5.44 million
people.
13
Ambulance Victoria attended 422 984 cases between 1
January 2009 and 31 May 2010. There were 430 132 case
patients and 445 576 ePCRs completed for these patients. From
the 422 984 cases attended by the ambulance service, there were
341 032 transports. This study was approved by Ambulance
Victoria as an internal quality assurance process.
Data capture
Data were captured via VACIS, an in-eld electronic data capture
system that is linked to an integrated Clinical Data Warehouse
database.
2,3
Data-matching variables included patient given
name, patient family name, gender, age (years), date of
birth (day / month / year), street and suburb from patients
residential address. Two additional data-matching variables were
used, namely patient treated by other team and patient trans-
ported by other team. When more than one ambulance crew
attends a given case there can obviously only be one transporting
ambulance for a given patient, and there are circumstances in
which not all crews who attend a patient actually treat that patient.
If an ambulance crew neither treats nor transports a patient they
have attended, they are required to document the reason why on
their ePCR. Takefor example anambulance crewwhoattends and
treats a patient but does not transport that patient to hospital. This
non-transporting ambulance crew is required to document the
name of the transporting ambulance crewon their ePCRunder the
section patient transported by other team. This information is
then used in the case patient matching rules to match the
ambulance crew team name on the corresponding transporting
teams ePCR. The same process is requiredfor teams whoattenda
patient but do not actually treat or transport that patient.
Matching conditions
The case patient algorithms use a combination of deterministic
and probabilistic matching conditions to optimise the number of
true patient matches. In instances where two or more ePCRs do
not match using one particular matching condition, they may
match using a different matching condition. The conditions used
in the case patient matching algorithms include, JaroWinkler,
Soundex and weight matching.
JaroWinkler matching condition
JaroWinkler (JW) scores are a measure of similarity between
two letter strings, based on spelling rather than phonetics. JW
matches letter strings based on their similarity value using an
algorithm that calculates the number of deletions, insertions and
substitutions required to transform one letter string into
another.
1417
JW accounts for the length of the letter strings and
penalises more for errors at the beginning of a letter string, as well
as recognising typographical errors. In contrast to exact match-
ing, standardised exact matching eliminates case (uppercase
and lowercase), spaces and non-alphanumeric characters before
comparing for an exact match.
15
The higher the JWscore for two
letter strings, the more similar the letter strings are. For example,
the names MARTHAand MARHTAhave a JWscore of 0.961 or
96%, whereas the names JONESand JOHNSONhave a JWscore
of 0.832 or 83%.
15
Soundex matching condition
Soundex is a phonetic algorithmfor matching names by sound or
pronunciationinEnglish, as opposedtomatchingbyspelling. The
goal is for names withthe same pronunciationtobe encodedtothe
same representation so that they can be matched despite minor
differences in spelling.
18
Soundex uses the rst letter of a name
followed by three digits to form a code. The digits are formed by
dropping vowels, as well as h, w and y, and mapping the
remainingletters accordingtothe followingcode: 1=b, f, p, v; 2=
c, g, j, k, q, s, x, z; 3 = d, t; 4 = l; 5 = m, n; 6 = r.
18,19
If two or more
consecutive letters have the same code, only the rst is used. If the
process runs out of letters, zero(0) is used. For example, SMITH
Development of a data-matching algorithm Australian Health Review 55
and SMYTH both result in a Soundex code of S530, which is a
positive match. The names JONES and JOHNSON result in
different Soundex codes (J520 and J525) and are therefore not
a match.
18,19
Weight matching condition
The sum of all weights is used to derive a total score that is
compared with upper and lower thresholds to determine matches,
possible matches or nonmatches.
18
A weight match rule is most
useful with a large number of attributes, and can prevent a single
attribute from invalidating a match. Each attribute is compared
usinga similarityalgorithmthat returns a score between0and100
to represent the similarity between two letter strings. For the letter
strings tobe considered a match, the total score must be equal toor
greater than a pre-designated overall weight to match score.
18
Business rules
There are nine levels of matchingthat dene the case patient and
a total of 116 matching rules. Table 1 presents a summary of the
case patient business rules. The process of matching ePCR
patients is hierarchical, that is, an attempt is made to match using
the level 1 rules and if a match cannot be made an attempt is made
to match using the level 2 rules and so on until all matching rules
have been exhausted. This process is presented schematically in
Fig. 1. The matching level and rule number assigned to the case
patient reects the worst case match for all ePCRs matched to the
case patient. For example, if ePCR one matches at level 1 rule 1,
Table 1. Summary of case patient business rules
N/A, not applicable; JW, JaroWinkler; DOB, date of birth; ePCR, electronic patient care record
Matching level Rules Variables & matching conditions Summary
Level 0 N/A N/A Includes all patients with just one ePCR
Level 1 2 Family name =Standardised exact Identifying variables must be present and match exactly to
satisfy level 1 Given name = Standardised exact
Age =Exact or DOB (all three elements) =Exact
Gender = Exact
Level 2 4 Family name =Soundex or JW85 Each rule matches on ve data variables. All ve variables
must be present and match according to the matching
conditions for a case to satisfy level 2
Given name = Soundex or JW85
Age =1 year or DOB year = 1 year
Gender = Exact
Street =Soundex or JW85
Suburb =Soundex or JW85
Level 3 90 Family name =Soundex or JW75 or weight = 20 Each rule matches on at least six data variables. All six
variables must be present and match according to the
matching conditions for a case to satisfy level 3.
Use age onlyif DOBis not recordedonone or more ePCRs
Given name = Soundex or JW75 or weight =20
DOB day =Exact
DOB month =Exact
DOB year =Exact
Age =1 year or weight =20
Gender =Exact or weight =15
Street =Soundex or JW75 or weight =12
Suburb =Soundex or JW75 or weight = 12
Patient treated / transported by other team: other team
name = Exact match
Level 4 15 Family name =Soundex or JW75 Each rule matches on four data variables. All four variables
must be present and match according to the matching
conditions for a case to satisfy level 4
Given name = Soundex or JW75
Age =1 year or weight =20
Gender = Exact
Street =Soundex or JW75
Suburb =Soundex or JW75
Patient treated/transported by other team: other team
name = Exact match
Level 5 1 Patient treated/transported by other team: other team
name = Exact match
This rule applies when one ePCRhas at least given name and
family name recorded and the other ePCR(s) do(es) not
Level 6 1 Patient treated/transported by other team: other team
name = Exact match
This rule applies when none of the ePCRs have given name
and family name recorded
Level 7 1 N/A Includes cases where there are multiple distinct patients per
case that do not match on any other level. Given name,
family name and gender must be present to be assigned
level 7
Level 8 1 N/A Includes all unknown patients that have not been matched
elsewhere, with at least one conicting identifying data
item
Level 9 1 N/A Includes all unknown patients that have not been matched
elsewhere, with no conicting identifying data items
56 Australian Health Review S. Cox et al.
Does (A)
match (B) at
level 1?
Select ePCR's (A)
recently loaded
into the CDW
Select case
patient ePCR's
(B) for the
same Case
End
Does (A)
match (B) at
level 2?
Does (A)
match (B) at
level 3?
Does (A)
match (B) at
level 4?
Does (A)
match (B) at
level 5?
Does (A)
match (B) at
level 6?
Does (A)
match (B) at
level 9?
Continue
matching
Record match
details
Is this match
better than
previous
match?
Process
match
are there
multiple unknown
matches at the
same level?
Create case patient
Link ePCR (A) to case patient
Tag ePCR (A) as
unknown match
Is there another
ePCR (B) to attempt
match to?
Was a match
found for (A)?
Are there more
ePCR's (A)
to match?
Continue
processing new
ePCR's (A)
are there
case patient
ePCR's (B) for
the same
case?
set match
level for (A) to 0
No match at level 1
No match at level 3
No match at level 2
No match at level 4
No match at level 5
No match at level 6
Best match
so far
Match at level 1
Match at level 2
Match at level 3
Match at level 4
Match at level 5
Match at level 6
Match at level 9
No match at level 9
More ePCR's
(B) for case
Worst match
than previous
No match
ePCR matched
No more ePCR's
(A) to match
Multiple unknown
matches
No more ePCR's
(B) to match against
Not unknown
match
process next
ePCR (A)
No ePCR's (B)
for the same
case as (A)
Fig. 1. Process ow diagram of case patient matching.
20
Development of a data-matching algorithm Australian Health Review 57
and ePCR two matches at level 5 rule 1, the matching level and
rule assigned to the case patient is level 5 rule 1.
Statistical analyses
Statistical analyses were performed using SPSS Version 17.0 for
Windows (SPSS Inc., Chicago, IL). Data are summarised as
counts and percentages. The evaluation of data matching relies
on four different values: the number of records linked correctly
(true positives), the number of records linked incorrectly (false
positives), the number of records unlinked correctly (true nega-
tives) and the number of records unlinked incorrectly (false
negatives).
10
Results
The number of true positive matches was calculated using case
patient matching levels 0 through 7. Asingle (n = 1) false positive
match was identied and this occurred at level 5. True negative
matches were calculated by adding the number of cases at level 8
andlevel 9(i.e. unknownpatients) (n=1110+47=1157). There
were n = 60 false negative cases identied at level 7. Sensitivity
(a/(a+c)),
B
specicity (d/(b+d)),
2
positive predictive value (a/
(a+b)),
2
negative predictive value (d/(d+c))
2
andaccuracy((a+d)/
(a+b+c+d))
2
were calculated for the case patient matching levels
overall.
Table 2 presents the number of cases, patients and ePCRs for
each case patient matching level. There were a total of 430 132
case patient matches made for the time period 1 June 200931
May 2010. Ninety-four percent of cases involved just one patient
and one ePCR (level 0).
Sensitivitywas calculatedusinglevels 0through7. There were
n = 428 821 patients identied. Within level 7 there were n = 60
patients who were double counted (or unlinked incorrectly)
and these were considered false negatives. Therefore there were
n = 428 761 true positive case patient matches. The sensitivity of
the case patient rules is 99.98%.
Specicity was calculated using the total number of unknown
patients in levels 8 and 9 as the numerator (n = 1157) and the
number of incorrect patient matches (n = 1) fromlevels 0 through
7as the denominator (1157+1=1158). The specicityof the case
patient matching rules is 99.91%. Positive predictive value is
99.99andnegative predictive value is 95.06. Theoverall accuracy
of the case patient matching rules is 99.98%.
Discussion
This study described and evaluated a model designed to match
pre-hospital electronic patient care records (ePCRs) within a
given case to one or more patients within that case. The ndings
of the study showed very high sensitivity, specicity and overall
accuracy.
Successful data matching is dependent on having an accurate
data-matching algorithmand sufciently accurate data.
7
The case
patient model has several key strengths. Matching is hierarchical
or sequential, which is important because it allows data users to
decide what level of matching stringency they want to use for any
given analysis. Further, once a case patient has been identied
more than one data record can be matched to that particular case
patient. The nal match level and rule assigned to the case patient
reects the worst-case match for all ePCRs matched to the case
patient. This is a critical strength of the case patient matching
algorithm because it avoids the errors and limitations of greedy
algorithms
14,17
. Agreedy algorithm is one in which a record is
always matched to the corresponding available record that has the
highest level of agreement. Subsequent records are only matched
to available remaining records that have not yet been matched.
17
Another key strength of the case patient model is that it does
not rely solely on exact data matching. Level one is the only level
where two or more ePCRs must match exactly for a case patient
match to be considered a true positive. This means that true
positive patient matches are still possible even when pre-hospital
patient records are incomplete or erroneous. The model optimises
patient matches between levels two and seven by using a com-
bination of matching rules. The use of JaroWinkler, Soundex
and weight matching conditions eliminates the need for data
records to have homogenous identifying characteristics. Further-
more, where two or more ePCRs do not match using one type of
matching condition they can match on a different matching
condition. It has been reported that even in high quality databases
there canbe anerror rate of 20%ingivenname pairs and10%
in surname pairs amongst records that are true positive matches.
17
This error rate is likely to be increased in the emergency setting.
Therefore being able to account for typographical errors using
different matchingconditions cansignicantlyimprove matching
efciency and match rates.
17
Limitations
Despite the critical strengths of the case patient matching algo-
rithm, there are some instances where the potential exists for
patients to be erroneously matched. Examples of these include
cases of twins where given name is missing from one or more of
Table 2. Number of cases, patients and ePCRs for each case patient
matching level
ePCR, electronic patient care record
Level number Number of
cases
Number of
case patients
Number of
ePCRs
Matched patients
0 398 987 (94.3) 398 987 (92.7) 398 987 (89.5)
1 7924 (1.9) 7933 (1.8) 15 921 (3.6)
2 3748 (0.9) 3750 (0.9) 7530 (1.7)
3 2241 (0.5) 2242 (0.5) 4504 (1.0)
4 444 (0.1) 444 (0.1) 895 (0.2)
5 896 (0.2) 896 (0.2) 1835 (0.4)
6 29 (0.007) 29 (0.007) 61 (0.01)
7 7558 (1.8) 14 540 (3.4) 14 540 (3.3)
Total 421 827 (99.7) 428 821 (99.6) 444 273 (99.8)
Unmatched (unknown) patients
8 1110 (0.3) 1202 (0.3) 1202 (0.2)
9 47 (0.01) 49 (0.01) 101 (0.02)
Total 1157 (0.3) 1251 (0.4) 1303 (0.2)
Overall total 422 984 (100.0) 430 132 (100.0) 445 576 (100.0)
B
a =true positives; b =false positives; c =false negatives; d =true negatives.
58 Australian Health Review S. Cox et al.
the patient records, cases of mothers and newborn babies where
the babys details are not recorded and cases where children have
the same names as their parents and age or date of birth has not
been recorded.
Conclusions
The case patient algorithmhas providedAmbulance Victoria with
a complex, yet efcient and highly accurate method of matching
patients with multiple patient care records within a given case.
Matching data at the patient level rather than at the case level is
critical for the purposes of data analysis and clinical reporting.
This methodology has applicability to other emergency services
where unique identiers are based on the case or event and not on
the patient.
Competing interests
There are no competing interests to declare.
Acknowledgements
The authors express their thanks andappreciationtoSharonHughes andJanna
Boulat for their data architecture expertise and technical support in building
the case patient algorithm.
References
1 Newgard CD, Zive D, Malveau S, Leopold R, Worrall W, Sahni R.
Developing a statewide emergency medical services database linked to
hospital outcomes: a feasibility study. Prehosp Emerg Care 2011; 15(3):
30319. doi:10.3109/10903127.2011.561404
2 CoxS, Currell A, Harriss L, Barger B, CameronP, SmithK. Evaluationof
the Victorian state adult pre-hospital trauma triage criteria. Injury 2012;
43(5): 57381. doi:10.1016/j.injury.2010.10.003
3 Cox S, Smith K, Currell A, Harriss L, Barger B, Cameron P. Differen-
tiation of conrmed major trauma patients and potential major trauma
patients using pre-hospital trauma triage criteria. Injury 2011; 42(9):
88995. doi:10.1016/j.injury.2010.03.035
4 Li B, Quan H, Fong A, Lu M. Assessing record linkage between health
care and Vital Statistics databases using deterministic methods. BMC
Health Serv Res 2006; 6: 48. doi:10.1186/1472-6963-6-48
5 Dean JM, Vernon DD, Cook L, Nechodom P, Reading J, Suruda A.
Probabilistic linkage of computerized ambulance and inpatient hospital
discharge records: a potential tool for evaluation of emergency medical
services. Ann Emerg Med 2001; 37(6): 61626. doi:10.1067/mem.2001.
115214
6 Matthew AJ. Probabilistic linkage of large public health data les. Stat
Med 1995; 14(57): 4918.
7 NewmanTB, Brown AN. Use of commercial recordlinkage software and
vital statistics to identify patient deaths. J Am Med Inform Assoc 1997;
4(3): 2337. doi:10.1136/jamia.1997.0040233
8 Tromp M, Meray N, Ravelli ACJ, Reitsma JB, Bonsel GJ. Ignoring
dependency between linking variables and its impact on the outcome of
probabilistic record linkage studies. J AmMed InformAssoc 2008; 15(5):
65460. doi:10.1197/jamia.M2265
9 Edelman LS, Cook L, Safe JR. Using probabilistic linkage of multiple
databases to describe burn injuries in Utah. J Burn Care Res 2009; 30(6):
98392.
10 Sauleau EA, Paumier J-P, Buemi A. Medical record linkage in health
information systems by approximate string matching and clustering.
BMC Med Inform Decis Mak 2005; 5: 32. doi:10.1186/1472-6947-5-32
11 Crilly JL, ODwyer JA, ODwyer MA, Lind JF, Peters JAL, Tippett VC,
et al. Linkingambulance, emergencydepartment andhospital admissions
data: understanding the emergency journey. Med J Aust 2011; 194(4):
S347.
12 Australian Bureau of Statistics. Year Book Australia, 2005; 2005.
Available at http://www.abs.gov.au/Ausstats/abs@.nsf/0/5B10622703
A022B0CA256F7200833007?opendocument [veried 28 June 2010].
13 Australian Bureau of Statistics. Regional Population Growth, Australia,
20082009; 2009. Available at http://www.abs.gov.au/ausstats/abs@.
nsf/Products/3218.0~2008-09~Main+Features~Victoria?Open
Document [veried 24 June 2010].
14 Jaro M. Advances in record linking methodology as applied to the 1985
census of Tampa Florida. J Amer Statist Assoc 1989; 84(406): 41420.
15 Winkler W. Overview of record linkage and current research directions.
Washington, DC: Bureau of the Census; 2006.
16 Jaro M. Probabilistic linkage of large public health data le. Stat Med
1995; 14: 4918. doi:10.1002/sim.4780140510
17 Winkler W. The state of record linkage and current research problems.
Washington, DC: Bureau of the Census; 1999.
18 Oracle. OracleWarehouseBuilder Users Guide10gRelease2(10.2.0.2).
Oracle Corporation; 2009. Available at http://download.oracle.com/
docs/cd/B31080_01/doc/owb.102/b28223/ref_dataquality.htm#i1187
308 [veried 28 June 2010].
19 Black P. Soundex in Dictionary of Algorithms and Data Structures.
U.S. National Institute of Standards and Technology; 2010. Available at
http://xlinux.nist.gov/dads/HTML/soundex.html [veried 29 September
2011].
20 VACIS Clinical Data Warehouse. VACIS Clinical Data Warehouse
Technical Case Patient Matching. Melbourne: Ambulance Victoria;
2010.
Development of a data-matching algorithm Australian Health Review 59
www.publish.csiro.au/journals/ahr