Anda di halaman 1dari 5

The Application of Text Mining Software to Examine Coded Information

Patricia B. Cerrito, Jewish Hospital Center for Advanced Medicine, University of Louisville; Antonio Badia, University of Louisville; James Cox, SAS Institute in any way. The next highest code is 251 for Abstract hypoglycemic coma, and does have some relationship to The purpose of this paper is to examine the use of text diabetes. However, code 252 deals with diseases of the mining to investigate coded information. Inventory codes parathyroid gland. The first three numbers represents a are usually nominal rather than ordinal. However, codes basic category of patient illness. There remain thousands are not completely devoid of meaning; generally items in of such codes. It is possible to treat the codes as text the inventory that are similar are assigned base numbers rather than as numbers to use stemming capabilities to with added digits to distinguish between individual items. relate codes in clusters. The standard approach to using such data in statistical models that require at least ordinal data is to create These codes can be examined to determine whether the dummy variables, ie the presence or absence of an item. data have been recorded uniformly, and whether various Instead, it is possible to treat the code as a text word, and hospitals use similar definitions. However, ultimately, the to use text analysis to examine the data. In this paper, two information in the ICD-9 codes submitted with billing examples from the healthcare industry are examined. The data must be compared to information in patient charts. first examines ICD-9 codes that are used in Medicare Currently, most hospitals rely on manual extraction of billing to list patient risk factors and complications. The information from patient records, requiring many second examines medications routinely prescribed to extractors. With hundreds of pages to examine, manual patients in open heart surgery. extraction can result in missed data. Therefore, it will be cost-effective and more accurate to extract information Keywords automatically from patient charts. However, statistical Text mining, ICD-9 codes, nominal data methods are not really applicable to extraction of textual information. Introduction Coded information is problematic using traditional Method statistical methods. A large retail inventory can contain Text analysis has become more sophisticated than simple hundreds and thousands of different codes. For example, frequency counts. Word groups can be identified as one the Office Depot web site contains 13 major categories of word, numbers can be included or excluded in the office supplies. The pencils and pens category contains analysis, words with slightly different word endings can another 5 subcategories with 11 further refined categories be made equivalent. Different weightings of the just of pens. Rollerball pens for one manufacturer have frequencies can be used. Common words such as the or the following inventory codes: 184580, 184606, 255551, and can be eliminated from the analysis. The focus shifts 184598, 184747. Note that 4 out of the 5 selections have from the most commonly used words to those that have 184 as the base part of the number. A second the most discriminating potential. The basics of text manufacturer has the following codes: 526300, 526328, analysis are1 526337, 794047, 526346, 526355. Again, 5 out of 6 have the same base 526 code. The Library of Congress 1. Coding:determining the basic unit of analysis, and classifications and ISBN codes are similar. counting how many time each word appears. 2. Categorizing:creating meaningful categories to which Like inventory codes, ICD-9 codes that are similar in the unit of analysis (for example, "terms signifying definition are also similar in number. ICD-9 codes were 'cooperation' and terms signifying 'competition') can developed by the World Health Organization as be assigned. shorthand for patient illness, complication, and disease. 3. Classifying:verifying that the units of analysis can be The ICD-9 code can have 5 digits, with the first 3 digits easily and unambiguously assigned to the appropriate defining the basic disease. For example, 250 codes categories. diabetes. If the last digit is 1, the condition is Type I 4. Comparing:comparing the categories in terms of (insulin-dependent, juvenile diabetes); a 2 in the last digit numbers of members in each category. signifies Type II (non-insulin, adult onset diabetes). The 5. Concluding:drawing theoretical conclusions about 4th digit is used to signify different diabetic complications. the content in its context. For example, 25041 indicates a Type I diabetic with renal failure. ICD-9 codes represent a coded language that can A novel approach to analyzing ICD-9 codes is to treat be investigated using text analysis, and cluster analysis (or them as text rather than as categories. In that way, classification) using a text format. similarities between codes can be related to similarities in patient conditions. Grammar and syntax are not However, the ordering of the first three numbers is important in examining the ICD-9 codes using text arbitrary. The next lowest code to diabetes is 246, which mining. Stemming means finding and returning the deals with thyroid disorders and is not related to diabetes

Copyright by SIAM. Unauthorized reproduction of this article is prohibited

254

root form (or base form) of a word. Stemming enables the investigator to work with linguistic forms that are more abstract than those of the original text. For example, the stem of grind, grinds, grinding, and ground is grind. The document collection often contains terms that do have the same base form but share the same meaning in context. For example, the words teach, instruct, educate, and train do not have a common stem, but share the same meaning of teach Text mining can relate words with similar stems. The capability can be extended to numeric codes. Medpar Data Medicare billing is frequently used to determine hospital rankings.2 The standard technique used to determine these rankings is to develop a logistic regression equation using statistically significant secondary codes from the ICD-9 standards. These codes require documentation found in the patient chart, as validated through physician signature. For example, there are 51 ICD-9 codes related to diabetes. Consider just the following five: 1. 2. 3. 4. 25000 Type II diabetes mellitus without mention of complications 25001 Type I diabetes mellitus without mention of complications 25002 Type II diabetes mellitus without mention of complications uncontrolled 25003 Type I diabetes mellitus without mention of complications uncontrolled

or under-reporting when compared to other hospitals. Because of the lack of uniformity, there is some question as to the validity of risk adjustment analysis.4 Quality measurement of hospital performance relies upon standard statistical methodology. Consider the standard methodology described by healthgrades.com:5 The models are limited by the following factors: 1) Cases may have been coded incorrectly or incompletely by the hospital; 2) the models can only account for risk factors that are coded into the billing data - if a particular risk factor was not coded into the billing data, such as a patient's socioeconomic status and health behavior, then it was not accounted for with these models; and, 3) although Health Grades, Inc. has taken steps to carefully compile these data using its proprietary methodology, no techniques are infallible, and therefore some information may be missing, outdated, or incorrect. However, with thousands of possible risk factors as recorded in ICD-9 codes, and hundreds of thousands of patients, virtually any risk factor will be statistically significant but meaningless. In addition, the use of logistic regression itself assumes a uniformity across all hospitals in reporting of these risk factors, a uniformity that is well known not to exist. In this example, there is no definite outcome, no right or wrong way of coding, and no gold standard of coding that an individual hospital can be measured against.6 Hospitals are typically ranked based on the differential between the predicted patient risk and the actual patient outcome. Ranking becomes more important as the results are used by the general public.7 There are two ways to improve a hospitals ranking. The first is to improve clinical quality and patient outcome. The second is to increase the predicted risk through improved reporting procedures in the Medpar database.8, 9 Therefore, comparisons remain somewhat unreliable. For example, if the regression equation uses the risk factor 25001 and the hospital generally codes only 25000 without specifically labeling the diabetes as uncontrolled, than that hospital will perform more poorly on the regression equation than will a hospital that habitually records 25003 whenever the glucose level exceeds 300. Consider the graph in Figure 1 that gives a comparison of the severity codes for several hospitals that use Medpar data, restricting attention to DRG 105: cardiac valve procedures. Note that hospital number 6 shifts most of its patients into the most severe category (4) while hospitals number 2 and 4 are equally balanced between severity risks 3 and 4. Is it possible that hospital 6 has extremely sick patients, or is it possible that hospital 6 just reports more risk factors?

A physician has sole discretion in documenting uncontrolled diabetes with very few guidelines. If a physician is more in the habit of adjusting insulin treatment without documenting uncontrolled then one hospital will have fewer patients with a more severe condition than another hospital where the physician habitually documents uncontrolled after one measurement above 300 units. There are many individuals and departments responsible for the final assignment of codes. The manual coders must rely on physician documentation. However, it is not the primary concern of the physician (nor should it be) to have complete documentation just to satisfy billing requirements. Moreover, reporting of secondary risk factors does not add to the total amount that the hospital can bill Medicare; billing depends only upon the primary diagnosis and treatment. Therefore, accuracy in the secondary codes is of reduced importance. Medpar assigns a code 1-4 (or 0-3) to each patient for the level of severity of risk factors.3 This represents one regression equation that is publicized. However, other organizations do not publish their methods. Using data mining techniques, it is possible to discover what factors are used, and to determine whether one hospital is over-

Copyright by SIAM. Unauthorized reproduction of this article is prohibited

255

Figure 1. Comparison of Hospitals


Com parison of Hospitals for DRG 105
80 70 60 50 40 30 20 10 0 1 2 3 4 5 6

hospitals in Kentucky performing cardiovascular procedures. There were a total of over 28,000 cardiovascular procedures available. Table 1 contains the defined patient profiles. The UB-92 form used for Medicare billing has nine column fields to record information by ICD-9 codes. Text analysis can only examine one field at a time. For analysis purposes, the nine fields were combined into one, with the nine ICD-9 codes for an individual patient defining one text string. Then a singular value decomposition was performed, which creates a matrix where ICD-9 codes that appear in combination can be identified, based on projecting patient profiles into a multidimensional space. A spatial ExpectationMaximization clustering algorithm was then run to cluster the patient profiles. SAS Enterprise Miner for text was used to perform the analysis Table 1 gives the results of the text analysis. A total of 8 clusters were identified in the analysis. For example, cluster one contains uncomplicated diabetes (both Type I and Type II) as well as minor problems such as arthritis and hypothyroidism. The majority of patients in this cluster are given a risk factor of 1 or 2 rather than 3 or 4. In contrast, cluster 4 is primarily focused on diabetes with complications, primarily retinopathy or neuropathy. Both conditions indicate that the patient suffers from prolonged, often uncontrolled diabetes. Patients in this cluster generally are assigned a risk factor of 4. The difference between recording 25001 and 25003 is slight. Yet, the results could be that a hospital with a high proportion in cluster 1 will perform poorly on the standardized risk adjusted regression equations. Similarly, cluster 2 identifies patients generally with a diagnosis requiring open heart surgery. They represent modest risk factors. Hospitals with a high concentration of patients in cluster 2 is probably under-reporting on risk factors, and will end up performing poorly on rankings based upon regression equations. Cluster 7 is associated with patients with critical risks, primarily dealing with kidney failure or even dialysis. Table 1. Patient Profiles Developed From Text Mining of Secondary Diagnosis Codes Clusters Found 1. General Risk Factors: Acquired hypothyroidism, Diabetes, uncomplicated, Fluid disorder, electrolyte and acid-base balance, Obesity, other hyperalimentation, Coronary atherosclerosis, Unspecified kidney and ureter disease, Aortocoronary bypass, Rheumatoid arthritis, Bronchitis 2. General Vascular Risk Factors: Disease, heart conduction, Cardiac dysrhythmia, Occlusion & stenosis, carotid artery, cerebral infarction, Renal artery arteriosclerosis, Peripheral vascular disease,

Rank 1 Rank 2 Rank 3 Rank 4

It is, of course, possible that one hospital has more critically ill patients than the other hospital (1 versus 6), but it is more likely that the one hospital is reporting at much higher rates. A comparison of two of the hospitals indicates that this is the case (Figure 2). Hospital B has a greater number of secondary diagnoses, and correspondingly, a better ranking than hospital A. It is important then to investigate the nature of report cards and the methodology used to ensure valid comparisons.
Figure 2. Comparison of the Number of Secondary Diagnoses
30

25

Percentage of Patients

20 Hospital A Hospital B

15

10

0 0 1 2 3 4 5 6 7 8 Number of Secondary Diagnoses

It is clear that the severity assigned is based to a certain extent on the number of risk factors identified by the coders. An experimental analysis was performed to examine formulas used to predict patient risks to examine the feasibility of information extraction from text sources, and to investigate the question of uniform data entry across hospitals. ICD-9 codes have been analyzed as unstructured text to define patient profiles. Up to nine secondary diagnoses can be listed in the Medicare reporting form (UB-92). Some of the secondary diagnoses are for complications from treatment, but many reflect co-morbid illnesses. A study was conducted using data provided by the Kentucky Hospital Association for all

Copyright by SIAM. Unauthorized reproduction of this article is prohibited

256

Clusters Found Diaphragmatic hernia, Osteoarthrosis, Sleep apnea, History, certain other diseases, Family history, certain chronic disabling diseases 3. Moderate Cardiac Risks: Fluid disorder, electrolyte and acid-base balance, Unspecified thrombocytopenia, Atrial fibrillation and flutter, Hemorrhage, hematoma, seroma, Iatrogenic hypotension, Hemopericardium, Fever 4. Complicated Diabetes: Diabetes, ophthalmic manifestations, Diabetes, neurological manifestations, Autonomic nervous system disorder, Polyneuropathy, in diabetes, Retinopathy, diabetic, Pericardium, other disease 5. Severe Cardiac Risks: Heart ischemia, Endocardium disease, Supraventricular paroxysmal tachycardia, Ventricular fibrillation and flutter, Rheumatic fever, other endocardial structures, Chronic obstructive bronchitis, Cardiac arrest, Unspecified hypotension, Acute myocardial infarction 6. Severe Respiratory Risks: Pneumonia, Iatrogenic pneumothorax, Acute respiratory failure, Surgical respiratory complication, Coronary atherosclerosis, Other specified complication not classified elsewhere, Chronic obstructive bronchitis, Emphysema, Hearing loss, Protein-calorie malnutrition 7. Severe Renal Risks: Diabetes, renal manifestations, Fluid disorder, electrolyte and acidbase balance, Hypertensive renal with renal failure disease, Nephritis and nephropathy, Acute renal failure, Urinary tract infection, Surgical Urinary complication, Volume depletion, Chronic renal failure 8. Severe Neurological Risks and Complications: Fluid disorder, electrolyte and acid-base balance, Cerebral artery occlusion, with cerebral infarction, Acute Cerebrovascular disease, Pulmonary insufficiency following trauma and surgery, Paralytic ileus, Cardiogenic shock, Central nervous system complication, Postoperative infection, Digestive system complication, Acute renal failure with tubular lesion of necrosis Table 2. Proportions of patients in each cluster by hospital

Different hospitals have different proportions of patients in the above clusters (Table 2). Although in the same geographic region with the same population pool, the reduced proportion of patients in the more severe clusters is sufficient to lower the overall predicted risk and lower the rankings determined by agencies such as healthgrades.com. It seems clear that hospital 6 is more successful at recording all 5-digits of the ICD-9 code while hospital 2 generally codes only the 3-digit primary ICD-9 code. The use of text mining and clustering can clearly demonstrate problems with the regression assumptions used to assign rankings. Pharmacy Order Database Using information from an area hospital, all pharmacy orders for 1300 patients undergoing open heart surgery were examined. The database had approximately 800,000 orders for these patients. The drug types are coded by the American Hospital Formulary Service developed by the American Society of Health System Pharmacists. Drugs are coded by type. For example, 2404xx codes beta blockers while 1212xx codes beta stimulators. Within each code, there are numerous sub types and individual medications. Again, there is a stem to each code that can be recognized via text mining. In this particular example, the drug name was used to cluster the medications. To create profiles of physician prescribing, the first step is to compress the data with patient as the observational unit rather than medication order as currently exists in the pharmacy database. The means to do this is to create a variable name with every drug code for an individual patient contained within a single text string for that patient, and a second variable with each drug name listed in one text string. Then text mining algorithms can examine the drugs as combinations to identify profiles of medication use in a category of patients (such as open heart patients). In a similar fashion, physician prescribing can be profiled by individual patient to determine habits of drug choice. Once the profiles have been identified, the profile can be related to patient outcome. An initial attempt to cluster drug types using SAS Text Miner by using the drug name was unsatisfactory because of the frequencies in the discovered clusters: 658, 173, 74, 49, 98. A second attempt was made to re-cluster the first 658 values. This also was not satisfactory since 549 of the 658 could not be clustered at all. However, when the same values were clustered by drug codes, the results was a total of 17 clusters. When codes are combined with physician names, 9 clusters result. One of the clusters contains approximately 15,000 combinations of medication orders with physicians (each of the 1300 open heart patients has multiple physicians). This cluster represents the mainstream of practice. The remaining 8 clusters identify

Hospital
1 2 3 4 5 6

1 35 40 35 32 25 14

2 9 8 10 10 13 14

3 23 20 21 26 34 36

4 1 3 1 1 3 8

5 8 9 12 9 11 3

6 6 9 4 8 7 8

7 11 7 7 7 4 11

8 8 5 9 8 9 6

Copyright by SIAM. Unauthorized reproduction of this article is prohibited

257

outliers in terms of the medications used, and also identifies the physicians who prescribe the medication. One cluster contains 155 orders for the drug coded 402810. This category represents diuretics. This class of drugs is primarily used for patients with congestive heart failure (CHF). It is possible to examine the clinical database to determine whether the patients who were prescribed drugs in the 402810 category were identified as having CHF as a means of comparing manual abstraction, and to determine whether any patients received the diuretics unnecessarily. However, without the text mining and cluster analysis, it is likely that this group of patients would not have been profiled for further examination. A second cluster of 226 drug orders is primarily associated with medications such as Procrit to improve blood circulation. This is more of a question mark since the approved uses do not include cardiovascular surgery. Instead, the drugs are primarily used for chemotherapy, AIDS, and kidney patients and for elective, noncardiovascular surgery where blood transfusions are not likely. The remaining medications associated with this cluster are for treating allergies, arthritis, and osteoporosis. A third cluster is related to mild infections (urinary tract, skin rash, etc.) where the antibiotics and combined with medications to relieve symptoms. A second analysis was performed with the data restricted to antibiotics with physicians prescribing. Again, the use of the antibiotic codes resulted in a better clustering than the use of the antibiotic names. Two outliers were found of interest. First, some physicians prescribe antibiotics that are outside of the norm of practice. Second, some antibiotics are used in combination with other antibiotics while others are use4 by themselves. The drugs fluconazole and nystatin are in the code 81204. There are a number of physicians associated with this code while other physicians are associated with the code 81808. Discussion The use of a base stem in coded information is wide spread. Many inventory codes follow patterns of code assignment. Currently, associations of use are identified using simple correlations (called association rules).10 Using codes of all customer purchases, it is possible to examine profiles of customers that can result in better understanding of relations. Clusters can be used to find relationships that are not readily examined because the possible association is not clear. The techniques developed in these examples can easily be translated to the developed of improved association rules. As a follow-up to this study, lists have been created relating specific diagnoses to medication orders. In this way, the pharmacy database can be used to automatically recover diagnosis information from the medication

orders. Currently, hospitals rely primarily upon coders to extract information manually from patient charts. Therefore, using text mining to automatically process information can greatly reduce hospital costs while increasing data accuracy. References 1. Martens BVdV. IST 501: Research Techniques for Information Management. Available at: http://web.syr.edu/~bvmarten/index.html. Accessed 2002, 2002. 2. Healthgrades. Healthgrades. Healthgrades. Available at: www.healthgrades.com. Accessed August 23, 2001, 2001. 3. Pope GC, Ellis RP, Ash AS, et al. Diagnostic cost group hierarchical condition category models for medicare risk adjustment. 2000. Located at: Health Care Financing Administration. 4. Shahian D, Normand S, Torchiana D, et al. Cardiac surgery report cards: comprehensive review and statistical critique. Annals of Thoracic Surgery. 2001;72(6):1845-1848. 5. Healthgrades.com. Hospital Report Cards Methodology. Healthgrades.com. Available at: http://www.healthgrades.com/public/index.cfm?fusea ction=mod&modtype=content&modact=Hrc_Method ology. Accessed 2001, 2001. 6. Spangler WE, May JH, Strum DP, Vargas LG. A data mining approach to characterizing medical code usage patterns. Journal of Medical Systems. 2002;26(3):255-275. 7. Scanlo D, Chernew M, Mclaughlin C, Solon G. The impact of health plan report cards on managed care enrollment. Journal of Health Economics. 2002;21(1):19-41. 8. Gorman J. Medicare-choice risk adjustment: don't leave money on the table. Healthcare Financial Management. 2001;55(8):45-48. 9. Tu JV, Schull MJ, Ferris LE, Hux JE, Redelmeier D. Problems for clinical judgment: 4. Surviving in the report card era. CMAJ. 2001;164(12):1709-1712. 10. Anonymous. Data mining: extending the information warehouse framework. IBM. Available at: http://www.almaden.ibm.com/cs/quest/papers/white paper.html, 2002. Correspondence: Patricia B. Cerrito, Department of Mathematics, University of Louisville, Louisville, KY 40292 pcerrito@louisville.edu

Copyright by SIAM. Unauthorized reproduction of this article is prohibited

258

Anda mungkin juga menyukai