Anda di halaman 1dari 5

2011 Seventh International Conference on Natural Computation

Prediction of Survival in Patients with Liver Cancer using Artificial Neural Networks and Classification and Regression Trees
Cheng-Mei Chen, Chien-Yeh Hsu Hung-Wen Chiu, Hsiao-Hsien Rau
Graduate Institute of Medical Informatics Taipei Medical University Taipei, TAIWAN, R.O.C

Cheng-Mei Chen
Department of Personnel Shin Kong Wu Ho-Su Memorial Hospital Taipei, TAIWAN, R.O.C

Chien-Yeh Hsu
Center of Excellence for Cancer Research Taipei Medical University Taipei, TAIWAN, R.O.C
AbstractThis study established a survival prediction model for liver cancer using data mining technology. The data were collected from the cancer registration database of a medical center in Northern Taiwan between 2004 and 2008. A total of 227 patients were newly diagnosed with liver cancer during this time. With literature review, and expert consultation, nine variables pertaining to liver cancer survival were analyzed using t-test and chi-square test. Six variables showed significant. Artificial neural network (ANN) and classification and regression tree (CART) were adopted as prediction models. The models were tested in three conditions; one variable (clinical stage alone), six significant variables, and all nine variables (significant and non significant). 5-year survival was the output prediction. The results showed that the ANN model with nine input variables was superior predictor of survival (p<0.001). The area under receiver operating characteristic curve (AUC) was 0.915, 0.87, 0.88, and 0.87 for accuracy, sensitivity, and specificity respectively. The ANN model is significant more accurate than CART model when predict survival for liver cancer and provide patients information for understanding the treatment outcomes. Keywords-liver cancer; artificial neural networks; classification and regression trees; prediction model

I.

INTRODUCTION

Liver cancer is one of the most common forms of cancer in the world. A total of 600,000 people die from liver cancer worldwide every year, and approximately 75% of these deaths are concentrated in the Asia-Pacific region. Malignant neoplasms of liver and intrahepatic bile ducts cases comprise 13.77% of overall malignancy in Taiwan. Incidence of liver cancer in males ranks number one in Taiwan while for females it ranks third. Liver cancer is the number one cause of mortality for males and the number two for females in Taiwan [1, 2]. Thus, liver cancer has gradually caught the attention of the public and many resources have been devoted to its research.

Most conventional survival analysis methods evaluate specific patients in a certain period by using Kaplan-Meier method or Cox proportional hazard model. Chen et al. [3] used cancer data in Taiwan and plotted survival curves with Kaplan-Meier to analyze the impact on survival of patient willingness to receive treatment. In recent years, data mining technology has been broadly applied in the domain of medicine. Artificial neural networks (ANN) and classification and regression tree (CART) are analysis methods that utilize classification and prediction. ANN refers to a computing system that uses a large number of mutually connected artificial neurons to imitate biological neural networks through computer applications. An artificial neural network learns iteratively and weight is adjusted through training to minimize error between output and target value. The ANN achieves convergence when no further variations can be made. This learning model is most commonly applied by backpropagation network (BPN), which is a supervised learning network. BPN provides, in particular, the input and output training samples from specific problems, and the network learns from the correspondence among the samples [4]. The structure of BPN comprises: an input layer, a hidden layer, and an output layer. The neurons in the input layer refer to the input variable of the network, the neurons in the hidden layer process the data from the input layer with linear transfer function, whereas the neurons in the output layer refer to the output variable of the network [5]. Classification and regression tree (CART) is a tree structure proposed by Breiman in 1984. CART establishes a binary-branching tree from the root through iterative operations until it reaches a termination criteria. Two steps are involved. First, is to establish a tree structure, divide the training samples into different subclasses according to the selected test conditions until all samples are under the same subclass category. The tree structure is established. Second, is to prune the tree structure from the bottom of the tree until the

978-1-4244-9953-3/11/$26.00 2011 IEEE

811

pruning guidelines are satisfied [6]. Rules are created from the outcome, and they are used to predict the classification of new samples. However, Yang et al. [7] developed a prediction method for risk of getting liver cancer for hepatitis B virus carriers. This method provides patients with a reference for tracking and treatment. But it rarely gives a survival prediction method for liver cancer patients. The aim of this study is to establish a liver cancer survival prediction model using data mining technology, provide patients information for understanding the treatment outcomes. II. METHODS

coefficient. However, there is no well-established theoretical method exists for designing an ideal ANN model [8], the best designs are typically determined through trial and error [5,9]. Performance of models We used the area under the ROC curve, accuracy, sensitivity, specificity, positive and negative predictive values as a measure of performance. This study was to compare the performance of ANNs and classification and regression tree models in predicting survival on liver cancer patients. E. Kaplan-Meier survival curves We performed KaplanMeier survival analysis to demonstrate whether survival predictions from ANNs and CARTs in a particular time were capable of predicting survival for the entire follow-up period. The survival of liver cancer patients were divided into 3 groups (good vs. intermediate vs. poor) with the cut points at the 25% and 50% of survival. Survival curves were plotted using the Kaplan Meier method and Log-Rank Test estimated the statistical significance of differences between groups. F. Software The ANNs were run by STATISTICA Neural Networks (Statistica-Neural-Networks TM-6.0, StatSoft, Hamburg, Germany). The CARTs were performed by Clementine , version 12.0 (2008. Chicago: SPSS Inc.) The statistical software used for Kaplan-Meier survival analysis was SPSS for Windows (Rel. 19.0.0 2010. Chicago: SPSS Inc.). The AUCs were estimated and compared with MedCalc for Windows, version 10.0 (MedCalc Software, Mariakerke, Belgium). III. RESULTS D.

A. Sample selection Samples were collected from the cancer registration database of a medical center in Northern Taiwan between 2004 and 2008. A total of 227 patients were newly diagnosed with liver cancer during this time. Cancer stage was recorded by adopting AJCC (TNM) 6th edition as the standard, while survival was tracked until March, 2010. We excluded those samples with missing data due to the 2007 registration format update from 65 fields to 95 fields. The samples with unexplained clinical stage and unspecified tumor size were also excluded, and allocated 60% (136) of the samples as training samples, while 40% (91) were used as test samples. B. Establishing the network models The clinical predictor variables, i.e., continuous input variables, included age, tumor size, and tracking period. Categorical input variables included gender, clinical stage, and undergoing surgery, radiotherapy, chemotherapy, and transcatheter arterial embolization (TAE). 5-year survival was the only categorical output variable (0, dead; 1, alive). The study performed t-test for the continuous variables respectively with p-value of 0.05, while chi-square and fisher exact tests were used for categorical variables. The results showed that a total of six variables reveal significant difference(age, tumor size, clinical stage, surgery, radiotherapy, chemotherapy), while three have no significant difference(tracking period, gender, TAE). The study also established ANN and CART prediction models with the training samples (136 samples) by using one variable (clinical stage), six variables (showed significant difference), and overall nine variables (significant and non-significant) as input variables, and 5-year survival was the output variable. Through this we verified the prediction outcome with the test samples (91 samples) and overall samples (227 samples). C. Artificial neural network analysis Training outcome of ANN is determined by parameter setting of prediction errors and convergence speed, which consist of number of neurons inputted, number of hidden layers, and number of units in the hidden layer, epochs (number of learning cycles), learning rate, and momentum

The ratio of dead to alive samples was 6:4, 159(70%) were men. 82(36%) were over 70 years old. 92(41%) were the first clinical stage, then 56(25%) were the third stage. 113(50%) patients followed by TAE, 55(24%) had undergone surgery. Regarding the tracking period, the greatest number of patients tracked their conditions within a year (45%), 35% of them from one to two years, and then 16% in three years. A. Artificial Neural Network (ANN) The study constructed a back-propagation network (BPN) model for testing. After recursive training, we achieved optimum performance when the number of hidden layers was 1 with 6 neuron units, using 200 epochs, 0.1 learning rate, and 0.8 for momentum coefficient. The results showed that the network, MLP 9:14-12-1:1, was the optimal network architecture, when the number of input variables was 9, the number of neurons was 14 in the input layers, 12 in the hidden layers, 1 in the output layers, and the number of output variables was 1. Clinical stage, undergone surgery, and age had highest weight ranking among input variables.

812

B. Classification and Regression Tree (CART) The decision tree consisted of 4 layers (the maximum number of levels below the root was set as 4), and selected the most simple structure as the optimal tree structure (The mode of the model was set as simple). With the same samples and test procedures, the nine input variables, and the single output variable revealed the most optimal performance. The classification rule created was: tracking period greater than 9.5 months is considered survival, and less than or equal to 9.5 months is deemed death. Tracking period, undergone surgery and TAE ranked in the first three places of variable importance.

C. Comparison of ANNs and CARTs The study used one variable (clinical stage), six variables, and nine variables as the input variables, and 5-year survival as the output variable. The outcome was verified by test samples (91 samples) and overall samples (227 samples) after establishing ANN and CART models with the training samples, as shown in Table I. The results showed that ANN model with nine input variables of overall samples has the optimal performance (ann_9(o)): the area under receiver operating characteristic curve (AUC) is 0.9150.022, 0.87, 0.88, and 0.87 for accuracy, sensitivity, and specificity respectively. The ROC curve comparison as showed in Fig. 1, p < 0.001.

TABLE I. AUC ann_1(t)


a

THE AUC AND PREDICTIVE ACCURACY OF DIFFERENT ANN AND CART MODELS WITH 1, 6, 9 INPUT VARIABLES S.E.c 0.060 0.064 0.052 0.055 0.057 0.050 0.035 0.033 0.022 0.034 0.036 0.032 95% CId 0.656 - 0.890 0.623 - 0.815 0.752 - 0.911 0.621 - 0.836 0.600 - 0.796 0.671 - 0.853 0.692 - 0.827 0.734 - 0.861 0.871 - 0.948 0.644 - 0.777 0.601 - 0.741 0.689 - 0.805 Accuracy 0.71 0.62 0.78 0.71 0.66 0.74 0.71 0.69 0.87 0.71 0.70 0.83 Sensitivity 0.76 0.60 0.76 0.76 0.72 0.80 0.70 0.70 0.88 0.70 0.77 0.89 Specificity 0.70 0.62 0.80 0.70 0.64 0.71 0.72 0.72 0.87 0.72 0.66 0.79 PPVe 0.49 0.38 0.59 0.49 0.43 0.51 0.60 0.59 0.79 0.59 0.57 0.71 NPVf 0.88 0.80 0.90 0.88 0.86 0.90 0.80 0.80 0.93 0.81 0.83 0.93
a. Test Samples b. Overall Samples c. Standard Error d. Confidence Interval e. Positive Predictive Value f. Negative Predictive Value

0.773 0.727 0.843 0.728 0.705 0.771 0.759 0.798 0.915 0.711 0.671 0.751

ann_6(t) a ann_9(t) a cart_1(t) a cart_6(t) a cart_9(t)


a

ann_1(o)b ann_6(o) b ann_9(o) b cart_1(o) b cart_6(o) b cart_9(o) b

survival 100 80 Sensitivity Sen sitiv ity 60 40 20 0 0 20 40 60 80 100 100-Specificity 100 80 60 40 20 0 0 20

survival

survival 100 80 Sensitiv ity 60 40 20 0

ann_1(o) cart_1(o)

ann_6(o) cart_6(o)

ann_9(o) cart_9(o)

40

60

80

100

20

40

60

80

100

100-Specificity

100-Specificity

AUC ann_1(o) : 0.759 cart_1(o) : 0.711

AUC ann_6(o) : 0.798 cart_6(o) : 0.671

AUC ann_9(o) : 0.915 cart_9(o) : 0.751

Figure 1. Comparison of ROC curves between ANN and CART prediction models with 1, 6, 9 input variables of overall samples

813

D. Kaplan-Meier survival analysis The survival curves of the 3 groups of the liver cancer patients are shown in Fig. 2, Log-Rank Test was 0.000 ( p .05). The 1-year(12 months) survival prediction of the original sample for the 3 groups were 100%, 100%, and 16.1%; The 1-year(12 months) survival prediction of the ANN models for the 3 groups were 100%, 100%, and 11.0% respectively; and CART models only for the good group was 94.1%, others were no detectable.

radiotherapy, chemotherapy, and TAE) related to liver cancer survival from cancer registration database in Taiwan. The values of these nine variables are easy to know for patients as they can get tumor size and clinical stage information from doctors. Remaining 7 variables can be understood by patients themselves. The results indicate that the accuracy of the ANNs using all nine input variables including significant and non significant had the optimal prediction outcome. The nine-variable model outperformed one variable (clinical stage) and the six-variable model that only used significant variables. Similarly, the survival prediction research on breast cancer and colorectal cancer proposed by Burke et al. [10] demonstrated that ANN models are more accurate than the TNM staging system(primary tumor, regional lymph nodes, distant metastasis), when both use the TNM variables alone. These results emphasize the importance of other variables in establishing an accurate prediction model, and may have from several factors : ANN is advantageous in that it has no particular requirements about data distribution and is fault tolerable. ANN is an non-linear models and variables can be in different classifications can also be accepted. ANN is easy to practices. When ANN training is completed, input information that is similar to the training samples can be processed to achieve an appropriate output.

Regarding research on ANN utilization for liver cancer, Tsai et al. [11] found a classification mode for diagnosis of liver cancer patients. Wang et al. [12] researched cohort studies of risk factors in liver cancer in Taiwan, but this has been rarely applied in survival prediction of liver cancer. However, demographic characteristics, clinical stage, tumor size, treatment and other relevant input variables used by other cancer studies [13,14,15], have achieved accuracy of 0.8 and above. Sato et al. [13] gained an AUC of 0.88 in the survival prediction of esophageal carcinoma. Snow et al. [14] indicated that the AUC is 0.87 in the diagnosis and prognosis of prostate cancer. The research of Snow et al. [15] indicated that the AUC is 0.87 for post-treatment colorectal cancer survival prediction. This study established prediction model for liver cancer survival has an AUC of 0.915, which is superior similar cancer studies. Kaplan-Meier method was performed to plot the survival curves of 3 groups. The performance of ANN survival curve and original sample curve were similar and ANN curve had better discriminatory power than CART. Moreover, the Logrank test was equal to 0.0000 ( p .05), which indicated that the data of a specific survival period could still reveal an overall survival by using ANN models . In this study prediction accuracy of CART models was not as good as ANNs, although the rules CART models created were easy to understand and classification could be established without a lot of computation. The primary classification rule

Time(months)

Figure 2. KaplanMeier Survival Curves

IV.

DISCUSSION

The most common question asked by liver cancer patients is survival. To answer this, we used nine variables (tumor size, clinical stage, age, gender, tracking period, undergoing surgery,

814

created was tracking period, which refers to the time period from when liver cancer is first diagnosed to the last contact time with the patient. The doctors we consulted indicated this classification lacks detail, since it only represents the last correspondence with the research hospital. It is also possible that the hospital could lose contact with a patient, the patient could transfer to another hospital or abandon treatment. Thus, tracking period should not be used as the only and primary classification for survival. By contrast, the priority of clinical stage, undergone surgery, age (the weight ranking of input variables) in ANN analysis is more reasonable. In conclusion, the results of this study establish a liver cancer survival prediction model using ANN is more accurate than CART. In the future, we suggest developing a computer system in clinical practice to predict liver cancer survival, and use nine input variables for the ANN prediction model. The system will use ANN algorithm to calculate the prediction result automatically, and assist patients to understand their survival, and the treatment outcomes more clearly. REFERENCES
[1] [2] [3] C. L. Lin and J. H. Kao, Epidemiology of Hepatocellular Carcinoma, J. Chinese Oncol. Soc., vol. 24, no.5, 2008, pp. 277-281. T. M. Lin, C. J. Chen, S. F. Tsai, and T. H. Tsai, Hepatoma in Taiwan, J Natl Public Health Assoc(ROC), vol. 8, no.2, 1988, pp. 91-100. H. C. Chen, H. F. Su, C. Y. Li, M. S. Lai, and P. C. Hsieh, Can physician adherence to treatment guidelines improve patient survival ? A population-based analysis of non-small cell lung cancer (NSCLC) in Taiwan, Taiwan J Public Health, vol. 29, no.2, 2010, pp. 118-130. J. E. Dayhoff and J. M. DeLeo, Artificial Neural Networks Opening the Black Box, Cancer, vol. 91, no.8, 2001, pp. 1615-35.

[5]

[6]

[7]

[8]

[9] [10] [11]

[12]

[13]

[14]

[15]

C. C. Lin et al., Artificial neural network prediction of clozapine response with combined pharmacogenetic and clinical data, Comput Methods Programs Biomed., vol. 91, no.2, 2008, pp. 91-9. J. Wang, M. Li, Y. T. Hu, and Y. Zhu, Comparison of hospital charge prediction models for gastric cancer patients: neural network vs. decision tree models, BMC Health Serv Res., vol.9, 2009, pp. 161. H. I. Yang et al., Nomograms for Risk of Hepatocellular Carcinoma in Patients With Chronic Hepatitis B Virus Infection, Journal of Clinical Oncology, vol. 28, no.14, 2010, pp. 2437-44. A. S. Miller, B. H. Blott, and T. K. Hames, Review of neural network applications in medical imaging and signal processing, Med. Biol. Eng. Comput, vol. 30, no.5, 1992, pp. 449464. P. C. Davies, Design issues in neural network development, Neurovest Journal, vol. 5, no.1, 1994, pp. 21-25. H. B. Burke et al., Artificial neural networks improve the accuracy of cancer survival prediction, Cancer, vol. 79, no.4, 1997, pp. 857862. H. J. Tsai, M. C. Ko, W. P. Chang, and D. M. Liou, Application of Artificial Neural Networks and Classification Tree Models in the Diagnosis of Hepatocellular Carcinoma, Taipei City Med J, vol. 4, no.8, 2007, pp.658-667. C. J. Wang, C. F. Lan, and C. J. Chen, A Cohort Study on Multiple Risk Factors of Liver Cancer, Lung Cancer and Stomach Cancer in Taiwan, Taiwan J Public Health, vol. 13, no.4, 1994, pp.308-314. F. Sato et al., Prediction of Survival in Patients with Esophageal Carcinoma using Artificial Neural Networks, Cancer, vol. 103, no.8, 2005, pp. 1596-605. P. B. Snow, D. S. Smith, and W. J. Catalona, Artificial neural network in the diagnosis and prognosis of prostate cancer : a pilot study,J Urol vol. 152, no.5, 1994, pp. 1923-1926. P. B. Snow, D. J. Kerr, J. M. Brandt, and D. M. Rodvold, Neural network and regression predictors of 5-year survival after colon carcinoma treatment, Cancer, vol. 91, no.8, 2001, pp. 1673-1678.

[4]

815

Anda mungkin juga menyukai