a
Bern University of Applied Sciences, Medical Informatics, Biel, Switzerland
b
ID Information und Dokumentation im Gesundheitswesen GmbH & Co. KGaA, Berlin, Germany
Keywords: Classification systems such as ICD-10 for diagnoses or the Swiss Operation Classification System (CHOP) for
Automatic encoding procedure classification in the clinical treatment are essential for clinical management and information ex-
Deep learning change. Traditionally, classification codes are assigned manually or by systems that rely upon concept-based or
Classification system rule-based classification methods. Such methods can reach their limit easily due to the restricted coverage of
Autoencoder
handcrafted rules and of the vocabulary in underlying terminological systems. Conventional machine learning
Convolutional neural networks
approaches normally depend on selected features within a human annotated training set. However, it is quite
laborious to obtain a well labeled data set and its generation can easily be influenced by accumulative errors
caused by human factors. To overcome this, we will present our processing pipeline for query matching realized
through neural networks within the task of medical procedure classification. The pipeline is built upon con-
volutional neural networks (CNN) and autoencoder with logistic regression. On the task of relevance determi-
nation between query and category text, the autoencoder based method has achieved a micro F1 score of
70.29%, while the convolutional based method has reached a micro F1 score of 60.86% with high efficiency.
These two algorithms are compared in experiments with different configurations and baselines (SVM, logistic
regression) with respect to their suitability for the task of automatic encoding. Advantages and limitations are
discussed.
1. Introduction assistants use free-text queries within their search whereas the output is
a set of possible classification codes. Consider the example in Fig. 1: A
In order to claim costs to the health insurance as well as for clinical medical documentation assistant wants to encode a procedure reflected
documentation purposes, hospitals and physicians are legally bound to by the keywords Chirurgie CT und MR (surgery, computed tomography
encode diagnoses and procedures with classification codes from re- and magnetic resonance in English). For this query, the assistant needs
levant classification systems. In Switzerland, these are ICD-10-GM for to conduct a top-down dictionary look-up within the CHOP classifica-
diagnoses and the Swiss Operation Classification System (CHOP)1 for tion system, which goes from the top level chapter C0 (measurement
clinical and surgical treatments. and intervention) down to subcategory 00.3 (computer assisted sur-
gery). Subcategory 00.3 is further divided in several subcategories: the
1.1. Challenges in automatic encoding category computer assisted surgery with “CT and MR” represented by
“Computergesteuerte Chirurgie mit mehreren Datenquellen” is one of
In this study, we are focusing on the task of automatic encoding the six subcategories under “computer assisted surgery”. All sub-
based on CHOP. In total, the CHOP system consists of 18 different ca- categories of 00.3 achieve at least a partial match with the query
tegories and over 14,000 different classification codes. In order to whereas the subcategories differ only slightly. The best results for
realize automatic encoding or search for relevant classification codes, partial matches are achieved for the codes 00.31 and 00.32 that contain
respectively, rule-based classification systems have been developed to at least one of the imaging procedures requested. Based on their
achieve an automatic encoding [1]. As input, physicians and coding background knowledge, the encoding assistants are able to decide the
Corresponding author.
⁎
https://doi.org/10.1016/j.artmed.2018.10.001
Received 20 October 2017; Received in revised form 7 September 2018; Accepted 3 October 2018
0933-3657/ © 2018 Elsevier B.V. All rights reserved.
Please cite this article as: Deng, Y., Artificial Intelligence In Medicine, https://doi.org/10.1016/j.artmed.2018.10.001
Y. Deng et al. Artificial Intelligence In Medicine xxx (xxxx) xxx–xxx
Fig. 1. Query scenario and snippet from CHOP on category computer assisted surgery.
correct code is 00.35 (computer assisted surgery with multiple data each layer for specific tasks.
sources). However, for an automatic encoding this would require se- In this paper, we introduce and evaluate a pipeline based on neural
mantic knowledge that CT and MR are imaging data sources. networks to support the automatic CHOP encoding. More specifically,
In encoding systems, automatic extraction approaches and rule- the goal is to determine the best model of neural networks for the CHOP
based encoding approaches are employed to handle the user query and encoding and in this way, assigning a correct CHOP code to a query
determine the matching between query and category text. Such text. We are assessing different models based on neural networks with
methods extract specific information from query and category text respect to their performance (relevance determination in precision,
automatically and support in classifying them according to standard recall and F1). Moreover, we assess the impact of a semantic enrich-
medical terminologies [2]. Hence, comprehensive rules are specified ment of the query using a semantic knowledge on the classification
beforehand considering all the possible query situations, which is often performance.
highly time-consuming. It is also difficult to achieve completeness and Based on aforementioned challenges and objectives, we will address
correctness in the rule set. Conventional machine learning approaches the following research questions:
aim at learning correlation between query and CHOP encoding auto-
matically and deriving latent models from large data sets. Although • Which type of neural networks is better suited for the considered
manually chosen features have proven to be effective for some specific matching task in terms of efficiency: CNN or autoencoder?
classification tasks, the biases caused by feature selection at each pro- • Given the positive influence of semantic enrichment for the sake of
cessing step is unavoidable. For this kind of encoding with multiple balancing between query and category text, which method can be
subcategories, it is particularly difficult to obtain sufficient training applied to deal with the sparseness of the data set caused by a large
examples for each sub-category to make the classifier differentiable for vocabulary space of enrichment?
instance in the corresponding classes. Without these examples, a • Is autoencoder a suitable method for the knowledge integration?
trained classifier would have a reduced ability of discrimination for Does semantic enrichment of the feature set by concepts of a se-
each subcategory and would produce a large number of false positive mantic network impact the classification accuracy?
codes. • Is layer-wise pre-training of autoencoder suited for representation
With the resurgence of neural networks and deep learning techni- for the task of semantic matching?
ques, the end-to-end methods for data representation learning and ob-
ject/pattern recognition in both, text and image, have advanced sub- 1.2. Paper organization
stantially. For a classification task based on natural language text,
words and relations between words can be learned through semantic In Section 2, we summarize the related work. Moreover, we also
preserving feature vectors [3]. Besides, the convolutional neural net- introduce the characteristics of our task in contrast to the related work.
works (CNN) [4,5] enable a minimal manual pre-processing on the In Section 3, we present our data material that is used for training and
input data and learn the feature maps through applying a sliding evaluation. Section 4 includes the formal definition of the CHOP query
window on the original input according to a pre-defined filtering size matching problem in Section 4.1 and the methods we proposed for the
and stride. The features corresponding to different filters can be se- CHOP query matching problem in Section 4.3. Besides, a working ex-
lected by the convolution and pooling process (max pooling or average ample for input normalization and enrichment is described in Section
pooling). In contrast to supervised deep neural networks, the un- 4.2. The CNN based method and autoencoder based methods are de-
supervised pre-training approaches like restricted Boltzmann machine scribed in Sections 4.3.1 and 4.3.2, respectively. Section 4.4 and 4.5
(RBM) [6], deep belief nets (DBN) [7] and autoencoder [8] provide the describe the implementation and experimental settings of our methods.
possibility of stepwise adaptation. The latent representation of high The results and efficiency for relevance determination are presented in
dimensional input data can be layer-wise reduced and fine-tuned at Section 5. After summarizing our principal findings and pointing to
2
Y. Deng et al. Artificial Intelligence In Medicine xxx (xxxx) xxx–xxx
limitations in Section 6, respectively, we refer to directions of future deep representation learning based on unsupervised neural networks.
work in Section 6.3.
2.2. Unsupervised representation learning using neutral networks
2. Related work
In contrast to the training with annotated data by supervised
In this section, we will firstly introduce a fraction of relevant learning, the unsupervised deep learning method trains the model with
methods for medical document tagging and classification based on the input data itself (reconstruction of input). Based on the MNIST digit
neural networks. As next, the specific neural networks for representa- recognition data set [17], Larochelle et al. [18] have evaluated the
tion learning and knowledge integration will be presented in detail. greedy layer-wise pre-training of a deep network. The results confirmed
Finally, the insights provided by the previous work and the difference of that the layer-wise pre-training can largely increase the effectiveness of
our approaches in contrast to the existing methods will be highlighted. the network initialization and enable a rational starting point for the
classification. It has also proved that the training can achieve a rela-
2.1. Neural networks, medical document classification and tagging tively large benefit by the layer-wise pre-training, in particular, when
the amount of training data is small. In the clinical domain, Miotto et al.
Neural networks have been employed for a variety of tasks in the [19] have used a three-layer stacked denoising autoencoder (SDA) to
medical domain. Miotto et al. [9] provided a comprehensive summary learn a deep patient representation based on electronic health records.
for applying deep learning technologies in the medical domain. The representation was then employed to predict disease risk using
Through a review of 32 recent research papers regarding four domains random forest as classifiers. The evaluation was performed on 76,214
of clinical applications (clinical imaging, electronic health records, patients comprising 78 diseases from diverse clinical domains within a
genomics and mobile health), the related deep models such as CNN, time window up to one year. On the task of disease prediction, the SDA
recurrent neural networks (RNNs), restricted Boltzmann machine based methods have significantly outperformed other dimension re-
(RBM) and autoencoder (AE) have been discussed. The potentials and duction methods such as principal component analysis (PCA), k-mean
opportunities of deep models relating to our encoding task are the on the task of disease prediction.
feature enrichment, the incorporation of expert knowledge and the In our task, a suitable representation learning method is essential for
interpretability of the model. The consideration of these aspects could balancing the difference between query and CHOP category text.
potentially improve the model performance and also increase the ac- Beyond that, the representation learning should also be able to facilitate
ceptance of deep learning methods in medical use cases. the knowledge integration from external medical knowledge base.
Similar as the CHOP classification in the sense of medical procedure
indexing, the medical subject headings thesaurus (MeSH) is a hier- 2.3. Word embedding and autoencoders for knowledge integration
archical medical thesaurus for indexing biomedical literature. It con-
tains 16 top categories and 27,455 main headings.2 The assignment of Another important trend of representation learning is the integra-
MeSH tags to free text based on neutral networks depends on a suitable tion of external knowledge through unsupervised learning. In the gen-
feature representation and a algorithm for the relevance determination. eral domain, word embedding with semantic enrichment has been
Peng et al. [10] proposed DeepMeSH, an approach for MeSH tag- evaluated by Yu and Dredze [20] and Celikyilmaz et al. [21]. The
ging based on unsupervised deep representation learning. The docu- knowledge from terminologies like Wordnet3 or PPDB4 is exploited in
ment to vector (d2v) [11] and tf-idf feature embeddings combined with these approaches. The type of relations and weights of the relations
MeSHLabeler [12] achieved the best score on the task of large scale between concepts have been combined with a word embedding model
semantic indexing in the 2017 BioASQ3 challenge [13] (task 5a). Fur- (word2vec) linearly. The linearly combined models were trained on
ther, with an F1-score of 0.6323, the DeepMeSH method yielded an different data sources: the word embedding was firstly trained on the
improvement of 12% and 2% respectively, in comparison to the two input corpus based on continuous bag of words (CBOW). Then, the
baseline index algorithms: Medical text indexer (MTI) with an F1- normalized weights from related words found in the knowledge re-
Measure of 0.5637 and MeSHLabeler with an F1-Measure of 0.6218. source were used to update the weight parameter of CBOW and opti-
Du et al. [14] provided another possibility of feature representation mized according to the same lost function. Yu and Dredze's method
for MeSH indexing beside d2v. They employed the bidirectional re- achieved a 19% improvement in mean reciprocal rank (MRR) on the
current neural network (BRNN) and an auxiliary regression mechanism task of finding semantically related words using the embeddings, while
to conduct the primary multi-label classification. The composed serial Celikyilmaz et al. have achieved around 2% of improvement in F1
structure between terms can therefore be extracted. The algorithm has measure on the task of semantic tagging within one movie dataset [22].
outperformed the state-of-art baseline Medical Text Indexers (MTI) in Faruqui et al. [23] have further improved the knowledge integration
F1-score with 0.6220 and reached a higher precision (0.77) than through a post-processing step by conducting belief propagation on a
DeepMeSH (0.70). One earlier attempt conducted by Rios and Kavuluru graph obtained from lexicon-derived relational information. These
[15] used CNNs to assign MeSH terms to biomedical articles. Paper methods outperformed the prior approaches developed by Yu and
abstract of publications listed in Pubmed have been processed by Dredze [20] with 5% (sentiment analysis [24]) to 20% (Synonym Se-
multiple-layer CNNs proposed by Kim [16]. The simple CNN structure lection (TOEFL) [25]) improvement in accuracy.
with single convolution and pooling layer achieved an absolute im- Yu et al. [26] used semantic hashing based on autoencoder to
provement of over 3% in macro F1-score on selected subsets of MeSH support short text understanding and retrieval. More specifically, the
terms compared to the baseline MTI method. short text representations were enriched with concepts and their co-
The aforementioned examples of MeSH tagging demonstrate the occurring concepts from Probase [27]. After the enrichment, the
strong potentials of the application of neural networks for medical stacked autoencoder was applied to reduce the dimension of the re-
document classification (tagging) and show the importance of re- presentation. The obtained hashing representation of short text has
presentation learning. One lesson learned from the method design is yielded to an improvement of 20% to 30% on the task of news retrieval.
that a comprehensive representation combining multiple levels of On the task of classifying Wikipedia sentences, the proposed method
salient features can largely enhance the performance of the follow-up has obtained an improvement of 10% (in comparison with other types
classification. In the follow section, we will take a further look at the
3
https://wordnet.princeton.edu, accessed on September 9, 2017.
2 4
https://www.nlm.nih.gov/mesh, accessed on August 23, 2018. http://www.cis.upenn.edu/∼ccb/ppdb/, accessed on September 9, 2017.
3
Y. Deng et al. Artificial Intelligence In Medicine xxx (xxxx) xxx–xxx
4
Y. Deng et al. Artificial Intelligence In Medicine xxx (xxxx) xxx–xxx
5
Y. Deng et al. Artificial Intelligence In Medicine xxx (xxxx) xxx–xxx
Fig. 2. Example for input normalization and enrichment. (A) shows one example of the non-enriched concept vector and the corresponding embedding of query and
category text. (B) indicates the non-enriched concept vector with frequency. (C) represents the enriched concept vector with frequency.
each vector, namely those concepts that have been retrieved as ances- 4.3.1. Matching with CNN
tors of the originally mapped concepts. This is realized by a similarity A CNN is a type of feed-forward artificial neural networks, which
function. The similarity function employs the class hierarchy and re- has been successfully applied to both, image processing [30] and nat-
lations of the ID MACS® semantic network to determine the relatedness ural language processing [31]. Our architecture exploits a CNN with
of two concepts. The value generated by the similarity function re- one layer of convolution and one layer of max pooling. In the con-
presents the frequency of one concept appearing in the representation volutional layer, different filter sizes can be defined to cover the po-
of one query or category text. For instance, an entry for spinal column tential semantic scope in the query and category description. The ar-
with value 0.86 is added to the sparse vector representation of the chitecture is therefore similar to the CNN employed by Kim and Shen
target text, since spinal column is related to spinal liquor in our se- [16,32]. As input for the training, the vectors of the query and category
mantic network. Similarly, a concept entry for “fistula extirpation” with text from the training set are concatenated as one vector. The output of
value 0.89 is added because this concept describes another fistula-re- the network is the relevance of a category text with respect to a query.
lated operation which is connected in our semantic network to the The features are obtained from multiple filters (3, 4, 5); the convolution
concept “closure of a fistula”. Hence, the enriched frequency vector has is selected through max-pooling and fully connected into one feature
a form of (0 … 0, 1, 0.86, 0 … 0, 0.89, 1, 0 … 0, 0 … 0). The non-en- vector. The filter size 3, 4, 5 have been determined based on the length
riched concept vector is the input for the embedding layer of CNN statistics of our training corpus as most of the concept vectors for query
(labeled with (A)). The two frequency-based vector representations and category description have been mapped to 3–5 concepts.
((B), (C)) are the input for autoencoder, SVM or logistic regression. Based on the formulas (1)–(3) in Section 4.1.2, a pair of padded
query and CHOP category description is modeled as
4.3. Matching methods with CNN and autoencoder
P = Qn ||Cn = {x1, x2 , …, x i , …, x n , x1 , …, x j , …, xn} (6)
In our pipeline and experiments, we exploit mainly two types of Since query and category text share the same vocabulary after the
models based on neural networks (see Fig. 3). normalization and also the same index in the concept dictionary, the
The first approach is based on CNN whereas the second type of query and category pair can be transformed into: P = {x1, …, x2n}. The
network is a classification model connected to a representation layer convolution process is conducted through filters with predefined size,
trained with autoencoders. The CNN are implemented as baseline to which slides through the concatenated concept vector P. As it has al-
compare with the autoencoder based method. Since the original se- ready been mentioned in Section 4.1.2 shortly after formula (3), the
quence of text has been eliminated after the semantic enrichment and dimension n represents the padded length of a query or a category
the sequence between concepts is a crucial prerequisite for the em- vector. In this network, a filter of size h generates a windows of xi:i+h.
bedding and convolution in CNN, we will only apply the holistic The corresponding features are obtained through an activation function
knowledge integration for the autoencoder method. Additionally, the f with a bias term b ∈ in form of
linear-kernel SVM and logistic regression with L2 regularization are
Convolutional_features = f (w × x i : i + h 1 + b) (7)
applied on both, enriched and non-enriched input, as generic baselines
to compare with the neural network-based approaches. For each dimension of concept embedding (size k in presentation
6
Y. Deng et al. Artificial Intelligence In Medicine xxx (xxxx) xxx–xxx
Fig. 3. Data flow for the training: the bug-tracker logs (query-category text pairs) are processed with the concept dictionary mapper. A non-enriched concept vector
dictionary is generated for concept embedding, while enriched vectors are prepared for autoencoder based representation learning. The representation is trained and
cross validated for relevance determination. As is illustrated in Fig. 2, (A) represents the non-enriched concept vector for concept embedding. (B) indicates the non-
enriched concept vector with frequency. (C) represents the enriched concept vector with frequency.
definition: 128) we have assigned one filter to slide through all the In order to achieve an optimized network configuration for the CNN
possible windows in the concept vector. As it can be seen in Fig. 4, a experiment, we have configured and evaluated the outcome of the
feature vector with length of n − h + 1 is generated at each dimension embedding based on annotated relevance metrics [33]. The rectified
of the embedding. Next, the maximum value in the feature vector is linear unit (RELU) has been applied as activation function; filter win-
pooled out through the max pooling process. Next, the maximum value dows of 3, 4, 5 with 128 dimensional feature maps are applied. We have
in the feature vector is pooled out through the max pooling process so chosen the size of 128 to ensure a fast calculation of similarity metrics.
that a fully connected layer can be created. At last, the softmax layer is The dropout rate has been defined as 0.5 whereas an L2 constraint (s) of
attached to conduct the classification. 0.01 has been chosen to avoid over-fitting. The batch size – chosen
Fig. 4. CNN for relevance determination of CHOP query and category text matching, filter sizes with 3, 4, 5 are selected.
7
Y. Deng et al. Artificial Intelligence In Medicine xxx (xxxx) xxx–xxx
8
Y. Deng et al. Artificial Intelligence In Medicine xxx (xxxx) xxx–xxx
Fig. 6. Relevance classification based on stacked autoencoder. The details of the pre-training module are illustrated in Fig. 7.
input data, meanwhile, the input information is attained through these also follow this two phase protocol. The pre-training can then be un-
pre-training steps as much as possible. Based on the pre-training derstood as the first part of an entire two phase protocol that combines
network illustrated in Fig. 6, the vector enriched by ID MACS® is the pre-training phase and a supervised learning phase. The supervised
concatenated into one vector in form of Query ∥ Category. The vectors learning phase involves training a classifier on top of the features
are multiple-hot coded with concepts. The networks are initialized layer learned in the pre-training phase. At the same time, the supervised
by layer with denoising mechanism: classification can fine-tune the entire network learned in the layer-wise
(1) Stepwise initialization: In Fig. 6, the main architecture of our pre-training phase.
mapping process is presented. With the enriched query and category As a comparable experiment, we will also evaluate the non-stop
text concatenation, the four nested autoencoders are initialized. We deep autoencoder, which conducts the latent learning without inter-
followed the principle of a greedy layer-wise initialization [38]. The ruption. The F1 measure of the classification and the efficiency of the
greedy algorithm optimizes each piece of the solution independently, training will be compared.
one layer at a time, rather than jointly optimizing all layers. Specifi- (2) Enabled with denoising mechanism: A denoising autoencoder
cally, greedy layer-wise pre-training proceeds one layer at a time, minimizes
training the kth layer while keeping the previous layers. The lower
layers (which are trained first) are not adapted after the upper layers
have been introduced. In our pre-training model illustrated in Fig. 7, we L (x , g (f (x ))) (14)
9
Y. Deng et al. Artificial Intelligence In Medicine xxx (xxxx) xxx–xxx
Fig. 8. Using autoencoder to learn latent representation directly on an enriched single query and category concept dictionary.
where x̃ is a copy of x that has been corrupted by some form of noise. autoencoder, the vector size will be reduced to the half of the previous
The denoising autoencoders must therefore undo this corruption rather layer. In the last step, the supervised classifier has received a 500 di-
than simply copying their input. The adding of noise has firstly pushed mensional vector to conduct the relevance classification. We have de-
the original data distribution away from the target low dimension fined a batch size of 128 to do the training. This configuration is de-
manifold, while the learning process has to project the noised input signed with the consideration of both performance expectation and
data back onto the “manifold” [36]. With the input from the last hidden training cost. Since we conduct experiments with limited hardware
layer hl: resources (single GPU with 12G memory), an acceptable training time
and cost is essential for a comparison based on different approaches.
qD = g (hl + N ) (15)
where N is the Gaussian noise and g is the activation function, the de- 4.4. Implementation
noising process is expected to achieve a higher generalization than
ordinary autoencoders. We implemented the proposed architecture with the Tensorflow
framework.12 Tensorflow was chosen, since it provides comprehensive
4.3.2.2. Alternative representation learning: applying autoencoder only on toolkits for the construction of embedding and neural networks [39].
query and category dictionary. The aforementioned approaches start all The computational network works as a road map of the data processing
with a concatenated vector of query and category text. There is another workflow, whereas the real data is not loaded into the computing
way of training which can largely reduce the training time by graph. The input is only defined with sizes and attributes within pla-
representation learning only on the CHOP concept dictionary ceholders. The real value input will be triggered only after the session
(representation learning before query/category concatenation) (see has been initialized through queue feeding or dictionary feeding, while
Fig. 3). The concept dictionary contains the single vector queue is defined for asynchronous tensor input and dictionary is used to
representation for query and category. The corresponding vector input the data with a small amount statically. This deferred form of
representation can be obtained through query/category id as graph computing can also optimize the resource utilization and scal-
dictionary key. The query and category text can then be assigned ability during the processing. For the creation of a CNN, Tensorflow has
with a short representation separately before the concatenation of the provided interfaces like tf.nn.embedding_lookup, tf.nn.conv2d
query and the category text see Fig. 8. This alternative approach learns and tf.nn.max_pool. The convolution and pooling process can be
the representation locally based on the vocabulary vector created and configured intuitively to constitute our proposed CNN ar-
representation without supervised tuning. In comparison with the chitecture. In this implementation, we have also used the Python APIs
autoencoder with fully connected query and category text vector, from the package scikit-learn13 and numpy14 to implement the bench-
illustrated in Fig. 6, the alternative method has concentrated on the mark.
learning of intrinsic relations within a single query or a single category
text. It is expected that the relatively low dimension of the
4.5. Experiments
representation leads to a short training time and a fast converging.
For the experiment with autoencoder with supervised classification,
Three groups of methods are tested: the conventional machine
the configuration is realized as follows: we employed four autoencoders
learning methods (SVM, logistic regression), the convolutional pooling
consecutively to reduce the dimension directly after the enriched input.
based method and the autoencoder based relevance determination. The
The noise is added to the encoding step of each of the four layers, while
a denoising mechanism has been employed at each denoising layer. The
sigmoid activation function and cross entropy loss are calculated at 12
https://www.tensorflow.org, accessed on September 12, 2017.
each layer. The stochastic gradient decent (SGD) is used as optimization 13
http://scikit-learn.org/stable/, accessed on September 9, 2017.
method. As is already mentioned in Section 4.3.2, through each 14
http://www.numpy.org, accessed on September 9, 2017.
10
Y. Deng et al. Artificial Intelligence In Medicine xxx (xxxx) xxx–xxx
Table 1 Table 2
Baseline classification results according to precision, recall, F1, specificity (true Pre-trained autoencoders with classifier (logistic regression). If not mentioned,
negative rate = true negatives/(true negatives + false positives)) (all in micro the experiments are enabled with denoising layer. In the last experiment, the
average) for relevance determination based on different configurations. 3, 4, 5 denoising layer is disabled for comparison. Specificity refers to the true nega-
represent that three filters of size 3, 4, 5 have been applied, respectively. The tive rate, true negative rate = true negatives/(true negatives + false positives).
embedding was trained on cross-validation fold (4/5 of training data). SVM and
Configuration (all enriched) Precision Recall F1 Specificity
logistic regression are evaluated on both enriched and non-enriched input.
Configuration Precision Recall F1 Specificity Deep autoencoder (non-stop training 79.32% 54.76% 64.14% 93.6%
on dictionary)
CNN with single convolution and 78.32% 49.76% 60.86% 93.86% Layer-wise autoencoder on 85.21% 59.56% 70.29% 95.3%
pooling 3, 4, 5 Ada L2 0.01 concatenated input
SVM classifier linear kernel with non- 30.04% 25.52% 27.60% 72.9% Layer-wise alternative autoencoder: 84% 55.9% 68% 95.17%
enriched input representation learned on
Logistic regression L2 Liblinear with 23.24% 18.66% 20.70% 71.7% dictionary
non-enriched input Layer-wise alternative autoencoder 80.32% 52.46% 63.47% 94.14%
SVM classifier linear kernel with 28.34% 31.89% 30.01% 63.2% dictionary denoising disabled
enriched input
Logistic regression L2 Liblinear with 21.01% 19.31% 19.65% 66%
enriched input are better recognized than the relevant pairs in general. The SVM and
Deep autoencoder (non-stop training 73.12% 45.26% 55.26% 93.78%
logistic regression with enriched input have reached a higher recall
on dictionary) with non-enriched
input
(increasing with 1–6%) but a lower specificity percentage (decreasing
Layer-wise autoencoder on 74.41% 55.76% 63.75% 94% with 5–9%) than non-enriched input. However, even with the enriched
concatenated input with non- input, the outcomes of the SVM and logistic regression are still out-
enriched input performed by the CNN with concept embedding without depending on
external knowledge.
The enriched versions of logistic regression has achieved a higher
performance of the latter will be compared to the first two methods that
precision than its own recall whereas the SVM with enriched input
are considered as baseline. We chose different inputs and configurations
reaches a higher recall (31.89%) than precision. Generally, the en-
of these methods with the aim of optimizing the performance. As far as
richment setting (SVM, logistic regression) showed an improvement on
the input is concerned, the CNN method with single convolutional and
F1 measures ranging from 1% to 4% due to an increase in recall. In
pooling layer has been fed with concept embedding learned from the
order to give a reference for the autoencoder with enriched frequency
non-enriched concept vector. The SVM, logistic regression and auto-
vector, we have extended our baseline group with two autoencoder
encoder based methods have been tested on both, the non-enriched
configurations with non-enriched frequency vector. The deep auto-
concept vectors and the enriched frequency vectors. We tested the au-
encoder based on dictionary reached a micro F1 score of 55.26%, which
toencoder with denoising layer. Three different types of autoencoders
shows 5% of decrease in F1 score in contrast to CNN, while the layer-
(non-stop deep autoencoder, layer-wise autoencoder with merged
wise autoencoder based on concatenated input has shown a slightly
input, alternative autoencoder on dictionary) with logistic regression
better F1 measure (3%) than the CNN based on the non-enriched fre-
were tested in parallel. Precision, recall, F1 measure, specificity (all in
quency vector. The layer-wise autoencoder based on concatenated
Micro) have been compared. In the evaluation, all methods have gone
input has outperformed all the other baseline methods since it has
through a 5-fold validation. For the concept embedding, the embedding
achieved the highest score for recall (55.76%). Through the layer-wise
is conducted based on the cross-validation fold (4/5 of the entire data
pre-training with fine-tuning, this configuration have gained a 10%
corpus). The representations learned by autoencoder are derived from
improvement in recall in contrast to the non-stop deep autoencoder
enriched or non-enriched frequency dictionary, while the layer-wise
based on dictionary with the same non-enriched input.
autoencoder with concatenated input has also been trained on cross-
validation fold (second configuration (line) in Table 2).
5.2. Performance of enriched autoencoder-based method
5. Results
The performances of the enriched autoencoder-based method using
In this section, we will first describe the results of the proposed pre-training have been presented in Table 2. Different types of auto-
baseline methods. Second, the performance of the autoencoder-based encoders resulting from varying pre-processing steps are evaluated on
approaches will be presented. Finally, the efficiency aspects of training the enriched frequency vectors illustrated in (C) in Fig. 2. All models of
and deployment of the proposed methods are discussed. autoencoders were evaluated based on the same enriched input.
As illustrated in Table 2, the layer-wise autoencoder based on
5.1. Performance of baseline classification concatenated input has achieved the best F1 measure. The deep auto-
encoder trained on the enriched frequency dictionary has also reached
Table 1 shows the precision, recall and F1 measure of our baseline high precision (84%) and moderate recall of 55%. The enrichment led
model (all in micro average) on non-enriched and enriched input. The to an improvement of the F1 measure between 4% to 10% in compar-
CNN with single convolution and pooling layer has been trained on ison with the CNN method. The effectiveness of denoising layers has
embeddings obtained from the concept vector (A) demonstrated in been confirmed: the same configuration without denoising mechanism
Fig. 2. We can see that the CNN with the embedding significantly
outperforms the other two conventional machine learning methods Table 3
with an F1 measure of 60.86%. Based on a frequency-based concept Learning efficiency of the proposed methods based on neural networks.
vector with enrichment, the SVM classifier with linear kernel reached Configuration Efficiency of training (iterations)
an F1 measure of 30.01%, while the logistic regression with L2 reg-
ularization has yielded an F1 measure of 19.65%. With the same con- Deep autoencoder (non-stop training) 14,000
Layer-wise trained autoencoder Stopped by each training layer
figuration, SVM and logistic regression with the non-enriched concept
Autoencoder trained on dictionary 8,900
vector have achieved F1 measures of 27.6% and 20.7%, respectively. CNN 3, 4, 5 Ada L2 0.01 6,000
The specificity (true negative rate) indicates that the irrelevant pairs
11
Y. Deng et al. Artificial Intelligence In Medicine xxx (xxxx) xxx–xxx
showed a decrease in precision (4% less) and recall (3% less) in contrast latter method can only be trained once based on a training set.
to the same configuration with denoising. In all configurations, the ir- According to the results presented in Table 2, the local feature
relevant pairs were more accurately predicted than the relevant pairs within one vector of query or one category text can already yield a high
according to the score of specificity and the classifier tends to over- F1 measure while the inter-query-category text features have appar-
classify (overkill) the relevant as irrelevant. ently higher training costs but a slightly better performance on the test
set. Hence, the feature selection should be determined by balancing
5.3. Efficiency of proposed methods training costs and performance requirements. As a potential extension,
we can make use of the query and category specific features, such as the
Given the same input, the count of iterations on different model exclusive concept in category, as negation rules to complement the
configurations before the model convergence (changes _ o f Accuracy@ current trained classifier. In the current training set, we have only
dev < 5%) can be used to indicate the speed of model converging. The eliminated the exclusive terms. In the next step, we can transform the
count can therefore be seen as an indicator of efficiency by training. As vector pairs with exclusive terms into negative training examples.
it can be seen in Table 3, the CNN-based method with single convolu-
tion and pooling layer achieves the best efficiency due to its efficient 6.2. Summary of principal findings
embedding mechanism and the pooling with short sliding windows (3,
4, 5). Since most of the time costs in autoencoders are consumed by the Through experiments, we compared different configurations of two
representation learning phase, we only compare the iterations and stops models based on neural networks (CNN and autoencoder-based classi-
in the representation phase. Regarding the efficiency of deployment, fication) to solve the matching issues. The CNN-based methods, SVM
Tensorflow provides a comprehensive library for product ready model and logistic regression are used as baseline to compare with the auto-
serving. With the required resource stored as docker container, the encoder based methods. The principal findings are:
chosen model (ckpt file) is transformed into a protocol buffer (pb) file.
The frozen model can be deployed as classifier for relevance prediction. 1. A suitable vector representation for a high dimensional sparse data
The inference and reconfiguration with the model is highly efficient. set has been identified for the clinical encoding task: concept em-
For the query normalization and enrichment, the ID MACS® provides bedding is well suited for the input with original sequence while a
response time at millisecond level, while the representation generation frequency vector is a good representation for semantic enriched
through frozen model using pb file can also expect a low response time. input.
2. The suitability of stepwise pre-training and non-stop representation
6. Discussion and conclusion learning based on autoencoder have been assessed: for text re-
presentation learning on a small data amount, stepwise pre-training
The current results have shown a clear potential of different models can be necessary and useful for a better generalization.
based on neural networks in clinical procedure classification. In con- 3. The usefulness of the knowledge enrichment has been confirmed. It
trast to previous research, we have conducted a model design and has been shown that the enrichment process can facilitate the se-
evaluation in a relatively narrow application domain (clinical proce- mantic matching and increase the model generalization in the test
dure encoding). We achieved better performance using autoencoder- set. More specifically, the autoencoder with pre-training can facil-
based representation learning compared to the baseline methods im- itate the knowledge integration based on the enriched frequency
plemented with CNN and conventional machine learning methods. The vector. The denoising layer can also make moderate contributions to
methods are evaluated with a subset of CHOP categories and 24,092 the performance in terms of model generalization.
corresponding query-category pairs. With the insight yielded through 4. CNN has been proven to be a suitable method for concept matching
the experiment, more work will be done to exploit the full CHOP ca- in the biomedical context. However, the embedding process and the
tegory with more training data. sliding windows at convolutional layer rely on the co-occurrence
information within the input sequence. For those tasks with se-
6.1. Discussion of results quence-less input, CNNs should be avoided or applied with adap-
tation. In this work, for a non-enriched input with original sequence,
Within current settings of autoencoder, the non-stop pre-training is CNN can achieve a moderate prediction performance with a high
better suited for processing a large balanced data set, while layer-wise training efficiency.
pre-training should be applied when the data set is relatively small and 5. The unsupervised pre-training based on autoencoder may slightly
unequally sampled. In our experiment, the layer-wise pre-training decrease the converging speed but makes the model more generic. It
clearly yielded improvement compared to the application of auto- achieves a better result in the cross-validation, while the knowledge
encoders without layer-wise initialization. We believe that the layer- can be integrated in the pre-training more smoothly.
wise pre-training has brought the model into the rational subspace
which leads to better optimization effectiveness and prevents the model 6.3. Limitations and future work
from getting skewed by the imbalanced input. The enrichment has
shown a positive influence on the matching performance. In our ex- One limitation is that the quality of layer-wise pre-training can only
periment, the enriched input processed by the layer-wise pre-trained be judged by the result of matching prediction. The status concerning
autoencoder and supervised classifier achieved the best outcomes on the weight and biases as well as the status of the model at each layer is
the task relevance determination. still unclear. An investigation method should be developed to present
In comparison with the performance of baseline methods (SVM and the changes of models, so that the correlation between the hyper-
logistic regression) with enriched input presented in Table 1, the non- parameter and model status can be determined and the performance
linear transformation conducted by a layer-wise autoencoder has shown can be fine-tuned.
a clearly better result. We believe that this performance augmentation Secondly, the external knowledge base has been used in a relatively
originates from the in-depth latent representation learned through simple way, the enrichment was conducted based on hypernyms and
multi-layer autoencoder. In addition, the separate layers have also been hyponyms in the hierarchical terminology. The graphical features and
fine-tuned globally with the supervised classification and get adapted to the conditional probabilities between the concepts have not been
the task on the fly. These flexibilities made the autoencoder-based re- exploited. As a next step, the additional knowledge reasoning can be
presentation learning clearly more attractive than the static linear in- conducted by Bayes’ rules to guide the matching process.
tegration of a knowledge weight based on word embedding, since the Last but not least, the current data set (24,092 pairs) is relatively
12
Y. Deng et al. Artificial Intelligence In Medicine xxx (xxxx) xxx–xxx
small for deep fine-tuning. A more comprehensive evaluation based on [13] Nentidis A, Bougiatiotis K, Krithara A, Paliouras G, Kakadiaris IA. Results of the
more log data from the production server (over 1 million pairs) has fifth edition of the BioASQ challenge. BioNLP, 2017, Vancouver, Canada, August 4,
2017 2017:48–57. https://doi.org/10.18653/v1/W17-2306.
been planned. [14] Du Y, Pan Y, Ji J. A novel serial deep multi-task learning model for large scale
During the training, we have not considered the exclusive concept biomedical semantic indexing. 2017 IEEE international conference on bioinfor-
separately (certain concepts that must not appear in the category text matics and biomedicine (BIBM) 2017:533–7. https://doi.org/10.1109/BIBM.2017.
8217704.
has been provided in each CHOP category) due to the capacity of the [15] Rios A, Kavuluru R. Convolutional neural networks for biomedical text classifica-
representation vector. The possible improvements are either the de- tion: application in indexing biomedical articles. Proceedings of the 6th ACM
velopment of a separate classifier to generate an exclusive decision or conference on bioinformatics, computational biology and health informatics,
BCB’15. New York, NY, USA: ACM; 2015. p. 258–67. https://doi.org/10.1145/
transforming vector pairs with negative concept into negative training 2808719:2808746.
examples. [16] Kim Y. Convolutional neural networks for sentence classification. Proceedings of
As a next step, we would like to employ a larger data set to perform the 2014 conference on empirical methods in natural language processing, EMNLP
2014, October 25–29, 2014, Doha, Qatar. A meeting of SIGDAT, a special interest
a more comprehensive layer-wise pre-training. The status of the model
group of the ACL 2014:1746–51. URL http://aclweb.org/anthology/D/D14/D14-
at each layer will be investigated to make the training more inter- 1181.pdf.
pretable. Beyond, additional features from category texts such as ex- [17] Lecun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to docu-
clusive concept and the category hierarchy will be used to complement ment recognition. Proc IEEE 1998;86(11):2278–324. https://doi.org/10.1109/
5:726791.
the current classifiers, so that the training in deep networks can be [18] Larochelle H, Bengio Y, Louradour J, Lamblin P. Exploring strategies for training
accelerated and the performance of the encoding can be improved. At deep neural networks. J Mach Learn Res 2009;10:1–40. URL http://dl.acm.org/
last, we will analyze the hardware requirement of the proposed neural citation.cfm?id=1577069.1577070.
[19] Miotto R, Li L, Kidd BA, Dudley JT. Deep patient: an unsupervised representation to
network based method. A hardware for deployment in the real pro- predict the future of patients from the electronic health records. Scientific reports
duction will be selected. 2016.
[20] Yu M, Dredze M. Improving lexical embeddings with semantic knowledge.
Proceedings of the 52nd annual meeting of the association for computational lin-
Acknowledgement guistics, ACL 2014, June 22–27, 2014, Baltimore, MD, USA, Volume 2: Short papers
2014:545–50. URL http://aclweb.org/anthology/P/P14/P14-2089.pdf.
This work is partially supported by ID Information und [21] Celikyilmaz A, Hakkani-Tür D, Pasupat P, Sarikaya R. Enriching word embeddings
using knowledge graph for semantic tagging in conversational dialog systems. AAAI
Dokumentation im Gesundheitswesen GmbH & Co. KGaA, Berlin, – Association for the Advancement of Artificial Intelligence; 2015. URL https://
Germany. Our gratitude also goes to Marie-Anne Pinheiro for her va- www.microsoft.com/en-us/research/publication/enriching-word-embeddings-
luable comments. We thank the anonymous reviewers for their careful using-knowledge-graph-for-semantic-tagging-in-conversational-dialog-systems/.
[22] Hakkani-Tür D, Celikyilmaz A, Heck L, Tur G, Zweig G. Probabilistic enrichment of
reading of our manuscript and their many insightful comments and
knowledge graph entities for relation detection in conversational understanding.
suggestions. Proceedings of Interspeech, ISCA – International Speech Communication
Association 2014. URL https://www.microsoft.com/en-us/research/publication/
References probabilistic-enrichment-of-knowledge-graph-entities-for-relation-detection-in-
conversational-understanding/.
[23] Faruqui M, Dodge J, Jauhar SK, Dyer C, Hovy E, Smith NA. Retrofitting word
[1] Li S-T, Chen C-C, Huang F. Conceptual-driven classification for coding advise in vectors to semantic lexicons. Proceedings of the 2015 conference of the North
health insurance reimbursement. Artif Intell Med 2011;51(1):27–41. https://doi. American Chapter of the Association for computational linguistics: human language
org/10.1016/j.artmed.2010.10.003. technologies. Association for Computational Linguistics; 2015. p. 1606–15. URL
[2] Faulstich L, Mueller F, Sander A. Automated clinical documentation from electronic http://www.aclweb.org/anthology/N15-1184.
health records. Medizinische Informatik, Biometrie und Epidemiologie, GMDS [24] Socher R, Perelygin A, Wu J, Chuang J, Manning CD, Ng A, et al. Recursive deep
2010, 55. Jahrestagung der GMDS, Mannheim, September 2010 2010:547–50. models for semantic compositionality over a sentiment treebank. Proceedings of the
[3] Mikolov T, Sutskever I, Chen K, Corrado G, Dean J. Distributed representations of 2013 conference on empirical methods in natural language processing. Association
words and phrases and their compositionality, CoRR abs/1310.4546. URL http:// for Computational Linguistics; 2013. p. 1631–42. URL http://www.aclweb.org/
arxiv.org/abs/1310.4546. anthology/D13-1170.
[4] Hu B, Lu Z, Li H, Chen Q. Convolutional neural network architectures for matching [25] Landauer TK, Dumais ST. A solution to Plato's problem: The latent semantic analysis
natural language sentences. Advances in neural information processing systems 27: theory of acquisition, induction, and representation of knowledge. Psychol Rev
Annual conference on neural information processing systems 2014, December 8–13, 1997;104:211–40.
2014, Montreal, Quebec, Canada 2014:2042–50. URL http://papers.nips.cc/paper/ [26] Yu Z, Wang H, Lin X, Wang M. Understanding short texts through semantic en-
5550-convolutional-neural-network-architectures-for-matching-natural-language- richment and hashing. IEEE Trans Knowl Data Eng 2016;28(2):566–79. https://doi.
sentences.pdf. org/10.1109/TKDE.2015.2485224.
[5] Shen Y, He X, Gao J, Deng L, Mesnil G. A latent semantic model with convolutional- [27] Wu W, Li H, Wang H, Zhu KQ. Probase: a probabilistic taxonomy for text under-
pooling structure for information retrieval. Proceedings of the 23rd ACM interna- standing. Proceedings of the 2012 ACM SIGMOD international conference on
tional conference on conference on information and knowledge management, management of data, SIGMOD’12. New York, NY, USA: ACM; 2012. p. 481–92.
CIKM’14. New York, NY, USA: ACM; 2014. p. 101–10. https://doi.org/10.1145/ https://doi.org/10.1145/2213836.2213891.
2661829.2661935. [28] Wingert F. Automated indexing based on snomed. Methods Inf Med 1984;24:27–34.
[6] Salakhutdinov R, Mnih A, Hinton G. Restricted Boltzmann machines for colla- https://doi.org/10.1055/s-0038-1635350.
borative filtering. Proceedings of the 24th international conference on machine [29] Koopman B, Zuccon G, Bruza P. What makes an effective clinical query and querier?
learning, ICML’07. New York, NY, USA: ACM; 2007. p. 791–8. https://doi.org/10. J Assoc Inf Sci Technol 2017;68(11):2557–71. https://doi.org/10.1002/asi.23959.
1145/1273496.1273596. [30] Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolu-
[7] Hinton GE, Osindero S, Teh Y-W. A fast learning algorithm for deep belief nets. tional neural networks. In: Pereira F, Burges CJC, Bottou L, Weinberger KQ, editors.
Neural Comput 2006;18(7):1527–54. https://doi.org/10.1162/neco.2006.18.7. Advances in neural information processing systems 25. Curran Associates, Inc.;
1527. 2012. p. 1097–105. URL http://papers.nips.cc/paper/4824-imagenet-classification-
[8] Hinton GE, Salakhutdinov RR. Reducing the dimensionality of data with neural with-deep-convolutional-neural-networks.pdf.
networks. Science 2006;313(5786):504–7. https://doi.org/10.1126/science. [31] Kalchbrenner N, Grefenstette E, Blunsom P. A convolutional neural network for
1127647. modelling sentences. CoRR abs/1404.2188. 2014. URL http://arxiv.org/abs/1404.
[9] Miotto R, Wang F, Wang S, Jiang X, Dudley JT. Deep learning for healthcare: re- [32] Shen Y, He X, Gao J, Deng L, Mesnil G. Learning semantic representations using
view, opportunities and challenges. Brief Bioinf 2017. https://doi.org/10.1093/ convolutional neural networks for web search. Proceedings of the 23rd interna-
bib/bbx044. tional conference on world wide web, WWW’14 companion. New York, NY, USA:
[10] Peng S, You R, Wang H, Zhai C, Mamitsuka H, Zhu S. Deepmesh: deep semantic ACM; 2014. p. 373–4. https://doi.org/10.1145/2567948.2577348.
representation for improving large-scale mesh indexing. Bioinformatics [33] Deng Y, Faulstich L, Denecke K. Concept embedding for relevance detection of
2016;32(12):70–9. https://doi.org/10.1093/bioinformatics/btw294. search queries regarding CHOP. MEDINFO 2017: precision healthcare through in-
[11] Le Q, Mikolov T. Distributed representations of sentences and documents. In: Xing formatics – Proceedings of the 16th world congress on medical and health infor-
EP, Jebara T, editors. Proceedings of the 31st international conference on machine matics, Hangzhou, China, 21–25 August, 2017 2017:2333. https://doi.org/10.
learning, Vol. 32 of proceedings of machine learning research. Beijing, China: 3233/978-1-61499-830-3-1260.
PMLR; 2014. p. 1188–96http://proceedings.mlr.press/v32/le14.html. [34] Huang P-S, He X, Gao J, Deng L, Acero A, Heck L. Learning deep structured se-
[12] Liu K, Peng S, Wu J, Zhai C, Mamitsuka H, Zhu S. Meshlabeler: improving the mantic models for web search using clickthrough data. Proceedings of the 22nd
accuracy of large-scale mesh indexing by integrating diverse evidence. ACM international conference on information & knowledge management, CIKM’13.
Bioinformatics 2015;31(12):i339–47. https://doi.org/10.1093/bioinformatics/ New York, NY, USA: ACM; 2013. p. 2333–8. https://doi.org/10.1145/2505515.
btv237. 2505665.
13
Y. Deng et al. Artificial Intelligence In Medicine xxx (xxxx) xxx–xxx
[35] Lee H, Battle A, Raina R, Ng AY. Efficient sparse coding algorithms. In: Schölkopf B, www.deeplearningbook.org.
Platt JC, Hoffman T, editors. Advances in neural information processing systems 19. [38] Vincent P, Larochelle H, Lajoie I, Bengio Y, Manzagol P. Stacked denoising auto-
MIT Press; 2007. p. 801–8. URL http://papers.nips.cc/paper/2979-efficient-sparse- encoders: learning useful representations in a deep network with a local denoising
coding-algorithms.pdf. criterion. J Mach Learn Res 2010;11:3371–408. URL http://portal.acm.org/
[36] Vincent P, Larochelle H, Bengio Y, Manzagol P-A. Extracting and composing robust citation.cfm?id=1953039.
features with denoising autoencoders. Proceedings of the 25th international con- [39] Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, et al. TensorFlow: large-
ference on machine learning, ICML’08. New York, NY, USA: ACM; 2008. p. scale machine learning on heterogeneous systems. 2015 Software available from:
1096–103. https://doi.org/10.1145/1390156.1390294. URL https://www.tensorflow.org/.
[37] Goodfellow I, Bengio Y, Courville A. Deep learning. MIT Press; 2016. URL http://
14