Artificial Intelligence in Medicine: Yihan Deng, André Sander, Lukas Faulstich, Kerstin Denecke

Artificial Intelligence In Medicine xxx (xxxx) xxx–xxx
Contents lists available at ScienceDirect
Artificial Intelligence In Medicine

journal homepage: www.elsevier.com/locate/artmed
Towards automatic encoding of medical procedures using convolutional

neural networks and autoencoders
Yihan Denga, , André Sanderb, Lukas Faulstichb, Kerstin Deneckea
⁎
a
Bern University of Applied Sciences, Medical Informatics, Biel, Switzerland
b
ID Information und Dokumentation im Gesundheitswesen GmbH & Co. KGaA, Berlin, Germany
ARTICLE INFO ABSTRACT
Keywords: Classification systems such as ICD-10 for diagnoses or the Swiss Operation Classification System (CHOP) for
Automatic encoding procedure classification in the clinical treatment are essential for clinical management and information ex-
Deep learning change. Traditionally, classification codes are assigned manually or by systems that rely upon concept-based or
Classification system rule-based classification methods. Such methods can reach their limit easily due to the restricted coverage of
Autoencoder
handcrafted rules and of the vocabulary in underlying terminological systems. Conventional machine learning
Convolutional neural networks
approaches normally depend on selected features within a human annotated training set. However, it is quite
laborious to obtain a well labeled data set and its generation can easily be influenced by accumulative errors
caused by human factors. To overcome this, we will present our processing pipeline for query matching realized
through neural networks within the task of medical procedure classification. The pipeline is built upon con-
volutional neural networks (CNN) and autoencoder with logistic regression. On the task of relevance determi-
nation between query and category text, the autoencoder based method has achieved a micro F1 score of
70.29%, while the convolutional based method has reached a micro F1 score of 60.86% with high efficiency.
These two algorithms are compared in experiments with different configurations and baselines (SVM, logistic
regression) with respect to their suitability for the task of automatic encoding. Advantages and limitations are
discussed.
1. Introduction assistants use free-text queries within their search whereas the output is
a set of possible classification codes. Consider the example in Fig. 1: A
In order to claim costs to the health insurance as well as for clinical medical documentation assistant wants to encode a procedure reflected
documentation purposes, hospitals and physicians are legally bound to by the keywords Chirurgie CT und MR (surgery, computed tomography
encode diagnoses and procedures with classification codes from re- and magnetic resonance in English). For this query, the assistant needs
levant classification systems. In Switzerland, these are ICD-10-GM for to conduct a top-down dictionary look-up within the CHOP classifica-
diagnoses and the Swiss Operation Classification System (CHOP)1 for tion system, which goes from the top level chapter C0 (measurement
clinical and surgical treatments. and intervention) down to subcategory 00.3 (computer assisted sur-
gery). Subcategory 00.3 is further divided in several subcategories: the
1.1. Challenges in automatic encoding category computer assisted surgery with “CT and MR” represented by
“Computergesteuerte Chirurgie mit mehreren Datenquellen” is one of
In this study, we are focusing on the task of automatic encoding the six subcategories under “computer assisted surgery”. All sub-
based on CHOP. In total, the CHOP system consists of 18 different ca- categories of 00.3 achieve at least a partial match with the query
tegories and over 14,000 different classification codes. In order to whereas the subcategories differ only slightly. The best results for
realize automatic encoding or search for relevant classification codes, partial matches are achieved for the codes 00.31 and 00.32 that contain
respectively, rule-based classification systems have been developed to at least one of the imaging procedures requested. Based on their
achieve an automatic encoding [1]. As input, physicians and coding background knowledge, the encoding assistants are able to decide the
Corresponding author.
⁎
E-mail address: yihan.deng@bfh.ch (Y. Deng).

1
https://www.bfs.admin.ch/bfs/de/home/statistiken/gesundheit/nomenklaturen/medkk/instrumente-medizinische-kodierung.assetdetail.1940914.html, ac-
cessed on September 9, 2017.
https://doi.org/10.1016/j.artmed.2018.10.001
Received 20 October 2017; Received in revised form 7 September 2018; Accepted 3 October 2018
0933-3657/ © 2018 Elsevier B.V. All rights reserved.
Please cite this article as: Deng, Y., Artificial Intelligence In Medicine, https://doi.org/10.1016/j.artmed.2018.10.001
Y. Deng et al. Artificial Intelligence In Medicine xxx (xxxx) xxx–xxx
Fig. 1. Query scenario and snippet from CHOP on category computer assisted surgery.
correct code is 00.35 (computer assisted surgery with multiple data each layer for specific tasks.
sources). However, for an automatic encoding this would require se- In this paper, we introduce and evaluate a pipeline based on neural
mantic knowledge that CT and MR are imaging data sources. networks to support the automatic CHOP encoding. More specifically,
In encoding systems, automatic extraction approaches and rule- the goal is to determine the best model of neural networks for the CHOP
based encoding approaches are employed to handle the user query and encoding and in this way, assigning a correct CHOP code to a query
determine the matching between query and category text. Such text. We are assessing different models based on neural networks with
methods extract specific information from query and category text respect to their performance (relevance determination in precision,
automatically and support in classifying them according to standard recall and F1). Moreover, we assess the impact of a semantic enrich-
medical terminologies [2]. Hence, comprehensive rules are specified ment of the query using a semantic knowledge on the classification
beforehand considering all the possible query situations, which is often performance.
highly time-consuming. It is also difficult to achieve completeness and Based on aforementioned challenges and objectives, we will address
correctness in the rule set. Conventional machine learning approaches the following research questions:
aim at learning correlation between query and CHOP encoding auto-
matically and deriving latent models from large data sets. Although • Which type of neural networks is better suited for the considered
manually chosen features have proven to be effective for some specific matching task in terms of efficiency: CNN or autoencoder?
classification tasks, the biases caused by feature selection at each pro- • Given the positive influence of semantic enrichment for the sake of
cessing step is unavoidable. For this kind of encoding with multiple balancing between query and category text, which method can be
subcategories, it is particularly difficult to obtain sufficient training applied to deal with the sparseness of the data set caused by a large
examples for each sub-category to make the classifier differentiable for vocabulary space of enrichment?
instance in the corresponding classes. Without these examples, a • Is autoencoder a suitable method for the knowledge integration?
trained classifier would have a reduced ability of discrimination for Does semantic enrichment of the feature set by concepts of a se-
each subcategory and would produce a large number of false positive mantic network impact the classification accuracy?
codes. • Is layer-wise pre-training of autoencoder suited for representation
With the resurgence of neural networks and deep learning techni- for the task of semantic matching?
ques, the end-to-end methods for data representation learning and ob-
ject/pattern recognition in both, text and image, have advanced sub- 1.2. Paper organization
stantially. For a classification task based on natural language text,
words and relations between words can be learned through semantic In Section 2, we summarize the related work. Moreover, we also
preserving feature vectors [3]. Besides, the convolutional neural net- introduce the characteristics of our task in contrast to the related work.
works (CNN) [4,5] enable a minimal manual pre-processing on the In Section 3, we present our data material that is used for training and
input data and learn the feature maps through applying a sliding evaluation. Section 4 includes the formal definition of the CHOP query
window on the original input according to a pre-defined filtering size matching problem in Section 4.1 and the methods we proposed for the
and stride. The features corresponding to different filters can be se- CHOP query matching problem in Section 4.3. Besides, a working ex-
lected by the convolution and pooling process (max pooling or average ample for input normalization and enrichment is described in Section
pooling). In contrast to supervised deep neural networks, the un- 4.2. The CNN based method and autoencoder based methods are de-
supervised pre-training approaches like restricted Boltzmann machine scribed in Sections 4.3.1 and 4.3.2, respectively. Section 4.4 and 4.5
(RBM) [6], deep belief nets (DBN) [7] and autoencoder [8] provide the describe the implementation and experimental settings of our methods.
possibility of stepwise adaptation. The latent representation of high The results and efficiency for relevance determination are presented in
dimensional input data can be layer-wise reduced and fine-tuned at Section 5. After summarizing our principal findings and pointing to
2
limitations in Section 6, respectively, we refer to directions of future deep representation learning based on unsupervised neural networks.
work in Section 6.3.
2.2. Unsupervised representation learning using neutral networks
2. Related work
In contrast to the training with annotated data by supervised
In this section, we will firstly introduce a fraction of relevant learning, the unsupervised deep learning method trains the model with
methods for medical document tagging and classification based on the input data itself (reconstruction of input). Based on the MNIST digit
neural networks. As next, the specific neural networks for representa- recognition data set [17], Larochelle et al. [18] have evaluated the
tion learning and knowledge integration will be presented in detail. greedy layer-wise pre-training of a deep network. The results confirmed
Finally, the insights provided by the previous work and the difference of that the layer-wise pre-training can largely increase the effectiveness of
our approaches in contrast to the existing methods will be highlighted. the network initialization and enable a rational starting point for the
classification. It has also proved that the training can achieve a rela-
2.1. Neural networks, medical document classification and tagging tively large benefit by the layer-wise pre-training, in particular, when
the amount of training data is small. In the clinical domain, Miotto et al.
Neural networks have been employed for a variety of tasks in the [19] have used a three-layer stacked denoising autoencoder (SDA) to
medical domain. Miotto et al. [9] provided a comprehensive summary learn a deep patient representation based on electronic health records.
for applying deep learning technologies in the medical domain. The representation was then employed to predict disease risk using
Through a review of 32 recent research papers regarding four domains random forest as classifiers. The evaluation was performed on 76,214
of clinical applications (clinical imaging, electronic health records, patients comprising 78 diseases from diverse clinical domains within a
genomics and mobile health), the related deep models such as CNN, time window up to one year. On the task of disease prediction, the SDA
recurrent neural networks (RNNs), restricted Boltzmann machine based methods have significantly outperformed other dimension re-
(RBM) and autoencoder (AE) have been discussed. The potentials and duction methods such as principal component analysis (PCA), k-mean
opportunities of deep models relating to our encoding task are the on the task of disease prediction.
feature enrichment, the incorporation of expert knowledge and the In our task, a suitable representation learning method is essential for
interpretability of the model. The consideration of these aspects could balancing the difference between query and CHOP category text.
potentially improve the model performance and also increase the ac- Beyond that, the representation learning should also be able to facilitate
ceptance of deep learning methods in medical use cases. the knowledge integration from external medical knowledge base.
Similar as the CHOP classification in the sense of medical procedure
indexing, the medical subject headings thesaurus (MeSH) is a hier- 2.3. Word embedding and autoencoders for knowledge integration
archical medical thesaurus for indexing biomedical literature. It con-
tains 16 top categories and 27,455 main headings.2 The assignment of Another important trend of representation learning is the integra-
MeSH tags to free text based on neutral networks depends on a suitable tion of external knowledge through unsupervised learning. In the gen-
feature representation and a algorithm for the relevance determination. eral domain, word embedding with semantic enrichment has been
Peng et al. [10] proposed DeepMeSH, an approach for MeSH tag- evaluated by Yu and Dredze [20] and Celikyilmaz et al. [21]. The
ging based on unsupervised deep representation learning. The docu- knowledge from terminologies like Wordnet3 or PPDB4 is exploited in
ment to vector (d2v) [11] and tf-idf feature embeddings combined with these approaches. The type of relations and weights of the relations
MeSHLabeler [12] achieved the best score on the task of large scale between concepts have been combined with a word embedding model
semantic indexing in the 2017 BioASQ3 challenge [13] (task 5a). Fur- (word2vec) linearly. The linearly combined models were trained on
ther, with an F1-score of 0.6323, the DeepMeSH method yielded an different data sources: the word embedding was firstly trained on the
improvement of 12% and 2% respectively, in comparison to the two input corpus based on continuous bag of words (CBOW). Then, the
baseline index algorithms: Medical text indexer (MTI) with an F1- normalized weights from related words found in the knowledge re-
Measure of 0.5637 and MeSHLabeler with an F1-Measure of 0.6218. source were used to update the weight parameter of CBOW and opti-
Du et al. [14] provided another possibility of feature representation mized according to the same lost function. Yu and Dredze's method
for MeSH indexing beside d2v. They employed the bidirectional re- achieved a 19% improvement in mean reciprocal rank (MRR) on the
current neural network (BRNN) and an auxiliary regression mechanism task of finding semantically related words using the embeddings, while
to conduct the primary multi-label classification. The composed serial Celikyilmaz et al. have achieved around 2% of improvement in F1
structure between terms can therefore be extracted. The algorithm has measure on the task of semantic tagging within one movie dataset [22].
outperformed the state-of-art baseline Medical Text Indexers (MTI) in Faruqui et al. [23] have further improved the knowledge integration
F1-score with 0.6220 and reached a higher precision (0.77) than through a post-processing step by conducting belief propagation on a
DeepMeSH (0.70). One earlier attempt conducted by Rios and Kavuluru graph obtained from lexicon-derived relational information. These
[15] used CNNs to assign MeSH terms to biomedical articles. Paper methods outperformed the prior approaches developed by Yu and
abstract of publications listed in Pubmed have been processed by Dredze [20] with 5% (sentiment analysis [24]) to 20% (Synonym Se-
multiple-layer CNNs proposed by Kim [16]. The simple CNN structure lection (TOEFL) [25]) improvement in accuracy.
with single convolution and pooling layer achieved an absolute im- Yu et al. [26] used semantic hashing based on autoencoder to
provement of over 3% in macro F1-score on selected subsets of MeSH support short text understanding and retrieval. More specifically, the
terms compared to the baseline MTI method. short text representations were enriched with concepts and their co-
The aforementioned examples of MeSH tagging demonstrate the occurring concepts from Probase [27]. After the enrichment, the
strong potentials of the application of neural networks for medical stacked autoencoder was applied to reduce the dimension of the re-
document classification (tagging) and show the importance of representation. The obtained hashing representation of short text has
presentation learning. One lesson learned from the method design is yielded to an improvement of 20% to 30% on the task of news retrieval.
that a comprehensive representation combining multiple levels of On the task of classifying Wikipedia sentences, the proposed method
salient features can largely enhance the performance of the follow-up has obtained an improvement of 10% (in comparison with other types
classification. In the follow section, we will take a further look at the
3
https://wordnet.princeton.edu, accessed on September 9, 2017.
2 4
https://www.nlm.nih.gov/mesh, accessed on August 23, 2018. http://www.cis.upenn.edu/∼ccb/ppdb/, accessed on September 9, 2017.
3
of enrichment) to 18% in accuracy (in contrast to non-enrichment). 3. Data set

In general, the deep neural networks have approximated a latent
representation with a suitable dimension gradually through non-linear The queries and associated CHOP codes have been collected from
transformation. The obtained deep representation was proved to be logs of a bug-tracker of the current rule-based system ID DIACOS®,9
useful on aforementioned classification tasks. including both, open (unresolved) and resolved search pairs. The search
In the biomedical domain, a large number of ontologies and ter- pairs are grouped into search issues. One search issue stands for the
minologies (e.g. SNOMED CT,5 UMLS6 Radlex,7 OntoMed8) provide encoding of a series of similar cases (see example in Fig. 1). Currently,
valuable knowledge on clinical concepts and their relations. Based on about 90% of the query and category description pairs can be processed
the findings from the general domain as presented before, we hy- with the knowledge base ID MACS®.10 About 85% of the matching pairs
pothesize that integrating such knowledge bases into the representation are completely solved in terms of accuracy. The relevance of the CHOP
of texts will introduce in-depth rich semantics to the representation and category description and the user query has been annotated manually
will lead to a quality improvement for the classification tasks. However, after the data gathering. Annotations were conducted by customers and
to our best knowledge, the knowledge enrichment for the purpose of internal testers and were quality-assured by encoding experts, taking
medical encoding with neural networks has not been well studied. In into account the official coding guidelines for CHOP. In order to eval-
particular, the enrichment of German medical text for the task medical uate the annotation quality, a random sampling within a subset of the
encoding is still an open question. Hence, the potentials of word em- annotated pairs including 15% all available unique codes has been
bedding and autoencoder for the knowledge integration will be in- conducted. When a query was labeled as relevant with the correct ca-
vestigated within our experiments. tegory description, the data pair was considered as correctly annotated
pair. Based on this criterion, the annotated data set has reached a match
rate over 98.6%.
2.4. Our task in contrast to related work In summary, the data set underlying this work comprises 24,092
query-category training pairs, consisting of relevant (7,522) and irre-
Compared with the document tagging with MeSH headings, a spe- levant (16,570) pairs. There are 856 unique queries and 14,898 unique
cific challenge in our work is that the structure of the query and the categories in the training pairs. The query and CHOP category have a
CHOP category description differ substantially from texts and queries many-to-many relationship. This many-to-many relationship and the
considered in related work. One encoding query or one CHOP category similarity between category text under one category issue may lead to
description normally contains only one or a few biomedical concepts. difficulties in the model training. However, we will implement incre-
Complex sentence structures can neither be found in the query nor in mental matching rules: The classifiers will be indexed according to the
the category text. Instead, a nominal style is used. However, we have to CHOP structure, updated and redeployed in case of new log data or
consider a high-dimensional vocabulary space. In contrast to public category changes.
accessible medical knowledge bases in English and other open knowl-
edge bases in the general domain, German medical knowledge bases are 4. Methodology
either unavailable (there is no free accessible translation of SNOMED
CT available in German) or they are managed by industrial companies In this section, the definition of data representation on the basis of
and optimized for special commercial purposes. The access and appli- data pre-processing will be introduced in Section 4.1. For illustration
cation of those knowledge bases may therefore be limited by the in- purposes, a working example of all representation types is presented in
terface in terms of concept enrichment, i.e., the search of similar con- Section 4.2. In Section 4.3, details are provided on our methods based
cepts can only be conducted based on provided function for the on CNN and autoencoder for the task of relevance determination. Fi-
similarity comparison. Hence, a suitable method for representation nally, the implementation and settings of the experiments are specified.
learning and enrichment should be designed in consideration of avail-
able interfaces provided by selected external knowledge bases.
4.1. Data representation
To deal with these challenges, we will exploit several types of
methods to enable a knowledge-based semantic matching in the CHOP
In order to process pairs of query and category text with neural
encoding. In view of the aforementioned state-of-the-art analysis
network based methods, a suitable pre-processing approach should be
(neural networks for medical text tagging, unsupervised representation
employed to transform the raw data pairs into the vector representa-
learning, knowledge integration), we will implement and evaluate a
tions. In the following two subsections, we will firstly introduce the pre-
baseline method using CNN due to its known performance and effi-
processing method (normalization and enrichment) for query and ca-
ciency on the task of text classification and tagging. Secondly, a suitable
tegory pairs. Subsequently, the vector representation will be proposed
representation will be determined for query and category text in our
considering the output of pre-processing and the requirements of
specific task. The possibility of knowledge integration in the neural
follow-up networks.
network will be investigated with a particular focus on the knowledge
enriched matching based on unsupervised representation learning [8],
Besides, the semantic matching with conventional machine learning 4.1.1. Data pre-processing
methods (SVM, logistic regression) will also be tested as the baseline of In this work, we consider the problem of assigning one or more
the neural network based methods. CHOP codes to a free-textual query. Within the CHOP classification
We hypothesize that the application of neural networks is a suitable system, to each CHOP code a category description, i.e. a short free-
technique to complement the conventional method and allows to ef- textual category description is given. For normalization and re-
fectively use the available features of the data set. presentation purposes, these category descriptions and queries have
been mapped to concepts of a medical terminology through the ter-
minology server ID MACS®, a software provided by the German com-
pany ID Information und Dokumentation im Gesundheitswesen. ID
5
https://www.snomed.org/snomed-ct/what-is-snomed-ct, accessed on
9
September 9, 2017. https://www.id-berlin.de/en/products/1-coding/1-id-diacos/, accessed on
6
https://www.nlm.nih.gov/research/umls/, accessed on September 9, 2017. September 9, 2017.
7 10
https://www.rsna.org/RadLex.aspx, accessed on September 9, 2017. https://www.id-berlin.de/en/products/8-terminology/15-id-macs/, ac-
8
http://www.onto-med.de/ontologies/, accessed on September 9, 2017. cessed on September 9, 2017.
4
MACS® contains a manually curated medical ontology, which is a concepts.

German derivative of an early version of SNOMED called Wingert k
Q1: nq = x1 ||x2 ||…||x nq , x i (1)
Nomenclature [28]. ID MACS®11 has been in use in more than 1,000
hospitals in Germany, Switzerland and other countries in the past two k
C1:nc = x1 ||x 2 ||…||x nc , x j (2)
decades. Since then the company has developed and extended the ID
MACS® continuously. Its quality is assured by various manual and au- The query and category text are padded to n with zero if necessary,
tomated processes. It has been found useful in applications such as where the n corresponds to the size of the non-enriched vocabulary
semantic search and drug safety. space for both, query and category text. In this way, we get the con-
One challenge of our data set is that queries differ from CHOP ca- catenated query and category pair:
tegory descriptions in several aspects, which makes it difficult to match
P1:2 * n = Qn ||Cn (3)
queries with descriptions: queries are short (often single word), am-
biguous (containing abbreviations or typos), use broader or narrower This processing step is required since a padded input with same length
terminology than the category description texts, may contain irrelevant facilitates the batching and the convolution and pooling process in
information. For addressing this issue, we generate a normalized and CNN.
enriched feature vector by including related concepts of a term from the For the representation of the frequency-based input for autoencoder
ID MACS® semantic network. The enrichment aims at bridging the gap (B) and (C), the normalized query and category text share the same
between queries and categories for the matching task. For instance, in vocabulary space. Hence, the vector of query and category description
the case of a query containing broader concepts than the category de- is represented with the same dimension. Let x be the frequency of one
scription, the enriched query vector would contain also the narrower concept l appearing in the query of category text, while l is the index of
concepts occurring in the category description. Our intuition to perform this concept in the concept dictionary. The dimension of one vector
this pre-processing approach is confirmed by Koopman et al. [29]. They represents the full size of the vocabulary space of one corresponding
concluded from a user study with pre-defined queries and topics that concept dictionary. The value of the frequency of the corresponding
the human process of novel keyword choosing can largely facilitate the index l is obtained by computing the frequency divided by the count of
retrieval of target biomedical documents. Our approach based on se- concepts in the vector of a query or category description. The concepts,
mantic query enrichment and relation prediction through a deep neural which have not appeared in the vector, are set to zero. A query with m
network may emulate this human keyword selection process. Through dimensions is therefore denoted as
our proposed method of query expansion, the discrepancy between
Q1: m = (x1, x2, …, xl, …, xm ) (4)
short and ad hoc query and well-formulated category text will be re-
duced. A more distinguishable representation can be generated for re- where xm indicates the frequency of the term at the corresponding index
levance prediction. The concept vector for query and category text has m. Similarly, the value of x′ in C represents the frequency of a concept x′
therefore two variants: the first concept space only contains concepts in one category text description. A category textual description is
recognized in a query or category text, which sum up to 5,919 concepts. therefore denoted as
The second dictionary contains the enriched vocabulary with 20,193 C1:m = (x1 , x 2 , …, xl , …, x m) (5)
concepts, which includes the parent nodes and ancestor nodes of each
concept found in a text as retrieved from the medical semantic network The enriched vector representation contains more concepts and also
ID MACS®. The concept vector can be looked up in the case specific higher dimensions than the non-enriched vector. The non-enriched
vector dictionary through query id and category id. vector contains 1 or 0 to indicate the occurrence of one concept, while
the enriched vector contains additionally the normalized percentage
value received from a similarity function to show the frequency of
4.1.2. Representation definition
concept appearance. The frequency-based input vector has eliminated
After the data pre-processing, the CHOP classification problem can
the original sequence of terms. The sequence of the dimension corre-
be considered as a matching task between the concept vectors of a
sponds to the concept sequence in the vocabulary dictionary. Based on
query and a category text description. In our task, two types of vector
the representation, the objective of our model is to learn the relevance
representations are defined for query and category text: (1) an em-
(relevant or irrelevant) of C for category encoding Q.
bedding representation learned from concept sequences of query cate-
gory pairs exploited by the CNN based approaches ((A) in Fig. 2), and
4.2. Example of vector representation, normalization and enrichment
(2) a frequency-based vector representation for autoencoder-based
methods ((B) and (C) in Fig. 2), while the frequency based vector can be
To illustrate the details of the data representation, normalization
derived from either non-enriched concept sequences or enriched con-
and enrichment we consider a concrete example of the single term
cept sets.
query “liquorfistel”. One CHOP category text for the corresponding
For representation of query and category pairs based on concept
code 03.59.12 is the category description “Liquorfistelverschluss am
embedding, a task-specific concept embedding is trained based on a
Rückenmark” (“closure of a spinal liquor fistula”) (see Fig. 2). The
normalized concept sequence from query and category pairs. For this
query is mapped by ID MACS® to two concepts associated with the
purpose, the word embedding framework word2vec [3] is applied. One
terms “Fistulisation” and “Liquor” (internal concept ids are 1226 and
query and category pair contains a sequence of normalized concepts,
3389) while the CHOP category text is represented by the concepts
while the word2vec algorithm generates the embedding representation
“closure of a fistula” (id: 1493) and “Liquor spinals” (id: 3410). The
referring to the co-occurrence of the concepts within the scope of one
vector representation of this text is a sparse vector with two concept
query and category pair. Since no enrichment has been applied to the
entries corresponding to the dimensions of the related concept dic-
concepts, the original sequence of these concepts is kept. Based on this
tionary space from ID MACS®.
setting of embedding, let ∥ be the concatenation operation and x i k
As it has already been explained in Section 4.1.2, the query vector is
be the k dimensional concept vector referring to the ith concept in one
represented by (1226, 3389). Meanwhile, the vector with frequencies is
query with nq concepts. Similarly, let x j k be the k dimensional
represented by a vector of the form (0 … 0, 1, 0 … 0, 1, 0 … 0) having
concept vector referring to the jth concept in one category text with nc
two non-zero entries at the positions of the two concepts to which the
query has been mapped to. All texts of the data set with positive and
11
https://www.id-berlin.de/static/files/products/id_logik.pdf, accessed on negative (query/category) pairs are represented in this way. In the
August 19, 2018. enriched vector representation, additional concept entries are added to
5
Fig. 2. Example for input normalization and enrichment. (A) shows one example of the non-enriched concept vector and the corresponding embedding of query and
category text. (B) indicates the non-enriched concept vector with frequency. (C) represents the enriched concept vector with frequency.
each vector, namely those concepts that have been retrieved as ances- 4.3.1. Matching with CNN
tors of the originally mapped concepts. This is realized by a similarity A CNN is a type of feed-forward artificial neural networks, which
function. The similarity function employs the class hierarchy and re- has been successfully applied to both, image processing [30] and nat-
lations of the ID MACS® semantic network to determine the relatedness ural language processing [31]. Our architecture exploits a CNN with
of two concepts. The value generated by the similarity function re- one layer of convolution and one layer of max pooling. In the con-
presents the frequency of one concept appearing in the representation volutional layer, different filter sizes can be defined to cover the po-
of one query or category text. For instance, an entry for spinal column tential semantic scope in the query and category description. The ar-
with value 0.86 is added to the sparse vector representation of the chitecture is therefore similar to the CNN employed by Kim and Shen
target text, since spinal column is related to spinal liquor in our se- [16,32]. As input for the training, the vectors of the query and category
mantic network. Similarly, a concept entry for “fistula extirpation” with text from the training set are concatenated as one vector. The output of
value 0.89 is added because this concept describes another fistula-re- the network is the relevance of a category text with respect to a query.
lated operation which is connected in our semantic network to the The features are obtained from multiple filters (3, 4, 5); the convolution
concept “closure of a fistula”. Hence, the enriched frequency vector has is selected through max-pooling and fully connected into one feature
a form of (0 … 0, 1, 0.86, 0 … 0, 0.89, 1, 0 … 0, 0 … 0). The non-en- vector. The filter size 3, 4, 5 have been determined based on the length
riched concept vector is the input for the embedding layer of CNN statistics of our training corpus as most of the concept vectors for query
(labeled with (A)). The two frequency-based vector representations and category description have been mapped to 3–5 concepts.
((B), (C)) are the input for autoencoder, SVM or logistic regression. Based on the formulas (1)–(3) in Section 4.1.2, a pair of padded
query and CHOP category description is modeled as
4.3. Matching methods with CNN and autoencoder
P = Qn ||Cn = {x1, x2 , …, x i , …, x n , x1 , …, x j , …, xn} (6)
In our pipeline and experiments, we exploit mainly two types of Since query and category text share the same vocabulary after the
models based on neural networks (see Fig. 3). normalization and also the same index in the concept dictionary, the
The first approach is based on CNN whereas the second type of query and category pair can be transformed into: P = {x1, …, x2n}. The
network is a classification model connected to a representation layer convolution process is conducted through filters with predefined size,
trained with autoencoders. The CNN are implemented as baseline to which slides through the concatenated concept vector P. As it has al-
compare with the autoencoder based method. Since the original se- ready been mentioned in Section 4.1.2 shortly after formula (3), the
quence of text has been eliminated after the semantic enrichment and dimension n represents the padded length of a query or a category
the sequence between concepts is a crucial prerequisite for the em- vector. In this network, a filter of size h generates a windows of xi:i+h.
bedding and convolution in CNN, we will only apply the holistic The corresponding features are obtained through an activation function
knowledge integration for the autoencoder method. Additionally, the f with a bias term b ∈ in form of
linear-kernel SVM and logistic regression with L2 regularization are
Convolutional_features = f (w × x i : i + h 1 + b) (7)
applied on both, enriched and non-enriched input, as generic baselines
to compare with the neural network-based approaches. For each dimension of concept embedding (size k in presentation
6
Fig. 3. Data flow for the training: the bug-tracker logs (query-category text pairs) are processed with the concept dictionary mapper. A non-enriched concept vector
dictionary is generated for concept embedding, while enriched vectors are prepared for autoencoder based representation learning. The representation is trained and
cross validated for relevance determination. As is illustrated in Fig. 2, (A) represents the non-enriched concept vector for concept embedding. (B) indicates the non-
enriched concept vector with frequency. (C) represents the enriched concept vector with frequency.
definition: 128) we have assigned one filter to slide through all the In order to achieve an optimized network configuration for the CNN
possible windows in the concept vector. As it can be seen in Fig. 4, a experiment, we have configured and evaluated the outcome of the
feature vector with length of n − h + 1 is generated at each dimension embedding based on annotated relevance metrics [33]. The rectified
of the embedding. Next, the maximum value in the feature vector is linear unit (RELU) has been applied as activation function; filter win-
pooled out through the max pooling process. Next, the maximum value dows of 3, 4, 5 with 128 dimensional feature maps are applied. We have
in the feature vector is pooled out through the max pooling process so chosen the size of 128 to ensure a fast calculation of similarity metrics.
that a fully connected layer can be created. At last, the softmax layer is The dropout rate has been defined as 0.5 whereas an L2 constraint (s) of
attached to conduct the classification. 0.01 has been chosen to avoid over-fitting. The batch size – chosen
Fig. 4. CNN for relevance determination of CHOP query and category text matching, filter sizes with 3, 4, 5 are selected.
7
concept mapper, more specifically, it is based on the aforementioned

similarity function s(c, c′) (see Section 4.2), which assigns a pair of
medical concepts a similarity value in the range of [0, 1], in particular, s
(c, c′) = 1,
h i = f (x i ) (10)
is the encoder function and
ri = g (hi ) (11)
is the decoder function. Let Vrich = g(f(Vrich)) be the optimization
Fig. 5. The basic autoencoder with three layers. The network tries to reproduce target. We will use the same concept mapping process and similarity
its input at the output layer. function to generate the enriched input vector. The autoencoder will
learn to reduce dimensions of the enriched vector through the encoder
and decoder process and to minimize the loss function L:
through incremental experiments using the evaluation set – is 64. These L (Vrich, g (f (Vrich ))) (12)
settings were selected considering the performance and computing ef-
ficiency. where the hi vector is the latent representation used in the relevance
classification. As a consecutive layer after the dimensional reduction
4.3.2. Holistic matching based on an enriched representation layer, another layer based on denoising autoencoder is added to retrain
An autoencoder is an unsupervised learning algorithm that auto- the output from the previous layer. The difference to the previous layer
matically learns features from unlabeled data [8]. It is actually a three- is that the noise derived from noise distribution D, such as Gaussian,
layer neural network and the learning process consists of two main Masking noise, salt-and-pepper, etc., [38] is used to corrupt the original
stages, namely the encoder and the decoder. It can be seen as a feed- input and the autoencoder can then try to differentiate the target pairs
forward neural network trained to reproduce its input at the output within the noise environment, so that a more generalized representa-
layer. The encoder takes a vector x d
as input and encodes it to tion can be learned. We define the data corruption function as qc. In
hidden feature h d
within hidden layer through a deterministic contrast to the original data Vrich, the data corruption process
mapping, that is h = f(Wx + b), where f is the activation function, W is q ( Vrich|Vrich ) has been pushed away from the original data distribution.
a d × d′ dimensional weight matrix. The decoder needs to reconstruct So the loss function can be defined as:
the input: the code h is further decoded to y, which is seen as a pre- L (Vrich, g (f ( Vrich ))) (13)
diction of x that is y = g(W′h + b′), where g is a similar activation
function, W′ is the transposed tied weight with W, b′ is a d dimensional where Vrich represents the enriched vector mixed with predefined
bias vector illustrated in Fig. 5. noise.
The semantic matching based on a single layer convolution and In the perspective of manifold learning, the learning process by the
pooling network has only represented one trend of textual semantic denoising autoencoder is aimed at projecting the corrupted examples
matching [5,34,26]. However, for applications in the medical domain, back to the low dimensional manifold through loss function mini-
knowledge bases and existing terminologies are a valuable resource to mization, so that the obtained representation can yield a clearly higher
enable the clinical interoperability and logical reasoning. These re- generalization level than autoencoder based on a normal data set [37].
sources are expected to balance the input and to facilitate relevance This assertion matches the attributes of our training data perfectly due
determination by bypassing the unnecessary connection in the value to the high similarity between the training set within one CHOP cate-
propagation chain. However, the knowledge is not always easy to be gory (search issue). The performance of the denoising layer in our case
integrated. For the mechanism based on neural networks, especially, will be evaluated. As far as the vector size at each layer is concerned, we
after the sparse coding and semantic enrichment, we normally obtain employed the empirical definition for the vector size: each layer re-
vectors indicating higher-level features and additional knowledge to- duces the vector size approximately to half of the vector size of its
gether with extended dimensions [35]. With the aim of the knowledge previous layer. For the concept vector without enrichment, a layer di-
utilization based on the local proximity and probability from the raw mension with vector sizes: (11838)-5919-2969-1479-500 is applied by
input data, a suitable mechanism to integrate the knowledge in the the four autoencoders’ layers (see Fig. 7). For the concept vector with
representation layer is required. In this work, we propose a mechanism enrichment, a layer size configuration with vector sizes: (40278)-
based on autoencoder to enable biomedical knowledge injection at the 20000-8000-4000-1500-500 has been employed. The current definition
phase of representation learning. In comparison with the stacked au- enables an efficient experiment with a comparable outcome. A fine
toencoder applied on handwritten digit data set [36,37], our method tuning of the vector size and decreasing interval is out of the scope of
has been adapted to process short text. Instead of learning the re- this paper.
presentation from training corpus only, we have employed a German The deep autoencoder extends the single two layer structure into
biomedical terminology ID MACS® as prior knowledge to train stacked multiple layers of symmetrical neural nets to increase the approxima-
autoencoder and to evaluate the influence of external knowledge on tion capacity. As a comparable benchmark to the stacked denosing
representation learning. Derived from the same aforementioned re- autoencoder, we will also evaluate the non-stop deep autoencoder. The
presentation formulas (4) and (5) in Section 4.1.2, the holistic matching training of representation will be fully supervised and the layer attri-
process can be described with the following formulas: butes will not be initialized in a layer-wise manner. The sigmoid
function will be applied at each layer as activation function and the
Mvector = Qm ||Cm (8)
cross entropy loss is computed as objective of minimization. The con-
where Q and C represent the query and category text whereas Mvector crete steps of proposed approaches will be introduced in the following
stands for the concatenated vector, m is the shared dimension of vo- Section 4.3.2.1.
cabulary in both query and category text,
4.3.2.1. Pre-training of frequency based vectors. The pre-training is an
Vrich = fMacs (Mvector ) (9)
important technique to let each layer of a neural network be initialized
where Vrich stands for enriched vector. The function fMacs represents the correctly. At this phase, we employed the unsupervised method based
enrichment approach provided by ID MACS® semantic network and on autoencoder to obtain a compact high level representation of the
8
Fig. 6. Relevance classification based on stacked autoencoder. The details of the pre-training module are illustrated in Fig. 7.
Fig. 7. Pre-training architecture implementing 4 stacked denoising autoencoders.
input data, meanwhile, the input information is attained through these also follow this two phase protocol. The pre-training can then be un-
pre-training steps as much as possible. Based on the pre-training derstood as the first part of an entire two phase protocol that combines
network illustrated in Fig. 6, the vector enriched by ID MACS® is the pre-training phase and a supervised learning phase. The supervised
concatenated into one vector in form of Query ∥ Category. The vectors learning phase involves training a classifier on top of the features
are multiple-hot coded with concepts. The networks are initialized layer learned in the pre-training phase. At the same time, the supervised
by layer with denoising mechanism: classification can fine-tune the entire network learned in the layer-wise
(1) Stepwise initialization: In Fig. 6, the main architecture of our pre-training phase.
mapping process is presented. With the enriched query and category As a comparable experiment, we will also evaluate the non-stop
text concatenation, the four nested autoencoders are initialized. We deep autoencoder, which conducts the latent learning without inter-
followed the principle of a greedy layer-wise initialization [38]. The ruption. The F1 measure of the classification and the efficiency of the
greedy algorithm optimizes each piece of the solution independently, training will be compared.
one layer at a time, rather than jointly optimizing all layers. Specifi- (2) Enabled with denoising mechanism: A denoising autoencoder
cally, greedy layer-wise pre-training proceeds one layer at a time, minimizes
training the kth layer while keeping the previous layers. The lower
layers (which are trained first) are not adapted after the upper layers
have been introduced. In our pre-training model illustrated in Fig. 7, we L (x , g (f (x ))) (14)
9
Fig. 8. Using autoencoder to learn latent representation directly on an enriched single query and category concept dictionary.
where x̃ is a copy of x that has been corrupted by some form of noise. autoencoder, the vector size will be reduced to the half of the previous
The denoising autoencoders must therefore undo this corruption rather layer. In the last step, the supervised classifier has received a 500 di-
than simply copying their input. The adding of noise has firstly pushed mensional vector to conduct the relevance classification. We have de-
the original data distribution away from the target low dimension fined a batch size of 128 to do the training. This configuration is de-
manifold, while the learning process has to project the noised input signed with the consideration of both performance expectation and
data back onto the “manifold” [36]. With the input from the last hidden training cost. Since we conduct experiments with limited hardware
layer hl: resources (single GPU with 12G memory), an acceptable training time
and cost is essential for a comparison based on different approaches.
qD = g (hl + N ) (15)
where N is the Gaussian noise and g is the activation function, the de- 4.4. Implementation
noising process is expected to achieve a higher generalization than
ordinary autoencoders. We implemented the proposed architecture with the Tensorflow
framework.12 Tensorflow was chosen, since it provides comprehensive
4.3.2.2. Alternative representation learning: applying autoencoder only on toolkits for the construction of embedding and neural networks [39].
query and category dictionary. The aforementioned approaches start all The computational network works as a road map of the data processing
with a concatenated vector of query and category text. There is another workflow, whereas the real data is not loaded into the computing
way of training which can largely reduce the training time by graph. The input is only defined with sizes and attributes within pla-
representation learning only on the CHOP concept dictionary ceholders. The real value input will be triggered only after the session
(representation learning before query/category concatenation) (see has been initialized through queue feeding or dictionary feeding, while
Fig. 3). The concept dictionary contains the single vector queue is defined for asynchronous tensor input and dictionary is used to
representation for query and category. The corresponding vector input the data with a small amount statically. This deferred form of
representation can be obtained through query/category id as graph computing can also optimize the resource utilization and scal-
dictionary key. The query and category text can then be assigned ability during the processing. For the creation of a CNN, Tensorflow has
with a short representation separately before the concatenation of the provided interfaces like tf.nn.embedding_lookup, tf.nn.conv2d
query and the category text see Fig. 8. This alternative approach learns and tf.nn.max_pool. The convolution and pooling process can be
the representation locally based on the vocabulary vector created and configured intuitively to constitute our proposed CNN ar-
representation without supervised tuning. In comparison with the chitecture. In this implementation, we have also used the Python APIs
autoencoder with fully connected query and category text vector, from the package scikit-learn13 and numpy14 to implement the bench-
illustrated in Fig. 6, the alternative method has concentrated on the mark.
learning of intrinsic relations within a single query or a single category
text. It is expected that the relatively low dimension of the
4.5. Experiments
representation leads to a short training time and a fast converging.
For the experiment with autoencoder with supervised classification,
Three groups of methods are tested: the conventional machine
the configuration is realized as follows: we employed four autoencoders
learning methods (SVM, logistic regression), the convolutional pooling
consecutively to reduce the dimension directly after the enriched input.
based method and the autoencoder based relevance determination. The
The noise is added to the encoding step of each of the four layers, while
a denoising mechanism has been employed at each denoising layer. The
sigmoid activation function and cross entropy loss are calculated at 12
https://www.tensorflow.org, accessed on September 12, 2017.
each layer. The stochastic gradient decent (SGD) is used as optimization 13
http://scikit-learn.org/stable/, accessed on September 9, 2017.
method. As is already mentioned in Section 4.3.2, through each 14
http://www.numpy.org, accessed on September 9, 2017.
10
Table 1 Table 2
Baseline classification results according to precision, recall, F1, specificity (true Pre-trained autoencoders with classifier (logistic regression). If not mentioned,
negative rate = true negatives/(true negatives + false positives)) (all in micro the experiments are enabled with denoising layer. In the last experiment, the
average) for relevance determination based on different configurations. 3, 4, 5 denoising layer is disabled for comparison. Specificity refers to the true nega-
represent that three filters of size 3, 4, 5 have been applied, respectively. The tive rate, true negative rate = true negatives/(true negatives + false positives).
embedding was trained on cross-validation fold (4/5 of training data). SVM and
Configuration (all enriched) Precision Recall F1 Specificity
logistic regression are evaluated on both enriched and non-enriched input.
Configuration Precision Recall F1 Specificity Deep autoencoder (non-stop training 79.32% 54.76% 64.14% 93.6%
on dictionary)
CNN with single convolution and 78.32% 49.76% 60.86% 93.86% Layer-wise autoencoder on 85.21% 59.56% 70.29% 95.3%
pooling 3, 4, 5 Ada L2 0.01 concatenated input
SVM classifier linear kernel with non- 30.04% 25.52% 27.60% 72.9% Layer-wise alternative autoencoder: 84% 55.9% 68% 95.17%
enriched input representation learned on
Logistic regression L2 Liblinear with 23.24% 18.66% 20.70% 71.7% dictionary
non-enriched input Layer-wise alternative autoencoder 80.32% 52.46% 63.47% 94.14%
SVM classifier linear kernel with 28.34% 31.89% 30.01% 63.2% dictionary denoising disabled
enriched input
Logistic regression L2 Liblinear with 21.01% 19.31% 19.65% 66%
enriched input are better recognized than the relevant pairs in general. The SVM and
Deep autoencoder (non-stop training 73.12% 45.26% 55.26% 93.78%
logistic regression with enriched input have reached a higher recall
on dictionary) with non-enriched
input
(increasing with 1–6%) but a lower specificity percentage (decreasing
Layer-wise autoencoder on 74.41% 55.76% 63.75% 94% with 5–9%) than non-enriched input. However, even with the enriched
concatenated input with non- input, the outcomes of the SVM and logistic regression are still out-
enriched input performed by the CNN with concept embedding without depending on
external knowledge.
The enriched versions of logistic regression has achieved a higher
performance of the latter will be compared to the first two methods that
precision than its own recall whereas the SVM with enriched input
are considered as baseline. We chose different inputs and configurations
reaches a higher recall (31.89%) than precision. Generally, the en-
of these methods with the aim of optimizing the performance. As far as
richment setting (SVM, logistic regression) showed an improvement on
the input is concerned, the CNN method with single convolutional and
F1 measures ranging from 1% to 4% due to an increase in recall. In
pooling layer has been fed with concept embedding learned from the
order to give a reference for the autoencoder with enriched frequency
non-enriched concept vector. The SVM, logistic regression and auto-
vector, we have extended our baseline group with two autoencoder
encoder based methods have been tested on both, the non-enriched
configurations with non-enriched frequency vector. The deep auto-
concept vectors and the enriched frequency vectors. We tested the au-
encoder based on dictionary reached a micro F1 score of 55.26%, which
toencoder with denoising layer. Three different types of autoencoders
shows 5% of decrease in F1 score in contrast to CNN, while the layer-
(non-stop deep autoencoder, layer-wise autoencoder with merged
wise autoencoder based on concatenated input has shown a slightly
input, alternative autoencoder on dictionary) with logistic regression
better F1 measure (3%) than the CNN based on the non-enriched fre-
were tested in parallel. Precision, recall, F1 measure, specificity (all in
quency vector. The layer-wise autoencoder based on concatenated
Micro) have been compared. In the evaluation, all methods have gone
input has outperformed all the other baseline methods since it has
through a 5-fold validation. For the concept embedding, the embedding
achieved the highest score for recall (55.76%). Through the layer-wise
is conducted based on the cross-validation fold (4/5 of the entire data
pre-training with fine-tuning, this configuration have gained a 10%
corpus). The representations learned by autoencoder are derived from
improvement in recall in contrast to the non-stop deep autoencoder
enriched or non-enriched frequency dictionary, while the layer-wise
based on dictionary with the same non-enriched input.
autoencoder with concatenated input has also been trained on cross-
validation fold (second configuration (line) in Table 2).
5.2. Performance of enriched autoencoder-based method
5. Results
The performances of the enriched autoencoder-based method using
In this section, we will first describe the results of the proposed pre-training have been presented in Table 2. Different types of auto-
baseline methods. Second, the performance of the autoencoder-based encoders resulting from varying pre-processing steps are evaluated on
approaches will be presented. Finally, the efficiency aspects of training the enriched frequency vectors illustrated in (C) in Fig. 2. All models of
and deployment of the proposed methods are discussed. autoencoders were evaluated based on the same enriched input.
As illustrated in Table 2, the layer-wise autoencoder based on
5.1. Performance of baseline classification concatenated input has achieved the best F1 measure. The deep auto-
encoder trained on the enriched frequency dictionary has also reached
Table 1 shows the precision, recall and F1 measure of our baseline high precision (84%) and moderate recall of 55%. The enrichment led
model (all in micro average) on non-enriched and enriched input. The to an improvement of the F1 measure between 4% to 10% in compar-
CNN with single convolution and pooling layer has been trained on ison with the CNN method. The effectiveness of denoising layers has
embeddings obtained from the concept vector (A) demonstrated in been confirmed: the same configuration without denoising mechanism
Fig. 2. We can see that the CNN with the embedding significantly
outperforms the other two conventional machine learning methods Table 3
with an F1 measure of 60.86%. Based on a frequency-based concept Learning efficiency of the proposed methods based on neural networks.
vector with enrichment, the SVM classifier with linear kernel reached Configuration Efficiency of training (iterations)
an F1 measure of 30.01%, while the logistic regression with L2 reg-
ularization has yielded an F1 measure of 19.65%. With the same con- Deep autoencoder (non-stop training) 14,000
Layer-wise trained autoencoder Stopped by each training layer
figuration, SVM and logistic regression with the non-enriched concept
Autoencoder trained on dictionary 8,900
vector have achieved F1 measures of 27.6% and 20.7%, respectively. CNN 3, 4, 5 Ada L2 0.01 6,000
The specificity (true negative rate) indicates that the irrelevant pairs
11
showed a decrease in precision (4% less) and recall (3% less) in contrast latter method can only be trained once based on a training set.
to the same configuration with denoising. In all configurations, the ir- According to the results presented in Table 2, the local feature
relevant pairs were more accurately predicted than the relevant pairs within one vector of query or one category text can already yield a high
according to the score of specificity and the classifier tends to over- F1 measure while the inter-query-category text features have appar-
classify (overkill) the relevant as irrelevant. ently higher training costs but a slightly better performance on the test
set. Hence, the feature selection should be determined by balancing
5.3. Efficiency of proposed methods training costs and performance requirements. As a potential extension,
we can make use of the query and category specific features, such as the
Given the same input, the count of iterations on different model exclusive concept in category, as negation rules to complement the
configurations before the model convergence (changes _ o f Accuracy@ current trained classifier. In the current training set, we have only
dev < 5%) can be used to indicate the speed of model converging. The eliminated the exclusive terms. In the next step, we can transform the
count can therefore be seen as an indicator of efficiency by training. As vector pairs with exclusive terms into negative training examples.
it can be seen in Table 3, the CNN-based method with single convolu-
tion and pooling layer achieves the best efficiency due to its efficient 6.2. Summary of principal findings
embedding mechanism and the pooling with short sliding windows (3,
4, 5). Since most of the time costs in autoencoders are consumed by the Through experiments, we compared different configurations of two
representation learning phase, we only compare the iterations and stops models based on neural networks (CNN and autoencoder-based classi-
in the representation phase. Regarding the efficiency of deployment, fication) to solve the matching issues. The CNN-based methods, SVM
Tensorflow provides a comprehensive library for product ready model and logistic regression are used as baseline to compare with the auto-
serving. With the required resource stored as docker container, the encoder based methods. The principal findings are:
chosen model (ckpt file) is transformed into a protocol buffer (pb) file.
The frozen model can be deployed as classifier for relevance prediction. 1. A suitable vector representation for a high dimensional sparse data
The inference and reconfiguration with the model is highly efficient. set has been identified for the clinical encoding task: concept em-
For the query normalization and enrichment, the ID MACS® provides bedding is well suited for the input with original sequence while a
response time at millisecond level, while the representation generation frequency vector is a good representation for semantic enriched
through frozen model using pb file can also expect a low response time. input.
2. The suitability of stepwise pre-training and non-stop representation
6. Discussion and conclusion learning based on autoencoder have been assessed: for text re-
presentation learning on a small data amount, stepwise pre-training
The current results have shown a clear potential of different models can be necessary and useful for a better generalization.
based on neural networks in clinical procedure classification. In con- 3. The usefulness of the knowledge enrichment has been confirmed. It
trast to previous research, we have conducted a model design and has been shown that the enrichment process can facilitate the se-
evaluation in a relatively narrow application domain (clinical proce- mantic matching and increase the model generalization in the test
dure encoding). We achieved better performance using autoencoder- set. More specifically, the autoencoder with pre-training can facil-
based representation learning compared to the baseline methods im- itate the knowledge integration based on the enriched frequency
plemented with CNN and conventional machine learning methods. The vector. The denoising layer can also make moderate contributions to
methods are evaluated with a subset of CHOP categories and 24,092 the performance in terms of model generalization.
corresponding query-category pairs. With the insight yielded through 4. CNN has been proven to be a suitable method for concept matching
the experiment, more work will be done to exploit the full CHOP ca- in the biomedical context. However, the embedding process and the
tegory with more training data. sliding windows at convolutional layer rely on the co-occurrence
information within the input sequence. For those tasks with se-
6.1. Discussion of results quence-less input, CNNs should be avoided or applied with adap-
tation. In this work, for a non-enriched input with original sequence,
Within current settings of autoencoder, the non-stop pre-training is CNN can achieve a moderate prediction performance with a high
better suited for processing a large balanced data set, while layer-wise training efficiency.
pre-training should be applied when the data set is relatively small and 5. The unsupervised pre-training based on autoencoder may slightly
unequally sampled. In our experiment, the layer-wise pre-training decrease the converging speed but makes the model more generic. It
clearly yielded improvement compared to the application of auto- achieves a better result in the cross-validation, while the knowledge
encoders without layer-wise initialization. We believe that the layer- can be integrated in the pre-training more smoothly.
wise pre-training has brought the model into the rational subspace
which leads to better optimization effectiveness and prevents the model 6.3. Limitations and future work
from getting skewed by the imbalanced input. The enrichment has
shown a positive influence on the matching performance. In our ex- One limitation is that the quality of layer-wise pre-training can only
periment, the enriched input processed by the layer-wise pre-trained be judged by the result of matching prediction. The status concerning
autoencoder and supervised classifier achieved the best outcomes on the weight and biases as well as the status of the model at each layer is
the task relevance determination. still unclear. An investigation method should be developed to present
In comparison with the performance of baseline methods (SVM and the changes of models, so that the correlation between the hyper-
logistic regression) with enriched input presented in Table 1, the non- parameter and model status can be determined and the performance
linear transformation conducted by a layer-wise autoencoder has shown can be fine-tuned.
a clearly better result. We believe that this performance augmentation Secondly, the external knowledge base has been used in a relatively
originates from the in-depth latent representation learned through simple way, the enrichment was conducted based on hypernyms and
multi-layer autoencoder. In addition, the separate layers have also been hyponyms in the hierarchical terminology. The graphical features and
fine-tuned globally with the supervised classification and get adapted to the conditional probabilities between the concepts have not been
the task on the fly. These flexibilities made the autoencoder-based re- exploited. As a next step, the additional knowledge reasoning can be
presentation learning clearly more attractive than the static linear in- conducted by Bayes’ rules to guide the matching process.
tegration of a knowledge weight based on word embedding, since the Last but not least, the current data set (24,092 pairs) is relatively
12
small for deep fine-tuning. A more comprehensive evaluation based on [13] Nentidis A, Bougiatiotis K, Krithara A, Paliouras G, Kakadiaris IA. Results of the
more log data from the production server (over 1 million pairs) has fifth edition of the BioASQ challenge. BioNLP, 2017, Vancouver, Canada, August 4,
2017 2017:48–57. https://doi.org/10.18653/v1/W17-2306.
been planned. [14] Du Y, Pan Y, Ji J. A novel serial deep multi-task learning model for large scale
During the training, we have not considered the exclusive concept biomedical semantic indexing. 2017 IEEE international conference on bioinfor-
separately (certain concepts that must not appear in the category text matics and biomedicine (BIBM) 2017:533–7. https://doi.org/10.1109/BIBM.2017.
8217704.
has been provided in each CHOP category) due to the capacity of the [15] Rios A, Kavuluru R. Convolutional neural networks for biomedical text classifica-
representation vector. The possible improvements are either the de- tion: application in indexing biomedical articles. Proceedings of the 6th ACM
velopment of a separate classifier to generate an exclusive decision or conference on bioinformatics, computational biology and health informatics,
BCB’15. New York, NY, USA: ACM; 2015. p. 258–67. https://doi.org/10.1145/
transforming vector pairs with negative concept into negative training 2808719:2808746.
examples. [16] Kim Y. Convolutional neural networks for sentence classification. Proceedings of
As a next step, we would like to employ a larger data set to perform the 2014 conference on empirical methods in natural language processing, EMNLP
2014, October 25–29, 2014, Doha, Qatar. A meeting of SIGDAT, a special interest
a more comprehensive layer-wise pre-training. The status of the model
group of the ACL 2014:1746–51. URL http://aclweb.org/anthology/D/D14/D14-
at each layer will be investigated to make the training more inter- 1181.pdf.
pretable. Beyond, additional features from category texts such as ex- [17] Lecun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to docu-
clusive concept and the category hierarchy will be used to complement ment recognition. Proc IEEE 1998;86(11):2278–324. https://doi.org/10.1109/
5:726791.
the current classifiers, so that the training in deep networks can be [18] Larochelle H, Bengio Y, Louradour J, Lamblin P. Exploring strategies for training
accelerated and the performance of the encoding can be improved. At deep neural networks. J Mach Learn Res 2009;10:1–40. URL http://dl.acm.org/
last, we will analyze the hardware requirement of the proposed neural citation.cfm?id=1577069.1577070.
[19] Miotto R, Li L, Kidd BA, Dudley JT. Deep patient: an unsupervised representation to
network based method. A hardware for deployment in the real pro- predict the future of patients from the electronic health records. Scientific reports
duction will be selected. 2016.
[20] Yu M, Dredze M. Improving lexical embeddings with semantic knowledge.
Proceedings of the 52nd annual meeting of the association for computational lin-
Acknowledgement guistics, ACL 2014, June 22–27, 2014, Baltimore, MD, USA, Volume 2: Short papers
2014:545–50. URL http://aclweb.org/anthology/P/P14/P14-2089.pdf.
This work is partially supported by ID Information und [21] Celikyilmaz A, Hakkani-Tür D, Pasupat P, Sarikaya R. Enriching word embeddings
using knowledge graph for semantic tagging in conversational dialog systems. AAAI
Dokumentation im Gesundheitswesen GmbH & Co. KGaA, Berlin, – Association for the Advancement of Artificial Intelligence; 2015. URL https://
Germany. Our gratitude also goes to Marie-Anne Pinheiro for her va- www.microsoft.com/en-us/research/publication/enriching-word-embeddings-
luable comments. We thank the anonymous reviewers for their careful using-knowledge-graph-for-semantic-tagging-in-conversational-dialog-systems/.
[22] Hakkani-Tür D, Celikyilmaz A, Heck L, Tur G, Zweig G. Probabilistic enrichment of
reading of our manuscript and their many insightful comments and
knowledge graph entities for relation detection in conversational understanding.
suggestions. Proceedings of Interspeech, ISCA – International Speech Communication
Association 2014. URL https://www.microsoft.com/en-us/research/publication/
References probabilistic-enrichment-of-knowledge-graph-entities-for-relation-detection-in-
conversational-understanding/.
[23] Faruqui M, Dodge J, Jauhar SK, Dyer C, Hovy E, Smith NA. Retrofitting word
[1] Li S-T, Chen C-C, Huang F. Conceptual-driven classification for coding advise in vectors to semantic lexicons. Proceedings of the 2015 conference of the North
health insurance reimbursement. Artif Intell Med 2011;51(1):27–41. https://doi. American Chapter of the Association for computational linguistics: human language
org/10.1016/j.artmed.2010.10.003. technologies. Association for Computational Linguistics; 2015. p. 1606–15. URL
[2] Faulstich L, Mueller F, Sander A. Automated clinical documentation from electronic http://www.aclweb.org/anthology/N15-1184.
health records. Medizinische Informatik, Biometrie und Epidemiologie, GMDS [24] Socher R, Perelygin A, Wu J, Chuang J, Manning CD, Ng A, et al. Recursive deep
2010, 55. Jahrestagung der GMDS, Mannheim, September 2010 2010:547–50. models for semantic compositionality over a sentiment treebank. Proceedings of the
[3] Mikolov T, Sutskever I, Chen K, Corrado G, Dean J. Distributed representations of 2013 conference on empirical methods in natural language processing. Association
words and phrases and their compositionality, CoRR abs/1310.4546. URL http:// for Computational Linguistics; 2013. p. 1631–42. URL http://www.aclweb.org/
arxiv.org/abs/1310.4546. anthology/D13-1170.
[4] Hu B, Lu Z, Li H, Chen Q. Convolutional neural network architectures for matching [25] Landauer TK, Dumais ST. A solution to Plato's problem: The latent semantic analysis
natural language sentences. Advances in neural information processing systems 27: theory of acquisition, induction, and representation of knowledge. Psychol Rev
Annual conference on neural information processing systems 2014, December 8–13, 1997;104:211–40.
2014, Montreal, Quebec, Canada 2014:2042–50. URL http://papers.nips.cc/paper/ [26] Yu Z, Wang H, Lin X, Wang M. Understanding short texts through semantic en-
5550-convolutional-neural-network-architectures-for-matching-natural-language- richment and hashing. IEEE Trans Knowl Data Eng 2016;28(2):566–79. https://doi.
sentences.pdf. org/10.1109/TKDE.2015.2485224.
[5] Shen Y, He X, Gao J, Deng L, Mesnil G. A latent semantic model with convolutional- [27] Wu W, Li H, Wang H, Zhu KQ. Probase: a probabilistic taxonomy for text under-
pooling structure for information retrieval. Proceedings of the 23rd ACM interna- standing. Proceedings of the 2012 ACM SIGMOD international conference on
tional conference on conference on information and knowledge management, management of data, SIGMOD’12. New York, NY, USA: ACM; 2012. p. 481–92.
CIKM’14. New York, NY, USA: ACM; 2014. p. 101–10. https://doi.org/10.1145/ https://doi.org/10.1145/2213836.2213891.
2661829.2661935. [28] Wingert F. Automated indexing based on snomed. Methods Inf Med 1984;24:27–34.
[6] Salakhutdinov R, Mnih A, Hinton G. Restricted Boltzmann machines for colla- https://doi.org/10.1055/s-0038-1635350.
borative filtering. Proceedings of the 24th international conference on machine [29] Koopman B, Zuccon G, Bruza P. What makes an effective clinical query and querier?
learning, ICML’07. New York, NY, USA: ACM; 2007. p. 791–8. https://doi.org/10. J Assoc Inf Sci Technol 2017;68(11):2557–71. https://doi.org/10.1002/asi.23959.
1145/1273496.1273596. [30] Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolu-
[7] Hinton GE, Osindero S, Teh Y-W. A fast learning algorithm for deep belief nets. tional neural networks. In: Pereira F, Burges CJC, Bottou L, Weinberger KQ, editors.
Neural Comput 2006;18(7):1527–54. https://doi.org/10.1162/neco.2006.18.7. Advances in neural information processing systems 25. Curran Associates, Inc.;
1527. 2012. p. 1097–105. URL http://papers.nips.cc/paper/4824-imagenet-classification-
[8] Hinton GE, Salakhutdinov RR. Reducing the dimensionality of data with neural with-deep-convolutional-neural-networks.pdf.
networks. Science 2006;313(5786):504–7. https://doi.org/10.1126/science. [31] Kalchbrenner N, Grefenstette E, Blunsom P. A convolutional neural network for
1127647. modelling sentences. CoRR abs/1404.2188. 2014. URL http://arxiv.org/abs/1404.
[9] Miotto R, Wang F, Wang S, Jiang X, Dudley JT. Deep learning for healthcare: re- [32] Shen Y, He X, Gao J, Deng L, Mesnil G. Learning semantic representations using
view, opportunities and challenges. Brief Bioinf 2017. https://doi.org/10.1093/ convolutional neural networks for web search. Proceedings of the 23rd interna-
bib/bbx044. tional conference on world wide web, WWW’14 companion. New York, NY, USA:
[10] Peng S, You R, Wang H, Zhai C, Mamitsuka H, Zhu S. Deepmesh: deep semantic ACM; 2014. p. 373–4. https://doi.org/10.1145/2567948.2577348.
representation for improving large-scale mesh indexing. Bioinformatics [33] Deng Y, Faulstich L, Denecke K. Concept embedding for relevance detection of
2016;32(12):70–9. https://doi.org/10.1093/bioinformatics/btw294. search queries regarding CHOP. MEDINFO 2017: precision healthcare through in-
[11] Le Q, Mikolov T. Distributed representations of sentences and documents. In: Xing formatics – Proceedings of the 16th world congress on medical and health infor-
EP, Jebara T, editors. Proceedings of the 31st international conference on machine matics, Hangzhou, China, 21–25 August, 2017 2017:2333. https://doi.org/10.
learning, Vol. 32 of proceedings of machine learning research. Beijing, China: 3233/978-1-61499-830-3-1260.
PMLR; 2014. p. 1188–96http://proceedings.mlr.press/v32/le14.html. [34] Huang P-S, He X, Gao J, Deng L, Acero A, Heck L. Learning deep structured se-
[12] Liu K, Peng S, Wu J, Zhai C, Mamitsuka H, Zhu S. Meshlabeler: improving the mantic models for web search using clickthrough data. Proceedings of the 22nd
accuracy of large-scale mesh indexing by integrating diverse evidence. ACM international conference on information & knowledge management, CIKM’13.
Bioinformatics 2015;31(12):i339–47. https://doi.org/10.1093/bioinformatics/ New York, NY, USA: ACM; 2013. p. 2333–8. https://doi.org/10.1145/2505515.
btv237. 2505665.
13
[35] Lee H, Battle A, Raina R, Ng AY. Efficient sparse coding algorithms. In: Schölkopf B, www.deeplearningbook.org.
Platt JC, Hoffman T, editors. Advances in neural information processing systems 19. [38] Vincent P, Larochelle H, Lajoie I, Bengio Y, Manzagol P. Stacked denoising auto-
MIT Press; 2007. p. 801–8. URL http://papers.nips.cc/paper/2979-efficient-sparse- encoders: learning useful representations in a deep network with a local denoising
coding-algorithms.pdf. criterion. J Mach Learn Res 2010;11:3371–408. URL http://portal.acm.org/
[36] Vincent P, Larochelle H, Bengio Y, Manzagol P-A. Extracting and composing robust citation.cfm?id=1953039.
features with denoising autoencoders. Proceedings of the 25th international con- [39] Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, et al. TensorFlow: large-
ference on machine learning, ICML’08. New York, NY, USA: ACM; 2008. p. scale machine learning on heterogeneous systems. 2015 Software available from:
1096–103. https://doi.org/10.1145/1390156.1390294. URL https://www.tensorflow.org/.
[37] Goodfellow I, Bengio Y, Courville A. Deep learning. MIT Press; 2016. URL http://
14

Artificial Intelligence in Medicine: Yihan Deng, André Sander, Lukas Faulstich, Kerstin Denecke

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Artificial Intelligence in Medicine: Yihan Deng, André Sander, Lukas Faulstich, Kerstin Denecke

Diunggah oleh

Hak Cipta:

Format Tersedia

Artificial Intelligence In Medicine xxx (xxxx) xxx–xxx

Contents lists available at ScienceDirect

Artificial Intelligence In Medicine

Towards automatic encoding of medical procedures using convolutional

ARTICLE INFO ABSTRACT

E-mail address: yihan.deng@bfh.ch (Y. Deng).

of enrichment) to 18% in accuracy (in contrast to non-enrichment). 3. Data set

MACS® contains a manually curated medical ontology, which is a concepts.

concept mapper, more specifically, it is based on the aforementioned

Fig. 7. Pre-training architecture implementing 4 stacked denoising autoencoders.

Anda mungkin juga menyukai