a r t i c l e i n f o a b s t r a c t
Article history: A dialogue system should capture speakers’ intentions, which can be represented by combinations of
Received 23 June 2017 speech acts, predicators, and sentiments. To identify these intentions from speakers’ utterances, many
Available online 6 November 2017
studies have independently dealt with speech acts, predicators, and sentiments. However, these three el-
Keywords: ements composing speakers’ intentions are tightly associated with each other. To resolve this problem,
Integrated intention identification model we propose a convolutional neural network model that simultaneously identifies speech acts, predica-
Speech act identification tors, and sentiments. The proposed model has well-designed hidden layers for embedding informative
Predicator identification abstractions appropriate for speech act identification, predicator identification, and sentiment identifica-
Sentiment identification tion. Nodes in the hidden layers are partially trained by three cycles of error backpropagation: training
Partial error backpropagation the nodes associated with speech act identification, predicator identification, and sentiment identifica-
tion. In the experiments, the proposed model showed higher F1-scores than independent models: 6.8%
higher in speech act identification, 6.2% higher in predicator identification, and 4.9% higher in sentiment
identification. Based on the experimental results, we conclude that the proposed integration architecture
and partial error backpropagation can help to increase the performance of intention identification.
© 2017 Elsevier B.V. All rights reserved.
https://doi.org/10.1016/j.patrec.2017.11.009
0167-8655/© 2017 Elsevier B.V. All rights reserved.
2 M. Kim, H. Kim / Pattern Recognition Letters 101 (2018) 1–5
Table 1
Example of a dialogue between a system and a user.
2. Previous work
are applied to the CNN. First, the errors between the output val- Table 2
Top-n tags in the dialogue corpus.
ues associated with speech act categories and correct speech act
vectors represented by one-hot codes are propagated through the Speech act (%) Predicator (%) Sentiment (%)
partially connected nodes. Second, partial error backpropagations Statement (51.3) None (17.9) None (43.5)
for predicator identification are performed in the same manner Response-if (18.3) Judge (9.3) Fear (10.5)
as the partial error backpropagations for speech act identification. Ask-if (10.0) Other (6.6) Sadness (8.9)
Finally, partial error backpropagations for sentiment identification Ask-ref (7.8) Be (6.3) Anger (8.8)
Response-ref (5.3) Express (6.0) Coolness (8.0)
are similarly performed. We expect that informative features (or
Hope (2.7) Know (5.4) Love (6.5)
abstraction values) are accumulated at the embedding vectors, ow- Request (1.2) Like (5.2) Joy (4.6)
ing to the partial error backpropagations. Then, the output values Opinion (1.0) Non-exist (4.2) Wish (3.9)
Si associated with speech act categories are fed to the input nodes Ask-confirm (0.9) Exist (4.1) Other (3.0)
Thanks (0.7) Perform (3.8) Surprise (2.2)
for predicator identification and sentiment identification. The pre-
vious speech act vector Si−1 represented by a one-hot code is con-
catenated to the utterance embedding vector SEi for speech act
identification. During the partial error backpropagations, the cross- shows speech acts, predicators, and sentiments that frequently oc-
entropies are used as loss functions in order to maximize simi- cur in the dialogue corpus.
larities between correct categories (i.e., Si , Pi , and Ei ) and output Prior to manual tagging, we explained the meanings of the
categories (i.e., Si , ), as shown in Eqs. (4).
Pi , and E i speech acts, predicators, and sentiments to the students and
showed them some samples that were annotated with domain ac-
HSˆ (S ) = − Sˆi log(Si )
tions. Then, we assigned one student to code each fifth of the data.
i
HPˆ (P ) = − Pˆi log(Pi ) (4) Finally, a graduate student post-processed all the tagged data for
i consistency.
HEˆ (E ) = − Eˆi log(Ei ) To evaluate IIIM, we divided the annotated dialogue corpus into
i
a training corpus and a test corpus, in a ratio of 9:1. Then, we
trained IIIM by using the training corpus. Next, we performed 10-
4. Evaluation fold cross-validation.
Four evaluation measures, the accuracy, macro precision, macro
4.1. Data sets and experimental settings recall rate, and macro F1-score, were used to evaluate the perfor-
mance of the proposed model. The accuracy is the proportion of
For our experiments, we collected a Korean dialogue corpus correct values of those returned. The macro precision is the aver-
about views on love (3092 utterances; 1874 unique words) from age proportion of correct values returned per category. The macro
mobile chat rooms in which two users discuss each other’s views recall rate is the average proportion of correctly returned values
on a specific topic by using the short message service of a commer- per category. The macro F1-score combines the macro precision
cial telecommunication company. Each utterance in the collected and macro recall rate with an equal weighting in the following
dialogues was manually annotated with speech acts, predicators, form: F1 = (2.0 × macro precision × macro recall rate)/(macro pre-
and sentiments by three undergraduate students who were famil- cision + macro recall rate). To statistically validate the confidence
iar with dialogue analysis. The classifications of speech acts, pred- levels of the evaluation measures, we performed t-tests between
icators, and sentiments for dialogues are very subjective, and uni- the comparison models (i.e., independent identification models: a
versally accepted criteria do not exist. In this paper, we defined 15 speech act identification model, predicator identification model,
speech act tags, 67 predicator tags, and 10 sentiment tags. Table 2 and sentiment identification model) and IIIM using the scores of
4 M. Kim, H. Kim / Pattern Recognition Letters 101 (2018) 1–5
Table 3
Performance comparison with independent models.
Model Accuracy Avg. Macro Precision Avg. Macro Recall Avg. Macro F1-score
Table 4
Performance comparison with previous models.
Model Accuracy Avg. Macro Precision Avg. Macro Recall Avg. Macro F1-score
each evaluation measure as the input values of the t-test. The p- between IIIM and the independent models were as follows,
values of speech act identification, predicator identification, and in terms of the average macro F1-scores: IIIM-S & S-Only
sentiment identification were 0.064, 0.064, and 0.064, respectively. (6.8%) > IIIM-P & P-Only (6.2%) > IIM-E & E-Only (4.9%). This shows
This implies that the performance scores are statistically meaning- that speech acts are less associated with predicators and senti-
ful at a significance level of 95%. ments. In addition, it shows that predications and sentiments are
affected by each other.
4.2. Implementation The second experiment was to compare IIIM with previous
models by using the same training and test corpus. Table 4 shows
We implemented IIIM using TensorFlow 1.0 [1]. Training and the performance differences between IIIM and the previous mod-
prediction were done on a per-sentence level. We set the size of els. In Table 4, Seon et al. [13] is an SVM-based classification model
each Word2Vec embedding vector in Fig. 2 to 50. The training in which the performances of speech act identification and pred-
spanned 300 epochs and was performed by mini-batch stochas- icator identification are increased by using the mutual retraining
tic gradient descent with a fixed learning rate of 0.001. Each mini- method [13]. Lee et al. [10] is an integrated model based on neu-
batch consisted of 64 sentences. ral networks in which the results of predicator identification are
used as inputs of speech act identification [10]. Shin et al. [15] is
4.3. Experimental results a sentiment analysis based on CNN and LSTM (Long Short-Term
Memory) networks [15]. IIIM-S + LSTM, IIIM-P + LSTM, and IIIM-
The first experiment was to compare the performance of IIIM E + LSTM are respectively an IIIM-S model, IIIM-P model, and an
with independent models and a conventionally integrated model IIIM-E model that are modified to the same architecture as [15] for
by using the same Word2Vec embedding vectors. Table 3 shows reflecting more contextual information, where the output values of
the performance differences between IIIM and the independent IIIM models are used as input values of LSTM networks.
models. As shown in Table 4, IIIM-S and IIIM-P showed similar perfor-
In Table 3, IIIM-S, IIIM-P, and IIIM-E indicate the speech act mances to [13] and [10]. However, we think that IIIM is more ef-
identification part, predicator identification part, and sentiment ficient than Seon et al. [13] and Lee et al. [10], because they re-
identification part of IIIM, respectively. CIM-S, CIM-P, and CIM- quire more engineering effort to extract and select effective fea-
E indicate the speech act identification part, predicator identi- tures, such as main verb phrases and grammatical roles of words
fication part, and sentiment identification part of a convention- from utterances. IIIM-S + LSTM and IIIM-P + LSTM outperformed
ally integrated model, respectively. The conventionally integrated [13] and [10]. This fact reveals that IIIM can become more effec-
model had the ordinary CNN architecture in which there are not tive if it is combined with a sequence labeling model for reflecting
any shared nodes. S-Only, P-Only, and E-Only indicate a speech more contextual information. IIIM-E + LSTM showed better perfor-
act identification model, predicator identification model, and senti- mances than [15]. We think that the performance differences were
ment identification model, respectively. These independent models caused by the common hidden nodes in Fig. 2, similar to the per-
had the same CNN architectures as IIIM, except that they did not formance differences between IIIM-E and E-Only.
have the shared nodes in Fig. 2. As shown in Table 3, IIIM out-
performed the conventionally integrated model and the indepen- 5. Conclusion
dent models for all measures. In addition, the conventionally in-
tegrated model showed poor performances. This fact reveals that We proposed an integrated neural network model that si-
the proposed integration architecture can help to increase the per- multaneously identified speech acts, predicators, and sentiments
formance of intention identification. The performance differences of dialogue utterances. The proposed model has three kinds of
M. Kim, H. Kim / Pattern Recognition Letters 101 (2018) 1–5 5
well-designed hidden nodes that embed effective abstractions from [6] H. Kim, C.-N. Seon, J. Seo, Review of Korean speech act classification: machine
input utterances for speech act identification, predicator identifica- learning methods, J. Comput. Sci. Eng. 5 (4) (2011) 288–293.
[7] A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep con-
tion, and sentiment identification. The hidden nodes are trained by volutional neural networks, in: Proceedings of NIPS, 2012.
partial error backpropagation algorithms. In the experiments, the [8] C. Langley, Analysis for speech translation using grammar-based parsing and
proposed model showed better performance than the independent automatic classification, in: Proceedings of the ACL Student Research Work-
shop, 2002.
models (i.e., the speech act identification model, predicator identi- [9] H. Lee, H. Kim, J. Seo, Domain action classification using a maximum entropy
fication model, and sentiment identification model). Based on the model in a schedule management domain, AI Commun. 21 (4) (2008) 221–229.
experimental results, we conclude that the proposed integration [10] H. Lee, H. Kim, J. Seo, An integrated neural network model for domain ac-
tion determination in goal-oriented dialogues, J. Inf. Process. Syst. 9 (2) (2013)
architecture and partial error backpropagation can help to increase
259–270.
the performance of intention identification. [11] L. Levin, C. Langley, A. Lavie, D. Gates, D. Wallace, K. Peterson, Domain spe-
cific speech acts for spoken language translation, in: Proceedings of 4th SIGdial
Workshop on Discourse and Dialogue, 2003.
Acknowledgment
[12] S. Li, C. Zong, X. Wang, Sentiment classification through combining classifiers
with multiple feature sets, in: Proceedings of NLP-KE, 2007, pp. 135–140.
This work was supported by the National Research Foundation [13] C.-N. Seon, H. Lee, H. Kim, J. Seo, Improving domain action classification in
of Korea (NRF) grant funded by the Korea government (MSIP) (No. goal-oriented dialogues using a mutual retraining method, Pattern Recognit.
Lett. 45 (2014) 154–160.
2016R1A2B4007732). [14] S. Shen, H. Lee, Neural attention models for sequence classification: analysis
and application to key term extraction and dialogue act detection, in: Proceed-
References ings of INTERSPEECH, 2016.
[15] D. Shin, Y. Lee, J. Jang, H. Rim, Using CNN-LSTM for effective application of
[1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. Corrado, A. dialogue context to emotion classification, in: Proceedings of HCLT (in Korean),
Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. 2016.
Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mane, R. Monga, [16] A. Stolcke, K. Ries, N. Coccaro, E. Shiriberg, R. Bates, D. Jurafsky, P. Taylor,
S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. C. Van Ess-Dykema, R. Martin, M. Meteer, Dialogue act modeling for automatic
Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viegas, O. Vinyals, P. Warden, tagging and recognition of conversational speech, Comput. Ling. 26 (3) (20 0 0)
M. Wattenberg, M. Wicke, Y. Yu, X. Zheng, TensorFlow: large-scale machine 339–373.
learning on heterogeneous distributed systems, 2015. Software available from [17] D. Surendran, G. Levow, Dialogue act tagging with support vector machines
tensorflow.org. and hidden Markov models, in: Proceedings of INTERSPEECH, 2006.
[2] S. Aman, S. Szpakowicz, Identifying expressions of emotion in text, in: Pro- [18] The National Institute of the Korean Language. 2007. Final report on achieve-
ceedings of 10th International Conference on Text, Speech and Dialogue, 2007. ments of 21st Sejong project: electronic dictionary (in Korean).
[3] Y. Goldberg, O. Levy, word2vec Explained: deriving Mikolov et al.’s negative- [19] N. Webb, M. Hepple, Y. Wilks, Dialogue act classification based on intra-ut-
sampling word-embedding method, 2014. arXiv preprint arXiv:1402.3722. terance features, in: Proceedings of the AAAI Workshop on Spoken Language
[4] S. Kang, H. Kim, J. Seo, A reliable multidomain model for speech act classifica- Understanding, 2005.
tion, Pattern Recognit. Lett. 31 (1) (2010) 71–74. [20] H. Yune, H. Kim, J. Chang, An efficient search method of product reviews using
[5] S. Kim, S. Park, S. Park, S. Lee, K. Kim, A syllable kernel based sentiment clas- opinion mining techniques, J. KIISE 16 (2) (2010) 222–226 (in Korean).
sification for movie reviews, J. KIISS 20 (2) (2010) 202–207 (in Korean).