Anda di halaman 1dari 5

Pattern Recognition Letters 101 (2018) 1–5

Contents lists available at ScienceDirect

Pattern Recognition Letters


journal homepage: www.elsevier.com/locate/patrec

Integrated neural network model for identifying speech acts,


predicators, and sentiments of dialogue utterances
Minkyoung Kim, Harksoo Kim∗
Program of Computer and Communications Engineering, College of IT, Kangwon National University, 1 Gangwondaehak-gil, Chuncheon-si, Gangwon-do
24341, Republic of Korea

a r t i c l e i n f o a b s t r a c t

Article history: A dialogue system should capture speakers’ intentions, which can be represented by combinations of
Received 23 June 2017 speech acts, predicators, and sentiments. To identify these intentions from speakers’ utterances, many
Available online 6 November 2017
studies have independently dealt with speech acts, predicators, and sentiments. However, these three el-
Keywords: ements composing speakers’ intentions are tightly associated with each other. To resolve this problem,
Integrated intention identification model we propose a convolutional neural network model that simultaneously identifies speech acts, predica-
Speech act identification tors, and sentiments. The proposed model has well-designed hidden layers for embedding informative
Predicator identification abstractions appropriate for speech act identification, predicator identification, and sentiment identifica-
Sentiment identification tion. Nodes in the hidden layers are partially trained by three cycles of error backpropagation: training
Partial error backpropagation the nodes associated with speech act identification, predicator identification, and sentiment identifica-
tion. In the experiments, the proposed model showed higher F1-scores than independent models: 6.8%
higher in speech act identification, 6.2% higher in predicator identification, and 4.9% higher in sentiment
identification. Based on the experimental results, we conclude that the proposed integration architecture
and partial error backpropagation can help to increase the performance of intention identification.
© 2017 Elsevier B.V. All rights reserved.

1. Introduction Table 1, a speech act and a predicator represent speaker’s explicit


intention, and a sentiment represents an implicit intention supple-
A dialog system should correctly understand speakers’ utter- menting his/her explicit intention. As shown in Table 1, a current
ances and should respond to their requests in their natural lan- speech act is strongly dependent on previous speech acts. For ex-
guage. To realize the former, the dialogue system should identify ample, the speech act “response” of the third utterance is affected
the underlying intentions of the speakers’ utterance [6]. The speak- by the previous speech act “ask-ref.” If the previous speech act
ers’ intentions in dialogues can be represented by combinations of were not “ask-ref,” it could be “inform.” A predicator and a sen-
speech acts and predicators (so-called main actions or concept se- timent are less dependent on their contexts than a speech act is.
quences) [11]. In addition, the sentiments implicated in utterances On the other hand, they are affected by lexical meanings of the
can help to capture speakers’ intentions. Table 1 shows some parts current utterance and are associated with each other. For example,
of a dialogue between a dialogue system and a human user. the predicator “part” of the fourth utterance is determined by the
In Table 1, we represent speaker’s intention in a comma- word sense of the main verb phrase “was departed from.” In addi-
separated triple format. The first element in the triplet is a speech tion, the predicator “part” of the fourth utterance aids the system
act (e.g., “inform,” “ask-ref,” “response,” and “statement” in the ex- to determine the sentiment “sadness,” and the sentiment “sadness”
ample) that indicates a domain-independent intention associated of the fifth utterance aids the system to determine the predica-
with the conversational role of an utterance. The second element is tor “encourage.” In this paper, we propose an integrated model to
a predicator (e.g., “late,” “be,” “part,” and “encourage” in the exam- simultaneously determine speakers’ speech acts, predicators, and
ple) that captures a domain-dependent semantic focus associated sentiments.
with the main meaning of an utterance. The last element is a sen- This paper is organized as follows. In Section 2, we review the
timent (e.g., “none” and “sadness” in the example) that expresses previous work on intention analysis. In Section 3, we describe the
speaker’s attitude with respect to a dialogue topic. As shown in integrated intention analysis model. In Section 4, we explain the
experimental setup and report some experimental results. Finally,
we draw conclusions in Section 5.

Corresponding author.
E-mail address: nlpdrkim@kangwon.ac.kr (H. Kim).

https://doi.org/10.1016/j.patrec.2017.11.009
0167-8655/© 2017 Elsevier B.V. All rights reserved.
2 M. Kim, H. Kim / Pattern Recognition Letters 101 (2018) 1–5

Table 1
Example of a dialogue between a system and a user.

Speaker Utterance Intention

User I was late in returning home (inform, late, none)


yesterday.
System What time was it? (ask-ref, be, none)
User 11 P.M. (response, be, none)
User In fact, I was parted from her. (inform, part, sadness)
System Come on. (statement, encourage,
sadness)

2. Previous work

Most recent studies on speech acts and predicators have been


based on machine learning models. Stolcke et al. [16] proposed Fig. 1. Simplification of equation by the Markov assumption and independence as-
a speech act labeling model based on a hidden Markov model sumption.

(HMM), in which acoustic features (prosodic features as well as


lexical features) are used for dealing with speech inputs. Lang-
as the following equation:
ley [8] proposed a speech act classifier and predicator classifier
de f
with memory-based learning (k-NN) to improve the performance IM (D ) = arg max P (S1,n , P1,n , E1,n |U1,n ) (1)
of speech translation. Surendran and Levow [17] replaced the ob- S1,n ,P1,n ,E1,n
servation probabilities of an HMM with the class probabilities of
According to the chain rule, Eq. (1) can be rewritten as follows:
a support vector machine (SVM) in order to reduce a sparse data
problem. Kang et al. [4] proposed a multidomain model based on de f
conditional random fields, in which input features are constructed IM (D ) = arg max P (S1,n |U1,n )P (P1,n , E1,n |U1,n , S1,n ) (2)
S1,n ,P1,n ,E1,n
according to application domains. Although many machine learn-
ing models based on various linguistic features have been pro- As shown in Eq. (2), the integrated model consists of two parts:
posed, the previous models have mainly dealt with speech act the speech act identification model, P(S1, n |U1, n ), and the predicator
identification alone [4,14,16,17,19] or have separately dealt with & sentiment identification model, P(P1, n , E1, n |U1, n , S1, n ). To sim-
speech act identification and predicator identification [8,9]. How- plify the speech act identification model, we assume that a current
ever, a pair consisting of a speech act and a predicator should be speech act is dependent on the previous speech act (i.e., a 1st order
identified simultaneously to precisely capture speakers’ intentions. Markov assumption) because speech acts are strongly affected by
Therefore, Lee et al. [10] proposed an integrated neural network their previous contexts. Then, we assume that a predicator and a
model in which speech act identification results are used as in- sentiment are only dependent on their current observational infor-
puts to predicator identification. To improve the performance of mation (i.e., a conditional independent assumption) because pred-
the integrated model, Seon et al. [13] proposed a mutual retrain- icators and sentiments are strongly affected by lexical meanings
ing method in which speech act identification results are repeat- of current utterances. We also apply the conditional independent
edly used as inputs to predicator identification while training, and assumption to the speech act identification model. Fig. 1 depicts
vice versa. Although these integrated models showed that the in- the process by which Eq. (2) is simplified into Eq. (3) according to
tegration architecture can help to increase performance, they did these two assumptions.
not consider speakers’ sentiments as elements composing their in-  
de f 
n
P (Si |Ui )P (Si |Si−1 )
tentions. IM (D ) = arg max (3)
The previous studies on sentiment classification can be divided S1,n ,P1,n ,E1,n P (Pi , Ei |Ui , Si )
i=1
into two groups: feature-focused methods [12,20] and learner-
To obtain the sequence labels S1, n , P1, n , and E1, n that max-
focused methods [2,5]. The feature-focused methods have mainly
imize Eq. (3), we propose an Integrated Intention Identification
studied feature-weighting schemes based on various resources,
Model (IIIM) based on Convolutional Neural Networks (CNNs) [7],
such as sentiment dictionaries and sentiment snippets (i.e., two or
as shown in Fig. 2. In Fig. 2, Wi is a Word2Vec embedding vec-
three sentences including sentiment words). The learner-focused
tor with 50 dimensions of the ith words in the input utterance [3].
methods have mainly studied how to apply various machine learn-
The embedding vectors are trained from a large balanced corpus
ing models to sentiment classification. To alleviate the feature
called the 21st century Sejong project’s POS-tagged corpus [18].
engineering requirements for learner-focused methods, sentiment
HX represents a set of nodes fully connected with the output X.
classification models based on neural networks using word em-
In other words, HP means a set of nodes that is fully connected
bedding vectors as input features have been proposed [15]. Al-
with the output vector representing the predicator Pi . Similarly, HXY
though there have been numerous studies on sentiment classifica-
means a set of nodes fully connected with the outputs X and Y. In
tion, most of the previous studies were focused on sentiment clas-
other words, HEP means a set of nodes that is fully connected with
sification, not of dialogue utterances but short texts such as cus-
two output vectors representing the sentiment Ei and the predica-
tomers’ reviews and blogs.
tor Pi . In this paper, these sets of partially grouped nodes, such as
HES , HSP , HEP , and HESP , are called shared nodes because they can
contain weighting values associated with multiple outputs. While
3. IIIM: integrated intention identification model based on training, three types of utterance embedding vectors are generated
neural networks by concatenating nodes in the hidden layer: the utterance embed-
ding vector SEi for speech act identification, the utterance embed-
Given n utterances, U1, n , in a dialogue D, let S1, n , P1, n , and E1, n ding vector PEi for predicator identification, and the utterance em-
denote n speech act tags, predicator tags, and sentiment tags in D, bedding vector EEi for sentiment identification. To generate the
respectively. The integrated model can then be formally expressed embedding vectors, three cycles of partial error backpropagations
M. Kim, H. Kim / Pattern Recognition Letters 101 (2018) 1–5 3

Fig. 2. Overall architecture of IIIM.

are applied to the CNN. First, the errors between the output val- Table 2
Top-n tags in the dialogue corpus.
ues associated with speech act categories and correct speech act
vectors represented by one-hot codes are propagated through the Speech act (%) Predicator (%) Sentiment (%)
partially connected nodes. Second, partial error backpropagations Statement (51.3) None (17.9) None (43.5)
for predicator identification are performed in the same manner Response-if (18.3) Judge (9.3) Fear (10.5)
as the partial error backpropagations for speech act identification. Ask-if (10.0) Other (6.6) Sadness (8.9)
Finally, partial error backpropagations for sentiment identification Ask-ref (7.8) Be (6.3) Anger (8.8)
Response-ref (5.3) Express (6.0) Coolness (8.0)
are similarly performed. We expect that informative features (or
Hope (2.7) Know (5.4) Love (6.5)
abstraction values) are accumulated at the embedding vectors, ow- Request (1.2) Like (5.2) Joy (4.6)
ing to the partial error backpropagations. Then, the output values Opinion (1.0) Non-exist (4.2) Wish (3.9)
Si associated with speech act categories are fed to the input nodes Ask-confirm (0.9) Exist (4.1) Other (3.0)
Thanks (0.7) Perform (3.8) Surprise (2.2)
for predicator identification and sentiment identification. The pre-
vious speech act vector Si−1 represented by a one-hot code is con-
catenated to the utterance embedding vector SEi for speech act
identification. During the partial error backpropagations, the cross- shows speech acts, predicators, and sentiments that frequently oc-
entropies are used as loss functions in order to maximize simi- cur in the dialogue corpus.
larities between correct categories (i.e., Si , Pi , and Ei ) and output Prior to manual tagging, we explained the meanings of the
categories (i.e., Si ,   ), as shown in Eqs. (4).
Pi , and E i speech acts, predicators, and sentiments to the students and
 showed them some samples that were annotated with domain ac-
HSˆ (S ) = − Sˆi log(Si )
tions. Then, we assigned one student to code each fifth of the data.

i
HPˆ (P ) = − Pˆi log(Pi ) (4) Finally, a graduate student post-processed all the tagged data for

i consistency.
HEˆ (E ) = − Eˆi log(Ei ) To evaluate IIIM, we divided the annotated dialogue corpus into
i
a training corpus and a test corpus, in a ratio of 9:1. Then, we
trained IIIM by using the training corpus. Next, we performed 10-
4. Evaluation fold cross-validation.
Four evaluation measures, the accuracy, macro precision, macro
4.1. Data sets and experimental settings recall rate, and macro F1-score, were used to evaluate the perfor-
mance of the proposed model. The accuracy is the proportion of
For our experiments, we collected a Korean dialogue corpus correct values of those returned. The macro precision is the aver-
about views on love (3092 utterances; 1874 unique words) from age proportion of correct values returned per category. The macro
mobile chat rooms in which two users discuss each other’s views recall rate is the average proportion of correctly returned values
on a specific topic by using the short message service of a commer- per category. The macro F1-score combines the macro precision
cial telecommunication company. Each utterance in the collected and macro recall rate with an equal weighting in the following
dialogues was manually annotated with speech acts, predicators, form: F1 = (2.0 × macro precision × macro recall rate)/(macro pre-
and sentiments by three undergraduate students who were famil- cision + macro recall rate). To statistically validate the confidence
iar with dialogue analysis. The classifications of speech acts, pred- levels of the evaluation measures, we performed t-tests between
icators, and sentiments for dialogues are very subjective, and uni- the comparison models (i.e., independent identification models: a
versally accepted criteria do not exist. In this paper, we defined 15 speech act identification model, predicator identification model,
speech act tags, 67 predicator tags, and 10 sentiment tags. Table 2 and sentiment identification model) and IIIM using the scores of
4 M. Kim, H. Kim / Pattern Recognition Letters 101 (2018) 1–5

Table 3
Performance comparison with independent models.

Model Accuracy Avg. Macro Precision Avg. Macro Recall Avg. Macro F1-score

IIIM-S 0.855 0.862 0.720 0.714


CIM-S 0.838 0.705 0.512 0.593
S-Only 0.744 0.799 0.657 0.646
IIIM-P 0.731 0.835 0.691 0.641
CIM-P 0.260 0.296 0.136 0.186
P-Only 0.698 0.837 0.618 0.579
IIIM-E 0.661 0.594 0.559 0.565
CIM-E 0.466 0.412 0.165 0.236
E-Only 0.638 0.607 0.527 0.516

Table 4
Performance comparison with previous models.

Model Accuracy Avg. Macro Precision Avg. Macro Recall Avg. Macro F1-score

IIIM-S 0.910 0.920 0.825 0.825


IIIM-S + LSTM 0.985 0.989 0.958 0.959
[13] 0.941 0.878 0.915 0.894
[10] 0.860 – – –
IIIM-P 0.863 0.885 0.771 0.747
IIIM-P + LSTM 0.975 0.936 0.903 0.872
[13] 0.909 0.827 0.768 0.789
[10] 0.738 – – –
IIIM-E 0.661 0.594 0.559 0.565
IIIM-E + LSTM 0.952 0.969 0.952 0.960
[15] 0.932 0.858 0.853 0.855

each evaluation measure as the input values of the t-test. The p- between IIIM and the independent models were as follows,
values of speech act identification, predicator identification, and in terms of the average macro F1-scores: IIIM-S & S-Only
sentiment identification were 0.064, 0.064, and 0.064, respectively. (6.8%) > IIIM-P & P-Only (6.2%) > IIM-E & E-Only (4.9%). This shows
This implies that the performance scores are statistically meaning- that speech acts are less associated with predicators and senti-
ful at a significance level of 95%. ments. In addition, it shows that predications and sentiments are
affected by each other.
4.2. Implementation The second experiment was to compare IIIM with previous
models by using the same training and test corpus. Table 4 shows
We implemented IIIM using TensorFlow 1.0 [1]. Training and the performance differences between IIIM and the previous mod-
prediction were done on a per-sentence level. We set the size of els. In Table 4, Seon et al. [13] is an SVM-based classification model
each Word2Vec embedding vector in Fig. 2 to 50. The training in which the performances of speech act identification and pred-
spanned 300 epochs and was performed by mini-batch stochas- icator identification are increased by using the mutual retraining
tic gradient descent with a fixed learning rate of 0.001. Each mini- method [13]. Lee et al. [10] is an integrated model based on neu-
batch consisted of 64 sentences. ral networks in which the results of predicator identification are
used as inputs of speech act identification [10]. Shin et al. [15] is
4.3. Experimental results a sentiment analysis based on CNN and LSTM (Long Short-Term
Memory) networks [15]. IIIM-S + LSTM, IIIM-P + LSTM, and IIIM-
The first experiment was to compare the performance of IIIM E + LSTM are respectively an IIIM-S model, IIIM-P model, and an
with independent models and a conventionally integrated model IIIM-E model that are modified to the same architecture as [15] for
by using the same Word2Vec embedding vectors. Table 3 shows reflecting more contextual information, where the output values of
the performance differences between IIIM and the independent IIIM models are used as input values of LSTM networks.
models. As shown in Table 4, IIIM-S and IIIM-P showed similar perfor-
In Table 3, IIIM-S, IIIM-P, and IIIM-E indicate the speech act mances to [13] and [10]. However, we think that IIIM is more ef-
identification part, predicator identification part, and sentiment ficient than Seon et al. [13] and Lee et al. [10], because they re-
identification part of IIIM, respectively. CIM-S, CIM-P, and CIM- quire more engineering effort to extract and select effective fea-
E indicate the speech act identification part, predicator identi- tures, such as main verb phrases and grammatical roles of words
fication part, and sentiment identification part of a convention- from utterances. IIIM-S + LSTM and IIIM-P + LSTM outperformed
ally integrated model, respectively. The conventionally integrated [13] and [10]. This fact reveals that IIIM can become more effec-
model had the ordinary CNN architecture in which there are not tive if it is combined with a sequence labeling model for reflecting
any shared nodes. S-Only, P-Only, and E-Only indicate a speech more contextual information. IIIM-E + LSTM showed better perfor-
act identification model, predicator identification model, and senti- mances than [15]. We think that the performance differences were
ment identification model, respectively. These independent models caused by the common hidden nodes in Fig. 2, similar to the per-
had the same CNN architectures as IIIM, except that they did not formance differences between IIIM-E and E-Only.
have the shared nodes in Fig. 2. As shown in Table 3, IIIM out-
performed the conventionally integrated model and the indepen- 5. Conclusion
dent models for all measures. In addition, the conventionally in-
tegrated model showed poor performances. This fact reveals that We proposed an integrated neural network model that si-
the proposed integration architecture can help to increase the per- multaneously identified speech acts, predicators, and sentiments
formance of intention identification. The performance differences of dialogue utterances. The proposed model has three kinds of
M. Kim, H. Kim / Pattern Recognition Letters 101 (2018) 1–5 5

well-designed hidden nodes that embed effective abstractions from [6] H. Kim, C.-N. Seon, J. Seo, Review of Korean speech act classification: machine
input utterances for speech act identification, predicator identifica- learning methods, J. Comput. Sci. Eng. 5 (4) (2011) 288–293.
[7] A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep con-
tion, and sentiment identification. The hidden nodes are trained by volutional neural networks, in: Proceedings of NIPS, 2012.
partial error backpropagation algorithms. In the experiments, the [8] C. Langley, Analysis for speech translation using grammar-based parsing and
proposed model showed better performance than the independent automatic classification, in: Proceedings of the ACL Student Research Work-
shop, 2002.
models (i.e., the speech act identification model, predicator identi- [9] H. Lee, H. Kim, J. Seo, Domain action classification using a maximum entropy
fication model, and sentiment identification model). Based on the model in a schedule management domain, AI Commun. 21 (4) (2008) 221–229.
experimental results, we conclude that the proposed integration [10] H. Lee, H. Kim, J. Seo, An integrated neural network model for domain ac-
tion determination in goal-oriented dialogues, J. Inf. Process. Syst. 9 (2) (2013)
architecture and partial error backpropagation can help to increase
259–270.
the performance of intention identification. [11] L. Levin, C. Langley, A. Lavie, D. Gates, D. Wallace, K. Peterson, Domain spe-
cific speech acts for spoken language translation, in: Proceedings of 4th SIGdial
Workshop on Discourse and Dialogue, 2003.
Acknowledgment
[12] S. Li, C. Zong, X. Wang, Sentiment classification through combining classifiers
with multiple feature sets, in: Proceedings of NLP-KE, 2007, pp. 135–140.
This work was supported by the National Research Foundation [13] C.-N. Seon, H. Lee, H. Kim, J. Seo, Improving domain action classification in
of Korea (NRF) grant funded by the Korea government (MSIP) (No. goal-oriented dialogues using a mutual retraining method, Pattern Recognit.
Lett. 45 (2014) 154–160.
2016R1A2B4007732). [14] S. Shen, H. Lee, Neural attention models for sequence classification: analysis
and application to key term extraction and dialogue act detection, in: Proceed-
References ings of INTERSPEECH, 2016.
[15] D. Shin, Y. Lee, J. Jang, H. Rim, Using CNN-LSTM for effective application of
[1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. Corrado, A. dialogue context to emotion classification, in: Proceedings of HCLT (in Korean),
Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. 2016.
Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mane, R. Monga, [16] A. Stolcke, K. Ries, N. Coccaro, E. Shiriberg, R. Bates, D. Jurafsky, P. Taylor,
S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. C. Van Ess-Dykema, R. Martin, M. Meteer, Dialogue act modeling for automatic
Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viegas, O. Vinyals, P. Warden, tagging and recognition of conversational speech, Comput. Ling. 26 (3) (20 0 0)
M. Wattenberg, M. Wicke, Y. Yu, X. Zheng, TensorFlow: large-scale machine 339–373.
learning on heterogeneous distributed systems, 2015. Software available from [17] D. Surendran, G. Levow, Dialogue act tagging with support vector machines
tensorflow.org. and hidden Markov models, in: Proceedings of INTERSPEECH, 2006.
[2] S. Aman, S. Szpakowicz, Identifying expressions of emotion in text, in: Pro- [18] The National Institute of the Korean Language. 2007. Final report on achieve-
ceedings of 10th International Conference on Text, Speech and Dialogue, 2007. ments of 21st Sejong project: electronic dictionary (in Korean).
[3] Y. Goldberg, O. Levy, word2vec Explained: deriving Mikolov et al.’s negative- [19] N. Webb, M. Hepple, Y. Wilks, Dialogue act classification based on intra-ut-
sampling word-embedding method, 2014. arXiv preprint arXiv:1402.3722. terance features, in: Proceedings of the AAAI Workshop on Spoken Language
[4] S. Kang, H. Kim, J. Seo, A reliable multidomain model for speech act classifica- Understanding, 2005.
tion, Pattern Recognit. Lett. 31 (1) (2010) 71–74. [20] H. Yune, H. Kim, J. Chang, An efficient search method of product reviews using
[5] S. Kim, S. Park, S. Park, S. Lee, K. Kim, A syllable kernel based sentiment clas- opinion mining techniques, J. KIISE 16 (2) (2010) 222–226 (in Korean).
sification for movie reviews, J. KIISS 20 (2) (2010) 202–207 (in Korean).

Anda mungkin juga menyukai