Anda di halaman 1dari 4

A PROBABILISTIC INFERENCE OF PARTICIPANTS INTEREST LEVEL

IN A MULTI-PARTY CONVERSATION BASED ON MULTI-MODAL SENSING

Yusuke Kishita, Hiroshi Noguchi, Hiromi Sanada, and Taketoshi Mori

Graduate School of Interdisciplinary Information Studies, The University of Tokyo


yuusuuke.kishiita@gmail.com

ABSTRACT concept of hot-spots in group meetings, locating it in terms of


participants highly involved in a discussion [1]. The authors
Detecting degree of involvement during conversations is im-
made use of the ICSI Meeting Recorder corpus including con-
portant for summarization, retrieval, and browsing applica-
versations recorded by close-talking microphones. This study
tions. In this paper, we define the degree of involvement
found that some of prosodic features such as energy of voice
as the interest level that a group of participants show in the
and the fundamental frequency (F0) appear to distinguish be-
course of interactions, and propose the automatic detection
tween involved and non-involved utterances.
scheme of scenes of high-interest based on multi-modal sens-
Kennedy and Ellis showed high accuracy in detecting em-
ing. Our research is motivated by the fact that non-verbal in-
phasis or excitement of speech utterances, which may be use-
formation such as gesture and facial expressions plays an im-
ful identifiers of interest, in meetings from prosodic cues [2].
portant role during a face-to-face conversation. Audio-visual
features from the entire group are obtained by sensors located Few works have studied the use of multi-modal cues for
in a meeting room, and topics are extracted by applying la- interest level estimation in multi-party conversations. Gatica-
tent Dirichlet allocation (LDA) to the features. Then Support Perez et al. investigated the automatic detection of segments
Vector Machine (SVM) is used to infer interest level from the of high-interest level from audio-visual cues [3]. The audio-
topics. We conducted experiments using recording of conver- visual cues included pitch, energy, speaking rate as audio fea-
sational scenes (total 2hours 43 minutes) with interest level tures and visual cues were also estimated by computing skin-
labels of a five point scale. Interest level 4 or over is assigned color head and right-hand blobs for each participant. For
as high and interest level 3 or under is assigned as low, with the purpose, they applied a methodology based on Hidden
the result that the highest accuracy of our inference model can Markov Models (HMMs) to a number of audio-visual fea-
reach 87.3 %. tures. The results were promising, using audio-visual cues
could improve performance.
Index Terms— Automatic interest level detection, multi- We address the problem using multi-modal (audio-visual)
modal data, latent Dirichlet allocation. data containing participants’ prosodic features, gesture, facial
expression and a classification model called latent Dirichlet
1. INTRODUCTION allocation (LDA) [4]. We conducted experiments with a num-
ber of features and various combinations of data, and obtained
In real world, most of the communications that people do are high performance with our proposed inference model.
conversations. They communicate each other face-to-face by The paper is organized as follows. Section 2 discusses
conveying not only verbal information but also non-verbal in- the inference framework we used. Section 3 presents experi-
formation. This non-verbal information such as gesture, facial ments and results. Section 4 concludes the findings and future
expression, and volume of voice plays an important role in a works.
conversation.
In recent years, the approach called intelligent space to 2. FRAMEWORK
watch and support human activities within a space has been
proposed. And systems with collections of sensors that are In this section, we describe our framework for estimation of
able to recognize the activities and interactions among people interest level with multi-modal data. A conversation scene is
have been developed. With computers being aware of social divided into a sequence of shots, and features within a shot
contexts, an application that extracts semantically meaningful are used for the inference as follows.
conversational scenes, and a robot that understands people’s
intention will be realized. 1. Audio and visual features are extracted separately at
To automatically detect people’s internal states from hu- frame-level as multi-modal data. These features are ob-
man social interactions, Wrede and Shriberg introduced the tained for each participant.
2. Audio-visual features of participants are combined to
kinect Microphone   kinect
be transformed into audio-visual words. Each shot is array
represented as a histogram of audio-visual words.
kinect kinect
3. Topics of each shot are obtained by applying a topic
model [4] to the sequence of the shots. Interest level
of a conversation scene is inferred on the basis of these
topics.
Server Server Server Server
2.1. Audio features extraction
Gesture,  Facial  expression,  Nods,  Pitch,  Energy,  MFCC  
Audio features are extracted from raw audio wave files saved
for each participant. A microphone-array is composed of 8 Client
non-directional microphones. This device needs calibration
for sound source localization and MUltiple SIgnal Classifica- Fig. 1. System from the view point of feature extraction.
tion (MUSIC) algorithm is applied to the localization of 360
degrees. Each participant’s utterance is obtained with the di-
rection of the sound source arrival. The audio format is 16-
bit encoded wave file, and sampling rate is 16kHz. Audio 012,)3(

features are extracted within a frame length of 512 samples,


)"#$%&4$(")*&5%.#(
i.e. 32ms, where 160 samples are overlapped with previous
frame. As a result, 14 features including pitch, log energy of !"#$%&'$(")*&+,)-".,( !"#$%&'$(")*&/%.#( 6$(-%7.)2

signal, and mel frequency cepstral coefficients (MFCC) are Fig. 2. Bag-of-audio-visual-words (AVWs).
extracted for each frame.

2.2. Visual features extraction 2.3. Conversational scene representation


As for visual features, we made use of microsoft xbox kinect Fig. 1 shows the system we have developed. 3 kinects and
360 with kinect for windows SDK 1.5. The aim to use kinect the microphone-array are connected to servers one by one.
sensor is to detect and obtain each participant’s facial expres- The servers send multi-modal features to a client. The multi-
sion and movement of the upper part of a body at once. These modal features of each participant are combined to create a
features are extracted at frame-level where the audio features input vector which has 51 dimensions. The input vector rep-
are derived. resents 3 participants’ multi-modal data within a frame.
With face tracking SDK of kinect, results are expressed in Fig. 2 shows the procedure of a conversational scene rep-
terms of weights of six animation units. The animation units resentation. In order to reduce the noise of the input data,
display deltas from the neutral face shape, and we used 3 of vector quantization is applied to create bag-of-audio-visual-
them (AU0, AU2, and AU4) to obtain one feature value. If words (AVWs).
the facial expression is more pronounced, then the value will The procedure is summarized as:
be positive. If the expression is subtle, then the value will be
negative. 1. Divide a conversational scene into a sequence of shots.
For extracting gesture, seated tracking mode of the SDK Feature vectors are extracted from shots. Each shot
is applied. Positions of 10 joints (head, neck, right shoulder, lasts 15 seconds.
right elbow, right wrist, right arm, left shoulder, left elbow,
left wrist, and left arm) are measured. Kinect runs at about 2. Generate a codebook of audio-visual words by apply-
15 fps when recognizing both a face and a skeleton of each ing K-means clustering to all the frames.
participant. Movement of each joint is computed as a time
3. Replace each feature vector in a shot with an audio-
differential of a vector norm when the same joint is recog-
visual word in the codebook.
nized in 1 second from current frame. Then, gesture of each
participant is calculated as the average of each movement. 4. Transform each shot into a histogram of AVWs.
To see participants’ reaction during conversations, nods
are computed by using 3D head pose angles: pitch, roll, and
2.4. Interest level inference
yaw. We made use of the pitch angle which changes -90 to 90
degrees, and measures a time differential of it as a nod. The topic model extracts representative elements of AVWs
In total, 3 visual features (facial expression, gesture, and by applying latent Dirichlet allocation (LDA) that is a topic
nods) are obtained. estimation method in the natural language research field.
3.1. Data
The experiment was set in a meeting room with recording in-
α θd Zd,n Wd,n φk β
N struments. The microphone-array was set on a table at the
D K
center of the room, and 3 participants sat around the table. 3
ΠK
k=1 P (φk |β)Πd=1 P (θd |α)(Πn=1 P (Zd,n |θd )P (Wd,n |φ1:K , Zd,n ))
D N
kinects were located on poles to recognize users avoiding oc-
Fig. 3. Latent Dirichlet allocation. clusion. Participants are not supposed to move around during
conversations.
The dataset contains 10 conversations from 14 partici-
pants including 5 males and 9 females. Totally, 163 minutes
of conversations were collected. The shortest length of con-
341020&1..
5060- versation is 13 minutes, while the longest is 19 minutes, with
average length is 16 minutes. 80 % of the dataset were light
conversations, and the rest was discussion about random top-
34/02.;$41020&1.-060-;./2"7.1"#$%& ics. For the light conversations, there were no predefined top-
ics and those conversations were allowed to go freely. The 14
!"#$%&
participants were selected from colleagues of the same labo-
!"#$%.7")0-.859':
ratory or the same class, which made conversations meander
'()$"*+$&(,-. smoothly.
/0,1(20
The scoring task was done by one trained annotator. The
annotator gave scores to each conversational scene (shot) of
!$70
15 seconds from videos based on intuitive impressions. The
Fig. 4. Interest level inference model. measure of interest level comprises the joint interest level of
the entire group.
Interest level annotations are based on the following cri-
teria. Interest level 1 means cases in which no interaction
LDA has been applied to movie affective scene classifica-
between participants is taking place. Interest level 2 is a less
tion [5] successfully. Inspired by the work, we adopt LDA to
extreme situation of interest level 1. Interest level 3 is an-
infer the topics of each conversational scene.
notated when moderate conversations emerge. Interest level
LDA estimates the relationships between observed words
4 is annotated when participants show increased interest and
and latent topics. Since LDA assumes that a document is a
actively contribute to the conversations. Interest level 5 is an
bag-of-words and each shot is represented as a histogram of
extreme situation of interest level 4.
AVWs, it is possible to apply LDA directly to shots. Fig. 3
Finally, 456 shots were annotated using the dataset. A five
shows the graphical representation of LDA and the equation
point scale was chosen but only values 2-5 were actually used
below is the joint distribution of the latent variables. The pos-
in the annotations. Shots annotated with values each of 2, 3,
terior of the latent variables given the observed words is:
4, and 5 respectively account for 13.4 %, 58.1 %, 21.7 %, 6.8
P (β1:K , θ1:D , Z1:D,1:N |W1:D,1:N ) (1) % of all the shots.
We applied SVM to shots in order to infer interest level.
LDA is a generative model and from a collection of docu- Each shot is represented as a vector of per-shot topic propo-
ments, infer per-word topic assignment Zd,n , per-document tions. For example, if the number of topics is 50, then the
topic proportions θd , and per-corpus topic distribution φk . vector has 50 dimensions. We used a ten-fold cross-validation
The variables α and β represents fixed hyper-parameters. We procedure to evaluate the accuracy, precision, and recall.
used Collapsed Gibbs Sampler for the estimation.
Fig. 4 shows our model to infer interest level from topics.
3.2. Results
The bottom layer represents each shot and LDA is applied to
estimate topics. On the top layer, interest level of each shot In order to analyze the performance of the inference scheme,
is inferred from topics by applying Support Vector Machine we evaluated the performance with different numbers of top-
(SVM). ics and AVWs. Then compared it with conventional inference
methods.
3. EXPERIMENTS Instead of directly assigning five interest level scores for
each shot, we adopted 2-level interest level recognition. We
This section describes experiments to recognize interest level assigned a shot with score interest level 4 or over as high, and
of participants in conversations. We first describe the detail a shot with score interest level 3 or under as low.
of input data and annotation scheme. We then describe the In this experiment, we changed the number of codebook
results with different evaluation measures.   size K in K-means clustering method, and the number of top-
Fig. 5. Precision for codebook size 100. Fig. 7. Precision for codebook size 500.

4. CONCLUSION

In this paper, we proposed a method to detect group inter-


est level from face-to-face conversations using multi-modal
data. The results suggest that our proposal outperformed ex-
isting methods and the use of topic model is effective for the
task. However, no significant difference was seen between
the results of multi-modal features and audio features. It can
be because the annotating interest level, which is subjective
quantity of participants, is difficult. Developing effective an-
notation scheme to retrieve participants’ internal states will
be the next step of this work.
Fig. 6. Precision for codebook size 300.
5. REFERENCES

[1] Britta Wrede and Elizabeth Shriberg, “Spotting ”hot


ics N obtained by LDA. We applied our inference scheme to spots” in meetings: human judgments and prosodic
test data for K = 100, 300, 500 and N = 10, 20, 30, 40, 50. Fig. cues.,” in INTERSPEECH. 2003, pp. 2805–2808, ISCA.
5, Fig. 6, and Fig. 7 show precision of binary classification
at different numbers of topics. The plot for K = 100 has the [2] Lyndon S. Kennedy and Daniel P. W. Ellis, “Pitch-
highest score for N = 10, which is because the possibility of based emphasis detection for characterization of meet-
over-fitting exists. The plots for K = 300 and K = 500 has the ing recordings,” in IEEE Workshop on Automatic Speech
lowest score for N = 10. It seems there are not enough topics Recognition and Understanding, 2003.
to explain the conversations well. When K = 300 and N = 50,
[3] D. Gatica-Perez, I. McCowan, Dong Zhang, and S. Ben-
the highest score of accuracy was obtained.
gio, “Detecting Group Interest-Level in Meetings,” in
Table. 1 shows the results of comparing our approach with
Acoustics, Speech, and Signal Processing, 2005. Pro-
2 other methods. One method did not use LDA but applied
ceedings. (ICASSP ’05). IEEE International Conference
SVM to AVWs directly. The other method used only audio
on, Mar. 2005, vol. 1.
features (pitch, energy, and MFCC) as input data.
[4] David M. Blei, Andrew Y. Ng, and Michael I. Jordan,
“Latent dirichlet allocation,” J. Mach. Learn. Res., vol. 3,
Table 1. Comparison of results between different algorithms.
pp. 993–1022, Mar. 2003.
Accuracy[%] Precision Recall
MultiModal+LDA [5] Go Irie, Kota Hidaka, Takashi Satou, Akira Kojima,
87.3 0.93 0.60
(Proposed) Toshihiko Yamasaki, and Kiyoharu Aizawa, “Latent
MultiModal+SVM 62.5 0.32 0.29 topic driving model for movie affective scene classifica-
AudioFeature+LDA 87.1 0.97 0.56 tion,” in Proceedings of the 17th ACM international con-
ference on Multimedia, New York, NY, USA, 2009, MM
’09, pp. 565–568, ACM.

Anda mungkin juga menyukai