A PROBABILISTIC INFERENCE OF PARTICIPANTS INTEREST LEVEL
IN A MULTI-PARTY CONVERSATION BASED ON MULTI-MODAL SENSING
Yusuke Kishita, Hiroshi Noguchi, Hiromi Sanada, and Taketoshi Mori
Graduate School of Interdisciplinary Information Studies, The University of Tokyo
yuusuuke.kishiita@gmail.com
ABSTRACT concept of hot-spots in group meetings, locating it in terms of
participants highly involved in a discussion [1]. The authors Detecting degree of involvement during conversations is im- made use of the ICSI Meeting Recorder corpus including con- portant for summarization, retrieval, and browsing applica- versations recorded by close-talking microphones. This study tions. In this paper, we define the degree of involvement found that some of prosodic features such as energy of voice as the interest level that a group of participants show in the and the fundamental frequency (F0) appear to distinguish be- course of interactions, and propose the automatic detection tween involved and non-involved utterances. scheme of scenes of high-interest based on multi-modal sens- Kennedy and Ellis showed high accuracy in detecting em- ing. Our research is motivated by the fact that non-verbal in- phasis or excitement of speech utterances, which may be use- formation such as gesture and facial expressions plays an im- ful identifiers of interest, in meetings from prosodic cues [2]. portant role during a face-to-face conversation. Audio-visual features from the entire group are obtained by sensors located Few works have studied the use of multi-modal cues for in a meeting room, and topics are extracted by applying la- interest level estimation in multi-party conversations. Gatica- tent Dirichlet allocation (LDA) to the features. Then Support Perez et al. investigated the automatic detection of segments Vector Machine (SVM) is used to infer interest level from the of high-interest level from audio-visual cues [3]. The audio- topics. We conducted experiments using recording of conver- visual cues included pitch, energy, speaking rate as audio fea- sational scenes (total 2hours 43 minutes) with interest level tures and visual cues were also estimated by computing skin- labels of a five point scale. Interest level 4 or over is assigned color head and right-hand blobs for each participant. For as high and interest level 3 or under is assigned as low, with the purpose, they applied a methodology based on Hidden the result that the highest accuracy of our inference model can Markov Models (HMMs) to a number of audio-visual fea- reach 87.3 %. tures. The results were promising, using audio-visual cues could improve performance. Index Terms— Automatic interest level detection, multi- We address the problem using multi-modal (audio-visual) modal data, latent Dirichlet allocation. data containing participants’ prosodic features, gesture, facial expression and a classification model called latent Dirichlet 1. INTRODUCTION allocation (LDA) [4]. We conducted experiments with a num- ber of features and various combinations of data, and obtained In real world, most of the communications that people do are high performance with our proposed inference model. conversations. They communicate each other face-to-face by The paper is organized as follows. Section 2 discusses conveying not only verbal information but also non-verbal in- the inference framework we used. Section 3 presents experi- formation. This non-verbal information such as gesture, facial ments and results. Section 4 concludes the findings and future expression, and volume of voice plays an important role in a works. conversation. In recent years, the approach called intelligent space to 2. FRAMEWORK watch and support human activities within a space has been proposed. And systems with collections of sensors that are In this section, we describe our framework for estimation of able to recognize the activities and interactions among people interest level with multi-modal data. A conversation scene is have been developed. With computers being aware of social divided into a sequence of shots, and features within a shot contexts, an application that extracts semantically meaningful are used for the inference as follows. conversational scenes, and a robot that understands people’s intention will be realized. 1. Audio and visual features are extracted separately at To automatically detect people’s internal states from hu- frame-level as multi-modal data. These features are ob- man social interactions, Wrede and Shriberg introduced the tained for each participant. 2. Audio-visual features of participants are combined to kinect
Microphone
kinect
be transformed into audio-visual words. Each shot is array
represented as a histogram of audio-visual words. kinect
kinect
3. Topics of each shot are obtained by applying a topic model [4] to the sequence of the shots. Interest level of a conversation scene is inferred on the basis of these topics. Server
Server
Server
Server
2.1. Audio features extraction Gesture,
Facial
expression,
Nods,
Pitch,
Energy,
MFCC
Audio features are extracted from raw audio wave files saved for each participant. A microphone-array is composed of 8 Client
non-directional microphones. This device needs calibration for sound source localization and MUltiple SIgnal Classifica- Fig. 1. System from the view point of feature extraction. tion (MUSIC) algorithm is applied to the localization of 360 degrees. Each participant’s utterance is obtained with the di- rection of the sound source arrival. The audio format is 16- bit encoded wave file, and sampling rate is 16kHz. Audio 012,)3(
features are extracted within a frame length of 512 samples,
)"#$%&4$(")*&5%.#( i.e. 32ms, where 160 samples are overlapped with previous frame. As a result, 14 features including pitch, log energy of !"#$%&'$(")*&+,)-".,( !"#$%&'$(")*&/%.#( 6$(-%7.)2
signal, and mel frequency cepstral coefficients (MFCC) are Fig. 2. Bag-of-audio-visual-words (AVWs). extracted for each frame.
2.2. Visual features extraction 2.3. Conversational scene representation
As for visual features, we made use of microsoft xbox kinect Fig. 1 shows the system we have developed. 3 kinects and 360 with kinect for windows SDK 1.5. The aim to use kinect the microphone-array are connected to servers one by one. sensor is to detect and obtain each participant’s facial expres- The servers send multi-modal features to a client. The multi- sion and movement of the upper part of a body at once. These modal features of each participant are combined to create a features are extracted at frame-level where the audio features input vector which has 51 dimensions. The input vector rep- are derived. resents 3 participants’ multi-modal data within a frame. With face tracking SDK of kinect, results are expressed in Fig. 2 shows the procedure of a conversational scene rep- terms of weights of six animation units. The animation units resentation. In order to reduce the noise of the input data, display deltas from the neutral face shape, and we used 3 of vector quantization is applied to create bag-of-audio-visual- them (AU0, AU2, and AU4) to obtain one feature value. If words (AVWs). the facial expression is more pronounced, then the value will The procedure is summarized as: be positive. If the expression is subtle, then the value will be negative. 1. Divide a conversational scene into a sequence of shots. For extracting gesture, seated tracking mode of the SDK Feature vectors are extracted from shots. Each shot is applied. Positions of 10 joints (head, neck, right shoulder, lasts 15 seconds. right elbow, right wrist, right arm, left shoulder, left elbow, left wrist, and left arm) are measured. Kinect runs at about 2. Generate a codebook of audio-visual words by apply- 15 fps when recognizing both a face and a skeleton of each ing K-means clustering to all the frames. participant. Movement of each joint is computed as a time 3. Replace each feature vector in a shot with an audio- differential of a vector norm when the same joint is recog- visual word in the codebook. nized in 1 second from current frame. Then, gesture of each participant is calculated as the average of each movement. 4. Transform each shot into a histogram of AVWs. To see participants’ reaction during conversations, nods are computed by using 3D head pose angles: pitch, roll, and 2.4. Interest level inference yaw. We made use of the pitch angle which changes -90 to 90 degrees, and measures a time differential of it as a nod. The topic model extracts representative elements of AVWs In total, 3 visual features (facial expression, gesture, and by applying latent Dirichlet allocation (LDA) that is a topic nods) are obtained. estimation method in the natural language research field. 3.1. Data The experiment was set in a meeting room with recording in- α θd Zd,n Wd,n φk β N struments. The microphone-array was set on a table at the D K center of the room, and 3 participants sat around the table. 3 ΠK k=1 P (φk |β)Πd=1 P (θd |α)(Πn=1 P (Zd,n |θd )P (Wd,n |φ1:K , Zd,n )) D N kinects were located on poles to recognize users avoiding oc- Fig. 3. Latent Dirichlet allocation. clusion. Participants are not supposed to move around during conversations. The dataset contains 10 conversations from 14 partici- pants including 5 males and 9 females. Totally, 163 minutes of conversations were collected. The shortest length of con- 341020&1.. 5060- versation is 13 minutes, while the longest is 19 minutes, with average length is 16 minutes. 80 % of the dataset were light conversations, and the rest was discussion about random top- 34/02.;$41020&1.-060-;./2"7.1"#$%& ics. For the light conversations, there were no predefined top- ics and those conversations were allowed to go freely. The 14 !"#$%& participants were selected from colleagues of the same labo- !"#$%.7")0-.859': ratory or the same class, which made conversations meander '()$"*+$&(,-. smoothly. /0,1(20 The scoring task was done by one trained annotator. The annotator gave scores to each conversational scene (shot) of !$70 15 seconds from videos based on intuitive impressions. The Fig. 4. Interest level inference model. measure of interest level comprises the joint interest level of the entire group. Interest level annotations are based on the following cri- teria. Interest level 1 means cases in which no interaction LDA has been applied to movie affective scene classifica- between participants is taking place. Interest level 2 is a less tion [5] successfully. Inspired by the work, we adopt LDA to extreme situation of interest level 1. Interest level 3 is an- infer the topics of each conversational scene. notated when moderate conversations emerge. Interest level LDA estimates the relationships between observed words 4 is annotated when participants show increased interest and and latent topics. Since LDA assumes that a document is a actively contribute to the conversations. Interest level 5 is an bag-of-words and each shot is represented as a histogram of extreme situation of interest level 4. AVWs, it is possible to apply LDA directly to shots. Fig. 3 Finally, 456 shots were annotated using the dataset. A five shows the graphical representation of LDA and the equation point scale was chosen but only values 2-5 were actually used below is the joint distribution of the latent variables. The pos- in the annotations. Shots annotated with values each of 2, 3, terior of the latent variables given the observed words is: 4, and 5 respectively account for 13.4 %, 58.1 %, 21.7 %, 6.8 P (β1:K , θ1:D , Z1:D,1:N |W1:D,1:N ) (1) % of all the shots. We applied SVM to shots in order to infer interest level. LDA is a generative model and from a collection of docu- Each shot is represented as a vector of per-shot topic propo- ments, infer per-word topic assignment Zd,n , per-document tions. For example, if the number of topics is 50, then the topic proportions θd , and per-corpus topic distribution φk . vector has 50 dimensions. We used a ten-fold cross-validation The variables α and β represents fixed hyper-parameters. We procedure to evaluate the accuracy, precision, and recall. used Collapsed Gibbs Sampler for the estimation. Fig. 4 shows our model to infer interest level from topics. 3.2. Results The bottom layer represents each shot and LDA is applied to estimate topics. On the top layer, interest level of each shot In order to analyze the performance of the inference scheme, is inferred from topics by applying Support Vector Machine we evaluated the performance with different numbers of top- (SVM). ics and AVWs. Then compared it with conventional inference methods. 3. EXPERIMENTS Instead of directly assigning five interest level scores for each shot, we adopted 2-level interest level recognition. We This section describes experiments to recognize interest level assigned a shot with score interest level 4 or over as high, and of participants in conversations. We first describe the detail a shot with score interest level 3 or under as low. of input data and annotation scheme. We then describe the In this experiment, we changed the number of codebook results with different evaluation measures. size K in K-means clustering method, and the number of top- Fig. 5. Precision for codebook size 100. Fig. 7. Precision for codebook size 500.
4. CONCLUSION
In this paper, we proposed a method to detect group inter-
est level from face-to-face conversations using multi-modal data. The results suggest that our proposal outperformed ex- isting methods and the use of topic model is effective for the task. However, no significant difference was seen between the results of multi-modal features and audio features. It can be because the annotating interest level, which is subjective quantity of participants, is difficult. Developing effective an- notation scheme to retrieve participants’ internal states will be the next step of this work. Fig. 6. Precision for codebook size 300. 5. REFERENCES
[1] Britta Wrede and Elizabeth Shriberg, “Spotting ”hot
ics N obtained by LDA. We applied our inference scheme to spots” in meetings: human judgments and prosodic test data for K = 100, 300, 500 and N = 10, 20, 30, 40, 50. Fig. cues.,” in INTERSPEECH. 2003, pp. 2805–2808, ISCA. 5, Fig. 6, and Fig. 7 show precision of binary classification at different numbers of topics. The plot for K = 100 has the [2] Lyndon S. Kennedy and Daniel P. W. Ellis, “Pitch- highest score for N = 10, which is because the possibility of based emphasis detection for characterization of meet- over-fitting exists. The plots for K = 300 and K = 500 has the ing recordings,” in IEEE Workshop on Automatic Speech lowest score for N = 10. It seems there are not enough topics Recognition and Understanding, 2003. to explain the conversations well. When K = 300 and N = 50, [3] D. Gatica-Perez, I. McCowan, Dong Zhang, and S. Ben- the highest score of accuracy was obtained. gio, “Detecting Group Interest-Level in Meetings,” in Table. 1 shows the results of comparing our approach with Acoustics, Speech, and Signal Processing, 2005. Pro- 2 other methods. One method did not use LDA but applied ceedings. (ICASSP ’05). IEEE International Conference SVM to AVWs directly. The other method used only audio on, Mar. 2005, vol. 1. features (pitch, energy, and MFCC) as input data. [4] David M. Blei, Andrew Y. Ng, and Michael I. Jordan, “Latent dirichlet allocation,” J. Mach. Learn. Res., vol. 3, Table 1. Comparison of results between different algorithms. pp. 993–1022, Mar. 2003. Accuracy[%] Precision Recall MultiModal+LDA [5] Go Irie, Kota Hidaka, Takashi Satou, Akira Kojima, 87.3 0.93 0.60 (Proposed) Toshihiko Yamasaki, and Kiyoharu Aizawa, “Latent MultiModal+SVM 62.5 0.32 0.29 topic driving model for movie affective scene classifica- AudioFeature+LDA 87.1 0.97 0.56 tion,” in Proceedings of the 17th ACM international con- ference on Multimedia, New York, NY, USA, 2009, MM ’09, pp. 565–568, ACM.