Anda di halaman 1dari 7

Proceeding of the 3rd International Conference on Informatics and Technology, 2009

Arabic Speech Emotion Detection Using Hidden Markov Models

Muhannad Al-Naabi, Raja Noor Ainon & Mohammad A.M. Abu Shariah
Faculty of Computer Science & Information Technology
University of Malaya
mnaabi@hotmail.com, ainon@um.edu.my, shariah@perdana.um.edu.my

Abstract
In this paper an Arabic speech emotion detection system is presented. The study focuses on four types of emotions:
happiness, anger, sadness and absence of emotion i.e. neutral. This system is designed based on Hidden Markov
Model (HMM) as a classification method. An emotional utterances database comprising 160 speech samples was
recorded, consisting of 10 sentences that were recorded by four different people each acting four different emotion
states. The detection process of the system starts by dividing the utterances files into two sets i.e. training set and
testing set. The phonetic and acoustic features of the speech signals of the speech samples from the training set are
obtained by extracting the Mel Frequency Cepstral Coefficients (MFCC). The system then detects the emotions from
the testing set using the HMM models. Experimental results showed that an overall detection accuracy of 68.75%
was achieved.

Keywords: Emotion Detection, Hidden Markov Models


(HMM), Mel Frequency Cepstral Coefficients (MFCC), Table 1. Related Research on detecting emotions
Arabic speech. In speech using HMM

Source Language(s) No. of Size of Recognition


emotions speech rate
1. Introduction samples

Implementing a system or a machine that can detect a German &


user’s emotion is a challenging task. Many studies have [4] 7 5250 77.8%
English
been conducted for this task but unfortunately no perfect [5] English 4 880 64.77%
system currently exists. This is because detecting emotions [6] German 6 494 71%
in speech requires the combination of state-of-the-art speech
[7] Danish 5 266 59.5%
technology with psychology and linguistics.
[2] Mandarin 5 3400 47.1%
The main objective of this work is to design and
implement a software system that can detect emotions in
Arabic spoken speech. This system will be built based on Hidden Markov Model (HMM) as a classification method.
The work focuses on four types of emotions: happiness, anger, sadness and neutral. This system will be applicable
and useful to be implemented in Human-Computer-Interaction (HCI) systems e.g. Robots.
A discrete HMM is defined by the probabilistic model λ = (π, A, B), where π is a vector of probabilities describing
initial states, A is a matrix of probabilities describing the likelihood of transitioning from any one state to any other
state, and B is a matrix of probabilities describing the likelihood of a particular state outputting a particular
observation [1].

2. Related Works
From experimental results, HMM classifiers yielded classification accuracy significantly better than the linear
discriminant analysis and quadratic discriminant analysis [2]. The HMM-based classifier has advantages over other
static discriminative classifiers in that frame length normalization is not necessary and temporal dynamics of the base
features can be reflected by using the state transition probability [3].
In recent years many research works were conducted to detect emotions in speech as shown in Table 1.
Unfortunately, no previous research could be found on the detection of emotions in Arabic speech. As seen from
Table 1, the recognition rate of emotions ranges between 77.8%, as the highest, and 47.1% as the lowest. Several
different extraction techniques were applied in the research; for example, [4] and [6] used Gaussian Mixtures while
[2] and [7] used a mixture of the MFCC and LPC. It is noticeable that the bigger the size of speech samples the better
the recognition rate obtained. The only exception is the work done in [2] and the reason for the high accuracy rate
was that both clean speech and noisy speech were used in the experiments. For clean speech 62.5% was achieved but

©Informatics '09, UM 2009 RDT1 - 13


Proceeding of the 3rd International Conference on Informatics and Technology, 2009

the noisy speech the rate was only 35.5% at 05dB level of noise. Their experiments were also conducted for different
levels for noise starting from 05dB to 40dB and on average they achieved 47.1% recognition rate.
Moreover, it can be concluded from the table that the German language always achieves the best recognition rate.
This is because the German Emotional Database was prepared very well. The recording sessions were held in a studio,
thereby ensuring that all speech files are clear with no noise. In addition, its size is big which gives the opportunity to
have bigger training sets thus giving better recognition rates.

3. Method

3.1 System Architecture

The process of detecting emotions in speech has three steps i.e. Feature Extraction, Feature Selection & Emotion
Classifier. In this work the feature extraction technique employed is based on the Mel Frequency Cepstral
Coefficients (MFCC) as it is the simplest and the most widely used technique in speech processing [2].
The purpose of this technique is to transform the input waveform into a sequence of acoustic feature vectors.
These vectors represent the information in a small time window of the signal.
The last step in emotion detection process is the classification. There are different approaches which have been
applied and used by researchers to achieve this classification. According to [8] the sole technique that gains the
acceptance of the researchers to be the state of the art is Hidden Markov Model (HMM) technique. Those steps are
shown in figure 1.
Input Extract
(Speech Signal) features

Detect Select
Emotion features

Output (Detected
Emotion)

Figure 1. Steps of Emotion Detection System

The system has two main parts, i.e., Training and Testing. The output of the Training will be used as an input for
the Testing part. The data in the system is processed sequentially from the Training part, where the features are
extracted, to the Testing part, where an emotion is detected.

Input Extract features using


(Speech) MFCC

Train the Select relevant


system features

Test input speech Output


using HMM (Detected
Emotion)

Figure 2. Pipe and Filter architecture of Arabic Speech Emotion Detector (ASED) System

©Informatics '09, UM 2009 RDT1 - 14


Proceeding of the 3rd International Conference on Informatics and Technology, 2009

3.2 Fundamental problems of HMM

In order to use the HMM properly in real-world applications, there are three basic problems that must be solved
for the model. The problems are the Evaluation problem, the Decoding problem and the Learning problem [9].

3.2.1 The Evaluation Problem

Given the observation sequence O = O1 , O2 , ... , Ot,, and a model λ = ( A, B, π ), the problem is to compute
P(O|λ) which is the probability that the observed sequence is produced by the model.
In addition, this problem can be viewed as how well a given model matches a given observation. This could be
used to find out the best model among many that produces the given observation. So, this problem allows choosing
the model that best matches the observation.
To solve this problem, Forward or Backward Algorithm can be used. This algorithm reduces the computational
cost with simple iterative mathematical formulae [8].

3.2.2 The Decoding Problem

Given the observation sequence O = O1 , O2 , ... , OT, and a model λ = ( A, B, π ), the problem is to chose a
corresponding state sequence Q= q1 q2 … qT , that will most likely produce the given observations. That is, we need to
find the best state sequence which is optimal in a certain sense given the speech feature sequence.
To solve this problem a dynamic programming technique called Viterbi Algorithm is recommended. This
algorithm searches for a single best state sequence by maximising P(O|λ) to detect the best state sequence [8].

3.2.3 The Training/Learning Problem

In the training phase the problem is to adjust the model parameters λ to maximize P(O|λ). According to [10] we
need to estimate the HMM parameters from a given set of training samples according to some meaningful criterion.
This problem aims to optimise the model parameters to obtain the best model that represents a certain set of
observations belonging to one spoken entity. The best algorithm used to solve this problem is called the Baum-Welch
Algorithm which is an iterative method to reach the local maximas of the probability function P(O|λ) [8].
The reason why HMM is widely used in speech processing is that a speech signal can be viewed as a piece-wise
stationary signal or short time stationary signal.
The training/learning process uses the features vectors created in the previous step as an input. The vectors of
speech files of one emotion will be treated individually so an HMM model of each emotion will be created. The
training process starts with initialization using K-means. This initialization computes initial parameters that are
estimates for an HMM with a mixture of Gaussian outputs.
After the learning process is completed, an acoustic likelihood for each acoustic unit is computed and will be used
in the detection step.

4. Creating emotional speech corpus

The first step in this study is to prepare an emotional database that provides utterances for testing and training
purposes.

4.1 Creating sentences

The emotional speech database can be created by using either real or acted utterances. The real utterances can be
captured from recorded phone calls, computer games or films. However, for our study the acted utterances were used
as almost all real utterances are not spoken in Standard Arabic.
The Arabic language has several varieties that diverge widely from one country to country and within a single
country. This language is spoken as the first language of more than 20 countries. One factor in the differentiation of
the varieties is the influence from the languages previously spoken in the areas, which have typically provided a
significant number of new words, and have sometimes also influenced pronunciation or word order. So, it is difficult
for a listener to understand the utterances if the speaker is not speaking the standard Arabic.

©Informatics '09, UM 2009 RDT1 - 15


Proceeding of the 3rd International Conference on Informatics and Technology, 2009

However, it is almost impossible to use natural data if basic emotions are the subject of investigation as clear
emotional expression is not only rare in everyday situations but the recording of people experiencing full-blown
emotions is also ethically problematic, [11].
Therefore, it was decided to use acted utterances that will be accomplished by recording 10 sentences acted in 4
emotions: Sadness, Happiness, Anger and non-emotional utterance i.e. Neutral. The choice of the actors depends on
the naturalness and recognizability of the performance. Based on this, four actors belonging to a theatre group were
selected.
According to Burkhardt et al [11], to create sentences, it is important, that all these sentences should be
interpretable in the emotions under review and that they contain no emotional bias. Two different kinds of text
material would normally meet these requirements:
• Nonsense text material, like for instance haphazard series of figures or letters, or fantasy words.
• Normal sentences which could be used in everyday life.
Nonsense material is guaranteed to be emotionally neutral. However, there is the disadvantage that actors will find
it difficult to imagine an emotional situation and to produce natural emotional speech spontaneously. This is why
nonsense material often results in stereotyped overacting.
In comparison with poems and nonsense sentences, the use of everyday communication has proved best, because
this is the natural form of speech under emotional arousal. Moreover, actors can immediately speak them from
memory. In the construction of the database, priority was given to the naturalness of speech material and thus
everyday sentences were used as test utterances. Table 2 shows the sentences used to create the database and their
translation to English.

4.2 Recording session

The recording session was carried out in a room at the headquarters of the theatre group. All recordings carried
out in this room were sometimes interrupted by noise. This is because the room is not a quiet or noise proof room
unlike a special recording studio.
The recording was accomplished using GoldWave v5.11 and a Sony F-V120 Uni-Directional Vocal Microphone
which was connected to HP Pavilion DV6920US 15.4-inch Laptop. The utterances were saved into .wav extension at
16 kHz

©Informatics '09, UM 2009 RDT1 - 16


Proceeding of the 3rd International Conference on Informatics and Technology, 2009

Table 2. Sentences used to create Arabic 4.3 Perception test of recorded speech corpus
emotional database
After achieving the recording process a perception test
Sentence was performed in order to ensure the accuracy of the recorded
(Translation) sentences and to make sure they carry the respective
.ِ ‫ﺍﻟﻜِﺘَﺎﺏُ ﻣَﻮﺟُﻮﺩٌ ﻋَﻠَﻰ ﺍﻟﺮﱠﻑِ ﺍﻷَﻭﱠﻝ‬ emotional elements. In this test, each utterance was validated
The book is on the first shelf by a number of listeners who will determine the emotion in
each record and then the determined emotion was compared
. ‫ﻟَﻘَﺪ ﺭَﺃَﻳﺖُ ﻋَﻠِﻴﱠﺎً ﺍﻟﺒَﺎﺭِﺣَﺔَ ﻓِﻲ ﺍﻟﻨﱠﺎﺩﻱ‬ with the pre-defined emotion.
I saw Ali in the club yesterday.
A total of 15 persons participated in this test, five of them
female, aged between 23-32 years old while the rest were
.ِ ِ‫ﺳَﻨَﺬﻫَﺐُ ﺻَﺒَﺎﺡَ ﺍﻟﻐَﺪِ ﻟِﺸِﺮَﺍءِ ﺍﻷَﻏﺮَﺍﺽ ِ ﻣِﻦَ ﺍﻟﺴﱡﻮﻕ‬
Tomorrow morning we will go to buy things
males of varying ages. All of them are Arabs and their mother
from the market. tongue language is Arabic in order that they can understand
. ً‫ﻛَﺎﻥَ ﻟِﻘَﺎﺅُﻧَﺎ ﺑِﺎﻷَﺻﺪِﻗَﺎءِ ﺍﻟﺒَﺎﺭِﺣَﺔ َ ﻣُﻔَﺎﺟِﺌﺎ‬ the recorded utterances and the emotional element. Table 3
Our meeting with the friends was surprising shows the nationalities of the persons who have performed
yesterday. the test.
. ِ ‫ﺯَﺍﺭَﻧِﻲ ﺃَﺣﻤَﺪٌ ﻭَﺃَﺧُﻮﻩُ ﺻَﺒَﺎﺡَ ﺍﻟﻴَﻮﻡ‬
Ahmed and his brother visited me this Table 3. Nationalities of the persons who have performed
morning. the perception test
. ً‫ﺗُﻐﻠَﻖُ ﺍﻟﺢَﺩِﻳﻘَﺔ ُ ﺍﻟﺴﱠﺎﻋَﺔ َ ﺍﻟﺘﱠﺎﺳِﻌَﺔ َ ﻣَﺴَﺎء‬ Number of
The park is closed at nine p.m.. Nationality
per sons
. ‫ﺳَﺘُﻌﺮَﺽُ ﺍﻟﻤُﺒَﺎﺭَﺍﺓ ُ ﺍﻟﺴﱠﺎﻋَﺔ َ ﺍﻟﺜﱠﺎﻣِﻨَﺔ َ ﻋَﻠَﻰ ﺍﻟﻘّﻨَﺎﺓِ ﺍﻷُﻭﻟَﻰ‬
The match will be at eight o'clock on Channel Omani 7
One.
Iraqi 2
‫ﻫَﻞِ ﺍﻟﺠِﻬَﺎﺭُ ﺍﻟﻤَﻮﺟُﻮﺩُ ﻋﻠﻰ ﺍﻟﻂﱠﺍﻭﻟِﺔِ ﻟﻚ ؟‬
Is the machine which is on the table belongs to
Sudanese 2
you?
‫ﺳَﺄَﺫﻫَﺐُ ﺃَﻳَﺎﻡَ ﺍﻟﻌُﻄﻠَﺔِ ﻓِﻲ ﺭِﺣﻠَﺔٍ ﻣَﻊَ ﺃَﺻﺪِﻗَﺎﺋِﻲ‬ Algerian 1
I will go in a trip with my friends during the
holiday. Yemen 3
. ِ ‫ﺳَﻨﻨﺘَﻘِﻞُ ﺇِﻟَﻰ ﺷِﻘﱠﺔٍ ﺟَﺪِﻳﺪَﺓٍ ﺑَﻌﺪَ ﻳَﻮﻣَﻴﻦ‬
We will move to a new apartment after two Total 15
days.

The listeners were given a form that contains a table. Each row of this table has a name of a sound file and 4
emotions from which he/she has to choose one emotion. The listeners were not given the pre-defined emotion of each
file. He/she listened to the sound files randomly and had to choose one of the 4 emotions that he/she expected was
acted in that file.
The recognition rate of the utterances is 76.85% which is acceptable as compared to that of the German Database,
86% [11] and the Danish database, 67% [12]. Table 4 shows the recognition rate performed in the perception test for
each emotion.
As depicted in Table 4 Anger has the highest recognition rate by 83.2% while Happiness has the lowest by 68.5%.

Table 4. Recognition rate performed in the perception test for each emotion

Emotion Rate

Anger 83.2%

Happiness 68.5%

Neutral 75.7%

Sadness 80.0%

Aver age 76.85%

©Informatics '09, UM 2009 RDT1 - 17


Proceeding of the 3rd International Conference on Informatics and Technology, 2009

5. Experimental Results

The system was implemented using MATLAB which is a numerical computing environment and programming
language created by the MathWorks corporation.
Several experiments were conducted in order to ensure the accuracy of the developed system. A perception test
was carried out to make sure that these utterances are natural. After performing the perception test, the utterances
databse was divided into two sets, training set and testing set. The training set has 112 utterances with 28 utterances
for each emotion and will be used in training the HMM models. The testing set has 48 utterances comprising 12 for
each emotion and this set will be used in the detection process and to check the accuracy of the system. Table 5 shows
the confusion matrix of the classification accuracy for the four emotional states.

From table 5, it is clear the Anger has the highest recognition rate by 83.3%. Comparing this to the perception test,
Anger also has the highest rate. Happiness is the most difficult one to be recognized. This can be seen from the table
the recognition rate achieved was only 50% with the other three emotions being mistakenly recognized in 50% of the
test samples. Again this is similar to the perception test result with Happiness getting the lowest acceptance rate.
This result might be due to the way Arab people express the
happiness emotion. Some Arabs express it loudly and others express it quickly.

Table 5. Confusion Matrix of the Classification Accuracy

Recognized as (%)
Neutral
Happin

Sadnes
Anger

Emotion
ess

Anger 83.3 0 0 16.67

Happiness 16.7 50.0 16.67 16.66

Sadness 0 0 66.67 33.33

Neutral 8.3 0 16.67 75

6. Conclusion and Future Work

This aim of this work is to develop a system that can detect four emotions in Arabic speech. The emotion states
are Anger, Happiness, Neutral and Sadness. An emotional speech database was created by 4 people who acted 10
sentences in four emotions. The database was then divided into two sets, Training set and Testing set. We faced some
difficulties in the recording session since we should choose people who are professional in acting so the recorded
sentences appear natural. Furthermore, the recording session should be held in a studio to minimize the noise but
unfortunately such a studio was not available during the recording and instead we used a quiet room.
The system can be described as acceptable since it achieved an overall recognition rate of 68.75%. The
recognition can be increased if the recording session is held in a studio and if the number of utterances is increased in
the emotional database especially for the training set.
Arabic Speech Emotion Detector (ASED) system can be improved in many ways some of them are:
• In this work only four emotions have been covered and short sentences have been acted. For a future work more
emotions might be added and a mixture of short and long sentences should be prepared.

©Informatics '09, UM 2009 RDT1 - 18


Proceeding of the 3rd International Conference on Informatics and Technology, 2009

• For Training and Testing speech database, the database was available in advance. For improvement, the system
might allow the user to record and use his/her recorded files for Training or Testing. Alternatively, the system
should have the ability to remove noise in order to get a better recognition rate.

• Regarding the features, we have used the prosodic features which gave a good result but in order to reduce the
overlapping among the emotions, we recommend using also the linguistic features. For example using linguistic
filter for recognizing between anger and happiness may increase the accuracy rate.

• Although the HMM was found to be a good classifier for the speech emotion recognition, using other technique
in combination with the HMM might result in a better recognition. Other technique that can be combined is the
Vector Quantizing.

7.0 References

[1]Talwar, G., Kubichek, R. F. & Liang, H., 'Hiddenness Control of Hidden Markov Models and Application to Objective Speech
Quality and Isolated-Word Speech Recognition', Fortieth Asilomar Conference on Signals, Systems and Computers, California,
USA, 2006, pp. 1076 - 1080.
[2] Pao, T.-L., Liao, W.-Y., Chen, Y.-T., Yeh, Cheng, Y.-M. & Chien, C., 'Comparison of Several Classifiers for Emotion
Recognition from Noisy Mandarin Speech', Proceedings of the Third International Conference on International Information
Hiding and Multimedia Signal Processing, vol. 1, 2007, pp. 23-26
[3] Kwon, O.-W., Chan, K., Hao, J. & Lee, T.-W.,'Emotion Recognition by Speech Signals', 8th European Conference on Speech
Communication and Technology, Geneva, Switzerland, 2003, pp. 125-128
[4] Schuller, B., Rigoll, G., and Lang, M., 'Hidden Markov model-based speech emotion recognition', IEEE International
Conference on Acoustics, Speech, and Signal Processing, Maryland, USA, 2003, pp. 1-4
[5] Lee, C. M., Yildirim, S., Bulut, M., Abe, K., Busso, C., Deng, Z., Lee, S. & Narayanan, S.,'Emotion recognition based on
phoneme classes', 8th International Conference on Spoken Language Processing, Jeju Island, Korea, 2004, pp. 889-892
[6] El Ayadi, M.M.H., Kamel, M.S. & Karray, F.,'Speech Emotion Recognition using Gaussian Mixture Vector Autoregressive
Models', IEEE International Conference on Acoustics, Speech and Signal Processing, Hawai'i, U.S.A., 2007, pp.IV-957 - IV-
960
[7] Lin, Y.-L. & Wei, G., 'Speech emotion recognition based on HMM and SVM', Proceedings of the Fourth International
Conference on Machine Learning and Cybernetics, Guangzhou, China, 2005,pp.4898-4901.
[8] Abdulla, W. & Kasabov, N.,'The Concepts of Hidden Markov Model in Speech Recognition', Technical Report, Knowledge
Engineering Lab, Department of Information Science, University of Otago, New Zealand, 1999.
[9] Jurafsky, D. & Martin, J., Speech and Language Processing, 2nd edn, Prentice Hall, New Jersey, USA, 1999.
[10] Huang, X., Acero, A., & Hon, H.-W. Spoken Language Processing: A Guide to Theory, Algorithm and System Development,
Prentice Hall PTR, New Jersey, 2001, pp. 288- 304
[11] Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W. & Weiss B., 'A database of German emotional speech', Proceedings
of the INTERSPEECH, Lissabon, Portugal, 2005, pp. 1517-1520.
[12] Engberg, I.S. & Hansen, A.V., ‘Documentation of the Danish emotional speech database (DES)’, Internal AAU Report,
Center for Person Kommunikation, Denmark. 1996

©Informatics '09, UM 2009 RDT1 - 19

Anda mungkin juga menyukai