Muhannad Al-Naabi, Raja Noor Ainon & Mohammad A.M. Abu Shariah
Faculty of Computer Science & Information Technology
University of Malaya
mnaabi@hotmail.com, ainon@um.edu.my, shariah@perdana.um.edu.my
Abstract
In this paper an Arabic speech emotion detection system is presented. The study focuses on four types of emotions:
happiness, anger, sadness and absence of emotion i.e. neutral. This system is designed based on Hidden Markov
Model (HMM) as a classification method. An emotional utterances database comprising 160 speech samples was
recorded, consisting of 10 sentences that were recorded by four different people each acting four different emotion
states. The detection process of the system starts by dividing the utterances files into two sets i.e. training set and
testing set. The phonetic and acoustic features of the speech signals of the speech samples from the training set are
obtained by extracting the Mel Frequency Cepstral Coefficients (MFCC). The system then detects the emotions from
the testing set using the HMM models. Experimental results showed that an overall detection accuracy of 68.75%
was achieved.
2. Related Works
From experimental results, HMM classifiers yielded classification accuracy significantly better than the linear
discriminant analysis and quadratic discriminant analysis [2]. The HMM-based classifier has advantages over other
static discriminative classifiers in that frame length normalization is not necessary and temporal dynamics of the base
features can be reflected by using the state transition probability [3].
In recent years many research works were conducted to detect emotions in speech as shown in Table 1.
Unfortunately, no previous research could be found on the detection of emotions in Arabic speech. As seen from
Table 1, the recognition rate of emotions ranges between 77.8%, as the highest, and 47.1% as the lowest. Several
different extraction techniques were applied in the research; for example, [4] and [6] used Gaussian Mixtures while
[2] and [7] used a mixture of the MFCC and LPC. It is noticeable that the bigger the size of speech samples the better
the recognition rate obtained. The only exception is the work done in [2] and the reason for the high accuracy rate
was that both clean speech and noisy speech were used in the experiments. For clean speech 62.5% was achieved but
the noisy speech the rate was only 35.5% at 05dB level of noise. Their experiments were also conducted for different
levels for noise starting from 05dB to 40dB and on average they achieved 47.1% recognition rate.
Moreover, it can be concluded from the table that the German language always achieves the best recognition rate.
This is because the German Emotional Database was prepared very well. The recording sessions were held in a studio,
thereby ensuring that all speech files are clear with no noise. In addition, its size is big which gives the opportunity to
have bigger training sets thus giving better recognition rates.
3. Method
The process of detecting emotions in speech has three steps i.e. Feature Extraction, Feature Selection & Emotion
Classifier. In this work the feature extraction technique employed is based on the Mel Frequency Cepstral
Coefficients (MFCC) as it is the simplest and the most widely used technique in speech processing [2].
The purpose of this technique is to transform the input waveform into a sequence of acoustic feature vectors.
These vectors represent the information in a small time window of the signal.
The last step in emotion detection process is the classification. There are different approaches which have been
applied and used by researchers to achieve this classification. According to [8] the sole technique that gains the
acceptance of the researchers to be the state of the art is Hidden Markov Model (HMM) technique. Those steps are
shown in figure 1.
Input Extract
(Speech Signal) features
Detect Select
Emotion features
Output (Detected
Emotion)
The system has two main parts, i.e., Training and Testing. The output of the Training will be used as an input for
the Testing part. The data in the system is processed sequentially from the Training part, where the features are
extracted, to the Testing part, where an emotion is detected.
Figure 2. Pipe and Filter architecture of Arabic Speech Emotion Detector (ASED) System
In order to use the HMM properly in real-world applications, there are three basic problems that must be solved
for the model. The problems are the Evaluation problem, the Decoding problem and the Learning problem [9].
Given the observation sequence O = O1 , O2 , ... , Ot,, and a model λ = ( A, B, π ), the problem is to compute
P(O|λ) which is the probability that the observed sequence is produced by the model.
In addition, this problem can be viewed as how well a given model matches a given observation. This could be
used to find out the best model among many that produces the given observation. So, this problem allows choosing
the model that best matches the observation.
To solve this problem, Forward or Backward Algorithm can be used. This algorithm reduces the computational
cost with simple iterative mathematical formulae [8].
Given the observation sequence O = O1 , O2 , ... , OT, and a model λ = ( A, B, π ), the problem is to chose a
corresponding state sequence Q= q1 q2 … qT , that will most likely produce the given observations. That is, we need to
find the best state sequence which is optimal in a certain sense given the speech feature sequence.
To solve this problem a dynamic programming technique called Viterbi Algorithm is recommended. This
algorithm searches for a single best state sequence by maximising P(O|λ) to detect the best state sequence [8].
In the training phase the problem is to adjust the model parameters λ to maximize P(O|λ). According to [10] we
need to estimate the HMM parameters from a given set of training samples according to some meaningful criterion.
This problem aims to optimise the model parameters to obtain the best model that represents a certain set of
observations belonging to one spoken entity. The best algorithm used to solve this problem is called the Baum-Welch
Algorithm which is an iterative method to reach the local maximas of the probability function P(O|λ) [8].
The reason why HMM is widely used in speech processing is that a speech signal can be viewed as a piece-wise
stationary signal or short time stationary signal.
The training/learning process uses the features vectors created in the previous step as an input. The vectors of
speech files of one emotion will be treated individually so an HMM model of each emotion will be created. The
training process starts with initialization using K-means. This initialization computes initial parameters that are
estimates for an HMM with a mixture of Gaussian outputs.
After the learning process is completed, an acoustic likelihood for each acoustic unit is computed and will be used
in the detection step.
The first step in this study is to prepare an emotional database that provides utterances for testing and training
purposes.
The emotional speech database can be created by using either real or acted utterances. The real utterances can be
captured from recorded phone calls, computer games or films. However, for our study the acted utterances were used
as almost all real utterances are not spoken in Standard Arabic.
The Arabic language has several varieties that diverge widely from one country to country and within a single
country. This language is spoken as the first language of more than 20 countries. One factor in the differentiation of
the varieties is the influence from the languages previously spoken in the areas, which have typically provided a
significant number of new words, and have sometimes also influenced pronunciation or word order. So, it is difficult
for a listener to understand the utterances if the speaker is not speaking the standard Arabic.
However, it is almost impossible to use natural data if basic emotions are the subject of investigation as clear
emotional expression is not only rare in everyday situations but the recording of people experiencing full-blown
emotions is also ethically problematic, [11].
Therefore, it was decided to use acted utterances that will be accomplished by recording 10 sentences acted in 4
emotions: Sadness, Happiness, Anger and non-emotional utterance i.e. Neutral. The choice of the actors depends on
the naturalness and recognizability of the performance. Based on this, four actors belonging to a theatre group were
selected.
According to Burkhardt et al [11], to create sentences, it is important, that all these sentences should be
interpretable in the emotions under review and that they contain no emotional bias. Two different kinds of text
material would normally meet these requirements:
• Nonsense text material, like for instance haphazard series of figures or letters, or fantasy words.
• Normal sentences which could be used in everyday life.
Nonsense material is guaranteed to be emotionally neutral. However, there is the disadvantage that actors will find
it difficult to imagine an emotional situation and to produce natural emotional speech spontaneously. This is why
nonsense material often results in stereotyped overacting.
In comparison with poems and nonsense sentences, the use of everyday communication has proved best, because
this is the natural form of speech under emotional arousal. Moreover, actors can immediately speak them from
memory. In the construction of the database, priority was given to the naturalness of speech material and thus
everyday sentences were used as test utterances. Table 2 shows the sentences used to create the database and their
translation to English.
The recording session was carried out in a room at the headquarters of the theatre group. All recordings carried
out in this room were sometimes interrupted by noise. This is because the room is not a quiet or noise proof room
unlike a special recording studio.
The recording was accomplished using GoldWave v5.11 and a Sony F-V120 Uni-Directional Vocal Microphone
which was connected to HP Pavilion DV6920US 15.4-inch Laptop. The utterances were saved into .wav extension at
16 kHz
Table 2. Sentences used to create Arabic 4.3 Perception test of recorded speech corpus
emotional database
After achieving the recording process a perception test
Sentence was performed in order to ensure the accuracy of the recorded
(Translation) sentences and to make sure they carry the respective
.ِ ﺍﻟﻜِﺘَﺎﺏُ ﻣَﻮﺟُﻮﺩٌ ﻋَﻠَﻰ ﺍﻟﺮﱠﻑِ ﺍﻷَﻭﱠﻝ emotional elements. In this test, each utterance was validated
The book is on the first shelf by a number of listeners who will determine the emotion in
each record and then the determined emotion was compared
. ﻟَﻘَﺪ ﺭَﺃَﻳﺖُ ﻋَﻠِﻴﱠﺎً ﺍﻟﺒَﺎﺭِﺣَﺔَ ﻓِﻲ ﺍﻟﻨﱠﺎﺩﻱ with the pre-defined emotion.
I saw Ali in the club yesterday.
A total of 15 persons participated in this test, five of them
female, aged between 23-32 years old while the rest were
.ِ ِﺳَﻨَﺬﻫَﺐُ ﺻَﺒَﺎﺡَ ﺍﻟﻐَﺪِ ﻟِﺸِﺮَﺍءِ ﺍﻷَﻏﺮَﺍﺽ ِ ﻣِﻦَ ﺍﻟﺴﱡﻮﻕ
Tomorrow morning we will go to buy things
males of varying ages. All of them are Arabs and their mother
from the market. tongue language is Arabic in order that they can understand
. ًﻛَﺎﻥَ ﻟِﻘَﺎﺅُﻧَﺎ ﺑِﺎﻷَﺻﺪِﻗَﺎءِ ﺍﻟﺒَﺎﺭِﺣَﺔ َ ﻣُﻔَﺎﺟِﺌﺎ the recorded utterances and the emotional element. Table 3
Our meeting with the friends was surprising shows the nationalities of the persons who have performed
yesterday. the test.
. ِ ﺯَﺍﺭَﻧِﻲ ﺃَﺣﻤَﺪٌ ﻭَﺃَﺧُﻮﻩُ ﺻَﺒَﺎﺡَ ﺍﻟﻴَﻮﻡ
Ahmed and his brother visited me this Table 3. Nationalities of the persons who have performed
morning. the perception test
. ًﺗُﻐﻠَﻖُ ﺍﻟﺢَﺩِﻳﻘَﺔ ُ ﺍﻟﺴﱠﺎﻋَﺔ َ ﺍﻟﺘﱠﺎﺳِﻌَﺔ َ ﻣَﺴَﺎء Number of
The park is closed at nine p.m.. Nationality
per sons
. ﺳَﺘُﻌﺮَﺽُ ﺍﻟﻤُﺒَﺎﺭَﺍﺓ ُ ﺍﻟﺴﱠﺎﻋَﺔ َ ﺍﻟﺜﱠﺎﻣِﻨَﺔ َ ﻋَﻠَﻰ ﺍﻟﻘّﻨَﺎﺓِ ﺍﻷُﻭﻟَﻰ
The match will be at eight o'clock on Channel Omani 7
One.
Iraqi 2
ﻫَﻞِ ﺍﻟﺠِﻬَﺎﺭُ ﺍﻟﻤَﻮﺟُﻮﺩُ ﻋﻠﻰ ﺍﻟﻂﱠﺍﻭﻟِﺔِ ﻟﻚ ؟
Is the machine which is on the table belongs to
Sudanese 2
you?
ﺳَﺄَﺫﻫَﺐُ ﺃَﻳَﺎﻡَ ﺍﻟﻌُﻄﻠَﺔِ ﻓِﻲ ﺭِﺣﻠَﺔٍ ﻣَﻊَ ﺃَﺻﺪِﻗَﺎﺋِﻲ Algerian 1
I will go in a trip with my friends during the
holiday. Yemen 3
. ِ ﺳَﻨﻨﺘَﻘِﻞُ ﺇِﻟَﻰ ﺷِﻘﱠﺔٍ ﺟَﺪِﻳﺪَﺓٍ ﺑَﻌﺪَ ﻳَﻮﻣَﻴﻦ
We will move to a new apartment after two Total 15
days.
The listeners were given a form that contains a table. Each row of this table has a name of a sound file and 4
emotions from which he/she has to choose one emotion. The listeners were not given the pre-defined emotion of each
file. He/she listened to the sound files randomly and had to choose one of the 4 emotions that he/she expected was
acted in that file.
The recognition rate of the utterances is 76.85% which is acceptable as compared to that of the German Database,
86% [11] and the Danish database, 67% [12]. Table 4 shows the recognition rate performed in the perception test for
each emotion.
As depicted in Table 4 Anger has the highest recognition rate by 83.2% while Happiness has the lowest by 68.5%.
Table 4. Recognition rate performed in the perception test for each emotion
Emotion Rate
Anger 83.2%
Happiness 68.5%
Neutral 75.7%
Sadness 80.0%
5. Experimental Results
The system was implemented using MATLAB which is a numerical computing environment and programming
language created by the MathWorks corporation.
Several experiments were conducted in order to ensure the accuracy of the developed system. A perception test
was carried out to make sure that these utterances are natural. After performing the perception test, the utterances
databse was divided into two sets, training set and testing set. The training set has 112 utterances with 28 utterances
for each emotion and will be used in training the HMM models. The testing set has 48 utterances comprising 12 for
each emotion and this set will be used in the detection process and to check the accuracy of the system. Table 5 shows
the confusion matrix of the classification accuracy for the four emotional states.
From table 5, it is clear the Anger has the highest recognition rate by 83.3%. Comparing this to the perception test,
Anger also has the highest rate. Happiness is the most difficult one to be recognized. This can be seen from the table
the recognition rate achieved was only 50% with the other three emotions being mistakenly recognized in 50% of the
test samples. Again this is similar to the perception test result with Happiness getting the lowest acceptance rate.
This result might be due to the way Arab people express the
happiness emotion. Some Arabs express it loudly and others express it quickly.
Recognized as (%)
Neutral
Happin
Sadnes
Anger
Emotion
ess
This aim of this work is to develop a system that can detect four emotions in Arabic speech. The emotion states
are Anger, Happiness, Neutral and Sadness. An emotional speech database was created by 4 people who acted 10
sentences in four emotions. The database was then divided into two sets, Training set and Testing set. We faced some
difficulties in the recording session since we should choose people who are professional in acting so the recorded
sentences appear natural. Furthermore, the recording session should be held in a studio to minimize the noise but
unfortunately such a studio was not available during the recording and instead we used a quiet room.
The system can be described as acceptable since it achieved an overall recognition rate of 68.75%. The
recognition can be increased if the recording session is held in a studio and if the number of utterances is increased in
the emotional database especially for the training set.
Arabic Speech Emotion Detector (ASED) system can be improved in many ways some of them are:
• In this work only four emotions have been covered and short sentences have been acted. For a future work more
emotions might be added and a mixture of short and long sentences should be prepared.
• For Training and Testing speech database, the database was available in advance. For improvement, the system
might allow the user to record and use his/her recorded files for Training or Testing. Alternatively, the system
should have the ability to remove noise in order to get a better recognition rate.
• Regarding the features, we have used the prosodic features which gave a good result but in order to reduce the
overlapping among the emotions, we recommend using also the linguistic features. For example using linguistic
filter for recognizing between anger and happiness may increase the accuracy rate.
• Although the HMM was found to be a good classifier for the speech emotion recognition, using other technique
in combination with the HMM might result in a better recognition. Other technique that can be combined is the
Vector Quantizing.
7.0 References
[1]Talwar, G., Kubichek, R. F. & Liang, H., 'Hiddenness Control of Hidden Markov Models and Application to Objective Speech
Quality and Isolated-Word Speech Recognition', Fortieth Asilomar Conference on Signals, Systems and Computers, California,
USA, 2006, pp. 1076 - 1080.
[2] Pao, T.-L., Liao, W.-Y., Chen, Y.-T., Yeh, Cheng, Y.-M. & Chien, C., 'Comparison of Several Classifiers for Emotion
Recognition from Noisy Mandarin Speech', Proceedings of the Third International Conference on International Information
Hiding and Multimedia Signal Processing, vol. 1, 2007, pp. 23-26
[3] Kwon, O.-W., Chan, K., Hao, J. & Lee, T.-W.,'Emotion Recognition by Speech Signals', 8th European Conference on Speech
Communication and Technology, Geneva, Switzerland, 2003, pp. 125-128
[4] Schuller, B., Rigoll, G., and Lang, M., 'Hidden Markov model-based speech emotion recognition', IEEE International
Conference on Acoustics, Speech, and Signal Processing, Maryland, USA, 2003, pp. 1-4
[5] Lee, C. M., Yildirim, S., Bulut, M., Abe, K., Busso, C., Deng, Z., Lee, S. & Narayanan, S.,'Emotion recognition based on
phoneme classes', 8th International Conference on Spoken Language Processing, Jeju Island, Korea, 2004, pp. 889-892
[6] El Ayadi, M.M.H., Kamel, M.S. & Karray, F.,'Speech Emotion Recognition using Gaussian Mixture Vector Autoregressive
Models', IEEE International Conference on Acoustics, Speech and Signal Processing, Hawai'i, U.S.A., 2007, pp.IV-957 - IV-
960
[7] Lin, Y.-L. & Wei, G., 'Speech emotion recognition based on HMM and SVM', Proceedings of the Fourth International
Conference on Machine Learning and Cybernetics, Guangzhou, China, 2005,pp.4898-4901.
[8] Abdulla, W. & Kasabov, N.,'The Concepts of Hidden Markov Model in Speech Recognition', Technical Report, Knowledge
Engineering Lab, Department of Information Science, University of Otago, New Zealand, 1999.
[9] Jurafsky, D. & Martin, J., Speech and Language Processing, 2nd edn, Prentice Hall, New Jersey, USA, 1999.
[10] Huang, X., Acero, A., & Hon, H.-W. Spoken Language Processing: A Guide to Theory, Algorithm and System Development,
Prentice Hall PTR, New Jersey, 2001, pp. 288- 304
[11] Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W. & Weiss B., 'A database of German emotional speech', Proceedings
of the INTERSPEECH, Lissabon, Portugal, 2005, pp. 1517-1520.
[12] Engberg, I.S. & Hansen, A.V., ‘Documentation of the Danish emotional speech database (DES)’, Internal AAU Report,
Center for Person Kommunikation, Denmark. 1996