Anda di halaman 1dari 213

Language Identication Using

Visual Features
Jacob Laurence Newman
A thesis submitted for the Degree of
Doctor of Philosophy
University of East Anglia
School of Computing Sciences
May, 2011
c This copy of the thesis has been supplied on condition that anyone who consults it is understood
to recognise that its copyright rests with the author and that no quotation from the thesis, nor
any information derived therefrom, may be published without the authors prior written consent.
Abstract
Language identication (LID) is the task of attributing a spoken language to an
utterance of speech. Automatic visual LID (VLID) uses the external appearance and
dynamics of the speech articulators for this task, in a process known as computer
lip-reading. This has applications to LID where conventional speech recognition is
ineective, such as noisy environments or where an audio signal is unavailable.
This thesis introduces supervised and unsupervised methods for VLID. They
are based upon standard audio LID techniques, which use language phonology for
discrimination. We test our unsupervised method speaker dependently, identifying
between the languages spoken by individual multilingual speakers. Rate-of-speech
and recording session biases are investigated. We present ways of improving the
speaker independency of our active appearance model (AAM) features, in tasks to
identify between English and French, then later, Arabic and English. We investigate
if a lack of articulatory information in the AAM features limits phone recognition
performance, and enquire how it could be improved were more information available.
We show that VLID is possible in both speaker-dependent and independent
modes, and that LID using audio features gives, as expected, superior performance.
Rate-of-speech, which can indicate language uency, is shown to assist speaker de-
pendent discrimination. Recognition of spoken phones from visual features is shown
to be poor, and increased recognition accuracy is found to be key to improved VLID.
Finally, we show that the articulatory information contained by our AAM features
gives equal phone recognition performance to features containing information re-
garding the front-most speech articulators.
i
ii
for my Grandma, who has my love, as always.
Acknowledgements
My thanks to Stephen Cox for his expert supervision. His intelligence, level-headedness,
receptiveness, and good humour have made the last three years possible. Thanks
also to my viva examiners, Ben Milner and Jon Barker.
From the LILiR team, thank you to Yuxuan Lan for her patience, knowledge and
time. She has been a pleasure to work with. I am very grateful to Richard Harvey
for his input and for advising me to apply for this degree in the rst place. Thank
you to Barry Theobald for his extremely useful assistance, including his short tenure
as my primary supervisor. Further thanks to Sarah Hilder and Barry Theobald for
a memorable nal conference in Japan.
For their company, friendship, and academic input, thank you to the guys in the
lab, particularly the following people: Chris Watkins, Osama Dorgham, Ibrahim
Almajai, Matthew Stocks, Oliver Kirkland and Luke Davis. An extended thanks to
my best mate, Aron Bennett, and to my girlfriend, Eleanor Leist.
Thank you to my parents, Mark and Lesley, for their love, food and humour.
Particular thanks to my Mum for all the cakes and bacon sandwiches, and to my
Dad for the reasonable room rate. A special thanks to my dog, Oscar, for not eating
this thesis.
iii
Contents
List of Abbreviations vii
List of Figures viii
List of Tables xvii
1 Introduction 1
1.1 Motivation and Aims . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Technical Background 8
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Phone Recognition Followed by Language Modelling for Language
Identication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Distance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4 Active Appearance Models . . . . . . . . . . . . . . . . . . . . . . . . 15
2.5 Vector Quantisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.6 HMM Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . 22
2.6.1 Monophone HMMs . . . . . . . . . . . . . . . . . . . . . . . . 23
2.6.2 Triphone HMMs . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.7 N-Gram Language Modelling . . . . . . . . . . . . . . . . . . . . . . 30
2.8 Linear Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . 36
2.9 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . 38
3 Description of Datasets 41
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2 United Nations Multilingual Video Dataset (UN1) . . . . . . . . . . . 43
iv
CONTENTS v
3.3 United Nations Native English and Arabic Video Dataset (UN2) . . . 49
4 Literature Review 55
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.2 Human Speech and Language . . . . . . . . . . . . . . . . . . . . . . 56
4.3 Audio Language Identication Techniques . . . . . . . . . . . . . . . 59
4.3.1 National Institute of Standards and Technology Language Recog-
nition Evaluation Task . . . . . . . . . . . . . . . . . . . . . . 60
4.3.2 Phone-Based Tokenisation . . . . . . . . . . . . . . . . . . . . 63
4.3.3 Gaussian Mixture Model Tokenisation . . . . . . . . . . . . . 64
4.3.4 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . 65
4.3.5 Large Vocabulary Continuous Speech Recognition . . . . . . . 67
4.4 Audio-Visual Approaches . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.5 Human LID Experiments . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.6 Visual-Only Speech Recognition . . . . . . . . . . . . . . . . . . . . . 73
4.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5 Speaker-Dependent VLID 78
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.2 Approach and Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.3 VLID using AAMs and Vector Quantisation . . . . . . . . . . . . . . 81
5.3.1 Features: Active Appearance Models . . . . . . . . . . . . . . 82
5.3.2 Feature Modelling: Vector Quantisation . . . . . . . . . . . . 82
5.3.3 Language Model Likelihood Classication . . . . . . . . . . . 83
5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.4.1 Initial Experiments . . . . . . . . . . . . . . . . . . . . . . . . 84
5.4.2 Removing Rate of Speech . . . . . . . . . . . . . . . . . . . . 89
5.4.3 Testing Rate of Speech . . . . . . . . . . . . . . . . . . . . . . 90
5.4.4 Testing Session Biases . . . . . . . . . . . . . . . . . . . . . . 92
5.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6 Speaker-Independent VLID: UN1 95
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.2 Approach and Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.2.1 VLID Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
CONTENTS vi
6.3 Parallel Viseme Recognition Followed by Language Modelling . . . . 97
6.3.1 Features: Active Appearance Models . . . . . . . . . . . . . . 98
6.3.2 Viseme Modelling and Phoneme to Viseme Mapping . . . . . 103
6.3.2.1 Tied-State Multiple Mixture Triphone HMMs . . . . 105
6.3.3 Language Modelling and SVM Classication . . . . . . . . . . 106
6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
6.4.1 Speaker-independent VLID Results . . . . . . . . . . . . . . . 108
6.4.2 Simulated Viseme Error on VLID Performance . . . . . . . . . 109
6.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
7 Speaker-Independent VLID: UN2 Dataset 113
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
7.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
7.3 PRLM Using Visual Phones . . . . . . . . . . . . . . . . . . . . . . . 115
7.3.1 Experimental Parameters . . . . . . . . . . . . . . . . . . . . 124
7.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
7.4 Eliminating Skin Tone . . . . . . . . . . . . . . . . . . . . . . . . . . 129
7.4.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
7.5 Arabic Phone Recognition . . . . . . . . . . . . . . . . . . . . . . . . 143
7.6 PPRLM using Visual Phones . . . . . . . . . . . . . . . . . . . . . . 146
7.6.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
7.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
8 Limitations of Visual Speech Recognition 152
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
8.2 Dataset and Features . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
8.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
8.3.1 Speaker-Dependent Articulatory Features . . . . . . . . . . . . 157
8.3.2 Speaker-Independent Articulatory Features . . . . . . . . . . . 159
8.3.3 Investigating Sampling Rate . . . . . . . . . . . . . . . . . . . 161
8.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
9 Conclusions and Future Work 166
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
CONTENTS vii
9.2 Discussions and Conclusions . . . . . . . . . . . . . . . . . . . . . . . 167
9.3 Further Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
9.4 List of Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
A United Nations Universal Declaration of Human Rights Script in
English 176
Bibliography 185
List of Abbreviations
Abbreviation Meaning
AAM Active Appearance Model
AV Audio-Visual
ASR Automatic Speech Recognition
CMOS Complementary Meta-Oxide-Semiconductor
DCT Discrete Cosine Transform
EMA Electromagnetic Articulography
FMN Feature Mean Normalisation
GMM Gaussian Mixture Model
GPDF Gaussian Probability Density Function
HD High Denition
HMM Hidden Markov Model
HTK Hidden Markov Model Toolkit
IMELDA Integrated Mel-scale representation with LDA
IPA International Phonetic Alphabet
LDA Linear Discriminant Analysis
LID Language Identication
MFCC Mel-Frequency Cepstral Coecients
PCA Principal Component Analysis
PDF Probability Density Function
PDM Point Distribution Models
PRLM Phone Recognition followed by Language Modelling
PPRLM Parallel Phone Recognition followed by Language Modelling
SVM Support Vector Machine
UN United Nations
VLID Visual Language Identication
VQ Vector Quantisation
viii
List of Figures
1.1 A system diagram showing a language identier as a sub-system for
automatic translation. . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.1 Visual-only language identication: The task of nding the spoken
language from a sequence of video frames. . . . . . . . . . . . . . . . 8
2.2 A system diagram of the phone recognition followed by language mod-
elling approach to audio LID. . . . . . . . . . . . . . . . . . . . . . . 10
2.3 A system diagram of parallel phone recognition followed by language
modelling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4 In this gure, the Euclidean distance between A and C is smaller than
A to B. With the Mahalanobis distance, using the diagonal covariance
matrix represented by the red ellipse, points A and B are closer than
A and C. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.5 An example of hand-labelled landmark points on the outer and inner
lip boundaries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.6 Example images representing extremal mouth congurations (both
shape and appearance) required for training an AAM. . . . . . . . . . 16
2.7 The rst two modes of variation of the shape component of an AAM
varying between 3 standard deviations from the mean. The rst
mode appears to capture variation due to mouth opening and closing,
and the second appears to capture variation due to lip-rounding. . . . 16
2.8 The mean and rst three modes of variation of the appearance com-
ponent of an AAM. The appearance images have been suitably scaled
for visualisation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.9 A plot showing the cumulative variation captured by each AAM ap-
pearance dimension. . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.10 A gure illustrating vector quantisation. Vectors x, y and z are coded
by their closest VQ codewords, C1, C2 and C3, respectively. . . . . . 20
2.11 A dendrogram illustrating an example of hierarchical clustering. . . . 21
2.12 A diagram of a hidden Markov model. . . . . . . . . . . . . . . . . . 22
ix
LIST OF FIGURES x
2.13 An illustration of a Gaussian probability density function (GPDF).
For an input feature value, the function returns a probability of that
vector (or scalar, in this example) being observed from this distri-
bution. The parameter determines which feature value the distri-
bution is centred around, and species the spread of the function.
These values are calculated from the data that the function is re-
quired to model. Nearly 68% of feature values are contained within
one standard deviation from the mean. . . . . . . . . . . . . . . . . . 24
2.14 An illustration of a Gaussian mixture model. The two dashed, blue
lines are separately weighted Gaussian mixture components (each
with its own mean and variance). These Gaussians approximately
model the actual feature distribution (shown as a solid red line), which
itself is not Gaussian. . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.15 An example of two triphone contexts for the Arpabet phone, [AA].
Left contexts are shown in red, and right contexts are shown in blue. 27
2.16 An example triphone context for the phone [aa] and clustering ques-
tions for the phone [d]. This gure explains what is meant by the
left and right phone contexts, and by broad and narrow clustering
questions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.17 N-grams: unigrams, bigrams and trigrams. . . . . . . . . . . . . . . . 31
2.18 This gure demonstrates the eect of dierent insertion penalty val-
ues on varying lengths of hypothesised word sequences, as in HTK. In
this example, each word has an acoustic log-likelihood of -5. When
the insertion penalty is 0, the addition of each new word decreases
the total log-likelihood of the sequence only slightly (the sum of the
acoustic likelihoods), and it is conceivable that a word sequence of
any length could be selected as the hypothesis (though, more gener-
ally, longer sequences have lower likelihoods). When the penalty is
20, longer sequences are seen to have a much greater likelihood than
shorter ones. This is because the addition of each new word adds 20
the acoustic likelihood. When the penalty is -20, the opposite eect
is apparent, as each additional words likelihood is reduced by 20,
lowering the overall likelihood of longer sequences versus the shorter
ones. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.19 This diagram illustrates LDA in a two class, 2D case. The circles and
squares represent data from two separate classes. LDA seeks to nd
w, a projection of the data, which maximises the distance between
the class means, whilst minimising the within-class variance. In this
example, the data is projected onto a 1D line which maximises the
ratio between these two measurements. By contrast, projecting the
data onto the other LDA axis (perpendicular to w), would result in
no separation between the two classes. . . . . . . . . . . . . . . . . . 36
LIST OF FIGURES xi
2.20 Support Vector Machine Linear Case . . . . . . . . . . . . . . . . . . 38
2.21 The red data in (a) is non-linearly separable from the blue data in
2D space. In (b), the red data has become linearly separable by
projecting the data into a higher-dimensional space. . . . . . . . . . . 39
3.1 Studio arrangement for the recording of the UN1 dataset. A camera
records the whole face of the speaker, whilst the speaker recites the
database text as projected onto a screen. The speaker controls the
rate of text presentation with a mouse. Lighting is diused by a
reective surface to illuminate the face of the speaker. . . . . . . . . . 44
3.2 Example video frame from the UN1 dataset . . . . . . . . . . . . . . 46
3.3 Example frame showing the number and location of the tracked land-
marks corresponding to the lips in the UN1 Dataset. . . . . . . . . . 48
3.4 Studio arrangement for the recording of the UN2 dataset. A camera
records the mouth region of the speaker, whilst the speaker recites the
database text as shown on a laptop, from which they can control the
rate of text presentation. Lighting is diused by a reective surface
to illuminate the face of the speaker. . . . . . . . . . . . . . . . . . . 50
3.5 Example video frame from the UN2 Dataset . . . . . . . . . . . . . . 51
3.6 Example frame showing the number and location of the tracked land-
marks corresponding to the lips in the UN2 dataset. . . . . . . . . . . 52
4.1 A gure showing the vocal tract, and containing the position of the
speech articulators and other important anatomy for the production
of speech. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.2 A gure from Zhu et al. [2008] showing DET curves to compare the
performance of three audio LID systems, in two separate tasks. The
smaller the area under a curve, the better the performance of the sys-
tem. Each DET curve is quite straight, meaning that the likelihoods
from each system follow normal distributions. . . . . . . . . . . . . . 62
4.3 A system diagram of the phone recognition followed by language mod-
elling approach to audio LID. . . . . . . . . . . . . . . . . . . . . . . 63
4.4 A diagram showing two examples of separating hyperplanes. The
left-most image shows a non-maximum margin, whilst the right-most
image shows a maximum margin, as used by SVMs for classication. . 65
4.5 A diagram showing an approach to language identication using an
LVCSR for each language to be recognised. For a given speech ut-
terance, a likelihood is generated by each speech recogniser and is
classied either by the recogniser producing the maximum likelihood,
or by a more sophisticated classication process. . . . . . . . . . . . . 67
LIST OF FIGURES xii
4.6 This gure is adapted from Lan et al. [2009] and presents a high-level
illustration of how Sieve features are generated. A vertical scan-line
from a greyscale version of the mouth sub-image (a) is shown as an
intensity plot (b). The granularity spectrum from an m-sieve with
positive/negative granules shown in red/blue (c). These granules are
then counted, or summed, over all scan-lines to produce the scale-
histogram (d). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.1 Unsupervised VLID system diagram, using VQ tokenisation of AAM
feature frames. This system is based upon the PRLM architecture
used in audio LID. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.2 A VLID system is trained on two separate recitals of the UN decla-
ration, read by one speaker in two dierent languages; English and
Arabic. The task is to identify the language of some unseen test data
from that speaker. This plot presents the results for this experiment
for a range of dierent VQ codebook sizes. . . . . . . . . . . . . . . . 85
5.3 A VLID system is trained on two separate recitals of the UN decla-
ration, read by one speaker in two dierent languages; English and
German. The task is to identify the language of some unseen test data
from that speaker. This plot presents the results for this experiment
for a range of dierent VQ codebook sizes. . . . . . . . . . . . . . . . 87
5.4 A VLID system is trained on three separate recitals of the UN dec-
laration, read by one speaker in three dierent languages; English,
French and German. The task is to identify the language of some
unseen test data from that speaker. This plot presents the results for
this experiment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.5 This plot presents the VLID error for each of the multilingual speakers
tested in this section. The results shown are for the 256 VQ-codeword
systems as presented in gures 5.2, 5.3 and 5.4. Additionally, the
mean performance for the three speakers (using 256-codewords) is
presented. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.6 A VLID system is trained on three separate recitals of the UN decla-
ration, read by the same speaker, in English, at three vastly dierent
speeds. The task is to see if there are sucient dierences between
the extreme recital speeds that facilitate discrimination, despite the
three recitals containing the same language. This plot presents the
results for this experiment. . . . . . . . . . . . . . . . . . . . . . . . . 91
5.7 A VLID system is trained on three separate, identical recitals of the
UN declaration in English, spoken by the same speaker. The task is to
see if there are recording session biases which facilitate discrimination,
despite the three recitals containing the same language. This plot
presents the results for this experiment. . . . . . . . . . . . . . . . . . 93
LIST OF FIGURES xiii
6.1 Visual-only LID System Diagram . . . . . . . . . . . . . . . . . . . . 98
6.2 Histograms of the original AAM feature dimensions for a single speaker
from the UN dataset. Histograms are presented for the rst two shape
and appearance dimensions. The gure shows that mean and variance
varies across dimensions, as well as scale, and that the distributions
are approximately Gaussian. . . . . . . . . . . . . . . . . . . . . . . . 99
6.3 A 2D distance preserving projection (Sammon map [Sammon, 1969])
of 250 randomly sampled un-normalised AAM features from each of
the 5 UN dataset speakers used in Chapter 6. The gure shows
that each speaker is well separated, and therefore that there is strong
speaker dependency encoded by AAM features. . . . . . . . . . . . . 101
6.4 This gure is the same as Figure 6.3, except a z-score normalisation
was applied to the features of each speaker. Here, we can see that
the speakers feature spaces are closer than in Figure 6.3, suggesting
a possible improvement to the speaker independency of our features. . 102
6.5 A diagram showing the method of back-end classication for PPRLM
VLID using the language model likelihoods. The process is split into
two streams which are ultimately combined into an SVM classier.
The rst stream uses the ratios between the likelihoods from each
language model, calculated independently for each viseme recogniser.
The second stream applies an LDA transformation to the original
likelihoods, and then calculates the ratio between new likelihoods
generated from language-dependent GPDFs. . . . . . . . . . . . . . . 107
6.6 Speaker-independent VLID results. We employ cross-fold validation
across our ve speakers, holding out one speaker each time for testing.
The task to identify the language of an unseen utterance as either
English or French. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.7 The eect of viseme accuracy on VLID recognition . . . . . . . . . . 111
7.1 A system diagram of the PRLM language identication architecture
applied to visual speech. . . . . . . . . . . . . . . . . . . . . . . . . . 116
7.2 Example confusion matrix for all test data in a speaker-independent
visual-only phone recogniser. The labels running vertically on the left-
hand side represent input phones from the ground-truth transcription,
and each row represents the recognition confusions for each of the
specied input phones. . . . . . . . . . . . . . . . . . . . . . . . . . . 118
LIST OF FIGURES xiv
7.3 Example confusion matrix for all test data in a speaker-independent
visual-only phone recogniser. The colour of each element represents
its value and each is scaled by the sum of the row. The original
values from this matrix are shown in Figure 7.2. The labels run-
ning vertically on the left-hand side represent input phones from the
ground-truth transcription, and each row represents the recognition
confusions for each of the specied input phones. . . . . . . . . . . . 119
7.4 Example confusion matrix for all test data in a speaker-independent
visual-only phone recogniser. The colour of each element represents
its value and each is scaled by the sum of the column. Each column
shows the phones that were confused with the phone represented by
the column (or where insertions took place). The original values from
this matrix are shown in Figure 7.2. The labels running vertically
on the left-hand side represent input phones from the ground-truth
transcription, and each row represents the recognition confusions for
each of the specied input phones. . . . . . . . . . . . . . . . . . . . . 120
7.5 Language model log-likelihood scores for phone sequences recognised
by an English visual-phone recogniser. The language models were
built from the training data from their respective languages. The
English training data shown here was also used to train the phone
models, whereas the Arabic training data was not used. The plot
shows that the English test data and Arabic training data are grouped
together and are well separated from the English training data. . . . 122
7.6 Language model log-likelihood scores for phone sequences recognised
by an English visual-phone recogniser, trained on a development set of
English subjects, and excluding the training data shown on the plot.
The language models were built from the training data from their
respective languages. The plot shows that the English and Arabic
training data are not as well separated as in Figure 7.5, but that the
English test data is closer to the English than to the Arabic training
examples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
7.7 Speaker-independent VLID results using a PRLM approach with the
UN2 visual-only dataset. Each plot line represents the performance
of a test speaker. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
7.8 Audio LID results corresponding to the VLID results presented in
Figure 7.7, using a PRLM architecture and MFCC features from the
UN2 dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
7.9 A plot showing the dierence between the results shown in Figure
7.7 (Visual PRLM) and Figure 7.8 (Audio PRLM), for each speaker.
Positive values indicate a reduction in error from using audio features.
The audio performance always equals or outperforms the correspond-
ing VLID result. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
LIST OF FIGURES xv
7.10 A diagram of the processes involved in histogram equalisation. Each
colour channel is processed independently, and just the mouth region
is used. The equalising mapping function is generated from a sample
of training images, and provides a mapping for input pixel intensities.
The histograms show the count the pixels (y-axis) for each intensity
value (x-axis). Sample histograms are shown before and after the
mapping has been applied. For each colour channel, we can see that
the distribution of pixels across the intensity range is approximately
even after the equalisation process has been applied. . . . . . . . . . . 130
7.11 Example frames from two Arabic and two English subjects, before
and after histogram equalisation. . . . . . . . . . . . . . . . . . . . . 131
7.12 A synthetic example showing mouth region pixels, and a 3 x 3 grid
centred over a pixel which has been hand-labelled as containing teeth.
The structure of the feature vector corresponding to the tooth pixel
is also shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
7.13 Example frames from two Arabic and two English subjects, before
and after tooth classication. . . . . . . . . . . . . . . . . . . . . . . 133
7.14 Speaker-independent VLID results for 19 test subjects from the UN2
dataset. These results relate to the histogram equalisation experiment
described in Section 7.4 . . . . . . . . . . . . . . . . . . . . . . . . . . 135
7.15 A plot showing the dierence between the results shown in Figure 7.14
(Visual PRLM with histogram equalisation) and Figure 7.7 (Visual
PRLM without histogram equalisation), for each speaker. Positive
values indicate an increase in error from using the histogram equali-
sation normalisation, whereas negative values indicate a reduction in
error. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
7.16 Speaker-independent VLID results for 19 test subjects from the UN2
dataset. These results relate to the AAM shape-only experiment
described in Section 7.4 . . . . . . . . . . . . . . . . . . . . . . . . . . 138
7.17 A plot showing the dierence between the results shown in Figure 7.16
(Visual PRLM using shape) and Figure 7.14 (Visual PRLM using
histogram equalisation), for each speaker. Positive values indicate
an increase in error from using the shape features, whereas negative
values indicate a reduction in error. . . . . . . . . . . . . . . . . . . . 139
7.18 Speaker-independent VLID results for 19 test subjects from the UN2
dataset. These results relate to the tooth recognition experiment
described in Section 7.4 . . . . . . . . . . . . . . . . . . . . . . . . . . 141
LIST OF FIGURES xvi
7.19 A plot showing the dierence between the results shown in Figure
7.18 (Visual PRLM using tooth recognition) and Figure 7.14 (Vi-
sual PRLM using histogram equalisation), for each speaker. Positive
values indicate an increase in error from using the tooth recognition
features, whereas negative values indicate a reduction in error. . . . . 142
7.20 A screenshot of ElixirFM online interface. The gure shows an input
word in Arabic script, two separate ways to tokenise the word (As one
word, or as two tokens), and phonetic representations corresponding
to dierent inective forms of the various citations. . . . . . . . . . . 145
7.21 Speaker-independent VLID results for 19 test subjects from the UN2
dataset. These results relate to the PPRLM experiment using his-
togram equalised video described in Section 7.6 . . . . . . . . . . . . 147
7.22 A plot showing the dierence between the results shown in Figure 7.21
(Visual PPRLM using histogram equalisation) and Figure 7.14 (Vi-
sual PRLM using histogram equalisation), for each speaker. Positive
values indicate an increase in error from using the PPRLM architec-
ture, whereas negative values indicate a reduction in error. . . . . . . 148
7.23 A plot showing the mean performance of each VLID system described
in this chapter, and results generated from audio LID experiments on
the UN2 dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
8.1 A diagram of the midsaggital plane, showing the position of the EMA
sensors in red and the name of the articulators that they are used to
track. The nose sensor is not used for recognition. . . . . . . . . . . 155
8.2 Speaker-dependent phone recognition using articulatory features. The
left-most label on the x-axis denotes that all articulators were used by
the recogniser. Each subsequent score denotes the removal of an ar-
ticulator (named on the axis), and the removal is cumulative moving
left-to right (e.g. -dorsum indicates that both the velum and the dor-
sum were not included). The plot shows the mean accuracy averaged
over the ve-folds, and the error bars denote 1 times the standard
error. The accuracy (and the trend) for both speakers is very sim-
ilar, and as is expected the performance degrades as information is
withheld from the recogniser. . . . . . . . . . . . . . . . . . . . . . . 158
8.3 Comparing the performance of speaker-independent phone recogni-
tion using articulatory features and AAM features. Note the point of
intersection between the two curves. AAM-based features capture the
visible shape and appearance information, and perform as well as the
EMA features when the articulators from tongue blade/tip forward
are available. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
LIST OF FIGURES xvii
8.4 A plot showing the mean squared dierence between various frames
rates of AAM features and a 60 frames a second signal. Each frame
rate was generated by downsampling the 60 frames per second signal,
and then upsampling back to the original rate. . . . . . . . . . . . . 163
List of Tables
3.1 A brief summary of the UN1 and UN2 datasets. . . . . . . . . . . . . 43
3.2 UN1: Multilingual Database . . . . . . . . . . . . . . . . . . . . . . . 47
3.3 UN1: Multilingual Database Continued . . . . . . . . . . . . . . . . . 48
3.4 UN2: English & Arabic Database - The English Speech . . . . . . . . 53
3.5 UN2: English & Arabic Database - The Arabic Speech . . . . . . . . 54
4.1 NIST LRE 2009 Target Languages . . . . . . . . . . . . . . . . . . . 60
6.1 Phoneme to viseme mapping . . . . . . . . . . . . . . . . . . . . . . . 104
6.2 Viseme recognition results . . . . . . . . . . . . . . . . . . . . . . . . 108
7.1 UN2 test subjects used in all VLID experiments . . . . . . . . . . . . 115
7.2 UN2 subjects used in PRLM VLID experiments for training the visual
phone HMMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
7.3 Speaker-independent visual-phone recognition performance, using ran-
domly generated features and a bigram language model. The pa-
rameters for recognition were 10 for the grammar scale factor and 12
for the insertion penalty (which we found gave optimal phone recogni-
tion accuracy). The results show that an impression of discrimination
is given when using a language model, even though the underlying
features are meaningless. . . . . . . . . . . . . . . . . . . . . . . . . 124
7.4 Speaker-independent visual-phone recognition performance of the En-
glish training data used to train the recogniser. The results presented
relate to the experiment described in Section 7.3. . . . . . . . . . . . 125
7.5 Speaker-independent visual-phone recognition performance. The re-
sults presented are for the English test data used in the experiment
described in Section 7.3. . . . . . . . . . . . . . . . . . . . . . . . . . 125
xviii
LIST OF TABLES xix
7.6 Speaker-independent audio-phone recognition performance of the En-
glish training data used to train the recogniser. Each plot line repre-
sents the performance of a test speaker. The results presented relate
to the audio-only experiment described in Section 7.3. . . . . . . . . . 127
7.7 Speaker-independent audio monophone recognition performance. The
results presented are for the English test data used in the audio-only
experiment described in Section 7.3. . . . . . . . . . . . . . . . . . . . 127
7.8 Speaker-independent visual-phone recognition performance of the En-
glish training data used to train the recogniser. The results presented
relate to the histogram equalisation experiment described in Section
7.4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
7.9 Speaker-independent visual-phone recognition performance of the En-
glish test data. The results presented relate to the histogram equali-
sation experiment described in Section 7.4. . . . . . . . . . . . . . . . 134
7.10 Speaker-independent visual-phone recognition performance of the En-
glish training data used to train the recogniser. The results presented
relate to the AAM shape-only experiment described in Section 7.4. . 137
7.11 Speaker-independent visual-phone recognition performance of the En-
glish test data. The results presented relate to the AAM shape-only
experiment described in Section 7.4. . . . . . . . . . . . . . . . . . . . 137
7.12 Speaker-independent visual-phone recognition performance of the En-
glish training data used to train the recogniser. The results presented
relate to the tooth-recognition experiment described in Section 7.4. . 140
7.13 Speaker-independent visual-phone recognition performance of the En-
glish test data. The results presented relate to the tooth-recognition
experiment described in Section 7.4. . . . . . . . . . . . . . . . . . . . 140
7.14 Dictionary pronunciations for the Arabic word shown in Figure 7.20.
The dictionary entry is the rst possible citation form of the Arabic
word, generated by the morphological analysis process. Each pronun-
ciation represents an inective form of the various citations. . . . . . 146
8.1 Speaker-independent audio monophone recognition performance, as
presented in Section 7.3.2. This phone recognition system was trained
and tested using MFCC features sampled at 100Hz. . . . . . . . . . . 162
8.2 Speaker-independent audio monophone recognition performance, as
presented in Section 7.3.2. This phone recognition system was trained
and tested using MFCC features sampled at 30Hz. . . . . . . . . . . . 162
Chapter 1
Introduction
1.1 Motivation and Aims
Lip-reading is the process of using the visual appearance of the mouth during speech
to infer its acoustic content. Visual speech cues are known to aect sound percep-
tion [McGurk and MacDonald, 1976] and are used by humans [Summereld, 1992]
and machines [Matthews et al., 2002] to improve speech intelligibility. This is not
a trivial task however, since it is well known that many speech sounds that are
distinctive acoustically are indistinguishable on the lips [Fisher, 1968]. Language
can be portrayed by dierent modes of communication, such as writing, speaking,
and gesturing (as in sign language). Language identication (LID) is the process of
ascertaining which language the information is being presented in. Audio language
identication is a mature technology, able to discriminate between tens of spoken
languages with just a few seconds of representative speech. Figure 1.1 illustrates a
typical application of a LID system for a more general function, in this case language
selection for translation. Given the success of LID approaches in the audio domain,
it is interesting to enquire whether language can be discriminated automatically by
purely visual means, i.e. by computer lip-reading. This novel research task is the
focus of the research described in this thesis.
1
CHAPTER 1. INTRODUCTION 2
Translators
Language
Identifier
Arabic to
English
French to
English
German to
English
Unidentified
Speech
English
Recipient
English
Speech
Figure 1.1: A system diagram showing a language identier as a sub-system for
automatic translation.
Although visual language identication (VLID) is a new and unexplored area of
research, it has several applications, both for scientic research and for practical de-
ployment. The use of visual-only lip-reading can be considered as an extreme case of
audio-visual speech recognition, where the audio channel is completely uninforma-
tive, caused by a low signal to noise ratio, or where there is a disparity between the
conditions of the training and testing environments [Acero and Stern, 1990]. VLID
can assist audio-visual speech recognition by improving visual recognition features,
either in terms of the extraction process, the features themselves, or how they are
applied to recognition. If the usefulness of the visual signal can be bettered, it could
then be integrated into existing audio-visual technology [Potamianos et al., 2003].
As well as the benets to the scientic community, there are a number of real-
world applications to this research. As Figure 1.1 shows, LID applications are usually
part of a larger system, specically as a sub-system whose function is to identify the
language of some speech, from which the appropriate language-dependent system
(speech recogniser, translator, etc) can be selected. A practical example of this is
a public information terminal, where natural human interaction would be speech-
based and where the spoken language of the user would be unknown. Environmental
noise might prohibit conventional LID approaches in such environments, and hence
VLID could benet such a system.
In 2002, an estimated 4.2 million closed-circuit television cameras (CCTV) were
CHAPTER 1. INTRODUCTION 3
operational in the UK, as reported by McCahill and Norris [2002]. The primary
functions of CCTV are as a preventative measure against crime and to assist in
the conviction of criminals who have been lmed performing illegal activities. The
footage from CCTV is often silent or the audio signal is noisy, which makes automatic
speech and language recognition a dicult task. There are obvious advantages to
being able to automatically transcribe the speech of criminal suspects from CCTV
footage, and such a system would most likely have to be multilingual, meaning that
a VLID system would have to be incorporated.
Transcriptions of speech compiled by expert speech-readers from CCTV footage
are an admissible source of evidence in a court of law. In the case of R v Luttrell
[2004], a transcription from 10 minutes of CCTV footage was used to secure a crimi-
nal conviction against defendants accused of handling stolen goods. An unsuccessful
appeal was launched, and it was ruled that given acceptable CCTV video quality
and an adequately competent speech-reading expert, the use of CCTV transcrip-
tions is currently an acceptable form of evidence [Keane et al., 2010]. In practice,
the accuracy of such evidence limits its use, as experiments have shown that human
performance at lip-reading varies greatly and typically results in word level accu-
racies of around 50%. If an automated system could minimise the error of such
transcriptions, or even outperform humans, it may become a powerful surveillance
tool capable of securing criminal convictions in court.
Scientic developments have historically echoed technological themes portrayed
by science ction [Chen, 2010]. VLID could conceivably underpin more futuristic
applications. In the science ction series Star Trek, a device called the universal
translator has the ability to automatically and seamlessly translate any language.
Stanley Kubricks 2001: A Space Odyssey, depicts two astronauts who have hidden
in an airlock in order to not be heard by their sentient computer pilot, only for
it to lip-read them through a glass door. Practical realisations of these types of
technologies would require a method of overcoming acoustic noise in speech recog-
nition, better speaker independence, and would also require a language selection
CHAPTER 1. INTRODUCTION 4
component. These research issues are addressed to some extent by the work in this
thesis.
The aim of this project is to develop a system that can identify between languages
using only visual speech information. Given representative training data for each
of the languages to be discriminated, the linguistic identity of an unseen sample of
speech must be determined. This denition alone presents our rst research question,
can languages be distinguished by lip-reading? Although VLID is a novel research
task, by separating this project into its constituent research areas (Chapter 4), we
can nd the approaches adopted in related elds that might be applicable here. Such
a review should highlight the technical diculties and limitations introduced by each
task or concept, inevitably determining the precise research direction that this work
will take. If languages can be distinguished by lip-reading, then the next question is,
how can an automatic system discriminate languages? The answer to that question
we can hypothesise from our literature review and test empirically. Finally, assuming
we can develop an automatic system for this task, the nal question posed by this
thesis is, how robust is automatic language identication? By which we aim to
discover the limitations of the developed system, overcome existing problems where
necessary and suggest the future direction of this research.
1.2 Thesis Structure
This section describes the content and arrangement of the remaining chapters in
this thesis.
Chapter 2 serves as a reference for some of the background technical details
required throughout the body of this thesis. It contains details of many of the exist-
ing and fundamental methods of computer vision, speech recognition and machine
learning applied herein. The chapter starts with a brief introduction illustrating the
basics of the VLID task and some important speech recognition concepts. It then
describes a method for language discrimination which relies on language phonology
CHAPTER 1. INTRODUCTION 5
to identify between languages, and then in turn discusses the processes involved
in each sub-system. This includes active appearance models (AAMs) for lip track-
ing and feature extraction, vector quantisation (VQ) for frame tokenisation, hidden
Markov models for modelling speech, and language models. Linear discriminant
analysis (LDA) and support vector machines (SVMs) are outlined as methods for
classication and improving feature discrimination.
We introduce two new datasets for visual language discrimination in Chapter 3.
Both sets include studio recorded footage of speakers reciting the Universal Decla-
ration of Human Rights. The rst dataset, UN1, is used in the speaker-dependent
experiments described in Chapter 5, and the preliminary speaker-independent ex-
periments in Chapter 6. It contains standard denition video, and camera-captured
audio of multilingual subjects. The second dataset, UN2, contains high denition
video with accompanying high quality audio. The speakers recorded are native
English and native Arabic speakers, used in our further speaker-independent exper-
iments described in Chapter 7.
Chapter 4 surveys the elds of research which relate to the topic of VLID. This
chapter presents both the fundamental and state-of-the-art techniques adopted in
the research areas of audio LID, audio-visual speech recognition and visual-only
recognition. It also presents a brief illustration of the concepts behind human speech
and language, and research by psychologists into human capabilities of language
discrimination by visual means. This chapter motivates the direction of the research
conducted in this thesis, by exploring the limitations of existing relevant techniques
and using existing knowledge to decide upon realistic goals for improvements that
this work can oer.
Our rst experiments into VLID are conducted in Chapter 5. This chapter de-
scribes an unsupervised method of VLID, in which AAM feature vectors are to-
kenised into VQ frames, and language models are built from the sequences of VQ
symbols produced by language-dependent training data. This system uses language
phonology as the feature for discriminating languages. We evaluate this system
CHAPTER 1. INTRODUCTION 6
speaker-dependently, on three separate multi-lingual speakers. Two further tests
are performed to see how sensitive the system is to non-language clues, including
the construction of a system to discriminate three recitals of the UN Declaration
read at vastly dierent speeds, and three apparently equal recitals.
Leading on from our speaker-dependent experiments, Chapter 6 is concerned
with speaker independence, where the speaker used for testing has not been used
in training. This work moves from an unsupervised approach to one which requires
word-level transcriptions of the training data. Our method of feature tokenisation
here is the recognition of confusable classes of phones, known as visemes. We use
our UN1 dataset for this work, and the task is to discriminate between English
and French, as spoken by ve multilinguals, and tested using ve-fold cross valida-
tion. We present two methods for improving the speaker independence of our AAM
features, and we show the performance of our viseme recognisers. A simulation is
performed to demonstrate how language discrimination varies as the accuracy of
our recognisers increases, and to determine the improvements in phone recognition
required for better VLID performance.
Chapter 7 describes our nal work into speaker-independent VLID, using our
custom UN2 dataset of high denition video and high quality audio. This chapter is
aimed at repeating the experiments of the previous chapter but using a much larger
dataset of speakers, and only concentrating on the features of language produced
by native speakers. Here, we build a system to discriminate between English and
Arabic visual speech, two phonologically dissimilar languages. Instead of recognis-
ing superclasses of confused phones, namely visemes, we recognise individual spoken
phones visually, which we term visual-phones. Initially, we implement a phone recog-
nition followed by language modelling (PRLM) system to simplify the evaluation of
our techniques, and towards the end of the chapter we demonstrate the parallel
PRLM (PPRLM) architecture. Part of this work focuses on discarding a subjects
skin-tone as a possible method of discrimination. Firstly, we histogram equalise our
video frames, secondly we use only shape parameters for our visual-phone recogni-
CHAPTER 1. INTRODUCTION 7
tion, and nally we build a tooth-pixel recogniser to generate binary image masks
of the tooth regions, and then extract our AAM features from those images.
Based upon the limitations of VLID highlighted by the previous chapters, next
we investigate the limitations of our AAM features by comparing their performance
for visual-phone recognition to features derived from the physical movements of the
speech articulators (Chapter 8). The dataset we use for the AAM features is the UN2
dataset, and the physical tracking of the articulators is from the MOCHA-TIMIT
dataset, as captured by electro-magnetic articulography (EMA). This chapter aims
to see if phone recognition accuracy using AAM features is comparable to recognition
using a similar amount of directly tracked articulatory information to what is visible
from an external, frontal view of the mouth. We also look at the performance
characteristics as we remove articulatory information from our speech recognisers,
and as we add more information than is visible externally. A comparison between the
sample rates of audio and visual features are given, to determine whether the visual
signal is undersampled. Audio phone recognition performance on the MOCHA-
TIMIT data is presented as a benchmark of the current state-of-the-art in audio
techniques.
Chapter 9 concludes this thesis by summarising the results and ndings of the
preceding chapters. Limitations of the work presented here are discussed, and further
work to address these issues is suggested. Finally, a list of rst author publications
resulting from this work is given.
Chapter 2
Technical Background
2.1 Introduction
Frames x
1
...x
T
L
a
n
g
u
a
g
e

I
d
e
n
t
i
f
i
e
r
English
French
German
Arabic
Figure 2.1: Visual-only language identication: The task of nding the spoken
language from a sequence of video frames.
Language identication (LID) is the process of classifying a representative sample
of speech, text, or other media, by the language in which it is spoken or written.
Audio LID is a mature technology, able to achieve high recognition accuracies with
just a few seconds of clean speech. The task is therefore, given reasonably high and
consistent quality video footage of peoples mouth regions when speaking uently
and continuously, to identify the spoken language of an unknown subject (Figure
8
CHAPTER 2. TECHNICAL BACKGROUND 9
2.1). Although there have been some explanatory papers that explore the human
capability to recognise spoken language by lip-reading [Ronquest et al., 2010; Soto-
Faraco et al., 2007], the work described here is the rst attempt to develop an
automatic computer system to perform this task, which we term visual language
identication (VLID).
In speech recognition, both audio and visual, speech is represented as a series of
discrete observation vectors. The sampling frequency of these signals varies between
applications, but most domestic video cameras record frames at a frequency of 25
frames per second. By contrast, most audio recording equipment is capable of
recording at a frequency of 48kHz, though the features actually used for speech
recognition are usually sampled at almost 100Hz. It is not clear therefore whether
the visual signal is under-sampled for use in an automated system. Furthermore,
it is not known whether there is sucient articulatory information visible on the
face to discriminate units of speech, and ultimately to identify spoken languages.
Finally, the very nature of visual features is to encode the physical appearance of
the face, and this imposes strong speaker dependency which must be overcome to
recognise languages across dierent people. These are the main challenges facing
this research and the focus of our research objectives.
The development of a VLID system requires technical methods from a range of
computer science disciplines, most notably, speech processing, computer vision and
machine learning. Currently, there is no published literature directly tackling VLID,
apart from that resulting from this research. This chapter describes current tech-
niques from existing elds of research which have been adapted, where necessary, to
suit the VLID system developed here. It is arranged as follows: Section 2.2 details
the most popular approaches to audio language identication, and then the suc-
ceeding sections describe that systems constituent parts and supporting methods
in more detail. Firstly, distance metrics are introduced in Section 2.3. Feature ex-
traction is covered in Section 2.4, with vector quantisation (VQ) and hidden Markov
models (HMMs) following in Sections 2.5 and 2.6 respectively. Language modelling
CHAPTER 2. TECHNICAL BACKGROUND 10
is explained in Section 2.7, and classication techniques linear discriminant analysis
(LDA) and support vector machines (SVMs) are in 2.8 and 2.9.
2.2 Phone Recognition Followed by Language Mod-
elling for Language Identication
Spoken languages dier in many ways, and these dierences have been studied closely
by language experts and phoneticians. Many approaches to audio LID have been
developed to exploit such dierences [Sugiyama, 1991; Zue et al., 1993; Zissman,
1996], though some are more successful than others. Chapter 4 describes in more
detail some of the approaches which have been tried. Of the techniques which have
proven to be useful, a subset rely on information which is completely absent from
the visual domain, such as pitch, and therefore are unlikely to be useful in a system
which relies solely on visual information.
Speech
MFCC
Feature
Extraction
English
Phone
Recogniser
English
Language Model
French
Language Model
Pr(Fr | EnFrLM)
Pr(En | EnEnLM)
Hypothesised
Language
Figure 2.2: A system diagram of the phone recognition followed by language mod-
elling approach to audio LID.
Figure 2.2 shows a system diagram of a basic audio LID system, known as phone
recognition followed by language modelling (PRLM). Zissman [1996] developed this
system to exploit the phonotactic dierences between languages. Phonotactics gov-
ern the allowable sequence of phonemes in a given language. For example, English
does not allow a voiced fricative such as [v] to be the second consonantal sound in
a word, whereas Swedish does, as in the name Sven. PRLM learns discriminatory
information such as this from language-specic training data. In VLID, instead
of recognising phones from acoustic information, we will use the visual appearance
CHAPTER 2. TECHNICAL BACKGROUND 11
of the mouth region to represent speech, which has been shown in humans to be
sucient to discriminate languages [Ronquest et al., 2010].
The PRLM system contains three subsystems. The rst is feature extraction. For
an audio phone recognition system, the features typically used are mel-frequency
cepstral coecients (MFCCs) [Davis and Mermelstein, 1980]. In the visual domain,
we extract active appearance model (AAM) features (Section 2.4). The second sub-
system is a method of tokenising the incoming features, typically into frames, phones
or syllabic units. Next, statistical language models are built for each language, from
phone sequences belonging to language specic training data. Finally, likelihoods
can be calculated from each language model given a phone sequence, and these can
be processed in some way that classies the utterance.
The tokenisation subsystem in PRLM usually comprises a set of HMMs, which
are used to segment an input speech utterance into a sequence of phones. The
choice of which phone set to model is limited to the languages for which acoustic
transcriptions are available, since they are required to train phone models. Most
usual is to use an English phone recognition system, although some work has been
done to model shared or universal phone sets [Li et al., 2007; Huang and Wu, 2007].
An alternative method of tokenisation is to treat each input frame as a token.
Vector quantisation (VQ) (Section 2.5) can be used to represent frames as VQ
codewords, and then the sequence of VQ codes can be processed as if it were a
sequence of phones.
In the language modelling subsystem, the phone sequences generated by the
tokeniser can be analysed in terms of their co-occurence (or n-gram) probabilities.
Statistical n-gram language models are built (Section 2.7) using language-specic
training data. Each model generates a score representing the likelihood of the
input phone sequence being produced by that model. In training, n-gram models
are built for each language from language-specic training data.
The use of a single phone recogniser trained using a single language greatly sim-
plies the front-end of a language identication system in terms of training and
CHAPTER 2. TECHNICAL BACKGROUND 12
Speech
MFCC
Feature
Extraction
English
Phone
Recogniser
English
Language Model
French
Language Model
SVM Classifier
Pr(Fr | EnFrLM)
Pr(En | EnEnLM)
Hypothesised
Language
French Phone
Recogniser
English
Language Model
French
Language Model
Pr(Fr | FrFrLM)
Pr(En | FrEnLM)
Figure 2.3: A system diagram of parallel phone recognition followed by language
modelling.
system complexity. However, greater discrimination may be achieved by way of a
parallel phone recognition system (Figure 2.3). In this method, known as paral-
lel phone recognition followed by language modelling (PPRLM) [Zissman, 1996],
language specic phone tokenisers are built, which are eectively several language
identication systems running in parallel. All data is processed by all tokenisers and
language models, and this produces n
2
language model likelihoods per utterance,
where n is the number of tokenisers. This gives a higher dimensional vector of like-
lihoods, which lends itself to the use of a discriminatory backend classier such as
support vector machines (SVMs) (Section 2.9). The use of phone recognisers for dif-
ferent languages eectively sensitises the recognition capabilities of the LID system
to a greater number of phones, including subtle dierences in phoneme realisations
between languages, which in itself may provide greater discriminatory capabilities
over single phone recognition systems.
2.3 Distance Metrics
Speech processing often requires a way to quantify the similarity between frames
within a speech signal. A measure of closeness between vectors is necessary for all
CHAPTER 2. TECHNICAL BACKGROUND 13
clustering techniques, such as those used in vector quantisation (Section 2.5). For
example, they are used to calculate the inter and intra class distances as required
by many cluster quality measures, and for nding the closest neighbours to a vector
when using agglomerative hierarchical clustering. Within HMM modelling of speech
utterances, a measure of distance is required to determine the closeness of a frame
to a particular state within a model. Zhang and Lu [2003] describes and explains a
number of distance metrics, including those outlined in this section.
Described here are most of the distance metrics used within this body of work.
For processing eciency, many implementations of these distance metrics do not
include the square root operation (which is unimportant, as it does not aect the
ranking of distances), and these are referred to as the squared distance metrics.
The Euclidean distance is a common, simple and eective distance metric. The
Euclidean distance between vector p and vector q is dened as:
d(p, q) =

_
N

i=1
(p
i
q
i
)
2
(2.1)
where N is the number of dimensions in p and q.
The cosine distance, more commonly used as a similarity measurement in doc-
ument retrieval, calculates the angle between two vectors. This measure is useful
where the magnitude of two vectors may vary but where that information does not
provide a useful discriminatory feature. Such a case might be where vectors are
constructed in some way over a series of tokens, but where the number of tokens per
vector varies. The cosine distance between vector p and vector q is dened as:
cos() =
p q
||p||||q||
(2.2)
CHAPTER 2. TECHNICAL BACKGROUND 14
A
C
B
Figure 2.4: In this gure, the Euclidean distance between A and C is smaller than
A to B. With the Mahalanobis distance, using the diagonal covariance matrix repre-
sented by the red ellipse, points A and B are closer than A and C.
The Mahalanobis distance uses the covariance matrix of the data to take ac-
count of correlation between variables, which means that the way data points are
distributed within classes aects the calculated distance. Figure 2.4 shows a data
case where the Euclidean distance would classify the test vector (A) as being closer
to the mean of class C, when the distribution of the classes actually suggests that it
should belong to class B. In this instance, the Mahalanobis metric would classify the
test vector correctly. Just as the full Mahalanobis distance is a special case of the
Euclidean distance, where the identity matrix is replaced by the full covariance ma-
trix, there is a further case where a diagonal covariance matrix is used instead. This
is known as the diagonal covariance Mahalanobis distance. The diagonal covariance
Mahalanobis distance between vector p and vector q is dened as:
d(p, q) =

_
N

i=1
(p
i
q
i
)
2

2
i
(2.3)
where N is the number of dimensions in p and q, and is the covariance matrix.
As described in Section 2.6, Gaussian probability density functions (GPDFs) are
used to nd the likelihood of a test vector belonging to the distribution on which
the GPDF was trained. The likelihoods this process generates can also be used as a
distance metric, as those vectors which are furthest from the Gaussian distribution
will produce the lowest likelihoods, whilst those that are closest will be higher.
CHAPTER 2. TECHNICAL BACKGROUND 15
2.4 Active Appearance Models
Active appearance models (AAMs) are routinely used for tracking the contours of
the lips and other facial features [Cootes et al., 2001]. In addition to this, the
features used for tracking can also be used as features for recognition. AAM is
often used as a collective term for two types of model. The rst model is the AAM
itself, which represents appearance, and the second is a point distribution model
(PDM), which models shape. Both models are generated by performing principal
component analysis [Pearson, 1901] (PCA) on their respective features. For a PDM,
those features are the x and y coordinates of the landmarks, and for an AAM are
the pixel intensities of the image warped to the mean shape.
Figure 2.5: An example of hand-labelled landmark points on the outer and inner
lip boundaries.
To construct an AAM, each image is marked with a number, k, of feature points
that identify the features of interest on the face. In this case, we use the inner and
outer lip contours (Figure 2.5). The images labelled should represent the extremities
in shape and appearance that the model is expected to represent (Figure 2.6). The
feature points are normalised for pose (translation, rotation and scale) and are
CHAPTER 2. TECHNICAL BACKGROUND 16
Figure 2.6: Example images representing extremal mouth congurations (both
shape and appearance) required for training an AAM.
subject to a PCA to give a compact model of shape of the form:
s = s +Sb
s
(2.4)
where s represents a vector of concatenated feature points and s is the mean shape.
The columns of S are the n leading eigenvectors of the covariance matrix dening
the modes of variation of the shape, and the shape parameters, b
s
, dene the con-
tribution of each mode of variation in the representation of s. An example shape
model is shown in Figure 2.7.
Figure 2.7: The rst two modes of variation of the shape component of an AAM
varying between 3 standard deviations from the mean. The rst mode appears
to capture variation due to mouth opening and closing, and the second appears to
capture variation due to lip-rounding.
PCA projects raw data into a lower dimensional space, where each eigenvector
represents a particular mode of variation and the corresponding eigenvalues represent
the energy of their contribution. In this way, PCA can be used as a data reduction
CHAPTER 2. TECHNICAL BACKGROUND 17
technique by discarding AAM dimensions which have low eigenvalues and therefore
do not contribute extensively to the overall variation in shape or appearance.
AAMs also allow for appearance variation, where each image is shape normalised
by warping from the labelled feature points, s, to the mean shape, s. The pixel
intensities within the mean shape are concatenated and the resultant vectors are
subject to a PCA. A compact model of the appearance variation is given by:
a = a +Ab
a
(2.5)
where a is a shape-normalised image (i.e. warped onto the mean shape) and a is
the mean appearance image. The columns of A are the m leading eigenvectors of
the covariance matrix dening the modes of variation of the appearance, and the
appearance parameters, b
a
, dene the contribution of each mode of variation in the
representation of a. An example appearance model is shown in Figure 2.8.
Figure 2.8: The mean and rst three modes of variation of the appearance compo-
nent of an AAM. The appearance images have been suitably scaled for visualisation.
There is usually signicant correlation between changes in shape and changes in
appearance. For example, as the mouth opens we would expect to see the teeth and
the tongue, or as the mouth aperture closes we could expect the inside of the mouth
to darken. To remove this correlation, a combined model of shape and appearance
can be constructed, although we do not do that here. Given the labelled images
used to compute the shape and the appearance components of the AAM, the shape
and appearance parameters are computed using:
b
s
= S
T
(s s) (2.6)
then warping from the shape b
s
to s and computing the appearance parameters
CHAPTER 2. TECHNICAL BACKGROUND 18
using:
b
a
= A
T
(a a) (2.7)
respectively. These feature vectors are concatenated to give the vector upon which
the recognition experiments are based:
b =
_
_
b
s
b
a
_
_
. (2.8)
These vectors form a parameter trajectory though the AAM space corresponding
to the words spoken by the speaker being tracked. The eigenvalues for each dimen-
sion vary, which means that the scale of each dimension is dierent. Normalisation
is required to prevent individual modes from dominating distance calculations (such
as in clustering algorithms) on the basis of their contribution to the overall variation.
As mentioned before, by discarding eigenvectors with low eigenvalues, PCA is used
as a technique for data reduction, removing dimensions whose overall contribution to
the measured variation is small. Figure 2.9 shows, for a typical AAM, how the total
percentage of variation retained increases by progressively adding PCA dimensions.
The total number of dimensions in Figure 2.9 is shown to be 62 when 100% of
the appearance variation is retained. However, the number of dimensions should
be equal to the number of pixels contained by the mean shape, multiplied by the
number of colour channels (in our case, three; red, green and blue). In practice
the maximum number of dimensions retained for both shape and appearance is not
reached using the implementation of PCA adopted here. Conventional PCA is not
possible, since eigendecomposition requires the covariance matrix to be full rank,
which is unfeasible when processing high-dimensional image data (See De la Torre
and Black [2003] for examples on how this can be overcome). It is also important
to note that the number and order of eigenvectors varies from data to data, which
means that PCA can not be separately applied to a training set and then recalculated
for a test set. In these cases, it is necessary to perform PCA on a training set and
CHAPTER 2. TECHNICAL BACKGROUND 19
then to apply those PCA transformations to the unseen data.
10 20 30 40 50 1 62
10
20
30
40
50
60
70
80
90
100
Number of AAM Dimensions
A
m
o
u
n
t

o
f

V
a
r
i
a
t
i
o
n

R
e
t
a
i
n
e
d

(
P
e
r
c
e
n
t
)
Figure 2.9: A plot showing the cumulative variation captured by each AAM ap-
pearance dimension.
In order to generate AAM features for every frame in a video sequence, we must
rst track the facial elements we are interested in. As mentioned at the start of
this section, AAMs are routinely used for this purpose. There are many algorithms
for tracking the positions of the landmark points dening the AAM shape, but
the one we use is the inverse compositional project-out algorithm [Matthews and
Baker, 2004], which works as follows: during each tracking iteration, the previous
landmark points are warped to the mean shape. The error between the appearance
in the warped image and the mean appearance is calculated. Gradient descent is
used to nd the updated positions of the landmark points that minimise the tting
cost (i.e. the error between the actual image and the mean appearance). The initial
positions of the feature landmarks are specied manually.
CHAPTER 2. TECHNICAL BACKGROUND 20
2.5 Vector Quantisation
Vector quantisation (VQ) is the process of discretising a continuous feature space.
Several approaches for VQ are described in Gersho and Gray [1991], but in general
a number of discrete points are chosen within the space and feature vectors are
represented by their closest discrete neighbour (Figure 2.10). The number of VQ
points, or codes, within the space can be any number greater than one, although
some of the clustering algorithms used to nd the codes are limited to code numbers
that are a power of two. The complete set of codes are referred to as the codebook.
Commonly, code distribution within a space is required to model either cluster
locations, or to be evenly distributed within the space. How ne or coarse that
distribution is varies between applications. Typically, a clustering algorithm is used
to automatically nd densely populated areas of the feature space. Hierarchical and
K-means clustering, described in this section, are two such methods of automatic
code distribution.
z
1
z
4
z
2
z
3
z
5
C3
C2
y
1
y
2
y
3
y
4
y
5
y
6
y
7
x
1
C1
x
2
x
3
x
4
x
5
x
6
z
6
Incoming vectors:
z
6,
z
1,
x
4,
y
7,
x
3,
x
4,
z
3,
y
1,
y
1,
y
6,
x
1...
C3, C3, C1, C2, C1, C1, C3, C2, C2, C2, C1...
Codebook
Figure 2.10: A gure illustrating vector quantisation. Vectors x, y and z are coded
by their closest VQ codewords, C1, C2 and C3, respectively.
There are two main types of hierarchical clustering: agglomerative and divisive.
Alternative names for these methods are bottom-up and top-down, referring to the
direction in which the clustering process occurs. The dendrogram in Figure 2.11 is
an example of a clustering hierarchy that could be generated using either method.
CHAPTER 2. TECHNICAL BACKGROUND 21
In bottom-up clustering, each item initially belongs to its own cluster, and new
clusters are created by merging existing clusters. Merging is performed according to
a measure of closeness between clusters, called linkage. Top-down clustering starts
with every item belonging to the same cluster, then each cluster splits, doubling
the number of clusters in an iteration. Splitting a cluster is performed using a at
clustering algorithm, such as K-means. Unlike hierarchical clustering, where there
is a clear structure determining the relationship between clusters, at clustering
algorithms generate clusters which are not directly relatable to each other. Both
clustering algorithms operate until there are no items left to merge or split, or until
the desired number of clusters has been reached.
D
i
v
i
s
i
v
e
A
g
g
l
o
m
e
r
a
t
i
v
e
x
2
x
1
x
12
x
11
x
4
x
3
x
6
x
5
x
8
x
7
x
10
x
9
x
14
x
13
x
16
x
15
x
18
x
17
Figure 2.11: A dendrogram illustrating an example of hierarchical clustering.
K-means clustering is a computationally inexpensive method of nding k centroid
locations in some data [MacQueen, 1967]. The algorithm has two main processes.
Firstly, each data point is assigned to the nearest of the k centroids, forming clusters
of datapoints. Initially, the centroids will be selected randomly or systematically
(e.g, the rst k data points). The mean of each cluster is then calculated and that
becomes the centroid for the next iteration. The algorithm runs for a specied
CHAPTER 2. TECHNICAL BACKGROUND 22
number of iterations, or until the centroid locations move less within an iteration
than a specied threshold. Work has shown that the centroids found by K-means
are strongly related to the principal component axes generated by PCA on the same
data [Ding and He, 2004]. Since PCA is a technique for representing the underlying
structure of data, this supports the use of K-means for nding meaningful cluster
locations. Using dierent initial seed locations can alter the nal clusters, especially
when the structure of the data is ambiguous. It is usual therefore to use the same
seed points each time the algorithm is run.
2.6 HMM Speech Recognition
Start state End state
State 1 State 2 State 3
Figure 2.12: A diagram of a hidden Markov model.
Figure 2.12 shows a diagram of a hidden Markov model (HMM). HMMs are a
successful technique used in speech recognition to model units of speech [Gales and
Young, 2007; Juang and Rabiner, 1991; Rabiner, 1989]. They represent a signal
as a sequence of static events, modelled by states. Each state has an associated
observation probability density function (PDF) over the features (the observation
distribution is typically modelled by a Gaussian PDF or a Gaussian mixture model),
and a set of probabilities of moving between states, which is called a transition
matrix. This model of speech assumes that the signal being modelled is comprised
of a number of static segments equal to the number of states in the HMM. This
assumption is a simplication of the speech signal, but it is an eective model in
the case of speech recognition. HMMs are usually built to model phones or words.
The number of states per model can only be decided by evaluating the system
CHAPTER 2. TECHNICAL BACKGROUND 23
performance on a development set, although as a rule, longer speech sounds generally
require more states. When training an HMM, there are two sets of parameters
to estimate: the observation PDF associated with each state, and the matrix of
transition probabilities determining the probability of transiting between states, or
remaining in the same state.
The purpose of using HMMs to model speech is to use the models to classify
unknown speech utterances. The task is therefore, given an input utterance and
a set of HMMs, to nd the model which most closely matches the input. There
are computationally expensive algorithms to nd the most likely model, but usu-
ally a dynamic programming algorithm called Viterbi [Forney, 1973] is used. This
algorithm nds the most likely path through each model, and then the model with
highest likelihood classies the utterance. The task of estimating the parameters
of an HMM is performed by the forward-backward algorithm [Jurafsky and Martin,
2009].
2.6.1 Monophone HMMs
In a monophone HMM recogniser, a model is built for each phone to be recognised.
In a typical English phone recogniser there are around 44 phones, including models
for short pauses and silence. There are two methods for initialising the HMM param-
eters in monophone models. The rst requires time-aligned phonetic transcriptions
of the training data, and uses this information as a starting point from which to
iteratively adjust the phone boundaries until optimal models have been estimated.
Such transcriptions are expensive and time-consuming to obtain, and so often a tech-
nique called at start is used instead. In this approach, a word level transcription
of sentence length utterances, is automatically expanded to phone level, by way of a
proununciation dictionary. Phone boundaries are assumed to be equally distributed
within each block of speech, and are then re-estimated from that starting point.
This can be used to generate a time-aligned transcription of training data, which
is useful for techniques such as IMELDA [Hunt et al., 1991], where feature vectors
CHAPTER 2. TECHNICAL BACKGROUND 24
corresponding to phones are discriminatively altered to improve their separation.
Continuous Feature Values -
Standard deviations (!) from the mean ()
P
r
o
b
a
b
i
l
i
t
y

D
e
n
s
i
t
y
0.0
1! -1! -2! 2!
1/"(2#!
2
)
Figure 2.13: An illustration of a Gaussian probability density function (GPDF). For
an input feature value, the function returns a probability of that vector (or scalar,
in this example) being observed from this distribution. The parameter determines
which feature value the distribution is centred around, and species the spread of
the function. These values are calculated from the data that the function is required
to model. Nearly 68% of feature values are contained within one standard deviation
from the mean.
A continuous probability density function models the distribution of a set of
continuous feature vectors and can be used to calculate the likelihood of that distri-
bution having produced any input feature vector, or range of input vectors [Jurafsky
and Martin, 2009]. A Gaussian PDF (GPDF) (Figure 2.13) makes the assumption
that the underlying distribution of the data is Gaussian, and typically models it
as a normal distribution with mean, , and diagonal covariance matrix, . The
equation for a GPDF is dened as follows:
b
j
(o
t
) =
1
(2)
D
2
|
j
|
1
2
exp
_

1
2
(o
t

j
)
T

1
j
(o
t

j
)
_
(2.9)
where b
j
(o
t
) is the likelihood of feature vector o
t
being observed in state j,
j
is
the mean of state j and
j
is the covariance matrix of mixture of state j.
CHAPTER 2. TECHNICAL BACKGROUND 25
Each state in an HMM has a GPDF, which models the feature vectors for the
static part of the signal modelled by that state. For an input vector (o
t
), a GPDF
returns a likelihood that it would be observed at state j (Equation 2.9). Although
the area under a GPDF sums to one, the output for a single input vector could be
equal to or greater than one, and although in practice this is not a problem, the
values produced are technically not probabilities.
The distribution of features within a state is actually rarely Gaussian, due to
systematic variations in pronunciation caused by many things, the most important
being coarticulation and speaker variation. It is common, once models with single
GPDFs have been built, to use a Gaussian mixture model (GMM) instead. A
GMM models features as a number of weighted GPDFs, each known as a mixture
component. The equation for a GMM is dened as follows:
b
j
(o
t
) =
M

m=1
w
m
1
(2)
D
2
|
jm
|
1
2
exp
_

1
2
(o
t

jm
)
T

1
jm
(o
t

jm
)
_
(2.10)
where b
j
(o
t
) is the likelihood of feature vector o
t
being observed in state j,
jm
is the mean of mixture component m in state j,
jm
is the covariance matrix of
mixture component m in state j, and M is the number of mixture components, each
with weight w
m
.
Figure 2.14 shows an example GMM, where a non-Gaussian function is repre-
sented by a combination of dierent GPDFs. Each probability density is weighted
in such a way that the cumulative probabilities of the GMM sum to one. The pur-
pose of this model is to better represent the distribution of the underlying features.
The number of mixture components per GMM is increased sequentially from one to
a specied limit, and the mixture component parameters are re-estimated at each
step. Again, the optimal number of components must be identied by evaluating on
a development set.
It is possible to incorporate more temporal information into the features used for
CHAPTER 2. TECHNICAL BACKGROUND 26
Continuous Feature Values
P
r
o
b
a
b
i
l
i
t
y

D
e
n
s
i
t
y
Figure 2.14: An illustration of a Gaussian mixture model. The two dashed, blue
lines are separately weighted Gaussian mixture components (each with its own mean
and variance). These Gaussians approximately model the actual feature distribution
(shown as a solid red line), which itself is not Gaussian.
recognition, by appending dimensions which correspond to the velocity and accel-
eration (or delta and delta-delta) of each feature. This is calculated for each frame
within a sliding window, usually of a couple of frames. Since speech is a temporal
signal, this helps to discriminate within the feature space between two speech frames
which are spatially similar but in the context of their surrounding frames are moving
at dierent speeds or accelerations. Dierences such as those could be because those
features are part of dierent feature trajectories, or put another way, they belong
to dierent sounds.
2.6.2 Triphone HMMs
Tied-state multiple mixture HMMs are a technique for modelling context dependent
phones used in state-of-the-art audio phone recognition systems [Banerjee et al.,
2008; Livescu et al., 2007]. They are used to improve the robustness of HMMs to
the eects of coarticulation in speech. Coarticulation is a phenomenon in which a
phonemes acoustic realisation is altered by the presence of a preceding or succeeding
phoneme [Bell-Berti and Harris, 1982]. For example, in the words cats and dogs, the
CHAPTER 2. TECHNICAL BACKGROUND 27
phoneme [s] is unvoiced in the rst and voiced in the latter, which is inuenced by the
voicing of the previous phoneme. By building separate HMMs for dierent phone
contexts, we attempt to model this eect explicitly, rather than via the implicit
method of multiple mixture-component monophones. Figure 2.15 shows an example
of possible triphone contexts for the Arpabet phoneme [AA].
Large
[L] [AA] [JH]
Word:
ARPA Phones:
[AA] Triphone: L-AA+JH
Start
[S] [T] [AA] [T]
T-AA+T
Figure 2.15: An example of two triphone contexts for the Arpabet phone, [AA].
Left contexts are shown in red, and right contexts are shown in blue.
Coarticulation is not limited to altering the sound of a phone. In continuous
audio speech, a phone can be deleted before it is articulated [Greenberg, 1999]. This
eect is particularly prominent in rapid speech [Siegler and Stern, 1995], although
generally does not aect its intelligibility. This is because phones are deleted in
contexts where their absence will not impact signicantly upon the human recog-
nition of the word being articulated. Triphone models do not provide a method of
modelling the deletion of visual phones, but by having models for specic phone
contexts where phone deletion is likely to occur, and separate models for the re-
maining contexts, we ensure that visually dierent articulations of the same phone
are conned to their own models. Without triphones, combining phone contexts
which are typically deleted with those that arent, would provide a single, noisy
model. There has been work to explicitly model acoustic phone deletions [Hain and
Woodland, 2000], and such an approach could benet this work in the future.
Ideally, a triphone recognition system would contain a model for every phone
context present in the training set. This would be the best way of ensuring that all
of the coarticulation eects present in the data were being modelled. In practice,
data sparsity limits the number of triphone models that we can build. For an
inventory of 44 phones, there are potentially a maximum of 85, 184 (44
3
) triphones
(if cross-word triphones are considered and if languages did not forbid certain word-
internal triphones). Most of these will not appear in the training data and many
CHAPTER 2. TECHNICAL BACKGROUND 28
of those that do appear will only do so very infrequently, which would compromise
the generalising power of those models. Therefore, context dependent modelling
requires that the complete set of triphones is reduced in some way that similar
models are combined. The two approaches used commonly are data-driven and
rule-based clustering. Rule-based clustering is the approach used here.
Before the algorithm for rule-based state-tying can be described, it is necessary to
explain that the transition matrices for the set of triphones belonging to a particular
phone are explicitly tied together, in order to share the data across all of the contexts.
This is permitted because the transition probabilities do not vary greatly across
phone contexts [Young et al., 2006]. However, the output distributions within each
model are sensitive to context, and by reducing the complete triphone set we can
pool data for contexts which share similar features, which increases the number of
samples available to train each model. Therefore, it is for these parameters that we
will try to cluster our models.
Figure 2.11, in Section 2.5, shows an example of a clustering tree. For the rule-
based approach to reducing a complete triphone set, a divisive clustering method is
used. Each node in the tree contains a phonetic question about the context of the
input model, for example,Is the phone to the left a [m] in this model?. The answer
to each question is either yes, or no, which sends it to the corresponding node and
hence the next question. An advantage of rule-based clustering over data-driven,
is that unseen triphones can be classied by a clustering tree, since the questions
are based on nothing other than phone context. Generating the tree is the process
of selecting the questions to place at each node. The source of the questions are
discussed shortly, and Section 6 discusses the limitations of this approach when
applied to visual speech and a technique for applying it to viseme recognition.
Generating a cluster tree rst requires that all available context questions are
loaded. Then, the training data which has been passed to that node is pooled to
form one cluster. For training data, the state occupancies, means and variances are
used. Each question is asked in turn, each splitting the single cluster into two and
CHAPTER 2. TECHNICAL BACKGROUND 29
giving a possible two-state clustering. We calculate the log likelihood of the data
for each side of the node, given the state-tying resulting from the current question.
The question selected for the node is the one which provides the biggest increase in
log likelihood from the original pool to the two-state split. Each branch of the node
passes a list of states to the next node, and the process repeats until the likelihood
dierence falls below the specied threshold, where leaf nodes are then tied together,
or until a terminal node is reached.
In phonology, the places of articulation and rules of co-articulation are known,
and so questions can been devised which categorise English phones by their place of
articulation, for both left and right contexts. Ideally, we would discard this subset of
questions in favour of generating every question possible, from those that will apply
to large groups of phones, to those that will apply to single phones. It is not possible
to ask incorrect questions, since only those which maximise the log likelihood for
a node are selected for the clustering tree, and those which dont are discarded. In
practice, the complete set of questions for a 44 phone system is computationally
unfeasible and would increase the processing time incurred by loading the questions
and calculating the clustering at each node of the tree. It is for this reason that the
reduced question set is used [Young et al., 2006].
For the [d] triphone models, Is the left context nasal?.
A broad, left context question.
A narrow, right context question.
For all the contexts of [d], select the models where [p] is the right context.
[f] - [aa] + [m]
A triphone model for the phone [aa]. The left context is [f] and the
right context is [m].
For all the contexts of [d], select the models where [p], [b], [m], [t], [d], [k],
[g], [n], [y], or [w] are the left context.
For all the contexts of [d], select the models where the left context is a
phone requiring nasality ([m], [n], or [ng]).
Figure 2.16: An example triphone context for the phone [aa] and clustering questions
for the phone [d]. This gure explains what is meant by the left and right phone
contexts, and by broad and narrow clustering questions.
CHAPTER 2. TECHNICAL BACKGROUND 30
The reduced set of questions are specically designed to divide a phone set into
groups which correspond to the conguration of the speech articulators. Further-
more, these congurations are known to give coarticulation eects. Figure 2.16
shows an example of a specic question which asks, Is the left context nasal?.
In this example, the models for a phone that are present at the node where the
question was asked are then split into two groups; the models where the left-hand
phone is any of the ones listed (the nasal phones), and the remaining (non-nasal)
models. Questions which select larger groups of phone models are known as broad
questions, whilst those which select fewer models are narrow.
2.7 N-Gram Language Modelling
As discussed in Section 2.2, phonotactics is a feature which diers between spoken
languages, and as such can be used to help identify between them. Furthermore, it is
a feature of structure within a language which can be used to constrain a recognised
sequence of symbols, phones or words to conform to phonotactic rules. It can do this
either by limiting the search space of allowable sequences, known as the grammar,
or by weighting them according to their likelihood of occurring (Although, limiting
the search space to a reduced set of sequences is equivalent to setting a likelihood of
zero for those sequences). N-gram language models are the way that phonotactics
are incorporated into speech recognition. An n-gram (Figure 2.17) is a sequence
of n tokens (frames, phones, or words, etc) and a language model is a statistical
representation of the occurrence of those n-grams within the data.
It is usual to work with n-grams where n is two, or three, known as bigrams
and trigrams respectively. Since the number of possible n-grams for a given n is
w
n
(where w is the number of tokens), it soon becomes impossible to train reliable
language models when n is greater than three, due to data sparsity. In the case
of bigrams, the language model stores the likelihoods of each token, given that a
previous token has occurred. The equation for this is dened as the probability
CHAPTER 2. TECHNICAL BACKGROUND 31
w
n-1,
w
n
w
n-2 ,
w
n-1,
w
n
w
n
Unigram Bigram Trigram
w
n-N ...
w
n
N-gram
Current word Previous word
Figure 2.17: N-grams: unigrams, bigrams and trigrams.
of the co-occurrence of the tokens divided by the probability of the previous token
occurring:
Pr(w
i
|w
i1
) =
Pr(w
i1
, w
i
)
Pr(w
i1
)
(2.11)
where w is a token.
Language models can be used to weight sequences of words or phones based
upon their observed likelihood of occurring in the training data. In an HTK speech
recogniser, the grammar scale factor, s, determines the inuence of the language
model versus that of the acoustic likelihoods, as generated by the HMMs during
recognition. This is achieved by multiplying the language model likelihoods by the
specied scale factor value. As the scale factor increases, the acoustic likelihood
becomes smaller in relation to the language model likelihood, and therefore less
inuential on the total likelihood of the word. An ideal scale factor should improve
recognition accuracy by correctly encouraging the more frequently occurring word
sequences as observed in the training data, without making the acoustic likelihoods
redundant.
In HTK, the insertion penalty, p, adds a xed value to the language model likeli-
hood for each word in a recognised sequence. This has the eect of directly control-
ling the insertion and deletion of words, since longer sequences of words will incur a
CHAPTER 2. TECHNICAL BACKGROUND 32
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
400
300
200
100
0
100
200
300
Number of Words in the Hypothesised Sequence
L
o
g

L
i
k
e
l
i
h
o
o
d

o
f

t
h
e

H
y
p
o
t
h
e
s
i
s
e
d

S
e
q
u
e
n
c
e


20
0
20
Insertion Penalty Value
Figure 2.18: This gure demonstrates the eect of dierent insertion penalty values
on varying lengths of hypothesised word sequences, as in HTK. In this example, each
word has an acoustic log-likelihood of -5. When the insertion penalty is 0, the
addition of each new word decreases the total log-likelihood of the sequence only
slightly (the sum of the acoustic likelihoods), and it is conceivable that a word
sequence of any length could be selected as the hypothesis (though, more generally,
longer sequences have lower likelihoods). When the penalty is 20, longer sequences
are seen to have a much greater likelihood than shorter ones. This is because the
addition of each new word adds 20 the acoustic likelihood. When the penalty is -20,
the opposite eect is apparent, as each additional words likelihood is reduced by 20,
lowering the overall likelihood of longer sequences versus the shorter ones.
greater overall penalty. Negative values of p mean that longer sequences of words
will have a larger, negative penalty, which discourages insertions by reducing the
overall likelihood of the sequence. Positive values of p encourage fewer deletions by
giving longer sequences a larger, positive penalty, increasing the overall likelihood.
The insertion penalty, as used in HTK, therefore penalises either insertions or
deletions, and only acts as a penalty on insertions if the value of p is negative. More
generally, the insertion penalty provides a way of adjusting the balance between
CHAPTER 2. TECHNICAL BACKGROUND 33
insertions and deletions. Figure 2.18 demonstrates how longer sequences of words
incur greater penalties than shorter ones.
Entropy, as dened in Shannon [1948], is the amount of information (measured
in bits) that is required on average to encode a value from a discrete information
source. It is dened as follows:
H(X) =
n

i=1
p(x
i
) log
2
p(x
i
) (2.12)
where H(X) is the entropy, in bits, of a discrete random variable X, where X
can take the values {x
1
, x
2
, ..., x
n
}, and p(x
i
) is a probability function giving the
likelihood for an observation of x
i
.
Intuitively, this is a measure of potential surprise which a signal provides as
each signal observation is generated. A sequence of words, such as a portion of
text, is an example of an information source. Given that language is considered
ergodic (i.e. the last word in a sequence is generally not dependent on all previous
words), we can consider the observed sequence of words as one event with an overall
likelihood, rather than assuming that the addition of each new word signicantly
aects the overall surprise of the sequence, and using this assumption we calculate
the per word entropy [Young et al., 2006]:

H =
1
m
log
2
P(w
1
, w
2
, ...w
m
) (2.13)
where

H is the per word entropy, in bits, of a sequence of words, w
1
, w
2
, ...w
m
. The
probability function P(w
1
, w
2
, ...w
m
) gives the likelihood of the m words appearing
in that order, and is calculated from a language model.
Perplexity calculates the number of values that are representable by a measure-
ment of entropy (i.e. converting from bits to base ten). This gives the average
CHAPTER 2. TECHNICAL BACKGROUND 34
number of choices that the next item in a word sequence is likely to be drawn from.
Perplexity is dened as follows:
PP = 2

H
(2.14)
where PP is the perplexity of a language model, where

H is the per word entropy
as calculated by Equation 2.13.
This measure is commonly applied to language models, by estimating the entropy
for a test sequence of words using the language model likelihoods. Low perplexity
means that a language model gives more certainty about the next word in a se-
quence, whereas high perplexity means there is a higher chance of confusion based
upon the number of choices alone. It follows that low perplexity also means the
chance of guessing the next word in a sequence is higher, so care must be taken
when comparing the accuracy of recognition systems which use language models of
dierent perplexities. From equations 2.13 and 2.14, we can see that if each word
is equally as likely to follow any other word (i.e. the likelihoods are uniformly dis-
tributed), as in an unweighted phone loop grammar, the perplexity is equal to the
number of words in the grammar.
It is not ideal to use language models as a technique for constraining the recog-
nised sequence of phones within a language identication system, since the task is
to identify which language has produced that sequence, and to force it to conform to
a model is to destroy the original information. Therefore, language models are also
used to give the likelihood of a particular model producing an observed sequence of
tokens, by summing the log likelihoods over the sequence. In this way, a language
model can be found which best ts the observed sequence. The likelihoods generated
must be length normalised to account for dierent sequence lengths. Also, some lan-
guage models have a bias to produce higher mean likelihoods, possibly because they
contain fewer tokens and therefore have a lower perplexity. This must be considered
CHAPTER 2. TECHNICAL BACKGROUND 35
when using classication techniques which rely on maximum likelihoods.
Even in low-order n-gram models, where n is two for example, it is common for
bigrams to occur infrequently, if at all. When a previously unseen bigram does occur,
if it does not appear in the language model then its likelihood cannot be calculated.
To avoid this, we can generate a weighting known as the back-o, which can be
used to weight the current unigram, whilst satisfying the constraint that the bigram
probabilities for the previous unigram must sum to one. To calculate this, we use
a method called discounting, in which a portion of the probability distribution is
discounted from the most frequently occurring bigrams. The subtracted probability
mass is then distributed amongst those bigrams whose frequency is below a specied
threshold, including those which do not occur at all [Katz, 1987].
CHAPTER 2. TECHNICAL BACKGROUND 36
2.8 Linear Discriminant Analysis
Linear discriminant analysis (LDA) [Webb, 2002; Mika et al., 1999] is a supervised
data reduction technique. Unlike principal component analysis (PCA), where a data
projection is found based on the underlying structure of the original data, LDA nds
a projection of the data which attempts to separate the input classes. It nds a pro-
jection into a K1 dimensional space (where K is the number of classes), where the
ratio between the between-class scatter matrix and the within-class scatter matrix
is maximal (Figure 2.19). The projection is expected to increase the separability of
the classes, which makes LDA useful as a precursor to classication.
w
Figure 2.19: This diagram illustrates LDA in a two class, 2D case. The circles and
squares represent data from two separate classes. LDA seeks to nd w, a projection
of the data, which maximises the distance between the class means, whilst minimising
the within-class variance. In this example, the data is projected onto a 1D line which
maximises the ratio between these two measurements. By contrast, projecting the
data onto the other LDA axis (perpendicular to w), would result in no separation
between the two classes.
The equation for calculating the ratio between the between-class and within-class
scatter matrices for two class LDA is dened as follows:
J(w) =
w
T
S
B
w
w
T
S
W
w
(2.15)
where w is the projection which maximises the ratio between the between-class
CHAPTER 2. TECHNICAL BACKGROUND 37
scatter matrix, S
B
(Equation 2.16), and the within-class scatter matrix, S
W
(Equa-
tion 2.17).
The between-class scatter matrix is dened as:
S
B
=

c
(
c
x)(
c
x)
T
(2.16)
where
c
is the sample mean of class c and x is the overall mean of all the data.
And the within-class scatter matrix is dened as:
S
W
=

ic
(x
i

c
)(x
i

c
)
T
(2.17)
where x
i
is vector i in class c and
c
is the sample mean of class c.
CHAPTER 2. TECHNICAL BACKGROUND 38
2.9 Support Vector Machines
Support Vectors
|
|
w
|
|
1
|
|
w
|
|
1
w
!
x

-

b

=

0
w
!
x

-

b

=

-
1
w
!
x

-

b

=

1
Maximum Margin
w
Figure 2.20: Support Vector Machine Linear Case
A support vector machine (SVM) classier is a supervised machine learning al-
gorithm, which constructs a maximum margin hyperplane between the data of two
classes (Figure 2.20), labelled -1 and 1, respectively. This means that the SVM al-
gorithm attempts to nd a margin which maximally separates the data points from
each class. For a set of points x, a hyperplane is dened as:
f(x) = w x b (2.18)
where w is the normal vector to the hyperplane and b determines the oset
of the hyperplane from the origin. Values for w and b are selected through the
SVM training process, such that the equations of the hyperplanes shown in 2.20 are
satised, whilst minimising ||w||, the norm of w. f(x) is either 1 or 1.
This function is also used at classication time to determine which side of the
hyperplane a point lies, and hence which class it will be assigned to. Figure 2.20
shows that for the maximum margin hyperplane, u = 0. The training examples
which lie closest to the separating hyperplane are referred to as the support vectors,
CHAPTER 2. TECHNICAL BACKGROUND 39
and the equations of the hyperplanes which pass through them are shown in Figure
2.20 (u = 1 for the circular data, and u = 1 for the square data). For the example
shown, the output of the hyperplane function for all circular training data should be
1, whilst all the square data should be 1. The task of nding the maximum
margin hyperplane can be presented as an optimisation problem of minimising ||w||,
the norm of w, by nding values for w and b. For a more in depth explanation of
SVMs, see Cortes and Vapnik [1995].
!" # $ % & " & % $ # !"
!"
#
$
%
&
"
&
%
$
#
!"
(a) Original 2D Data
!"
#
"
#
!"
!"
#
"
#
!"
"
!"
$"
%"
&"
#"
'"
(b) 3D Projection of 2D Data
Figure 2.21: The red data in (a) is non-linearly separable from the blue data in 2D
space. In (b), the red data has become linearly separable by projecting the data into
a higher-dimensional space.
In cases where a linear decision boundary is not sucient to separate data from
two classes (Figure 2.21(a)), a kernel function may be used to construct a higher-
dimensional separating hyperplane, or non-linear decision boundary. As all data is
linearly separable in an innitely high-dimensional space, this approach is guaran-
teed to nd a margin to separate the data. There is no guarantee however, that
the data projection generated by the SVM will separate unseen data, and this is
something that must be evaluated empirically. The polynomial kernel function is
dened as:
k(x
i
, x
j
) = ((x
i
x
j
) + 1)
p
(2.19)
CHAPTER 2. TECHNICAL BACKGROUND 40
where x
i
and x
j
are the input vectors, and p is the order of the polynomial in the
kernel function.
The radial basis function is dened as:
k(x
i
, x
j
) = exp
_

||x
i
x
j
||
2
2
_
(2.20)
where x
i
and x
j
are the input vectors, and is the standard deviation of the Gaussian
function.
Kernel functions substitutes all inner product calculations in the construction
of the separating hyperplane, which is a technique known as the kernel trick.
Figure 2.21(b) demonstrates how the non-linearly separable data from Figure 2.21(a)
(shown in red) can become linearly separable in a higher-dimensional space. The
SVM can be extended to work with multi-class problems in the same ways that many
binary classiers are extended, i.e. one-against-all or one-against-many strategies.
The greater the margin between the hyperplane and the data examples of either
class, the greater the potential of the classier to generalise across unseen data.
SVMs have been shown to generalise well and are widely used in many elds. Sig-
nicantly, SVMs are widely and successfully used in audio language identication
approaches for the classication of language model likelihood scores [Ziaei et al.,
2009; Suo et al., 2008].
Chapter 3
Description of Datasets
3.1 Introduction
This chapter describes the specication, content, purpose and data capture process
of two spoken language datasets, recorded for use in the visual-only language iden-
tication experiments described in the course of this thesis. Table 3.1 presents a
brief summary of both datasets. Since visual-only language identication is a new
eld of research, other than these datasets, there are no other directly relevant cor-
pora available. Datasets for computer lip-reading research do exist, however they
do not contain the range and volume of languages required for the work described
here. Specically, the existing corpora almost exclusively contain English speech.
Owing to this lack of data, it was decided to record custom datasets which directly
fullled the requirements of this work. The broadest requirement was to provide a
sucient amount of data on which we could develop, train, test and evaluate VLID
systems, and for which we could establish, dene and share a protocol for each of
these procedures. Furthermore, care was taken to ensure that the data was of high
quality, that the data for each language had no biases of which we were unaware or
that we could not reasonably ignore or normalise out. These considerations helped
ensure that these datasets are useful to the scientic community, in particular for
research into visual-only language identication, but also to other elds of research
41
CHAPTER 3. DESCRIPTION OF DATASETS 42
which involve computer lip-reading.
The rst dataset, known as United Nations 1 (UN1), was primarily designed
for speaker-dependent language identication. This dataset was required to facili-
tate the development of a system for identifying the language spoken by a known
speaker. It was decided to record multilingual subjects, as it meant that we could
collect speaker-dependent training and testing data in multiple languages, providing
an easy way to evaluate the generalising power of any system developed. The sec-
ond dataset, known as United Nations 2 (UN2), was for speaker-independent experi-
ments, specically for discriminating between English and Arabic speech. Hence, we
chose to record as many English and Arabic speakers as possible, to ensure that we
had sucient data for our experiments. Both datasets contain audio and video data
of subjects reading the United Nations Declaration of Human Rights. In addition
to the audio and video, the datasets contain the tracking information corresponding
to the x and y coordinates of a number of contours relating to facial features, prin-
cipally, the lips. In this context, a contour refers to a spline with a xed number of
points, and the boundary of a spline denes the shape or position of a facial feature.
This tracking information (in conjunction with the appearance data) was used to
generate our features for recognition, but they are included as part of the dataset
description because they could be used to extract other types of visual features from
the face.
This chapter is arranged as follows: Section 3.2 describes the multilingual video
dataset recorded for our speaker-dependent language recognition and preliminary
speaker-independent experiments. Within that section, the content of the rst
dataset is presented, information regarding the precise specication of the video
data, the audio data, and a description of the data capture process that was used.
Section 3.3 then presents the same information for our native English-Arabic video
dataset, which was used in our further experiments into speaker-independent lan-
guage identication.
CHAPTER 3. DESCRIPTION OF DATASETS 43
Table 3.1: A brief summary of the UN1 and UN2 datasets.
Dataset Resolution Frames/Sec # Languages # Speakers # Hours
UN1 576 x 768 25 12 26 6.5
UN2 1920 x 1080 60 2 35 7
3.2 United Nations Multilingual Video Dataset
(UN1)
This dataset was recorded initially for the speaker-dependent experiments described
in Chapter 5, and was also used for our initial speaker-independent experiments
in Chapter 6. In the speaker-dependent experiments, the task was to identify the
spoken language of a known subject, given training data for that subject in each
of their spoken languages. Little consideration was taken for using this dataset
in a speaker-independent setting, but there was sucient coverage of English and
French to perform preliminary experiments. Most of the speakers recorded were
competent multilinguals, uent in either two or three languages. Only a few were
truly multilingual, in that they had the same linguistic ability in all languages that
they spoke, having learnt to speak them from a very early age [Baker, 2006].
The video camera used for the recordings was a Sony DV domestic video cam-
era. The video format was DV, at a compression ratio of 5:1, which is 25 Mbps
( 3.1MBps), the image resolution was 576 x 768 pixels (down-sampled to 480 x
640), and the frame rate was 25 frames per second (progressive scan). The camera
was rotated by 90 degrees, so that the dimension with the greatest resolution was the
vertical, rather than horizontal dimension. This meant a subjects face was framed,
whilst occupying most of the image. Audio was captured from the video cameras
built in microphone at a sample rate of 32kHz, and the speech in the audio is intel-
ligible. However, it was not the focus of this database, and therefore no special care
was taken to avoid the small amounts of background noise present in the recordings.
The video data was captured in a studio environment, where lighting was reason-
ably controlled. The arrangement of the recording studio is depicted in Figure 3.1.
CHAPTER 3. DESCRIPTION OF DATASETS 44
P
r
o
j
e
c
t
i
o
n

s
c
r
e
e
n
Video camera
Subject
Lighting
P
r
o
j
e
c
t
o
r
R
e
f
l
e
c
t
i
v
e

s
u
r
f
a
c
e
Figure 3.1: Studio arrangement for the recording of the UN1 dataset. A camera
records the whole face of the speaker, whilst the speaker recites the database text as
projected onto a screen. The speaker controls the rate of text presentation with a
mouse. Lighting is diused by a reective surface to illuminate the face of the speaker.
Each subject sat facing a screen which displayed the text they had to read. They
were given a mouse with which to scroll through the document. Subjects were told
to sit as still as possible, to face the camera and to avoid occluding their face with
their hands. Figure 3.2 shows an example video frame from one of the speakers
recitals, giving a typical example of how each subject is framed. The entire head
was captured in order to assist the mouth-tracking process, and in case additional
facial information was necessary for any further work. Subjects were also advised to
carry on reading regardless of any recital mistakes. The text chosen was the United
Nations Universal Declaration of Human Rights, because it is freely available on the
web [UN General Assembly, 1948] and translated into over 300 languages. Using the
same text in each language also meant that there was some degree of consistency
in the style (the formality and how continuous or connected the speech is), and the
phonetic coverage of the speech. Speakers were required to read up to and including
the rst 16 articles of the declaration, in each of their uent languages. Appendix
A contains the full English text of the UN declaration.
Tables 3.2 and 3.3 present details of the UN1 dataset, specically the languages
recorded by each speaker, their gender, the duration of their recitals and whether
CHAPTER 3. DESCRIPTION OF DATASETS 45
or not tracking co-ordinates are available for the video. Figure 3.3 shows the lip
landmarks tracked for each frame of video in this dataset (excluding the other facial
landmarks). Active appearance model features (Section 2.4) were extracted for the
videos which was used for experiments in chapters 5 and 6. The speaker IDs refer to
the order in which the speakers were recorded, and also provide a unique identier
for each subject. The IDs that are missing refer to subjects who were recorded, but
whose videos were out of focus, or whose faces were occluded for an extended length
of time, rendering the video unusable. Most speakers undertook a single recording
session in which they performed all of their recitals (the recordings are date-stamped,
so the few recitals which were not performed in a single session are easily identiable,
and such sessions were not used in our experiments). If requested by the subject,
a few minutes of preparation time were granted to read the UN declaration before
the session began. The video camera was turned o between recitals, the settings
on the camera were not changed at any time during the capture of this dataset, and
lighting conditions were not altered between sessions, although a small amount of
natural light could enter the recording studio.
CHAPTER 3. DESCRIPTION OF DATASETS 46
Figure 3.2: Example video frame from the UN1 dataset
CHAPTER 3. DESCRIPTION OF DATASETS 47
Table 3.2: UN1: Multilingual Database
Speaker Native Spoken Sex Duration Tracked
ID Language Language (hh:mm:ss)
1 English English Male 00:06:35 No
2 Tamil English Female 00:07:38 No
3 Italian English Male 00:07:31 Yes
3 Italian Italian Male 00:06:56 Yes
4 Polish Polish Female 00:06:12 No
4 Polish English Female 00:07:10 No
5 Arabic Arabic Male 00:07:21 Yes
5 Arabic English Male 00:11:05 Yes
6 German English Male 00:06:42 Yes
6 German German Male 00:06:34 Yes
7 Portuguese Portuguese Male 00:07:38 No
7 Portuguese English Male 00:04:11 No
8 Arabic English Male 00:08:58 Yes
8 Arabic Arabic Male 00:08:47 Yes
9 English French Female 00:08:55 Yes
9 English English Female 00:06:43 Yes
10 Mandarin Mandarin Female 00:07:47 No
10 Mandarin English Female 00:12:42 No
11 Polish Polish Female 00:06:37 No
11 Polish English Female 00:07:57 No
12 English English Male 00:06:30 Yes
12 English English Male 00:04:42 Yes
12 English English Male 00:07:51 Yes
12 English English Male 00:06:17 Yes
12 English English Male 00:06:10 Yes
12 English English Male 00:06:04 Yes
13 Polish Polish Male 00:07:34 No
14 English English Female 00:06:11 Yes
14 English French Female 00:06:43 Yes
15 English English Female 00:05:39 Yes
15 English French Female 00:08:18 Yes
16 Dutch Dutch Female 00:07:46 No
16 Dutch English Female 00:07:22 No
17 Mandarin Mandarin Female 00:06:41 No
17 Mandarin Cantonese Female 00:08:02 No
17 Mandarin English Female 00:09:31 No
CHAPTER 3. DESCRIPTION OF DATASETS 48
Table 3.3: UN1: Multilingual Database Continued
Speaker Native Spoken Sex Duration Tracked
ID Language Language (hh:mm:ss)
18 Tamil English Male 00:07:19 No
19 Mandarin English Female 00:09:06 No
19 Mandarin Mandarin Female 00:06:33 No
20 Arabic Arabic Female 00:07:28 No
20 Arabic English Female 00:11:49 No
21 French English Female 00:07:49 Yes
21 French French Female 00:07:13 Yes
21 French German Female 00:08:59 Yes
22 Russian English Female 00:11:18 No
22 Russian Russian Female 00:08:11 No
22 Russian Hebrew Female 00:15:46 No
23 English French Female 00:08:10 Yes
23 English English Female 00:06:10 Yes
24 Mandarin Mandarin Male 00:06:50 No
25 Mandarin Mandarin Male 00:07:03 No
26 English English Male 00:05:27 Yes
Figure 3.3: Example frame showing the number and location of the tracked land-
marks corresponding to the lips in the UN1 Dataset.
CHAPTER 3. DESCRIPTION OF DATASETS 49
3.3 United Nations Native English and Arabic
Video Dataset (UN2)
This dataset was recorded for the further speaker-independent experiments described
in Section 7. The task in those experiments was to discriminate between Arabic and
English, which are two phonologically dissimilar languages. To overcome the limita-
tions of the initial speaker-independent experiments, a dataset was recorded contain-
ing a larger number of speakers for each language and comprising entirely of native
speakers. The video recorded was also of a higher denition and frame-rate than in
UN1, and a tie clip microphone was used to capture high quality audio. Word-level
transcriptions were manually generated for the English and Arabic speech, and auto-
matically expanded phonetic transcriptions were created (The English transcription
contained Arpabet phones and the Arabic transcription a transliteration scheme of
the IPA, as described in Smrz [2007]).
The video was captured using a Sanyo Xacti VPC-FH1 domestic video camera,
which contains a CMOS sensor. The video is recorded at 60 frames per second
(progressive scan), at a resolution of 1920 x 1080. It was encoded natively using
the MPEG-4 AVC/H.264 codec, at a compression ratio of 1:118.7, which is 24Mbps
(3MBps). Processing the HD video for feature extraction was performed using
the UEAs High Performance Cluster (HPC). Without this system, much of the
processing required by this project, particularly the video processing, would not
have been possible. Audio was captured using a tie-clip microphone at 44.1KHz,
16-bit, mono, and was uncompressed. The microphone used was an Audio-Tecnica
AT899, omni-directional condenser tie clip microphone.
The video data was captured in a studio environment, where lighting was reason-
ably controlled. Each subject sat facing a screen which displayed the text they had
to read. They were given a mouse with which to scroll through the document. The
arrangement of the recording studio is depicted in Figure 3.4. The studio arrange-
ment for this dataset was dierent from that used in UN1 because some speakers
CHAPTER 3. DESCRIPTION OF DATASETS 50
R
e
f
l
e
c
t
i
v
e

s
u
r
f
a
c
e
Video camera
Subject
Lighting
Laptop
Figure 3.4: Studio arrangement for the recording of the UN2 dataset. A camera
records the mouth region of the speaker, whilst the speaker recites the database text
as shown on a laptop, from which they can control the rate of text presentation.
Lighting is diused by a reective surface to illuminate the face of the speaker.
found it dicult to read text from a projected screen (either because the text was
too small or they could not focus on it). Subjects were told to sit as still as possible,
to face the camera and to avoid occluding their face with their hands. Figure 3.5
shows an example video frame from one of the speakers recitals, giving a typical
example of how each subject is framed. To ensure that maximum information could
be extracted from the visual appearance of the mouth, only the mouth region was
lmed. Subjects were again advised to carry on reading regardless of any recital
mistakes. As before, the text chosen was the UN declaration of human rights, be-
cause it is freely available on the web [UN General Assembly, 1948] and translated
into over 300 languages.
In total, 25 native English speakers and ten native Arabic subjects were recorded;
their details are shown in tables 3.4 and 3.5 respectively. The speaker IDs refer to
the order in which the speakers were recorded, and also provide a unique identier
for each subject. The IDs that are missing refer to subjects who were recorded,
but whose videos were out of focus, or whose faces were occluded for an extended
length of time, rendering the video unusable. Also presented are the languages
recorded by each speaker, their gender and the duration of their recitals. Example
CHAPTER 3. DESCRIPTION OF DATASETS 51
Figure 3.5: Example video frame from the UN2 Dataset
lip landmarks tracked for this dataset are shown in Figure 3.6. 60 lip landmarks were
used in total (32 for the outer lip and 28 for the inner lip). Active appearance model
features (Section 2.4) were extracted for the videos which were used for experiments
in chapters 7 and 8.
It was our original intention to record ten native Egyptian speakers, in order to
concentrate on one widely spoken variation of the Arabic language, i.e. Egyptian
Arabic. This would have helped minimise some of the morphological, syntactical
and phonetic dierences between Arabic dialects. In practice, it was not possible
to acquire a suciently large group of people who spoke this dialect, and so we
recorded as many native Arabic speaking volunteers as we could nd, regardless of
which particular dialect they spoke. All speakers undertook a single recording session
in which they read the UN declaration in their native language. If requested by the
subject, a few minutes of preparation time were granted to read the UN declaration
before the session began. The video camera was turned o between recitals and
care was taken to keep the lighting conditions constant between sessions. Camera
settings were unchanged throughout the recording of this dataset, apart from a
CHAPTER 3. DESCRIPTION OF DATASETS 52
Figure 3.6: Example frame showing the number and location of the tracked land-
marks corresponding to the lips in the UN2 dataset.
colour lter applied to speaker 40 (who we didnt use in our experiments). Camera
focus was set manually, as the autofocus was not capable of holding focus on the
mouth region alone. This is not ideal, since some speakers moved out of focus by
the end of their recital, and it should be a consideration of any visual-only datasets
recorded in future.
Tables 3.4 and 3.5 show a dierence in mean recital duration between the English
and Arabic videos. Upon examination of the videos, it was found that this is mainly
due to the enforced sentence breaks within the Arabic portion of the database,
rather than a dierence in reading speed caused by language. Many sentences in
the Arabic recitals were separated by a user activated beep sound to signal the end
of a sentence, during which the subject was silent. The purpose of the beep was to
assist with sentence level segmentation when building an Arabic speech recogniser.
Since silence is not a discriminatory feature of language, video segments recognised
as visual silence are ignored in the VLID process (Chapter 7).
CHAPTER 3. DESCRIPTION OF DATASETS 53
Table 3.4: UN2: English & Arabic Database - The English Speech
Speaker ID Language Nationality Sex Duration (hh:mm:ss)
1 English British Male 00:12:17
2 English British Male 00:11:10
3 English British Male 00:12:33
4 English British Male 00:13:32
5 English British Male 00:11:43
6 English British Male 00:11:49
7 English British Male 00:10:21
8 English British Male 00:11:58
11 English British Male 00:12:16
12 English British Male 00:11:36
13 English British Male 00:11:50
14 English British Male 00:12:00
15 English British Male 00:12:51
16 English British Male 00:11:05
17 English British Male 00:14:16
18 English British Male 00:12:57
19 English British Male 00:11:04
20 English British Male 00:12:13
21 English British Male 00:11:28
23 English British Female 00:13:07
24 English British Female 00:12:29
25 English British Male 00:13:03
26 English British Female 00:11:57
27 English British Female 00:15:20
29 English British Female 00:12:10
Mean: 00:12:17
Total: 05:07:05
CHAPTER 3. DESCRIPTION OF DATASETS 54
Table 3.5: UN2: English & Arabic Database - The Arabic Speech
Speaker ID Language Nationality Sex Duration (hh:mm:ss)
32 Arabic Egyptian Male 00:13:55
33 Arabic Egyptian Male 00:15:34
34 Arabic Egyptian Male 00:15:37
35 Arabic Egyptian Male 00:14:34
36 Arabic Iranian Male 00:13:10
37 Arabic Egyptian Male 00:14:10
38 Arabic Egyptian Female 00:11:14
39 Arabic Jordanian Male 00:14:31
40 Arabic Saudi Arabian Male 00:15:30
41 Arabic Jordanian Female 00:12:15
Mean: 00:14:03
Total: 02:20:30
Chapter 4
Literature Review
4.1 Introduction
In this chapter we present a review of existing literature relating to automatic visual-
only language discrimination. At the time of writing this thesis, the only published
literature directly addressing this problem was that resulting from the research com-
prising this thesis. However, other areas of study have relevance to the challenges
posed by this task and the techniques developed by them were used in the work
described here. We therefore focus our review on these related elds of research. By
examining existing techniques, we can conrm the novelty of developing a system for
language identication using visual features, the complexity of the task and produce
a performance benchmark with which to compare our completed system.
The rest of this chapter is arranged as follows: Section 4.2 outlines some im-
portant concepts of human speech and language, both auditory and visual, while
Section 4.3 focuses on existing approaches to audio language identication. The eld
of computer vision is covered in Section 4.4, specically audio-visual speech recog-
nition. Section 4.5 describes experiments published by psychologists to measure
human language identication capabilities using lip-reading. Work into visual-only
speech recognition is discussed in Section 4.6, and Section 4.7 concludes this chapter.
55
CHAPTER 4. LITERATURE REVIEW 56
4.2 Human Speech and Language
Teeth
Tongue
Velum
Uvula
Upper lip
Lower lip
Nasal cavity
Palate
Oral cavity
Vocal folds
Trachea
Hyoid bone
Esophagus
Laryngopharynx
Epiglottis
Nasopharynx
Thyroid cartilage
Aveolar ridge
Root
Dorsum
Tip
Blade
Oropharynx
Figure 4.1: A gure showing the vocal tract, and containing the position of the
speech articulators and other important anatomy for the production of speech.
Communication between humans is comprised of several channels of information
[Massaro and Cohen, 1983]. The primary mode of communication is speech, which
is foremost an audio channel. Speech is produced by the movement of speech ar-
ticulators, most notably the tongue, jaw, lips and vocal chords [Mermelstein, 1973].
Figure 4.1 shows the anatomy of the vocal tract. The external appearance of the
articulators, as seen by a video camera for instance, constitutes a visual mode of
speech. The recognition of speech from its visual components is known as lip or
speech-reading. Not all of the articulatory parameters are consistently visible from
an external view; for example, the teeth and tongue are periodically occluded when-
CHAPTER 4. LITERATURE REVIEW 57
ever the lips close. Other articulators are permanently occluded, such as the velum
and the vocal chords. The parameters with some degree of visibility are the mea-
surable characteristics of visual speech.
The term language refers to the broadest system by which humans convey in-
formation to each other. Human languages have evolved to be uniquely complex
amongst those used by other species [Christiansen and Kirby, 2003], in terms of the
limitless potential for informational exchange and the rules governing the languages
themselves. Languages can be communicated in a variety of ways, including by
written text, visually (sign language and body language), or more commonly acous-
tically, by speech. At its lowest level, spoken language is comprised of sound units,
known as phones, which are distinctive. Phones that are considered contrastive
within a certain language are phonemes of that language (for instance [r] and [l] are
phonemes in English but not in Japanese). Each language has its own inventory of
phones, although many are often shared. These are the basic building blocks of any
spoken language. Phones are combined together to form larger units called words,
that are subject to morphological rules determining which phones the morpheme
(or root word) should contain for a given context. Phonotactics are a more general
set of rules specifying which phone sequences are permissible in a language.
A set of words is known as the lexicon, and each word belongs to a class referring
to a particular language concept. For example, a word can be a noun (referring
to an object, e.g. dog), a verb (referring to an action, e.g. eating), or an adjective
(referring to a descriptive word, e.g. big), etc. Syntax governs the allowable sequence
of word classes in a sentence, and also the precise meaning of a sentence given
the order of the words, which is often used as a synonym for grammar. More
correctly, grammar encompasses many other aspects of language such as morphology
(loosely, the study of words) and phonology (the study of phones). The specics of
the various sets and rules described here vary between languages [Collinge, 1990],
though most have evolved from common language ancestors and so many share
degrees of linguistic similarity. In fact, the similarities between languages can be
CHAPTER 4. LITERATURE REVIEW 58
used to determine their evolution and ancestry [Benedetto et al., 2002]. It is the
intra-language similarity of these features which theoretically determine how dicult
languages are to distinguish. For an in depth discussion on language, see Jusczyk
[2000], and Jurafsky and Martin [2009].
To further explain the concept of a phone, we must rst describe the phoneme.
A phoneme is the abstract representation of a sound unit required to construct
speech. For example, when a human intends to say the word cat, they are aware
that cat is comprised of three distinct sound units (/k/ // /t/) and then the
speaker attempts to articulate those sounds. The audible realisation of a phoneme
can vary. For instance, the articulation of a phoneme can be inuenced by the
preceding phone, or an articulatory target might not be met, causing a dierent
sound to be produced [Jackson and Singampalli, 2009], or there may be consistent
dierences in pronunciations, such as those caused by regional accents. For example,
one speaker may pronounce cat as [k] [] [t] and another as [k] [] [?], where [?]
is a glottal stop.
A viseme is described as the visual appearance of a phoneme, and as such the
two are inextricably correlated. As described in Section 2.2, many successful LID
techniques are based on phonemic theory, specically the identication of spoken
phones and the constraints imposed upon sequences of phones by language. Since
phones and visemes are just dierent modes of information representing the same
concept, it is likely that phonemic LID approaches will be applicable to visemic
information also. However, it is not known how contrastive visemes are in compar-
ison to phones, which will determine how dicult they are to recognise. It is also
not known precisely how strong the relationship between phones and visemes is,
meaning that we do not know how much information from the audio channel could
conceivably be recovered from the visual correlates of speech. When dealing with
visemes, it is generally understood that there are fewer visemes than phonemes. It
is common to map audio phone sets to a much smaller viseme set, either for syn-
thesising speech in animated characters [Ezzat and Poggio, 2000] or for applications
CHAPTER 4. LITERATURE REVIEW 59
in speech recognition [Visser et al., 1999]. It is therefore confusing to use the term
viseme, since any one viseme does not refer to a single phoneme, rather a class of
phonemes whose appearance is considered by some measure to be visually similar.
4.3 Audio Language Identication Techniques
Audio language identication is a mature eld of research, with many successful
techniques developed to achieve high levels of language discrimination with only a
few seconds of test data. These techniques make use of discriminatory features of
language as identied by linguists and phonologists. Many of the features are not
expressed visually and are therefore not identiable in the visual-domain, by lip-
reading. Such methods include measuring the spectral similarity between languages
[Sugiyama, 1991], or analysing prosody [Zue et al., 1993], which encompasses the
stress and pitch of speech. Other audio LID approaches make use of the phonetic
and phonotactic characteristics of languages which are proven to be an identiable
discriminatory feature between languages [Jusczyk, 2000], and have shown to be
expressed visually [Goldschen et al., 1994]. For more in depth reviews on language
discrimination using audio features, please refer to reviews by Zissman [1996], Ziss-
man and Berkling [2001] and Muthusamy et al. [1994].
This section focuses on reviewing the approaches to audio LID which have used
phonological characteristics to discriminate between languages. We start by describ-
ing a language recognition evaluation task in Section 4.3.1, which is used within the
research community to benchmark the success of state-of-the-art LID technologies.
In Section 4.3.2, we then briey outline phone recognition approaches to LID (the
most relevant of which is described fully in Section 2.2), and provide performance in-
formation regarding those systems. Section 4.3.3 gives a description of an alternative
to phone recognition, using frame-level tokenisation, and Section 4.3.4 describes the
application of support vector machines (SVMs) to LID. The nal section illustrates
the use of parallel language-dependent automatic speech recognisers for LID.
CHAPTER 4. LITERATURE REVIEW 60
4.3.1 National Institute of Standards and Technology Lan-
guage Recognition Evaluation Task
The National Institute of Standards and Technology language recognition evaluation
task (NIST LRE) is a task devised to evaluate the performance of worldwide research
conducted into audio language recognition. The evaluation task has been running
since 1996, and there have been 5 incarnations, most recently in 2009. The task
itself is described as one of language recognition, rather than language identication.
NIST describes language identication as the task of identifying which language is
being spoken in a speech utterance, whereas pair-wise language recognition is to
decide whether or not a particular target language is present in an utterance. One
important dierence in classication between these two approaches is that language
identication will assign a single language ID to an utterance, whereas recognition
may result in several languages recognised within an utterance, or even none.
Table 4.1: NIST LRE 2009 Target Languages
Amharic Bosnian Cantonese
Creole (Haitian) Croatian Dari
English (American) English (Indian) Farsi
French Georgian Cantonese
Hindi Korean Cantonese
Pashto Portuguese Russian
Spanish Turkish Ukranian
Urdu Vietnamese
The 2009 NIST LRE [NIST, 2007] was similar to previous years evaluations
in that participating research groups were presented with an evaluation procedure,
including data for training, development and testing, and methods of testing. The
task was to recognise 23 known languages and several unknown ones. Each group
presented a series of results for each test utterance, which was then processed by
NIST and presented anonymously. For each test utterance, a series of language
hypotheses were presented in turn for all languages to be recognised (e.g. Does this
speech contain English?), and a groups LID system would return a yes or no answer
for each hypothesis, and a value of condence. The languages for which a question
CHAPTER 4. LITERATURE REVIEW 61
would be asked depends on which test condition was used. The rst is the closed-set
condition, in which each of the 23 languages listed in Table 4.1 were used for the
hypotheses. The second was the open-set, where unknown languages were used in
addition to those in the closed-set. The nal condition was designed to evaluate LID
performance on linguistically similar groups, such as dialects, by only considering
the language contained within the speech and one other language of interest.
The Linguistic Data Consortium (LDC) provides development and training data
for each of the target languages. The primary training data, although any named
source is permissible, is narrowband (telephone speech quality) VOA (Voice Of
America) radio broadcasts, with accompanying language ID labels. Thousands of
automatically labelled broadband quality broadcasts are also provided for training
or evaluation. Test utterance durations are 3, 10 and 30 seconds, and all test data is
of telephone speech quality. There are 100 or more utterances for each test duration,
for each of the languages (including the open-set unknown languages), and this data
comes from the VOA corpora or is conversational telephone speech (CTS).
Two primary performance measures are adopted by the NIST LRE. The rst is an
average cost performance, based upon the equally-weighted and language weighted
sum of the false-positive and false-negative probabilities, and is dened as follows:
C
avg
=
1
N
L

L
T
_

_
C
Miss
P
Target
P
Miss
(L
T
)
+

L
N
C
FA
P
NonTarget
P
FA
(L
T
, L
N
)
+C
FA
P
OutofSet
P
FA
(L
T
, L
O
)
_

_
(4.1)
where C
Miss
= C
FA
= 1, and P
Target
= 0.5
N
L
is the number of languages in the closed-set.
L
T
is the target and L
N
is the non-target language.
P
NonTarget
= (1 P
Target
P
OutofSet
)/(N
L
1)
P
OutofSet
= 0.0 for the closed-set condition and 0.2 for the open-set.
CHAPTER 4. LITERATURE REVIEW 62
P
Miss
is the probability that the target language was missed.
P
FA
is the false alarm probability for the specied language pair.
L
O
is the Out-of-Set language.
The best performing system from the 2009 LRE produces an average cost per-
formance of around 1.5%, which represents a very low cost. In addition to this,
NIST also present detection error tradeo (DET) curves [Martin et al., 1997] which
give a graphical representation of the performance of a recognition system, in terms
of the false positive and false negative rates as the classication threshold is var-
ied (Figure 4.2). For example, if the recognition system produces a likelihood for
a given recognition decision, a threshold can be set (and varied) over which the
classication hypothesis is said to be true. The full results of the 2009 LRE can
be found at NIST [2007], and show that excellent levels of language discrimination
can be achieved between high numbers of languages using state-of-the-art audio LID
techniques.
0 5 10 15 20 25 30 35 40 45 50
2.5
3
3.5
4
4.5
5
5.5
Number of epochs
E
E
R

(
i
n

%
)
minDCF!03
minDCF!05
minDET!03
minDET!05
Fig. 1. EERs with increase of epochs in the training processes
of the minDET and minDCF approaches on NIST 03 and 05
tasks.
Both approaches yielded similar improvement over the base-
line. For minDET, relative EER and DCF improvements were
21.17%and 17.18%, respectively on the NIST 2003 task; they
were 14.29% and 12.56% on the NIST 2005 task. Figure 2 il-
lustrates DET curves of the two approaches. These curves
show that the proposed approaches effectively moved down
the baseline DET curves.
Table 2. EER/DCF (in %) comparison of Baseline, minDET
and minDCF on NIST 03 and 05 LRE tasks.
EER/DCF NIST 03 NIST 05
Baseline 3.59/3.55 5.46/6.69
minDET 2.83/2.94 4.68/5.85
minDCF 2.82/3.00 4.71/5.65
4. CONCLUSIONS
In this paper we proposed approaches for integrating perfor-
mance metrics, the detection cost function and the area un-
der the DET curve, into the model training. This strategy
is attractive because it offers a way to directly optimize the
language detection performance with evaluation measures of
interest. The two objective functions are optimized in train-
ing of backend GMMs in our PPR-VSM system. The GMM
parameters are embedded into the objective functions by us-
ing smooth approximations of the discrete metrics and reesti-
mated with the GPDalgorithm. Experimental results on NIST
2003 and 2005 LRE tasks show that the two approaches effec-
tively improve the detection performance over the ML train-
ing approach and the optimization of the two metrics achieves
competitive results. Ongoing and future works include 1) a
1 2 5 10 20 40
1
2
5
10
20
40
False Alarm probability (in %)
M
i
s
s

p
r
o
b
a
b
i
l
i
t
y

(
i
n

%
)
DET curves on NIST 03 and 05 LRE tasks
Baseline
minDET
minDCF
NIST 05
NIST 03
Fig. 2. DET curves of Baseline and two proposed approaches
(minDET and minDCF) on the NIST 03 and 05 LRE tasks.
comparison with optimization of other performance metrics,
2) applying proposed approaches to other classiers, and 3) a
study of simultaneous optimization of different performance
metrics.
5. REFERENCES
[1] The NIST language recognition evaluation plan, http://www.nist.gov
/speech/tests/lang/index.htm.
[2] A. Martin, G. Doddington, T. Kamm, M. Ordowski, and M. Przybocki,
The DET curve in assessment of detection task performance, Proc.
Eurospeech, vol. 4, pp. 18951898, 1997.
[3] M. A. Zissman, Comparison of four approaches to automatic language
identication of telephone speech, IEEE Trans. Speech and Audio
Processing, vol. 4, no. 1, pp. 3144, 1996.
[4] H. Li, B. Ma, and C.-H. Lee, A vector space modeling approach to
spoken language identication, IEEE Trans. Audio, Speech and Lan-
guage Processing, vol. 15, no. 1, pp. 271284, 2007.
[5] B.-H. Juang, W. Chou, and C.-H. Lee, Minimum classication error
rate methods for speech recognition, IEEE Trans. Speech and Audio
Processing, vol. 5, no. 3, pp. 257265, 1997.
[6] L. Yan, R. Dodier, M. C. Mozer, and R. Wolniewicz, Optimizing clas-
sier performance via an approximation to the wilcoxon-mann-whitney
statistic, Proc. ICML, 2003.
[7] S. Gao, W. Wu, C.-H. Lee, and T.-S. Chua, A maximal gure-of-merit
(MFoM)-learning approach to robust classier design for text catego-
rization, ACM Trans. on Information Systems, vol. 24, no. 2, pp. 190
218, 2006.
[8] B. Ma, R. Tong, and H. Li, Discriminative vector for spoken language
recognition, in Proc. ICASSP, 2007, pp. 1520.
[9] C. M. Bishop, Neural networks for pattern recognition, Clarendon
Press, 1995.
[10] Callfriend corpus, telephone speech of 15 different languages or di-
alects, http://www.ldc.upenn.edu/Catalog/byType.jsp#speech.
telephone.
Figure 4.2: A gure from Zhu et al. [2008] showing DET curves to compare the
performance of three audio LID systems, in two separate tasks. The smaller the area
under a curve, the better the performance of the system. Each DET curve is quite
straight, meaning that the likelihoods from each system follow normal distributions.
CHAPTER 4. LITERATURE REVIEW 63
4.3.2 Phone-Based Tokenisation
Several approaches exist which exploit the phonetic content dierences between lan-
guages to achieve language discrimination. Such techniques require the training of
a phone recogniser, usually comprising a set of hidden Markov models (HMMs),
which are used to segment input speech into composite phones. Phonetic transcrip-
tions of training data are required to train a phone recogniser and this may be a
prohibitive factor as transcriptions may not be readily available. Furthermore, some
LID systems use language-dependent phone recognisers, which introduces a further
limitation in that the recognition of additional languages is not a trivial task and is
once again reliant upon transcription availability.
Speech
MFCC
Feature
Extraction
English
Phone
Recogniser
English
Language Model
French
Language Model
Pr(Fr | EnFrLM)
Pr(En | EnEnLM)
Hypothesised
Language
Figure 4.3: A system diagram of the phone recognition followed by language mod-
elling approach to audio LID.
In an approach called phone recognition followed by language modelling (PRLM)
[Zissman, 1996] (Section 2.2), phonotactics is the feature of language used for dis-
crimination. The contention here is that dierent languages have dierent rules
regarding the order in which phones may occur in speech. In this technique (Figure
4.3), a single phone recognition system can be used to tokenise an utterance using
a shared phone set, trained using one language. The phone sequences produced by
this system can then be analysed in terms of the co-occurrence (or n-gram) probabil-
ities of phones in an utterance. Statistical models are built using language-specic
training data, and these models generate a likelihood score of input utterances being
produced by that model. For classication, simple maximum likelihood approaches
can be used, or more complex back-end classiers such as Gaussian probability den-
sity functions (GPDF), neural networks or SVMs can be applied. This system can
CHAPTER 4. LITERATURE REVIEW 64
be extended by building PRLM systems using language-specic phone recognisers,
and running the recognition systems in parallel (PPRLM) (Section 2.2).
The MIT language recognition system, which uses a fusion of scores from a
prosodic recognition system and a PRLM system, was entered into the 2009 NIST
LRE. They achieved a mean recognition error of 1.64% on 30-second test utterances
in the 23 language closed-set test [Torres-Carrasquillo et al., 2010]. Work in Hos-
seini and Homayounpour [2009] and Yin et al. [2008] also use PRLM methods, and
it is widely stated amongst the audio LID community that PRLM is a well estab-
lished method for discriminating languages. PRLM is still a relevant and successful
technique today, despite rst appearing in scientic papers around 15 years ago.
4.3.3 Gaussian Mixture Model Tokenisation
The tokenisation sub-system within the LID architecture is usually applied at a
phone level, although could be as high level as words or sentences. Torres-Carrasquillo
et al. [2002] present a variant to the standard PPRLM LID approaches which uses
sub-phone, frame-level tokenisation. In this method, a Gaussian mixture model
(GMM) is trained for each language from language-specic mel-frequency cepstral
coecients (MFCC) data. Each GMM can be considered to be an acoustic dictio-
nary of sounds, with each mixture component modelling a distinct sound from the
training data. Given an MFCC frame (a typical recognition feature used in audio
speech processing), the mixture component is found which produces the highest
likelihood score, and the index of that component becomes the token for that frame.
For a stream of input frames, a stream of component indices will be produced, on
which language modelling followed by back-end classication can be performed, as
is common in audio LID [Suo et al., 2008; Torres-Carrasquillo et al., 2002].
For the NIST 1996 12 language evaluation task [NIST, 1995], Torres-Carrasquillo
et al. [2002] present a minimum error rate of 17%, which is higher than standard
PRLM techniques. Despite this increase in error rate, several advantages are oered
by this approach. Firstly, the training of the tokeniser does not require transcribed
CHAPTER 4. LITERATURE REVIEW 65
data, which simplies the incorporation of additional languages into the system and
is especially advantageous for VLID where there is no agreed protocol for transcrip-
tions. Secondly, there is a reduction in computational cost using this technique,
versus phone recognition.
4.3.4 Support Vector Machines
A support vector machine (SVM) is a discriminative classier that has been applied
to various subsystems within audio language recognition methods. The classier
works by constructing a separating hyperplane that maximises the margin between
each class to be discriminated and that hyperplane (Figure 4.4). By use of a kernel
function, a higher dimensional projection of the training data can be found, in which
the data is linearly separable and in which a maximum margin can then be located.
Section 2.9 describes SVMs in more detail.
Maximum
Margin
Maximum
Margin
Hyperplane
Non-Maximum
Margin
Hyperplane
Figure 4.4: A diagram showing two examples of separating hyperplanes. The left-
most image shows a non-maximum margin, whilst the right-most image shows a
maximum margin, as used by SVMs for classication.
In audio LID papers by Zhai et al. [2006] and Li et al. [2007], a PPRLM LID
system similar to that described in Section 4.3.2 [Zissman, 1996] is proposed, except
the language models themselves, rather than the scores they produce, become the
vectors used by the back-end classier. Instead of a maximum likelihood, GPDF
or linear discriminant analysis (LDA) back-end, SVMs are built from the bigram
language models of the tokenised training data, an SVM for each language, each
CHAPTER 4. LITERATURE REVIEW 66
comparing that target language against all other languages. For example, in testing
two very simple languages, each having the same three word phone inventory, but
dierent phonotactic rules, each language model would have a maximum of nine
bigram probabilities. Each of the probabilities for each bigram would most likely
dier between languages, with some perhaps not occurring at all. This vector of
bigram probabilities for a language becomes the training vector. Language models
are built from the training data of the target language and for all other languages,
and SVMs are constructed to dene the decision boundary which can be used to
classify the language model generated from a test utterance.
Zhai et al. [2006] present details of experiments to measure the eect of training
dataset size imbalances and the eect of GMM back-end classication using SVM
scores as an input feature. They also demonstrate the eect of discriminatively re-
ducing the size of a language model before the SVM training process. In a language
with a phoneme inventory count of 47 phonemes, the maximum number of within
and cross-word bigrams is 2,209. By reducing this set to include the more discrim-
inative combinations they expect to see greater accuracy and less computational
expense. The process they use to reduce the bigram set is called term frequency
inverse document frequency (TF-IDF) [Salton, 1989], which is an important tech-
nique used in the eld of information retrieval. In this scheme, terms which appear
in all languages are least informative and are weighted to 0. More language-specic
bigrams are retained and weighted according to their overall frequency relative to
each language.
The results described by Zhai et al. [2006] show an improvement by using their
SVM technique over an implementation of a standard PPRLM approach. They
conclude that the TF-IDF weighting scheme was ineective at improving results, as
were GMM back-end classiers on SVM scores. SVMs are shown to be sensitive to
training data imbalances and thus benet from forced balancing of the sets. On six
languages from the OGI-Telephone speech corpus, without GMM back-end classi-
cation and with balanced training data, they achieve 94% average accuracy. This
CHAPTER 4. LITERATURE REVIEW 67
result is 7.3% higher than their standard PPRLM system and suggests the potential
of discriminative classiers for LID. Li et al. [2007] shows similarly good performance
on the 12 language 2003 NIST tests. However, they do not test SVM performance
without discriminative weighting of their language models. Other approaches have
been suggested too, such as in Campbell et al. [2006], where sequences of audio
features are compared by an SVM sequence kernel. In conclusion, SVMs built on
PPRLM technology are currently the most up to date and successful LID systems,
and may provide similarly good results when applied to VLID.
4.3.5 Large Vocabulary Continuous Speech Recognition
Speech
MFCC
Feature
Extraction
English
Speech
Recogniser
Classifier
Pr(FrASR)
Pr(EnASR)
Hypothesised
Language
French
Speech
Recogniser
Pr(GeASR)
German
Speech
Recogniser
Figure 4.5: A diagram showing an approach to language identication using an
LVCSR for each language to be recognised. For a given speech utterance, a likeli-
hood is generated by each speech recogniser and is classied either by the recogniser
producing the maximum likelihood, or by a more sophisticated classication process.
The syntax of a language governs its allowable sentence structures and is a feature
which diers between languages. Continuous automatic speech recognition (ASR)
systems can make use of higher level language information such as this by using word-
level language models. Large vocabulary continuous speech recognition (LVCSR)
has been applied to audio identication in the hope that recognising these high-
level features of language will improve discrimination (Figure 4.5). Using a set
of language dependent ASR systems, one for each language to be recognised, the
class of a test utterance can be classied as the ASR system producing the highest
CHAPTER 4. LITERATURE REVIEW 68
likelihood score. On four languages from the spontaneous scheduling task, Mendoza
et al. [1996] achieves 84% identication accuracy and Schultz et al. [1996] presents a
maximum error of less than 3% on three languages from the OGI telephone speech
corpus.
A disadvantage of this approach is that training an ASR system requires a large
amount of transcribed, language-specic training data, in order to train reliable mod-
els which generalise well. Also, there is a signicant processing overhead associated
with training and testing multiple ASR systems. There is a further complication in
applying this approach to visual speech, since we will show in chapters 6 and 7 that
visual phone recognition accuracy is especially poor, and that recognised sequences
show a large proportion of deletions and insertions. This means that recognition of
words in terms of their constituent phones, via a pronunciation dictionary, is likely
to give poor results. Word level models are therefore required, which are highly data
intensive. If these limitations can be overcome, this approach may become a viable
and powerful tool for VLID in the future.
4.4 Audio-Visual Approaches
The eld of computer vision has been applied to speech recognition in order to use
the information present in visual speech to improve standard ASR performance.
This research area is known as audio-visual (AV) speech recognition, and its main
application is for speech recognition in noisy environments, where conventional au-
dio features become ineective, but where visual features are unaected [Matthews
et al., 2002; Almajai and Milner, 2009]. Approaches for AV ASR use features de-
rived from the mouth region, since it is the area on the face which conveys the most
information regarding the speech articulators.
Face, mouth and lip tracking approaches have all been employed in the past
as methods to locate automatically the areas of the face that contain the most
speech information, from which feature extraction can take place. Face detection
CHAPTER 4. LITERATURE REVIEW 69
software usually operates by splitting an image into a pyramid of rectangular regions,
the shape, scale and rotation of which are limited to pre-dened parameters. The
greyscale intensities within those regions are analysed and classiers trained on the
regions. Once the face has been recognised, smaller facial features such as the
mouth can be more easily located by the same technique, using the constraints
introduced by the knowledge of the face location. Broadly speaking, there are two
types of features that we can extract from the mouth region for use in visual speech
recognition; geometric and appearance-based.
Geometric features are a way of representing the shape of the mouth, usually
the lip contours. Since geometric features do not contain appearance information
regarding the teeth and tongue, which are vital to the production of speech, they
are unlikely to capture sucient articulatory information to recover a complete
audio transcription. However, there may be speaker-dependent shape congurations
which provide adequate information to overcome the missing articulators, but this
information is unlikely to be transferable across speakers. Examples of geometric
features include lip-contour height and width, aspect ratio, area, or perimeter length
[Cetingul et al., 2006; Sujatha and Santhanam, 2010; Bregler and Konig, 1994].
Appearance based features are derived from the image of the face, specically
the pixel intensities contained within a region of interest or the contour of the lips.
Once the region of interest has been found, the contour of the lips can be tracked.
AAMs are routinely used for this purpose [Cootes et al., 2001]. An attraction to
using AAMs for tracking is that they provide a model of shape and appearance
which can then be used to generate features for recognition. AAMs contain a PDM,
which provides an estimation of shape, and an appearance model representing the
shape-free appearance variation of the mouth. The details of this process have
been described in Section 2.4, but to summarise, PCA is applied to the landmark
coordinates from the lip spline for the shape model, and to the pixel intensities of
the mouth region warped to the mean mouth shape for the appearance.
Examples of alternative appearance based features are the discrete cosine trans-
CHAPTER 4. LITERATURE REVIEW 70
(a)
0 100 200 300
0
50
100
150
200
Intensity
P
o
s
i
t
i
o
n
(b)
Scale
P
o
s
i
t
i
o
n
50 100 150
50
100
150
(c)
0 50 100 150 200
0
5
10
15
x 10
4
Scale
C
o
u
n
t
(d)
Figure 4: A vertical scan-line from a greyscale version of the mouth sub-image (a) is shown as an intensity plot (b). The granularity
spectrum from an m-sieve with positive/negative granules shown in red/blue (c). These granules are then counted, or summed, over all
scan-lines to produce the scale-histogram (d).
frame. The result of this slight re-adjustment is a set of landmarks
that are more consistent and AAM-like. We nd the use of LPs
and this second stage to be highly necessary since, common prac-
tice, in which an AAM is initialised on tracked landmarks of pre-
vious frame [2], fails on the GRID data.
To extract lip-only AAM features, a lip AAM model is trained
using only landmarks on the lips. Enhanced LP landmarks are
then projected onto the model from which the feature is com-
puted. Figure 1 shows an example of the shape and the appearance
of AAM from an image created using this method.
2.3 Sieve
The second type of feature derives from sieves, [8], which are a
class of scale-space lters. The one-dimensional variants can be
described as a cascade of lters such that the signal at scale s
is x
s
= f
s
(x
s1
) where x
0
is the original signal and f
s
() is a
scale-dependent operator and is one of the greyscale opening O
s
,
closing C
s
, M
s
, or N
s
operators where M
s
= O
s
C
s
, N
s
=
C
s
O
s
, O
s
=
s

s
and C
s
=
s

s
.
s
is dened as:

s
(x
s1
(n)) = min
p[s,s]
z
s1
(n + p) (6)
z
s
(n) = max
p[s,s]
x
s1
(n + p) (7)
with
s
mutatis mutandis with max and min swapped. An impor-
tant property of sieves, and one which gives them their order-N
complexity [9], is that the small scales are processed before the
larger ones they are a cascade with the output from the small
scale feeding into the larger scale. In the original literature the
morphological operator was replaced with a recursive median l-
ter (the so called m-sieve) but nowadays the variants given above
are more common.
When applied to lip-reading outputs at successive scales can be
differenced to obtain granule functions which identify regional
extrema in the signal by scale. These difference signals form a
scale signature which should change as the mouth opens. The
feature extraction system follows that used in [2] and is illustrated
in Figure 4.
3 Database
For the experiments described in this paper, we use an audio-
visual speech database called GRID[4] which consists of record-
ings of 1000 utterances per speaker, and a collection of 34 speak-
ers. Each sentence is created using a xed grammar model with
6 components: command, color, preposition, letter, digit, and ad-
verb, with a vocabulary size of 51 word. An example of such a
sentence is bin blue at f two soon (examples are in the addi-
tional material). Visual speech was captured at a frame rate of 25
frame/second and was converted to MPEG-1 format with datarate
of 6Mbits s
1
. The frame dimension of the MPEG movies is
720576 pixels. The database also includes a word level audio
alignment using at-start force alignment, and marks the begin-
ning and the end of each word during speech. The lip region
has been semi-automatically detected [10], and is specied by
a bounding box, from which a lip sub-image can be extracted
for computing features, including sieve1d, 2D DCT and eigen-
lips [11]. Some examples of lip sub-images are shown in Figure 5.
Two types of bounding boxes are included in the dataset. One is
the tracked bounding box which is centralised on the center of
lip region, the other is the static bounding box that is positioned
on the mean location of tracked bounding boxes of the whole se-
quence. To be comparable with experiments done in [10], the
experiments described in this paper use the 2D DCT featues that
are supplied with the GRIDdata which were computed fromstatic
bounding boxes. Sieve1D feature and eigen-lip feature are com-
puted on lip sub-images from tracked bounding box, so that they
are not affected by the movement of head, although this does in-
troduce some tracking noise..
(a) (b)
(c) (d)
Figure 5: Example lip sub-images from GRID database.
(a)
0 100 200 300
150
100
50
Intensity
P
o
s
i
t
i
o
n
(b)
(a)
0 100 200 300
0
50
100
150
200
Intensity
P
o
s
i
t
i
o
n
(b)
Scale
P
o
s
i
t
i
o
n
50 100 150
50
100
150
(c)
0 50 100 150 200
0
5
10
15
x 10
4
Scale
C
o
u
n
t
(d)
Figure 4: A vertical scan-line from a greyscale version of the mouth sub-image (a) is shown as an intensity plot (b). The granularity
spectrum from an m-sieve with positive/negative granules shown in red/blue (c). These granules are then counted, or summed, over all
scan-lines to produce the scale-histogram (d).
frame. The result of this slight re-adjustment is a set of landmarks
that are more consistent and AAM-like. We nd the use of LPs
and this second stage to be highly necessary since, common prac-
tice, in which an AAM is initialised on tracked landmarks of pre-
vious frame [2], fails on the GRID data.
To extract lip-only AAM features, a lip AAM model is trained
using only landmarks on the lips. Enhanced LP landmarks are
then projected onto the model from which the feature is com-
puted. Figure 1 shows an example of the shape and the appearance
of AAM from an image created using this method.
2.3 Sieve
The second type of feature derives from sieves, [8], which are a
class of scale-space lters. The one-dimensional variants can be
described as a cascade of lters such that the signal at scale s
is x
s
= f
s
(x
s1
) where x
0
is the original signal and f
s
() is a
scale-dependent operator and is one of the greyscale opening O
s
,
closing C
s
, M
s
, or N
s
operators where M
s
= O
s
C
s
, N
s
=
C
s
O
s
, O
s
=
s

s
and C
s
=
s

s
.
s
is dened as:

s
(x
s1
(n)) = min
p[s,s]
z
s1
(n + p) (6)
z
s
(n) = max
p[s,s]
x
s1
(n + p) (7)
with
s
mutatis mutandis with max and min swapped. An impor-
tant property of sieves, and one which gives them their order-N
complexity [9], is that the small scales are processed before the
larger ones they are a cascade with the output from the small
scale feeding into the larger scale. In the original literature the
morphological operator was replaced with a recursive median l-
ter (the so called m-sieve) but nowadays the variants given above
are more common.
When applied to lip-reading outputs at successive scales can be
differenced to obtain granule functions which identify regional
extrema in the signal by scale. These difference signals form a
scale signature which should change as the mouth opens. The
feature extraction system follows that used in [2] and is illustrated
in Figure 4.
3 Database
For the experiments described in this paper, we use an audio-
visual speech database called GRID[4] which consists of record-
ings of 1000 utterances per speaker, and a collection of 34 speak-
ers. Each sentence is created using a xed grammar model with
6 components: command, color, preposition, letter, digit, and ad-
verb, with a vocabulary size of 51 word. An example of such a
sentence is bin blue at f two soon (examples are in the addi-
tional material). Visual speech was captured at a frame rate of 25
frame/second and was converted to MPEG-1 format with datarate
of 6Mbits s
1
. The frame dimension of the MPEG movies is
720576 pixels. The database also includes a word level audio
alignment using at-start force alignment, and marks the begin-
ning and the end of each word during speech. The lip region
has been semi-automatically detected [10], and is specied by
a bounding box, from which a lip sub-image can be extracted
for computing features, including sieve1d, 2D DCT and eigen-
lips [11]. Some examples of lip sub-images are shown in Figure 5.
Two types of bounding boxes are included in the dataset. One is
the tracked bounding box which is centralised on the center of
lip region, the other is the static bounding box that is positioned
on the mean location of tracked bounding boxes of the whole se-
quence. To be comparable with experiments done in [10], the
experiments described in this paper use the 2D DCT featues that
are supplied with the GRIDdata which were computed fromstatic
bounding boxes. Sieve1D feature and eigen-lip feature are com-
puted on lip sub-images from tracked bounding box, so that they
are not affected by the movement of head, although this does in-
troduce some tracking noise..
(a) (b)
(c) (d)
Figure 5: Example lip sub-images from GRID database.
(c)
(a)
0 100 200 300
0
50
100
150
200
Intensity
P
o
s
i
t
i
o
n
(b)
Scale
P
o
s
i
t
i
o
n
50 100 150
50
100
150
(c)
0 50 100 150 200
0
5
10
15
x 10
4
Scale
C
o
u
n
t
(d)
Figure 4: A vertical scan-line from a greyscale version of the mouth sub-image (a) is shown as an intensity plot (b). The granularity
spectrum from an m-sieve with positive/negative granules shown in red/blue (c). These granules are then counted, or summed, over all
scan-lines to produce the scale-histogram (d).
frame. The result of this slight re-adjustment is a set of landmarks
that are more consistent and AAM-like. We nd the use of LPs
and this second stage to be highly necessary since, common prac-
tice, in which an AAM is initialised on tracked landmarks of pre-
vious frame [2], fails on the GRID data.
To extract lip-only AAM features, a lip AAM model is trained
using only landmarks on the lips. Enhanced LP landmarks are
then projected onto the model from which the feature is com-
puted. Figure 1 shows an example of the shape and the appearance
of AAM from an image created using this method.
2.3 Sieve
The second type of feature derives from sieves, [8], which are a
class of scale-space lters. The one-dimensional variants can be
described as a cascade of lters such that the signal at scale s
is x
s
= f
s
(x
s1
) where x
0
is the original signal and f
s
() is a
scale-dependent operator and is one of the greyscale opening O
s
,
closing C
s
, M
s
, or N
s
operators where M
s
= O
s
C
s
, N
s
=
C
s
O
s
, O
s
=
s

s
and C
s
=
s

s
.
s
is dened as:

s
(x
s1
(n)) = min
p[s,s]
z
s1
(n + p) (6)
z
s
(n) = max
p[s,s]
x
s1
(n + p) (7)
with
s
mutatis mutandis with max and min swapped. An impor-
tant property of sieves, and one which gives them their order-N
complexity [9], is that the small scales are processed before the
larger ones they are a cascade with the output from the small
scale feeding into the larger scale. In the original literature the
morphological operator was replaced with a recursive median l-
ter (the so called m-sieve) but nowadays the variants given above
are more common.
When applied to lip-reading outputs at successive scales can be
differenced to obtain granule functions which identify regional
extrema in the signal by scale. These difference signals form a
scale signature which should change as the mouth opens. The
feature extraction system follows that used in [2] and is illustrated
in Figure 4.
3 Database
For the experiments described in this paper, we use an audio-
visual speech database called GRID[4] which consists of record-
ings of 1000 utterances per speaker, and a collection of 34 speak-
ers. Each sentence is created using a xed grammar model with
6 components: command, color, preposition, letter, digit, and ad-
verb, with a vocabulary size of 51 word. An example of such a
sentence is bin blue at f two soon (examples are in the addi-
tional material). Visual speech was captured at a frame rate of 25
frame/second and was converted to MPEG-1 format with datarate
of 6Mbits s
1
. The frame dimension of the MPEG movies is
720576 pixels. The database also includes a word level audio
alignment using at-start force alignment, and marks the begin-
ning and the end of each word during speech. The lip region
has been semi-automatically detected [10], and is specied by
a bounding box, from which a lip sub-image can be extracted
for computing features, including sieve1d, 2D DCT and eigen-
lips [11]. Some examples of lip sub-images are shown in Figure 5.
Two types of bounding boxes are included in the dataset. One is
the tracked bounding box which is centralised on the center of
lip region, the other is the static bounding box that is positioned
on the mean location of tracked bounding boxes of the whole se-
quence. To be comparable with experiments done in [10], the
experiments described in this paper use the 2D DCT featues that
are supplied with the GRIDdata which were computed fromstatic
bounding boxes. Sieve1D feature and eigen-lip feature are com-
puted on lip sub-images from tracked bounding box, so that they
are not affected by the movement of head, although this does in-
troduce some tracking noise..
(a) (b)
(c) (d)
Figure 5: Example lip sub-images from GRID database.
(d)
Figure 4.6: This gure is adapted from Lan et al. [2009] and presents a high-level
illustration of how Sieve features are generated. A vertical scan-line from a greyscale
version of the mouth sub-image (a) is shown as an intensity plot (b). The granularity
spectrum from an m-sieve with positive/negative granules shown in red/blue (c).
These granules are then counted, or summed, over all scan-lines to produce the scale-
histogram (d).
form (DCT) and the sieve. A DCT is similar to a Fourier transform, in that it is a
method of converting a signal from the spacial domain to the frequency domain. In
the case of a mouth region image, the signal to be processed is the grayscale pixel
intensities of each row or column of the image. The DCT represents the signal as
a sum of coecients corresponding to high energy, low frequency cosine functions
[Almajai and Milner, 2008]. Sieve features decompose an image by progressively
removing features of increasing scale [Bangham et al., 1996] (Figure 4.6). There are
several dierent types of sieve, such as those that operate by removing pixel inten-
sity minima, maxima or both. Using a 1D closing-sieve as an example, each line is
raster scanned (a vertical line of pixels is selected), sieves from scale 1 to scale m
CHAPTER 4. LITERATURE REVIEW 71
(where m is the height of the image) are applied to each line, this attens the pixel
intensity minima, and a histogram across all raster lines of the resulting granularity
measure (the count of the pixels removed as each sieve was applied) becomes the
feature vector for that frame. Sieves can also be multi-dimensional, by considering
the connectivity of pixels in adjacent raster lines. In our visual-only experiments we
will use AAMs, as they are currently the features of choice for the UEA lip-reading
group. This decision is based upon a body of research by the UEA group, comparing
the recognition performance of various features for computer lip-reading (see Section
4.6).
The shape and appearance features described can be used independently, or as a
combined feature for recognition. For example, PCA can be applied to concatenated
shape and appearance features so that each PCA dimension represents a mode of
combined shape and appearance variation. Similarly, visual and audio features can
simply be concatenated and classied as a global visual/audio class, or each channel
can be classied separately and the individual likelihoods combined in some manner,
known as early and late fusion respectively. For in depth reviews of audio-visual
speech recognition, see Potamianos et al. [2003] and Aleksic et al. [2005].
4.5 Human LID Experiments
In this section we establish a benchmark of human performance on the task of VLID,
so that we have some indication of the complexity of the task, and so that we can
measure the success of the automatic systems we describe over the course of this
thesis.
Soto-Faraco et al. [2007] conducted a number of VLID experiments using human
subjects to determine the extent of our ability to discriminate languages based on the
visual correlates of speech. In one experiment, Spanish-Catalan bilingual subjects
were shown 48 silent video clips of people speaking uently in Spanish or Catalan,
two extremely phonologically similar languages. The task was to identify whether
CHAPTER 4. LITERATURE REVIEW 72
the language spoken in the current utterance diered to that of the previous one.
Results showed above chance classication accuracy for both Spanish and Catalan
dominant participants on this task (57.4% and 60.9% accuracy, respectively), and
neither group was found to have a statistically signicant advantage over the other.
The syllabic count in a sentence was identied as a signicant inuence on accu-
racy, with longer test utterances (with higher syllabic counts) resulting in higher
recognition accuracies.
In a second experiment, to test any biases introduced by subtle non-language
cues, the rst experiment was repeated but using English and Italian participants
unable to speak either test language. Analysis of the results revealed no statistically
signicant advantage for the participants of either language over the other, and no
participants achieved signicantly above random classication accuracies (English
speakers achieved 53.1% and Italian speakers reached 52.2% accuracy). The syllabic
content in a test sentence was identied as a signicant classication factor when
measured per participant, rather than per sentence or per participant group. They
compared these results with the scores from the previous experiment and validated
a hypothesis that each group in the rst experiment had outperformed each group
in the second. They concluded that the higher accuracy must be due to language
familiarity rather than any other non-language dierences in the video recitals.
A further experiment sought to discover whether knowledge of only one of the
test languages was sucient to aid discrimination. It was discovered that Spanish
monolinguals could discriminate between their native language and a language with
which they are not familiar, at above chance levels (55.8% accuracy). When com-
pared to the results from the rst experiment, it was found that bilingual speakers
had a marginal advantage in this task, and that bilinguals on average took less
time to reach a decision for a given test utterance. A nal experiment required
participants to identify the language of a presented utterance as either Spanish or
Catalan, and to transcribe as many of the words as they could, to see to what ex-
tent speech-reading of individual words was responsible for language discrimination.
CHAPTER 4. LITERATURE REVIEW 73
On average, participants attempted transcriptions for 10% of words seen, and only
2.5% of transcriptions were correct or even phonetically similar to the actual words
spoken.
These results show that humans are capable, to some extent, of discriminating
between two linguistically similar languages by lip-reading alone. It is also apparent
from this work that separating non-language cues from those relevant to language
discrimination is a dicult but necessary task. Knowledge of the languages to be
recognised is shown to be required for improved recognition accuracy, and this sug-
gests that there are features of language which can be learnt in advance and then
identied as necessary. This is a useful result, as it shows that some degree of
automated visual-only discrimination might be possible, assuming that we could de-
termine the methods employed by humans in this task. These results also show that
the task is hard for humans, despite knowledge of both languages, which provides a
benchmark for any system developed here.
4.6 Visual-Only Speech Recognition
In audio-visual speech recognition (Section 4.4), it is usual for visual speech infor-
mation to be coupled with audio in some way to improve the intelligibility of speech
under noisy conditions. When audio information is not available or is uninforma-
tive (i.e. a low signal to noise ratio), speech recognition must rely upon the visual
information alone. Many AV ASR papers present the performance of visual-only
recognisers, which are the visual component of their AV system. Presenting this
information provides an indication of the informational content of the visual sig-
nal, specically that it is unaected by articially added acoustic noise in the audio
domain, which justies the fusion of the audio and visual channels for improved ro-
bustness. In practice, there may be some eect of acoustic noise on the production
of human speech via the Lombard eect (a feature of human speech where sounds
are hyper-articulated, or shouted, in the presence of acoustic noise). Research also
CHAPTER 4. LITERATURE REVIEW 74
exists for visual-only speech recognition, which is of direct importance to our work.
Visual-only results show us the range of performance we can expect to achieve in
various testing scenarios.
Matthews et al. [2002] describes and evaluates two method of visual feature ex-
traction for integration into an audio-visual speech recogniser. Also presented in
that paper are the video-only recognition results, which pertain to multi-speaker
(test speakers are included in the training set), word-level, isolated letters recogni-
tion, using HMMs for speech modelling, and using low resolution grayscale video.
They used 520 utterances for training, which equates to two recitals of each letter
by each speaker, and they use a further recital for testing (260 utterances). The
best AAM results presented are 41.9% word accuracy, and 26.9% using active shape
models (ASM). Also shown are results for sieve-based approaches, which are shown
here to be superior to AAMs at 44.6% word accuracy.
Cox et al. [2008] present work into automatic computer lip-reading, on high de-
nition, colour video, where the task is to identify unconnected letters of the alphabet.
They build word-level models using HMMs, and they compare sieve features to AAM
features. The results show that reasonable accuracy of above 80% is achieved for all
speakers in a multi-speaker testing scenario, using either feature type. As expected,
the same experiments performed using audio features give near-perfect recognition
performance. In speaker-independent tests (where the testing speaker has not been
used in training), the accuracy drops dramatically to below 10%, and in some cases to
around chance level. The AAM features are shown to outperform the sieve features
for all speaker-independent tests. This paper also presents a convincing illustration
of the strong speaker dependency of the AAM features, versus MFCC features, and
cites this as the reason for the poor speaker-independent performance.
A further analysis of the speaker-independent performance of various recognition
features is presented in Lan et al. [2009]. Using the GRID corpus [Cooke et al.,
2006], which is standard denition video containing constrained grammar speech,
they built word-level HMMs using several types of recognition features (AAMs, 2D-
CHAPTER 4. LITERATURE REVIEW 75
DCT, 1D-sieve, etc). AAMs are shown to give a maximum mean word accuracy
of 65%, whilst sieve and 2D-DCT features both achieve 40% accuracy (Where 19%
represents chance by guessing). It is concluded that appearance derived features in
general outperform those derived from shape alone, meaning that the appearance of
the mouth contains useful information for computer lip-reading.
A nal study in Lan et al. [2010], continuing the work in Lan et al. [2009], suggests
some improvements for speaker-independent visual-only speech recognition. They
used a custom corpus of 12 speakers reading sentences from the resource management
corpus [Price et al., 1988], and they built viseme HMMs. Once again they compare
DCT and AAM features for recognition, and AAM features are found to outperform
DCT features. Two techniques are presented for improving speaker independency;
rstly, a per speaker z-score normalisation is proposed [Newman and Cox, 2010], in
which each speakers features are represented in terms of standard deviations from
the mean (Equation 6.1). Secondly, they also apply a discriminative training process
known as Hi-LDA, in which feature vectors are stacked from a sliding window, and
LDA is applied to the stacked vectors corresponding to each viseme class. Viseme
recognition accuracies of around 35% are shown by applying z-score normalisation,
and a modest improvement is shown by applying Hi-LDA and z-score in conjunction,
giving 44% accuracy.
Separate studies in Potamianos et al. [2003] and Liang et al. [2002] also present
their visual only recognition performances for the visual component of their audio-
visual systems. Potamianos et al. [2003] presents the same contrast between multi-
speaker and speaker-independent scenarios that we have already shown in other
studies. In a digit recognition task using studio recorded video, they present 61.47%
word accuracy in a speaker-independent task and 76.42% using a multi-speaker
setup. Liang et al. [2002] uses an approach similar to Hi-LDA in Lan et al. [2010],
using PCA features derived from the mouth region of interest. They evaluate their
system using the Clients portion of the XM2VTS dataset, and achieve a maximum
speaker-independent word recognition accuracy of 57.77%.
CHAPTER 4. LITERATURE REVIEW 76
4.7 Conclusions
In this chapter we have reviewed a selection of published literature relating to the
eld of visual-only language identication and also to audio-visual speech recog-
nition. This review spanned a range of scientic disciplines including linguistics,
various computer science research areas, and psychology. Since this research topic
is new and unexplored, we have had to look to these other elds to nd useful and
relevant literature from which we can establish the feasibility of this project, draw
upon applicable techniques and determine a measure of success for any system we
develop.
Firstly, we presented a brief overview of the physiology of human speech. We
also discussed human language, from the abstract representation of speech sounds
in the human brain, namely phonemes, to the higher level structures of a language
such as syntax. The term viseme is introduced to mean the visual appearance of a
spoken phone, and is shown to be used for facial animation and speech recognition.
Given the relationship between audio and visual speech, and the uses of visemes
in other computer science elds, we can conclude that observing the face during
speech is sucient to determine some spoken sounds or words. Furthermore, we
presented highly successful audio LID techniques which operate at a phone level,
modelling the phonotactics of dierent languages and using that information as a
method of identication. We have also seen that a low degree of visual-only language
discrimination is possible by humans with no formal lip-reading experience, between
two phonetically similar languages. We can therefore postulate that its possible to
identify a spoken language from visual information alone and that the task is hard.
We reviewed a number of papers that presented research into visual-only speech
recognition, using features for recognition commonly used in the eld of audio-visual
speech recognition. Each paper showed that speaker-dependent performance on any
recognition task was superior to speaker-independent tests. Performance in speaker-
independent mode varied from near random accuracy in letter recognition, to 65%
word accuracy with a constrained grammar. One paper suggested that the strong
CHAPTER 4. LITERATURE REVIEW 77
speaker-dependency of the AAM features was the reason for the poor performance,
and that multi-speaker tests appear quite unaected because the test speaker has
already been seen by the recognisers. We can deduce from the results presented that
DCT and sieve features are subject to the same issues, and that features derived from
shape are not sucient alone for speaker-independent recognition. Despite the poor
performance of visual-only recognisers, we have found methods for automatically
recognising visual speech (in terms of visemes and words), and that to some extent
they can be used to recover a speech transcription without any audio.
It is obviously not ideal that the studies presented here are not all performed using
the same datasets. In fact, it is reasonable to assume that the diculty of the tasks
varies signicantly, from those recognising highly constrained, simple grammars, to
those recognising unconstrained continuous phone sequences. Also, dierent recog-
nition systems and their associated parameters will naturally cause variations in
performance. However, several themes have clearly emerged throughout this re-
view. We have seen that automatic language discrimination is possible in the audio
domain by recognising phone sequences, and that visual-only phone recognition is
possible to some extent. Given these ndings, we focussed our research on apply-
ing audio LID techniques to the visual features routinely used in audio-visual and
visual-only speech recognition approaches. The main research questions highlighted
by this review, and motivating this research, are whether or not there is sucient
information presented visually to recover enough phonetic information to discrim-
inate languages, and whether or not this information is consistently recognisable
across speakers for speaker-independent recognition.
Chapter 5
Speaker-Dependent Visual
Language Identication
5.1 Introduction
The literature review in Chapter 4 showed that audio language identication (LID)
is a mature technology which can achieve a high identication accuracy from only
a few seconds of representative speech, and by contrast, visual-only recognition is a
relatively new eld of research. As visual speech processing has developed in the last
few years, it is interesting to enquire whether language could be identied purely by
visual means. This has practical applications in systems that use either audio-visual
speech recognition [Potamianos et al., 2003] or pure lip-reading [Matthews et al.,
2002] in noisy environments, or in situations where the audio signal is not available.
See Section 1.1 for the full motivation for this thesis.
In this chapter, we describe preliminary experiments in visual language identi-
cation (VLID), in which only lip-shape and lip-motion are used to determine the
language of a spoken utterance. We focus on the task of discriminating between two
or three languages spoken by the same speaker, because appearance-based recogni-
tion features have shown to be strongly speaker-dependent [Cox et al., 2008]. The
78
CHAPTER 5. SPEAKER-DEPENDENT VLID 79
speakers used were selected from a custom dataset that we recorded especially for
these experiments. The system developed here is based upon a standard audio LID
approach, where feature vectors are tokenised and then language models are esti-
mated from the streams of tokens produced by each language. This method relies
upon the phonotactic dierences between languages, namely that phonotactic rules
specify the permissible sequences of phones for dierent languages.
It is known that visual cues can be used by humans in speech processing and
contribute to intelligibility [Matthews et al., 2002], but performance in lip-reading
is much lower than using audio, even by trained lip-readers. Studies have shown that
humans are also capable of identifying language from purely visual cues [Soto-Faraco
et al., 2007], but again performance is much lower than that obtained using audio
signals. The dierence between audio and visual performance is due to two factors.
The rst is that speech communication has evolved in such a way that the audio,
rather than the visual signal has been optimised for error-free communication. The
second is that as the visual signal is a secondary communication channel, most people
do not develop their lip-reading ability. It is therefore not clear to what extent the
task of identifying language from facial features is dicult purely because it is an
unusual one that most people are not skilled in, or because the information required
for discrimination is not present in the visual features.
The visual communication units of speech which correspond to phonemes are
known as visemes [Fisher, 1968] (An abbreviation of visual phonemes). Language
identication using only the visual correlates of speech poses a signicant chal-
lenge as there are generally considered to be fewer distinct visemes than phonemes.
Broadly speaking, there is a many to one mapping from phonemes to visemes, in-
creasing the possibility of confusion between speech units, and therefore increasing
the diculty of language identication. Therefore, the purpose of the work described
in this chapter was to determine if language discrimination is possible from visual
information alone, despite the lack of information compared to the audio channel.
This chapter is structured as follows: Section 5.2 describes the video dataset
CHAPTER 5. SPEAKER-DEPENDENT VLID 80
recorded for this language identication task and species the speakers we use in
the experiments described here. The developed visual-only LID system is described
in Section 5.3. Section 5.4 explains the test procedure adopted here, presents results
produced by the system and discusses our ndings. This chapter is concluded in
Section 5.5, where the motivations for the next chapter are also presented.
5.2 Approach and Dataset
The literature review in Chapter 4, particularly the work in Cox et al. [2008], showed
that there is strong speaker dependency in the visual features commonly used for
audio-visual and visual-only speech recognition. This means that there is little or no
correspondence between the feature space occupied by dierent speakers. Therefore,
we decided that until some features that exhibited greater speaker independence had
been developed, we would discriminate between two or more languages read by a
single speaker. Hence we chose to record an audio-visual dataset of multilingual
speakers (Section 3.2). This approach also has the advantage of focusing on the
purely language-specic aspects of the task, whilst ensuring that speaker identity is
not inhibiting discrimination.
As Section 3.2 explains, the dataset recorded contains 26 subjects. The video was
captured at 25 frames per second, and at a resolution of 576 x 768 pixels (down-
sampled to 480 x 640). Most of these subjects were uent in at least two dierent
languages, some in three. Typically, these languages consisted of their mother-
tongue and a language that they had spoken for several years in an immersive
environment. Each subject read a script to a camera in all of the languages in which
they were procient. The script chosen was the Universal Declaration of Human
Rights [UN General Assembly, 1948] (Appendix A), as translations of this text are
available in over 300 languages. Subjects were asked to read up to and including the
rst 16 articles of the declaration, a text of about 900 words and typically lasting
about 7 minutes. The subjects were instructed to keep as still as possible, to face
CHAPTER 5. SPEAKER-DEPENDENT VLID 81
the camera, and to avoid occluding their face. They were asked to continue reading
regardless of any small mistakes in their recital.
5.3 VLID using AAMs and Vector Quantisation
AAM
Feature
Extraction
Vector
Quantisation
Codebook
English
Language Model
French
Language Model
Pr(French)
Pr(English)
Hypothesised
Language
German
Language Model
Video
Frames
Pr(German)
E
n
g
l
i
s
h

T
r
a
i
n
i
n
g
F
r
e
n
c
h

T
r
a
i
n
i
n
g
G
e
r
m
a
n

T
r
a
i
n
i
n
g
Figure 5.1: Unsupervised VLID system diagram, using VQ tokenisation of AAM
feature frames. This system is based upon the PRLM architecture used in audio LID.
Figure 5.1 shows the automatic video language identication system developed
here. It is a variation of the phone recognition followed by language modelling
(PRLM) architecture described in Zissman [1996]. PRLM-based techniques tokenise
input features in some way (at a frame level, phone level, or word level) and then
build language models from the token sequences to capture the phonotactic rules
contained by a language. Section 2.2 explains the processes involved in more detail,
but this section will briey illustrate how we implement each element of our PRLM
system for the visual discrimination of languages.
The video data for each speaker is tracked using an active appearance model
(AAM), as described in Section 5.3.1. The vectors this process produces are rst
clustered using vector quantisation (VQ), detailed in Section 5.3.2, allowing the
training data to be tokenised as VQ symbols and bigram language models to be built
CHAPTER 5. SPEAKER-DEPENDENT VLID 82
from the resulting VQ transcriptions. In testing, the AAM vectors are transcribed
in the same way and each language model produces a likelihood which is classied
by the method outlined in Section 5.3.3.
5.3.1 Features: Active Appearance Models
The AAM tracks the face and lips and produces a vector representing the shape
and appearance for each frame of video (Section 2.4). However, the parameters
corresponding to non-lip elements and also to the mouth appearance are included
only for the purpose of assisting tracking capability and are discarded for training
and testing, meaning the experiments described here only use the parameters that
describe lip shape. Principal component analysis (PCA) is applied to the set of
vectors for an individual speaker to reduce the dimensionality. The rst four PCA
components represent translation (x and y), rotation and scale. These modes of
variation are explicitly added to our AAM for the purpose of tracking, but are
discarded for feature extraction, since the location and orientation of the mouth is
not an important linguistic feature. Removing these dimensions leaves around ve
components to describe lip shape in these experiments.
AAM generation requires a small number of ground truth frames from which
we build a statistical model to be used for tracking. The frames selected are chosen
to represent the extremities in shape and appearance that the tracker can expect
to encounter. In the system described here, an AAM is built for each speaker using
a manual selection of typical frames. These are taken from near the start, middle
and end of each language for a single speaker, totalling no more than 15 frames per
language.
5.3.2 Feature Modelling: Vector Quantisation
It is common for phonetic transcriptions to be mapped to so called visemic transcrip-
tions [Visser et al., 1999; Ezzat and Poggio, 2000], which are many-to-one mappings,
CHAPTER 5. SPEAKER-DEPENDENT VLID 83
grouping phones according to how they visually confused. However, because there
is no agreed method for transcribing visual speech and to see if such a mapping
was necessary for language discrimination, we decided to adopt an unsupervised
approach to LID in this work. Here, we use a version of the system described in
Torres-Carrasquillo et al. [2002], in which phonetic transcriptions are not required
to train monophone models, since feature vectors are tokenised at a frame level.
In Torres-Carrasquillo et al. [2002], a Gaussian mixture model (GMM) was used
and the ID of the mixture component that was most likely to have generated a
frame was recorded and used as the input to the language model element of the LID
system. We have replaced this process with a straightforward vector quantisation
(VQ) process, which it closely resembles. We adopted this approach because it is
computationally inexpensive, and preliminary audio LID experiments using VQ to-
kenisation gave comparable performance to LID using GMM tokenisation. Once all
AAM vectors have been produced from the video data, the designated training seg-
ment is vector quantised using a top-down hierarchical clustering process (Section
2.5) and the diagonal covariance Mahalanobis distance metric.
5.3.3 Language Model Likelihood Classication
Bigram language models for each language recorded by a speaker are built from the
VQ codeword transcriptions of the training data for each language from that speaker
(Section 2.7). Unseen codewords are smoothed to a count of one during generation
of the language models. Test data is transcribed into codewords in the same way
as the training data, and each language model produces a likelihood for a given
utterance. Back-o weights are calculated and used for unseen bigrams in the test
data. Classication of a test utterance is determined by the bigram language model
producing the highest total likelihood for the given utterance. This is calculated by
nding the sum of the log probabilities from a language model across all frames in
a test utterance, giving the total probability of a test utterance given a language
model.
CHAPTER 5. SPEAKER-DEPENDENT VLID 84
5.4 Experiments
Cross-fold validation was used to evaluate the performance of the LID system devel-
oped here. Experiments were performed on single speakers to account for the high
speaker dependency of the AAM features extracted, as explained in Section 5.2 and
as motivated by the review in Section 4. An equal number of AAM vectors from each
language of a single speaker were divided sequentially and exhaustively to give test
utterance durations of either 60, 30, 7, 3 or 1 seconds. As an example, if a speaker
read the declaration in English (lasting 6 minutes) and French (lasting 7 minutes),
the frames in the shorter recital would be divided into 6 one-minute, 12 30-second,
51 7-second, 120 3-second and 360 1-second test utterances. The longer recitals are
trimmed to the length of the shorter ones and are partitioned consistently with the
shortest one, to ensure balanced training data. A single test utterance is selected
from each language and all remaining test data is used for training. In each experi-
ment the system must classify a test utterance as one of the two or three languages
spoken by this particular speaker. The number of codewords used to vector quantise
the data is also an experimental parameter, ranging from 8 to 256 codes.
Partitioning the data in the way described above means that the number of test
utterances for shorter test durations greatly exceeds the number of longer duration
utterances, e.g. for the 60 second utterance tests in the three language case in Figure
5.4, there are only 21 test utterances in total. Hence, a single mis-classication of a
60 second utterance translates to a 4.8% drop in average percentage accuracy, which
although apparently large, may not be statistically signicant.
5.4.1 Initial Experiments
The results shown here demonstrate three separate LID experiment results. These
include one three-language discrimination experiment and two, two-language ex-
periments. Each gure shows the mean error in percent for each duration of test
utterance. Tests using codebooks containing between 8 and 256 codewords are pre-
CHAPTER 5. SPEAKER-DEPENDENT VLID 85
sented.
1 3 7 30 60
0
1
2
3
Test Utterance Duration (Seconds)
M
e
a
n

E
r
r
o
r

(
P
e
r
c
e
n
t
)


8
16
32
64
128
256
VQCodebook
Size
Figure 5.2: A VLID system is trained on two separate recitals of the UN declaration,
read by one speaker in two dierent languages; English and Arabic. The task is to
identify the language of some unseen test data from that speaker. This plot presents
the results for this experiment for a range of dierent VQ codebook sizes.
Figures 5.2, 5.3 and 5.4 show the results of tests on an English/Arabic bilingual
speaker (UN1 speaker 8), an English/German bilingual speaker (UN1 speaker 6) and
an English/French/German trilingual speaker (UN1 speaker 22). The performance
of the three speakers 256-codeword systems are also shown in Figure 5.5, including
the mean performance across the speakers. Figures 5.3 and 5.4 show a degree of
discrimination between the presented languages, although the performance of these
speakers is lower than that shown in Figure 5.2. This suggests that dierent language
pairs are not equally discriminable, either because of true intra-language variation
or because of the speaking style of those individual speakers in a given language.
CHAPTER 5. SPEAKER-DEPENDENT VLID 86
The gures suggest that classication error decreases as the test utterance duration
increases, and low error can be achieved for longer utterances. However, at this
stage it seemed somewhat improbable to us that one second utterances would be
sucient to provide the high discrimination performance between two languages
(as shown in Figure 5.2) or between three languages (Figure 5.4). Furthermore,
the performance of the eight codeword systems in these gures suggests that eight
mouth shapes are sucient to discriminate between three languages, which was a
surprising result considering audio identication uses sets containing over 40 phones.
Given these results, it was decided to investigate extent to which unintended eects
during recording may have biased results. These would include changes in lighting
intensity and colour during the recording and changes in pose. Since we use only the
shape contours of the mouth, we have largely removed the eect of lighting on our
features, although these factors could have aected the performance of the tracker,
potentially leading to uneven tracking performance. However, we checked carefully
for these eects and were satised that there were no systematic dierences between
the lighting in each video.
CHAPTER 5. SPEAKER-DEPENDENT VLID 87
1 3 7 30 60
0
5
10
15
20
25
30
35
40
Test Utterance Duration (Seconds)
M
e
a
n

E
r
r
o
r

(
P
e
r
c
e
n
t
)


8
16
32
64
128
256
VQCodebook
Size
Figure 5.3: A VLID system is trained on two separate recitals of the UN declaration,
read by one speaker in two dierent languages; English and German. The task is to
identify the language of some unseen test data from that speaker. This plot presents
the results for this experiment for a range of dierent VQ codebook sizes.
CHAPTER 5. SPEAKER-DEPENDENT VLID 88
1 3 7 30 60
0
5
10
15
20
25
30
35
40
Test Utterance Duration (Seconds)
M
e
a
n

E
r
r
o
r

(
P
e
r
c
e
n
t
)


8
16
32
64
128
256
VQCodebook
Size
Figure 5.4: A VLID system is trained on three separate recitals of the UN declara-
tion, read by one speaker in three dierent languages; English, French and German.
The task is to identify the language of some unseen test data from that speaker. This
plot presents the results for this experiment.
CHAPTER 5. SPEAKER-DEPENDENT VLID 89
1 3 7 30 60
0
5
10
15
20
25
Test Utterance Duration (Seconds)
M
e
a
n

E
r
r
o
r

(
P
e
r
c
e
n
t
)


EnglishArabic
EnglishGerman
EnglishFrenchGerman
Mean
Multilingual Speaker
Figure 5.5: This plot presents the VLID error for each of the multilingual speakers
tested in this section. The results shown are for the 256 VQ-codeword systems as
presented in gures 5.2, 5.3 and 5.4. Additionally, the mean performance for the
three speakers (using 256-codewords) is presented.
5.4.2 Removing Rate of Speech
Another possible explanation for the low error achieved was that the rate of speech
in each recital might be assisting discrimination. Rate of speech is commonly con-
sidered to be a measurable characteristic that varies over dierent languages, but
in Roach [1998], Roach suggests that this is a simplistic view. He shows that the
number of spoken phones per second in dierent languages varies only slightly, and
claims that the perceived variations in speed are related to how well you understand
the languages in question. However, when we measured the duration of the recitals
for our speakers, we found that they tended to speak their native tongue faster than
CHAPTER 5. SPEAKER-DEPENDENT VLID 90
their other languages. In other words, recital speed was correlated with language
uency.
In a low codeword system, each codeword represents a more broad area within
the feature space, and since rate of speech is linked to rate of change of features, we
would expect to see longer runs of the same codeword in slower or less uent speech.
Such a characteristic would be modelled by the bigram language models and would
therefore contribute towards classication eectiveness. To test the hypothesis that
we were actually measuring dierences in rate of speech rather than dierences in
languages, we performed a similar experiment to the one shown in Figure 5.4, except
that repetitions of the same codeword were ignored and treated as a single occurrence
of the codeword. The accuracy of the eight codeword system dropped signicantly,
suggesting that rate of speech was indeed having an eect on performance.
Systems using a higher number of codewords were not as aected, as ner clus-
tering of the vector space results in close clusters of data being represented by a
number of dierent codewords, and hence groups of dierent codewords rather than
runs of the same codeword are likely to be observed in slowly-changing speech. It
is also interesting to observe the performance of the lower codeword systems in Fig-
ure 5.3, which are produced by the speaker whose bilingual uency we subjectively
judged to be best of those speakers presented here and whose recitals of the UN
declaration in each language are almost equal in duration. It would seem rate of
speech does indeed eect the classication accuracy, though the extent to which it
contributes is not easily determined from these experiments.
5.4.3 Testing Rate of Speech
As a test of the sensitivity of our system to variations in speaking rate, we tested it
to see whether it could discriminate between three recitations of the same language
recorded at dierent speaking speeds. The system was trained on a single speaker
reading three English recitals of the UN declaration in English, read at three dierent
speeds: very slow, a normal reading pace and very fast. The test here is whether
CHAPTER 5. SPEAKER-DEPENDENT VLID 91
or not each session is identiable on rate of speech alone. Rate of speech can
aect an utterances phonetic content by the eects of co-articulation: for instance,
assimilation and deletion of phonemes are more prominent in rapid speech. It is
probable therefore that such a large dierence in speech rate, as tested here, will
alter the phonetic and thus the visual appearance of the speech, resulting in some
ability to discriminate between sessions despite containing the same language.
1 3 7 30 60
0
5
10
15
20
25
30
35
40
Test Utterance Duration (Seconds)
M
e
a
n

E
r
r
o
r

(
P
e
r
c
e
n
t
)


8
16
32
64
128
256
VQCodebook
Size
Figure 5.6: A VLID system is trained on three separate recitals of the UN decla-
ration, read by the same speaker, in English, at three vastly dierent speeds. The
task is to see if there are sucient dierences between the extreme recital speeds
that facilitate discrimination, despite the three recitals containing the same language.
This plot presents the results for this experiment.
Figure 5.6 shows that similar discrimination is achieved to the three language
identication task of Figure 5.4. However, the speed variation in Figure 5.6 is
extreme: the durations of the readings of the text at fast, medium and slow speeds
were respectively 4.6, 6.2 and 7.8 mins, whereas the durations of the texts read
CHAPTER 5. SPEAKER-DEPENDENT VLID 92
in three dierent languages (Figure 5.4) were 7.2, 7.8 and 9.0 mins. Hence when
dierent languages were processed, a much smaller speed variation gave about the
same discrimination performance, which indicates that there is an eect of language
present.
5.4.4 Testing Session Biases
As a nal test of the sensitivity of our system to variations in recording conditions, we
tested it to see whether it could discriminate between three recording sessions that
we had designed to be identical: the same speaker reading the same material in the
same language at the same speed. Without any deliberate variation in rate of speech
and in language, the system should be unable to discriminate between sessions and
accuracy should therefore be random at around 33% (66% error). Figure 5.7 does
show a signicant reduction in system performance when compared to gures 5.4 and
5.6. However the results are statistically better than random. We can condently
exclude tracking consistency and subtle lighting dierences as the causes of this
dierence, since the AAM is trained with equal amounts of data from all sessions
and only shape features, rather than shape and appearance, are used for testing.
It is more likely that there is a small physical dierence between sessions, such as
slight pose variations, or that reading performance across sessions was suciently
dierent to make the sessions distinguishable.
CHAPTER 5. SPEAKER-DEPENDENT VLID 93
1 3 7 30 60
0
10
20
30
40
50
60
Test Utterance Duration (Seconds)
M
e
a
n

E
r
r
o
r

(
P
e
r
c
e
n
t
)


8
16
32
64
128
256
VQCodebook
Size
Figure 5.7: A VLID system is trained on three separate, identical recitals of the
UN declaration in English, spoken by the same speaker. The task is to see if there
are recording session biases which facilitate discrimination, despite the three recitals
containing the same language. This plot presents the results for this experiment.
5.5 Conclusions
This chapter has presented a preliminary study in identifying language purely from
visual features. We demonstrated an unsupervised approach to VLID, in which
AAM frames are tokenised into VQ frames, and language models are built from
the token sequences for each language. Because we did not have access to visual
features that were independent of the identity of the speaker, we used a custom
dataset containing multilingual speakers and attempted to discriminate the two or
three dierent languages spoken by each person. These results are equivocal, be-
CHAPTER 5. SPEAKER-DEPENDENT VLID 94
cause they show that speaking-rate can certainly play a part in identication, and
speaking-rate is inextricably bound up with a speakers uency in a certain lan-
guage. A further experiment showed that apparently even very small dierences in
a speakers recital (possibly pose, lighting conditions or recital speed) were picked
up by our system and were sometimes classied with above random accuracy. How-
ever, the fact that dierent languages spoken at rather similar speeds were as well
discriminated as a single language spoken at three extreme speeds indicates that
there is a language eect present in these results. Further evidence to conrm that
the language eects presented here are genuine comes from the speaker-independent
VLID experiments in later chapters, where we successfully discriminate languages
without prior knowledge of a speakers identity (Chapters 6 and 7).
To determine the suitability of this technique for visual-only LID, we must rst
ascertain the contribution to discrimination performance of visual dierences caused
by language, ignoring non-language (namely speaker or session-specic) variation.
The most eective way of achieving this is to average out factors such as the rate
of speech, pose and any other potential biases by performing speaker-independent
language identication on a larger number of speakers. To do this, we must over-
come the strong speaker dependency of our AAM features which was illustrated in
Chapter 4. The unsupervised approach we took here was selected because it did not
require us to impose any rules onto our visual features (such as any sort of map-
ping between phones and visemes) and instead used the inherent structure of our
feature data. However, experiments in audio LID have shown that phone-based to-
kenisation outperforms frame-based methods, and therefore we will consider such an
approach in the next chapter. Chapter 6 will also evaluate this system in a speaker-
independent setting, by excluding the test speakers from the training set. In order
to do this, we will focus on improving the speaker independence of our recognition
features and seek to improve the discriminatory capabilities of our VLID system.
Chapter 6
Speaker-Independent Visual
Language Identication using the
UN1 Dataset
6.1 Introduction
This chapter presents a preliminary study on speaker-independent, visual language
identication (VLID), in which only lip shape, appearance and motion are used to
determine the language of a spoken utterance. In the previous chapter, we had shown
that identication of languages using visual features is possible in speaker-dependent
mode, i.e. identifying the language spoken by a multilingual speaker. We achieved
this by using sub-phonetic units in a manner similar to GMM-tokenisation in audio
LID [Zissman, 1996]. Here, we attempt to extend this to speaker-independent dis-
crimination of two languages. This means that we have to abandon the approach
used in the previous chapter [Newman and Cox, 2009], where we used speaker-
dependent vector quantisation (VQ) codebooks to encode the individual speaking
style of a speaker. Instead, we will use visual units which are common to many
speakers.
95
CHAPTER 6. SPEAKER-INDEPENDENT VLID: UN1 96
The visual communication units of speech we use here are known as visemes
[Fisher, 1968]. A viseme is described in Fisher [1968] as the visual appearance of a
phoneme, but the exact relationship between phonemes and visemes in continuous
speech is still a matter for ongoing research [Hilder et al., 2010]. Language identi-
cation using visemes poses a signicant challenge as there are fewer distinct visemes
than phonemes. To a rst approximation, there is a many to one mapping from
phonemes to visemes, so that when visemes are used, there is an increased possibil-
ity of confusion between speech units, and hence an increased diculty of language
identication. Also, we have shown the features that we extract from the face are
highly speaker-dependent [Cox et al., 2008], which may limit the performance of a
speaker-independent system. Here, by appropriately modifying techniques that have
been successful in audio language identication (LID), we extend our work to dis-
criminating two languages in speaker-independent mode, using viseme recognition
as our tokenisation front-end.
This chapter is structured as follows: Section 6.2 describes the approach to LID
adopted in this chapter, and species the speakers we selected from our UN1 (Sec-
tion 3.2) dataset for these experiments. The developed visual-only LID system is
described in Section 6.3. Section 6.4 explains the test procedure to be used, presents
results produced by the system and discusses our ndings. Section 6.5 concludes
the chapter and motivates the next chapter.
6.2 Approach and Dataset
Our work in Chapter 5 [Newman and Cox, 2009] showed clearly that using sub-
phonetic units in a manner similar to audio LID systems was sucient to discrim-
inate languages in speaker-dependent experiments. Given that phone recognition
generally outperforms sub-phone tokenisation techniques in audio LID, and that
preliminary work suggested that considering longer temporal trajectories provided
greater consistency across speakers, this chapter focusses on speaker-independent
CHAPTER 6. SPEAKER-INDEPENDENT VLID: UN1 97
viseme modelling. The approach we have adopted is an adaptation of the paral-
lel phone recognition followed by language modelling (PPRLM) approach described
in Zissman [1996] and detailed in Section 2.2. This change of approach allows us
to measure the accuracy of our system in terms of units directly related to speech
(visemes), which should give a deeper understanding of the speaker independence
challenges, and how they can be tackled to facilitate language discrimination.
6.2.1 VLID Dataset
For the experiments described here, we used the database described in Section 3.2
(UN1), which consisted of recordings from 26 subjects. These subjects were uent in
at least two dierent languages, some in three. Typically, these languages consisted
of their mother-tongue and a language that they had spoken for several years in an
immersive environment. In this work we have focussed on the task of discriminating
between English and French spoken by ve speakers, as they are the languages for
which the UN1 dataset has the most video data. The speakers from UN1 that we
use here are numbers 22, 16, 14, 9 and 24, and this chapter, we will refer to these
speakers as 1 to 5. This two language case will allow us to analyse the problems
faced more closely and, if successful, could be extended to include more languages
later.
6.3 Parallel Viseme Recognition Followed by Lan-
guage Modelling
Figure 6.1 shows the automatic visual language identication system developed for
this work. The video data is tracked using an active appearance model (AAM), as
described in section 6.3.1. Audio transcriptions of the video and the AAM vectors
are used to train language-specic tied-state viseme HMMs, detailed in section 6.3.2.
Training data is then automatically transcribed as a sequence of visemes, from which
CHAPTER 6. SPEAKER-INDEPENDENT VLID: UN1 98
language models can be built, and the language model likelihoods are processed using
a support vector machine (SVM) discriminative classier, as outlined in section 6.3.3.
English
Viseme
Recogniser
English
Language Model
French
Language Model
SVM Classifier
Pr(Fr | EnFrLM)
Pr(En | EnEnLM)
Hypothesised
Language
French
Viseme
Recogniser
English
Language Model
French
Language Model
Pr(Fr | FrFrLM)
Pr(En | FrEnLM)
AAM
Feature
Extraction
Video
Frames
AAM
Feature
Extraction
Figure 6.1: Visual-only LID System Diagram
6.3.1 Features: Active Appearance Models
An AAM tracks the face and lips and produces a vector representing the shape and
appearance for each frame of video (Section 2.4). However, the parameters corre-
sponding to non-lip elements are included only for the purpose of assisting tracking
capability, and are discarded for training and testing, so that the vector consists
only of parameters that describe the lip shape and appearance. Principal compo-
nent analysis (PCA) is applied to the set of vectors for an individual speaker to
reduce the dimensionality. As in the previous chapter, the rst four PCA compo-
nents are discarded for feature extraction, since they represent translation (x and y),
rotation and scale. This leaves between 50 and 60 components to describe combined
lip shape and appearance for our speaker-independent experiments.
We examined typical AAM features and found the distribution of values within
each dimension to be approximately Gaussian (Figure 6.2), although means, vari-
ances and scale varied from speaker to speaker, and from dimension to dimension.
Given this, each AAM dimension was z-score normalised per speaker, per language,
in an attempt to reduce the speaker dependency of the features. Figure 6.2 shows
CHAPTER 6. SPEAKER-INDEPENDENT VLID: UN1 99

5
0
0
5
0
0
5
0
0
1
0
0
0
1
5
0
0
2
0
0
0
2
5
0
0
B
i
n

V
a
l
u
e
Histogram Count
S
p
e
a
k
e
r

1


S
h
a
p
e

D
i
m

1

5
0
0
5
0
0
5
0
0
1
0
0
0
1
5
0
0
2
0
0
0
2
5
0
0
B
i
n

V
a
l
u
e
S
p
e
a
k
e
r

1


S
h
a
p
e

D
i
m

2

2
0
0
0
0
2
0
0
0
0
5
0
0
1
0
0
0
1
5
0
0
2
0
0
0
2
5
0
0
B
i
n

V
a
l
u
e
S
p
e
a
k
e
r

1


A
p
p
e
a
r
a
n
c
e

D
i
m

2

2
0
0
0
0
2
0
0
0
0
5
0
0
1
0
0
0
1
5
0
0
2
0
0
0
2
5
0
0
B
i
n

V
a
l
u
e
S
p
e
a
k
e
r

1


A
p
p
e
a
r
a
n
c
e

D
i
m

1

5
0
0
5
0
0
5
0
0
1
0
0
0
1
5
0
0
2
0
0
0
2
5
0
0
B
i
n

V
a
l
u
e
Histogram Count
S
p
e
a
k
e
r

4


S
h
a
p
e

D
i
m

1

5
0
0
5
0
0
5
0
0
1
0
0
0
1
5
0
0
2
0
0
0
2
5
0
0
B
i
n

V
a
l
u
e
S
p
e
a
k
e
r

4


S
h
a
p
e

D
i
m

2

2
0
0
0
0
2
0
0
0
0
5
0
0
1
0
0
0
1
5
0
0
2
0
0
0
2
5
0
0
B
i
n

V
a
l
u
e
S
p
e
a
k
e
r

4


A
p
p
e
a
r
a
n
c
e

D
i
m

1

2
0
0
0
0
2
0
0
0
0
5
0
0
1
0
0
0
1
5
0
0
2
0
0
0
2
5
0
0
B
i
n

V
a
l
u
e
S
p
e
a
k
e
r

4


A
p
p
e
a
r
a
n
c
e

D
i
m

2
Figure 6.2: Histograms of the original AAM feature dimensions for a single speaker
from the UN dataset. Histograms are presented for the rst two shape and appearance
dimensions. The gure shows that mean and variance varies across dimensions, as
well as scale, and that the distributions are approximately Gaussian.
that the rst appearance mode is bimodal, for both speakers. Since the mouth can
either be open, or shut, and the rst appearance mode corresponds to the bright-
ness of the mouth interior, it is possible that these two mouth states are responsible
CHAPTER 6. SPEAKER-INDEPENDENT VLID: UN1 100
for the peaks in the distributions. Before normalisation, AAM features are well
separated between speakers (Figure 6.3), meaning that there is no correspondence
between the feature vectors for each subject. After applying the z-score, the relative
distance between speakers is much smaller (Figure 6.4). The z-score is a similar
normalisation method to Feature Mean Normalisation (FMN), described in Potami-
anos and Potamianos [1999], except that the distance from the mean is expressed
in standard deviations (Equation 6.1):
z(j) =
x(j) x(j)
(j)
(6.1)
where z(j) is the jth dimension of the normalised vector z, x(j) is the unnormalised
value in dimension j from the input feature vector x, x(j) is the mean of the speakers
vectors in dimension j, and (j) is the standard deviation in dimension j.
CHAPTER 6. SPEAKER-INDEPENDENT VLID: UN1 101
Sammon Map Projection Dimension 1
S
a
m
m
o
n

M
a
p

P
r
o
j
e
c
t
i
o
n


D
i
m
e
n
s
i
o
n

2

Subject 1
Subject 2
Subject 3
Subject 4
Subject 5
Figure 6.3: A 2D distance preserving projection (Sammon map [Sammon, 1969]) of
250 randomly sampled un-normalised AAM features from each of the 5 UN dataset
speakers used in Chapter 6. The gure shows that each speaker is well separated, and
therefore that there is strong speaker dependency encoded by AAM features.
CHAPTER 6. SPEAKER-INDEPENDENT VLID: UN1 102
Sammon Map Projection Dimension 1
S
a
m
m
o
n

M
a
p

P
r
o
j
e
c
t
i
o
n


D
i
m
e
n
s
i
o
n

2

Subject 1
Subject 2
Subject 3
Subject 4
Subject 5
Figure 6.4: This gure is the same as Figure 6.3, except a z-score normalisation was
applied to the features of each speaker. Here, we can see that the speakers feature
spaces are closer than in Figure 6.3, suggesting a possible improvement to the speaker
independency of our features.
The data was also linearly interpolated from 25Hz to 100Hz as in Almajai and
Milner [2009], in order to raise the sample rate of the visual signal up to that of
typical MFCC data and hence provide a suitable number of visual frames to train
three state HMMs. Although such up-sampling does not, of course, provide any new
information, it avoids the problems that are encountered when there are only a few
frames per state available.
To improve the discrimination of the features we extract, we weight the ith
feature dimension d
i
by the mutual information between the feature dimensions and
the viseme classes. The mutual information was estimated for each dimension by
pooling the training vectors and labelling them according to their corresponding
CHAPTER 6. SPEAKER-INDEPENDENT VLID: UN1 103
viseme. Then, for each dimension of the feature space, the training-data values
(over all viseme classes) were quantised using a linear quantiser with 16 levels. The
mutual information between class C
k
and d
i
is then estimated as follows:
I(C, d
i
) =
K

k=1
L
i

l=1
Pr(C
k
, d
i
(l)) log
_
Pr(C
k
|d
i
(l))
Pr(C
k
)
_
, (6.2)
where d
i
(l) is the lth quantisation level in dimension d
i
and L
i
= 16. By weighting
the feature vectors in this way, we give greater importance to the AAM dimensions
which are most useful for discriminating the viseme classes, whilst giving less weight-
ing to the least important, which we might expect to be the more speaker-dependent
dimensions.
6.3.2 Viseme Modelling and Phoneme to Viseme Mapping
Visemes are modelled using tied-state triphone viseme models (Section 2.6.2). Tri-
phone models with a lower number of mixture components per model are preferred
to monophones with a higher number of components, because triphones are better
able to model co-articulation. Coarticulation is likely to occur more in visual speech
than in audio, as humans are generally not concerned with the visual consistency of
the speech articulators. Instead, the ecient production of speech requires certain
articulatory targets to be met, whilst the position of other articulators may be unim-
portant for a specic speech sound [Jackson and Singampalli, 2009]. Building viseme
HMMs requires labelled training data, and so we map the audio transcriptions of
our visual data into visemes.
Viseme mappings dene a high-level relationship between phonemes and visemes,
and although they do not adequately take account of either the speaker-independent
features of visual speech, the eect of coarticulation, or the subtleties of intra-
language variation, they provide a straightforward way of relating the audio and
visual domains. Although the mapping between phonemes and visemes is complex,
CHAPTER 6. SPEAKER-INDEPENDENT VLID: UN1 104
in general, there is considered to be a many-to-one relationship from phonemes to
visemes. By applying a phoneme to viseme mapping to our audio transcriptions, we
eectively represent our visual data as sequences of phoneme super-classes.
Table 6.1: Phoneme to viseme mapping
IPA Arpabet Viseme IPA Arpabet Viseme
p P k K
b B /p/ g G
m M n N
f F /f/ /k/
v V l L
t T N NG
d D h HH
s S /t/ j Y
z Z I IH
T TH I@r IA /iy/
D DH i IY
w W /w/ 2 AH
r R @ AX /ah/
tS CH aI AY
dZ JH /ch/ O AO
S SH OI OY /ao/
Z ZH oU OW
E EH U@r UA
eI EY A AA
AE A /aa/
aU AW /eh/ a
: ER 6 OH
E@r EA o
E 4 /oo/
e y
U UH /uh/ o
u UW
/oen/ SIL SIL /sil/
SP SP -
There is currently no universally agreed method for transcribing visemes, but
several mappings between visemes and phonemes exist. For this work we have cho-
sen the scheme described in Lee and Yook [2002] to map English phonemes (notated
using the ARPAbet system) into visemes. A small number of infrequently occurring
ARPAbet phonemes do not appear in the mapping, and have been manually placed
CHAPTER 6. SPEAKER-INDEPENDENT VLID: UN1 105
within visually similar viseme groups. Since the phonemic inventories of many lan-
guages overlap, it is also likely that there is an overlap in the inventory of visual
gestures, however the precise nature of the similarities in the visual domain are not
well explored. Using the mapping in Fisher [1968], we found where the French IPA
pronunciations overlapped with the English Arpabet equivalents, and combined the
appropriate classes to generate our own French mapping. We merged some French
visemes which appeared extremely infrequently and for which we had limited train-
ing data. Although alternative phoneme to viseme mappings for French do exist,
they are used for visual speech synthesis, rather than recognition. The short pause
model often used in Automatic Speech Recognition (ASR) was discarded from our
set, as we found that the visual articulators do not adopt a consistent position dur-
ing word transitions, rather they move towards the next spoken phoneme. Table 6.1
shows the nal mapping we applied to the audio transcriptions of our visual data.
6.3.2.1 Tied-State Multiple Mixture Triphone HMMs
Tied-State Mixture Triphone HMMs (Section 2.6.2) are normally used in state-of-
the-art speech recognition systems because of their ability to model coarticulation
around a central phone. To build viseme triphones, we required viseme level tran-
scriptions of our video data. These were generated by manually transcribing the
accompanying audio at word level, using HTK to automatically expand the tran-
scription to phone level, and then applying the viseme mapping. A at start was
then applied to the training data so that the segmentation of the AAM frames into
visemes was data-driven, and not inuenced by the audio segmentation.
The BEEP dictionary was used to provide English pronunciations, and we con-
structed a French pronunciation dictionary manually. Once the audio had been
transcribed, we applied the mapping shown in Table 6.1, to convert our audio tran-
scriptions to visemes. In Table 6.1, the English-only phonemes are also shown in
Arpabet, whilst all phonemes are presented in IPA. From the transcription of our
data, we also generated a bigram grammar network (Section 2.7) for use during
CHAPTER 6. SPEAKER-INDEPENDENT VLID: UN1 106
recognition.
Using HTK [Young et al., 2006] we built three-state, single Gaussian HMMs
for each viseme, which were then replicated to form triphone models. Tied-state
triphones were then built using hierarchical tree clustering driven by left and right
viseme context questions. In a phone recognition system, the most signicant rules
of coarticulation are known from phonetics knowledge and can be applied to decide
which states to examine for automatic clustering. If such rules are not available, as
in visual speech, then a data driven approach can be adopted instead to decide which
states to tie. Because there are only 16 visemes in our set, as compared to 45 in a
typical phoneme set, we can generate all possible combinations of context rules for
our visemes (65550 rules), and these provide a computationally manageable number
of rules for the clustering process. During clustering, rules that do not satisfy the
state occupancy and likelihood thresholds are ignored, leaving the most appropriate
rules for the given parameters. The thresholds we specied retained between 6%
and 7% of the total number of states after tying. Finally, the number of mixture
components was increased sequentially from one to ve, and two was found to give
the best viseme accuracy (presumably for more than two components, the model
overts the training data).
6.3.3 Language Modelling and SVM Classication
Using the training set, bigram language models for both languages are built from
the viseme transcriptions (recognised from the appropriate language). Test data
is transcribed into visemes and each language model produces a likelihood for a
given utterance, which is length normalised. Back-o weights are calculated and
used for unseen bigrams in the test data. Classication is performed using an SVM
back-end classier (Section 2.9). For a given utterance in our experiments, four
language model likelihoods are produced, as shown in Figure 6.1. Figure 6.5 shows
the process by which the classication vector for each utterance is constructed before
it is processed by the SVM. The vector we construct for the SVM contains two
CHAPTER 6. SPEAKER-INDEPENDENT VLID: UN1 107
streams; the rst is the ratio between the language model likelihoods, calculated
separately for the likelihoods from each language dependent recogniser. The second
stream applies a linear discriminant analysis (LDA) (Section 2.8) transformation to
the likelihood scores, where the class label is the language, in this case English or
French, and hence the resulting LDA transformation is a projection into a single
dimension. Gaussian probability density functions (GPDF) for each language are
then built from the LDA projected data, and then the ratio between the GPDF
likelihoods are the nal input into the SVM. The two streams are then concatenated
together. At training time, these vectors are used to build an SVM, which nds
the maximum margin hyperplane separating the training data classes. Our SVM
uses a Gaussian radial basis function kernel to create a non-linear classier, as the
likelihood scores are not linearly separable. In this task we found that our fusion of
SVM and LDA outperformed implementations of either LDA or SVMs alone.
Language Model
Likelihoods
Pr(Fr | EnFrLM)
Pr(En | EnEnLM)
Hypothesised
Language
Pr(Fr | FrFrLM)
Pr(En | FrEnLM)
SVM Classifier
English
GPDF
LDA Dim
LDA
Pr(En | EnEnLM) /
Pr(Fr | EnFrLM)
Pr(En | FrEnLM) /
Pr(Fr | FrFrLM)
English GPDF
Likelihood /
French GPDF
Likelihood
Pr(Fr | EnFrLM)
Pr(En | EnEnLM)
Pr(Fr | FrFrLM)
Pr(En | FrEnLM)
French
GPDF
LDA Dim
Figure 6.5: A diagram showing the method of back-end classication for PPRLM
VLID using the language model likelihoods. The process is split into two streams
which are ultimately combined into an SVM classier. The rst stream uses the
ratios between the likelihoods from each language model, calculated independently
for each viseme recogniser. The second stream applies an LDA transformation to the
original likelihoods, and then calculates the ratio between new likelihoods generated
from language-dependent GPDFs.
CHAPTER 6. SPEAKER-INDEPENDENT VLID: UN1 108
6.4 Experiments
Each of the ve subjects we tested was an English/French bilingual speaker. How-
ever, only three of them were true bilinguals, having learnt to speak both languages
from a very early age.
We use speaker-independent, cross-fold validation to evaluate the performance of
the LID system: each speaker is held out in turn for testing, and the remaining four
are used for training. The time-stamped viseme transcriptions from each language of
each single speaker were divided sequentially and exhaustively to give test utterance
durations of 60, 30, 7, 3 or 1 seconds. Partitioning the data in this way means that
the number of test utterances for shorter test durations greatly exceeds the number
of longer duration utterances, and hence improvements on LID accuracies between
speakers for longer test utterances may not be statistically signicant.
6.4.1 Speaker-independent VLID Results
Table 6.2 shows the speaker-independent viseme accuracy of the ve speakers used
in our experiments. For the values quoted, the grammar scale factor within HTKs
Viterbi recognition tool was set to 0.3 and the word insertion penalty to 20 (Section
2.7). These were optimum values, determined empirically.
Table 6.2: Viseme recognition results
English French
Speaker ID %Corr Acc %Corr Acc
1 50.07 34.33 38.06 29.78
2 49.95 34.28 47.72 28.80
3 49.64 34.98 41.24 34.47
4 49.47 34.69 43.57 33.35
5 49.64 35.57 44.02 34.70
Mean 49.75 34.77 42.92 32.22
Figure 6.6 shows the VLID results for our system. The lowest mean error of 12.2%
is achieved with 60 seconds of test data. It is reassuring to note that the VLID error-
rate decreases monotonically with utterance length for all speakers except for speaker
CHAPTER 6. SPEAKER-INDEPENDENT VLID: UN1 109
number three, whose error rate remains about chance. However, we noted that
audio LID was signicantly worse for this speaker, and examination of their video
recordings showed that they exhibited some unusual visual speech, in particular very
prominent top teeth, and a minor speech defect. These factors probably contributed
to poor performance, but it is these kind of issues that we wish to examine with a
larger database of speakers in the next chapter.
1 3 7 30 60
0
10
20
30
40
50
60
70
80
90
100
Test Utterance Duration (Seconds)
M
e
a
n

E
r
r
o
r

(
P
e
r
c
e
n
t
)


Subject 1
Subject 2
Subject 3
Subject 4
Subject 5
Mean
Speaker used for
Testing
Figure 6.6: Speaker-independent VLID results. We employ cross-fold validation
across our ve speakers, holding out one speaker each time for testing. The task to
identify the language of an unseen utterance as either English or French.
6.4.2 Simulated Viseme Error on VLID Performance
The accuracy of the viseme recognisers used in our VLID experiments are very
low compared to recognisers typically used in audio phone recognition systems.
Given that our VLID accuracy appears to be limited by our viseme recognisers, it is
CHAPTER 6. SPEAKER-INDEPENDENT VLID: UN1 110
interesting to investigate at what level of viseme accuracy high VLID recognition is
attainable. Furthermore, we would like to know how degradation of viseme accuracy
aects VLID performance. We tested this by simulating dierent viseme recognition
accuracies for our ve speakers, with speaker 1 used as test data. We began by
using the ground-truth (oracle) viseme transcriptions of the English and French
texts to construct bigram language models. Then, for each speaker, we reduce the
transcription accuracy by using a model of the pattern of confusions made by an
individual speaker. The simple model we used for this process was to consider
each phone in the transcription in turn, and then to correctly identify, confuse, or
delete that phone according to an (estimated) speakers confusion matrix. After
each phone, a decision to insert any of the phones in the set was calculated, once
again based on the patterns for that speaker. This model enables us to insert
substitutions, deletions and insertions into the transcription at eectively any error-
rate we choose, although probably simplifying the underlying problems of speaker
dependency and coarticulation. We then perform VLID using test data from one of
the speakers.
The results of this simulation are shown in Figure 6.7. They indicate that even
with an error-rate around 40%, 100% VLID accuracy is achievable with 60 seconds
of test data.
CHAPTER 6. SPEAKER-INDEPENDENT VLID: UN1 111
1 3 7 30 60
0
10
20
30
40
50
60
70
80
90
100
Test Utterance Duration (Seconds)
M
e
a
n

E
r
r
o
r

(
P
e
r
c
e
n
t
)


0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
Viseme Error
Figure 6.7: The eect of viseme accuracy on VLID recognition
6.5 Conclusions
This preliminary work on a two language discrimination problem using only ve
speakers indicates that speaker-independent visual language discrimination is pos-
sible, but it is limited by the low accuracy obtainable by our viseme recognisers,
as suggested by the simulation experiment described in Section 6.4.2. The tech-
nique developed is a supervised approach (by which we mean it requires phonetic
or visemic transcriptions of the training data), and is based upon the PRLM LID
architecture commonly used in audio LID.
English and French are both Indo-European languages, sharing many of the same
phonological characteristics. Therefore, it is possible that more dissimilar languages
CHAPTER 6. SPEAKER-INDEPENDENT VLID: UN1 112
may be discriminated more easily. Better discrimination may also be achieved by
training on a larger set of speakers, given that more data might be required to
adequately model the variation present in the AAM feature space. In the next
chapter, we will continue this investigation into speaker-independent VLID. Work
will focus on using a larger dataset of two more phonologically dissimilar languages,
namely English and Arabic. The data will also be of higher quality to see if the
dataset used here (in terms of frame rate or image resolution) is responsible for our
poor viseme recognition. Improving the accuracy of our viseme recogniser, or using
the information presented visually in a more useful way, is key to improved language
discrimination and therefore the focus our work.
Chapter 7
Speaker-Independent Visual
Language Identication using the
UN2 Dataset
7.1 Introduction
The experiments in this chapter focus on repeating those described in Chapter 6, but
using a larger database of subjects, speaking two more dissimilar languages. In the
previous chapter, we showed that good speaker-independent language discrimination
could be achieved between two phonetically similar languages, specically, English
and French. We also showed that the accuracy of our viseme recognisers is poor
when compared to similar systems built using audio features, and that higher viseme
recognition is key to improved language discrimination.
In this work, we aim to discover whether or not the poor viseme recognition
results we presented in the previous chapter were due to a sparse coverage of the
active appearance model (AAM) feature space, caused by insucient speakers in
our dataset. We also focus on the features of language produced by native speakers,
rather than those introduced by second language acquisition. To investigate these
113
CHAPTER 7. SPEAKER-INDEPENDENT VLID: UN2 DATASET 114
points, we recorded a dataset of 25 native English and 10 native Arabic-speaking
subjects, of which we use 10 English and 9 Arabic recitals for testing (Chapter
3). Furthermore, we recorded the new video data at full HD resolution (1920x1080
pixels) and at a frame rate of 60 frames per second, in order to determine whether
or not the extra image and temporal information captured would lead to greater
phone discrimination. As well as visual language identication (VLID) experiments
using our AAM features, we present recognition experiments using shape features
and appearance features derived from tooth recognition, to eliminate skin-tone as a
discriminatory feature of language.
Section 7.2 describes the experimental setup adopted for these experiments, and
then we evaluate the performance of the system in 7.3. Section 7.4 tests a num-
ber of additional features to eliminate skin-tone from our experiments. In Section
7.5 we describe the process of building an Arabic phone recogniser, and in 7.6 we
present results on using a parallel phone recognition followed by language modelling
(PPRLM) LID approach with the UN2 dataset. Section 7.7 concludes the chapter.
7.2 Experimental Setup
The task in these experiments is to discriminate between English and Arabic from
visual-only information, in a manner similar to the speaker-independent experiments
described in Chapter 6. We recorded a new visual dataset for these experiments,
which contained accompanying audio (Chapter 3). Unlike our previous speaker-
independent experiments, the data we recorded is comprised entirely of native speak-
ers. The amount of data recorded was also much larger than used previously, and
the languages we chose were phonologically very dierent. Our testing procedure is
19-fold cross validation, where each of the 19 subjects is held out of the training set
in turn, and used for testing instead. Table 7.1 shows details of the 19 test subjects.
As before, for each speaker to be tested, their data was divided sequentially and
exhaustively into segments of 1, 3, 7, 30 and 60 seconds. This means that there
CHAPTER 7. SPEAKER-INDEPENDENT VLID: UN2 DATASET 115
are more one than sixty second utterances, and so some results presented for longer
test utterance durations may not be statistically signicant. We also balanced the
amount of training data for each language within the VLID portion of the system,
so that neither language had a bias based upon how well it was modelled.
Table 7.1: UN2 test subjects used in all VLID experiments
SpeakerID Language Sex Use
1 English Male Testing/Training
2 English Male Testing/Training
6 English Male Testing/Training
8 English Male Testing/Training
17 English Male Testing/Training
18 English Male Testing/Training
23 English Female Testing/Training
24 English Female Testing/Training
26 English Female Testing/Training
29 English Female Testing/Training
32 Arabic Male Testing/Training
33 Arabic Male Testing/Training
34 Arabic Male Testing/Training
35 Arabic Male Testing/Training
36 Arabic Male Testing/Training
37 Arabic Male Testing/Training
38 Arabic Female Testing/Training
40 Arabic Male Testing/Training
41 Arabic Female Testing/Training
7.3 PRLM Using Visual Phones
Our previous work in VLID used viseme recognition as our method of tokenising
a speech signal (Chapter 6). During our work where we compared AAM features
to features derived from physically tracking the articulators (Chapter 8), we ran
several recognition experiments where the task was to identify the spoken phone,
rather than the viseme class to which it belonged (which we shall term visual-phone
and viseme, respectively). Figure 7.2 shows a typical phone recognition confusion
matrix generated from a speaker-independent visual-phone recogniser. Figures 7.3
CHAPTER 7. SPEAKER-INDEPENDENT VLID: UN2 DATASET 116
and 7.4 present separate graphical representations of the same information. The
vertical shading in Figure 7.3 shows that many input phones share a similar pattern
of confusion, and that there is a high number of phonetic deletions. The horizontal
shading for some input phones in Figure 7.4 shows that the confusions for a particular
phone are often distributed evenly among all classes, as are phonetic insertions.
The matrices clearly show that the confusions generated are not representable by
many-to-one mappings from phones to conventional viseme classes. For example,
in Figure 7.2, the phones [f] and [v] are both shown to be confused with most of
the other phone classes, and although they are confused with each other (which
would be their viseme grouping), [f] is more often confused with [v], than vice
versa. Furthermore, some degree of phone discrimination can be made between
the phones [p], [b] and [m], where a viseme mapping would make no discrimination.
Since the perplexity of a 15-class viseme sequence is lower than that of a 44-class
phone sequence, the recognition accuracy corresponding to chance is higher in the
rst than in the second. Although on the face of it, it is counter-intuitive, despite
higher recognition accuracies in the viseme recognisers, there may be more language
information captured in the lower accuracies achieved by visual-phone recognisers
(where visual-phone refers to having a model for each phoneme, trained using
visual features, rather than combining them to form viseme models). Therefore, in
this work, we tokenise our speech in terms in visual-phones, rather than visemes.
The system used for VLID is based upon the phone recognition followed by language
modelling (PRLM) architecture (Figure 7.1) (Section 2.2).
Video
Frames
AAM
Feature
Extraction
English
Visual Phone
Recogniser
English
Language Model
French
Language Model
SVM Classifier
Pr(Fr | EnFrLM)
Pr(En | EnEnLM)
Hypothesised
Language
Figure 7.1: A system diagram of the PRLM language identication architecture
applied to visual speech.
In some preliminary experiments using our new video dataset, we compared phone
CHAPTER 7. SPEAKER-INDEPENDENT VLID: UN2 DATASET 117
recognition performance across three separate video resolutions, which were gener-
ated by downsampling our original HD video. The three resolutions were: 640x360,
1080x720 and 1920x1080. We found no measurable advantage for visual-phone
recognition by using higher video resolutions. It is possible that having recorded
the video in HD, extra denition and clarity is propagated through to the lower
resolutions during the resampling process. This could mean that the quality of the
lower resolution data might be higher than had the data been recorded natively at
those resolutions. Since we found no advantage, we generated our AAM features
for all of the experiments presented here from video frames resampled to 640x360,
which signicantly reduced the time taken for feature generation.
CHAPTER 7. SPEAKER-INDEPENDENT VLID: UN2 DATASET 118
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-

O
v
e
r
a
l
l

R
e
s
u
l
t
s

-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
S
E
N
T
:

%
C
o
r
r
e
c
t
=
0
.
0
0

[
H
=
0
,

S
=
1
8
2
7
,

N
=
1
8
2
7
]
W
O
R
D
:

%
C
o
r
r
=
3
2
.
3
3
,

A
c
c
=
1
7
.
7
1

[
H
=
2
5
3
6
4
,

D
=
2
0
0
1
5
,

S
=
3
3
0
6
3
,

I
=
1
1
4
7
4
,

N
=
7
8
4
4
2
]
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-

C
o
n
f
u
s
i
o
n

M
a
t
r
i
x

-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-







a



a



a



a



a



a



a



b



c



d



d



e



e



e



e



f



g



h



i



i



i



j



k



l



m



n



n



o



o



o



p



r



s



s



s



t



t



u



u



u



v



w



y



z








a



e



h



o



w



x



y







h







h



a



h



r



y











h



a



h



y



h



















g



h



w



y















h



i







h



a



h



w
































































































































































l
















































































































































































































































































































































































































D
e
l

[

%
c

/

%
e
]


a
a


1
2
7

1
1


2
8



1



9


1
4


1
6



4



0


1
4



3



1


1
9



1



2



0



0



2



2



9



8



5


1
3



7



2


2
6


1
3



4



1



0



5


1
1


1
2



4


2
7


1
3



0



0



0



5



6



5



0



6


1
5
8

[
2
9
.
1
/
0
.
4
]


a
e


1
2


7
2
0

6
0



9



6


7
7


3
2


1
0



8


3
6


1
2



6


1
2
3

1
4


4
8



1


1
0



9



6


7
4


5
8


2
4


7
3


3
1


1
5


4
9


2
3



2


1
0



0


2
4


7
0


4
0



9


3
9


8
1



2



0



4


1
1


1
7


1
4


1
1


2
4


7
6
8

[
3
7
.
8
/
1
.
5
]


a
h


1
1


1
5


3
1
8


7



6


6
2


2
1


1
9



4


3
3



6



1


4
6


1
1


1
2



1



2



5



0


4
1


3
2



6


1
7


4
1


2
2


3
0



9



6



3



0


1
9


6
0


1
4



8


3
1


3
4



2



0



1



5



7



6



4



8


4
2
2

[
3
2
.
3
/
0
.
9
]


a
o



3



8


1
3


3
9
9


1


4
7



4



1



4


3
8



4



1



6



4



3



1



2



1



0


2
0


1
0



5


1
3


2
7



7


3
1



7



4



3



0



6


4
6


1
3



7


1
0


2
3



5



0



2



8


1
2


1
2



3


1
4


2
1
2

[
4
8
.
2
/
0
.
5
]


a
w



1



4



2



0


2
1


1
1



2



1



1



3



0



0



4



0



3



1



1



2



0



3



5



2



4



7



1



3



1



0



1



0



0


1
6



1



3



1



5



0



0



0



0



3



2



1



3



3
2

[
1
7
.
6
/
0
.
1
]


a
x


2
1


6
8


1
0
8

4
7



4

2
9
7
0

2
9


4
8


1
5


1
1
1

2
8


1
1


1
0
3

1
5


3
1


1
6


1
7


1
8


1
0


2
4
1

9
2


2
4


6
8


9
5


4
1


2
0
0

5
4


1
7


2
0



2


7
7


1
7
4

9
6


3
1


6
8


1
7
2

1
0



0



2


2
2


4
3


3
7


2
3


7
1

1
6
1
8

[
5
5
.
5
/
3
.
0
]


a
y


1
6


2
9


6
0



7



3


3
5


2
0
1


3



2


2
8



3



4


5
3


2
5


3
1



7



3



4



2


3
9


6
4


1
1


3
0


2
7



5


6
2


1
3



2



2



0


1
2


3
4


2
4



5



8


2
7



1



1



0



2



6



3



2


1
2


4
4
0

[
2
2
.
1
/
0
.
9
]



b



1



8



8


1
2



1


3
5



2


3
1
2


4


1
3



5



0


1
5



4



6



0



0



2



0


3
3


1
2



2



6



9


4
4


2
2



2



8



4



0


1
2
4

5
3



8



2



9


1
6



1



0



1



5



4


1
6



4



6


2
9
0

[
3
8
.
1
/
0
.
6
]


c
h



1


1
0



7



1



0


1
3



3



2


2
9



9



0



0


1
1



1



2



2



1



5



1



5



9



9



8



4



4


1
4



1



1



2



0



4


1
3



7


1
3



4


2
0



2



0



1



2



1



0



1



4


1
2
5

[
1
2
.
8
/
0
.
3
]



d



6


2
9


5
4


1
0



4


1
0
5

1
6


2
4



6


8
7
4

2
2



1


5
0



5


2
3



8


1
0



7



6


7
8


5
9


2
9


7
1


3
5


3
4


8
4


3
2



5



2



0


3
9


8
6


7
2


1
9


4
9


1
4
4


2



1



4


1
1


3
3


2
4



7


4
8


8
7
6

[
3
9
.
2
/
1
.
7
]


d
h



6


1
9


3
8


1
5



2


4
6



4



6



2


6
6


2
6
4


6


3
7



5


1
4


1
6



2


1
7



1


6
7


2
7


1
1


2
7


3
1


2
8


7
7


1
1



3



5



0


1
8


1
1
0

5
8


1
1


1
9


1
2
1


5



0



0



7


2
1


1
2



1


2
6


5
6
0

[
2
0
.
9
/
1
.
3
]


e
a



2



6


1
3



4



1


1
0



1



4



1



9



1


1
5


1
4



0



3



1



0



0



0



4



4



1



8



5



4



6



1



1



0



0



6



9



3



2



6


1
0



0



0



0



2



9



2



0



2



7
6

[

8
.
8
/
0
.
2
]


e
h


1
6


1
0
8

4
2



6



2


7
7


2
7


1
6



4


6
1



8



7


5
9
1

1
6


6
1



4


1
7


1
2



5


1
0
2

7
2


1
5


5
4


3
9


1
2


4
2


1
1



3



5



0


1
9


8
8


3
0


1
2


2
7


4
6



0



0



8



7


1
6


1
1


1
0


2
0


6
9
9

[
3
4
.
2
/
1
.
5
]


e
r



8



8


3
8



1



1


2
4


1
9



7



2



6



3



1


1
1


4
4



4



1



2



2



2


1
5


1
1



5



9



7



9


2
6



9



3



3



0



2


1
6



6



1



2


1
5



0



0



0



0



3



1



2



5


1
2
1

[
1
3
.
2
/
0
.
4
]


e
y



3


3
0


2
4



2



0


2
2


1
3



6



9


2
3



5



0


5
1



4


1
7
8


1



4



2



2


3
0


5
4


1
8


5
5


1
4



7


5
1


1
5



1



0



0


1
1


2
0


2
3



4


1
0


3
6



1



0



6



2



3



4



0



7


3
1
3

[
2
3
.
7
/
0
.
7
]



f



4


1
3


1
7


1
8



0


8
9



2



6



2


6
8


1
4



0


1
7



1



5


2
7
8


5



7



1


3
2


1
5



4


1
8


2
0


1
2


2
8



8



0



0



0


1
8


9
9



9



4


2
5


5
1


1
0



0



0



9


4
3



6



2


1
2


3
7
2

[
2
8
.
6
/
0
.
9
]



g



0



4


1
1



4



1


1
5



6



2



1


1
3



3



0


1
3



2



3



5



8



3



0



7


1
4



4



9



6



1



6



3



4



0



0



3


1
4



8



1



4


2
4



1



0



1



3



5



6



2



4


1
6
4

[

3
.
6
/
0
.
3
]


h
h



2



7


1
1



6



0


2
2



8



9



0


3
0



8



1


2
2



4


1
3


1
1



2


4
7



2


2
7


2
4



4


3
4


2
4



9


2
8


1
0



1



3



0


2
2


5
6


1
4



8


2
1


3
4



0



0



3



1



7



7



3



2


3
8
4

[

8
.
6
/
0
.
6
]


i
a



0



7


1
1



3



0



9



0



0



2



8



3



1


1
2



0



5



1



0



1



3


1
0


2
0



3


1
1



3



4



9



4



0



1



0



2



9



6



0



6



8



0



0



0



1



4



2



0



2



7
4

[

1
.
8
/
0
.
2
]


i
h


1
4


6
7


1
1
2

2
2



3


2
3
4

4
3


2
2


1
5


8
5


2
8



4


1
3
9


7


7
1


1
0


2
0


2
2



3

1
9
0
5

1
5
4

2
0


1
0
4

6
2


2
9


1
6
5

2
9


1
5



8



0


4
8


1
6
7

9
8


2
6


6
6


1
3
4


7



1



4


1
4


3
8


2
4


2
3


5
1

1
6
1
0

[
4
6
.
3
/
2
.
8
]


i
y



8


4
6


5
2


1
0



2


7
4


1
9


1
5



5


4
3


1
3



8


7
9



9


3
2



2



8



8



4


1
0
9

7
0
9

1
0


6
6


2
9


2
8


1
2
0

1
7



4



5



0


4
4


6
0


2
5



8


3
5


4
1



1



0



1



9


2
3


1
2



6


2
0


7
0
8

[
3
9
.
0
/
1
.
4
]


j
h



2



7



7



4



0


2
3



4



1



6


1
1



0



1



8



3



2



2



0



7



0


1
5



6


1
1
1

1
5



7


1
0


1
6



4



0



1



0



5


1
4



5


2
6



4


2
9



1



0



1



1



3



5



5



8


1
6
4

[
2
9
.
2
/
0
.
3
]



k



9


3
9


3
6


2
4



2


9
6


1
6



8



3


4
3


1
0



0


4
2



4


2
6



5



8


1
5



4


4
0


5
7


2
0


5
5
6

3
8



6


1
0
8

1
4



4


1
0



0


2
4


4
8


3
4


1
4


3
7


9
1



1



0



5



6



6


1
4



4


3
2


6
7
2

[
3
5
.
7
/
1
.
3
]



l


2
7


4
2


6
0


4
5



4


1
4
2

2
4


3
6



6


1
0
0

2
0



4


4
8


2
0


1
6


1
1


1
0


2
3



6


9
4


6
5


2
4


4
6


7
8
0

3
3


1
2
7

3
7



3



7



0


4
4


1
3
8

5
2


1
5


4
7


9
3



5



1



3


3
5


2
1


3
3


1
0


4
1

1
0
3
2

[
3
2
.
5
/
2
.
1
]



m


1
2


3
2


2
2


1
2



1


3
7


1
0


9
7



4


2
5



9



2


2
1



3



8



4



2



5



0


5
6


2
4



6


2
0


2
3


5
6
1

4
4


1
0



6



1



0


1
3
2

8
1


2
0



6


1
7


2
7



2



0



0


1
1


1
0


1
6



4


1
6


4
3
2

[
4
0
.
1
/
1
.
1
]



n


2
6


6
9


1
1
1

3
9



5


2
0
7

4
3


4
7


1
6


1
2
9

3
0



8


1
0
7

2
0


6
9


1
1


1
3


2
8



6


1
9
7

1
3
2

5
1


1
5
9

1
0
9

5
7

2
7
2
2

5
7


1
3



7



0


6
3


1
9
2

1
4
2

3
0


9
3


2
2
5


9



1


1
1


2
0


6
4


3
4


4
0


7
0

1
7
0
7

[
4
9
.
7
/
3
.
5
]


n
g



0



3



6



2



0


2
5



2



6



2


1
9



5



0



8



2



3



2



1



1



2



1


2
1



6


1
6



8



2


3
0


4
3



0



1



0



3


1
7



3



3



9


1
0



0



1



0



2



8



1



3



2


1
2
7

[
1
5
.
4
/
0
.
3
]


o
h



0



6


1
7



9



0


3
5



3



7



2


1
3



6



1



8



3



1



3



3



2



0


1
6



9



3



3


1
3



4


1
4



5


5
6



2



0



8


1
5



7



3



6


1
4



1



1



0



1



3



6



3


1
2


1
7
2

[
1
7
.
3
/
0
.
3
]


o
w



3



6


1
2



6



1


2
1



9



3



3


1
3



2



0


1
1



1



0



0



3



4



0


1
6



5



5



5



8



4


2
3



7



6


6
7



0



3


2
1



8



8



9


1
8



0



0



0



3



2



7



6



6


1
5
6

[
2
0
.
0
/
0
.
3
]


o
y



0



1



1



3



0



6



0



0



0



2



0



0



1



0



0



0



0



1



1



4



0



0



2



0



0



0



1



4



1



0



1



9



1



1



3



2



0



0



0



2



0



2



1



0



4
0

[

0
.
0
/
0
.
1
]



p



6


1
7


1
1



8



1


3
7



5


7
7



1


1
4



4



1


1
1



2



2



1



1



3



1


2
9


1
4



6



4


1
1


6
4


2
5



4



5



1



0


7
2
6

5
2


1
0



7


1
0


2
3



3



0



0



8



8


2
3



2



7


2
5
7

[
5
8
.
3
/
0
.
7
]



r



9


3
7


3
5


3
0



0


6
5


1
0


3
0



2


5
9


1
1



3


2
8



7



4



5



4


1
1



3


4
1


2
3


1
7


1
5


3
5


3
7


6
0


1
1



2



3



0


2
0

2
1
8
0

2
8


1
2


2
5


3
8



2



1



1



8


1
5


2
2



7


1
6


6
0
1

[
7
3
.
4
/
1
.
0
]



s



5


2
8


3
7


1
4



1


1
0
2

1
6


1
7



9


8
2


2
5



3


6
3


1
0


3
0



2



4


1
5



0


9
1


6
9


2
6


6
0


3
3


2
1


1
0
0

2
0



6



1



0


1
7


7
4


9
2
6

2
5


3
6


1
7
2


2



1



4


1
5


1
7


1
6


1
3


6
6


8
3
7

[
4
0
.
7
/
1
.
7
]


s
h



2


1
3


1
9



1



0


7
1


1
6



4


1
2


3
7



5



1


2
8



3



5



1



3



8



2


3
9


3
2


4
4


2
8



9



6


2
4



6



2



4



0


1
1


2
9


5
6


3
9
1

1
3


9
2



2



3



3



4



5


1
1



5


2
4


3
5
6

[
3
6
.
4
/
0
.
9
]

s
i
l


3
5


1
6



8


1
2



0


9
4


2
9



2



1


3
2



5



1


1
3



2



2



2



3



6



2


3
3


3
8



4


7
9


1
5


1
9


2
6
5


6



1



3



0



2


1
0


8
7



5

2
3
3
5

8
7



1



1



1


1
1



3


1
1



3


1
6


3
5
5

[
7
0
.
7
/
1
.
2
]



t


1
6


6
2


5
7


3
8



3


1
3
3

1
8


3
1


1
1


1
1
8

3
2



0


8
9


1
4


6
4



9


1
2


2
2



6


1
1
8

1
0
5

6
1


9
7


6
9


4
4


1
4
3

4
2


1
0



7



0


5
7


1
3
8

1
7
8

5
0


7
4

2
2
8
2

1
0



2



4


1
9


4
0


3
6


2
3


9
3

1
1
0
1

[
5
1
.
4
/
2
.
7
]


t
h



0



4



7



3



0


1
9



2



3



0


1
3



6



1



5



0



1



0



0



6



0


1
0



3



1



5



5



2


1
6



2



0



2



0



2


1
3



5



1



2


1
9


1
3



0



1



4



1



3



0



6



6
2

[

7
.
0
/
0
.
2
]


u
a



0



2



7



2



0


1
3



1



0



1



3



0



0



2



0



0



0



0



0



1



1



2



0



1



2



1



7



0



1



1



0



0



1



2



1



1



8



0



4



0



1



1



1



0



2



4
2

[

5
.
7
/
0
.
1
]


u
h



0


1
0



4



3



0


1
5



3



1



6



2



2



1



8



3



2



0



2



0



1



9



7



3



6



4



3


1
5



3



0



0



0



1


1
3



5



1



4


1
4



0



1


1
4



2



3



4



2



2



7
2

[

7
.
8
/
0
.
2
]


u
w



1



5


1
2


1
9



1


6
5



3



4



4


1
2



2



0


1
5



0



2



2



1



6



0


3
4


1
7



7


1
2


2
3



8


3
1



4



4



3



0


1
2


3
3


1
4



9


1
6


2
2



2



2



0


1
2
6


1



9



3


1
0


1
8
4

[
2
2
.
7
/
0
.
5
]



v



6


9
5


3
4


1
3



0


4
1



1


1
0



2


7
8


1
6



7


1
7



0



9


3
0



9


1
4



1


4
2


2
1



2


2
7


2
4


1
6


6
4


1
2



1



7



0


2
2


1
1
4

1
4


1
0


2
9


3
3



1



0



2



2


5
5
9

1
0



8


1
4


5
6
2

[
3
9
.
4
/
1
.
1
]



w



6


1
8



7


1
6



0


5
5


1
0


2
3



2


2
7



6



4


1
6



2



3



0



3



4



0


1
0


1
0



5


1
0


2
4


4
6


3
0


1
0



2



5



0


5
8


5
3



9



5


1
4


1
7



0



0



1



6



7


5
5
7


5



6


3
3
2

[
5
1
.
0
/
0
.
7
]



y



0



7



5



4



0


3
6



3



4



1


1
9



3



0


1
5



0



2



1



0



2



0


1
5


1
4



6


1
0



9



1


1
9



2



3



0



0


1
2


3
2



5



6



7


1
5



0



2



0



0



0



7


6
5



7


1
8
1

[
1
9
.
2
/
0
.
3
]



z



6


2
0


3
2



5



1


7
9


1
4


1
2



2


4
5


1
3



1


1
9



9


1
4



1



2



9



1


4
0


4
8


1
2


3
7


2
5


1
6


6
7


1
4



2



1



0


2
7


5
7


7
5


1
3


3
0


1
0
1


2



0



1



4


1
2


1
3


1
1


2
7
7

5
4
1

[
2
3
.
7
/
1
.
1
]


z
h



0



1



1



0



0



0



0



0



0



1



0



0



0



0



0



0



0



0



0



0



2



0



1



0



0



0



0



0



0



0



0



1



2



1



1



1



0



0



0



0



0



0



0



0




8

[

0
.
0
/
0
.
0
]
!
E
N
T



0



0



0



0



0



0



0



0



0



0



0



0



0



0



0



0



0



0



0



0



0



0



0



0



0



0



0



0



0



0



0



0



0



0



0



0



0



0



0



0



0



0



0



0




0
!
E
X
I



0



0



0



0



0



0



0



0



0



0



0



0



0



0



0



0



0



0



0



0



0



0



0



0



0



0



0



0



0



0



0



0



0



0



0



0



0



0



0



0



0



0



0



0




0
I
n
s


1
9
5

3
1
4

3
8
7

2
0
8


2
3

8
6
5

1
5
1

1
4
1


5
8

5
9
9

1
3
9


3
2

4
1
8


8
2

1
8
2


6
8


5
4

1
3
0


2
3

6
3
3

3
9
1

1
6
0

4
9
0

3
6
5

2
8
3

9
1
6

1
5
9


6
0


7
7



1

2
7
6

7
0
2

3
9
8

1
3
7

6
7
0

6
6
0


3
9


1
0


3
3

1
3
2

1
9
5

2
2
8

1
0
1

2
8
9
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
Figure 7.2: Example confusion matrix for all test data in a speaker-independent
visual-only phone recogniser. The labels running vertically on the left-hand side
represent input phones from the ground-truth transcription, and each row represents
the recognition confusions for each of the specied input phones.
CHAPTER 7. SPEAKER-INDEPENDENT VLID: UN2 DATASET 119
Figure 7.3: Example confusion matrix for all test data in a speaker-independent
visual-only phone recogniser. The colour of each element represents its value and
each is scaled by the sum of the row. The original values from this matrix are shown
in Figure 7.2. The labels running vertically on the left-hand side represent input
phones from the ground-truth transcription, and each row represents the recognition
confusions for each of the specied input phones.
CHAPTER 7. SPEAKER-INDEPENDENT VLID: UN2 DATASET 120
Figure 7.4: Example confusion matrix for all test data in a speaker-independent
visual-only phone recogniser. The colour of each element represents its value and each
is scaled by the sum of the column. Each column shows the phones that were confused
with the phone represented by the column (or where insertions took place). The
original values from this matrix are shown in Figure 7.2. The labels running vertically
on the left-hand side represent input phones from the ground-truth transcription, and
each row represents the recognition confusions for each of the specied input phones.
CHAPTER 7. SPEAKER-INDEPENDENT VLID: UN2 DATASET 121
As before, we z-score normalise (Equation 6.1) our AAM features per speaker to
help improve the speaker independency of our features. However, unlike before, we
adopt a sliding window approach, where instead of normalising each frame based on
the global mean and standard deviation of that speaker, we calculate the parameters
within a window of 50 seconds (3000 frames), 25 seconds (1500 frames) either side
of the central frame. The duration of the sliding window must be large enough to
obtain a reasonable estimate of the mean and standard deviation, but short enough
to limit the eect of erroneous data on the complete set of features. Without using
this method, incorrectly tracked frames or other changes in the features will aect
the value of the global mean and standard deviation. By using a sliding window,
the eect of feature anomalies will be limited to the scope of the sliding window and
therefore should improve the estimate of the z-score.
In our preliminary experiments using our new dataset, we did not build our visual
phone recogniser from a development set. Instead, we built several English recog-
nisers from our training data, according to which fold of the cross-fold validation
we were using (to ensure that the test speaker was not used to train the models).
The phone recognition accuracy of the training was signicantly higher than the
test data. This is mainly due to the speaker dependency of the features, which is
modelled increasingly well as the number of mixture components in the Gaussian
mixture models (GMMs) are increased, but does little to improve the recognition
of an unseen speaker. Once language models had been built on the already seen
English training data and the unseen Arabic training data, we found that we had
over-trained, i.e. we had built a system which was highly accurate on the training
data but did not generalise to unseen data. Hence, upon analysis of the likelihood
scores from the language models, we consistently saw that unseen data was grouped
away from the data used to train the models (Figure 7.5). We solved this problem
(Figure 7.6) by using ten unseen English speakers as a development set, on which
to train our visual phone recogniser, which is the approach commonly adopted in
audio LID. Without using a development set, we had to generate features for all
speakers, for each validation fold, since the feature-generation process for AAMs is
CHAPTER 7. SPEAKER-INDEPENDENT VLID: UN2 DATASET 122
1.8 1.7 1.6 1.5 1.4 1.3 1.2 1.1
1.65
1.6
1.55
1.5
1.45
1.4
1.35
1.3
1.25
1.2
Arabic Language Model Log-Likelihood
E
n
g
l
i
s
h
L
a
n
g
u
a
g
e
M
o
d
e
l
L
o
g
-
L
i
k
e
l
i
h
o
o
d
Arabic Training Data
English Training Data
English Test Data
Figure 7.5: Language model log-likelihood scores for phone sequences recognised by
an English visual-phone recogniser. The language models were built from the training
data from their respective languages. The English training data shown here was also
used to train the phone models, whereas the Arabic training data was not used. The
plot shows that the English test data and Arabic training data are grouped together
and are well separated from the English training data.
data-driven. Using a development set, we can generate a single AAM model from
which we can create one set of features for all speakers.
From the UN2 database described in Section 3.3, we selected 19 speakers on
which to run our experiments. 10 English speakers were selected to build an English
visual phone recogniser, and these speakers would only be used to train the HMMs.
10 English speakers and 9 Arabic subjects were used in a 19-fold cross validation
setup, where in each fold a speaker was held out for testing. Table 7.2 shows the
selection of subjects we used for training our visual-phone models.
CHAPTER 7. SPEAKER-INDEPENDENT VLID: UN2 DATASET 123
1.8 1.7 1.6 1.5 1.4 1.3 1.2 1.1
1.6
1.55
1.5
1.45
1.4
1.35
1.3
1.25
1.2
1.65
Arabic Language Model Log-Likelihoods
E
n
g
l
i
s
h

L
a
n
g
u
a
g
e

M
o
d
e
l

L
o
g
-
L
i
k
e
l
i
h
o
o
d
s
Arabic Training Data
English Training Data
English Test Data
Figure 7.6: Language model log-likelihood scores for phone sequences recognised by
an English visual-phone recogniser, trained on a development set of English subjects,
and excluding the training data shown on the plot. The language models were built
from the training data from their respective languages. The plot shows that the
English and Arabic training data are not as well separated as in Figure 7.5, but that
the English test data is closer to the English than to the Arabic training examples.
Table 7.2: UN2 subjects used in PRLM VLID experiments for training the visual
phone HMMs
SpeakerID Language Sex Use
3 English Male HMM Training
4 English Male HMM Training
5 English Male HMM Training
7 English Male HMM Training
12 English Male HMM Training
14 English Male HMM Training
15 English Male HMM Training
16 English Male HMM Training
20 English Male HMM Training
21 English Male HMM Training
CHAPTER 7. SPEAKER-INDEPENDENT VLID: UN2 DATASET 124
7.3.1 Experimental Parameters
In the viseme recognition subsystem, a number of parameters must be specied.
Firstly, the state-tying threshold must be chosen such that there are no models after
tying for which there are insucient examples on which to train them. The experi-
ments in this chapter all use a threshold of 4500, which means that the hierarchical
clustering process stops at the point where state tying cannot achieve a log-likelihood
dierence of 4500. To select this value, we ran VLID experiments using visual-phone
transcriptions generated from a range of triphone threshold parameters, and found
a value of about 4500 to be optimal.
Table 7.3: Speaker-independent visual-phone recognition performance, using ran-
domly generated features and a bigram language model. The parameters for recog-
nition were 10 for the grammar scale factor and 12 for the insertion penalty (which we
found gave optimal phone recognition accuracy). The results show that an impression
of discrimination is given when using a language model, even though the underlying
features are meaningless.
Accuracy # Correct # Deleted # Substituted # Inserted # Phones
14.00 22565 26413 29464 11585 78442
Secondly, the grammar scale factor and insertion penalty must be set. As de-
scribed in Section 2.7, the use of a bigram grammar in LID constrains the recog-
nised sequence of phones and destroys the original sequence. We also built a phone
recogniser on randomly generated features, which contained uniformly distributed
pseudorandom numbers lying within a similar range to our AAM features. We found
that using a relatively small grammar scale factor of 10 and an insertion penalty of
12, we could achieve above random phone recognition accuracies which were compa-
rable to those gained by our actual features (Tables 7.7 and 7.3). However, it should
be noted that applying that scale factor to our AAM phone recognisers does produce
higher recognition accuracies. It is interesting to see that a grammar model alone,
applied to uninformative feature information, is sucient to give an impression of
phone discrimination. Therefore, we set the scale factor to 0, to remove the inuence
of the bigram language model on our phone recognition. The insertion penalty is
varied to between 0 and -20, to balance the large number of insertions and deletions
CHAPTER 7. SPEAKER-INDEPENDENT VLID: UN2 DATASET 125
in a visual phone recognition system, and based on this we use a global penalty of
-20. Once again, this value was found to give optimal VLID accuracies, although
higher classication accuracies may be achievable by specifying separate penalties
for each speaker. Lastly, the number of mixture components is increased sequentially
from 1 to 16: We found the optimum number to be 10 for this experiment.
In the language modelling subsystem, we smoothed counts of phones that did
not occur in the training data to a count of one. When training the VLID system,
we balanced the amount of training data for each language, so that the number
of utterances in each language was equal. We also length normalised our language
model scores, by using the mean language model score for each utterance. Length
normalisation accounts for the fact that longer utterances will have lower likelihoods
than shorter ones. Although silence (or its visual equivalent) was recognised in the
phone recognition portion of the system, we removed silence from the language
modelling subsystem, since silence is not a valid indicator of the identity of a spoken
language.
7.3.2 Results
Table 7.4: Speaker-independent visual-phone recognition performance of the En-
glish training data used to train the recogniser. The results presented relate to the
experiment described in Section 7.3.
Accuracy # Correct # Deleted # Substituted # Inserted # Phones
31.25 33580 19840 24803 9136 78223
Table 7.5: Speaker-independent visual-phone recognition performance. The results
presented are for the English test data used in the experiment described in Section
7.3.
Accuracy # Correct # Deleted # Substituted # Inserted # Phones
17.71 25364 20015 33063 11474 78442
Figure 7.7 shows the results for the experiments described in this section. In addi-
tion to the visual language identication results for each speaker, we also present the
mean accuracy of each technique evaluated in this chapter (Figure 7.23). Further-
more, we provide the equivalent audio LID results, from a system using monophone
CHAPTER 7. SPEAKER-INDEPENDENT VLID: UN2 DATASET 126
1 3 7 30 60
0
10
20
30
40
50
60
70
80
90
100
Test Utterance Duration (Seconds)
M
e
a
n

E
r
r
o
r

(
P
e
r
c
e
n
t
)


1
2
6
8
17
18
23
24
26
29
32
33
34
35
36
37
38
39
41
Figure 7.7: Speaker-independent VLID results using a PRLM approach with the
UN2 visual-only dataset. Each plot line represents the performance of a test speaker.
models, built from mel-frequency cepstral coecients (MFCC) audio features (Fig-
ure 7.8). Figure 7.9 presents the dierence in performance for each speaker shown
in gures 7.7 and 7.8. We found the optimum number of mixture components to
be 12 for our monophone models. As expected, we can see that almost perfect lan-
guage discrimination can be achieved using audio features after around 7 seconds
of test data. By contrast, most speakers require at least 30 seconds, more often 60
seconds, to achieve a similar performance using visual features. It can also be seen
that there is greater variation between the performance of speakers in the visual
domain when compared to the audio domain, with some speakers only achieving
low levels of discrimination even with greater test utterance durations. A minimum
mean VLID error of 4.64% is achieved after 60 seconds of test data, compared to
CHAPTER 7. SPEAKER-INDEPENDENT VLID: UN2 DATASET 127
0% using audio features.
Table 7.6: Speaker-independent audio-phone recognition performance of the English
training data used to train the recogniser. Each plot line represents the performance
of a test speaker. The results presented relate to the audio-only experiment described
in Section 7.3.
Accuracy # Correct # Deleted # Substituted # Inserted # Phones
60.0 52466 9584 16173 5532 78223
Table 7.7: Speaker-independent audio monophone recognition performance. The
results presented are for the English test data used in the audio-only experiment
described in Section 7.3.
Accuracy # Correct # Deleted # Substituted # Inserted # Phones
47.58 45628 9126 23688 8305 78442
1 3 7 30 60
0
10
20
30
40
50
60
70
80
90
100
Test Utterance Duration (Seconds)
M
e
a
n

E
r
r
o
r

(
P
e
r
c
e
n
t
)


1
2
6
8
17
18
23
24
26
29
32
33
34
35
36
37
38
39
41
Figure 7.8: Audio LID results corresponding to the VLID results presented in Figure
7.7, using a PRLM architecture and MFCC features from the UN2 dataset.
CHAPTER 7. SPEAKER-INDEPENDENT VLID: UN2 DATASET 128
1 3 7 30 60
100
80
60
40
20
0
20
40
60
80
100
Test Utterance Duration (Seconds)
D
i
f
f
e
r
e
n
c
e

i
n

M
e
a
n

E
r
r
o
r

(
P
e
r
c
e
n
t
)


1
2
6
8
17
18
23
24
26
29
32
33
34
35
36
37
38
39
41
Figure 7.9: A plot showing the dierence between the results shown in Figure
7.7 (Visual PRLM) and Figure 7.8 (Audio PRLM), for each speaker. Positive values
indicate a reduction in error from using audio features. The audio performance always
equals or outperforms the corresponding VLID result.
CHAPTER 7. SPEAKER-INDEPENDENT VLID: UN2 DATASET 129
7.4 Eliminating Skin Tone
The results for VLID accuracy presented in Section 7.3.2 showed that good, two-
class, speaker-independent language discrimination could be achieved using visual
features. After nding this result, we sought to conrm that spoken language was
the source of the discrimination, rather than any systematic non-language dierence
between the speakers in both languages. Despite extensive care being taken during
the recording process, dierences in the recordings could have occurred during data
capture, such as focus, colour balance or lighting conditions. Discrimination may
also have occurred because of consistent physical dierences between English and
Arabic speakers, such as skin tone, which have no bearing on which language is
being spoken. Here, we repeat the PRLM VLID experiments described in Section
7.3.2, but using simpler features to discard information that we would not expect
to assist in language discrimination.
Perhaps the most obvious visual dierence between our English and Arabic speak-
ers is their skin tone. The Arabic speakers all have a darker skin tone than the
English subjects. Given that our features are derived from the visual appearance
of the face, we examined to what extent our systems successful discrimination of
these languages was based on skin tone, rather than cues of language. To eliminate
skin tone as a feature of language, we devised a series of tests. Firstly, skin tone is
constant and not a time-varying signal. Thus, we would expect that a static image,
or a very short test utterance duration, would be sucient to achieve above random
classication accuracy. From Figure 7.7, we can see that shorter test utterance dura-
tions do not provide sucient information for high classication accuracy, although
we cannot be certain that some discriminability is introduced by skin-tone. Further
to this, we tried to minimise the eect of colour on our features, by normalising the
mouth regions of each speaker to a global colour distribution, and then by using
shape-only components, and nally by removing colour altogether with a binary
representation of the mouth and teeth.
Initially, we decided to retain all colour information present in our features and
CHAPTER 7. SPEAKER-INDEPENDENT VLID: UN2 DATASET 130
0 50 100 150 200 250 300
0
0.2
0.4
0.6
0.8
1
Input Intensity
M
a
p
p
e
d

I
n
t
e
n
s
i
t
y
Original Image
Red Channel Green Channel Blue Channel
Pixel Intensity Histograms of the Mouth Region
Pixel Intensity Mapping Function and Equalised Channels
0 50 100 150 200 250
0
0.2
0.4
0.6
0.8
1
Input Intensity
M
a
p
p
e
d

I
n
t
e
n
s
i
t
y
0 50 100 150 200 250
0
0.2
0.4
0.6
0.8
1
Input Intensity
M
a
p
p
e
d

I
n
t
e
n
s
i
t
y
Figure 7.10: A diagram of the processes involved in histogram equalisation. Each
colour channel is processed independently, and just the mouth region is used. The
equalising mapping function is generated from a sample of training images, and pro-
vides a mapping for input pixel intensities. The histograms show the count the pixels
(y-axis) for each intensity value (x-axis). Sample histograms are shown before and
after the mapping has been applied. For each colour channel, we can see that the
distribution of pixels across the intensity range is approximately even after the equal-
isation process has been applied.
to normalise each speakers appearance to a global target. The process we used
is known as histogram equalisation, in which a mapping is found for an input his-
togram, such that the number of pixels in each histogram bin is approximately equal
after the mapping is applied (Figure 7.10). For each speaker, using the images we se-
lected previously for training each speakers AAM, we extracted the pixels contained
within the outer-lip contour, and constructed a global histogram for those pixels.
CHAPTER 7. SPEAKER-INDEPENDENT VLID: UN2 DATASET 131
The histogram equalisation mapping generated for a speakers training images was
then applied to all input frames for that speaker.
Since we only use the mouth region for recognition, we perform the normalisation
on pixel histograms of the mouth only. This also ensures that the histograms have
some degree of correspondence between speakers, rather than containing extraneous
image information that may be more likely to vary from image to image. This nor-
malisation method assumes that each speaker has a similar distribution of intensities
within each colour channel, but that the mean and variance of those distributions
is dierent. Using an extreme case as an example, if one speaker had red lips and
another had blue lips, this normalisation would be ineective. Figure 7.11 shows
two subjects from both languages, before and after histogram equalisation.
Original Image
After Histogram Equalisation
Original Image
After Tooth Classification
English Arabic
English Arabic
Figure 7.11: Example frames from two Arabic and two English subjects, before and
after histogram equalisation.
Next we sought to remove skin tone altogether from our features. The AAMs
PDM or shape features, are a colour free representation of the shape of the inner
and outer lip contour. Of course, using shape features discards any information
about the tongue, teeth, or any other informative colour variation that is separate
from skin tone. Therefore, if language discrimination is possible using only shape,
we would expect to see some degree of language discrimination, but signicantly
less than when using combined shape and appearance features. When the feature
dimensions corresponding to appearance are discarded, 8 shape parameters remain
out of 70 for the histogram equalised features, and these are the dimensions we use
CHAPTER 7. SPEAKER-INDEPENDENT VLID: UN2 DATASET 132
for recognition.
Although shape parameters were shown to provide some degree of language dis-
crimination, they are not the optimal skin tone free features that we can extract
from the face, as they are only providing information about one articulator, i.e. the
lips. Our nal investigation into skin tone was to nd an abstract representation
of an additional articulator, in order to improve upon our shape-only results. Our
contention here is that there is insucient information within the shape features to
determine the position of the other articulators, and hence to identify the spoken
phone. Once again, we do not expect these results to give comparable recognition
performance to our original features, although we expect better performance than
our shape-only results.
Mouth Region Pixels
Pixel1 Pixel2 Pixel 3
Pixel 4
Tooth
Pixel
Pixel 6
Pixel 7 Pixel 8 Pixel 9
Tooth Feature Vector: Pixel1Red + Pixel1Green + Pixel1Blue + Pixel2Red + Pixel2Green +
Pixel2Blue + Pixel3Red + Pixel3Green + Pixel3Blue + Pixel4Red + Pixel4Green + Pixel4Blue +
ToothPixelRed + ToothPixelGreen + ToothPixelBlue + Pixel6Red + Pixel6Green + Pixel6Blue +
Pixel7Red + Pixel7Green + Pixel7Blue + Pixel8Red + Pixel8Green + Pixel8Blue + Pixel9Red +
Pixel9Green + Pixel9Blue
Figure 7.12: A synthetic example showing mouth region pixels, and a 3 x 3 grid
centred over a pixel which has been hand-labelled as containing teeth. The structure
of the feature vector corresponding to the tooth pixel is also shown.
CHAPTER 7. SPEAKER-INDEPENDENT VLID: UN2 DATASET 133
Using the hand-labelled images that were created for building the speaker specic
AAM trackers, we hand labelled tooth and mouth regions. A 3 x 3 pixel grid was
placed over each labelled pixel, then the red, green and blue intensities of those
9 pixels were concatenated and used as the feature vector for each central pixel
(Figure 7.12). A multilayer perceptron classier was trained speaker-dependently to
discriminate between the 27-dimensional vectors corresponding to tooth and non-
tooth regions for each speaker. Test vectors were constructed for each pixel contained
within the outer lip contour of a test image, and then the recognised regions were
set to 0 for non-tooth and 255 for tooth regions, according to the results of the
classier. Finally, small enclosed regions were removed, to reduce the noise caused
by a small number of pixels being classied incorrectly.
Original Image
After Histogram Equalisation
Original Image
After Tooth Classification
English Arabic
English Arabic
Figure 7.13: Example frames from two Arabic and two English subjects, before and
after tooth classication.
Using this technique, we were able to represent a sequence of video frames as a
series of binary images where the teeth are white and everything else is black. From
this, we can perform PCA as in our previous experiments, and run VLID experiments
as before. This means that our features contain shape information relating to the
lip contours and appearance information corresponding to the position of the teeth
within the mouth. Figure 7.13 shows some original images and their representation
after classication of the teeth.
CHAPTER 7. SPEAKER-INDEPENDENT VLID: UN2 DATASET 134
7.4.1 Results
The rest of this section presents the results of PRLM VLID experiments, and phone
recognition accuracies, using the dierent features for phone recognition described
previously. It should be noted that we cannot present phone recognition accuracy
of the Arabic test data through the English phone recogniser, since we do not have
transcriptions of the Arabic speech in terms of English phones.
Tables 7.8 and 7.9 present the recogniser performance of the training and test
data used to achieve the VLID results shown in Figure 7.14, which relate to the
histogram equalisation experiments. The recogniser used a grammar scale factor of
0, and an insertion penalty of -20. The optimal number of mixture components per
state was found empirically to be 16.
Table 7.8: Speaker-independent visual-phone recognition performance of the En-
glish training data used to train the recogniser. The results presented relate to the
histogram equalisation experiment described in Section 7.4.
Accuracy # Correct # Deleted # Substituted # Inserted # Phones
36.42 36185 20511 21527 7700 78223
Table 7.9: Speaker-independent visual-phone recognition performance of the English
test data. The results presented relate to the histogram equalisation experiment
described in Section 7.4.
Accuracy # Correct # Deleted # Substituted # Inserted # Phones
15.63 25752 18822 33868 13495 78442
Figure 7.14 shows the VLID results of using AAM features built from histogram
equalised video frames. Compared to the results achieved without histogram equal-
isation in Figure 7.7, the performance of each speaker is more varied, with 2 or 3
subjects failing to achieve high discriminability between English and Arabic. For
most speakers however, good discrimination is achieved with between 30 and 60
seconds of test data. From these results, it is not clear whether histogram equalisa-
tion is the correct technique for normalising for the eect of skin-tone. It may be
that this normalisation technique has removed some of the discriminability between
English and Arabic based on the skin-tone, or it could be that stretching the colour
CHAPTER 7. SPEAKER-INDEPENDENT VLID: UN2 DATASET 135
space in this way has increased the visual dierence between speakers, which might
explain why more mixture components are required for optimal performance.
1 3 7 30 60
0
10
20
30
40
50
60
70
80
90
100
Test Utterance Duration (Seconds)
M
e
a
n

E
r
r
o
r

(
P
e
r
c
e
n
t
)


1
2
6
8
17
18
23
24
26
29
32
33
34
35
36
37
38
39
41
Figure 7.14: Speaker-independent VLID results for 19 test subjects from the UN2
dataset. These results relate to the histogram equalisation experiment described in
Section 7.4
CHAPTER 7. SPEAKER-INDEPENDENT VLID: UN2 DATASET 136
1 3 7 30 60
100
80
60
40
20
0
20
40
60
80
100
Test Utterance Duration (Seconds)
D
i
f
f
e
r
e
n
c
e

i
n

M
e
a
n

E
r
r
o
r

(
P
e
r
c
e
n
t
)


1
2
6
8
17
18
23
24
26
29
32
33
34
35
36
37
38
39
41
Figure 7.15: A plot showing the dierence between the results shown in Figure 7.14
(Visual PRLM with histogram equalisation) and Figure 7.7 (Visual PRLM without
histogram equalisation), for each speaker. Positive values indicate an increase in error
from using the histogram equalisation normalisation, whereas negative values indicate
a reduction in error.
CHAPTER 7. SPEAKER-INDEPENDENT VLID: UN2 DATASET 137
Tables 7.10 and 7.11 present the recogniser performance of the training and test
data used to achieve the VLID results shown in Figure 7.16, which relate to the
shape-only experiments. The recogniser used a grammar scale factor of 0, and an
insertion penalty of -20. The optimal number of mixture components per state was
found empirically to be 1. This suggests that more than 1 mixture component over-
models the shape features, perhaps because they do not contain much informative,
speaker-independent data about the conguration of the mouth.
Table 7.10: Speaker-independent visual-phone recognition performance of the En-
glish training data used to train the recogniser. The results presented relate to the
AAM shape-only experiment described in Section 7.4.
Accuracy # Correct # Deleted # Substituted # Inserted # Phones
12.58 12476 38933 26914 2536 78223
Table 7.11: Speaker-independent visual-phone recognition performance of the En-
glish test data. The results presented relate to the AAM shape-only experiment
described in Section 7.4.
Accuracy # Correct # Deleted # Substituted # Inserted # Phones
12.09 11979 36889 29574 2498 78442
Figure 7.16 shows the VLID results of using AAM shape parameters as features
for visual-only phone recognition. The results show that whilst around half the
speakers can achieve a good identication accuracy with 30 second test utterances,
the other half show little or no discrimination. This result is as expected, since
the amount of articulatory information captured in the shape parameters is limited
compared to those derived from the appearance, which is also demonstrated by the
lower phone recognition accuracies in Tables 7.10 and 7.11. It is evident that some
degree of appearance-free language discrimination is possible, although at the cost
of overall accuracy.
CHAPTER 7. SPEAKER-INDEPENDENT VLID: UN2 DATASET 138
1 3 7 30 60
0
10
20
30
40
50
60
70
80
90
100
Test Utterance Duration (Seconds)
M
e
a
n

E
r
r
o
r

(
P
e
r
c
e
n
t
)


1
2
6
8
17
18
23
24
26
29
32
33
34
35
36
37
38
39
41
Figure 7.16: Speaker-independent VLID results for 19 test subjects from the UN2
dataset. These results relate to the AAM shape-only experiment described in Section
7.4
CHAPTER 7. SPEAKER-INDEPENDENT VLID: UN2 DATASET 139
1 3 7 30 60
100
80
60
40
20
0
20
40
60
80
100
Test Utterance Duration (Seconds)
D
i
f
f
e
r
e
n
c
e

i
n

M
e
a
n

E
r
r
o
r

(
P
e
r
c
e
n
t
)


1
2
6
8
17
18
23
24
26
29
32
33
34
35
36
37
38
39
41
Figure 7.17: A plot showing the dierence between the results shown in Figure
7.16 (Visual PRLM using shape) and Figure 7.14 (Visual PRLM using histogram
equalisation), for each speaker. Positive values indicate an increase in error from
using the shape features, whereas negative values indicate a reduction in error.
CHAPTER 7. SPEAKER-INDEPENDENT VLID: UN2 DATASET 140
Tables 7.12 and 7.13 present the recogniser performance of the training and test
data used to achieve the VLID results shown in Figure 7.18, which relate to the ex-
periments that used tooth segmentation. The recogniser used a grammar scale factor
of 0, and an insertion penalty of -20. The optimal number of mixture components
per state was evaluated empirically to be 8.
Table 7.12: Speaker-independent visual-phone recognition performance of the En-
glish training data used to train the recogniser. The results presented relate to the
tooth-recognition experiment described in Section 7.4.
Accuracy # Correct # Deleted # Substituted # Inserted # Phones
21.42 25460 22059 30704 8701 78223
Table 7.13: Speaker-independent visual-phone recognition performance of the En-
glish test data. The results presented relate to the tooth-recognition experiment
described in Section 7.4.
Accuracy # Correct # Deleted # Substituted # Inserted # Phones
15.60 21441 23611 33390 9205 78442
Figure 7.18 shows the VLID results of using AAM features, generated from tooth
segmented video frames, used for visual-only phone recognition. The results show
an improvement of performance over the shape-only results in Figure 7.16, which is
expected since we have added the articulatory information of the teeth to our fea-
tures. A higher recognition accuracy is also present in the phone recognition results
shown in Tables 7.12 and 7.13. There are two signicant outliers whose performance
degrades with test utterance duration, and three subjects who only achieve limited
language discrimination. These results are lower than those produced by the sys-
tems built using unnormalised and histogram equalised video frames (Figures 7.7 and
7.14), which could be due to a loss of visual information regarding modes of variation
corresponding to the tongue, and other potentially informative modes. These results
do show that reasonable language discrimination can be achieved using colour-free
appearance features, which outperform shape alone but are less accurate than using
full appearance features.
CHAPTER 7. SPEAKER-INDEPENDENT VLID: UN2 DATASET 141
1 3 7 30 60
0
10
20
30
40
50
60
70
80
90
100
Test Utterance Duration (Seconds)
M
e
a
n

E
r
r
o
r

(
P
e
r
c
e
n
t
)


1
2
6
8
17
18
23
24
26
29
32
33
34
35
36
37
38
39
41
Figure 7.18: Speaker-independent VLID results for 19 test subjects from the UN2
dataset. These results relate to the tooth recognition experiment described in Section
7.4
CHAPTER 7. SPEAKER-INDEPENDENT VLID: UN2 DATASET 142
1 3 7 30 60
100
80
60
40
20
0
20
40
60
80
100
Test Utterance Duration (Seconds)
D
i
f
f
e
r
e
n
c
e

i
n

M
e
a
n

E
r
r
o
r

(
P
e
r
c
e
n
t
)


1
2
6
8
17
18
23
24
26
29
32
33
34
35
36
37
38
39
41
Figure 7.19: A plot showing the dierence between the results shown in Figure
7.18 (Visual PRLM using tooth recognition) and Figure 7.14 (Visual PRLM using
histogram equalisation), for each speaker. Positive values indicate an increase in
error from using the tooth recognition features, whereas negative values indicate a
reduction in error.
CHAPTER 7. SPEAKER-INDEPENDENT VLID: UN2 DATASET 143
7.5 Arabic Phone Recognition
In order to build a PPRLM system, such as that built in Section 6, we must build
more than one visual-phone recogniser, ideally one for each language we wish to
discriminate between. To do this we require a phonetic transcription of the phones
that have been spoken, which can be determined from the audio. In English, we
would automatically expand a word-level transcription to phone level, but this is
not a trivial task for Arabic, as there are several dierences between the Arabic
and English language (both written and spoken) which prohibit the use of standard
approaches. Firstly, Arabic is written right-to-left, whereas English is left-to-right.
Secondly, Arabic and English use dierent written letters, meaning a transliteration
must be applied to the Arabic script in order to Romanise it, for use in the HTK
speech recognition toolkit. Most signicantly, the pronunciation of a word may vary
signicantly depending on the context in which it is used, meaning that a phonetic
transcription of Arabic speech requires morphological analysis to determine the exact
pronunciation.
Diacritics are marks which are added to written letters to give the explicit pro-
nunciation of the word in which they are contained. In Arabic script, they are known
as tashkl and are rarely applied to written Arabic, meaning that two words which
sound dierent can share the same written letters. The variations in pronunciation
that these marks indicate are for the dierent inective forms of words, which en-
compass many factors such as the plurality, formality, or gender of a word. It is
rare for diacritics to be added to written text, as capable Arabic speakers will be
able to infer the correct pronunciation from the context. It is more usual to nd
tashkl on scripts used to teach the Arabic language, or in the Quran, where correct
pronunciation is required by the Muslim faith.
As when building an English phone recogniser, an Arabic recogniser requires a
phone level transcription of the training data, which can be automatically expanded
in some way from a word level transcription. There are two commonly used methods
for generating a phonetic transcription of Arabic speech. The rst method is known
CHAPTER 7. SPEAKER-INDEPENDENT VLID: UN2 DATASET 144
as a graphemic system, and builds phone models for each character in the Arabic
written alphabet, and then pronunciation variations are modelled implicitly by each
states GMM. This is not ideal, since it is not known how accurately the GMMs are
modelling the underlying distribution of sounds. Also, distinct sounds are combined
into one model (corresponding to the Arabic letter which produced them), meaning
that any linguistically important discrimination between them is lost.
The most accurate method for building an Arabic speech recogniser requires the
generation of an Arabic pronunciation dictionary, mapping each input word to a
series of pronunciations, representing the various inected forms of that word. This
process requires morphological analysis of the text, for which there is existing exten-
sive research [Al-Sughaiyer and Al-Kharashi, 2004; Buckwalter, 2004]. The system
we use is an online interface
1
to the ElixirFM morphological analyser [Smrz, 2007],
which is the state-of-the-art in Arabic lexical analysis. The online interface can be
passed a string of text in Arabic script, and once the inection operation is invoked,
will return a list of possible phonetic pronunciations for various tokenisations, gen-
ders and other features which can aect the inective forms (Figure 7.20). The
pronunciations are shown in the DIN 31635 transliteration scheme, which has a
direct correspondence to the IPA.
From the information provided by ElixirFM in Figure 7.20, a pronunciation dic-
tionary can be generated. We use the rst citation form as the word entry in the
dictionary for each pronunciation variant. As in Figure 7.20, it is possible for the
morphological analysis to reveal multiple tokenisations of a word, in which case each
combination of the tokenisations must be considered when extracting the pronun-
ciations. This includes each inection within each token, and so the number of
possible pronunciations can increase dramatically with the number of tokenisation
variants, though some duplicate pronunciations can be formed, and these can be re-
moved. Table 7.14 shows an example of the dictionary entry for the word analysed
by ElixirFM in Figure 7.20.
1
http://quest.ms.m.cuni.cz/elixir/
CHAPTER 7. SPEAKER-INDEPENDENT VLID: UN2 DATASET 145
Arabic script:
Citation Form:
Possible Tokenisation:
Citation Form:
Citation Form:
Token 1:
Token 2:
Citation Form:
Citation Form:
Citation Form:
Inflection, with
Phonetic Pronunciation
Figure 7.20: A screenshot of ElixirFM online interface. The gure shows an input
word in Arabic script, two separate ways to tokenise the word (As one word, or as
two tokens), and phonetic representations corresponding to dierent inective forms
of the various citations.
Once a pronunciation dictionary has been generated, a word-level transcription
of the Arabic speech (where each word is in the citation form) can be expanded to a
phone level, using any of the inection variants as the initial pronunciation for that
word. From that point, a at-start alignment of the phones can be performed. Of
course, the inection chosen might not be the one actually spoken in the audio, and
so once the initial models have been built, the best pronunciation variant is selected
from the dictionary for each word, as part of the automatic phone alignment process.
Building the phone recogniser from visual features then becomes the same task as
building an English recogniser.
CHAPTER 7. SPEAKER-INDEPENDENT VLID: UN2 DATASET 146
Table 7.14: Dictionary pronunciations for the Arabic word shown in Figure 7.20.
The dictionary entry is the rst possible citation form of the Arabic word, generated
by the morphological analysis process. Each pronunciation represents an inective
form of the various citations.
Dictionary Entry Pronunciation
bah
.
ira bah
.
ira
bah
.
ira buh
.
ira
bah
.
ira bah
.
h
.
ara
bah
.
ira buh
.
h
.
ira
bah
.
ira bah
.
h
.
ir
bah
.
ira bah
.
run
bah
.
ira bah
.
rin
bah
.
ira bah
.
ru
bah
.
ira bah
.
ri
bah
.
ira bah
.
ra
bah
.
ira bih
.
arrin
bah
.
ira bih
.
arri
bah
.
ira bih
.
urrin
bah
.
ira bih
.
urri
7.6 PPRLM using Visual Phones
As in Section 7.3, we found that test language model likelihood scores were skewed
towards the unseen training data if we didnt use separate data to train our phone
recognisers. This problem is not immediately obvious in a PPRLM system however,
since the eect is present across each language specic phone recogniser, meaning
that no language has a performance bias that depends on whether it was used for
training. However, we found that Arabic had a recognition bias. Improvements in
VLID accuracy were achieved when we used a proportion of the training data in
each validation fold for training the HMMs, and excluded the remaining speakers
from that process. In each fold, we used the rst 5 speakers for training the HMMs
and all training speakers within the rest of the VLID system.
The system we used for recognition was the same as described in Section 6, except
the languages we used were English and Arabic. Since our Arabic recogniser does
not use the same phone set as our English recogniser, and since the rules of co-
articulation may vary from English to Arabic, we cannot use our previous approach
CHAPTER 7. SPEAKER-INDEPENDENT VLID: UN2 DATASET 147
of rule-based state-tying. Instead, we must adopt the data-driven approach (Section
2.6.2), where states are clustered according to their closest neighbouring states. We
used a distance threshold of 0.9, which retained a similar number of states for both
the English and Arabic recognisers, as the rule-based state-tying used earlier in this
chapter. The features we used for recognition were AAM features, generated from
histogram equalised video frames (Section 7.4).
7.6.1 Results
1 3 7 30 60
0
10
20
30
40
50
60
70
80
90
100
Test Utterance Duration (Seconds)
M
e
a
n

E
r
r
o
r

(
P
e
r
c
e
n
t
)


1
2
6
8
17
18
23
24
26
29
32
33
34
35
36
37
38
39
41
Figure 7.21: Speaker-independent VLID results for 19 test subjects from the UN2
dataset. These results relate to the PPRLM experiment using histogram equalised
video described in Section 7.6
Figure 7.21 presents the VLID performance of a PPRLM system built using AAM
features derived from histogram equalised video frames. The optimum number of
mixture components was found to be 1. Interestingly, the VLID recognition accuracy
CHAPTER 7. SPEAKER-INDEPENDENT VLID: UN2 DATASET 148
1 3 7 30 60
100
80
60
40
20
0
20
40
60
80
100
Test Utterance Duration (Seconds)
D
i
f
f
e
r
e
n
c
e

i
n

M
e
a
n

E
r
r
o
r

(
P
e
r
c
e
n
t
)


1
2
6
8
17
18
23
24
26
29
32
33
34
35
36
37
38
39
41
Figure 7.22: A plot showing the dierence between the results shown in Figure 7.21
(Visual PPRLM using histogram equalisation) and Figure 7.14 (Visual PRLM using
histogram equalisation), for each speaker. Positive values indicate an increase in error
from using the PPRLM architecture, whereas negative values indicate a reduction in
error.
consistently declined as the number of mixture components was increased, which
suggests that the generalising power of the PPRLM approach is compromised as
the visual-models are tted to the training data more accurately, which is not the
same conclusion that we draw from the PRLM experiments. The three poorest
performing subjects are the same as those presented in the PRLM experiments
using histogram equalisation in Figure 7.11. Using PPRLM, the mean recognition
accuracy is higher (but not statistically signicantly). An explanation for the low
number of mixture components required for optimal performance and the lack of
performance improvement over PRLM could be the limited number of speakers
used to train our models for the PPRLM system. Since we only use a subset of
CHAPTER 7. SPEAKER-INDEPENDENT VLID: UN2 DATASET 149
speakers to train the visual-phone models, it is possible that those models do not
generalise very well and therefore that a small number of mixture components are
required to model the variation they contain.
1 3 7 30 60
0
10
20
30
40
50
60
70
80
90
100
Test Utterance Duration (Seconds)
M
e
a
n

E
r
r
o
r

(
P
e
r
c
e
n
t
)


Audio PRLM
Visual PRLM
Histogram Equalisation PRLM
ShapeOnly PRLM
ToothRecognition PRLM
Histogram Equalisation PPRLM
Figure 7.23: A plot showing the mean performance of each VLID system described
in this chapter, and results generated from audio LID experiments on the UN2 dataset.
7.7 Conclusions
In this section, we have presented a further evaluation of the techniques for visual-
only language identication described in Chapter 6, using a larger dataset of native
English and Arabic speaking subjects. We have demonstrated that good speaker-
independent language discrimination can be achieved using a PRLM LID architec-
ture, with phone recognition performed using visual features. Further to this, we
have shown that language discrimination and phone recognition in the visual domain
CHAPTER 7. SPEAKER-INDEPENDENT VLID: UN2 DATASET 150
are signicantly harder problems than in the audio domain, which is reected in the
low accuracies attained using state-of-the-art techniques. We have shown that some
degree of language discrimination is possible using features completely free from
appearance or from colour, which shows that the shape and dynamics of the articu-
lators are key to visual-phone recognition and hence language identication. Finally,
we have shown that good discrimination can also be achieved by way of a PPRLM
approach, using an English and Arabic phone recogniser in conjunction.
We have shown that the accuracy of our visual phone recognisers are very low
compared to those built using audio features. The key to language discrimination
is the accuracy of the phone recognition element in the VLID system, a fact which
is conrmed by applying our LID approach to audio features, and also shown by
our viseme accuracy simulation in Chapter 6. We have shown that some degree of
speaker-independent visual-phone recognition is attainable, and that by using our
highly constrained video data, it is often possible to identify a spoken language,
despite the poor performance.
The number of insertions and deletions in a visual phone recognition system
have been shown to be very high, suggesting strong confusability, and a lack of
information in the visual signal to recognise the spoken phones. It is not clear
precisely how many visual phones (by which we mean audio phones that could be
discriminated visually) are present in continuous visual speech compared to audio,
and therefore it is not known exactly how the balance of insertions and deletions
should be dealt with. Although here we used a xed global penalty, it is possible
that it should be determined empirically for each speaker. Research into the nature
of visual phone deletions and the composite units of visual speech may provide a
method for dealing with this problem, including the possibility of redening how
visual speech is transcribed, to account for phones that are indiscriminable visually.
Importantly, many-to-one phone to viseme mappings do not address this problem
suciently.
For those speakers for whom language discrimination was not possible, we cannot
CHAPTER 7. SPEAKER-INDEPENDENT VLID: UN2 DATASET 151
condently identify the cause of their limited performance. However, we can hypoth-
esise some reasons. When analysing the features of failed speakers by way of a 2D
distance-preserving projection, they are not shown to be outliers, in terms of their
relative distance to other speakers. For those speakers who are classied incorrectly,
increasing the duration of the test utterance does not improve performance, which
implies that there is a fundamental problem in recognising their spoken language.
One reason for this might be due to a subjects atypical speaking style, or a visual
appearance that is too dierent and not representable by the appearance model
used to generate the features. It may also be that too many phones may be inserted
or deleted from the recognised sequences, thereby corrupting the discriminable lan-
guage information. We have observed that the proximity of the classication features
to the SVM decision boundary is close, even where classication has been successful,
so it is possible that even a subtle degradation of the discriminative elements of a
subjects speech is likely to worsen recognition performance.
Further work could seek to develop the idea of an articulator classier. In particu-
lar, a method could be developed to abstractly and speaker-independently represent
the location of the tongue. Also, the representation and detection of the teeth de-
veloped here could be improved in both accuracy and speaker independency. Given
the high degree of error in our visual-only phone recognisers, and the speaker depen-
dency present in our visual features, we cannot be certain of the extendability of this
approach to discriminate a greater number of languages. Therefore, this work could
be extended to discriminate between three or more dierent languages, to evaluate
how well this approach generalises. Finally, the biggest limitation we have identied
is the poor recognition accuracies achieved by our visual-phone recognisers. If we
could improve their accuracy, then we would undoubtedly improve the performance
of our VLID system. Thus, two major issues must be tackled: Firstly, the speaker
independence of AAM features, and secondly, the problem of visual ambiguity of
phones and therefore the way that they are transcribed. All of these issues are
discussed in greater depth in Chapter 9.
Chapter 8
Limitations of Visual Speech
Recognition
8.1 Introduction
Visual-only speech recognition is known to be a dicult problem [Theobald et al.,
2006; Cox et al., 2008; Lan et al., 2009]. We have demonstrated this in this thesis,
specically by the low speaker-independent viseme and visual-phone recognition
performances in chapters 6 and 7. Most automated systems (including ours) extract
visual features from image regions that contain the mouth and train recognition
systems to identify classes of sounds based on the temporal pattern apparent in
these features. This process is known as automated lip-reading [Nankaku et al., 2000;
Newman and Cox, 2010]. Conversely, humans tend to use more information than
is available on just the lips. For example, humans also use head movements, facial
expressions, body gestures and, more importantly, high-level language structure and
context to help them interpret human speech. These uses of additional features of
communication are referred to as speech-reading [Stork and Hennecke, 1996].
Most of the focus on constructing automated speech recognition systems that
utilise visual information has been with respect to audio-visual speech recognition
152
CHAPTER 8. LIMITATIONS OF VISUAL SPEECH RECOGNITION 153
[Almajai and Milner, 2009; Matthews et al., 2002; Potamianos et al., 2001, 2003].
That is, augmenting an acoustic speech recogniser with visual features to improve
the robustness of the recogniser to acoustic noise. Few studies focus on developing
pure automated lip-reading systems, and fewer still focus on developing speech-
reading systems. The complexity of building an automated system that integrates
all of the modalities used by human speech-readers is far beyond the current state
of the art. A major limitation of automated lip-reading systems is that not all of
the speech articulators are visible, and so the information available to the system is
somewhat limited. This means that the dierences in articulation for many sounds
are where they cannot be seen, and so many sounds appear visually similar on the
lips. For example, the place of articulation for the phonemes /b/, /m/ and /p/ is at
the lips (these are bilabial stops and require a closure of the lips). The dierences
between these sounds are in the voicing and the nasality. Similarly /f/ and /v/
are labiodental fricatives and require the lower lip and upper teeth to come into
close contact. Again, one of the main dierences between these sounds is that /f/
is voiceless whilst /v/ is voiced, a dierence that cannot be seen at the lips.
It is customary to divide the set of phonemes into visually contrastive groups,
referred to as visemes (an abbreviation of visual phonemes) [Fisher, 1968]. Visual
speech recognition then involves constructing models for visemic classes rather than
phonetic classes, and using these models to recognise speech from visual-only data.
There are many problems with this. Firstly, the inventory of visual speech units
is smaller, so the number of unique words that can be constructed is also signi-
cantly lower when using a visemic rather than a phonetic transcription. Secondly,
many possible mappings from phonemes to visemes have been proposed and there
is no standardised set of visemes for transcribing visual speech. Thirdly, the same
viseme can appear very dierent because of coarticulation eects (e.g., the /l/ and
/n/ in lean and loon are very dierent visually yet supposedly have the same
visemic label and so the same visual meaning). Fourthly, there is the problem that
the audio and visual signals are asynchronous. Visual gestures will always precede
the acoustic realisation of a phoneme, and so an alignment between the two signals
CHAPTER 8. LIMITATIONS OF VISUAL SPEECH RECOGNITION 154
is not always possible.
In this chapter, parts of which are also presented in Newman et al. [2010], we are
interested in understanding the limitation of lip-reading systems and we consider the
improvements that could be gained were additional information from other (non-
visible) speech articulators available to the recogniser. That is, how accurate can a
visual speech recognition system be given that not all of the articulators can be seen?
To this end, we construct automatic speech recognisers trained on various amounts of
articulatory information, and measure how the performance of these systems degrade
as information is withheld from the recognisers. Hidden Markov model (HMM)
speech recognisers are trained using electromagnetic articulography (EMA) data
drawn from the MOCHA-TIMIT dataset. Articulatory information is systematically
withheld from the recogniser and the performance is tested and compared with that
of typical state of the art lip-reading and audio speech recognition systems.
The rest of this chapter is arranged as follows: Section 8.2 describes the MOCHA-
TIMIT and UN2 datasets, and the active appearance model (AAM) and EMA fea-
tures used in the experiments described here. Three experiments are described Sec-
tion 8.3; the rst compares AAM features to EMA features in a speaker-dependent
setting (Section 8.3.1), the second is the same but testing speaker-independently
(Section 8.3.2), and the nal experiment investigates the sample rate of visual fea-
tures compared to typical audio features (Section 8.3.3). Section 8.4 concludes this
chapter.
8.2 Dataset and Features
The articulatory features used in this work are drawn from the MOCHA-TIMIT
dataset [Wrench, 2001], which consists of a series of EMA measurements for a male
and a female speaker. The EMA data are captured at 500Hz and represent the x
and y positions of eight points on the midsagittal plane at the upper and lower lips,
upper and lower incisors, tongue tip, blade and dorsum, and the velum (Figure 8.1).
CHAPTER 8. LIMITATIONS OF VISUAL SPEECH RECOGNITION 155
The nose bridge and upper incisor are xed relative to each other and can be used
to align the sensor information to a reference point. However, in these experiments
we ignore the nose bridge and use the upper incisor and the so called eshpoint
articulators as recognition features. The EMA data is down-sampled to 100Hz, to
match typical audio recognition features.
Tongue
Nose bridge
Upper lip
Lower lip
Upper teeth
Lower teeth
Velum
Dorsum
Tip
Blade
Midsaggital plane
x
y
Figure 8.1: A diagram of the midsaggital plane, showing the position of the EMA
sensors in red and the name of the articulators that they are used to track. The nose
sensor is not used for recognition.
The video data for lip-reading used in this work are drawn from a custom made
dataset (Section 3.3) containing 25 native English speakers reading the United Na-
tions Universal Declaration of Human Rights (Appendix A). We used all 25 speak-
ers for our work here. The video was recorded using a Sanyo Xacti FH1 cam-
era at 1920x1080 resolution and 60 frames per second progressive scan. Note, the
MOCHA-TIMIT dataset does have corresponding video data, but the videos were
not captured under good conditions. For lip-reading we require some constraints
to be imposed on the speakers. For instance, the speaker must be facing the camera
CHAPTER 8. LIMITATIONS OF VISUAL SPEECH RECOGNITION 156
so the articulators always are visible, and the head pose should ideally be reason-
ably constant, as should the lighting conditions. There is an inherent diculty with
avoiding occlusion of the articulators, since the EMA sensors used to capture the
data have wires attaching them to recording equipment.
To extract visual speech features for recognition, an active appearance model
(AAM) [Cootes et al., 2001] is trained manually from a few tens of images for
each speaker (Section 2.4). Our choice of visual feature is motivated by earlier
work comparing visual features for lip-reading [Lan et al., 2009]. We have found
that model-based features signicantly outperform image-based features (such as
eigen-lips or discrete cosine transform (DCT) features). To compensate for inter-
speaker dierences in the parameters, the visual features, which are highly speaker-
dependent, are z-score normalised per speaker (Equation 6.1). Inspection of the
original features showed that their distribution is approximately Gaussian, and hence
the z-score is an appropriate normalisation to apply. We also normalise the EMA x
and y points per utterance, to compensate for physiological dierences between the
two speakers.
8.3 Experiments
As in the previous chapters, our visual-only recognisers consist of tied-state triphone
hidden Markov models (HMMs) (Section 2.6.2) that model coarticulation around
a central phone, as is typical in state-of-the-art speech recognition systems. The
phone level transcriptions required to construct these models for the EMA data are
provided with the MOCHA-TIMIT dataset, and for the AAM features the phonetic
transcriptions were generated from automatically expanded word level transcriptions
of the acoustic speech. A at start [Young et al., 2006] was then applied to
the training data so that the phone-level segmentations were not inuenced by the
acoustic segmentation.
Each of the 44 phone HMMs starts as a three-state model with a single Gaussian
CHAPTER 8. LIMITATIONS OF VISUAL SPEECH RECOGNITION 157
probability density function associated with each state. These were then replicated
to form triphone models for all phone contexts present in the training data. At this
stage, there are many triphones with too few examples to train accurate models.
To overcome the problem of data sparsity, appropriate states are tied together in
order to share data between infrequently occurring triphones. Deciding which states
to tie is done using hierarchical tree clustering driven by left and right viseme context
questions. In a phone recognition system, the most signicant coarticulation rules
are known from knowledge of the phonetic properties of speech. In this work we
use the phone clustering rules provided by the Resource Management tutorial demo
[Price et al., 1988]. During clustering, rules that do not satisfy state occupancy and
likelihood thresholds are ignored, leaving the most appropriate rules for the given
parameters. The thresholds we specied retained between 6-7% of the total number
of states after tying. Finally, the number of mixture components was increased
sequentially from one to eight. To ensure the language model does not inuence
recognition, the grammar scale factor is set to 0, and to balance the insertion and
deletion rates the insertion penalty is set to between -15 and -20 (see the HTK
book [Young et al., 2006] or Section 2.7 for more information on these properties of
the recogniser).
8.3.1 Speaker-Dependent Articulatory Features
We rst establish the maximum accuracy that can be achieved using context-dependent
phone recognisers trained on the EMA features from the MOCHA-TIMIT dataset.
Next, subsets of these features are formed by systematically removing articulatory
features. The degradation in performance of the recognisers can then be measured
as a function of the loss of information that results from removing the articulators.
First, speaker-dependent recognisers are built for the two speakers in the MOCHA-
TIMIT dataset. A ve fold cross validation setup is used, giving 92 test utterances
per fold. The features initially consist of all 16 x, y coordinates from all 8 sensors,
and then each sensor is removed in turn from the back to the front of the vocal tract
CHAPTER 8. LIMITATIONS OF VISUAL SPEECH RECOGNITION 158
until only one articulator remains. This allows us to simulate the loss of informa-
tion due to the limited visibility of the articulators toward the back of the mouth
during lip-reading. To benchmark the performance against features used in tra-
ditional acoustic phone recognition, tied-state multiple mixture speaker-dependent
HMMs also are trained and tested using mel-frequency cepstral coecients (MFCC)
features extracted from the acoustic speech signal.
All articulators velum dorsum blade tip upper incisor lower incisor upper lip
0
10
20
30
40
50
60
70
80
90
100
P
h
o
n
e

R
e
c
o
g
n
i
t
i
o
n

A
c
c
u
r
a
c
y

(
P
e
r
c
e
n
t
)


Female MFCC
Male MFCC
Female EMA
Male EMA
Figure 8.2: Speaker-dependent phone recognition using articulatory features. The
left-most label on the x-axis denotes that all articulators were used by the recogniser.
Each subsequent score denotes the removal of an articulator (named on the axis),
and the removal is cumulative moving left-to right (e.g. -dorsum indicates that both
the velum and the dorsum were not included). The plot shows the mean accuracy
averaged over the ve-folds, and the error bars denote 1 times the standard error.
The accuracy (and the trend) for both speakers is very similar, and as is expected the
performance degrades as information is withheld from the recogniser.
The mean phone recognition accuracy of the speaker-dependent recognisers is
illustrated in Figure 8.2. A mean accuracy of 65% is obtained using MFCC features
in a speaker-dependent recogniser for both the male and the female subjects. This is
in contrast to the mean accuracy of only 45% achieved using all articulatory features.
CHAPTER 8. LIMITATIONS OF VISUAL SPEECH RECOGNITION 159
This dierence in performance is likely to be attributable, in part, to the lack of
direct information regarding voicing in the articulatory features, since the vocal folds
are not monitored (Figure 4.1 shows the position of the vocal folds within the vocal
tract). Analysis of the confusion matrices for these experiments show that there is a
greater confusion of phonemes that only dier by voicing (such as /f/ and /v/, or /s/
and /z/) when using EMA features, versus MFCC features. Furthermore, there is
also a more even distribution of confusions amongst all of the phone classes, and an
increase in phonetic deletions, suggesting that the EMA data itself is less informative
than the acoustic features (perhaps limited by the number and position of the EMA
sensors). The removal of the EMA sensors gives an understandable reduction in the
performance, and seems to show a signicant decrease when no tongue back or velum
information is present. This is possibly because these sensors provide strong cues
related to nasality, so their absence precludes good discrimination between nasal
and non-nasal sounds. Interestingly, using only the position of the lower lip as a
feature for speech recognition provides a mean phone recognition accuracy of 12%,
which still is signicantly above chance (2.25%), although the number of deletions
observed is very high when compared to using all sensors for recognition.
8.3.2 Speaker-Independent Articulatory Features
To consider the problem in a more general sense, speaker-independent recognisers
are employed, which are trained and tested using the same ve-fold cross validation
paradigm described in Section 8.3.1. That is, a recogniser is trained from the female
EMA data, and tested on only the male EMA data, and vice versa. The performance
is then averaged over both speakers. The performance of these recognisers are
compared to a traditional speaker-independent (visual-only) AAM-based automated
lip-reading system trained and tested on AAM features derived from the UN2 dataset
(Section 3.3). The results are shown in Figure 8.3.
The performance of the speaker-independent recogniser is signicantly below that
of the speaker-dependent recognisers. Using all sensors provides a mean phone recog-
CHAPTER 8. LIMITATIONS OF VISUAL SPEECH RECOGNITION 160
All articulators velum dorsum blade tip upper incisor lower incisor upper lip
0
5
10
15
20
25
30
35
P
h
o
n
e

R
e
c
o
g
n
i
t
i
o
n

A
c
c
u
r
a
c
y

(
P
e
r
c
e
n
t
)


Speaker Independent EMA
Speaker Independent AAM
Figure 8.3: Comparing the performance of speaker-independent phone recognition
using articulatory features and AAM features. Note the point of intersection between
the two curves. AAM-based features capture the visible shape and appearance infor-
mation, and perform as well as the EMA features when the articulators from tongue
blade/tip forward are available.
nition accuracy of only 29% (compared with 45% for speaker-dependent). However,
we note the trend of the curves for both gures 8.2 and 8.3 are very similar. A mean
accuracy of 14.5% is obtained using the AAM features, with a standard deviation of
2.4. It is interesting to note where the line in Figure 8.3 marking the performance
of the AAM-based recogniser intersects the line marking the performance of the ar-
ticulatory features. This appears to be somewhere between when the tongue blade
and the tongue tip are included (with the teeth and lips). This is a striking result
that shows just how much information is contained in the sparse EMA features com-
pared with the dense description provided by the AAM. This result also shows that
similar information appears to be encoded in both feature sets, and possibly where
CHAPTER 8. LIMITATIONS OF VISUAL SPEECH RECOGNITION 161
the upper bound on the expected performance of a visual-only recogniser might be.
8.3.3 Investigating Sampling Rate
The appropriate rate at which to present visual information for human lip-reading
and the eects of low frame rates have been the focus of much research [Blokland
and Anderson, 1998; Vitkovitch and Barber, 1994; Knoche et al., 2005; Williams
et al., 1997]. All of these studies have focussed on reducing the frame-rate of visual
stimuli presented to humans and observing the eects on speech perception, such as
the confusions between visemes. Given that most domestic video cameras capture
video at between 25 and 30 frames a second, these studies presented upper frame
rates of no more than 30 frames a second, and all presented minimum required
frame rates below that gure. However, the ndings of these studies do not provide
us with sucient information to determine the ideal frame rate for an automated
lip-reading system. Firstly, the human eye does not work on a frame by frame basis,
rather its perception of motion is a continuous process, varying in sensitivity based
on a number of dierent factors (including the speed of the observed motion, the
way it is moving or changing, and the nature of the moving object itself). Secondly,
we cannot be sure to what extent the human brain is lling in the perceptual gaps
caused by missing visual information.
Here we present two separate experiments to improve our understanding of the
frame-rate requirement for a visual speech recogniser, to see if we are under-sampling
our visual features. The rst experiment aims to see how much more information
is contained within our 60 frames per second video, than would be present if we
were using lower frame rates. We simulate this by comparing AAM features of
dierent frame rates to our maximum rate of 60 per second. Our comparison measure
is the mean squared dierence between equivalent frames. Lower frame rates are
generated by downsampling our 60 frames per second AAM features to the desired
rate. Obviously, a lower frame rate means that there are fewer frames overall, so we
then upsample to 60 frames per second to allow us to directly compare the signals
CHAPTER 8. LIMITATIONS OF VISUAL SPEECH RECOGNITION 162
on a frame by frame basis. In a separate experiment, we build a simple audio phone
recogniser using exactly the same conguration as in Section 7.3.2, except that the
target rate of our audio features is 30Hz, rather than 100Hz. This experiment is
designed to see how well audio features perform for phone recognition when they
are sampled at the same rate as typical visual features.
The results in tables 8.1 and 8.2 show that audio phone recognition performance
breaks down signicantly when the sampling rate used for the MFCC features is
in the region of standard video frame rates. This must be in part due to many
spoken phones lasting less than the duration of a single frame. It seems unlikely
therefore that visual-only recognition could be expected to identify phones visually
which are indiscriminable when using ideal audio features at the same frame rate.
Figure 8.4 shows that using video frame rates between 25 and 30 frames per second
is a good approximation of the 60 frames per second visual signal, suggesting that
there is not much information to be gained from greater frame rates. By contrast,
we know that higher rates improve audio performance signicantly. This result
seems to conrm our earlier ndings (Sections 8.3.1 and 8.3.2) that visual features
are lacking the information required for better phone discrimination, regardless of
frame rate. From our other experiments in this chapter, we can hypothesise that it
is articulatory information which is missing from our visual features, or put another
way, we cannot see enough on the face to identify every spoken phone.
Table 8.1: Speaker-independent audio monophone recognition performance, as pre-
sented in Section 7.3.2. This phone recognition system was trained and tested using
MFCC features sampled at 100Hz.
Accuracy # Correct # Deleted # Substituted # Inserted # Phones
47.58 45628 9126 23688 8305 78442
Table 8.2: Speaker-independent audio monophone recognition performance, as pre-
sented in Section 7.3.2. This phone recognition system was trained and tested using
MFCC features sampled at 30Hz.
Accuracy # Correct # Deleted # Substituted # Inserted # Phones
22.04 25835 24611 27996 8547 78442
CHAPTER 8. LIMITATIONS OF VISUAL SPEECH RECOGNITION 163
5 10 15 20 25 30 35 40 45 50 55 60
0
0.5
1
1.5
2
2.5
3
x 10
5
Frame Rate (Frames Per Second)
M
e
a
n

S
q
u
a
r
e
d

D
i
f
f
e
r
e
n
c
e
Figure 8.4: A plot showing the mean squared dierence between various frames rates
of AAM features and a 60 frames a second signal. Each frame rate was generated
by downsampling the 60 frames per second signal, and then upsampling back to the
original rate.
8.4 Conclusions
In this chapter we have investigated the limits of automated lip-reading. We rst
used EMA features to measure the maximum mean phone recognition accuracy that
can be achieved using all eight of the sensors in the MOCHA-TIMIT dataset. We
then systematically removed sensor data to simulate the loss of information due
to the rearmost articulators not being visible. The performance of these EMA-
based recognisers was compared both with a traditional acoustic speech recogniser
and a traditional AAM-based lip-reading system. We found that using all eight
CHAPTER 8. LIMITATIONS OF VISUAL SPEECH RECOGNITION 164
articulatory sensors, not surprisingly, achieved the best performance. However, this
was signicantly below the performance of an acoustic speech recogniser trained on
the equivalent acoustic speech, but signicantly better than an AAM-based system.
The dierence in performance between the acoustic recogniser and the one built
using a full articulatory feature set could be due to the lack of voicing information
in the EMA features. We found that the performance of the EMA and AAM-
based systems was approximately equal when the articulators in front of the tongue
blade were available to the recogniser (i.e. information that can be seen from a
frontal view of the face). This perhaps suggests an upper bound on the expected
performance of pure lip-reading using only visual features. We also showed that 60
frames per second AAM features do not provide much more visual information than
those captured at 25-30 frames per second, and that audio recognition performance
is poor when sample rates are as low as this. Together, these ndings suggest
that visual features lack important discriminatory information, and this would not
improve with increased frame rates.
We note that only eight EMA sensors are included in the MOCHA-TIMIT
dataset, and that these measure the x, y position along the midsagittal plane. This
work would benet by repeating the study using more articulatory data. For exam-
ple, the performance of the recognisers using the frontmost articulators was surpris-
ingly poor, probably because only the degree of mouth opening is being measured.
Including other sensors would allow more complex mouth gestures to be captured.
In removing the sensor information, we only considered removing the sensors one at
a time from the back of the mouth forwards. Dierent combinations of sensors could
be tested to determine the most useful set overall. In addition, future work could in-
vestigate a more detailed look at the ways the recognisers in the dierent modalities
fail. The results presented here consider only the mean phone recognition accuracy.
We might look, for example, at the particular classes and contexts of phones that
the dierent modalities are most able to accurately recognise. This would allow
audio-visual recognisers to be coupled with EMA data, and the relative weight of
the dierent modalities could be adapted according to the available information.
CHAPTER 8. LIMITATIONS OF VISUAL SPEECH RECOGNITION 165
This was not done here because we are interested only in investigating lip-reading
(i.e. unimodal recognition). Ideally, we could also build a speaker-independent EMA
recogniser using a far greater number of speakers than are available in the MOCHA-
TIMIT dataset. Additional speakers might improve the generalising power of our
speaker-independent models as we would be providing the system with more in-
formation about the dierent variations of articulation across a number of people,
rather than just one other person.
Chapter 9
Conclusions and Future Work
9.1 Introduction
This thesis has presented novel research into automatic language identication using
visual features. The task of visual language identication (VLID) is one of observing
the visual correlates of speech and deciding which spoken language the utterance
contains. Here, we focussed on the development of an automatic method for VLID,
using computer lip-reading. We have developed two methods for language identi-
cation (LID) of visual speech, based upon audio LID techniques that use language
phonology as a feature of discrimination. Firstly, an unsupervised approach that
tokenises active appearance model (AAM) feature vectors using vector quantisation
(VQ), and secondly, using a method of visual triphone modelling, rstly with phone
superclasses (visemes), and then using phones. We have evaluated the performance
of our systems in both speaker-dependent and speaker-independent scenarios, show-
ing that VLID is possible in both cases, and that there is sucient information
presented on the lips to discriminate between two or three languages using these
techniques, despite the low phone recognition accuracies that we observed. Finally,
we investigated the limitations of our AAM features, in terms of the articulatory
information that they are missing for improved recognition. This chapter concludes
the thesis by rst summarising and discussing the conclusions of each preceding
166
CHAPTER 9. CONCLUSIONS AND FUTURE WORK 167
chapter (Section 9.2), then in Section 9.3 we explain the limitations of this work
and how it could be taken further. Finally, Section 9.4 lists the publications relating
to the research described herein.
9.2 Discussions and Conclusions
In the introduction to this thesis (Chapter 1) we described the task of VLID, in-
cluding its real-world applications and briey outlined the major research questions
posed by this work. Namely, is it possible to identify a language from computer
lip-reading alone, and are the discriminatory features of language consistently visi-
ble across dierent speakers? We explained that LID usually exists as a subsystem
within a multilingual automatic speech recognition system, to allow the automatic
selection of the correct language-dependent system.
Chapter 2 was concerned with providing the background on the pre-existing tech-
niques used within the body of this thesis. We provided precise details of the phone
recognition followed by language modelling (PRLM) approaches to language dis-
crimination, and then described in turn each of the subsystems it contains, focussing
on their specic application to VLID. Language modelling and vector quantisation
(VQ) were also introduced, which are integral parts of the unsupervised method for
VLID described in Chapter 5. Hidden Markov model (HMM) speech recognition,
including triphones, were also covered to accompany their application in chapters
6, 7 and 8. AAMs were outlined, rstly as a method of tracking facial elements,
such as the mouth, and secondly as a feature generation method. Finally, linear
discriminant analysis (LDA) and support vector machines (SVMs) were dened as
two techniques used in LID for back-end classication.
Chapter 3 introduced two new datasets designed for VLID experiments. United
Nations 1 (UN1) was the rst dataset recorded, and was intended for speaker-
dependent VLID experiments. It contained multilingual speakers, with each speaker
reading the UN Declaration of Human Rights in all of the languages in which they
CHAPTER 9. CONCLUSIONS AND FUTURE WORK 168
are competent. A standard denition video camera was used to record the audio and
video. Tracking information and automatically generated phone-level transcriptions
are available for some of the speakers. We also used this dataset in our prelimi-
nary speaker-independent experiments described in Chapter 6. The second dataset,
United Nations 2 (UN2), contained high denition video and high quality audio from
a tie clip microphone. We recorded this dataset for our further speaker-independent
experiments in Chapter 7. This data included 25 native English and 10 native Ara-
bic speakers reading the entire UN Declaration of Human Rights. Feature tracking
details and automatically generated phone-level transcriptions are available for all
speakers in UN2.
Chapter 4 surveyed various elds of research relating to the topic of VLID. We
discussed the features of human speech and language, to better understand the
speech information that is presented by the visual modality and to nd the discrim-
inatory features of language which a LID system might exploit. Several techniques
for audio LID were presented which use phonological and syntactical information as
a source of discrimination, as these are features that are potentially observable from
the face by phone recognition. The National Institute of Standards and Technology
language recognition evaluation task (NIST LRE) was introduced to illustrate what
currently constitutes a challenging task in audio LID, and also the error measures
they use to benchmark and compare the performance of LID systems. Research into
audio-visual (AV) and visual-only speech recognition showed that techniques exist to
make use of the information provided by the visual modality for improving speech
recognition in noisy environments, and that appearance derived features such as
AAMs provide the best recognition performance in speaker-independent tests. Cru-
cially, this showed that it might be possible to discriminate languages visually by
using LID approaches which use phonetic information, such as PRLM, by recog-
nising spoken phones from visual information. We also showed that humans with
unimpaired hearing are able to discriminate phonetically similar languages from vi-
sual information alone, but not with high accuracy, demonstrating the diculty of
this task.
CHAPTER 9. CONCLUSIONS AND FUTURE WORK 169
In Chapter 5 we introduced a method for VLID using AAMs as our visual fea-
tures. We used AAM shape parameters for these preliminary, speaker-dependent
VLID experiments. Our focus on speaker-dependent experiments was because we
were aware of the strong speaker dependency of our recognition features, and we
wanted to focus on the language discrimination aspect of the task. The system
for LID we developed was based upon a standard audio LID approach known as
PRLM, where incoming frames are tokenised, language models are built from the
token sequences, and the model producing the maximum likelihood classies a test
utterance. We used a novel method of sub-phone tokenisation, namely VQ of AAM
frames, which operates in much the same way as Gaussian Mixture Model (GMM)
tokenisation. This method was unsupervised, and therefore didnt require the pho-
netic transcriptions usually needed to train monophone models, used for phonetic
tokenisation. The results in this chapter showed that speaker-dependent language
discrimination was possible amongst the three speakers for which we tested our ap-
proach. We also demonstrated that we could discriminate between three recitals
containing the same language but read at dramatically dierent speeds, and to a
lesser extent, three apparently equal recitals of the same language. This suggested
that the system could also use non-language cues for discrimination, and that by
performing speaker-independent experiments we could average out some of these
eects (such as lighting and pose), which provided the motivation for the work in
the next chapter.
Chapter 6 presented our preliminary work on speaker-independent VLID. This
time using AAM features consisting of shape and appearance parameters, we built a
PRLM system to distinguish English from French speech. Using our UN1 dataset, we
selected ve speakers who spoke both English and French, and then operated a ve-
fold cross validation, excluding each speaker in turn to use them for testing. Unlike
before, we transcribed our speech data in terms of confusable phone groups, known
as visemes, and from these transcriptions built viseme HMMs with which to tokenise
our AAM feature data. We decided to move away from frame level tokenisation, since
phone tokenisation has been shown to provide superior LID performance when using
CHAPTER 9. CONCLUSIONS AND FUTURE WORK 170
audio features and we thought that logically this would apply to the visual domain
also. To tackle the speaker dependency of our AAM features, we z-score normalised
each dimension and applied a mutual information weighting. Classication was
performed using an SVM back-end classier, where the input features were the ratios
of the language model likelihood scores. Results showed that speaker-independent
recognition was possible, with mean recognition accuracy of over 80% achieved with
60 seconds of test data, but a simpler, unsupervised audio LID system gave superior
performance. Finally, we ran a simulation in which we articially degraded perfect
visemic transcriptions for our speakers, which showed that a viseme accuracy in the
region of 50% was required to reach near perfect language discrimination, whereas
our recognises were only around 35% accurate.
Continuing the theme of Chapter 6, Chapter 7 was also concerned with speaker-
independent visual-only LID, but this time using a larger training set containing two
phonologically dissimilar languages. In this chapter we used a portion of our cus-
tom UN2 dataset, specically nine native Arabic and ten native English speakers for
cross-validation testing. We also used a further ten English speakers for training our
visual-phone models in our PRLM experiments. As before, the task was to identify
the language of an unknown speaker. We found that we could achieve good mean
discrimination of around 90% after 60 seconds of test data, and that as expected the
overall performance was worse than the same system built using audio features. We
then tried several approaches to minimise the eects of skin-tone on our features,
to see to what extent it was providing language discrimination. Histogram equal-
isation of our video frames reduced mean performance slightly, and increased the
range of performances observed. Using shape-only parameters provided some degree
of language discrimination but mean performance was signicantly degraded, as we
would expect since these features exclude a lot of articulatory information. Then,
using shape parameters and binary video frames depicting tooth pixels, we achieved
better performance than shape alone, which showed that the added, colour-free, ar-
ticulatory information provided useful language information. Finally, having built
an Arabic visual-phone recogniser, we implemented a parallel PRLM architecture
CHAPTER 9. CONCLUSIONS AND FUTURE WORK 171
and found that it performed much the same as the equivalent PRLM system. We
found that optimal performance with the PPRLM system was achieved with only
one mixture component per state, compared to 16 in the PRLM system. This could
have been because a much smaller subset of speakers were used to train the visual-
phone models for the PPRLM system, and therefore using more than one mixture
component over-tted our features. This could also explain why the PPRLM system
did not outperform the approach using a single tokeniser front-end.
Further to the ndings of the previous chapters, Chapter 8 investigated the limita-
tions of the AAM features in order to explain the poor performance of our visual-only
phone recognisers. Using the MOCHA-TIMIT dataset, which contains electromag-
netic articulography (EMA) tracking information regarding several of the speech
articulators, we hypothesised a link between the articulatory information contained
within the EMA and that of the AAM features. To test this, we built state-of-
the-art triphone recognisers as in Chapter 7, separately for each feature set. For
the EMA data, we systematically removed each articulators data, from the back
of the mouth to the front, and ran recognition experiments each time. In speaker-
dependent experiments, as expected, we observed that performance degraded each
time an articulator was removed. In speaker-independent experiments, despite the
limited amount of training data, we found that using all articulators gave visual-
phone recognition accuracy of around 30%, which is much higher the accuracies of
around 15% we see with AAM features. We saw a similar trend of performance
degradation as each articulator was removed as with the speaker-dependent exper-
iments, suggesting that each articulator provides useful information regarding the
spoken phone. Our speaker-independent results also showed that the point at which
AAM visual-phone performance is equivalent to using EMA features, was when the
frontal EMA points were included, between inclusion of the tongue-tip and without
it. This result is intuitive, since a camera pointed at the face can only capture
information regarding the front-most articulators.
CHAPTER 9. CONCLUSIONS AND FUTURE WORK 172
9.3 Further Work
The research comprising this thesis has furthered the eld of automatic VLID where
no directly related research existed. It has provided new methods for visual language
discrimination, and has improved the understanding of the intricacies of such a
system and the limitations of what has been implemented. Scientic development
is an incremental process, and as such this work has paved the way for future
research into this eld and may be of interest to related areas, such as audio-visual
speech recognition, or speaker verication. This section aims to discuss some of
the limitations of our study, and to present ways in which they could be overcome.
Also, we will highlight areas of interest which have been eluded to by this work but
not explored fully. And nally, we shall suggest possibilities for future directions or
applications of this eld.
Apart from one three-language discrimination task in Chapter 5, this research
has focussed on discriminating between two languages. In the future, the number
of languages included in the system could be increased to determine how well this
approach generalises when the chance of language confusion is higher. Groups of
phonetically similar languages could be added to see if they are more confusable
than those with diering phonetic characteristics. Languages such as Mandarin use
tonal information to distinguish between otherwise identical sounding words. Such
languages could be examined to see if the techniques we have presented are sucient
for discrimination where an integral feature of language is not expressed visually. A
range of skin-tones should be present across the languages used, to remove colour as
a feature of language. Inclusion of additional languages would allow the generation
of a visual language taxonomy, which could be interesting to compare to language
trees which have been produced from acoustic speech, both in terms of the expected
similarities and how they might dier. Finally, the eect of non-native speech on the
visual discrimination of languages could be investigated, as second language speech
is shown to aect speech perception in the audio-visual domain [Ortega-Llebaria
et al., 2001].
CHAPTER 9. CONCLUSIONS AND FUTURE WORK 173
Our investigation comparing AAM features to the articulatory information con-
tained within the MOCHA-TIMIT dataset provided a useful insight into the ar-
ticulatory equivalence of the AAMs. However, the MOCHA-TIMIT dataset only
contained two speakers, which is not ideal for evaluating the speaker-independent
performance of EMA data. Also, the AAM features we extracted were from a sepa-
rate dataset than the EMA data, and the rst contained scripted continuous speech
whilst the second contained short, phonetically balanced sentences. Furthermore,
the EMA features are only extracted from the midsaggital plane, and therefore do
not contain information regarding the width of the mouth or the tongue curvature
visible from the frontal plane, both of which could aid phonetic discrimination. We
propose that a new EMA dataset with high-quality video data should be recorded
to repeat the experiments of Chapter 8, with a view to predicting articulatory in-
formation from AAM features for improved visual-phone recognition.
Since AAM features are derived from principal component analysis on the pixel
intensities of an image, they are aected by changes in the appearance of an image.
Such variations in appearance could be caused by lighting conditions, visual artefacts
introduced by image compression algorithms, or by a physical change in the recorded
scene. A practical example of a change in scene in computer lip-reading would be
a change in head pose of the subject being recorded. Ane transformations such
as a planar rotation and translation are easy to correct for, however rotation of the
head from side-to-side would change the 2D projection of the lips as observed by
a video camera, and this would alter the extracted AAM features. Work should
therefore be performed to assess the eect of pose on appearance features, and to
see how VLID performance degrades as the angle of pose is increased. Also, physical
tracking of the lips, via motion capture, could be used to provide ground-truth shape
information for the lips, and this could be used to asses how shape parameters vary
with pose. The results of these experiments could also be used to improve the
robustness of AAM features to lighting conditions and pose, which would ultimately
assist a real-world application of VLID.
CHAPTER 9. CONCLUSIONS AND FUTURE WORK 174
The feature of language that we have used for discrimination is phonology, specif-
ically phonotactics, which governs the allowable sequence of phones in a language.
Phonotactics are not the only aspect of language which can be used to dierentiate
between them. Zissman [1996] describes the use of phone duration to improve au-
dio LID. In some preliminary work not described here, we tested the technique in
Zissman [1996], where PRLM is performed using double the number of phones, and
we saw an increase in identication performance. In this method, each recognised
phone is labelled according to whether it is shorter or longer than its mean duration,
which doubles the size of a phone set. Another feature of language is rhythm. Ramus
[2002] explains that babies have the ability to distinguish languages based on acous-
tic rhythm, and Ronquest et al. [2010] suggests that adults also have this ability and
furthermore, rhythm is expressed visually. Further work into VLID could therefore
focus on incorporating both of these additional language cues and evaluating their
contribution to language discrimination.
Although we have demonstrated methods to reduce the strong speaker depen-
dency of our AAM features (z-score normalisation and mutual information feature
weighting), future work could seek to nd features for recognition which are inher-
ently more speaker-independent. The methods of transcribing visual speech could
also be investigated further, since we have shown that many acoustic phones are
deleted in the visual modality, most probably because the articulators responsible
for their production are not visible externally. Hence, the speech units (i.e. phones)
applied to segment audio do not necessarily represent the unique visual compo-
nents of speech [Hilder et al., 2010]. We have shown that the solution is not a
simple phoneme to viseme mapping, but rather a method considering the eects of
speech production, co-articulation and speaker dependency. Speaker independency
and speech transcription are the most pivotal research areas to the improvement of
VLID technology, since we believe them to be accountable for the poor recognition
performance of our visual-phone recognisers, compared to those built using audio
features. As is true for audio LID, improved phone recognition will lead to improved
language discrimination.
CHAPTER 9. CONCLUSIONS AND FUTURE WORK 175
9.4 List of Publications
Following is a list of rst author publications resulting from the research described
in this thesis:
Newman, J. and Cox, S. (2009). Automatic visual-only language identication:
A preliminary study. In Acoustics, Speech and Signal Processing, 2009. ICASSP
2009. IEEE International Conference on, pages 4345-4348.
Newman, J. and Cox, S. (2010). Speaker-independent visual-only language iden-
tication. In Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE Inter-
national Conference on, pages 5026-5029.
Newman, J., Theobald, B., and Cox, S. (2010). Limitations of visual speech
recognition. In International Conference on Auditory-Visual Speech Processing (AVSP),
pages 65-68.
Appendix A
United Nations Universal
Declaration of Human Rights
Script in English
176

Universal Declaration of Human Rights
Preamble
Whereas recognition of the inherent dignity and of the equal and inalienable
rights of all members of the human family is the foundation of freedom, justice
and peace in the world,
Whereas disregard and contempt for human rights have resulted in barbarous
acts which have outraged the conscience of mankind, and the advent of a world
in which human beings shall enjoy freedom of speech and belief and freedom
from fear and want has been proclaimed as the highest aspiration of the common
people,
Whereas it is essential, if man is not to be compelled to have recourse, as a last
resort, to rebellion against tyranny and oppression, that human rights should be
protected by the rule of law,
Whereas it is essential to promote the development of friendly relations between
nations,
Whereas the peoples of the United Nations have in the Charter reaffirmed their
faith in fundamental human rights, in the dignity and worth of the human person
and in the equal rights of men and women and have determined to promote
social progress and better standards of life in larger freedom,
Whereas Member States have pledged themselves to achieve, in cooperation
with the United Nations, the promotion of universal respect for and observance of
human rights and fundamental freedoms,
Whereas a common understanding of these rights and freedoms is of the
greatest importance for the full realization of this pledge,
Now, therefore,
The General Assembly,
Proclaims this Universal Declaration of Human Rights as a common standard of
achievement for all peoples and all nations, to the end that every individual and
every organ of society, keeping this Declaration constantly in mind, shall strive by


teaching and education to promote respect for these rights and freedoms and by
progressive measures, national and international, to secure their universal and
effective recognition and observance, both among the peoples of Member States
themselves and among the peoples of territories under their jurisdiction.
Article I
All human beings are born free and equal in dignity and rights. They are
endowed with reason and conscience and should act towards one another in a
spirit of brotherhood.
Article 2
Everyone is entitled to all the rights and freedoms set forth in this Declaration,
without distinction of any kind, such as race, colour, sex, language, religion,
political or other opinion, national or social origin, property, birth or other status.
Furthermore, no distinction shall be made on the basis of the political,
jurisdictional or international status of the country or territory to which a person
belongs, whether it be independent, trust, non-self-governing or under any other
limitation of sovereignty.
Article 3
Everyone has the right to life, liberty and security of person.
Article 4
No one shall be held in slavery or servitude; slavery and the slave trade shall be
prohibited in all their forms.
Article 5
No one shall be subjected to torture or to cruel, inhuman or degrading treatment
or punishment.


Article 6
Everyone has the right to recognition everywhere as a person before the law.
Article 7
All are equal before the law and are entitled without any discrimination to equal
protection of the law. All are entitled to equal protection against any
discrimination in violation of this Declaration and against any incitement to such
discrimination.
Article 8
Everyone has the right to an effective remedy by the competent national tribunals
for acts violating the fundamental rights granted him by the constitution or by law.
Article 9
No one shall be subjected to arbitrary arrest, detention or exile.
Article 10
Everyone is entitled in full equality to a fair and public hearing by an independent
and impartial tribunal, in the determination of his rights and obligations and of any
criminal charge against him.
Article 11
1. Everyone charged with a penal offence has the right to be presumed
innocent until proved guilty according to law in a public trial at which he
has had all the guarantees necessary for his defence.
2. No one shall be held guilty of any penal offence on account of any act or
omission which did not constitute a penal offence, under national or
international law, at the time when it was committed. Nor shall a heavier


penalty be imposed than the one that was applicable at the time the penal
offence was committed.
Article 12
No one shall be subjected to arbitrary interference with his privacy, family, home
or correspondence, nor to attacks upon his honour and reputation. Everyone has
the right to the protection of the law against such interference or attacks.
Article 13
1. Everyone has the right to freedom of movement and residence within the
borders of each State.
2. Everyone has the right to leave any country, including his own, and to
return to his country.
Article 14
1. Everyone has the right to seek and to enjoy in other countries asylum from
persecution.
2. This right may not be invoked in the case of prosecutions genuinely
arising from non-political crimes or from acts contrary to the purposes and
principles of the United Nations.
Article 15
1. Everyone has the right to a nationality.
2. No one shall be arbitrarily deprived of his nationality nor denied the right to
change his nationality.
Article 16


1. Men and women of full age, without any limitation due to race, nationality
or religion, have the right to marry and to found a family. They are entitled
to equal rights as to marriage, during marriage and at its dissolution.
2. Marriage shall be entered into only with the free and full consent of the
intending spouses.
3. The family is the natural and fundamental group unit of society and is
entitled to protection by society and the State.
Article 17
1. Everyone has the right to own property alone as well as in association with
others.
2. No one shall be arbitrarily deprived of his property.
Article 18
Everyone has the right to freedom of thought, conscience and religion; this right
includes freedom to change his religion or belief, and freedom, either alone or in
community with others and in public or private, to manifest his religion or belief in
teaching, practice, worship and observance.
Article 19
Everyone has the right to freedom of opinion and expression; this right includes
freedom to hold opinions without interference and to seek, receive and impart
information and ideas through any media and regardless of frontiers.
Article 20
1. Everyone has the right to freedom of peaceful assembly and association.
2. No one may be compelled to belong to an association.
Article 21


1. Everyone has the right to take part in the government of his country,
directly or through freely chosen representatives.
2. Everyone has the right to equal access to public service in his country.
3. The will of the people shall be the basis of the authority of government;
this will shall be expressed in periodic and genuine elections which shall
be by universal and equal suffrage and shall be held by secret vote or by
equivalent free voting procedures.
Article 22
Everyone, as a member of society, has the right to social security and is entitled
to realization, through national effort and international co-operation and in
accordance with the organization and resources of each State, of the economic,
social and cultural rights indispensable for his dignity and the free development
of his personality.
Article 23
1. Everyone has the right to work, to free choice of employment, to just and
favourable conditions of work and to protection against unemployment.
2. Everyone, without any discrimination, has the right to equal pay for equal
work.
3. Everyone who works has the right to just and favourable remuneration
ensuring for himself and his family an existence worthy of human dignity,
and supplemented, if necessary, by other means of social protection.
4. Everyone has the right to form and to join trade unions for the protection of
his interests.
Article 24
Everyone has the right to rest and leisure, including reasonable limitation of
working hours and periodic holidays with pay.


Article 25
1. Everyone has the right to a standard of living adequate for the health and
well-being of himself and of his family, including food, clothing, housing
and medical care and necessary social services, and the right to security
in the event of unemployment, sickness, disability, widowhood, old age or
other lack of livelihood in circumstances beyond his control.
2. Motherhood and childhood are entitled to special care and assistance. All
children, whether born in or out of wedlock, shall enjoy the same social
protection.
Article 26
1. Everyone has the right to education. Education shall be free, at least in the
elementary and fundamental stages. Elementary education shall be
compulsory. Technical and professional education shall be made
generally available and higher education shall be equally accessible to all
on the basis of merit.
2. Education shall be directed to the full development of the human
personality and to the strengthening of respect for human rights and
fundamental freedoms. It shall promote understanding, tolerance and
friendship among all nations, racial or religious groups, and shall further
the activities of the United Nations for the maintenance of peace.
3. Parents have a prior right to choose the kind of education that shall be
given to their children.
Article 27
1. Everyone has the right freely to participate in the cultural life of the
community, to enjoy the arts and to share in scientific advancement and
its benefits.


2. Everyone has the right to the protection of the moral and material interests
resulting from any scientific, literary or artistic production of which he is the
author.
Article 28
Everyone is entitled to a social and international order in which the rights and
freedoms set forth in this Declaration can be fully realized.
Article 29
1. Everyone has duties to the community in which alone the free and full
development of his personality is possible.
2. In the exercise of his rights and freedoms, everyone shall be subject only
to such limitations as are determined by law solely for the purpose of
securing due recognition and respect for the rights and freedoms of others
and of meeting the just requirements of morality, public order and the
general welfare in a democratic society.
3. These rights and freedoms may in no case be exercised contrary to the
purposes and principles of the United Nations.
Article 30
Nothing in this Declaration may be interpreted as implying for any State, group or
person any right to engage in any activity or to perform any act aimed at the
destruction of any of the rights and freedoms set forth herein.

Bibliography
Acero, A. and Stern, R. (1990). Environmental robustness in automatic speech
recognition. In Acoustics, Speech, and Signal Processing, 1990. ICASSP-90., 1990
International Conference on, volume 2, pages 849852.
Al-Sughaiyer, I. A. and Al-Kharashi, I. A. (2004). Arabic morphological analysis
techniques: a comprehensive survey. J. Am. Soc. Inf. Sci. Technol., 55:189213.
Aleksic, P. S., Potamianos, G., and Katsaggelos, A. K. (2005). Exploiting visual
information in automatic speech processing. In Handbook of image and video
processing, chapter 10.8, pages 12631289. Elsevier Academic Press, 2nd edition.
Almajai, I. and Milner, B. (2008). Using audio-visual features for robust voice
activity detection in clean and noisy speech. In Proc. EUSIPCO.
Almajai, I. and Milner, B. (2009). Enhancing audio speech using visual speech
features. In INTERSPEECH-2009, pages 19591962.
Baker, C. (2006). Foundations of Bilingual Education and Bilingualism. Multilingual
Matters Ltd, 4th edition.
Banerjee, P., Garg, G., Mitra, P., and Basu, A. (2008). Application of triphone
clustering in acoustic modeling for continuous speech recognition in Bengali. In
Pattern Recognition, 2008. ICPR 2008. 19th International Conference on, pages
14.
Bangham, J. A., Ling, P. D., and Harvey, R. (1996). Scale-space from nonlinear
lters. IEEE Trans. Pattern Anal. Mach. Intell., 18:520528.
Bell-Berti, F. and Harris, K. S. (1982). Temporal patterns of coarticulation: Lip
rounding. The Journal of the Acoustical Society of America, 71(2):449454.
Benedetto, D., Caglioti, E., and Loreto, V. (2002). Language trees and zipping.
Phys. Rev. Lett., 88(4):048702048705.
Blokland, A. and Anderson, A. H. (1998). Eect of low frame-rate video on intelli-
gibility of speech. Speech Communication, 26(1-2):97103.
185
BIBLIOGRAPHY 186
Bregler, C. and Konig, Y. (1994). Eigenlips for robust speech recognition. In
Acoustics, Speech, and Signal Processing, 1994. ICASSP-94., 1994 IEEE Inter-
national Conference on, volume 2, pages 669672.
Buckwalter, T. (2004). Issues in Arabic orthography and morphology analysis. In
Proceedings of the Workshop on Computational Approaches to Arabic Script-based
Languages, Semitic 04, pages 3134, Morristown, NJ, USA. Association for Com-
putational Linguistics.
Campbell, W., Campbell, J., Reynolds, D., Singer, E., and Torres-Carrasquillo, P.
(2006). Support vector machines for speaker and language recognition. Computer
Speech & Language, 20(2-3):210229.
Cetingul, H., Yemez, Y., Erzin, E., and Tekalp, A. (2006). Discriminative anal-
ysis of lip motion features for speaker identication and speech-reading. Image
Processing, IEEE Transactions on, 15(10):28792891.
Chen, T. (2010). Science ction becomes reality. Network, IEEE, 24(4):23.
Christiansen, M. H. and Kirby, S. (2003). Language evolution: consensus and con-
troversies. Trends in Cognitive Sciences, 7(7):300307.
Collinge, N. E., editor (1990). An Encyclopedia of Language. Routledge, UK.
Cooke, M., Barker, J., Cunningham, S., and Shao, X. (2006). An audio-visual
corpus for speech perception and automatic speech recognition. The Journal of
the Acoustical Society of America, 120(5):24212424.
Cootes, T. F., Edwards, G. J., and Taylor, C. J. (2001). Active appearance models.
IEEE Trans. Pattern Anal. Mach. Intell., 23:681685.
Cortes, C. and Vapnik, V. (1995). Support-vector networks. Machine Learning,
20:273297. 10.1023/A:1022627411411.
Cox, S., Harvey, R., Lan, Y., Newman, J., and Theobald, B.-J. (2008). The chal-
lenge of multispeaker lip-reading. In International Conference on Auditory-Visual
Speech Processing (AVSP), pages 179184.
Davis, S. and Mermelstein, P. (1980). Comparison of parametric representations
for monosyllabic word recognition in continuously spoken sentences. Acoustics,
Speech and Signal Processing, IEEE Transactions on, 28(4):357366.
De la Torre, F. and Black, M. J. (2003). A framework for robust sub-
space learning. International Journal of Computer Vision, 54:117142.
10.1023/A:1023709501986.
Ding, C. and He, X. (2004). K-means clustering via principal component analysis.
In Proceedings of the twenty-rst international conference on Machine learning,
ICML 04, pages 29, New York, NY, USA. ACM.
BIBLIOGRAPHY 187
Ezzat, T. and Poggio, T. (2000). Visual speech synthesis by morphing visemes. Int.
J. Comput. Vision, 38:4557.
Fisher, C. G. (1968). Confusions among visually perceived consonants. Journal of
Speech and Hearing Research, 11:796804.
Forney, G.D., J. (1973). The Viterbi algorithm. Proceedings of the IEEE, 61(3):268
278.
Gales, M. and Young, S. (2007). The application of hidden Markov models in speech
recognition. Found. Trends Signal Process., 1:195304.
Gersho, A. and Gray, R. M. (1991). Vector quantization and signal compression.
Kluwer Academic Publishers, Norwell, MA, USA.
Goldschen, A., Garcia, O., and Petajan, E. (1994). Continuous optical automatic
speech recognition by lipreading. In Signals, Systems and Computers, 1994. 1994
Conference Record of the Twenty-Eighth Asilomar Conference on, volume 1, pages
572577.
Greenberg, S. (1999). Speaking in shorthand a syllable-centric perspective for
understanding pronunciation variation. Speech Commun., 29:159176.
Hain, T. and Woodland, P. C. (2000). Modelling sub-phone insertions and deletions
in continuous speech recognition. In ICSLP-2000, volume 4, pages 172175.
Hilder, S., Theobald, B.-J., and Harvey, R. (2010). In pursuit of visemes. In Interna-
tional Conference on Auditory-Visual Speech Processing (AVSP), pages 154159.
Hosseini, A. and Homayounpour, M. (2009). Improvement of language identication
performance using generalized phone recognizer. In Computer Conference, 2009.
CSICC 2009. 14th International CSI, pages 596600.
Huang, C.-L. and Wu, C.-H. (2007). Phone set generation based on acoustic and
contextual analysis for multilingual speech recognition. In Acoustics, Speech and
Signal Processing, 2007. ICASSP 2007. IEEE International Conference on, vol-
ume 4, pages 10171020.
Hunt, M., Richardson, S., Bateman, D., and Piau, A. (1991). An investigation
of PLP and IMELDA acoustic representations and of their potential for com-
bination. In Acoustics, Speech, and Signal Processing, 1991. ICASSP-91., 1991
International Conference on, volume 2, pages 881884.
Jackson, P. J. B. and Singampalli, V. D. (2009). Statistical identication of articu-
lation constraints in the production of speech. Speech Commun., 51:695710.
Juang, B. H. and Rabiner, L. R. (1991). Hidden Markov models for speech recogni-
tion. Technometrics, 33(3):251272.
BIBLIOGRAPHY 188
Jurafsky, D. and Martin, J. H. (2009). Speech and language processing : an in-
troduction to natural language processing, computational linguistics, and speech
recognition. Pearson Prentice Hall, 2nd edition.
Jusczyk, P. W. (2000). The Discovery of Spoken Language. The MIT Press.
Katz, S. (1987). Estimation of probabilities from sparse data for the language model
component of a speech recognizer. Acoustics, Speech and Signal Processing, IEEE
Transactions on, 35(3):400401.
Keane, A., Griths, J., and McKeown, P. (2010). The Modern Law of Evidence.
Oxford University Press.
Knoche, H., de Meer, H., and Kirsh, D. (2005). Compensating for low frame rates.
In CHI 05 extended abstracts on Human factors in computing systems, CHI 05,
pages 15531556, New York, NY, USA. ACM.
Lan, Y., Harvey, R., Theobald, B.-J., Ong, E.-J., and Bowden, R. (2009). Comparing
visual features for lipreading. In International Conference on Auditory-Visual
Speech Processing (AVSP), pages 102106.
Lan, Y., Theobald, B.-J., Harvey, R., Ong, E.-J., and Bowden, R. (2010). Improving
visual features for lipreading. In International Conference on Auditory-Visual
Speech Processing (AVSP), pages 142147.
Lee, S. and Yook, D. (2002). Audio-to-visual conversion using hidden Markov mod-
els. In Proceedings of the 7th Pacic Rim International Conference on Articial
Intelligence: Trends in Articial Intelligence, PRICAI 02, pages 563570, Lon-
don, UK. Springer-Verlag.
Li, H., Ma, B., and Lee, C.-H. (2007). A vector space modeling approach to spoken
language identication. Audio, Speech, and Language Processing, IEEE Transac-
tions on, 15(1):271284.
Liang, L., Liu, X., Zhao, Y., Pi, X., and Nean, A. (2002). Speaker independent
audio-visual continuous speech recognition. In Multimedia and Expo, 2002. ICME
02. Proceedings. 2002 IEEE International Conference on, volume 2, pages 2528.
Livescu, K., Cetin, O., Hasegawa-Johnson, M., King, S., Bartels, C., Borges, N.,
Kantor, A., Lal, P., Yung, L., Bezman, A., Dawson-Haggerty, S., Woods, B.,
Frankel, J., Magami-Doss, M., and Saenko, K. (2007). Articulatory feature-based
methods for acoustic and audio-visual speech recognition: Summary from the
2006 JHU summer workshop. In Acoustics, Speech and Signal Processing, 2007.
ICASSP 2007. IEEE International Conference on, volume 4, pages 621624.
MacQueen, J. B. (1967). Some methods for classication and analysis of multivariate
observations. In Cam, L. M. L. and Neyman, J., editors, Proc. of the fth Berkeley
Symposium on Mathematical Statistics and Probability, volume 1, pages 281297.
University of California Press.
BIBLIOGRAPHY 189
Martin, A., Doddington, G., Kamm, T., Ordowski, M., and Przybocki, M. (1997).
The DET curve in assessment of detection task performance. In EUROSPEECH-
1997, pages 18951898.
Massaro, D. W. and Cohen, M. M. (1983). Evaluation and integration of visual and
auditory information in speech perception. Journal of Experimental Psychology:
Human Perception and Performance, 9(5):753771.
Matthews, I. and Baker, S. (2004). Active appearance models revisited. Interna-
tional Journal of Computer Vision, 60:135164.
Matthews, I., Cootes, T., Bangham, J., Cox, S., and Harvey, R. (2002). Extraction
of visual features for lipreading. Pattern Analysis and Machine Intelligence, IEEE
Transactions on, 24(2):198213.
McCahill, M. and Norris, C. (2002). CCTV in London. Urbaneye, working paper
no. 6. http://www.urbaneye.net/results/ue wp6.pdf. (Last accessed 6/12/10).
McGurk, H. and MacDonald, J. (1976). Hearing lips and seeing voices. Nature,
264(5588):746748.
Mendoza, S., Gillick, L., Ito, Y., Lowe, S., and Newman, M. (1996). Automatic
language identication using large vocabulary continuous speech recognition. In
Acoustics, Speech, and Signal Processing, 1996. ICASSP-96. Conference Proceed-
ings., 1996 IEEE International Conference on, volume 2, pages 785788.
Mermelstein, P. (1973). Articulatory model for the study of speech production. The
Journal of the Acoustical Society of America, 53(4):10701082.
Mika, S., Ratsch, G., Weston, J., Scholkopf, B., and Mullers, K. (1999). Fisher
discriminant analysis with kernels. In Neural Networks for Signal Processing IX,
1999. Proceedings of the 1999 IEEE Signal Processing Society Workshop, pages
4148.
Muthusamy, Y., Barnard, E., and Cole, R. (1994). Reviewing automatic language
identication. Signal Processing Magazine, IEEE, 11(4):3341.
Nankaku, Y., Tokuda, K., Kitamura, T., and Kobayashi, T. (2000). Normalized
training for HMM-based visual speech recognition. In Image Processing, 2000.
Proceedings. 2000 International Conference on, volume 3, pages 234237.
Newman, J. and Cox, S. (2009). Automatic visual-only language identication: A
preliminary study. In Acoustics, Speech and Signal Processing, 2009. ICASSP
2009. IEEE International Conference on, pages 43454348.
Newman, J. and Cox, S. (2010). Speaker independent visual-only language identi-
cation. In Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE Inter-
national Conference on, pages 50265029.
BIBLIOGRAPHY 190
Newman, J., Theobald, B., and Cox, S. (2010). Limitations of visual speech recogni-
tion. In International Conference on Auditory-Visual Speech Processing (AVSP),
pages 6568.
NIST (1995). NIST LRE-1996. http://www.itl.nist.gov/iad/mig/tests/lre/1996/.
(Last accessed 6/12/10).
NIST (2007). NIST LRE-2009. http://www.itl.nist.gov/iad/mig/tests/lre/2009/.
(Last accessed 6/12/10).
Ortega-Llebaria, M., Faulkner, A., and Hazan, V. (2001). Auditory-visual L2 speech
perception: Eects of visual cues and acoustic-phonetic context for Spanish learn-
ers of English. In International Conference on Auditory-Visual Speech Processing
(AVSP), pages 149154.
Pearson, K. (1901). On lines and planes of closest t to systems of points in space.
Philosophical Magazine, 2:559572.
Potamianos, G., Neti, C., Gravier, G., Garg, A., and Senior, A. (2003). Recent
advances in the automatic recognition of audiovisual speech. Proceedings of the
IEEE, 91(9):13061326.
Potamianos, G., Neti, C., Iyengar, G., and Helmuth, E. (2001). Large-vocabulary
audio-visual speech recognition by machines and humans. In EUROSPEECH-
2001, pages 10271030.
Potamianos, G. and Potamianos, A. (1999). Speaker adaptation for audio-visual
speech recognition. In EUROSPEECH99, pages 12911294.
Price, P., Fisher, W., Bernstein, J., and Pallett, D. (1988). The DARPA 1000-word
resource management database for continuous speech recognition. In Acoustics,
Speech, and Signal Processing, 1988. ICASSP-88., 1988 International Conference
on, volume 1, pages 651654.
R v Luttrell (2004). 2 Cr App R 520, CA.
Rabiner, L. (1989). A tutorial on hidden Markov models and selected applications
in speech recognition. Proceedings of the IEEE, 77(2):257286.
Ramus, F. (2002). Language discrimination by newborns: Teasing apart phono-
tactic, rhythmic, and intonational cues. Annual Review of Language Acquisition,
2(1):85115.
Roach, P. (1998). Some languages are spoken more quickly than others. In Trudgill,
P. and Bauer, L., editors, Language Myths, pages 150159. Penguin.
Ronquest, R. E., Levi, S. V., and Pisoni, D. B. (2010). Language identication from
visual-only speech signals. Attention, Perception, & Psychophysics, 72(6):1601
1613.
BIBLIOGRAPHY 191
Salton, G. (1989). Automatic text processing: the transformation, analysis, and
retrieval of information by computer. Addison-Wesley Longman Publishing Co.,
Inc., Boston, MA, USA.
Sammon, J.W., J. (1969). A nonlinear mapping for data structure analysis. Com-
puters, IEEE Transactions on, C-18(5):401409.
Schultz, T., Rogina, I., and Waibel, A. (1996). LVCSR-based language identica-
tion. In Acoustics, Speech, and Signal Processing, 1996. ICASSP-96. Conference
Proceedings., 1996 IEEE International Conference on, volume 2, pages 781784.
Shannon, C. E. (1948). A mathematical theory of communication. Bell System
Technical Journal, 27:379423 and 623656.
Siegler, M. and Stern, R. (1995). On the eects of speech rate in large vocabulary
speech recognition systems. In Acoustics, Speech, and Signal Processing, 1995.
ICASSP-95., 1995 International Conference on, volume 1, pages 612615.
Smrz, O. (2007). ElixirFM: implementation of functional Arabic morphology. In
Proceedings of the 2007 Workshop on Computational Approaches to Semitic Lan-
guages: Common Issues and Resources, Semitic 07, pages 18, Morristown, NJ,
USA. Association for Computational Linguistics.
Soto-Faraco, S., Navarra, J., Weikum, W. M., Vouloumanos, A., Sebastian-Galles,
N., and Werker, J. F. (2007). Discriminating languages by speech-reading. PER-
CEPTION AND PSYCHOPHYSICS, 69(2):218231.
Stork, D. G. and Hennecke, M. E., editors (1996). Speechreading by humans and
machines : models, systems, and applications. Springer.
Sugiyama, M. (1991). Automatic language recognition using acoustic features. In
Acoustics, Speech, and Signal Processing, 1991. ICASSP-91., 1991 International
Conference on, pages 813 816 vol.2.
Sujatha, B. and Santhanam, T. (2010). A novel approach integrating geometric and
Gabor wavelet approaches to improvise visual lipreading. International Journal
of Soft Computing, 5(1):1318.
Summereld, Q. (1992). Lipreading and audio-visual speech perception. Philosoph-
ical Transactions: Biological Sciences, 335(1273):7178.
Suo, H., Li, M., Lu, P., and Yan, Y. (2008). Using SVM as back-end classier for
language identication. EURASIP J. Audio Speech Music Process., 2008:16.
Theobald, B. J., Harvey, R., Cox, S. J., Lewis, C., and Owen, G. P. (2006). Lip-
reading enhancement for law enforcement. In Lewis, C. and Owen, G. P., editors,
Optics and Photonics for Counterterrorism and Crime Fighting II, volume 6402.
SPIE.
BIBLIOGRAPHY 192
Torres-Carrasquillo, P., Singer, E., Gleason, T., McCree, A., Reynolds, D., Richard-
son, F., and Sturim, D. (2010). The MITLL NIST LRE 2009 language recognition
system. In Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE Inter-
national Conference on, pages 49944997.
Torres-Carrasquillo, P. A., Singer, E., Kohler, M. A., Greene, R. J., Reynolds, D. A.,
and Deller Jr., J. R. (2002). Approaches to language identication using Gaussian
mixture models and shifted delta cepstral features. In ICSLP-2002, pages 8992.
UN General Assembly (1948). Universal declaration of human rights. In General
Assembly Resolutions, volume 217 A (III).
Visser, M., Poel, M., and Nijholt, A. (1999). Classifying visemes for automatic
lipreading. In Matousek, V., Mautner, P., Ocelkov a, J., and Sojka, P., editors,
Text, Speech and Dialogue, volume 1692 of Lecture Notes in Computer Science,
pages 843843. Springer Berlin / Heidelberg.
Vitkovitch, M. and Barber, P. (1994). Eect of Video Frame Rate on Subjects
Ability to Shadow One of Two Competing Verbal Passages. J Speech Hear Res,
37(5):12041210.
Webb, A. R. (2002). Statistical Pattern Recognition. John Wiley & Sons, 2nd edition.
Williams, J., Rutledge, J., Garstecki, D., and Katsaggelos, A. (1997). Frame rate
and viseme analysis for multimedia applications. In Multimedia Signal Processing,
1997., IEEE First Workshop on, pages 1318.
Wrench, A. A. (2001). A new resource for production modelling in speech technology.
In Proc. Inst. of Acoust. (WISP), volume 23, pages 207217.
Yin, B., Ambikairajah, E., and Chen, F. (2008). Improvements on hierarchical lan-
guage identication based on automatic language clustering. In Acoustics, Speech
and Signal Processing, 2008. ICASSP 2008. IEEE International Conference on,
pages 42414244.
Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X. A., Moore, G.,
Odell, J., Ollason, D., Povey, D., Valtchev, V., and Woodland, P. (2006). The
HTK Book (for HTK Version 3.4). Cambridge University Engineering Depart-
ment.
Zhai, L., Siu, M., Yang, X., and Gish, H. (2006). Discriminatively trained language
models using support vector machines for language identication. In Speaker and
Language Recognition Workshop, 2006. IEEE Odyssey 2006: The, pages 16.
Zhang, D. and Lu, G. (2003). Evaluation of similarity measurement for image
retrieval. In Neural Networks and Signal Processing, 2003. Proceedings of the
2003 International Conference on, volume 2, pages 928931.
BIBLIOGRAPHY 193
Zhu, D., Li, H., Ma, B., and Lee, C.-H. (2008). Discriminative learning for optimizing
detection performance in spoken language recognition. In Acoustics, Speech and
Signal Processing, 2008. ICASSP 2008. IEEE International Conference on, pages
41614164.
Ziaei, A., Ahadi, S., Yeganeh, H., and Mirrezaie, S. (2009). A new approach for
spoken language identication based on sequence kernel SVMs. In Digital Signal
Processing, 2009 16th International Conference on, pages 14.
Zissman, M. (1996). Comparison of four approaches to automatic language identi-
cation of telephone speech. Speech and Audio Processing, IEEE Transactions on,
4(1):3144.
Zissman, M. A. and Berkling, K. M. (2001). Automatic language identication.
Speech Communication, 35(1-2):115124.
Zue, W., Hazen, T. J., and Hazen, T. J. (1993). Automatic language identication
using a segment-based approach. In EUROSPEECH-1993, pages 13031306.